July 30th, 2013

… or the definitive guide to getting Tomcat and Mysql working correctly with character sets

I thought I understood all you needed to know for taking user submitted data and storing it correctly in the database until somebody Polish tried to make a donation a few weeks back and my gaps in knowledge were exposed. Here is my comprehensive guide to everything you need to know and do to handling correctly all (technically not quite all but near enough) possible characters a user might submit when working with Tomcat and Mysql.

Before we start: What every developer should understand about working with text

The definition of character encodings, sets, points has evolved over time.  Here is how I understand the terms now based on the unicode standard.

Characters

A character is an abstract concept referring to some sign or symbol.  All the letters of an alphabet are characters, as are numbers.

Code point

A code point is a mapping of some abstract character to a number.  For example in UTF the character for the latin capital A is assigned the decimal number 65.

Character set

A character set is a set of code points.  That is a character set consists of a list of unique characters that it supports each of which is given a number.  UTF is a character set.

Character encoding

A character encoding specifies how a particular character set should be transmitted and stored on a computer.  Computers work in binary (1s and 0s) and most often in bytes (8 1s or 0s in a sequence).  You can represent 256 different numbers in a byte but character sets often consist of many 1000s of character and so you need a way of mapping those to sequences of more than one byte.  Some encodings use a fixed number of bytes but this can be quite wasteful in terms of memory and storage use.  UTF8 uses one byte for the first 128 code points and then starts using multiple bytes for higher value code points.  Check out the wikipedia entry for more detail.

UTF-8 the character encoding to use for development

UTF-8 has a number of advantages:

  • it uses exactly the same single byte code points as ASCII for the first 128 characters
  • it has a comprehensive coverage of characters users in most languages are ever likely to use
  • it is widely used and supported

In short, where you have a choice, choose UTF-8 as your character encoding until you find a good reason not to.

Client to server character communication and storage

When thinking about making sure your system supports any character a user might throw at it you need to consider how characters are encoded at each node in the path (browser, web server, database) and in the messages sent between each.  We need to make sure the UTF  character set is supported all the way through.

MySql: get those defaults set-up

A MySql installation consists of a database server made up of individual databases (or schemas) which in turn contain tables and each table has a set of columns which may or may not store text. Believe it or not you can configure the default encoding at each of these levels and so we need to do that. This is quite painless if you are starting from scratch, you just need to set the character set of the server before you start, but a real pain if you have a lot of tables created already.

Set the default character encoding of the MySql server

This means that the default character set of any new database you create will be UTF8.

  • Add the following line to the [mysqld]section of your my.ini file
    character-set-server=utf8
  • Restart the MySql server.

Alternatively you can use the command line or a configuration option when building from source.

Set the default character encoding of all your MySql databases

This means that the default character set of any new tables you create in the database will be UTF8.

  • Open a mysql client session
  • Run the following command for each database
    ALTER DATABASE db_name CHARACTER SET 'utf8';

Set the default character encoding of each of your existing MySql tables

This means that any new columns that store character data that you add in future to a table will be UTF-8 by default.

ALTER TABLE tb_name CHARACTER SET utf8;

Set the character encoding of each of your existing MySql columns

We’re still not done! So if you did not set your defaults up correctly before and did not specify a character set for each column that stores character data or did not define your columns as nvarchar then most likely they are set to store data as the MySql default latin1. Latin1 is not the worst character encoding and covers a lot of western European languages very well but if you have someone with a Polish ‘Ł’ or ‘ń’ you will run into difficulties. A good command to use to find out what each individual column is set to in a table is:

show full columns from tb_name;

The collation column tells you if you have the right character encoding (you want it to start with uft8). Collations are rules for comparing the characters within a particular character encoding. So expanding our ddl statement from before:

ALTER TABLE tb_name MODIFY COLUMN col_name SQL_TYPE CHARACTER SET utf8, CHARACTER SET utf8;

For example if you have a table called person and a column called name which is of type varchar(100):

ALTER TABLE person MODIFY COLUMN name CHARACTER VARYING(100) CHARACTER SET utf8, CHARACTER SET utf8;

There might be some columns where it might be more correct to specify the character set as ascii if only ascii characters are allowed for that data type (for example email, non internationalized URLs). You may want to do this for data integrity purposes.

Always specify a character encoding for each column in you ddl

For maximum reassurance and long term maintainability I’d advise against using the national character varying and national character data types (nchar and nvarchar) and always specifying a character encoding on every column that stores character data. All nchar and nvarchar mean is use a suitable character encoding to store internationalized character data. MySql happens to select UTF8 for this purpose but I think it’s better to be explicit. Setting a character encoding explicitly for the table is not part of the ANSI standard so you might prefer not to do that (and rely on the fact it will pick this up from the database default).

Specify a default character encoding for you mysql command line client

Finally when working on the server you want to be able to view any data stored correctly.  To do this you should set the character encoding for the mysql client.

  • Add the following line to the [client] section of your my.cnf file
    default-character-set=utf8
  • Restart your mysql client

Tomcat: get talking UTF8

Java works in UTF16, a fixed two byte encoding that supports all the characters of UTF8.  For Tomcat we need to ensure that the communication between client and server and also server and database is all in UTF8.

Tomcat to MySql communication

To ensure Tomcat is talking UTF8 to MySql you should always specify the character encoding on the JDBC connection URL.  In Tomcat we configure these in either the GlobalNamingResources element in server.xml or in the Context element of a web applications context.xml.  You simply add a parameter to the URL to do this.  For example

 <Resource name="jdbc/mydb" auth="Container" type="javax.sql.DataSource"
              maxActive="100" maxIdle="30" maxWait="10000"
              username="myuser" password="pwd" driverClassName="com.mysql.jdbc.Driver"
              url="jdbc:mysql://myhost:myport/mydb?autoReconnect=true&amp;characterEncoding=UTF-8"
              factory="org.apache.tomcat.jdbc.pool.DataSourceFactory" />

Tomcat to browser communication

To ensure Tomcat returns UTF8 you need to ensure your jsps all have UTF-8 specified as their page encoding

<%@page pageEncoding="UTF-8"%>

and also ensure you set the default content type for jsps in your web.xml

  <jsp-config>
    <jsp-property-group>
        <url-pattern>/WEB-INF/jsp/*</url-pattern>
        <default-content-type>text/html;charset=utf-8</default-content-type>       
    </jsp-property-group>
  </jsp-config>

Browser to Tomcat configuration

This is troublesome. If you page is encoded in UTF8 then most browser will encode POST requests in UTF8. However there is no way to guarantee the encoding of GET request parameters. You might consider setting UTF-8 on the Connector element in server.xml.