automatic character set conversion in postgresql

Today, I encountered a few goofy characters in the data I am migrating from one ERP system to another. For example, “¢” isn’t represented the same way in UTF-8 as LATIN1 character sets. In UTF-8, the hex representation for “¢” is c2 a2, but in LATIN1 it is a2.

I started looking for an easy Perl way to translate everything into UTF-8 on the client side, when I discovered that PostgreSQL offers automatic client-to-server character set conversions. All I have to do is specify what my client character set is.

Here’s how you can do it with an SQL command:

SET CLIENT_ENCODING TO 'LATIN1';

Substitute your character set for “LATIN1″.

Lucky for me, my database is set to UTF8, and in that case, all supported encodings on my clients will be automatically converted to UTF-8 — as long as I specify which encoding I’m using.

The support for UTF-8 (formerly called UNICODE in the docs) in PostgreSQL has been around since version 7.1 (early 2000), and in version 8.1 the conversion support for UTF-8 was expanded to all known character sets.

7 thoughts on automatic character set conversion in postgresql

Comments are closed.

  1. SET NAMES utf8

    I believe that does everything you need? It does 2 settings in 1 to just make all queries IO as UTF8, so I think its more powerful than client encoding

  2. In theory, UTF-8 supports all characters in all encodings. However, SQL_ASCII encoding (which implies C locale) is not exactly a real encoding.

    Many SQL_ASCII strings cannot be converted to UTF-8. If you put something into SQL_ASCII, or treat the data as bytes (like a C string) rather than characters (like a unicode string) in your application, you essentially lose the original encoding that it was, so you can’t convert to UTF-8.

    This is the reason that so many people had encoding problems when upgrading from 8.1 to 8.2 (where PostgreSQL became more strict).

    Also, there are some slightly more obscure issues. I believe it was Tatsuo-san who mentioned that the Japanese government redefined some characters after UTF-8 was established, and there are therefore some Japanese characters that can’t be represented in UTF-8. That is my understanding anyway, and I could be mistaken on this point.

  3. Ahhh, good to know, Jeff. I was wondering about the SQL_ASCII encoding. This portion of the docs made me laugh — “Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding.”

    We are using UTF-8 for our database encoding, but don’t have to store non-latin characters or symbols at this time.

  4. Pingback: Sheeri Kritzer Cabral » Blog Archive » Log Buffer #72 — a Carnival of the Vanities for DBAs - The MySQL She-BA

  5. Pingback: Technocation, Inc.