Today, I encountered a few goofy characters in the data I am migrating from one ERP system to another. For example, “¢” isn’t represented the same way in UTF-8 as LATIN1 character sets. In UTF-8, the hex representation for “¢” is c2 a2
, but in LATIN1 it is a2
.
I started looking for an easy Perl way to translate everything into UTF-8 on the client side, when I discovered that PostgreSQL offers automatic client-to-server character set conversions. All I have to do is specify what my client character set is.
Here’s how you can do it with an SQL command:
SET CLIENT_ENCODING TO 'LATIN1';
Substitute your character set for “LATIN1”.
Lucky for me, my database is set to UTF8
, and in that case, all supported encodings on my clients will be automatically converted to UTF-8 — as long as I specify which encoding I’m using.
The support for UTF-8 (formerly called UNICODE in the docs) in PostgreSQL has been around since version 7.1 (early 2000), and in version 8.1 the conversion support for UTF-8 was expanded to all known character sets.
SET NAMES utf8
I believe that does everything you need? It does 2 settings in 1 to just make all queries IO as UTF8, so I think its more powerful than client encoding
Thanks for the pointer! The docs say that SET NAMES is an alias for SET CLIENT_ENCODING. Am I missing something?
Better use win1252 as encoding, not LATIN1.
In theory, UTF-8 supports all characters in all encodings. However, SQL_ASCII encoding (which implies C locale) is not exactly a real encoding.
Many SQL_ASCII strings cannot be converted to UTF-8. If you put something into SQL_ASCII, or treat the data as bytes (like a C string) rather than characters (like a unicode string) in your application, you essentially lose the original encoding that it was, so you can’t convert to UTF-8.
This is the reason that so many people had encoding problems when upgrading from 8.1 to 8.2 (where PostgreSQL became more strict).
Also, there are some slightly more obscure issues. I believe it was Tatsuo-san who mentioned that the Japanese government redefined some characters after UTF-8 was established, and there are therefore some Japanese characters that can’t be represented in UTF-8. That is my understanding anyway, and I could be mistaken on this point.
Ahhh, good to know, Jeff. I was wondering about the SQL_ASCII encoding. This portion of the docs made me laugh — “Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding.”
We are using UTF-8 for our database encoding, but don’t have to store non-latin characters or symbols at this time.
Pingback: Sheeri Kritzer Cabral » Blog Archive » Log Buffer #72 — a Carnival of the Vanities for DBAs - The MySQL She-BA
Pingback: Technocation, Inc.