Sql Server : Collation and Unicode vs UTF-8, UTF-16 and UTF-32
- Posted by Sqltimes
- On September 4, 2014
- 0 Comments
More often these days, there have been many conversations around the topic of handling unicode data and UTF-* type data in Sql Server. Some details are below:
Basics
- ASCII means American Standard Code for Information Interchange – based on English alphabet
- As you can see above, ASCII is only for English
- Unicode is a standard for representing characters of most of world’s languages (written)
- Just like ASCII, but covers more languages.
- UTF-8 means UCS Transformation Format – 8 bit (1 byte).
- UTF-8 is the most used standard for web and has closest compatibility with ASCII
- UCS means Universal Character Set defined by ISO
- UTF-16 uses 2 bytes
- UTF-32 uses 4
How this relates to Sql Server:
- Sql Server uses COLLATION setting to determine the techniques to compare and sort character data.
- In Sql Server we have COLLATION at instance level, database level and column level.
- Instance collation could be configured during installation
- Installation collation is inherited for database and table columns, when no other collation is explicitly specified.
- .Net string is unicode encoded in UTF-16
- Java string is unicode encoded in UTF-16
- In previous post, we discussed ways to store unicode data.
- When you store unicode data in non-unicode column, Sql Server finds the closest match. Example : ‘À’ to ‘A’
- But when a close match is not available, it replaces it with ‘?’. See this example.
- Use windows collation where possible (not SQL collation)
- Windows collations are updated when there are new characters (or alphabet), where as SQL collations are not going to up updated going forward.
- SQL collations are for backwards compatibility and when one Sql Server needs to communicate directly with other Sql Servers in the environment.
- Either through SSIS
- Replication
- etc
- In a multi-language environment, use the collation with best overall support for the languages used.
- Performance is not (and should not be) a factor in deciding collation.
Several posts have helped gather the information above. I’ll list some of them:
Hope this helps,
_Sqltimes
0 Comments