OK, now for once and for all I will outline what is really going on with unicode in Python.
First of all, you should understand the basics of character sets and encodings. Read Joel’s article entitled “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” for more information and make sure you understand the difference between a character set and a character encoding. (Hint: a set is only a list of available characters while an encoding specifies how those characters are represented in 1s and 0s.) Once you understand why unicode is a character set and utf-8 is an encoding, let us continue.
In python there are two ways to represent text; in real 1s and 0s (the
str) and in some abstract data type (
unicode). When you represent something as a
str it is vital that you know what encoding it is in. Without that it is just a useless list of bytes. When you represent something as
unicode it is unambiguous; a
unicode object can mean only one thing. It is a sequence of unicode points that you can treat similar to
str objects in many ways, save a few.
The most important thing to understand is that python calls unicode objects “decoded” and regular strings “encoded”. If you want to transform one into the other, you have to encode or decode it. This is as simple as calling those methods on the object you want to treat:
my_str = "abc" my_str
my_unicode = my_str.decode("utf-8") my_unicode
u"abc" is short-hand for
"abc".decode(some_encoding), where =someencoding_ is auto-detected by Python. If the source is read from a file, it is the encoding Python thinks that file is written in. Note that if it is read from the terminal, it tries to determine the encoding of the terminal but falls back when there is an error decoding the literal.
Note that I assumed the string
my_str was encoded in utf-8. This is because my terminal emulator is set to use that encoding, and I entered the string by typing it in my terminal emulator. It is vital that you understand this part, especially when you want to work with strange characters. Say I set my terminal to iso-8859-1, and I do this:
>>> my_str = "ça évite" >>> my_unicode_1 = my_str.decode("iso-8859-1") >>> my_unicode_1 u'\xe7a \xe9vite' >>> my_unicode_2 = my_str.decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid data >>> print my_unicode_1.encode("utf-8") Ã§a Ã©vite
As you can see when you try to make python interpret the raw string as utf-8 it fails because the iso-8859-1 representation contains byte-sequences that can not exist in utf-8. The borked message at the end of the example is the result of my terminal emulator interpreting this utf-8 encoded data as iso-8859-1. If I change my terminal emulator back to utf-8 and do it again, here is what happens:
>>> print my_unicode_1.encode("utf-8") ça évite
It is sometimes possible to autodetect the encoding of the stdin and/or the stdout. You can access these values in sys.stdin.encoding and sys.stdout.encoding. Note that if python can not autodetect the encoding these values are None.
So, now that you know how to manipulate text in python you should understand why it is useful to represent everything as unicode. It matters even for simple things like the length of a string; this can not be determined from a raw string because what takes three bytes in utf-8 may actually just be one character point.
len() returns the number of bytes for raw strings (which is almost never what you want) and it returns the number of actual characters for unicode objects.
Thanks a lot for this. I first thought you initial statement ‘ … once and for all …’ was a bit too strong. But your statement “The most important thing to understand is that python calls unicode objects “decoded” and regular strings “encoded” did the trick :-)
Thomas Ross said on: Monday, December 1, 2008 13:15