Fun with Unicode

Anton theatilla at gmail.com
Fri Feb 1 09:06:30 EST 2013


Let's talk python and Unicode (yey!)

The objects that I want to store will have non-ASCII strings in them.
Potentially a lot. How much is a lot? "Very many millions" should be a
good estimate.

Now, the default behaviour for storing a python object (ok, a dict of
stuff), using the PBC transport is to pass them to json and encode
them. I'm ok with that, I like JSON and the fact that I can read out
an object in JSON, using a browser, helps a lot. It's really great for
developing project-specific tools, say debugging tools.

But here is where the fun part starts. The JSON encoder in python is
not a simple thing, and takes a lot of parameters. And by default it
works. So well that people rarely look at what's going on. When you
look at what's going on, however, things get more entertaining.

The JSON encoder works on unicode objects, not strings. When you pass
it unicode objects, it's happy. When you pass it strings, it decodes
them, using a specified encoding. By default this is set to 'utf-8'
which makes everything quite ok. So far so good. However, there's
another option - 'ensure_ascii'. This is set to True by default and it
means that the JSON encoder will spew out an ASCII-encoded string.
That is, in the result, every unicode code-point is encoded as \u0123,
or a total of 6 bytes.

Now, this is not good. For one, the JSON RFCs expect Unicode, encoded
using UTF-*. Also, even if much of the data will require 3bytes in
UTF-8, that's still only half the bytes that the python default would
take.

Now, consider this elementary example. It already gives a significant
(in bytes) difference for a short string:
http://pastie.org/6011147


Please tell me I'm not going crazy and all this is the state of
affairs and it is, in fact, wrong and can/should be fixed.




More information about the riak-users mailing list