Fun with Unicode

Sean Cribbs sean at basho.com
Fri Feb 1 09:27:22 EST 2013


Anton,

I don't see any reason why this can't be fixed. However, since I'm not
familiar with the specifics of the JSON implementation, I'll need
assistance. Please open an issue or pull-request on the Python client:
https://github.com/basho/riak-python-client/issues. We are open to
major, breaking changes for the next release.

On Fri, Feb 1, 2013 at 8:06 AM, Anton <theatilla at gmail.com> wrote:
> Let's talk python and Unicode (yey!)
>
> The objects that I want to store will have non-ASCII strings in them.
> Potentially a lot. How much is a lot? "Very many millions" should be a
> good estimate.
>
> Now, the default behaviour for storing a python object (ok, a dict of
> stuff), using the PBC transport is to pass them to json and encode
> them. I'm ok with that, I like JSON and the fact that I can read out
> an object in JSON, using a browser, helps a lot. It's really great for
> developing project-specific tools, say debugging tools.
>
> But here is where the fun part starts. The JSON encoder in python is
> not a simple thing, and takes a lot of parameters. And by default it
> works. So well that people rarely look at what's going on. When you
> look at what's going on, however, things get more entertaining.
>
> The JSON encoder works on unicode objects, not strings. When you pass
> it unicode objects, it's happy. When you pass it strings, it decodes
> them, using a specified encoding. By default this is set to 'utf-8'
> which makes everything quite ok. So far so good. However, there's
> another option - 'ensure_ascii'. This is set to True by default and it
> means that the JSON encoder will spew out an ASCII-encoded string.
> That is, in the result, every unicode code-point is encoded as \u0123,
> or a total of 6 bytes.
>
> Now, this is not good. For one, the JSON RFCs expect Unicode, encoded
> using UTF-*. Also, even if much of the data will require 3bytes in
> UTF-8, that's still only half the bytes that the python default would
> take.
>
> Now, consider this elementary example. It already gives a significant
> (in bytes) difference for a short string:
> http://pastie.org/6011147
>
>
> Please tell me I'm not going crazy and all this is the state of
> affairs and it is, in fact, wrong and can/should be fixed.
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



-- 
Sean Cribbs <sean at basho.com>
Software Engineer
Basho Technologies, Inc.
http://basho.com/




More information about the riak-users mailing list