Fun with Unicode

Adam Lindsay atl at
Fri Feb 1 10:45:10 EST 2013

Anton, Sean,  

Anton brings up a pretty interesting problem.

At first, I thought it might be easy to remedy with:

import json
import functools
antonjson = functools.partial(json.dumps, ensure_ascii=False)

from riak import RiakClient
R = RiakClient()
R.set_encoder('application/json', antonjson)

…however, upon testing this out, it's seems likely that the underlying transport channels use the default encoding, 'ascii,' and choke on the 8-bit data we now pass it, in (for the HTTP client) or protobuf.internal.type_checkers (for PBC).

Maybe that's a suitable hint for Anton's further investigation, but I'll try to spend some time with it to see what I can find, as well.  

As to the OP's question: Yes, you've summarized the state of affairs quite nicely. IMHO it was a reasonable default (you can't be sure other Riak clients are as good as Python at 8-bit/Unicode!), but the underlying implementation definitely shows a bug that (again, IMHO) should and can be fixed.--  
Adam Lindsay

On Friday, 1 February 2013 at 14:27, Sean Cribbs wrote:

> Anton,
> I don't see any reason why this can't be fixed. However, since I'm not
> familiar with the specifics of the JSON implementation, I'll need
> assistance. Please open an issue or pull-request on the Python client:
> We are open to
> major, breaking changes for the next release.
> On Fri, Feb 1, 2013 at 8:06 AM, Anton <theatilla at (mailto:theatilla at> wrote:
> > Let's talk python and Unicode (yey!)
> >  
> > The objects that I want to store will have non-ASCII strings in them.
> > Potentially a lot. How much is a lot? "Very many millions" should be a
> > good estimate.
> >  
> > Now, the default behaviour for storing a python object (ok, a dict of
> > stuff), using the PBC transport is to pass them to json and encode
> > them. I'm ok with that, I like JSON and the fact that I can read out
> > an object in JSON, using a browser, helps a lot. It's really great for
> > developing project-specific tools, say debugging tools.
> >  
> > But here is where the fun part starts. The JSON encoder in python is
> > not a simple thing, and takes a lot of parameters. And by default it
> > works. So well that people rarely look at what's going on. When you
> > look at what's going on, however, things get more entertaining.
> >  
> > The JSON encoder works on unicode objects, not strings. When you pass
> > it unicode objects, it's happy. When you pass it strings, it decodes
> > them, using a specified encoding. By default this is set to 'utf-8'
> > which makes everything quite ok. So far so good. However, there's
> > another option - 'ensure_ascii'. This is set to True by default and it
> > means that the JSON encoder will spew out an ASCII-encoded string.
> > That is, in the result, every unicode code-point is encoded as \u0123,
> > or a total of 6 bytes.
> >  
> > Now, this is not good. For one, the JSON RFCs expect Unicode, encoded
> > using UTF-*. Also, even if much of the data will require 3bytes in
> > UTF-8, that's still only half the bytes that the python default would
> > take.
> >  
> > Now, consider this elementary example. It already gives a significant
> > (in bytes) difference for a short string:
> >
> >  
> >  
> > Please tell me I'm not going crazy and all this is the state of
> > affairs and it is, in fact, wrong and can/should be fixed.
> >  
> > _______________________________________________
> > riak-users mailing list
> > riak-users at (mailto:riak-users at
> >
> >  
> --  
> Sean Cribbs <sean at (mailto:sean at>
> Software Engineer
> Basho Technologies, Inc.
> _______________________________________________
> riak-users mailing list
> riak-users at (mailto:riak-users at

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the riak-users mailing list