Fun with Unicode

Adam Lindsay atl at alum.mit.edu
Fri Feb 1 11:21:05 EST 2013


Ugh, yes, to both.  

Anton put his finger on it exactly--I made a typical Python Unicode goof in not being explicit about the encoding. My bad for assuming the json module would do so.

So, Anton, would your use case be served by the following?

antonjson = lambda x: json.dumps(x, ensure_ascii=False).encode("utf8")
R.set_encoder('application/json', antonjson)

--  
Adam Lindsay


On Friday, 1 February 2013 at 16:11, Sean Cribbs wrote:

> For what it's worth, the underlying transports don't (read: shouldn't)
> care about the encoding of the payload. They just want a chunk of
> bytes. Is there an equivalent to "hey, I know this is probably a
> unicode or string object, but just give me the equivalent bytearray
> without transcoding anything"? If there is, we should be using that.
>  
> On Fri, Feb 1, 2013 at 9:57 AM, Anton <theatilla at gmail.com (mailto:theatilla at gmail.com)> wrote:
> > Adam, you should be able to write to any transport if you first
> > .encode('utf-8') the result there, right? ensure_ascii=False will feed
> > you unicode objects (if and only if there's something non-ASCII in the
> > input to .dumps). They of course will cause anything that attempts to
> > coerce them to a string to go wrong, as it'll attempt to do that by
> > encoding to ASCII.
> >  
> > On 1 February 2013 16:45, Adam Lindsay <atl at alum.mit.edu (mailto:atl at alum.mit.edu)> wrote:
> > > Anton, Sean,
> > >  
> > > Anton brings up a pretty interesting problem.
> > >  
> > > At first, I thought it might be easy to remedy with:
> > >  
> > > import json
> > > import functools
> > > antonjson = functools.partial(json.dumps, ensure_ascii=False)
> > >  
> > > from riak import RiakClient
> > > R = RiakClient()
> > > R.set_encoder('application/json', antonjson)
> > >  
> > > …however, upon testing this out, it's seems likely that the underlying
> > > transport channels use the default encoding, 'ascii,' and choke on the 8-bit
> > > data we now pass it, in socket.py (for the HTTP client) or
> > > protobuf.internal.type_checkers (for PBC).
> > >  
> > > Maybe that's a suitable hint for Anton's further investigation, but I'll try
> > > to spend some time with it to see what I can find, as well.
> > >  
> > > As to the OP's question: Yes, you've summarized the state of affairs quite
> > > nicely. IMHO it was a reasonable default (you can't be sure other Riak
> > > clients are as good as Python at 8-bit/Unicode!), but the underlying
> > > implementation definitely shows a bug that (again, IMHO) should and can be
> > > fixed.
> > > --
> > > Adam Lindsay
> > >  
> > > On Friday, 1 February 2013 at 14:27, Sean Cribbs wrote:
> > >  
> > > Anton,
> > >  
> > > I don't see any reason why this can't be fixed. However, since I'm not
> > > familiar with the specifics of the JSON implementation, I'll need
> > > assistance. Please open an issue or pull-request on the Python client:
> > > https://github.com/basho/riak-python-client/issues. We are open to
> > > major, breaking changes for the next release.
> > >  
> > > On Fri, Feb 1, 2013 at 8:06 AM, Anton <theatilla at gmail.com (mailto:theatilla at gmail.com)> wrote:
> > >  
> > > Let's talk python and Unicode (yey!)
> > >  
> > > The objects that I want to store will have non-ASCII strings in them.
> > > Potentially a lot. How much is a lot? "Very many millions" should be a
> > > good estimate.
> > >  
> > > Now, the default behaviour for storing a python object (ok, a dict of
> > > stuff), using the PBC transport is to pass them to json and encode
> > > them. I'm ok with that, I like JSON and the fact that I can read out
> > > an object in JSON, using a browser, helps a lot. It's really great for
> > > developing project-specific tools, say debugging tools.
> > >  
> > > But here is where the fun part starts. The JSON encoder in python is
> > > not a simple thing, and takes a lot of parameters. And by default it
> > > works. So well that people rarely look at what's going on. When you
> > > look at what's going on, however, things get more entertaining.
> > >  
> > > The JSON encoder works on unicode objects, not strings. When you pass
> > > it unicode objects, it's happy. When you pass it strings, it decodes
> > > them, using a specified encoding. By default this is set to 'utf-8'
> > > which makes everything quite ok. So far so good. However, there's
> > > another option - 'ensure_ascii'. This is set to True by default and it
> > > means that the JSON encoder will spew out an ASCII-encoded string.
> > > That is, in the result, every unicode code-point is encoded as \u0123,
> > > or a total of 6 bytes.
> > >  
> > > Now, this is not good. For one, the JSON RFCs expect Unicode, encoded
> > > using UTF-*. Also, even if much of the data will require 3bytes in
> > > UTF-8, that's still only half the bytes that the python default would
> > > take.
> > >  
> > > Now, consider this elementary example. It already gives a significant
> > > (in bytes) difference for a short string:
> > > http://pastie.org/6011147
> > >  
> > >  
> > > Please tell me I'm not going crazy and all this is the state of
> > > affairs and it is, in fact, wrong and can/should be fixed.
> > >  
> > > _______________________________________________
> > > riak-users mailing list
> > > riak-users at lists.basho.com (mailto:riak-users at lists.basho.com)
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > >  
> > >  
> > >  
> > >  
> > > --
> > > Sean Cribbs <sean at basho.com (mailto:sean at basho.com)>
> > > Software Engineer
> > > Basho Technologies, Inc.
> > > http://basho.com/
> > >  
> > > _______________________________________________
> > > riak-users mailing list
> > > riak-users at lists.basho.com (mailto:riak-users at lists.basho.com)
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > >  
> >  
> >  
>  
>  
>  
>  
> --  
> Sean Cribbs <sean at basho.com (mailto:sean at basho.com)>
> Software Engineer
> Basho Technologies, Inc.
> http://basho.com/
>  
>  


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20130201/d4874d3e/attachment.html>


More information about the riak-users mailing list