Fun with Unicode

Anton theatilla at gmail.com
Fri Feb 1 10:57:17 EST 2013


Adam, you should be able to write to any transport if you first
.encode('utf-8') the result there, right? ensure_ascii=False will feed
you unicode objects (if and only if there's something non-ASCII in the
input to .dumps). They of course will cause anything that attempts to
coerce them to a string to go wrong, as it'll attempt to do that by
encoding to ASCII.

On 1 February 2013 16:45, Adam Lindsay <atl at alum.mit.edu> wrote:
> Anton, Sean,
>
> Anton brings up a pretty interesting problem.
>
> At first, I thought it might be easy to remedy with:
>
> import json
> import functools
> antonjson = functools.partial(json.dumps, ensure_ascii=False)
>
> from riak import RiakClient
> R = RiakClient()
> R.set_encoder('application/json', antonjson)
>
> …however, upon testing this out, it's seems likely that the underlying
> transport channels use the default encoding, 'ascii,' and choke on the 8-bit
> data we now pass it, in socket.py (for the HTTP client) or
> protobuf.internal.type_checkers (for PBC).
>
> Maybe that's a suitable hint for Anton's further investigation, but I'll try
> to spend some time with it to see what I can find, as well.
>
> As to the OP's question: Yes, you've summarized the state of affairs quite
> nicely. IMHO it was a reasonable default (you can't be sure other Riak
> clients are as good as Python at 8-bit/Unicode!), but the underlying
> implementation definitely shows a bug that (again, IMHO) should and can be
> fixed.
> --
> Adam Lindsay
>
> On Friday, 1 February 2013 at 14:27, Sean Cribbs wrote:
>
> Anton,
>
> I don't see any reason why this can't be fixed. However, since I'm not
> familiar with the specifics of the JSON implementation, I'll need
> assistance. Please open an issue or pull-request on the Python client:
> https://github.com/basho/riak-python-client/issues. We are open to
> major, breaking changes for the next release.
>
> On Fri, Feb 1, 2013 at 8:06 AM, Anton <theatilla at gmail.com> wrote:
>
> Let's talk python and Unicode (yey!)
>
> The objects that I want to store will have non-ASCII strings in them.
> Potentially a lot. How much is a lot? "Very many millions" should be a
> good estimate.
>
> Now, the default behaviour for storing a python object (ok, a dict of
> stuff), using the PBC transport is to pass them to json and encode
> them. I'm ok with that, I like JSON and the fact that I can read out
> an object in JSON, using a browser, helps a lot. It's really great for
> developing project-specific tools, say debugging tools.
>
> But here is where the fun part starts. The JSON encoder in python is
> not a simple thing, and takes a lot of parameters. And by default it
> works. So well that people rarely look at what's going on. When you
> look at what's going on, however, things get more entertaining.
>
> The JSON encoder works on unicode objects, not strings. When you pass
> it unicode objects, it's happy. When you pass it strings, it decodes
> them, using a specified encoding. By default this is set to 'utf-8'
> which makes everything quite ok. So far so good. However, there's
> another option - 'ensure_ascii'. This is set to True by default and it
> means that the JSON encoder will spew out an ASCII-encoded string.
> That is, in the result, every unicode code-point is encoded as \u0123,
> or a total of 6 bytes.
>
> Now, this is not good. For one, the JSON RFCs expect Unicode, encoded
> using UTF-*. Also, even if much of the data will require 3bytes in
> UTF-8, that's still only half the bytes that the python default would
> take.
>
> Now, consider this elementary example. It already gives a significant
> (in bytes) difference for a short string:
> http://pastie.org/6011147
>
>
> Please tell me I'm not going crazy and all this is the state of
> affairs and it is, in fact, wrong and can/should be fixed.
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
>
> --
> Sean Cribbs <sean at basho.com>
> Software Engineer
> Basho Technologies, Inc.
> http://basho.com/
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>




More information about the riak-users mailing list