Fun with Unicode

Sean Cribbs sean at basho.com
Fri Feb 1 11:11:25 EST 2013


For what it's worth, the underlying transports don't (read: shouldn't)
care about the encoding of the payload. They just want a chunk of
bytes. Is there an equivalent to "hey, I know this is probably a
unicode or string object, but just give me the equivalent bytearray
without transcoding anything"? If there is, we should be using that.

On Fri, Feb 1, 2013 at 9:57 AM, Anton <theatilla at gmail.com> wrote:
> Adam, you should be able to write to any transport if you first
> .encode('utf-8') the result there, right? ensure_ascii=False will feed
> you unicode objects (if and only if there's something non-ASCII in the
> input to .dumps). They of course will cause anything that attempts to
> coerce them to a string to go wrong, as it'll attempt to do that by
> encoding to ASCII.
>
> On 1 February 2013 16:45, Adam Lindsay <atl at alum.mit.edu> wrote:
>> Anton, Sean,
>>
>> Anton brings up a pretty interesting problem.
>>
>> At first, I thought it might be easy to remedy with:
>>
>> import json
>> import functools
>> antonjson = functools.partial(json.dumps, ensure_ascii=False)
>>
>> from riak import RiakClient
>> R = RiakClient()
>> R.set_encoder('application/json', antonjson)
>>
>> …however, upon testing this out, it's seems likely that the underlying
>> transport channels use the default encoding, 'ascii,' and choke on the 8-bit
>> data we now pass it, in socket.py (for the HTTP client) or
>> protobuf.internal.type_checkers (for PBC).
>>
>> Maybe that's a suitable hint for Anton's further investigation, but I'll try
>> to spend some time with it to see what I can find, as well.
>>
>> As to the OP's question: Yes, you've summarized the state of affairs quite
>> nicely. IMHO it was a reasonable default (you can't be sure other Riak
>> clients are as good as Python at 8-bit/Unicode!), but the underlying
>> implementation definitely shows a bug that (again, IMHO) should and can be
>> fixed.
>> --
>> Adam Lindsay
>>
>> On Friday, 1 February 2013 at 14:27, Sean Cribbs wrote:
>>
>> Anton,
>>
>> I don't see any reason why this can't be fixed. However, since I'm not
>> familiar with the specifics of the JSON implementation, I'll need
>> assistance. Please open an issue or pull-request on the Python client:
>> https://github.com/basho/riak-python-client/issues. We are open to
>> major, breaking changes for the next release.
>>
>> On Fri, Feb 1, 2013 at 8:06 AM, Anton <theatilla at gmail.com> wrote:
>>
>> Let's talk python and Unicode (yey!)
>>
>> The objects that I want to store will have non-ASCII strings in them.
>> Potentially a lot. How much is a lot? "Very many millions" should be a
>> good estimate.
>>
>> Now, the default behaviour for storing a python object (ok, a dict of
>> stuff), using the PBC transport is to pass them to json and encode
>> them. I'm ok with that, I like JSON and the fact that I can read out
>> an object in JSON, using a browser, helps a lot. It's really great for
>> developing project-specific tools, say debugging tools.
>>
>> But here is where the fun part starts. The JSON encoder in python is
>> not a simple thing, and takes a lot of parameters. And by default it
>> works. So well that people rarely look at what's going on. When you
>> look at what's going on, however, things get more entertaining.
>>
>> The JSON encoder works on unicode objects, not strings. When you pass
>> it unicode objects, it's happy. When you pass it strings, it decodes
>> them, using a specified encoding. By default this is set to 'utf-8'
>> which makes everything quite ok. So far so good. However, there's
>> another option - 'ensure_ascii'. This is set to True by default and it
>> means that the JSON encoder will spew out an ASCII-encoded string.
>> That is, in the result, every unicode code-point is encoded as \u0123,
>> or a total of 6 bytes.
>>
>> Now, this is not good. For one, the JSON RFCs expect Unicode, encoded
>> using UTF-*. Also, even if much of the data will require 3bytes in
>> UTF-8, that's still only half the bytes that the python default would
>> take.
>>
>> Now, consider this elementary example. It already gives a significant
>> (in bytes) difference for a short string:
>> http://pastie.org/6011147
>>
>>
>> Please tell me I'm not going crazy and all this is the state of
>> affairs and it is, in fact, wrong and can/should be fixed.
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>>
>>
>> --
>> Sean Cribbs <sean at basho.com>
>> Software Engineer
>> Basho Technologies, Inc.
>> http://basho.com/
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>



-- 
Sean Cribbs <sean at basho.com>
Software Engineer
Basho Technologies, Inc.
http://basho.com/




More information about the riak-users mailing list