Fun with Unicode

Anton theatilla at gmail.com
Sat Feb 2 03:28:50 EST 2013


Yes, so far, I believe that this exact string should replace every
invocation of .dumps() in python, and our case specifically. The way I
understand it, PHP and Python do what they do by default, in order to
protect ancient systems which do not like to read non-ascii values,
but I have to say that the reasons behind all this are quite hard to
dig out.

Unless someone else is already on this, I can give it a go in a branch
and see if there is anything more to it.

However, in the case libraries in multiple languages will have access
to the same cluster, one has to check what the default encoding in all
of them will do. Be cause the \uXXXX escaping will work, transparently
to you, eating up storage space and bandwidth.

On 1 February 2013 17:21, Adam Lindsay <atl at alum.mit.edu> wrote:
> Ugh, yes, to both.
>
> Anton put his finger on it exactly--I made a typical Python Unicode goof in
> not being explicit about the encoding. My bad for assuming the json module
> would do so.
>
> So, Anton, would your use case be served by the following?
>
> antonjson = lambda x: json.dumps(x, ensure_ascii=False).encode("utf8")
> R.set_encoder('application/json', antonjson)
>
> --
> Adam Lindsay
>
> On Friday, 1 February 2013 at 16:11, Sean Cribbs wrote:
>
> For what it's worth, the underlying transports don't (read: shouldn't)
> care about the encoding of the payload. They just want a chunk of
> bytes. Is there an equivalent to "hey, I know this is probably a
> unicode or string object, but just give me the equivalent bytearray
> without transcoding anything"? If there is, we should be using that.
>
> On Fri, Feb 1, 2013 at 9:57 AM, Anton <theatilla at gmail.com> wrote:
>
> Adam, you should be able to write to any transport if you first
> .encode('utf-8') the result there, right? ensure_ascii=False will feed
> you unicode objects (if and only if there's something non-ASCII in the
> input to .dumps). They of course will cause anything that attempts to
> coerce them to a string to go wrong, as it'll attempt to do that by
> encoding to ASCII.
>
> On 1 February 2013 16:45, Adam Lindsay <atl at alum.mit.edu> wrote:
>
> Anton, Sean,
>
> Anton brings up a pretty interesting problem.
>
> At first, I thought it might be easy to remedy with:
>
> import json
> import functools
> antonjson = functools.partial(json.dumps, ensure_ascii=False)
>
> from riak import RiakClient
> R = RiakClient()
> R.set_encoder('application/json', antonjson)
>
> …however, upon testing this out, it's seems likely that the underlying
> transport channels use the default encoding, 'ascii,' and choke on the 8-bit
> data we now pass it, in socket.py (for the HTTP client) or
> protobuf.internal.type_checkers (for PBC).
>
> Maybe that's a suitable hint for Anton's further investigation, but I'll try
> to spend some time with it to see what I can find, as well.
>
> As to the OP's question: Yes, you've summarized the state of affairs quite
> nicely. IMHO it was a reasonable default (you can't be sure other Riak
> clients are as good as Python at 8-bit/Unicode!), but the underlying
> implementation definitely shows a bug that (again, IMHO) should and can be
> fixed.
> --
> Adam Lindsay
>
> On Friday, 1 February 2013 at 14:27, Sean Cribbs wrote:
>
> Anton,
>
> I don't see any reason why this can't be fixed. However, since I'm not
> familiar with the specifics of the JSON implementation, I'll need
> assistance. Please open an issue or pull-request on the Python client:
> https://github.com/basho/riak-python-client/issues. We are open to
> major, breaking changes for the next release.
>
> On Fri, Feb 1, 2013 at 8:06 AM, Anton <theatilla at gmail.com> wrote:
>
> Let's talk python and Unicode (yey!)
>
> The objects that I want to store will have non-ASCII strings in them.
> Potentially a lot. How much is a lot? "Very many millions" should be a
> good estimate.
>
> Now, the default behaviour for storing a python object (ok, a dict of
> stuff), using the PBC transport is to pass them to json and encode
> them. I'm ok with that, I like JSON and the fact that I can read out
> an object in JSON, using a browser, helps a lot. It's really great for
> developing project-specific tools, say debugging tools.
>
> But here is where the fun part starts. The JSON encoder in python is
> not a simple thing, and takes a lot of parameters. And by default it
> works. So well that people rarely look at what's going on. When you
> look at what's going on, however, things get more entertaining.
>
> The JSON encoder works on unicode objects, not strings. When you pass
> it unicode objects, it's happy. When you pass it strings, it decodes
> them, using a specified encoding. By default this is set to 'utf-8'
> which makes everything quite ok. So far so good. However, there's
> another option - 'ensure_ascii'. This is set to True by default and it
> means that the JSON encoder will spew out an ASCII-encoded string.
> That is, in the result, every unicode code-point is encoded as \u0123,
> or a total of 6 bytes.
>
> Now, this is not good. For one, the JSON RFCs expect Unicode, encoded
> using UTF-*. Also, even if much of the data will require 3bytes in
> UTF-8, that's still only half the bytes that the python default would
> take.
>
> Now, consider this elementary example. It already gives a significant
> (in bytes) difference for a short string:
> http://pastie.org/6011147
>
>
> Please tell me I'm not going crazy and all this is the state of
> affairs and it is, in fact, wrong and can/should be fixed.
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
>
> --
> Sean Cribbs <sean at basho.com>
> Software Engineer
> Basho Technologies, Inc.
> http://basho.com/
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
>
> --
> Sean Cribbs <sean at basho.com>
> Software Engineer
> Basho Technologies, Inc.
> http://basho.com/
>
>




More information about the riak-users mailing list