This is about riak search question. How to search utf8 format dat?

Ryan Zezeski rzezeski at basho.com
Sun Oct 28 15:10:11 EDT 2012


On Wed, Oct 10, 2012 at 12:52 AM, 郎咸武 <langxianzhe at gmail.com> wrote:
>
>
> *2)To put a Object to <<"user1">> bucket. The data is utf8 format.*
>
> (trends at jason-lxw)123> f(O), O=riakc_obj:new(<<"user1">>,
> <<"jason5">>,list_to_binary(mochijson:encode({struct, [{name,
> binary_to_list(unicode:characters_to_binary("爱"))},{sex,"male"}]})),
> "application/json").
> {riakc_obj,<<"user1">>,<<"jason5">>,undefined,[],
>            {dict,1,16,16,8,80,48,
>                  {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
>
>  {{[],[],[],[],[],[],[],[],[],[],[[<<...>>|...]],[],[],...}}},
>            <<"{\"name\":\"\\u00e7\\u0088\\u00b1\",\"sex\":\"male\"}">>}
> (((trends at jason-lxw)124> riakc_pb_socket:put(Pid, O).
>
> ok
>
>
First, let's start with your data and make sure it's getting stored
properly.

3> UC = unicode:characters_to_binary("爱").
<<231,136,177>>

Okay, so Erlang properly decoded this into a 3-byte unicode sequence.  What
does mochijson2 think? (I noticed you are using mochison, I recommend using
mochijson2).

4> mochijson2:encode({struct, [{name, UC}]}).
[123,[34,"name",34],58,[34,"\\u7231",34],125]

Good, mochijson2 properly interpreted this as u7231.  A quick lookup on the
web verifies this is correct:
http://www.fileformat.info/info/unicode/char/7231/index.htm.

But notice in your code you call binary_to_list on the binary before
passing it to mochi.  Lets see what happened.

15> binary_to_list(UC).
[231,136,177]

Okay, so the integers are correct.  But Erlang treats lists differently
from binaries.  It's just a list of integers to Erlang.

16> io:format("~ts~n",[binary_to_list(UC)]).
爱
ok

This is why mochi converted it to 3 chatacters: \\u00e7\\u0088\\u00b1

To make a proper unicode list the unicode:caracters_to_list function must
be used.

17> UCS = unicode:characters_to_list("爱").
[29233]

18> io:format("~ts~n", [UCS]).
爱
ok

Let's try encoding again, but this time leave out the list_to_binary.

19> riakc_obj:new(<<"user1">>, <<"jason5">>, mochijson2:encode({struct,
[{name, unicode:characters_to_binary("爱")}]}), "application/json").
{riakc_obj,<<"user1">>,<<"jason5">>,undefined,[],
           {dict,1,16,16,8,80,48,
                 {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},

 {{[],[],[],[],[],[],[],[],[],[],[[<<...>>|...]],[],[],...}}},
           [123,[34,"name",34],58,[34,"\\u7231",34],125]}

And there we go.  A properly encoded unicode character.

-Z
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20121028/99967f5a/attachment.html>


More information about the riak-users mailing list