Using UUID as keys is problematic for Riak Search

Eric Redmond eredmond at basho.com
Sun Aug 10 20:43:50 EDT 2014


I'm at my laptop now so I can talk a bit more about it.

Don't conflate the value type with the encodings. UUID is a field type, just like how dates or integers are field types. They explain to the Solr indexer how to reason about the value it gets. The field type string "20140810" is encoded differently than the integer value 20140810 or Date "20140810". This is important for the queries you can build, as a date range query is different than an integer or string range.

That said, in Solr, usually UUID is generated on the backend, such as with UUIDUpdateProcessorFactory. Even so, you can no more send a binary UUID than you can a binary date value.

There are two encodings you have to think about when dealing with Solr. Anything that's binary needs to be converted to a String that Solr can understand. Base64 is how you convert a binary value to a string value. So in the case of your key (in Erlang):

1> base64:encode(<<94,143,33,35,45,180,78,164,151,237,72,81,56,13,28,250>>).
<<"Xo8hIy20TqSX7UhROA0c+g==">>

base64 encoding libs exist in any language.

Once you have this key string in base64, internally, Yokozuna will assume that string is valid UTF8.

I was probably a bit hasty when I said "yokozuna only supports UTF8 . What I should have said is that "yokozuna assumes types/buckets/keys are UTF8  and encodes values appropriately."

So in summation:

UUID:   Solr field type
Base64:  Encode binary values to a string
UTF8:  The assumed string encoding

Does that help?
Eric


On Aug 10, 2014, at 5:03 PM, David James <davidcjames at gmail.com> wrote:

> Thanks for the quick responses.
> 
> Eric: I don't understand. Why does Solr have the UUIDField (http://lucene.apache.org/solr/4_7_0/solr-core/org/apache/solr/schema/UUIDField.html) if it were not indexable? What is the nature of the limitation?
> 
> Jason: Thanks, I will consider Base 64 encoding.
> 
> 
> On Sun, Aug 10, 2014 at 7:19 PM, Jason Campbell <xiaclo at xiaclo.net> wrote:
> I like UUIDs for everything as well, although I expected compatibility issues with something. Base 64 encoding the binary value is a nice compromise for me, and takes 22 characters (if you drop the padding) instead of the usual 36 for the hyphenated hex format.
> 
> It would still require re encoding all the keys, but it's a partial solutions.
> 
> From: Eric Redmond
> Sent: Monday, 11 August 2014 9:15 AM
> To: David James
> Cc: riak-users
> Subject: Re: Using UUID as keys is problematic for Riak Search
> 
> You're correct that yokozuna only supports utf8, because the Solr interface only supports utf8 (note that the failure happens when attempting to build a non-utf8 JSON add document command). There's not much we can do here at the moment, since we've yet to (if ever) support a custom interface to Solr that accepts arbitrary binary values. In the mean time, to use yokozuna, you'll have to encode your keys to utf8.
> 
> Eric Redmond, Engineer @ Basho
> 
> 
> On Sun, Aug 10, 2014 at 4:01 PM, David James <davidcjames at gmail.com> wrote:
> 
> I'm using UUIDs for keys in Riak -- converted to bytes, not UTF-8 strings. (I'd rather spend 16 bytes for each key, not 36.)
> 
> As I understand it, Yokozuna maps the Riak key to _yz_id.
> 
> Here is the suggested schema from the documentation:
> 
> <!-- schema.xml -->
> <field name="_yz_id" type="_yz_str" indexed="true" stored="true" multiValued="false" required="true"/> 
> <fieldType name="_yz_str" class="solr.StrField" sortMissingLast="true"/>
> 
> Would you expect this to work with Riak Search? I would hope so.
> 
> (Or must keys be UTF-8 strings?)
> 
> I get this error, which does not surprise me, given that the _yz_id is defined as a string:
> ==> log/error.log <==
> 
> 2014-08-10 18:24:16.221 [error] <0.610.0>@yz_kv:index:206 failed to index object {<<"test-0001">>,<<94,143,33,35,45,180,78,164,151,237,72,81,56,13,28,250>>} with error {ucs,{bad_utf8_character_code}} because [{xmerl_ucs,from_utf8,1,[{file,"xmerl_ucs.erl"},{line,185}]},{mochijson2,json_encode_string,2,[{file,"src/mochijson2.erl"},{line,186}]},{mochijson2,'-json_encode_proplist/2-fun-0-',3,[{file,"src/mochijson2.erl"},{line,167}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{mochijson2,json_encode_proplist,2,[{file,"src/mochijson2.erl"},{line,170}]},{mochijson2,'-json_encode_proplist/2-fun-0-',3,[{file,"src/mochijson2.erl"},{line,167}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{mochijson2,json_encode_proplist,2,[{file,"src/mochijson2.erl"},{line,170}]}]
> 
> I don't think changing the schema.xml type for _yz_id to "solr.UUIDField" is a good idea.
> 
> What can I do?
> 
> Thanks,
> David
> 
> 
> 
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20140810/6a3e056a/attachment.html>


More information about the riak-users mailing list