Riak search, post schema change reindexation

Guillaume Boddaert guillaume at lighthouse-analytics.co
Mon Aug 29 11:27:47 EDT 2016

Hi Fred, thanks for your answer.

I'm using Riak 2.1 see attached status export.
I'm working on a single cluster, and need to update from time to time 
the some search index on all nodes.
As a cloud user, I can consider buying a spare host for a few days in 
order to achieve a complete rollout.

I can understand your plan to remove an host from production while it 
reconstruct its index. From my point of view your solution can only be 
applied on a broken Solr index, that needs to be rebuild from scratch on 
a single host.
In my case, I need to reindex my documents because I was updated my solr 
schema, which requires to wipe existing index beforehand (create new 
index, change bucket index_name prop, drop old index), on all hosts 
since that's a bucket type property that I need to update.

Fred, is your plan can be really applied on a « I want to update my 
search schema on my full cluster » ?

At the moment, I already created the new index, destroyed the old one, 
and I am unable to use a slow python script to force all items to be 
written again (and subsequently pushed to solr) since I get regular 
timeout on key stream API (both protobuff and http).
Is there a way to run a program inside riak nodes (not http, not 
protobuf) to achieve this simple algorithm:

for key in bucket.stream_keys():
   obj = bucket.get(key)

I really fear that will not be able to restore my index any time soon. I 
am not stressed out because we are not in production yet, I have still 
plenty of time to fix that as new data is available. But this kind of 
complex operations required by index update really freak me out.


On 29/08/2016 14:41, Fred Dushin wrote:
> Hi Guillame,
> A few questions.
> What version of Riak?
> Does the reindexing need to occur across the entire cluster, or just 
> on one node?
> What are the expectations about query-ability while re-indexing is 
> going on?
> If you can afford to take a node out of commission for query, then one 
> approach would be to delete your YZ data and YZ AAE trees, and let AAE 
> sync your 30 million documents from Riak.  You can increase AAE tree 
> rebuild and exchange concurrency to make that occur more quickly than 
> it does by default, but that will put a fairly significant load on 
> that node.  Moreover, because you have deleted indexed data on one 
> node, you will get inconsistent search results from Yokozuna, as the 
> node being reindexed will still show up as part of a coverage plan. 
>  Depending on the version of Riak, however, you may be able to 
> manually remove that node from coverage plans through the Riak console 
> while re-indexing is going on.  The node is still available for Riak 
> get/put operations (including indexing new entries into Solr), but it 
> will be excluded from any cover set when a query plan is generated.  I 
> can't guarantee that this would take less than 5 days, however.
> -Fred
>> On Aug 29, 2016, at 3:56 AM, Guillaume Boddaert 
>> <guillaume at lighthouse-analytics.co 
>> <mailto:guillaume at lighthouse-analytics.co>> wrote:
>> Hi,
>> I recently needed to alter my Riak Search schema for a bucket type 
>> that contains ~30 millions rows. As a result, my index was wiped 
>> since we are waiting for a Riak Search 2.2 feature that will sync 
>> Riak storage with Solr index on such an occasion.
>> I adapted a since script suggested by Evren Esat Özkan there 
>> (https://github.com/basho/yokozuna/issues/130#issuecomment-196189344). It 
>> is a simple python script that will stream keys and trigger a store 
>> action for any items. Unfortunately it failed past 178k items due to 
>> time out on the key stream. I calculated that this kind of 
>> reindexation mechanism would take up to 5 days without a crash to 
>> succeed.
>> I was wondering if there would be a pure Erlang mean to achieve a 
>> complete forced rewrite of every single element in my bucket type 
>> rather that an error prone and very long python process.
>> How would you guys reindex a 30 million item bucket type in a fast 
>> and reliable way ?
>> Thanks, Guillaume
>> _______________________________________________
>> riak-users mailing list
>> riak-users at lists.basho.com <mailto:riak-users at lists.basho.com>
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20160829/de7b10df/attachment-0002.html>
-------------- next part --------------
riak_auth_mods_version : <<"2.1.0-0-g31b8b30">>
erlydtl_version : <<"0.7.0">>
riak_control_version : <<"2.1.2-0-gab3f924">>
cluster_info_version : <<"2.0.3-0-g76c73fc">>
yokozuna_version : <<"2.1.2-0-g3520d11">>
ibrowse_version : <<"4.0.2">>
riak_search_version : <<"2.1.1-0-gffe2113">>
merge_index_version : <<"2.0.1-0-g0c8f77c">>
riak_kv_version : <<"2.1.2-0-gf969bba">>
riak_api_version : <<"2.1.2-0-gd8d510f">>
riak_pb_version : <<"">>
protobuffs_version : <<"0.8.1p5-0-gf88fc3c">>
riak_dt_version : <<"2.1.1-0-ga2986bc">>
sidejob_version : <<"2.0.0-0-gc5aabba">>
riak_pipe_version : <<"2.1.1-0-gb1ac2cf">>
riak_core_version : <<"2.1.5-0-gb02ab53">>
exometer_core_version : <<"1.0.0-basho2-0-gb47a5d6">>
poolboy_version : <<"0.8.1p3-0-g8bb45fb">>
pbkdf2_version : <<"2.0.0-0-g7076584">>
eleveldb_version : <<"2.0.17-0-g973fc92">>
clique_version : <<"0.3.2-0-ge332c8f">>
bitcask_version : <<"1.7.2">>
basho_stats_version : <<"1.0.3">>
webmachine_version : <<"1.10.8-0-g7677c24">>
mochiweb_version : <<"2.9.0">>
inets_version : <<"5.9.6">>
xmerl_version : <<"1.3.4">>
erlang_js_version : <<"1.3.0-0-g07467d8">>
runtime_tools_version : <<"1.8.12">>
os_mon_version : <<"2.2.13">>
riak_sysmon_version : <<"2.0.0">>
ssl_version : <<"5.3.1">>
public_key_version : <<"0.20">>
crypto_version : <<"3.1">>
asn1_version : <<"2.0.3">>
sasl_version : <<"2.3.3">>
lager_version : <<"2.1.1">>
goldrush_version : <<"0.1.7">>
compiler_version : <<"4.9.3">>
syntax_tools_version : <<"1.6.11">>
stdlib_version : <<"1.19.3">>
kernel_version : <<"2.16.3">>

More information about the riak-users mailing list