Import big data to Riak

Georgi Ivanov ivanov at vesseltracker.com
Tue Oct 29 11:32:00 EDT 2013


Hi and thank you for the reply. My comment follow:

> Your tests are not close to what you are going to have in production
My tests are exactly what we will have in production. 1 node or in best case 2 
nodes.
We don't care about durability here. Our request per second will be extremely 
low too(300 per day ?)
> IMHO, here are few recommendations:
> 
>  1. Build a cluster with at least 5 nodes with N=3 and R=W=2 (You can
>     update your bucket properties via PBC with Java)
>  2. Use PBC instead of HTTP.
Hmm i ran some tests that showed that PBC is slower. Keep in mind that the 
import script is working on the same node as riak .
Also we use 2i indexes. 
The docs say that 2i is emulated using PBC (Whatever that means). Dunno if 
this is a problem .

Secondary Indexes (emulated, native)	✓✗

I will try this tho...
>  3. If you are only importing data call
>     .store()....withoutFetch().execute() to avoid unnecessary roundtrips.
I already do this.
The problem goes deeper as i am not only inserting but also updating some 
keys. 
I.e. :
	1.Fetch 
	2.Merge results
	3.Store back


> 
> If you test using unrealistic scenarios you will find unpleasant
> surprises when you are about to be go live so better to set your
> expectations right at the beginning.
> 
> HTH,
> 
> Guido.
> 
> On 29/10/13 14:59, Georgi Ivanov wrote:
> > Hello,
> > I am importing some big data to Riak.
> > I am importing like 10GB per day and i have to import one year of data.
> > The task is to speed up the initial import. After  that i will import on
> > daily basis, so the speed is not very important.
> > 
> > I am using JAVA HTTP client. So far my test show that the fastest setup is
> > to use n_val 1 and import to single server.
> > 
> > I tested importing on 2 servers (with n_val:2), but it is actually slower.
> > My JAVA client is multi-threaded.
> > 
> > My idea is to use n_val:1 on single node, then increase the n_val:2 and
> > add
> > one more node to the cluster. The problem is that i don't see the storage
> > to grow when i change n_val : 2
> > I was looking at Riak Active Anti-Entropy feature and i am expecting my
> > storage to grow after i increase the n_val. Unfortunately this is not the
> > case or i don't understand AAE feature ....
> > I can't any changes in storage size at all. I don't want to go in
> > direction of force repair as it would take forever.
> > 
> > Can anyone shed some light on AAE ? Or any tips for speeding up the import
> > in general.
> > 
> > To summarize the situation :
> > 1. One Riak node with n_val : 1 , eLevelDb as back-end
> > 2. Import data.
> > 3. Change n_val to 2
> > 4. Join one more node to the cluster.
> > 
> > What i expect to happen :
> > To have all the keys distributed to 2 riak nodes with n_val:2
> > So if i had 1TB of data on node1 with n_val:1 , after changing to n_val 2
> > and joining one more node, to have 1TB of data on each node.
> > 
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com





More information about the riak-users mailing list