Using Bucket Data Types slowed insert performance

Christopher Mancini cmancini at basho.com
Tue Oct 20 14:51:34 EDT 2015


Hi Mark / Dennis,

Can you provide the snippet of the code that puts a 5k record onto Riak as
a map?

Chris

On Tue, Oct 20, 2015 at 11:30 AM Mark Schmidt <mschmidt at orcawave.net> wrote:

> Hi folks, sorry for the confusion.
>
>
>
> Our scenario is as follows:
>
>
>
> We have a 6 node development cluster running on its own network segment
> using HAProxy to facilitate load-balancing across the nodes. A single
> Riak-dot-NET client service is performing the insert operations from
> dedicated hardware located within the same network segment. We have basic
> network throughput capabilities of 100 Mbit with an average speed
> achievable of 75 Mbit.
>
>
>
> The data we are attempting to insert is composed of phone call record
> receipts from telephone carriers. These records are batched and written to
> a flat file for incorporation into our reporting engine. 1) Our Riak client
> process takes a flat file (In this case, a 40MB collection of records, each
> record being approximately 5k in size) and parses the entire file so each
> record can be added to a local .NET queue.
>
> 2) Once the entire file has been parsed and each record loaded into the
> local queue, 20 threads are spawned and connections are opened to our Riak
> nodes via the HAProxy.
>
> 3) Each thread will pull a 5k record from the queue on a first come first
> served basis and perform a put to the Riak environment.
>
>
>
> When first testing our client insert process, we were pushing the 5K
> records as whole strings into the Riak environment. Network throughput
> topped out at around 80 Mbits with a total load time of 90 seconds for 149k
> records. When the client process was modified (same queuing and de-queuing
> methods) so that a map datatype bucket would be created and keys stored as
> registers, we saw network throughput drop to around 10 Mbit with total
> upload time increase to around 270 seconds for the 149k records.
>
>
>
> It appears as though we’ve either encountered a potential bottleneck
> unrelated to network throughput, or we’re just seeing an expected
> processing penalty for our use of Riak datatypes. Please note, we’re
> configuring Zabbix so we can monitor disk IO on each node as processor and
> memory resources don’t appear to be the culprit either.
>
>
>
> If the reduction in processing speed is a natural consequence to utilizing
> Riak data types, is the inter-node network the optimum place to increase
> resources? Our eventual datacenter implementation will support speeds of
> over 40 Gbit for inter-node communication. We’re just trying to identify
> which levers from an operational standpoint we can throw to boost
> performance, or if our client implementation is suspect.
>
>
>
> You bring up some excellent points regarding our use of CRDTs. In our
> case, the call data records are mutable as they are subject to changes by
> phone carriers for billing error corrections, incorrect data and a host of
> other reasons. We may be better served by treating the records as immutable
> and performing wide scale record removal and “reprocessing” in the event
> changes to existing records are received/requested.
>
>
>
> Thank you,
>
>
>
> Mark Schmidt
>
>
>
> *From:* Alexander Sicular [mailto:siculars at gmail.com]
> *Sent:* Tuesday, October 20, 2015 10:55 AM
> *To:* Dennis Nicolay <dnicolay at orcawave.net>
> *Cc:* Christopher Mancini <cmancini at basho.com>; riak-users at lists.basho.com;
> Mark Schmidt <mschmidt at orcawave.net>
>
>
> *Subject:* Re: Using Bucket Data Types slowed insert performance
>
>
>
> Let's talk about Riak data types for a moment. Riak data types are
> collectively implementations of what academia refer to as CRDT's
> (convergent or conflict free replicated data types.) The key benefit a CRDT
> offers, over a traditional KV by contrast, is in automatic conflict
> resolution. The various CRDT's provided in Riak have specific conflict
> resolution strategies. This does not come for free. There is a
> computational cost associated with CRDT's. If your use case requires
> automated conflict resolution strategies than CRDT's are a good fit.
> Internally CRDT's rely on vector clocks (see DVV's in the documentation) to
> resolve conflict.
>
>
>
> Considering your ETL use case I'm going to presume that your data is
> immutable (I could very well be wrong here.) If your data is immutable I
> would consider simply using a KV and not paying the CRDT computational
> penalty (and possibly even the write once bucket.) The CRDT penalty you pay
> is obviously subjective to your use case, configuration, hw deployment etc.
>
>
>
> Hope that helps!
> -Alexander
>
>
>
> @siculars
>
> http://siculars.posthaven.com
>
>
>
> Sent from my iRotaryPhone
>
>
> On Oct 20, 2015, at 12:39, Dennis Nicolay <dnicolay at orcawave.net> wrote:
>
> Hi Alexander,
>
>
>
> I’m parsing the file and storing each row with own key in a map datatype
> bucket and each column is a register.
>
>
>
> Thanks,
>
> Dennis
>
>
>
> *From:* Alexander Sicular [mailto:siculars at gmail.com <siculars at gmail.com>]
>
> *Sent:* Tuesday, October 20, 2015 10:34 AM
> *To:* Dennis Nicolay
> *Cc:* Christopher Mancini; riak-users at lists.basho.com
> *Subject:* Re: Using Bucket Data Types slowed insert performance
>
>
>
> Hi Dennis,
>
>
>
> It's a bit unclear what you are trying to do here. Are you 1. uploading
> the entire file and saving it to one key with the value being the file? Or
> are you 2. parsing the file and storing each row as a register in a map?
>
>
>
> Either of those approaches are not appropriate in Riak KV. For the first
> case I would point you to Riak S2 which is designed to manage large binary
> object storage. You can keep the large file as a single addressable entity
> and access it via Amazon S3 or Swift protocol. For the second case I would
> consider maintaining one key (map) per row in the file and have a register
> per column in the row. Or not use Riak data types (maps, sets, registers,
> flags and counters) and simply keep each row in the file as a KV in Riak
> either as a raw string or as a serialized json string. ETL'ing out of
> relational databases and into Riak is a very common use case and often
> implemented in the fashion I described.
>
>
>
> As Chris mentioned, soft upper bound on value size should be 1MB. I say
> soft because we won't enforce it although there are settings in the config
> that can be changed to enforce it (default 5MB warning, 50MB reject I
> believe.)
>
> Best,
>
> Alexander
>
>
>
>
> @siculars
>
> http://siculars.posthaven.com
>
>
>
> Sent from my iRotaryPhone
>
>
> On Oct 20, 2015, at 10:22, Christopher Mancini <cmancini at basho.com> wrote:
>
> Hi Dennis,
>
> I am not the most experienced, but what I do know is that a file that size
> causes a great deal of network chatter because it has to handoff that data
> to the other nodes in the network and will cause delays in Riak's ability
> to send and confirm consistency across the ring. Typically we recommend
> that you try to structure your objects to around 1mb or less to ensure
> consistent performance. That max object size can vary of course based on
> your network / server specs and configuration.
>
> I hope this helps.
>
> Chris
>
>
>
> On Tue, Oct 20, 2015 at 8:18 AM Dennis Nicolay <dnicolay at orcawave.net>
> wrote:
>
> Hi,
>
>
>
> I’m using .net RiakClient 2.0 to insert a 44mb delimited file with 139k
> rows of data into riak.  I switched to a map bucket data type with
> registers.   It is taking about 3 times longer to insert into this bucket
> vs non data typed bucket.  Any suggestions?
>
>
>
> Thanks in advance,
>
> Dennis
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
> _______________________________________________
> riak-users mailing list
> riak-users at lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.basho.com/pipermail/riak-users_lists.basho.com/attachments/20151020/791db62f/attachment-0002.html>


More information about the riak-users mailing list