Feedback for GSoC project - RIak Destination for Syslog-ng
fdushin at basho.com
Tue May 5 09:34:20 EDT 2015
First off, Parth, this is a really exciting project, and I'm glad you're taking it on.
As an SIEM refugee, I have a few questions about the proposal and a few thoughts about syslog, generally, and that may help you work out some of your thoughts about data types and how you plan to structure data in Riak.
As far as I understand, you're talking about a mapping from keys to sets, but I'm unclear on a few things. What are the keys you are thinking about? Time stamps? If timestamps, these are presumably the timestamps of the syslog event? Just a word of warning, if so. You might find a lot of variation in timestamp formats and granularity. Perhaps you can get something reliable out of syslog-ng, but that won't help you in the case where syslog-ng is functioning as a syslog relay, and you want to preserve the timestamp of the originator, which you should, if you want to preserve integrity of the logs (e.g, for compliance). Or are you talking about a key being a (course grained) timestamp, say, an integral value in UTC seconds, for example? And the value(s) being all logs in that interval? Is that your motivation for sets?
How much of the syslog payload are you planning to parse? RFC-3164 and RFC-5424 provide enough BNF to allow standard syslog producers/consumers to provide pretty elaborately structured data in a syslog "datagram" (be it sent via UDP or TCP, or what have you). RFC-5424, in particular, has support for arbitrarily structured data in a syslog header, which is pretty nice. However, I personally have run into a few issues with this RFC.
First, very few, if any syslog generators support this RFC. Certainly the "legacy" enterprise log sources (operating systems, firewall vendors, etc) don't, and even the syslog API  doesn't provide enough parameters to make structured logs a possibility. I think there may be some work to improve APIs in the community, but to my knowledge nothing has been standardized, no vendors are taking up the work. Besides, most of the syslog implementations out there obey the "non-normative" behavior of RFC-3164, so you can get some pretty quirky logs in the wild.
Another interesting problem is that the STRUCTURED-DATA element of 5424 uses OIDs to discriminate different data types that are encoded in the header. And while there is a kind of loosely coupled authority for OIDs, there is no infrastructure for determining a parsing strategy for these fields. They could really be anything, in the worst case.
But regardless of the deeply structured data, you could get some very interesting traction by just taking standard headers and indexing them through Yokozuna. Certainly, indexing the body of a syslog message is a great idea, as these messages are generally unstructured and fodder for lucene. This is something that Logstash/ElasticSearch can do pretty effectively today, and it would be cool to see the same in Riak + some syslog provider.
Finally, it would be really nice if you could structure your plugin in such a way that they could eventually be ported to rsyslog . The rsyslogd daemon is deployed by default on certain Linux favors and enjoys fairly widespread distribution. You might be able to get it supported in that community, as well.
Best of luck,
 See http://linux.die.net/man/3/syslog for example.
> On May 5, 2015, at 8:11 AM, Christopher Meiklejohn <cmeiklejohn at basho.com> wrote:
>> On May 5, 2015, at 1:01 PM, Gergely Nagy <algernon at madhouse-project.org> wrote:
>>>>>>> "Christopher" == Christopher Meiklejohn <cmeiklejohn at basho.com> writes:
>> Christopher> I’m a bit concerned with your use of the set embedded in the
>> Christopher> map.
>> The original idea was to use a Set directly. The Set-in-Map thing was
>> just a thought experiment (Map-in-Set would make more sense).
>> Christopher> Large objects have traditionally been a big problem in Riak due
>> Christopher> to the use of distributed Erlang and head of line blocking. I’m
>> Christopher> curious if you could elaborate on what type of data you will be
>> Christopher> storing in the set: how big you expect each item to be, how big you
>> Christopher> expect the map to be, and the overall layout of data inside of the
>> Christopher> data structure.
>> The intention is to store log messages in each element of the set:
>> either as a string (syslog or json, or whatever else the user sees fit),
>> or as a map of key-value pairs (where values themselves can be maps
>> On average, the log messages are a few kilobytes in size. There may be
>> exceptions, but >1mb ones are fairly rare. How much data the set would
>> hold... now that's a question that can't really be answered. It is
>> really up to the syslog-ng user to configure that.
> I’m referring to the size of the entire set, not the objects that will be members of
> the set. Therefore, the performance penalty seen when using large objects would
> be observed as soon as the size of the entire set (or map) has reached ~1 MB.
> Given that restriction, I’d imagine you would only be able to store a few messages
> in each set. That granularity seems like you are no longer getting the benefits
> of the set.
> Additionally, the primary benefit of the data types in Riak is that they converge
> deterministically when dealing with concurrent operations. I’m curious if the set
> is the right choice here; could you just use a custom set format inside of a normal
> Riak object (or store one message per Riak object, given the write will be an
> immutable log entry?)
> - Chris
> Christopher Meiklejohn
> Senior Software Engineer
> Basho Technologies, Inc.
> cmeiklejohn at basho.com
> riak-users mailing list
> riak-users at lists.basho.com
More information about the riak-users