Riak and Distributed Image Processing

andrew cooke andrew at acooke.org
Mon Nov 7 16:23:01 EST 2011


Apologies if this is a dumb idea, or I am asking in the wrong place.  I'm
muddling around trying to understand various bits of technology while piecing
together a possible project.  So feel free to tell me I'm wrong :o)

I am considering how best to design a system that processes data from
telescopes.  A typical "step" in the processing might involve combining a
small number of calibration images with a (possibly large) set of observation
images in some way and then adding the result.  To do this in a distributed
manner you would have the observations on various machines, broadcast the
calibrations, then do a map (the per-observation processing) followed by a
reduce (the summing).

So, in very vague terms, this fits roughly into map-reduce territory.  What I
am doing now is seeing how the details work out with various "nosql" systems.

So my basic question is: how would the above fit with Riak?  Alternatively,
what else should I consider?

Some more details and speculation:

 - A typical image might contain 10 million 16 bit values, so is of around
   20MB in size (and will get bigger as technology improves).

 - A typical process could involve anything from 1 to hundreds of images.

 - I have no problem with using Erlang for high level code, but would expect
   to delegate image processing to C, Fortran, or OpenCL (if GPUs were
   available on nodes; I know an OpenCL package exists for Erlang).

 - Integration with numerical Python or IDL or Matlab or similar would be an
   unexpected plus.

 - I imagine (but have done no tests so have no real idea how much time would
   be spent in number-crunching, compared to data movement) that for
   efficiency it might (sometimes) be best to have mutable, memory mapped
   access to the images in a map-reduce "task".

 - But exactly when processes would mutate image data, and when they would
   create new images, is not yet clear.

 - If images awere immutable then you could consider the data processing as a
   directed graph of images.  Re-processing with modified parameters (a common
   occurence as the astronomer "tweaks" the reduction) might re-use some
   points in the graph to avoid duplicating previous work.  Some kind of
   "garbage collection" could then be required to delete older images.

 - Some processing will require combining images on different nodes.

 - Something must preserve a history of the processing required to generate
   each image.  I assume this would be managed by the high-level code, but
   it's possible "data provenance" is already available in Riak, or supported
   by some library?

 - Most tasks would be expressed in terms of kernel operations (eg add two
   images) taken from some library, but astronomers may want to add completely
   new code.

If you've read this far I'd love to hear of any thoughts that pop into your
head in response to the above.  Possible problems?  Technical details of Riak
that might help?  Similar projects?


More information about the riak-users mailing list