Riak and Distributed Image Processing

andrew cooke andrew at acooke.org
Mon Nov 7 20:23:59 EST 2011


Thanks for the reply.

There were a lot of useful details there, but the thing I need to reply to, I
think, is the whole question of whether the map step is combining multiple
images, or processing a single image.

The number of calibrations (in the sense I used that term) is always small.
I'll explain why below (I'm not putting the text here because it's long and
more about astronomy than computing).  So I am considering them as "constants"
that are somehow present on every node.  Somewhere in the Riak docs I thought
I found some mention of a map step that cached a linked value locally.  I
can't find that now, but I thought this might be how the calibration was
pulled across initially.

Does the above address your concern?  So rather than

  fetch one image > fetch another image > mutate > write output

I see something more like:

  move to where the image is >
  fetch calibration if this this is the first image on this node >
  mutate the image > write output

Does that improve things?


Here's the argument why the number of calibration images does not scale
with the number of observation images:

Astronomy is observing time limited.  The number of telescopes is small
compared to the number of astronomers.  Furthermore, new discoveries are made
by pushing limits - either by observing new objects (which means looking at
faint objects, since bright ones are already well-studied, and so requires a
lot of observing time) or by doing large statistical samples (which requires a
lot of observing time) or by looking at something in more detail (which means
restricting spatial or frequency ranges, and so implies less photons and
requires a lot of observing time).  In every case, instruments and observing
processes must spend as much time as possible obtaining observation images.
This forces calibration images to be made at the start/end of the night.  So a
small number of calibrations are applied to many observations.

The above is simplified, but hopefully it gives the general idea.  What is new
in my email (if anything) is not the general approach, but the idea that
"commodity" software can be used rather than developing dedicated pipeline
software for a particular instrument (along with the idea that you can create
provenance graphs and reuse data in a similar way to pure data structures).

On Mon, Nov 07, 2011 at 04:56:41PM -0500, Alexander Sicular wrote:
> Great project, Andrew. It's not a dumb idea, sounds pretty awesome actually. I just don't think Riak will get you there.
> As I see it, the basic outline looks something like:
> fetch one image > fetch another image > mutate > write output
> I just don't see how Riak's implementation of map reduce allows you to iterate over a collection of images accessed by key and then mutate over some other image. That said, Riak would work well as the distributed, replicated image repository. The main advantage would be computation would be done where the data resides. You could write some erlang-fu mod that shelled the image out to some matlab/scipy/opencl process and then wrote output directly back to riak or put output in some watched dir that made it back to riak some other way. Many nodes with many cores would help. But in reality disk i/o is often the slowest link in any moderately complex chain. Shipping 20MB images over the wire in a GigE network is not really a big deal.
> Further, image to image manipulation becomes some order of O(n2) complexity which almost guarantees that you will never complete your compute task if either your calibration or observation set grow at any meaningful rate. I don't have experience in the image manipulation space but I do have some experience in the data similarity/comparison space which in this instance is more or less the same. You have two pieces of data that you need to run some sort of algorithm against to determine (dis)similarity. It is usually the case that direct comparison is not used in such cases if only for regards to computation time. Then again, if your set size is static disregard everything I just said, do direct comparisons and compute your execution time on the back of a napkin.
> Meh, that's my feeble stab at it.
> Let us know what you eventually decide on!
> Cheers,
> -Alexander Sicular
> @siculars
> http://siculars.posterous.com
> On Nov 7, 2011, at 4:23 PM, andrew cooke wrote:
> > 
> > Hi,
> > 
> > Apologies if this is a dumb idea, or I am asking in the wrong place.  I'm
> > muddling around trying to understand various bits of technology while piecing
> > together a possible project.  So feel free to tell me I'm wrong :o)
> > 
> > I am considering how best to design a system that processes data from
> > telescopes.  A typical "step" in the processing might involve combining a
> > small number of calibration images with a (possibly large) set of observation
> > images in some way and then adding the result.  To do this in a distributed
> > manner you would have the observations on various machines, broadcast the
> > calibrations, then do a map (the per-observation processing) followed by a
> > reduce (the summing).
> > 
> > So, in very vague terms, this fits roughly into map-reduce territory.  What I
> > am doing now is seeing how the details work out with various "nosql" systems.
> > 
> > So my basic question is: how would the above fit with Riak?  Alternatively,
> > what else should I consider?
> > 
> > Some more details and speculation:
> > 
> > - A typical image might contain 10 million 16 bit values, so is of around
> >   20MB in size (and will get bigger as technology improves).
> > 
> > - A typical process could involve anything from 1 to hundreds of images.
> > 
> > - I have no problem with using Erlang for high level code, but would expect
> >   to delegate image processing to C, Fortran, or OpenCL (if GPUs were
> >   available on nodes; I know an OpenCL package exists for Erlang).
> > 
> > - Integration with numerical Python or IDL or Matlab or similar would be an
> >   unexpected plus.
> > 
> > - I imagine (but have done no tests so have no real idea how much time would
> >   be spent in number-crunching, compared to data movement) that for
> >   efficiency it might (sometimes) be best to have mutable, memory mapped
> >   access to the images in a map-reduce "task".
> > 
> > - But exactly when processes would mutate image data, and when they would
> >   create new images, is not yet clear.
> > 
> > - If images awere immutable then you could consider the data processing as a
> >   directed graph of images.  Re-processing with modified parameters (a common
> >   occurence as the astronomer "tweaks" the reduction) might re-use some
> >   points in the graph to avoid duplicating previous work.  Some kind of
> >   "garbage collection" could then be required to delete older images.
> > 
> > - Some processing will require combining images on different nodes.
> > 
> > - Something must preserve a history of the processing required to generate
> >   each image.  I assume this would be managed by the high-level code, but
> >   it's possible "data provenance" is already available in Riak, or supported
> >   by some library?
> > 
> > - Most tasks would be expressed in terms of kernel operations (eg add two
> >   images) taken from some library, but astronomers may want to add completely
> >   new code.
> > 
> > If you've read this far I'd love to hear of any thoughts that pop into your
> > head in response to the above.  Possible problems?  Technical details of Riak
> > that might help?  Similar projects?
> > 
> > Thanks,
> > Andrew
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users at lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

More information about the riak-users mailing list