Thursday, September 16, 2010

[slvpnhij] Limited concurrent programming

Concurrency bugs are hard.  Maybe even, the only hard bugs are concurrency bugs.

Consider the shell pipeline: cat input | foo | bar | baz > output

The several programs can be executed concurrently, and it does not matter which order the threads / processes are interleaved; the output will remain the same.  Furthemore, if we encounter a problem, each program is only single-threaded so is easy to debug.  Thus, this "single stream" concurrency is a safe form of concurrency if you wish to avoid nasty concurrency bugs.

The same form might be written in haskell with lazy lists.

output :: [String] -> [String];
output input = bar $ foo $ input;

though no compiler I know compiles it into parallel concurrent threads.

Can we do just a bit more under this model, perhaps "multiple streams" while still avoiding nasty concurrency bugs and maintaining ease of debugging?

If we have multiple streams, then the interesting operation is where two streams merge to become one.  As an example of "sort of" what we might want to do, in shell:

foo > temp1
bar > temp2
diff temp1 temp2 > output

The above example is not concurrent at all, though it could be / should be.  Perhaps with named pipes?

With just a single stream, the input will arrive to any program in exactly the same order regardless of the ordering of processes.  However, with two streams, the streams may be interleaved ; there are a super-exponential Binomial(n+m,n) possibilities of interleaving.  No wonder concurrency is so hard.  For a single thread, we only need to record the initial state and all the input, and we can easily replicate "what went wrong" for debugging.  For concurrency, we need to record and replay the time order of all the inputs.

However, the "diff" does not care about interleaving (or ought not to); it always produces the same output.  So for the sake of avoiding bugs and ease of debugging, let us limit ourselves to concurrency where the interleaving of the input does not matter.  (Yes, this limitation cuts off a lot.)  How can we express and enforce this?

One possible way is to forbid the single-threaded code from "peeking" at an input stream and continuing if there is no data ready.  That is, any input read succeeds returning a character, EOF, or blocks.

Another possible way is with lazy lists again:

maptwo :: (a -> b) -> [a] -> [a] -> [b];
maptwo f (x:xrest) (y:yrest) = (f x):(f y):(maptwo f xrest yrest);
maptwo f l [] = map f l;
maptwo f [] l = map f l;

But this code is weird: it necessarily forces one-to-one interleaving of the input.  But what if the input is not generated at identical speeds?  One can also hard code a two-to-one ratio or any fixed ratio.  This is the limitation of this concurrency method.  (Is it worth it?)

I'm guessing a process emitting two streams from one (or zero) is not difficult.

We have enough primitives to create an arbitrary directed acyclic graph of concurrent processes, which certainly is not enough to express everything, but should be enough to do a lot.

No comments :