Iyfjky

Question

I am reading this blog on how to parallelize Sow and Reap, but it uses SetSharedFunction:

Life[init_,step_]:=First[CellularAutomaton[224,2,2,2,2,2,1,2,2,2,2,1,1,init,0,step]]
InterestingQ[result_, n_] := Count[Flatten[result], 1] >= n
SetSharedFunction[ParallelSow];
ParallelSow[expr_] := Sow[expr];

Parallelize[
 Do[Module[init = RandomInteger[1, 6, 6], result, 
 If[InterestingQ[result = Life[init, 100], 200], 
 ParallelSow[
 ArrayPlot[init, ImageSize -> 50] -> 
 ArrayPlot[result, ImageSize -> 70]]]], 10000]] // 
 Reap // Last

According to @Szabolcs this is far from optimal use of Parallelize: how do I parallelize in a fast way this kind of code?

Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. Use Pause if you need to simulate slow processing. — 2 hours ago

score 4 · Accepted Answer · 2018-09-17 10:05:46Z

Introduction

So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction and SetSharedVariable many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.

First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.

It is thus clear that SetSharedFunction can be useful, if used correctly. However, most naÃƒÂ¯ve uses I see here on SE result in a slowdown.

Parallelization in Mathematica

Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.

Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:

Time spent computing by parallel kernels

Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.

The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.

Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)

SetSharedFunction

SetSharedFunction and SetSharedVariable are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.

This means that after SetSharedFunction[fun], each usage of fun triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).

SetSharedVariable[fun] is only worth it if fun is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.

The game of life example from Wolfram Blog

This is exactly the situation with the example you linked to. The parallelized Do has 10,000 iterations, but ParallelSow (the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.

The amount of computation done on the main kernel (point 4.), a simple Sow, is also negligible.

The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphicsÃ¢Â€Â”the result of an ArrayPlot). But this won't matter because there are so few callbacks.

To keep things simpler analyse, I am going to remove the ArrayPlot in the following examples.

What would happen if ParallelSow were triggered more often? After removing the ArrayPlot, let's change the second argument of InterestingQ from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.

In[73]:= Parallelize[
 Do[Module[init = RandomInteger[1, 6, 6], result, 
 If[InterestingQ[result = Life[init, 100], 100], 
 ParallelSow[result]]], 10000]] // Reap // Last // Last // 
 Length // AbsoluteTiming

Out[73]= 2.23647, 599

In[74]:= Do[
 Module[init = RandomInteger[1, 6, 6], result, 
 If[InterestingQ[result = Life[init, 100], 100], 
 ParallelSow[result]]], 10000] // Reap // Last // Last // 
 Length // AbsoluteTiming

Out[74]= 2.19312, 554

If we use a threshold of 50 in InterestingQ, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!

Let's think about how many main kernel Ã¢Â€Â“ subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained" setting (see Method option of Parallelize), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow is triggered 20% of the cases (that's what happened with a threshold of 50).

Can we use a different approach parallelization that avoids SetSharedFunction? Here's an idea: let's keep Sow/Reap, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine. In this case we could use a simple ParallelTable:

Flatten[
 ParallelTable[
 Do[Module[init = RandomInteger[1, 6, 6], result, 
 If[InterestingQ[result = Life[init, 100], 50], 
 Sow[result]]], 2500] // Reap // Last,
 4
 ], 
 2
] // Length // AbsoluteTiming

(* 0.675195, 1955 *)

The differences are:

Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with (Parallel)Table

Use Sow instead of ParallelSow to avoid triggering a shared function

Combine two four result lists with a Flatten[..., 2].

Let's time the serial version too:

Flatten[
 Table[
 Do[Module[init = RandomInteger[1, 6, 6], result, 
 If[InterestingQ[result = Life[init, 100], 50], 
 Sow[result]]], 2500] // Reap // Last,
 4
 ], 
 2
] // Length // AbsoluteTiming

(* 2.26587, 1981 *)

We're back to a 3.3x speedup, even if we need to collect 20% of results.

A note on removing ArrayPlot: if you experiment with keeping ArrayPlot, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ threshold of 100. I opted to remove ArrayPlot because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.

Summary

As a general guideline, try to minimize the number of times inter-kernel communication must occur

In practice, SetSharedFunction is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow.

These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.

When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.

Thanks for the explanation!
â€“Â mattiav27
16 mins ago — 16 mins ago

score 4 · Accepted Answer · 2018-09-17 10:05:46Z

Introduction

So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction and SetSharedVariable many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.

First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.

It is thus clear that SetSharedFunction can be useful, if used correctly. However, most naÃƒÂ¯ve uses I see here on SE result in a slowdown.

Parallelization in Mathematica

Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.

Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:

Time spent computing by parallel kernels

Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.

The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.

Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)

SetSharedFunction

SetSharedFunction and SetSharedVariable are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.

This means that after SetSharedFunction[fun], each usage of fun triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).

SetSharedVariable[fun] is only worth it if fun is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.

The game of life example from Wolfram Blog

This is exactly the situation with the example you linked to. The parallelized Do has 10,000 iterations, but ParallelSow (the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.

The amount of computation done on the main kernel (point 4.), a simple Sow, is also negligible.

The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphicsÃ¢Â€Â”the result of an ArrayPlot). But this won't matter because there are so few callbacks.

To keep things simpler analyse, I am going to remove the ArrayPlot in the following examples.

What would happen if ParallelSow were triggered more often? After removing the ArrayPlot, let's change the second argument of InterestingQ from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.

In[73]:= Parallelize[
 Do[Module[init = RandomInteger[1, 6, 6], result, 
 If[InterestingQ[result = Life[init, 100], 100], 
 ParallelSow[result]]], 10000]] // Reap // Last // Last // 
 Length // AbsoluteTiming

Out[73]= 2.23647, 599

In[74]:= Do[
 Module[init = RandomInteger[1, 6, 6], result, 
 If[InterestingQ[result = Life[init, 100], 100], 
 ParallelSow[result]]], 10000] // Reap // Last // Last // 
 Length // AbsoluteTiming

Out[74]= 2.19312, 554

If we use a threshold of 50 in InterestingQ, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!

Let's think about how many main kernel Ã¢Â€Â“ subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained" setting (see Method option of Parallelize), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow is triggered 20% of the cases (that's what happened with a threshold of 50).

Can we use a different approach parallelization that avoids SetSharedFunction? Here's an idea: let's keep Sow/Reap, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine. In this case we could use a simple ParallelTable:

Flatten[
 ParallelTable[
 Do[Module[init = RandomInteger[1, 6, 6], result, 
 If[InterestingQ[result = Life[init, 100], 50], 
 Sow[result]]], 2500] // Reap // Last,
 4
 ], 
 2
] // Length // AbsoluteTiming

(* 0.675195, 1955 *)

The differences are:

Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with (Parallel)Table

Use Sow instead of ParallelSow to avoid triggering a shared function

Combine two four result lists with a Flatten[..., 2].

Let's time the serial version too:

Flatten[
 Table[
 Do[Module[init = RandomInteger[1, 6, 6], result, 
 If[InterestingQ[result = Life[init, 100], 50], 
 Sow[result]]], 2500] // Reap // Last,
 4
 ], 
 2
] // Length // AbsoluteTiming

(* 2.26587, 1981 *)

We're back to a 3.3x speedup, even if we need to collect 20% of results.

A note on removing ArrayPlot: if you experiment with keeping ArrayPlot, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ threshold of 100. I opted to remove ArrayPlot because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.

Summary

As a general guideline, try to minimize the number of times inter-kernel communication must occur

In practice, SetSharedFunction is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow.

These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.

When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.

Thanks for the explanation!
â€“Â mattiav27
16 mins ago — 16 mins ago

Search This Blog

Iyfjky

Which is the correct way to parallelize this code?

1 Answer
1

Introduction

Parallelization in Mathematica

SetSharedFunction

The game of life example from Wolfram Blog

Summary

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Introduction

Parallelization in Mathematica

SetSharedFunction

The game of life example from Wolfram Blog

Summary

Introduction

Parallelization in Mathematica

SetSharedFunction

The game of life example from Wolfram Blog

Summary

Introduction

Parallelization in Mathematica

SetSharedFunction

The game of life example from Wolfram Blog

Summary

Introduction

Parallelization in Mathematica

SetSharedFunction

The game of life example from Wolfram Blog

Summary

Post as a guest

Comments

Post a Comment

Popular posts from this blog

What does second last employer means? [closed]

List of Gilmore Girls characters

Confectionery

Category

Random preview

Which is the correct way to parallelize this code?

1 Answer 1

Introduction

Parallelization in Mathematica

SetSharedFunction

The game of life example from Wolfram Blog

Summary

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Introduction

Parallelization in Mathematica

SetSharedFunction

The game of life example from Wolfram Blog

Summary

Introduction

Parallelization in Mathematica

SetSharedFunction

The game of life example from Wolfram Blog

Summary

Introduction

Parallelization in Mathematica

SetSharedFunction

The game of life example from Wolfram Blog

Summary

Introduction

Parallelization in Mathematica

SetSharedFunction

The game of life example from Wolfram Blog

Summary

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Comments

Post a Comment

Popular posts from this blog

What does second last employer means? [closed]

List of Gilmore Girls characters

Confectionery

1 Answer
1

1 Answer
1

1 Answer
1