Which is the correct way to parallelize this code?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
3
down vote

favorite
1












I am reading this blog on how to parallelize Sow and Reap, but it uses SetSharedFunction:



Life[init_,step_]:=First[CellularAutomaton[224,2,2,2,2,2,1,2,2,2,2,1,1,init,0,step]]
InterestingQ[result_, n_] := Count[Flatten[result], 1] >= n
SetSharedFunction[ParallelSow];
ParallelSow[expr_] := Sow[expr];

Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 200],
ParallelSow[
ArrayPlot[init, ImageSize -> 50] ->
ArrayPlot[result, ImageSize -> 70]]]], 10000]] //
Reap // Last


According to @Szabolcs this is far from optimal use of Parallelize: how do I parallelize in a fast way this kind of code?










share|improve this question























  • Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. Use Pause if you need to simulate slow processing.
    – Szabolcs
    2 hours ago










  • Now it should be ok
    – mattiav27
    2 hours ago










  • Can you add the definition for Life?
    – Szabolcs
    2 hours ago










  • now it should be fine
    – mattiav27
    2 hours ago














up vote
3
down vote

favorite
1












I am reading this blog on how to parallelize Sow and Reap, but it uses SetSharedFunction:



Life[init_,step_]:=First[CellularAutomaton[224,2,2,2,2,2,1,2,2,2,2,1,1,init,0,step]]
InterestingQ[result_, n_] := Count[Flatten[result], 1] >= n
SetSharedFunction[ParallelSow];
ParallelSow[expr_] := Sow[expr];

Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 200],
ParallelSow[
ArrayPlot[init, ImageSize -> 50] ->
ArrayPlot[result, ImageSize -> 70]]]], 10000]] //
Reap // Last


According to @Szabolcs this is far from optimal use of Parallelize: how do I parallelize in a fast way this kind of code?










share|improve this question























  • Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. Use Pause if you need to simulate slow processing.
    – Szabolcs
    2 hours ago










  • Now it should be ok
    – mattiav27
    2 hours ago










  • Can you add the definition for Life?
    – Szabolcs
    2 hours ago










  • now it should be fine
    – mattiav27
    2 hours ago












up vote
3
down vote

favorite
1









up vote
3
down vote

favorite
1






1





I am reading this blog on how to parallelize Sow and Reap, but it uses SetSharedFunction:



Life[init_,step_]:=First[CellularAutomaton[224,2,2,2,2,2,1,2,2,2,2,1,1,init,0,step]]
InterestingQ[result_, n_] := Count[Flatten[result], 1] >= n
SetSharedFunction[ParallelSow];
ParallelSow[expr_] := Sow[expr];

Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 200],
ParallelSow[
ArrayPlot[init, ImageSize -> 50] ->
ArrayPlot[result, ImageSize -> 70]]]], 10000]] //
Reap // Last


According to @Szabolcs this is far from optimal use of Parallelize: how do I parallelize in a fast way this kind of code?










share|improve this question















I am reading this blog on how to parallelize Sow and Reap, but it uses SetSharedFunction:



Life[init_,step_]:=First[CellularAutomaton[224,2,2,2,2,2,1,2,2,2,2,1,1,init,0,step]]
InterestingQ[result_, n_] := Count[Flatten[result], 1] >= n
SetSharedFunction[ParallelSow];
ParallelSow[expr_] := Sow[expr];

Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 200],
ParallelSow[
ArrayPlot[init, ImageSize -> 50] ->
ArrayPlot[result, ImageSize -> 70]]]], 10000]] //
Reap // Last


According to @Szabolcs this is far from optimal use of Parallelize: how do I parallelize in a fast way this kind of code?







parallelization






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 hours ago

























asked 3 hours ago









mattiav27

2,02111428




2,02111428











  • Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. Use Pause if you need to simulate slow processing.
    – Szabolcs
    2 hours ago










  • Now it should be ok
    – mattiav27
    2 hours ago










  • Can you add the definition for Life?
    – Szabolcs
    2 hours ago










  • now it should be fine
    – mattiav27
    2 hours ago
















  • Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. Use Pause if you need to simulate slow processing.
    – Szabolcs
    2 hours ago










  • Now it should be ok
    – mattiav27
    2 hours ago










  • Can you add the definition for Life?
    – Szabolcs
    2 hours ago










  • now it should be fine
    – mattiav27
    2 hours ago















Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. Use Pause if you need to simulate slow processing.
– Szabolcs
2 hours ago




Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. Use Pause if you need to simulate slow processing.
– Szabolcs
2 hours ago












Now it should be ok
– mattiav27
2 hours ago




Now it should be ok
– mattiav27
2 hours ago












Can you add the definition for Life?
– Szabolcs
2 hours ago




Can you add the definition for Life?
– Szabolcs
2 hours ago












now it should be fine
– mattiav27
2 hours ago




now it should be fine
– mattiav27
2 hours ago










1 Answer
1






active

oldest

votes

















up vote
4
down vote



accepted










Introduction



So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction and SetSharedVariable many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.



First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.



It is thus clear that SetSharedFunction can be useful, if used correctly. However, most naïve uses I see here on SE result in a slowdown.



Parallelization in Mathematica



Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.



Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:



  1. Time spent computing by parallel kernels

  2. Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.

  3. The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.

  4. Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)

SetSharedFunction



SetSharedFunction and SetSharedVariable are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.



This means that after SetSharedFunction[fun], each usage of fun triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).



SetSharedVariable[fun] is only worth it if fun is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.



The game of life example from Wolfram Blog



This is exactly the situation with the example you linked to. The parallelized Do has 10,000 iterations, but ParallelSow (the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.



The amount of computation done on the main kernel (point 4.), a simple Sow, is also negligible.



The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphics—the result of an ArrayPlot). But this won't matter because there are so few callbacks.



To keep things simpler analyse, I am going to remove the ArrayPlot in the following examples.



What would happen if ParallelSow were triggered more often? After removing the ArrayPlot, let's change the second argument of InterestingQ from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.



In[73]:= Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000]] // Reap // Last // Last //
Length // AbsoluteTiming

Out[73]= 2.23647, 599

In[74]:= Do[
Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000] // Reap // Last // Last //
Length // AbsoluteTiming

Out[74]= 2.19312, 554


If we use a threshold of 50 in InterestingQ, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!



Let's think about how many main kernel – subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained" setting (see Method option of Parallelize), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow is triggered 20% of the cases (that's what happened with a threshold of 50).



Can we use a different approach parallelization that avoids SetSharedFunction? Here's an idea: let's keep Sow/Reap, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine. In this case we could use a simple ParallelTable:



Flatten[
ParallelTable[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming

(* 0.675195, 1955 *)


The differences are:



  • Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with (Parallel)Table


  • Use Sow instead of ParallelSow to avoid triggering a shared function


  • Combine two four result lists with a Flatten[..., 2].


Let's time the serial version too:



Flatten[
Table[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming

(* 2.26587, 1981 *)


We're back to a 3.3x speedup, even if we need to collect 20% of results.



A note on removing ArrayPlot: if you experiment with keeping ArrayPlot, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ threshold of 100. I opted to remove ArrayPlot because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.



Summary



  • As a general guideline, try to minimize the number of times inter-kernel communication must occur

  • In practice, SetSharedFunction is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow.

  • These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.

  • When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.





share|improve this answer






















  • Thanks for the explanation!
    – mattiav27
    16 mins ago










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "387"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f182025%2fwhich-is-the-correct-way-to-parallelize-this-code%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
4
down vote



accepted










Introduction



So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction and SetSharedVariable many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.



First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.



It is thus clear that SetSharedFunction can be useful, if used correctly. However, most naïve uses I see here on SE result in a slowdown.



Parallelization in Mathematica



Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.



Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:



  1. Time spent computing by parallel kernels

  2. Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.

  3. The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.

  4. Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)

SetSharedFunction



SetSharedFunction and SetSharedVariable are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.



This means that after SetSharedFunction[fun], each usage of fun triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).



SetSharedVariable[fun] is only worth it if fun is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.



The game of life example from Wolfram Blog



This is exactly the situation with the example you linked to. The parallelized Do has 10,000 iterations, but ParallelSow (the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.



The amount of computation done on the main kernel (point 4.), a simple Sow, is also negligible.



The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphics—the result of an ArrayPlot). But this won't matter because there are so few callbacks.



To keep things simpler analyse, I am going to remove the ArrayPlot in the following examples.



What would happen if ParallelSow were triggered more often? After removing the ArrayPlot, let's change the second argument of InterestingQ from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.



In[73]:= Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000]] // Reap // Last // Last //
Length // AbsoluteTiming

Out[73]= 2.23647, 599

In[74]:= Do[
Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000] // Reap // Last // Last //
Length // AbsoluteTiming

Out[74]= 2.19312, 554


If we use a threshold of 50 in InterestingQ, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!



Let's think about how many main kernel – subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained" setting (see Method option of Parallelize), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow is triggered 20% of the cases (that's what happened with a threshold of 50).



Can we use a different approach parallelization that avoids SetSharedFunction? Here's an idea: let's keep Sow/Reap, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine. In this case we could use a simple ParallelTable:



Flatten[
ParallelTable[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming

(* 0.675195, 1955 *)


The differences are:



  • Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with (Parallel)Table


  • Use Sow instead of ParallelSow to avoid triggering a shared function


  • Combine two four result lists with a Flatten[..., 2].


Let's time the serial version too:



Flatten[
Table[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming

(* 2.26587, 1981 *)


We're back to a 3.3x speedup, even if we need to collect 20% of results.



A note on removing ArrayPlot: if you experiment with keeping ArrayPlot, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ threshold of 100. I opted to remove ArrayPlot because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.



Summary



  • As a general guideline, try to minimize the number of times inter-kernel communication must occur

  • In practice, SetSharedFunction is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow.

  • These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.

  • When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.





share|improve this answer






















  • Thanks for the explanation!
    – mattiav27
    16 mins ago














up vote
4
down vote



accepted










Introduction



So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction and SetSharedVariable many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.



First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.



It is thus clear that SetSharedFunction can be useful, if used correctly. However, most naïve uses I see here on SE result in a slowdown.



Parallelization in Mathematica



Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.



Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:



  1. Time spent computing by parallel kernels

  2. Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.

  3. The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.

  4. Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)

SetSharedFunction



SetSharedFunction and SetSharedVariable are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.



This means that after SetSharedFunction[fun], each usage of fun triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).



SetSharedVariable[fun] is only worth it if fun is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.



The game of life example from Wolfram Blog



This is exactly the situation with the example you linked to. The parallelized Do has 10,000 iterations, but ParallelSow (the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.



The amount of computation done on the main kernel (point 4.), a simple Sow, is also negligible.



The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphics—the result of an ArrayPlot). But this won't matter because there are so few callbacks.



To keep things simpler analyse, I am going to remove the ArrayPlot in the following examples.



What would happen if ParallelSow were triggered more often? After removing the ArrayPlot, let's change the second argument of InterestingQ from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.



In[73]:= Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000]] // Reap // Last // Last //
Length // AbsoluteTiming

Out[73]= 2.23647, 599

In[74]:= Do[
Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000] // Reap // Last // Last //
Length // AbsoluteTiming

Out[74]= 2.19312, 554


If we use a threshold of 50 in InterestingQ, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!



Let's think about how many main kernel – subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained" setting (see Method option of Parallelize), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow is triggered 20% of the cases (that's what happened with a threshold of 50).



Can we use a different approach parallelization that avoids SetSharedFunction? Here's an idea: let's keep Sow/Reap, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine. In this case we could use a simple ParallelTable:



Flatten[
ParallelTable[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming

(* 0.675195, 1955 *)


The differences are:



  • Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with (Parallel)Table


  • Use Sow instead of ParallelSow to avoid triggering a shared function


  • Combine two four result lists with a Flatten[..., 2].


Let's time the serial version too:



Flatten[
Table[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming

(* 2.26587, 1981 *)


We're back to a 3.3x speedup, even if we need to collect 20% of results.



A note on removing ArrayPlot: if you experiment with keeping ArrayPlot, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ threshold of 100. I opted to remove ArrayPlot because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.



Summary



  • As a general guideline, try to minimize the number of times inter-kernel communication must occur

  • In practice, SetSharedFunction is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow.

  • These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.

  • When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.





share|improve this answer






















  • Thanks for the explanation!
    – mattiav27
    16 mins ago












up vote
4
down vote



accepted







up vote
4
down vote



accepted






Introduction



So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction and SetSharedVariable many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.



First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.



It is thus clear that SetSharedFunction can be useful, if used correctly. However, most naïve uses I see here on SE result in a slowdown.



Parallelization in Mathematica



Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.



Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:



  1. Time spent computing by parallel kernels

  2. Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.

  3. The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.

  4. Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)

SetSharedFunction



SetSharedFunction and SetSharedVariable are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.



This means that after SetSharedFunction[fun], each usage of fun triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).



SetSharedVariable[fun] is only worth it if fun is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.



The game of life example from Wolfram Blog



This is exactly the situation with the example you linked to. The parallelized Do has 10,000 iterations, but ParallelSow (the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.



The amount of computation done on the main kernel (point 4.), a simple Sow, is also negligible.



The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphics—the result of an ArrayPlot). But this won't matter because there are so few callbacks.



To keep things simpler analyse, I am going to remove the ArrayPlot in the following examples.



What would happen if ParallelSow were triggered more often? After removing the ArrayPlot, let's change the second argument of InterestingQ from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.



In[73]:= Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000]] // Reap // Last // Last //
Length // AbsoluteTiming

Out[73]= 2.23647, 599

In[74]:= Do[
Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000] // Reap // Last // Last //
Length // AbsoluteTiming

Out[74]= 2.19312, 554


If we use a threshold of 50 in InterestingQ, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!



Let's think about how many main kernel – subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained" setting (see Method option of Parallelize), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow is triggered 20% of the cases (that's what happened with a threshold of 50).



Can we use a different approach parallelization that avoids SetSharedFunction? Here's an idea: let's keep Sow/Reap, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine. In this case we could use a simple ParallelTable:



Flatten[
ParallelTable[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming

(* 0.675195, 1955 *)


The differences are:



  • Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with (Parallel)Table


  • Use Sow instead of ParallelSow to avoid triggering a shared function


  • Combine two four result lists with a Flatten[..., 2].


Let's time the serial version too:



Flatten[
Table[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming

(* 2.26587, 1981 *)


We're back to a 3.3x speedup, even if we need to collect 20% of results.



A note on removing ArrayPlot: if you experiment with keeping ArrayPlot, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ threshold of 100. I opted to remove ArrayPlot because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.



Summary



  • As a general guideline, try to minimize the number of times inter-kernel communication must occur

  • In practice, SetSharedFunction is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow.

  • These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.

  • When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.





share|improve this answer














Introduction



So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction and SetSharedVariable many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.



First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.



It is thus clear that SetSharedFunction can be useful, if used correctly. However, most naïve uses I see here on SE result in a slowdown.



Parallelization in Mathematica



Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.



Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:



  1. Time spent computing by parallel kernels

  2. Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.

  3. The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.

  4. Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)

SetSharedFunction



SetSharedFunction and SetSharedVariable are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.



This means that after SetSharedFunction[fun], each usage of fun triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).



SetSharedVariable[fun] is only worth it if fun is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.



The game of life example from Wolfram Blog



This is exactly the situation with the example you linked to. The parallelized Do has 10,000 iterations, but ParallelSow (the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.



The amount of computation done on the main kernel (point 4.), a simple Sow, is also negligible.



The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphics—the result of an ArrayPlot). But this won't matter because there are so few callbacks.



To keep things simpler analyse, I am going to remove the ArrayPlot in the following examples.



What would happen if ParallelSow were triggered more often? After removing the ArrayPlot, let's change the second argument of InterestingQ from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.



In[73]:= Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000]] // Reap // Last // Last //
Length // AbsoluteTiming

Out[73]= 2.23647, 599

In[74]:= Do[
Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000] // Reap // Last // Last //
Length // AbsoluteTiming

Out[74]= 2.19312, 554


If we use a threshold of 50 in InterestingQ, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!



Let's think about how many main kernel – subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained" setting (see Method option of Parallelize), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow is triggered 20% of the cases (that's what happened with a threshold of 50).



Can we use a different approach parallelization that avoids SetSharedFunction? Here's an idea: let's keep Sow/Reap, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine. In this case we could use a simple ParallelTable:



Flatten[
ParallelTable[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming

(* 0.675195, 1955 *)


The differences are:



  • Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with (Parallel)Table


  • Use Sow instead of ParallelSow to avoid triggering a shared function


  • Combine two four result lists with a Flatten[..., 2].


Let's time the serial version too:



Flatten[
Table[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming

(* 2.26587, 1981 *)


We're back to a 3.3x speedup, even if we need to collect 20% of results.



A note on removing ArrayPlot: if you experiment with keeping ArrayPlot, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ threshold of 100. I opted to remove ArrayPlot because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.



Summary



  • As a general guideline, try to minimize the number of times inter-kernel communication must occur

  • In practice, SetSharedFunction is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow.

  • These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.

  • When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.






share|improve this answer














share|improve this answer



share|improve this answer








edited 15 mins ago

























answered 27 mins ago









Szabolcs

153k13416899




153k13416899











  • Thanks for the explanation!
    – mattiav27
    16 mins ago
















  • Thanks for the explanation!
    – mattiav27
    16 mins ago















Thanks for the explanation!
– mattiav27
16 mins ago




Thanks for the explanation!
– mattiav27
16 mins ago

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f182025%2fwhich-is-the-correct-way-to-parallelize-this-code%23new-answer', 'question_page');

);

Post as a guest













































































Comments

Popular posts from this blog

What does second last employer means? [closed]

List of Gilmore Girls characters

Confectionery