Which is the correct way to parallelize this code?
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
I am reading this blog on how to parallelize Sow
and Reap
, but it uses SetSharedFunction
:
Life[init_,step_]:=First[CellularAutomaton[224,2,2,2,2,2,1,2,2,2,2,1,1,init,0,step]]
InterestingQ[result_, n_] := Count[Flatten[result], 1] >= n
SetSharedFunction[ParallelSow];
ParallelSow[expr_] := Sow[expr];
Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 200],
ParallelSow[
ArrayPlot[init, ImageSize -> 50] ->
ArrayPlot[result, ImageSize -> 70]]]], 10000]] //
Reap // Last
According to @Szabolcs this is far from optimal use of Parallelize
: how do I parallelize in a fast way this kind of code?
parallelization
add a comment |Â
up vote
3
down vote
favorite
I am reading this blog on how to parallelize Sow
and Reap
, but it uses SetSharedFunction
:
Life[init_,step_]:=First[CellularAutomaton[224,2,2,2,2,2,1,2,2,2,2,1,1,init,0,step]]
InterestingQ[result_, n_] := Count[Flatten[result], 1] >= n
SetSharedFunction[ParallelSow];
ParallelSow[expr_] := Sow[expr];
Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 200],
ParallelSow[
ArrayPlot[init, ImageSize -> 50] ->
ArrayPlot[result, ImageSize -> 70]]]], 10000]] //
Reap // Last
According to @Szabolcs this is far from optimal use of Parallelize
: how do I parallelize in a fast way this kind of code?
parallelization
Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. UsePause
if you need to simulate slow processing.
– Szabolcs
2 hours ago
Now it should be ok
– mattiav27
2 hours ago
Can you add the definition forLife
?
– Szabolcs
2 hours ago
now it should be fine
– mattiav27
2 hours ago
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I am reading this blog on how to parallelize Sow
and Reap
, but it uses SetSharedFunction
:
Life[init_,step_]:=First[CellularAutomaton[224,2,2,2,2,2,1,2,2,2,2,1,1,init,0,step]]
InterestingQ[result_, n_] := Count[Flatten[result], 1] >= n
SetSharedFunction[ParallelSow];
ParallelSow[expr_] := Sow[expr];
Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 200],
ParallelSow[
ArrayPlot[init, ImageSize -> 50] ->
ArrayPlot[result, ImageSize -> 70]]]], 10000]] //
Reap // Last
According to @Szabolcs this is far from optimal use of Parallelize
: how do I parallelize in a fast way this kind of code?
parallelization
I am reading this blog on how to parallelize Sow
and Reap
, but it uses SetSharedFunction
:
Life[init_,step_]:=First[CellularAutomaton[224,2,2,2,2,2,1,2,2,2,2,1,1,init,0,step]]
InterestingQ[result_, n_] := Count[Flatten[result], 1] >= n
SetSharedFunction[ParallelSow];
ParallelSow[expr_] := Sow[expr];
Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 200],
ParallelSow[
ArrayPlot[init, ImageSize -> 50] ->
ArrayPlot[result, ImageSize -> 70]]]], 10000]] //
Reap // Last
According to @Szabolcs this is far from optimal use of Parallelize
: how do I parallelize in a fast way this kind of code?
parallelization
parallelization
edited 2 hours ago
asked 3 hours ago
mattiav27
2,02111428
2,02111428
Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. UsePause
if you need to simulate slow processing.
– Szabolcs
2 hours ago
Now it should be ok
– mattiav27
2 hours ago
Can you add the definition forLife
?
– Szabolcs
2 hours ago
now it should be fine
– mattiav27
2 hours ago
add a comment |Â
Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. UsePause
if you need to simulate slow processing.
– Szabolcs
2 hours ago
Now it should be ok
– mattiav27
2 hours ago
Can you add the definition forLife
?
– Szabolcs
2 hours ago
now it should be fine
– mattiav27
2 hours ago
Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. Use
Pause
if you need to simulate slow processing.– Szabolcs
2 hours ago
Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. Use
Pause
if you need to simulate slow processing.– Szabolcs
2 hours ago
Now it should be ok
– mattiav27
2 hours ago
Now it should be ok
– mattiav27
2 hours ago
Can you add the definition for
Life
?– Szabolcs
2 hours ago
Can you add the definition for
Life
?– Szabolcs
2 hours ago
now it should be fine
– mattiav27
2 hours ago
now it should be fine
– mattiav27
2 hours ago
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
4
down vote
accepted
Introduction
So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction
and SetSharedVariable
many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.
First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.
It is thus clear that SetSharedFunction
can be useful, if used correctly. However, most naïve uses I see here on SE result in a slowdown.
Parallelization in Mathematica
Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.
Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:
- Time spent computing by parallel kernels
- Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.
- The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.
- Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)
SetSharedFunction
SetSharedFunction
and SetSharedVariable
are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.
This means that after SetSharedFunction[fun]
, each usage of fun
triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).
SetSharedVariable[fun]
is only worth it if fun
is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.
The game of life example from Wolfram Blog
This is exactly the situation with the example you linked to. The parallelized Do
has 10,000 iterations, but ParallelSow
(the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.
The amount of computation done on the main kernel (point 4.), a simple Sow
, is also negligible.
The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphics—the result of an ArrayPlot
). But this won't matter because there are so few callbacks.
To keep things simpler analyse, I am going to remove the ArrayPlot
in the following examples.
What would happen if ParallelSow
were triggered more often? After removing the ArrayPlot
, let's change the second argument of InterestingQ
from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.
In[73]:= Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000]] // Reap // Last // Last //
Length // AbsoluteTiming
Out[73]= 2.23647, 599
In[74]:= Do[
Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000] // Reap // Last // Last //
Length // AbsoluteTiming
Out[74]= 2.19312, 554
If we use a threshold of 50 in InterestingQ
, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!
Let's think about how many main kernel – subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained"
setting (see Method
option of Parallelize
), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow
. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow
is triggered 20% of the cases (that's what happened with a threshold of 50).
Can we use a different approach parallelization that avoids SetSharedFunction
? Here's an idea: let's keep Sow
/Reap
, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine
implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine
. In this case we could use a simple ParallelTable
:
Flatten[
ParallelTable[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming
(* 0.675195, 1955 *)
The differences are:
Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with
(Parallel)Table
Use
Sow
instead ofParallelSow
to avoid triggering a shared functionCombine two four result lists with a
Flatten[..., 2]
.
Let's time the serial version too:
Flatten[
Table[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming
(* 2.26587, 1981 *)
We're back to a 3.3x speedup, even if we need to collect 20% of results.
A note on removing ArrayPlot
: if you experiment with keeping ArrayPlot
, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ
threshold of 100. I opted to remove ArrayPlot
because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot
is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot
or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.
Summary
- As a general guideline, try to minimize the number of times inter-kernel communication must occur
- In practice,
SetSharedFunction
is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow. - These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.
- When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.
Thanks for the explanation!
– mattiav27
16 mins ago
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
Introduction
So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction
and SetSharedVariable
many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.
First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.
It is thus clear that SetSharedFunction
can be useful, if used correctly. However, most naïve uses I see here on SE result in a slowdown.
Parallelization in Mathematica
Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.
Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:
- Time spent computing by parallel kernels
- Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.
- The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.
- Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)
SetSharedFunction
SetSharedFunction
and SetSharedVariable
are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.
This means that after SetSharedFunction[fun]
, each usage of fun
triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).
SetSharedVariable[fun]
is only worth it if fun
is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.
The game of life example from Wolfram Blog
This is exactly the situation with the example you linked to. The parallelized Do
has 10,000 iterations, but ParallelSow
(the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.
The amount of computation done on the main kernel (point 4.), a simple Sow
, is also negligible.
The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphics—the result of an ArrayPlot
). But this won't matter because there are so few callbacks.
To keep things simpler analyse, I am going to remove the ArrayPlot
in the following examples.
What would happen if ParallelSow
were triggered more often? After removing the ArrayPlot
, let's change the second argument of InterestingQ
from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.
In[73]:= Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000]] // Reap // Last // Last //
Length // AbsoluteTiming
Out[73]= 2.23647, 599
In[74]:= Do[
Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000] // Reap // Last // Last //
Length // AbsoluteTiming
Out[74]= 2.19312, 554
If we use a threshold of 50 in InterestingQ
, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!
Let's think about how many main kernel – subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained"
setting (see Method
option of Parallelize
), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow
. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow
is triggered 20% of the cases (that's what happened with a threshold of 50).
Can we use a different approach parallelization that avoids SetSharedFunction
? Here's an idea: let's keep Sow
/Reap
, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine
implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine
. In this case we could use a simple ParallelTable
:
Flatten[
ParallelTable[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming
(* 0.675195, 1955 *)
The differences are:
Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with
(Parallel)Table
Use
Sow
instead ofParallelSow
to avoid triggering a shared functionCombine two four result lists with a
Flatten[..., 2]
.
Let's time the serial version too:
Flatten[
Table[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming
(* 2.26587, 1981 *)
We're back to a 3.3x speedup, even if we need to collect 20% of results.
A note on removing ArrayPlot
: if you experiment with keeping ArrayPlot
, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ
threshold of 100. I opted to remove ArrayPlot
because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot
is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot
or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.
Summary
- As a general guideline, try to minimize the number of times inter-kernel communication must occur
- In practice,
SetSharedFunction
is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow. - These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.
- When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.
Thanks for the explanation!
– mattiav27
16 mins ago
add a comment |Â
up vote
4
down vote
accepted
Introduction
So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction
and SetSharedVariable
many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.
First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.
It is thus clear that SetSharedFunction
can be useful, if used correctly. However, most naïve uses I see here on SE result in a slowdown.
Parallelization in Mathematica
Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.
Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:
- Time spent computing by parallel kernels
- Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.
- The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.
- Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)
SetSharedFunction
SetSharedFunction
and SetSharedVariable
are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.
This means that after SetSharedFunction[fun]
, each usage of fun
triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).
SetSharedVariable[fun]
is only worth it if fun
is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.
The game of life example from Wolfram Blog
This is exactly the situation with the example you linked to. The parallelized Do
has 10,000 iterations, but ParallelSow
(the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.
The amount of computation done on the main kernel (point 4.), a simple Sow
, is also negligible.
The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphics—the result of an ArrayPlot
). But this won't matter because there are so few callbacks.
To keep things simpler analyse, I am going to remove the ArrayPlot
in the following examples.
What would happen if ParallelSow
were triggered more often? After removing the ArrayPlot
, let's change the second argument of InterestingQ
from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.
In[73]:= Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000]] // Reap // Last // Last //
Length // AbsoluteTiming
Out[73]= 2.23647, 599
In[74]:= Do[
Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000] // Reap // Last // Last //
Length // AbsoluteTiming
Out[74]= 2.19312, 554
If we use a threshold of 50 in InterestingQ
, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!
Let's think about how many main kernel – subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained"
setting (see Method
option of Parallelize
), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow
. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow
is triggered 20% of the cases (that's what happened with a threshold of 50).
Can we use a different approach parallelization that avoids SetSharedFunction
? Here's an idea: let's keep Sow
/Reap
, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine
implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine
. In this case we could use a simple ParallelTable
:
Flatten[
ParallelTable[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming
(* 0.675195, 1955 *)
The differences are:
Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with
(Parallel)Table
Use
Sow
instead ofParallelSow
to avoid triggering a shared functionCombine two four result lists with a
Flatten[..., 2]
.
Let's time the serial version too:
Flatten[
Table[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming
(* 2.26587, 1981 *)
We're back to a 3.3x speedup, even if we need to collect 20% of results.
A note on removing ArrayPlot
: if you experiment with keeping ArrayPlot
, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ
threshold of 100. I opted to remove ArrayPlot
because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot
is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot
or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.
Summary
- As a general guideline, try to minimize the number of times inter-kernel communication must occur
- In practice,
SetSharedFunction
is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow. - These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.
- When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.
Thanks for the explanation!
– mattiav27
16 mins ago
add a comment |Â
up vote
4
down vote
accepted
up vote
4
down vote
accepted
Introduction
So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction
and SetSharedVariable
many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.
First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.
It is thus clear that SetSharedFunction
can be useful, if used correctly. However, most naïve uses I see here on SE result in a slowdown.
Parallelization in Mathematica
Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.
Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:
- Time spent computing by parallel kernels
- Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.
- The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.
- Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)
SetSharedFunction
SetSharedFunction
and SetSharedVariable
are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.
This means that after SetSharedFunction[fun]
, each usage of fun
triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).
SetSharedVariable[fun]
is only worth it if fun
is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.
The game of life example from Wolfram Blog
This is exactly the situation with the example you linked to. The parallelized Do
has 10,000 iterations, but ParallelSow
(the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.
The amount of computation done on the main kernel (point 4.), a simple Sow
, is also negligible.
The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphics—the result of an ArrayPlot
). But this won't matter because there are so few callbacks.
To keep things simpler analyse, I am going to remove the ArrayPlot
in the following examples.
What would happen if ParallelSow
were triggered more often? After removing the ArrayPlot
, let's change the second argument of InterestingQ
from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.
In[73]:= Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000]] // Reap // Last // Last //
Length // AbsoluteTiming
Out[73]= 2.23647, 599
In[74]:= Do[
Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000] // Reap // Last // Last //
Length // AbsoluteTiming
Out[74]= 2.19312, 554
If we use a threshold of 50 in InterestingQ
, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!
Let's think about how many main kernel – subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained"
setting (see Method
option of Parallelize
), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow
. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow
is triggered 20% of the cases (that's what happened with a threshold of 50).
Can we use a different approach parallelization that avoids SetSharedFunction
? Here's an idea: let's keep Sow
/Reap
, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine
implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine
. In this case we could use a simple ParallelTable
:
Flatten[
ParallelTable[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming
(* 0.675195, 1955 *)
The differences are:
Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with
(Parallel)Table
Use
Sow
instead ofParallelSow
to avoid triggering a shared functionCombine two four result lists with a
Flatten[..., 2]
.
Let's time the serial version too:
Flatten[
Table[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming
(* 2.26587, 1981 *)
We're back to a 3.3x speedup, even if we need to collect 20% of results.
A note on removing ArrayPlot
: if you experiment with keeping ArrayPlot
, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ
threshold of 100. I opted to remove ArrayPlot
because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot
is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot
or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.
Summary
- As a general guideline, try to minimize the number of times inter-kernel communication must occur
- In practice,
SetSharedFunction
is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow. - These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.
- When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.
Introduction
So it seems that this question is straightly aimed at me, as I warned against the use of SetSharedFunction
and SetSharedVariable
many times. Thus I will explain in detail why I consider these functions dangerous for performance, and what alternatives one can use.
First of all, let us point out that the Wolfram Blog example you linked to does work well. The parallel and serial versions run in 0.78 and 2.42 seconds, respectively, on my 4-core machine. That's a speedup factor of 3, i.e. very good for a 4-core machine.
It is thus clear that SetSharedFunction
can be useful, if used correctly. However, most naïve uses I see here on SE result in a slowdown.
Parallelization in Mathematica
Mathematica uses distributed memory parallelism. This means that each parallel thread (implemented as a separate kernel process) has its own private memory. Memory cannot be shared between them. Instead, data is sent/copied between different kernels as needed. All this inter-kernel communication can be slow. The key to writing fast parallel code in Mathematica is understanding what triggers inter-kernel communication, and how much overhead this introduces.
Roughly, these are the factors that determine the speed of a parallel computation, and each one can potentially be a bottleneck:
- Time spent computing by parallel kernels
- Each instance of communication between subkernels and the main kernel involves some overhead, even if almost no data is sent.
- The time spent on inter-kernel communication also depends on the type and amount of data sent. E.g. sending a number is faster than sending a large array, and sending a packed array is faster than sending an unpacked one.
- Time spent computing by the main kernel (which may limit how quickly it can feed data to subkernels)
SetSharedFunction
SetSharedFunction
and SetSharedVariable
are a way to overcome the limitation of distributed memory parallelism. They make functions or variables behave as if memory were shared between subkernels. This is achieved by forcing these functions and variables to always be evaluated on the main kernel.
This means that after SetSharedFunction[fun]
, each usage of fun
triggers a communication to the main kernel (point 2. and possibly 3. above) and requires a computation by the main kernel (point 4. above).
SetSharedVariable[fun]
is only worth it if fun
is called rarely compared to the number of computations done by the subkernels, and it can evaluate much faster than the bits of computation done on subkernels.
The game of life example from Wolfram Blog
This is exactly the situation with the example you linked to. The parallelized Do
has 10,000 iterations, but ParallelSow
(the shared function) is only invoked in about 10 of them, only 0.1%! In this case, the overhead from the extra inter-kernel communication (point 2.) is negligible.
The amount of computation done on the main kernel (point 4.), a simple Sow
, is also negligible.
The amount of data transferred in each callback is actually a lot, and it is in a format that is slow to transfer (graphics—the result of an ArrayPlot
). But this won't matter because there are so few callbacks.
To keep things simpler analyse, I am going to remove the ArrayPlot
in the following examples.
What would happen if ParallelSow
were triggered more often? After removing the ArrayPlot
, let's change the second argument of InterestingQ
from 200 to 100, so that more of the results are kept. Instead of ~10 result, now we get about ~600. And the parallel version is no longer faster than the serial version.
In[73]:= Parallelize[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000]] // Reap // Last // Last //
Length // AbsoluteTiming
Out[73]= 2.23647, 599
In[74]:= Do[
Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 100],
ParallelSow[result]]], 10000] // Reap // Last // Last //
Length // AbsoluteTiming
Out[74]= 2.19312, 554
If we use a threshold of 50 in InterestingQ
, we get ~2000 result. The parallel version now takes 6.6 s while the serial one takes only 2.3 s. That an almost 3x slowdown!
Let's think about how many main kernel – subkernel communications occur in this computation. I am running 4 subkernels. The computations need to be both sent and received. With the "CoarsestGrained"
setting (see Method
option of Parallelize
), this would require only $2times 4$ communication (send the computations to each subkernel, then receive the results). In addition, we need twice as many communications as the number of calls to ParallelSow
. That's only $2times 10$ in the original code, but it's as many as 4000 communications if ParallelSow
is triggered 20% of the cases (that's what happened with a threshold of 50).
Can we use a different approach parallelization that avoids SetSharedFunction
? Here's an idea: let's keep Sow
/Reap
, but collect partial results on the subkernels. Then transfer the partial results back to the main kernel and combine them there. Generally, ParallelCombine
implements something similar to this. But in many specific cases, we can use simpler functions than ParallelCombine
. In this case we could use a simple ParallelTable
:
Flatten[
ParallelTable[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming
(* 0.675195, 1955 *)
The differences are:
Do only a quarter of the iterations, i.e. 2500 instead of 10000, but do them four times. This is implemented with
(Parallel)Table
Use
Sow
instead ofParallelSow
to avoid triggering a shared functionCombine two four result lists with a
Flatten[..., 2]
.
Let's time the serial version too:
Flatten[
Table[
Do[Module[init = RandomInteger[1, 6, 6], result,
If[InterestingQ[result = Life[init, 100], 50],
Sow[result]]], 2500] // Reap // Last,
4
],
2
] // Length // AbsoluteTiming
(* 2.26587, 1981 *)
We're back to a 3.3x speedup, even if we need to collect 20% of results.
A note on removing ArrayPlot
: if you experiment with keeping ArrayPlot
, you will notice that the behaviour is a bit different. The parallel version is still faster at an InterestingQ
threshold of 100. I opted to remove ArrayPlot
because decreasing the threshold further will make the computation too slow to be useful as an example, and also because it complicates the situation. Since the result of ArrayPlot
is a complicated and slow-to-transfer expression, we would need to analyse which is slower: computing an ArrayPlot
or transferring the result back to the main kernel. Based on the findings, we would decide whether the visualization is better done on the main kernel or subkernels.
Summary
- As a general guideline, try to minimize the number of times inter-kernel communication must occur
- In practice,
SetSharedFunction
is only useful if the shared function only gets called rarely (not in every iteration), or if the computations done outside of the shared function call are very slow. - These are merely guidelines to help you understand what happens. They let you make reasonable guesses about what is likely to be slow or fast, but in the end you should always benchmark.
- When writing parallel code, try to think not in terms of algorithms, but in terms of data. For example, process a quarter of the dataset on each of the 4 cores you have, then combine the results. If there is no input dataset, there is still one produced as an output. Try to produce a quarter on each core, then combine afterwards.
edited 15 mins ago
answered 27 mins ago
Szabolcs
153k13416899
153k13416899
Thanks for the explanation!
– mattiav27
16 mins ago
add a comment |Â
Thanks for the explanation!
– mattiav27
16 mins ago
Thanks for the explanation!
– mattiav27
16 mins ago
Thanks for the explanation!
– mattiav27
16 mins ago
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f182025%2fwhich-is-the-correct-way-to-parallelize-this-code%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Can you please fix the mismatched brackets, the spelling errors in function names, and make the example executable? I.e. make sure every function has a definition, and we can run your code. Use
Pause
if you need to simulate slow processing.– Szabolcs
2 hours ago
Now it should be ok
– mattiav27
2 hours ago
Can you add the definition for
Life
?– Szabolcs
2 hours ago
now it should be fine
– mattiav27
2 hours ago