Parallelization does not use all cores fully
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code
prm = Prime /@ Range[1,1000];
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming
returns 43.9328,1,4,3,4,3,8,0,2,1,1,2
. Task manager measures:
The multicore code
prm = Prime /@ Range[1, 1000];
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming
returns 26.1113,1,4,3,4,3,8,0,2,1,1,2
. Task manager measures:
In Evaluation -> Parallel Kernel Configuration, I have:
Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?
parallelization
 |Â
show 1 more comment
up vote
1
down vote
favorite
On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code
prm = Prime /@ Range[1,1000];
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming
returns 43.9328,1,4,3,4,3,8,0,2,1,1,2
. Task manager measures:
The multicore code
prm = Prime /@ Range[1, 1000];
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming
returns 26.1113,1,4,3,4,3,8,0,2,1,1,2
. Task manager measures:
In Evaluation -> Parallel Kernel Configuration, I have:
Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?
parallelization
1
In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â bbgodfrey
4 hours ago
@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â Leon
4 hours ago
1
try addingMethod-> "CoarsestGrained"
at the end ofParallelCombine
. you can also useLabeled
to show which$KernelId
was used to obtain the results
â Alucard
4 hours ago
CoarsestGrained
andFinestGrained
both require 26sec.
â Leon
4 hours ago
@Leon try addingLaunchKernels
before you distribute the definitions, then check how many kernels mathematica see with$KernelCount
â Alucard
4 hours ago
 |Â
show 1 more comment
up vote
1
down vote
favorite
up vote
1
down vote
favorite
On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code
prm = Prime /@ Range[1,1000];
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming
returns 43.9328,1,4,3,4,3,8,0,2,1,1,2
. Task manager measures:
The multicore code
prm = Prime /@ Range[1, 1000];
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming
returns 26.1113,1,4,3,4,3,8,0,2,1,1,2
. Task manager measures:
In Evaluation -> Parallel Kernel Configuration, I have:
Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?
parallelization
On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code
prm = Prime /@ Range[1,1000];
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming
returns 43.9328,1,4,3,4,3,8,0,2,1,1,2
. Task manager measures:
The multicore code
prm = Prime /@ Range[1, 1000];
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming
returns 26.1113,1,4,3,4,3,8,0,2,1,1,2
. Task manager measures:
In Evaluation -> Parallel Kernel Configuration, I have:
Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?
parallelization
parallelization
edited 10 mins ago
AccidentalFourierTransform
4,7921940
4,7921940
asked 5 hours ago
Leon
315111
315111
1
In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â bbgodfrey
4 hours ago
@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â Leon
4 hours ago
1
try addingMethod-> "CoarsestGrained"
at the end ofParallelCombine
. you can also useLabeled
to show which$KernelId
was used to obtain the results
â Alucard
4 hours ago
CoarsestGrained
andFinestGrained
both require 26sec.
â Leon
4 hours ago
@Leon try addingLaunchKernels
before you distribute the definitions, then check how many kernels mathematica see with$KernelCount
â Alucard
4 hours ago
 |Â
show 1 more comment
1
In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â bbgodfrey
4 hours ago
@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â Leon
4 hours ago
1
try addingMethod-> "CoarsestGrained"
at the end ofParallelCombine
. you can also useLabeled
to show which$KernelId
was used to obtain the results
â Alucard
4 hours ago
CoarsestGrained
andFinestGrained
both require 26sec.
â Leon
4 hours ago
@Leon try addingLaunchKernels
before you distribute the definitions, then check how many kernels mathematica see with$KernelCount
â Alucard
4 hours ago
1
1
In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â bbgodfrey
4 hours ago
In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â bbgodfrey
4 hours ago
@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â Leon
4 hours ago
@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â Leon
4 hours ago
1
1
try adding
Method-> "CoarsestGrained"
at the end of ParallelCombine
. you can also use Labeled
to show which $KernelId
was used to obtain the resultsâ Alucard
4 hours ago
try adding
Method-> "CoarsestGrained"
at the end of ParallelCombine
. you can also use Labeled
to show which $KernelId
was used to obtain the resultsâ Alucard
4 hours ago
CoarsestGrained
and FinestGrained
both require 26sec.â Leon
4 hours ago
CoarsestGrained
and FinestGrained
both require 26sec.â Leon
4 hours ago
@Leon try adding
LaunchKernels
before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
â Alucard
4 hours ago
@Leon try adding
LaunchKernels
before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
â Alucard
4 hours ago
 |Â
show 1 more comment
1 Answer
1
active
oldest
votes
up vote
3
down vote
The following was perfomed on a 4-processor, 8-thread PC running
$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)
To begin,
ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)
This is equivalent to
ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)
which is a bit easier to understand and manipulate, in my view. I then tried both Method
s and also tried different orders of i
in ParallelTable
Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.
Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
3.3773, 1, 24.3363, 1, 0.0163816, 2 *)
i = 7
and i = 9
take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7
takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable
takes 35 seconds to handle all i
.
Addendum
In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
The following was perfomed on a 4-processor, 8-thread PC running
$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)
To begin,
ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)
This is equivalent to
ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)
which is a bit easier to understand and manipulate, in my view. I then tried both Method
s and also tried different orders of i
in ParallelTable
Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.
Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
3.3773, 1, 24.3363, 1, 0.0163816, 2 *)
i = 7
and i = 9
take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7
takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable
takes 35 seconds to handle all i
.
Addendum
In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.
add a comment |Â
up vote
3
down vote
The following was perfomed on a 4-processor, 8-thread PC running
$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)
To begin,
ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)
This is equivalent to
ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)
which is a bit easier to understand and manipulate, in my view. I then tried both Method
s and also tried different orders of i
in ParallelTable
Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.
Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
3.3773, 1, 24.3363, 1, 0.0163816, 2 *)
i = 7
and i = 9
take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7
takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable
takes 35 seconds to handle all i
.
Addendum
In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.
add a comment |Â
up vote
3
down vote
up vote
3
down vote
The following was perfomed on a 4-processor, 8-thread PC running
$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)
To begin,
ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)
This is equivalent to
ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)
which is a bit easier to understand and manipulate, in my view. I then tried both Method
s and also tried different orders of i
in ParallelTable
Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.
Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
3.3773, 1, 24.3363, 1, 0.0163816, 2 *)
i = 7
and i = 9
take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7
takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable
takes 35 seconds to handle all i
.
Addendum
In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.
The following was perfomed on a 4-processor, 8-thread PC running
$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)
To begin,
ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)
This is equivalent to
ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)
which is a bit easier to understand and manipulate, in my view. I then tried both Method
s and also tried different orders of i
in ParallelTable
Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.
Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
3.3773, 1, 24.3363, 1, 0.0163816, 2 *)
i = 7
and i = 9
take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7
takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable
takes 35 seconds to handle all i
.
Addendum
In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.
edited 2 mins ago
answered 3 hours ago
bbgodfrey
43.2k857104
43.2k857104
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f184282%2fparallelization-does-not-use-all-cores-fully%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â bbgodfrey
4 hours ago
@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â Leon
4 hours ago
1
try adding
Method-> "CoarsestGrained"
at the end ofParallelCombine
. you can also useLabeled
to show which$KernelId
was used to obtain the resultsâ Alucard
4 hours ago
CoarsestGrained
andFinestGrained
both require 26sec.â Leon
4 hours ago
@Leon try adding
LaunchKernels
before you distribute the definitions, then check how many kernels mathematica see with$KernelCount
â Alucard
4 hours ago