Parallelization does not use all cores fully

up vote
1
down vote

favorite

On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code

prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm]; 
Table[f[5^100+i],i,0,10] //AbsoluteTiming

returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code

prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm]; 
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming

returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:

Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?

edited 10 mins ago

AccidentalFourierTransform

4,7921940

asked 5 hours ago

Leon

315111

1

In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â€“Â bbgodfrey
4 hours ago

@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â€“Â Leon
4 hours ago

1

try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
â€“Â Alucard
4 hours ago

CoarsestGrained and FinestGrained both require 26sec.
â€“Â Leon
4 hours ago

@Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
â€“Â Alucard
4 hours ago

Â |Â
show 1 more comment

up vote
1
down vote

favorite

On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code

prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm]; 
Table[f[5^100+i],i,0,10] //AbsoluteTiming

returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code

prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm]; 
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming

returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:

Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?

edited 10 mins ago

AccidentalFourierTransform

4,7921940

asked 5 hours ago

Leon

315111

1

In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â€“Â bbgodfrey
4 hours ago

@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â€“Â Leon
4 hours ago

1

try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
â€“Â Alucard
4 hours ago

CoarsestGrained and FinestGrained both require 26sec.
â€“Â Leon
4 hours ago

@Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
â€“Â Alucard
4 hours ago

Â |Â
show 1 more comment

up vote
1
down vote

favorite

On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code

prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm]; 
Table[f[5^100+i],i,0,10] //AbsoluteTiming

returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code

prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm]; 
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming

returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:

Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?

edited 10 mins ago

AccidentalFourierTransform

4,7921940

asked 5 hours ago

Leon

315111

On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code

prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm]; 
Table[f[5^100+i],i,0,10] //AbsoluteTiming

returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code

prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm]; 
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming

returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:

Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?

parallelization

edited 10 mins ago

AccidentalFourierTransform

4,7921940

asked 5 hours ago

Leon

315111

edited 10 mins ago

AccidentalFourierTransform

4,7921940

asked 5 hours ago

Leon

315111

edited 10 mins ago

AccidentalFourierTransform

4,7921940

edited 10 mins ago

AccidentalFourierTransform

4,7921940

edited 10 mins ago

AccidentalFourierTransform

4,7921940

asked 5 hours ago

Leon

315111

asked 5 hours ago

Leon

315111

asked 5 hours ago

Leon

315111

1

In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â€“Â bbgodfrey
4 hours ago

@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â€“Â Leon
4 hours ago

1

try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
â€“Â Alucard
4 hours ago

CoarsestGrained and FinestGrained both require 26sec.
â€“Â Leon
4 hours ago

@Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
â€“Â Alucard
4 hours ago

Â |Â
show 1 more comment

1

In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â€“Â bbgodfrey
4 hours ago

@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â€“Â Leon
4 hours ago

1

try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
â€“Â Alucard
4 hours ago

CoarsestGrained and FinestGrained both require 26sec.
â€“Â Leon
4 hours ago

@Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
â€“Â Alucard
4 hours ago

In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
â€“Â bbgodfrey
4 hours ago

@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
â€“Â Leon
4 hours ago

try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
â€“Â Alucard
4 hours ago

CoarsestGrained and FinestGrained both require 26sec.
â€“Â Leon
4 hours ago

@Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
â€“Â Alucard
4 hours ago

Â |Â
show 1 more comment

1 Answer
1

active

oldest

votes

up vote
3
down vote

The following was perfomed on a 4-processor, 8-thread PC running

$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)

To begin,

ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)

This is equivalent to

ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)

which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.

Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4, 
 2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2, 
 3.3773, 1, 24.3363, 1, 0.0163816, 2 *)

i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.

Addendum

In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.

edited 2 mins ago

answered 3 hours ago

bbgodfrey

43.2k857104

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "387"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f184282%2fparallelization-does-not-use-all-cores-fully%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

The following was perfomed on a 4-processor, 8-thread PC running

$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)

To begin,

ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)

This is equivalent to

ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)

Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4, 
 2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2, 
 3.3773, 1, 24.3363, 1, 0.0163816, 2 *)

Addendum

edited 2 mins ago

answered 3 hours ago

bbgodfrey

43.2k857104

add a commentÂ |Â

up vote
3
down vote

The following was perfomed on a 4-processor, 8-thread PC running

$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)

To begin,

ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)

This is equivalent to

ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)

Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4, 
 2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2, 
 3.3773, 1, 24.3363, 1, 0.0163816, 2 *)

Addendum

edited 2 mins ago

answered 3 hours ago

bbgodfrey

43.2k857104

add a commentÂ |Â

up vote
3
down vote

The following was perfomed on a 4-processor, 8-thread PC running

$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)

To begin,

ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)

This is equivalent to

ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)

Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4, 
 2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2, 
 3.3773, 1, 24.3363, 1, 0.0163816, 2 *)

Addendum

edited 2 mins ago

answered 3 hours ago

bbgodfrey

43.2k857104

The following was perfomed on a 4-processor, 8-thread PC running

$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)

To begin,

ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)

This is equivalent to

ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)

Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4, 
 2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2, 
 3.3773, 1, 24.3363, 1, 0.0163816, 2 *)

Addendum

edited 2 mins ago

answered 3 hours ago

bbgodfrey

43.2k857104

edited 2 mins ago

answered 3 hours ago

bbgodfrey

43.2k857104

answered 3 hours ago

bbgodfrey

43.2k857104

answered 3 hours ago

bbgodfrey

43.2k857104

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky