Parallelization does not use all cores fully

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code



prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming


returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code



prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming


returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:
enter image description here



Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?










share|improve this question



















  • 1




    In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
    – bbgodfrey
    4 hours ago











  • @bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
    – Leon
    4 hours ago






  • 1




    try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
    – Alucard
    4 hours ago










  • CoarsestGrained and FinestGrained both require 26sec.
    – Leon
    4 hours ago










  • @Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
    – Alucard
    4 hours ago














up vote
1
down vote

favorite












On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code



prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming


returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code



prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming


returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:
enter image description here



Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?










share|improve this question



















  • 1




    In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
    – bbgodfrey
    4 hours ago











  • @bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
    – Leon
    4 hours ago






  • 1




    try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
    – Alucard
    4 hours ago










  • CoarsestGrained and FinestGrained both require 26sec.
    – Leon
    4 hours ago










  • @Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
    – Alucard
    4 hours ago












up vote
1
down vote

favorite









up vote
1
down vote

favorite











On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code



prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming


returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code



prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming


returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:
enter image description here



Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?










share|improve this question















On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code



prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming


returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code



prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming


returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:
enter image description here



Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?







parallelization






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 10 mins ago









AccidentalFourierTransform

4,7921940




4,7921940










asked 5 hours ago









Leon

315111




315111







  • 1




    In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
    – bbgodfrey
    4 hours ago











  • @bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
    – Leon
    4 hours ago






  • 1




    try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
    – Alucard
    4 hours ago










  • CoarsestGrained and FinestGrained both require 26sec.
    – Leon
    4 hours ago










  • @Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
    – Alucard
    4 hours ago












  • 1




    In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
    – bbgodfrey
    4 hours ago











  • @bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
    – Leon
    4 hours ago






  • 1




    try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
    – Alucard
    4 hours ago










  • CoarsestGrained and FinestGrained both require 26sec.
    – Leon
    4 hours ago










  • @Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
    – Alucard
    4 hours ago







1




1




In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
– bbgodfrey
4 hours ago





In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
– bbgodfrey
4 hours ago













@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
– Leon
4 hours ago




@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
– Leon
4 hours ago




1




1




try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
– Alucard
4 hours ago




try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
– Alucard
4 hours ago












CoarsestGrained and FinestGrained both require 26sec.
– Leon
4 hours ago




CoarsestGrained and FinestGrained both require 26sec.
– Leon
4 hours ago












@Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
– Alucard
4 hours ago




@Leon try adding LaunchKernels before you distribute the definitions, then check how many kernels mathematica see with $KernelCount
– Alucard
4 hours ago










1 Answer
1






active

oldest

votes

















up vote
3
down vote













The following was perfomed on a 4-processor, 8-thread PC running



$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)


To begin,



ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


This is equivalent to



ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.



Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
3.3773, 1, 24.3363, 1, 0.0163816, 2 *)


i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.



Addendum



In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.






share|improve this answer






















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "387"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f184282%2fparallelization-does-not-use-all-cores-fully%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote













    The following was perfomed on a 4-processor, 8-thread PC running



    $Version
    (* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)


    To begin,



    ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
    (* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


    This is equivalent to



    ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
    (* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


    which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.



    Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
    (* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
    2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
    3.3773, 1, 24.3363, 1, 0.0163816, 2 *)


    i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.



    Addendum



    In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.






    share|improve this answer


























      up vote
      3
      down vote













      The following was perfomed on a 4-processor, 8-thread PC running



      $Version
      (* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)


      To begin,



      ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
      (* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


      This is equivalent to



      ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
      (* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


      which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.



      Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
      (* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
      2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
      3.3773, 1, 24.3363, 1, 0.0163816, 2 *)


      i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.



      Addendum



      In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.






      share|improve this answer
























        up vote
        3
        down vote










        up vote
        3
        down vote









        The following was perfomed on a 4-processor, 8-thread PC running



        $Version
        (* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)


        To begin,



        ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
        (* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


        This is equivalent to



        ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
        (* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


        which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.



        Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
        (* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
        2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
        3.3773, 1, 24.3363, 1, 0.0163816, 2 *)


        i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.



        Addendum



        In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.






        share|improve this answer














        The following was perfomed on a 4-processor, 8-thread PC running



        $Version
        (* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)


        To begin,



        ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
        (* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


        This is equivalent to



        ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
        (* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


        which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.



        Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
        (* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
        2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
        3.3773, 1, 24.3363, 1, 0.0163816, 2 *)


        i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.



        Addendum



        In my experience, a computation that lends itself well to parallelization will launch four kernels on my PC, each using 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have seen is total CPU utilization of about 70%. Note that "the number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to https://reference.wolfram.com/language/workflow/RunAComputationInParallel.html. Using more typically just adds overhead.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 2 mins ago

























        answered 3 hours ago









        bbgodfrey

        43.2k857104




        43.2k857104



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f184282%2fparallelization-does-not-use-all-cores-fully%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            Long meetings (6-7 hours a day): Being “babysat” by supervisor

            What does second last employer means? [closed]

            One-line joke