Parallelization does not use all cores fully, help :(

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code



prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming


returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code



prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming


returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:
enter image description here



Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?










share|improve this question

















  • 1




    In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
    – bbgodfrey
    3 hours ago











  • @bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
    – Leon
    3 hours ago






  • 1




    try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
    – Alucard
    3 hours ago










  • CoarsestGrained and FinestGrained both require 26sec.
    – Leon
    3 hours ago










  • "The number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to reference.wolfram.com/language/workflow/…. Using more typically just adds overhead. So, with two-thread cores, it is difficult to obtain better than 50% utilization as measured by Task Manager.
    – bbgodfrey
    3 hours ago















up vote
1
down vote

favorite












On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code



prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming


returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code



prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming


returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:
enter image description here



Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?










share|improve this question

















  • 1




    In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
    – bbgodfrey
    3 hours ago











  • @bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
    – Leon
    3 hours ago






  • 1




    try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
    – Alucard
    3 hours ago










  • CoarsestGrained and FinestGrained both require 26sec.
    – Leon
    3 hours ago










  • "The number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to reference.wolfram.com/language/workflow/…. Using more typically just adds overhead. So, with two-thread cores, it is difficult to obtain better than 50% utilization as measured by Task Manager.
    – bbgodfrey
    3 hours ago













up vote
1
down vote

favorite









up vote
1
down vote

favorite











On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code



prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming


returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code



prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming


returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:
enter image description here



Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?










share|improve this question













On a new laptop with a 6-core i7-8850H in Mma11.3, I'd like to use its full capabilities. Unicore code



prm = Prime /@ Range[1,1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
Table[f[5^100+i],i,0,10] //AbsoluteTiming


returns 43.9328,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
The multicore code



prm = Prime /@ Range[1, 1000]; 
f[i_] := Length[(First/@FactorInteger[i]) [Intersection] prm];
DistributeDefinitions[prm, f];
ParallelCombine[Table[f[5^100+i],i,#]&, Range[0,10], Join] //AbsoluteTiming


returns 26.1113,1,4,3,4,3,8,0,2,1,1,2. Task manager measures:
enter image description here
In Evaluation -> Parallel Kernel Configuration, I have:
enter image description here



Why is the speed-up less than 2x? Why does the multicore code only give 33% use of my CPU and unicore 16%? How can I use all 6 cores fully (or at least 5 cores to the max)?







parallelization






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked 4 hours ago









Leon

315111




315111







  • 1




    In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
    – bbgodfrey
    3 hours ago











  • @bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
    – Leon
    3 hours ago






  • 1




    try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
    – Alucard
    3 hours ago










  • CoarsestGrained and FinestGrained both require 26sec.
    – Leon
    3 hours ago










  • "The number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to reference.wolfram.com/language/workflow/…. Using more typically just adds overhead. So, with two-thread cores, it is difficult to obtain better than 50% utilization as measured by Task Manager.
    – bbgodfrey
    3 hours ago













  • 1




    In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
    – bbgodfrey
    3 hours ago











  • @bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
    – Leon
    3 hours ago






  • 1




    try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
    – Alucard
    3 hours ago










  • CoarsestGrained and FinestGrained both require 26sec.
    – Leon
    3 hours ago










  • "The number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to reference.wolfram.com/language/workflow/…. Using more typically just adds overhead. So, with two-thread cores, it is difficult to obtain better than 50% utilization as measured by Task Manager.
    – bbgodfrey
    3 hours ago








1




1




In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
– bbgodfrey
3 hours ago





In my experience, Mathematica uses only one thread per core. How does the CPU utilization look (average utilization and time history) with six kernels launched, and with one kernel launched (i.e., not parallel)?
– bbgodfrey
3 hours ago













@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
– Leon
3 hours ago




@bbgodfrey I thought the number of Mma kernels should equal the number of CPU threads. Should it equal the number of CPU cores? Anyway, I tried the same two computations with 6 kernels, and the results are very similar (45sec 15%, 26sec 30%).
– Leon
3 hours ago




1




1




try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
– Alucard
3 hours ago




try adding Method-> "CoarsestGrained" at the end of ParallelCombine. you can also use Labeled to show which $KernelId was used to obtain the results
– Alucard
3 hours ago












CoarsestGrained and FinestGrained both require 26sec.
– Leon
3 hours ago




CoarsestGrained and FinestGrained both require 26sec.
– Leon
3 hours ago












"The number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to reference.wolfram.com/language/workflow/…. Using more typically just adds overhead. So, with two-thread cores, it is difficult to obtain better than 50% utilization as measured by Task Manager.
– bbgodfrey
3 hours ago





"The number of kernels available for parallel computation typically corresponds to the number of CPU cores", according to reference.wolfram.com/language/workflow/…. Using more typically just adds overhead. So, with two-thread cores, it is difficult to obtain better than 50% utilization as measured by Task Manager.
– bbgodfrey
3 hours ago











1 Answer
1






active

oldest

votes

















up vote
3
down vote













The following was perfomed on a 4-processor, 8-thread PC running



$Version
(* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)


To begin,



ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
(* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


This is equivalent to



ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
(* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.



Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
(* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
3.3773, 1, 24.3363, 1, 0.0163816, 2 *)


i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.



Addendum



In my experience, a computation that lends itself well to parallelization will launch four kernels, each using at 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have ever seen is total CPU utilization of about 75%.






share|improve this answer






















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "387"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f184282%2fparallelization-does-not-use-all-cores-fully-help%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote













    The following was perfomed on a 4-processor, 8-thread PC running



    $Version
    (* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)


    To begin,



    ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
    (* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


    This is equivalent to



    ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
    (* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


    which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.



    Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
    (* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
    2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
    3.3773, 1, 24.3363, 1, 0.0163816, 2 *)


    i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.



    Addendum



    In my experience, a computation that lends itself well to parallelization will launch four kernels, each using at 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have ever seen is total CPU utilization of about 75%.






    share|improve this answer


























      up vote
      3
      down vote













      The following was perfomed on a 4-processor, 8-thread PC running



      $Version
      (* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)


      To begin,



      ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
      (* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


      This is equivalent to



      ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
      (* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


      which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.



      Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
      (* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
      2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
      3.3773, 1, 24.3363, 1, 0.0163816, 2 *)


      i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.



      Addendum



      In my experience, a computation that lends itself well to parallelization will launch four kernels, each using at 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have ever seen is total CPU utilization of about 75%.






      share|improve this answer
























        up vote
        3
        down vote










        up vote
        3
        down vote









        The following was perfomed on a 4-processor, 8-thread PC running



        $Version
        (* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)


        To begin,



        ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
        (* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


        This is equivalent to



        ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
        (* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


        which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.



        Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
        (* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
        2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
        3.3773, 1, 24.3363, 1, 0.0163816, 2 *)


        i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.



        Addendum



        In my experience, a computation that lends itself well to parallelization will launch four kernels, each using at 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have ever seen is total CPU utilization of about 75%.






        share|improve this answer














        The following was perfomed on a 4-processor, 8-thread PC running



        $Version
        (* "11.3.0 for Microsoft Windows (64-bit) (March 7, 2018)" *)


        To begin,



        ParallelCombine[Table[f[5^100 + i], i, #] &, Range[0, 10], Join] // AbsoluteTiming
        (* 37.2452, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


        This is equivalent to



        ParallelTable[f[5^100 + i], i, 0, 10] // AbsoluteTiming
        (* 34.8027, 1, 4, 3, 4, 3, 8, 0, 2, 1, 1, 2 *)


        which is a bit easier to understand and manipulate, in my view. I then tried both Methods and also tried different orders of i in ParallelTable Nothing mattered. In all cases, Task Manager showed that all four kernels on my computer each ran at about 18% of total CPU capacity. However, two finished almost immediately, while the other two continued for some time, after which another finished and the last continued a bit longer. The reason for this behavior, it turns out, is easily determined.



        Table[f[5^100 + i] // AbsoluteTiming, i, 0, 10]
        (* 0.0000608395, 1, 0.0108113, 4, 0.0120675, 3, 0.255642, 4,
        2.88463, 3, 0.00794627, 8, 0.0868866, 0, 31.8475, 2,
        3.3773, 1, 24.3363, 1, 0.0163816, 2 *)


        i = 7 and i = 9 take most of the time, with two kernels computing the others quickly and then finishing. Given that i = 7 takes almost 32 seconds by itself, it is no longer is surprising that ParallelTable takes 35 seconds to handle all i.



        Addendum



        In my experience, a computation that lends itself well to parallelization will launch four kernels, each using at 16% - 19% of total CPU capacity as measured by Task Manager. Thus, the best I have ever seen is total CPU utilization of about 75%.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 1 hour ago

























        answered 2 hours ago









        bbgodfrey

        43.2k857104




        43.2k857104



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f184282%2fparallelization-does-not-use-all-cores-fully-help%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            What does second last employer means? [closed]

            Installing NextGIS Connect into QGIS 3?

            Confectionery