How to speed up evaluation of expression (multiple variables) to gain speed? Compile?

up vote
6
down vote

favorite

I obtained some long expression and need to use them with a lot of different parameters inside a simulation. However, evaluation of the expressions is pretty slow, so that I would like to speed it up. The first (and only) thing that came to my mind was Compile. But, unfortunately, speed is not improved significantly.

This is an example of the shortest expression (unable to paste the larger ones because they exceed the limits on pastebin):

fun[tt_, sig_, time_, expon_, bShift_, lam_] = Uncompress@Import["https://pastebin.com/raw/nj83nT2b"];
funComp = Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, fun[tt, sig, time, expon, bShift, lam]];
fun[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0836853, 0.045048 *)
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0718691, 0.045048 *)

As you can see, Compile basically has no effect. The bigger issue is that the harmful expressions I am dealing with rather take 10s to evaluate and I'd like to see one or more orders of magnitude speed up (if possible). I hope that the small example here is essentially limited by the same effect as the bigger expressions, so that it serves as a proper toy example.

Background on fun: It essentially is composed of sums and products of Gaussians and their derivatives to higher orders (also powers of them).

Is there a way to significantly speed up the evaluation process? Potentially by more advanced usage of Compile or some other trick? Currently this is a serious bottleneck in my simulations.

asked Aug 23 at 9:08

Lukas

1,6771716

You haven't generated the expression for the function fun by hand, have you? Please give us the code that produces this expression. If it contains many Sum: That's great because the Sums can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead of Compile which may perform even better.
â€“Â Henrik Schumacher
Aug 23 at 9:13

@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly contain Sum in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution of Sums were used or is it a general idea?
â€“Â Lukas
Aug 23 at 9:19

1

Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attribute Listable). For example, if a is a list of machine-precision real or complex numbers (and a should also be a packed array), Exp[a] will be compute much faster than Exp /@ a or Table[ Exp[x],x,a]. And Total[Exp[a]] is mush faster than Sum[Exp[x], x, a].
â€“Â Henrik Schumacher
Aug 23 at 9:22

Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â€“Â Henrik Schumacher
Aug 23 at 9:31

@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â€“Â Lukas
Aug 23 at 10:04

Â |Â
show 1 more comment

up vote
6
down vote

favorite

This is an example of the shortest expression (unable to paste the larger ones because they exceed the limits on pastebin):

fun[tt_, sig_, time_, expon_, bShift_, lam_] = Uncompress@Import["https://pastebin.com/raw/nj83nT2b"];
funComp = Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, fun[tt, sig, time, expon, bShift, lam]];
fun[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0836853, 0.045048 *)
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0718691, 0.045048 *)

Background on fun: It essentially is composed of sums and products of Gaussians and their derivatives to higher orders (also powers of them).

Is there a way to significantly speed up the evaluation process? Potentially by more advanced usage of Compile or some other trick? Currently this is a serious bottleneck in my simulations.

asked Aug 23 at 9:08

Lukas

1,6771716

You haven't generated the expression for the function fun by hand, have you? Please give us the code that produces this expression. If it contains many Sum: That's great because the Sums can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead of Compile which may perform even better.
â€“Â Henrik Schumacher
Aug 23 at 9:13

@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly contain Sum in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution of Sums were used or is it a general idea?
â€“Â Lukas
Aug 23 at 9:19

1

Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attribute Listable). For example, if a is a list of machine-precision real or complex numbers (and a should also be a packed array), Exp[a] will be compute much faster than Exp /@ a or Table[ Exp[x],x,a]. And Total[Exp[a]] is mush faster than Sum[Exp[x], x, a].
â€“Â Henrik Schumacher
Aug 23 at 9:22

Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â€“Â Henrik Schumacher
Aug 23 at 9:31

@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â€“Â Lukas
Aug 23 at 10:04

Â |Â
show 1 more comment

up vote
6
down vote

favorite

This is an example of the shortest expression (unable to paste the larger ones because they exceed the limits on pastebin):

fun[tt_, sig_, time_, expon_, bShift_, lam_] = Uncompress@Import["https://pastebin.com/raw/nj83nT2b"];
funComp = Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, fun[tt, sig, time, expon, bShift, lam]];
fun[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0836853, 0.045048 *)
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0718691, 0.045048 *)

Background on fun: It essentially is composed of sums and products of Gaussians and their derivatives to higher orders (also powers of them).

Is there a way to significantly speed up the evaluation process? Potentially by more advanced usage of Compile or some other trick? Currently this is a serious bottleneck in my simulations.

asked Aug 23 at 9:08

Lukas

1,6771716

This is an example of the shortest expression (unable to paste the larger ones because they exceed the limits on pastebin):

fun[tt_, sig_, time_, expon_, bShift_, lam_] = Uncompress@Import["https://pastebin.com/raw/nj83nT2b"];
funComp = Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, fun[tt, sig, time, expon, bShift, lam]];
fun[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0836853, 0.045048 *)
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0718691, 0.045048 *)

Background on fun: It essentially is composed of sums and products of Gaussians and their derivatives to higher orders (also powers of them).

Is there a way to significantly speed up the evaluation process? Potentially by more advanced usage of Compile or some other trick? Currently this is a serious bottleneck in my simulations.

asked Aug 23 at 9:08

Lukas

1,6771716

asked Aug 23 at 9:08

Lukas

1,6771716

asked Aug 23 at 9:08

Lukas

1,6771716

asked Aug 23 at 9:08

Lukas

1,6771716

You haven't generated the expression for the function fun by hand, have you? Please give us the code that produces this expression. If it contains many Sum: That's great because the Sums can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead of Compile which may perform even better.
â€“Â Henrik Schumacher
Aug 23 at 9:13

@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly contain Sum in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution of Sums were used or is it a general idea?
â€“Â Lukas
Aug 23 at 9:19

1

Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attribute Listable). For example, if a is a list of machine-precision real or complex numbers (and a should also be a packed array), Exp[a] will be compute much faster than Exp /@ a or Table[ Exp[x],x,a]. And Total[Exp[a]] is mush faster than Sum[Exp[x], x, a].
â€“Â Henrik Schumacher
Aug 23 at 9:22

Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â€“Â Henrik Schumacher
Aug 23 at 9:31

@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â€“Â Lukas
Aug 23 at 10:04

Â |Â
show 1 more comment

You haven't generated the expression for the function fun by hand, have you? Please give us the code that produces this expression. If it contains many Sum: That's great because the Sums can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead of Compile which may perform even better.
â€“Â Henrik Schumacher
Aug 23 at 9:13

@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly contain Sum in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution of Sums were used or is it a general idea?
â€“Â Lukas
Aug 23 at 9:19

1

Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attribute Listable). For example, if a is a list of machine-precision real or complex numbers (and a should also be a packed array), Exp[a] will be compute much faster than Exp /@ a or Table[ Exp[x],x,a]. And Total[Exp[a]] is mush faster than Sum[Exp[x], x, a].
â€“Â Henrik Schumacher
Aug 23 at 9:22

Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â€“Â Henrik Schumacher
Aug 23 at 9:31

@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â€“Â Lukas
Aug 23 at 10:04

You haven't generated the expression for the function fun by hand, have you? Please give us the code that produces this expression. If it contains many Sum: That's great because the Sums can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead of Compile which may perform even better.
â€“Â Henrik Schumacher
Aug 23 at 9:13

@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly contain Sum in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution of Sums were used or is it a general idea?
â€“Â Lukas
Aug 23 at 9:19

Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attribute Listable). For example, if a is a list of machine-precision real or complex numbers (and a should also be a packed array), Exp[a] will be compute much faster than Exp /@ a or Table[ Exp[x],x,a]. And Total[Exp[a]] is mush faster than Sum[Exp[x], x, a].
â€“Â Henrik Schumacher
Aug 23 at 9:22

Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â€“Â Henrik Schumacher
Aug 23 at 9:31

@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â€“Â Lukas
Aug 23 at 10:04

Â |Â
show 1 more comment

1 Answer
1

active

oldest

votes

up vote
9
down vote

accepted

The code for fun[tt, sig, time, expon, bShift, lam]] was not correctly inlined into the compiled function funComp. Most robust ways is to use With as follows. I also apply N to avoid some integer to double typecasts that would otherwise take place at runtime.

funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
 Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
 code,
 CompilationTarget -> "C"
 ]
 ];

(Compilation takes longer now because now there happens a lot.)

Resulting timings on my machine:

fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming

0.072316, 0.045048

0.000197, 0.045048

PS. You might get a further minor performance boost by changing expon, _Integer to expon, _Real because there are still some integer to double typecasts left in the C code. But that was hardly measurable with the simple test I tried so far.

PPS. You might probably want to apply this function to large lists of inputs. The the following way to compile the function (with options RuntimeAttributes -> Listable, Parallelization -> True) might be your friend:

funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
 Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
 code,
 CompilationTarget -> "C",
 RuntimeAttributes -> Listable,
 Parallelization -> True
 ]
 ];

Here some timing examples.

a = RandomReal[-1, 1, 100];

r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]

9.128

0.151

0.00040

0.000047

2.60209*10^-18, 6.07153*10^-18, 0.

edited Aug 29 at 7:33

answered Aug 23 at 9:19

Henrik Schumacher

36.2k249102

This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â€“Â Lukas
Aug 23 at 10:02

You're welcome!
â€“Â Henrik Schumacher
Aug 23 at 10:11

Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to define funComp, not funComp2? Or the funComp is the OP's function, and you have two examples with timing results for two different funComp2, as in the 1st and 3rd code sections (they differ by Listable and Parallel)?
â€“Â kkm
Aug 29 at 7:25

@kkm Well observed! I corrected that and also replaced AbsoluteTiming by RepeatedTiming. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: The Parallelization->True gets only active if (i) the compiled function has also the attribute Listable and (ii) the function is also called in the appropriate way: For r3, funComp3 is Mapped over the input vector; this does not exploit the Listable property while r4 profits from both listability and (a bit) from parallelization.
â€“Â Henrik Schumacher
Aug 29 at 7:39

1

It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â€“Â kkm
Aug 29 at 7:48

Â |Â
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "387"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f180492%2fhow-to-speed-up-evaluation-of-expression-multiple-variables-to-gain-speed-com%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
9
down vote

accepted

funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
 Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
 code,
 CompilationTarget -> "C"
 ]
 ];

(Compilation takes longer now because now there happens a lot.)

Resulting timings on my machine:

fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming

0.072316, 0.045048

0.000197, 0.045048

funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
 Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
 code,
 CompilationTarget -> "C",
 RuntimeAttributes -> Listable,
 Parallelization -> True
 ]
 ];

Here some timing examples.

a = RandomReal[-1, 1, 100];

r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]

9.128

0.151

0.00040

0.000047

2.60209*10^-18, 6.07153*10^-18, 0.

edited Aug 29 at 7:33

answered Aug 23 at 9:19

Henrik Schumacher

36.2k249102

This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â€“Â Lukas
Aug 23 at 10:02

You're welcome!
â€“Â Henrik Schumacher
Aug 23 at 10:11

Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to define funComp, not funComp2? Or the funComp is the OP's function, and you have two examples with timing results for two different funComp2, as in the 1st and 3rd code sections (they differ by Listable and Parallel)?
â€“Â kkm
Aug 29 at 7:25

@kkm Well observed! I corrected that and also replaced AbsoluteTiming by RepeatedTiming. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: The Parallelization->True gets only active if (i) the compiled function has also the attribute Listable and (ii) the function is also called in the appropriate way: For r3, funComp3 is Mapped over the input vector; this does not exploit the Listable property while r4 profits from both listability and (a bit) from parallelization.
â€“Â Henrik Schumacher
Aug 29 at 7:39

1

It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â€“Â kkm
Aug 29 at 7:48

Â |Â
show 1 more comment

up vote
9
down vote

accepted

funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
 Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
 code,
 CompilationTarget -> "C"
 ]
 ];

(Compilation takes longer now because now there happens a lot.)

Resulting timings on my machine:

fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming

0.072316, 0.045048

0.000197, 0.045048

funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
 Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
 code,
 CompilationTarget -> "C",
 RuntimeAttributes -> Listable,
 Parallelization -> True
 ]
 ];

Here some timing examples.

a = RandomReal[-1, 1, 100];

r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]

9.128

0.151

0.00040

0.000047

2.60209*10^-18, 6.07153*10^-18, 0.

edited Aug 29 at 7:33

answered Aug 23 at 9:19

Henrik Schumacher

36.2k249102

This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â€“Â Lukas
Aug 23 at 10:02

You're welcome!
â€“Â Henrik Schumacher
Aug 23 at 10:11

Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to define funComp, not funComp2? Or the funComp is the OP's function, and you have two examples with timing results for two different funComp2, as in the 1st and 3rd code sections (they differ by Listable and Parallel)?
â€“Â kkm
Aug 29 at 7:25

@kkm Well observed! I corrected that and also replaced AbsoluteTiming by RepeatedTiming. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: The Parallelization->True gets only active if (i) the compiled function has also the attribute Listable and (ii) the function is also called in the appropriate way: For r3, funComp3 is Mapped over the input vector; this does not exploit the Listable property while r4 profits from both listability and (a bit) from parallelization.
â€“Â Henrik Schumacher
Aug 29 at 7:39

1

It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â€“Â kkm
Aug 29 at 7:48

Â |Â
show 1 more comment

up vote
9
down vote

accepted

funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
 Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
 code,
 CompilationTarget -> "C"
 ]
 ];

(Compilation takes longer now because now there happens a lot.)

Resulting timings on my machine:

fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming

0.072316, 0.045048

0.000197, 0.045048

funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
 Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
 code,
 CompilationTarget -> "C",
 RuntimeAttributes -> Listable,
 Parallelization -> True
 ]
 ];

Here some timing examples.

a = RandomReal[-1, 1, 100];

r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]

9.128

0.151

0.00040

0.000047

2.60209*10^-18, 6.07153*10^-18, 0.

edited Aug 29 at 7:33

answered Aug 23 at 9:19

Henrik Schumacher

36.2k249102

funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
 Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
 code,
 CompilationTarget -> "C"
 ]
 ];

(Compilation takes longer now because now there happens a lot.)

Resulting timings on my machine:

fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming

0.072316, 0.045048

0.000197, 0.045048

funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
 Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
 code,
 CompilationTarget -> "C",
 RuntimeAttributes -> Listable,
 Parallelization -> True
 ]
 ];

Here some timing examples.

a = RandomReal[-1, 1, 100];

r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]

9.128

0.151

0.00040

0.000047

2.60209*10^-18, 6.07153*10^-18, 0.

edited Aug 29 at 7:33

answered Aug 23 at 9:19

Henrik Schumacher

36.2k249102

edited Aug 29 at 7:33

answered Aug 23 at 9:19

Henrik Schumacher

36.2k249102

answered Aug 23 at 9:19

Henrik Schumacher

36.2k249102

answered Aug 23 at 9:19

Henrik Schumacher

36.2k249102

This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â€“Â Lukas
Aug 23 at 10:02

You're welcome!
â€“Â Henrik Schumacher
Aug 23 at 10:11

Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to define funComp, not funComp2? Or the funComp is the OP's function, and you have two examples with timing results for two different funComp2, as in the 1st and 3rd code sections (they differ by Listable and Parallel)?
â€“Â kkm
Aug 29 at 7:25

@kkm Well observed! I corrected that and also replaced AbsoluteTiming by RepeatedTiming. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: The Parallelization->True gets only active if (i) the compiled function has also the attribute Listable and (ii) the function is also called in the appropriate way: For r3, funComp3 is Mapped over the input vector; this does not exploit the Listable property while r4 profits from both listability and (a bit) from parallelization.
â€“Â Henrik Schumacher
Aug 29 at 7:39

1

It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â€“Â kkm
Aug 29 at 7:48

Â |Â
show 1 more comment

This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â€“Â Lukas
Aug 23 at 10:02

You're welcome!
â€“Â Henrik Schumacher
Aug 23 at 10:11

Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to define funComp, not funComp2? Or the funComp is the OP's function, and you have two examples with timing results for two different funComp2, as in the 1st and 3rd code sections (they differ by Listable and Parallel)?
â€“Â kkm
Aug 29 at 7:25

@kkm Well observed! I corrected that and also replaced AbsoluteTiming by RepeatedTiming. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: The Parallelization->True gets only active if (i) the compiled function has also the attribute Listable and (ii) the function is also called in the appropriate way: For r3, funComp3 is Mapped over the input vector; this does not exploit the Listable property while r4 profits from both listability and (a bit) from parallelization.
â€“Â Henrik Schumacher
Aug 29 at 7:39

1

It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â€“Â kkm
Aug 29 at 7:48

This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â€“Â Lukas
Aug 23 at 10:02

You're welcome!
â€“Â Henrik Schumacher
Aug 23 at 10:11

Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to define funComp, not funComp2? Or the funComp is the OP's function, and you have two examples with timing results for two different funComp2, as in the 1st and 3rd code sections (they differ by Listable and Parallel)?
â€“Â kkm
Aug 29 at 7:25

@kkm Well observed! I corrected that and also replaced AbsoluteTiming by RepeatedTiming. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: The Parallelization->True gets only active if (i) the compiled function has also the attribute Listable and (ii) the function is also called in the appropriate way: For r3, funComp3 is Mapped over the input vector; this does not exploit the Listable property while r4 profits from both listability and (a bit) from parallelization.
â€“Â Henrik Schumacher
Aug 29 at 7:39

It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â€“Â kkm
Aug 29 at 7:48

Â |Â
show 1 more comment

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky