How to speed up evaluation of expression (multiple variables) to gain speed? Compile?
Clash Royale CLAN TAG#URR8PPP
up vote
6
down vote
favorite
I obtained some long expression and need to use them with a lot of different parameters inside a simulation. However, evaluation of the expressions is pretty slow, so that I would like to speed it up. The first (and only) thing that came to my mind was Compile
. But, unfortunately, speed is not improved significantly.
This is an example of the shortest expression (unable to paste the larger ones because they exceed the limits on pastebin):
fun[tt_, sig_, time_, expon_, bShift_, lam_] = Uncompress@Import["https://pastebin.com/raw/nj83nT2b"];
funComp = Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, fun[tt, sig, time, expon, bShift, lam]];
fun[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0836853, 0.045048 *)
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0718691, 0.045048 *)
As you can see, Compile
basically has no effect. The bigger issue is that the harmful expressions I am dealing with rather take 10s to evaluate and I'd like to see one or more orders of magnitude speed up (if possible). I hope that the small example here is essentially limited by the same effect as the bigger expressions, so that it serves as a proper toy example.
Background on fun
: It essentially is composed of sums and products of Gaussians and their derivatives to higher orders (also powers of them).
Is there a way to significantly speed up the evaluation process? Potentially by more advanced usage of Compile
or some other trick? Currently this is a serious bottleneck in my simulations.
performance-tuning simplifying-expressions compile
 |Â
show 1 more comment
up vote
6
down vote
favorite
I obtained some long expression and need to use them with a lot of different parameters inside a simulation. However, evaluation of the expressions is pretty slow, so that I would like to speed it up. The first (and only) thing that came to my mind was Compile
. But, unfortunately, speed is not improved significantly.
This is an example of the shortest expression (unable to paste the larger ones because they exceed the limits on pastebin):
fun[tt_, sig_, time_, expon_, bShift_, lam_] = Uncompress@Import["https://pastebin.com/raw/nj83nT2b"];
funComp = Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, fun[tt, sig, time, expon, bShift, lam]];
fun[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0836853, 0.045048 *)
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0718691, 0.045048 *)
As you can see, Compile
basically has no effect. The bigger issue is that the harmful expressions I am dealing with rather take 10s to evaluate and I'd like to see one or more orders of magnitude speed up (if possible). I hope that the small example here is essentially limited by the same effect as the bigger expressions, so that it serves as a proper toy example.
Background on fun
: It essentially is composed of sums and products of Gaussians and their derivatives to higher orders (also powers of them).
Is there a way to significantly speed up the evaluation process? Potentially by more advanced usage of Compile
or some other trick? Currently this is a serious bottleneck in my simulations.
performance-tuning simplifying-expressions compile
You haven't generated the expression for the functionfun
by hand, have you? Please give us the code that produces this expression. If it contains manySum
: That's great because theSum
s can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead ofCompile
which may perform even better.
â Henrik Schumacher
Aug 23 at 9:13
@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly containSum
in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution ofSum
s were used or is it a general idea?
â Lukas
Aug 23 at 9:19
1
Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attributeListable
). For example, ifa
is a list of machine-precision real or complex numbers (and a should also be a packed array),Exp[a]
will be compute much faster thanExp /@ a
orTable[ Exp[x],x,a]
. AndTotal[Exp[a]]
is mush faster thanSum[Exp[x], x, a]
.
â Henrik Schumacher
Aug 23 at 9:22
Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â Henrik Schumacher
Aug 23 at 9:31
@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â Lukas
Aug 23 at 10:04
 |Â
show 1 more comment
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I obtained some long expression and need to use them with a lot of different parameters inside a simulation. However, evaluation of the expressions is pretty slow, so that I would like to speed it up. The first (and only) thing that came to my mind was Compile
. But, unfortunately, speed is not improved significantly.
This is an example of the shortest expression (unable to paste the larger ones because they exceed the limits on pastebin):
fun[tt_, sig_, time_, expon_, bShift_, lam_] = Uncompress@Import["https://pastebin.com/raw/nj83nT2b"];
funComp = Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, fun[tt, sig, time, expon, bShift, lam]];
fun[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0836853, 0.045048 *)
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0718691, 0.045048 *)
As you can see, Compile
basically has no effect. The bigger issue is that the harmful expressions I am dealing with rather take 10s to evaluate and I'd like to see one or more orders of magnitude speed up (if possible). I hope that the small example here is essentially limited by the same effect as the bigger expressions, so that it serves as a proper toy example.
Background on fun
: It essentially is composed of sums and products of Gaussians and their derivatives to higher orders (also powers of them).
Is there a way to significantly speed up the evaluation process? Potentially by more advanced usage of Compile
or some other trick? Currently this is a serious bottleneck in my simulations.
performance-tuning simplifying-expressions compile
I obtained some long expression and need to use them with a lot of different parameters inside a simulation. However, evaluation of the expressions is pretty slow, so that I would like to speed it up. The first (and only) thing that came to my mind was Compile
. But, unfortunately, speed is not improved significantly.
This is an example of the shortest expression (unable to paste the larger ones because they exceed the limits on pastebin):
fun[tt_, sig_, time_, expon_, bShift_, lam_] = Uncompress@Import["https://pastebin.com/raw/nj83nT2b"];
funComp = Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real, fun[tt, sig, time, expon, bShift, lam]];
fun[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0836853, 0.045048 *)
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
(* 0.0718691, 0.045048 *)
As you can see, Compile
basically has no effect. The bigger issue is that the harmful expressions I am dealing with rather take 10s to evaluate and I'd like to see one or more orders of magnitude speed up (if possible). I hope that the small example here is essentially limited by the same effect as the bigger expressions, so that it serves as a proper toy example.
Background on fun
: It essentially is composed of sums and products of Gaussians and their derivatives to higher orders (also powers of them).
Is there a way to significantly speed up the evaluation process? Potentially by more advanced usage of Compile
or some other trick? Currently this is a serious bottleneck in my simulations.
performance-tuning simplifying-expressions compile
asked Aug 23 at 9:08
Lukas
1,6771716
1,6771716
You haven't generated the expression for the functionfun
by hand, have you? Please give us the code that produces this expression. If it contains manySum
: That's great because theSum
s can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead ofCompile
which may perform even better.
â Henrik Schumacher
Aug 23 at 9:13
@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly containSum
in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution ofSum
s were used or is it a general idea?
â Lukas
Aug 23 at 9:19
1
Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attributeListable
). For example, ifa
is a list of machine-precision real or complex numbers (and a should also be a packed array),Exp[a]
will be compute much faster thanExp /@ a
orTable[ Exp[x],x,a]
. AndTotal[Exp[a]]
is mush faster thanSum[Exp[x], x, a]
.
â Henrik Schumacher
Aug 23 at 9:22
Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â Henrik Schumacher
Aug 23 at 9:31
@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â Lukas
Aug 23 at 10:04
 |Â
show 1 more comment
You haven't generated the expression for the functionfun
by hand, have you? Please give us the code that produces this expression. If it contains manySum
: That's great because theSum
s can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead ofCompile
which may perform even better.
â Henrik Schumacher
Aug 23 at 9:13
@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly containSum
in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution ofSum
s were used or is it a general idea?
â Lukas
Aug 23 at 9:19
1
Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attributeListable
). For example, ifa
is a list of machine-precision real or complex numbers (and a should also be a packed array),Exp[a]
will be compute much faster thanExp /@ a
orTable[ Exp[x],x,a]
. AndTotal[Exp[a]]
is mush faster thanSum[Exp[x], x, a]
.
â Henrik Schumacher
Aug 23 at 9:22
Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â Henrik Schumacher
Aug 23 at 9:31
@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â Lukas
Aug 23 at 10:04
You haven't generated the expression for the function
fun
by hand, have you? Please give us the code that produces this expression. If it contains many Sum
: That's great because the Sum
s can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead of Compile
which may perform even better.â Henrik Schumacher
Aug 23 at 9:13
You haven't generated the expression for the function
fun
by hand, have you? Please give us the code that produces this expression. If it contains many Sum
: That's great because the Sum
s can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead of Compile
which may perform even better.â Henrik Schumacher
Aug 23 at 9:13
@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly contain
Sum
in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution of Sum
s were used or is it a general idea?â Lukas
Aug 23 at 9:19
@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly contain
Sum
in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution of Sum
s were used or is it a general idea?â Lukas
Aug 23 at 9:19
1
1
Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attribute
Listable
). For example, if a
is a list of machine-precision real or complex numbers (and a should also be a packed array), Exp[a]
will be compute much faster than Exp /@ a
or Table[ Exp[x],x,a]
. And Total[Exp[a]]
is mush faster than Sum[Exp[x], x, a]
.â Henrik Schumacher
Aug 23 at 9:22
Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attribute
Listable
). For example, if a
is a list of machine-precision real or complex numbers (and a should also be a packed array), Exp[a]
will be compute much faster than Exp /@ a
or Table[ Exp[x],x,a]
. And Total[Exp[a]]
is mush faster than Sum[Exp[x], x, a]
.â Henrik Schumacher
Aug 23 at 9:22
Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â Henrik Schumacher
Aug 23 at 9:31
Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â Henrik Schumacher
Aug 23 at 9:31
@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â Lukas
Aug 23 at 10:04
@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â Lukas
Aug 23 at 10:04
 |Â
show 1 more comment
1 Answer
1
active
oldest
votes
up vote
9
down vote
accepted
The code for fun[tt, sig, time, expon, bShift, lam]]
was not correctly inlined into the compiled function funComp
. Most robust ways is to use With
as follows. I also apply N
to avoid some integer to double typecasts that would otherwise take place at runtime.
funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C"
]
];
(Compilation takes longer now because now there happens a lot.)
Resulting timings on my machine:
fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
0.072316, 0.045048
0.000197, 0.045048
PS. You might get a further minor performance boost by changing expon, _Integer
to expon, _Real
because there are still some integer to double typecasts left in the C code. But that was hardly measurable with the simple test I tried so far.
PPS. You might probably want to apply this function to large lists of inputs. The the following way to compile the function (with options RuntimeAttributes -> Listable, Parallelization -> True
) might be your friend:
funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True
]
];
Here some timing examples.
a = RandomReal[-1, 1, 100];
r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]
9.128
0.151
0.00040
0.000047
2.60209*10^-18, 6.07153*10^-18, 0.
This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â Lukas
Aug 23 at 10:02
You're welcome!
â Henrik Schumacher
Aug 23 at 10:11
Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to definefunComp
, notfunComp2
? Or thefunComp
is the OP's function, and you have two examples with timing results for two differentfunComp2
, as in the 1st and 3rd code sections (they differ byListable
andParallel
)?
â kkm
Aug 29 at 7:25
@kkm Well observed! I corrected that and also replacedAbsoluteTiming
byRepeatedTiming
. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: TheParallelization->True
gets only active if (i) the compiled function has also the attributeListable
and (ii) the function is also called in the appropriate way: Forr3
,funComp3
isMap
ped over the input vector; this does not exploit theListable
property whiler4
profits from both listability and (a bit) from parallelization.
â Henrik Schumacher
Aug 29 at 7:39
1
It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â kkm
Aug 29 at 7:48
 |Â
show 1 more comment
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
9
down vote
accepted
The code for fun[tt, sig, time, expon, bShift, lam]]
was not correctly inlined into the compiled function funComp
. Most robust ways is to use With
as follows. I also apply N
to avoid some integer to double typecasts that would otherwise take place at runtime.
funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C"
]
];
(Compilation takes longer now because now there happens a lot.)
Resulting timings on my machine:
fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
0.072316, 0.045048
0.000197, 0.045048
PS. You might get a further minor performance boost by changing expon, _Integer
to expon, _Real
because there are still some integer to double typecasts left in the C code. But that was hardly measurable with the simple test I tried so far.
PPS. You might probably want to apply this function to large lists of inputs. The the following way to compile the function (with options RuntimeAttributes -> Listable, Parallelization -> True
) might be your friend:
funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True
]
];
Here some timing examples.
a = RandomReal[-1, 1, 100];
r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]
9.128
0.151
0.00040
0.000047
2.60209*10^-18, 6.07153*10^-18, 0.
This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â Lukas
Aug 23 at 10:02
You're welcome!
â Henrik Schumacher
Aug 23 at 10:11
Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to definefunComp
, notfunComp2
? Or thefunComp
is the OP's function, and you have two examples with timing results for two differentfunComp2
, as in the 1st and 3rd code sections (they differ byListable
andParallel
)?
â kkm
Aug 29 at 7:25
@kkm Well observed! I corrected that and also replacedAbsoluteTiming
byRepeatedTiming
. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: TheParallelization->True
gets only active if (i) the compiled function has also the attributeListable
and (ii) the function is also called in the appropriate way: Forr3
,funComp3
isMap
ped over the input vector; this does not exploit theListable
property whiler4
profits from both listability and (a bit) from parallelization.
â Henrik Schumacher
Aug 29 at 7:39
1
It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â kkm
Aug 29 at 7:48
 |Â
show 1 more comment
up vote
9
down vote
accepted
The code for fun[tt, sig, time, expon, bShift, lam]]
was not correctly inlined into the compiled function funComp
. Most robust ways is to use With
as follows. I also apply N
to avoid some integer to double typecasts that would otherwise take place at runtime.
funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C"
]
];
(Compilation takes longer now because now there happens a lot.)
Resulting timings on my machine:
fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
0.072316, 0.045048
0.000197, 0.045048
PS. You might get a further minor performance boost by changing expon, _Integer
to expon, _Real
because there are still some integer to double typecasts left in the C code. But that was hardly measurable with the simple test I tried so far.
PPS. You might probably want to apply this function to large lists of inputs. The the following way to compile the function (with options RuntimeAttributes -> Listable, Parallelization -> True
) might be your friend:
funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True
]
];
Here some timing examples.
a = RandomReal[-1, 1, 100];
r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]
9.128
0.151
0.00040
0.000047
2.60209*10^-18, 6.07153*10^-18, 0.
This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â Lukas
Aug 23 at 10:02
You're welcome!
â Henrik Schumacher
Aug 23 at 10:11
Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to definefunComp
, notfunComp2
? Or thefunComp
is the OP's function, and you have two examples with timing results for two differentfunComp2
, as in the 1st and 3rd code sections (they differ byListable
andParallel
)?
â kkm
Aug 29 at 7:25
@kkm Well observed! I corrected that and also replacedAbsoluteTiming
byRepeatedTiming
. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: TheParallelization->True
gets only active if (i) the compiled function has also the attributeListable
and (ii) the function is also called in the appropriate way: Forr3
,funComp3
isMap
ped over the input vector; this does not exploit theListable
property whiler4
profits from both listability and (a bit) from parallelization.
â Henrik Schumacher
Aug 29 at 7:39
1
It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â kkm
Aug 29 at 7:48
 |Â
show 1 more comment
up vote
9
down vote
accepted
up vote
9
down vote
accepted
The code for fun[tt, sig, time, expon, bShift, lam]]
was not correctly inlined into the compiled function funComp
. Most robust ways is to use With
as follows. I also apply N
to avoid some integer to double typecasts that would otherwise take place at runtime.
funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C"
]
];
(Compilation takes longer now because now there happens a lot.)
Resulting timings on my machine:
fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
0.072316, 0.045048
0.000197, 0.045048
PS. You might get a further minor performance boost by changing expon, _Integer
to expon, _Real
because there are still some integer to double typecasts left in the C code. But that was hardly measurable with the simple test I tried so far.
PPS. You might probably want to apply this function to large lists of inputs. The the following way to compile the function (with options RuntimeAttributes -> Listable, Parallelization -> True
) might be your friend:
funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True
]
];
Here some timing examples.
a = RandomReal[-1, 1, 100];
r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]
9.128
0.151
0.00040
0.000047
2.60209*10^-18, 6.07153*10^-18, 0.
The code for fun[tt, sig, time, expon, bShift, lam]]
was not correctly inlined into the compiled function funComp
. Most robust ways is to use With
as follows. I also apply N
to avoid some integer to double typecasts that would otherwise take place at runtime.
funComp2 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Integer, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C"
]
];
(Compilation takes longer now because now there happens a lot.)
Resulting timings on my machine:
fun[3.4, 10./4, 10., 3, 1.3456, Sqrt[2.]] // AbsoluteTiming
funComp[3.4, 10/4, 10, 3, 1.3456, Sqrt[2]] // AbsoluteTiming
0.072316, 0.045048
0.000197, 0.045048
PS. You might get a further minor performance boost by changing expon, _Integer
to expon, _Real
because there are still some integer to double typecasts left in the C code. But that was hardly measurable with the simple test I tried so far.
PPS. You might probably want to apply this function to large lists of inputs. The the following way to compile the function (with options RuntimeAttributes -> Listable, Parallelization -> True
) might be your friend:
funComp3 = With[code = N[fun[tt, sig, time, expon, bShift, lam]],
Compile[tt, _Real, sig, _Real, time, _Real, expon, _Real, bShift, _Real, lam, _Real,
code,
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True
]
];
Here some timing examples.
a = RandomReal[-1, 1, 100];
r1 = fun[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r2 = fun[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
r3 = funComp3[#, 10/4, 10, 3, 1.3456, Sqrt[2]] & /@ a; // RepeatedTiming // First
r4 = funComp3[a, 10/4, 10, 3, 1.3456, Sqrt[2]]; // RepeatedTiming // First
errors = Max[Abs[r1 - r2]], Max[Abs[r2 - r3]], Max[Abs[r3 - r4]]
9.128
0.151
0.00040
0.000047
2.60209*10^-18, 6.07153*10^-18, 0.
edited Aug 29 at 7:33
answered Aug 23 at 9:19
Henrik Schumacher
36.2k249102
36.2k249102
This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â Lukas
Aug 23 at 10:02
You're welcome!
â Henrik Schumacher
Aug 23 at 10:11
Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to definefunComp
, notfunComp2
? Or thefunComp
is the OP's function, and you have two examples with timing results for two differentfunComp2
, as in the 1st and 3rd code sections (they differ byListable
andParallel
)?
â kkm
Aug 29 at 7:25
@kkm Well observed! I corrected that and also replacedAbsoluteTiming
byRepeatedTiming
. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: TheParallelization->True
gets only active if (i) the compiled function has also the attributeListable
and (ii) the function is also called in the appropriate way: Forr3
,funComp3
isMap
ped over the input vector; this does not exploit theListable
property whiler4
profits from both listability and (a bit) from parallelization.
â Henrik Schumacher
Aug 29 at 7:39
1
It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â kkm
Aug 29 at 7:48
 |Â
show 1 more comment
This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â Lukas
Aug 23 at 10:02
You're welcome!
â Henrik Schumacher
Aug 23 at 10:11
Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to definefunComp
, notfunComp2
? Or thefunComp
is the OP's function, and you have two examples with timing results for two differentfunComp2
, as in the 1st and 3rd code sections (they differ byListable
andParallel
)?
â kkm
Aug 29 at 7:25
@kkm Well observed! I corrected that and also replacedAbsoluteTiming
byRepeatedTiming
. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: TheParallelization->True
gets only active if (i) the compiled function has also the attributeListable
and (ii) the function is also called in the appropriate way: Forr3
,funComp3
isMap
ped over the input vector; this does not exploit theListable
property whiler4
profits from both listability and (a bit) from parallelization.
â Henrik Schumacher
Aug 29 at 7:39
1
It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â kkm
Aug 29 at 7:48
This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â Lukas
Aug 23 at 10:02
This is brilliant! Thank you very much for the detailed analysis. It helps a lot!
â Lukas
Aug 23 at 10:02
You're welcome!
â Henrik Schumacher
Aug 23 at 10:11
You're welcome!
â Henrik Schumacher
Aug 23 at 10:11
Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to define
funComp
, not funComp2
? Or the funComp
is the OP's function, and you have two examples with timing results for two different funComp2
, as in the 1st and 3rd code sections (they differ by Listable
and Parallel
)?â kkm
Aug 29 at 7:25
Wow, that's quite an improvment! I am a little bit confused tho which function is which. Is the very first code block supposed to define
funComp
, not funComp2
? Or the funComp
is the OP's function, and you have two examples with timing results for two different funComp2
, as in the 1st and 3rd code sections (they differ by Listable
and Parallel
)?â kkm
Aug 29 at 7:25
@kkm Well observed! I corrected that and also replaced
AbsoluteTiming
by RepeatedTiming
. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: The Parallelization->True
gets only active if (i) the compiled function has also the attribute Listable
and (ii) the function is also called in the appropriate way: For r3
, funComp3
is Map
ped over the input vector; this does not exploit the Listable
property while r4
profits from both listability and (a bit) from parallelization.â Henrik Schumacher
Aug 29 at 7:39
@kkm Well observed! I corrected that and also replaced
AbsoluteTiming
by RepeatedTiming
. The latter takes much longer to compute a timing estimate but it is also much more accurate, in particular with very small timings. In order to clarify: The Parallelization->True
gets only active if (i) the compiled function has also the attribute Listable
and (ii) the function is also called in the appropriate way: For r3
, funComp3
is Map
ped over the input vector; this does not exploit the Listable
property while r4
profits from both listability and (a bit) from parallelization.â Henrik Schumacher
Aug 29 at 7:39
1
1
It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â kkm
Aug 29 at 7:48
It's clear now which is which, thanks! As for the parallel matrix ops, If MMA uses MKL automatic parallelization, then it depends a lot of CPU architecture, problem size, L3 cache, but the most on the present alignment of stars. I've experienced bizarre results with a pure C++ programs, like a VM (Hyper-V) benefitting from parallel computation greatly, being on the order of a few times faster, while its own host physical computer experiencing some degradation on the same problem. Same OS, same MKL version, same program and task, and VM computed faster than its host. Black magic ,no less!
â kkm
Aug 29 at 7:48
 |Â
show 1 more comment
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f180492%2fhow-to-speed-up-evaluation-of-expression-multiple-variables-to-gain-speed-com%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
You haven't generated the expression for the function
fun
by hand, have you? Please give us the code that produces this expression. If it contains manySum
: That's great because theSum
s can be more efficiently compiled as this humongous symbolic expression. Plus one might use vectorized code instead ofCompile
which may perform even better.â Henrik Schumacher
Aug 23 at 9:13
@HenrikSchumacher Sorry, with "obtained" I mean from external sources. The derivation does not explicitly contain
Sum
in the sense of using the corresponding MMA function. Indeed, the humongous expression was obtained "by hand" in the form of manually adding/multiplying Gaussians with certain parameters. Was "vectorized code" explicitly referring to a possible solution ofSum
s were used or is it a general idea?â Lukas
Aug 23 at 9:19
1
Well, vectorization often boils down to vector-vector and matrix-vector products, so many sums can be replace by vectorized code. But also many built-in arithmetic operations are already vectorized (in particular, they have the attribute
Listable
). For example, ifa
is a list of machine-precision real or complex numbers (and a should also be a packed array),Exp[a]
will be compute much faster thanExp /@ a
orTable[ Exp[x],x,a]
. AndTotal[Exp[a]]
is mush faster thanSum[Exp[x], x, a]
.â Henrik Schumacher
Aug 23 at 9:22
Hm. What is the external source? If it is wirtten in another programming language, maybe its code can be processed to a simple symbolic Mathematica expression (and a list of replacement rules for the constants).
â Henrik Schumacher
Aug 23 at 9:31
@HenrikSchumacher External source is someone who is working with me. I will get in touch to see if things can be boiled down so that vectorization may be used. But the answer you provide below already helps significantly, so a major step towards bearable evaluation times can be made!
â Lukas
Aug 23 at 10:04