Why don't C++ compilers do better constant folding?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
41
down vote

favorite
11












I'm investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual residuals, but the majority of the work (based on profiled execution time) is in calculating the jacobians.



This surprised me, since most of the jacobians are propagated forward from 0s and 1s, so the amount of work should be 2-4x the function, not 10-12x. In order to model what a large amount of the jacobian work is like, I made a super minimal example with just a dot product (instead of sin, cos, sqrt and more that would be in a real situation) that the compiler should be able to optimize to a single return value:



#include <Eigen/Core>
#include <Eigen/Geometry>

using Array12d = Eigen::Matrix<double,12,1>;

double testReturnFirstDot(const Array12d& b)

Array12d a;
a.array() = 0.;
a(0) = 1.;
return a.dot(b);



Which should be the same as



double testReturnFirst(const Array12d& b)

return b(0);



I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s. Even with fast-math (https://godbolt.org/z/GvPXFy) the optimizations are very poor in GCC and Clang (still involve multiplications and additions), and MSVC doesn't do any optimizations at all.



I don't have a background in compilers, but is there a reason for this? I'm fairly sure that in a large proportion of scientific computations being able to do better constant propagation/folding would make more optimizations apparent, even if the constant-fold itself didn't result in a speedup.



While I'm interested in explanations for why this isn't done on the compiler side, I'm also interested for what I can do on a practical side to make my own code faster when facing these kinds of patterns.







share|improve this question
















  • 18




    Floating point numbers are not real numbers, they have rigorous correctness requirements which are violated by obvious optimisations. E.g. (1.0 / 3.0) * 3.0 != (1.0 * 3.0)/3.0 because rounding behaviour is fully specified, so you cannot simply cancel the 3.
    – Ben
    Aug 31 at 10:37







  • 5




    The answer depends on the implementation of dot. Probably, it is not just a for loop with accumulation, but involves rescaling. No wonder that compilers can't optimize it.
    – Evg
    Aug 31 at 10:37







  • 5




    The point of -ffast-math is to say "it's not necessary to comply with the standard". MSVC equivalent of fast-math is /fp:fast you may find that it does some optimisation if you specify that.
    – Ben
    Aug 31 at 10:44







  • 3




    Once you added -ffast-math the remaining "problem" is explicit vectorization, see my answer.
    – ggael
    Aug 31 at 11:41






  • 2




    You can see the options in the godbolt. -O3 for gcc/clang, /Ox for MSVC.
    – jkflying
    Aug 31 at 13:19














up vote
41
down vote

favorite
11












I'm investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual residuals, but the majority of the work (based on profiled execution time) is in calculating the jacobians.



This surprised me, since most of the jacobians are propagated forward from 0s and 1s, so the amount of work should be 2-4x the function, not 10-12x. In order to model what a large amount of the jacobian work is like, I made a super minimal example with just a dot product (instead of sin, cos, sqrt and more that would be in a real situation) that the compiler should be able to optimize to a single return value:



#include <Eigen/Core>
#include <Eigen/Geometry>

using Array12d = Eigen::Matrix<double,12,1>;

double testReturnFirstDot(const Array12d& b)

Array12d a;
a.array() = 0.;
a(0) = 1.;
return a.dot(b);



Which should be the same as



double testReturnFirst(const Array12d& b)

return b(0);



I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s. Even with fast-math (https://godbolt.org/z/GvPXFy) the optimizations are very poor in GCC and Clang (still involve multiplications and additions), and MSVC doesn't do any optimizations at all.



I don't have a background in compilers, but is there a reason for this? I'm fairly sure that in a large proportion of scientific computations being able to do better constant propagation/folding would make more optimizations apparent, even if the constant-fold itself didn't result in a speedup.



While I'm interested in explanations for why this isn't done on the compiler side, I'm also interested for what I can do on a practical side to make my own code faster when facing these kinds of patterns.







share|improve this question
















  • 18




    Floating point numbers are not real numbers, they have rigorous correctness requirements which are violated by obvious optimisations. E.g. (1.0 / 3.0) * 3.0 != (1.0 * 3.0)/3.0 because rounding behaviour is fully specified, so you cannot simply cancel the 3.
    – Ben
    Aug 31 at 10:37







  • 5




    The answer depends on the implementation of dot. Probably, it is not just a for loop with accumulation, but involves rescaling. No wonder that compilers can't optimize it.
    – Evg
    Aug 31 at 10:37







  • 5




    The point of -ffast-math is to say "it's not necessary to comply with the standard". MSVC equivalent of fast-math is /fp:fast you may find that it does some optimisation if you specify that.
    – Ben
    Aug 31 at 10:44







  • 3




    Once you added -ffast-math the remaining "problem" is explicit vectorization, see my answer.
    – ggael
    Aug 31 at 11:41






  • 2




    You can see the options in the godbolt. -O3 for gcc/clang, /Ox for MSVC.
    – jkflying
    Aug 31 at 13:19












up vote
41
down vote

favorite
11









up vote
41
down vote

favorite
11






11





I'm investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual residuals, but the majority of the work (based on profiled execution time) is in calculating the jacobians.



This surprised me, since most of the jacobians are propagated forward from 0s and 1s, so the amount of work should be 2-4x the function, not 10-12x. In order to model what a large amount of the jacobian work is like, I made a super minimal example with just a dot product (instead of sin, cos, sqrt and more that would be in a real situation) that the compiler should be able to optimize to a single return value:



#include <Eigen/Core>
#include <Eigen/Geometry>

using Array12d = Eigen::Matrix<double,12,1>;

double testReturnFirstDot(const Array12d& b)

Array12d a;
a.array() = 0.;
a(0) = 1.;
return a.dot(b);



Which should be the same as



double testReturnFirst(const Array12d& b)

return b(0);



I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s. Even with fast-math (https://godbolt.org/z/GvPXFy) the optimizations are very poor in GCC and Clang (still involve multiplications and additions), and MSVC doesn't do any optimizations at all.



I don't have a background in compilers, but is there a reason for this? I'm fairly sure that in a large proportion of scientific computations being able to do better constant propagation/folding would make more optimizations apparent, even if the constant-fold itself didn't result in a speedup.



While I'm interested in explanations for why this isn't done on the compiler side, I'm also interested for what I can do on a practical side to make my own code faster when facing these kinds of patterns.







share|improve this question












I'm investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual residuals, but the majority of the work (based on profiled execution time) is in calculating the jacobians.



This surprised me, since most of the jacobians are propagated forward from 0s and 1s, so the amount of work should be 2-4x the function, not 10-12x. In order to model what a large amount of the jacobian work is like, I made a super minimal example with just a dot product (instead of sin, cos, sqrt and more that would be in a real situation) that the compiler should be able to optimize to a single return value:



#include <Eigen/Core>
#include <Eigen/Geometry>

using Array12d = Eigen::Matrix<double,12,1>;

double testReturnFirstDot(const Array12d& b)

Array12d a;
a.array() = 0.;
a(0) = 1.;
return a.dot(b);



Which should be the same as



double testReturnFirst(const Array12d& b)

return b(0);



I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s. Even with fast-math (https://godbolt.org/z/GvPXFy) the optimizations are very poor in GCC and Clang (still involve multiplications and additions), and MSVC doesn't do any optimizations at all.



I don't have a background in compilers, but is there a reason for this? I'm fairly sure that in a large proportion of scientific computations being able to do better constant propagation/folding would make more optimizations apparent, even if the constant-fold itself didn't result in a speedup.



While I'm interested in explanations for why this isn't done on the compiler side, I'm also interested for what I can do on a practical side to make my own code faster when facing these kinds of patterns.









share|improve this question











share|improve this question




share|improve this question










asked Aug 31 at 10:29









jkflying

683511




683511







  • 18




    Floating point numbers are not real numbers, they have rigorous correctness requirements which are violated by obvious optimisations. E.g. (1.0 / 3.0) * 3.0 != (1.0 * 3.0)/3.0 because rounding behaviour is fully specified, so you cannot simply cancel the 3.
    – Ben
    Aug 31 at 10:37







  • 5




    The answer depends on the implementation of dot. Probably, it is not just a for loop with accumulation, but involves rescaling. No wonder that compilers can't optimize it.
    – Evg
    Aug 31 at 10:37







  • 5




    The point of -ffast-math is to say "it's not necessary to comply with the standard". MSVC equivalent of fast-math is /fp:fast you may find that it does some optimisation if you specify that.
    – Ben
    Aug 31 at 10:44







  • 3




    Once you added -ffast-math the remaining "problem" is explicit vectorization, see my answer.
    – ggael
    Aug 31 at 11:41






  • 2




    You can see the options in the godbolt. -O3 for gcc/clang, /Ox for MSVC.
    – jkflying
    Aug 31 at 13:19












  • 18




    Floating point numbers are not real numbers, they have rigorous correctness requirements which are violated by obvious optimisations. E.g. (1.0 / 3.0) * 3.0 != (1.0 * 3.0)/3.0 because rounding behaviour is fully specified, so you cannot simply cancel the 3.
    – Ben
    Aug 31 at 10:37







  • 5




    The answer depends on the implementation of dot. Probably, it is not just a for loop with accumulation, but involves rescaling. No wonder that compilers can't optimize it.
    – Evg
    Aug 31 at 10:37







  • 5




    The point of -ffast-math is to say "it's not necessary to comply with the standard". MSVC equivalent of fast-math is /fp:fast you may find that it does some optimisation if you specify that.
    – Ben
    Aug 31 at 10:44







  • 3




    Once you added -ffast-math the remaining "problem" is explicit vectorization, see my answer.
    – ggael
    Aug 31 at 11:41






  • 2




    You can see the options in the godbolt. -O3 for gcc/clang, /Ox for MSVC.
    – jkflying
    Aug 31 at 13:19







18




18




Floating point numbers are not real numbers, they have rigorous correctness requirements which are violated by obvious optimisations. E.g. (1.0 / 3.0) * 3.0 != (1.0 * 3.0)/3.0 because rounding behaviour is fully specified, so you cannot simply cancel the 3.
– Ben
Aug 31 at 10:37





Floating point numbers are not real numbers, they have rigorous correctness requirements which are violated by obvious optimisations. E.g. (1.0 / 3.0) * 3.0 != (1.0 * 3.0)/3.0 because rounding behaviour is fully specified, so you cannot simply cancel the 3.
– Ben
Aug 31 at 10:37





5




5




The answer depends on the implementation of dot. Probably, it is not just a for loop with accumulation, but involves rescaling. No wonder that compilers can't optimize it.
– Evg
Aug 31 at 10:37





The answer depends on the implementation of dot. Probably, it is not just a for loop with accumulation, but involves rescaling. No wonder that compilers can't optimize it.
– Evg
Aug 31 at 10:37





5




5




The point of -ffast-math is to say "it's not necessary to comply with the standard". MSVC equivalent of fast-math is /fp:fast you may find that it does some optimisation if you specify that.
– Ben
Aug 31 at 10:44





The point of -ffast-math is to say "it's not necessary to comply with the standard". MSVC equivalent of fast-math is /fp:fast you may find that it does some optimisation if you specify that.
– Ben
Aug 31 at 10:44





3




3




Once you added -ffast-math the remaining "problem" is explicit vectorization, see my answer.
– ggael
Aug 31 at 11:41




Once you added -ffast-math the remaining "problem" is explicit vectorization, see my answer.
– ggael
Aug 31 at 11:41




2




2




You can see the options in the godbolt. -O3 for gcc/clang, /Ox for MSVC.
– jkflying
Aug 31 at 13:19




You can see the options in the godbolt. -O3 for gcc/clang, /Ox for MSVC.
– jkflying
Aug 31 at 13:19












3 Answers
3






active

oldest

votes

















up vote
61
down vote



accepted










This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.



So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.



If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:



static const int DontVectorize = 0x80000000;

namespace Eigen
namespace internal

template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
: traits<Matrix<_Scalar, _Rows, _Cols> >

typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
enum
EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
;
;




using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;


Full example there: https://godbolt.org/z/bOEyzv






share|improve this answer


















  • 2




    Why can't the compiler optimize the remaining vector instructions? Is it a QoI issue or is there a technical reason?
    – Rakete1111
    Aug 31 at 12:23






  • 9




    @Rakete1111 Presumably because nobody sat down to write detailed enough rules/model by which the compiler would track constant propagation through vector instructions. Some rules (such as multiplying by or adding 0.0) have evidently been included already, but it's probably difficult to make them as encompassing as the scalar ones.
    – Max Langhof
    Aug 31 at 13:02







  • 5




    That would be technically possible by "un-vectorizing" the code, but this would go against what the user explicitly asked, so this is debatable whether its reasonable or not.
    – ggael
    Aug 31 at 13:08






  • 2




    You are asking an awful lot of the compiler...for it to do what you want would require it to really develop some machine insight into the particulars of the problem. It's not impossible, but not the kind of think compiler writers focus on. To us humans, it is obvious that a dot product in N dimensions where all but the first element of one vector is zeros is a trivial multiplication, but that is not the compiler's focus. Further, as noted above, to maintain consistency floating point must do what it does. Python, for one, uses many 30 year-old Fortran libraries for this reason.
    – eSurfsnake
    Sep 1 at 5:39

















up vote
37
down vote














I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s.




They have no other choice unfortunately. Since IEEE floats have signed zeros, adding 0.0 is not an identity operation:



-0.0 + 0.0 = 0.0 // Not -0.0!


Similarly, multiplying by zero does not always yield zero:



0.0 * Infinity = NaN // Not 0.0!


So the compilers simply cannot perform these constant folds in the dot product while retaining IEEE float compliance - for all they know, your input might contain signed zeros and/or infinities.



You will have to use -ffast-math to get these folds, but that may have undesired consequences. You can get more fine-grained control with specific flags (from http://gcc.gnu.org/wiki/FloatingPointMath). According to the above explanation, adding the following two flags should allow the constant folding:
-ffinite-math-only, -fno-signed-zeros



Indeed, you get the same assembly as with -ffast-math this way: https://godbolt.org/z/vGULLA. You only give up the signed zeros (probably irrelevant), NaNs and the infinities. Presumably, if you were to still produce them in your code, you would get undefined behavior, so weigh your options.




As for why your example is not optimized better even with -ffast-math: That is on Eigen. Presumably they have vectorization on their matrix operations, which are much harder for compilers to see through. A simple loop is properly optimized with these options: https://godbolt.org/z/OppEhY






share|improve this answer


















  • 1




    Only clang optimizes a for loop, gcc doesn't do it.
    – Evg
    Aug 31 at 11:51

















up vote
8
down vote













One way to force a compiler to optimize multiplications by 0's and 1`s is to manually unroll the loop. For simplicity let's use



#include <array>
#include <cstddef>
constexpr std::size_t n = 12;
using Array = std::array<double, n>;


Then we can implement a simple dot function using fold expressions (or recursion if they are not available):



<utility>
template<std::size_t... is>
double dot(const Array& x, const Array& y, std::index_sequence<is...>)

return ((x[is] * y[is]) + ...);


double dot(const Array& x, const Array& y)

return dot(x, y, std::make_index_sequence<n>);



Now let's take a look at your function



double test(const Array& b)

const Array a1; // = 1, 0, ...
return dot(a, b);



With -ffast-math gcc 8.2 produces:



test(std::array<double, 12ul> const&):
movsd xmm0, QWORD PTR [rdi]
ret


clang 6.0.0 goes along the same lines:



test(std::array<double, 12ul> const&): # @test(std::array<double, 12ul> const&)
movsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
ret


For example, for



double test(const Array& b)

const Array a1, 1; // = 1, 1, 0...
return dot(a, b);



we get



test(std::array<double, 12ul> const&):
movsd xmm0, QWORD PTR [rdi]
addsd xmm0, QWORD PTR [rdi+8]
ret


Addition. Clang unrolls a for (std::size_t i = 0; i < n; ++i) ... loop without all these fold expressions tricks, gcc doesn't and needs some help.






share|improve this answer






















    Your Answer





    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52113522%2fwhy-dont-c-compilers-do-better-constant-folding%23new-answer', 'question_page');

    );

    Post as a guest






























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    61
    down vote



    accepted










    This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.



    So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.



    If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:



    static const int DontVectorize = 0x80000000;

    namespace Eigen
    namespace internal

    template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
    struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
    : traits<Matrix<_Scalar, _Rows, _Cols> >

    typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
    enum
    EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
    ;
    ;




    using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;


    Full example there: https://godbolt.org/z/bOEyzv






    share|improve this answer


















    • 2




      Why can't the compiler optimize the remaining vector instructions? Is it a QoI issue or is there a technical reason?
      – Rakete1111
      Aug 31 at 12:23






    • 9




      @Rakete1111 Presumably because nobody sat down to write detailed enough rules/model by which the compiler would track constant propagation through vector instructions. Some rules (such as multiplying by or adding 0.0) have evidently been included already, but it's probably difficult to make them as encompassing as the scalar ones.
      – Max Langhof
      Aug 31 at 13:02







    • 5




      That would be technically possible by "un-vectorizing" the code, but this would go against what the user explicitly asked, so this is debatable whether its reasonable or not.
      – ggael
      Aug 31 at 13:08






    • 2




      You are asking an awful lot of the compiler...for it to do what you want would require it to really develop some machine insight into the particulars of the problem. It's not impossible, but not the kind of think compiler writers focus on. To us humans, it is obvious that a dot product in N dimensions where all but the first element of one vector is zeros is a trivial multiplication, but that is not the compiler's focus. Further, as noted above, to maintain consistency floating point must do what it does. Python, for one, uses many 30 year-old Fortran libraries for this reason.
      – eSurfsnake
      Sep 1 at 5:39














    up vote
    61
    down vote



    accepted










    This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.



    So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.



    If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:



    static const int DontVectorize = 0x80000000;

    namespace Eigen
    namespace internal

    template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
    struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
    : traits<Matrix<_Scalar, _Rows, _Cols> >

    typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
    enum
    EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
    ;
    ;




    using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;


    Full example there: https://godbolt.org/z/bOEyzv






    share|improve this answer


















    • 2




      Why can't the compiler optimize the remaining vector instructions? Is it a QoI issue or is there a technical reason?
      – Rakete1111
      Aug 31 at 12:23






    • 9




      @Rakete1111 Presumably because nobody sat down to write detailed enough rules/model by which the compiler would track constant propagation through vector instructions. Some rules (such as multiplying by or adding 0.0) have evidently been included already, but it's probably difficult to make them as encompassing as the scalar ones.
      – Max Langhof
      Aug 31 at 13:02







    • 5




      That would be technically possible by "un-vectorizing" the code, but this would go against what the user explicitly asked, so this is debatable whether its reasonable or not.
      – ggael
      Aug 31 at 13:08






    • 2




      You are asking an awful lot of the compiler...for it to do what you want would require it to really develop some machine insight into the particulars of the problem. It's not impossible, but not the kind of think compiler writers focus on. To us humans, it is obvious that a dot product in N dimensions where all but the first element of one vector is zeros is a trivial multiplication, but that is not the compiler's focus. Further, as noted above, to maintain consistency floating point must do what it does. Python, for one, uses many 30 year-old Fortran libraries for this reason.
      – eSurfsnake
      Sep 1 at 5:39












    up vote
    61
    down vote



    accepted







    up vote
    61
    down vote



    accepted






    This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.



    So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.



    If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:



    static const int DontVectorize = 0x80000000;

    namespace Eigen
    namespace internal

    template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
    struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
    : traits<Matrix<_Scalar, _Rows, _Cols> >

    typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
    enum
    EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
    ;
    ;




    using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;


    Full example there: https://godbolt.org/z/bOEyzv






    share|improve this answer














    This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.



    So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.



    If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:



    static const int DontVectorize = 0x80000000;

    namespace Eigen
    namespace internal

    template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
    struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
    : traits<Matrix<_Scalar, _Rows, _Cols> >

    typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
    enum
    EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
    ;
    ;




    using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;


    Full example there: https://godbolt.org/z/bOEyzv







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Aug 31 at 12:21









    Rakete1111

    31.9k973107




    31.9k973107










    answered Aug 31 at 11:37









    ggael

    18.6k22844




    18.6k22844







    • 2




      Why can't the compiler optimize the remaining vector instructions? Is it a QoI issue or is there a technical reason?
      – Rakete1111
      Aug 31 at 12:23






    • 9




      @Rakete1111 Presumably because nobody sat down to write detailed enough rules/model by which the compiler would track constant propagation through vector instructions. Some rules (such as multiplying by or adding 0.0) have evidently been included already, but it's probably difficult to make them as encompassing as the scalar ones.
      – Max Langhof
      Aug 31 at 13:02







    • 5




      That would be technically possible by "un-vectorizing" the code, but this would go against what the user explicitly asked, so this is debatable whether its reasonable or not.
      – ggael
      Aug 31 at 13:08






    • 2




      You are asking an awful lot of the compiler...for it to do what you want would require it to really develop some machine insight into the particulars of the problem. It's not impossible, but not the kind of think compiler writers focus on. To us humans, it is obvious that a dot product in N dimensions where all but the first element of one vector is zeros is a trivial multiplication, but that is not the compiler's focus. Further, as noted above, to maintain consistency floating point must do what it does. Python, for one, uses many 30 year-old Fortran libraries for this reason.
      – eSurfsnake
      Sep 1 at 5:39












    • 2




      Why can't the compiler optimize the remaining vector instructions? Is it a QoI issue or is there a technical reason?
      – Rakete1111
      Aug 31 at 12:23






    • 9




      @Rakete1111 Presumably because nobody sat down to write detailed enough rules/model by which the compiler would track constant propagation through vector instructions. Some rules (such as multiplying by or adding 0.0) have evidently been included already, but it's probably difficult to make them as encompassing as the scalar ones.
      – Max Langhof
      Aug 31 at 13:02







    • 5




      That would be technically possible by "un-vectorizing" the code, but this would go against what the user explicitly asked, so this is debatable whether its reasonable or not.
      – ggael
      Aug 31 at 13:08






    • 2




      You are asking an awful lot of the compiler...for it to do what you want would require it to really develop some machine insight into the particulars of the problem. It's not impossible, but not the kind of think compiler writers focus on. To us humans, it is obvious that a dot product in N dimensions where all but the first element of one vector is zeros is a trivial multiplication, but that is not the compiler's focus. Further, as noted above, to maintain consistency floating point must do what it does. Python, for one, uses many 30 year-old Fortran libraries for this reason.
      – eSurfsnake
      Sep 1 at 5:39







    2




    2




    Why can't the compiler optimize the remaining vector instructions? Is it a QoI issue or is there a technical reason?
    – Rakete1111
    Aug 31 at 12:23




    Why can't the compiler optimize the remaining vector instructions? Is it a QoI issue or is there a technical reason?
    – Rakete1111
    Aug 31 at 12:23




    9




    9




    @Rakete1111 Presumably because nobody sat down to write detailed enough rules/model by which the compiler would track constant propagation through vector instructions. Some rules (such as multiplying by or adding 0.0) have evidently been included already, but it's probably difficult to make them as encompassing as the scalar ones.
    – Max Langhof
    Aug 31 at 13:02





    @Rakete1111 Presumably because nobody sat down to write detailed enough rules/model by which the compiler would track constant propagation through vector instructions. Some rules (such as multiplying by or adding 0.0) have evidently been included already, but it's probably difficult to make them as encompassing as the scalar ones.
    – Max Langhof
    Aug 31 at 13:02





    5




    5




    That would be technically possible by "un-vectorizing" the code, but this would go against what the user explicitly asked, so this is debatable whether its reasonable or not.
    – ggael
    Aug 31 at 13:08




    That would be technically possible by "un-vectorizing" the code, but this would go against what the user explicitly asked, so this is debatable whether its reasonable or not.
    – ggael
    Aug 31 at 13:08




    2




    2




    You are asking an awful lot of the compiler...for it to do what you want would require it to really develop some machine insight into the particulars of the problem. It's not impossible, but not the kind of think compiler writers focus on. To us humans, it is obvious that a dot product in N dimensions where all but the first element of one vector is zeros is a trivial multiplication, but that is not the compiler's focus. Further, as noted above, to maintain consistency floating point must do what it does. Python, for one, uses many 30 year-old Fortran libraries for this reason.
    – eSurfsnake
    Sep 1 at 5:39




    You are asking an awful lot of the compiler...for it to do what you want would require it to really develop some machine insight into the particulars of the problem. It's not impossible, but not the kind of think compiler writers focus on. To us humans, it is obvious that a dot product in N dimensions where all but the first element of one vector is zeros is a trivial multiplication, but that is not the compiler's focus. Further, as noted above, to maintain consistency floating point must do what it does. Python, for one, uses many 30 year-old Fortran libraries for this reason.
    – eSurfsnake
    Sep 1 at 5:39












    up vote
    37
    down vote














    I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s.




    They have no other choice unfortunately. Since IEEE floats have signed zeros, adding 0.0 is not an identity operation:



    -0.0 + 0.0 = 0.0 // Not -0.0!


    Similarly, multiplying by zero does not always yield zero:



    0.0 * Infinity = NaN // Not 0.0!


    So the compilers simply cannot perform these constant folds in the dot product while retaining IEEE float compliance - for all they know, your input might contain signed zeros and/or infinities.



    You will have to use -ffast-math to get these folds, but that may have undesired consequences. You can get more fine-grained control with specific flags (from http://gcc.gnu.org/wiki/FloatingPointMath). According to the above explanation, adding the following two flags should allow the constant folding:
    -ffinite-math-only, -fno-signed-zeros



    Indeed, you get the same assembly as with -ffast-math this way: https://godbolt.org/z/vGULLA. You only give up the signed zeros (probably irrelevant), NaNs and the infinities. Presumably, if you were to still produce them in your code, you would get undefined behavior, so weigh your options.




    As for why your example is not optimized better even with -ffast-math: That is on Eigen. Presumably they have vectorization on their matrix operations, which are much harder for compilers to see through. A simple loop is properly optimized with these options: https://godbolt.org/z/OppEhY






    share|improve this answer


















    • 1




      Only clang optimizes a for loop, gcc doesn't do it.
      – Evg
      Aug 31 at 11:51














    up vote
    37
    down vote














    I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s.




    They have no other choice unfortunately. Since IEEE floats have signed zeros, adding 0.0 is not an identity operation:



    -0.0 + 0.0 = 0.0 // Not -0.0!


    Similarly, multiplying by zero does not always yield zero:



    0.0 * Infinity = NaN // Not 0.0!


    So the compilers simply cannot perform these constant folds in the dot product while retaining IEEE float compliance - for all they know, your input might contain signed zeros and/or infinities.



    You will have to use -ffast-math to get these folds, but that may have undesired consequences. You can get more fine-grained control with specific flags (from http://gcc.gnu.org/wiki/FloatingPointMath). According to the above explanation, adding the following two flags should allow the constant folding:
    -ffinite-math-only, -fno-signed-zeros



    Indeed, you get the same assembly as with -ffast-math this way: https://godbolt.org/z/vGULLA. You only give up the signed zeros (probably irrelevant), NaNs and the infinities. Presumably, if you were to still produce them in your code, you would get undefined behavior, so weigh your options.




    As for why your example is not optimized better even with -ffast-math: That is on Eigen. Presumably they have vectorization on their matrix operations, which are much harder for compilers to see through. A simple loop is properly optimized with these options: https://godbolt.org/z/OppEhY






    share|improve this answer


















    • 1




      Only clang optimizes a for loop, gcc doesn't do it.
      – Evg
      Aug 31 at 11:51












    up vote
    37
    down vote










    up vote
    37
    down vote










    I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s.




    They have no other choice unfortunately. Since IEEE floats have signed zeros, adding 0.0 is not an identity operation:



    -0.0 + 0.0 = 0.0 // Not -0.0!


    Similarly, multiplying by zero does not always yield zero:



    0.0 * Infinity = NaN // Not 0.0!


    So the compilers simply cannot perform these constant folds in the dot product while retaining IEEE float compliance - for all they know, your input might contain signed zeros and/or infinities.



    You will have to use -ffast-math to get these folds, but that may have undesired consequences. You can get more fine-grained control with specific flags (from http://gcc.gnu.org/wiki/FloatingPointMath). According to the above explanation, adding the following two flags should allow the constant folding:
    -ffinite-math-only, -fno-signed-zeros



    Indeed, you get the same assembly as with -ffast-math this way: https://godbolt.org/z/vGULLA. You only give up the signed zeros (probably irrelevant), NaNs and the infinities. Presumably, if you were to still produce them in your code, you would get undefined behavior, so weigh your options.




    As for why your example is not optimized better even with -ffast-math: That is on Eigen. Presumably they have vectorization on their matrix operations, which are much harder for compilers to see through. A simple loop is properly optimized with these options: https://godbolt.org/z/OppEhY






    share|improve this answer















    I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s.




    They have no other choice unfortunately. Since IEEE floats have signed zeros, adding 0.0 is not an identity operation:



    -0.0 + 0.0 = 0.0 // Not -0.0!


    Similarly, multiplying by zero does not always yield zero:



    0.0 * Infinity = NaN // Not 0.0!


    So the compilers simply cannot perform these constant folds in the dot product while retaining IEEE float compliance - for all they know, your input might contain signed zeros and/or infinities.



    You will have to use -ffast-math to get these folds, but that may have undesired consequences. You can get more fine-grained control with specific flags (from http://gcc.gnu.org/wiki/FloatingPointMath). According to the above explanation, adding the following two flags should allow the constant folding:
    -ffinite-math-only, -fno-signed-zeros



    Indeed, you get the same assembly as with -ffast-math this way: https://godbolt.org/z/vGULLA. You only give up the signed zeros (probably irrelevant), NaNs and the infinities. Presumably, if you were to still produce them in your code, you would get undefined behavior, so weigh your options.




    As for why your example is not optimized better even with -ffast-math: That is on Eigen. Presumably they have vectorization on their matrix operations, which are much harder for compilers to see through. A simple loop is properly optimized with these options: https://godbolt.org/z/OppEhY







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Aug 31 at 11:44

























    answered Aug 31 at 11:23









    Max Langhof

    4,930927




    4,930927







    • 1




      Only clang optimizes a for loop, gcc doesn't do it.
      – Evg
      Aug 31 at 11:51












    • 1




      Only clang optimizes a for loop, gcc doesn't do it.
      – Evg
      Aug 31 at 11:51







    1




    1




    Only clang optimizes a for loop, gcc doesn't do it.
    – Evg
    Aug 31 at 11:51




    Only clang optimizes a for loop, gcc doesn't do it.
    – Evg
    Aug 31 at 11:51










    up vote
    8
    down vote













    One way to force a compiler to optimize multiplications by 0's and 1`s is to manually unroll the loop. For simplicity let's use



    #include <array>
    #include <cstddef>
    constexpr std::size_t n = 12;
    using Array = std::array<double, n>;


    Then we can implement a simple dot function using fold expressions (or recursion if they are not available):



    <utility>
    template<std::size_t... is>
    double dot(const Array& x, const Array& y, std::index_sequence<is...>)

    return ((x[is] * y[is]) + ...);


    double dot(const Array& x, const Array& y)

    return dot(x, y, std::make_index_sequence<n>);



    Now let's take a look at your function



    double test(const Array& b)

    const Array a1; // = 1, 0, ...
    return dot(a, b);



    With -ffast-math gcc 8.2 produces:



    test(std::array<double, 12ul> const&):
    movsd xmm0, QWORD PTR [rdi]
    ret


    clang 6.0.0 goes along the same lines:



    test(std::array<double, 12ul> const&): # @test(std::array<double, 12ul> const&)
    movsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
    ret


    For example, for



    double test(const Array& b)

    const Array a1, 1; // = 1, 1, 0...
    return dot(a, b);



    we get



    test(std::array<double, 12ul> const&):
    movsd xmm0, QWORD PTR [rdi]
    addsd xmm0, QWORD PTR [rdi+8]
    ret


    Addition. Clang unrolls a for (std::size_t i = 0; i < n; ++i) ... loop without all these fold expressions tricks, gcc doesn't and needs some help.






    share|improve this answer


























      up vote
      8
      down vote













      One way to force a compiler to optimize multiplications by 0's and 1`s is to manually unroll the loop. For simplicity let's use



      #include <array>
      #include <cstddef>
      constexpr std::size_t n = 12;
      using Array = std::array<double, n>;


      Then we can implement a simple dot function using fold expressions (or recursion if they are not available):



      <utility>
      template<std::size_t... is>
      double dot(const Array& x, const Array& y, std::index_sequence<is...>)

      return ((x[is] * y[is]) + ...);


      double dot(const Array& x, const Array& y)

      return dot(x, y, std::make_index_sequence<n>);



      Now let's take a look at your function



      double test(const Array& b)

      const Array a1; // = 1, 0, ...
      return dot(a, b);



      With -ffast-math gcc 8.2 produces:



      test(std::array<double, 12ul> const&):
      movsd xmm0, QWORD PTR [rdi]
      ret


      clang 6.0.0 goes along the same lines:



      test(std::array<double, 12ul> const&): # @test(std::array<double, 12ul> const&)
      movsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
      ret


      For example, for



      double test(const Array& b)

      const Array a1, 1; // = 1, 1, 0...
      return dot(a, b);



      we get



      test(std::array<double, 12ul> const&):
      movsd xmm0, QWORD PTR [rdi]
      addsd xmm0, QWORD PTR [rdi+8]
      ret


      Addition. Clang unrolls a for (std::size_t i = 0; i < n; ++i) ... loop without all these fold expressions tricks, gcc doesn't and needs some help.






      share|improve this answer
























        up vote
        8
        down vote










        up vote
        8
        down vote









        One way to force a compiler to optimize multiplications by 0's and 1`s is to manually unroll the loop. For simplicity let's use



        #include <array>
        #include <cstddef>
        constexpr std::size_t n = 12;
        using Array = std::array<double, n>;


        Then we can implement a simple dot function using fold expressions (or recursion if they are not available):



        <utility>
        template<std::size_t... is>
        double dot(const Array& x, const Array& y, std::index_sequence<is...>)

        return ((x[is] * y[is]) + ...);


        double dot(const Array& x, const Array& y)

        return dot(x, y, std::make_index_sequence<n>);



        Now let's take a look at your function



        double test(const Array& b)

        const Array a1; // = 1, 0, ...
        return dot(a, b);



        With -ffast-math gcc 8.2 produces:



        test(std::array<double, 12ul> const&):
        movsd xmm0, QWORD PTR [rdi]
        ret


        clang 6.0.0 goes along the same lines:



        test(std::array<double, 12ul> const&): # @test(std::array<double, 12ul> const&)
        movsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
        ret


        For example, for



        double test(const Array& b)

        const Array a1, 1; // = 1, 1, 0...
        return dot(a, b);



        we get



        test(std::array<double, 12ul> const&):
        movsd xmm0, QWORD PTR [rdi]
        addsd xmm0, QWORD PTR [rdi+8]
        ret


        Addition. Clang unrolls a for (std::size_t i = 0; i < n; ++i) ... loop without all these fold expressions tricks, gcc doesn't and needs some help.






        share|improve this answer














        One way to force a compiler to optimize multiplications by 0's and 1`s is to manually unroll the loop. For simplicity let's use



        #include <array>
        #include <cstddef>
        constexpr std::size_t n = 12;
        using Array = std::array<double, n>;


        Then we can implement a simple dot function using fold expressions (or recursion if they are not available):



        <utility>
        template<std::size_t... is>
        double dot(const Array& x, const Array& y, std::index_sequence<is...>)

        return ((x[is] * y[is]) + ...);


        double dot(const Array& x, const Array& y)

        return dot(x, y, std::make_index_sequence<n>);



        Now let's take a look at your function



        double test(const Array& b)

        const Array a1; // = 1, 0, ...
        return dot(a, b);



        With -ffast-math gcc 8.2 produces:



        test(std::array<double, 12ul> const&):
        movsd xmm0, QWORD PTR [rdi]
        ret


        clang 6.0.0 goes along the same lines:



        test(std::array<double, 12ul> const&): # @test(std::array<double, 12ul> const&)
        movsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
        ret


        For example, for



        double test(const Array& b)

        const Array a1, 1; // = 1, 1, 0...
        return dot(a, b);



        we get



        test(std::array<double, 12ul> const&):
        movsd xmm0, QWORD PTR [rdi]
        addsd xmm0, QWORD PTR [rdi+8]
        ret


        Addition. Clang unrolls a for (std::size_t i = 0; i < n; ++i) ... loop without all these fold expressions tricks, gcc doesn't and needs some help.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Aug 31 at 12:14

























        answered Aug 31 at 11:00









        Evg

        2,2351130




        2,2351130



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52113522%2fwhy-dont-c-compilers-do-better-constant-folding%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            What does second last employer means? [closed]

            List of Gilmore Girls characters

            Confectionery