Can one give an example(s) of when non-nested AIC model comparison is not useful for model selection?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












Note: The question here is not the same as this one. Indeed, as an answer to that question the answer below was closed as unrelated, together with the suggestion (credit @gung) to ask a separate question.



Background: @JonesBC writes "Akaike himself thought that AIC was useful for comparing non-nested models." Moreover, @DavidJohnson writes "The derivation of AIC as an estimator of Kullback-Leibler information loss makes no assumptions of models being nested."



The basic assumption here is that non-nested models can be compared by AIC, and that is not really the case unless lots of other usually ignored conditions are met. It is not necessary to specify all of the conditions that should be met in order that AIC model selection be useful. It is sufficient to post a counter example, and that is the question here,



Question: What is non-nesting and what would a counterexample of not useful non-nested AIC comparison look like?










share|cite|improve this question



























    up vote
    2
    down vote

    favorite












    Note: The question here is not the same as this one. Indeed, as an answer to that question the answer below was closed as unrelated, together with the suggestion (credit @gung) to ask a separate question.



    Background: @JonesBC writes "Akaike himself thought that AIC was useful for comparing non-nested models." Moreover, @DavidJohnson writes "The derivation of AIC as an estimator of Kullback-Leibler information loss makes no assumptions of models being nested."



    The basic assumption here is that non-nested models can be compared by AIC, and that is not really the case unless lots of other usually ignored conditions are met. It is not necessary to specify all of the conditions that should be met in order that AIC model selection be useful. It is sufficient to post a counter example, and that is the question here,



    Question: What is non-nesting and what would a counterexample of not useful non-nested AIC comparison look like?










    share|cite|improve this question























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      Note: The question here is not the same as this one. Indeed, as an answer to that question the answer below was closed as unrelated, together with the suggestion (credit @gung) to ask a separate question.



      Background: @JonesBC writes "Akaike himself thought that AIC was useful for comparing non-nested models." Moreover, @DavidJohnson writes "The derivation of AIC as an estimator of Kullback-Leibler information loss makes no assumptions of models being nested."



      The basic assumption here is that non-nested models can be compared by AIC, and that is not really the case unless lots of other usually ignored conditions are met. It is not necessary to specify all of the conditions that should be met in order that AIC model selection be useful. It is sufficient to post a counter example, and that is the question here,



      Question: What is non-nesting and what would a counterexample of not useful non-nested AIC comparison look like?










      share|cite|improve this question













      Note: The question here is not the same as this one. Indeed, as an answer to that question the answer below was closed as unrelated, together with the suggestion (credit @gung) to ask a separate question.



      Background: @JonesBC writes "Akaike himself thought that AIC was useful for comparing non-nested models." Moreover, @DavidJohnson writes "The derivation of AIC as an estimator of Kullback-Leibler information loss makes no assumptions of models being nested."



      The basic assumption here is that non-nested models can be compared by AIC, and that is not really the case unless lots of other usually ignored conditions are met. It is not necessary to specify all of the conditions that should be met in order that AIC model selection be useful. It is sufficient to post a counter example, and that is the question here,



      Question: What is non-nesting and what would a counterexample of not useful non-nested AIC comparison look like?







      aic model-comparison nested-models






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked 1 hour ago









      Carl

      6,80432369




      6,80432369




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote













          Nesting is when all of the models tested can be derived by eliminating parameters from a parent model. Non-nesting is when the models contain parameters that are not in a set with subset(s) format.



          I suspect that AIC is more limited than its more optimistic proponents suggest. First, please consider that goodness of fit is not always a useful regression target. For example, suppose that we want to model how a drug is eliminated from blood plasma. Almost to the exclusion of other formulas, the pharmaceutical industry and the FDA would recommend using a sum of exponential terms expression. Sums of exponential terms (SET) functions are ubiquitous and often given as $C_textSET(t)=sum_i=1^n c_i ,e^-lambda_i,t$. Their density functions have not received much attention, as the industry and the FDA are blissfully unaware of statistical considerations:



          beginequationlabeleq:SET
          textED_n(t;lambda_1,2,3...n)=sum_i=1^nlambda_i ,p_i ,e^-lambda_i,t,
          endequation



          where $sum_i=1^n p_i= 1$, the $lambda_i$ are decay coefficients, the term constants of SET functions relate to the ED$_n$ scale parameters as $c_i=kappa lambda_i p_i$, and concentration is $C_textSET(t)=kappa, textED_n$, where $kappa=textAUC(textSET)_0^infty$. Note that ED$_n$ and SET only have scale parameters. Unfortunately, there are no shape parameters to aid in fitting disposition curves, the fitting of curve shapes, and extrapolation. There is no location parameter. It is typical, in the industry to choose $n=2$ for the above equation yielding $textED_2(t;lambda_1,lambda_2)=lambda_1 ,p_1 ,e^-lambda_1,t+lambda_2 ,p_2 ,e^-lambda_2,t$, where $p_1+p_2=1$.



          This merited a comment by @whuber "... It is indeed possible to parameterize the ED$_2$ family with a scale parameter and two shape parameters; for instance, one could take $lambda_1$ as an inverse scale parameter, leaving $lambda_2>lambda_1$ and $p_1$ as shape parameters. I cannot see any connection whatsoever between such considerations and derivatives unless "derivative" means something unusual in this context...." The response to which was "There is indeed an inefficient mutability of shape for an ED$_2$. Using multiple parameters to emulate a shape parameter is inefficient in the sense that the full range of shapes that a solitary shape parameter offers is not properly rendered.... One can reduce the error of fitting to zero by using a sufficient number of parameters... Consider overfitting a curve with a polynomial. One can reduce the error of fitting to zero by using a sufficient number of parameters. However, unless the physics of the problem is coincidentally an exact polynomial shape, that perfect goodness of fit is meaningless in terms of extrapolation, and the fit may be "wiggly" between the samples fit. That is, overfitting does not tell us what a good model is, and if one does not consider what the slope is between or among samples, the model itself may have achieved a pyrrhic goodness of fit."



          Indeed, ED$_2$ is inflexible enough that exact solutions for four time-samples are sometimes complex field, i.e., not real and not physical. In one study of 413 subjects, eight results (1.9%) with four time-samples solutions had unphysical exponential coefficients



          Now let us consider a non-nested model with respect to that latter equation. The gamma distribution (GD) is given by



          beginequationlabeleq:GD
          textGD(t; a,b) =
          ,dfrac1t;dfrace^-b , t(b , t)^,a Gamma (a) ;; ;;hspace2emtgeq 0 ;; ;;\
          ,
          %tabularnewline
          endequation



          where the gamma function satisfies $Gamma (a)=int _0^infty e^-t t^a-1dt$. The GD is an ED when $a=1$. The GD also has a ($+infty$) discontinuity at $t=0$ when $0<a<1$. However, that discontinuity is integrable $left(mathbbR_geq 0right)$. The gamma distribution has a rate parameter $b$, whereas $frac1b$ is the scale parameter. The shape parameter for the GD is $a$. The shape parameter aids in fitting disposition curves and their shapes. There is no location parameter.



          The GD and ED$_ngeq2$ are not respectively nested because one cannot reduce either model to be equal to the other by choosing particular values for their parameters. Now, let us cut to the chase. The gamma distribution has a shape parameter, and SET formulas do not. As a consequence, SET formulas do not fit the derivative of blood plasma concentration, because they lack shape parameters, and gamma distributions, or their convolutions, do in fact fit the derivatives. In the case of drug persistence in the body, without proper derivative fitting, there is no hope of predicting future plasma concentration of drugs using SET heuristics, whereas fitting of derivatives may permit more exact extrapolation.
          When one plots SET derivatives from actual data fits, the result is a wiggly curve, with one bump for each exponential term, which is pathognomonic for overfitting.



          AIC use only compares departures from goodness of fit of the data itself, and says nothing about how well derivatives are fit. B-spline fitting is an example of fitting both data and an arbitrary number of data derivatives. In the case of drug persistence in the body, AIC is insufficient as a fit criterion, as the useful model will fit not only the data, but also the shape of the data, and an AIC comparison of the non-nested SET and GD models is not sufficient to characterize their respective utility, at least as far as drug concentration in blood plasma is concerned. For further comparison concerning this see this paper. When that paper was under review, one of the reviewers requested an AIC comparison. The authors compared correlation coefficients between the models and the data instead of using AIC, and even that was irrelevant.



          In summary, how one compares models is context dependent. That is, the appropriate goal of modelling is often not simple goodness-of-fit. However, it has become, IMHO, an all too frequent a reflex to think that AIC is relevant to any and all circumstances, and that without considering what those circumstances are. The first step in modelling should be, but seldom is, to identify what the goal or goals of modelling is, and to choose only those procedures that are appropriate to those goal(s).






          share|cite|improve this answer


















          • 2




            In addition to a well-thought-out question and answer (+2), I've learned a new word (pathognomonic)!
            – jbowman
            1 hour ago










          • @jbowman Many thanks for your inspiration and help! Indeed when residuals are significantly structured, they are pathognomonic for overfitting. For an example the Student’s-t probabilities for an ED$_2$ fit series significantly oscillated above and below the fit values, see Table 5 in this.
            – Carl
            38 mins ago










          • @jbowman That is, the t-statistics oscillated above and below the fit values, and were significant both as individual errors for each time-sample category and collectively as Chi-squared probabilities for all time-sample groups of the population.
            – Carl
            12 mins ago






          • 1




            Thanks for this. I think this is a better approach than trying to shoehorn your argument into a superficially related thread.
            – gung♦
            11 mins ago










          • @gung It is I who am in your debt for the goodness of your suggestion. I had difficulty suspending disbelief concerning the prior question, which is technically an aside, as contrasted to an answer. So be it.
            – Carl
            5 mins ago










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "65"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f369850%2fcan-one-give-an-examples-of-when-non-nested-aic-model-comparison-is-not-useful%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          3
          down vote













          Nesting is when all of the models tested can be derived by eliminating parameters from a parent model. Non-nesting is when the models contain parameters that are not in a set with subset(s) format.



          I suspect that AIC is more limited than its more optimistic proponents suggest. First, please consider that goodness of fit is not always a useful regression target. For example, suppose that we want to model how a drug is eliminated from blood plasma. Almost to the exclusion of other formulas, the pharmaceutical industry and the FDA would recommend using a sum of exponential terms expression. Sums of exponential terms (SET) functions are ubiquitous and often given as $C_textSET(t)=sum_i=1^n c_i ,e^-lambda_i,t$. Their density functions have not received much attention, as the industry and the FDA are blissfully unaware of statistical considerations:



          beginequationlabeleq:SET
          textED_n(t;lambda_1,2,3...n)=sum_i=1^nlambda_i ,p_i ,e^-lambda_i,t,
          endequation



          where $sum_i=1^n p_i= 1$, the $lambda_i$ are decay coefficients, the term constants of SET functions relate to the ED$_n$ scale parameters as $c_i=kappa lambda_i p_i$, and concentration is $C_textSET(t)=kappa, textED_n$, where $kappa=textAUC(textSET)_0^infty$. Note that ED$_n$ and SET only have scale parameters. Unfortunately, there are no shape parameters to aid in fitting disposition curves, the fitting of curve shapes, and extrapolation. There is no location parameter. It is typical, in the industry to choose $n=2$ for the above equation yielding $textED_2(t;lambda_1,lambda_2)=lambda_1 ,p_1 ,e^-lambda_1,t+lambda_2 ,p_2 ,e^-lambda_2,t$, where $p_1+p_2=1$.



          This merited a comment by @whuber "... It is indeed possible to parameterize the ED$_2$ family with a scale parameter and two shape parameters; for instance, one could take $lambda_1$ as an inverse scale parameter, leaving $lambda_2>lambda_1$ and $p_1$ as shape parameters. I cannot see any connection whatsoever between such considerations and derivatives unless "derivative" means something unusual in this context...." The response to which was "There is indeed an inefficient mutability of shape for an ED$_2$. Using multiple parameters to emulate a shape parameter is inefficient in the sense that the full range of shapes that a solitary shape parameter offers is not properly rendered.... One can reduce the error of fitting to zero by using a sufficient number of parameters... Consider overfitting a curve with a polynomial. One can reduce the error of fitting to zero by using a sufficient number of parameters. However, unless the physics of the problem is coincidentally an exact polynomial shape, that perfect goodness of fit is meaningless in terms of extrapolation, and the fit may be "wiggly" between the samples fit. That is, overfitting does not tell us what a good model is, and if one does not consider what the slope is between or among samples, the model itself may have achieved a pyrrhic goodness of fit."



          Indeed, ED$_2$ is inflexible enough that exact solutions for four time-samples are sometimes complex field, i.e., not real and not physical. In one study of 413 subjects, eight results (1.9%) with four time-samples solutions had unphysical exponential coefficients



          Now let us consider a non-nested model with respect to that latter equation. The gamma distribution (GD) is given by



          beginequationlabeleq:GD
          textGD(t; a,b) =
          ,dfrac1t;dfrace^-b , t(b , t)^,a Gamma (a) ;; ;;hspace2emtgeq 0 ;; ;;\
          ,
          %tabularnewline
          endequation



          where the gamma function satisfies $Gamma (a)=int _0^infty e^-t t^a-1dt$. The GD is an ED when $a=1$. The GD also has a ($+infty$) discontinuity at $t=0$ when $0<a<1$. However, that discontinuity is integrable $left(mathbbR_geq 0right)$. The gamma distribution has a rate parameter $b$, whereas $frac1b$ is the scale parameter. The shape parameter for the GD is $a$. The shape parameter aids in fitting disposition curves and their shapes. There is no location parameter.



          The GD and ED$_ngeq2$ are not respectively nested because one cannot reduce either model to be equal to the other by choosing particular values for their parameters. Now, let us cut to the chase. The gamma distribution has a shape parameter, and SET formulas do not. As a consequence, SET formulas do not fit the derivative of blood plasma concentration, because they lack shape parameters, and gamma distributions, or their convolutions, do in fact fit the derivatives. In the case of drug persistence in the body, without proper derivative fitting, there is no hope of predicting future plasma concentration of drugs using SET heuristics, whereas fitting of derivatives may permit more exact extrapolation.
          When one plots SET derivatives from actual data fits, the result is a wiggly curve, with one bump for each exponential term, which is pathognomonic for overfitting.



          AIC use only compares departures from goodness of fit of the data itself, and says nothing about how well derivatives are fit. B-spline fitting is an example of fitting both data and an arbitrary number of data derivatives. In the case of drug persistence in the body, AIC is insufficient as a fit criterion, as the useful model will fit not only the data, but also the shape of the data, and an AIC comparison of the non-nested SET and GD models is not sufficient to characterize their respective utility, at least as far as drug concentration in blood plasma is concerned. For further comparison concerning this see this paper. When that paper was under review, one of the reviewers requested an AIC comparison. The authors compared correlation coefficients between the models and the data instead of using AIC, and even that was irrelevant.



          In summary, how one compares models is context dependent. That is, the appropriate goal of modelling is often not simple goodness-of-fit. However, it has become, IMHO, an all too frequent a reflex to think that AIC is relevant to any and all circumstances, and that without considering what those circumstances are. The first step in modelling should be, but seldom is, to identify what the goal or goals of modelling is, and to choose only those procedures that are appropriate to those goal(s).






          share|cite|improve this answer


















          • 2




            In addition to a well-thought-out question and answer (+2), I've learned a new word (pathognomonic)!
            – jbowman
            1 hour ago










          • @jbowman Many thanks for your inspiration and help! Indeed when residuals are significantly structured, they are pathognomonic for overfitting. For an example the Student’s-t probabilities for an ED$_2$ fit series significantly oscillated above and below the fit values, see Table 5 in this.
            – Carl
            38 mins ago










          • @jbowman That is, the t-statistics oscillated above and below the fit values, and were significant both as individual errors for each time-sample category and collectively as Chi-squared probabilities for all time-sample groups of the population.
            – Carl
            12 mins ago






          • 1




            Thanks for this. I think this is a better approach than trying to shoehorn your argument into a superficially related thread.
            – gung♦
            11 mins ago










          • @gung It is I who am in your debt for the goodness of your suggestion. I had difficulty suspending disbelief concerning the prior question, which is technically an aside, as contrasted to an answer. So be it.
            – Carl
            5 mins ago














          up vote
          3
          down vote













          Nesting is when all of the models tested can be derived by eliminating parameters from a parent model. Non-nesting is when the models contain parameters that are not in a set with subset(s) format.



          I suspect that AIC is more limited than its more optimistic proponents suggest. First, please consider that goodness of fit is not always a useful regression target. For example, suppose that we want to model how a drug is eliminated from blood plasma. Almost to the exclusion of other formulas, the pharmaceutical industry and the FDA would recommend using a sum of exponential terms expression. Sums of exponential terms (SET) functions are ubiquitous and often given as $C_textSET(t)=sum_i=1^n c_i ,e^-lambda_i,t$. Their density functions have not received much attention, as the industry and the FDA are blissfully unaware of statistical considerations:



          beginequationlabeleq:SET
          textED_n(t;lambda_1,2,3...n)=sum_i=1^nlambda_i ,p_i ,e^-lambda_i,t,
          endequation



          where $sum_i=1^n p_i= 1$, the $lambda_i$ are decay coefficients, the term constants of SET functions relate to the ED$_n$ scale parameters as $c_i=kappa lambda_i p_i$, and concentration is $C_textSET(t)=kappa, textED_n$, where $kappa=textAUC(textSET)_0^infty$. Note that ED$_n$ and SET only have scale parameters. Unfortunately, there are no shape parameters to aid in fitting disposition curves, the fitting of curve shapes, and extrapolation. There is no location parameter. It is typical, in the industry to choose $n=2$ for the above equation yielding $textED_2(t;lambda_1,lambda_2)=lambda_1 ,p_1 ,e^-lambda_1,t+lambda_2 ,p_2 ,e^-lambda_2,t$, where $p_1+p_2=1$.



          This merited a comment by @whuber "... It is indeed possible to parameterize the ED$_2$ family with a scale parameter and two shape parameters; for instance, one could take $lambda_1$ as an inverse scale parameter, leaving $lambda_2>lambda_1$ and $p_1$ as shape parameters. I cannot see any connection whatsoever between such considerations and derivatives unless "derivative" means something unusual in this context...." The response to which was "There is indeed an inefficient mutability of shape for an ED$_2$. Using multiple parameters to emulate a shape parameter is inefficient in the sense that the full range of shapes that a solitary shape parameter offers is not properly rendered.... One can reduce the error of fitting to zero by using a sufficient number of parameters... Consider overfitting a curve with a polynomial. One can reduce the error of fitting to zero by using a sufficient number of parameters. However, unless the physics of the problem is coincidentally an exact polynomial shape, that perfect goodness of fit is meaningless in terms of extrapolation, and the fit may be "wiggly" between the samples fit. That is, overfitting does not tell us what a good model is, and if one does not consider what the slope is between or among samples, the model itself may have achieved a pyrrhic goodness of fit."



          Indeed, ED$_2$ is inflexible enough that exact solutions for four time-samples are sometimes complex field, i.e., not real and not physical. In one study of 413 subjects, eight results (1.9%) with four time-samples solutions had unphysical exponential coefficients



          Now let us consider a non-nested model with respect to that latter equation. The gamma distribution (GD) is given by



          beginequationlabeleq:GD
          textGD(t; a,b) =
          ,dfrac1t;dfrace^-b , t(b , t)^,a Gamma (a) ;; ;;hspace2emtgeq 0 ;; ;;\
          ,
          %tabularnewline
          endequation



          where the gamma function satisfies $Gamma (a)=int _0^infty e^-t t^a-1dt$. The GD is an ED when $a=1$. The GD also has a ($+infty$) discontinuity at $t=0$ when $0<a<1$. However, that discontinuity is integrable $left(mathbbR_geq 0right)$. The gamma distribution has a rate parameter $b$, whereas $frac1b$ is the scale parameter. The shape parameter for the GD is $a$. The shape parameter aids in fitting disposition curves and their shapes. There is no location parameter.



          The GD and ED$_ngeq2$ are not respectively nested because one cannot reduce either model to be equal to the other by choosing particular values for their parameters. Now, let us cut to the chase. The gamma distribution has a shape parameter, and SET formulas do not. As a consequence, SET formulas do not fit the derivative of blood plasma concentration, because they lack shape parameters, and gamma distributions, or their convolutions, do in fact fit the derivatives. In the case of drug persistence in the body, without proper derivative fitting, there is no hope of predicting future plasma concentration of drugs using SET heuristics, whereas fitting of derivatives may permit more exact extrapolation.
          When one plots SET derivatives from actual data fits, the result is a wiggly curve, with one bump for each exponential term, which is pathognomonic for overfitting.



          AIC use only compares departures from goodness of fit of the data itself, and says nothing about how well derivatives are fit. B-spline fitting is an example of fitting both data and an arbitrary number of data derivatives. In the case of drug persistence in the body, AIC is insufficient as a fit criterion, as the useful model will fit not only the data, but also the shape of the data, and an AIC comparison of the non-nested SET and GD models is not sufficient to characterize their respective utility, at least as far as drug concentration in blood plasma is concerned. For further comparison concerning this see this paper. When that paper was under review, one of the reviewers requested an AIC comparison. The authors compared correlation coefficients between the models and the data instead of using AIC, and even that was irrelevant.



          In summary, how one compares models is context dependent. That is, the appropriate goal of modelling is often not simple goodness-of-fit. However, it has become, IMHO, an all too frequent a reflex to think that AIC is relevant to any and all circumstances, and that without considering what those circumstances are. The first step in modelling should be, but seldom is, to identify what the goal or goals of modelling is, and to choose only those procedures that are appropriate to those goal(s).






          share|cite|improve this answer


















          • 2




            In addition to a well-thought-out question and answer (+2), I've learned a new word (pathognomonic)!
            – jbowman
            1 hour ago










          • @jbowman Many thanks for your inspiration and help! Indeed when residuals are significantly structured, they are pathognomonic for overfitting. For an example the Student’s-t probabilities for an ED$_2$ fit series significantly oscillated above and below the fit values, see Table 5 in this.
            – Carl
            38 mins ago










          • @jbowman That is, the t-statistics oscillated above and below the fit values, and were significant both as individual errors for each time-sample category and collectively as Chi-squared probabilities for all time-sample groups of the population.
            – Carl
            12 mins ago






          • 1




            Thanks for this. I think this is a better approach than trying to shoehorn your argument into a superficially related thread.
            – gung♦
            11 mins ago










          • @gung It is I who am in your debt for the goodness of your suggestion. I had difficulty suspending disbelief concerning the prior question, which is technically an aside, as contrasted to an answer. So be it.
            – Carl
            5 mins ago












          up vote
          3
          down vote










          up vote
          3
          down vote









          Nesting is when all of the models tested can be derived by eliminating parameters from a parent model. Non-nesting is when the models contain parameters that are not in a set with subset(s) format.



          I suspect that AIC is more limited than its more optimistic proponents suggest. First, please consider that goodness of fit is not always a useful regression target. For example, suppose that we want to model how a drug is eliminated from blood plasma. Almost to the exclusion of other formulas, the pharmaceutical industry and the FDA would recommend using a sum of exponential terms expression. Sums of exponential terms (SET) functions are ubiquitous and often given as $C_textSET(t)=sum_i=1^n c_i ,e^-lambda_i,t$. Their density functions have not received much attention, as the industry and the FDA are blissfully unaware of statistical considerations:



          beginequationlabeleq:SET
          textED_n(t;lambda_1,2,3...n)=sum_i=1^nlambda_i ,p_i ,e^-lambda_i,t,
          endequation



          where $sum_i=1^n p_i= 1$, the $lambda_i$ are decay coefficients, the term constants of SET functions relate to the ED$_n$ scale parameters as $c_i=kappa lambda_i p_i$, and concentration is $C_textSET(t)=kappa, textED_n$, where $kappa=textAUC(textSET)_0^infty$. Note that ED$_n$ and SET only have scale parameters. Unfortunately, there are no shape parameters to aid in fitting disposition curves, the fitting of curve shapes, and extrapolation. There is no location parameter. It is typical, in the industry to choose $n=2$ for the above equation yielding $textED_2(t;lambda_1,lambda_2)=lambda_1 ,p_1 ,e^-lambda_1,t+lambda_2 ,p_2 ,e^-lambda_2,t$, where $p_1+p_2=1$.



          This merited a comment by @whuber "... It is indeed possible to parameterize the ED$_2$ family with a scale parameter and two shape parameters; for instance, one could take $lambda_1$ as an inverse scale parameter, leaving $lambda_2>lambda_1$ and $p_1$ as shape parameters. I cannot see any connection whatsoever between such considerations and derivatives unless "derivative" means something unusual in this context...." The response to which was "There is indeed an inefficient mutability of shape for an ED$_2$. Using multiple parameters to emulate a shape parameter is inefficient in the sense that the full range of shapes that a solitary shape parameter offers is not properly rendered.... One can reduce the error of fitting to zero by using a sufficient number of parameters... Consider overfitting a curve with a polynomial. One can reduce the error of fitting to zero by using a sufficient number of parameters. However, unless the physics of the problem is coincidentally an exact polynomial shape, that perfect goodness of fit is meaningless in terms of extrapolation, and the fit may be "wiggly" between the samples fit. That is, overfitting does not tell us what a good model is, and if one does not consider what the slope is between or among samples, the model itself may have achieved a pyrrhic goodness of fit."



          Indeed, ED$_2$ is inflexible enough that exact solutions for four time-samples are sometimes complex field, i.e., not real and not physical. In one study of 413 subjects, eight results (1.9%) with four time-samples solutions had unphysical exponential coefficients



          Now let us consider a non-nested model with respect to that latter equation. The gamma distribution (GD) is given by



          beginequationlabeleq:GD
          textGD(t; a,b) =
          ,dfrac1t;dfrace^-b , t(b , t)^,a Gamma (a) ;; ;;hspace2emtgeq 0 ;; ;;\
          ,
          %tabularnewline
          endequation



          where the gamma function satisfies $Gamma (a)=int _0^infty e^-t t^a-1dt$. The GD is an ED when $a=1$. The GD also has a ($+infty$) discontinuity at $t=0$ when $0<a<1$. However, that discontinuity is integrable $left(mathbbR_geq 0right)$. The gamma distribution has a rate parameter $b$, whereas $frac1b$ is the scale parameter. The shape parameter for the GD is $a$. The shape parameter aids in fitting disposition curves and their shapes. There is no location parameter.



          The GD and ED$_ngeq2$ are not respectively nested because one cannot reduce either model to be equal to the other by choosing particular values for their parameters. Now, let us cut to the chase. The gamma distribution has a shape parameter, and SET formulas do not. As a consequence, SET formulas do not fit the derivative of blood plasma concentration, because they lack shape parameters, and gamma distributions, or their convolutions, do in fact fit the derivatives. In the case of drug persistence in the body, without proper derivative fitting, there is no hope of predicting future plasma concentration of drugs using SET heuristics, whereas fitting of derivatives may permit more exact extrapolation.
          When one plots SET derivatives from actual data fits, the result is a wiggly curve, with one bump for each exponential term, which is pathognomonic for overfitting.



          AIC use only compares departures from goodness of fit of the data itself, and says nothing about how well derivatives are fit. B-spline fitting is an example of fitting both data and an arbitrary number of data derivatives. In the case of drug persistence in the body, AIC is insufficient as a fit criterion, as the useful model will fit not only the data, but also the shape of the data, and an AIC comparison of the non-nested SET and GD models is not sufficient to characterize their respective utility, at least as far as drug concentration in blood plasma is concerned. For further comparison concerning this see this paper. When that paper was under review, one of the reviewers requested an AIC comparison. The authors compared correlation coefficients between the models and the data instead of using AIC, and even that was irrelevant.



          In summary, how one compares models is context dependent. That is, the appropriate goal of modelling is often not simple goodness-of-fit. However, it has become, IMHO, an all too frequent a reflex to think that AIC is relevant to any and all circumstances, and that without considering what those circumstances are. The first step in modelling should be, but seldom is, to identify what the goal or goals of modelling is, and to choose only those procedures that are appropriate to those goal(s).






          share|cite|improve this answer














          Nesting is when all of the models tested can be derived by eliminating parameters from a parent model. Non-nesting is when the models contain parameters that are not in a set with subset(s) format.



          I suspect that AIC is more limited than its more optimistic proponents suggest. First, please consider that goodness of fit is not always a useful regression target. For example, suppose that we want to model how a drug is eliminated from blood plasma. Almost to the exclusion of other formulas, the pharmaceutical industry and the FDA would recommend using a sum of exponential terms expression. Sums of exponential terms (SET) functions are ubiquitous and often given as $C_textSET(t)=sum_i=1^n c_i ,e^-lambda_i,t$. Their density functions have not received much attention, as the industry and the FDA are blissfully unaware of statistical considerations:



          beginequationlabeleq:SET
          textED_n(t;lambda_1,2,3...n)=sum_i=1^nlambda_i ,p_i ,e^-lambda_i,t,
          endequation



          where $sum_i=1^n p_i= 1$, the $lambda_i$ are decay coefficients, the term constants of SET functions relate to the ED$_n$ scale parameters as $c_i=kappa lambda_i p_i$, and concentration is $C_textSET(t)=kappa, textED_n$, where $kappa=textAUC(textSET)_0^infty$. Note that ED$_n$ and SET only have scale parameters. Unfortunately, there are no shape parameters to aid in fitting disposition curves, the fitting of curve shapes, and extrapolation. There is no location parameter. It is typical, in the industry to choose $n=2$ for the above equation yielding $textED_2(t;lambda_1,lambda_2)=lambda_1 ,p_1 ,e^-lambda_1,t+lambda_2 ,p_2 ,e^-lambda_2,t$, where $p_1+p_2=1$.



          This merited a comment by @whuber "... It is indeed possible to parameterize the ED$_2$ family with a scale parameter and two shape parameters; for instance, one could take $lambda_1$ as an inverse scale parameter, leaving $lambda_2>lambda_1$ and $p_1$ as shape parameters. I cannot see any connection whatsoever between such considerations and derivatives unless "derivative" means something unusual in this context...." The response to which was "There is indeed an inefficient mutability of shape for an ED$_2$. Using multiple parameters to emulate a shape parameter is inefficient in the sense that the full range of shapes that a solitary shape parameter offers is not properly rendered.... One can reduce the error of fitting to zero by using a sufficient number of parameters... Consider overfitting a curve with a polynomial. One can reduce the error of fitting to zero by using a sufficient number of parameters. However, unless the physics of the problem is coincidentally an exact polynomial shape, that perfect goodness of fit is meaningless in terms of extrapolation, and the fit may be "wiggly" between the samples fit. That is, overfitting does not tell us what a good model is, and if one does not consider what the slope is between or among samples, the model itself may have achieved a pyrrhic goodness of fit."



          Indeed, ED$_2$ is inflexible enough that exact solutions for four time-samples are sometimes complex field, i.e., not real and not physical. In one study of 413 subjects, eight results (1.9%) with four time-samples solutions had unphysical exponential coefficients



          Now let us consider a non-nested model with respect to that latter equation. The gamma distribution (GD) is given by



          beginequationlabeleq:GD
          textGD(t; a,b) =
          ,dfrac1t;dfrace^-b , t(b , t)^,a Gamma (a) ;; ;;hspace2emtgeq 0 ;; ;;\
          ,
          %tabularnewline
          endequation



          where the gamma function satisfies $Gamma (a)=int _0^infty e^-t t^a-1dt$. The GD is an ED when $a=1$. The GD also has a ($+infty$) discontinuity at $t=0$ when $0<a<1$. However, that discontinuity is integrable $left(mathbbR_geq 0right)$. The gamma distribution has a rate parameter $b$, whereas $frac1b$ is the scale parameter. The shape parameter for the GD is $a$. The shape parameter aids in fitting disposition curves and their shapes. There is no location parameter.



          The GD and ED$_ngeq2$ are not respectively nested because one cannot reduce either model to be equal to the other by choosing particular values for their parameters. Now, let us cut to the chase. The gamma distribution has a shape parameter, and SET formulas do not. As a consequence, SET formulas do not fit the derivative of blood plasma concentration, because they lack shape parameters, and gamma distributions, or their convolutions, do in fact fit the derivatives. In the case of drug persistence in the body, without proper derivative fitting, there is no hope of predicting future plasma concentration of drugs using SET heuristics, whereas fitting of derivatives may permit more exact extrapolation.
          When one plots SET derivatives from actual data fits, the result is a wiggly curve, with one bump for each exponential term, which is pathognomonic for overfitting.



          AIC use only compares departures from goodness of fit of the data itself, and says nothing about how well derivatives are fit. B-spline fitting is an example of fitting both data and an arbitrary number of data derivatives. In the case of drug persistence in the body, AIC is insufficient as a fit criterion, as the useful model will fit not only the data, but also the shape of the data, and an AIC comparison of the non-nested SET and GD models is not sufficient to characterize their respective utility, at least as far as drug concentration in blood plasma is concerned. For further comparison concerning this see this paper. When that paper was under review, one of the reviewers requested an AIC comparison. The authors compared correlation coefficients between the models and the data instead of using AIC, and even that was irrelevant.



          In summary, how one compares models is context dependent. That is, the appropriate goal of modelling is often not simple goodness-of-fit. However, it has become, IMHO, an all too frequent a reflex to think that AIC is relevant to any and all circumstances, and that without considering what those circumstances are. The first step in modelling should be, but seldom is, to identify what the goal or goals of modelling is, and to choose only those procedures that are appropriate to those goal(s).







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited 31 mins ago

























          answered 1 hour ago









          Carl

          6,80432369




          6,80432369







          • 2




            In addition to a well-thought-out question and answer (+2), I've learned a new word (pathognomonic)!
            – jbowman
            1 hour ago










          • @jbowman Many thanks for your inspiration and help! Indeed when residuals are significantly structured, they are pathognomonic for overfitting. For an example the Student’s-t probabilities for an ED$_2$ fit series significantly oscillated above and below the fit values, see Table 5 in this.
            – Carl
            38 mins ago










          • @jbowman That is, the t-statistics oscillated above and below the fit values, and were significant both as individual errors for each time-sample category and collectively as Chi-squared probabilities for all time-sample groups of the population.
            – Carl
            12 mins ago






          • 1




            Thanks for this. I think this is a better approach than trying to shoehorn your argument into a superficially related thread.
            – gung♦
            11 mins ago










          • @gung It is I who am in your debt for the goodness of your suggestion. I had difficulty suspending disbelief concerning the prior question, which is technically an aside, as contrasted to an answer. So be it.
            – Carl
            5 mins ago












          • 2




            In addition to a well-thought-out question and answer (+2), I've learned a new word (pathognomonic)!
            – jbowman
            1 hour ago










          • @jbowman Many thanks for your inspiration and help! Indeed when residuals are significantly structured, they are pathognomonic for overfitting. For an example the Student’s-t probabilities for an ED$_2$ fit series significantly oscillated above and below the fit values, see Table 5 in this.
            – Carl
            38 mins ago










          • @jbowman That is, the t-statistics oscillated above and below the fit values, and were significant both as individual errors for each time-sample category and collectively as Chi-squared probabilities for all time-sample groups of the population.
            – Carl
            12 mins ago






          • 1




            Thanks for this. I think this is a better approach than trying to shoehorn your argument into a superficially related thread.
            – gung♦
            11 mins ago










          • @gung It is I who am in your debt for the goodness of your suggestion. I had difficulty suspending disbelief concerning the prior question, which is technically an aside, as contrasted to an answer. So be it.
            – Carl
            5 mins ago







          2




          2




          In addition to a well-thought-out question and answer (+2), I've learned a new word (pathognomonic)!
          – jbowman
          1 hour ago




          In addition to a well-thought-out question and answer (+2), I've learned a new word (pathognomonic)!
          – jbowman
          1 hour ago












          @jbowman Many thanks for your inspiration and help! Indeed when residuals are significantly structured, they are pathognomonic for overfitting. For an example the Student’s-t probabilities for an ED$_2$ fit series significantly oscillated above and below the fit values, see Table 5 in this.
          – Carl
          38 mins ago




          @jbowman Many thanks for your inspiration and help! Indeed when residuals are significantly structured, they are pathognomonic for overfitting. For an example the Student’s-t probabilities for an ED$_2$ fit series significantly oscillated above and below the fit values, see Table 5 in this.
          – Carl
          38 mins ago












          @jbowman That is, the t-statistics oscillated above and below the fit values, and were significant both as individual errors for each time-sample category and collectively as Chi-squared probabilities for all time-sample groups of the population.
          – Carl
          12 mins ago




          @jbowman That is, the t-statistics oscillated above and below the fit values, and were significant both as individual errors for each time-sample category and collectively as Chi-squared probabilities for all time-sample groups of the population.
          – Carl
          12 mins ago




          1




          1




          Thanks for this. I think this is a better approach than trying to shoehorn your argument into a superficially related thread.
          – gung♦
          11 mins ago




          Thanks for this. I think this is a better approach than trying to shoehorn your argument into a superficially related thread.
          – gung♦
          11 mins ago












          @gung It is I who am in your debt for the goodness of your suggestion. I had difficulty suspending disbelief concerning the prior question, which is technically an aside, as contrasted to an answer. So be it.
          – Carl
          5 mins ago




          @gung It is I who am in your debt for the goodness of your suggestion. I had difficulty suspending disbelief concerning the prior question, which is technically an aside, as contrasted to an answer. So be it.
          – Carl
          5 mins ago

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f369850%2fcan-one-give-an-examples-of-when-non-nested-aic-model-comparison-is-not-useful%23new-answer', 'question_page');

          );

          Post as a guest













































































          Comments

          Popular posts from this blog

          What does second last employer means? [closed]

          Installing NextGIS Connect into QGIS 3?

          One-line joke