Metrics to determine K in K-cross fold validation

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












Consider a scenario where the dataset in hand is quite large, let's assume 50000 samples (quite well balanced between two classes). What metrics can be used to decide the K value in a K-fold cross-validation? In other words, can a 5-fold CV be enough or should I go for a 10-fold CV?



The rule of thumb is the higher K, the better. But, putting aside the computational costs, what can be used to decide the value of K? Should we look at the overall performance, e.g. average accuracy? That is, if accuracy (5CV) ~ accuracy(10CV), we can opt for 5-fold CV?. Is the standard deviation between the performance of different folds important? That is, the lower the better?










share|improve this question

























    up vote
    1
    down vote

    favorite












    Consider a scenario where the dataset in hand is quite large, let's assume 50000 samples (quite well balanced between two classes). What metrics can be used to decide the K value in a K-fold cross-validation? In other words, can a 5-fold CV be enough or should I go for a 10-fold CV?



    The rule of thumb is the higher K, the better. But, putting aside the computational costs, what can be used to decide the value of K? Should we look at the overall performance, e.g. average accuracy? That is, if accuracy (5CV) ~ accuracy(10CV), we can opt for 5-fold CV?. Is the standard deviation between the performance of different folds important? That is, the lower the better?










    share|improve this question























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      Consider a scenario where the dataset in hand is quite large, let's assume 50000 samples (quite well balanced between two classes). What metrics can be used to decide the K value in a K-fold cross-validation? In other words, can a 5-fold CV be enough or should I go for a 10-fold CV?



      The rule of thumb is the higher K, the better. But, putting aside the computational costs, what can be used to decide the value of K? Should we look at the overall performance, e.g. average accuracy? That is, if accuracy (5CV) ~ accuracy(10CV), we can opt for 5-fold CV?. Is the standard deviation between the performance of different folds important? That is, the lower the better?










      share|improve this question













      Consider a scenario where the dataset in hand is quite large, let's assume 50000 samples (quite well balanced between two classes). What metrics can be used to decide the K value in a K-fold cross-validation? In other words, can a 5-fold CV be enough or should I go for a 10-fold CV?



      The rule of thumb is the higher K, the better. But, putting aside the computational costs, what can be used to decide the value of K? Should we look at the overall performance, e.g. average accuracy? That is, if accuracy (5CV) ~ accuracy(10CV), we can opt for 5-fold CV?. Is the standard deviation between the performance of different folds important? That is, the lower the better?







      cross-validation accuracy performance






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 1 hour ago









      NCL

      312




      312




















          3 Answers
          3






          active

          oldest

          votes

















          up vote
          1
          down vote













          First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!



          I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.



          Also you can select your models with different K based on a criteria. For example AIC.






          share|improve this answer



























            up vote
            1
            down vote













            You should ask yourself, why are we even doing cross-validation?
            It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.



            If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.



            Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.



            Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.



            So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.






            share|improve this answer








            New contributor




            ExabytE is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.
























              up vote
              1
              down vote














              The rule of thumb is the higher K, the better.




              I think a better rule of thumb is: The larger your dataset, the less important is $k$.



              However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):



              • Increasing $k$ decreases the bias because the training set better represents the data

              • Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar

              Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.



              And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.






              share|improve this answer




















                Your Answer





                StackExchange.ifUsing("editor", function ()
                return StackExchange.using("mathjaxEditing", function ()
                StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
                StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
                );
                );
                , "mathjax-editing");

                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "557"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                convertImagesToLinks: false,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                noCode: true, onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );













                 

                draft saved


                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40772%2fmetrics-to-determine-k-in-k-cross-fold-validation%23new-answer', 'question_page');

                );

                Post as a guest






























                3 Answers
                3






                active

                oldest

                votes








                3 Answers
                3






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes








                up vote
                1
                down vote













                First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!



                I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.



                Also you can select your models with different K based on a criteria. For example AIC.






                share|improve this answer
























                  up vote
                  1
                  down vote













                  First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!



                  I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.



                  Also you can select your models with different K based on a criteria. For example AIC.






                  share|improve this answer






















                    up vote
                    1
                    down vote










                    up vote
                    1
                    down vote









                    First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!



                    I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.



                    Also you can select your models with different K based on a criteria. For example AIC.






                    share|improve this answer












                    First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!



                    I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.



                    Also you can select your models with different K based on a criteria. For example AIC.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered 53 mins ago









                    silverstone

                    765




                    765




















                        up vote
                        1
                        down vote













                        You should ask yourself, why are we even doing cross-validation?
                        It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.



                        If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.



                        Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.



                        Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.



                        So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.






                        share|improve this answer








                        New contributor




                        ExabytE is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.





















                          up vote
                          1
                          down vote













                          You should ask yourself, why are we even doing cross-validation?
                          It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.



                          If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.



                          Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.



                          Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.



                          So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.






                          share|improve this answer








                          New contributor




                          ExabytE is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.



















                            up vote
                            1
                            down vote










                            up vote
                            1
                            down vote









                            You should ask yourself, why are we even doing cross-validation?
                            It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.



                            If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.



                            Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.



                            Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.



                            So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.






                            share|improve this answer








                            New contributor




                            ExabytE is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            You should ask yourself, why are we even doing cross-validation?
                            It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.



                            If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.



                            Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.



                            Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.



                            So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.







                            share|improve this answer








                            New contributor




                            ExabytE is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            share|improve this answer



                            share|improve this answer






                            New contributor




                            ExabytE is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            answered 39 mins ago









                            ExabytE

                            111




                            111




                            New contributor




                            ExabytE is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.





                            New contributor





                            ExabytE is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






                            ExabytE is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.




















                                up vote
                                1
                                down vote














                                The rule of thumb is the higher K, the better.




                                I think a better rule of thumb is: The larger your dataset, the less important is $k$.



                                However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):



                                • Increasing $k$ decreases the bias because the training set better represents the data

                                • Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar

                                Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.



                                And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.






                                share|improve this answer
























                                  up vote
                                  1
                                  down vote














                                  The rule of thumb is the higher K, the better.




                                  I think a better rule of thumb is: The larger your dataset, the less important is $k$.



                                  However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):



                                  • Increasing $k$ decreases the bias because the training set better represents the data

                                  • Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar

                                  Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.



                                  And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.






                                  share|improve this answer






















                                    up vote
                                    1
                                    down vote










                                    up vote
                                    1
                                    down vote










                                    The rule of thumb is the higher K, the better.




                                    I think a better rule of thumb is: The larger your dataset, the less important is $k$.



                                    However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):



                                    • Increasing $k$ decreases the bias because the training set better represents the data

                                    • Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar

                                    Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.



                                    And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.






                                    share|improve this answer













                                    The rule of thumb is the higher K, the better.




                                    I think a better rule of thumb is: The larger your dataset, the less important is $k$.



                                    However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):



                                    • Increasing $k$ decreases the bias because the training set better represents the data

                                    • Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar

                                    Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.



                                    And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered 31 mins ago









                                    oW_

                                    2,707629




                                    2,707629



























                                         

                                        draft saved


                                        draft discarded















































                                         


                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40772%2fmetrics-to-determine-k-in-k-cross-fold-validation%23new-answer', 'question_page');

                                        );

                                        Post as a guest













































































                                        Comments

                                        Popular posts from this blog

                                        What does second last employer means? [closed]

                                        Installing NextGIS Connect into QGIS 3?

                                        One-line joke