What the good general regression technqiue for a problem with 50 independent varaibles

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I am a newbie to data science and statistics. I came across this problem, which has 50 independent variables and one dependent variable and trying to identify the good regression technique to start with. The following is the flow chart that I executed:



Data Exploration -> Correlational matrix -> dimension reduction -> PCA (Dimension reduction) -> Basic Linear Regression technique.



Can someone guide me, if there is any other better technique or procedure.










share|improve this question







New contributor




user86752 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.























    up vote
    1
    down vote

    favorite












    I am a newbie to data science and statistics. I came across this problem, which has 50 independent variables and one dependent variable and trying to identify the good regression technique to start with. The following is the flow chart that I executed:



    Data Exploration -> Correlational matrix -> dimension reduction -> PCA (Dimension reduction) -> Basic Linear Regression technique.



    Can someone guide me, if there is any other better technique or procedure.










    share|improve this question







    New contributor




    user86752 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





















      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I am a newbie to data science and statistics. I came across this problem, which has 50 independent variables and one dependent variable and trying to identify the good regression technique to start with. The following is the flow chart that I executed:



      Data Exploration -> Correlational matrix -> dimension reduction -> PCA (Dimension reduction) -> Basic Linear Regression technique.



      Can someone guide me, if there is any other better technique or procedure.










      share|improve this question







      New contributor




      user86752 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I am a newbie to data science and statistics. I came across this problem, which has 50 independent variables and one dependent variable and trying to identify the good regression technique to start with. The following is the flow chart that I executed:



      Data Exploration -> Correlational matrix -> dimension reduction -> PCA (Dimension reduction) -> Basic Linear Regression technique.



      Can someone guide me, if there is any other better technique or procedure.







      regression statistics data-science-model






      share|improve this question







      New contributor




      user86752 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      user86752 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      user86752 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 6 hours ago









      user86752

      61




      61




      New contributor




      user86752 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      user86752 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      user86752 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote













          In no way is this going to be an exhaustive answer, but it will definitely give you a starting point in Python -



          Data Exploration



          Start with Pandas Profiling. It will give you HTML reports of your variables. If the quality of the data is good, it will provide some insights into the fill rate, depending upon the variable type some statistics for each variable



          Correlational matrix



          The pandas profiling report includes the coorelation matrix. But if you are looking to compute by hand, use pd.corr(). You can vary parameters to get different correlation metrics like ‘pearson’, ‘kendall’, ‘spearman’



          Dimension Reduction -> PCA (Dimension reduction)



          There are many ways to do this. Keep in mind if you are looking for accuracies only and don't care about how X is influencing y, (1) is an optional step (applies to (2) as well).



          1. Analyse the correlation matrix and use VIF to dump variables with high correlation

          2. Factor Analysis / PCA for dimensionality reduction

          3. Use LASSO to fit a model, check the coefficients and the ones that are 0 or going to 0 can be thought of as weak indicators and can be eliminated.

          4. Keep all 50, and use Ridge Regression and vary the alpha parameter to fine-tune accuracy (or whatever metric you are trying to optimize)

          5. If the model still doesn't seem to be stable, try to cook non-linear features with sklearn's Polynomial Features, regularize and repeat.

          6. Probably the most important in the real world, ask the domain expert on what he/she thinks might be the important variables

          Basic Linear Regression technique



          1. Playing with hyperparamters to get good cross-validation/test score is the key here for a basic Linear Regression model.

          2. Try as many techniques as you can from here and here





          share|improve this answer




















            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            noCode: true, onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );






            user86752 is a new contributor. Be nice, and check out our Code of Conduct.









             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40843%2fwhat-the-good-general-regression-technqiue-for-a-problem-with-50-independent-var%23new-answer', 'question_page');

            );

            Post as a guest






























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            3
            down vote













            In no way is this going to be an exhaustive answer, but it will definitely give you a starting point in Python -



            Data Exploration



            Start with Pandas Profiling. It will give you HTML reports of your variables. If the quality of the data is good, it will provide some insights into the fill rate, depending upon the variable type some statistics for each variable



            Correlational matrix



            The pandas profiling report includes the coorelation matrix. But if you are looking to compute by hand, use pd.corr(). You can vary parameters to get different correlation metrics like ‘pearson’, ‘kendall’, ‘spearman’



            Dimension Reduction -> PCA (Dimension reduction)



            There are many ways to do this. Keep in mind if you are looking for accuracies only and don't care about how X is influencing y, (1) is an optional step (applies to (2) as well).



            1. Analyse the correlation matrix and use VIF to dump variables with high correlation

            2. Factor Analysis / PCA for dimensionality reduction

            3. Use LASSO to fit a model, check the coefficients and the ones that are 0 or going to 0 can be thought of as weak indicators and can be eliminated.

            4. Keep all 50, and use Ridge Regression and vary the alpha parameter to fine-tune accuracy (or whatever metric you are trying to optimize)

            5. If the model still doesn't seem to be stable, try to cook non-linear features with sklearn's Polynomial Features, regularize and repeat.

            6. Probably the most important in the real world, ask the domain expert on what he/she thinks might be the important variables

            Basic Linear Regression technique



            1. Playing with hyperparamters to get good cross-validation/test score is the key here for a basic Linear Regression model.

            2. Try as many techniques as you can from here and here





            share|improve this answer
























              up vote
              3
              down vote













              In no way is this going to be an exhaustive answer, but it will definitely give you a starting point in Python -



              Data Exploration



              Start with Pandas Profiling. It will give you HTML reports of your variables. If the quality of the data is good, it will provide some insights into the fill rate, depending upon the variable type some statistics for each variable



              Correlational matrix



              The pandas profiling report includes the coorelation matrix. But if you are looking to compute by hand, use pd.corr(). You can vary parameters to get different correlation metrics like ‘pearson’, ‘kendall’, ‘spearman’



              Dimension Reduction -> PCA (Dimension reduction)



              There are many ways to do this. Keep in mind if you are looking for accuracies only and don't care about how X is influencing y, (1) is an optional step (applies to (2) as well).



              1. Analyse the correlation matrix and use VIF to dump variables with high correlation

              2. Factor Analysis / PCA for dimensionality reduction

              3. Use LASSO to fit a model, check the coefficients and the ones that are 0 or going to 0 can be thought of as weak indicators and can be eliminated.

              4. Keep all 50, and use Ridge Regression and vary the alpha parameter to fine-tune accuracy (or whatever metric you are trying to optimize)

              5. If the model still doesn't seem to be stable, try to cook non-linear features with sklearn's Polynomial Features, regularize and repeat.

              6. Probably the most important in the real world, ask the domain expert on what he/she thinks might be the important variables

              Basic Linear Regression technique



              1. Playing with hyperparamters to get good cross-validation/test score is the key here for a basic Linear Regression model.

              2. Try as many techniques as you can from here and here





              share|improve this answer






















                up vote
                3
                down vote










                up vote
                3
                down vote









                In no way is this going to be an exhaustive answer, but it will definitely give you a starting point in Python -



                Data Exploration



                Start with Pandas Profiling. It will give you HTML reports of your variables. If the quality of the data is good, it will provide some insights into the fill rate, depending upon the variable type some statistics for each variable



                Correlational matrix



                The pandas profiling report includes the coorelation matrix. But if you are looking to compute by hand, use pd.corr(). You can vary parameters to get different correlation metrics like ‘pearson’, ‘kendall’, ‘spearman’



                Dimension Reduction -> PCA (Dimension reduction)



                There are many ways to do this. Keep in mind if you are looking for accuracies only and don't care about how X is influencing y, (1) is an optional step (applies to (2) as well).



                1. Analyse the correlation matrix and use VIF to dump variables with high correlation

                2. Factor Analysis / PCA for dimensionality reduction

                3. Use LASSO to fit a model, check the coefficients and the ones that are 0 or going to 0 can be thought of as weak indicators and can be eliminated.

                4. Keep all 50, and use Ridge Regression and vary the alpha parameter to fine-tune accuracy (or whatever metric you are trying to optimize)

                5. If the model still doesn't seem to be stable, try to cook non-linear features with sklearn's Polynomial Features, regularize and repeat.

                6. Probably the most important in the real world, ask the domain expert on what he/she thinks might be the important variables

                Basic Linear Regression technique



                1. Playing with hyperparamters to get good cross-validation/test score is the key here for a basic Linear Regression model.

                2. Try as many techniques as you can from here and here





                share|improve this answer












                In no way is this going to be an exhaustive answer, but it will definitely give you a starting point in Python -



                Data Exploration



                Start with Pandas Profiling. It will give you HTML reports of your variables. If the quality of the data is good, it will provide some insights into the fill rate, depending upon the variable type some statistics for each variable



                Correlational matrix



                The pandas profiling report includes the coorelation matrix. But if you are looking to compute by hand, use pd.corr(). You can vary parameters to get different correlation metrics like ‘pearson’, ‘kendall’, ‘spearman’



                Dimension Reduction -> PCA (Dimension reduction)



                There are many ways to do this. Keep in mind if you are looking for accuracies only and don't care about how X is influencing y, (1) is an optional step (applies to (2) as well).



                1. Analyse the correlation matrix and use VIF to dump variables with high correlation

                2. Factor Analysis / PCA for dimensionality reduction

                3. Use LASSO to fit a model, check the coefficients and the ones that are 0 or going to 0 can be thought of as weak indicators and can be eliminated.

                4. Keep all 50, and use Ridge Regression and vary the alpha parameter to fine-tune accuracy (or whatever metric you are trying to optimize)

                5. If the model still doesn't seem to be stable, try to cook non-linear features with sklearn's Polynomial Features, regularize and repeat.

                6. Probably the most important in the real world, ask the domain expert on what he/she thinks might be the important variables

                Basic Linear Regression technique



                1. Playing with hyperparamters to get good cross-validation/test score is the key here for a basic Linear Regression model.

                2. Try as many techniques as you can from here and here






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 4 hours ago









                Vivek Kalyanarangan

                36816




                36816




















                    user86752 is a new contributor. Be nice, and check out our Code of Conduct.









                     

                    draft saved


                    draft discarded


















                    user86752 is a new contributor. Be nice, and check out our Code of Conduct.












                    user86752 is a new contributor. Be nice, and check out our Code of Conduct.











                    user86752 is a new contributor. Be nice, and check out our Code of Conduct.













                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40843%2fwhat-the-good-general-regression-technqiue-for-a-problem-with-50-independent-var%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Comments

                    Popular posts from this blog

                    What does second last employer means? [closed]

                    List of Gilmore Girls characters

                    One-line joke