Which type of data normalizing should be used with KNN?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite
2












I know that there is more than two type of normalizing.



For example,



1- Transforming data using a z-score or t-score. This is usually called standardization.



2- Rescaling data to have values between 0 and 1.



The question now if I need normalizing



Which type of data normalizing should be used with KNN? and Why?







share|cite|improve this question




























    up vote
    2
    down vote

    favorite
    2












    I know that there is more than two type of normalizing.



    For example,



    1- Transforming data using a z-score or t-score. This is usually called standardization.



    2- Rescaling data to have values between 0 and 1.



    The question now if I need normalizing



    Which type of data normalizing should be used with KNN? and Why?







    share|cite|improve this question
























      up vote
      2
      down vote

      favorite
      2









      up vote
      2
      down vote

      favorite
      2






      2





      I know that there is more than two type of normalizing.



      For example,



      1- Transforming data using a z-score or t-score. This is usually called standardization.



      2- Rescaling data to have values between 0 and 1.



      The question now if I need normalizing



      Which type of data normalizing should be used with KNN? and Why?







      share|cite|improve this question














      I know that there is more than two type of normalizing.



      For example,



      1- Transforming data using a z-score or t-score. This is usually called standardization.



      2- Rescaling data to have values between 0 and 1.



      The question now if I need normalizing



      Which type of data normalizing should be used with KNN? and Why?









      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Aug 25 at 10:16

























      asked Aug 25 at 8:59









      jeza

      17119




      17119




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          4
          down vote



          accepted










          For k-NN, I'd suggest normalizing the data between $0$ and $1$.



          k-NN uses the Euclidean distance, as its means of comparing examples. To calculate the distance between two points $x_1 = (f_1^1, f_1^2, ..., f_1^M)$ and $x_2 = (f_2^1, f_2^2, ..., f_2^M)$, where $f_1^i$ is the value of the $i$-th feature of $x_1$:



          $$
          d(x_1, x_2) = sqrt(f_1^1 - f_2^1)^2 + (f_1^2 - f_2^2)^2 + ... + (f_1^M - f_2^M)^2
          $$



          In order for all of the features to be of equal importance when calculating the distance, the features must have the same range of values. This is only achievable through normalization.



          If they were not normalized and for instance feature $f^1$ had a range of values in $[0, 1$), while $f^2$ had a range of values in $[1, 10)$. When calculating the distance, the second term would be $10$ times important than the first, leading k-NN to rely more on the second feature than the first. Normalization ensures that all features are mapped to the same range of values.



          Standardization, on the other hand, does have many useful properties, but can't ensure that the features are mapped to the same range. While standardization may be best suited for other classifiers, this is not the case for k-NN or any other distance-based classifier.






          share|cite|improve this answer
















          • 2




            Is your answer will be the same if I used different distance instead of Euclidean distance (for example Manhattan distance or other distance even fractional distance)? Also If the range of the variables is approximately close to each other.
            – jeza
            Aug 25 at 12:33







          • 2




            Yes I just showed Euclidean distance as an example, but all distance metrics suffer from the same thing. If the ranges are close to one another then it wouldn't affect the calculation of the metric that much, but it still would. For example if $f^1 in [0, 1)$ and $f^2 in [0, 1.2)$, $f^2$ would still be $20%$ more important than $f^1$. One thing I forgot to mention was that standardizing, obviously, is much better than not performing any feature scaling; it is simply worse than normalization.
            – Djib2011
            Aug 25 at 13:07











          • Ah I see. "it is simply worse than normalization"!?
            – jeza
            Aug 25 at 13:24










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "65"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f363889%2fwhich-type-of-data-normalizing-should-be-used-with-knn%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          4
          down vote



          accepted










          For k-NN, I'd suggest normalizing the data between $0$ and $1$.



          k-NN uses the Euclidean distance, as its means of comparing examples. To calculate the distance between two points $x_1 = (f_1^1, f_1^2, ..., f_1^M)$ and $x_2 = (f_2^1, f_2^2, ..., f_2^M)$, where $f_1^i$ is the value of the $i$-th feature of $x_1$:



          $$
          d(x_1, x_2) = sqrt(f_1^1 - f_2^1)^2 + (f_1^2 - f_2^2)^2 + ... + (f_1^M - f_2^M)^2
          $$



          In order for all of the features to be of equal importance when calculating the distance, the features must have the same range of values. This is only achievable through normalization.



          If they were not normalized and for instance feature $f^1$ had a range of values in $[0, 1$), while $f^2$ had a range of values in $[1, 10)$. When calculating the distance, the second term would be $10$ times important than the first, leading k-NN to rely more on the second feature than the first. Normalization ensures that all features are mapped to the same range of values.



          Standardization, on the other hand, does have many useful properties, but can't ensure that the features are mapped to the same range. While standardization may be best suited for other classifiers, this is not the case for k-NN or any other distance-based classifier.






          share|cite|improve this answer
















          • 2




            Is your answer will be the same if I used different distance instead of Euclidean distance (for example Manhattan distance or other distance even fractional distance)? Also If the range of the variables is approximately close to each other.
            – jeza
            Aug 25 at 12:33







          • 2




            Yes I just showed Euclidean distance as an example, but all distance metrics suffer from the same thing. If the ranges are close to one another then it wouldn't affect the calculation of the metric that much, but it still would. For example if $f^1 in [0, 1)$ and $f^2 in [0, 1.2)$, $f^2$ would still be $20%$ more important than $f^1$. One thing I forgot to mention was that standardizing, obviously, is much better than not performing any feature scaling; it is simply worse than normalization.
            – Djib2011
            Aug 25 at 13:07











          • Ah I see. "it is simply worse than normalization"!?
            – jeza
            Aug 25 at 13:24














          up vote
          4
          down vote



          accepted










          For k-NN, I'd suggest normalizing the data between $0$ and $1$.



          k-NN uses the Euclidean distance, as its means of comparing examples. To calculate the distance between two points $x_1 = (f_1^1, f_1^2, ..., f_1^M)$ and $x_2 = (f_2^1, f_2^2, ..., f_2^M)$, where $f_1^i$ is the value of the $i$-th feature of $x_1$:



          $$
          d(x_1, x_2) = sqrt(f_1^1 - f_2^1)^2 + (f_1^2 - f_2^2)^2 + ... + (f_1^M - f_2^M)^2
          $$



          In order for all of the features to be of equal importance when calculating the distance, the features must have the same range of values. This is only achievable through normalization.



          If they were not normalized and for instance feature $f^1$ had a range of values in $[0, 1$), while $f^2$ had a range of values in $[1, 10)$. When calculating the distance, the second term would be $10$ times important than the first, leading k-NN to rely more on the second feature than the first. Normalization ensures that all features are mapped to the same range of values.



          Standardization, on the other hand, does have many useful properties, but can't ensure that the features are mapped to the same range. While standardization may be best suited for other classifiers, this is not the case for k-NN or any other distance-based classifier.






          share|cite|improve this answer
















          • 2




            Is your answer will be the same if I used different distance instead of Euclidean distance (for example Manhattan distance or other distance even fractional distance)? Also If the range of the variables is approximately close to each other.
            – jeza
            Aug 25 at 12:33







          • 2




            Yes I just showed Euclidean distance as an example, but all distance metrics suffer from the same thing. If the ranges are close to one another then it wouldn't affect the calculation of the metric that much, but it still would. For example if $f^1 in [0, 1)$ and $f^2 in [0, 1.2)$, $f^2$ would still be $20%$ more important than $f^1$. One thing I forgot to mention was that standardizing, obviously, is much better than not performing any feature scaling; it is simply worse than normalization.
            – Djib2011
            Aug 25 at 13:07











          • Ah I see. "it is simply worse than normalization"!?
            – jeza
            Aug 25 at 13:24












          up vote
          4
          down vote



          accepted







          up vote
          4
          down vote



          accepted






          For k-NN, I'd suggest normalizing the data between $0$ and $1$.



          k-NN uses the Euclidean distance, as its means of comparing examples. To calculate the distance between two points $x_1 = (f_1^1, f_1^2, ..., f_1^M)$ and $x_2 = (f_2^1, f_2^2, ..., f_2^M)$, where $f_1^i$ is the value of the $i$-th feature of $x_1$:



          $$
          d(x_1, x_2) = sqrt(f_1^1 - f_2^1)^2 + (f_1^2 - f_2^2)^2 + ... + (f_1^M - f_2^M)^2
          $$



          In order for all of the features to be of equal importance when calculating the distance, the features must have the same range of values. This is only achievable through normalization.



          If they were not normalized and for instance feature $f^1$ had a range of values in $[0, 1$), while $f^2$ had a range of values in $[1, 10)$. When calculating the distance, the second term would be $10$ times important than the first, leading k-NN to rely more on the second feature than the first. Normalization ensures that all features are mapped to the same range of values.



          Standardization, on the other hand, does have many useful properties, but can't ensure that the features are mapped to the same range. While standardization may be best suited for other classifiers, this is not the case for k-NN or any other distance-based classifier.






          share|cite|improve this answer












          For k-NN, I'd suggest normalizing the data between $0$ and $1$.



          k-NN uses the Euclidean distance, as its means of comparing examples. To calculate the distance between two points $x_1 = (f_1^1, f_1^2, ..., f_1^M)$ and $x_2 = (f_2^1, f_2^2, ..., f_2^M)$, where $f_1^i$ is the value of the $i$-th feature of $x_1$:



          $$
          d(x_1, x_2) = sqrt(f_1^1 - f_2^1)^2 + (f_1^2 - f_2^2)^2 + ... + (f_1^M - f_2^M)^2
          $$



          In order for all of the features to be of equal importance when calculating the distance, the features must have the same range of values. This is only achievable through normalization.



          If they were not normalized and for instance feature $f^1$ had a range of values in $[0, 1$), while $f^2$ had a range of values in $[1, 10)$. When calculating the distance, the second term would be $10$ times important than the first, leading k-NN to rely more on the second feature than the first. Normalization ensures that all features are mapped to the same range of values.



          Standardization, on the other hand, does have many useful properties, but can't ensure that the features are mapped to the same range. While standardization may be best suited for other classifiers, this is not the case for k-NN or any other distance-based classifier.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered Aug 25 at 11:40









          Djib2011

          1,607616




          1,607616







          • 2




            Is your answer will be the same if I used different distance instead of Euclidean distance (for example Manhattan distance or other distance even fractional distance)? Also If the range of the variables is approximately close to each other.
            – jeza
            Aug 25 at 12:33







          • 2




            Yes I just showed Euclidean distance as an example, but all distance metrics suffer from the same thing. If the ranges are close to one another then it wouldn't affect the calculation of the metric that much, but it still would. For example if $f^1 in [0, 1)$ and $f^2 in [0, 1.2)$, $f^2$ would still be $20%$ more important than $f^1$. One thing I forgot to mention was that standardizing, obviously, is much better than not performing any feature scaling; it is simply worse than normalization.
            – Djib2011
            Aug 25 at 13:07











          • Ah I see. "it is simply worse than normalization"!?
            – jeza
            Aug 25 at 13:24












          • 2




            Is your answer will be the same if I used different distance instead of Euclidean distance (for example Manhattan distance or other distance even fractional distance)? Also If the range of the variables is approximately close to each other.
            – jeza
            Aug 25 at 12:33







          • 2




            Yes I just showed Euclidean distance as an example, but all distance metrics suffer from the same thing. If the ranges are close to one another then it wouldn't affect the calculation of the metric that much, but it still would. For example if $f^1 in [0, 1)$ and $f^2 in [0, 1.2)$, $f^2$ would still be $20%$ more important than $f^1$. One thing I forgot to mention was that standardizing, obviously, is much better than not performing any feature scaling; it is simply worse than normalization.
            – Djib2011
            Aug 25 at 13:07











          • Ah I see. "it is simply worse than normalization"!?
            – jeza
            Aug 25 at 13:24







          2




          2




          Is your answer will be the same if I used different distance instead of Euclidean distance (for example Manhattan distance or other distance even fractional distance)? Also If the range of the variables is approximately close to each other.
          – jeza
          Aug 25 at 12:33





          Is your answer will be the same if I used different distance instead of Euclidean distance (for example Manhattan distance or other distance even fractional distance)? Also If the range of the variables is approximately close to each other.
          – jeza
          Aug 25 at 12:33





          2




          2




          Yes I just showed Euclidean distance as an example, but all distance metrics suffer from the same thing. If the ranges are close to one another then it wouldn't affect the calculation of the metric that much, but it still would. For example if $f^1 in [0, 1)$ and $f^2 in [0, 1.2)$, $f^2$ would still be $20%$ more important than $f^1$. One thing I forgot to mention was that standardizing, obviously, is much better than not performing any feature scaling; it is simply worse than normalization.
          – Djib2011
          Aug 25 at 13:07





          Yes I just showed Euclidean distance as an example, but all distance metrics suffer from the same thing. If the ranges are close to one another then it wouldn't affect the calculation of the metric that much, but it still would. For example if $f^1 in [0, 1)$ and $f^2 in [0, 1.2)$, $f^2$ would still be $20%$ more important than $f^1$. One thing I forgot to mention was that standardizing, obviously, is much better than not performing any feature scaling; it is simply worse than normalization.
          – Djib2011
          Aug 25 at 13:07













          Ah I see. "it is simply worse than normalization"!?
          – jeza
          Aug 25 at 13:24




          Ah I see. "it is simply worse than normalization"!?
          – jeza
          Aug 25 at 13:24

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f363889%2fwhich-type-of-data-normalizing-should-be-used-with-knn%23new-answer', 'question_page');

          );

          Post as a guest













































































          Comments

          Popular posts from this blog

          White Anglo-Saxon Protestant

          BuddyTV

          Conflict (narrative)