Improving spam classification with tensorflow logistic regression

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I would like to classify a mail (spam = 1/ham = 0), using logistic regression. My implementation is similar to this implementation and using tensorflow.



A mail is represented as a bag-of-words vector, with each number in the vector representing how often a term appeared in a mail. The idea is to multiply that with a vector, and use the sign-function to turn regression into classification. $$y_predicted = sigma(x_i^Ttheta) $$, with $sigma = frac11 + e^-x$. To calculate the loss, I am using the l2-loss (squared loss). Since I have a lot of trainig data, regularization seems not necessary (training and testing accuracy is always very close). Still I only get a max accuracy of about 90% (both training and testing). How can I improve this?



I already tried the following:



  • Use regularization, L1, L2 with different strength (seems not necessary)


  • Use different learning rates


  • Use gradient descent, stochastic gradient descent and batch gradient descent (the hope is to avoid local minima in the loss-function, by introducing more variance with stochastic/batch gradient descent)


  • create more training data (classes were disbalanced 80/20 spam/ham), using SMOTE


Things that I could still try:



  • use a different loss function

Any other suggestions?







share|cite|improve this question




























    up vote
    2
    down vote

    favorite












    I would like to classify a mail (spam = 1/ham = 0), using logistic regression. My implementation is similar to this implementation and using tensorflow.



    A mail is represented as a bag-of-words vector, with each number in the vector representing how often a term appeared in a mail. The idea is to multiply that with a vector, and use the sign-function to turn regression into classification. $$y_predicted = sigma(x_i^Ttheta) $$, with $sigma = frac11 + e^-x$. To calculate the loss, I am using the l2-loss (squared loss). Since I have a lot of trainig data, regularization seems not necessary (training and testing accuracy is always very close). Still I only get a max accuracy of about 90% (both training and testing). How can I improve this?



    I already tried the following:



    • Use regularization, L1, L2 with different strength (seems not necessary)


    • Use different learning rates


    • Use gradient descent, stochastic gradient descent and batch gradient descent (the hope is to avoid local minima in the loss-function, by introducing more variance with stochastic/batch gradient descent)


    • create more training data (classes were disbalanced 80/20 spam/ham), using SMOTE


    Things that I could still try:



    • use a different loss function

    Any other suggestions?







    share|cite|improve this question
























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I would like to classify a mail (spam = 1/ham = 0), using logistic regression. My implementation is similar to this implementation and using tensorflow.



      A mail is represented as a bag-of-words vector, with each number in the vector representing how often a term appeared in a mail. The idea is to multiply that with a vector, and use the sign-function to turn regression into classification. $$y_predicted = sigma(x_i^Ttheta) $$, with $sigma = frac11 + e^-x$. To calculate the loss, I am using the l2-loss (squared loss). Since I have a lot of trainig data, regularization seems not necessary (training and testing accuracy is always very close). Still I only get a max accuracy of about 90% (both training and testing). How can I improve this?



      I already tried the following:



      • Use regularization, L1, L2 with different strength (seems not necessary)


      • Use different learning rates


      • Use gradient descent, stochastic gradient descent and batch gradient descent (the hope is to avoid local minima in the loss-function, by introducing more variance with stochastic/batch gradient descent)


      • create more training data (classes were disbalanced 80/20 spam/ham), using SMOTE


      Things that I could still try:



      • use a different loss function

      Any other suggestions?







      share|cite|improve this question














      I would like to classify a mail (spam = 1/ham = 0), using logistic regression. My implementation is similar to this implementation and using tensorflow.



      A mail is represented as a bag-of-words vector, with each number in the vector representing how often a term appeared in a mail. The idea is to multiply that with a vector, and use the sign-function to turn regression into classification. $$y_predicted = sigma(x_i^Ttheta) $$, with $sigma = frac11 + e^-x$. To calculate the loss, I am using the l2-loss (squared loss). Since I have a lot of trainig data, regularization seems not necessary (training and testing accuracy is always very close). Still I only get a max accuracy of about 90% (both training and testing). How can I improve this?



      I already tried the following:



      • Use regularization, L1, L2 with different strength (seems not necessary)


      • Use different learning rates


      • Use gradient descent, stochastic gradient descent and batch gradient descent (the hope is to avoid local minima in the loss-function, by introducing more variance with stochastic/batch gradient descent)


      • create more training data (classes were disbalanced 80/20 spam/ham), using SMOTE


      Things that I could still try:



      • use a different loss function

      Any other suggestions?









      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Aug 18 at 13:50









      Sycorax

      33.1k587147




      33.1k587147










      asked Aug 18 at 13:14









      User12547645

      235




      235




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          5
          down vote



          accepted










          L2 loss for logistic regression is not convex, but the cross entropy loss is. I’d recommend making the switch because convexity is a really nice property to have during optimization. Convexity implies that you don’t have to worry about local minima because they don’t exist by definition.



          A nice discussion of the mathematics comparing the convexity of log loss to the non-convexity of L2 loss can be found here: What is happening here, when I use squared loss in logistic regression setting?



          The textbook way to estimate logistic regression coefficients is called Newton-Raphson updating, but I don't believe that it is implemented in TensorFlow since second-order methods are not generally used for neural networks. However, you might improve the rate of convergence if you use SGD + classical momentum or SGD + Nesterov momentum. Nesterov momentum is especially appealing in this case: since your problem is convex, the problem is more-or-less locally quadratic, and that is the use case where Nesterov momentum really shines.






          share|cite|improve this answer






















          • Thank you very much for the suggestion. I will have a look into it and then repo how good a result it gave me
            – User12547645
            Aug 18 at 15:12










          • Thank you again! I am not at more than 98% accuracy for training and testing, with training still going
            – User12547645
            Aug 18 at 20:58






          • 1




            That sounds like a pretty nice improvement, though. Almost 10%! -- in your post, you said you were getting 90% accuracy.
            – Sycorax
            Aug 18 at 21:08










          • Yes, it is very impressive indeed! And it seems as though I still do not need any regularization, since training and testing accuracy are fairly close together
            – User12547645
            Aug 19 at 9:52










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "65"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f362802%2fimproving-spam-classification-with-tensorflow-logistic-regression%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          5
          down vote



          accepted










          L2 loss for logistic regression is not convex, but the cross entropy loss is. I’d recommend making the switch because convexity is a really nice property to have during optimization. Convexity implies that you don’t have to worry about local minima because they don’t exist by definition.



          A nice discussion of the mathematics comparing the convexity of log loss to the non-convexity of L2 loss can be found here: What is happening here, when I use squared loss in logistic regression setting?



          The textbook way to estimate logistic regression coefficients is called Newton-Raphson updating, but I don't believe that it is implemented in TensorFlow since second-order methods are not generally used for neural networks. However, you might improve the rate of convergence if you use SGD + classical momentum or SGD + Nesterov momentum. Nesterov momentum is especially appealing in this case: since your problem is convex, the problem is more-or-less locally quadratic, and that is the use case where Nesterov momentum really shines.






          share|cite|improve this answer






















          • Thank you very much for the suggestion. I will have a look into it and then repo how good a result it gave me
            – User12547645
            Aug 18 at 15:12










          • Thank you again! I am not at more than 98% accuracy for training and testing, with training still going
            – User12547645
            Aug 18 at 20:58






          • 1




            That sounds like a pretty nice improvement, though. Almost 10%! -- in your post, you said you were getting 90% accuracy.
            – Sycorax
            Aug 18 at 21:08










          • Yes, it is very impressive indeed! And it seems as though I still do not need any regularization, since training and testing accuracy are fairly close together
            – User12547645
            Aug 19 at 9:52














          up vote
          5
          down vote



          accepted










          L2 loss for logistic regression is not convex, but the cross entropy loss is. I’d recommend making the switch because convexity is a really nice property to have during optimization. Convexity implies that you don’t have to worry about local minima because they don’t exist by definition.



          A nice discussion of the mathematics comparing the convexity of log loss to the non-convexity of L2 loss can be found here: What is happening here, when I use squared loss in logistic regression setting?



          The textbook way to estimate logistic regression coefficients is called Newton-Raphson updating, but I don't believe that it is implemented in TensorFlow since second-order methods are not generally used for neural networks. However, you might improve the rate of convergence if you use SGD + classical momentum or SGD + Nesterov momentum. Nesterov momentum is especially appealing in this case: since your problem is convex, the problem is more-or-less locally quadratic, and that is the use case where Nesterov momentum really shines.






          share|cite|improve this answer






















          • Thank you very much for the suggestion. I will have a look into it and then repo how good a result it gave me
            – User12547645
            Aug 18 at 15:12










          • Thank you again! I am not at more than 98% accuracy for training and testing, with training still going
            – User12547645
            Aug 18 at 20:58






          • 1




            That sounds like a pretty nice improvement, though. Almost 10%! -- in your post, you said you were getting 90% accuracy.
            – Sycorax
            Aug 18 at 21:08










          • Yes, it is very impressive indeed! And it seems as though I still do not need any regularization, since training and testing accuracy are fairly close together
            – User12547645
            Aug 19 at 9:52












          up vote
          5
          down vote



          accepted







          up vote
          5
          down vote



          accepted






          L2 loss for logistic regression is not convex, but the cross entropy loss is. I’d recommend making the switch because convexity is a really nice property to have during optimization. Convexity implies that you don’t have to worry about local minima because they don’t exist by definition.



          A nice discussion of the mathematics comparing the convexity of log loss to the non-convexity of L2 loss can be found here: What is happening here, when I use squared loss in logistic regression setting?



          The textbook way to estimate logistic regression coefficients is called Newton-Raphson updating, but I don't believe that it is implemented in TensorFlow since second-order methods are not generally used for neural networks. However, you might improve the rate of convergence if you use SGD + classical momentum or SGD + Nesterov momentum. Nesterov momentum is especially appealing in this case: since your problem is convex, the problem is more-or-less locally quadratic, and that is the use case where Nesterov momentum really shines.






          share|cite|improve this answer














          L2 loss for logistic regression is not convex, but the cross entropy loss is. I’d recommend making the switch because convexity is a really nice property to have during optimization. Convexity implies that you don’t have to worry about local minima because they don’t exist by definition.



          A nice discussion of the mathematics comparing the convexity of log loss to the non-convexity of L2 loss can be found here: What is happening here, when I use squared loss in logistic regression setting?



          The textbook way to estimate logistic regression coefficients is called Newton-Raphson updating, but I don't believe that it is implemented in TensorFlow since second-order methods are not generally used for neural networks. However, you might improve the rate of convergence if you use SGD + classical momentum or SGD + Nesterov momentum. Nesterov momentum is especially appealing in this case: since your problem is convex, the problem is more-or-less locally quadratic, and that is the use case where Nesterov momentum really shines.







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited Aug 18 at 21:07

























          answered Aug 18 at 13:44









          Sycorax

          33.1k587147




          33.1k587147











          • Thank you very much for the suggestion. I will have a look into it and then repo how good a result it gave me
            – User12547645
            Aug 18 at 15:12










          • Thank you again! I am not at more than 98% accuracy for training and testing, with training still going
            – User12547645
            Aug 18 at 20:58






          • 1




            That sounds like a pretty nice improvement, though. Almost 10%! -- in your post, you said you were getting 90% accuracy.
            – Sycorax
            Aug 18 at 21:08










          • Yes, it is very impressive indeed! And it seems as though I still do not need any regularization, since training and testing accuracy are fairly close together
            – User12547645
            Aug 19 at 9:52
















          • Thank you very much for the suggestion. I will have a look into it and then repo how good a result it gave me
            – User12547645
            Aug 18 at 15:12










          • Thank you again! I am not at more than 98% accuracy for training and testing, with training still going
            – User12547645
            Aug 18 at 20:58






          • 1




            That sounds like a pretty nice improvement, though. Almost 10%! -- in your post, you said you were getting 90% accuracy.
            – Sycorax
            Aug 18 at 21:08










          • Yes, it is very impressive indeed! And it seems as though I still do not need any regularization, since training and testing accuracy are fairly close together
            – User12547645
            Aug 19 at 9:52















          Thank you very much for the suggestion. I will have a look into it and then repo how good a result it gave me
          – User12547645
          Aug 18 at 15:12




          Thank you very much for the suggestion. I will have a look into it and then repo how good a result it gave me
          – User12547645
          Aug 18 at 15:12












          Thank you again! I am not at more than 98% accuracy for training and testing, with training still going
          – User12547645
          Aug 18 at 20:58




          Thank you again! I am not at more than 98% accuracy for training and testing, with training still going
          – User12547645
          Aug 18 at 20:58




          1




          1




          That sounds like a pretty nice improvement, though. Almost 10%! -- in your post, you said you were getting 90% accuracy.
          – Sycorax
          Aug 18 at 21:08




          That sounds like a pretty nice improvement, though. Almost 10%! -- in your post, you said you were getting 90% accuracy.
          – Sycorax
          Aug 18 at 21:08












          Yes, it is very impressive indeed! And it seems as though I still do not need any regularization, since training and testing accuracy are fairly close together
          – User12547645
          Aug 19 at 9:52




          Yes, it is very impressive indeed! And it seems as though I still do not need any regularization, since training and testing accuracy are fairly close together
          – User12547645
          Aug 19 at 9:52

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f362802%2fimproving-spam-classification-with-tensorflow-logistic-regression%23new-answer', 'question_page');

          );

          Post as a guest













































































          Comments

          Popular posts from this blog

          What does second last employer means? [closed]

          Installing NextGIS Connect into QGIS 3?

          One-line joke