Is there any paper which summarizes the mathematical foundation of deep learning?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












Is there any paper which summarizes the mathematical foundation of deep learning?



Now, I am studying about the mathematical background of deep learning. However, unfortunately I cannot know to what extent theory of neural network is mathematically proved. Therefore, I want some paper which review the historical stream of neural network theory based on mathematical foundation, especially in terms of learning algorithms (convergence), and NN’s generalization ability and the NN’s architecture (why deep is good?) If you know, please let me know the name of the paper.



For your reference, let me write down some papers I read.



  • Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.

  • Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.

  • Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.

  • Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.

  • Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.

  • Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.

  • Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.

  • Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.

  • Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.

  • Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.

  • Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

  • Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.

  • Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

  • Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.

  • Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.

  • Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.

  • Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

  • Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.









share|cite|improve this question









New contributor




almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 1




    What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
    – DeltaIV
    2 hours ago

















up vote
2
down vote

favorite












Is there any paper which summarizes the mathematical foundation of deep learning?



Now, I am studying about the mathematical background of deep learning. However, unfortunately I cannot know to what extent theory of neural network is mathematically proved. Therefore, I want some paper which review the historical stream of neural network theory based on mathematical foundation, especially in terms of learning algorithms (convergence), and NN’s generalization ability and the NN’s architecture (why deep is good?) If you know, please let me know the name of the paper.



For your reference, let me write down some papers I read.



  • Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.

  • Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.

  • Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.

  • Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.

  • Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.

  • Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.

  • Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.

  • Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.

  • Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.

  • Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.

  • Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

  • Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.

  • Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

  • Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.

  • Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.

  • Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.

  • Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

  • Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.









share|cite|improve this question









New contributor




almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.















  • 1




    What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
    – DeltaIV
    2 hours ago













up vote
2
down vote

favorite









up vote
2
down vote

favorite











Is there any paper which summarizes the mathematical foundation of deep learning?



Now, I am studying about the mathematical background of deep learning. However, unfortunately I cannot know to what extent theory of neural network is mathematically proved. Therefore, I want some paper which review the historical stream of neural network theory based on mathematical foundation, especially in terms of learning algorithms (convergence), and NN’s generalization ability and the NN’s architecture (why deep is good?) If you know, please let me know the name of the paper.



For your reference, let me write down some papers I read.



  • Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.

  • Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.

  • Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.

  • Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.

  • Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.

  • Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.

  • Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.

  • Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.

  • Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.

  • Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.

  • Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

  • Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.

  • Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

  • Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.

  • Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.

  • Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.

  • Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

  • Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.









share|cite|improve this question









New contributor




almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











Is there any paper which summarizes the mathematical foundation of deep learning?



Now, I am studying about the mathematical background of deep learning. However, unfortunately I cannot know to what extent theory of neural network is mathematically proved. Therefore, I want some paper which review the historical stream of neural network theory based on mathematical foundation, especially in terms of learning algorithms (convergence), and NN’s generalization ability and the NN’s architecture (why deep is good?) If you know, please let me know the name of the paper.



For your reference, let me write down some papers I read.



  • Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.

  • Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.

  • Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.

  • Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.

  • Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.

  • Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.

  • Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.

  • Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.

  • Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.

  • Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.

  • Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

  • Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.

  • Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

  • Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.

  • Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.

  • Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.

  • Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

  • Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.






neural-networks deep-learning references






share|cite|improve this question









New contributor




almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|cite|improve this question









New contributor




almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|cite|improve this question




share|cite|improve this question








edited 40 mins ago









Ferdi

3,42842151




3,42842151






New contributor




almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 2 hours ago









almnagako

111




111




New contributor




almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







  • 1




    What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
    – DeltaIV
    2 hours ago













  • 1




    What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
    – DeltaIV
    2 hours ago








1




1




What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
– DeltaIV
2 hours ago





What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
– DeltaIV
2 hours ago











1 Answer
1






active

oldest

votes

















up vote
3
down vote













To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:



  • Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

  • Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:



  • Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)

  • Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014

Also, if you are interested in historical stream of development of neural networks, read:



  • Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.


The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.



Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed1 to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it2.



Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)




[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).



[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.



1 Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.



2 This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.






share|cite|improve this answer






















  • They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
    – Jan Kukacka
    1 hour ago






  • 1




    @DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
    – Jan Kukacka
    1 hour ago










  • not going to answer b/c I feel it's too broad, so I edited your question.
    – DeltaIV
    16 mins ago










  • I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
    – DeltaIV
    7 mins ago











Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






almnagako is a new contributor. Be nice, and check out our Code of Conduct.









 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f371303%2fis-there-any-paper-which-summarizes-the-mathematical-foundation-of-deep-learning%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
3
down vote













To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:



  • Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

  • Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:



  • Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)

  • Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014

Also, if you are interested in historical stream of development of neural networks, read:



  • Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.


The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.



Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed1 to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it2.



Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)




[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).



[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.



1 Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.



2 This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.






share|cite|improve this answer






















  • They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
    – Jan Kukacka
    1 hour ago






  • 1




    @DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
    – Jan Kukacka
    1 hour ago










  • not going to answer b/c I feel it's too broad, so I edited your question.
    – DeltaIV
    16 mins ago










  • I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
    – DeltaIV
    7 mins ago















up vote
3
down vote













To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:



  • Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

  • Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:



  • Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)

  • Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014

Also, if you are interested in historical stream of development of neural networks, read:



  • Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.


The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.



Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed1 to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it2.



Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)




[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).



[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.



1 Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.



2 This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.






share|cite|improve this answer






















  • They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
    – Jan Kukacka
    1 hour ago






  • 1




    @DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
    – Jan Kukacka
    1 hour ago










  • not going to answer b/c I feel it's too broad, so I edited your question.
    – DeltaIV
    16 mins ago










  • I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
    – DeltaIV
    7 mins ago













up vote
3
down vote










up vote
3
down vote









To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:



  • Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

  • Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:



  • Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)

  • Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014

Also, if you are interested in historical stream of development of neural networks, read:



  • Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.


The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.



Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed1 to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it2.



Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)




[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).



[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.



1 Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.



2 This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.






share|cite|improve this answer














To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:



  • Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

  • Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:



  • Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)

  • Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014

Also, if you are interested in historical stream of development of neural networks, read:



  • Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.


The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.



Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed1 to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it2.



Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)




[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).



[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.



1 Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.



2 This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited 8 mins ago









DeltaIV

6,32312057




6,32312057










answered 1 hour ago









Jan Kukacka

4,92611436




4,92611436











  • They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
    – Jan Kukacka
    1 hour ago






  • 1




    @DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
    – Jan Kukacka
    1 hour ago










  • not going to answer b/c I feel it's too broad, so I edited your question.
    – DeltaIV
    16 mins ago










  • I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
    – DeltaIV
    7 mins ago

















  • They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
    – Jan Kukacka
    1 hour ago






  • 1




    @DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
    – Jan Kukacka
    1 hour ago










  • not going to answer b/c I feel it's too broad, so I edited your question.
    – DeltaIV
    16 mins ago










  • I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
    – DeltaIV
    7 mins ago
















They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
– Jan Kukacka
1 hour ago




They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
– Jan Kukacka
1 hour ago




1




1




@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
– Jan Kukacka
1 hour ago




@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
– Jan Kukacka
1 hour ago












not going to answer b/c I feel it's too broad, so I edited your question.
– DeltaIV
16 mins ago




not going to answer b/c I feel it's too broad, so I edited your question.
– DeltaIV
16 mins ago












I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
– DeltaIV
7 mins ago





I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
– DeltaIV
7 mins ago











almnagako is a new contributor. Be nice, and check out our Code of Conduct.









 

draft saved


draft discarded


















almnagako is a new contributor. Be nice, and check out our Code of Conduct.












almnagako is a new contributor. Be nice, and check out our Code of Conduct.











almnagako is a new contributor. Be nice, and check out our Code of Conduct.













 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f371303%2fis-there-any-paper-which-summarizes-the-mathematical-foundation-of-deep-learning%23new-answer', 'question_page');

);

Post as a guest













































































Comments

Popular posts from this blog

What does second last employer means? [closed]

List of Gilmore Girls characters

Confectionery