Is there any paper which summarizes the mathematical foundation of deep learning?
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
Is there any paper which summarizes the mathematical foundation of deep learning?
Now, I am studying about the mathematical background of deep learning. However, unfortunately I cannot know to what extent theory of neural network is mathematically proved. Therefore, I want some paper which review the historical stream of neural network theory based on mathematical foundation, especially in terms of learning algorithms (convergence), and NN’s generalization ability and the NN’s architecture (why deep is good?) If you know, please let me know the name of the paper.
For your reference, let me write down some papers I read.
- Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.
- Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.
- Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.
- Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.
- Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.
- Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.
- Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.
- Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.
- Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.
- Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.
- Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).
- Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.
- Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.
- Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.
- Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.
- Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
- Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.
neural-networks deep-learning references
New contributor
almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
2
down vote
favorite
Is there any paper which summarizes the mathematical foundation of deep learning?
Now, I am studying about the mathematical background of deep learning. However, unfortunately I cannot know to what extent theory of neural network is mathematically proved. Therefore, I want some paper which review the historical stream of neural network theory based on mathematical foundation, especially in terms of learning algorithms (convergence), and NN’s generalization ability and the NN’s architecture (why deep is good?) If you know, please let me know the name of the paper.
For your reference, let me write down some papers I read.
- Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.
- Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.
- Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.
- Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.
- Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.
- Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.
- Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.
- Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.
- Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.
- Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.
- Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).
- Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.
- Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.
- Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.
- Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.
- Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
- Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.
neural-networks deep-learning references
New contributor
almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1
What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
– DeltaIV
2 hours ago
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
Is there any paper which summarizes the mathematical foundation of deep learning?
Now, I am studying about the mathematical background of deep learning. However, unfortunately I cannot know to what extent theory of neural network is mathematically proved. Therefore, I want some paper which review the historical stream of neural network theory based on mathematical foundation, especially in terms of learning algorithms (convergence), and NN’s generalization ability and the NN’s architecture (why deep is good?) If you know, please let me know the name of the paper.
For your reference, let me write down some papers I read.
- Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.
- Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.
- Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.
- Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.
- Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.
- Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.
- Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.
- Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.
- Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.
- Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.
- Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).
- Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.
- Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.
- Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.
- Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.
- Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
- Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.
neural-networks deep-learning references
New contributor
almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Is there any paper which summarizes the mathematical foundation of deep learning?
Now, I am studying about the mathematical background of deep learning. However, unfortunately I cannot know to what extent theory of neural network is mathematically proved. Therefore, I want some paper which review the historical stream of neural network theory based on mathematical foundation, especially in terms of learning algorithms (convergence), and NN’s generalization ability and the NN’s architecture (why deep is good?) If you know, please let me know the name of the paper.
For your reference, let me write down some papers I read.
- Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.
- Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.
- Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.
- Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.
- Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.
- Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.
- Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.
- Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.
- Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.
- Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.
- Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).
- Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.
- Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.
- Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.
- Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.
- Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.
- Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
- Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.
neural-networks deep-learning references
neural-networks deep-learning references
New contributor
almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
edited 40 mins ago
Ferdi
3,42842151
3,42842151
New contributor
almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 2 hours ago
almnagako
111
111
New contributor
almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1
What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
– DeltaIV
2 hours ago
add a comment |Â
1
What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
– DeltaIV
2 hours ago
1
1
What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
– DeltaIV
2 hours ago
What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
– DeltaIV
2 hours ago
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
3
down vote
To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:
- Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.
- Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)
- Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014
Also, if you are interested in historical stream of development of neural networks, read:
- Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.
The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.
Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed1 to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it2.
Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)
[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).
[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.
1 Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.
2 This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.
They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
– Jan Kukacka
1 hour ago
1
@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
– Jan Kukacka
1 hour ago
not going to answer b/c I feel it's too broad, so I edited your question.
– DeltaIV
16 mins ago
I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
– DeltaIV
7 mins ago
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:
- Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.
- Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)
- Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014
Also, if you are interested in historical stream of development of neural networks, read:
- Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.
The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.
Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed1 to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it2.
Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)
[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).
[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.
1 Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.
2 This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.
They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
– Jan Kukacka
1 hour ago
1
@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
– Jan Kukacka
1 hour ago
not going to answer b/c I feel it's too broad, so I edited your question.
– DeltaIV
16 mins ago
I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
– DeltaIV
7 mins ago
add a comment |Â
up vote
3
down vote
To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:
- Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.
- Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)
- Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014
Also, if you are interested in historical stream of development of neural networks, read:
- Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.
The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.
Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed1 to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it2.
Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)
[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).
[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.
1 Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.
2 This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.
They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
– Jan Kukacka
1 hour ago
1
@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
– Jan Kukacka
1 hour ago
not going to answer b/c I feel it's too broad, so I edited your question.
– DeltaIV
16 mins ago
I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
– DeltaIV
7 mins ago
add a comment |Â
up vote
3
down vote
up vote
3
down vote
To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:
- Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.
- Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)
- Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014
Also, if you are interested in historical stream of development of neural networks, read:
- Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.
The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.
Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed1 to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it2.
Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)
[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).
[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.
1 Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.
2 This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.
To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:
- Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.
- Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:
- Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)
- Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014
Also, if you are interested in historical stream of development of neural networks, read:
- Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.
The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.
Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed1 to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it2.
Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)
[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).
[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.
1 Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.
2 This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.
edited 8 mins ago
DeltaIV
6,32312057
6,32312057
answered 1 hour ago


Jan Kukacka
4,92611436
4,92611436
They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
– Jan Kukacka
1 hour ago
1
@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
– Jan Kukacka
1 hour ago
not going to answer b/c I feel it's too broad, so I edited your question.
– DeltaIV
16 mins ago
I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
– DeltaIV
7 mins ago
add a comment |Â
They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
– Jan Kukacka
1 hour ago
1
@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
– Jan Kukacka
1 hour ago
not going to answer b/c I feel it's too broad, so I edited your question.
– DeltaIV
16 mins ago
I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
– DeltaIV
7 mins ago
They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
– Jan Kukacka
1 hour ago
They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
– Jan Kukacka
1 hour ago
1
1
@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
– Jan Kukacka
1 hour ago
@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
– Jan Kukacka
1 hour ago
not going to answer b/c I feel it's too broad, so I edited your question.
– DeltaIV
16 mins ago
not going to answer b/c I feel it's too broad, so I edited your question.
– DeltaIV
16 mins ago
I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
– DeltaIV
7 mins ago
I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
– DeltaIV
7 mins ago
add a comment |Â
almnagako is a new contributor. Be nice, and check out our Code of Conduct.
almnagako is a new contributor. Be nice, and check out our Code of Conduct.
almnagako is a new contributor. Be nice, and check out our Code of Conduct.
almnagako is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f371303%2fis-there-any-paper-which-summarizes-the-mathematical-foundation-of-deep-learning%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
– DeltaIV
2 hours ago