Is there any paper which summarizes the mathematical foundation of deep learning?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

Is there any paper which summarizes the mathematical foundation of deep learning?

Now, I am studying about the mathematical background of deep learning. However, unfortunately I cannot know to what extent theory of neural network is mathematically proved. Therefore, I want some paper which review the historical stream of neural network theory based on mathematical foundation, especially in terms of learning algorithms (convergence), and NNÃ¢Â€Â™s generalization ability and the NNÃ¢Â€Â™s architecture (why deep is good?) If you know, please let me know the name of the paper.

For your reference, let me write down some papers I read.

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.

Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.

Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.

Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.

Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.

Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.

Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.

Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.

Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.

Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.

Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.

Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.

Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.

edited 40 mins ago

Ferdi

3,42842151

asked 2 hours ago

almnagako

111

New contributor

1

What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
â€“Â DeltaIV
2 hours ago

add a commentÂ |Â

up vote
2
down vote

favorite

Is there any paper which summarizes the mathematical foundation of deep learning?

For your reference, let me write down some papers I read.

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.

Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.

Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.

Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.

Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.

Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.

Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.

Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.

Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.

Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.

Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.

Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.

Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.

edited 40 mins ago

Ferdi

3,42842151

asked 2 hours ago

almnagako

111

New contributor

1

What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
â€“Â DeltaIV
2 hours ago

add a commentÂ |Â

up vote
2
down vote

favorite

Is there any paper which summarizes the mathematical foundation of deep learning?

For your reference, let me write down some papers I read.

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.

Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.

Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.

Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.

Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.

Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.

Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.

Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.

Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.

Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.

Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.

Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.

Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.

edited 40 mins ago

Ferdi

3,42842151

asked 2 hours ago

almnagako

111

New contributor

Is there any paper which summarizes the mathematical foundation of deep learning?

For your reference, let me write down some papers I read.

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.

Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.

Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6), 861-867.

Mhaskar, H. N., & Micchelli, C. A. (1992). Approximation by superposition of sigmoidal and radial basis functions. Advances in Applied mathematics, 13(3), 350-373.

Delalleau, O., & Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems (pp. 666-674). Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.

Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3), 930-945.

Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1), 164-177.

Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017). On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028.

Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, 1188-1192.

Kawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information Processing Systems (pp. 586-594).

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSProp, COURSERA: Neural networks for machine learning. University of Toronto, Technical Report.

Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

Yun, C., Sra, S., & Jadbabaie, A. (2017). Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444.

Zeng, J., Lau, T. T. K., Lin, S., & Yao, Y. (2018). Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees. arXiv preprint arXiv:1803.00225.

Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1), 1-11. Li, Q., Chen, L., Tai, C., & Weinan, E. (2017). Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1), 5998-6026.

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

Kawaguchi, K., Kaelbling, L. P., & Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.

neural-networks deep-learning references

edited 40 mins ago

Ferdi

3,42842151

asked 2 hours ago

almnagako

111

New contributor

edited 40 mins ago

Ferdi

3,42842151

asked 2 hours ago

almnagako

111

New contributor

edited 40 mins ago

Ferdi

3,42842151

edited 40 mins ago

Ferdi

3,42842151

edited 40 mins ago

Ferdi

3,42842151

asked 2 hours ago

almnagako

111

New contributor

asked 2 hours ago

almnagako

111

asked 2 hours ago

almnagako

111

New contributor

almnagako is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

1

What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
â€“Â DeltaIV
2 hours ago

add a commentÂ |Â

1

What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
â€“Â DeltaIV
2 hours ago

What do you mean by "mathematically proved"? A proof of what, exactly? Anyway, you've got a pretty complete bibliography. You're missing all the works of Daniel Roy, Tomaso Poggio and Sanjeev Arora, but apart from these (big) misses, you've got it pretty much covered. If there is something specific you want a proof for, edit your question accordingly.
â€“Â DeltaIV
2 hours ago

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
3
down vote

To my knowledge, there is not a single paper that would summarize proven mathematical results. For general overview, I recommend going for textbooks instead, which are more likely to give you a broad background overview. Two prominent examples are:

Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

These are rather introductory books, compared to the level of some papers you cited. If you want to go deeper into PAC learning theory (which you really should, if you plan on doing research on the learnability of NN models), read both of these:

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)

Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014

Also, if you are interested in historical stream of development of neural networks, read:

Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.

The tricky thing with mathematical theory and proofs in deep learning is that many important results don't have practical implications. For example, the super famous Universal approximation theorem says that a neural network with a single hidden layer can approximate any function to arbitrary precision. Why would you care for using more layers then if one is enough? Because it was empirically demonstrated that it works.

Another example is the convergence: Training neural networks using first order methods (gradient descent and the likes) is guaranteed¹ to converge to a local minimum but nothing more. Since it is non-convex optimization problem, we simply cannot prove much more useful about it (although some research is being done about the local minima distance from a global minimum [1,2]). Naturally, much more attention is paid to empirical research studying what we can do even if we cannot prove it².

Finally, I am not aware of works proving much important about the network architecture or about their generalization ability (to be honest, I am not sure what kinds of proofs are you looking for here; maybe if you reply in comments or add details to your question, I can expand on it here.)

[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).

[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.

¹ Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.

² This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.

edited 8 mins ago

DeltaIV

6,32312057

answered 1 hour ago

Jan Kukacka

4,92611436

They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
â€“Â Jan Kukacka
1 hour ago

1

@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
â€“Â Jan Kukacka
1 hour ago

not going to answer b/c I feel it's too broad, so I edited your question.
â€“Â DeltaIV
16 mins ago

I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
â€“Â DeltaIV
7 mins ago

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

almnagako is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f371303%2fis-there-any-paper-which-summarizes-the-mathematical-foundation-of-deep-learning%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)

Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014

Also, if you are interested in historical stream of development of neural networks, read:

Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.

[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).

[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.

¹ Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.

² This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.

edited 8 mins ago

DeltaIV

6,32312057

answered 1 hour ago

Jan Kukacka

4,92611436

They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
â€“Â Jan Kukacka
1 hour ago

1

@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
â€“Â Jan Kukacka
1 hour ago

not going to answer b/c I feel it's too broad, so I edited your question.
â€“Â DeltaIV
16 mins ago

I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
â€“Â DeltaIV
7 mins ago

add a commentÂ |Â

up vote
3
down vote

Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)

Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014

Also, if you are interested in historical stream of development of neural networks, read:

Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.

[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).

[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.

¹ Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.

² This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.

edited 8 mins ago

DeltaIV

6,32312057

answered 1 hour ago

Jan Kukacka

4,92611436

They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
â€“Â Jan Kukacka
1 hour ago

1

@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
â€“Â Jan Kukacka
1 hour ago

not going to answer b/c I feel it's too broad, so I edited your question.
â€“Â DeltaIV
16 mins ago

I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
â€“Â DeltaIV
7 mins ago

add a commentÂ |Â

up vote
3
down vote

Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)

Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014

Also, if you are interested in historical stream of development of neural networks, read:

Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.

[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).

[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.

¹ Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.

² This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.

edited 8 mins ago

DeltaIV

6,32312057

answered 1 hour ago

Jan Kukacka

4,92611436

Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning, MIT Press, 2012 (but wait for the 2018 edition, it's due on Christmas and it has a few considerable improvements)

Shai Shalev-Shwartz , Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014

Also, if you are interested in historical stream of development of neural networks, read:

Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.

[1]: Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. and LeCun, Y., 2015, February. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics (pp. 192-204).

[2]: Soudry, D. and Carmon, Y., 2016. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361.

¹ Guaranteed almost surely; see discussion around this answer for some pathological counterexamples.

² This is not necessarily bad and it does not mean that deep learning is alchemy: Proofs and rigorous math theories often follow empirical evidence and engineering results.

edited 8 mins ago

DeltaIV

6,32312057

answered 1 hour ago

Jan Kukacka

4,92611436

edited 8 mins ago

DeltaIV

6,32312057

edited 8 mins ago

DeltaIV

6,32312057

edited 8 mins ago

DeltaIV

6,32312057

answered 1 hour ago

Jan Kukacka

4,92611436

answered 1 hour ago

Jan Kukacka

4,92611436

answered 1 hour ago

Jan Kukacka

4,92611436

They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
â€“Â Jan Kukacka
1 hour ago

1

@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
â€“Â Jan Kukacka
1 hour ago

not going to answer b/c I feel it's too broad, so I edited your question.
â€“Â DeltaIV
16 mins ago

I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
â€“Â DeltaIV
7 mins ago

add a commentÂ |Â

They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
â€“Â Jan Kukacka
1 hour ago

1

@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
â€“Â Jan Kukacka
1 hour ago

not going to answer b/c I feel it's too broad, so I edited your question.
â€“Â DeltaIV
16 mins ago

I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
â€“Â DeltaIV
7 mins ago

They are basic in the level of detail they provide, but they are full of references to a broad range of topics. Focusing on papers may be a bit like depth-first search...
â€“Â Jan Kukacka
1 hour ago

@DeltaIV btw, feel free to include the textbooks that you mention in an edit (if you're not planning to answer yourself), or I can do it too.
â€“Â Jan Kukacka
1 hour ago

not going to answer b/c I feel it's too broad, so I edited your question.
â€“Â DeltaIV
16 mins ago

I'd have an edit also on the Universal Approximation Theorem (I very much like your angle, btw). Shall I go with it, or would you rather have me introduce it first in a comment?
â€“Â DeltaIV
7 mins ago

add a commentÂ |Â

almnagako is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

almnagako is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky