Adding more layers decreases accuracy

up vote
2
down vote

favorite

I have my ANN trained on MNIST dataset. Hidden layer has 128 neurons and input layer has 784 neurons. This gave me an accuracy of 94%. However when I added one more layer with 64 neurons in each then the accuracy significantly reduced to 35%. What could be the reason behind this.

Edit : Activation function : sigmoid. 521 epochs.

edited 1 hour ago

asked 5 hours ago

Pink

112

New contributor

What is the activation function you are using?
â€“Â DuttaA
2 hours ago

@DuttaA sigmoid
â€“Â Pink
55 mins ago

add a commentÂ |Â

up vote
2
down vote

favorite

Edit : Activation function : sigmoid. 521 epochs.

edited 1 hour ago

asked 5 hours ago

Pink

112

New contributor

What is the activation function you are using?
â€“Â DuttaA
2 hours ago

@DuttaA sigmoid
â€“Â Pink
55 mins ago

add a commentÂ |Â

up vote
2
down vote

favorite

Edit : Activation function : sigmoid. 521 epochs.

edited 1 hour ago

asked 5 hours ago

Pink

112

New contributor

Edit : Activation function : sigmoid. 521 epochs.

machine-learning neural-network deep-learning mlp

edited 1 hour ago

asked 5 hours ago

Pink

112

New contributor

edited 1 hour ago

asked 5 hours ago

Pink

112

New contributor

edited 1 hour ago

asked 5 hours ago

Pink

112

New contributor

asked 5 hours ago

Pink

112

asked 5 hours ago

Pink

112

New contributor

Pink is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

What is the activation function you are using?
â€“Â DuttaA
2 hours ago

@DuttaA sigmoid
â€“Â Pink
55 mins ago

add a commentÂ |Â

What is the activation function you are using?
â€“Â DuttaA
2 hours ago

@DuttaA sigmoid
â€“Â Pink
55 mins ago

What is the activation function you are using?
â€“Â DuttaA
2 hours ago

@DuttaA sigmoid
â€“Â Pink
55 mins ago

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
1
down vote

The reason is that by adding more layers, you've added more trainable parameter to your model. You have to train it more. You should consider that MNIST data set is a very easy-to-learn dataset. You can have to layers with much less number of neurons in each layer. Try $10$ neurons for each to facilitate the learning process. You can reach to $100%$ accuracy.

edited 3 hours ago

answered 4 hours ago

Media

5,95551446

It's also a very small dataset.
â€“Â Matthieu Brucher
35 mins ago

Yes! $50$ thousand is very smal for deep-learning purposes.
â€“Â Media
28 mins ago

add a commentÂ |Â

up vote
0
down vote

The problem in your case (as I thought previously) is the sigmoid activation function. It suffers from many problems. Out of that your performance decrease is likely due to two reasons:

Vanishing Gradient

High learning rate

The vanishing gradient problem makes sure your Neural Neyt is trapped in a non optimal solution. While the high learning rate ensures that you get trapped in the non optimal solution. In short the high learning rate after a few oscillations will push your network to saturation.

Solution:

Best solution is to use the ReLu activation function, with maybe the last layer as sigmoid.

Use an adaptive optimizer like AdaGrad, Adam or RMSProp.

Decrease the learning rate to $10^-6$ to $10^-7$ but to compensate increase the number of epochs to $10^6$ to $10^7$.

answered 12 mins ago

DuttaA

440116

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Pink is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40333%2fadding-more-layers-decreases-accuracy%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

edited 3 hours ago

answered 4 hours ago

Media

5,95551446

It's also a very small dataset.
â€“Â Matthieu Brucher
35 mins ago

Yes! $50$ thousand is very smal for deep-learning purposes.
â€“Â Media
28 mins ago

add a commentÂ |Â

up vote
1
down vote

edited 3 hours ago

answered 4 hours ago

Media

5,95551446

It's also a very small dataset.
â€“Â Matthieu Brucher
35 mins ago

Yes! $50$ thousand is very smal for deep-learning purposes.
â€“Â Media
28 mins ago

add a commentÂ |Â

up vote
1
down vote

edited 3 hours ago

answered 4 hours ago

Media

5,95551446

edited 3 hours ago

answered 4 hours ago

Media

5,95551446

edited 3 hours ago

answered 4 hours ago

Media

5,95551446

answered 4 hours ago

Media

5,95551446

answered 4 hours ago

Media

5,95551446

It's also a very small dataset.
â€“Â Matthieu Brucher
35 mins ago

Yes! $50$ thousand is very smal for deep-learning purposes.
â€“Â Media
28 mins ago

add a commentÂ |Â

It's also a very small dataset.
â€“Â Matthieu Brucher
35 mins ago

Yes! $50$ thousand is very smal for deep-learning purposes.
â€“Â Media
28 mins ago

It's also a very small dataset.
â€“Â Matthieu Brucher
35 mins ago

Yes! $50$ thousand is very smal for deep-learning purposes.
â€“Â Media
28 mins ago

add a commentÂ |Â

up vote
0
down vote

The problem in your case (as I thought previously) is the sigmoid activation function. It suffers from many problems. Out of that your performance decrease is likely due to two reasons:

Vanishing Gradient

High learning rate

Solution:

Best solution is to use the ReLu activation function, with maybe the last layer as sigmoid.

Use an adaptive optimizer like AdaGrad, Adam or RMSProp.

Decrease the learning rate to $10^-6$ to $10^-7$ but to compensate increase the number of epochs to $10^6$ to $10^7$.

answered 12 mins ago

DuttaA

440116

add a commentÂ |Â

up vote
0
down vote

The problem in your case (as I thought previously) is the sigmoid activation function. It suffers from many problems. Out of that your performance decrease is likely due to two reasons:

Vanishing Gradient

High learning rate

Solution:

Best solution is to use the ReLu activation function, with maybe the last layer as sigmoid.

Use an adaptive optimizer like AdaGrad, Adam or RMSProp.

Decrease the learning rate to $10^-6$ to $10^-7$ but to compensate increase the number of epochs to $10^6$ to $10^7$.

answered 12 mins ago

DuttaA

440116

add a commentÂ |Â

up vote
0
down vote

The problem in your case (as I thought previously) is the sigmoid activation function. It suffers from many problems. Out of that your performance decrease is likely due to two reasons:

Vanishing Gradient

High learning rate

Solution:

Best solution is to use the ReLu activation function, with maybe the last layer as sigmoid.

Use an adaptive optimizer like AdaGrad, Adam or RMSProp.

Decrease the learning rate to $10^-6$ to $10^-7$ but to compensate increase the number of epochs to $10^6$ to $10^7$.

answered 12 mins ago

DuttaA

440116

The problem in your case (as I thought previously) is the sigmoid activation function. It suffers from many problems. Out of that your performance decrease is likely due to two reasons:

Vanishing Gradient

High learning rate

Solution:

Best solution is to use the ReLu activation function, with maybe the last layer as sigmoid.

Use an adaptive optimizer like AdaGrad, Adam or RMSProp.

Decrease the learning rate to $10^-6$ to $10^-7$ but to compensate increase the number of epochs to $10^6$ to $10^7$.

answered 12 mins ago

DuttaA

440116

answered 12 mins ago

DuttaA

440116

answered 12 mins ago

DuttaA

440116

answered 12 mins ago

DuttaA

440116

add a commentÂ |Â

Pink is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Pink is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky