Does a âNormal Distributionâ need to have mean=median=mode?
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I've been in a debate with my graduate-level statistics professor about "normal distributions". I contend that to truly get a normal distribution one must have mean=median=mode, all the data must be contained under the bell curve, and perfectly symmetrical around the mean. Therefore, technically, there are virtually NO normal distributions in real studies, and we should call them something else, perhaps "near-normal".
She says I'm too picky, and if the skew/kurtosis are less than 1.0 it is a normal distribution and took off points on an exam. The dataset is total number of falls/year in a random sampling of 52 nursing homes which is a random sample of a larger population. Any insight?
QUESTION: 3. Compute measures of skewness and kurtosis for this data. Include a histogram with a normal curve. Discuss your findings. Is the data normally distributed?
Statistics
Number of falls
N Valid 52
Missing 0
Mean 11.23
Median 11.50
Mode 4a
a. Multiple modes exist. The smallest value is shown
Number of falls
N Valid 52
Missing 0
Skewness .114
Std. Error of Skewness .330
Kurtosis -.961
Std. Error of Kurtosis .650
MY ANSWER:
The data is platykurtic and has only slight positive skewing, and it is NOT a normal distribution because the mean and median and mode are not equal and the data is not evenly distributed around the mean. In reality virtually no data is ever a perfect normal distribution, although we can discuss âÂÂapproximately normal distributionsâ such as height, weight, temperature, or length of adult ring finger in large population groups.
PROFESSOR'S ANSWER:
You are correct that there is no perfectly normal distribution. But, we are not looking for perfection. We need to look at data in addition to the histogram and the measures of central tendency. What do the skewness and kurtosis statistics tell you about the distribution? Because they are both between the critical values of -1 and +1, this data is considered to be normally distributed.
mathematical-statistics descriptive-statistics
New contributor
 |Â
show 6 more comments
up vote
2
down vote
favorite
I've been in a debate with my graduate-level statistics professor about "normal distributions". I contend that to truly get a normal distribution one must have mean=median=mode, all the data must be contained under the bell curve, and perfectly symmetrical around the mean. Therefore, technically, there are virtually NO normal distributions in real studies, and we should call them something else, perhaps "near-normal".
She says I'm too picky, and if the skew/kurtosis are less than 1.0 it is a normal distribution and took off points on an exam. The dataset is total number of falls/year in a random sampling of 52 nursing homes which is a random sample of a larger population. Any insight?
QUESTION: 3. Compute measures of skewness and kurtosis for this data. Include a histogram with a normal curve. Discuss your findings. Is the data normally distributed?
Statistics
Number of falls
N Valid 52
Missing 0
Mean 11.23
Median 11.50
Mode 4a
a. Multiple modes exist. The smallest value is shown
Number of falls
N Valid 52
Missing 0
Skewness .114
Std. Error of Skewness .330
Kurtosis -.961
Std. Error of Kurtosis .650
MY ANSWER:
The data is platykurtic and has only slight positive skewing, and it is NOT a normal distribution because the mean and median and mode are not equal and the data is not evenly distributed around the mean. In reality virtually no data is ever a perfect normal distribution, although we can discuss âÂÂapproximately normal distributionsâ such as height, weight, temperature, or length of adult ring finger in large population groups.
PROFESSOR'S ANSWER:
You are correct that there is no perfectly normal distribution. But, we are not looking for perfection. We need to look at data in addition to the histogram and the measures of central tendency. What do the skewness and kurtosis statistics tell you about the distribution? Because they are both between the critical values of -1 and +1, this data is considered to be normally distributed.
mathematical-statistics descriptive-statistics
New contributor
2
I think you are being a little pedantic, it makes no sense to talk about median or mode of a continuous distribution.
â user2974951
5 hours ago
2
I would like to know your professor's exact wording. In principle a normal distribution has mean, median and mode identical (but so do many other distributions) and has skewness 0 and (so-called excess) kurtosis 0 (and so do some other distributions). At best a distribution with (e.g.) slight skewness or kurtosis is approximately normal. Note that almost all real data are at best approximations to named distributions in the theoretical menagerie.
â Nick Cox
4 hours ago
7
I don't agree with @user2974951 In company with every good text I know I am perfectly happy thinking that the normal distribution has a median and mode. And that applies widely to continuous distributions, although I don't doubt that pathological counter-examples can be identified.
â Nick Cox
4 hours ago
1
I didn't call you or your question pedantic, but the one answer to date is I think broadly on the right lines. The main issue here seems to be different questions, what is a normal distribution in principle and what is approximately normal in practice. (FWIW, I don't readily believe the data -- unless the people who fell over many, many times are now dead or in hospital. For other reasons, I know more than I want to about nursing homes. It is also crucial whether you are talking about falls per individual or some average rate of falls for a nursing home, e.g, falls per week per 100 patients).
â Nick Cox
4 hours ago
1
Thanks for the specific detail, which shows merit on both sides, but I am not grading either of you. However, I dissent strongly from the term critical values as used by your Professor, as the limits $pm 1$ for skewness and kurtosis have no standing whatsoever beyond being rules of thumb that might be used. Depending on what you are doing with the data, a skewness $< 1$ might go along with wanting to transform the data and a skewness of $>1$ might go along with the leaving the data are they are, and similarly for kurtosis.
â Nick Cox
3 hours ago
 |Â
show 6 more comments
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I've been in a debate with my graduate-level statistics professor about "normal distributions". I contend that to truly get a normal distribution one must have mean=median=mode, all the data must be contained under the bell curve, and perfectly symmetrical around the mean. Therefore, technically, there are virtually NO normal distributions in real studies, and we should call them something else, perhaps "near-normal".
She says I'm too picky, and if the skew/kurtosis are less than 1.0 it is a normal distribution and took off points on an exam. The dataset is total number of falls/year in a random sampling of 52 nursing homes which is a random sample of a larger population. Any insight?
QUESTION: 3. Compute measures of skewness and kurtosis for this data. Include a histogram with a normal curve. Discuss your findings. Is the data normally distributed?
Statistics
Number of falls
N Valid 52
Missing 0
Mean 11.23
Median 11.50
Mode 4a
a. Multiple modes exist. The smallest value is shown
Number of falls
N Valid 52
Missing 0
Skewness .114
Std. Error of Skewness .330
Kurtosis -.961
Std. Error of Kurtosis .650
MY ANSWER:
The data is platykurtic and has only slight positive skewing, and it is NOT a normal distribution because the mean and median and mode are not equal and the data is not evenly distributed around the mean. In reality virtually no data is ever a perfect normal distribution, although we can discuss âÂÂapproximately normal distributionsâ such as height, weight, temperature, or length of adult ring finger in large population groups.
PROFESSOR'S ANSWER:
You are correct that there is no perfectly normal distribution. But, we are not looking for perfection. We need to look at data in addition to the histogram and the measures of central tendency. What do the skewness and kurtosis statistics tell you about the distribution? Because they are both between the critical values of -1 and +1, this data is considered to be normally distributed.
mathematical-statistics descriptive-statistics
New contributor
I've been in a debate with my graduate-level statistics professor about "normal distributions". I contend that to truly get a normal distribution one must have mean=median=mode, all the data must be contained under the bell curve, and perfectly symmetrical around the mean. Therefore, technically, there are virtually NO normal distributions in real studies, and we should call them something else, perhaps "near-normal".
She says I'm too picky, and if the skew/kurtosis are less than 1.0 it is a normal distribution and took off points on an exam. The dataset is total number of falls/year in a random sampling of 52 nursing homes which is a random sample of a larger population. Any insight?
QUESTION: 3. Compute measures of skewness and kurtosis for this data. Include a histogram with a normal curve. Discuss your findings. Is the data normally distributed?
Statistics
Number of falls
N Valid 52
Missing 0
Mean 11.23
Median 11.50
Mode 4a
a. Multiple modes exist. The smallest value is shown
Number of falls
N Valid 52
Missing 0
Skewness .114
Std. Error of Skewness .330
Kurtosis -.961
Std. Error of Kurtosis .650
MY ANSWER:
The data is platykurtic and has only slight positive skewing, and it is NOT a normal distribution because the mean and median and mode are not equal and the data is not evenly distributed around the mean. In reality virtually no data is ever a perfect normal distribution, although we can discuss âÂÂapproximately normal distributionsâ such as height, weight, temperature, or length of adult ring finger in large population groups.
PROFESSOR'S ANSWER:
You are correct that there is no perfectly normal distribution. But, we are not looking for perfection. We need to look at data in addition to the histogram and the measures of central tendency. What do the skewness and kurtosis statistics tell you about the distribution? Because they are both between the critical values of -1 and +1, this data is considered to be normally distributed.
mathematical-statistics descriptive-statistics
mathematical-statistics descriptive-statistics
New contributor
New contributor
edited 3 hours ago
New contributor
asked 5 hours ago
Possum-Pie
113
113
New contributor
New contributor
2
I think you are being a little pedantic, it makes no sense to talk about median or mode of a continuous distribution.
â user2974951
5 hours ago
2
I would like to know your professor's exact wording. In principle a normal distribution has mean, median and mode identical (but so do many other distributions) and has skewness 0 and (so-called excess) kurtosis 0 (and so do some other distributions). At best a distribution with (e.g.) slight skewness or kurtosis is approximately normal. Note that almost all real data are at best approximations to named distributions in the theoretical menagerie.
â Nick Cox
4 hours ago
7
I don't agree with @user2974951 In company with every good text I know I am perfectly happy thinking that the normal distribution has a median and mode. And that applies widely to continuous distributions, although I don't doubt that pathological counter-examples can be identified.
â Nick Cox
4 hours ago
1
I didn't call you or your question pedantic, but the one answer to date is I think broadly on the right lines. The main issue here seems to be different questions, what is a normal distribution in principle and what is approximately normal in practice. (FWIW, I don't readily believe the data -- unless the people who fell over many, many times are now dead or in hospital. For other reasons, I know more than I want to about nursing homes. It is also crucial whether you are talking about falls per individual or some average rate of falls for a nursing home, e.g, falls per week per 100 patients).
â Nick Cox
4 hours ago
1
Thanks for the specific detail, which shows merit on both sides, but I am not grading either of you. However, I dissent strongly from the term critical values as used by your Professor, as the limits $pm 1$ for skewness and kurtosis have no standing whatsoever beyond being rules of thumb that might be used. Depending on what you are doing with the data, a skewness $< 1$ might go along with wanting to transform the data and a skewness of $>1$ might go along with the leaving the data are they are, and similarly for kurtosis.
â Nick Cox
3 hours ago
 |Â
show 6 more comments
2
I think you are being a little pedantic, it makes no sense to talk about median or mode of a continuous distribution.
â user2974951
5 hours ago
2
I would like to know your professor's exact wording. In principle a normal distribution has mean, median and mode identical (but so do many other distributions) and has skewness 0 and (so-called excess) kurtosis 0 (and so do some other distributions). At best a distribution with (e.g.) slight skewness or kurtosis is approximately normal. Note that almost all real data are at best approximations to named distributions in the theoretical menagerie.
â Nick Cox
4 hours ago
7
I don't agree with @user2974951 In company with every good text I know I am perfectly happy thinking that the normal distribution has a median and mode. And that applies widely to continuous distributions, although I don't doubt that pathological counter-examples can be identified.
â Nick Cox
4 hours ago
1
I didn't call you or your question pedantic, but the one answer to date is I think broadly on the right lines. The main issue here seems to be different questions, what is a normal distribution in principle and what is approximately normal in practice. (FWIW, I don't readily believe the data -- unless the people who fell over many, many times are now dead or in hospital. For other reasons, I know more than I want to about nursing homes. It is also crucial whether you are talking about falls per individual or some average rate of falls for a nursing home, e.g, falls per week per 100 patients).
â Nick Cox
4 hours ago
1
Thanks for the specific detail, which shows merit on both sides, but I am not grading either of you. However, I dissent strongly from the term critical values as used by your Professor, as the limits $pm 1$ for skewness and kurtosis have no standing whatsoever beyond being rules of thumb that might be used. Depending on what you are doing with the data, a skewness $< 1$ might go along with wanting to transform the data and a skewness of $>1$ might go along with the leaving the data are they are, and similarly for kurtosis.
â Nick Cox
3 hours ago
2
2
I think you are being a little pedantic, it makes no sense to talk about median or mode of a continuous distribution.
â user2974951
5 hours ago
I think you are being a little pedantic, it makes no sense to talk about median or mode of a continuous distribution.
â user2974951
5 hours ago
2
2
I would like to know your professor's exact wording. In principle a normal distribution has mean, median and mode identical (but so do many other distributions) and has skewness 0 and (so-called excess) kurtosis 0 (and so do some other distributions). At best a distribution with (e.g.) slight skewness or kurtosis is approximately normal. Note that almost all real data are at best approximations to named distributions in the theoretical menagerie.
â Nick Cox
4 hours ago
I would like to know your professor's exact wording. In principle a normal distribution has mean, median and mode identical (but so do many other distributions) and has skewness 0 and (so-called excess) kurtosis 0 (and so do some other distributions). At best a distribution with (e.g.) slight skewness or kurtosis is approximately normal. Note that almost all real data are at best approximations to named distributions in the theoretical menagerie.
â Nick Cox
4 hours ago
7
7
I don't agree with @user2974951 In company with every good text I know I am perfectly happy thinking that the normal distribution has a median and mode. And that applies widely to continuous distributions, although I don't doubt that pathological counter-examples can be identified.
â Nick Cox
4 hours ago
I don't agree with @user2974951 In company with every good text I know I am perfectly happy thinking that the normal distribution has a median and mode. And that applies widely to continuous distributions, although I don't doubt that pathological counter-examples can be identified.
â Nick Cox
4 hours ago
1
1
I didn't call you or your question pedantic, but the one answer to date is I think broadly on the right lines. The main issue here seems to be different questions, what is a normal distribution in principle and what is approximately normal in practice. (FWIW, I don't readily believe the data -- unless the people who fell over many, many times are now dead or in hospital. For other reasons, I know more than I want to about nursing homes. It is also crucial whether you are talking about falls per individual or some average rate of falls for a nursing home, e.g, falls per week per 100 patients).
â Nick Cox
4 hours ago
I didn't call you or your question pedantic, but the one answer to date is I think broadly on the right lines. The main issue here seems to be different questions, what is a normal distribution in principle and what is approximately normal in practice. (FWIW, I don't readily believe the data -- unless the people who fell over many, many times are now dead or in hospital. For other reasons, I know more than I want to about nursing homes. It is also crucial whether you are talking about falls per individual or some average rate of falls for a nursing home, e.g, falls per week per 100 patients).
â Nick Cox
4 hours ago
1
1
Thanks for the specific detail, which shows merit on both sides, but I am not grading either of you. However, I dissent strongly from the term critical values as used by your Professor, as the limits $pm 1$ for skewness and kurtosis have no standing whatsoever beyond being rules of thumb that might be used. Depending on what you are doing with the data, a skewness $< 1$ might go along with wanting to transform the data and a skewness of $>1$ might go along with the leaving the data are they are, and similarly for kurtosis.
â Nick Cox
3 hours ago
Thanks for the specific detail, which shows merit on both sides, but I am not grading either of you. However, I dissent strongly from the term critical values as used by your Professor, as the limits $pm 1$ for skewness and kurtosis have no standing whatsoever beyond being rules of thumb that might be used. Depending on what you are doing with the data, a skewness $< 1$ might go along with wanting to transform the data and a skewness of $>1$ might go along with the leaving the data are they are, and similarly for kurtosis.
â Nick Cox
3 hours ago
 |Â
show 6 more comments
5 Answers
5
active
oldest
votes
up vote
3
down vote
The teacher is clearly out of his/her element, and probably should not be teaching statistics. It seems worse to me to teach something wrong than to not teach it at all.
These issues could all be cleared up easily if the distinction between "data" and "process that produced the data" were made more clearly. Data target the process that produced the data. The normal distribution is a model for this process.
It makes no sense to talk about whether the data are normally distributed. For one reason, the data are always discrete. For another reason, the normal distribution describes an infinity of potentially observable quantities, not a finite set of specific observed quantities.
Further, the answer to the question "is the process that produced the data a normally distributed process" is also always "no," regardless of the data. Two simple reasons: (i) any measurements we take are necessarily discrete, being rounded off to some level. (ii) perfect symmetry, like a perfect circle, does not exist in observable nature. There are always imperfections.
At best, the answer to the question "what do these data tell you about normality of the data-generating process" could be given as follows: "these data are consistent with what we would expect to see, had the data truly come from a normally distributed process." That answer correctly does not conclude that the distribution is normal.
These issues are very easily understood by using simulation. Just simulate data from a normal distribution and compare those to the existing data. If the data are counts (0,1,2,3,...), then obviously the normal model is wrong because it does not produce numbers like 0,1,2,3,...; instead, it produces numbers with decimals that go on forever (or at least as far as the computer will allow.) Such simulation should be the first thing you do when learning about the normality question. Then you can more correctly interpret the graphs and summary statistics.
I didnt downgrade your answer, but consider that you're judging a grad professor from the words of a student. How likely it is that a student is right and a teacher is wrong? Isn't it more likely that student is misrepresenting his professor and the context of the conversation?
â Aksakal
55 mins ago
Based on my experience, and on the students words, I'd say it is more likely that the teacher is wrong. There are teachers with little formal training that teach courses, even grad courses, in universities everywhere. If the accreditating agencies only knew the truth!
â Peter Westfall
20 mins ago
add a comment |Â
up vote
1
down vote
You're missing the point and probably are also being "difficult," which is not appreciated in the industry. She's showing you a toy example, to train you in assessment of normality of a data set. Looking at distribution moments is one way to check the normality, e.g. Jarque Bera test is based on such an assessment.
Yes, the normal distribution is perfectly symmetrical. However, if you draw a sample from a true normal distribution, that sample will most likely not be perfectly symmetrical. This is the point you're completely missing. You can test this very easily yourself. Just generate a sample from Gaussian distribution, and check its moment. They'll never be perfectly "normal," despite the true distribution being such.
Here's a silly Python example. I'm generating 100 samples of 100 random numbers, then obtaining their means and medians. I print the first sample to show that the mean and median are different, then show the histogram of the difference between the means and medians. You can see that it's rather narrow, but the difference is basically never zero. Note, that the numbers are coming from a normal distribution.
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
s = np.random.normal(0, 1, (100,100))
print('sample 0 mean:',np.mean(s[:,0]),'median:',np.median(s[:,0]))
plt.hist(np.mean(s,0)-np.median(s,0))
plt.show()
print('avg mean-median:',np.mean(np.mean(s,0)-np.median(s,0)))
outputs:
1
(+1) Exactly the point. Confusing the random variable and a sample of realisations from that random variable.
â Xi'an
41 mins ago
add a comment |Â
up vote
0
down vote
I think you and your professor are talking in different context. Equality of mean = median = mode is characteristics of theoretical distribution and this is not the only characteristics. You can not say that if for any distribution above property hold then distribution is normal. T-distribution is also symmetric but it is not normal. So, you are talking about theoretical properties of normal distribution which hold always true for normal distribution.
You professor is talking about distribution of sample data. He is right, you will never get data in real life, where you will find mean = median = mode. This is simply due to sampling error. Similarly, it is very unlikely, you will get zero coefficient of skewness for sample data and zero excess kurtosis. Your professor is just giving you simple rule to get an idea about the distribution from the sample statistics. Which is not true in general (without getting further information).
Professor is said to be femaie.
â Nick Cox
4 hours ago
Why you don't get mean=median=mode is mostly because many distributions really are skewed! (Strictly, mean=median=mode is possible with skewed distributions too, despite what many textbooks say.)
â Nick Cox
4 hours ago
I disagree that lack of equality of mean/median/mode= sampling error. Suppose you random sampled 52 nursing homes for fall rates. Homes 27, 34, and 52 are chronicly short-staffed and always have above-average number of falls. Those homes pushes mean towards tail and isn't due to a sampling error.
â Possum-Pie
4 hours ago
@Possum Pie What the data are is secondary here but you're giving different signals in different places. Here you are talking about several nursing homes -- but in your question you state "in a nursing home". Being unclear about even incidental details does not help.
â Nick Cox
4 hours ago
@Nick Cox Sorry, I clarified it. Number of falls/year in a sample of 52 nursing homes
â Possum-Pie
4 hours ago
add a comment |Â
up vote
0
down vote
I'm an engineer, so in my world, the applied statistician is what I see most, and get the most concrete value. If you are going to work in applied, then you need to be solidly grounded in practice over theory: whether or not it is elegant, the aircraft has to fly and not crash.
When I think about this question the way I approach it, as many of my technical betters here have also done, is to think about "what does it look like in the real world with the presence of noise".
The second thing that I do is, often, to make a simulation that allows me to get my hands around the question.
Here is a very brief exploration:
#show how the mean and the median differ with respect to sample size
#libraries
library(reshape2)
library(ggplot2)
#sample sizes
ssizes <- 10^(seq(from=1, to=3, by=0.25))
ssizes <- round(ssizes)
#loops per sample
n_loops <- 5000
#pre-declare, prep for loop
my_store <- matrix(0,
ncol = 3,
nrow = n_loops*length(ssizes))
count <- 1
for(i in 1:length(ssizes))
#how many samples
n_samp <- ssizes[i]
for(j in 1:n_loops)
#draw samples
y <- 0
y <- rnorm(n = n_samp,mean = 0, sd = 1)
#compute mean, median, mode
my_store[count,1] <- n_samp
my_store[count,2] <- median(y)
my_store[count,3] <- mean(y)
#update
count = count + 1
#make data into ggplot friendly form
df <- data.frame(my_store)
names(df) <- c("n_samp", "median","mean")
df <- melt(df, id.vars = 1, measure.vars = c("median","mean"))
#make ggplot
ggplot(df, aes(x=as.factor(n_samp),
y = value,
fill = variable)) + geom_boxplot() +
labs(title = "Contrast Median and Mean estimate variation vs. Sample Size",
x = "Number of Samples",
y = "Estimated value")
It gives this as the output:
Note: be careful about the x-axis, because it is log-scaled, not uniform-scaled.
I know that the mean and median are exactly the same. The code says it. The empirical realization is greatly sensitive to sample size, and if there aren't truly infinite samples, then they can't ever perfectly match with theory.
You can think about whether the uncertainty in the median envelopes the estimated mean or vice versa. If the best estimate of the mean is within the 95% CI of the estimate for the median, then the data can't tell the difference. The data says they are the same in theory. If you get more data, then see what it says.
Interesting graph. I would have thought the Mean would have been generally larger than the median at first considering the mean chases outliers...in other words the red bars would be mean and green be medians. What am I missing?
â Possum-Pie
2 hours ago
add a comment |Â
up vote
0
down vote
In medical statistics, we only ever comment on the shapes and seeming of distributions. The fact that no discrete finite sample can ever be normal is irrelevant and pedantic. I would mark you wrong for that.
If a distribution looks "mostly" normal, we are comfortable with calling it normal. When I describe distributions for a non-statistical audience, I am very comfortable with calling something approximately normal even when I know the normal distribution is not the underlying probability model, I get the sense I would side with your teacher here... but we have no histogram or dataset to verify.
As a tip, I would go through the following inspections very closely:
- who are the outliers, how many and what are their values?
- Are the data bimodal?
- Do the data seem to take a skewed shape so that some transformation (like a log) would better quantify the "distance" between observations?
- Is there apparent truncation or heaping so that assays or labs are failing to reliably detect a certain range of values?
add a comment |Â
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
The teacher is clearly out of his/her element, and probably should not be teaching statistics. It seems worse to me to teach something wrong than to not teach it at all.
These issues could all be cleared up easily if the distinction between "data" and "process that produced the data" were made more clearly. Data target the process that produced the data. The normal distribution is a model for this process.
It makes no sense to talk about whether the data are normally distributed. For one reason, the data are always discrete. For another reason, the normal distribution describes an infinity of potentially observable quantities, not a finite set of specific observed quantities.
Further, the answer to the question "is the process that produced the data a normally distributed process" is also always "no," regardless of the data. Two simple reasons: (i) any measurements we take are necessarily discrete, being rounded off to some level. (ii) perfect symmetry, like a perfect circle, does not exist in observable nature. There are always imperfections.
At best, the answer to the question "what do these data tell you about normality of the data-generating process" could be given as follows: "these data are consistent with what we would expect to see, had the data truly come from a normally distributed process." That answer correctly does not conclude that the distribution is normal.
These issues are very easily understood by using simulation. Just simulate data from a normal distribution and compare those to the existing data. If the data are counts (0,1,2,3,...), then obviously the normal model is wrong because it does not produce numbers like 0,1,2,3,...; instead, it produces numbers with decimals that go on forever (or at least as far as the computer will allow.) Such simulation should be the first thing you do when learning about the normality question. Then you can more correctly interpret the graphs and summary statistics.
I didnt downgrade your answer, but consider that you're judging a grad professor from the words of a student. How likely it is that a student is right and a teacher is wrong? Isn't it more likely that student is misrepresenting his professor and the context of the conversation?
â Aksakal
55 mins ago
Based on my experience, and on the students words, I'd say it is more likely that the teacher is wrong. There are teachers with little formal training that teach courses, even grad courses, in universities everywhere. If the accreditating agencies only knew the truth!
â Peter Westfall
20 mins ago
add a comment |Â
up vote
3
down vote
The teacher is clearly out of his/her element, and probably should not be teaching statistics. It seems worse to me to teach something wrong than to not teach it at all.
These issues could all be cleared up easily if the distinction between "data" and "process that produced the data" were made more clearly. Data target the process that produced the data. The normal distribution is a model for this process.
It makes no sense to talk about whether the data are normally distributed. For one reason, the data are always discrete. For another reason, the normal distribution describes an infinity of potentially observable quantities, not a finite set of specific observed quantities.
Further, the answer to the question "is the process that produced the data a normally distributed process" is also always "no," regardless of the data. Two simple reasons: (i) any measurements we take are necessarily discrete, being rounded off to some level. (ii) perfect symmetry, like a perfect circle, does not exist in observable nature. There are always imperfections.
At best, the answer to the question "what do these data tell you about normality of the data-generating process" could be given as follows: "these data are consistent with what we would expect to see, had the data truly come from a normally distributed process." That answer correctly does not conclude that the distribution is normal.
These issues are very easily understood by using simulation. Just simulate data from a normal distribution and compare those to the existing data. If the data are counts (0,1,2,3,...), then obviously the normal model is wrong because it does not produce numbers like 0,1,2,3,...; instead, it produces numbers with decimals that go on forever (or at least as far as the computer will allow.) Such simulation should be the first thing you do when learning about the normality question. Then you can more correctly interpret the graphs and summary statistics.
I didnt downgrade your answer, but consider that you're judging a grad professor from the words of a student. How likely it is that a student is right and a teacher is wrong? Isn't it more likely that student is misrepresenting his professor and the context of the conversation?
â Aksakal
55 mins ago
Based on my experience, and on the students words, I'd say it is more likely that the teacher is wrong. There are teachers with little formal training that teach courses, even grad courses, in universities everywhere. If the accreditating agencies only knew the truth!
â Peter Westfall
20 mins ago
add a comment |Â
up vote
3
down vote
up vote
3
down vote
The teacher is clearly out of his/her element, and probably should not be teaching statistics. It seems worse to me to teach something wrong than to not teach it at all.
These issues could all be cleared up easily if the distinction between "data" and "process that produced the data" were made more clearly. Data target the process that produced the data. The normal distribution is a model for this process.
It makes no sense to talk about whether the data are normally distributed. For one reason, the data are always discrete. For another reason, the normal distribution describes an infinity of potentially observable quantities, not a finite set of specific observed quantities.
Further, the answer to the question "is the process that produced the data a normally distributed process" is also always "no," regardless of the data. Two simple reasons: (i) any measurements we take are necessarily discrete, being rounded off to some level. (ii) perfect symmetry, like a perfect circle, does not exist in observable nature. There are always imperfections.
At best, the answer to the question "what do these data tell you about normality of the data-generating process" could be given as follows: "these data are consistent with what we would expect to see, had the data truly come from a normally distributed process." That answer correctly does not conclude that the distribution is normal.
These issues are very easily understood by using simulation. Just simulate data from a normal distribution and compare those to the existing data. If the data are counts (0,1,2,3,...), then obviously the normal model is wrong because it does not produce numbers like 0,1,2,3,...; instead, it produces numbers with decimals that go on forever (or at least as far as the computer will allow.) Such simulation should be the first thing you do when learning about the normality question. Then you can more correctly interpret the graphs and summary statistics.
The teacher is clearly out of his/her element, and probably should not be teaching statistics. It seems worse to me to teach something wrong than to not teach it at all.
These issues could all be cleared up easily if the distinction between "data" and "process that produced the data" were made more clearly. Data target the process that produced the data. The normal distribution is a model for this process.
It makes no sense to talk about whether the data are normally distributed. For one reason, the data are always discrete. For another reason, the normal distribution describes an infinity of potentially observable quantities, not a finite set of specific observed quantities.
Further, the answer to the question "is the process that produced the data a normally distributed process" is also always "no," regardless of the data. Two simple reasons: (i) any measurements we take are necessarily discrete, being rounded off to some level. (ii) perfect symmetry, like a perfect circle, does not exist in observable nature. There are always imperfections.
At best, the answer to the question "what do these data tell you about normality of the data-generating process" could be given as follows: "these data are consistent with what we would expect to see, had the data truly come from a normally distributed process." That answer correctly does not conclude that the distribution is normal.
These issues are very easily understood by using simulation. Just simulate data from a normal distribution and compare those to the existing data. If the data are counts (0,1,2,3,...), then obviously the normal model is wrong because it does not produce numbers like 0,1,2,3,...; instead, it produces numbers with decimals that go on forever (or at least as far as the computer will allow.) Such simulation should be the first thing you do when learning about the normality question. Then you can more correctly interpret the graphs and summary statistics.
answered 2 hours ago
Peter Westfall
55339
55339
I didnt downgrade your answer, but consider that you're judging a grad professor from the words of a student. How likely it is that a student is right and a teacher is wrong? Isn't it more likely that student is misrepresenting his professor and the context of the conversation?
â Aksakal
55 mins ago
Based on my experience, and on the students words, I'd say it is more likely that the teacher is wrong. There are teachers with little formal training that teach courses, even grad courses, in universities everywhere. If the accreditating agencies only knew the truth!
â Peter Westfall
20 mins ago
add a comment |Â
I didnt downgrade your answer, but consider that you're judging a grad professor from the words of a student. How likely it is that a student is right and a teacher is wrong? Isn't it more likely that student is misrepresenting his professor and the context of the conversation?
â Aksakal
55 mins ago
Based on my experience, and on the students words, I'd say it is more likely that the teacher is wrong. There are teachers with little formal training that teach courses, even grad courses, in universities everywhere. If the accreditating agencies only knew the truth!
â Peter Westfall
20 mins ago
I didnt downgrade your answer, but consider that you're judging a grad professor from the words of a student. How likely it is that a student is right and a teacher is wrong? Isn't it more likely that student is misrepresenting his professor and the context of the conversation?
â Aksakal
55 mins ago
I didnt downgrade your answer, but consider that you're judging a grad professor from the words of a student. How likely it is that a student is right and a teacher is wrong? Isn't it more likely that student is misrepresenting his professor and the context of the conversation?
â Aksakal
55 mins ago
Based on my experience, and on the students words, I'd say it is more likely that the teacher is wrong. There are teachers with little formal training that teach courses, even grad courses, in universities everywhere. If the accreditating agencies only knew the truth!
â Peter Westfall
20 mins ago
Based on my experience, and on the students words, I'd say it is more likely that the teacher is wrong. There are teachers with little formal training that teach courses, even grad courses, in universities everywhere. If the accreditating agencies only knew the truth!
â Peter Westfall
20 mins ago
add a comment |Â
up vote
1
down vote
You're missing the point and probably are also being "difficult," which is not appreciated in the industry. She's showing you a toy example, to train you in assessment of normality of a data set. Looking at distribution moments is one way to check the normality, e.g. Jarque Bera test is based on such an assessment.
Yes, the normal distribution is perfectly symmetrical. However, if you draw a sample from a true normal distribution, that sample will most likely not be perfectly symmetrical. This is the point you're completely missing. You can test this very easily yourself. Just generate a sample from Gaussian distribution, and check its moment. They'll never be perfectly "normal," despite the true distribution being such.
Here's a silly Python example. I'm generating 100 samples of 100 random numbers, then obtaining their means and medians. I print the first sample to show that the mean and median are different, then show the histogram of the difference between the means and medians. You can see that it's rather narrow, but the difference is basically never zero. Note, that the numbers are coming from a normal distribution.
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
s = np.random.normal(0, 1, (100,100))
print('sample 0 mean:',np.mean(s[:,0]),'median:',np.median(s[:,0]))
plt.hist(np.mean(s,0)-np.median(s,0))
plt.show()
print('avg mean-median:',np.mean(np.mean(s,0)-np.median(s,0)))
outputs:
1
(+1) Exactly the point. Confusing the random variable and a sample of realisations from that random variable.
â Xi'an
41 mins ago
add a comment |Â
up vote
1
down vote
You're missing the point and probably are also being "difficult," which is not appreciated in the industry. She's showing you a toy example, to train you in assessment of normality of a data set. Looking at distribution moments is one way to check the normality, e.g. Jarque Bera test is based on such an assessment.
Yes, the normal distribution is perfectly symmetrical. However, if you draw a sample from a true normal distribution, that sample will most likely not be perfectly symmetrical. This is the point you're completely missing. You can test this very easily yourself. Just generate a sample from Gaussian distribution, and check its moment. They'll never be perfectly "normal," despite the true distribution being such.
Here's a silly Python example. I'm generating 100 samples of 100 random numbers, then obtaining their means and medians. I print the first sample to show that the mean and median are different, then show the histogram of the difference between the means and medians. You can see that it's rather narrow, but the difference is basically never zero. Note, that the numbers are coming from a normal distribution.
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
s = np.random.normal(0, 1, (100,100))
print('sample 0 mean:',np.mean(s[:,0]),'median:',np.median(s[:,0]))
plt.hist(np.mean(s,0)-np.median(s,0))
plt.show()
print('avg mean-median:',np.mean(np.mean(s,0)-np.median(s,0)))
outputs:
1
(+1) Exactly the point. Confusing the random variable and a sample of realisations from that random variable.
â Xi'an
41 mins ago
add a comment |Â
up vote
1
down vote
up vote
1
down vote
You're missing the point and probably are also being "difficult," which is not appreciated in the industry. She's showing you a toy example, to train you in assessment of normality of a data set. Looking at distribution moments is one way to check the normality, e.g. Jarque Bera test is based on such an assessment.
Yes, the normal distribution is perfectly symmetrical. However, if you draw a sample from a true normal distribution, that sample will most likely not be perfectly symmetrical. This is the point you're completely missing. You can test this very easily yourself. Just generate a sample from Gaussian distribution, and check its moment. They'll never be perfectly "normal," despite the true distribution being such.
Here's a silly Python example. I'm generating 100 samples of 100 random numbers, then obtaining their means and medians. I print the first sample to show that the mean and median are different, then show the histogram of the difference between the means and medians. You can see that it's rather narrow, but the difference is basically never zero. Note, that the numbers are coming from a normal distribution.
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
s = np.random.normal(0, 1, (100,100))
print('sample 0 mean:',np.mean(s[:,0]),'median:',np.median(s[:,0]))
plt.hist(np.mean(s,0)-np.median(s,0))
plt.show()
print('avg mean-median:',np.mean(np.mean(s,0)-np.median(s,0)))
outputs:
You're missing the point and probably are also being "difficult," which is not appreciated in the industry. She's showing you a toy example, to train you in assessment of normality of a data set. Looking at distribution moments is one way to check the normality, e.g. Jarque Bera test is based on such an assessment.
Yes, the normal distribution is perfectly symmetrical. However, if you draw a sample from a true normal distribution, that sample will most likely not be perfectly symmetrical. This is the point you're completely missing. You can test this very easily yourself. Just generate a sample from Gaussian distribution, and check its moment. They'll never be perfectly "normal," despite the true distribution being such.
Here's a silly Python example. I'm generating 100 samples of 100 random numbers, then obtaining their means and medians. I print the first sample to show that the mean and median are different, then show the histogram of the difference between the means and medians. You can see that it's rather narrow, but the difference is basically never zero. Note, that the numbers are coming from a normal distribution.
code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
s = np.random.normal(0, 1, (100,100))
print('sample 0 mean:',np.mean(s[:,0]),'median:',np.median(s[:,0]))
plt.hist(np.mean(s,0)-np.median(s,0))
plt.show()
print('avg mean-median:',np.mean(np.mean(s,0)-np.median(s,0)))
outputs:
edited 38 mins ago
answered 58 mins ago
Aksakal
36.2k345104
36.2k345104
1
(+1) Exactly the point. Confusing the random variable and a sample of realisations from that random variable.
â Xi'an
41 mins ago
add a comment |Â
1
(+1) Exactly the point. Confusing the random variable and a sample of realisations from that random variable.
â Xi'an
41 mins ago
1
1
(+1) Exactly the point. Confusing the random variable and a sample of realisations from that random variable.
â Xi'an
41 mins ago
(+1) Exactly the point. Confusing the random variable and a sample of realisations from that random variable.
â Xi'an
41 mins ago
add a comment |Â
up vote
0
down vote
I think you and your professor are talking in different context. Equality of mean = median = mode is characteristics of theoretical distribution and this is not the only characteristics. You can not say that if for any distribution above property hold then distribution is normal. T-distribution is also symmetric but it is not normal. So, you are talking about theoretical properties of normal distribution which hold always true for normal distribution.
You professor is talking about distribution of sample data. He is right, you will never get data in real life, where you will find mean = median = mode. This is simply due to sampling error. Similarly, it is very unlikely, you will get zero coefficient of skewness for sample data and zero excess kurtosis. Your professor is just giving you simple rule to get an idea about the distribution from the sample statistics. Which is not true in general (without getting further information).
Professor is said to be femaie.
â Nick Cox
4 hours ago
Why you don't get mean=median=mode is mostly because many distributions really are skewed! (Strictly, mean=median=mode is possible with skewed distributions too, despite what many textbooks say.)
â Nick Cox
4 hours ago
I disagree that lack of equality of mean/median/mode= sampling error. Suppose you random sampled 52 nursing homes for fall rates. Homes 27, 34, and 52 are chronicly short-staffed and always have above-average number of falls. Those homes pushes mean towards tail and isn't due to a sampling error.
â Possum-Pie
4 hours ago
@Possum Pie What the data are is secondary here but you're giving different signals in different places. Here you are talking about several nursing homes -- but in your question you state "in a nursing home". Being unclear about even incidental details does not help.
â Nick Cox
4 hours ago
@Nick Cox Sorry, I clarified it. Number of falls/year in a sample of 52 nursing homes
â Possum-Pie
4 hours ago
add a comment |Â
up vote
0
down vote
I think you and your professor are talking in different context. Equality of mean = median = mode is characteristics of theoretical distribution and this is not the only characteristics. You can not say that if for any distribution above property hold then distribution is normal. T-distribution is also symmetric but it is not normal. So, you are talking about theoretical properties of normal distribution which hold always true for normal distribution.
You professor is talking about distribution of sample data. He is right, you will never get data in real life, where you will find mean = median = mode. This is simply due to sampling error. Similarly, it is very unlikely, you will get zero coefficient of skewness for sample data and zero excess kurtosis. Your professor is just giving you simple rule to get an idea about the distribution from the sample statistics. Which is not true in general (without getting further information).
Professor is said to be femaie.
â Nick Cox
4 hours ago
Why you don't get mean=median=mode is mostly because many distributions really are skewed! (Strictly, mean=median=mode is possible with skewed distributions too, despite what many textbooks say.)
â Nick Cox
4 hours ago
I disagree that lack of equality of mean/median/mode= sampling error. Suppose you random sampled 52 nursing homes for fall rates. Homes 27, 34, and 52 are chronicly short-staffed and always have above-average number of falls. Those homes pushes mean towards tail and isn't due to a sampling error.
â Possum-Pie
4 hours ago
@Possum Pie What the data are is secondary here but you're giving different signals in different places. Here you are talking about several nursing homes -- but in your question you state "in a nursing home". Being unclear about even incidental details does not help.
â Nick Cox
4 hours ago
@Nick Cox Sorry, I clarified it. Number of falls/year in a sample of 52 nursing homes
â Possum-Pie
4 hours ago
add a comment |Â
up vote
0
down vote
up vote
0
down vote
I think you and your professor are talking in different context. Equality of mean = median = mode is characteristics of theoretical distribution and this is not the only characteristics. You can not say that if for any distribution above property hold then distribution is normal. T-distribution is also symmetric but it is not normal. So, you are talking about theoretical properties of normal distribution which hold always true for normal distribution.
You professor is talking about distribution of sample data. He is right, you will never get data in real life, where you will find mean = median = mode. This is simply due to sampling error. Similarly, it is very unlikely, you will get zero coefficient of skewness for sample data and zero excess kurtosis. Your professor is just giving you simple rule to get an idea about the distribution from the sample statistics. Which is not true in general (without getting further information).
I think you and your professor are talking in different context. Equality of mean = median = mode is characteristics of theoretical distribution and this is not the only characteristics. You can not say that if for any distribution above property hold then distribution is normal. T-distribution is also symmetric but it is not normal. So, you are talking about theoretical properties of normal distribution which hold always true for normal distribution.
You professor is talking about distribution of sample data. He is right, you will never get data in real life, where you will find mean = median = mode. This is simply due to sampling error. Similarly, it is very unlikely, you will get zero coefficient of skewness for sample data and zero excess kurtosis. Your professor is just giving you simple rule to get an idea about the distribution from the sample statistics. Which is not true in general (without getting further information).
answered 4 hours ago
Neeraj
652519
652519
Professor is said to be femaie.
â Nick Cox
4 hours ago
Why you don't get mean=median=mode is mostly because many distributions really are skewed! (Strictly, mean=median=mode is possible with skewed distributions too, despite what many textbooks say.)
â Nick Cox
4 hours ago
I disagree that lack of equality of mean/median/mode= sampling error. Suppose you random sampled 52 nursing homes for fall rates. Homes 27, 34, and 52 are chronicly short-staffed and always have above-average number of falls. Those homes pushes mean towards tail and isn't due to a sampling error.
â Possum-Pie
4 hours ago
@Possum Pie What the data are is secondary here but you're giving different signals in different places. Here you are talking about several nursing homes -- but in your question you state "in a nursing home". Being unclear about even incidental details does not help.
â Nick Cox
4 hours ago
@Nick Cox Sorry, I clarified it. Number of falls/year in a sample of 52 nursing homes
â Possum-Pie
4 hours ago
add a comment |Â
Professor is said to be femaie.
â Nick Cox
4 hours ago
Why you don't get mean=median=mode is mostly because many distributions really are skewed! (Strictly, mean=median=mode is possible with skewed distributions too, despite what many textbooks say.)
â Nick Cox
4 hours ago
I disagree that lack of equality of mean/median/mode= sampling error. Suppose you random sampled 52 nursing homes for fall rates. Homes 27, 34, and 52 are chronicly short-staffed and always have above-average number of falls. Those homes pushes mean towards tail and isn't due to a sampling error.
â Possum-Pie
4 hours ago
@Possum Pie What the data are is secondary here but you're giving different signals in different places. Here you are talking about several nursing homes -- but in your question you state "in a nursing home". Being unclear about even incidental details does not help.
â Nick Cox
4 hours ago
@Nick Cox Sorry, I clarified it. Number of falls/year in a sample of 52 nursing homes
â Possum-Pie
4 hours ago
Professor is said to be femaie.
â Nick Cox
4 hours ago
Professor is said to be femaie.
â Nick Cox
4 hours ago
Why you don't get mean=median=mode is mostly because many distributions really are skewed! (Strictly, mean=median=mode is possible with skewed distributions too, despite what many textbooks say.)
â Nick Cox
4 hours ago
Why you don't get mean=median=mode is mostly because many distributions really are skewed! (Strictly, mean=median=mode is possible with skewed distributions too, despite what many textbooks say.)
â Nick Cox
4 hours ago
I disagree that lack of equality of mean/median/mode= sampling error. Suppose you random sampled 52 nursing homes for fall rates. Homes 27, 34, and 52 are chronicly short-staffed and always have above-average number of falls. Those homes pushes mean towards tail and isn't due to a sampling error.
â Possum-Pie
4 hours ago
I disagree that lack of equality of mean/median/mode= sampling error. Suppose you random sampled 52 nursing homes for fall rates. Homes 27, 34, and 52 are chronicly short-staffed and always have above-average number of falls. Those homes pushes mean towards tail and isn't due to a sampling error.
â Possum-Pie
4 hours ago
@Possum Pie What the data are is secondary here but you're giving different signals in different places. Here you are talking about several nursing homes -- but in your question you state "in a nursing home". Being unclear about even incidental details does not help.
â Nick Cox
4 hours ago
@Possum Pie What the data are is secondary here but you're giving different signals in different places. Here you are talking about several nursing homes -- but in your question you state "in a nursing home". Being unclear about even incidental details does not help.
â Nick Cox
4 hours ago
@Nick Cox Sorry, I clarified it. Number of falls/year in a sample of 52 nursing homes
â Possum-Pie
4 hours ago
@Nick Cox Sorry, I clarified it. Number of falls/year in a sample of 52 nursing homes
â Possum-Pie
4 hours ago
add a comment |Â
up vote
0
down vote
I'm an engineer, so in my world, the applied statistician is what I see most, and get the most concrete value. If you are going to work in applied, then you need to be solidly grounded in practice over theory: whether or not it is elegant, the aircraft has to fly and not crash.
When I think about this question the way I approach it, as many of my technical betters here have also done, is to think about "what does it look like in the real world with the presence of noise".
The second thing that I do is, often, to make a simulation that allows me to get my hands around the question.
Here is a very brief exploration:
#show how the mean and the median differ with respect to sample size
#libraries
library(reshape2)
library(ggplot2)
#sample sizes
ssizes <- 10^(seq(from=1, to=3, by=0.25))
ssizes <- round(ssizes)
#loops per sample
n_loops <- 5000
#pre-declare, prep for loop
my_store <- matrix(0,
ncol = 3,
nrow = n_loops*length(ssizes))
count <- 1
for(i in 1:length(ssizes))
#how many samples
n_samp <- ssizes[i]
for(j in 1:n_loops)
#draw samples
y <- 0
y <- rnorm(n = n_samp,mean = 0, sd = 1)
#compute mean, median, mode
my_store[count,1] <- n_samp
my_store[count,2] <- median(y)
my_store[count,3] <- mean(y)
#update
count = count + 1
#make data into ggplot friendly form
df <- data.frame(my_store)
names(df) <- c("n_samp", "median","mean")
df <- melt(df, id.vars = 1, measure.vars = c("median","mean"))
#make ggplot
ggplot(df, aes(x=as.factor(n_samp),
y = value,
fill = variable)) + geom_boxplot() +
labs(title = "Contrast Median and Mean estimate variation vs. Sample Size",
x = "Number of Samples",
y = "Estimated value")
It gives this as the output:
Note: be careful about the x-axis, because it is log-scaled, not uniform-scaled.
I know that the mean and median are exactly the same. The code says it. The empirical realization is greatly sensitive to sample size, and if there aren't truly infinite samples, then they can't ever perfectly match with theory.
You can think about whether the uncertainty in the median envelopes the estimated mean or vice versa. If the best estimate of the mean is within the 95% CI of the estimate for the median, then the data can't tell the difference. The data says they are the same in theory. If you get more data, then see what it says.
Interesting graph. I would have thought the Mean would have been generally larger than the median at first considering the mean chases outliers...in other words the red bars would be mean and green be medians. What am I missing?
â Possum-Pie
2 hours ago
add a comment |Â
up vote
0
down vote
I'm an engineer, so in my world, the applied statistician is what I see most, and get the most concrete value. If you are going to work in applied, then you need to be solidly grounded in practice over theory: whether or not it is elegant, the aircraft has to fly and not crash.
When I think about this question the way I approach it, as many of my technical betters here have also done, is to think about "what does it look like in the real world with the presence of noise".
The second thing that I do is, often, to make a simulation that allows me to get my hands around the question.
Here is a very brief exploration:
#show how the mean and the median differ with respect to sample size
#libraries
library(reshape2)
library(ggplot2)
#sample sizes
ssizes <- 10^(seq(from=1, to=3, by=0.25))
ssizes <- round(ssizes)
#loops per sample
n_loops <- 5000
#pre-declare, prep for loop
my_store <- matrix(0,
ncol = 3,
nrow = n_loops*length(ssizes))
count <- 1
for(i in 1:length(ssizes))
#how many samples
n_samp <- ssizes[i]
for(j in 1:n_loops)
#draw samples
y <- 0
y <- rnorm(n = n_samp,mean = 0, sd = 1)
#compute mean, median, mode
my_store[count,1] <- n_samp
my_store[count,2] <- median(y)
my_store[count,3] <- mean(y)
#update
count = count + 1
#make data into ggplot friendly form
df <- data.frame(my_store)
names(df) <- c("n_samp", "median","mean")
df <- melt(df, id.vars = 1, measure.vars = c("median","mean"))
#make ggplot
ggplot(df, aes(x=as.factor(n_samp),
y = value,
fill = variable)) + geom_boxplot() +
labs(title = "Contrast Median and Mean estimate variation vs. Sample Size",
x = "Number of Samples",
y = "Estimated value")
It gives this as the output:
Note: be careful about the x-axis, because it is log-scaled, not uniform-scaled.
I know that the mean and median are exactly the same. The code says it. The empirical realization is greatly sensitive to sample size, and if there aren't truly infinite samples, then they can't ever perfectly match with theory.
You can think about whether the uncertainty in the median envelopes the estimated mean or vice versa. If the best estimate of the mean is within the 95% CI of the estimate for the median, then the data can't tell the difference. The data says they are the same in theory. If you get more data, then see what it says.
Interesting graph. I would have thought the Mean would have been generally larger than the median at first considering the mean chases outliers...in other words the red bars would be mean and green be medians. What am I missing?
â Possum-Pie
2 hours ago
add a comment |Â
up vote
0
down vote
up vote
0
down vote
I'm an engineer, so in my world, the applied statistician is what I see most, and get the most concrete value. If you are going to work in applied, then you need to be solidly grounded in practice over theory: whether or not it is elegant, the aircraft has to fly and not crash.
When I think about this question the way I approach it, as many of my technical betters here have also done, is to think about "what does it look like in the real world with the presence of noise".
The second thing that I do is, often, to make a simulation that allows me to get my hands around the question.
Here is a very brief exploration:
#show how the mean and the median differ with respect to sample size
#libraries
library(reshape2)
library(ggplot2)
#sample sizes
ssizes <- 10^(seq(from=1, to=3, by=0.25))
ssizes <- round(ssizes)
#loops per sample
n_loops <- 5000
#pre-declare, prep for loop
my_store <- matrix(0,
ncol = 3,
nrow = n_loops*length(ssizes))
count <- 1
for(i in 1:length(ssizes))
#how many samples
n_samp <- ssizes[i]
for(j in 1:n_loops)
#draw samples
y <- 0
y <- rnorm(n = n_samp,mean = 0, sd = 1)
#compute mean, median, mode
my_store[count,1] <- n_samp
my_store[count,2] <- median(y)
my_store[count,3] <- mean(y)
#update
count = count + 1
#make data into ggplot friendly form
df <- data.frame(my_store)
names(df) <- c("n_samp", "median","mean")
df <- melt(df, id.vars = 1, measure.vars = c("median","mean"))
#make ggplot
ggplot(df, aes(x=as.factor(n_samp),
y = value,
fill = variable)) + geom_boxplot() +
labs(title = "Contrast Median and Mean estimate variation vs. Sample Size",
x = "Number of Samples",
y = "Estimated value")
It gives this as the output:
Note: be careful about the x-axis, because it is log-scaled, not uniform-scaled.
I know that the mean and median are exactly the same. The code says it. The empirical realization is greatly sensitive to sample size, and if there aren't truly infinite samples, then they can't ever perfectly match with theory.
You can think about whether the uncertainty in the median envelopes the estimated mean or vice versa. If the best estimate of the mean is within the 95% CI of the estimate for the median, then the data can't tell the difference. The data says they are the same in theory. If you get more data, then see what it says.
I'm an engineer, so in my world, the applied statistician is what I see most, and get the most concrete value. If you are going to work in applied, then you need to be solidly grounded in practice over theory: whether or not it is elegant, the aircraft has to fly and not crash.
When I think about this question the way I approach it, as many of my technical betters here have also done, is to think about "what does it look like in the real world with the presence of noise".
The second thing that I do is, often, to make a simulation that allows me to get my hands around the question.
Here is a very brief exploration:
#show how the mean and the median differ with respect to sample size
#libraries
library(reshape2)
library(ggplot2)
#sample sizes
ssizes <- 10^(seq(from=1, to=3, by=0.25))
ssizes <- round(ssizes)
#loops per sample
n_loops <- 5000
#pre-declare, prep for loop
my_store <- matrix(0,
ncol = 3,
nrow = n_loops*length(ssizes))
count <- 1
for(i in 1:length(ssizes))
#how many samples
n_samp <- ssizes[i]
for(j in 1:n_loops)
#draw samples
y <- 0
y <- rnorm(n = n_samp,mean = 0, sd = 1)
#compute mean, median, mode
my_store[count,1] <- n_samp
my_store[count,2] <- median(y)
my_store[count,3] <- mean(y)
#update
count = count + 1
#make data into ggplot friendly form
df <- data.frame(my_store)
names(df) <- c("n_samp", "median","mean")
df <- melt(df, id.vars = 1, measure.vars = c("median","mean"))
#make ggplot
ggplot(df, aes(x=as.factor(n_samp),
y = value,
fill = variable)) + geom_boxplot() +
labs(title = "Contrast Median and Mean estimate variation vs. Sample Size",
x = "Number of Samples",
y = "Estimated value")
It gives this as the output:
Note: be careful about the x-axis, because it is log-scaled, not uniform-scaled.
I know that the mean and median are exactly the same. The code says it. The empirical realization is greatly sensitive to sample size, and if there aren't truly infinite samples, then they can't ever perfectly match with theory.
You can think about whether the uncertainty in the median envelopes the estimated mean or vice versa. If the best estimate of the mean is within the 95% CI of the estimate for the median, then the data can't tell the difference. The data says they are the same in theory. If you get more data, then see what it says.
answered 2 hours ago
EngrStudent
5,55611862
5,55611862
Interesting graph. I would have thought the Mean would have been generally larger than the median at first considering the mean chases outliers...in other words the red bars would be mean and green be medians. What am I missing?
â Possum-Pie
2 hours ago
add a comment |Â
Interesting graph. I would have thought the Mean would have been generally larger than the median at first considering the mean chases outliers...in other words the red bars would be mean and green be medians. What am I missing?
â Possum-Pie
2 hours ago
Interesting graph. I would have thought the Mean would have been generally larger than the median at first considering the mean chases outliers...in other words the red bars would be mean and green be medians. What am I missing?
â Possum-Pie
2 hours ago
Interesting graph. I would have thought the Mean would have been generally larger than the median at first considering the mean chases outliers...in other words the red bars would be mean and green be medians. What am I missing?
â Possum-Pie
2 hours ago
add a comment |Â
up vote
0
down vote
In medical statistics, we only ever comment on the shapes and seeming of distributions. The fact that no discrete finite sample can ever be normal is irrelevant and pedantic. I would mark you wrong for that.
If a distribution looks "mostly" normal, we are comfortable with calling it normal. When I describe distributions for a non-statistical audience, I am very comfortable with calling something approximately normal even when I know the normal distribution is not the underlying probability model, I get the sense I would side with your teacher here... but we have no histogram or dataset to verify.
As a tip, I would go through the following inspections very closely:
- who are the outliers, how many and what are their values?
- Are the data bimodal?
- Do the data seem to take a skewed shape so that some transformation (like a log) would better quantify the "distance" between observations?
- Is there apparent truncation or heaping so that assays or labs are failing to reliably detect a certain range of values?
add a comment |Â
up vote
0
down vote
In medical statistics, we only ever comment on the shapes and seeming of distributions. The fact that no discrete finite sample can ever be normal is irrelevant and pedantic. I would mark you wrong for that.
If a distribution looks "mostly" normal, we are comfortable with calling it normal. When I describe distributions for a non-statistical audience, I am very comfortable with calling something approximately normal even when I know the normal distribution is not the underlying probability model, I get the sense I would side with your teacher here... but we have no histogram or dataset to verify.
As a tip, I would go through the following inspections very closely:
- who are the outliers, how many and what are their values?
- Are the data bimodal?
- Do the data seem to take a skewed shape so that some transformation (like a log) would better quantify the "distance" between observations?
- Is there apparent truncation or heaping so that assays or labs are failing to reliably detect a certain range of values?
add a comment |Â
up vote
0
down vote
up vote
0
down vote
In medical statistics, we only ever comment on the shapes and seeming of distributions. The fact that no discrete finite sample can ever be normal is irrelevant and pedantic. I would mark you wrong for that.
If a distribution looks "mostly" normal, we are comfortable with calling it normal. When I describe distributions for a non-statistical audience, I am very comfortable with calling something approximately normal even when I know the normal distribution is not the underlying probability model, I get the sense I would side with your teacher here... but we have no histogram or dataset to verify.
As a tip, I would go through the following inspections very closely:
- who are the outliers, how many and what are their values?
- Are the data bimodal?
- Do the data seem to take a skewed shape so that some transformation (like a log) would better quantify the "distance" between observations?
- Is there apparent truncation or heaping so that assays or labs are failing to reliably detect a certain range of values?
In medical statistics, we only ever comment on the shapes and seeming of distributions. The fact that no discrete finite sample can ever be normal is irrelevant and pedantic. I would mark you wrong for that.
If a distribution looks "mostly" normal, we are comfortable with calling it normal. When I describe distributions for a non-statistical audience, I am very comfortable with calling something approximately normal even when I know the normal distribution is not the underlying probability model, I get the sense I would side with your teacher here... but we have no histogram or dataset to verify.
As a tip, I would go through the following inspections very closely:
- who are the outliers, how many and what are their values?
- Are the data bimodal?
- Do the data seem to take a skewed shape so that some transformation (like a log) would better quantify the "distance" between observations?
- Is there apparent truncation or heaping so that assays or labs are failing to reliably detect a certain range of values?
edited 1 hour ago
answered 1 hour ago
AdamO
30.6k255128
30.6k255128
add a comment |Â
add a comment |Â
Possum-Pie is a new contributor. Be nice, and check out our Code of Conduct.
Possum-Pie is a new contributor. Be nice, and check out our Code of Conduct.
Possum-Pie is a new contributor. Be nice, and check out our Code of Conduct.
Possum-Pie is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f367254%2fdoes-a-normal-distribution-need-to-have-mean-median-mode%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
I think you are being a little pedantic, it makes no sense to talk about median or mode of a continuous distribution.
â user2974951
5 hours ago
2
I would like to know your professor's exact wording. In principle a normal distribution has mean, median and mode identical (but so do many other distributions) and has skewness 0 and (so-called excess) kurtosis 0 (and so do some other distributions). At best a distribution with (e.g.) slight skewness or kurtosis is approximately normal. Note that almost all real data are at best approximations to named distributions in the theoretical menagerie.
â Nick Cox
4 hours ago
7
I don't agree with @user2974951 In company with every good text I know I am perfectly happy thinking that the normal distribution has a median and mode. And that applies widely to continuous distributions, although I don't doubt that pathological counter-examples can be identified.
â Nick Cox
4 hours ago
1
I didn't call you or your question pedantic, but the one answer to date is I think broadly on the right lines. The main issue here seems to be different questions, what is a normal distribution in principle and what is approximately normal in practice. (FWIW, I don't readily believe the data -- unless the people who fell over many, many times are now dead or in hospital. For other reasons, I know more than I want to about nursing homes. It is also crucial whether you are talking about falls per individual or some average rate of falls for a nursing home, e.g, falls per week per 100 patients).
â Nick Cox
4 hours ago
1
Thanks for the specific detail, which shows merit on both sides, but I am not grading either of you. However, I dissent strongly from the term critical values as used by your Professor, as the limits $pm 1$ for skewness and kurtosis have no standing whatsoever beyond being rules of thumb that might be used. Depending on what you are doing with the data, a skewness $< 1$ might go along with wanting to transform the data and a skewness of $>1$ might go along with the leaving the data are they are, and similarly for kurtosis.
â Nick Cox
3 hours ago