Now let us consider the probability of a certain sample mean. Suppose there is a sample of size \(n\). How special (how rare) is this sample? To answer such a question, we first calculate the mean of the sample. If the mean is close to the population mean, then the sample in question is probably not a special sample. If the mean is far away from the population mean, then the sample may not be a randomly chosen sample: it’s probably a rare sample. But how can we make such a decision?
Well, computing the z-score of the mean relative to the population is obviously one possibility. Suppose the sample mean is \(\bar{x}\), the population mean is \(\mu\), and the population sd is \(\sigma\). Then, is the z-score of the sample mean calculated by the following formula?
\[\dfrac{\bar{x}-\mu}{\sigma}\]
The answer is no. Why not? To see this, let us do an exercise.
Here’s our favorite data of exam scores of 100 students at A High School.
a <- c(29, 30, 40, 41, 43, 44, 44, 45, 46, 47, 48, 48, 50, 51, 51, 52, 52, 55, 55, 55, 55, 56, 56, 57, 58, 58, 59, 59, 59, 59, 60, 60, 60, 60, 62, 62, 62, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 65, 66, 66, 67, 67, 68, 68, 68, 68, 68, 68, 69, 69, 69, 69, 70, 70, 70, 70, 70, 71, 71, 71, 71, 72, 72, 72, 72, 72, 73, 73, 73, 75, 75, 75, 75, 75, 77, 77, 77, 80, 80, 81, 81, 82, 82, 84, 86, 86, 88, 89, 91, 94)
- What was the mean, variance, sd of this data?
- Plot a histogram of the data. (Use
hist())
- Sample a set of 10 data points. (Use
sample())
- Repeat this procedure 100 times to create a data of 100 mean values. (Use a
forloop: e.g.,for (i in c(1:100)) {}.) Put it in a vector variable calledsamplemeans.- What is the mean of samplemeans?
- What is the variance and sd?
- Plot a histogram using
hist(samplemeans, xlim=c(20,100)).
Now going back to the original question, how can we evaluate the rarity of a specific sample? When we evaluate a specific data point in a population, we calculate its z-score using the following formula:
\[\dfrac{\bar{x}-\mu}{\sigma}\]
But, what we want to evaluate is the relative position (i.e., z-score) of the “sample mean”. However, \(\sigma\) merely represents the standard deviation of the data points in the population, and the “sample mean” in question has no place in there. In order to calculate the z-score of the sample mean, we need to know the standard deviation of the sample means, not of the population data points. If the population follow a normal distribution, and if we sample data of size \(n\) many times, it is known mathematically that the variance of the sample means \(\hat{\sigma}^2\) should be the following: \[\hat{\sigma}^2 = \dfrac{\sigma^2}{n}\]
Thus the standard deviation would be:
\[\hat{\sigma} = \dfrac{\sigma}{\sqrt{n}}\]
Thus the z-score of the sample mean \(\bar{x}\) should be:
\[z = \dfrac{\bar{x}-\mu}{\hat{\sigma}} = \dfrac{\bar{x}-\mu}{\dfrac{\sigma}{\sqrt{n}}} \]
- If you sample a set of 10 data points from the A High School data and calculate its mean, what is the minimum possible value?
- What is the z-score of this set?
- If we do random sampling, what is the probability of obtaining this set (under a normal distribution)? (Use
pnorm())
At A High School’s cultural festival, a quiz competition was organized with randomly assigned teams of 10 students and a teacher working together. The team assignments were made by one of the teachers, Mr. K, who took full responsibility for the process. However, there arose suspicions that Mr. K, who claimed to have randomly assigned the students, might have unfairly included more top-performing students in his own team to gain an advantage. When examining the placement test scores of the students in Team K, the following scores were noted:
k <- c(47,55,63,68,75,77,80,82,86,91)
The average score was 72.4 points. Given that the average score for all 100 students was 65 points, the difference was 7.4 points, which seemed quite high. However, Mr. K insisted, “I am being falsely accused! The average score can randomly be higher or lower, and the 7.4 point difference is just a coincidence!”
- Calculate the z-score of Team K
- What is the probability of obtaining this value assuming that the team assignment was done randomly?