1 Estimating probability

We have learned the concept of z-score: it measures the “distance” between a specific data point and the mean of the data. Two great things about z-score are:

  1. It takes both mean and variance (standard deviation) into account
  2. Thanks to the use of the standard deviation, it is a standardized measure. It is not affected by the numerical magnitude of the data.

The second point is important: z-score works as some sort of “standard” to evaluate the relative position of a data point in a data, regardless of whether we deal with a data in very large numbers (e.g., how big is the earth compared to other planets?) and a data in very small numbers (e.g., how big is this flea compared to other fleas?).

Our next question is this: when we get a z-score of, say, 1.2, for a specific data point, how special, how precious, how rare is this data point? If this z-score is very difficult to obtain, then this data point should be considered special, precious, and rare. If this z-score of 1.2 is easy to find, then this data point is not that special.

In order to evaluate a z-score, we attempt to calculate the probabiity of obtaining a particular z-score. But how?

Let us consider our favorite exam data from A High School:

a <- c(29, 30, 40, 41, 43, 44, 44, 45, 46, 47, 48, 48, 50, 51, 51, 52, 52, 55, 55, 55, 55, 56, 56, 57, 58, 58, 59, 59, 59, 59, 60, 60, 60, 60, 62, 62, 62, 63, 63, 63, 63, 63, 63, 64, 64, 64, 64, 65, 66, 66, 67, 67, 68, 68, 68, 68, 68, 68, 69, 69, 69, 69, 70, 70, 70, 70, 70, 71, 71, 71, 71, 72, 72, 72, 72, 72, 73, 73, 73, 75, 75, 75, 75, 75, 77, 77, 77, 80, 80, 81, 81, 82, 82, 84, 86, 86, 88, 89, 91, 94)

The mean and the sd are:

mean(a)
## [1] 65
sd(a)
## [1] 12.67304

Now, suppose that one student was absent from the exam and is going to take the same exam for a make-up. Suppose further that we don’t know how good or bad this student is because this is to be held even before the first class of the first year. Since this student has passed the admission procedure like all other students, we assume that she has the similar level of intelligence as other students. Then how many points is she likely to score in this exam? What is the probability of this student scoring 72 or more? The probability of getting 65 or more? 40 or less? 100?

By the “counting” method, we may say that the probability of scoring 72 or more is 29%, that of 65 or more is 53%, that of 40 or less is 3%, and that of 100 is 0%.

# Counting the number of elements in a that are equal to or more than 72 
length(a[a>=72])
## [1] 29
# Counting the number of elements in a that are equal to or more than 65 
length(a[a>=65])
## [1] 53
# Counting the number of elements in a that are equal to or more than 40 
length(a[a<=40])
## [1] 3
# Counting the number of elements in a that are equal to or more than 100 
length(a[a>=100])
## [1] 0

One problem about this method is that the data points are discrete and thus the data is not dense. For example, the probability of scoring 78 or more, that of scoring 79 or more, and that of scoring 80 or more would be 13% because 78 and 79 are missing from the data. Similarly, the probability of scoring 39 or less would be identical to the probability of scoring 30, while the probability of scoring 28 or less and that of 95 or more would be zero because these values are missing in the data. If these values are missing for a good reason, then we might be happy with these calculations. But it is quite likely that their absence is just accidental.

To overcome this problem we might want to try a more sophisticated method by assuming a dense distribution of the data points based on the mean and the variance/sd.

2 Normal distribution

In much of data in the world, the distribution of the data points is centered around the mean: the majority of the data points are placed around the mean value. There would be less and less data points as moving away from the mean. The histogram of such a data would look like a symmetric mound, as we can see in first histogram of the exam data from A High School.

A mathematical formula was proposed in the 18th century that generates such an ideal distribution, which is called normal distribution. The probability density function for a normal distribution with the mean value of \(\mu\) and the sd value of \(\sigma\) is defined as follows:

\[ f(x)={\frac {1}{\sigma\sqrt {2\pi}}}e ^{\!-{\frac {1}{2}}\left({\frac {x-\mu}{\sigma}}\right)^2\quad} \]

Well, for the present purpose, you do not have to worry about what this formula means. In R, dnorm represents the above function.

When a distribution \(X\) follows a normal distribution, with mean \(\mu\) and variance \(\sigma^2\), it is written as follows:

\[X \sim N(\mu, \sigma^2)\]

In R, the density function dnorm for a normal distribution is, a bit confusingly, specified with mean \(\mu\) and standard deviation \(\sigma\) (not variance \(\sigma^2\)).

For example,a normal distribution with \(\mu=50\) and \(\sigma^2 = 100\) (namely, \(\sigma=10\)) [note: this is 偏差値 hensachi] is conventionally represented as:

\[X \sim N(50, 100)\] which is a short form of the following function:

\[ f(x)={\frac {1}{10\sqrt {2\pi}}}e ^{\!-{\frac {1}{2}}\left({\frac {x-50}{10}}\right)^2\quad} \] In R, this function is represented as dnorm(x, mean=50, sd=10). Graphically:

# Graph [0 and 100 specifies the range of x-axis; type="l" means a "line" graph]:curve(dnorm(x, mean=50, sd=10), 0, 100, type="l", ylab="Density")

You can put any number in x to obtain its density (the value in the y-axis).

# Density (value in the y-axis) for 50 in the x-axis:
dnorm(50, mean=50, sd=10)
## [1] 0.03989423
# Density (value in the y-axis) for 70 in the x-axis:
dnorm(70, mean=50, sd=10)
## [1] 0.005399097

The probability of obtaining x or less corresponds the area in the graph below x, and it is computed by pnorm(x, mean=50, sd=10) in R.

# What is the probability (=the area in the graph) of having 50?
pnorm(50, mean=50, sd=10)
## [1] 0.5
# The probability corresponds to the shaded area.
curve(dnorm(x, mean=50, sd=10), 0, 100, type="l", ylab="Density")
xvals <- seq(0, 50, length=100)   # 100等分
dvals <- dnorm(xvals,mean=50,sd=10)  # 対応するグラフの高さ
polygon(c(xvals,rev(xvals)),  c(rep(0,100),rev(dvals)),angle=60,density=20)

The probability of obtaining 70 or less:

pnorm(70, mean=50, sd=10)
## [1] 0.9772499

The probability of obtaining 70 or more can be computed using the lower.tail=F option or simply subract it from 1.

pnorm(70, mean=50, sd=10, lower.tail=F)
## [1] 0.02275013
1-pnorm(70, mean=50, sd=10)
## [1] 0.02275013
curve(dnorm(x, mean=50, sd=10), 0, 100, type="l", ylab="Density")
xvals <- seq(70, 100, length=100)   # 100等分
dvals <- dnorm(xvals,mean=50,sd=10)  # 対応するグラフの高さ
polygon(c(xvals,rev(xvals)),  c(rep(0,100),rev(dvals)),angle=60,density=20)

# Note: without the lower.tail=F option, the probability will be the lower area:
pnorm(70, mean=50, sd=10)
## [1] 0.9772499
curve(dnorm(x, mean=50, sd=10), 0, 100, type="l", ylab="Density")
xvals <- seq(0, 70, length=100)   # 100等分
dvals <- dnorm(xvals,mean=50,sd=10)  # 対応するグラフの高さ
polygon(c(xvals,rev(xvals)),  c(rep(0,100),rev(dvals)),angle=60,density=20)

# What is the upper x value that corresponds to probability 0.05?
qnorm(0.05,mean=50,sd=10,lower.tail=F)
## [1] 66.44854
# What is the upper x value that corresponds to probability 0.025?
qnorm(0.025,mean=50,sd=10,lower.tail=F)
## [1] 69.59964
# What is the lower x value that corresponds to probability 0.025?
qnorm(0.025,mean=50,sd=10)
## [1] 30.40036
# 2 tail probabilities that sum up to 0.05
curve(dnorm(x, mean=50, sd=10), 0, 100, type="l", ylab="Density")
xvals <- seq(0, qnorm(0.025,mean=50,sd=10), length=100)   # 100等分
dvals <- dnorm(xvals,mean=50,sd=10)  # 対応するグラフの高さ
polygon(c(xvals,rev(xvals)),  c(rep(0,100),rev(dvals)),angle=60,density=20)

xvals <- seq(qnorm(0.025,mean=50,sd=10,lower.tail=F),100,length=100)   # 100等分
dvals <- dnorm(xvals,mean=50,sd=10)  # 対応するグラフの高さ
polygon(c(xvals,rev(xvals)),  c(rep(0,100),rev(dvals)),angle=60,density=20)

When \(X\) is normally distributed with \(\mu=0\) and \(\sigma^2=1\), it is called the standard normal distribution (\(N(0,1)\)). Note that the graph basically has the same shape, except the range of the x-axis is 10 times smaller.

curve(dnorm(x, mean=0, sd=1), -5, 5, type="l")

2.1 Exercise: Calculating the probability using a normal distribution

We saw that using the “counting” method, the probability of scoring 72 or more would be 29%, that of 65 or more would be 53%, that of 40 or less would be 2%, and that of 100 would be 0%.

Let us now use the normal distribution based on the mean and standard deviation of the A High School data and calculate the probabilities of the absentee scoring the following:

  1. 72 or more
  2. 65 or more
  3. 40 or less
  4. 100 (or more)