17 Enrichment: Probability Tools

Things you should know that aren’t on the immediate path

This is optional enrichment material—probability tools that are good to know but not directly used in the main course content.

The Union Bound

Placeholder

The union bound states that for any events \(A\) and \(B\), \[ \P(A \cup B) \le \P(A) + \P(B). \]

More generally, for events \(A_1 \ldots A_k\), \[ \P(A_1 \cup \cdots \cup A_k) \le \P(A_1) + \cdots + \P(A_k). \]

This follows from inclusion-exclusion: \(\P(A \cup B) = \P(A) + \P(B) - \P(A \cap B)\), and the intersection has non-negative probability.

TODO: Add exercises on applying the union bound.

Markov’s Inequality and Consistency

Convergence in mean square implies convergence in probability. Let’s use Markov’s inequality to see why.

Markov’s Inequality

Markov’s inequality says that for a non-negative random variable \(X\) and any \(t > 0\), \[ P(X \ge t) \le \frac{\mathop{\mathrm{E}}[X]}{t}. \]

The usual proof of Markov’s inequality is based on a few simple observations.

The expectation of the indicator variable \(1_{\ge t}(X)\) is the probability that \(X\) exceeds \(t\), \(P(X \ge t)\).
If we have some function of \(u\) that’s always larger than \(1_{\ge t}\), i.e., one satisfying \(u_t(x) \ge 1_{\ge t}(x)\) for all \(x\), we know that \(\mathop{\mathrm{E}}[u_t(X)] \ge \mathop{\mathrm{E}}1_{\ge t}(X)\) for any random variable \(X\). If it’s always larger for non-negative \(x\), then \(\mathop{\mathrm{E}}[u_t(X)] \ge \mathop{\mathrm{E}}[1_{\ge t}(X)]\) for any non-negative random variable \(X\).
The function \(u_t(x)=x/t\) is such a function.¹

Proving Markov’s Inequality

Prove Markov’s inequality.

🔒

Solution

Locked (Week 0)

Often, instead of using this to bound the random variable we’re interested in directly, e.g. \(X=|\hat\theta - \theta|\), we use it to bound the random variable’s square. \(|X| \ge \epsilon\) if and only if \(X^2 \ge \epsilon^2\), so the probability that \(|X| \ge \epsilon\) is the same as the probability that \(X^2 \ge \epsilon^2\). Applying Markov’s inequality to the random variable \(X^2\) gives us a bound in terms of \(X\)’s mean square which, in the specific case that \(X\) is \(|\hat\theta-\theta|\), is the mean squared error of the estimator \(\hat\theta\), \(\RMSE(\hat\theta)^2=\mathop{\mathrm{E}}[(\hat\theta-\theta)^2]\)

\[ P(X \ge \epsilon) = P(X^2 \ge \epsilon^2) \leq \frac{\mathop{\mathrm{E}}[X^2]}{\epsilon^2} \qqtext{ e.g.} P(|\hat\theta - \theta| \ge \epsilon) = P((\hat\theta - \theta)^2 \ge \epsilon^2) \leq \frac{\mathop{\mathrm{E}}[(\hat\theta - \theta)^2]}{\epsilon^2} \]

This tells us that, if the root-mean-squared error of \(\hat\theta\) goes to zero, then the probability that \(\hat\theta\) is any distance \(\epsilon\) away from \(\theta\) goes to zero, i.e., consistency in mean-square implies consistency in probability.

Markov’s Inequality and Interval Estimation

So far, when we’ve calibrated interval estimates using our estimator’s standard deviation, we’ve relied on normal approximation. In effect, we’ve been using a formula for \(P(\lvert\hat\theta - \theta\rvert \le \epsilon)\) that’s accurate when \(\hat\theta\) has a normal distribution and close enough when its distribution is close enough to normal. In this problem, we’re going to think about doing without this reliance on approximate normality.

Let’s consider \(\hat\theta\), an unbiased estimator of \(\theta\) with standard deviation \(\sigma\), so the normal approximation to the distribution of \(\hat\theta-\theta\) has the density \(f_{0,\sigma}(x)\) below.

\[ P\qty(|\hat\theta - \theta| \le \epsilon) \approx \int_{-\epsilon}^{\epsilon} f_{0,\sigma}(x) dx \qfor f_{0, \sigma}(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-x^2/2\sigma^2} \]

The reason we’ve been talking about interval estimators of the form \(\hat\theta \pm 1.96 \sigma\) is that, if this approximation were perfect, these interval estimators would have 95% coverage. That is, it’d be true that \(P(|\hat\theta-\theta| \le 1.96 \sigma) = .95\). And if the approximation is pretty good, we should still expect coverage close to that. But suppose we’re not confident that it is. Markov’s inequality allows us to calibrate interval estimates in terms of our estimator’s standard deviation without any caveats about its sampling distribution being approximately normal. Let’s give it a shot.

Exercise 20.1

Let \(\hat\theta\) be an unbiased estimator of \(\theta\) with standard deviation \(\sigma\). By applying Markov’s inequality to \(|\hat\theta-\theta|^2\), find a lower bound on \(P(|\hat\theta-\theta| \le t\sigma)\) as a function of \(t\). For what choice of \(t\) is this bound equal to \(.95\)? Explain how you can use this to get an interval estimate \(\hat\theta \pm t\sigma\) with a coverage probability of at least 95%.

From Upper Bounds to Lower Bounds.

If you have an upper bound on \(P(X > \epsilon)\), then you have a lower bound on \(P(X \le \epsilon)\). \(P(X \le \epsilon)=1-P(X > \epsilon)\) because \(X > \epsilon\) if and only if it is not the case that \(X \le \epsilon\), so if \(P(X > \epsilon) \le u\), then \(P(X \le \epsilon) = 1-P(X > \epsilon) \ge 1-u\). And if you’re troubled that Markov’s inequality says \(\ge\) instead of \(>\), don’t be. You have a bound on \(P(X > \epsilon)\) too. \(P(X > \epsilon) \le P(X\ge \epsilon)\) because \(X \ge \epsilon\) whenever \(X > \epsilon\).

🔒

Solution

Locked (Week 0)

Now let’s try this out on the NBA data we’ve been working with recently.

prior.obs = read.csv("https://qtm285-1.github.io/assets/data/nba_sample_2.csv")
sam = read.csv("https://qtm285-1.github.io/assets/data/nba_sample_1.csv")
pop = read.csv("https://qtm285-1.github.io/assets/data/nba_population.csv") 

indicator = function(W,L,...) { W / (W+L) > 1/2 }

library(purrr)
Y.prior = prior.obs |> pmap_vec(indicator)
Y = sam |> pmap_vec(indicator)
y = pop |> pmap_vec(indicator)

n = length(Y)
m = length(y)
theta = mean(y)

The sample \(Y_1 \ldots Y_{100}\) we’ll use is drawn with replacement from a population \(y_1 \ldots y_{539}\) of indicators. These indicators—one for each of the 539 players who played in the NBA in 2023—are one if the player’s team won more than half the games they played in and zero otherwise.

We’ll consider two point estimators. The first is the sample mean, \(\hat\theta=\bar{Y}\). And the second is the mean-with-prior-observations estimator \(\tilde{Y}_{100}\) we talked about in the last homework, using 100 prior observations from what we called ‘your sample’ in the Week 1 Homework.

Exercise 20.2

For each of these point estimators:

Plot the bootstrap sampling distribution and a 95% confidence interval calibrated using it.
Plot an estimate of the normal approximation to the estimator’s sampling distribution and a second 95% confidence interval calibrated using that.
Plot a third 95% confidence interval, this time calibrated using Markov’s inequality.
Taking advantage of your knowledge of the population, plot the estimator’s actual sampling distribution and calculate the coverage probability of your three interval estimators.

Referring to your plots, comment on the behavior of your six² interval estimators. If you had to do something like this again, e.g. using data from a different season, which would you choose? Why? Would it make a difference if your sample size were larger, e.g. \(n=400\) instead of \(n=100\)?

Style your plots as you see fit. Overlay the distributions or plot them side-by-side, use color, transparency, annotations, etc. Make it easy to see the stuff that comes up in your comments.

Note. Your interval estimates should not use any information about the population other than what’s in the sample. If you need a standard deviation or something like that, estimate it using the sample.

🔒

Solution

Locked (Week 0)

If you’re not convinced, sketch the two functions on the same axes. Sketching usually helps.↩︎
\(2\ \text{point estimators} \times 3\ \text{interval calibration methods}=6\ \text{interval estimators}\)↩︎