Probability, Statistics, and Random Processes

Definition of Probability

Kolmogorov Axioms of Probability:

  1. P(ω)=1,P(ϕ)=0P(\omega) = 1 , P(\phi) = 0
  2. P(A)=1P(Ac)P(A) = 1 - P(A^c)
  3. P(i=1Ai)=i=1P(Ai)P(\cup_{i=1}^{\infty} A_i) = \sum_{i=1}^{\infty}P(A_i), where AiA_i are mutually exclusive

What is a Random Variable a map between? X:ωRX:\omega \rightarrow R

Random Process —> Random Event —> Random Variable —> Probability Distribution

Role of Statistics When the distribution is known but not the parameter of the distribution statistics helps you figure out the parameter

When does the classical definition of probability fail? When the set of possible outcomes are infinite.

Fischer’s Definition Long-run relative frequencyEmpirical Probability\underbrace{\text{Long-run relative frequency}}_{\text{Empirical Probability}} is considered as definition of probability by RA Fisher


X:No. of goals scored by home teamX : \text{No. of goals scored by home team} XPoisson(λ);λ:rate parameter;0<λ<X \sim \text{Poisson}(\lambda); \lambda: \text{rate parameter}; 0 \lt \lambda \lt \infty P(X=k)=eλλxx!P(X = k) = \frac{e^{-\lambda}{\lambda^x}}{x!}

Here, λ\lambda is the parameter of the model, and statistics is interested in finding it out. The process of searching for the λ\lambda is called Estimation

We choose λ\lambda values in such a way that P(X=k)P(X=k) from the model will be very close to the empirical probability

Estimation Techniques

  1. Method of Moments (Works on Toy Examples)
  2. Likelihood Method (Powerful)
  3. Bayesian Method (A Method in ML)

Method of Moments Assume, X=pθ(x)X = p_{\theta}(x), where θ\theta are parameters, and pp is either PMF\PDF based on the nature of XX, and x1,x2,x3,...,xn{x_1, x_2, x_3, ..., x_n} are i.i.d samples.

Then, Sample mean: xˉ\bar{x} = i=1nxin\frac{\sum_{i=1}^n x_i}{n} = m1m_1 m2m_2 = i=1nxi2n\frac{\sum_{i=1}^n x_i^2}{n}, m3m_3 = i=1nxi3n\frac{\sum_{i=1}^n x_i^3}{n} Sample Variance = i=1nxixˉn\frac{\sum_{i=1}^n x_i - \bar{x}}{n} = m2m12m_2 - m_1^2

Population Mean: E[X]E[X] = x=0xpθ(x)\sum_{x=0}^{\infty} xp_{\theta}(x) = μ1\mu_1 E[X2]E[X^2] = x=0x2pθ(x)\sum_{x=0}^{\infty} x^2 p_{\theta}(x) = μ2\mu_2 Population Variance: V(X)=E[Xμ]2=E[X2]E[X]2V(X) = E[X - \mu]^2 = E[X^2] - E[X]^2 = μ2μ12\mu_2 - \mu_1^2

The philosophy of method of moments is the sample moments should be close to the moments from the probability distribution pθ(x)p_{\theta}(x)

Basically,

μ1(θ)=m1\mu_1(\theta) = m_1 μ2(θ)=m2\mu_2(\theta) = m_2

and solve for θ\theta

MOM would not work when there are k unknowns but kthk^{th} moment does not exist

What does it mean for the moment to not exist? When the function inside the summation or integral (in the case of continuous random variables) diverges for a population moment.

Probability Distribution for Continuous Random Variable

Probability Density Function f(x)f(x) is a PDF such that

  1. f(x)0f(x) \ge 0
  2. Rf(x).dx=1\int_{R}f(x).dx = 1

PDF is not a probability. It is a tool to evaluate probabilityP(a<x<b)=abf(x)dx\underbrace{\text{evaluate probability}}_{P(a \lt x \lt b) = \int_a^b f(x)dx}, whereas PMF0<P(x)<1;p(x)=1\overbrace{\text{PMF}}^{0 \lt P(x) \lt 1; \sum{p(x)} = 1} is probability.

density x width = probability density = probwidth\frac{prob}{width} = p12δ\frac{p_1}{2\delta} = f1n2δ\frac{\frac{f_1}{n}}{2\delta} = relative.frequencywidth\frac{\text{relative.frequency}}{width}

a x1 x2 x3 x4 b

P(a<X<b)=d1(x1a)+d2(x2x1)+d3(x3x2)+...P(a \lt X \lt b) = d_1(x_1 - a) + d_2(x_2 - x_1) + d_3(x_3 - x_2) + ... =i=abdi(xixi1)= \sum_{i=a}^b d_i(x_i - x_{i-1}) Approximate, di=f(xi)d_i = f(x_i) and wi=xixi1w_i = x_i - x_{i-1} As w —> 0, abf(x)(xixi1)\int_a^b f(x)(x_i - x_{i-1})

If you have a function g(x)0g(x) \ge 0 and Rg(x)dx=a\int_{R} g(x) dx = a, then f(x)=g(x)af(x) = \frac{g(x)}{a} is a PDF and g(x)g(x) is kernel of PDF, aa is normalising constant.

Cumulative Distribution Function

af(x).dx=P(xa)=Fx(a)\int_{-\infty}^a f(x). dx = P(x \le a) = F_{x}(a)

Moment Generating Functions

E[g(x)]=g(x)P(x)discrete=g(x)f(x)dxcontiuousE[g(x)] = \underbrace{\sum{g(x)P(x)}}_{\text{discrete}} = \overbrace{\int{g(x)f(x)dx}}^{\text{contiuous}} If X is a Random Variable with f(x)f(x) as the Probability Density Function, then MGF is E[etx]=etxf(x)dxE[e^{tx}] = \int{e^{tx}f(x)dx} (continuous) =etxP(x)=\sum{e^{tx}P(x)}(discrete) =g(t)= g(t) =(1+tx+tx2/2!+...)f(x).dx= \int{(1 + tx + tx^2/2! + ...)f(x).dx} (Using Taylor Series of exe^x) =f(x)dx+txf(x)dx+t2/2!x2f(x)dx+...= \int f(x)dx + t\int xf(x)dx + t^2/2!\int x^2 f(x)dx + ... g(t)=1+tE[X]+t2/2!E[X2]+...g(t) = 1 + tE[X] + t^2/2!E[X^2] + ... g(t)=E[X]+tE[X2]+t2/2!E[X3]+...g^{'}(t) = E[X] + tE[X^2] + t^2/2!E[X^3] + ...

Therefore, g(0)=E[X]g^{'}(0) = E[X] and g(0)=E[X2]g^{''}(0) = E[X^2], and V(X)=g(0)(g(0))2V(X) = g^{''}(0) - (g^{'}(0))^2

Why do we need the moments? I do not want to solve 4 integrations for finding the 4 properties of a distribution. I will solve 1 integration, namely E[etx]E[e^{tx}] and then differentiate to obtain 4 moments. Therefore, these moments help in uniquely identifying the distribution.

Raw Moments vs Central Moments? Raw Moments: Subtract “0” from X Central Moments: Subtract "μ\mu" from X

  • Variance is the 2nd Central Moment and Skewness is the 3rd Central Moment Link

Cauchy Distribution f(x)=1π11+x2f(x) = \frac{1}{\pi} \frac{1}{1+x^2}, where <x<-\infty \lt x \lt \infty

The real moments does not exist for the distribution. It is also called as t-distribution with 1 DOF

Sampling Distribution

Population: CMI ka Students X = No. of books ordered by a student We are interested in finding E[X]=μE[X] = \mu Define a Random Sample of size n: D = x1,x2,x3,...,xnx_1, x_2, x_3, ..., x_n, and xˉ=1ni=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i Conduct this experiment many times, and plot the means. This’ll be your Sampling Distribution of the Sample Means.

Intention? Find μ\mu The sampling distribution gives a sense of how far sample mean is away from the hypothetical true mean. It helps in quantifying the uncertainty in the data. Also helps in comparing 2 different probability estimations.

We want to figure out the sampling distribution of the sample mean from single sample Suppose X1,X2,X3,...,XnX_1, X_2, X_3, ..., X_n are iid N(μ,σ2)N(\mu, \sigma^2)

Xˉ=1ni=1nxi\bar{X} = \frac{1}{n} \sum_{i=1}^n x_i

We want to find MXˉ(t)M_{\bar{X}}(t) MS(t)M_S(t) = E[ets]E[e^{ts}] = E[eti=1nxi]E[e^{t\sum_{i=1}^n x_i}] = E[i=1netxi]E[\prod_{i=1}^n e^{tx_i}] = (E[etxi])\prod{(E[e^{t x_i}])} = (Mxi(t))\prod{(M_{x_i}(t))} = (etμ+t2σ22)\prod{(e^{t\mu + \frac{t^2\sigma^2}{2}})} = (etnμ+t2nσ22)\prod{(e^{tn\mu + \frac{t^2n\sigma^2}{2}})} Therefore, SN(nμ,nσ2)S \sim N(n\mu, n\sigma^2) We know, Xˉ=Sn\bar{X} = \frac{S}{n} So, MXˉ(t)=E[etXˉ]=E[etSn]M_{\bar{X}}(t) = E[e^{t\bar{X}}] = E[e^{t\frac{S}{n}}] =E[etns]=E[e^{\frac{t}{n}s}] = (etnnμ+(tn)2nσ22)\prod{(e^{\frac{t}{n} n\mu + \frac{(\frac{t}{n})^2n\sigma^2}{2}})} =(etμ+t2σ22n)=\prod{(e^{t\mu + \frac{t^2\sigma^2}{2n}})} XˉN(μ,σ2n)\bar{X} \sim N(\mu, \frac{\sigma^2}{n})

Central Limit Theorem

X1,X2,X3,...,XnX_1, X_2, X_3, ..., X_n iid\sim^{iid} f(x)f(x), and f(x)f(x) is an appropriate probability function such that E[Xi]=μE[X_i] = \mu and V[Xi]=σ2<V[X_i] = \sigma^2 \lt \infty What is the distribution of Xˉ\bar{X}? MXˉ(t)=E[etXˉ]M_{\bar{X}}(t) = E[e^{t\bar{X}}] =E[etnXi]= \prod{E[e^{\frac{t}{n} X_i}]} Since, i.i.d, replace i with 1 E[etnX1]nE[e^{\frac{t}{n} X_1}]^n Find MGF of X1X_1 and plug here MX1(t)=E[etXi]=E[1+tX1+t2X122+...]M_{X_1}(t) = E[e^{tX_i}] = E[1 + tX_1 + \frac{t^2X_1^2}{2} + ...] =E[1+tX1+t2X122+O(t)]= E[1 + tX_1 + \frac{t^2X_1^2}{2} + O(t)], since t0t \sim 0 =1+tμ+t2E[X12]2+O(t2)= 1 + t\mu + \frac{t^2E[X_1^2]}{2} + O(t^2) =1+tμ+t2(σ2+μ2)2+O(t2)= 1 + t\mu + \frac{t^2(\sigma^2 + \mu^2)}{2} + O(t^2) Therefore, MXˉ(t)=[MX1(tn)]nM_{\bar{X}}(t) = [M_{X_1}(\frac{t}{n})]^n

RESULT If XN(μ,σ2)X \sim N(\mu, \sigma^2), then MX(t)=E[etX]=etμ+tσ22M_{X}(t) = E[e^{tX}] = e^{t\mu + \frac{t\sigma^2}{2}} If XN(0,σ2)X \sim N(0, \sigma^2), then MX(t)=E[etX]=et2σ22M_{X}(t) = E[e^{tX}] = e^{\frac{t^2\sigma^2}{2}} If XN(0,1)X \sim N(0, 1), then MX(t)=E[etX]=et22M_{X}(t) = E[e^{tX}] = e^{\frac{t^2}{2}}

Then, E[Xˉ]=E[1nXi]=1nE[Xi]=1nnμ=μE[\bar{X}] = E[\frac{1}{n}\sum{X_i}] = \frac{1}{n}\sum{E[X_i]} = \frac{1}{n} n\mu = \mu V[Xˉ]=V[1nXi]=1n2V[Xi]=1n2nσ2=σ2nV[\bar{X}] = V[\frac{1}{n}\sum{X_i}] = \frac{1}{n^2}\sum{V[X_i]} = \frac{1}{n^2} n\sigma^2 = \frac{\sigma^2}{n}

Take Z=XˉμZ = \bar{X} - \mu and MZ(t)=E[etZ]M_{Z}(t) = E[e^{tZ}] =E[1+tZ+t2Z22+O(t3)]= E[1 + tZ + \frac{t^2Z^2}{2} + O(t^3)] =1+tE[Z]+t22E[Z2]+O(t3)= 1 + tE[Z] + \frac{t^2}{2}E[Z^2] + O(t^3) =1+0+t22σ2n+O(t3)= 1 + 0 + \frac{t^2}{2} \frac{\sigma^2}{n} + O(t^3) We know, (1+xn)nex(1 + \frac{x}{n})^n \approx e^x, when xx is around 0, and nn —> \infty Therefore, 1+0+t22σ2n+O(t3)et22σ2n1 + 0 + \frac{t^2}{2} \frac{\sigma^2}{n} + O(t^3) \approx e^{\frac{t^2}{2} \frac{\sigma^2}{n}} So, Z=XˉμN(0,σ2n)Z = \bar{X} - \mu \approx N(0, \frac{\sigma^2}{n}), as nn —> \infty n(Xˉμ)σN(0,1)\frac{\sqrt{n}(\bar{X} - \mu)}{\sigma} \approx N(0, 1), as nn —> \infty

Lindyberg-Levy’s CLT

Let X1,X2,X3,...,XnX_1, X_2, X_3, ..., X_n iid\sim^{iid} f(x)f(x) with E[X1]=μ1E[X_1] = \mu_1 & V(X1)=σ2<V(X_1) = \sigma^2 < \infty Zn=n(Xˉμ)σN(0,1)Z_n = \frac{\sqrt{n}(\bar{X} - \mu)}{\sigma} \approx N(0, 1) as nn —> \infty

Centralise first, for quick convergence CLT fails in Cauchy, but Cauchy not seen in real-life

Bernoulli

Suppose X1,X2,X3,...,XniidBernoulli(p),0<p<1X_1, X_2, X_3, ..., X_n \sim^{iid} \text{Bernoulli}(p), 0 < p < 1, where E[Xi]=p,V[Xi]=p(1p)<E[X_i] = p, V[X_i] = p(1-p) < \infty Xˉ=Sn=p^\bar{X} = \frac{S}{n} = \hat{p}, where ss is the number of successes. E[p^]=E[1niXi]=1n(np)=pE[\hat{p}] = E[\frac{1}{n}\sum_i{X_i}] = \frac{1}{n}(np) = p V[p^]=V[1niXi]=1n2(np(1p))=p(1p)nV[\hat{p}] = V[\frac{1}{n}\sum_i{X_i}] = \frac{1}{n^2}(np(1-p)) = \frac{p(1-p)}{n} By CLT, n(p^p)p(1p)N(0,1)\frac{\sqrt{n}(\hat{p} - p)}{\sqrt{p(1-p)}} \rightarrow N(0, 1) as nn —> \infty For large sample, sampling distribution of sample population approximately follows Gaussian Distribution.

Gamma

Despite the fact that Xˉ\bar{X} of Gamma Random Variables is Gamma, the CLT still holds. Since, after a certain nn, Gamma starts to behave as Gaussian.

Serious Criticism of TB statistical inference relying on CLT is it requires large sample size Solution? Bootstrap Statistics

Important Results

  1. YN(μ,σ2)Y \sim N(\mu, \sigma^2), then Z=YμσN(0,1)Z = \frac{Y - \mu}{\sigma} \sim N(0, 1) and Z2χ(1)2Z^2 \sim \chi_{(1)}^2
  2. If Z1,Z2,...,ZnN(0,1)Z_1, Z_2, ..., Z_n \sim N(0, 1), then S=i=1nZi2χ(n)2S = \sum_{i=1}^n{Z_i^2} \sim \chi_{(n)}^2
  3. X1,X2,....,XnN(μ,σ2)X_1, X_2, ...., X_n \sim N(\mu, \sigma^2), then 1. Zn=(n)(Xˉμ)σN(0,1)Z_n = \frac{\sqrt(n)(\bar{X} - \mu)}{\sigma} \sim N(0, 1) 2. U=(n1)s2σ2χ(n1)2U = \frac{(n-1)s^2}{\sigma^2} \sim \chi_{(n-1)}^2, where s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2 3. ZnZ_n & UU are independent of each other 4. n(xˉμ)stn1\frac{\sqrt{n}(\bar{x} - \mu)}{s} \sim t_{n-1} In part(4) of 3, if you replace σ\sigma with estimated σ\sigma, then the distribution follows t and moves away from normal.

Statistical Inference

X1,X2,...,XnDatasetiidFθ(x)\underbrace{X_1, X_2, ..., X_n}_{\text{Dataset}} \sim_{\text{iid}} F_{\theta}(x); θ\theta is an unknown population parameter. Based on this dataset, make an educated guess about θ\theta

In Data Sciences, people are majorly interested in:

  • I\text{I} : Predictive Modelling
  • II\text{II} : Statistical Inference
    • Point Estimation / Training in ML
    • Testing of Hypotheses
    • Confidence Interval

Estimator

T(n)=T(X1,X2,...,XnT(n) = T(X_1, X_2, ..., X_n For example, θ^=argmaxHL(θD):MLE\hat{\theta} = \arg\max_{H} L(\theta | D) : \text{MLE} Any process you come up with to make an educated guess about θ\theta T(n)=T(X1,X2,...,XnGθ(t)T(n) = T(X_1, X_2, ..., X_n \sim G_{\theta}(t), which is the CDF of sampling distribution of TnT_n

If E[Tn]=θE[T_n] = \theta, then we call TnT_n as an Unbiased Estimator\underline{\text{Unbiased Estimator}} If E[Tn]=θ+δE[T_n] = \theta + \delta or θδ\theta \delta, then we call TnT_n as an Biased Estimator\underline{\text{Biased Estimator}}

Problem

  1. θ\theta is unknown
  2. We need to learn or estimate it from dataset

More Examples of Estimators

  1. T=Xˉ=1ni=1nXiT = \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i | Sample mean is an estimator of population mean E[X]=μE[X] = \mu
  2. T=S2=1n1i=1n(XiXˉ)2T = S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2 | Sample variance which estimates the population variance V[X]=σ2V[X] = \sigma^2
  3. T=Sn2=1ni=1n(XiXˉ)2T = S_n^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 | Sample variance which estimates the sample variance

Which is better among (2) & (3)

Competing Estimators

If T1T_1 and T2T_2 are two competing estimators of θ\theta, which one should we prefer? We compare their sampling distributions.

If this is the case, we clearly prefer T2T_2, as it’s more concentrated around θ\theta

We compare: P(θδ<T1<θ+δ)<P(θδ<T2<θ+δ)P(\theta - \delta < T_1 < \theta + \delta) < P(\theta - \delta < T_2 < \theta + \delta)

Which corresponds to comparing the probability mass under the sampling distributions: θδθ+δgθ(t1)dt1vsθδθ+δhθ(t2)dt2\int_{\theta - \delta}^{\theta + \delta} g_\theta(t_1) \, dt_1 \quad \text{vs} \quad \int_{\theta - \delta}^{\theta + \delta} h_\theta(t_2) \, dt_2

But — we don’t know the sampling distributions of T1T_1 and T2T_2 in practice (???)

Mean Squared Error (MSE) as a Criterion

Instead, we work with something more concrete: Mean Squared Error (MSE) MSE=E[(Tθ)2]\text{MSE} = \mathbb{E}[(T - \theta)^2] If MSE(T1)<MSE(T2)\text{MSE}(T_1) < \text{MSE}(T_2), we know T1T_1 has, on average, less error, and so we prefer T1T_1 over T2T_2.

We can decompose MSE as follows: E[(Tθ)2]=E[(TE[T]+E[T]θ)2]\mathbb{E}[(T - \theta)^2] = \mathbb{E}[(T - \mathbb{E}[T] + \mathbb{E}[T] - \theta)^2]

Expanding this:

E[(TE[T])2]+(E[T]θ)2=Var(T)+(Bias(T))2=Var(T)+(Bias(T))2\mathbb{E}[(T - \mathbb{E}[T])^2] + (\mathbb{E}[T] - \theta)^2 =Var(T)+(Bias(T))2= \text{Var}(T) + \left( \text{Bias}(T) \right)^2 This is the bias-variance tradeoff.

Even if the bias of T2T_2 is more, if its variance is very small, we might still choose it. This tradeoff is captured in the MSE decomposition.

Concrete Example

Let X1,X2,,XnN(μ,σ2)X_1, X_2, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)

Two estimators for population variance σ2\sigma^2:

  1. Sn2=1ni=1n(XiXˉ)2S_n^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2
  2. S2=1n1i=1n(XiXˉ)2S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2

Consider,

(n1)S2σ2χn12\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1} E[(n1)S2σ2]=n1E[S2]=σ2\mathbb{E}\left[\frac{(n-1)S^2}{\sigma^2}\right] = n - 1 \Rightarrow \mathbb{E}[S^2] = \sigma^2

So, S2S^2 is an unbiased estimator of σ2\sigma^2.

Var((n1)S2σ2)=2(n1)Var(S2)=2σ4n1\text{Var}\left(\frac{(n-1)S^2}{\sigma^2}\right) = 2(n - 1)\Rightarrow \text{Var}(S^2) = \frac{2\sigma^4}{n - 1} Since the bias is 0: MSE(S2)=Var(S2)=2σ4n1\text{MSE}(S^2) = \text{Var}(S^2) = \frac{2\sigma^4}{n - 1}

Whereas,

Sn2=1ni=1n(XiXˉ)2=n1nS2E[Sn2]=E[n1nS2]=n1nσ2S_n^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 = \frac{n - 1}{n} S^2 ⇒E[S_n^2]=E[\frac{n-1}{n}S^2]=\frac{n−1}{n}\sigma^2

Thus, Sn2S_n^2 is a biased estimator of σ2\sigma^2.

Var(Sn2)=Var(n1nS2)=(n1n)2Var(S2)=(n1n)22σ4n1=(n1)2σ4n2\text{Var}(S_n^2) = \text{Var}\left(\frac{n - 1}{n} S^2\right) = \left(\frac{n - 1}{n}\right)^2 \text{Var}(S^2) = \left(\frac{n - 1}{n}\right)^2 \cdot \frac{2\sigma^4}{n - 1} = \frac{(n - 1)2\sigma^4}{n^2}

MSE(Sn2)=Var(Sn2)+(Bias(Sn2))2=(n1)2σ4n2+(σ2n)2=(n1)2σ4+σ4n2=(2n1)σ4n2\text{MSE}(S_n^2) = \text{Var}(S_n^2) + \left(\text{Bias}(S_n^2)\right)^2 = \frac{(n - 1)2\sigma^4}{n^2} + \left(\frac{\sigma^2}{n}\right)^2 = \frac{(n - 1)2\sigma^4 + \sigma^4}{n^2} = \frac{(2n - 1)\sigma^4}{n^2}

We want to show: 2n1n2<2n1\frac{2n - 1}{n^2} < \frac{2}{n - 1} Multiply both sides by n2(n1)n^2(n - 1) to eliminate denominators: (2n1)(n1)<2n2(2n - 1)(n - 1) < 2n^2 2n(n1)(n1)=2n22nn+1=2n23n+12n(n - 1) - (n - 1) = 2n^2 - 2n - n + 1 = 2n^2 - 3n + 1

Compare: 2n23n+1<2n23n+1<0(which is true for all n>13)2n^2 - 3n + 1 < 2n^2 \Rightarrow -3n + 1 < 0 \quad \text{(which is true for all } n > \frac{1}{3}) ✅ So, inequality is proved.

Therefore, we found an estimator with less MSE than a minimum unbiased variance estimator.

Information

D={X1,X2,,Xn}i.i.dfθ(x)\mathcal{D} = \{X_1, X_2, \ldots, X_n\} \overset{\text{i.i.d}}{\sim} f_\theta(x)

  • θH\theta \in \mathcal{H}: unknown parameter
  • XRX \in \mathbb{R}

Likelihood Principle L(θx)=i=1nf(xiθ)\mathcal{L}(\theta \mid \mathbf{x}) = \prod_{i=1}^n f(x_i \mid \theta)

MLE : θ^MLE=argmaxθHL(θx)\hat{\theta}_{MLE} = \arg\max_{\theta \in \mathbb{H}} L(\theta | x)

  1. θlog(L(θx))=set0\frac{\partial}{\partial\theta}log(L(\theta | x)) =^{set} 0 to obtain where the θ\theta maximises the likelihood function.
  2. 2θ2log(L(θx))\frac{\partial^2}{\partial\theta^2}log(L(\theta | x)) captures how sensitive is the likelihood function for a nudge in θ\theta

Fischer’s Information

I(θ)=E[2θ2log(L(θx))]=E[θlog(L(θx))]2I(\theta) = -\mathbb{E}[\frac{\partial^2}{\partial\theta^2}log(L(\theta | x))] =\mathbb{E}[\frac{\partial}{\partial\theta}log(L(\theta | x))]^2

Example:
Let
D={X1,X2,,Xn}iidPoisson(λ)D = \{X_1, X_2, \ldots, X_n\} \overset{\text{iid}}{\sim} \text{Poisson}(\lambda)

We have to figure out if ID(λ)=IXˉ(λ)I_D(\lambda) = I_{\bar{X}}(\lambda): The information due to Data and whether if the information due to Xˉ\bar{X} is same.

λlog(λX)=n+1λXi\frac{\partial}{\partial \lambda} \log(\lambda | X) = -n + \frac{1}{\lambda} \sum X_i 2λ2logL(λX)=1λ2Xi\frac{\partial^2}{\partial \lambda^2} \log L(\lambda | X) = -\frac{1}{\lambda^2} \sum X_i ID(λ)=E[2λ2logL(λX)]=1λ2E(Xi)=1λ2nλ=nλI_D(\lambda) = \mathbb{E} \left[ -\frac{\partial^2}{\partial \lambda^2} \log L(\lambda | X) \right] = \frac{1}{\lambda^2} \mathbb{E} \left( \sum X_i \right) = \frac{1}{\lambda^2} \cdot n \lambda = \frac{n}{\lambda}

If
X1,,XniidPoisson(λ)X_1, \ldots, X_n \overset{\text{iid}}{\sim} \text{Poisson}(\lambda)
Then
S=i=1nXiPoisson(nλ)S = \sum_{i=1}^n X_i \sim \text{Poisson}(n\lambda)

PMF:

f(sλ)=enλ(nλ)ss!f(s|\lambda) = \frac{e^{-n\lambda}(n\lambda)^s}{s!} logf(sλ)=nλ+log(nλ)slogs!\log f(s|\lambda) = -n\lambda + \log(n\lambda)^s - \log s! λlogf(sλ)=n+sλ\frac{\partial}{\partial \lambda} \log f(s|\lambda) = -n + \frac{s}{\lambda} 2λ2logf(sλ)=sλ2\frac{\partial^2}{\partial \lambda^2} \log f(s|\lambda) = -\frac{s}{\lambda^2} E[2λ2logf(sλ)]=E[S]λ2=nλλ2=nλ-\mathbb{E} \left[ \frac{\partial^2}{\partial \lambda^2} \log f(s|\lambda) \right] = \frac{\mathbb{E}[S]}{\lambda^2} = \frac{n\lambda}{\lambda^2} = \frac{n}{\lambda}

So, SS and Xˉ\bar{X} are sufficient statistics of DD because
ID=ISI_D = I_S

We want to calculate the information of Xˉ\bar{X} and show that it is sufficient.

Theorem by Fischer:
Let T=f(S)T = f(S) be a one-one function.
Then, TT is also a sufficient statistic if SS is a sufficient statistic.

Imagine you have 100 thermometers measuring the same temperature (with Poisson noise). Each gives you one reading. But if someone tells you the total sum or average, you can figure out the temperature just as well as if you had all 100 readings. So the sum/average is all you need — it’s sufficient. And you get the same precision in your estimate (same Fisher information).

MVUE & Cramer-Rao Inequality

Suppose that T=T(x)T=T(x) is an unbiased estimator for real-valued parametric function T(θ)\mathcal{T}(\theta), that is E[T]=T(θ)\mathbb{E[T]} = \mathcal{T}({\theta}) θH\forall \theta \in \mathcal{H}

Assume d(T(θ))dθ=T(θ)\frac{d(\mathcal{T(\theta)})}{d\theta} = \mathcal{T}^{'}(\theta) exists and finite, Then, V(T)[T(θ)]2nEθ[θlogfθ(xi)]2=[T(θ)]2I(θ)V(T) \geq \frac{[\mathcal{T}^{'}(\theta)]^2}{n\mathbb{E}_{\theta}[ \frac{\partial}{\partial \theta} \log f_{\theta}(x_i)]^2} = \frac{[\mathcal{T}^{'}(\theta)]^2}{I(\theta)}

Statistical Consistency

Population: {X1X_1, X2X_2, X3X_3, …, XnX_n} Sample: {x1x_1, x2x_2, x3x_3, …, xnx_n}

As nNn \rightarrow N, the sample(mean) should tend to population(mean)

As nNn \rightarrow N, xˉμ\bar{x} \rightarrow \mu {Finite Population} and nn \rightarrow \infty, xˉμ\bar{x} \rightarrow \mu {To prove this, we use probability inequalities}

Markov Inequality If XX is a positive Random Variable, then P(X>a)<E[X]aP(X \gt a) \lt \frac{E[X]}{a}

Chebyshev’s Inequality

Let XX be any random variable with finite mean μ=E[X]\mu = \mathbb{E}[X] and finite variance σ2=Var(X)<\sigma^2 = \operatorname{Var}(X) < \infty

Pr(Xμkσ)    1k2\Pr\bigl(|X-\mu| \ge k\sigma\bigr) \;\le\; \frac{1}{k^{2}}

Equivalently, Pr ⁣(kσ(Xμ)kσ)    11k2\Pr\!\bigl(-k\sigma \le (X-\mu) \le k\sigma\bigr) \;\ge\; 1-\frac{1}{k^{2}} Pr ⁣(μkσXμ+kσ)    11k2\Pr\!\bigl(\mu - k\sigma \le X \le \mu + k\sigma\bigr) \;\ge\; 1-\frac{1}{k^{2}}

Example:
For k=3k = 3 (within 3σ3\sigma, the probability is at least 119=891 - \tfrac{1}{9} = \tfrac{8}{9}, i.e. 88.9 %.

Weak Law of Large Numbers (WLLN) Denote the sample mean by Xˉ  =  1ni=1nXi\bar{X} \;=\; \frac{1}{n}\sum_{i=1}^{n} X_i Using Chebyshev on Xˉ\bar{X}:

Pr(Xˉμ>k)    Var(Xˉ)k2  =  σ2nk2.\Pr\bigl(|\bar{X}-\mu| > k\bigr) \;\le\; \frac{\operatorname{Var}(\bar{X})}{k^{2}} \;=\; \frac{\sigma^{2}}{n\,k^{2}}.

Hence, as nn \to \infty, Xˉ    in probability    μ\bar{X} \;\xrightarrow{\;\text{in probability}\;}\; \mu


Convergence of the Sample Variance to the Population Variance

Define the (biased) sample variance s2  =  1ni=1n(XiXˉ)2s^{2} \;=\; \frac{1}{n}\sum_{i=1}^{n}\bigl(X_i - \bar{X}\bigr)^{2}

Our goal is to show

s2    in probability    σ2.s^{2} \;\xrightarrow{\;\text{in probability}\;}\; \sigma^{2}.

A Chebyshev-style bound:

Pr(s2σ2ε)    Var(s2)ε2.\Pr\bigl(|s^{2}-\sigma^{2}| \ge \varepsilon\bigr) \;\le\; \frac{\operatorname{Var}(s^{2})}{\varepsilon^{2}}.

Sketch of the Variance Calculation

  1. Decompose each squared term

    (XiXˉ)2  =  (Xiμ+μXˉ)2  =  (Xiμ)2+(μXˉ)2+2(Xiμ)(μXˉ).\bigl(X_i - \bar{X}\bigr)^{2} \;=\; \bigl(X_i - \mu + \mu - \bar{X}\bigr)^{2} \;=\; (X_i-\mu)^{2} + (\mu-\bar{X})^{2} + 2(X_i-\mu)(\mu-\bar{X}).
  2. Take expectations

    Var(s2)=1n ⁣[(XiXˉ)2]=1n(σ2+Var(Xˉ))=1n(σ2+σ2n).Var(s^2) = \frac{1}{n}\,\sum\!\left[(X_i-\bar{X})^{2}\right] = \frac{1}{n}\Bigl(\sigma^{2} + \operatorname{Var}(\bar{X})\Bigr) = \frac{1}{n}\Bigl(\sigma^{2} + \tfrac{\sigma^{2}}{n}\Bigr).
  3. Result

    Var(s2)    1n,soVar(s2)  n  0.\operatorname{Var}(s^{2}) \;\propto\; \frac{1}{n}, \qquad\text{so}\qquad \operatorname{Var}(s^{2}) \;\xrightarrow{n\to\infty}\; 0.

Since the variance of s2s^{2} tends to zero as nn grows, the probability bound above forces s2s^{2} to converge in probability to σ2\sigma^{2}.


Hypothesis Testing

D=X1,X2,...,XniidPoisson(λ)D = {X_1, X_2, ..., X_n} \sim^{\text{iid}} \text{Poisson}(\lambda) H0:λλ0(=5)H_0: \lambda \leq \lambda_{0}(=5) v/s H1:λ>λ0(=5)H_1: \lambda \gt \lambda_{0}(=5)

Test 1 Test Statistic: xˉ\bar{x} Reject Null Hypothesis, if xˉ>c1\bar{x} \gt c_1

Test 2 Test Statistic: sample variance(s2s^2) Reject H0H_0 if sample variance(s2s^2) >c2> c_2

Test 3 Test Statistic: Sample Median(mm) Reject H0H_0: if m>c3m > c_3

Pr(Type I error)α\Pr(\text{Type I error}) \leq \alpha => Pr(Reject Null Hypothesis, When null is true)α\Pr(\text{Reject Null Hypothesis, When null is true}) \leq \alpha ==> Pr(T>cH0is True)α\Pr(T \gt c | H_0 \text{is True}) \leq \alpha

**Under H0H_0

  1. Simulate Dx1,x2,...xnPoisson(λ0)D^* \leftarrow {x_1^{*}, x_2^{*}, ... x_n^{*}} \sim \text{Poisson}(\lambda_0)
  2. Calculate xˉ,s2,m\bar{x}^{*}, {s^2}^{*}, m^{*}
  3. Repeat Step 1 and 2 MM times
  4. Draw a histogram of x1,x2,...xn{x_1^{*}, x_2^{*}, ... x_n^{*}} | s12,s22,...sn2{{s_1^2}^{*}, {s_2^2}^{*}, ... {s_n^2}^{*}}, etc… and find c on the histogram such that the area under the curve to the right of c in the distribution is less than α\alpha

We now would like to do intra-test comparison to figure out the test with great power

Type-II Error

Pr(Type II error)=Pr(Not Reject H_0 When H_1 is true)\Pr(\text{Type II error}) = \Pr(\text{Not Reject H\_0 When H\_1 is true})

Power = 1 - Type(II)


Joint, Marginal Distributions

fi=#(a<xi<bf_i = \#(a \lt x_i \lt b & c<yi<d)c \lt y_i \lt d) fin=rf(a<xi<b\tfrac{f_i}{n} = \text{rf}(a \lt x_i \lt b & c<yi<d)c \lt y_i \lt d) rfd=rfArea of the Box\text{rfd} = \frac{\text{rf}}{\text{Area of the Box}} rfd=rf(ba)(dc)=rfΔxΔy\text{rfd} = \frac{\text{rf}}{(b-a)(d-c)} = \frac{\text{rf}}{\Delta{x}\Delta{y}} Joint Probability Density=ProbabilityArea\text{Joint Probability Density} = \frac{\text{Probability}}{\text{Area}}

f(x,y)f(x,y) is the joint PDF such that (1)f(x,y)0x,yR2(1) f(x,y) \geq 0 \forall {x,y} \in \mathbb{R}^2 (2)R2f(x,y)dxdy=1(2)\int\int_{\mathbb{R}^2}f(x,y)dxdy=1

f(x)f(x) is the marginal PDF Distribution of XX where (1)f(x)=yf(x,y)dy(1) f(x) = \int_{y} f(x,y) dy (2)f(y)=xf(x,y)dx(2) f(y) = \int_{x}f(x,y)dx

How do you explain the fitter line on a scatter plot from a probability perspective?

Conditional Expectation of ‘y’ given ‘x’

  1. The line formed by joining the expectations of conditional distributions of yy given xx is called the regression line/function or prediction line/function
  2. If Joint PDF is bi-variate or multi-variate gaussian distribution, then the resulting regression function will be a straight line

E[yx]α+βx=yf(yx)dy\underbrace{\mathbb{E}[y | x]}_{\alpha + \beta x} = \int{yf(y|x)}dy If (x,y)(x,y) are assumed to be normal, then yy can be α+βx+ϵ\alpha + \beta x + \epsilon AKA straight line.

Correlation

The Degree of Linear Association

The Correlation between xx and yy depicting a upper-half of a semi-circle function will be close to zero, since correlation can only talk about the linear relationship between 2 variables.

Sxy=1ni=1n(xixˉ)(yiyˉ)S_{xy} = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}) (Covariance)

rxy=sxysxsyr_{xy} = \frac{s_{xy}}{s_x s_y} (Sample Correlation) ρxy=E[XE[X]]E[YE[Y]]σ(x)σ(y)\rho_{xy} = \frac{\mathbb{E}[X - \mathbb{E}[X]] \mathbb{E}[Y - \mathbb{E}[Y]]}{\sigma(x)\sigma(y)}

1rxy1-1 \leq r_{xy} \leq 1 1ρxy1-1 \leq \rho_{xy} \leq 1

Randomised Case Control Trials by RA Fischer is the currently only available way of proving causality

If you’re trying to solve OLS, then β^=rsysx\hat{\beta} = r\frac{s_y}{s_x} and α^=yˉβ^xˉ\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}