Hypothesis Testing for IB Math Applications and Interpretation


This page focuses on hypothesis testing, including null and alternative hypotheses, test statistics, significance level, critical values, critical regions (HL), type I/type II errors (HL), unbiased estimates of population variance (HL), and confidence intervals (HL). In terms of syllabus, this means SL 4.11, AHL 4.14 (unbiased estimates only), AHL 4.15, AHL 4.16, AHL 4.17, AHL 4.18. This page does not cover other statistics or probability concepts and techniques.

Last edited 2026-05-19: Clarification on 2-sample vs paired, on paired t-test H0 notation; introduced example using InvP on Casio for Poisson type I errors; clarified that writing down null distribution is required in the HL exact tests; added paragraph on 2-tail zz-test critical values

Contents

Concepts and presentation

As a simplification, tt-test and the HL tests are about the mean; and chi-squared χ2\chi^2-test is about variance. It’s possible to perform tt and the HL tests using summarizing statistics such as mean and sample variance (HL), but we need the full frequency data for χ2\chi^2.

Our null hypothesis H0 involves some population parameter, or in the case of χ2\chi^2-test, a fact about the population. The alternative hypothesis typically involves an inequality. If H1 is parameter>k\text{parameter} > k or parameter<k\text{parameter} < k then it is a one-tailed test; if it is parameterk\text{parameter}\neq k then it is a two-tailed test. When possible, state your H0 and H1 using symbols, as shown later. For example “mean is 5” is not a valid null hypothesis as it does not state “population”; xˉ=5\bar x = 5 is not valid either as it does not use a population parameter; instead we should write μ=5\mu = 5, as μ\mu means “population mean”.

We compute a test statistic using H0 and a sample. If our H0 is correct, then the test statistic will follow some distribution.

[test statistic][distribution]\text{[test statistic]} \sim \text{[distribution]}

For larger samples in tt-test when the data are not normally distributed, and large samples in χ2\chi^2-test, we have

[test statistic][distribution], as n\text{[test statistic]} \to \text{[distribution]}, \text{ as } n\to\infty

meaning that the distributions are approximate but suitable for sufficiently large nn. This is discussed in the respective (HL) sections.

We reject H0 if the test statistic is unlikely to follow the distribution, with probability less than the significance level α\alpha. More on that later.

Conditions to reject H0

The following statements are equivalent; all are true or all are false depending on the sample.

  • there is sufficient evidence to reject H0 in favor of H1
  • pαp\leq \alpha, ie pp is less than the significance level
  • test statistic is “unexpected” assuming H0 is true
  • test statistic is at least as extreme as the critical value(s)
  • test statistic is in the critical region, ie one or both tails (HL)
  • null hypothesis parameter μ0\mu_0 is outside the confidence interval (HL, for two-tailed t-test and two-tailed z-test only)

For chi-squared test, extreme means more positive. For other tests in the course, extreme is in the direction of the H1 (or both directions in a two-tailed test).

For test statistics, SL students only have to know how to calculate χcalc2\chi^2_{calc} using GDC, and reject H0 if χcalc2\chi^2_{calc} exceeds the critical value.

H1conditions to reject H0notes
Not independent or not following the model (for χ2\chi^2-test)χcalc2\chi^2_{calc} \geq critical value
parameter >k> k(HL) test statistic \geq critical value1-tail test (upper)
parameter <k< k(HL) test statistic \leq critical value1-tail test (lower)
parameter k\neq k(HL) test statistic outside confidence interval2-tail test
The above is true when restricted to tests in IB Math AI HL. Not true in general.

Usually, pp-values are easier because we reject H0 when p<p < significance level (typically set at 0.010.01, 0.050.05, or 0.100.10). pp-value is the conditional probability of a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. IB sometimes avoids saying the significance level, at which case reject H0 if p<0.01p < 0.01, and fail to reject H0 if p>0.10p > 0.10.


The conclusion has two parts: a reason and an outcome (decision). The reason is an inequality comparing either pp-value against significance level, or test statistic against critical value. The outcome should reject or fail to reject H0.

It is incorrect to say “accept H0”, because we haven’t proved it to be true. Rather it remains merely “plausible”. Always say sufficient or insufficient evidence to reject H0, regardless how the question is phrased. If question says “in context”, then also rephrase the problem statement.

The typical workflow is

  • H0, H1, with units (eg kg, cm, etc.) if applicable;
  • (HL) the distribution assuming H0 is true (for zz, binomial, Poisson only), “the null distribution”;
  • degrees of freedom (for χ2\chi^2 only);
  • pp-value or test statistic;
  • an inequality (reason); and
  • an outcome.

Include all these elements if question just says to “test at the […] significance level”.

As all of these steps depends on you choosing the right test, triple check you are using the correct test before you proceed.

tt-test

t-test is for checking the mean of either a large sample, or of normally distributed data points with unknown variance.

Extra material slightly beyond HL:

When sampling nn independent observations from N(μ,σ2)\mathrm N(\mu, \sigma^2), ie normal distribution with mean μ\mu and variance σ2\sigma^2, we can calculate a sample mean Xˉ\bar X and an unbiased estimate for population variance Sn1S_{n-1}, such that

XˉμSn1/ntn1\frac{\bar X - \mu}{S_{n-1} / \sqrt n} \sim t_{n-1}

where the tt-statistic on the left follows a tt-distribution with n1n-1 degrees of freedom. GDC computes the statistic from H0 and the sample, then finds the pp-value under the tt-distribution.

A tt-distribution is like the normal distribution, but with larger tails and smaller center. You can see graphs at Wikipedia: t-distribution. 

As a consequence of Central Limit Theorem (HL), when sampling from any distribution with finite mean and variance, we have

XˉμSn1/ntn1, as n\frac{\bar X - \mu}{S_{n-1} / \sqrt n} \to t_{n-1}, \text{ as } n\to\infty

As such, we can use tt-test for large samples, regardless what the underlying distribution is.

In HL exam, you need either n>30n > 30 (for an approximate test) or data is normally distributed with unknown variance (for an exact test) to use the tt-test. Furthermore, SL students will always be given the data for a tt-test, and not just the summarizing statistics. Nonetheless, it is still possible at SL for IB to give the sample mean, ask you to find a missing value, then ask you to run a tt-test.

typeusagedegrees of freedom ν\nunull hypothesis H0
1-samplesample mean vs benchmarkn1n-1μ=\mu = [value]
pooled 2-samplesample A mean vs sample B mean, same σ\sigmanA+nB2n_A + n_B - 2μA=μB\mu_A = \mu_B
paired (HL)aspect A vs aspect B of the same sample (eg before vs after), same σ\sigman1n-1μA=μB\mu_A = \mu_B
correlation coefficient (HL)whether two normal distributions are linearly correlatedn2n-2ρ=0\rho = 0

The alternative hypothesis H1 depends on the problem context and involves <\lt, >\gt, or \neq. All of these are strict inequalities. All above tests may use any of the three inequalities for H1.

You probably will not be assessed on tt-test degrees of freedom.

In (pooled) 2-sample tt-test, the two samples are independent. In paired tt-test, there is only one sample; we find the difference between the two dependent data lists, before doing a 1-sample test.

1-sample tt-test

The 1-sample tt-test checks if the population mean is equal to a pre-determined benchmark.

Example: data from SPEC AI Paper 1 SL Q9 [Maximum mark: 6]

Ms Calhoun measures the heights of students in her mathematics class. She is interested to see if the mean height of male students, μ1\mu_1 , is less than 150 cm.

Male height (cm)150150148148143143152152151151149149147147
Female height (cm)148148152152154154147147146146153153152152150150

At the 10% level of significance, a t-test was used to test this. The data is assumed to be normally distributed and the standard deviations are equal between the two groups.

(a) State the null and alternative hypotheses. [2]

H0:μ1=150 cmH1:μ1<150 cmH_0: \mu_1 = 150 \text{ cm} \\ H_1: \mu_1 < 150 \text{ cm}

For 1-sample tt-tests, always use μ\mu and not words even if question does not mention the quantity μ\mu.

(b) Calculate the p-value for this test. [2]

p0.127p \approx 0.127 \qed
  1. Enter the data in stat 1:Edit... L₁.

    male student heights entered in L₁
    male student heights entered in L₁
  2. In stat (TESTS) select 2:T-Test.... Select Data for input type. Enter the null hypothesis, list, frequency (for grouped data, which we don’t have in this question), and the alternative hypothesis. Calculate and enter

    T-Test. Input: Data. mu0: 150. List: L₁. Freq: 1. mu: not equal mu0
    Input screen for 1-sample t-test
  3. The screen shows the alternative hypothesis you inputted (not the conclusion), tt test statistic, the pp-value, summarizing statistics. nn is the sample size not the degrees of freedom.

    T-Test. mu<150. t = -1.26. p=0.127. mean x = 148.57. Sx = 2.9921. n = 7
    1 sample results. First line is H1, regardless if it is accepted or rejected

(c) State, giving a reason, whether there is sufficient evidence to claim that the male students are on average shorter than 150 cm. [2]

As p0.127>0.10p \approx 0.127 > 0.10, there is insufficient evidence to claim male population average is below 150 cm \qed

pooled 2-sample tt-test

The pooled 2-sample tt-test assumes that the population variances are equal, and checks whether the two population means are also equal.

Example: SPEC AI Paper 1 SL Q9 [Maximum mark: 6]

Ms Calhoun measures the heights of students in her mathematics class. She is also interested to see if the mean height of male students, μ1\mu_1 , is the same as the mean height of female students, μ2\mu_2. The information is recorded in the table.

Male height (cm)150150148148143143152152151151149149147147
Female height (cm)148148152152154154147147146146153153152152150150

At the 10% level of significance, a t-test was used to compare the means of the two groups. The data is assumed to be normally distributed and the standard deviations are equal between the two groups.

(a) State the null and alternative hypotheses for this new test. [2]

H0:μ1=μ2H1:μ1μ2H_0: \mu_1 = \mu_2 \\ H_1: \mu_1 \neq \mu_2

(b) Calculate the p-value for this new test. [2]

p0.296p \approx 0.296 \qed
  1. Enter the data in stat 1:Edit... L₁ and L₂.

    male student heights entered in L₁
    male student heights entered in L₁
  2. In stat (TESTS) select 4:2-Samp-Test... Select Data for input type. Enter the null hypothesis, the two lists, frequencies (for grouped data, which we don’t have in this question), and the alternative hypothesis. Always choose Pooled. Calculate and enter

    T-Test. Input: Data. List1: L₁. List2: L₂. Freq1:1. Freq2:1. mu1: not equal to mu2.
    Data entry for 2-sample t-test. Always use pooled on the exam.
  3. The screen shows the alternative hypothesis you inputted (not the conclusion), tt test statistic, the pp-value, degrees of freedom (7+82=137 + 8 - 2 = 13), summarizing statistics.

    T-Test. mu1 not equal to mu2. t = -1.089. p=0.0.2957. df=13. mean x1 = 148.57. mean x2=150.5. Sx1 = 2.9921. Sx2 = 2.9641

(c) State, giving a reason, whether Ms Calhoun should accept the null hypothesis. [2]

As 0.296>0.100.296 > 0.10, there is insufficient evidence to reject H0 \qed

paired tt-test (HL)

In a paired tt-test, the same sample is studied either in two aspects, or two instances in time. To perform this test, we find the difference of the two equally-long lists and perform a 1-sample tt-test on the difference.

Do not state hypotheses that imply causation, as the paired tt-test is only about observed changes.

Example: Specimen Paper 3 HL Q1

IB World School A wants to evaluate their teaching.

A group of eight students were randomly selected. They were given a standardized test at the start of the course and a prediction for total IB points was made based on that test; this was then compared to their points total at the end of the course.

Previous results indicate that both the predictions from the standardized tests and the final IB points can be modelled by a normal distribution.

It can be assumed that:

  • the standardized test is a valid method for predicting the final IB points
  • that variations from the prediction can be explained through the circumstances of the student or school

School A also gives each student a score for effort in each subject. This effort score is based on a scale of 1 to 5 where 5 is regarded as outstanding effort.

Student numberGenderPredicted IB pointsFinal IB pointsAverage effort score
1male43.2444.4
2male36.5344.2
3female37.1384.7
4male30.9284.3
5male41.1393.9
6female35.1394.9
7male36.4404.9
8male38.2384.3
Mean37.3137.54.45

(d) Use a paired t-test to determine whether there is significant evidence that the students in school A have improved their IB points since the start of the course. [4]

H0: μpredicted=μfinal\mu_\text{predicted} = \mu_\text{final}

H1: μpredicted<μfinal\mu_\text{predicted} < \mu_\text{final}

p0.423p \approx 0.423

There is insufficient evidence in improved IB points \qed

  1. Enter the before/after data in stat 1:Edit... L₁ and L₂. You should input the before data in L₁, and after data in L₂.

    old marks in L₁. new marks in L₂
    data entry with before in L₁ and after in L₂
  2. In L₃ header, set it as L₂ - L₁. enter to populate the list.

    T-Test. Input: Data. mu0: 150. List: L₁. Freq: 1. mu: not equal mu0
    Input screen for 1-sample t-test
  3. Follow the same procedure as 1-sample test, using L₃ for the list. Set null hypothesis to 0.

    T-Test. Input: Data. mu0: 0. List: L₃. Freq: 1. mu: not equal mu0
    Input screen for 1-sample t-test
    T-Test. mu>0. t = 0.20158. p=0.42299. mean x = 0.1875. Sx = 2.6308. n = 8
    1 sample results. First line is H1, regardless if it is accepted or rejected

Note that the mean difference is 0.18750.1875, this matches the given difference 37.537.31=0.1937.5 - 37.31 = 0.19. IB sometimes will give such info so you can catch some of your data entry errors, if any.

correlation tt-test (HL)

The correlation coefficient tests whether given the sample size and the measured rr, whether we can conclude that the correlation coefficient ρ\rho between two variables is non-zero. The variables must be normally distributed in order to use this test.

rr and ρ\rho are only for linear relationship. This question is often tested together with linear regression. If the null hypothesis holds, it means there is insufficient evidence to justify a linear relationship.

Example: Specimen Paper 3 HL Q1

This example started in the previous section

It is claimed that the effort put in by a student is an important factor in improving upon their predicted IB points.

Student numberGenderPredicted IB pointsFinal IB pointsAverage effort score
1male43.2444.4
2male36.5344.2
3female37.1384.7
4male30.9284.3
5male41.1393.9
6female35.1394.9
7male36.4404.9
8male38.2384.3
Mean37.3137.54.45

(f) (i) Perform a test on the data from school A to show it is reasonable to assume a linear relationship between effort scores and improvements in IB points. You may assume effort scores follow a normal distribution.

(ii) Hence, find the expected improvement between predicted and final points for an increase of one unit in effort grades, giving your answer to one decimal place.


H0: ρ=0\rho = 0

H1: ρ>0\rho > 0

p0.00157p\approx 0.00157. There is sufficient evidence that effort is an important factor in IB points improvement \qed

gradient (change in improvement over change in effort) is approx. 6.66.6 \qed

  1. Input the two variables in stat 1:Edit... L₁ and L₂. Remember which one is which.

  2. In stat (TESTS), choose F:LinRegTTest.... Input your independent (what you change) followed by dependent (what you observe) variables. Store the regression equation in alpha trace Y1 or a different function.

    TI 84 Plus CE Menu showing Tests and F: LinRegTTest
    LinRegTTest. Xlist: L₂. Ylist:L₁ Freq: 1. beta and rho: > 0. RegEQ: Y1.
    Effort L₂ in Xlist because part (ii) wants change in grade over change in effort. Otherwise if you just want the p-value, doesn't matter which order we put for XList or YList.
  3. The results shows the model, the alternative hypothesis (not the conclusion), the test statistic, pp-value, degrees of freedom (82=68 - 2 = 6), and regression line constant and gradient.

    LinRegTTest. y=a+bx. beta > 0 and rho > 0. t = 4.756. p = 0.001569. df = 6. a = -29.167. b = 6.5969
    First 1-2 lines are always alternative hypothesis, regardless whether it is accepted or not.

chi-squared test

Chi-squared test checks if categorical data involving frequencies is distributed as expected. The categories must be non-overlapping and covering the entire set of possible values. Categories may be qualitative (eg mode of transportation) or qualitative (eg weight).

Extra material slightly beyond HL:

For categorical data of sample size nn, with kk mutually exclusive classes, if OiO_i EiE_i are respectively the observed and expected frequencies of class ii, and OiN, EiR+O_i \in \mathbb N, \ E_i \in \mathbb R^+, then,

i=1k(OiEi)2Eiχk12, as n\sum_{i=1}^k \frac{\left(O_i - E_i\right)^2}{E_i} \to \chi^2_{k-1}, \text{ as } n\to\infty

The summation at the left approaches a χ2\chi^2-distribution of k1k-1 degrees of freedom. A χ2\chi^2-distribution only defined for positive xx. You can see graphs at Wikipedia: chi squared-distribution. 

Chi-squared test is always an approximate test. In HL, the expected frequencies for all categories must exceed 5, and you need to merge the smallest classes as necessary.

typeusagedegrees of freedom ν\nunull hypothesis H0
independenceif row and column variables are independent(rows 1)(cols 1)(\text{rows } - 1)(\text{cols } - 1) [row var.] is independent of [column var.]
goodness of fitif observed freq. match the predictedk1k - 1, with kk classesfreq of [x] are distributed according to [model]
goodness of fit (HL)if observed freq. match the predictedk1pk - 1 - p, with kk classes; model using pp parametersfreq of [x] are distributed according to [model]
assuming all rows and columns can have different frequencies

Note that for independence the GDC can calculate expected, but goodness-of-fit we need to supply to the calculator the expected frequencies as well as the degrees of freedom. Independence uses a contingency table, and the data define/impact the expected values; whereas in goodness-of-fit, the expected is separate from the observed.

independence

Example: AI Specimen P1 SL Q6

As part of a study into healthy lifestyles, Jing visited Surrey Hills University. Jing recorded a person’s position in the university and how frequently they ate a salad. Results are shown in the table.

Salad meals per week
01-23-4> 4
Student4526186
Professors158512
Staff and Administration1613106

Jing conducted a χ2\chi^2 test for independence at a 5%5\% level of significance.

(a) State the null hypothesis. [1]

H0: Number of salads eaten in a week is independent from the person’s position \qed

(b) Calculate the p-value for this test. [2]

p0.0201p \approx 0.0201 \qed
  1. In 2nd x⁻¹, arrow for EDIT. Select 1:[A]. Set the dimensions in number of rows and columns. For our example, 3 by 4.

    The tabular data entered as a 3 by 4 matrix
  2. In stat (TESTS), choose C:χ²-Test... Set

    Observed: [A]
    Expected: [B]
    . The matrices are available from 2nd x⁻¹ Names. [B] should be empty.
    chi-squared test. Observed: [A]. Expected: [B].
  3. The result is shown with a chi-squared test statistic, pp-value, and degrees of freedom ((31)(41)=6(3-1)(4-1) = 6).

    chi squared statistic: 15.0187. p = 0.020112. df = 6

    Matrix [B] should now be populated with the expected frequencies.

(c) State, giving a reason, whether the null hypothesis should be accepted. [2]

Since 0.0201<0.050.0201 < 0.05, there is sufficient evidence to reject H0 \qed

(d) [BONUS] Find the smallest χ2\chi^2 test statistic that would reject the null hypothesis. [2]

The degrees of freedom is 66. We need P(Xχ crit2)=0.05\mathrm{P}(X \geq \chi^2_\text{ crit}) = 0.05 where

Xχ62X \sim \chi^2_6

Graph the χ2\chi^2 distribution cumulative density function (cdf), then numerically solve for the intersection with 0.950.95, as χ2\chi^2 is always upper tail.

The critical value is 12.612.6 \qed

So another way part c) can be asked is that we can be given 12.612.6 as the critical value, and see that the test statistic of 15.0>12.615.0 > 12.6 meaning we reject H0 in favor of H1.

  1. Graph the chi-squared distribution and the line y=0.95y = 0.95. Use 2nd vars, 8:χ²cdf(.

    chi squared cdf. lower:0. upper:X. df:6. Paste
    Y= screen. Y1=chi squared cdf(0,X,6). Y2=0.95
  2. Solve graphically for the intersection.

    graph of Y1 and Y2 from x = 0 to 15. y from 0 to 1. Intersection. X = 12.592. Y = 0.95

goodness of fit

Example: SPEC AI P2 HL Q2 / SL Q2. [Maximum mark: 12] Slugworth Candy Company sell a variety pack of colourful, shaped sweets.

According to manufacturer specifications, the colours in each variety pack should be distributed as follows.

ColourBrownRedGreenOrangeYellowPurple
Percentage (%)152520201010

Mr Slugworth opens a pack of 80 sweets and records the frequency of each colour.

ColourBrownRedGreenOrangeYellowPurple
Observed Frequency10201618124

To investigate if the sample is consistent with manufacturer specifications, Mr Slugworth conducts a χ2\chi^2 goodness of fit test. The test is carried out at a 5%5\% significance level.

(b) Write down the null hypothesis for this test. [1]

The colours are distributed according to manufacturer specs \qed

(c) Copy and complete the following table in your answer booklet. [2]

ColourBrownRedGreenOrangeYellowPurple
Expected Frequency

We need to convert the expected percentages to expected frequencies. The recommended way is to store the percentages in a list, then multiply the list by the sample size, ie 8080, as given in the question.

ColourBrownRedGreenOrangeYellowPurple
Expected Frequency1220161688

We got whole number expected frequencies by chance; decimal expected frequencies should be kept and not rounded.

Input the observed frequencies in stat 1:Edit... L₁, and expected percentages in L₂. Go to L₃ header and set it to 80L₂.

Observed frequencies in L1. Expected percentages in L2.
Set L₃ equal to 80L₂
L₃ now populated with expected frequencies
L₃ now populated with expected frequencies

(d) Write down the number of degrees of freedom. [1]

61=56 - 1 = 5 \qed

(e) Find the p-value for the test. [2]

p0.469p \approx 0.469 \qed
  1. In stat (TESTS), choose D:χ²GOF-Test.... Input the observed and expected frequency lists, the degrees of freedom, and Calculate.

    chi squared GOF-Test. Observed: L₁. Expected:L₃ df:5. Calculate
  2. The results show the test statistic, pp-value, degrees of freedom, and the contributions (CNTRB) from each class to the test statistic.

    chi squared GOF-Test. chi squared = 4.5833. p = 0.46881. df=5 CNTRB=0.333, 0, 0 ...

(f) State the conclusion of the test. Give a reason for your answer. [2]

0.469>0.050.469 > 0.05, so there is insufficient evidence to reject H0 \qed

goodness of fit with parametric model (HL)

Same as above but we need to subtract the number of estimated parameters in our model, from the degrees of freedom. The model must return a finite number of non-overlapping categories. If the categories are quantitative, the model can be normal distribution, with groupings provided or outlined; binomial distribution; Poisson distribution; or another discrete distribution to be given in the question.

Example: Adapted from May 2023 P2 TZ1 HL Q5

Goran is interested in the number of sightings of a particular bird each week in the 50 weeks following the first day of September. He collects some data which is shown in the table.

Number of sightings012345More than 5
Number of weeks816138320

Goran believes that the data follows a Poisson distribution. Goran decides to test at the 5% significance level to see if his belief is correct. His null hypothesis is XPo(m)X \sim \mathrm{Po} (m), where the random variable, XX, is defined as the number of sightings per week.

Goran estimates parameter mm to be the mean of the sample, 1.761.76.

Number of sightings012345 or More
Expected frequencies

(c) Copy the table and fill in the expected frequencies for sightings per week in the 50 weeks after the first day of September [7]

The expected frequency for exactly xx sightings, is

Ex=P(X=x)50E_x = \mathrm P(X = x) \cdot 50

For the “5 or more” class, we want

E5+=P(X5)50=(1P(X4))50E_{5+} = \mathrm P(X \geq 5) \cdot 50 = (1 - \mathrm P(X \leq 4)) \cdot 50

depending on whichever is easier on your GDC.

Answer:

Number of sightings012345 or More
Expected frequencies8.608.6015.115.113.313.37.827.823.443.441.681.68
  1. In lists stat 1:Edit..., insert a list using 2nd stat OPS choose 5:seq( to define a sequence. This allows us to evaluate a function over a list. Specifically, we want D:poissonpdf( from 2nd vars and poissonpdf(1.76,X)*50. We should go from 0 to 4, for the first five columns.

  2. In last row in L₂, enter (1-poissoncdf(1.76,4))*50

(d) State a reason why Goran should combine groups to conduct his significance test. [1]

Some expected frequencies are less than 5 \qed

To combine values, we can use L₂(5) for instance to get the fifth element. We set the new L₂(5) equal to L₂(5) + L₂(6). And then delete L₂(6).

(e) Write down the degrees of freedom for the test. [1]

The formula is n1pn - 1 - p, and p=1p = 1 because we estimated 1 parameter mm.

511=35 - 1 - 1 = 3 \qed

(f) Find the p-value for the test. [2]

Using 3 degrees of freedom, we get p=0.991p = 0.991 \qed

(g) State the conclusion of the test. Justify your answer. [2]

0.991>0.050.991 > 0.05, insufficient evidence to reject H0 \qed

Type I and type II errors (HL)

Type I and type II errors are properties of the test, and do not use anything from our sample. Compute the critical value before finding such errors.

Type I error is the conditional probability that we rejected H0 even though it is true. For continuous tests including tt, χ2\chi^2, and zz, this is same as the significance level. For discrete tests including binomial and Poisson, this is smaller than the significance level.

Type II error is the conditional probability that we fail to reject H0, assuming the parameter is actually some other value given in the question. Type II errors will only be asked for zz, binomial, and Poisson tests.

Some examples are shown along the following tests.

Exact tests (HL)

Previously, both tt and χ2\chi^2 tests are usually approximations that work for large samples. The following work for small samples as well, but require the population distribution to be known except for one parameter.

The tests ask the question: “if the last model parameter is the value from null hypothesis, what is the probability that the data is at least as extreme as the sample?” and answers with a pp-value. This differs with regular distribution questions in which we know all parameters.

typeusagenull hypothesis H0
zz-testtt-test but known σ\sigmaμ=\mu = [value]
binomial testfraction (proportion) of population that meets some descriptionϕ=\phi = [value]
Poisson testlong term average counts of some Poisson process in an intervalm=m = [value]

μ\mu and mm can be used without defining the variable. For binomial test, use the given variable from the question, or if none were given, define ϕ\phi as the population proportion belong to the specified category.

In theory, all the tt-tests can also be zz-tests if the population variance σ2\sigma^2 is known.

Your GDC has built-in test statistic and pp-value functions for zz-test. For binomial and Poisson tests, the test statistic is given or a simple arithmetic, and we can use built-in binomcdf or poissoncdf to find pp. There is a built-in to find critical value of binomial test, but not for Poisson, for which you will need to solve over the integers.

There is an explicit A1 for writing down the null distribution (the distribution involving the parameter from H0) for these tests.

zz test for population mean (HL)

z-test is for checking if the sample mean of independent and identical normal distributions is as expected.

When sampling nn independent observations from a normal distribution XN(μ,σ2)X \sim \mathrm N\left(\mu, \sigma^2 \right), the sample mean Xˉ\bar{X} is distributed follows a normal distribution with same mean but variance scaled by 1n\frac{1}{n}. Under H0,

XˉN(μ,σ2n)\bar{X} \sim \mathrm N\left(\mu, \frac{\sigma^2}{n}\right)

You are required to be able to derive this.

Extra material slightly beyond syllabus:

Alternatively, this is same as

Xˉμσ/nN(0,1) \frac{\bar{X} - \mu}{\sigma / \sqrt n} \sim \mathrm N(0, 1)

where N(0,1)\mathrm{N}(0, 1) is the standard normal distribution, often denoted by zz.

Also recall the bonus material from earlier:

XˉμSn1/ntn1\frac{\bar X - \mu}{S_{n-1} / \sqrt n} \sim t_{n-1}

When comparing to the tt-test, we notice that the zz-test requires a known σ\sigma while tt-test does not. Both tests are exact with independent sampling from a normal distribution. Similarly, by Central Limit Theorem, both tests can be approximate for large samples when the distribution is not normal. In Math AI HL, we use tt-test when we do not know σ\sigma, and zz-test when we do.

Example: Practice AI P2 HL Q6 (from AI Teacher Support Material)

The masses in kilograms of melons produced by a farm can be modelled by a normal distribution with a mean of 2.62.6 kg and a standard deviation of 0.50.5 kg.

One year due to favourable weather conditions it is thought that the mean mass of the melons has increased. The owner of the farm decides to take a random sample of 1616 melons to test this hypothesis at the 5% significance level, assuming the standard deviation of the masses of the melons has not changed.

(c) Write down the null and alternative hypotheses for the test. [1]

H0: μ=2.6 kg\mu = 2.6 \text{ kg}

H1: μ>2.6 kg\mu > 2.6 \text{ kg}

(d) Find the critical region for this test. [4]

Let Xˉ\bar X be the sample mean in kg. Under H0,

XˉN(2.6,0.5216)required on the exam\bar X \sim \mathrm N\left(2.6, \frac{0.5^2}{16}\right) \quad \text{required on the exam}

Remember that on most GDC, we input the standard deviation, but to show our work, we use variance.

Since H1 is μ>2.6\mu > 2.6, we want P(Xˉxcrit)=0.05\mathrm P(\bar X \geq x_{crit}) = 0.05 and solve for xcritx_{crit}, the critical value.

xcrit2.81x_{crit} \approx 2.81

The critical region is Xˉ2.81 kg\bar X \geq 2.81 \text{ kg} \qed

Use 2nd vars 3:invNorm(.

TI-84 Plus non-CE users, and TI Nspire users should use area: 0.95 as there is no option for RIGHT tail.

2-tail critical values for zz-test is required. Nspire, for 5% significance level, would find area = 0.025 and area = 0.975 separately. Both TI 84 Plus CE and Casio have built-in two-tailed InvNorm / InvN. Casio only returns the lower critical value, and you need to 2μlow2\mu - \text{low} for the upper critical value. The critical region would be

Xˉlow  Xˉhigh\bar{X} \leq \text{low} \ \cup \ \bar X \geq \text{high}

Binomial test for population proportion (HL)

Binomial test is for checking if the observed percentage is as expected. The test statistic, XX, is the number of measurements belong to the specified category.

It can be seen as an exact version of the approximate χ2\chi^2 test for two categories. Under H0,

XB(n,ϕ)X \sim \mathrm B(n, \phi)

Since pp primarily means pp-value, IB will give you a variable for the proportion. If not given, define ϕ\phi as the population proportion.

Example: May 2025 TZ2 Paper 2 HL 5

A zoologist collects a sample of cane beetles. He measures their length and categorizes them as “small” meaning from 10 to 12mm long, “medium” meaning from 12 to 16mm long and “large” meaning from 16 to 18mm long. He also notes their sex and records the frequencies in the following table.

Length, x mm
Small
10 < x ≤ 12
Medium
12 < x ≤ 16
Large
16 < x ≤ 18
SexFemale422519
Male612712

(a) Find how many cane beetles are in the zoologist’s sample. [1]

Ultimately we need not only total cane beetles, but also the male ones. I want to present a method that can avoid certain data entry errors.

In general for long sums, use list and 1-var stats, or spreadsheet, as opposed to using lots of plus signs. You should use 2 columns for the 2 categories, male and female. We are spending at most 15 extra seconds, in order to make any data entry mistakes easy to spot. (It’s the same approach for grouped data / interval data. You may want to use a similar approach for Riemann sums, expected values, etc.)

There are 86 females, 100 males, and 186 in total \qed

  1. In stat 1:Edit..., enter the female data into L₁, and male data into L₂

  2. In stat CALC 2:2-Var Stats, use our lists. We read off Σx=86 and Σy=100


Let ϕ\phi be the population proportion of cane beetles that are male.

(b) Test, at the 5% significance level, the hypothesis that more than 45% of cane beetles are male. Write the null and alternative hypotheses. State the pp-value of your test and write your conclusion in context. [5]

They gave a variable for population proportion so we can use in our null and alternative hypotheses.

H0: ϕ=0.45\phi = 0.45

H1: ϕ>0.45\phi > 0.45

Note that proportions are just decimals between 0 to 1 and have no units.

Let XX be the observed number of males. Under H0,

XB(186,0.45)required on the examX \sim \mathrm B(186, 0.45) \quad \text{required on the exam}

Critical region is in the direction of H1: ϕ>0.45\phi > 0.45, so we have

p=P(X100)p=1P(X99)  (if needed on your GDC)p0.0101<0.05p = \mathrm P(X \geq 100) \\ p = 1 - \mathrm P(X \leq 99) \ \text{ (if needed on your GDC)} \\ p \approx 0.0101 < 0.05

Reject H0; sufficient evidence that over 45% cane beetles are male \qed

On the TI-84 Plus (CE), we must use 1P(X99)1 - \mathrm P(X \leq 99). Use B:binomcdf from 2nd vars.


Alternative method if question asked for critical region, and not pp-value

[Null and alternative hypotheses same as above; XX same as above.]

Our GDC actually has a built-in for this. The answer will always cross the significance level, so the value is always excluded from the critical region, no matter if lower or upper tail test.

critical region: X>95X > 95, critical value is 9696

100>95100 > 95, so reject H0; sufficient evidence that over 45% cane beetles are male \qed

If lower tail, use area 0.05, and X<X < [answer] for the critical region. And the critical value would be [answer] - 1.

In 2nd vars C:invBinom(, use

area:0.95
trials:186
p:0.45

The type I error rate is

P(X>95)=1P(X95)=0.0413\mathrm P(X > 95) = 1 - \mathrm P(X \leq 95) = 0.0413

For binomial and Poisson tests, the type I error is always (just) below the significance level.

Poisson distribution (HL)

The Poisson process is loosely a “rare event” that occurs finitely number of times over any continuous time interval. mm is the expected number of independent occurrences over a fixed duration (eg 1 hour). When XX is a Poisson distribution with parameter mm,

XPo(m)X \sim \mathrm{Po}(m)

XX describes the number of occurrences over a particular interval (eg noon to 1 pm).

We have

m=E(X)=Var(X)m = \mathrm E(X) = \mathrm{Var}(X)

Extra material slightly beyond AI HL:

The source of randomness is that the Poisson processes occur at random times. The probability of long wait-times decreases exponentially. This makes it so that when the processes are independent, the wait times are also independent. The number of Poisson processes in all equally-long periods can be modelled by the same Poisson distribution.

The sum of two independent Poisson distributions is another Poisson with parameter equal to the sum of the parameters.

X1Po(m1), and X2Po(m2)    X1+X2Po(m1+m2)X_1 \sim \mathrm{Po}(m_1), \text{ and } X_2 \sim \mathrm{Po}(m_2) \implies X_1 + X_2 \sim \mathrm{Po}(m_1 + m_2)

A special case is that if duration multiplies by kk, then the new parameter is kmkm.

Example: The passage of each type of vehicles at an intersection is modelled by a Poisson distribution. On average every 1010 minutes, 2020 cars pass the intersection, and on average every 66 minutes, 11 bus pass the intersection. Find the probability that at least 7070 vehicles pass in 2525 minutes.


In 2525 minutes, we expect 2510(20)=50\frac{25}{10}(20) = 50 cars, and 256(1)=256\frac{25}{6}(1) = \frac{25}{6} buses. Let VV be the number of vehicles in a 2525 minute interval.

VPo(50+256)V \sim \mathrm{Po}\left(50 + \frac{25}{6}\right)

We want P(V70)\mathrm P(V \geq 70). This is same as 1P(V69)1 - \mathrm P(V \leq 69), which can be calculated on our GDC.

1P(V69)0.02191 - \mathrm P(V \leq 69) \approx 0.0219 \qed

In 2nd vars use E:poissoncdf to find a sum of Poisson distribution probabilities. Use upper bound of 69 buses.

We can imagine a scenario where we have an H0 of m=50+256m=50+\frac{25}{6}. p0.0219<0.05p \approx 0.0219 < 0.05 so at 5% significance level we would reject H0 in favor of H1 of m>50+256m > 50 + \frac{25}{6}.

Poisson test for population mean (HL)

THe Poisson test checks if the XX, number of observed Poisson processes, is expected. The Poisson test is exact. Under H0,

XPo(m)X \sim \mathrm{Po}(m)

Example: Specimen AI P1 HL 16

The number of fish that can be caught in one hour from a particular lake can be modelled by a Poisson distribution.

The owner of the lake, Emily, states in her advertising that the average number of fish caught in an hour is three.

Tom, a keen fisherman, is not convinced and thinks it is less than three. He decides to set up the following test. Tom will fish for one hour and if he catches fewer than two fish he will reject Emily’s claim.

(a) State a suitable null and alternative hypotheses for Tom’s test. [1]

H0: m=3m = 3 (per hour)

H1: m<3m < 3 (per hour)

(b) Find the probability of a Type I error. [2]

Under H0,

XPo(3)required on the examX \sim \mathrm{Po}(3) \quad \text{required on the exam}

The critical region is X<2X < 2. Type I error is

P(X<2)=P(X1)0.199\mathrm P(X < 2) = \mathrm P(X \leq 1) \approx 0.199 \qed

For GDC instructions scroll up to the previous section.


The average number of fish caught in an hour is actually 2.52.5.

(c) Find the probability of a Type II error. [3]

Under μ=2.5\mu = 2.5,

YPo(2.5)required on the examY \sim \mathrm{Po}(2.5) \quad \text{required on the exam}

The type II is fail to reject under this actual mean, which means the inequality is the opposite (complement) of the critical region.

P(Y2)=1P(Y1)0.713\mathrm P(Y \geq 2) = 1 - \mathrm P(Y \leq 1) \approx 0.713 \qed

GDC instructions see earlier Poisson distribution section

(d) Tom tests for five hours against the original null and alternative hypotheses, where the hourly average is 3 vs less than 3. What is the maximum number of fish caught in five hours that makes the probability of making a Type I error less than 5%5\%?

In five hours, under H0

WPo(15)required on the examW \sim \mathrm{Po}(15) \quad \text{required on the exam}

To be consistent with the built in approach for binomial, we find the first value that exceeds the significance level (or 100% - significance level, for upper tail tests), then critical region starts from the more extreme value.

Cumulative probability first exceed 0.05 at W=9W = 9. The max number of fish that keep Type I error under 5% is 88 fish \qed

  1. In y= enter the poissoncdf(15, X), using 2nd vars E:poissoncdf(

    poissoncdf, mu = 15, X = X
    poissoncdf, data entry
  2. Since mean is small, we can just check the table starting from 0.

    table. X=8. p = 0.037446. X=9. p=0.0699
    poissoncdf, result

In MENU 2.STAT (2. Statistics). In DIST, POISN, InvP, enter Area: 0.05, μ: 15.

Inverse Poisson
Data: Variable 
Area:0.05 
μ: 15 
Save Res: None 
Execute
Inverse Poisson. xInv=9

For left tail, the critical region is less than the xInv; for right tail, the critical region is greater than the xInv.


Example:

City planning committee told the mayor that each day there are on average 1729 rideshares requested daily, and that the number of requests per day can be modeled using a Poisson distribution. The mayor decides to test this at the 1% significance level, using the next day’s data, to see if the average is higher.

a) State the null and alternative hypotheses. [1]

H0: m=1729m = 1729 (rides daily)

H1: m>1729m > 1729 (rides daily)

b) Find the critical region. [3]

Under H0,
RPo(1729)required on the examR\sim \mathrm{Po}(1729) \quad \text{required on the exam}

For solving over large integers, it is recommended to graph the function by rounding or flooring X, before confirming with the table of values.

P(Rr)0.99 when r1826\mathrm P(R \leq r) \geq 0.99 \text { when } r \geq 1826

The critical region is R>1826R > 1826 or R1827R \geq 1827 \qed

  1. In y=, define a poissoncdf( but using round(X) for the x value. This is found in math NUM 2:round(. Also define Y2 = 0.99 for upper-tail 1% significance level.

  2. Using appropriate window settings in window, graph graph then trace near the intersection point. Here, we can estimate x is between 1729 and 1850 when setting window.

  3. In 2nd window, start the table just below the intersection point. 2nd graph to view the table.

In MENU 2.STAT (2. Statistics). In DIST, POISN, InvP, enter Area: 0.99, μ: 1729.

Inverse Poisson
Data: Variable 
Area:0.99 
μ: 1729 
Save Res: None 
Execute
Warning. Area:0.99. xInv:1826. Area-0.01. *xInv:1815
Calculator kindly reminds us that if 0.99 was inexact, the answer would be different. We can just ignore here.
Inverse Poisson. xInv=1826

For left tail, the critical region is less than the xInv; for right tail, the critical region is greater than the xInv.

The next day, the mayor’s assistant tallied requests from the city’s top 5 most popular rideshare apps and identified 18191819 requests.

c) Using this data, state the conclusion of the test and provide a reason. [2]

As 18191819 is less extreme than 18271827, there is insufficient evidence to reject H0 \qed

d) Identify two ways content validity plays a role in the results of the test. [2]

  • If the market share of the top 5 apps is low, then the actual number of requests may well exceed 18271827.
  • City planning committee and the different apps may be using different aggregation methods or definitions of “requests”, including possible requests of rides on a later day \qed

Unbiased estimates of population variance (HL)

Different authors use different definitions and notations which makes this simple topic really confusing.

In IB, sample variance, sn2s_n^2, and sample standard deviation, sns_n, refer to

sn2=1ni=1n(XiXˉ)2s_n^2 = \frac1n \sum_{i=1}^n \left(X_i - \bar X\right)^2

and sn1s_{n-1}, unbiased estimates of population variance, refers to

sn12=1n1i=1n(XiXˉ)2=nn1sn2s_{n-1}^2 = \frac1{n-1} \sum_{i=1}^n \left(X_i - \bar X\right)^2 = \frac{n}{n-1}s_n^2

It is named so because the average sn12s_{n-1}^2 across many samples predicts the population variance. Statistics is all about finding out information about the population, so this quantity is very useful.

However, outside of IB, this latter quantity can also be referred to as sample variance, and can also be referred to as sn2s_n^2. Furthermore, the calculator may use σx for IB’s sns_n and sn for IB’s sn1s_{n-1}. So keep that in mind as you use different math resources.

Example: Spec AI P1 HL Q9

A manager wishes to check the mean weight of flour put into bags in his factory. He randomly samples 1010 bags and finds the mean weight is 1.4781.478kg and the standard deviation of the sample is 0.01960.0196 kg.

(a) Find sn1s_{n-1} for this sample. [2]

sn1=nn1sn2=nn1sn0.0207s_{n-1} = \sqrt{\frac{n}{n-1} s_n^2} = \sqrt{\frac{n}{n-1}} \cdot s_n \approx 0.0207 \qed
sqrt of (10/9 * 0.0196^2) stored in S. 0.020660214

Confidence interval (HL)

For a large number of independent, two-tailed tt- or zz- tests based on the same population, 95% of the constructed 95% confidence intervals are expected to contain the true population mean (μ\mu). Similar properties hold for 90% and 99% confidence intervals.

It is wrong to say that there is a 95% probability that the true mean lies in the 95% confidence interval. The true mean is not probabilistic, ie the same interval either always contain or never contain μ\mu. We need different intervals to express this idea of 95%.

When the hypothesized mean μ0\mu_0 from H0 (which is not necessarily the true mean) lies outside the confidence interval, we opt to reject H0 in a two-tailed test.

Extra material slightly beyond AI HL:

The basis for that is that the confidence interval is an algebraic rearrangement of

[lower critical value < test statistic < upper critical value]

for a 2-tailed test in which H0 is not rejected. If μ0\mu_0 lies outside the confidence interval, it means the test statistic was in the critical region.

Example: Spec AI P1 HL Q9 example started in the previous section

A manager wishes to check the mean weight of flour put into bags in his factory. He randomly samples 1010 bags and finds the mean weight is 1.4781.478kg and the standard deviation of the sample is 0.01960.0196 kg.

(b) Find a 95%95\% confidence interval for the population mean, giving your answer to 44 significant figures.

1.463μ1.4931.463 \leq \mu \leq 1.493 \qed

In stat arrow for TESTS, select either 7:ZInterval (when population σ\sigma is known) and 8:TInterval (when population σ\sigma is unknown).

TInterval. Input:Stats. x bar: 1.478. Sx:S. n:10. C-Level: 0.95. Calculate
entering statistics x bar and Sx and n, and CI
TInterval. (1.4632, 1.4928). x bar = 1.478. Sx=0.020660214. n=10.
getting the 95% confidence interval for a t-distribution

Note here S is a variable stored in the previous section. This is a way on most caluclators to reuse values without ever re-entering them.