Hypothesis Testing for IB Math Applications and Interpretation
This page focuses on hypothesis testing, including null and alternative hypotheses, test statistics, significance level, critical values, critical regions (HL), type I/type II errors (HL), unbiased estimates of population variance (HL), and confidence intervals (HL). In terms of syllabus, this means SL 4.11, AHL 4.14 (unbiased estimates only), AHL 4.15, AHL 4.16, AHL 4.17, AHL 4.18. This page does not cover other statistics or probability concepts and techniques.
Last edited 2026-05-19: Clarification on 2-sample vs paired, on paired t-test H0 notation; introduced example using InvP on Casio for Poisson type I errors; clarified that writing down null distribution is required in the HL exact tests; added paragraph on 2-tail -test critical values
Contents
- Concepts and presentation
- -test
- chi-squared test
- Type I and type II errors (HL)
- Exact tests (HL)
- Unbiased estimates of population variance (HL)
- Confidence interval (HL)
Concepts and presentation
As a simplification, -test and the HL tests are about the mean; and chi-squared -test is about variance. It’s possible to perform and the HL tests using summarizing statistics such as mean and sample variance (HL), but we need the full frequency data for .
Our null hypothesis H0 involves some population parameter, or in the case of -test, a fact about the population. The alternative hypothesis typically involves an inequality. If H1 is or then it is a one-tailed test; if it is then it is a two-tailed test. When possible, state your H0 and H1 using symbols, as shown later. For example “mean is 5” is not a valid null hypothesis as it does not state “population”; is not valid either as it does not use a population parameter; instead we should write , as means “population mean”.
We compute a test statistic using H0 and a sample. If our H0 is correct, then the test statistic will follow some distribution.
For larger samples in -test when the data are not normally distributed, and large samples in -test, we have
meaning that the distributions are approximate but suitable for sufficiently large . This is discussed in the respective (HL) sections.
We reject H0 if the test statistic is unlikely to follow the distribution, with probability less than the significance level . More on that later.
Conditions to reject H0
The following statements are equivalent; all are true or all are false depending on the sample.
- there is sufficient evidence to reject H0 in favor of H1
- , ie is less than the significance level
- test statistic is “unexpected” assuming H0 is true
- test statistic is at least as extreme as the critical value(s)
- test statistic is in the critical region, ie one or both tails (HL)
- null hypothesis parameter is outside the confidence interval (HL, for two-tailed t-test and two-tailed z-test only)
For chi-squared test, extreme means more positive. For other tests in the course, extreme is in the direction of the H1 (or both directions in a two-tailed test).
For test statistics, SL students only have to know how to calculate using GDC, and reject H0 if exceeds the critical value.
| H1 | conditions to reject H0 | notes |
|---|---|---|
| Not independent or not following the model (for -test) | critical value | |
| parameter | (HL) test statistic critical value | 1-tail test (upper) |
| parameter | (HL) test statistic critical value | 1-tail test (lower) |
| parameter | (HL) test statistic outside confidence interval | 2-tail test |
Usually, -values are easier because we reject H0 when significance level (typically set at , , or ). -value is the conditional probability of a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. IB sometimes avoids saying the significance level, at which case reject H0 if , and fail to reject H0 if .
The conclusion has two parts: a reason and an outcome (decision). The reason is an inequality comparing either -value against significance level, or test statistic against critical value. The outcome should reject or fail to reject H0.
It is incorrect to say “accept H0”, because we haven’t proved it to be true. Rather it remains merely “plausible”. Always say sufficient or insufficient evidence to reject H0, regardless how the question is phrased. If question says “in context”, then also rephrase the problem statement.
The typical workflow is
- H0, H1, with units (eg kg, cm, etc.) if applicable;
- (HL) the distribution assuming H0 is true (for , binomial, Poisson only), “the null distribution”;
- degrees of freedom (for only);
- -value or test statistic;
- an inequality (reason); and
- an outcome.
Include all these elements if question just says to “test at the […] significance level”.
As all of these steps depends on you choosing the right test, triple check you are using the correct test before you proceed.
-test
t-test is for checking the mean of either a large sample, or of normally distributed data points with unknown variance.
Extra material slightly beyond HL:
When sampling independent observations from , ie normal distribution with mean and variance , we can calculate a sample mean and an unbiased estimate for population variance , such that
where the -statistic on the left follows a -distribution with degrees of freedom. GDC computes the statistic from H0 and the sample, then finds the -value under the -distribution.
A -distribution is like the normal distribution, but with larger tails and smaller center. You can see graphs at Wikipedia: t-distribution.
As a consequence of Central Limit Theorem (HL), when sampling from any distribution with finite mean and variance, we have
As such, we can use -test for large samples, regardless what the underlying distribution is.
In HL exam, you need either (for an approximate test) or data is normally distributed with unknown variance (for an exact test) to use the -test. Furthermore, SL students will always be given the data for a -test, and not just the summarizing statistics. Nonetheless, it is still possible at SL for IB to give the sample mean, ask you to find a missing value, then ask you to run a -test.
| type | usage | degrees of freedom | null hypothesis H0 |
|---|---|---|---|
| 1-sample | sample mean vs benchmark | [value] | |
| pooled 2-sample | sample A mean vs sample B mean, same | ||
| paired (HL) | aspect A vs aspect B of the same sample (eg before vs after), same | ||
| correlation coefficient (HL) | whether two normal distributions are linearly correlated |
The alternative hypothesis H1 depends on the problem context and involves , , or . All of these are strict inequalities. All above tests may use any of the three inequalities for H1.
You probably will not be assessed on -test degrees of freedom.
In (pooled) 2-sample -test, the two samples are independent. In paired -test, there is only one sample; we find the difference between the two dependent data lists, before doing a 1-sample test.
1-sample -test
The 1-sample -test checks if the population mean is equal to a pre-determined benchmark.
Example: data from SPEC AI Paper 1 SL Q9 [Maximum mark: 6]
Ms Calhoun measures the heights of students in her mathematics class. She is interested to see if the mean height of male students, , is less than 150 cm.
| Male height (cm) | ||||||||
|---|---|---|---|---|---|---|---|---|
| Female height (cm) |
At the 10% level of significance, a t-test was used to test this. The data is assumed to be normally distributed and the standard deviations are equal between the two groups.
(a) State the null and alternative hypotheses. [2]
For 1-sample -tests, always use and not words even if question does not mention the quantity .
(b) Calculate the p-value for this test. [2]
Enter the data in stat
1:Edit...L₁.
male student heights entered in L₁ In stat ◀ (
TESTS) select2:T-Test.... SelectDatafor input type. Enter the null hypothesis, list, frequency (for grouped data, which we don’t have in this question), and the alternative hypothesis.Calculateand enter
Input screen for 1-sample t-test The screen shows the alternative hypothesis you inputted (not the conclusion), test statistic, the -value, summarizing statistics. is the sample size not the degrees of freedom.
1 sample results. First line is H1, regardless if it is accepted or rejected
(c) State, giving a reason, whether there is sufficient evidence to claim that the male students are on average shorter than 150 cm. [2]
As , there is insufficient evidence to claim male population average is below 150 cm
pooled 2-sample -test
The pooled 2-sample -test assumes that the population variances are equal, and checks whether the two population means are also equal.
Example: SPEC AI Paper 1 SL Q9 [Maximum mark: 6]
Ms Calhoun measures the heights of students in her mathematics class. She is also interested to see if the mean height of male students, , is the same as the mean height of female students, . The information is recorded in the table.
| Male height (cm) | ||||||||
|---|---|---|---|---|---|---|---|---|
| Female height (cm) |
At the 10% level of significance, a t-test was used to compare the means of the two groups. The data is assumed to be normally distributed and the standard deviations are equal between the two groups.
(a) State the null and alternative hypotheses for this new test. [2]
(b) Calculate the p-value for this new test. [2]
Enter the data in stat
1:Edit...L₁ and L₂.
male student heights entered in L₁ In stat ◀ (
TESTS) select4:2-Samp-Test...SelectDatafor input type. Enter the null hypothesis, the two lists, frequencies (for grouped data, which we don’t have in this question), and the alternative hypothesis. Always choosePooled.Calculateand enter
Data entry for 2-sample t-test. Always use pooled on the exam. The screen shows the alternative hypothesis you inputted (not the conclusion), test statistic, the -value, degrees of freedom (), summarizing statistics.
(c) State, giving a reason, whether Ms Calhoun should accept the null hypothesis. [2]
As , there is insufficient evidence to reject H0
paired -test (HL)
In a paired -test, the same sample is studied either in two aspects, or two instances in time. To perform this test, we find the difference of the two equally-long lists and perform a 1-sample -test on the difference.
Do not state hypotheses that imply causation, as the paired -test is only about observed changes.
Example: Specimen Paper 3 HL Q1
IB World School A wants to evaluate their teaching.
A group of eight students were randomly selected. They were given a standardized test at the start of the course and a prediction for total IB points was made based on that test; this was then compared to their points total at the end of the course.
Previous results indicate that both the predictions from the standardized tests and the final IB points can be modelled by a normal distribution.
It can be assumed that:
- the standardized test is a valid method for predicting the final IB points
- that variations from the prediction can be explained through the circumstances of the student or school
School A also gives each student a score for effort in each subject. This effort score is based on a scale of 1 to 5 where 5 is regarded as outstanding effort.
| Student number | Gender | Predicted IB points | Final IB points | Average effort score |
|---|---|---|---|---|
| 1 | male | 43.2 | 44 | 4.4 |
| 2 | male | 36.5 | 34 | 4.2 |
| 3 | female | 37.1 | 38 | 4.7 |
| 4 | male | 30.9 | 28 | 4.3 |
| 5 | male | 41.1 | 39 | 3.9 |
| 6 | female | 35.1 | 39 | 4.9 |
| 7 | male | 36.4 | 40 | 4.9 |
| 8 | male | 38.2 | 38 | 4.3 |
| Mean | 37.31 | 37.5 | 4.45 |
(d) Use a paired t-test to determine whether there is significant evidence that the students in school A have improved their IB points since the start of the course. [4]
H0:
H1:
There is insufficient evidence in improved IB points
Enter the before/after data in stat
1:Edit...L₁andL₂. You should input the before data inL₁, and after data inL₂.
data entry with before in L₁ and after in L₂ In L₃ header, set it as
L₂ - L₁. enter to populate the list.
Input screen for 1-sample t-test Follow the same procedure as 1-sample test, using
L₃for the list. Set null hypothesis to 0.
Input screen for 1-sample t-test
1 sample results. First line is H1, regardless if it is accepted or rejected
Note that the mean difference is , this matches the given difference . IB sometimes will give such info so you can catch some of your data entry errors, if any.
correlation -test (HL)
The correlation coefficient tests whether given the sample size and the measured , whether we can conclude that the correlation coefficient between two variables is non-zero. The variables must be normally distributed in order to use this test.
and are only for linear relationship. This question is often tested together with linear regression. If the null hypothesis holds, it means there is insufficient evidence to justify a linear relationship.
Example: Specimen Paper 3 HL Q1
This example started in the previous section
It is claimed that the effort put in by a student is an important factor in improving upon their predicted IB points.
| Student number | Gender | Predicted IB points | Final IB points | Average effort score |
|---|---|---|---|---|
| 1 | male | 43.2 | 44 | 4.4 |
| 2 | male | 36.5 | 34 | 4.2 |
| 3 | female | 37.1 | 38 | 4.7 |
| 4 | male | 30.9 | 28 | 4.3 |
| 5 | male | 41.1 | 39 | 3.9 |
| 6 | female | 35.1 | 39 | 4.9 |
| 7 | male | 36.4 | 40 | 4.9 |
| 8 | male | 38.2 | 38 | 4.3 |
| Mean | 37.31 | 37.5 | 4.45 |
(f) (i) Perform a test on the data from school A to show it is reasonable to assume a linear relationship between effort scores and improvements in IB points. You may assume effort scores follow a normal distribution.
(ii) Hence, find the expected improvement between predicted and final points for an increase of one unit in effort grades, giving your answer to one decimal place.
H0:
H1:
. There is sufficient evidence that effort is an important factor in IB points improvement
gradient (change in improvement over change in effort) is approx.
Input the two variables in stat
1:Edit...L₁andL₂. Remember which one is which.
In stat ◀ (
TESTS), chooseF:LinRegTTest.... Input your independent (what you change) followed by dependent (what you observe) variables. Store the regression equation in alpha traceY1or a different function.
Effort L₂ in Xlist because part (ii) wants change in grade over change in effort. Otherwise if you just want the p-value, doesn't matter which order we put for XList or YList. The results shows the model, the alternative hypothesis (not the conclusion), the test statistic, -value, degrees of freedom (), and regression line constant and gradient.
First 1-2 lines are always alternative hypothesis, regardless whether it is accepted or not.
chi-squared test
Chi-squared test checks if categorical data involving frequencies is distributed as expected. The categories must be non-overlapping and covering the entire set of possible values. Categories may be qualitative (eg mode of transportation) or qualitative (eg weight).
Extra material slightly beyond HL:
For categorical data of sample size , with mutually exclusive classes, if are respectively the observed and expected frequencies of class , and , then,
The summation at the left approaches a -distribution of degrees of freedom. A -distribution only defined for positive . You can see graphs at Wikipedia: chi squared-distribution.
Chi-squared test is always an approximate test. In HL, the expected frequencies for all categories must exceed 5, and you need to merge the smallest classes as necessary.
| type | usage | degrees of freedom | null hypothesis H0 |
|---|---|---|---|
| independence | if row and column variables are independent | † | [row var.] is independent of [column var.] |
| goodness of fit | if observed freq. match the predicted | , with classes | freq of [x] are distributed according to [model] |
| goodness of fit (HL) | if observed freq. match the predicted | , with classes; model using parameters | freq of [x] are distributed according to [model] |
Note that for independence the GDC can calculate expected, but goodness-of-fit we need to supply to the calculator the expected frequencies as well as the degrees of freedom. Independence uses a contingency table, and the data define/impact the expected values; whereas in goodness-of-fit, the expected is separate from the observed.
independence
Example: AI Specimen P1 SL Q6
As part of a study into healthy lifestyles, Jing visited Surrey Hills University. Jing recorded a person’s position in the university and how frequently they ate a salad. Results are shown in the table.
| Salad meals per week | ||||
|---|---|---|---|---|
| 0 | 1-2 | 3-4 | > 4 | |
| Student | 45 | 26 | 18 | 6 |
| Professors | 15 | 8 | 5 | 12 |
| Staff and Administration | 16 | 13 | 10 | 6 |
Jing conducted a test for independence at a level of significance.
(a) State the null hypothesis. [1]
H0: Number of salads eaten in a week is independent from the person’s position
(b) Calculate the p-value for this test. [2]
In 2nd x⁻¹, ◀ arrow for
EDIT. Select1:[A]. Set the dimensions in number of rows and columns. For our example, 3 by 4.
In stat ◀ (
TESTS), chooseC:χ²-Test...SetObserved: [A]. The matrices are available from 2nd x⁻¹
Expected: [B]Names.[B]should be empty.
The result is shown with a chi-squared test statistic, -value, and degrees of freedom ().
Matrix [B] should now be populated with the expected frequencies.
(c) State, giving a reason, whether the null hypothesis should be accepted. [2]
Since , there is sufficient evidence to reject H0
(d) [BONUS] Find the smallest test statistic that would reject the null hypothesis. [2]
The degrees of freedom is . We need where
Graph the distribution cumulative density function (cdf), then numerically solve for the intersection with , as is always upper tail.
The critical value is
So another way part c) can be asked is that we can be given as the critical value, and see that the test statistic of meaning we reject H0 in favor of H1.
Graph the chi-squared distribution and the line . Use 2nd vars,
8:χ²cdf(.
Solve graphically for the intersection.
goodness of fit
Example: SPEC AI P2 HL Q2 / SL Q2. [Maximum mark: 12] Slugworth Candy Company sell a variety pack of colourful, shaped sweets.
According to manufacturer specifications, the colours in each variety pack should be distributed as follows.
| Colour | Brown | Red | Green | Orange | Yellow | Purple |
|---|---|---|---|---|---|---|
| Percentage (%) | 15 | 25 | 20 | 20 | 10 | 10 |
Mr Slugworth opens a pack of 80 sweets and records the frequency of each colour.
| Colour | Brown | Red | Green | Orange | Yellow | Purple |
|---|---|---|---|---|---|---|
| Observed Frequency | 10 | 20 | 16 | 18 | 12 | 4 |
To investigate if the sample is consistent with manufacturer specifications, Mr Slugworth conducts a goodness of fit test. The test is carried out at a significance level.
(b) Write down the null hypothesis for this test. [1]
The colours are distributed according to manufacturer specs
(c) Copy and complete the following table in your answer booklet. [2]
| Colour | Brown | Red | Green | Orange | Yellow | Purple |
|---|---|---|---|---|---|---|
| Expected Frequency |
We need to convert the expected percentages to expected frequencies. The recommended way is to store the percentages in a list, then multiply the list by the sample size, ie , as given in the question.
| Colour | Brown | Red | Green | Orange | Yellow | Purple |
|---|---|---|---|---|---|---|
| Expected Frequency | 12 | 20 | 16 | 16 | 8 | 8 |
We got whole number expected frequencies by chance; decimal expected frequencies should be kept and not rounded.
Input the observed frequencies in stat 1:Edit... L₁, and expected percentages in L₂. Go to L₃ header and set it to 80L₂.
(d) Write down the number of degrees of freedom. [1]
(e) Find the p-value for the test. [2]
In stat ◀ (
TESTS), chooseD:χ²GOF-Test.... Input the observed and expected frequency lists, the degrees of freedom, andCalculate.
The results show the test statistic, -value, degrees of freedom, and the contributions (
CNTRB) from each class to the test statistic.
(f) State the conclusion of the test. Give a reason for your answer. [2]
, so there is insufficient evidence to reject H0
goodness of fit with parametric model (HL)
Same as above but we need to subtract the number of estimated parameters in our model, from the degrees of freedom. The model must return a finite number of non-overlapping categories. If the categories are quantitative, the model can be normal distribution, with groupings provided or outlined; binomial distribution; Poisson distribution; or another discrete distribution to be given in the question.
Example: Adapted from May 2023 P2 TZ1 HL Q5
Goran is interested in the number of sightings of a particular bird each week in the 50 weeks following the first day of September. He collects some data which is shown in the table.
| Number of sightings | 0 | 1 | 2 | 3 | 4 | 5 | More than 5 |
|---|---|---|---|---|---|---|---|
| Number of weeks | 8 | 16 | 13 | 8 | 3 | 2 | 0 |
Goran believes that the data follows a Poisson distribution. Goran decides to test at the 5% significance level to see if his belief is correct. His null hypothesis is , where the random variable, , is defined as the number of sightings per week.
Goran estimates parameter to be the mean of the sample, .
| Number of sightings | 0 | 1 | 2 | 3 | 4 | 5 or More |
|---|---|---|---|---|---|---|
| Expected frequencies |
(c) Copy the table and fill in the expected frequencies for sightings per week in the 50 weeks after the first day of September [7]
The expected frequency for exactly sightings, is
For the “5 or more” class, we want
depending on whichever is easier on your GDC.
Answer:
| Number of sightings | 0 | 1 | 2 | 3 | 4 | 5 or More |
|---|---|---|---|---|---|---|
| Expected frequencies |
In lists stat
1:Edit..., insert a list using 2nd statOPSchoose5:seq(to define a sequence. This allows us to evaluate a function over a list. Specifically, we wantD:poissonpdf(from 2nd vars andpoissonpdf(1.76,X)*50. We should go from 0 to 4, for the first five columns.
In last row in L₂, enter
(1-poissoncdf(1.76,4))*50
(d) State a reason why Goran should combine groups to conduct his significance test. [1]
Some expected frequencies are less than 5
To combine values, we can use L₂(5) for instance to get the fifth element. We set the new L₂(5) equal to L₂(5) + L₂(6). And then delete L₂(6).
(e) Write down the degrees of freedom for the test. [1]
The formula is , and because we estimated 1 parameter .
(f) Find the p-value for the test. [2]
Using 3 degrees of freedom, we get
(g) State the conclusion of the test. Justify your answer. [2]
, insufficient evidence to reject H0
Type I and type II errors (HL)
Type I and type II errors are properties of the test, and do not use anything from our sample. Compute the critical value before finding such errors.
Type I error is the conditional probability that we rejected H0 even though it is true. For continuous tests including , , and , this is same as the significance level. For discrete tests including binomial and Poisson, this is smaller than the significance level.
Type II error is the conditional probability that we fail to reject H0, assuming the parameter is actually some other value given in the question. Type II errors will only be asked for , binomial, and Poisson tests.
Some examples are shown along the following tests.
Exact tests (HL)
Previously, both and tests are usually approximations that work for large samples. The following work for small samples as well, but require the population distribution to be known except for one parameter.
The tests ask the question: “if the last model parameter is the value from null hypothesis, what is the probability that the data is at least as extreme as the sample?” and answers with a -value. This differs with regular distribution questions in which we know all parameters.
| type | usage | null hypothesis H0 |
|---|---|---|
| -test | -test but known | [value] |
| binomial test | fraction (proportion) of population that meets some description | [value] |
| Poisson test | long term average counts of some Poisson process in an interval | [value] |
and can be used without defining the variable. For binomial test, use the given variable from the question, or if none were given, define as the population proportion belong to the specified category.
In theory, all the -tests can also be -tests if the population variance is known.
Your GDC has built-in test statistic and -value functions for -test. For binomial and Poisson tests, the test statistic is given or a simple arithmetic, and we can use built-in binomcdf or poissoncdf to find . There is a built-in to find critical value of binomial test, but not for Poisson, for which you will need to solve over the integers.
There is an explicit A1 for writing down the null distribution (the distribution involving the parameter from H0) for these tests.
test for population mean (HL)
z-test is for checking if the sample mean of independent and identical normal distributions is as expected.
When sampling independent observations from a normal distribution , the sample mean is distributed follows a normal distribution with same mean but variance scaled by . Under H0,
You are required to be able to derive this.
Extra material slightly beyond syllabus:
Alternatively, this is same as
where is the standard normal distribution, often denoted by .
Also recall the bonus material from earlier:
When comparing to the -test, we notice that the -test requires a known while -test does not. Both tests are exact with independent sampling from a normal distribution. Similarly, by Central Limit Theorem, both tests can be approximate for large samples when the distribution is not normal. In Math AI HL, we use -test when we do not know , and -test when we do.
Example: Practice AI P2 HL Q6 (from AI Teacher Support Material)
The masses in kilograms of melons produced by a farm can be modelled by a normal distribution with a mean of kg and a standard deviation of kg.
One year due to favourable weather conditions it is thought that the mean mass of the melons has increased. The owner of the farm decides to take a random sample of melons to test this hypothesis at the 5% significance level, assuming the standard deviation of the masses of the melons has not changed.
(c) Write down the null and alternative hypotheses for the test. [1]
H0:
H1:
(d) Find the critical region for this test. [4]
Let be the sample mean in kg. Under H0,
Remember that on most GDC, we input the standard deviation, but to show our work, we use variance.
Since H1 is , we want and solve for , the critical value.
The critical region is
Use 2nd vars 3:invNorm(.
TI-84 Plus non-CE users, and TI Nspire users should use area: 0.95 as there is no option for RIGHT tail.
2-tail critical values for -test is required. Nspire, for 5% significance level, would find area = 0.025 and area = 0.975 separately. Both TI 84 Plus CE and Casio have built-in two-tailed InvNorm / InvN. Casio only returns the lower critical value, and you need to for the upper critical value. The critical region would be
Binomial test for population proportion (HL)
Binomial test is for checking if the observed percentage is as expected. The test statistic, , is the number of measurements belong to the specified category.
It can be seen as an exact version of the approximate test for two categories. Under H0,
Since primarily means -value, IB will give you a variable for the proportion. If not given, define as the population proportion.
Example: May 2025 TZ2 Paper 2 HL 5
A zoologist collects a sample of cane beetles. He measures their length and categorizes them as “small” meaning from 10 to 12mm long, “medium” meaning from 12 to 16mm long and “large” meaning from 16 to 18mm long. He also notes their sex and records the frequencies in the following table.
| Length, x mm | ||||
| Small 10 < x ≤ 12 | Medium 12 < x ≤ 16 | Large 16 < x ≤ 18 | ||
| Sex | Female | 42 | 25 | 19 |
| Male | 61 | 27 | 12 | |
(a) Find how many cane beetles are in the zoologist’s sample. [1]
Ultimately we need not only total cane beetles, but also the male ones. I want to present a method that can avoid certain data entry errors.
In general for long sums, use list and 1-var stats, or spreadsheet, as opposed to using lots of plus signs. You should use 2 columns for the 2 categories, male and female. We are spending at most 15 extra seconds, in order to make any data entry mistakes easy to spot. (It’s the same approach for grouped data / interval data. You may want to use a similar approach for Riemann sums, expected values, etc.)
There are 86 females, 100 males, and 186 in total
In stat
1:Edit..., enter the female data intoL₁, and male data intoL₂
In stat
CALC2:2-Var Stats, use our lists. We read offΣx=86andΣy=100
Let be the population proportion of cane beetles that are male.
(b) Test, at the 5% significance level, the hypothesis that more than 45% of cane beetles are male. Write the null and alternative hypotheses. State the -value of your test and write your conclusion in context. [5]
They gave a variable for population proportion so we can use in our null and alternative hypotheses.
H0:
H1:
Note that proportions are just decimals between 0 to 1 and have no units.
Let be the observed number of males. Under H0,
Critical region is in the direction of H1: , so we have
Reject H0; sufficient evidence that over 45% cane beetles are male
On the TI-84 Plus (CE), we must use . Use B:binomcdf from 2nd vars.
Alternative method if question asked for critical region, and not -value
[Null and alternative hypotheses same as above; same as above.]
Our GDC actually has a built-in for this. The answer will always cross the significance level, so the value is always excluded from the critical region, no matter if lower or upper tail test.
critical region: , critical value is
, so reject H0; sufficient evidence that over 45% cane beetles are male
If lower tail, use area 0.05, and [answer] for the critical region. And the critical value would be [answer] - 1.
In 2nd vars C:invBinom(, use
area:0.95
trials:186
p:0.45
The type I error rate is
For binomial and Poisson tests, the type I error is always (just) below the significance level.
Poisson distribution (HL)
The Poisson process is loosely a “rare event” that occurs finitely number of times over any continuous time interval. is the expected number of independent occurrences over a fixed duration (eg 1 hour). When is a Poisson distribution with parameter ,
describes the number of occurrences over a particular interval (eg noon to 1 pm).
We have
Extra material slightly beyond AI HL:
The source of randomness is that the Poisson processes occur at random times. The probability of long wait-times decreases exponentially. This makes it so that when the processes are independent, the wait times are also independent. The number of Poisson processes in all equally-long periods can be modelled by the same Poisson distribution.
The sum of two independent Poisson distributions is another Poisson with parameter equal to the sum of the parameters.
A special case is that if duration multiplies by , then the new parameter is .
Example: The passage of each type of vehicles at an intersection is modelled by a Poisson distribution. On average every minutes, cars pass the intersection, and on average every minutes, bus pass the intersection. Find the probability that at least vehicles pass in minutes.
In minutes, we expect cars, and buses. Let be the number of vehicles in a minute interval.
We want . This is same as , which can be calculated on our GDC.
In 2nd vars use E:poissoncdf to find a sum of Poisson distribution probabilities. Use upper bound of 69 buses.
We can imagine a scenario where we have an H0 of . so at 5% significance level we would reject H0 in favor of H1 of .
Poisson test for population mean (HL)
THe Poisson test checks if the , number of observed Poisson processes, is expected. The Poisson test is exact. Under H0,
Example: Specimen AI P1 HL 16
The number of fish that can be caught in one hour from a particular lake can be modelled by a Poisson distribution.
The owner of the lake, Emily, states in her advertising that the average number of fish caught in an hour is three.
Tom, a keen fisherman, is not convinced and thinks it is less than three. He decides to set up the following test. Tom will fish for one hour and if he catches fewer than two fish he will reject Emily’s claim.
(a) State a suitable null and alternative hypotheses for Tom’s test. [1]
H0: (per hour)
H1: (per hour)
(b) Find the probability of a Type I error. [2]
Under H0,
The critical region is . Type I error is
For GDC instructions scroll up to the previous section.
The average number of fish caught in an hour is actually .
(c) Find the probability of a Type II error. [3]
Under ,
The type II is fail to reject under this actual mean, which means the inequality is the opposite (complement) of the critical region.
GDC instructions see earlier Poisson distribution section
(d) Tom tests for five hours against the original null and alternative hypotheses, where the hourly average is 3 vs less than 3. What is the maximum number of fish caught in five hours that makes the probability of making a Type I error less than ?
In five hours, under H0
To be consistent with the built in approach for binomial, we find the first value that exceeds the significance level (or 100% - significance level, for upper tail tests), then critical region starts from the more extreme value.
Cumulative probability first exceed 0.05 at . The max number of fish that keep Type I error under 5% is fish
In y= enter the
poissoncdf(15, X), using 2nd varsE:poissoncdf(
poissoncdf, data entry Since mean is small, we can just check the table starting from
0.
poissoncdf, result
In MENU 2.STAT (2. Statistics). In DIST, POISN, InvP, enter Area: 0.05, μ: 15.
For left tail, the critical region is less than the xInv; for right tail, the critical region is greater than the xInv.
Example:
City planning committee told the mayor that each day there are on average 1729 rideshares requested daily, and that the number of requests per day can be modeled using a Poisson distribution. The mayor decides to test this at the 1% significance level, using the next day’s data, to see if the average is higher.
a) State the null and alternative hypotheses. [1]
H0: (rides daily)
H1: (rides daily)
b) Find the critical region. [3]
Under H0,For solving over large integers, it is recommended to graph the function by rounding or flooring X, before confirming with the table of values.
The critical region is or
In y=, define a
poissoncdf(but usinground(X)for thex value. This is found in mathNUM2:round(. Also defineY2 = 0.99for upper-tail 1% significance level.
Using appropriate window settings in window, graph graph then trace near the intersection point. Here, we can estimate x is between 1729 and 1850 when setting window.
In 2nd window, start the table just below the intersection point. 2nd graph to view the table.
In MENU 2.STAT (2. Statistics). In DIST, POISN, InvP, enter Area: 0.99, μ: 1729.
For left tail, the critical region is less than the xInv; for right tail, the critical region is greater than the xInv.
The next day, the mayor’s assistant tallied requests from the city’s top 5 most popular rideshare apps and identified requests.
c) Using this data, state the conclusion of the test and provide a reason. [2]
As is less extreme than , there is insufficient evidence to reject H0
d) Identify two ways content validity plays a role in the results of the test. [2]
- If the market share of the top 5 apps is low, then the actual number of requests may well exceed .
- City planning committee and the different apps may be using different aggregation methods or definitions of “requests”, including possible requests of rides on a later day
Unbiased estimates of population variance (HL)
Different authors use different definitions and notations which makes this simple topic really confusing.
In IB, sample variance, , and sample standard deviation, , refer to
and , unbiased estimates of population variance, refers to
It is named so because the average across many samples predicts the population variance. Statistics is all about finding out information about the population, so this quantity is very useful.
However, outside of IB, this latter quantity can also be referred to as sample variance, and can also be referred to as . Furthermore, the calculator may use σx for IB’s and sn for IB’s . So keep that in mind as you use different math resources.
Example: Spec AI P1 HL Q9
A manager wishes to check the mean weight of flour put into bags in his factory. He randomly samples bags and finds the mean weight is kg and the standard deviation of the sample is kg.
(a) Find for this sample. [2]
Confidence interval (HL)
For a large number of independent, two-tailed - or - tests based on the same population, 95% of the constructed 95% confidence intervals are expected to contain the true population mean (). Similar properties hold for 90% and 99% confidence intervals.
It is wrong to say that there is a 95% probability that the true mean lies in the 95% confidence interval. The true mean is not probabilistic, ie the same interval either always contain or never contain . We need different intervals to express this idea of 95%.
When the hypothesized mean from H0 (which is not necessarily the true mean) lies outside the confidence interval, we opt to reject H0 in a two-tailed test.
Extra material slightly beyond AI HL:
The basis for that is that the confidence interval is an algebraic rearrangement of
[lower critical value < test statistic < upper critical value]
for a 2-tailed test in which H0 is not rejected. If lies outside the confidence interval, it means the test statistic was in the critical region.
Example: Spec AI P1 HL Q9 example started in the previous section
A manager wishes to check the mean weight of flour put into bags in his factory. He randomly samples bags and finds the mean weight is kg and the standard deviation of the sample is kg.
(b) Find a confidence interval for the population mean, giving your answer to significant figures.
In stat ◀ arrow for TESTS, select either 7:ZInterval (when population is known) and 8:TInterval (when population is unknown).
Note here S is a variable stored in the previous section. This is a way on most caluclators to reuse values without ever re-entering them.