Chi-squared tests and coefficient of correlation

We already used the $\chi^2}$ test for hypothesis testing of variances. These tests are also used for goodness-of-fit tests and testing contingency (cross) tables.
We will explain both applications in a number of examples.

Goodness-of-fit test

The goodness-of-fit test is applied to data produced by a multinomial experiment which is a generalization of a binomial experiment and is used to describe one population of data. It consists of a fixed number of $n$ trials. Each trial can have one of $k$ outcomes, called cells. Each probability $p_i$ remains constant. Our usual notion of probabilities holds, namely:

$\displaystyle{p_1+ p_2+\cdots+ p_k=1}$

and each trial is independent of the other trials.

We test whether there is sufficient evidence to reject a specified set of values for $p_i$ .
To illustrate this, our null hypothesis is:

$\displaystyle{H_0: p_1=a_1,p_2=a_2,\cdots,p_k=a_k}$

where

$\displaystyle{a_1,a_2,\cdots,a_k}$ are the values we want to test, and:

$H_1:$ At least one $p_i$ is not equal to its specified value.

Market shares companies A and B

Two companies, A and B have the following market shares:

A: 45%, B: 40%, Others: 15%.

After a marketing campaign a random sample of 200 customers showed that:

102 customers preferred A;

82 preferred B; and

16 preferred other companies.

Can we infer at the 5% significance level that customer preferences have changed from their levels before the advertising campaigns were launched?

We compare the market share before and after the advertising campaign to see if there is a difference. We hypothesize values for the parameters equal to the before-market share. That is,

$H_0:p_1=0.45,p_2=0.40, p_3=0.15$

$H_1:$ at least one $p_i$ is not equal to its specified value.

If the null hypothesis is true, we would expect the number of customers selecting brand A, brand B, and others:

$E_1=200(0.45)= 90$ ; $O_1=102$

$E_2=200(0.40)=80$ ; $O_2=82$

$E_3=200(0.15)=30$ ; $O_3=16$

$E_i:$ expected number of customers.

$O_i:$ observed number of customers (marketing analysis)

The $\chi^2$ (chi-squared) goodness-of-fit test statistic is given by:

$\displaystyle{\chi^2=\sum{}{}\frac{(O_i-E_i)^2}{E_i}=8.18}$

Note.

If $O_i-E_i=0$ then $\chi^2=0}$

If $|O_i-E_i|\gg0$ then $\chi^2>>0$

This statistic is (approximately) $\chi^2$ distributed with $k-1$ degrees of freedom, provided the sample size is large. The rejection region is: $\displaystyle{\chi^2>\chi^2_{k-1;\alpha}}$ .

The rejection region is $\displaystyle{\chi^2>\chi^2_{3-1;0.05}=5.99147}$ .

The test statistic is $\displaystyle{\chi^2=8.18}$ which falls into the rejection region, so we reject $H_0$ in favor of $H_1$ .

Conclusion: there is sufficient evidence to infer that the proportions have changed since the advertising campaigns were implemented.

There are some requirements for the goodness-of-fit test. First, in order to use this technique, the sample size must be large enough so that the expected value for each cell is 5 or more. If the expected frequency in a cell is less than five, it would be better to combine that cell with other cells to satisfy the condition.

Contingency (cross) tables

Again we explain this topic by an example.

An MBA program was experiencing problems scheduling their courses. The demand for the program's optional courses and majors was quite variable from one year to the next. An investigator believed that the problem may be that the undergraduate degree affects the choice of a major.

Therefore, he took a random sample of last year's MBA students and recorded the undergraduate degree and the major selected in the graduate program. The undergraduate degrees were BA (Bachelor of Arts), BEng (Bachelor of Engineering) , BBA (Bachelor of Business Administration), and others.

There are three possible majors for the MBA students, accounting, finance, and marketing. Can he conclude at the 5% significance level that the undergraduate degree affects the choice of the major? In fact, are the undergraduate degree and the majors independent?

$H_0:$ MBA major and Undergraduate degree are independent;
$H_1:$ They are not independent.

Based on this table we construct a table with the observed and expected values.

Blue: Observed values
Red: Expected values

How do we compute the expected values? As an example we take the cell (BA, Accounting):

$\displaystyle{E$ (BA, Accounting) $=\frac{60}{152}*61=24.08}$

The test statistic is:

$\displaystyle{\chi^2=\sum{}{}\frac{(O_i-E_i)^2}{E_i}}=\frac{(31-24.08)^2}{24.08}+\frac{(13-17.37)^2}{17.37}+\cdots=14.70}$

The rejection region is:

$\displaystyle{\chi^2>{\chi^2}_{(r-1)(c-1);\alpha}={\chi^2}_{6;0.05}=12.60}$

$\chi^2$ falls into the rejection region and thus reject $H_0$ .

Of course, Excel gives the same results.

Coefficient of correlation

Sometimes we want to know whether there is a linear relationship between two population variables. For this we use the population coefficient of correlation $\rho$ :

$\displaystyle{\rho=\frac{\sigma_{xy}}{\sigma_x\sigma_y}}$ :

Recall: $\displaystyle{\rho\in{[-1,+1]}}$

$\rho=-1:$ minimal negative relation

$\rho=+1:$ maximal positive relation

$\rho=0:$ no relation

The population coefficient of correlation is denoted by $\rho$ . We estimate its value from sample data with the sample coefficient of correlation:

$\displaystyle{r=\frac{s_{xy}}{s_xs_y}}$

The test statistic for testing $\rho=0$ is:

$\displaystyle{t=r\sqrt{\frac{n-2}{1-r^2}}}$

If $H_0:\rho=0; H_1:\rho>0$ or $\rho<0$ or $\rho\neq0$

The rejection region is:

$t>t_{\nu;\alpha}$ or $t<-t_{\nu;\alpha}$ or $-t_{\nu;\alpha/2}<t<t_{\nu;\alpha/2}$

In Keller (p. 655) an example is used giving the data of the relation between the price and odometer readings of trade-in cars. Are these data correlated?

The correlation is computed and we define the null and alternative hypotheses. We get:

$\displaystyle{H_0: \rho=0; H_1: \rho\neq0; n=100; r=-0.8052; \alpha=0.05}$

The test statistic is:

$\displaystyle{t=r\sqrt{\frac{n-2}{1-r^2}}=-0.8052\sqrt{\frac{100-2}{1-(-0.8053)^2}}=-13.44}$

The rejection region is:

$\displaystyle{t<-t_{0.025;98}\approx-1.984}$

So, there is overwhelming evidence that odometer readings and price are correlated. We reject the null hypothesis.

$Observed$	$Expected$		$(O-E)^2/E$
31	24,08		1,99
13	17,37		1,10
16	18,55		0,35
8	12,44		1,58
16	8,97		5,51
7	9,59		0,70
12	15,65		0,85
10	11,29		0,15
17	12,06		2,02
10	8,83		0,16
5	6,37		0,29
7	6,80		0,01
		Total	14,71
		Rej. Level	12,59
		So, reject H0