Chapter 8 Homework: tests

8.1 Simulations: confidence interval for \(\mu\)

  • Create a random sample of size \(n=50\) drawn from a Gaussian distribution \(\mathcal{N}(175,100)\). Compute a confidence interval for \(\mu\).
  • Create 1000 sample of size \(n=50\). For each sample, compute a confidence interval for \(\mu\) and the width of the confidence interval. Count how many times the true value \(\mu\) is inside the confidence interval. Compute the mean width of the intervals.
  • Create a function that takes as input an integer \(n\) and that perform the previous operation. Keep the results in a table (with 2 columns) varying sample size from 10 à 500 (by 10). Comment.

8.2 Missing values, introduction

The aim is to assess different strategies to handle missing values: deletion, mean imputation, regression imputation, stochastic regression imputation. Then, to tackle the issues of single imputation, multiple imputation will be used.

  1. Generate bivariate data with \(n = 100\) drawn from a Gaussian distribution with \(\mu_y = \mu_x = 125\), standard deviation \(\sigma_y = \sigma_x = 25\) and correlation \(\rho = 0.6\). You should use the rmvnorm function. We note \(\theta = \mu_y\).

  2. Plot the data.

  3. Introduce \(73\%\) of missing completely at random (MCAR) values on one variable (\(y\)).

A very popular approach to handle missing values consists in replacing the missing entries by plausible values to get a completed data set that can be analyzed by any statistical method. The most popular strategy consists of imputing with the mean of the observed values. But you may want to take into account the relationship between variables and impute with regression for instance. However, in order to preserve as well as possible the joint and marginal distributions you may prefer imputing by stochastic regression which means adding to the imputation by regression a random draw from a Gaussian distribution with mean zero and variance equals to the residuals variance. Let us assess these approaches.

  1. For each strategy (deletion, mean imputation, regression imputation, stochastic regression imputation), compute the empirical mean of \(y\) (\(\hat \theta = \bar y\)), its standard deviation (\(\hat \sigma_y\)), the correlation between \(x\) and \(y\) (\(r(X,Y)\)), a confidence interval for \(\mu_y\) and the width of the confidence interval.

  2. Plot the results

  3. Repeat steps 1, 3 and 4, 10000 times (generating data, generating missing entries, imputing values, computing some stats) and compute the bias of \(\hat \theta\), the coverage of the confidence interval for \(\mu_y\) as well as the average of the width of the confidence intervals. Comment.

  4. What is your explanation regarding the previous results. What could be done?

  5. Repeat all the steps but for missing values generated under the mechanism Missing At Random MAR or Missing non at Random MNAR. MAR means that the probability to have a missing entry does not depend on the value itself but may depend on values of other variables (ex: the probability to have missing entries on the income does not depend on the income but may depend on the age; older people may be less incline to reveal their income - but within an age cluster the probability to have missing values is the same for each individual). MNAR means that the probability to have a missing entry depends on the value itself (rich people are less encline to reveal their income). For MAR, put missing values on \(y\) when values for \(x \leq 140\); for MNAR, put missing values on \(y\) when values for \(y \leq 140\). Comment the results.

One main references on missing values is the book of Little & Rubin (2002) Statistical Analysis with missing data.

8.3 Microarray analysis

The data we consider concerns 43 brain tumors of 4 different types defined by the standard world health organization (WHO) classification (O, oligodendrogliomas; A, astrocytomas; OA, mixed oligo-astrocytomas and GBM, glioblastomas) described by data both at the transcriptome level (with expression data) and at the genome level (with CGH data). More precisely, there are 356 continuous variables for the microarray data and 76 continuous variables for the CGH data. One aim of the study is to see whether the genes are expressed differently (underexpress or overexpress) depending on the tumor types.
A deep analysis of this data can be found in this publication where they also aim at comparing the information brought by the microarray data and the CGH data.

  • Import the data that can be found here: http://factominer.free.fr/docs/gene.csv
  • Create a new variable with two categories GBM (which are the dangerous tumors) and Others (for lower grades tumors).
  • Compare the means of the genes according to these two tumor types. You could output a data table, with the mean of each gene according to the two tumors types, the \(p\)-value associated to the test and where genes are sorted according to their \(p\)-values.

8.4 Chi-squares test - Goodness of fit.

We want to test, the fit between the sample \((N_1,\dots,N_m)\) of size \(n=N_1+\dots+N_m\) to a Multinomiale distribution \({\cal M}(n;p_1,\dots,p_m)\) with \(p_i\) known. The test is \[H_0:\ (N_1,\dots,N_m)\sim{\cal M}(n;p_1,\dots,p_m)\hspace{1cm}\mbox{vs}\hspace{1cm}H_1:\ (N_1,\dots,N_m)\nsim{\cal M}(n;p_1,\dots,p_m).\] The statistic \[T_n=n\sum_{i=1}^m\frac{\left(\frac{N_{i}}{n}-p_i\right)^2}{p_i}=\sum_{i=1}^m\frac{(N_{i}-np_i)^2}{np_i},\] measure the distance between observed values and theoritical frequencies. The test is given in Section 7.5.1.

We give the number of births in the USA relatively to the day of the week. The mean number of births per day for 1997 (source [National Vital Statistics Report] (http://www.cdc.gov/nchs/products/nvsr.htm)) are the following: \(N_{monday}=10861\), \(N_{tuesday}=12104\), \(N_{wednesday}=11723\), \(N_{thursday}=11631\), \(N_{friday}=11640\), \(N_{saturday}=8670\), \(N_{sunday}=7778\).

  • Are births uniform on days of the week ?

  • Should we use the mean number of births per day or the total number of births per day to apply the test ?

8.5 RV and dCov coefficients

Exam 2016

library(mvtnorm)
library(FactoMineR)
library(energy)
library(ggplot2)

Consider two random vectors \(X\) in \(\mathbb{R}^p\) and \(Y\) in \(\mathbb{R}^q\). Our aim in this exercise is to study and test the association between these two vectors. We will introduce two ways of measuring the association between two random vectors: the RV coefficient and the dCov coefficient. The RV coefficient is restricted to the study of linear relations, while the dCov coefficient extends to a nonlinear setting.

The RV coefficient is defined as follows. Let \(\Sigma_{XY}\) denote the population covariance matrix between \(X\) and \(Y\) and \({\operatorname{tr}}\) the trace operator. Let us define the following correlation coefficient between \(X\) and \(Y\):

\[ \rho V(X,Y) = \frac{{\operatorname{tr}}(\Sigma_{XY}\Sigma_{YX})}{\sqrt{{\operatorname{tr}}(\Sigma^2_{XX}){\operatorname{tr}}(\Sigma^2_{YY})}} \]

Note that for \(p=q=1\), \(\rho V=\rho^2\) the square of the standard correlation coefficient. In a statistical context, we consider \(n\) independent realizations of the random vectors stored in matrices \({\mathbf{X}}_{n\times p}\) and \({\mathbf{Y}}_{n\times q}\). Denoting \(S_{{\mathbf{X}}{\mathbf{Y}}}=1/(n-1) \times {\mathbf{X}}' {\mathbf{Y}}\) the empirical covariance matrix between \({\mathbf{X}}\) and \({\mathbf{Y}}\), the \(\rho V\) coefficient can be consistently estimated by:

\[ {\operatorname{RV}}({\mathbf{X}},{\mathbf{Y}}) = \frac{{\operatorname{tr}}(S_{{\mathbf{X}}{\mathbf{Y}}}S_{{\mathbf{Y}}{\mathbf{X}}})}{\sqrt{{\operatorname{tr}}(S^2_{{\mathbf{X}}{\mathbf{X}}}){\operatorname{tr}}(S^2_{{\mathbf{Y}}{\mathbf{Y}}})}} \]

To interpret the RV coefficient, we define \({\mathbf{W}}_{\mathbf{X}}= {\mathbf{X}}{\mathbf{X}}'\) and \({\mathbf{W}}_{\mathbf{Y}}= {\mathbf{Y}}{\mathbf{Y}}'\).

(Q1) Write the RV coefficient with \({\mathbf{W}}_{\mathbf{X}}\), \({\mathbf{W}}_{\mathbf{Y}}\) and the Hilbert-Schmidt inner product between matrices. Show that \(0 \leq {\operatorname{RV}}\leq 1\) and interpret the case of \({\operatorname{RV}}=0\).

As with the ordinary correlation coefficient, a high value of the RV coefficient does not necessarily mean there is a significant relationship between the two sets of measurements. The next paragraph aims at showing numerically that the RV coefficient depends on the dimensions \(n,p,q\).

8.5.1 Expected values of the RV coefficient with respect to the dimensions

To compute numerically the RV coefficient, we can use the function coeffRV of the package FactoMineR. Besides, the function rmvnorm of the package mvtnorm enables to simulate multivariate gaussian variables.

(R1) Let \(n=25\) and \(p=q=5\). Generate matrices \({\mathbf{X}}\) and \({\mathbf{Y}}\) by drawing observations from independent gaussian distributions with mean \(\mu=(0)_{1 \times p}\) and covariance matrix \(\text{Id}_{p \times p}\). Compute the value of the RV coefficient. Repeat the process 100 times and take the quantile at 95% of this empirical distribution (under the null hypothesis of no linear relationship) of RV coefficient. The aim is now to vary \(n\) and \(p=q\) and to perform the same operation for for \(n=25,30,35,50,70,100\) and \(p=q=5,10,20,50,100\). Create a matrix to gather your results, i.e. which contains the values of these quantiles of RV coefficient for the different size.

(Q2) Comment in two or three sentences the obtained values of the RV coefficient.

8.5.2 Permutation test for wine data

From the results of the preceding section (RV coefficient), a testing procedure to test the significance of the association between \(X\) and \(Y\) is necessary. One usually sets up the hypothesis test by taking

  • \(H_0\), \(\rho V=0\) there is no linear relationship between the two sets
  • \(H_1\), \(\rho V>0\) there is a linear relationship between the two sets

The fact that \(\rho V=0\) (which corresponds to the population covariance matrix \(\Sigma_{XY}=0\)) does not necessarily imply independence between X and Y (except when they are multivariate normal), only the absence of a linear relationship between them.

In this section, we focus on a permutation test. Repeated permutation of the rows of one matrix and computation of the statistic such as the RV coefficient provides the distribution of the RV coefficient under \(H_0\).

The data we are working on have already been introduced in the classes (Wine tasting). It is a sensory analysis where judges assessed wines. We worked on a data set where the judges were experts (Expert_wine.csv). In addition, we have at disposal data where the wines have been described by Students (Student_wine.csv) and usual consumers (Consumer_wine.csv). We want to globally compare the analysis of the wines between these different judges, i.e. to see if the wines are described in the same way by the different panelists (if wines which are close for Experts are also close for Students).

(R2) Import the data sets and keep only the quantitative variables. Compute the RV coefficients between the different pairs of the judges (Experts, Consumers, Students). Comment the results.

For a couple of matrices, \({\mathbf{X}}_1\) and \({\mathbf{X}}_2\) (for instance Experts, Consumers), using the sample function, produce 10000 permutations of the rows of one matrix say \({\mathbf{X}}_1\) and compute each time the RV coefficient between the matrix with permuted rows \({\mathbf{X}}_1^{perm}\) and \({\mathbf{X}}_2\), (perm=1, …, 10000). Plot the histogram of the empirical distribution of the RV coefficient you obtain. Position the observed value of the RV coefficient calculated between the initial matrices \({\mathbf{X}}_1\) and \({\mathbf{X}}_2\). Calculate the \(p\)-value using this empirical distribution, and compare it to the \(p\)-value given by the coeffRV function. The \(p\)-value is defined as the proportion of the values that are greater or equal to the observed coefficient.

(Q3) Explain in two sentences why this procedure makes sense to test the null hypothesis of no association.

8.5.3 Power of the RV and dCov tests in the case of a linear relation

The dCov coefficient enables to detect relationships between two variables in a nonlinear setting. The definition and properties are beyond the scope of this exercice, but the dCov coefficient is equal to zero, if and only if there is independence between the random vectors.

Several testing procedures exist for the RV and dCov coefficients, based on the asymptotic distribution or on permutation tests (as we have seen in the preceding paragraph). Numerically, they are implemented in the coeffRV function for the RV coefficient and in the dcov.test function of the package energy for the dCov coefficient.

We want to compare the power of these tests in the case of a linear relation. Let us denote by \(\mathbb{1}\) the matrix full of \(1\).

(R3) For \(p=q=5\) and \(n=25,30,35,50,70,100\), simulate a gaussian vector \((X,Y)\) of dimension \(p+q=10\), of mean \(0\) and covariance matrix \(0.1 \times \mathbb{1}_{10 \times 10} + 0.9 \times \text{Id}_{10 \times 10}\). For 1000 simulations, calculate the \(p\)-value of the RV and dCov tests respectively. Count the number of times the \(p\)-value is less than \(0.1\). The frequencies obtained respectively for the RV and dCov tests are approximations of the power of the tests. Plot the powers of these tests as a function of \(n\).

(Q4) Comment in two or three sentences the plot.

8.5.4 Power of the RV and dCov tests in the case of a nonlinear relation

The same steps can be done in the case of a nonlinear relation between \(X\) and \(Y\).

(R4) For \(p=q=10\) and \(n=25,30,35,50,70,100\), simulate a gaussian vector \(X\) of dimension \(p=10\), of mean \(0\) and covariance matrix \(\text{Id}_{10 \times 10}\). Then set \(Y=\log(X^2)\). For 1000 simulations, calculate the \(p\)-value of the RV and dCov tests respectively. Count the number of times the \(p\)-value is less than \(0.1\). The frequencies obtained respectively for the RV and dCov tests are approximations of the power of the tests. Plot the powers of these tests in function of \(n\).

(Q5) Comment in two or three sentences the plot.