The discussion will use R as the only programming language
You can use Python, Java, C++, etc for the homework assignments
I will not response and offer help when you choose different languages other than R

Install R and Rstudio

R: https://cran.r-project.org
Rstudio: https://www.rstudio.com/products/rstudio/

Please install both software into your computer.

Before the discussion

For beginner: Please follow any basic tutorial online if you haven’t got a chance to use R, it will not take longer than 1 hour usually.

STAT 479 is a class design for Data Science, in this class you will know what is a real world data science question, and learn how to think, how to solve the real question, which tools you need to use.

DO NOT EXPECT

Teach you basic R grammar
Help you debug without original codes
Reply email during the weekend

We Will

Teach you some useful R packages to solve problem
Help you if we have a simple run-able code from you
Reply email during working hours in week days
Give you more resource of learning material
Reply on piazza

Read Data

Our data file name data_complete.rdata, don’t use double click !!! to open the data, it is a text file!

data <- read.table(file = "../../blsdata/data_complete.rdata", header = TRUE)

Missing Value, How to handle it?

From Wikipedia

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Missing data can occur because of nonresponse: no information is provided for one or more items or for a whole unit (“subject”). Some items are more likely to generate a nonresponse than others: for example items about private subjects such as income. Attrition (“Dropout”) is a type of missingness that can occur in longitudinal studies - for instance studying development where a measurement is repeated after a certain period of time. Missingness occurs when participants drop out before the test ends and one or more measurements are missing.

Solution 1: Delete all missing records

Make the data to complete-case analysis, it is also most statistical software packages do automatically. Which means omitting any case with a missing value on any of the variables.

For example:

set.seed(100)
X <- matrix(rnorm(100), ncol=1)
Y <- 5+ 2*X + rnorm(100, 0, 1)
Y1 <- Y2 <- Y
index1 <- (Y > 6.5)
index2 <- sample(sum(index1), 1:length(Y))

Y1[index1] <- NA # Missing not at random
Y2[index2] <- NA # Missing at random
lm(Y ~ X)$coef

## (Intercept)           X 
##    5.011448    1.894633

lm(Y2 ~ X)$coef

## (Intercept)           X 
##    5.008890    1.894401

lm(Y2[-index2] ~ X[-index2])$coef

## (Intercept)  X[-index2] 
##    5.008890    1.894401

The approach assumes that the missing data are missing completely at random (MCAR), i.e., the omitted cases are likely to be random sub samples.

When data are MCAR, the analysis performed on the data is unbiased; however, data are rarely MCAR. In the case of MCAR, the missingness of data is unrelated to any study variable: thus, the participants with completely observed data are in effect a random sample of all the participants assigned a particular intervention. With MCAR, the random assignment of treatments is assumed to be preserved, but that is usually an unrealistically strong assumption in practice.

Complete-case analysis is not recommended for two reasons:

Omitting a possibly substantial number of individuals will cause a large amount of information to be discarded and lower the effective sample size of the data.
More worrisome is that dropping the cases with missing values on one or more variables can lead to serious biases in both estimation and inference if the discarded cases are not a random sub sample of the observed data.

# index1 <- Y > 6.5 # Missing if Y value greater than 6.5
# index2 <- sample(sum(index1), 1:length(Y)) 
# sample(x, size, replace = FALSE, prob = NULL) use help(sample)
# Y1[index1] <- NA # Missing not at random
# Y2[index2] <- NA # Missing at random
lm(Y2 ~ X)$coef

## (Intercept)           X 
##    5.008890    1.894401

lm(Y1 ~ X)$coef

## (Intercept)           X 
##    4.809966    1.634907

Solution 2: Single Imputation

### Let us use a subset of our homework data, use your own datafile path
original_data <- read.table(file = "../../blsdata/data_complete.rdata", header = TRUE)
INTRDVX_index <- which(colnames(original_data) == "INTRDVX")
data_use <- original_data[1:200, c(1:80, INTRDVX_index)]
dim(na.omit(data_use))

## [1]  0 81

An alternative answer to the missing-data problem is to consider some form of imputation, the practice of filling in missing data with plausible values.

Methods that impute the missing values have the advantage that, unlike in complete-case analysis, observed values in the incomplete cases are retained. On the surface, it looks like imputation will solve the missing-data problem and enable the investigator to progress normally.

Use mean, median or mode to impute missing value is this type of method.

Usually we consider use mean or median to impute continuous variable, and mode to impute categorical variable.

library(randomForest)
data_impute_roughfix <- na.roughfix(data_use)

na.roughfix

Solution 3: Multiple Imputation

The another way to deal with missing values is suggested by Rubin (1987) known as multiple imputation. This is a Monte Carlo technique in which the missing values are replaced by \(m > 1\) simulated versions, where m is typically small (say \(3-10\)). Instead of filling in a single value for each missing value, multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. Each of the simulated complete data sets is analysed separately. Then the results are combined to produce estimates and confidence intervals that account for missing value uncertainty. It is actually always give you larger confidence intervals.

However, one should always bear in mind that the imputed values are not real measurements. # mice package

We use R package mice to do multiple imputation.

Missing Data Patterns

mice package also includes several functions for identifying the missing data pattern(s) present in the dataset.

library(mice)
### number of the missing values
md.pattern(data_use[, 1:8]) # for display purpose

##    DIRACC_ AGE_REF AGE_REF_ AGE2_ AS_COMP1 AS_C_MP1 DIRACC AGE2    
## 91       1       1        1     1        1        1      1    1   0
##  3       1       1        1     1        1        1      0    1   1
## 98       1       1        1     1        1        1      1    0   1
##  8       1       1        1     1        1        1      0    0   2
##          0       0        0     0        0        0     11  106 117

A matrix with ncol(x)+1 columns, in which each row corresponds to a missing data pattern (1=observed, 0=missing). Rows and columns are sorted in increasing amounts of missing information. The last column and row contain row and column counts, respectively.

The first columns store how many cases in the situation. The last columns shows There are 91 (out of 200) rows that are complete, AGE2 has 106 records which are missing. The first columns indication how many cases in the situation.

Another way to study the pattern involves calculating the number of observations per patterns for all pairs of variables. A pair of variables can have exactly four missingness patterns: both variables are observed (pattern rr), the first variable is observed and the second variable is missing (pattern rm), the first variable is missing and the second variable is observed (pattern mr), and both are missing (pattern mm).

p <- md.pairs(data_use[, 1:3])
p

## $rr
##         DIRACC DIRACC_ AGE_REF
## DIRACC     189     189     189
## DIRACC_    189     200     200
## AGE_REF    189     200     200
## 
## $rm
##         DIRACC DIRACC_ AGE_REF
## DIRACC       0       0       0
## DIRACC_     11       0       0
## AGE_REF     11       0       0
## 
## $mr
##         DIRACC DIRACC_ AGE_REF
## DIRACC       0      11      11
## DIRACC_      0       0       0
## AGE_REF      0       0       0
## 
## $mm
##         DIRACC DIRACC_ AGE_REF
## DIRACC      11       0       0
## DIRACC_      0       0       0
## AGE_REF      0       0       0

Imputation

The mice function will help us do imputations. The default method of imputation in the “mice” package is “pmm” and the default number of imputations is 5. If you would like to change the default number you use the following argument.

tempdata <- mice(data_use, m = 5, maxit = 5, seed = 500)

## 
##  iter imp variable
##   1   1  AGE2  BATHRMQ

## Error in solve.default(xtx + diag(pen)): system is computationally singular: reciprocal condition number = 1.75733e-16

Maybe one way?

When we use mice, in every regression, mice treat imputing variable as the outcome, all of others as predictor. How about choose some of the variables as predictor?

predictorMatrix in mice function

A square matrix of size ncol(data) containing 0/1 data specifying the set of predictors to be used for each target column. Rows correspond to target variables (i.e. variables to be imputed), in the sequence as they appear in data. A value of ‘1’ means that the column variable is used as a predictor for the target variable (in the rows). The diagonal of predictorMatrix must be zero. The default for predictorMatrix is that all other columns are used as predictors (sometimes called massive imputation). Note: For two-level imputation codes ‘2’ and ‘-2’ are also allowed.

data_temp <- data_use

## missing_ratio and unqiue_ratio are my personal functions, maybe not helpful :)

missing_temp <- apply(data_temp, 2, missing_ratio) 
unique_temp <- apply(data_temp, 2, unique_ratio) 

index <- (missing_temp < 0.2) * unique_temp  # suppose I choose variables which not has more than 20% missing also delete all the variables which has unique value

p <- ncol(data_temp)
impute_matrix <- matrix(0, p, p)
impute_matrix[,index] <- 1
diag(impute_matrix) <- 0

impute_data <- mice(data_temp, m = 5, maxit = 5, method = "pmm", seed = 500, predictorMatrix = impute_matrix)

The output states that, as we requested, 5 imputed datasets were created. Our two variables with missing values were imputed using “pmm”, and of course you can choose different methods. The predictor matrix tells us which variables in the dataset were used to produce predicted values for matching.

completeddata <- complete(x = impute_data, action = 1)
completeddata[1:10, 1:5]

##    DIRACC DIRACC_ AGE_REF AGE_REF_ AGE2
## 1       1       D      82        D   87
## 2       1       D      69        D   71
## 3       1       D      22        D   43
## 4       1       D      57        D   87
## 5       1       D      45        D   43
## 6       1       D      37        D   31
## 7       1       D      29        D   31
## 8       1       D      53        D   59
## 9       1       D      64        D   87
## 10      1       D      46        D   87

The missing values have been replaced with the imputed values in the first of the 5 datasets. If you wish to use another one, just change action = 2 in the complete() function.

Maybe another? missForest

library(missForest)
forest_impute <- missForest(data_temp)

We can also use this package to get all the imputation!!!

One more thing…

Description file for GUIDE

sapply(data_temp, class)

##    DIRACC   DIRACC_   AGE_REF  AGE_REF_      AGE2     AGE2_  AS_COMP1 
## "integer"  "factor" "integer"  "factor" "integer"  "factor" "integer" 
##  AS_C_MP1  AS_COMP2  AS_C_MP2  AS_COMP3  AS_C_MP3  AS_COMP4  AS_C_MP4 
##  "factor" "integer"  "factor" "integer"  "factor" "integer"  "factor" 
##  AS_COMP5  AS_C_MP5   BATHRMQ  BATHRMQ_  BEDROOMQ  BEDR_OMQ  BLS_URBN 
## "integer"  "factor" "integer"  "factor" "integer"  "factor" "integer" 
##  BUILDING  BUIL_ING  CUTENURE  CUTE_URE  EARNCOMP  EARN_OMP  EDUC_REF 
## "integer"  "factor" "integer"  "factor" "integer"  "factor" "integer" 
##  EDUC0REF    EDUCA2   EDUCA2_  FAM_SIZE  FAM__IZE  FAM_TYPE  FAM__YPE 
##  "factor" "integer"  "factor" "integer"  "factor" "integer"  "factor" 
##  FAMTFEDX  FAMT_EDX  FEDRFNDX  FEDR_NDX   FEDTAXX  FEDTAXX_  FGOVRETX 
## "integer"  "factor" "integer"  "factor" "integer"  "factor" "integer" 
##  FGOV_ETX  FINCATAX  FINCAT_X  FINCBTAX  FINCBT_X  FINDRETX  FIND_ETX 
##  "factor" "integer"  "factor" "integer"  "factor" "integer"  "factor" 
##  FINLWT21  FJSSDEDX  FJSS_EDX  FPRIPENX  FPRI_ENX   FRRDEDX  FRRDEDX_ 
## "numeric" "integer"  "factor" "integer"  "factor" "integer"  "factor" 
##  FRRETIRX  FRRE_IRX  FSALARYX  FSAL_RYX   FSLTAXX  FSLTAXX_     FSSIX 
## "integer"  "factor" "integer"  "factor" "integer"  "factor" "integer" 
##    FSSIX_  HLFBATHQ  HLFB_THQ  INC_HRS1  INC__RS1  INC_HRS2  INC__RS2 
##  "factor" "integer"  "factor" "integer"  "factor" "integer"  "factor" 
##  INC_RANK  INC__ANK  INCNONW1  INCN_NW1  INCNONW2  INCN_NW2  INCOMEY1 
## "numeric"  "factor" "integer"  "factor" "integer"  "factor" "integer" 
##  INCO_EY1  INCOMEY2  INCO_EY2   INTRDVX 
##  "factor" "integer"  "factor" "integer"

k = ncol(data_temp)
intrdvx_index <- which(colnames(data_temp) == "INTRDVX")
role <- rep("n", k)
for(j in 1:k) {
  if(class(data_temp[, j]) == "factor") role[j] = "c"
}

role2 <- ifelse(sapply(data_temp, class) == "factor", "c", "n")
role[intrdvx_index] = role2[intrdvx_index] = "d"
write.csv(data_temp, file = "data_temp.csv", row.names = FALSE)
write("data_temp.csv", file = "data.dsc")
write("NA", file="data.dsc", append=TRUE) # NA
write("2", file="data.dsc", append=TRUE) #
write.table(cbind(1:k, colnames(data_temp), role),
            file="data.dsc", append=TRUE, row.names=FALSE, col.names=FALSE, quote=FALSE)

cat data.dsc

data_temp.csv
NA
2
1 DIRACC n
2 DIRACC_ c
3 AGE_REF n
4 AGE_REF_ c
5 AGE2 n
6 AGE2_ c
7 AS_COMP1 n
8 AS_C_MP1 c
9 AS_COMP2 n
10 AS_C_MP2 c
11 AS_COMP3 n
12 AS_C_MP3 c
13 AS_COMP4 n
14 AS_C_MP4 c
15 AS_COMP5 n
16 AS_C_MP5 c
17 BATHRMQ n
18 BATHRMQ_ c
19 BEDROOMQ n
20 BEDR_OMQ c
21 BLS_URBN n
22 BUILDING n
23 BUIL_ING c
24 CUTENURE n
25 CUTE_URE c
26 EARNCOMP n
27 EARN_OMP c
28 EDUC_REF n
29 EDUC0REF c
30 EDUCA2 n
31 EDUCA2_ c
32 FAM_SIZE n
33 FAM__IZE c
34 FAM_TYPE n
35 FAM__YPE c
36 FAMTFEDX n
37 FAMT_EDX c
38 FEDRFNDX n
39 FEDR_NDX c
40 FEDTAXX n
41 FEDTAXX_ c
42 FGOVRETX n
43 FGOV_ETX c
44 FINCATAX n
45 FINCAT_X c
46 FINCBTAX n
47 FINCBT_X c
48 FINDRETX n
49 FIND_ETX c
50 FINLWT21 n
51 FJSSDEDX n
52 FJSS_EDX c
53 FPRIPENX n
54 FPRI_ENX c
55 FRRDEDX n
56 FRRDEDX_ c
57 FRRETIRX n
58 FRRE_IRX c
59 FSALARYX n
60 FSAL_RYX c
61 FSLTAXX n
62 FSLTAXX_ c
63 FSSIX n
64 FSSIX_ c
65 HLFBATHQ n
66 HLFB_THQ c
67 INC_HRS1 n
68 INC__RS1 c
69 INC_HRS2 n
70 INC__RS2 c
71 INC_RANK n
72 INC__ANK c
73 INCNONW1 n
74 INCN_NW1 c
75 INCNONW2 n
76 INCN_NW2 c
77 INCOMEY1 n
78 INCO_EY1 c
79 INCOMEY2 n
80 INCO_EY2 c
81 INTRDVX d

Discussion 1

Peigen Zhou