Please install both software into your computer.
For beginner: Please follow any basic tutorial online if you haven’t got a chance to use R, it will not take longer than 1 hour usually.
STAT 479 is a class design for Data Science, in this class you will know what is a real world data science question, and learn how to think, how to solve the real question, which tools you need to use.
DO NOT EXPECT
We Will
Our data file name data_complete.rdata
, don’t use double click !!! to open the data, it is a text file!
data <- read.table(file = "../../blsdata/data_complete.rdata", header = TRUE)
From Wikipedia
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Missing data can occur because of nonresponse: no information is provided for one or more items or for a whole unit (“subject”). Some items are more likely to generate a nonresponse than others: for example items about private subjects such as income. Attrition (“Dropout”) is a type of missingness that can occur in longitudinal studies - for instance studying development where a measurement is repeated after a certain period of time. Missingness occurs when participants drop out before the test ends and one or more measurements are missing.
Make the data to complete-case analysis, it is also most statistical software packages do automatically. Which means omitting any case with a missing value on any of the variables.
For example:
set.seed(100)
X <- matrix(rnorm(100), ncol=1)
Y <- 5+ 2*X + rnorm(100, 0, 1)
Y1 <- Y2 <- Y
index1 <- (Y > 6.5)
index2 <- sample(sum(index1), 1:length(Y))
Y1[index1] <- NA # Missing not at random
Y2[index2] <- NA # Missing at random
lm(Y ~ X)$coef
## (Intercept) X
## 5.011448 1.894633
lm(Y2 ~ X)$coef
## (Intercept) X
## 5.008890 1.894401
lm(Y2[-index2] ~ X[-index2])$coef
## (Intercept) X[-index2]
## 5.008890 1.894401
The approach assumes that the missing data are missing completely at random (MCAR), i.e., the omitted cases are likely to be random sub samples.
When data are MCAR, the analysis performed on the data is unbiased; however, data are rarely MCAR. In the case of MCAR, the missingness of data is unrelated to any study variable: thus, the participants with completely observed data are in effect a random sample of all the participants assigned a particular intervention. With MCAR, the random assignment of treatments is assumed to be preserved, but that is usually an unrealistically strong assumption in practice.
Complete-case analysis is not recommended for two reasons:
# index1 <- Y > 6.5 # Missing if Y value greater than 6.5
# index2 <- sample(sum(index1), 1:length(Y))
# sample(x, size, replace = FALSE, prob = NULL) use help(sample)
# Y1[index1] <- NA # Missing not at random
# Y2[index2] <- NA # Missing at random
lm(Y2 ~ X)$coef
## (Intercept) X
## 5.008890 1.894401
lm(Y1 ~ X)$coef
## (Intercept) X
## 4.809966 1.634907
### Let us use a subset of our homework data, use your own datafile path
original_data <- read.table(file = "../../blsdata/data_complete.rdata", header = TRUE)
INTRDVX_index <- which(colnames(original_data) == "INTRDVX")
data_use <- original_data[1:200, c(1:80, INTRDVX_index)]
dim(na.omit(data_use))
## [1] 0 81
An alternative answer to the missing-data problem is to consider some form of imputation, the practice of filling in missing data with plausible values.
Methods that impute the missing values have the advantage that, unlike in complete-case analysis, observed values in the incomplete cases are retained. On the surface, it looks like imputation will solve the missing-data problem and enable the investigator to progress normally.
Use mean, median or mode to impute missing value is this type of method.
Usually we consider use mean or median to impute continuous variable, and mode to impute categorical variable.
library(randomForest)
data_impute_roughfix <- na.roughfix(data_use)
The another way to deal with missing values is suggested by Rubin (1987) known as multiple imputation. This is a Monte Carlo technique in which the missing values are replaced by \(m > 1\) simulated versions, where m is typically small (say \(3-10\)). Instead of filling in a single value for each missing value, multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. Each of the simulated complete data sets is analysed separately. Then the results are combined to produce estimates and confidence intervals that account for missing value uncertainty. It is actually always give you larger confidence intervals.
However, one should always bear in mind that the imputed values are not real measurements. # mice
package
We use R package mice
to do multiple imputation.
mice
package also includes several functions for identifying the missing data pattern(s) present in the dataset.
library(mice)
### number of the missing values
md.pattern(data_use[, 1:8]) # for display purpose
## DIRACC_ AGE_REF AGE_REF_ AGE2_ AS_COMP1 AS_C_MP1 DIRACC AGE2
## 91 1 1 1 1 1 1 1 1 0
## 3 1 1 1 1 1 1 0 1 1
## 98 1 1 1 1 1 1 1 0 1
## 8 1 1 1 1 1 1 0 0 2
## 0 0 0 0 0 0 11 106 117
A matrix with ncol(x)+1 columns, in which each row corresponds to a missing data pattern (1=observed, 0=missing). Rows and columns are sorted in increasing amounts of missing information. The last column and row contain row and column counts, respectively.
The first columns store how many cases in the situation. The last columns shows There are 91 (out of 200) rows that are complete, AGE2 has 106 records which are missing. The first columns indication how many cases in the situation.
Another way to study the pattern involves calculating the number of observations per patterns for all pairs of variables. A pair of variables can have exactly four missingness patterns: both variables are observed (pattern rr), the first variable is observed and the second variable is missing (pattern rm), the first variable is missing and the second variable is observed (pattern mr), and both are missing (pattern mm).
p <- md.pairs(data_use[, 1:3])
p
## $rr
## DIRACC DIRACC_ AGE_REF
## DIRACC 189 189 189
## DIRACC_ 189 200 200
## AGE_REF 189 200 200
##
## $rm
## DIRACC DIRACC_ AGE_REF
## DIRACC 0 0 0
## DIRACC_ 11 0 0
## AGE_REF 11 0 0
##
## $mr
## DIRACC DIRACC_ AGE_REF
## DIRACC 0 11 11
## DIRACC_ 0 0 0
## AGE_REF 0 0 0
##
## $mm
## DIRACC DIRACC_ AGE_REF
## DIRACC 11 0 0
## DIRACC_ 0 0 0
## AGE_REF 0 0 0
The mice
function will help us do imputations. The default method of imputation in the “mice” package is “pmm” and the default number of imputations is 5. If you would like to change the default number you use the following argument.
tempdata <- mice(data_use, m = 5, maxit = 5, seed = 500)
##
## iter imp variable
## 1 1 AGE2 BATHRMQ
## Error in solve.default(xtx + diag(pen)): system is computationally singular: reciprocal condition number = 1.75733e-16
When we use mice
, in every regression, mice
treat imputing variable as the outcome, all of others as predictor. How about choose some of the variables as predictor?
predictorMatrix
in mice
function
A square matrix of size ncol(data) containing 0/1 data specifying the set of predictors to be used for each target column. Rows correspond to target variables (i.e. variables to be imputed), in the sequence as they appear in data. A value of ‘1’ means that the column variable is used as a predictor for the target variable (in the rows). The diagonal of predictorMatrix must be zero. The default for predictorMatrix is that all other columns are used as predictors (sometimes called massive imputation). Note: For two-level imputation codes ‘2’ and ‘-2’ are also allowed.
data_temp <- data_use
## missing_ratio and unqiue_ratio are my personal functions, maybe not helpful :)
missing_temp <- apply(data_temp, 2, missing_ratio)
unique_temp <- apply(data_temp, 2, unique_ratio)
index <- (missing_temp < 0.2) * unique_temp # suppose I choose variables which not has more than 20% missing also delete all the variables which has unique value
p <- ncol(data_temp)
impute_matrix <- matrix(0, p, p)
impute_matrix[,index] <- 1
diag(impute_matrix) <- 0
impute_data <- mice(data_temp, m = 5, maxit = 5, method = "pmm", seed = 500, predictorMatrix = impute_matrix)
The output states that, as we requested, 5 imputed datasets were created. Our two variables with missing values were imputed using “pmm”, and of course you can choose different methods. The predictor matrix tells us which variables in the dataset were used to produce predicted values for matching.
completeddata <- complete(x = impute_data, action = 1)
completeddata[1:10, 1:5]
## DIRACC DIRACC_ AGE_REF AGE_REF_ AGE2
## 1 1 D 82 D 87
## 2 1 D 69 D 71
## 3 1 D 22 D 43
## 4 1 D 57 D 87
## 5 1 D 45 D 43
## 6 1 D 37 D 31
## 7 1 D 29 D 31
## 8 1 D 53 D 59
## 9 1 D 64 D 87
## 10 1 D 46 D 87
The missing values have been replaced with the imputed values in the first of the 5 datasets. If you wish to use another one, just change action = 2
in the complete() function.
library(missForest)
forest_impute <- missForest(data_temp)
We can also use this package to get all the imputation!!!
Description file for GUIDE
sapply(data_temp, class)
## DIRACC DIRACC_ AGE_REF AGE_REF_ AGE2 AGE2_ AS_COMP1
## "integer" "factor" "integer" "factor" "integer" "factor" "integer"
## AS_C_MP1 AS_COMP2 AS_C_MP2 AS_COMP3 AS_C_MP3 AS_COMP4 AS_C_MP4
## "factor" "integer" "factor" "integer" "factor" "integer" "factor"
## AS_COMP5 AS_C_MP5 BATHRMQ BATHRMQ_ BEDROOMQ BEDR_OMQ BLS_URBN
## "integer" "factor" "integer" "factor" "integer" "factor" "integer"
## BUILDING BUIL_ING CUTENURE CUTE_URE EARNCOMP EARN_OMP EDUC_REF
## "integer" "factor" "integer" "factor" "integer" "factor" "integer"
## EDUC0REF EDUCA2 EDUCA2_ FAM_SIZE FAM__IZE FAM_TYPE FAM__YPE
## "factor" "integer" "factor" "integer" "factor" "integer" "factor"
## FAMTFEDX FAMT_EDX FEDRFNDX FEDR_NDX FEDTAXX FEDTAXX_ FGOVRETX
## "integer" "factor" "integer" "factor" "integer" "factor" "integer"
## FGOV_ETX FINCATAX FINCAT_X FINCBTAX FINCBT_X FINDRETX FIND_ETX
## "factor" "integer" "factor" "integer" "factor" "integer" "factor"
## FINLWT21 FJSSDEDX FJSS_EDX FPRIPENX FPRI_ENX FRRDEDX FRRDEDX_
## "numeric" "integer" "factor" "integer" "factor" "integer" "factor"
## FRRETIRX FRRE_IRX FSALARYX FSAL_RYX FSLTAXX FSLTAXX_ FSSIX
## "integer" "factor" "integer" "factor" "integer" "factor" "integer"
## FSSIX_ HLFBATHQ HLFB_THQ INC_HRS1 INC__RS1 INC_HRS2 INC__RS2
## "factor" "integer" "factor" "integer" "factor" "integer" "factor"
## INC_RANK INC__ANK INCNONW1 INCN_NW1 INCNONW2 INCN_NW2 INCOMEY1
## "numeric" "factor" "integer" "factor" "integer" "factor" "integer"
## INCO_EY1 INCOMEY2 INCO_EY2 INTRDVX
## "factor" "integer" "factor" "integer"
k = ncol(data_temp)
intrdvx_index <- which(colnames(data_temp) == "INTRDVX")
role <- rep("n", k)
for(j in 1:k) {
if(class(data_temp[, j]) == "factor") role[j] = "c"
}
role2 <- ifelse(sapply(data_temp, class) == "factor", "c", "n")
role[intrdvx_index] = role2[intrdvx_index] = "d"
write.csv(data_temp, file = "data_temp.csv", row.names = FALSE)
write("data_temp.csv", file = "data.dsc")
write("NA", file="data.dsc", append=TRUE) # NA
write("2", file="data.dsc", append=TRUE) #
write.table(cbind(1:k, colnames(data_temp), role),
file="data.dsc", append=TRUE, row.names=FALSE, col.names=FALSE, quote=FALSE)
cat data.dsc
data_temp.csv
NA
2
1 DIRACC n
2 DIRACC_ c
3 AGE_REF n
4 AGE_REF_ c
5 AGE2 n
6 AGE2_ c
7 AS_COMP1 n
8 AS_C_MP1 c
9 AS_COMP2 n
10 AS_C_MP2 c
11 AS_COMP3 n
12 AS_C_MP3 c
13 AS_COMP4 n
14 AS_C_MP4 c
15 AS_COMP5 n
16 AS_C_MP5 c
17 BATHRMQ n
18 BATHRMQ_ c
19 BEDROOMQ n
20 BEDR_OMQ c
21 BLS_URBN n
22 BUILDING n
23 BUIL_ING c
24 CUTENURE n
25 CUTE_URE c
26 EARNCOMP n
27 EARN_OMP c
28 EDUC_REF n
29 EDUC0REF c
30 EDUCA2 n
31 EDUCA2_ c
32 FAM_SIZE n
33 FAM__IZE c
34 FAM_TYPE n
35 FAM__YPE c
36 FAMTFEDX n
37 FAMT_EDX c
38 FEDRFNDX n
39 FEDR_NDX c
40 FEDTAXX n
41 FEDTAXX_ c
42 FGOVRETX n
43 FGOV_ETX c
44 FINCATAX n
45 FINCAT_X c
46 FINCBTAX n
47 FINCBT_X c
48 FINDRETX n
49 FIND_ETX c
50 FINLWT21 n
51 FJSSDEDX n
52 FJSS_EDX c
53 FPRIPENX n
54 FPRI_ENX c
55 FRRDEDX n
56 FRRDEDX_ c
57 FRRETIRX n
58 FRRE_IRX c
59 FSALARYX n
60 FSAL_RYX c
61 FSLTAXX n
62 FSLTAXX_ c
63 FSSIX n
64 FSSIX_ c
65 HLFBATHQ n
66 HLFB_THQ c
67 INC_HRS1 n
68 INC__RS1 c
69 INC_HRS2 n
70 INC__RS2 c
71 INC_RANK n
72 INC__ANK c
73 INCNONW1 n
74 INCN_NW1 c
75 INCNONW2 n
76 INCN_NW2 c
77 INCOMEY1 n
78 INCO_EY1 c
79 INCOMEY2 n
80 INCO_EY2 c
81 INTRDVX d