R統計 | 遺漏值

R以大寫「NA」(not available)來表示遺漏值，資料分析應排除遺漏值，所以分析之前應該先完成設定遺漏值的工作。

編碼遺漏值

實務上在編碼時，經常以99或999來代表遺漏值。為了說明方便，繼續以class_new.RData為例，在現有10名學生之外，增加2筆包含遺漏值的資料：


      > load(file="c:/Users/USER/downloads/class_new.RData")

      > Alicia<-data.frame("Alicia", "F", 180, 999, 74) #增加一名學生Alicia，英文成績為遺漏值

      > names(Alicia)<-c("name", "gender", "height", "english", "math") #命名Alicia的變數

      > Wu<-data.frame("Wu", 999, 175, 68 ,75) #增加另一名學生Wu，性別未知

      > names(Wu)<-c("name", "gender", "height", "english", "math") #命名Wu的變數

      > class_new<-rbind(class_new, Alicia, Wu) #將Alicia與Wu新增至class_new

      Warning message:
      In `[<-.factor`(`*tmp*`, ri, value = 999) :
        invalid factor level, NA generated
      > class_new

           name gender height english math
      1    Anne      F    171      81   60
      2   Ariel      F    168      90   89
      3   Bruce      M    174      66   58
      4    Emma      F    160      73   82
      5  George      M    170      77   64
      6   Kevin      M    188      96   90
      7   Lewis      M    182      97   88
      8   Linda      F    158      61   48
      9   Roger      M    168      86   85
      10  Sarah      F    156      89   93
      11 Alicia      F    180     999   74
      12     Wu   ‹NA›    175      68   75

將Alicia的英文成績999，以及Wu的性別999編碼為NA遺漏值：


     > class_new$english[class_new$english==999]<-NA

     > class_new$gender[class_new$gender==999]<-NA

     > class_new

          name gender height english math
     1    Anne      F    171      81   60
     2   Ariel      F    168      90   89
     3   Bruce      M    174      66   58
     4    Emma      F    160      73   82
     5  George      M    170      77   64
     6   Kevin      M    188      96   90
     7   Lewis      M    182      97   88
     8   Linda      F    158      61   48
     9   Roger      M    168      86   85
     10  Sarah      F    156      89   93
     11 Alicia      F    180      NA   74
     12     Wu   ‹NA›    175      68   75

檢測遺漏值

is.na()可以快速檢驗資料中是否含有遺漏值：


     > sum(is.na(class_new)) #整份資料當中有多少遺漏值?

     [1] 2


     > is.na(class_new) #檢測全部資料

         name gender height english  math
     1  FALSE  FALSE  FALSE   FALSE FALSE
     2  FALSE  FALSE  FALSE   FALSE FALSE
     3  FALSE  FALSE  FALSE   FALSE FALSE
     4  FALSE  FALSE  FALSE   FALSE FALSE
     5  FALSE  FALSE  FALSE   FALSE FALSE
     6  FALSE  FALSE  FALSE   FALSE FALSE
     7  FALSE  FALSE  FALSE   FALSE FALSE
     8  FALSE  FALSE  FALSE   FALSE FALSE
     9  FALSE  FALSE  FALSE   FALSE FALSE
     10 FALSE  FALSE  FALSE   FALSE FALSE
     11 FALSE  FALSE  FALSE    TRUE FALSE
     12 FALSE   TRUE  FALSE   FALSE FALSE


     > is.na(class_new$english) #檢測english單一變數

      [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE


     > is.na(class_new$gender) #檢測gender單一變數

      [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

也可以利用complete.cases()檢測每一筆個案的狀況。當資料完整時會傳回TRUE、有遺漏值時傳回FALSE。檢測發現最後2筆個案含有遺漏值：


     > complete.cases(class_new)

      [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

配合R x C篩選，complete.cases()也可以直接刪除每一筆含有遺漏值的個案。不過如果遺漏值太多，這麼一來將會刪除太多個案，這是必須要注意的地方。


     > class_new_complete<-class_new[complete.cases(class_new),]

          name gender height english math
     1    Anne      F    171      81   60
     2   Ariel      F    168      90   89
     3   Bruce      M    174      66   58
     4    Emma      F    160      73   82
     5  George      M    170      77   64
     6   Kevin      M    188      96   90
     7   Lewis      M    182      97   88
     8   Linda      F    158      61   48
     9   Roger      M    168      86   85
     10  Sarah      F    156      89   93

排除遺漏值

遺漏值的目的在於分析時可將之排除，避免統計失真。R利用na.rm=TRUE指令排除遺漏值。以英文成績為例，在未排除前無法計算，排除遺漏值後則可以計算全班英文成績平均。


     > mean(class_new$english) #資料有遺漏值無法計算平均數

     [1] NA
     > mean(class_new$english, na.rm=TRUE) #排除遺漏值

     [1] 80.36364

刪除遺漏值

在有遺漏值的情況下，大部分的計算都會面臨無法運算的窘境。面對遺漏值，可行的做法之一就是刪除含有遺漏值的個案，R對於處理遺漏值預設4種方法：

na.fail()	如果資料集有NA，則回傳錯誤
na.omit()	完全刪除含NA的個案，並回傳刪除個案後的資料集
na.exclude()	刪除含NA的個案，並回傳刪除個案後的資料集，但含有NA個案仍可以在其他統計中被計算
na.pass()	無論有沒有NA，都回傳原始資料


     > na.fail(class_new)

     Error in na.fail.default(class_new) : missing values in object
     > na.omit(class_new)

          name gender height english math
     1    Anne      F    171      81   60
     2   Ariel      F    168      90   89
     3   Bruce      M    174      66   58
     4    Emma      F    160      73   82
     5  George      M    170      77   64
     6   Kevin      M    188      96   90
     7   Lewis      M    182      97   88
     8   Linda      F    158      61   48
     9   Roger      M    168      86   85
     10  Sarah      F    156      89   93
     > na.exclude(class_new)

          name gender height english math
     1    Anne      F    171      81   60
     2   Ariel      F    168      90   89
     3   Bruce      M    174      66   58
     4    Emma      F    160      73   82
     5  George      M    170      77   64
     6   Kevin      M    188      96   90
     7   Lewis      M    182      97   88
     8   Linda      F    158      61   48
     9   Roger      M    168      86   85
     10  Sarah      F    156      89   93
     > na.pass(class_new)

          name gender height english math
     1    Anne      F    171      81   60
     2   Ariel      F    168      90   89
     3   Bruce      M    174      66   58
     4    Emma      F    160      73   82
     5  George      M    170      77   64
     6   Kevin      M    188      96   90
     7   Lewis      M    182      97   88
     8   Linda      F    158      61   48
     9   Roger      M    168      86   85
     10  Sarah      F    156      89   93
     11 Alicia      F    180      NA   74
     12     Wu   ‹NA›    175      68   75

預測遺漏值

遺漏值確實造成分析上的麻煩，除了刪除，還有沒有更好的方法？答案是有的。可以透過一些數學方法預測遺漏值，將遺漏值填補起來。常見的方法包含用平均數、中位數以及K-近鄰演算法(K-Nearest Neighbours, KNN)來預測遺漏值。

平均數與中位數

採用平均數與中位數背後的邏輯很簡單。由於平均數與中位數考量整體個案的資料分布，在預測遺漏值上是一個相對安全、保守的作法。


      > mean_english<-mean(class_new$english, na.rm=TRUE) #計算全班英文成績

      > class_new[11,4]<-round(mean_english) #將英文成績四捨五入後填入第11列第4欄，也就是Alicia的位置

      > class_new

           name gender height  english math
      1    Anne      F    171       81   60
      2   Ariel      F    168       90   89
      3   Bruce      M    174       66   58
      4    Emma      F    160       73   82
      5  George      M    170       77   64
      6   Kevin      M    188       96   90
      7   Lewis      M    182       97   88
      8   Linda      F    158       61   48
      9   Roger      M    168       86   85
      10  Sarah      F    156       89   93
      11 Alicia      F    180       80   74
      12     Wu   ‹NA›    175       68   75

如果改採中位數則是81分，與平均數相去不遠。


      > median(class_new$english, na.rm=TRUE

      [1] 81

K-近鄰演算法-caret package

K-近鄰演算法從英文望文生義K-Nearest Neighbours的概念其實不難理解，它的概念很像「近朱者赤，近墨者黑」，兩個條件越相近的人，在各方面會越趨於一致。你和你的鄰居會有相似的經濟地位、相似的活動區域、相似的生活條件。換句話說，從鄰居的特質就可以大致預測你的特質。

利用K-近鄰演算法預測遺漏值的想法很簡單。以班級成績為例，現在已經有一群學生的性別、英文與數學成績，要預測Alicia的英文成績，我們應該先看Alicia的性別與數學成績和那些同學相近？然後再拿這些同學的英文成績平均或加權，作為Alicia的分數。也就是找出最近的鄰居，用它們的資料，來預測未知的遺漏值。

caret與DMwR2是R計算K-近鄰演算法常用的擴充套件。caret應用上比較繁瑣，但可以清楚knn的計算步驟；DMwR2最大的優勢是方便，可以一次就把所有遺漏值填補完畢。首先先來介紹利用caret如何預測遺漏值。

載入caret前置作業，必須先載入lattice、ggplot2兩個套件。然後再整理資料，剔除有遺漏值的個案：


      > library(lattice) 載入lattice套件

      > library(ggplot2) 載入ggplot2套件

      > library(caret) 載入caret套件

      > library(dplyr) 載入dplyr套件

      > class_new_complete<-class_new[complete.cases(class_new),] #刪除含有遺漏值個案

      > attach(class_new_complete)

完成所有前置作業後，可以呼叫caret套件裡的preProcess()，計算K-近鄰演算法。由於姓名、身高與考試成績不具關聯性，所以我們只挑選性別、英文、數學成績納入計算。挑選與遺漏值個案最接近的3個鄰居(k=3)來計算knn。

K到底要設定多少，會影響個案要和多少個鄰居做比較，進而影響最終分析結果。一般而言我們習慣將K設定為K=√n，n=個案數，並盡量讓K為奇數，避免個案沒有辦法歸類到哪一群鄰居的窘境。


      > knn<-preProcess(class_new_complete %>% dplyr::select(gender, english, math), method="knnImpute", k=3, knnSummary=mean) #挑選gender、english、math以k=3來計算knn

      > knn

      Created from 10 samples and 3 variables #資料中有10筆個案、3個變數
      

      Pre-processing:
        - centered (2) #將2個變數(英文、數學)置中
        - ignored (1) #忽略1個變數(性別)
        - 3 nearest neighbor imputation (2) #k=3
        - scaled (2) #將2個變數標準化(英文、數學)

knn分析結果：


     > knn$mean

     english    math 
        81.6    75.7 
     > knn$std

      english     math 
     12.25833 16.40495

預測Alicia英文成績：


     > class_predict<-predict(knn, class_new, na.action=na.pass) #KNN預測成績

     > class_predict

          name gender height     english        math
     1    Anne      F    171 -0.04894631 -0.95702850
     2   Ariel      F    168  0.68524829  0.81073115
     3   Bruce      M    174 -1.27260397 -1.07894296
     4    Emma      F    160 -0.70156372  0.38403055
     5  George      M    170 -0.37525502 -0.71319958
     6   Kevin      M    188  1.17471135  0.87168838
     7   Lewis      M    182  1.25628853  0.74977392
     8   Linda      F    158 -1.68048985 -1.68851526
     9   Roger      M    168  0.35893958  0.56690223
     10  Sarah      F    156  0.60367111  1.05456007
     11 Alicia      F    180 -0.23929305 -0.10362729
     12     Wu   ‹NA›    175 -1.10944961 -0.04267006

將標準化隨機變數，以n x 標準差 + 平均數公式還原，得知預測Alicia的英文成績為79分：


     > class_predict_eng<-round(class_predict[,4]*12.25833+81.6) #還原英文成績並四捨五入

     > class_predict_eng

      [1] 81 90 66 73 77 96 97 61 86 89 79 68
     > class_predict_math<-round((class_predict %>% dplyr::select(math))*16.40495+75.7) #還原數學成績並四捨五入

     > class_predict_math

        math
     1    60
     2    89
     3    58
     4    82
     5    64
     6    90
     7    88
     8    48
     9    85
     10   93
     11   74
     12   75

K-近鄰演算法-DMwR2 package

相較於caret計算繁瑣，DMwR2裡的knnImputation()，則可以一次全部補上預測的遺漏值。

knnImputation()可以透過設定meth屬性meth="weighAvg"或meth="median"，來預測數字或名義變數。


     > library(DMwR2)

     > class_new_DMwR2<-knnImputation(class_new, k=3)

     > class_new_DMwR2

          name gender height  english math 
     1    Anne      F    171 81.00000   60
     2   Ariel      F    168 90.00000   89
     3   Bruce      M    174 66.00000   58
     4    Emma      F    160 73.00000   82
     5  George      M    170 77.00000   64
     6   Kevin      M    188 96.00000   90
     7   Lewis      M    182 97.00000   88
     8   Linda      F    158 61.00000   48
     9   Roger      M    168 86.00000   85
     10  Sarah      F    156 89.00000   93
     11 Alicia      F    180 72.37666   74
     12     Wu      M    175 68.00000   75

上述結果，DMwR2 package預測Wu是男性，Alicia的英文成績為72分，與caret package預測的79分有些許誤差。

簡介