R統計 | 資料篩選

資料量龐大時，篩選與過濾可以簡化萃取出要分析的資料。這項工作在SPSS、Excel中是透過if函數來完成，SAS則是透過keep或drop進行資料篩選。R的操作相較起來則更為直覺化。

篩選個案 Row

R將資料儲存為dataframe形式，意味著R是透過列(row)與欄(column)來辨識資料，就像Excel儲存格，每個位子都是由列與欄來定義。因此R的資料永遠符合R x C的原則，透過指定第x列、第x欄，可以快速篩選出指定的資料。

依據R x C原則，[5:10,]代表篩選第5筆至第10筆個案，由於沒有指定欄(column)，因此逗號後為空白。如此可以篩選出class_new最後五名學生：


     > load(file="c:/Users/USER/downloads/class_new.RData")

     > class_new[5:10,] #篩選第5筆到第10筆個案

從最後五名學生中，再篩選出男性：


     > class_last5<-class_new[5:10,] #篩選最後5名學生，並將結果指定給class_last5

     > class_last5[class_last5$gender=="M",] #從最後5名學生中，篩選出男同學

用「&」設定複合式篩選條件，篩選出身高高於170公分的男同學：


     > class_last5[class_last5$gender=="M" & class_last5$height>=170,] #篩選出身高超過170公分的男同學

也可以透過「%in%」，並搭配c()來篩選非連續個案：


     > class_new[class_new$math %in% c(93,89,82,58,48),] #篩選數學成績考93、89、82、58、48分的同學

或是直接用NULL刪除變數：


     > class_new$math<-NULL

篩選變數 Column

篩選變數一樣依循R x C原則，[,4:5]代表篩選第4個到第5個變數，此時逗號前面為空白，代表沒有指定列(row)。如此可以篩選出英文與數學成績：


     > class_new[,4:5] #篩選英數成績

篩選姓名、英文與數學成績變數：


    > class_new[,c(1,4,5)] #篩選姓名、英文與數學成績變數

    > class_new[,c("name","english","math")] #另一種方法：指名變數

    > class_new[,c(-2,-3)] #另一種方法：減去變數

篩選個案與變數

subset()提供一個更簡便、快速的方法，可以同時作到個案與變數的篩選。例如想篩選出英文、數學成績都在80分以上的女學生，用「&」設立交集條件：


     > excellent_female<-subset(class_new, class_new$gender=="F" & class_new$english>=80 & class_new$math>=80, select=c(1,4,5))

     > excellent_female

         name english math
     2  Ariel      90   89
     10 Sarah      89   93

篩選出英文或數學，其中有一科在80分以上的女學生，用「|」設立聯集條件：


     > good_female<-subset(class_new, class_new$gender=="F" & (class_new$english>=80 | class_new$math>=80), select=c(1,4,5))

     > good_female

         name english math
     1   Anne      81   60
     2  Ariel      90   89
     4   Emma      73   82
     10 Sarah      89   93

隨機篩選

sampe()可以直接針對變數作隨機抽樣：


     > sample(class_new$name, size=6, replace=F)

     [1] Emma   Roger  Lewis  Kevin  Anne   George
     Levels: Ariel Kevin Lewis Sarah George Linda Anne Emma Roger Bruce

從資料中可以隨機篩選個案。例如隨機篩選6個個案，重複選取：


     > class_new[sample(1:nrow(class_new), size=6, replace=T),]

           name gender height english math
     8    Linda      F    158      61   48
     10   Sarah      F    156      89   93
     2    Ariel      F    168      90   89
     5   George      M    170      77   64
     7    Lewis      M    182      97   88
     7.1  Lewis      M    182      97   88

隨機篩選6個個案，不重複選取：


    > class_new[sample(1:nrow(class_new), size=6, replace=F),]

          name gender height english math
    1     Anne      F    171      81   60
    7    Lewis      M    182      97   88
    3    Bruce      M    174      66   58
    5   George      M    170      77   64
    10   Sarah      F    156      89   93
    2    Ariel      F    168      90   89

dplyr 套件

在篩選資料中不得不提由Hadley Wickham發展出來的dplyr套件。dplyr提供了選取、篩選、分組、排序等功能，讓R更容易隨心所欲地管理資料。

dplyr套件包含下列功能：

—select()：篩選欄位Column

—filter()：篩選個案Row

—mutate()：增加新欄位

—group_by()：分組

—arrange()：排序

—rename()：重新命名

要使用上述功能，必須先下載dpylr package。


     > library(dplyr)

select()

select()可以選取要分析的欄位，使用方式為select(資料名稱, 選取條件1, 選取條件2)，也可以配合starts_with()、ends_with()、contains()、mathces()、num_range()等指令來設定篩選條件。


       > select(class_new, name, english) #篩選姓名、英文成績兩個欄位

            name english
       1    Anne      81
       2   Ariel      90
       3   Bruce      66
       4    Emma      73
       5  George      77
       6   Kevin      96
       7   Lewis      97
       8   Linda      61
       9   Roger      86
       10  Sarah      89

filter()

filter()使用方法與select()相同，也可以以直接在filter()中直接做運算。


      > filter(class_new, english>80) #篩選英文成績>80分的個案

         name gender height english math
      1  Anne      F    171      81   60
      2 Ariel      F    168      90   89
      3 Kevin      M    188      96   90
      4 Lewis      M    182      97   88
      5 Roger      M    168      86   85
      6 Sarah      F    156      89   93
      > filter(class_new, english>80 & math>90) #篩選英文成績>80分且數學成績>90分的個案

         name gender height english math
      1 Sarah      F    156      89   93
      > filter(class_new, ((english+math)/2)<60) #篩選英數平均成績不及格的個案

         name gender height english math
      1 Linda      F    158      61   48

mutate()

mutate()可以增加新變數。


      > Average<-mutate(class_new, Average=((english+math)/2)) #增加平均成績

           name gender height english math Average
      1    Anne      F    171      81   60    70.5
      2   Ariel      F    168      90   89    89.5
      3   Bruce      M    174      66   58    62.0
      4    Emma      F    160      73   82    77.5
      5  George      M    170      77   64    70.5
      6   Kevin      M    188      96   90    93.0
      7   Lewis      M    182      97   88    92.5
      8   Linda      F    158      61   48    54.5
      9   Roger      M    168      86   85    85.5
      10  Sarah      F    156      89   93    91.0

group_by()

group_by()通常搭配summarize()一起使用，並配合%>%，可以達到資料分組的功能。


      > class_new %>% group_by(gender) %>% summarize(math)

      `summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
      # A tibble: 10 x 2
      # Groups:   gender [2]
         gender  math
       1 F         60
       2 F         89
       3 F         82
       4 F         48
       5 F         93
       6 M         58
       7 M         64
       8 M         90
       9 M         88
      10 M         85

arrange()

可以做升降冪排序，預設是升冪排序，配合desc()可做降冪排序。


      > arrange(class_new, height) #身高排序，預設為升冪

           name gender height english math
      1   Sarah      F    156      89   93
      2   Linda      F    158      61   48
      3    Emma      F    160      73   82
      4   Ariel      F    168      90   89
      5   Roger      M    168      86   85
      6  George      M    170      77   64
      7    Anne      F    171      81   60
      8   Bruce      M    174      66   58
      9   Lewis      M    182      97   88
      10  Kevin      M    188      96   90
      > arrange(class_new, desc(height)) #身高排序，降冪

           name gender height english math
      1   Kevin      M    188      96   90
      2   Lewis      M    182      97   88
      3   Bruce      M    174      66   58
      4    Anne      F    171      81   60
      5  George      M    170      77   64
      6   Ariel      F    168      90   89
      7   Roger      M    168      86   85
      8    Emma      F    160      73   82
      9   Linda      F    158      61   48
      10  Sarah      F    156      89   93

rename()

rename(資料,新名稱=舊名稱)可以重新命名變數。


      > rename(class_new, cm=height)

           name gender  cm english math
      1    Anne      F 171      81   60
      2   Ariel      F 168      90   89
      3   Bruce      M 174      66   58
      4    Emma      F 160      73   82
      5  George      M 170      77   64
      6   Kevin      M 188      96   90
      7   Lewis      M 182      97   88
      8   Linda      F 158      61   48
      9   Roger      M 168      86   85
      10  Sarah      F 156      89   93

簡介