处理 R 中的缺失、无效和重复数据

免费领取国家顶级域名（.cn/.中国）

欢迎加入国家域名信源站点导航

站点智能：AI搭建 AI辅助运营独立站，把生意做到全世界

介绍

数据科学的一个重要组成部分是清理数据并为预测建模做好准备。与数据清理相关的最常见问题是处理缺失数据、无效记录和重复值。

在本指南中，您将了解统计编程语言 R 中处理缺失、无效和重复数据的技术。

数据

在本指南中，我们将使用包含 600 个观测值和 12 个变量的虚构贷款申请数据集：

1. UID——申请人的唯一标识符

2. Marital_status - 申请人是否已婚（“是”）或未婚（“否”）

3. 家属- 申请人的家属人数

4. Is_graduate - 申请人是否为毕业生（“是”）或不是（“否”）

5. 收入- 申请人的年收入（美元）

6. Loan_amount - 提交申请的贷款金额（美元）

7. Term_months - 贷款期限

8. Credit_score - 申请人的信用评分是良好（“满意”）还是不良好（“不满意”）

9. Approval_status - 贷款申请是否被批准（“1”）或未批准（“0”）

10. 年龄- 申请人的年龄（岁）

11. 性别- 申请人是男性（“M”）还是女性（“F”）

12. 目的- 申请贷款的目的

让我们首先加载所需的库和数据。

      library(readr)

dat <- read_csv("data_cleaning.csv")
glimpse(dat)
    

输出：

      Observations: 600
Variables: 12
$ UID             <chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U...
$ Marital_status  <chr> "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Y...
$ Dependents      <int> NA, 0, 0, 0, NA, NA, NA, NA, 0, 2, 0, 2, 0, 0, NA, NA,...
$ Is_graduate     <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", ...
$ Income          <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700...
$ Loan_amount     <int> 53500, 115000, 49000, 61000, 62000, 70000, 55500, 6300...
$ Term_months     <int> 504, 384, 384, 384, 384, 384, 384, 204, 384, 384, 384,...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ...
$ Age             <int> -3, -3, -10, -12, -12, -12, -12, 23, 23, 23, 23, 23, 2...
$ Sex             <chr> "M", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M",...
$ Purpose         <chr> "Education", "Travel", "Personal", "Business", "Educat...
    

输出显示数据集有六个数值变量（标记为“int”），其余六个是分类变量（标记为“chr”）。我们将使用下面的代码行将这些变量转换为“因子”变量（“UID”变量除外）。

      names <- c(2,4,8,9,11,12)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
    

输出：

      Observations: 600
Variables: 12
$ UID             <chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U...
$ Marital_status  <fct> No, Yes, No, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes...
$ Dependents      <int> NA, 0, 0, 0, NA, NA, NA, NA, 0, 2, 0, 2, 0, 0, NA, NA,...
$ Is_graduate     <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes, Yes, Y...
$ Income          <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700...
$ Loan_amount     <int> 53500, 115000, 49000, 61000, 62000, 70000, 55500, 6300...
$ Term_months     <int> 504, 384, 384, 384, 384, 384, 384, 204, 384, 384, 384,...
$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ...
$ Age             <int> -3, -3, -10, -12, -12, -12, -12, 23, 23, 23, 23, 23, 2...
$ Sex             <fct> M, M, F, F, F, M, M, M, M, M, M, M, M, F, F, M, M, M, ...
$ Purpose         <fct> Education, Travel, Personal, Business, Education, Educ...
    

我们现在准备执行以下部分中的数据清理步骤。

重复值

第一步是检查重复记录，这是现实世界数据中最常见的错误之一。重复记录会增加计算时间并降低模型准确性，因此必须删除。在我们的数据集中，“UID”是唯一标识符变量，将用于删除重复记录。下面的第一行代码使用duplicated()函数查找重复项，而第二行则打印重复项的数量。

      dup_records <- duplicated(dat$UID)
sum(dup_records)
    

输出：

      1] 3

The output shows that there are three duplicate records. We will drop these records using the first line of code below. The second line prints the dimension of the resulting dataset — 597 observations and 12 variables.

      dat <- dat[!duplicated(dat$UID), ]

dim(dat)

Output:

      1] 597  12

Invalid Values

When we looked at the data using the glimpse() function in the previous section, we realized that the age variable has incorrect entries. Let’s look at the summary of the age variable.

      summary(dat$Age)

Output:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -12.00   36.50   51.00   49.03   61.00   76.00       2
    

The output shows that the minimum value of the variable 'Age' is -12. This is impossible and brings us to the next common problem in real world datasets: the presence of inaccurate records. It is safe to assume that for loan applications, the minimum age should be 18 years. This means that we will remove records of applicants below 18 years of age.

The first two of lines of code below give us the number of records in the dataset for which the age is below 18 years. The number of such records is seven, and they are removed with the third line of code. The fourth line prints the dimensions of the new data — 590 observations and 12 variables.

Finally, we relook at the summary of the age variable. This shows that the range of age is now 23 to 76 years, indicating that the correction has been made.

      age_18 <- dat[which(dat$Age<18),]
dim(age_18)

dat <- dat[-which(dat$Age<18),]
dim(dat)

summary(dat$Age)
    

Output:

      1] 590  12
 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  23.00   37.00   51.00   49.72   61.00   76.00       2
    

It was easy to detect incorrect entries in the age variable. In other cases, invalid values are in the form of outliers. Let’s look at an example of the 'Income' variable. The summary of the variable shows that the minimum and maximum income levels are USD 136700, and 3321001, respectively. This is a highly skewed range, indicating some extreme values. To better understand the distribution, we use the quantile function, which gives us the first to hundredth percentile values of the variable in the sequence of unit percentile.

      summary(dat$Income)

quantile(dat$Income,seq(0,1,0.01))

Output:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 136700  386700  512800  687874  775300 3321001


    0%      1%      2%      3%      4%      5%      6%      7%      8%      9% 
 136700  136700  210676  240000  244288  254540  263684  274664  286108  295100 
    10%     11%     12%     13%     14%     15%     16%     17%     18%     19% 
 301560  311100  317700  320000  329100  332220  333300  335288  344400  346700 
    20%     21%     22%     23%     24%     25%     26%     27%     28%     29% 
 352860  358800  363084  371396  383332  386700  391172  397980  401508  405556 
    30%     31%     32%     33%     34%     35%     36%     37%     38%     39% 
 410220  412724  421052  422244  424804  431960  437016  444400  448620  454972 
    40%     41%     42%     43%     44%     45%     46%     47%     48%     49% 
 458920  465068  468448  476468  479696  486180  491380  495548  500000  506956 
    50%     51%     52%     53%     54%     55%     56%     57%     58%     59% 
 512800  515552  523184  532012  536480  551820  555504  563080  572852  577700 
    60%     61%     62%     63%     64%     65%     66%     67%     68%     69% 
 585380  607584  611276  620300  625904  633300  648308  656708  666700  683156 
    70%     71%     72%     73%     74%     75%     76%     77%     78%     79% 
 700000  721040  733300  753968  761484  775300  788132  800000  807740  821696 
    80%     81%     82%     83%     84%     85%     86%     87%     88%     89% 
 834660  853300  880008  914712  963752 1010680 1058180 1111100 1149276 1219460 
    90%     91%     92%     93%     94%     95%     96%     97%     98%     99% 
1262060 1333300 1392412 1502676 1664032 1944400 2064768 2223884 2608396 3197268 
   100% 
3321001
    

We can remove the outliers using the method described in the previous section. We can also address them through a different method of flooring and capping the extreme values. The first line of code below does the flooring of the lower outliers at the first percentile value, i.e., USD 136700. Similarly, the second line performs the capping of the higher outliers at the 99th percentile value, i.e., USD 3321001.

The third line of code prints the new summary of the variable, indicating that the correction has been done.

      dat$Income[which(dat$Income<136700)]<- 136700

dat$Income[which(dat$Income > 3321001)]<- 3321001

summary(dat$Income)

Output:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 136700  386175  508650  685301  772650 3321001
    

Missing Values

Missing value treatment is the most common data cleaning step performed in a data science project. The line of code below prints the number of missing values in each of the variables.

      sapply(dat, function(x) sum(is.na(x)))

Output:

      UID  Marital_status      Dependents     Is_graduate          Income 
              0               0               3               6               0 
    Loan_amount     Term_months    Credit_score approval_status             Age 
              7               6               0               0               2 
            Sex         Purpose 
              0               0
    

The output above shows the presence of missing values across the variables, most of which are numerical variables, except 'Is_graduate', which is a categorical variable.

Missing Value Imputation for Numerical Variables

为数值变量估算值最常用的技术是用平均值或中值替换缺失值。在下面的代码行中，我们用中值替换“Loan_amount”中的缺失值，而用平均值替换“Term_months”中的缺失值。输出显示缺失值已被估算。

      dat$Loan_amount[is.na(dat$Loan_amount)] <- median(dat$Loan_amount, na.rm = TRUE)
table(is.na(dat$Loan_amount))

dat$Term_months[is.na(dat$Term_months)] <- mean(dat$Term_months, na.rm = TRUE)
table(is.na(dat$Term_months))
    

输出：

      FALSE 
  590 

FALSE 
  590
    

分类变量的缺失值插补

对于分类变量，了解频率分布很重要，可以使用下面的代码行打印出来。

      table(dat$Is_graduate)

输出：

      No Yes 
127 457
    

输出显示大多数申请人都是毕业生，并标有“是”标签。下面的代码行将缺失值替换为频率最高的标签“是”。

      dat$Is_graduate[is.na(dat$Is_graduate)] <- "Yes"

table(dat$Is_graduate)

输出：

      No Yes 
127 463
    

_{免责声明：本内容来源于第三方作者授权、网友推荐或互联网整理，旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有，其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况，请与我们取得联系，我们将尽快进行相关处理与修改。感谢您的理解与支持！}

_查看原文