使用 R 编码数据

发布日期:2026-06-25 05:35:04   来源 : 杭州电子商务研究院    浏览量 :11
杭州电子商务研究院 发布日期:2026-06-25 05:35:04  
11

介绍

R 中有几种强大的机器学习算法。但是,为了充分利用这些算法,我们必须将数据转换为所需的格式。执行此操作的常见步骤之一是对数据进行编码,这可以增强算法的计算能力和效率。在本指南中,您将了解使用 R 对数据进行编码的不同技术。

数据

在本指南中,我们将使用包含 600 个观测值和 10 个变量的虚构贷款申请数据集:

  1. Marital_status:申请人是否已婚(“是”)或未婚(“否”)

  2. 家属:申请人家属人数

  3. Is_graduate:申请人是否为毕业生(“是”)或不是(“否”)

  4. 收入:申请人的年收入(美元)

  5. Loan_amount:提交申请的贷款金额(美元)

  6. Credit_score:申请人的信用评分是良好(“满意”)还是不良好(“不满意”)

  7. Approval_status:贷款申请是否已获批准(“1”)或未获批准(“0”)

  8. 年龄:申请人的年龄(岁)

  9. 性别:申请人是男性(“M”)还是女性(“F”)

  10. 目的:申请贷款的目的

让我们首先加载所需的库和数据。

      library(plyr)
library(readr)
library(dplyr)
library(caret)

dat <- read_csv("data_eng.csv")

glimpse(dat)
    

输出:

      Observations: 600
Variables: 10
$ Marital_status  <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Dependents      <int> 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, ...
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 298500, 315500, 295100, 319300, 333300, 277700, 332100...
$ Loan_amount     <int> 71000, 75500, 70000, 70000, 98000, 71000, 58000, 64000...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Age             <int> 74, 71, 71, 68, 64, 64, 63, 61, 60, 59, 56, 55, 54, 54...
$ Sex             <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",...
$ Purpose         <chr> "Wedding", "Wedding", "Wedding", "Wedding", "Wedding",...
    

输出显示数据集有六个数值变量(标记为int)和四个分类变量(标记为chr)。我们现在准备执行编码步骤。

编码分类变量的方法有很多种,选择的方法取决于变量中的标签分布和最终目标。在后续章节中,我们将介绍最广泛使用的编码分类变量的技术。

标签编码

简单来说,标签编码是用虚拟数字替换分类变量的不同级别的过程。例如,变量Credit_score有两个级别,“Satisfactory”和“Not_satisfactory”。它们可以分别编码为 1 和 0。下面的第一行代码执行此任务,而第二行打印编码后的级别表。

      dat$Credit_score <- ifelse(dat$Credit_score == "Satisfactory",1,0)

table(dat$Credit_score)
    

输出:

      0   1 
128 472
    

The above output shows that the label encoding is done. This is easy when you have two levels in the categorical variable, as with Credit_score. If the variable contains more than two labels, this will not be intuitive. For example, the 'Purpose' variable has six levels, as can be seen from the output below.

      table(dat$Purpose)
    

Output:

      Business Education Furniture  Personal    Travel   Wedding 
       43       191        38       166       123        39
    

In such cases, one-hot encoding is preferred.

One-Hot Encoding

In this technique, one-hot (dummy) encoding is applied to the features, creating a binary column for each category level and returning a sparse matrix. In each dummy variable, the label “1” will represent the existence of the level in the variable, while the label “0” will represent its non-existence.

We will apply this technique to all the remaining categorical variables. The first line of code below imports the powerful caret package, while the second line uses the dummyVars() function to create a full set of dummy variables. The dummyVars() method works on the categorical variables. It is to be noted that the second line contains the argument fullrank=T, which will create n-1 columns for a categorical variable with n unique levels.

The third line uses the output of the dummyVars() function and transforms the dataset, dat, where all the categorical variables are encoded to numerical variables. The fourth line of code prints the structure of the resulting data, dat-transfored, which confirms that one-hot encoding is completed.

      library(caret)

dmy <- dummyVars(" ~ .", data = dat, fullRank = T)
dat_transformed <- data.frame(predict(dmy, newdata = dat))

glimpse(dat_transformed)
    

Output:

      Observations: 600
Variables: 14
$ Marital_status.Yes <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...
$ Dependents         <dbl> 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, ...
$ Is_graduate.Yes    <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, ...
$ Income                  <dbl> 298500, 315500, 295100, 319300, 333300, 277700, 332...
$ Loan_amount        <dbl> 71000, 75500, 70000, 70000, 98000, 71000, 58000, 64...
$ Credit_score          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, ...
$ approval_status.1  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Age                         <dbl> 74, 71, 71, 68, 64, 64, 63, 61, 60, 59, 56, 55, 54,...
$ Sex.M                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Purpose.Education  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Furniture   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Personal   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Travel        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Wedding    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    

Encoding Continuous (or Numeric) Variables

In the previous sections, we learned how to encode categorical variables. However, sometimes it may be useful to carry out encoding for numerical variables as well. For example, the Naive Bayes Algorithm requires all variables to be categorical, so encoding numerical variables is required. This is also called binning.

We will consider the Income variable as an example. Let’s look at the summary statistics of this variable.

      summary(dat$Income)
    

Output:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 133300  384975  508350  706302  766100 8444900
    

The values of Income range between $133,300 and $8.44 million, which shows that the distribution is right skewed. One of the additional benefits of binning is that it also takes care of the outliers. Let’s create three levels of the variable Income, which are “Low” for income levels lying below $380,000, “High” for income values above $760,000, and “Mid50” for the middle 50 percentage values of the income distribution.

第一步是创建这些截止点的向量,这在下面的第一行代码中完成。第二行给这些截止点赋予相应的名称。第三行使用 cut ()函数根据截止点拆分向量。最后,我们使用summary()函数将原始Income变量与分箱后的Income_New变量进行比较

      bins <- c(-Inf, 384975, 766100, Inf)

bin_names <- c("Low", "Mid50", "High")

dat$Income_new <- cut(dat$Income, breaks = bins, labels = bin_names)

summary(dat$Income)

summary(dat$Income_new)
    

输出:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 133300  384975  508350  706302  766100 8444900 
 
  Low Mid50  High 
  150   301   149
    

上面的输出表明变量已分箱。也可以自动创建分箱截止值,如下面的代码所示。在本例中,我们为变量Age创建了 5 个宽度大致相等的分箱。

      dat$Age_new <- cut(dat$Age, breaks = 5, labels = c("Bin1", "Bin2", "Bin3","Bin4", "Bin5"))

summary(dat$Age)

summary(dat$Age_new)
    

输出:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.00   36.00   50.00   49.31   61.00   76.00 
 
Bin1 Bin2 Bin3 Bin4 Bin5 
 108  117  114  162   99
    

结论

在本指南中,您学习了使用 R 编码数据的方法。您已将这些技术应用于定量和定性变量。根据项目目标,您可以应用任何或所有这些编码技术。要了解有关使用 R 进行数据科学的更多信息,请参阅以下指南:

  1. 使用 R 的描述性统计解释数据

  2. 使用 R 统计模型解释数据

  3. 使用 R 进行时间序列预测

  4. 假设检验 - 用统计模型解释数据

  5. 使用 R 对文本数据进行机器学习

  6. 使用 R 中的词云对文本数据进行可视化

  7. 使用 R 进行可视化数据探索

  8. 使用 R 重塑数据

  9. 使用 R 中的数据类型

  10. 使用 R 拆分和合并数据

以上内容来自杭州电子商务研究院推送
关注
关于我们
热门推荐
合作伙伴
免责声明:本站部分资讯来源于网络,如有侵权请及时联系客服,我们将尽快处理
Copyright © 2025-2027 ToB产业网址导航 公安备案 浙公网安备33010602013138号 浙ICP备16025413号-9
支持 反馈 关注 数据