Learning Progress: 40.77%.
- https://bookdown.org/max/FES/
- https://github.com/topepo/FES
- 中文翻译由 ChatGPT 完成
Preface
Despite our attempts to follow these good practices, we are sometimes frustrated to find that the best models have less-than-anticipated, less-than-useful useful predictive performance. This lack of performance may be due to a simple to explain, but difficult to pinpoint, cause: relevant predictors that were collected are represented in a way that models have trouble achieving good performance. Key relationships that are not directly available as predictors may be between the response and:
- a transformation of a predictor,
- an interaction of two or more predictors such as a product or ratio,
- a functional relationship among predictors, or
- an equivalent re-representation of a predictor.
Adjusting and reworking the predictors to enable models to better uncover predictor-response relationships has been termed feature engineering. The engineering connotation implies that we know the steps to take to fix poor performance and to guide predictive improvement. However, we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance. This process, too, can lead to overfitting due to the vast number of alternative predictor representations. So appropriate care must be taken to avoid overfitting during the predictor creation process.
The goals of Feature Engineering and Selection are to provide tools for re-representing predictors, to place these tools in the context of a good predictive modeling framework, and to convey our experience of utilizing these tools in practice. In the end, we hope that these tools and our experience will help you generate better models.
尽管我们尝试遵循这些良好的实践,但有时我们会发现最佳模型的预测性能不如预期或者不够有用。这种性能的不足可能是由于一个简单易懂但难以确定的原因:收集的相关预测变量的表示方式使得模型难以获得良好的性能。与响应之间可能存在关键关系,而这些关系在预测变量中并不直接可用,例如:
- 预测变量的转换;
- 两个或更多预测变量的交互作用,例如乘积或比率;
- 预测变量之间的功能关系;
- 预测变量的等效重新表示。
调整和重新设计预测变量以使模型能够更好地揭示预测变量与响应之间的关系被称为特征工程。“工程”这个词意味着我们知道采取哪些步骤来修复性能差的问题并指导预测性能的改进。然而,我们通常并不知道改进模型性能的最佳预测变量重新表示方式。相反,重新设计预测变量更像是一门艺术,需要正确的工具和经验来找到更好的预测变量表示方式。此外,为了改进模型性能,我们可能需要搜索许多替代的预测变量表示方式。这个过程也可能导致过度拟合,因为替代预测变量的数量非常庞大。因此,在预测变量创建过程中必须谨慎避免过度拟合。
《特征工程与选择》的目标是提供用于重新表示预测变量的工具,将这些工具置于一个良好的预测建模框架中,并传达我们在实践中使用这些工具的经验。最终,我们希望这些工具和我们的经验能够帮助您生成更好的模型。
1 Introducton
Whether the model will be used for inference or estimation (or in rare occasions, both), there are important characteristics to consider. Parsimony (or simplicity) is a key consideration. Simple models are generally preferable to complex models, especially when inference is the goal.
The problem, however, is that accuracy should not be seriously sacrificed for the sake of simplicity. A simple model might be easy to interpret but would not succeed if it does not maintain acceptable level of faithfulness to the data; if a model is only 50% accurate, should it be used to make inferences or predictions? Complexity is usually the solution to poor accuracy. By using additional parameters or by using a model that is inherently nonlinear, we might improve accuracy but interpretability will likely suffer greatly. This trade-off is a key consideration for model building.
The goal of this book is to help practitioners build better models by focusing on the predictors. “Better” depends on the context of the problem but most likely involves the following factors: accuracy, simplicity, and robustness.
无论模型是用于推断还是估计(或在罕见的情况下同时用于推断和估计),都有重要的特性需要考虑。简洁性(或简单性)是一个关键考虑因素。
然而,问题在于准确性不应该为简单性而严重牺牲。简单模型可能容易解释,但如果它不能对数据保持可接受的准确性,那么它将不会成功;如果一个模型只有50%的准确性,它应该被用来进行推断或预测吗?复杂性通常是准确性不佳的解决方案。通过使用额外的参数或使用本质上非线性的模型,我们可能会提高准确性,但解释性很可能会大大降低。这种权衡是模型构建的一个关键考虑因素。
本书的目标是帮助从业者通过关注预测变量来构建更好的模型。“更好”取决于问题的背景,但很可能涉及以下因素:准确性、简洁性和鲁棒性。
1.1 Important Concepts
While models can overfit to the data points, such as with the housing data shown above, feature selection techniques can overfit to the predictors. This occurs when a variable appears relevant in the current data set but shows no real relationship with the outcome once new data are collected. The risk of this type of overfitting is especially dangerous when the number of data points, denoted as \(n\), is small and the number of potential predictors (\(p\)) is very large. As with overfitting to the data points, this problem can be mitigated using a methodology that will show a warning when this is occurring.
Both supervised and unsupervised analyses are susceptible to overfitting but supervised are particularly inclined to discovering erroneous patterns in the data for predicting the outcome. In short, we can use these techniques to create a self-fulfilling predictive prophecy.
模型可以对数据点进行过拟合,比如上面显示的房地产数据,特征选择技术可以对预测变量进行过拟合。当一个变量在当前数据集中看起来相关,但一旦收集到新数据,与结果没有真实关系时,就会出现这种情况。当数据点的数量(表示为\(n\))很小,而潜在的预测变量(\(p\))非常多时,这种过拟合的风险尤其严重。与对数据点过拟合类似,可以使用一种方法来发出警告,当发生这种情况时,以减轻这个问题。
无论是监督分析还是无监督分析都容易出现过拟合,但监督分析尤其倾向于发现预测结果的数据中的错误模式。简言之,我们可以使用这些技术来创建自证预测预言。
Models can also be evaluated in terms of variance and bias (Geman, Bienenstock, and Doursat 1992). A model has high variance if small changes to the underlying data used to estimate the parameters cause a sizable change in those parameters (or in the structure of the model). For example, the sample mean of a set of data points has higher variance than the sample median. The latter uses only the values in the center of the data distribution and, for this reason, it is insensitive to moderate changes in the values. A few examples of models with low variance are linear regression, logistic regression, and partial least squares. High-variance models include those that strongly rely on individual data points to define their parameters such as classification or regression trees, nearest neighbor models, and neural networks. To contrast low-variance and high-variance models, consider linear regression and, alternatively, nearest neighbor models. Linear regression uses all of the data to estimate slope parameters and, while it can be sensitive to outliers, it is much less sensitive than a nearest neighbor model.
Model bias reflects the ability of a model to conform to the underlying theoretical structure of the data. A low-bias model is one that can be highly flexible and has the capacity to fit a variety of different shapes and patterns. A high-bias model would be unable to estimate values close to their true theoretical counterparts. Linear methods often have high bias since, without modification, cannot describe nonlinear patterns in the predictor variables. Tree-based models, support vector machines, neural networks, and others can be very adaptable to the data and have low bias.
As one might expect, model bias and variance can often be in opposition to one another; in order to achieve low bias, models tend to demonstrate high variance (and vice versa). The variance-bias trade-off is a common theme in statistics. In many cases, models have parameters that control the flexibility of the model and thus affect the variance and bias properties of the results.
As previously described, simplicity is an important characteristic of a model. One method of creating a low-variance, low-bias model is to augment a low-variance model with appropriate representations of the data to decrease the bias.
模型也可以根据方差和偏差进行评估(Geman,Bienenstock和Doursat 1992)。如果对用于估计参数的基础数据进行微小改变,会导致参数(或模型结构)发生较大变化,则模型的方差很高。例如,对一组数据点的样本均值的方差比样本中位数的方差更高。后者仅使用数据分布的中心值,因此对值的中等变化不敏感。一些具有低方差的模型包括线性回归、逻辑回归和偏最小二乘法。高方差的模型包括那些强烈依赖于个别数据点来定义其参数的模型,例如分类或回归树、最近邻模型和神经网络。为了对比低方差和高方差的模型,可以考虑线性回归和最近邻模型。线性回归使用所有数据来估计斜率参数,虽然它可能对异常值敏感,但比最近邻模型要不那么敏感。
模型偏差反映了模型符合数据潜在理论结构的能力。低偏差模型是高度灵活的,能够适应各种不同的形状和模式。高偏差模型无法估计接近其真实理论对应物的值。线性方法通常具有较高的偏差,因为未经修改时,无法描述预测变量中的非线性模式。基于树的模型、支持向量机、神经网络等在数据上可以非常适应,具有较低的偏差。
如预料的那样,模型偏差和方差通常会相互对立;为了实现低偏差,模型往往表现出较高的方差(反之亦然)。方差-偏差权衡是统计学中的一个常见主题。在许多情况下,模型具有控制模型灵活性的参数,从而影响结果的方差和偏差特性。
1.2 A More Complex Example
The overall points that should be understood from this demonstration are:
- When modeling data, there is almost never a single model fit or feature set that will immediately solve the problem. The process is more likely to be a campaign of trial and error to achieve the best results.
- The effect of feature sets can be much larger than the effect of different models.
- The interplay between models and features is complex and somewhat unpredictable.
- With the right set of predictors, is it common that many different types of models can achieve the same level of performance. Initially, the linear models had the worst performance but, in the end, showed some of the best performance.
总体上,应该理解以下几点:
- 当建模数据时,几乎永远不会有一个单一的模型适合或特征集能立即解决问题。这个过程更可能是一系列试错的过程,以获得最佳结果。
- 特征集的影响可能比不同模型的影响大得多。
- 模型与特征之间的相互作用是复杂且有些不可预测的。
使用正确的预测变量,许多不同类型的模型通常可以达到相同水平的性能。最初,线性模型的性能最差,但最后显示出了一些最好的性能。
1.3 Feature Selection
When conducting a search for a subset of variables, it is important to realize that there may not be a unique set of predictors that will produce the best performance. There is often a compensatory effect where, when one seemingly important variable is removed, the model adjusts using the remaining variables. This is especially true when there is some degree of correlation between the explanatory variables or when a low-bias models is used. For this reason, feature selection should not be used as a formal method of determining feature significance. More traditional inferential statistical approaches are a better solution for appraising the contribution of a predictor to the underlying model or to the data set.
当进行变量子集的搜索时,重要的是要意识到可能没有一个唯一的预测变量集会产生最佳性能。通常会有一种补偿效应,当一个看似重要的变量被移除时,模型会使用其余的变量进行调整。这在解释变量之间存在某种程度的相关性或者使用低偏差模型时尤其明显。因此,特征选择不应作为确定特征重要性的正式方法,更传统的推理统计方法是评估预测变量对基础模型或数据集的贡献的更好解决方案。
2 Illustrative Example: Predicting Risk of Ischemic Stroke
2.1 Other Considerations
Our primary point in this short tour is to illustrate that spending a little more time (and sometimes a lot more time) investigating predictors and relationships among predictors can help to improve model predictivity. This is especially true when marginal gains in predictive performance can have significant benefits.
在这个简短的导览中,我们的主要目的是说明多花一点时间(有时可能会花费更多时间)来调查预测因子和预测因子之间的关系,可以帮助提高模型的预测性能。特别是当预测性能的边际增益可以带来显著的好处时,这一点尤为重要。
3 A Review of the Predictive Modeling Process
3.1 Measuring Performance
During the initial phase of model building, a good strategy for data sets with two classes is to focus on the AUC statistics from these curves instead of metrics based on hard class predictions. Once a reasonable model is found, the ROC or precision-recall curves can be carefully examined to find a reasonable cutoff for the data and then qualitative prediction metrics can be used.
在模型构建的初期阶段,对于具有两个类别的数据集,一个很好的策略是专注于从这些曲线中得出的 AUC(曲线下面积)统计量,而不是基于硬类别预测的指标。一旦找到一个合理的模型,可以仔细检查 ROC 曲线或精确度-召回曲线,找到合理的数据截断点,然后使用定性预测指标。
3.2 Data Splitting
There are a number of ways to split the data into training and testing sets. The most common approach is to use some version of random sampling. Completely random sampling is a straightforward strategy to implement and usually protects the process from being biased towards any characteristic of the data. However this approach can be problematic when the response is not evenly distributed across the outcome. A less risky splitting strategy would be to use a stratified random sample based on the outcome. For classification models, this is accomplished by selecting samples at random within each class. This approach ensures that the frequency distribution of the outcome is approximately equal within the training and test sets. When the outcome is numeric, artificial strata can be constructed based on the quartiles of the data.
有许多方法可以将数据分成训练集和测试集。最常见的方法是使用某种随机抽样的版本。完全随机抽样是一种简单直接的策略,通常可以避免偏向数据任何特征的问题。然而,当响应在结果上不均匀分布时,这种方法可能会有问题。一个较低风险的分割策略是基于结果进行分层随机抽样。对于分类模型,可以在每个类别内随机选择样本。这种方法确保在训练集和测试集中,结果的频率分布大致相等。当结果是数值时,可以根据数据的四分位数构建人工层次。
3.3 Model Optimization and Tuning
When there are many tuning parameters associated with a model, there are several ways to proceed. First, a multidimensional grid search can be conducted where candidate parameter combinations and the grid of combinations are evaluated. In some cases, this can be very inefficient. Another approach is to define a range of possible values for each parameter and to randomly sample the multidimensional space enough times to cover a reasonable amount (Bergstra and Bengio 2012). This random search grid can then be resampled in the same way as a more traditional grid. This procedure can be very beneficial when there are a large number of tuning parameters and there is no a priori notion of which values should be used. A large grid may be inefficient to search, especially if the profile has a fairly stable pattern with little change over some range of the parameter. Neural networks, gradient boosting machines, and other models can effectively be tuned using this approach.
当模型与许多调参参数相关联时,有几种方法可以进行。首先,可以进行多维网格搜索,评估候选参数组合和组合的网格。在某些情况下,这可能非常低效。另一种方法是为每个参数定义一个可能的值范围,并随机采样多维空间足够多次以涵盖合理数量(Bergstra和Bengio 2012)。然后可以对此随机搜索网格进行与传统网格相同的重新采样。当存在大量调参参数并且没有先验的概念可以使用哪些值时,这个过程可能非常有益。大的网格可能不易于搜索,尤其是如果配置在某个参数范围内具有相当稳定的模式且变化较小的情况下。神经网络,梯度增强机器和其他模型可以有效地使用这种方法。
4 Exploratory Visualizations
One of the first steps of the exploratory data process when the ultimate purpose is to predict a response is to create visualizations that help elucidate knowledge of the response and then to uncover relationships between the predictors and the response. Therefore our visualizations should start with the response, understanding the characteristics of its distribution, and then to build outward from that with the additional information provided in the predictors. Knowledge about the response can be gained by creating a histogram or box plot. This simple visualization will reveal the amount of variation in the response and if the response was generated by a process that has unusual characteristics that must be investigated further. Next, we can move on to exploring relationships among the predictors and between predictors and the response. Important characteristics can be identified by examining
- scatter plots of individual predictors and the response,
- a pairwise correlation plot among the predictors,
- a projection of high-dimensional predictors into a lower dimensional space,
- line plots for time-based predictors,
- the first few levels of a regression or classification tree,
- a heat map across the samples and predictors, or
- mosaic plots for examining associations among categorical variables.
在探索数据过程中,当最终目标是预测响应变量时,首先要创建有助于阐明响应变量的知识并揭示预测变量与响应变量之间关系的可视化图形。因此,我们的可视化应该从响应变量开始,了解其分布的特性,然后根据预测变量中提供的附加信息向外扩展。可以通过创建直方图或箱线图来了解有关响应变量的知识。这种简单的可视化将显示响应变量的变化量,以及响应变量是否由具有异常特性的过程生成,这需要进一步调查。接下来,我们可以继续探索预测变量之间以及预测变量与响应变量之间的关系。可以通过以下方式识别重要特征:
- 个别预测变量与响应变量的散点图,
- 预测变量之间的成对相关性图,
- 将高维预测变量投影到较低维空间,
- 基于时间的预测变量的线图,
- 回归或分类树的前几个层级,
- 样本和预测变量的热图,或
- 用于检查分类变量之间关联的马赛克图。
4.1 Visualizations for Numeric Data
It is possible to visualize five or six dimensions of data in a two-dimensional figure by using colors, shapes, and faceting. But almost any data set today contains many more than just a handful of variables. Being able to visualize many dimensions in the physical space that we can actually see is crucial to understanding the data and to understanding if there characteristics of the data that point to the need for feature engineering. One way to condense many dimensions into just two or three is to use projection techniques such as principal components analysis (PCA), partial least squares (PLS), or multidimensional scaling (MDS).
Principal components analysis finds combinations of the variables that best summarizes the variability in the original data (Dillon and Goldstein 1984). The combinations are a simpler representation of the data and often identify underlying characteristics within the data that will help guide the feature engineering process.
通过使用颜色、形状和分面,可以在二维图中可视化五到六个数据维度。但是,现今几乎任何数据集都包含的变量数量远不止几个。能够在我们实际可见的物理空间中可视化多个维度对于理解数据以及发现是否有数据特征需要进行特征工程是至关重要的。将许多维度压缩为只有两个或三个维度的一种方法是使用投影技术,例如主成分分析(PCA)、偏最小二乘法(PLS)或多维缩放(MDS)。
主成分分析找出最能概括原始数据变异性的变量组合(Dillon和Goldstein 1984)。这些组合是数据的简化表示,并且通常可以识别出数据内部的潜在特征,有助于指导特征工程的过程。
4.2 Visualizations for Categorical Data
The point of this discussion is not that summary statistics with confidence intervals are always the solution to a visualization problem. The takeaway message is that each graph should have a clearly defined hypothesis and that this hypothesis is shown concisely in a way that allows the reader to make quick and informative judgments based on the data.
这个讨论的重点不是得出统计量和置信区间总是解决可视化问题的方案的结论。要传达的信息是,每个图形都应该有一个明确定义的假设,并且这个假设应该以简明的方式显示出来,以便读者可以根据数据快速且有效地做出判断。
4.3 Summary
Advanced predictive modeling and machine learning techniques offer the allure of being able to extract complex relationships between predictors and the response with little effort by the analyst. This hands-off approach to modeling will only put the analyst at a disadvantage. Spending time visualizing the response, predictors, relationships among the predictors, and relationships between predictors and the response can only lead to better understandings of the data. Moreover, this knowledge may provide crucial insights as to what features may be missing in the data and may need to be included to improve a model’s predictive performance.
先进的预测建模和机器学习技术提供了一种诱人的可能性,即能够在分析员几乎不费力的情况下提取预测变量与响应之间的复杂关系。然而,这种对建模的“无为而治”方法只会让分析员处于劣势。花时间对响应变量、预测变量、预测变量之间的关系以及预测变量与响应之间的关系进行可视化,只会带来对数据更深入的理解。此外,这些知识可能提供关键的见解,以了解数据中可能缺失的特征,并可能需要包含这些特征以改进模型的预测性能。
5 Encoding Categorical Predictors
5.1 Encoding Predictors with Many Categories
One potential issue is that resampling might exclude some of the rarer categories from the analysis set. This would lead to dummy variable columns in the data that contain all zeros and, for many models, this would become a numerical issue that will cause an error. Moreover the model will not be able to provide a relevant prediction for new samples that contain this predictor. When a predictor contains a single value, we call this a zero-variance predictor because there truly is no variation displayed by the predictor.
The first way to handle this issue is to create the full set of dummy variables and simply remove the zero-variance predictors. This is a simple and effective approach but it may be difficult to know a priori what terms will be in the model. In other words, during resampling, there may be a different number of model parameters across resamples. This can be a good side-effect since it captures the variance caused by omitting rarely occurring values and propagates this noise into the resampling estimates of performance.
一个潜在的问题是重新采样可能会将一些较少见的类别排除在分析集之外。这将导致数据中的虚拟变量列包含所有零,并且对于许多模型来说,这将成为一个可能引发错误的数值问题。此外,模型将无法为包含此预测变量的新样本提供相关的预测。当预测变量包含单个值时,我们称之为零方差预测变量,因为预测变量确实没有显示出任何变化。
处理此问题的第一种方法是创建完整的虚拟变量集,然后简单地删除零方差预测变量。这是一种简单有效的方法,但很难预先知道模型中会有哪些项。换句话说,在重新采样期间,模型参数的数量可能在重新采样之间有所不同。这可能是一个好的副作用,因为它捕捉了由于省略罕见出现的值而产生的方差,并将这种噪声传播到性能的重新采样估计中。