Feature Engineering and Selection

A Practical Approach for Predictive Models

Machine Learning
Feature Engineering







Learning Progress: 40.77%.

Learning Source


Despite our attempts to follow these good practices, we are sometimes frustrated to find that the best models have less-than-anticipated, less-than-useful useful predictive performance. This lack of performance may be due to a simple to explain, but difficult to pinpoint, cause: relevant predictors that were collected are represented in a way that models have trouble achieving good performance. Key relationships that are not directly available as predictors may be between the response and:

  • a transformation of a predictor,
  • an interaction of two or more predictors such as a product or ratio,
  • a functional relationship among predictors, or
  • an equivalent re-representation of a predictor.

Adjusting and reworking the predictors to enable models to better uncover predictor-response relationships has been termed feature engineering. The engineering connotation implies that we know the steps to take to fix poor performance and to guide predictive improvement. However, we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance. This process, too, can lead to overfitting due to the vast number of alternative predictor representations. So appropriate care must be taken to avoid overfitting during the predictor creation process.

The goals of Feature Engineering and Selection are to provide tools for re-representing predictors, to place these tools in the context of a good predictive modeling framework, and to convey our experience of utilizing these tools in practice. In the end, we hope that these tools and our experience will help you generate better models.


  • 预测变量的转换;
  • 两个或更多预测变量的交互作用,例如乘积或比率;
  • 预测变量之间的功能关系;
  • 预测变量的等效重新表示。



1 Introducton

Whether the model will be used for inference or estimation (or in rare occasions, both), there are important characteristics to consider. Parsimony (or simplicity) is a key consideration. Simple models are generally preferable to complex models, especially when inference is the goal.

The problem, however, is that accuracy should not be seriously sacrificed for the sake of simplicity. A simple model might be easy to interpret but would not succeed if it does not maintain acceptable level of faithfulness to the data; if a model is only 50% accurate, should it be used to make inferences or predictions? Complexity is usually the solution to poor accuracy. By using additional parameters or by using a model that is inherently nonlinear, we might improve accuracy but interpretability will likely suffer greatly. This trade-off is a key consideration for model building.

The goal of this book is to help practitioners build better models by focusing on the predictors. “Better” depends on the context of the problem but most likely involves the following factors: accuracy, simplicity, and robustness.




1.1 Important Concepts

While models can overfit to the data points, such as with the housing data shown above, feature selection techniques can overfit to the predictors. This occurs when a variable appears relevant in the current data set but shows no real relationship with the outcome once new data are collected. The risk of this type of overfitting is especially dangerous when the number of data points, denoted as \(n\), is small and the number of potential predictors (\(p\)) is very large. As with overfitting to the data points, this problem can be mitigated using a methodology that will show a warning when this is occurring.

Both supervised and unsupervised analyses are susceptible to overfitting but supervised are particularly inclined to discovering erroneous patterns in the data for predicting the outcome. In short, we can use these techniques to create a self-fulfilling predictive prophecy.



Models can also be evaluated in terms of variance and bias (Geman, Bienenstock, and Doursat 1992). A model has high variance if small changes to the underlying data used to estimate the parameters cause a sizable change in those parameters (or in the structure of the model). For example, the sample mean of a set of data points has higher variance than the sample median. The latter uses only the values in the center of the data distribution and, for this reason, it is insensitive to moderate changes in the values. A few examples of models with low variance are linear regression, logistic regression, and partial least squares. High-variance models include those that strongly rely on individual data points to define their parameters such as classification or regression trees, nearest neighbor models, and neural networks. To contrast low-variance and high-variance models, consider linear regression and, alternatively, nearest neighbor models. Linear regression uses all of the data to estimate slope parameters and, while it can be sensitive to outliers, it is much less sensitive than a nearest neighbor model.

Model bias reflects the ability of a model to conform to the underlying theoretical structure of the data. A low-bias model is one that can be highly flexible and has the capacity to fit a variety of different shapes and patterns. A high-bias model would be unable to estimate values close to their true theoretical counterparts. Linear methods often have high bias since, without modification, cannot describe nonlinear patterns in the predictor variables. Tree-based models, support vector machines, neural networks, and others can be very adaptable to the data and have low bias.

As one might expect, model bias and variance can often be in opposition to one another; in order to achieve low bias, models tend to demonstrate high variance (and vice versa). The variance-bias trade-off is a common theme in statistics. In many cases, models have parameters that control the flexibility of the model and thus affect the variance and bias properties of the results.

As previously described, simplicity is an important characteristic of a model. One method of creating a low-variance, low-bias model is to augment a low-variance model with appropriate representations of the data to decrease the bias.

模型也可以根据方差和偏差进行评估(Geman,Bienenstock和Doursat 1992)。如果对用于估计参数的基础数据进行微小改变,会导致参数(或模型结构)发生较大变化,则模型的方差很高。例如,对一组数据点的样本均值的方差比样本中位数的方差更高。后者仅使用数据分布的中心值,因此对值的中等变化不敏感。一些具有低方差的模型包括线性回归、逻辑回归和偏最小二乘法。高方差的模型包括那些强烈依赖于个别数据点来定义其参数的模型,例如分类或回归树、最近邻模型和神经网络。为了对比低方差和高方差的模型,可以考虑线性回归和最近邻模型。线性回归使用所有数据来估计斜率参数,虽然它可能对异常值敏感,但比最近邻模型要不那么敏感。



1.2 A More Complex Example

The overall points that should be understood from this demonstration are:

  1. When modeling data, there is almost never a single model fit or feature set that will immediately solve the problem. The process is more likely to be a campaign of trial and error to achieve the best results.
  2. The effect of feature sets can be much larger than the effect of different models.
  3. The interplay between models and features is complex and somewhat unpredictable.
  4. With the right set of predictors, is it common that many different types of models can achieve the same level of performance. Initially, the linear models had the worst performance but, in the end, showed some of the best performance.


  1. 当建模数据时,几乎永远不会有一个单一的模型适合或特征集能立即解决问题。这个过程更可能是一系列试错的过程,以获得最佳结果。
  2. 特征集的影响可能比不同模型的影响大得多。
  3. 模型与特征之间的相互作用是复杂且有些不可预测的。

1.3 Feature Selection

When conducting a search for a subset of variables, it is important to realize that there may not be a unique set of predictors that will produce the best performance. There is often a compensatory effect where, when one seemingly important variable is removed, the model adjusts using the remaining variables. This is especially true when there is some degree of correlation between the explanatory variables or when a low-bias models is used. For this reason, feature selection should not be used as a formal method of determining feature significance. More traditional inferential statistical approaches are a better solution for appraising the contribution of a predictor to the underlying model or to the data set.


2 Illustrative Example: Predicting Risk of Ischemic Stroke

2.1 Other Considerations

Our primary point in this short tour is to illustrate that spending a little more time (and sometimes a lot more time) investigating predictors and relationships among predictors can help to improve model predictivity. This is especially true when marginal gains in predictive performance can have significant benefits.


3 A Review of the Predictive Modeling Process

3.1 Measuring Performance

During the initial phase of model building, a good strategy for data sets with two classes is to focus on the AUC statistics from these curves instead of metrics based on hard class predictions. Once a reasonable model is found, the ROC or precision-recall curves can be carefully examined to find a reasonable cutoff for the data and then qualitative prediction metrics can be used.

在模型构建的初期阶段,对于具有两个类别的数据集,一个很好的策略是专注于从这些曲线中得出的 AUC(曲线下面积)统计量,而不是基于硬类别预测的指标。一旦找到一个合理的模型,可以仔细检查 ROC 曲线或精确度-召回曲线,找到合理的数据截断点,然后使用定性预测指标。

3.2 Data Splitting

There are a number of ways to split the data into training and testing sets. The most common approach is to use some version of random sampling. Completely random sampling is a straightforward strategy to implement and usually protects the process from being biased towards any characteristic of the data. However this approach can be problematic when the response is not evenly distributed across the outcome. A less risky splitting strategy would be to use a stratified random sample based on the outcome. For classification models, this is accomplished by selecting samples at random within each class. This approach ensures that the frequency distribution of the outcome is approximately equal within the training and test sets. When the outcome is numeric, artificial strata can be constructed based on the quartiles of the data.


3.3 Model Optimization and Tuning

When there are many tuning parameters associated with a model, there are several ways to proceed. First, a multidimensional grid search can be conducted where candidate parameter combinations and the grid of combinations are evaluated. In some cases, this can be very inefficient. Another approach is to define a range of possible values for each parameter and to randomly sample the multidimensional space enough times to cover a reasonable amount (Bergstra and Bengio 2012). This random search grid can then be resampled in the same way as a more traditional grid. This procedure can be very beneficial when there are a large number of tuning parameters and there is no a priori notion of which values should be used. A large grid may be inefficient to search, especially if the profile has a fairly stable pattern with little change over some range of the parameter. Neural networks, gradient boosting machines, and other models can effectively be tuned using this approach.

当模型与许多调参参数相关联时,有几种方法可以进行。首先,可以进行多维网格搜索,评估候选参数组合和组合的网格。在某些情况下,这可能非常低效。另一种方法是为每个参数定义一个可能的值范围,并随机采样多维空间足够多次以涵盖合理数量(Bergstra和Bengio 2012)。然后可以对此随机搜索网格进行与传统网格相同的重新采样。当存在大量调参参数并且没有先验的概念可以使用哪些值时,这个过程可能非常有益。大的网格可能不易于搜索,尤其是如果配置在某个参数范围内具有相当稳定的模式且变化较小的情况下。神经网络,梯度增强机器和其他模型可以有效地使用这种方法。

4 Exploratory Visualizations

One of the first steps of the exploratory data process when the ultimate purpose is to predict a response is to create visualizations that help elucidate knowledge of the response and then to uncover relationships between the predictors and the response. Therefore our visualizations should start with the response, understanding the characteristics of its distribution, and then to build outward from that with the additional information provided in the predictors. Knowledge about the response can be gained by creating a histogram or box plot. This simple visualization will reveal the amount of variation in the response and if the response was generated by a process that has unusual characteristics that must be investigated further. Next, we can move on to exploring relationships among the predictors and between predictors and the response. Important characteristics can be identified by examining

  • scatter plots of individual predictors and the response,
  • a pairwise correlation plot among the predictors,
  • a projection of high-dimensional predictors into a lower dimensional space,
  • line plots for time-based predictors,
  • the first few levels of a regression or classification tree,
  • a heat map across the samples and predictors, or
  • mosaic plots for examining associations among categorical variables.


  • 个别预测变量与响应变量的散点图,
  • 预测变量之间的成对相关性图,
  • 将高维预测变量投影到较低维空间,
  • 基于时间的预测变量的线图,
  • 回归或分类树的前几个层级,
  • 样本和预测变量的热图,或
  • 用于检查分类变量之间关联的马赛克图。

4.1 Visualizations for Numeric Data

It is possible to visualize five or six dimensions of data in a two-dimensional figure by using colors, shapes, and faceting. But almost any data set today contains many more than just a handful of variables. Being able to visualize many dimensions in the physical space that we can actually see is crucial to understanding the data and to understanding if there characteristics of the data that point to the need for feature engineering. One way to condense many dimensions into just two or three is to use projection techniques such as principal components analysis (PCA), partial least squares (PLS), or multidimensional scaling (MDS).

Principal components analysis finds combinations of the variables that best summarizes the variability in the original data (Dillon and Goldstein 1984). The combinations are a simpler representation of the data and often identify underlying characteristics within the data that will help guide the feature engineering process.


主成分分析找出最能概括原始数据变异性的变量组合(Dillon和Goldstein 1984)。这些组合是数据的简化表示,并且通常可以识别出数据内部的潜在特征,有助于指导特征工程的过程。

4.2 Visualizations for Categorical Data

The point of this discussion is not that summary statistics with confidence intervals are always the solution to a visualization problem. The takeaway message is that each graph should have a clearly defined hypothesis and that this hypothesis is shown concisely in a way that allows the reader to make quick and informative judgments based on the data.


4.3 Summary

Advanced predictive modeling and machine learning techniques offer the allure of being able to extract complex relationships between predictors and the response with little effort by the analyst. This hands-off approach to modeling will only put the analyst at a disadvantage. Spending time visualizing the response, predictors, relationships among the predictors, and relationships between predictors and the response can only lead to better understandings of the data. Moreover, this knowledge may provide crucial insights as to what features may be missing in the data and may need to be included to improve a model’s predictive performance.


5 Encoding Categorical Predictors

5.1 Encoding Predictors with Many Categories

One potential issue is that resampling might exclude some of the rarer categories from the analysis set. This would lead to dummy variable columns in the data that contain all zeros and, for many models, this would become a numerical issue that will cause an error. Moreover the model will not be able to provide a relevant prediction for new samples that contain this predictor. When a predictor contains a single value, we call this a zero-variance predictor because there truly is no variation displayed by the predictor.

The first way to handle this issue is to create the full set of dummy variables and simply remove the zero-variance predictors. This is a simple and effective approach but it may be difficult to know a priori what terms will be in the model. In other words, during resampling, there may be a different number of model parameters across resamples. This can be a good side-effect since it captures the variance caused by omitting rarely occurring values and propagates this noise into the resampling estimates of performance.



Back to top