Feature Engineering: The Skill That Matters More Than Algorithm Choice — Encoding, Scaling, Creating Features, Handling Dates, Text Features, and Feature Selection

Here is a secret that no ML tutorial tells you upfront: the algorithm you choose matters far less than the features you give it. XGBoost with bad features loses to Logistic Regression with great features. Every time.

In every previous post, we used clean, pre-built DataFrames. The credit scores were already numbers. The categories were already separate columns. The dates were already formatted. In the real world, none of that is true. You get raw data — customer names mixed with IDs, dates as strings, categories as text, prices in different currencies, and missing values everywhere.

Feature engineering is the process of turning raw data into features that a machine learning model can actually learn from. It is 80% of the work in a real ML project and the single biggest determinant of model quality.

Think of features as ingredients in a recipe. The algorithm is the oven. You can have the best oven in the world (XGBoost), but if you put in raw, unwashed, uncut ingredients (bad features), the meal will be terrible. Feature engineering is the prep work — washing, cutting, seasoning, marinating — that makes the difference between a mediocre dish and a restaurant-quality meal. Same oven, different prep, completely different result.

What Is Feature Engineering?
Why Features Matter More Than Algorithms
The Feature Engineering Workflow
Part 1: Handling Raw Data Types
Encoding Categorical Variables
Label Encoding
One-Hot Encoding
Ordinal Encoding
Target Encoding (Mean Encoding)
When to Use Which Encoding
Scaling Numerical Features
StandardScaler (Z-Score)
MinMaxScaler (0 to 1)
RobustScaler (Handles Outliers)
When to Use Which Scaler
Part 2: Creating New Features
Date and Time Features
Extracting Components (Year, Month, Day, Hour)
Cyclical Encoding (Day of Week, Month)
Time Since Events (Recency)
Is Weekend, Is Holiday, Is Business Hours
Mathematical Features
Ratios and Proportions
Differences and Changes
Aggregated Statistics (Mean, Max, Count)
Interaction Features
Binning (Discretization)
Equal-Width vs Equal-Frequency
Domain-Specific Bins
Text Features
Text Length, Word Count
Contains Keyword (Binary)
TF-IDF for Text Classification
Part 3: Handling Missing Values
Drop vs Impute Decision
Mean, Median, Mode Imputation
Indicator Column Strategy
Part 4: Handling Outliers
Detection (IQR, Z-Score)
Treatment (Cap, Transform, Remove)
Part 5: Feature Selection
Filter Methods (Correlation, Variance)
Wrapper Methods (Recursive Feature Elimination)
Embedded Methods (XGBoost Feature Importance)
Dropping Low-Importance Features
Real-World Scenario 1: Credit Risk (Banking)
Real-World Scenario 2: Customer Churn (Telecom)
Real-World Scenario 3: Demand Forecasting (Retail)
Real-World Scenario 4: Fraud Detection (Payments)
The Complete Feature Engineering Pipeline (Code)
Common Mistakes
Interview Questions
Wrapping Up

What Is Feature Engineering?

Raw data:
  customer_name: "Naveen Vuppula"
  signup_date: "2024-01-15"
  last_purchase: "2026-05-20"
  total_orders: 47
  total_spent: 12500.00
  country: "Canada"
  plan: "Premium"

After feature engineering:
  name_length: 15                          ← Text feature
  months_since_signup: 28                  ← Date feature (recency)
  days_since_last_purchase: 18             ← Date feature (recency)
  avg_order_value: 265.96                  ← Ratio (total_spent / total_orders)
  orders_per_month: 1.68                   ← Ratio (total_orders / months_since_signup)
  is_premium: 1                            ← Binary encoding
  country_Canada: 1                        ← One-hot encoding
  is_high_value: 1                         ← Binned (total_spent > 10000)
  signup_month: 1                          ← Extracted from date
  signup_dayofweek: 0                      ← Extracted from date (Monday)

The model never sees "Naveen" or "2024-01-15" — it sees NUMBERS that capture the MEANING.

Why Features Matter More Than Algorithms

Experiment (same data, same test set):

Logistic Regression + GREAT features:
  → avg_order_value, orders_per_month, days_since_last_purchase, is_premium
  → Accuracy: 88%, F1: 0.86

XGBoost + RAW features:
  → total_orders, total_spent, signup_date (as string), plan (as string)
  → Accuracy: 82%, F1: 0.79

Simple model + great features BEATS complex model + raw features.
The features carry the signal. The algorithm amplifies it.

The Feature Engineering Workflow

Step 1: Understand the data (EDA — exploratory data analysis)
Step 2: Handle missing values (drop or impute)
Step 3: Encode categorical variables (text → numbers)
Step 4: Create new features (ratios, dates, interactions)
Step 5: Scale numerical features (if needed for the algorithm)
Step 6: Handle outliers (cap or transform)
Step 7: Select the best features (drop low-importance ones)
Step 8: Train model → evaluate → iterate on features

Part 1: Handling Raw Data Types

Encoding Categorical Variables

Models need NUMBERS. Categorical columns (text) must be converted.

Label Encoding

Assign a unique integer to each category:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['plan_encoded'] = le.fit_transform(df['plan'])

# plan        → plan_encoded
# "Basic"     → 0
# "Premium"   → 1
# "Enterprise"→ 2

Use when: Ordinal categories (where order matters) OR tree-based models (XGBoost, Random Forest — they handle numeric categories well).

Do NOT use when: Linear/Logistic Regression — it would interpret “Enterprise=2” as twice “Premium=1,” which is meaningless.

One-Hot Encoding

Create a separate binary column for each category:

df_encoded = pd.get_dummies(df, columns=['plan'], prefix='plan', drop_first=True)

# Original: plan = "Premium"
# After:    plan_Premium = 1, plan_Enterprise = 0

# Original: plan = "Enterprise"
# After:    plan_Premium = 0, plan_Enterprise = 1

# Original: plan = "Basic" (dropped — the baseline)
# After:    plan_Premium = 0, plan_Enterprise = 0

Use when: Nominal categories (no order — colors, countries, product types) AND Linear/Logistic Regression.

Watch out: High-cardinality columns (1000 cities → 1000 columns). Use target encoding instead.

Ordinal Encoding

Map categories to integers reflecting their ORDER:

from sklearn.preprocessing import OrdinalEncoder

education_order = [['High School', 'Bachelor', 'Master', 'PhD']]
oe = OrdinalEncoder(categories=education_order)
df['education_encoded'] = oe.fit_transform(df[['education']])

# High School → 0, Bachelor → 1, Master → 2, PhD → 3
# The ORDER is meaningful: PhD > Master > Bachelor > High School

Use when: Categories have a natural order (education level, satisfaction rating, size S/M/L/XL).

Target Encoding (Mean Encoding)

Replace each category with the MEAN of the target for that category:

# For each city, calculate the average churn rate
city_means = df.groupby('city')['churned'].mean()
df['city_encoded'] = df['city'].map(city_means)

# city          → city_encoded (avg churn rate)
# "Toronto"     → 0.15 (15% of Toronto customers churned)
# "Vancouver"   → 0.22 (22% of Vancouver customers churned)
# "Montreal"    → 0.18

Use when: High-cardinality categorical features (hundreds of cities, thousands of product IDs). Reduces to one column instead of hundreds.

Watch out: Data leakage — calculate on training set only, apply to test set. Use cross-validation encoding in practice.

When to Use Which Encoding

Encoding	Best For	Columns Created	Models
Label	Ordinal + tree models	1	XGBoost, Random Forest
One-Hot	Nominal, low cardinality (< 20 categories)	N-1	Linear, Logistic, Neural Nets
Ordinal	Ordered categories	1	All models
Target	High cardinality (100+ categories)	1	All models (with care)

Scaling Numerical Features

StandardScaler (Z-Score)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['income', 'age', 'debt']] = scaler.fit_transform(df[['income', 'age', 'debt']])

# Before: income = 75000, age = 35, debt = 15000
# After:  income = 0.52, age = -0.31, debt = -0.45
# Mean = 0, Std = 1 for each column

Use when: Logistic Regression, Linear Regression, Neural Networks, SVM, KNN — algorithms that use DISTANCE or GRADIENT.

MinMaxScaler (0 to 1)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['income', 'age']] = scaler.fit_transform(df[['income', 'age']])

# Before: income ranges 20000-150000
# After:  income ranges 0.0-1.0

Use when: Neural Networks, algorithms that expect bounded input.

RobustScaler (Handles Outliers)

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()  # Uses median and IQR instead of mean and std
df[['income']] = scaler.fit_transform(df[['income']])

# Outliers (income = $10M) do not skew the scaling

When to Use Which Scaler

Scaler	Best For	Handles Outliers	Range
StandardScaler	General purpose, Logistic/Linear Reg	❌	~(-3, +3)
MinMaxScaler	Neural Networks, bounded input	❌	(0, 1)
RobustScaler	Data with outliers	✅	Variable
None	XGBoost, Random Forest, Decision Trees	N/A	Trees do not need scaling

Critical: Tree-based models (XGBoost, Random Forest) do NOT need scaling. They split on thresholds, not distances. Scaling wastes time and adds no benefit.

Part 2: Creating New Features

Date and Time Features

Extracting Components (Year, Month, Day, Hour)

df['signup_date'] = pd.to_datetime(df['signup_date'])

df['signup_year'] = df['signup_date'].dt.year
df['signup_month'] = df['signup_date'].dt.month
df['signup_day'] = df['signup_date'].dt.day
df['signup_dayofweek'] = df['signup_date'].dt.dayofweek    # Monday=0, Sunday=6
df['signup_hour'] = df['signup_date'].dt.hour               # For datetime columns
df['signup_quarter'] = df['signup_date'].dt.quarter
df['signup_weekofyear'] = df['signup_date'].dt.isocalendar().week

Cyclical Encoding (Day of Week, Month)

Day 7 (Sunday) is close to Day 1 (Monday), but numerically they are far apart (7 vs 1). Cyclical encoding fixes this:

import numpy as np

# Month (1-12) → sin/cos encoding
df['month_sin'] = np.sin(2 * np.pi * df['signup_month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['signup_month'] / 12)

# Day of week (0-6) → sin/cos encoding
df['dow_sin'] = np.sin(2 * np.pi * df['signup_dayofweek'] / 7)
df['dow_cos'] = np.cos(2 * np.pi * df['signup_dayofweek'] / 7)

# Now December (12) and January (1) are close in the feature space

Time Since Events (Recency)

today = pd.Timestamp.now()

df['days_since_signup'] = (today - df['signup_date']).dt.days
df['days_since_last_purchase'] = (today - df['last_purchase_date']).dt.days
df['days_since_last_login'] = (today - df['last_login_date']).dt.days

# Months instead of days (less noisy)
df['months_since_signup'] = df['days_since_signup'] / 30.44

Is Weekend, Is Holiday, Is Business Hours

df['is_weekend'] = df['signup_date'].dt.dayofweek.isin([5, 6]).astype(int)
df['is_month_start'] = df['signup_date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['signup_date'].dt.is_month_end.astype(int)
df['is_quarter_end'] = df['signup_date'].dt.is_quarter_end.astype(int)

# Business hours (9 AM - 5 PM on weekdays)
df['is_business_hours'] = (
    (df['event_time'].dt.hour.between(9, 16)) &
    (~df['event_time'].dt.dayofweek.isin([5, 6]))
).astype(int)

Mathematical Features

Ratios and Proportions

# Revenue per order (intensity)
df['avg_order_value'] = df['total_revenue'] / df['total_orders'].replace(0, 1)

# Orders per month (velocity)
df['orders_per_month'] = df['total_orders'] / df['months_since_signup'].replace(0, 1)

# Debt to income ratio (risk indicator)
df['debt_to_income'] = df['total_debt'] / df['annual_income'].replace(0, 1)

# Credit utilization
df['credit_utilization'] = df['credit_balance'] / df['credit_limit'].replace(0, 1)

# Return rate
df['return_rate'] = df['returns'] / df['total_orders'].replace(0, 1)

Differences and Changes

# Revenue growth
df['revenue_change'] = df['revenue_this_month'] - df['revenue_last_month']
df['revenue_growth_pct'] = df['revenue_change'] / df['revenue_last_month'].replace(0, 1)

# Price difference from average
df['price_vs_avg'] = df['price'] - df['category_avg_price']
df['price_premium_pct'] = (df['price'] - df['category_avg_price']) / df['category_avg_price'].replace(0, 1)

Aggregated Statistics (Mean, Max, Count)

# Customer-level aggregations from transaction data
customer_features = transactions.groupby('customer_id').agg(
    total_orders=('order_id', 'count'),
    total_spent=('amount', 'sum'),
    avg_order_value=('amount', 'mean'),
    max_order_value=('amount', 'max'),
    min_order_value=('amount', 'min'),
    std_order_value=('amount', 'std'),
    unique_products=('product_id', 'nunique'),
    unique_categories=('category', 'nunique'),
    first_order=('order_date', 'min'),
    last_order=('order_date', 'max'),
).reset_index()

# Merge back to customer DataFrame
df = df.merge(customer_features, on='customer_id', how='left')

Interaction Features

# Multiply two features (captures non-linear interactions)
df['income_x_credit'] = df['income'] * df['credit_score']
df['age_x_tenure'] = df['age'] * df['employment_years']
df['sqft_x_bedrooms'] = df['sqft'] * df['bedrooms']

# Polynomial features (automated)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interactions = poly.fit_transform(df[['income', 'credit_score', 'age']])

Binning (Discretization)

Equal-Width vs Equal-Frequency

# EQUAL-WIDTH BINS: same range per bin (0-20, 20-40, 40-60, ...)
# Use pd.cut — divides the value RANGE into equal intervals
df['age_equal_width'] = pd.cut(df['age'], bins=5)
# Bins: (18-28], (28-38], (38-48], (48-58], (58-68]
# Problem: if most people are 25-35, those bins have many records and others are empty

# EQUAL-FREQUENCY BINS: same NUMBER of records per bin
# Use pd.qcut — divides by quantiles so each bin has ~equal count
df['age_equal_freq'] = pd.qcut(df['age'], q=5, labels=['Q1','Q2','Q3','Q4','Q5'])
# Each bin has exactly 20% of the records
# Better for skewed distributions (income, transaction amounts)

# When to use which:
# Equal-width:     uniform distributions, domain-specific ranges (credit tiers)
# Equal-frequency: skewed distributions (income, prices), ensures no empty bins

Domain-Specific Bins

# Age groups
df['age_group'] = pd.cut(df['age'],
    bins=[0, 25, 35, 45, 55, 65, 100],
    labels=['18-25', '26-35', '36-45', '46-55', '56-65', '65+'])

# Income tiers
df['income_tier'] = pd.cut(df['income'],
    bins=[0, 30000, 60000, 100000, 200000, float('inf')],
    labels=['Low', 'Medium', 'High', 'Very High', 'Ultra High'])

# Credit score ranges (industry standard)
df['credit_tier'] = pd.cut(df['credit_score'],
    bins=[300, 580, 670, 740, 800, 850],
    labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent'])

Text Features

Text Length, Word Count

df['name_length'] = df['name'].str.len()
df['name_word_count'] = df['name'].str.split().str.len()
df['has_middle_name'] = (df['name'].str.split().str.len() > 2).astype(int)
df['email_domain'] = df['email'].str.split('@').str[1]
df['is_gmail'] = (df['email_domain'] == 'gmail.com').astype(int)
df['description_length'] = df['description'].str.len()
df['description_word_count'] = df['description'].str.split().str.len()

Contains Keyword (Binary)

df['mentions_urgent'] = df['message'].str.contains('urgent|asap|emergency', case=False, regex=True).astype(int)
df['mentions_cancel'] = df['message'].str.contains('cancel|refund|return', case=False, regex=True).astype(int)

TF-IDF for Text Classification

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=100, stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['review_text'])

# Convert to DataFrame and merge
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
df = pd.concat([df, tfidf_df], axis=1)

Part 3: Handling Missing Values

Drop vs Impute Decision

# Step 1: Check the extent of missing data
print(df.isnull().sum())
print(f"Missing %:\n{df.isnull().mean() * 100}")

# Decision framework:
# < 5% missing rows  → DROP the rows (minimal data loss)
# > 50% missing column → DROP the column (too sparse to be useful)
# 5-50% missing → IMPUTE (fill with a reasonable value)
# Missingness is systematic → CREATE an indicator column (the gap IS a signal)

# Drop rows (if < 5% missing in a critical column)
df_clean = df.dropna(subset=['critical_column'])

# Drop columns (if > 50% missing)
cols_to_drop = df.columns[df.isnull().mean() > 0.5]
df = df.drop(columns=cols_to_drop)
print(f"Dropped sparse columns: {list(cols_to_drop)}")

Mean, Median, Mode Imputation

from sklearn.impute import SimpleImputer

# Median imputation (numerical — robust to outliers, preferred over mean)
num_imputer = SimpleImputer(strategy='median')
df[['income', 'age']] = num_imputer.fit_transform(df[['income', 'age']])

# Mean imputation (numerical — sensitive to outliers)
mean_imputer = SimpleImputer(strategy='mean')
df[['credit_score']] = mean_imputer.fit_transform(df[['credit_score']])

# Mode imputation (categorical — fill with most frequent value)
df['city'].fillna(df['city'].mode()[0], inplace=True)

# When to use which:
# Median: default for numerical (robust to outliers like income $10M)
# Mean: when data is symmetric with no outliers
# Mode: for categorical features (fill with the most common category)

Indicator Column Strategy

# The FACT that a value is missing is often a feature itself
# Create a binary flag BEFORE imputing

df['income_missing'] = df['income'].isnull().astype(int)
df['income'].fillna(df['income'].median(), inplace=True)

df['employment_years_missing'] = df['employment_years'].isnull().astype(int)
df['employment_years'].fillna(df['employment_years'].median(), inplace=True)

# Now the model gets BOTH:
# 1. The imputed value (reasonable estimate)
# 2. The flag that it WAS missing (the signal)

# Why this is powerful:
# In credit scoring, a missing income field often means the applicant
# did NOT disclose it — which itself is a risk signal.
# The income_missing feature captures this pattern.
# XGBoost can learn: "income_missing=1 → higher default probability"

Part 4: Handling Outliers

Detection (IQR, Z-Score)

# IQR method
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['income'] < lower) | (df['income'] > upper)]
print(f"Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['income'].dropna()))
outliers = df[z_scores > 3]  # More than 3 standard deviations

Treatment (Cap, Transform, Remove)

# Cap at percentiles (winsorize)
df['income_capped'] = df['income'].clip(lower=df['income'].quantile(0.01),
                                          upper=df['income'].quantile(0.99))

# Log transform (reduces skew, compresses outliers)
df['income_log'] = np.log1p(df['income'])  # log(1 + x) handles zeros

# Remove (only when outliers are clearly data errors)
df = df[df['age'] > 0]        # Negative age = data error
df = df[df['age'] < 120]      # 120+ years = data error

# Leave outliers for tree models (XGBoost handles them naturally)
# Cap outliers for linear models (they are sensitive to extreme values)

Part 5: Feature Selection

Filter Methods (Correlation, Variance)

# Remove features highly correlated with each other (redundant)
corr_matrix = df.corr().abs()
upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr_cols = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.90)]
print(f"Dropping highly correlated: {high_corr_cols}")
df = df.drop(columns=high_corr_cols)

# Remove low-variance features (nearly constant)
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)  # Remove features with < 1% variance
X_selected = selector.fit_transform(X)

Wrapper Methods (Recursive Feature Elimination)

from sklearn.feature_selection import RFE
from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=100, random_state=42)
rfe = RFE(model, n_features_to_select=15)  # Keep top 15 features
rfe.fit(X_train, y_train)

selected_features = X_train.columns[rfe.support_]
print(f"Selected features: {list(selected_features)}")

Embedded Methods (XGBoost Feature Importance)

import xgboost as xgb

model = xgb.XGBClassifier(n_estimators=200, max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Get feature importance
importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 15 features:")
for _, row in importance.head(15).iterrows():
    bar = '█' * int(row['importance'] * 100)
    print(f"  {row['feature']:30s}: {row['importance']:.4f} {bar}")

Dropping Low-Importance Features

# After training XGBoost, identify and remove features that add noise

# Step 1: Find features with near-zero importance
low_importance = importance[importance['importance'] < 0.01]['feature'].tolist()
print(f"Low-importance features ({len(low_importance)}): {low_importance}")

# Step 2: Drop them
X_train_slim = X_train.drop(columns=low_importance)
X_test_slim = X_test.drop(columns=low_importance)

# Step 3: Retrain and compare
model_slim = xgb.XGBClassifier(n_estimators=200, max_depth=5, random_state=42)
model_slim.fit(X_train_slim, y_train)

from sklearn.metrics import f1_score
f1_all = f1_score(y_test, model.predict(X_test))
f1_slim = f1_score(y_test, model_slim.predict(X_test_slim))
print(f"F1 with all {len(X_train.columns)} features:  {f1_all:.4f}")
print(f"F1 with top {len(X_train_slim.columns)} features: {f1_slim:.4f}")

# Typical result: same or BETTER F1 with fewer features
# Fewer features = faster training, simpler model, less overfitting

The 80/20 rule of features: In most datasets, 20% of features carry 80% of the predictive power. Dropping the bottom 80% often maintains accuracy while cutting training time and complexity significantly.

Real-World Scenario 1: Credit Risk (Banking)

# Raw data: customer_id, income, age, employment_years, loan_amount, credit_score, city, education
# Target: default (0/1)

# Engineered features:
df['debt_to_income'] = df['loan_amount'] / df['income'].replace(0, 1)
df['income_per_year_employed'] = df['income'] / df['employment_years'].replace(0, 1)
df['loan_to_credit_ratio'] = df['loan_amount'] / (df['credit_score'] * 100)
df['credit_tier'] = pd.cut(df['credit_score'], bins=[300,580,670,740,800,850],
                            labels=['Poor','Fair','Good','VGood','Excellent'])
df['is_high_risk_city'] = df['city'].isin(['CityA', 'CityB']).astype(int)
df['education_encoded'] = df['education'].map({'High School':0,'Bachelor':1,'Master':2,'PhD':3})
df['income_x_credit'] = df['income'] * df['credit_score']

# Result: F1 improved from 0.72 (raw features) to 0.86 (engineered features)

Real-World Scenario 2: Customer Churn (Telecom)

# Engineered features from usage data:
df['calls_per_month'] = df['total_calls'] / df['tenure_months'].replace(0, 1)
df['avg_call_duration'] = df['total_minutes'] / df['total_calls'].replace(0, 1)
df['data_per_month'] = df['total_data_gb'] / df['tenure_months'].replace(0, 1)
df['support_calls_ratio'] = df['support_calls'] / df['total_calls'].replace(0, 1)
df['days_since_last_call'] = (today - df['last_call_date']).dt.days
df['contract_remaining_months'] = (df['contract_end'] - today).dt.days / 30.44
df['is_month_to_month'] = (df['contract_type'] == 'month-to-month').astype(int)
df['overage_count'] = df['times_over_limit']
df['payment_delay_avg'] = df['total_late_days'] / df['total_invoices'].replace(0, 1)

# The features that matter most for churn:
# 1. days_since_last_call (recency)
# 2. contract_remaining_months (commitment)
# 3. support_calls_ratio (frustration signal)
# 4. payment_delay_avg (financial stress)

Real-World Scenario 3: Demand Forecasting (Retail)

# Date-based features (most important for forecasting):
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
df['is_payday'] = df['date'].dt.day.isin([1, 15]).astype(int)
df['quarter'] = df['date'].dt.quarter
df['week_of_year'] = df['date'].dt.isocalendar().week

# Lag features (what happened in the past predicts the future):
df['sales_lag_1d'] = df.groupby('product_id')['sales'].shift(1)
df['sales_lag_7d'] = df.groupby('product_id')['sales'].shift(7)
df['sales_lag_30d'] = df.groupby('product_id')['sales'].shift(30)

# Rolling averages:
df['sales_rolling_7d'] = df.groupby('product_id')['sales'].transform(
    lambda x: x.rolling(7, min_periods=1).mean())
df['sales_rolling_30d'] = df.groupby('product_id')['sales'].transform(
    lambda x: x.rolling(30, min_periods=1).mean())

# Trend:
df['sales_trend_7d'] = df['sales_lag_1d'] - df['sales_lag_7d']  # Rising or falling?

Real-World Scenario 4: Fraud Detection (Payments)

# Transaction-level features:
df['amount_vs_avg'] = df['amount'] / df.groupby('customer_id')['amount'].transform('mean')
df['is_high_amount'] = (df['amount'] > df['amount'].quantile(0.99)).astype(int)
df['is_round_amount'] = (df['amount'] % 100 == 0).astype(int)
df['hour_of_day'] = df['timestamp'].dt.hour
df['is_night'] = df['hour_of_day'].between(0, 5).astype(int)

# Velocity features (how fast is the customer transacting):
df['txn_count_1h'] = df.groupby('customer_id')['timestamp'].transform(
    lambda x: x.rolling('1H').count())
df['txn_count_24h'] = df.groupby('customer_id')['timestamp'].transform(
    lambda x: x.rolling('24H').count())

# Geographic anomaly:
df['is_new_country'] = (df['country'] != df.groupby('customer_id')['country'].shift(1)).astype(int)

# The features that catch fraud:
# 1. amount_vs_avg (unusually large transaction for this customer)
# 2. txn_count_1h (velocity spike — many transactions in a short time)
# 3. is_new_country (geographic anomaly)
# 4. is_night (transactions at unusual hours)

The Complete Feature Engineering Pipeline (Code)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
import xgboost as xgb
from sklearn.metrics import f1_score, classification_report

# Step 1: Load data
df = pd.read_csv("customer_data.csv")

# Step 2: Handle missing values
df['income_missing'] = df['income'].isnull().astype(int)
num_imputer = SimpleImputer(strategy='median')
df[['income', 'age', 'credit_score']] = num_imputer.fit_transform(
    df[['income', 'age', 'credit_score']])
df['city'].fillna('Unknown', inplace=True)

# Step 3: Encode categoricals
le = LabelEncoder()
df['city_encoded'] = le.fit_transform(df['city'])
df['education_encoded'] = df['education'].map(
    {'High School':0, 'Bachelor':1, 'Master':2, 'PhD':3})
df = pd.get_dummies(df, columns=['plan'], drop_first=True)

# Step 4: Create new features
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days
df['avg_order_value'] = df['total_spent'] / df['total_orders'].replace(0, 1)
df['orders_per_month'] = df['total_orders'] / (df['days_since_signup'] / 30.44).replace(0, 1)
df['debt_to_income'] = df['debt'] / df['income'].replace(0, 1)
df['income_x_credit'] = df['income'] * df['credit_score']

# Step 5: Handle outliers (cap at 1st/99th percentile)
for col in ['income', 'total_spent', 'debt']:
    df[col] = df[col].clip(lower=df[col].quantile(0.01), upper=df[col].quantile(0.99))

# Step 6: Select features
feature_cols = ['income', 'age', 'credit_score', 'days_since_signup',
                'avg_order_value', 'orders_per_month', 'debt_to_income',
                'income_x_credit', 'income_missing', 'education_encoded',
                'city_encoded'] + [c for c in df.columns if c.startswith('plan_')]

X = df[feature_cols]
y = df['churned']

# Step 7: Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      random_state=42, stratify=y)

# Step 8: Train (XGBoost — no scaling needed)
model = xgb.XGBClassifier(n_estimators=200, max_depth=5, learning_rate=0.1,
                            random_state=42, scale_pos_weight=len(y[y==0])/len(y[y==1]))
model.fit(X_train, y_train)

# Step 9: Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# Step 10: Feature importance (iterate — drop low importance, re-train)
importance = pd.Series(model.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nFeature Importance:")
for feat, imp in importance.items():
    print(f"  {feat:30s}: {imp:.4f} {'█' * int(imp * 100)}")

Common Mistakes

Scaling features for XGBoost/Random Forest — tree-based models do not need scaling. Scaling wastes compute and adds no benefit. Scale only for Linear/Logistic Regression, SVM, KNN, Neural Networks.
One-hot encoding high-cardinality features — 10,000 cities → 10,000 columns. Use target encoding or label encoding instead.
Data leakage through target encoding — calculating the target mean on the FULL dataset (including test set) leaks information. Always calculate on training set only, then apply to test set.
Not creating recency features from dates — raw dates are useless to a model. days_since_last_purchase is one of the most predictive features in customer analytics.
Ignoring missing value indicators — the FACT that a value is missing is often a feature itself. Create column_missing binary indicators before imputing.
Feature engineering AFTER splitting — fit scalers and encoders on TRAINING data only. Apply (transform) to test data. Fitting on the full dataset leaks test information.
Too many features without selection — 500 features create noise. Use feature importance, correlation analysis, or RFE to select the top 20-50 that matter.

Interview Questions

Q: Why does feature engineering matter more than algorithm choice? A: Features carry the signal that algorithms amplify. XGBoost with raw features (total_orders, total_spent as-is) cannot discover that avg_order_value = total_spent / total_orders is the real predictor. By creating meaningful features (ratios, recency, interactions), you give ANY algorithm the signal it needs. Simple models with great features consistently beat complex models with raw features.

Q: When do you use one-hot encoding vs label encoding? A: One-hot for nominal categories (no order — colors, cities) with low cardinality (< 20 categories), especially for linear models. Label encoding for ordinal categories (education level, rating) or when using tree-based models (which handle numeric categories naturally). Target encoding for high-cardinality features (1000+ categories).

Q: How do you handle missing values? A: Three strategies: drop rows (if < 5% missing and not systematic), impute with median/mode (fills gaps), or create a missing indicator column (captures the signal of missingness) then impute. The indicator column strategy is powerful because missingness itself can be predictive — a missing income field in credit scoring signals risk.

Q: What are lag features and why are they important for time series? A: Lag features are past values of the target used as input features. sales_lag_7d is the sales value from 7 days ago. They capture temporal patterns: “what happened last week predicts this week.” Combined with rolling averages (7-day, 30-day mean), they are the most important features in demand forecasting, financial prediction, and any time-dependent model.

Q: How do you prevent data leakage in feature engineering? A: Fit all transformations (scalers, encoders, imputers) on TRAINING data only. Apply (transform) to test data using the training-set parameters. For target encoding, calculate means on training folds only (use cross-validation encoding). For time series, never use future data to create features — all lag and rolling features must look backward only.

Wrapping Up

Feature engineering is where domain knowledge meets data science. The algorithm is the engine, but features are the fuel. Raw data is crude oil — useful but not usable. Feature engineering refines it into high-octane fuel that powers accurate predictions.

The four highest-impact feature types: recency features (days since last event), ratio features (value per unit), aggregation features (customer-level statistics from transaction-level data), and interaction features (combinations that capture non-linear relationships). Master these four and you cover 80% of real-world feature engineering.

← Previous: Model Evaluation Deep Dive AI/ML (6/9) Next: Fine-Tuning LLMs →

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Table of Contents