Feature Engineering: The Skill That Matters More Than Algorithm Choice — Encoding, Scaling, Creating Features, Handling Dates, Text Features, and Feature Selection

Feature Engineering: The Skill That Matters More Than Algorithm Choice — Encoding, Scaling, Creating Features, Handling Dates, Text Features, and Feature Selection

Here is a secret that no ML tutorial tells you upfront: the algorithm you choose matters far less than the features you give it. XGBoost with bad features loses to Logistic Regression with great features. Every time.

In every previous post, we used clean, pre-built DataFrames. The credit scores were already numbers. The categories were already separate columns. The dates were already formatted. In the real world, none of that is true. You get raw data — customer names mixed with IDs, dates as strings, categories as text, prices in different currencies, and missing values everywhere.

Feature engineering is the process of turning raw data into features that a machine learning model can actually learn from. It is 80% of the work in a real ML project and the single biggest determinant of model quality.

Think of features as ingredients in a recipe. The algorithm is the oven. You can have the best oven in the world (XGBoost), but if you put in raw, unwashed, uncut ingredients (bad features), the meal will be terrible. Feature engineering is the prep work — washing, cutting, seasoning, marinating — that makes the difference between a mediocre dish and a restaurant-quality meal. Same oven, different prep, completely different result.

Table of Contents

  • What Is Feature Engineering?
  • Why Features Matter More Than Algorithms
  • The Feature Engineering Workflow
  • Part 1: Handling Raw Data Types
  • Encoding Categorical Variables
  • Label Encoding
  • One-Hot Encoding
  • Ordinal Encoding
  • Target Encoding (Mean Encoding)
  • When to Use Which Encoding
  • Scaling Numerical Features
  • StandardScaler (Z-Score)
  • MinMaxScaler (0 to 1)
  • RobustScaler (Handles Outliers)
  • When to Use Which Scaler
  • Part 2: Creating New Features
  • Date and Time Features
  • Extracting Components (Year, Month, Day, Hour)
  • Cyclical Encoding (Day of Week, Month)
  • Time Since Events (Recency)
  • Is Weekend, Is Holiday, Is Business Hours
  • Mathematical Features
  • Ratios and Proportions
  • Differences and Changes
  • Aggregated Statistics (Mean, Max, Count)
  • Interaction Features
  • Binning (Discretization)
  • Equal-Width vs Equal-Frequency
  • Domain-Specific Bins
  • Text Features
  • Text Length, Word Count
  • Contains Keyword (Binary)
  • TF-IDF for Text Classification
  • Part 3: Handling Missing Values
  • Drop vs Impute Decision
  • Mean, Median, Mode Imputation
  • Indicator Column Strategy
  • Part 4: Handling Outliers
  • Detection (IQR, Z-Score)
  • Treatment (Cap, Transform, Remove)
  • Part 5: Feature Selection
  • Filter Methods (Correlation, Variance)
  • Wrapper Methods (Recursive Feature Elimination)
  • Embedded Methods (XGBoost Feature Importance)
  • Dropping Low-Importance Features
  • Real-World Scenario 1: Credit Risk (Banking)
  • Real-World Scenario 2: Customer Churn (Telecom)
  • Real-World Scenario 3: Demand Forecasting (Retail)
  • Real-World Scenario 4: Fraud Detection (Payments)
  • The Complete Feature Engineering Pipeline (Code)
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What Is Feature Engineering?

Raw data:
  customer_name: "Naveen Vuppula"
  signup_date: "2024-01-15"
  last_purchase: "2026-05-20"
  total_orders: 47
  total_spent: 12500.00
  country: "Canada"
  plan: "Premium"

After feature engineering:
  name_length: 15                          ← Text feature
  months_since_signup: 28                  ← Date feature (recency)
  days_since_last_purchase: 18             ← Date feature (recency)
  avg_order_value: 265.96                  ← Ratio (total_spent / total_orders)
  orders_per_month: 1.68                   ← Ratio (total_orders / months_since_signup)
  is_premium: 1                            ← Binary encoding
  country_Canada: 1                        ← One-hot encoding
  is_high_value: 1                         ← Binned (total_spent > 10000)
  signup_month: 1                          ← Extracted from date
  signup_dayofweek: 0                      ← Extracted from date (Monday)

The model never sees "Naveen" or "2024-01-15" — it sees NUMBERS that capture the MEANING.

Why Features Matter More Than Algorithms

Experiment (same data, same test set):

Logistic Regression + GREAT features:
  → avg_order_value, orders_per_month, days_since_last_purchase, is_premium
  → Accuracy: 88%, F1: 0.86

XGBoost + RAW features:
  → total_orders, total_spent, signup_date (as string), plan (as string)
  → Accuracy: 82%, F1: 0.79

Simple model + great features BEATS complex model + raw features.
The features carry the signal. The algorithm amplifies it.

The Feature Engineering Workflow

Step 1: Understand the data (EDA — exploratory data analysis)
Step 2: Handle missing values (drop or impute)
Step 3: Encode categorical variables (text → numbers)
Step 4: Create new features (ratios, dates, interactions)
Step 5: Scale numerical features (if needed for the algorithm)
Step 6: Handle outliers (cap or transform)
Step 7: Select the best features (drop low-importance ones)
Step 8: Train model → evaluate → iterate on features

Part 1: Handling Raw Data Types

Encoding Categorical Variables

Models need NUMBERS. Categorical columns (text) must be converted.

Label Encoding

Assign a unique integer to each category:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['plan_encoded'] = le.fit_transform(df['plan'])

# plan        → plan_encoded
# "Basic"     → 0
# "Premium"   → 1
# "Enterprise"→ 2

Use when: Ordinal categories (where order matters) OR tree-based models (XGBoost, Random Forest — they handle numeric categories well).

Do NOT use when: Linear/Logistic Regression — it would interpret “Enterprise=2” as twice “Premium=1,” which is meaningless.

One-Hot Encoding

Create a separate binary column for each category:

df_encoded = pd.get_dummies(df, columns=['plan'], prefix='plan', drop_first=True)

# Original: plan = "Premium"
# After:    plan_Premium = 1, plan_Enterprise = 0

# Original: plan = "Enterprise"
# After:    plan_Premium = 0, plan_Enterprise = 1

# Original: plan = "Basic" (dropped — the baseline)
# After:    plan_Premium = 0, plan_Enterprise = 0

Use when: Nominal categories (no order — colors, countries, product types) AND Linear/Logistic Regression.

Watch out: High-cardinality columns (1000 cities → 1000 columns). Use target encoding instead.

Ordinal Encoding

Map categories to integers reflecting their ORDER:

from sklearn.preprocessing import OrdinalEncoder

education_order = [['High School', 'Bachelor', 'Master', 'PhD']]
oe = OrdinalEncoder(categories=education_order)
df['education_encoded'] = oe.fit_transform(df[['education']])

# High School → 0, Bachelor → 1, Master → 2, PhD → 3
# The ORDER is meaningful: PhD > Master > Bachelor > High School

Use when: Categories have a natural order (education level, satisfaction rating, size S/M/L/XL).

Target Encoding (Mean Encoding)

Replace each category with the MEAN of the target for that category:

# For each city, calculate the average churn rate
city_means = df.groupby('city')['churned'].mean()
df['city_encoded'] = df['city'].map(city_means)

# city          → city_encoded (avg churn rate)
# "Toronto"     → 0.15 (15% of Toronto customers churned)
# "Vancouver"   → 0.22 (22% of Vancouver customers churned)
# "Montreal"    → 0.18

Use when: High-cardinality categorical features (hundreds of cities, thousands of product IDs). Reduces to one column instead of hundreds.

Watch out: Data leakage — calculate on training set only, apply to test set. Use cross-validation encoding in practice.

When to Use Which Encoding

Encoding Best For Columns Created Models
Label Ordinal + tree models 1 XGBoost, Random Forest
One-Hot Nominal, low cardinality (< 20 categories) N-1 Linear, Logistic, Neural Nets
Ordinal Ordered categories 1 All models
Target High cardinality (100+ categories) 1 All models (with care)

Scaling Numerical Features

StandardScaler (Z-Score Normalization)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['income', 'age', 'debt']] = scaler.fit_transform(df[['income', 'age', 'debt']])

# Before: income = 75000, age = 35, debt = 15000
# After:  income = 0.52, age = -0.31, debt = -0.45
# Mean = 0, Std = 1 for each column

Use when: Logistic Regression, Linear Regression, Neural Networks, SVM, KNN — algorithms that use DISTANCE or GRADIENT.

MinMaxScaler (0 to 1)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['income', 'age']] = scaler.fit_transform(df[['income', 'age']])

# Before: income ranges 20000-150000
# After:  income ranges 0.0-1.0

Use when: Neural Networks, algorithms that expect bounded input.

RobustScaler (Handles Outliers)

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()  # Uses median and IQR instead of mean and std
df[['income']] = scaler.fit_transform(df[['income']])

# Outliers (income = $10M) do not skew the scaling

When to Use Which Scaler

Scaler Best For Handles Outliers Range
StandardScaler General purpose, Logistic/Linear Reg ~(-3, +3)
MinMaxScaler Neural Networks, bounded input (0, 1)
RobustScaler Data with outliers Variable
None XGBoost, Random Forest, Decision Trees N/A Trees do not need scaling

Critical: Tree-based models (XGBoost, Random Forest) do NOT need scaling. They split on thresholds, not distances. Scaling wastes time and adds no benefit.


Part 2: Creating New Features

Date and Time Features

Extracting Components

df['signup_date'] = pd.to_datetime(df['signup_date'])

df['signup_year'] = df['signup_date'].dt.year
df['signup_month'] = df['signup_date'].dt.month
df['signup_day'] = df['signup_date'].dt.day
df['signup_dayofweek'] = df['signup_date'].dt.dayofweek    # Monday=0, Sunday=6
df['signup_hour'] = df['signup_date'].dt.hour               # For datetime columns
df['signup_quarter'] = df['signup_date'].dt.quarter
df['signup_weekofyear'] = df['signup_date'].dt.isocalendar().week

Cyclical Encoding (Day of Week, Month)

Day 7 (Sunday) is close to Day 1 (Monday), but numerically they are far apart (7 vs 1). Cyclical encoding fixes this:

import numpy as np

# Month (1-12) → sin/cos encoding
df['month_sin'] = np.sin(2 * np.pi * df['signup_month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['signup_month'] / 12)

# Day of week (0-6) → sin/cos encoding
df['dow_sin'] = np.sin(2 * np.pi * df['signup_dayofweek'] / 7)
df['dow_cos'] = np.cos(2 * np.pi * df['signup_dayofweek'] / 7)

# Now December (12) and January (1) are close in the feature space

Time Since Events (Recency)

today = pd.Timestamp.now()

df['days_since_signup'] = (today - df['signup_date']).dt.days
df['days_since_last_purchase'] = (today - df['last_purchase_date']).dt.days
df['days_since_last_login'] = (today - df['last_login_date']).dt.days

# Months instead of days (less noisy)
df['months_since_signup'] = df['days_since_signup'] / 30.44

Is Weekend, Is Holiday, Is Business Hours

df['is_weekend'] = df['signup_date'].dt.dayofweek.isin([5, 6]).astype(int)
df['is_month_start'] = df['signup_date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['signup_date'].dt.is_month_end.astype(int)
df['is_quarter_end'] = df['signup_date'].dt.is_quarter_end.astype(int)

# Business hours (9 AM - 5 PM on weekdays)
df['is_business_hours'] = (
    (df['event_time'].dt.hour.between(9, 16)) &
    (~df['event_time'].dt.dayofweek.isin([5, 6]))
).astype(int)

Mathematical Features

Ratios and Proportions

# Revenue per order (intensity)
df['avg_order_value'] = df['total_revenue'] / df['total_orders'].replace(0, 1)

# Orders per month (velocity)
df['orders_per_month'] = df['total_orders'] / df['months_since_signup'].replace(0, 1)

# Debt to income ratio (risk indicator)
df['debt_to_income'] = df['total_debt'] / df['annual_income'].replace(0, 1)

# Credit utilization
df['credit_utilization'] = df['credit_balance'] / df['credit_limit'].replace(0, 1)

# Return rate
df['return_rate'] = df['returns'] / df['total_orders'].replace(0, 1)

Differences and Changes

# Revenue growth
df['revenue_change'] = df['revenue_this_month'] - df['revenue_last_month']
df['revenue_growth_pct'] = df['revenue_change'] / df['revenue_last_month'].replace(0, 1)

# Price difference from average
df['price_vs_avg'] = df['price'] - df['category_avg_price']
df['price_premium_pct'] = (df['price'] - df['category_avg_price']) / df['category_avg_price'].replace(0, 1)

Aggregated Statistics

# Customer-level aggregations from transaction data
customer_features = transactions.groupby('customer_id').agg(
    total_orders=('order_id', 'count'),
    total_spent=('amount', 'sum'),
    avg_order_value=('amount', 'mean'),
    max_order_value=('amount', 'max'),
    min_order_value=('amount', 'min'),
    std_order_value=('amount', 'std'),
    unique_products=('product_id', 'nunique'),
    unique_categories=('category', 'nunique'),
    first_order=('order_date', 'min'),
    last_order=('order_date', 'max'),
).reset_index()

# Merge back to customer DataFrame
df = df.merge(customer_features, on='customer_id', how='left')

Interaction Features

# Multiply two features (captures non-linear interactions)
df['income_x_credit'] = df['income'] * df['credit_score']
df['age_x_tenure'] = df['age'] * df['employment_years']
df['sqft_x_bedrooms'] = df['sqft'] * df['bedrooms']

# Polynomial features (automated)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_interactions = poly.fit_transform(df[['income', 'credit_score', 'age']])

Binning (Discretization)

Domain-Specific Bins

# Age groups
df['age_group'] = pd.cut(df['age'],
    bins=[0, 25, 35, 45, 55, 65, 100],
    labels=['18-25', '26-35', '36-45', '46-55', '56-65', '65+'])

# Income tiers
df['income_tier'] = pd.cut(df['income'],
    bins=[0, 30000, 60000, 100000, 200000, float('inf')],
    labels=['Low', 'Medium', 'High', 'Very High', 'Ultra High'])

# Credit score ranges (industry standard)
df['credit_tier'] = pd.cut(df['credit_score'],
    bins=[300, 580, 670, 740, 800, 850],
    labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent'])

# Equal-frequency bins (same number of records in each bin)
df['income_quantile'] = pd.qcut(df['income'], q=5, labels=['Q1','Q2','Q3','Q4','Q5'])

Text Features

Basic Text Features

df['name_length'] = df['name'].str.len()
df['name_word_count'] = df['name'].str.split().str.len()
df['has_middle_name'] = (df['name'].str.split().str.len() > 2).astype(int)
df['email_domain'] = df['email'].str.split('@').str[1]
df['is_gmail'] = (df['email_domain'] == 'gmail.com').astype(int)
df['description_length'] = df['description'].str.len()
df['description_word_count'] = df['description'].str.split().str.len()

Contains Keyword (Binary)

df['mentions_urgent'] = df['message'].str.contains('urgent|asap|emergency', case=False, regex=True).astype(int)
df['mentions_cancel'] = df['message'].str.contains('cancel|refund|return', case=False, regex=True).astype(int)

TF-IDF for Text Classification

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=100, stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['review_text'])

# Convert to DataFrame and merge
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
df = pd.concat([df, tfidf_df], axis=1)

Part 3: Handling Missing Values

# Check missing values
print(df.isnull().sum())
print(f"Missing %:
{df.isnull().mean() * 100}")

# Strategy 1: Drop rows (if < 5% missing)
df_clean = df.dropna(subset=['critical_column'])

# Strategy 2: Drop columns (if > 50% missing)
cols_to_drop = df.columns[df.isnull().mean() > 0.5]
df = df.drop(columns=cols_to_drop)

# Strategy 3: Impute with median (numerical — robust to outliers)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
df[['income', 'age']] = imputer.fit_transform(df[['income', 'age']])

# Strategy 4: Impute with mode (categorical)
df['city'].fillna(df['city'].mode()[0], inplace=True)

# Strategy 5: Indicator column (missingness itself is a signal!)
df['income_missing'] = df['income'].isnull().astype(int)
df['income'].fillna(df['income'].median(), inplace=True)
# Now the model knows: income was missing AND gets the imputed value

The indicator column strategy is powerful. In credit scoring, a missing income field often means the applicant did not disclose it — which itself is a risk signal. The income_missing feature captures this.


Part 4: Handling Outliers

Detection

# IQR method
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['income'] < lower) | (df['income'] > upper)]
print(f"Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['income'].dropna()))
outliers = df[z_scores > 3]  # More than 3 standard deviations

Treatment

# Cap at percentiles (winsorize)
df['income_capped'] = df['income'].clip(lower=df['income'].quantile(0.01),
                                          upper=df['income'].quantile(0.99))

# Log transform (reduces skew, compresses outliers)
df['income_log'] = np.log1p(df['income'])  # log(1 + x) handles zeros

# Leave outliers for tree models (XGBoost handles them naturally)
# Cap outliers for linear models (they are sensitive to extreme values)

Part 5: Feature Selection

Filter Methods (Correlation)

# Remove features highly correlated with each other (redundant)
corr_matrix = df.corr().abs()
upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
high_corr_cols = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.90)]
print(f"Dropping highly correlated: {high_corr_cols}")
df = df.drop(columns=high_corr_cols)

# Remove low-variance features (nearly constant)
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)  # Remove features with < 1% variance
X_selected = selector.fit_transform(X)

Wrapper Methods (Recursive Feature Elimination)

from sklearn.feature_selection import RFE
from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=100, random_state=42)
rfe = RFE(model, n_features_to_select=15)  # Keep top 15 features
rfe.fit(X_train, y_train)

selected_features = X_train.columns[rfe.support_]
print(f"Selected features: {list(selected_features)}")

Embedded Methods (XGBoost Feature Importance)

import xgboost as xgb

model = xgb.XGBClassifier(n_estimators=200, max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Get feature importance
importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 15 features:")
for _, row in importance.head(15).iterrows():
    bar = '█' * int(row['importance'] * 100)
    print(f"  {row['feature']:30s}: {row['importance']:.4f} {bar}")

# Drop features with zero or near-zero importance
low_importance = importance[importance['importance'] < 0.01]['feature'].tolist()
X_train = X_train.drop(columns=low_importance)
X_test = X_test.drop(columns=low_importance)

Real-World Scenario 1: Credit Risk (Banking)

# Raw data: customer_id, income, age, employment_years, loan_amount, credit_score, city, education
# Target: default (0/1)

# Engineered features:
df['debt_to_income'] = df['loan_amount'] / df['income'].replace(0, 1)
df['income_per_year_employed'] = df['income'] / df['employment_years'].replace(0, 1)
df['loan_to_credit_ratio'] = df['loan_amount'] / (df['credit_score'] * 100)
df['credit_tier'] = pd.cut(df['credit_score'], bins=[300,580,670,740,800,850],
                            labels=['Poor','Fair','Good','VGood','Excellent'])
df['is_high_risk_city'] = df['city'].isin(['CityA', 'CityB']).astype(int)
df['education_encoded'] = df['education'].map({'High School':0,'Bachelor':1,'Master':2,'PhD':3})
df['income_x_credit'] = df['income'] * df['credit_score']

# Result: F1 improved from 0.72 (raw features) to 0.86 (engineered features)

Real-World Scenario 2: Customer Churn (Telecom)

# Engineered features from usage data:
df['calls_per_month'] = df['total_calls'] / df['tenure_months'].replace(0, 1)
df['avg_call_duration'] = df['total_minutes'] / df['total_calls'].replace(0, 1)
df['data_per_month'] = df['total_data_gb'] / df['tenure_months'].replace(0, 1)
df['support_calls_ratio'] = df['support_calls'] / df['total_calls'].replace(0, 1)
df['days_since_last_call'] = (today - df['last_call_date']).dt.days
df['contract_remaining_months'] = (df['contract_end'] - today).dt.days / 30.44
df['is_month_to_month'] = (df['contract_type'] == 'month-to-month').astype(int)
df['overage_count'] = df['times_over_limit']
df['payment_delay_avg'] = df['total_late_days'] / df['total_invoices'].replace(0, 1)

# The features that matter most for churn:
# 1. days_since_last_call (recency)
# 2. contract_remaining_months (commitment)
# 3. support_calls_ratio (frustration signal)
# 4. payment_delay_avg (financial stress)

Real-World Scenario 3: Demand Forecasting (Retail)

# Date-based features (most important for forecasting):
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
df['is_payday'] = df['date'].dt.day.isin([1, 15]).astype(int)
df['quarter'] = df['date'].dt.quarter
df['week_of_year'] = df['date'].dt.isocalendar().week

# Lag features (what happened in the past predicts the future):
df['sales_lag_1d'] = df.groupby('product_id')['sales'].shift(1)
df['sales_lag_7d'] = df.groupby('product_id')['sales'].shift(7)
df['sales_lag_30d'] = df.groupby('product_id')['sales'].shift(30)

# Rolling averages:
df['sales_rolling_7d'] = df.groupby('product_id')['sales'].transform(
    lambda x: x.rolling(7, min_periods=1).mean())
df['sales_rolling_30d'] = df.groupby('product_id')['sales'].transform(
    lambda x: x.rolling(30, min_periods=1).mean())

# Trend:
df['sales_trend_7d'] = df['sales_lag_1d'] - df['sales_lag_7d']  # Rising or falling?

Real-World Scenario 4: Fraud Detection (Payments)

# Transaction-level features:
df['amount_vs_avg'] = df['amount'] / df.groupby('customer_id')['amount'].transform('mean')
df['is_high_amount'] = (df['amount'] > df['amount'].quantile(0.99)).astype(int)
df['is_round_amount'] = (df['amount'] % 100 == 0).astype(int)
df['hour_of_day'] = df['timestamp'].dt.hour
df['is_night'] = df['hour_of_day'].between(0, 5).astype(int)

# Velocity features (how fast is the customer transacting):
df['txn_count_1h'] = df.groupby('customer_id')['timestamp'].transform(
    lambda x: x.rolling('1H').count())
df['txn_count_24h'] = df.groupby('customer_id')['timestamp'].transform(
    lambda x: x.rolling('24H').count())

# Geographic anomaly:
df['is_new_country'] = (df['country'] != df.groupby('customer_id')['country'].shift(1)).astype(int)

# The features that catch fraud:
# 1. amount_vs_avg (unusually large transaction for this customer)
# 2. txn_count_1h (velocity spike — many transactions in a short time)
# 3. is_new_country (geographic anomaly)
# 4. is_night (transactions at unusual hours)

The Complete Feature Engineering Pipeline (Code)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
import xgboost as xgb
from sklearn.metrics import f1_score, classification_report

# Step 1: Load data
df = pd.read_csv("customer_data.csv")

# Step 2: Handle missing values
df['income_missing'] = df['income'].isnull().astype(int)
num_imputer = SimpleImputer(strategy='median')
df[['income', 'age', 'credit_score']] = num_imputer.fit_transform(
    df[['income', 'age', 'credit_score']])
df['city'].fillna('Unknown', inplace=True)

# Step 3: Encode categoricals
le = LabelEncoder()
df['city_encoded'] = le.fit_transform(df['city'])
df['education_encoded'] = df['education'].map(
    {'High School':0, 'Bachelor':1, 'Master':2, 'PhD':3})
df = pd.get_dummies(df, columns=['plan'], drop_first=True)

# Step 4: Create new features
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days
df['avg_order_value'] = df['total_spent'] / df['total_orders'].replace(0, 1)
df['orders_per_month'] = df['total_orders'] / (df['days_since_signup'] / 30.44).replace(0, 1)
df['debt_to_income'] = df['debt'] / df['income'].replace(0, 1)
df['income_x_credit'] = df['income'] * df['credit_score']

# Step 5: Handle outliers (cap at 1st/99th percentile)
for col in ['income', 'total_spent', 'debt']:
    df[col] = df[col].clip(lower=df[col].quantile(0.01), upper=df[col].quantile(0.99))

# Step 6: Select features
feature_cols = ['income', 'age', 'credit_score', 'days_since_signup',
                'avg_order_value', 'orders_per_month', 'debt_to_income',
                'income_x_credit', 'income_missing', 'education_encoded',
                'city_encoded'] + [c for c in df.columns if c.startswith('plan_')]

X = df[feature_cols]
y = df['churned']

# Step 7: Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      random_state=42, stratify=y)

# Step 8: Train (XGBoost — no scaling needed)
model = xgb.XGBClassifier(n_estimators=200, max_depth=5, learning_rate=0.1,
                            random_state=42, scale_pos_weight=len(y[y==0])/len(y[y==1]))
model.fit(X_train, y_train)

# Step 9: Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# Step 10: Feature importance (iterate — drop low importance, re-train)
importance = pd.Series(model.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("
Feature Importance:")
for feat, imp in importance.items():
    print(f"  {feat:30s}: {imp:.4f} {'█' * int(imp * 100)}")

Common Mistakes

  1. Scaling features for XGBoost/Random Forest — tree-based models do not need scaling. Scaling wastes compute and adds no benefit. Scale only for Linear/Logistic Regression, SVM, KNN, Neural Networks.

  2. One-hot encoding high-cardinality features — 10,000 cities → 10,000 columns. Use target encoding or label encoding instead.

  3. Data leakage through target encoding — calculating the target mean on the FULL dataset (including test set) leaks information. Always calculate on training set only, then apply to test set.

  4. Not creating recency features from dates — raw dates are useless to a model. days_since_last_purchase is one of the most predictive features in customer analytics.

  5. Ignoring missing value indicators — the FACT that a value is missing is often a feature itself. Create column_missing binary indicators before imputing.

  6. Feature engineering AFTER splitting — fit scalers and encoders on TRAINING data only. Apply (transform) to test data. Fitting on the full dataset leaks test information.

  7. Too many features without selection — 500 features create noise. Use feature importance, correlation analysis, or RFE to select the top 20-50 that matter.

Interview Questions

Q: Why does feature engineering matter more than algorithm choice? A: Features carry the signal that algorithms amplify. XGBoost with raw features (total_orders, total_spent as-is) cannot discover that avg_order_value = total_spent / total_orders is the real predictor. By creating meaningful features (ratios, recency, interactions), you give ANY algorithm the signal it needs. Simple models with great features consistently beat complex models with raw features.

Q: When do you use one-hot encoding vs label encoding? A: One-hot for nominal categories (no order — colors, cities) with low cardinality (< 20 categories), especially for linear models. Label encoding for ordinal categories (education level, rating) or when using tree-based models (which handle numeric categories naturally). Target encoding for high-cardinality features (1000+ categories).

Q: How do you handle missing values? A: Three strategies: drop rows (if < 5% missing and not systematic), impute with median/mode (fills gaps), or create a missing indicator column (captures the signal of missingness) then impute. The indicator column strategy is powerful because missingness itself can be predictive — a missing income field in credit scoring signals risk.

Q: What are lag features and why are they important for time series? A: Lag features are past values of the target used as input features. sales_lag_7d is the sales value from 7 days ago. They capture temporal patterns: “what happened last week predicts this week.” Combined with rolling averages (7-day, 30-day mean), they are the most important features in demand forecasting, financial prediction, and any time-dependent model.

Q: How do you prevent data leakage in feature engineering? A: Fit all transformations (scalers, encoders, imputers) on TRAINING data only. Apply (transform) to test data using the training-set parameters. For target encoding, calculate means on training folds only (use cross-validation encoding). For time series, never use future data to create features — all lag and rolling features must look backward only.

Wrapping Up

Feature engineering is where domain knowledge meets data science. The algorithm is the engine, but features are the fuel. Raw data is crude oil — useful but not usable. Feature engineering refines it into high-octane fuel that powers accurate predictions.

The four highest-impact feature types: recency features (days since last event), ratio features (value per unit), aggregation features (customer-level statistics from transaction-level data), and interaction features (combinations that capture non-linear relationships). Master these four and you cover 80% of real-world feature engineering.

Related posts:Model Evaluation Deep DiveXGBoost and Gradient BoostingDecision Trees and Random ForestsLinear and Logistic RegressionData Quality Framework


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link