AWS S3 for Data Engineers: Buckets, Storage Classes, IAM, and Data Lake Patterns

If Azure has ADLS Gen2, AWS has Amazon S3 — the most widely used cloud storage service in the world. S3 is the foundation of the AWS data ecosystem. Every data lake, every Spark job, every Athena query, every Glue pipeline reads from or writes to S3.

Whether you are an Azure engineer exploring AWS or building your first AWS data platform, understanding S3 deeply is essential. This guide covers S3 from a data engineering perspective — not just the basics, but the patterns, access controls, cost optimization, and architecture decisions you need in production.

What Is Amazon S3?
S3 Hierarchy: Buckets and Objects
S3 vs Azure Blob Storage vs ADLS Gen2
Storage Classes and Cost Optimization
IAM and Access Control for S3
S3 Bucket Policies vs IAM Policies
Encryption
Versioning and Lifecycle Rules
S3 for Data Lakes: Folder Structure
Working with S3 in Python (boto3)
S3 with AWS Data Services
S3 Event Notifications
Performance Optimization
Common Mistakes
Interview Questions
Wrapping Up

What Is Amazon S3?

Amazon Simple Storage Service (S3) is an object storage service that stores unlimited data as objects inside buckets. It was launched in 2006 and became the de facto standard for cloud storage.

Key characteristics:

Unlimited storage — no capacity planning needed
11 nines durability (99.999999999%) — you will never lose data
Pay per GB stored + per request
Global namespace — bucket names must be unique across ALL AWS accounts worldwide
HTTP/HTTPS access — every object has a URL
Integrated with everything — Athena, Glue, EMR, Redshift, Lambda, SageMaker

S3 Hierarchy: Buckets and Objects

S3 has a flat hierarchy (similar to Azure Blob Storage):

Bucket: my-company-datalake
  |-- bronze/customers/2026/04/07/part-00000.parquet   (Object)
  |-- bronze/orders/2026/04/07/part-00000.parquet      (Object)
  |-- silver/customers_cleaned/data.parquet             (Object)
  |-- gold/fact_sales/data.parquet                      (Object)

Buckets

A container for objects (like Azure containers)
Name must be globally unique across all AWS accounts (e.g., naveen-datalake-prod)
Created in a specific AWS region
Naming rules: 3-63 characters, lowercase, no underscores, must start with letter/number

Objects

The actual files (like Azure blobs)
Identified by a key (the full path including “folders”)
Maximum size: 5 TB per object
Metadata: content type, custom headers, tags

“Folders” Are Virtual

Just like Azure Blob Storage, S3 folders are not real. An object key of bronze/customers/data.parquet has a prefix bronze/customers/ that the console displays as folders.

Key difference from ADLS Gen2: S3 does not have a hierarchical namespace option. Renames and directory operations require copying every object. This is why Delta Lake and Iceberg were created — they manage file-level operations efficiently on top of S3.

S3 vs Azure Blob Storage vs ADLS Gen2

Feature	S3	Azure Blob Storage	ADLS Gen2
Namespace	Flat	Flat	Hierarchical
Rename folder	Copy + delete each object	Copy + delete each blob	Atomic operation
Max object size	5 TB	190.7 TB (block blob)	190.7 TB
Bucket/Container naming	Globally unique	Unique per account	Unique per account
Storage tiers	Standard, IA, Glacier, Deep Archive	Hot, Cool, Cold, Archive	Hot, Cool, Cold, Archive
Access control	IAM + Bucket Policies + ACLs	RBAC + SAS tokens	RBAC + POSIX ACLs
Event triggers	S3 Event Notifications + EventBridge	Event Grid	Event Grid
Query in place	Athena, S3 Select	Synapse Serverless	Synapse Serverless
Typical use	AWS data lakes	General Azure storage	Azure data lakes

Storage Classes and Cost Optimization

S3 offers multiple storage classes based on access frequency:

Storage Class	Use Case	Retrieval	Cost (per GB/month)
S3 Standard	Frequently accessed data	Instant	~$0.023
S3 Intelligent-Tiering	Unknown access patterns	Instant	~$0.023 + monitoring fee
S3 Standard-IA	Infrequent (30+ days)	Instant	~$0.0125
S3 One Zone-IA	Infrequent, non-critical	Instant	~$0.01
S3 Glacier Instant	Archive with instant access	Instant	~$0.004
S3 Glacier Flexible	Archive (minutes to hours)	1-12 hours	~$0.0036
S3 Glacier Deep Archive	Long-term archive	12-48 hours	~$0.00099

For Data Engineering

Pipeline output (current month): S3 Standard
Historical data (3-12 months): S3 Standard-IA
Compliance archive (1+ year): Glacier Flexible or Deep Archive
Unknown patterns: S3 Intelligent-Tiering (automatically moves data between tiers)

Lifecycle Rules

Automate storage class transitions:

{
    "Rules": [{
        "ID": "archive-old-data",
        "Status": "Enabled",
        "Filter": {"Prefix": "bronze/"},
        "Transitions": [
            {"Days": 30, "StorageClass": "STANDARD_IA"},
            {"Days": 90, "StorageClass": "GLACIER"},
            {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
        ],
        "Expiration": {"Days": 730}
    }]
}

This moves bronze data to cheaper tiers over time and deletes after 2 years.

IAM and Access Control for S3

IAM Policies (User/Role Level)

Attach policies to IAM users, groups, or roles:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "s3:GetObject",
            "s3:PutObject",
            "s3:ListBucket"
        ],
        "Resource": [
            "arn:aws:s3:::my-datalake",
            "arn:aws:s3:::my-datalake/*"
        ]
    }]
}

Common S3 IAM Actions

Action	What It Allows
`s3:GetObject`	Download/read objects
`s3:PutObject`	Upload/write objects
`s3:DeleteObject`	Delete objects
`s3:ListBucket`	List objects in a bucket
`s3:GetBucketLocation`	Get bucket region
`s3:*`	Full S3 access (avoid in production)

IAM Roles (For Services)

Instead of giving AWS credentials to services, assign an IAM role:

Glue job assumes a role with S3 read/write access
Lambda function assumes a role with S3 access
EMR cluster assumes a role to read/write the data lake

This is the AWS equivalent of Azure Managed Identity — no credentials to manage.

S3 Bucket Policies vs IAM Policies

Aspect	IAM Policy	Bucket Policy
Attached to	User, group, or role	The bucket itself
Scope	Controls what the identity can do	Controls who can access the bucket
Cross-account	Cannot grant cross-account by itself	Can grant access to other AWS accounts
Use case	“This user can read from these buckets”	“This bucket allows these users/accounts”

Best practice: Use IAM policies for your own team. Use bucket policies for cross-account access and public access control.

Blocking Public Access

Every S3 bucket should have public access blocked:

aws s3api put-public-access-block     --bucket my-datalake     --public-access-block-configuration     BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

This is enabled by default for new buckets, but always verify.

Encryption

Server-Side Encryption (SSE)

Type	Key Management	Use Case
SSE-S3	AWS manages keys	Default, simplest option
SSE-KMS	AWS KMS (you control keys)	Compliance, audit trail
SSE-C	Customer provides keys	Full key control

Recommendation: Use SSE-S3 for most data engineering. Use SSE-KMS when compliance requires key audit trails.

Enable default encryption on the bucket:

aws s3api put-bucket-encryption     --bucket my-datalake     --server-side-encryption-configuration     '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'

In-Transit Encryption

S3 supports HTTPS by default. Enforce it with a bucket policy:

{
    "Statement": [{
        "Effect": "Deny",
        "Principal": "*",
        "Action": "s3:*",
        "Resource": "arn:aws:s3:::my-datalake/*",
        "Condition": {"Bool": {"aws:SecureTransport": "false"}}
    }]
}

Versioning and Lifecycle Rules

Versioning

Keeps every version of every object. If you overwrite a file, the old version is preserved.

aws s3api put-bucket-versioning     --bucket my-datalake     --versioning-configuration Status=Enabled

Use case: Accidental deletion protection, audit trails, rollback capability.

Cost warning: Every version takes storage space. Combine with lifecycle rules to delete old versions.

S3 for Data Lakes: Folder Structure

Standard data lake layout on S3:

s3://company-datalake/
  |-- bronze/                          (Raw ingestion)
  |   |-- source_system/
  |   |   |-- table_name/
  |   |       |-- year=2026/
  |   |           |-- month=04/
  |   |               |-- day=07/
  |   |                   |-- part-00000.parquet
  |
  |-- silver/                          (Cleaned)
  |   |-- customers_cleaned/
  |   |-- orders_standardized/
  |
  |-- gold/                            (Business-ready)
  |   |-- dim_customer/
  |   |-- fact_sales/
  |
  |-- scripts/                         (ETL code)
  |-- config/                          (Pipeline configs)
  |-- logs/                            (Pipeline logs)

Hive-Style Partitioning

s3://datalake/bronze/orders/year=2026/month=04/day=07/data.parquet

Athena and Spark automatically recognize key=value patterns as partitions, enabling partition pruning in queries.

Working with S3 in Python (boto3)

import boto3

s3 = boto3.client('s3')

# Upload file
s3.upload_file('local_file.csv', 'my-bucket', 'bronze/data/file.csv')

# Download file
s3.download_file('my-bucket', 'bronze/data/file.csv', 'local_file.csv')

# List objects with prefix
response = s3.list_objects_v2(Bucket='my-bucket', Prefix='bronze/customers/')
for obj in response.get('Contents', []):
    print(f"{obj['Key']} - {obj['Size']} bytes - {obj['LastModified']}")

# Read CSV directly into pandas
import pandas as pd
import io

obj = s3.get_object(Bucket='my-bucket', Key='data/customers.csv')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

# Read Parquet from S3
df = pd.read_parquet('s3://my-bucket/data/customers.parquet')

# Write Parquet to S3
df.to_parquet('s3://my-bucket/output/customers.parquet')

# Delete object
s3.delete_object(Bucket='my-bucket', Key='temp/old_file.csv')

# Copy object
s3.copy_object(
    CopySource={'Bucket': 'source-bucket', 'Key': 'data/file.parquet'},
    Bucket='dest-bucket',
    Key='data/file.parquet'
)

S3 with AWS Data Services

Service	How It Uses S3
AWS Glue	Reads/writes data lake files, stores ETL scripts
Amazon Athena	Queries Parquet/CSV/JSON directly in S3 (no loading needed)
Amazon Redshift	COPY command loads from S3, UNLOAD exports to S3
Amazon EMR	Spark reads/writes S3 as the data lake
AWS Lambda	Triggered by S3 events, processes uploaded files
Amazon SageMaker	Reads training data from S3, writes models to S3
AWS Data Pipeline	Orchestrates data movement to/from S3

S3 Event Notifications

Trigger actions when files are uploaded:

# S3 event -> Lambda function
# When a file lands in bronze/, trigger processing

Configure in S3 bucket settings:

{
    "LambdaFunctionConfigurations": [{
        "LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456:function:process-upload",
        "Events": ["s3:ObjectCreated:*"],
        "Filter": {
            "Key": {
                "FilterRules": [
                    {"Name": "prefix", "Value": "bronze/"},
                    {"Name": "suffix", "Value": ".parquet"}
                ]
            }
        }
    }]
}

This triggers a Lambda function whenever a .parquet file is uploaded to the bronze/ prefix.

Performance Optimization

Use Parquet — columnar compression reduces data scanned by 80-90%
Partition data — Hive-style partitions enable query pruning
Right-size files — aim for 128 MB to 1 GB per file. Too many small files hurt performance.
Use S3 Transfer Acceleration for cross-region uploads
Multipart upload for files larger than 100 MB (boto3 handles this automatically)
Use S3 Select to filter data server-side before downloading
Prefix distribution — distribute objects across different prefixes to avoid throttling on high-request workloads

The Small Files Problem

Many Spark/Glue jobs produce thousands of tiny files. Each file requires a separate API call to read, slowing queries dramatically.

Solutions: – Coalesce Spark output: df.coalesce(10).write.parquet(path) – Use compaction jobs to merge small files periodically – Use Delta Lake or Iceberg which handle file management automatically

Common Mistakes

Using s3:* in IAM policies — too broad. Grant only the specific actions needed.
Not blocking public access — data breaches from public S3 buckets make headlines regularly.
Storing credentials in code — use IAM roles for services, not access keys.
Ignoring storage classes — leaving old data in S3 Standard wastes money.
No lifecycle rules — terabytes of temp files accumulating forever.
Not enabling versioning — one accidental delete and the data is gone.
Too many small files — kills Athena and Spark query performance.
Not encrypting — enable default encryption on every bucket.

Interview Questions

Q: What is Amazon S3? A: S3 is an object storage service that stores unlimited data as objects inside buckets. It offers 11 nines durability, multiple storage classes for cost optimization, and integrates with every AWS data service.

Q: What is the difference between S3 and ADLS Gen2? A: Both are cloud object storage for data lakes. ADLS Gen2 has a hierarchical namespace (real directories, atomic rename), while S3 has a flat namespace (virtual folders). S3 uses IAM policies and bucket policies for access control; ADLS Gen2 uses RBAC and POSIX ACLs. Both support Parquet, partitioning, and query-in-place.

Q: How do you secure an S3 bucket? A: Block public access, enable default encryption (SSE-S3 or SSE-KMS), enforce HTTPS with bucket policy, use IAM roles instead of access keys, enable versioning, and apply least-privilege IAM policies.

Q: What are S3 storage classes? A: S3 Standard (frequent access), Standard-IA (infrequent), One Zone-IA (non-critical infrequent), Glacier Instant (archive with instant access), Glacier Flexible (minutes-hours retrieval), and Deep Archive (12-48 hours). Use lifecycle rules to transition automatically.

Q: How would you design a data lake on S3? A: Use the Bronze/Silver/Gold pattern with Hive-style partitioning (year/month/day). Store raw data in Bronze as Parquet, clean in Silver, and aggregate in Gold. Use lifecycle rules for cost management, Athena for querying, and Glue for ETL.

Q: What is the small files problem in S3? A: Too many small files (under 128 MB) cause performance issues because each file requires a separate API call. Solutions include coalescing Spark output, running compaction jobs, and using Delta Lake or Iceberg for automatic file management.

Wrapping Up

S3 is to AWS what ADLS Gen2 is to Azure — the foundation of the data lake. Understanding buckets, storage classes, IAM, encryption, and the Bronze/Silver/Gold pattern gives you the foundation to build data platforms on either cloud.

If you already know Azure storage, the concepts translate directly. The main differences are naming (buckets vs containers), access control (IAM policies vs RBAC), and namespace (flat vs hierarchical).

If this guide helped you understand S3, share it with someone building on AWS. Questions? Drop a comment below.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.