AWS S3 for Data Engineers: Buckets, Storage Classes, IAM, and Data Lake Patterns
If Azure has ADLS Gen2, AWS has Amazon S3 — the most widely used cloud storage service in the world. S3 is the foundation of the AWS data ecosystem. Every data lake, every Spark job, every Athena query, every Glue pipeline reads from or writes to S3.
Whether you are an Azure engineer exploring AWS or building your first AWS data platform, understanding S3 deeply is essential. This guide covers S3 from a data engineering perspective — not just the basics, but the patterns, access controls, cost optimization, and architecture decisions you need in production.
Table of Contents
- What Is Amazon S3?
- S3 Hierarchy: Buckets and Objects
- S3 vs Azure Blob Storage vs ADLS Gen2
- Storage Classes and Cost Optimization
- IAM and Access Control for S3
- S3 Bucket Policies vs IAM Policies
- Encryption
- Versioning and Lifecycle Rules
- S3 for Data Lakes: Folder Structure
- Working with S3 in Python (boto3)
- S3 with AWS Data Services
- S3 Event Notifications
- Performance Optimization
- Common Mistakes
- Interview Questions
- Wrapping Up
What Is Amazon S3?
Amazon Simple Storage Service (S3) is an object storage service that stores unlimited data as objects inside buckets. It was launched in 2006 and became the de facto standard for cloud storage.
Key characteristics:
- Unlimited storage — no capacity planning needed
- 11 nines durability (99.999999999%) — you will never lose data
- Pay per GB stored + per request
- Global namespace — bucket names must be unique across ALL AWS accounts worldwide
- HTTP/HTTPS access — every object has a URL
- Integrated with everything — Athena, Glue, EMR, Redshift, Lambda, SageMaker
S3 Hierarchy: Buckets and Objects
S3 has a flat hierarchy (similar to Azure Blob Storage):
Bucket: my-company-datalake
|-- bronze/customers/2026/04/07/part-00000.parquet (Object)
|-- bronze/orders/2026/04/07/part-00000.parquet (Object)
|-- silver/customers_cleaned/data.parquet (Object)
|-- gold/fact_sales/data.parquet (Object)
Buckets
- A container for objects (like Azure containers)
- Name must be globally unique across all AWS accounts (e.g.,
naveen-datalake-prod) - Created in a specific AWS region
- Naming rules: 3-63 characters, lowercase, no underscores, must start with letter/number
Objects
- The actual files (like Azure blobs)
- Identified by a key (the full path including “folders”)
- Maximum size: 5 TB per object
- Metadata: content type, custom headers, tags
“Folders” Are Virtual
Just like Azure Blob Storage, S3 folders are not real. An object key of bronze/customers/data.parquet has a prefix bronze/customers/ that the console displays as folders.
Key difference from ADLS Gen2: S3 does not have a hierarchical namespace option. Renames and directory operations require copying every object. This is why Delta Lake and Iceberg were created — they manage file-level operations efficiently on top of S3.
S3 vs Azure Blob Storage vs ADLS Gen2
| Feature | S3 | Azure Blob Storage | ADLS Gen2 |
|---|---|---|---|
| Namespace | Flat | Flat | Hierarchical |
| Rename folder | Copy + delete each object | Copy + delete each blob | Atomic operation |
| Max object size | 5 TB | 190.7 TB (block blob) | 190.7 TB |
| Bucket/Container naming | Globally unique | Unique per account | Unique per account |
| Storage tiers | Standard, IA, Glacier, Deep Archive | Hot, Cool, Cold, Archive | Hot, Cool, Cold, Archive |
| Access control | IAM + Bucket Policies + ACLs | RBAC + SAS tokens | RBAC + POSIX ACLs |
| Event triggers | S3 Event Notifications + EventBridge | Event Grid | Event Grid |
| Query in place | Athena, S3 Select | Synapse Serverless | Synapse Serverless |
| Typical use | AWS data lakes | General Azure storage | Azure data lakes |
Storage Classes and Cost Optimization
S3 offers multiple storage classes based on access frequency:
| Storage Class | Use Case | Retrieval | Cost (per GB/month) |
|---|---|---|---|
| S3 Standard | Frequently accessed data | Instant | ~$0.023 |
| S3 Intelligent-Tiering | Unknown access patterns | Instant | ~$0.023 + monitoring fee |
| S3 Standard-IA | Infrequent (30+ days) | Instant | ~$0.0125 |
| S3 One Zone-IA | Infrequent, non-critical | Instant | ~$0.01 |
| S3 Glacier Instant | Archive with instant access | Instant | ~$0.004 |
| S3 Glacier Flexible | Archive (minutes to hours) | 1-12 hours | ~$0.0036 |
| S3 Glacier Deep Archive | Long-term archive | 12-48 hours | ~$0.00099 |
For Data Engineering
- Pipeline output (current month): S3 Standard
- Historical data (3-12 months): S3 Standard-IA
- Compliance archive (1+ year): Glacier Flexible or Deep Archive
- Unknown patterns: S3 Intelligent-Tiering (automatically moves data between tiers)
Lifecycle Rules
Automate storage class transitions:
{
"Rules": [{
"ID": "archive-old-data",
"Status": "Enabled",
"Filter": {"Prefix": "bronze/"},
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER"},
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
],
"Expiration": {"Days": 730}
}]
}
This moves bronze data to cheaper tiers over time and deletes after 2 years.
IAM and Access Control for S3
IAM Policies (User/Role Level)
Attach policies to IAM users, groups, or roles:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-datalake",
"arn:aws:s3:::my-datalake/*"
]
}]
}
Common S3 IAM Actions
| Action | What It Allows |
|---|---|
s3:GetObject |
Download/read objects |
s3:PutObject |
Upload/write objects |
s3:DeleteObject |
Delete objects |
s3:ListBucket |
List objects in a bucket |
s3:GetBucketLocation |
Get bucket region |
s3:* |
Full S3 access (avoid in production) |
IAM Roles (For Services)
Instead of giving AWS credentials to services, assign an IAM role:
- Glue job assumes a role with S3 read/write access
- Lambda function assumes a role with S3 access
- EMR cluster assumes a role to read/write the data lake
This is the AWS equivalent of Azure Managed Identity — no credentials to manage.
S3 Bucket Policies vs IAM Policies
| Aspect | IAM Policy | Bucket Policy |
|---|---|---|
| Attached to | User, group, or role | The bucket itself |
| Scope | Controls what the identity can do | Controls who can access the bucket |
| Cross-account | Cannot grant cross-account by itself | Can grant access to other AWS accounts |
| Use case | “This user can read from these buckets” | “This bucket allows these users/accounts” |
Best practice: Use IAM policies for your own team. Use bucket policies for cross-account access and public access control.
Blocking Public Access
Every S3 bucket should have public access blocked:
aws s3api put-public-access-block --bucket my-datalake --public-access-block-configuration BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
This is enabled by default for new buckets, but always verify.
Encryption
Server-Side Encryption (SSE)
| Type | Key Management | Use Case |
|---|---|---|
| SSE-S3 | AWS manages keys | Default, simplest option |
| SSE-KMS | AWS KMS (you control keys) | Compliance, audit trail |
| SSE-C | Customer provides keys | Full key control |
Recommendation: Use SSE-S3 for most data engineering. Use SSE-KMS when compliance requires key audit trails.
Enable default encryption on the bucket:
aws s3api put-bucket-encryption --bucket my-datalake --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
In-Transit Encryption
S3 supports HTTPS by default. Enforce it with a bucket policy:
{
"Statement": [{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::my-datalake/*",
"Condition": {"Bool": {"aws:SecureTransport": "false"}}
}]
}
Versioning and Lifecycle Rules
Versioning
Keeps every version of every object. If you overwrite a file, the old version is preserved.
aws s3api put-bucket-versioning --bucket my-datalake --versioning-configuration Status=Enabled
Use case: Accidental deletion protection, audit trails, rollback capability.
Cost warning: Every version takes storage space. Combine with lifecycle rules to delete old versions.
S3 for Data Lakes: Folder Structure
Standard data lake layout on S3:
s3://company-datalake/
|-- bronze/ (Raw ingestion)
| |-- source_system/
| | |-- table_name/
| | |-- year=2026/
| | |-- month=04/
| | |-- day=07/
| | |-- part-00000.parquet
|
|-- silver/ (Cleaned)
| |-- customers_cleaned/
| |-- orders_standardized/
|
|-- gold/ (Business-ready)
| |-- dim_customer/
| |-- fact_sales/
|
|-- scripts/ (ETL code)
|-- config/ (Pipeline configs)
|-- logs/ (Pipeline logs)
Hive-Style Partitioning
s3://datalake/bronze/orders/year=2026/month=04/day=07/data.parquet
Athena and Spark automatically recognize key=value patterns as partitions, enabling partition pruning in queries.
Working with S3 in Python (boto3)
import boto3
s3 = boto3.client('s3')
# Upload file
s3.upload_file('local_file.csv', 'my-bucket', 'bronze/data/file.csv')
# Download file
s3.download_file('my-bucket', 'bronze/data/file.csv', 'local_file.csv')
# List objects with prefix
response = s3.list_objects_v2(Bucket='my-bucket', Prefix='bronze/customers/')
for obj in response.get('Contents', []):
print(f"{obj['Key']} - {obj['Size']} bytes - {obj['LastModified']}")
# Read CSV directly into pandas
import pandas as pd
import io
obj = s3.get_object(Bucket='my-bucket', Key='data/customers.csv')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
# Read Parquet from S3
df = pd.read_parquet('s3://my-bucket/data/customers.parquet')
# Write Parquet to S3
df.to_parquet('s3://my-bucket/output/customers.parquet')
# Delete object
s3.delete_object(Bucket='my-bucket', Key='temp/old_file.csv')
# Copy object
s3.copy_object(
CopySource={'Bucket': 'source-bucket', 'Key': 'data/file.parquet'},
Bucket='dest-bucket',
Key='data/file.parquet'
)
S3 with AWS Data Services
| Service | How It Uses S3 |
|---|---|
| AWS Glue | Reads/writes data lake files, stores ETL scripts |
| Amazon Athena | Queries Parquet/CSV/JSON directly in S3 (no loading needed) |
| Amazon Redshift | COPY command loads from S3, UNLOAD exports to S3 |
| Amazon EMR | Spark reads/writes S3 as the data lake |
| AWS Lambda | Triggered by S3 events, processes uploaded files |
| Amazon SageMaker | Reads training data from S3, writes models to S3 |
| AWS Data Pipeline | Orchestrates data movement to/from S3 |
S3 Event Notifications
Trigger actions when files are uploaded:
# S3 event -> Lambda function
# When a file lands in bronze/, trigger processing
Configure in S3 bucket settings:
{
"LambdaFunctionConfigurations": [{
"LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456:function:process-upload",
"Events": ["s3:ObjectCreated:*"],
"Filter": {
"Key": {
"FilterRules": [
{"Name": "prefix", "Value": "bronze/"},
{"Name": "suffix", "Value": ".parquet"}
]
}
}
}]
}
This triggers a Lambda function whenever a .parquet file is uploaded to the bronze/ prefix.
Performance Optimization
- Use Parquet — columnar compression reduces data scanned by 80-90%
- Partition data — Hive-style partitions enable query pruning
- Right-size files — aim for 128 MB to 1 GB per file. Too many small files hurt performance.
- Use S3 Transfer Acceleration for cross-region uploads
- Multipart upload for files larger than 100 MB (boto3 handles this automatically)
- Use S3 Select to filter data server-side before downloading
- Prefix distribution — distribute objects across different prefixes to avoid throttling on high-request workloads
The Small Files Problem
Many Spark/Glue jobs produce thousands of tiny files. Each file requires a separate API call to read, slowing queries dramatically.
Solutions:
– Coalesce Spark output: df.coalesce(10).write.parquet(path)
– Use compaction jobs to merge small files periodically
– Use Delta Lake or Iceberg which handle file management automatically
Common Mistakes
- Using
s3:*in IAM policies — too broad. Grant only the specific actions needed. - Not blocking public access — data breaches from public S3 buckets make headlines regularly.
- Storing credentials in code — use IAM roles for services, not access keys.
- Ignoring storage classes — leaving old data in S3 Standard wastes money.
- No lifecycle rules — terabytes of temp files accumulating forever.
- Not enabling versioning — one accidental delete and the data is gone.
- Too many small files — kills Athena and Spark query performance.
- Not encrypting — enable default encryption on every bucket.
Interview Questions
Q: What is Amazon S3? A: S3 is an object storage service that stores unlimited data as objects inside buckets. It offers 11 nines durability, multiple storage classes for cost optimization, and integrates with every AWS data service.
Q: What is the difference between S3 and ADLS Gen2? A: Both are cloud object storage for data lakes. ADLS Gen2 has a hierarchical namespace (real directories, atomic rename), while S3 has a flat namespace (virtual folders). S3 uses IAM policies and bucket policies for access control; ADLS Gen2 uses RBAC and POSIX ACLs. Both support Parquet, partitioning, and query-in-place.
Q: How do you secure an S3 bucket? A: Block public access, enable default encryption (SSE-S3 or SSE-KMS), enforce HTTPS with bucket policy, use IAM roles instead of access keys, enable versioning, and apply least-privilege IAM policies.
Q: What are S3 storage classes? A: S3 Standard (frequent access), Standard-IA (infrequent), One Zone-IA (non-critical infrequent), Glacier Instant (archive with instant access), Glacier Flexible (minutes-hours retrieval), and Deep Archive (12-48 hours). Use lifecycle rules to transition automatically.
Q: How would you design a data lake on S3? A: Use the Bronze/Silver/Gold pattern with Hive-style partitioning (year/month/day). Store raw data in Bronze as Parquet, clean in Silver, and aggregate in Gold. Use lifecycle rules for cost management, Athena for querying, and Glue for ETL.
Q: What is the small files problem in S3? A: Too many small files (under 128 MB) cause performance issues because each file requires a separate API call. Solutions include coalescing Spark output, running compaction jobs, and using Delta Lake or Iceberg for automatic file management.
Wrapping Up
S3 is to AWS what ADLS Gen2 is to Azure — the foundation of the data lake. Understanding buckets, storage classes, IAM, encryption, and the Bronze/Silver/Gold pattern gives you the foundation to build data platforms on either cloud.
If you already know Azure storage, the concepts translate directly. The main differences are naming (buckets vs containers), access control (IAM policies vs RBAC), and namespace (flat vs hierarchical).
Related posts: – Azure Blob Storage Guide – ADLS Gen2 Complete Guide – Parquet vs CSV vs JSON – Python for Data Engineers – Building a REST API with FastAPI on AWS Lambda
If this guide helped you understand S3, share it with someone building on AWS. Questions? Drop a comment below.
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.