AWS S3 for Data Engineers: Buckets, Storage Classes, IAM, and Data Lake Patterns

AWS S3 for Data Engineers: Buckets, Storage Classes, IAM, and Data Lake Patterns

If Azure has ADLS Gen2, AWS has Amazon S3 — the most widely used cloud storage service in the world. S3 is the foundation of the AWS data ecosystem. Every data lake, every Spark job, every Athena query, every Glue pipeline reads from or writes to S3.

Whether you are an Azure engineer exploring AWS or building your first AWS data platform, understanding S3 deeply is essential. This guide covers S3 from a data engineering perspective — not just the basics, but the patterns, access controls, cost optimization, and architecture decisions you need in production.

Table of Contents

  • What Is Amazon S3?
  • S3 Hierarchy: Buckets and Objects
  • S3 vs Azure Blob Storage vs ADLS Gen2
  • Storage Classes and Cost Optimization
  • IAM and Access Control for S3
  • S3 Bucket Policies vs IAM Policies
  • Encryption
  • Versioning and Lifecycle Rules
  • S3 for Data Lakes: Folder Structure
  • Working with S3 in Python (boto3)
  • S3 with AWS Data Services
  • S3 Event Notifications
  • Performance Optimization
  • Common Mistakes
  • Interview Questions
  • Wrapping Up

What Is Amazon S3?

Amazon Simple Storage Service (S3) is an object storage service that stores unlimited data as objects inside buckets. It was launched in 2006 and became the de facto standard for cloud storage.

Key characteristics:

  • Unlimited storage — no capacity planning needed
  • 11 nines durability (99.999999999%) — you will never lose data
  • Pay per GB stored + per request
  • Global namespace — bucket names must be unique across ALL AWS accounts worldwide
  • HTTP/HTTPS access — every object has a URL
  • Integrated with everything — Athena, Glue, EMR, Redshift, Lambda, SageMaker

S3 Hierarchy: Buckets and Objects

S3 has a flat hierarchy (similar to Azure Blob Storage):

Bucket: my-company-datalake
  |-- bronze/customers/2026/04/07/part-00000.parquet   (Object)
  |-- bronze/orders/2026/04/07/part-00000.parquet      (Object)
  |-- silver/customers_cleaned/data.parquet             (Object)
  |-- gold/fact_sales/data.parquet                      (Object)

Buckets

  • A container for objects (like Azure containers)
  • Name must be globally unique across all AWS accounts (e.g., naveen-datalake-prod)
  • Created in a specific AWS region
  • Naming rules: 3-63 characters, lowercase, no underscores, must start with letter/number

Objects

  • The actual files (like Azure blobs)
  • Identified by a key (the full path including “folders”)
  • Maximum size: 5 TB per object
  • Metadata: content type, custom headers, tags

“Folders” Are Virtual

Just like Azure Blob Storage, S3 folders are not real. An object key of bronze/customers/data.parquet has a prefix bronze/customers/ that the console displays as folders.

Key difference from ADLS Gen2: S3 does not have a hierarchical namespace option. Renames and directory operations require copying every object. This is why Delta Lake and Iceberg were created — they manage file-level operations efficiently on top of S3.

S3 vs Azure Blob Storage vs ADLS Gen2

Feature S3 Azure Blob Storage ADLS Gen2
Namespace Flat Flat Hierarchical
Rename folder Copy + delete each object Copy + delete each blob Atomic operation
Max object size 5 TB 190.7 TB (block blob) 190.7 TB
Bucket/Container naming Globally unique Unique per account Unique per account
Storage tiers Standard, IA, Glacier, Deep Archive Hot, Cool, Cold, Archive Hot, Cool, Cold, Archive
Access control IAM + Bucket Policies + ACLs RBAC + SAS tokens RBAC + POSIX ACLs
Event triggers S3 Event Notifications + EventBridge Event Grid Event Grid
Query in place Athena, S3 Select Synapse Serverless Synapse Serverless
Typical use AWS data lakes General Azure storage Azure data lakes

Storage Classes and Cost Optimization

S3 offers multiple storage classes based on access frequency:

Storage Class Use Case Retrieval Cost (per GB/month)
S3 Standard Frequently accessed data Instant ~$0.023
S3 Intelligent-Tiering Unknown access patterns Instant ~$0.023 + monitoring fee
S3 Standard-IA Infrequent (30+ days) Instant ~$0.0125
S3 One Zone-IA Infrequent, non-critical Instant ~$0.01
S3 Glacier Instant Archive with instant access Instant ~$0.004
S3 Glacier Flexible Archive (minutes to hours) 1-12 hours ~$0.0036
S3 Glacier Deep Archive Long-term archive 12-48 hours ~$0.00099

For Data Engineering

  • Pipeline output (current month): S3 Standard
  • Historical data (3-12 months): S3 Standard-IA
  • Compliance archive (1+ year): Glacier Flexible or Deep Archive
  • Unknown patterns: S3 Intelligent-Tiering (automatically moves data between tiers)

Lifecycle Rules

Automate storage class transitions:

{
    "Rules": [{
        "ID": "archive-old-data",
        "Status": "Enabled",
        "Filter": {"Prefix": "bronze/"},
        "Transitions": [
            {"Days": 30, "StorageClass": "STANDARD_IA"},
            {"Days": 90, "StorageClass": "GLACIER"},
            {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
        ],
        "Expiration": {"Days": 730}
    }]
}

This moves bronze data to cheaper tiers over time and deletes after 2 years.

IAM and Access Control for S3

IAM Policies (User/Role Level)

Attach policies to IAM users, groups, or roles:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "s3:GetObject",
            "s3:PutObject",
            "s3:ListBucket"
        ],
        "Resource": [
            "arn:aws:s3:::my-datalake",
            "arn:aws:s3:::my-datalake/*"
        ]
    }]
}

Common S3 IAM Actions

Action What It Allows
s3:GetObject Download/read objects
s3:PutObject Upload/write objects
s3:DeleteObject Delete objects
s3:ListBucket List objects in a bucket
s3:GetBucketLocation Get bucket region
s3:* Full S3 access (avoid in production)

IAM Roles (For Services)

Instead of giving AWS credentials to services, assign an IAM role:

  • Glue job assumes a role with S3 read/write access
  • Lambda function assumes a role with S3 access
  • EMR cluster assumes a role to read/write the data lake

This is the AWS equivalent of Azure Managed Identity — no credentials to manage.

S3 Bucket Policies vs IAM Policies

Aspect IAM Policy Bucket Policy
Attached to User, group, or role The bucket itself
Scope Controls what the identity can do Controls who can access the bucket
Cross-account Cannot grant cross-account by itself Can grant access to other AWS accounts
Use case “This user can read from these buckets” “This bucket allows these users/accounts”

Best practice: Use IAM policies for your own team. Use bucket policies for cross-account access and public access control.

Blocking Public Access

Every S3 bucket should have public access blocked:

aws s3api put-public-access-block     --bucket my-datalake     --public-access-block-configuration     BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

This is enabled by default for new buckets, but always verify.

Encryption

Server-Side Encryption (SSE)

Type Key Management Use Case
SSE-S3 AWS manages keys Default, simplest option
SSE-KMS AWS KMS (you control keys) Compliance, audit trail
SSE-C Customer provides keys Full key control

Recommendation: Use SSE-S3 for most data engineering. Use SSE-KMS when compliance requires key audit trails.

Enable default encryption on the bucket:

aws s3api put-bucket-encryption     --bucket my-datalake     --server-side-encryption-configuration     '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'

In-Transit Encryption

S3 supports HTTPS by default. Enforce it with a bucket policy:

{
    "Statement": [{
        "Effect": "Deny",
        "Principal": "*",
        "Action": "s3:*",
        "Resource": "arn:aws:s3:::my-datalake/*",
        "Condition": {"Bool": {"aws:SecureTransport": "false"}}
    }]
}

Versioning and Lifecycle Rules

Versioning

Keeps every version of every object. If you overwrite a file, the old version is preserved.

aws s3api put-bucket-versioning     --bucket my-datalake     --versioning-configuration Status=Enabled

Use case: Accidental deletion protection, audit trails, rollback capability.

Cost warning: Every version takes storage space. Combine with lifecycle rules to delete old versions.

S3 for Data Lakes: Folder Structure

Standard data lake layout on S3:

s3://company-datalake/
  |-- bronze/                          (Raw ingestion)
  |   |-- source_system/
  |   |   |-- table_name/
  |   |       |-- year=2026/
  |   |           |-- month=04/
  |   |               |-- day=07/
  |   |                   |-- part-00000.parquet
  |
  |-- silver/                          (Cleaned)
  |   |-- customers_cleaned/
  |   |-- orders_standardized/
  |
  |-- gold/                            (Business-ready)
  |   |-- dim_customer/
  |   |-- fact_sales/
  |
  |-- scripts/                         (ETL code)
  |-- config/                          (Pipeline configs)
  |-- logs/                            (Pipeline logs)

Hive-Style Partitioning

s3://datalake/bronze/orders/year=2026/month=04/day=07/data.parquet

Athena and Spark automatically recognize key=value patterns as partitions, enabling partition pruning in queries.

Working with S3 in Python (boto3)

import boto3

s3 = boto3.client('s3')

# Upload file
s3.upload_file('local_file.csv', 'my-bucket', 'bronze/data/file.csv')

# Download file
s3.download_file('my-bucket', 'bronze/data/file.csv', 'local_file.csv')

# List objects with prefix
response = s3.list_objects_v2(Bucket='my-bucket', Prefix='bronze/customers/')
for obj in response.get('Contents', []):
    print(f"{obj['Key']} - {obj['Size']} bytes - {obj['LastModified']}")

# Read CSV directly into pandas
import pandas as pd
import io

obj = s3.get_object(Bucket='my-bucket', Key='data/customers.csv')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

# Read Parquet from S3
df = pd.read_parquet('s3://my-bucket/data/customers.parquet')

# Write Parquet to S3
df.to_parquet('s3://my-bucket/output/customers.parquet')

# Delete object
s3.delete_object(Bucket='my-bucket', Key='temp/old_file.csv')

# Copy object
s3.copy_object(
    CopySource={'Bucket': 'source-bucket', 'Key': 'data/file.parquet'},
    Bucket='dest-bucket',
    Key='data/file.parquet'
)

S3 with AWS Data Services

Service How It Uses S3
AWS Glue Reads/writes data lake files, stores ETL scripts
Amazon Athena Queries Parquet/CSV/JSON directly in S3 (no loading needed)
Amazon Redshift COPY command loads from S3, UNLOAD exports to S3
Amazon EMR Spark reads/writes S3 as the data lake
AWS Lambda Triggered by S3 events, processes uploaded files
Amazon SageMaker Reads training data from S3, writes models to S3
AWS Data Pipeline Orchestrates data movement to/from S3

S3 Event Notifications

Trigger actions when files are uploaded:

# S3 event -> Lambda function
# When a file lands in bronze/, trigger processing

Configure in S3 bucket settings:

{
    "LambdaFunctionConfigurations": [{
        "LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456:function:process-upload",
        "Events": ["s3:ObjectCreated:*"],
        "Filter": {
            "Key": {
                "FilterRules": [
                    {"Name": "prefix", "Value": "bronze/"},
                    {"Name": "suffix", "Value": ".parquet"}
                ]
            }
        }
    }]
}

This triggers a Lambda function whenever a .parquet file is uploaded to the bronze/ prefix.

Performance Optimization

  1. Use Parquet — columnar compression reduces data scanned by 80-90%
  2. Partition data — Hive-style partitions enable query pruning
  3. Right-size files — aim for 128 MB to 1 GB per file. Too many small files hurt performance.
  4. Use S3 Transfer Acceleration for cross-region uploads
  5. Multipart upload for files larger than 100 MB (boto3 handles this automatically)
  6. Use S3 Select to filter data server-side before downloading
  7. Prefix distribution — distribute objects across different prefixes to avoid throttling on high-request workloads

The Small Files Problem

Many Spark/Glue jobs produce thousands of tiny files. Each file requires a separate API call to read, slowing queries dramatically.

Solutions: – Coalesce Spark output: df.coalesce(10).write.parquet(path) – Use compaction jobs to merge small files periodically – Use Delta Lake or Iceberg which handle file management automatically

Common Mistakes

  1. Using s3:* in IAM policies — too broad. Grant only the specific actions needed.
  2. Not blocking public access — data breaches from public S3 buckets make headlines regularly.
  3. Storing credentials in code — use IAM roles for services, not access keys.
  4. Ignoring storage classes — leaving old data in S3 Standard wastes money.
  5. No lifecycle rules — terabytes of temp files accumulating forever.
  6. Not enabling versioning — one accidental delete and the data is gone.
  7. Too many small files — kills Athena and Spark query performance.
  8. Not encrypting — enable default encryption on every bucket.

Interview Questions

Q: What is Amazon S3? A: S3 is an object storage service that stores unlimited data as objects inside buckets. It offers 11 nines durability, multiple storage classes for cost optimization, and integrates with every AWS data service.

Q: What is the difference between S3 and ADLS Gen2? A: Both are cloud object storage for data lakes. ADLS Gen2 has a hierarchical namespace (real directories, atomic rename), while S3 has a flat namespace (virtual folders). S3 uses IAM policies and bucket policies for access control; ADLS Gen2 uses RBAC and POSIX ACLs. Both support Parquet, partitioning, and query-in-place.

Q: How do you secure an S3 bucket? A: Block public access, enable default encryption (SSE-S3 or SSE-KMS), enforce HTTPS with bucket policy, use IAM roles instead of access keys, enable versioning, and apply least-privilege IAM policies.

Q: What are S3 storage classes? A: S3 Standard (frequent access), Standard-IA (infrequent), One Zone-IA (non-critical infrequent), Glacier Instant (archive with instant access), Glacier Flexible (minutes-hours retrieval), and Deep Archive (12-48 hours). Use lifecycle rules to transition automatically.

Q: How would you design a data lake on S3? A: Use the Bronze/Silver/Gold pattern with Hive-style partitioning (year/month/day). Store raw data in Bronze as Parquet, clean in Silver, and aggregate in Gold. Use lifecycle rules for cost management, Athena for querying, and Glue for ETL.

Q: What is the small files problem in S3? A: Too many small files (under 128 MB) cause performance issues because each file requires a separate API call. Solutions include coalescing Spark output, running compaction jobs, and using Delta Lake or Iceberg for automatic file management.

Wrapping Up

S3 is to AWS what ADLS Gen2 is to Azure — the foundation of the data lake. Understanding buckets, storage classes, IAM, encryption, and the Bronze/Silver/Gold pattern gives you the foundation to build data platforms on either cloud.

If you already know Azure storage, the concepts translate directly. The main differences are naming (buckets vs containers), access control (IAM policies vs RBAC), and namespace (flat vs hierarchical).

Related posts:Azure Blob Storage GuideADLS Gen2 Complete GuideParquet vs CSV vs JSONPython for Data EngineersBuilding a REST API with FastAPI on AWS Lambda

If this guide helped you understand S3, share it with someone building on AWS. Questions? Drop a comment below.


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link