Azure Fundamentals for Data Engineers: Subscriptions, Resource Groups, IAM, and Everything You Need Before Starting a Project
Before you create your first Azure Data Factory pipeline or provision an ADLS Gen2 storage account, you need to understand the foundation that everything sits on. Subscriptions, resource groups, IAM roles, regions, tags, cost management — these are not glamorous topics, but getting them wrong causes real problems.
I have seen data engineers who can build complex metadata-driven pipelines but cannot explain why their colleague cannot access the storage account they just created. Or why their pipeline fails with a permissions error even though “they are an admin.” Or why their Azure bill doubled last month.
This post covers every Azure fundamental a data engineer needs to know before starting a project. Think of it as the prerequisite knowledge that Azure tutorials skip because they assume you already know it.
Table of Contents
- The Azure Hierarchy: How Everything Is Organized
- Azure Active Directory (Entra ID)
- Subscriptions
- Resource Groups
- Resources
- Regions and Availability Zones
- IAM: Identity and Access Management
- The Most Important RBAC Roles for Data Engineers
- Managed Identity
- Azure Key Vault
- Tags and Resource Organization
- Cost Management and Budgets
- Networking Basics for Data Engineers
- Azure CLI and PowerShell
- Setting Up a New Data Engineering Project: The Checklist
- Common Mistakes New Data Engineers Make
- Interview Questions
- Wrapping Up
The Azure Hierarchy: How Everything Is Organized
Azure resources are organized in a strict hierarchy:
Azure Active Directory (Entra ID) Tenant
|
|-- Management Groups (optional, for large organizations)
|
|-- Subscriptions (billing boundary)
|
|-- Resource Groups (logical containers)
|
|-- Resources (the actual services)
Examples: Storage Account, Data Factory,
SQL Database, Synapse Workspace, Key Vault
Every Azure resource lives inside a Resource Group, which lives inside a Subscription, which lives inside a Tenant. Understanding this hierarchy is essential because permissions, billing, and policies are applied at each level and inherited downward.
Azure Active Directory (Entra ID)
What It Is
Azure Active Directory (recently renamed to Microsoft Entra ID) is the identity and access management service. It is the directory that stores:
- Users — people with email addresses (naveen@company.com)
- Groups — collections of users (DataEngineers, Admins)
- Service Principals — identities for applications and automation
- Managed Identities — identities assigned to Azure resources (like ADF)
Tenant
A Tenant is a single instance of Entra ID. Every organization has one tenant. When you sign up for Azure with your company email, your account belongs to your company’s tenant.
Tenant: contoso.onmicrosoft.com
|-- Users: naveen@contoso.com, alice@contoso.com
|-- Groups: DataEngineers, CloudAdmins
|-- Service Principals: sp-github-deploy, sp-databricks
|-- Managed Identities: naveen-datafactory-de, naveen-synapse-ws
Why It Matters for Data Engineers
Every time you: – Grant someone access to a storage account — you are adding an Entra ID user to an RBAC role – Create a linked service with Managed Identity — ADF uses its Entra ID identity to authenticate – Set up CI/CD with GitHub Actions — a Service Principal in Entra ID authenticates the deployment – Connect to Azure SQL with Azure AD auth — your Entra ID identity is verified
You cannot escape Entra ID. It is the backbone of all authentication and authorization in Azure.
Subscriptions
What It Is
A Subscription is the billing container for Azure resources. Every resource you create lives in a subscription, and all costs for that resource are charged to that subscription.
Why Multiple Subscriptions?
Organizations typically have multiple subscriptions to separate:
Subscription: Dev/Test ($500/month budget)
|-- Resource Groups for development workloads
|-- Lower-tier resources (smaller VMs, basic storage)
|-- Developer access
Subscription: Production ($5,000/month budget)
|-- Resource Groups for production workloads
|-- Higher-tier resources (premium storage, dedicated pools)
|-- Restricted access (only DevOps team)
Subscription: Sandbox ($200/month budget)
|-- Personal experimentation
|-- Temporary resources
|-- Auto-delete policies
Subscription Limits
Every subscription has limits (quotas) on resources:
| Resource | Default Limit |
|---|---|
| Storage accounts per region | 250 |
| Resource groups per subscription | 980 |
| VMs per region | 25,000 |
| Azure Data Factories per subscription | 50 |
| SQL databases per server | 5,000 |
These limits can be increased by contacting Azure support, but knowing they exist prevents surprises.
Subscription ID
Every subscription has a unique GUID. You will use this frequently:
# Find your subscription ID
az account show --query id -o tsv
# Set a specific subscription as default
az account set --subscription "your-subscription-id"
Many CLI commands and ARM templates require the subscription ID.
Resource Groups
What It Is
A Resource Group is a logical container that holds related Azure resources. It is the primary way to organize, manage, and control access to resources.
How to Organize Resource Groups
Option A: By project
rg-project-alpha
|-- adf-project-alpha
|-- storage-project-alpha
|-- sqldb-project-alpha
|-- synapse-project-alpha
rg-project-beta
|-- adf-project-beta
|-- storage-project-beta
Option B: By environment
rg-dev
|-- adf-dev, storage-dev, sqldb-dev
rg-uat
|-- adf-uat, storage-uat, sqldb-uat
rg-prod
|-- adf-prod, storage-prod, sqldb-prod
Option C: By service type
rg-data-engineering
|-- All ADF, Synapse, storage resources
rg-databases
|-- All SQL databases
rg-networking
|-- VNets, private endpoints, NSGs
Most enterprise teams use a combination: rg-{project}-{environment} like rg-dataplatform-prod.
Key Rules
- A resource can only belong to one resource group
- Resource groups have a region but resources inside can be in any region
- Deleting a resource group deletes EVERYTHING inside it — be very careful
- RBAC roles assigned at the resource group level are inherited by all resources inside
- Tags on the resource group can be inherited by resources (with Azure Policy)
When to Delete a Resource Group
This is the fastest way to clean up:
# WARNING: Deletes everything inside the resource group
az group delete --name rg-sandbox-test --yes --no-wait
This is why sandbox/dev resources should be in separate resource groups from production.
Resources
What It Is
A Resource is an individual Azure service instance — the actual thing you use. Examples:
| Resource Type | Example Name | What It Does |
|---|---|---|
| Storage Account | naveendatalake | Stores blobs, files, tables, queues |
| Data Factory | naveen-datafactory-de | Runs data pipelines |
| Synapse Workspace | naveen-synapse-ws | Analytics platform |
| SQL Database | AdventureWorksLT | Relational database |
| Key Vault | naveen-keyvault | Stores secrets and certificates |
| Virtual Machine | vm-selfhosted-ir | Runs Self-Hosted IR |
Naming Conventions
Azure resource names have rules and best practices:
| Resource | Naming Rule | Example |
|---|---|---|
| Storage Account | Lowercase, 3-24 chars, no hyphens | naveendatalake |
| Data Factory | 3-63 chars, letters/numbers/hyphens | naveen-datafactory-de |
| Resource Group | 1-90 chars, most characters allowed | rg-dataplatform-prod |
| Key Vault | 3-24 chars, letters/numbers/hyphens | kv-dataplatform-prod |
| SQL Server | 1-63 chars, lowercase, hyphens allowed | sql-dataplatform-prod |
Best practice naming pattern: {resource-type}-{project}-{environment}
rg-dataplatform-dev (resource group)
adf-dataplatform-dev (data factory)
st-dataplatform-dev (storage -- no hyphens allowed)
kv-dataplatform-dev (key vault)
sql-dataplatform-dev (sql server)
syn-dataplatform-dev (synapse workspace)
Regions and Availability Zones
Regions
Azure has 60+ regions worldwide. A region is a geographic location with one or more data centers.
Common regions for data engineering:
| Region | Code | Use Case |
|---|---|---|
| East US | eastus | Default for North America |
| East US 2 | eastus2 | Secondary NA region |
| Canada Central | canadacentral | Canadian data residency |
| West Europe | westeurope | European workloads |
| North Europe | northeurope | European secondary |
| Central India | centralindia | Indian workloads |
| Southeast Asia | southeastasia | APAC workloads |
Region Selection Matters
Performance: Place related resources in the same region. Cross-region data transfer is slower.
GOOD: ADF (East US) + ADLS Gen2 (East US) + SQL (East US)
BAD: ADF (East US) + ADLS Gen2 (West Europe) + SQL (Central India)
Cost: Some regions are cheaper than others. East US is typically the cheapest in North America.
Compliance: Some regulations require data to stay in specific geographies: – GDPR: data must stay in EU regions – Canadian privacy laws: may require Canada Central or Canada East – Indian regulations: may require Central India or South India
Availability: Not all services are available in all regions. Check Azure Products by Region before choosing.
Availability Zones
Within a region, Availability Zones are physically separate data centers with independent power, cooling, and networking. They provide high availability:
East US Region
|-- Zone 1 (Data Center A)
|-- Zone 2 (Data Center B)
|-- Zone 3 (Data Center C)
Zone-redundant storage (ZRS) replicates data across all three zones. If one data center fails, your data is still accessible.
IAM: Identity and Access Management
What It Is
IAM (Identity and Access Management) controls who can do what on which Azure resources. It uses Role-Based Access Control (RBAC) where you assign roles to identities at a specific scope.
The RBAC Model
Who (Identity) + What (Role) + Where (Scope)
Example: – Who: naveen@company.com (user identity) – What: Storage Blob Data Contributor (role) – Where: naveendatalake storage account (scope) – Result: Naveen can read, write, and delete blobs in that storage account
Scope Hierarchy
Roles can be assigned at different levels, and permissions inherit downward:
Management Group (broadest)
|-- Subscription
|-- Resource Group
|-- Resource (most specific)
A role assigned at the Subscription level applies to ALL resource groups and resources in that subscription. A role assigned at a specific resource only applies to that resource.
Best practice: Assign roles at the most specific scope possible. Do not give Subscription-level Contributor to someone who only needs access to one storage account.
Built-In Role Types
Azure has 100+ built-in roles, but they fall into four categories:
| Category | Example Roles | What They Control |
|---|---|---|
| Owner | Owner | Full access + can assign roles to others |
| Contributor | Contributor | Full access except role assignments |
| Reader | Reader | View-only access |
| Specialized | Storage Blob Data Contributor, SQL DB Contributor | Specific service access |
How to Assign a Role
- Navigate to the resource (or resource group/subscription)
- Click Access Control (IAM)
- Click + Add role assignment
- Select the role (e.g., Storage Blob Data Contributor)
- Click Next
- Select members (users, groups, or managed identities)
- Click Review + assign
The Most Important RBAC Roles for Data Engineers
| Role | What It Allows | When You Need It |
|---|---|---|
| Storage Blob Data Reader | Read blobs | Synapse Serverless querying ADLS |
| Storage Blob Data Contributor | Read + write + delete blobs | ADF/Synapse writing pipeline output to ADLS |
| Storage Blob Data Owner | Full blob control + ACL management | Admin managing data lake permissions |
| Contributor (on Resource Group) | Create/manage all resources in the RG | Setting up ADF, Synapse, storage in a dev environment |
| Data Factory Contributor | Manage ADF resources (not data access) | Building pipelines in ADF |
| SQL DB Contributor | Manage SQL databases | Creating and configuring databases |
| Key Vault Secrets User | Read secrets from Key Vault | ADF linked services reading connection strings |
| Synapse Contributor | Manage Synapse workspace resources | Building pipelines and notebooks in Synapse |
The #1 IAM Mistake Data Engineers Make
Assigning Storage Account Contributor when they need Storage Blob Data Contributor:
Storage Account Contributor:
- Can manage the storage account (keys, settings, networking)
- CANNOT read or write blob data
Storage Blob Data Contributor:
- Can read, write, and delete blob data
- Cannot manage the storage account itself
These are completely different roles. Your ADF managed identity needs Storage Blob Data Contributor to write Parquet files. If you assign Storage Account Contributor, the pipeline fails with a permissions error.
Managed Identity
What It Is
A Managed Identity is an Azure-managed identity automatically created for certain Azure services. It eliminates the need for passwords, connection strings, or API keys in your pipeline configurations.
Two Types
System-assigned: Created automatically when you create the resource. Has the same lifecycle as the resource — deleted when the resource is deleted.
Azure Data Factory: naveen-datafactory-de
System-assigned Managed Identity: naveen-datafactory-de (same name)
Object ID: xxxx-xxxx-xxxx
Synapse Workspace: naveen-synapse-ws
System-assigned Managed Identity: naveen-synapse-ws (same name)
Object ID: yyyy-yyyy-yyyy
User-assigned: Created independently and can be shared across multiple resources.
Why Managed Identity Matters
Without Managed Identity (the old way):
Linked Service: LS_ADLS_Gen2
Authentication: Account Key
Key: "xK3j8d9f...long secret string..."
Problem: Key is stored in ADF, can be leaked, must be rotated
With Managed Identity (the recommended way):
Linked Service: LS_ADLS_Gen2
Authentication: Managed Identity
No passwords, no keys, nothing to leak
Azure handles authentication behind the scenes
How to Use Managed Identity
- Create your ADF or Synapse workspace (managed identity is created automatically)
- Go to the target resource (e.g., Storage Account)
- Access Control (IAM) > + Add role assignment
- Assign the role (e.g., Storage Blob Data Contributor)
- Select member: search for your ADF/Synapse workspace name
- Save
Now your linked service can authenticate using Managed Identity — no passwords needed.
Azure Key Vault
What It Is
Azure Key Vault is a centralized secrets management service. It securely stores:
- Secrets — connection strings, passwords, API keys
- Certificates — SSL/TLS certificates
- Keys — encryption keys
Why Data Engineers Need It
Instead of hardcoding database passwords in your ADF linked services:
BAD: Linked Service with password "MyP@ssw0rd123" hardcoded
GOOD: Linked Service referencing Key Vault secret "sql-connection-password"
If the password changes, you update Key Vault once. All linked services that reference it automatically get the new password.
Setting Up Key Vault for ADF
- Create a Key Vault:
kv-dataplatform-dev - Add a secret: Name =
sql-admin-password, Value =MyP@ssw0rd123 - Grant ADF access: assign Key Vault Secrets User role to the ADF managed identity
- In ADF, create a linked service to Key Vault:
LS_KeyVault - In your SQL linked service, instead of typing the password:
- Select Azure Key Vault
- Reference:
LS_KeyVault - Secret name:
sql-admin-password
Per-Environment Key Vaults
Each environment should have its own Key Vault:
kv-dataplatform-dev --> dev SQL password, dev storage key
kv-dataplatform-uat --> uat SQL password, uat storage key
kv-dataplatform-prod --> prod SQL password, prod storage key
When CI/CD deploys to prod, it references kv-dataplatform-prod which has the production secrets. No secrets are shared across environments.
Tags and Resource Organization
What Are Tags
Tags are key-value pairs you attach to resources for organization, cost tracking, and governance.
Resource: naveen-datafactory-de
Tags:
Environment: Development
Project: DataPlatform
Owner: naveen@company.com
CostCenter: DE-001
ManagedBy: Terraform
Why Tags Matter
Cost tracking: Filter your Azure bill by tag to see costs per project:
Project: DataPlatform --> $450/month
Project: MLPipeline --> $200/month
Project: WebApp --> $150/month
Governance: Azure Policy can enforce tags: – “Every resource must have an Environment tag” – “Every resource must have a CostCenter tag”
Automation: Scripts can target resources by tag:
# Delete all resources tagged as Environment=Sandbox
az resource list --tag Environment=Sandbox --query "[].id" -o tsv | xargs az resource delete --ids
Recommended Tags for Data Engineering
| Tag Key | Example Values | Purpose |
|---|---|---|
| Environment | dev, uat, prod, sandbox | Identify the environment |
| Project | DataPlatform, Analytics | Group by project |
| Owner | naveen@company.com | Who is responsible |
| CostCenter | DE-001 | Charge to the right budget |
| ManagedBy | manual, terraform, bicep | How the resource was created |
| DataClassification | public, internal, confidential | Security classification |
Cost Management and Budgets
Understanding Azure Costs
Azure billing has multiple components:
| Component | How It Is Charged |
|---|---|
| Storage | Per GB stored per month + per operation (read/write) |
| Compute | Per hour/minute of VM or serverless execution |
| Data transfer | Free within same region, charged across regions |
| ADF pipelines | Per activity run + per DIU-hour for data movement |
| Synapse | Per activity run + per DWU-hour for SQL pools + per vCore-hour for Spark |
Setting Up Budget Alerts
- Go to Cost Management + Billing > Budgets
- Click + Add
- Set the budget amount (e.g., $500/month)
- Set alert thresholds (e.g., 50%, 75%, 100%)
- Configure email notifications
- Optionally set an action group to auto-shutdown resources at 100%
Cost Saving Tips for Data Engineers
- Stop Azure-SSIS IR when not running packages (saves $600+/month)
- Use Serverless SQL in Synapse instead of Dedicated SQL Pool for ad-hoc queries
- Right-size Data Flow clusters — do not use 272 cores when 8 is enough
- Use Lifecycle Management to move old ADLS data to Cool/Archive tiers
- Delete unused resources — orphaned storage accounts, test VMs, unused databases
- Use reserved capacity for production resources you run 24/7 (up to 72% savings)
- Set auto-pause on Synapse Dedicated SQL Pool and Spark pools
Networking Basics for Data Engineers
Public vs Private Access
By default, Azure resources have public endpoints accessible from the internet. For production:
Public endpoint: naveendatalake.blob.core.windows.net (accessible from anywhere)
Private endpoint: naveendatalake.privatelink.blob.core.windows.net (only from your VNet)
Virtual Network (VNet)
A VNet is your private network in Azure. Resources inside a VNet can communicate privately without going through the internet.
Private Endpoints
A Private Endpoint gives an Azure resource a private IP address inside your VNet:
Before (public): ADF --> internet --> Storage Account (public IP)
After (private): ADF --> VNet --> Storage Account (private IP, no internet)
When Data Engineers Need Networking Knowledge
- Setting up Self-Hosted IR (needs outbound internet access)
- Configuring Managed VNet for Azure IR (private data access)
- Creating Private Endpoints for storage, SQL, Key Vault
- Understanding why a pipeline fails with “not authorized” (might be a network issue, not IAM)
Azure CLI and PowerShell
Azure CLI Essentials
Every data engineer should know these commands:
# Login
az login
# List subscriptions
az account list -o table
# Set default subscription
az account set --subscription "subscription-name-or-id"
# List resource groups
az group list -o table
# List resources in a resource group
az resource list --resource-group rg-dataplatform-dev -o table
# Create a resource group
az group create --name rg-test --location eastus
# Delete a resource group (DANGEROUS)
az group delete --name rg-test --yes
# List storage accounts
az storage account list -o table
# Get ADF details
az datafactory show --resource-group rg-dev --factory-name adf-dev
# List role assignments on a resource
az role assignment list --scope /subscriptions/{sub-id}/resourceGroups/{rg}
Azure PowerShell Equivalents
# Login
Connect-AzAccount
# List subscriptions
Get-AzSubscription
# Set subscription
Set-AzContext -SubscriptionId "subscription-id"
# List resource groups
Get-AzResourceGroup | Format-Table
# Create resource group
New-AzResourceGroup -Name rg-test -Location eastus
Setting Up a New Data Engineering Project: The Checklist
When starting a new data engineering project on Azure, follow this order:
Phase 1: Foundation
- [ ] Identify the subscription to use (or request a new one)
- [ ] Create a resource group with proper naming convention
- [ ] Decide on the region (consider compliance, cost, performance)
- [ ] Set up tagging standards (Environment, Project, Owner, CostCenter)
- [ ] Create a budget with email alerts
Phase 2: Security
- [ ] Create an Azure Key Vault for secrets
- [ ] Define IAM roles — who needs access to what
- [ ] Set up Entra ID groups (DataEngineers, DataAdmins) instead of assigning roles to individuals
- [ ] Decide on Managed Identity vs key-based authentication for services
Phase 3: Storage
- [ ] Create ADLS Gen2 storage account (enable hierarchical namespace)
- [ ] Create containers (bronze/silver/gold or raw/curated/analytics)
- [ ] Assign Storage Blob Data Contributor to ADF/Synapse managed identity
- [ ] Set up Lifecycle Management policies for cost optimization
Phase 4: Data Integration
- [ ] Create Azure Data Factory or Synapse Workspace
- [ ] Connect to GitHub or Azure DevOps for source control
- [ ] Create linked services (SQL, ADLS, Key Vault)
- [ ] Create parameterized datasets for reusability
- [ ] Build metadata-driven pipelines (Lookup, ForEach, Copy)
Phase 5: Operations
- [ ] Set up CI/CD with GitHub Actions or Azure DevOps
- [ ] Configure monitoring and alerts for pipeline failures
- [ ] Document naming conventions and standards
- [ ] Create a runbook for common operations (restart IR, rerun pipeline, etc.)
Common Mistakes New Data Engineers Make
-
Assigning Owner role to everyone — use the least privilege principle. Most people need Contributor or a specific role, not Owner.
-
Putting dev and prod resources in the same resource group — always separate by environment.
-
Hardcoding passwords in linked services — use Key Vault and Managed Identity.
-
Ignoring cost management — set up budget alerts on day one, not after the surprise bill.
-
Creating resources in random regions — pick one region for all related resources.
-
Not using tags — you will regret it when trying to understand your Azure bill 6 months later.
-
Confusing Storage Account Contributor with Storage Blob Data Contributor — the most common IAM error.
-
Not connecting ADF to Git — every change should be version controlled from the start.
-
Deleting the resource group instead of individual resources — always double-check which resource group you are deleting.
-
Forgetting to stop expensive resources — Azure-SSIS IR, Synapse SQL Pools, and VMs cost money when running idle.
Interview Questions
Q: What is the Azure resource hierarchy? A: Tenant (Entra ID) contains Subscriptions, which contain Resource Groups, which contain Resources. Permissions inherit downward — a role assigned at the subscription level applies to all resource groups and resources inside it.
Q: What is the difference between a Subscription and a Resource Group? A: A Subscription is a billing boundary — costs are tracked per subscription. A Resource Group is a logical container for related resources within a subscription. Organizations use multiple subscriptions to separate billing and access (dev vs prod).
Q: What is Managed Identity and why should you use it? A: Managed Identity is an Azure-managed identity for services like ADF and Synapse. It eliminates the need for passwords or keys in linked services. Azure handles authentication automatically. It is more secure (no credentials to leak) and easier to manage (no key rotation).
Q: What RBAC role does ADF need to write to ADLS Gen2? A: Storage Blob Data Contributor. Not Storage Account Contributor (which manages the account but cannot access blob data) and not Contributor (which is too broad).
Q: How do you organize resources for a multi-environment project? A: Separate resource groups per environment (rg-project-dev, rg-project-uat, rg-project-prod). Each environment has its own Key Vault, its own storage account, and its own ADF/Synapse workspace. Only the dev workspace is connected to Git. UAT and prod are deployed via CI/CD.
Q: What is Azure Key Vault and how do ADF linked services use it? A: Key Vault is a centralized secrets store. ADF creates a linked service to Key Vault, then other linked services reference Key Vault secrets instead of hardcoding passwords. The ADF managed identity needs Key Vault Secrets User role to read secrets.
Wrapping Up
Azure fundamentals are not exciting, but they are essential. Understanding subscriptions, resource groups, IAM, Managed Identity, and Key Vault before building pipelines saves you hours of debugging permissions errors and surprise bills.
Think of these fundamentals as the foundation of a house. Nobody admires the foundation, but without it, everything above collapses.
Related posts: – What is Azure Data Factory? – ADLS Gen2 Complete Guide – Azure Blob Storage Guide – CI/CD with GitHub – Integration Runtime Guide – Top 15 ADF Interview Questions
If this guide helped you understand the Azure landscape, share it with someone starting their Azure journey. Questions? Drop a comment below.
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.