Azure Fundamentals for Data Engineers: Subscriptions, Resource Groups, IAM, and Everything You Need Before Starting a Project

Azure Fundamentals for Data Engineers: Subscriptions, Resource Groups, IAM, and Everything You Need Before Starting a Project

Before you create your first Azure Data Factory pipeline or provision an ADLS Gen2 storage account, you need to understand the foundation that everything sits on. Subscriptions, resource groups, IAM roles, regions, tags, cost management — these are not glamorous topics, but getting them wrong causes real problems.

I have seen data engineers who can build complex metadata-driven pipelines but cannot explain why their colleague cannot access the storage account they just created. Or why their pipeline fails with a permissions error even though “they are an admin.” Or why their Azure bill doubled last month.

This post covers every Azure fundamental a data engineer needs to know before starting a project. Think of it as the prerequisite knowledge that Azure tutorials skip because they assume you already know it.

Table of Contents

  • The Azure Hierarchy: How Everything Is Organized
  • Azure Active Directory (Entra ID)
  • Subscriptions
  • Resource Groups
  • Resources
  • Regions and Availability Zones
  • IAM: Identity and Access Management
  • The Most Important RBAC Roles for Data Engineers
  • Managed Identity
  • Azure Key Vault
  • Tags and Resource Organization
  • Cost Management and Budgets
  • Networking Basics for Data Engineers
  • Azure CLI and PowerShell
  • Setting Up a New Data Engineering Project: The Checklist
  • Common Mistakes New Data Engineers Make
  • Interview Questions
  • Wrapping Up

The Azure Hierarchy: How Everything Is Organized

Azure resources are organized in a strict hierarchy:

Azure Active Directory (Entra ID) Tenant
  |
  |-- Management Groups (optional, for large organizations)
      |
      |-- Subscriptions (billing boundary)
          |
          |-- Resource Groups (logical containers)
              |
              |-- Resources (the actual services)
                  Examples: Storage Account, Data Factory,
                  SQL Database, Synapse Workspace, Key Vault

Every Azure resource lives inside a Resource Group, which lives inside a Subscription, which lives inside a Tenant. Understanding this hierarchy is essential because permissions, billing, and policies are applied at each level and inherited downward.

Azure Active Directory (Entra ID)

What It Is

Azure Active Directory (recently renamed to Microsoft Entra ID) is the identity and access management service. It is the directory that stores:

  • Users — people with email addresses (naveen@company.com)
  • Groups — collections of users (DataEngineers, Admins)
  • Service Principals — identities for applications and automation
  • Managed Identities — identities assigned to Azure resources (like ADF)

Tenant

A Tenant is a single instance of Entra ID. Every organization has one tenant. When you sign up for Azure with your company email, your account belongs to your company’s tenant.

Tenant: contoso.onmicrosoft.com
  |-- Users: naveen@contoso.com, alice@contoso.com
  |-- Groups: DataEngineers, CloudAdmins
  |-- Service Principals: sp-github-deploy, sp-databricks
  |-- Managed Identities: naveen-datafactory-de, naveen-synapse-ws

Why It Matters for Data Engineers

Every time you: – Grant someone access to a storage account — you are adding an Entra ID user to an RBAC role – Create a linked service with Managed Identity — ADF uses its Entra ID identity to authenticate – Set up CI/CD with GitHub Actions — a Service Principal in Entra ID authenticates the deployment – Connect to Azure SQL with Azure AD auth — your Entra ID identity is verified

You cannot escape Entra ID. It is the backbone of all authentication and authorization in Azure.

Subscriptions

What It Is

A Subscription is the billing container for Azure resources. Every resource you create lives in a subscription, and all costs for that resource are charged to that subscription.

Why Multiple Subscriptions?

Organizations typically have multiple subscriptions to separate:

Subscription: Dev/Test ($500/month budget)
  |-- Resource Groups for development workloads
  |-- Lower-tier resources (smaller VMs, basic storage)
  |-- Developer access

Subscription: Production ($5,000/month budget)
  |-- Resource Groups for production workloads
  |-- Higher-tier resources (premium storage, dedicated pools)
  |-- Restricted access (only DevOps team)

Subscription: Sandbox ($200/month budget)
  |-- Personal experimentation
  |-- Temporary resources
  |-- Auto-delete policies

Subscription Limits

Every subscription has limits (quotas) on resources:

Resource Default Limit
Storage accounts per region 250
Resource groups per subscription 980
VMs per region 25,000
Azure Data Factories per subscription 50
SQL databases per server 5,000

These limits can be increased by contacting Azure support, but knowing they exist prevents surprises.

Subscription ID

Every subscription has a unique GUID. You will use this frequently:

# Find your subscription ID
az account show --query id -o tsv

# Set a specific subscription as default
az account set --subscription "your-subscription-id"

Many CLI commands and ARM templates require the subscription ID.

Resource Groups

What It Is

A Resource Group is a logical container that holds related Azure resources. It is the primary way to organize, manage, and control access to resources.

How to Organize Resource Groups

Option A: By project

rg-project-alpha
  |-- adf-project-alpha
  |-- storage-project-alpha
  |-- sqldb-project-alpha
  |-- synapse-project-alpha

rg-project-beta
  |-- adf-project-beta
  |-- storage-project-beta

Option B: By environment

rg-dev
  |-- adf-dev, storage-dev, sqldb-dev

rg-uat
  |-- adf-uat, storage-uat, sqldb-uat

rg-prod
  |-- adf-prod, storage-prod, sqldb-prod

Option C: By service type

rg-data-engineering
  |-- All ADF, Synapse, storage resources

rg-databases
  |-- All SQL databases

rg-networking
  |-- VNets, private endpoints, NSGs

Most enterprise teams use a combination: rg-{project}-{environment} like rg-dataplatform-prod.

Key Rules

  • A resource can only belong to one resource group
  • Resource groups have a region but resources inside can be in any region
  • Deleting a resource group deletes EVERYTHING inside it — be very careful
  • RBAC roles assigned at the resource group level are inherited by all resources inside
  • Tags on the resource group can be inherited by resources (with Azure Policy)

When to Delete a Resource Group

This is the fastest way to clean up:

# WARNING: Deletes everything inside the resource group
az group delete --name rg-sandbox-test --yes --no-wait

This is why sandbox/dev resources should be in separate resource groups from production.

Resources

What It Is

A Resource is an individual Azure service instance — the actual thing you use. Examples:

Resource Type Example Name What It Does
Storage Account naveendatalake Stores blobs, files, tables, queues
Data Factory naveen-datafactory-de Runs data pipelines
Synapse Workspace naveen-synapse-ws Analytics platform
SQL Database AdventureWorksLT Relational database
Key Vault naveen-keyvault Stores secrets and certificates
Virtual Machine vm-selfhosted-ir Runs Self-Hosted IR

Naming Conventions

Azure resource names have rules and best practices:

Resource Naming Rule Example
Storage Account Lowercase, 3-24 chars, no hyphens naveendatalake
Data Factory 3-63 chars, letters/numbers/hyphens naveen-datafactory-de
Resource Group 1-90 chars, most characters allowed rg-dataplatform-prod
Key Vault 3-24 chars, letters/numbers/hyphens kv-dataplatform-prod
SQL Server 1-63 chars, lowercase, hyphens allowed sql-dataplatform-prod

Best practice naming pattern: {resource-type}-{project}-{environment}

rg-dataplatform-dev          (resource group)
adf-dataplatform-dev         (data factory)
st-dataplatform-dev          (storage -- no hyphens allowed)
kv-dataplatform-dev          (key vault)
sql-dataplatform-dev         (sql server)
syn-dataplatform-dev         (synapse workspace)

Regions and Availability Zones

Regions

Azure has 60+ regions worldwide. A region is a geographic location with one or more data centers.

Common regions for data engineering:

Region Code Use Case
East US eastus Default for North America
East US 2 eastus2 Secondary NA region
Canada Central canadacentral Canadian data residency
West Europe westeurope European workloads
North Europe northeurope European secondary
Central India centralindia Indian workloads
Southeast Asia southeastasia APAC workloads

Region Selection Matters

Performance: Place related resources in the same region. Cross-region data transfer is slower.

GOOD: ADF (East US) + ADLS Gen2 (East US) + SQL (East US)
BAD:  ADF (East US) + ADLS Gen2 (West Europe) + SQL (Central India)

Cost: Some regions are cheaper than others. East US is typically the cheapest in North America.

Compliance: Some regulations require data to stay in specific geographies: – GDPR: data must stay in EU regions – Canadian privacy laws: may require Canada Central or Canada East – Indian regulations: may require Central India or South India

Availability: Not all services are available in all regions. Check Azure Products by Region before choosing.

Availability Zones

Within a region, Availability Zones are physically separate data centers with independent power, cooling, and networking. They provide high availability:

East US Region
  |-- Zone 1 (Data Center A)
  |-- Zone 2 (Data Center B)
  |-- Zone 3 (Data Center C)

Zone-redundant storage (ZRS) replicates data across all three zones. If one data center fails, your data is still accessible.

IAM: Identity and Access Management

What It Is

IAM (Identity and Access Management) controls who can do what on which Azure resources. It uses Role-Based Access Control (RBAC) where you assign roles to identities at a specific scope.

The RBAC Model

Who (Identity)  +  What (Role)  +  Where (Scope)

Example:Who: naveen@company.com (user identity) – What: Storage Blob Data Contributor (role) – Where: naveendatalake storage account (scope) – Result: Naveen can read, write, and delete blobs in that storage account

Scope Hierarchy

Roles can be assigned at different levels, and permissions inherit downward:

Management Group (broadest)
  |-- Subscription
      |-- Resource Group
          |-- Resource (most specific)

A role assigned at the Subscription level applies to ALL resource groups and resources in that subscription. A role assigned at a specific resource only applies to that resource.

Best practice: Assign roles at the most specific scope possible. Do not give Subscription-level Contributor to someone who only needs access to one storage account.

Built-In Role Types

Azure has 100+ built-in roles, but they fall into four categories:

Category Example Roles What They Control
Owner Owner Full access + can assign roles to others
Contributor Contributor Full access except role assignments
Reader Reader View-only access
Specialized Storage Blob Data Contributor, SQL DB Contributor Specific service access

How to Assign a Role

  1. Navigate to the resource (or resource group/subscription)
  2. Click Access Control (IAM)
  3. Click + Add role assignment
  4. Select the role (e.g., Storage Blob Data Contributor)
  5. Click Next
  6. Select members (users, groups, or managed identities)
  7. Click Review + assign

The Most Important RBAC Roles for Data Engineers

Role What It Allows When You Need It
Storage Blob Data Reader Read blobs Synapse Serverless querying ADLS
Storage Blob Data Contributor Read + write + delete blobs ADF/Synapse writing pipeline output to ADLS
Storage Blob Data Owner Full blob control + ACL management Admin managing data lake permissions
Contributor (on Resource Group) Create/manage all resources in the RG Setting up ADF, Synapse, storage in a dev environment
Data Factory Contributor Manage ADF resources (not data access) Building pipelines in ADF
SQL DB Contributor Manage SQL databases Creating and configuring databases
Key Vault Secrets User Read secrets from Key Vault ADF linked services reading connection strings
Synapse Contributor Manage Synapse workspace resources Building pipelines and notebooks in Synapse

The #1 IAM Mistake Data Engineers Make

Assigning Storage Account Contributor when they need Storage Blob Data Contributor:

Storage Account Contributor:
  - Can manage the storage account (keys, settings, networking)
  - CANNOT read or write blob data

Storage Blob Data Contributor:
  - Can read, write, and delete blob data
  - Cannot manage the storage account itself

These are completely different roles. Your ADF managed identity needs Storage Blob Data Contributor to write Parquet files. If you assign Storage Account Contributor, the pipeline fails with a permissions error.

Managed Identity

What It Is

A Managed Identity is an Azure-managed identity automatically created for certain Azure services. It eliminates the need for passwords, connection strings, or API keys in your pipeline configurations.

Two Types

System-assigned: Created automatically when you create the resource. Has the same lifecycle as the resource — deleted when the resource is deleted.

Azure Data Factory: naveen-datafactory-de
  System-assigned Managed Identity: naveen-datafactory-de (same name)
  Object ID: xxxx-xxxx-xxxx

Synapse Workspace: naveen-synapse-ws
  System-assigned Managed Identity: naveen-synapse-ws (same name)
  Object ID: yyyy-yyyy-yyyy

User-assigned: Created independently and can be shared across multiple resources.

Why Managed Identity Matters

Without Managed Identity (the old way):

Linked Service: LS_ADLS_Gen2
  Authentication: Account Key
  Key: "xK3j8d9f...long secret string..."
  Problem: Key is stored in ADF, can be leaked, must be rotated

With Managed Identity (the recommended way):

Linked Service: LS_ADLS_Gen2
  Authentication: Managed Identity
  No passwords, no keys, nothing to leak
  Azure handles authentication behind the scenes

How to Use Managed Identity

  1. Create your ADF or Synapse workspace (managed identity is created automatically)
  2. Go to the target resource (e.g., Storage Account)
  3. Access Control (IAM) > + Add role assignment
  4. Assign the role (e.g., Storage Blob Data Contributor)
  5. Select member: search for your ADF/Synapse workspace name
  6. Save

Now your linked service can authenticate using Managed Identity — no passwords needed.

Azure Key Vault

What It Is

Azure Key Vault is a centralized secrets management service. It securely stores:

  • Secrets — connection strings, passwords, API keys
  • Certificates — SSL/TLS certificates
  • Keys — encryption keys

Why Data Engineers Need It

Instead of hardcoding database passwords in your ADF linked services:

BAD: Linked Service with password "MyP@ssw0rd123" hardcoded
GOOD: Linked Service referencing Key Vault secret "sql-connection-password"

If the password changes, you update Key Vault once. All linked services that reference it automatically get the new password.

Setting Up Key Vault for ADF

  1. Create a Key Vault: kv-dataplatform-dev
  2. Add a secret: Name = sql-admin-password, Value = MyP@ssw0rd123
  3. Grant ADF access: assign Key Vault Secrets User role to the ADF managed identity
  4. In ADF, create a linked service to Key Vault: LS_KeyVault
  5. In your SQL linked service, instead of typing the password:
  6. Select Azure Key Vault
  7. Reference: LS_KeyVault
  8. Secret name: sql-admin-password

Per-Environment Key Vaults

Each environment should have its own Key Vault:

kv-dataplatform-dev   --> dev SQL password, dev storage key
kv-dataplatform-uat   --> uat SQL password, uat storage key
kv-dataplatform-prod  --> prod SQL password, prod storage key

When CI/CD deploys to prod, it references kv-dataplatform-prod which has the production secrets. No secrets are shared across environments.

Tags and Resource Organization

What Are Tags

Tags are key-value pairs you attach to resources for organization, cost tracking, and governance.

Resource: naveen-datafactory-de
Tags:
  Environment: Development
  Project: DataPlatform
  Owner: naveen@company.com
  CostCenter: DE-001
  ManagedBy: Terraform

Why Tags Matter

Cost tracking: Filter your Azure bill by tag to see costs per project:

Project: DataPlatform --> $450/month
Project: MLPipeline   --> $200/month
Project: WebApp       --> $150/month

Governance: Azure Policy can enforce tags: – “Every resource must have an Environment tag” – “Every resource must have a CostCenter tag”

Automation: Scripts can target resources by tag:

# Delete all resources tagged as Environment=Sandbox
az resource list --tag Environment=Sandbox --query "[].id" -o tsv | xargs az resource delete --ids
Tag Key Example Values Purpose
Environment dev, uat, prod, sandbox Identify the environment
Project DataPlatform, Analytics Group by project
Owner naveen@company.com Who is responsible
CostCenter DE-001 Charge to the right budget
ManagedBy manual, terraform, bicep How the resource was created
DataClassification public, internal, confidential Security classification

Cost Management and Budgets

Understanding Azure Costs

Azure billing has multiple components:

Component How It Is Charged
Storage Per GB stored per month + per operation (read/write)
Compute Per hour/minute of VM or serverless execution
Data transfer Free within same region, charged across regions
ADF pipelines Per activity run + per DIU-hour for data movement
Synapse Per activity run + per DWU-hour for SQL pools + per vCore-hour for Spark

Setting Up Budget Alerts

  1. Go to Cost Management + Billing > Budgets
  2. Click + Add
  3. Set the budget amount (e.g., $500/month)
  4. Set alert thresholds (e.g., 50%, 75%, 100%)
  5. Configure email notifications
  6. Optionally set an action group to auto-shutdown resources at 100%

Cost Saving Tips for Data Engineers

  1. Stop Azure-SSIS IR when not running packages (saves $600+/month)
  2. Use Serverless SQL in Synapse instead of Dedicated SQL Pool for ad-hoc queries
  3. Right-size Data Flow clusters — do not use 272 cores when 8 is enough
  4. Use Lifecycle Management to move old ADLS data to Cool/Archive tiers
  5. Delete unused resources — orphaned storage accounts, test VMs, unused databases
  6. Use reserved capacity for production resources you run 24/7 (up to 72% savings)
  7. Set auto-pause on Synapse Dedicated SQL Pool and Spark pools

Networking Basics for Data Engineers

Public vs Private Access

By default, Azure resources have public endpoints accessible from the internet. For production:

Public endpoint:  naveendatalake.blob.core.windows.net (accessible from anywhere)
Private endpoint: naveendatalake.privatelink.blob.core.windows.net (only from your VNet)

Virtual Network (VNet)

A VNet is your private network in Azure. Resources inside a VNet can communicate privately without going through the internet.

Private Endpoints

A Private Endpoint gives an Azure resource a private IP address inside your VNet:

Before (public): ADF --> internet --> Storage Account (public IP)
After (private):  ADF --> VNet --> Storage Account (private IP, no internet)

When Data Engineers Need Networking Knowledge

  • Setting up Self-Hosted IR (needs outbound internet access)
  • Configuring Managed VNet for Azure IR (private data access)
  • Creating Private Endpoints for storage, SQL, Key Vault
  • Understanding why a pipeline fails with “not authorized” (might be a network issue, not IAM)

Azure CLI and PowerShell

Azure CLI Essentials

Every data engineer should know these commands:

# Login
az login

# List subscriptions
az account list -o table

# Set default subscription
az account set --subscription "subscription-name-or-id"

# List resource groups
az group list -o table

# List resources in a resource group
az resource list --resource-group rg-dataplatform-dev -o table

# Create a resource group
az group create --name rg-test --location eastus

# Delete a resource group (DANGEROUS)
az group delete --name rg-test --yes

# List storage accounts
az storage account list -o table

# Get ADF details
az datafactory show --resource-group rg-dev --factory-name adf-dev

# List role assignments on a resource
az role assignment list --scope /subscriptions/{sub-id}/resourceGroups/{rg}

Azure PowerShell Equivalents

# Login
Connect-AzAccount

# List subscriptions
Get-AzSubscription

# Set subscription
Set-AzContext -SubscriptionId "subscription-id"

# List resource groups
Get-AzResourceGroup | Format-Table

# Create resource group
New-AzResourceGroup -Name rg-test -Location eastus

Setting Up a New Data Engineering Project: The Checklist

When starting a new data engineering project on Azure, follow this order:

Phase 1: Foundation

  • [ ] Identify the subscription to use (or request a new one)
  • [ ] Create a resource group with proper naming convention
  • [ ] Decide on the region (consider compliance, cost, performance)
  • [ ] Set up tagging standards (Environment, Project, Owner, CostCenter)
  • [ ] Create a budget with email alerts

Phase 2: Security

  • [ ] Create an Azure Key Vault for secrets
  • [ ] Define IAM roles — who needs access to what
  • [ ] Set up Entra ID groups (DataEngineers, DataAdmins) instead of assigning roles to individuals
  • [ ] Decide on Managed Identity vs key-based authentication for services

Phase 3: Storage

  • [ ] Create ADLS Gen2 storage account (enable hierarchical namespace)
  • [ ] Create containers (bronze/silver/gold or raw/curated/analytics)
  • [ ] Assign Storage Blob Data Contributor to ADF/Synapse managed identity
  • [ ] Set up Lifecycle Management policies for cost optimization

Phase 4: Data Integration

  • [ ] Create Azure Data Factory or Synapse Workspace
  • [ ] Connect to GitHub or Azure DevOps for source control
  • [ ] Create linked services (SQL, ADLS, Key Vault)
  • [ ] Create parameterized datasets for reusability
  • [ ] Build metadata-driven pipelines (Lookup, ForEach, Copy)

Phase 5: Operations

  • [ ] Set up CI/CD with GitHub Actions or Azure DevOps
  • [ ] Configure monitoring and alerts for pipeline failures
  • [ ] Document naming conventions and standards
  • [ ] Create a runbook for common operations (restart IR, rerun pipeline, etc.)

Common Mistakes New Data Engineers Make

  1. Assigning Owner role to everyone — use the least privilege principle. Most people need Contributor or a specific role, not Owner.

  2. Putting dev and prod resources in the same resource group — always separate by environment.

  3. Hardcoding passwords in linked services — use Key Vault and Managed Identity.

  4. Ignoring cost management — set up budget alerts on day one, not after the surprise bill.

  5. Creating resources in random regions — pick one region for all related resources.

  6. Not using tags — you will regret it when trying to understand your Azure bill 6 months later.

  7. Confusing Storage Account Contributor with Storage Blob Data Contributor — the most common IAM error.

  8. Not connecting ADF to Git — every change should be version controlled from the start.

  9. Deleting the resource group instead of individual resources — always double-check which resource group you are deleting.

  10. Forgetting to stop expensive resources — Azure-SSIS IR, Synapse SQL Pools, and VMs cost money when running idle.

Interview Questions

Q: What is the Azure resource hierarchy? A: Tenant (Entra ID) contains Subscriptions, which contain Resource Groups, which contain Resources. Permissions inherit downward — a role assigned at the subscription level applies to all resource groups and resources inside it.

Q: What is the difference between a Subscription and a Resource Group? A: A Subscription is a billing boundary — costs are tracked per subscription. A Resource Group is a logical container for related resources within a subscription. Organizations use multiple subscriptions to separate billing and access (dev vs prod).

Q: What is Managed Identity and why should you use it? A: Managed Identity is an Azure-managed identity for services like ADF and Synapse. It eliminates the need for passwords or keys in linked services. Azure handles authentication automatically. It is more secure (no credentials to leak) and easier to manage (no key rotation).

Q: What RBAC role does ADF need to write to ADLS Gen2? A: Storage Blob Data Contributor. Not Storage Account Contributor (which manages the account but cannot access blob data) and not Contributor (which is too broad).

Q: How do you organize resources for a multi-environment project? A: Separate resource groups per environment (rg-project-dev, rg-project-uat, rg-project-prod). Each environment has its own Key Vault, its own storage account, and its own ADF/Synapse workspace. Only the dev workspace is connected to Git. UAT and prod are deployed via CI/CD.

Q: What is Azure Key Vault and how do ADF linked services use it? A: Key Vault is a centralized secrets store. ADF creates a linked service to Key Vault, then other linked services reference Key Vault secrets instead of hardcoding passwords. The ADF managed identity needs Key Vault Secrets User role to read secrets.

Wrapping Up

Azure fundamentals are not exciting, but they are essential. Understanding subscriptions, resource groups, IAM, Managed Identity, and Key Vault before building pipelines saves you hours of debugging permissions errors and surprise bills.

Think of these fundamentals as the foundation of a house. Nobody admires the foundation, but without it, everything above collapses.

Related posts:What is Azure Data Factory?ADLS Gen2 Complete GuideAzure Blob Storage GuideCI/CD with GitHubIntegration Runtime GuideTop 15 ADF Interview Questions

If this guide helped you understand the Azure landscape, share it with someone starting their Azure journey. Questions? Drop a comment below.


Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link