Automating Data Lake Cleanup: Delete Old Files Using Get Metadata, ForEach, and If Condition in Azure Data Factory
Your data lake grows every day. Pipelines write new Parquet files, CSVs accumulate, and before you know it, you have terabytes of data that nobody has looked at in months. Storage costs creep up, queries slow down scanning old partitions, and your data lake turns into a data swamp.
The solution is automated cleanup — a pipeline that scans your ADLS Gen2 folders, checks the age of each file, and deletes anything older than a retention period (like 6 months).
In this guide, I will walk you through building this cleanup pipeline using three ADF/Synapse activities you might not have used before: Get Metadata (to list files and their properties), ForEach (to iterate over each file), and If Condition (to decide whether to delete or keep).
This is a real production pattern used in enterprise data platforms for data lifecycle management.
Table of Contents
- The Problem: Data Lake Bloat
- The Solution: Automated File Cleanup Pipeline
- Understanding the Get Metadata Activity
- Understanding the If Condition Activity
- Understanding the Delete Activity
- Pipeline Architecture
- Step 1: Create the Datasets
- Step 2: Build the Pipeline — Get Metadata
- Step 3: Add ForEach to Iterate Over Files
- Step 4: Inside ForEach — Get Each File Metadata
- Step 5: Inside ForEach — If Condition (Age Check)
- Step 6: Inside If Condition — Delete Activity
- Step 7: The Complete Pipeline Structure
- Step 8: Run and Test
- Understanding the Date Comparison Expression
- All Get Metadata Field List Options
- All If Condition Operators
- Making It Dynamic with Parameters
- Production Enhancements
- Scheduling the Cleanup
- Interview Questions
- Wrapping Up
The Problem: Data Lake Bloat
Consider a pipeline that writes daily Parquet files to ADLS Gen2:
datalake/
sqldb/
Customer/
2025/10/05/part-00000.snappy.parquet -- 6+ months old
2025/11/15/part-00000.snappy.parquet -- 5+ months old
2025/12/20/part-00000.snappy.parquet -- 4+ months old
2026/01/10/part-00000.snappy.parquet -- 3 months old
2026/02/15/part-00000.snappy.parquet -- 2 months old
2026/03/20/part-00000.snappy.parquet -- 2 weeks old
2026/04/05/part-00000.snappy.parquet -- today
Address/
... (same pattern, hundreds of files)
Product/
... (same pattern)
After a year of daily runs across 50 tables, you have 18,000+ files. Most of them are never queried. But they cost storage money every month and slow down queries that scan the folder.
The Solution: Automated File Cleanup Pipeline
We will build a pipeline that:
- Scans a folder in ADLS Gen2 to get the list of all child items (files and subfolders)
- Iterates over each item using ForEach
- Checks the last modified date of each file using Get Metadata
- Compares the date to a threshold (6 months ago)
- Deletes the file if it is older than 6 months
- Keeps the file if it is newer
Understanding the Get Metadata Activity
The Get Metadata activity retrieves information about files and folders in your data store. It is one of the most versatile but underused activities in ADF.
What It Can Return
You choose which “field list” items to retrieve:
| Field | What It Returns | Example |
|---|---|---|
| childItems | List of files and folders in a directory | [{name: "file1.parquet", type: "File"}, ...] |
| itemName | Name of the file or folder | "part-00000.snappy.parquet" |
| itemType | Whether it is a File or Folder | "File" or "Folder" |
| size | File size in bytes | 125000 |
| lastModified | Last modified timestamp | "2025-10-05T14:30:00Z" |
| created | Creation timestamp | "2025-10-05T14:30:00Z" |
| exists | Whether the file/folder exists | true or false |
| structure | Column schema of the file | [{name: "id", type: "Int32"}, ...] |
| columnCount | Number of columns | 15 |
Two Modes of Operation
Folder mode (childItems): When you point Get Metadata at a folder and request childItems, it returns the list of all files and subfolders inside that folder. This is how we scan for files to clean up.
File mode (lastModified, size, etc.): When you point Get Metadata at a specific file, it returns that file properties. This is how we check when each file was last modified.
We use both modes in our pipeline — first to list files, then to check each file age.
Understanding the If Condition Activity
The If Condition activity is the ADF equivalent of an if/else statement. It evaluates a boolean expression and executes different activities depending on the result.
If Condition: @greater(activity('GetFileInfo').output.lastModified, addDays(utcnow(), -180))
|
|-- True activities (file is NEWER than 6 months) --> do nothing (keep the file)
|
|-- False activities (file is OLDER than 6 months) --> Delete the file
Comparison Operators Available
| Operator | Expression | Example |
|---|---|---|
| Equals | @equals(a, b) |
@equals(item().type, 'File') |
| Not equals | @not(equals(a, b)) |
@not(equals(item().type, 'Folder')) |
| Greater than | @greater(a, b) |
@greater(output.size, 1000000) |
| Greater or equal | @greaterOrEquals(a, b) |
@greaterOrEquals(output.size, 0) |
| Less than | @less(a, b) |
@less(output.size, 100) |
| Less or equal | @lessOrEquals(a, b) |
@lessOrEquals(output.size, 1000) |
| And | @and(expr1, expr2) |
@and(equals(a, 'File'), greater(b, 0)) |
| Or | @or(expr1, expr2) |
@or(equals(a, 'csv'), equals(a, 'parquet')) |
| Not | @not(expr) |
@not(equals(item().type, 'Folder')) |
| Contains | @contains(string, substr) |
@contains(item().name, '.parquet') |
| StartsWith | @startsWith(string, prefix) |
@startsWith(item().name, 'part-') |
| EndsWith | @endsWith(string, suffix) |
@endsWith(item().name, '.csv') |
| Empty | @empty(value) |
@empty(activity('Lookup').output.value) |
Understanding the Delete Activity
The Delete activity removes files or folders from supported data stores (ADLS Gen2, Blob Storage, SFTP, etc.).
Key settings: – Dataset: Points to the file or folder to delete – Recursive: If deleting a folder, whether to delete contents recursively – Logging: Can log deleted file names to a separate location for audit trail
The Delete activity returns:
– output.deletedFiles — number of files deleted
– output.deletedFolders — number of folders deleted
Pipeline Architecture
Here is the complete pipeline flow:
Pipeline: PL_Cleanup_OldFiles
|
|-- Get Metadata: "GetFolderContents"
| |-- Dataset: DS_ADLS_Folder (points to the target folder)
| |-- Field list: childItems
| |-- Output: array of {name, type} for each item in the folder
|
|-- ForEach: "ForEach_Item"
|-- Items: @activity('GetFolderContents').output.childItems
|-- Sequential: true (safer for delete operations)
|
|-- Inside ForEach:
|
|-- If Condition: "CheckIfFile"
| |-- Expression: @equals(item().type, 'File')
| |
| |-- True (it is a file):
| |
| |-- Get Metadata: "GetFileAge"
| | |-- Field list: lastModified
| | |-- Dataset: DS_ADLS_File (parameterized with item().name)
| |
| |-- If Condition: "CheckAge"
| |-- Expression: @less(
| | activity('GetFileAge').output.lastModified,
| | addDays(utcnow(), -180))
| |
| |-- True (older than 6 months):
| | |-- Delete: "DeleteOldFile"
| |
| |-- False (newer than 6 months):
| |-- (do nothing -- keep the file)
Step 1: Create the Datasets
You need two datasets:
A. DS_ADLS_Folder (Points to the folder to scan)
- Create dataset: ADLS Gen2 –> Binary (or any format — we are just listing contents)
- Name:
DS_ADLS_Folder - Linked service: your ADLS Gen2 linked service
- Parameters:
FolderPath(String, no default)- Connection tab:
- Container:
datalake(or your container name) - Directory:
@dataset().FolderPath - File name: leave blank
B. DS_ADLS_File (Points to a specific file — parameterized)
- Create dataset: ADLS Gen2 –> Binary
- Name:
DS_ADLS_File - Linked service: same ADLS linked service
- Parameters:
FolderPath(String, no default)FileName(String, no default)- Connection tab:
- Container:
datalake - Directory:
@dataset().FolderPath - File name:
@dataset().FileName
Step 2: Build the Pipeline — Get Metadata
- Create pipeline:
PL_Cleanup_OldFiles - Add pipeline parameters:
TargetFolder(String, default:sqldb/Customer)-
RetentionDays(Int, default:180) -
Drag a Get Metadata activity onto the canvas
- Name:
GetFolderContents - Settings tab:
- Dataset:
DS_ADLS_Folder - FolderPath parameter:
@pipeline().parameters.TargetFolder - Field list: click + New and select childItems
What This Returns
The Get Metadata activity outputs:
{
"childItems": [
{"name": "part-00000-abc.snappy.parquet", "type": "File"},
{"name": "part-00001-def.snappy.parquet", "type": "File"},
{"name": "_SUCCESS", "type": "File"},
{"name": "2026", "type": "Folder"}
]
}
Each item has a name and type (File or Folder).
Step 3: Add ForEach to Iterate Over Files
- Drag ForEach activity onto the canvas
- Name:
ForEach_Item - Connect: green arrow from
GetFolderContentstoForEach_Item - Settings tab:
- Sequential: checked (recommended for delete operations to avoid race conditions)
- Items:
@activity('GetFolderContents').output.childItems
Step 4: Inside ForEach — Check If File and Get Metadata
Double-click ForEach to open the inner canvas.
A. First If Condition: Is It a File?
We only want to process files, not subfolders:
- Drag If Condition activity inside ForEach
- Name:
CheckIfFile - Activities tab:
- Expression:
@equals(item().type, 'File')
B. True Path — Get File Metadata
Inside the If Condition True activities:
- Drag Get Metadata activity
- Name:
GetFileAge - Settings:
- Dataset:
DS_ADLS_File - FolderPath:
@pipeline().parameters.TargetFolder - FileName:
@item().name - Field list: select lastModified
This returns:
{
"lastModified": "2025-10-05T14:30:00Z",
"itemName": "part-00000-abc.snappy.parquet"
}
Step 5: Inside ForEach — If Condition (Age Check)
After GetFileAge, add another If Condition to check the age:
- Drag If Condition activity (inside True path, after GetFileAge)
- Name:
CheckAge - Expression:
@less(
activity('GetFileAge').output.lastModified,
addDays(utcnow(), mul(-1, pipeline().parameters.RetentionDays))
)
Breaking Down This Expression
@less(A, B) -- is A less than (older than) B?
A = activity('GetFileAge').output.lastModified
-- the file last modified date (e.g., "2025-10-05")
B = addDays(utcnow(), mul(-1, pipeline().parameters.RetentionDays))
-- today minus RetentionDays (e.g., 2026-04-05 minus 180 = 2025-10-07)
If the file was modified on 2025-10-05 and the threshold is 2025-10-07:
@less("2025-10-05", "2025-10-07") = TRUE
-- File is older than 6 months --> DELETE
If the file was modified on 2026-03-20:
@less("2026-03-20", "2025-10-07") = FALSE
-- File is newer than 6 months --> KEEP
Alternative simpler expression if you want to hardcode 180 days:
@less(
activity('GetFileAge').output.lastModified,
addDays(utcnow(), -180)
)
Step 6: Inside If Condition — Delete Activity
In the CheckAge True path (file IS older than threshold):
- Drag Delete activity
- Name:
DeleteOldFile - Settings:
- Dataset:
DS_ADLS_File - FolderPath:
@pipeline().parameters.TargetFolder - FileName:
@item().name - Enable logging: Yes (recommended — logs which files were deleted)
- Log settings:
- Linked service: your ADLS linked service
- Path:
datalake/cleanup-logs(creates a log of deleted files)
In the CheckAge False path (file is newer): – Leave empty (do nothing — the file is kept)
Step 7: The Complete Pipeline Structure
GetFolderContents (Get Metadata -- list childItems)
|
v
ForEach_Item (iterate over each child item)
|
|-- CheckIfFile (If Condition: item().type == 'File')
|
|-- TRUE:
| |-- GetFileAge (Get Metadata -- lastModified)
| | |
| | v
| |-- CheckAge (If Condition: lastModified < 6 months ago)
| |
| |-- TRUE: DeleteOldFile (Delete activity)
| |-- FALSE: (do nothing -- keep file)
|
|-- FALSE: (skip folders)
Step 8: Run and Test
Test Setup
Before running on real data, create test files:
- Upload a file to your target folder with a recent date (keep this)
- Upload another file and manually set its properties to an old date, OR
- Point the pipeline at a folder with known old files
Debug Run
- Click Validate — fix any errors
- Click Debug
- Set pipeline parameters:
- TargetFolder:
sqldb/Customer - RetentionDays:
180 - Watch the Monitor output
Expected Behavior
GetFolderContents --> Succeeded (found 7 items)
ForEach_Item --> Succeeded (7 iterations)
Iteration 1: part-00000-abc.parquet
CheckIfFile --> True (it is a file)
GetFileAge --> lastModified: 2025-10-05
CheckAge --> True (older than 6 months)
DeleteOldFile --> Succeeded (file deleted)
Iteration 2: part-00001-def.parquet
CheckIfFile --> True
GetFileAge --> lastModified: 2026-03-20
CheckAge --> False (newer than 6 months)
(no delete -- file kept)
Iteration 3: 2026 (folder)
CheckIfFile --> False (it is a folder)
(skipped)
Verify
Check your ADLS Gen2 folder — old files should be gone, recent files should remain.
Check the cleanup log at datalake/cleanup-logs/ — it contains a record of every deleted file.
Understanding the Date Comparison Expression
Let us break down the core expression piece by piece:
@less(
activity('GetFileAge').output.lastModified,
addDays(utcnow(), -180)
)
| Part | What It Does | Example Value |
|---|---|---|
utcnow() |
Current UTC time | 2026-04-05T14:30:00Z |
addDays(utcnow(), -180) |
Subtract 180 days | 2025-10-07T14:30:00Z |
activity('GetFileAge').output.lastModified |
File last modified | 2025-10-05T14:30:00Z |
@less(fileDate, threshold) |
Is file date before threshold? | true (Oct 5 < Oct 7) |
The less() function compares dates chronologically. An earlier date is “less than” a later date. So @less(oldDate, recentDate) returns true.
Making Retention Dynamic
Instead of hardcoding -180, use a pipeline parameter:
@less(
activity('GetFileAge').output.lastModified,
addDays(utcnow(), mul(-1, pipeline().parameters.RetentionDays))
)
mul(-1, 180) = -180. This way you can change retention without editing the pipeline.
All Get Metadata Field List Options
Here is every field you can request from Get Metadata:
| Field | Applies To | Returns |
|---|---|---|
| childItems | Folders | Array of {name, type} for each child |
| itemName | Files and Folders | Name of the item |
| itemType | Files and Folders | “File” or “Folder” |
| size | Files | Size in bytes |
| lastModified | Files and Folders | Last modified timestamp |
| created | Files and Folders | Creation timestamp |
| exists | Files and Folders | Boolean — does it exist? |
| structure | Files (structured) | Column schema array |
| columnCount | Files (structured) | Number of columns |
| contentMD5 | Files | MD5 hash of file content |
Combining Multiple Fields
You can request multiple fields in a single Get Metadata call:
Field list: childItems, itemName, itemType, lastModified
This is more efficient than making separate calls for each property.
All If Condition Operators
Complete reference of boolean expressions for If Condition:
Comparison
@equals(a, b) -- a == b
@not(equals(a, b)) -- a != b
@greater(a, b) -- a > b
@greaterOrEquals(a, b) -- a >= b
@less(a, b) -- a < b
@lessOrEquals(a, b) -- a <= b
Logical
@and(expr1, expr2) -- both true
@or(expr1, expr2) -- either true
@not(expr) -- negate
String
@contains(string, substr) -- string contains substr
@startsWith(string, prefix) -- string starts with prefix
@endsWith(string, suffix) -- string ends with suffix
@empty(value) -- value is null or empty
Practical Examples
-- Only process .parquet files
@and(equals(item().type, 'File'), endsWith(item().name, '.parquet'))
-- Skip hidden files (starting with _ or .)
@not(or(startsWith(item().name, '_'), startsWith(item().name, '.')))
-- File is larger than 1 MB
@greater(activity('GetFileAge').output.size, 1048576)
-- File is older than 90 days AND larger than 100 MB
@and(
less(activity('GetFileAge').output.lastModified, addDays(utcnow(), -90)),
greater(activity('GetFileAge').output.size, 104857600)
)
Making It Dynamic with Parameters
To make the pipeline reusable for any folder, add parameters:
Pipeline Parameters:
TargetFolder (String) -- e.g., "sqldb/Customer"
RetentionDays (Int) -- e.g., 180
ContainerName (String) -- e.g., "datalake"
FileExtension (String) -- e.g., ".parquet" (optional filter)
Then call this pipeline from a parent pipeline that loops through multiple folders:
Parent Pipeline: PL_Cleanup_AllFolders
|-- Lookup: read list of folders to clean from a config table
|-- ForEach: iterate over folders
|-- Execute Pipeline: call PL_Cleanup_OldFiles with each folder
Parameters:
TargetFolder: @item().FolderName
RetentionDays: @item().RetentionDays
This way you can set different retention periods for different folders:
INSERT INTO cleanup_config VALUES ('sqldb/Customer', 180); -- 6 months
INSERT INTO cleanup_config VALUES ('sqldb/Logs', 30); -- 30 days
INSERT INTO cleanup_config VALUES ('sqldb/Archive', 365); -- 1 year
Production Enhancements
1. Add Audit Logging
Log every deletion to an audit table:
INSERT INTO cleanup_audit (file_name, folder_path, file_date, deleted_at, retention_days)
VALUES (@fileName, @folderPath, @fileDate, GETDATE(), @retentionDays);
Add a Stored Procedure activity after the Delete activity to record what was deleted.
2. Dry Run Mode
Add a DryRun boolean pipeline parameter. When true, the pipeline logs which files WOULD be deleted but does not actually delete them:
If Condition expression:
@and(
less(activity('GetFileAge').output.lastModified, addDays(utcnow(), -180)),
not(pipeline().parameters.DryRun)
)
3. Handle Nested Folders
The basic pipeline only scans one level. For nested date-partitioned folders like Customer/2025/10/05/, you need recursive scanning:
Option A: Use the Delete activity with Recursive = true and a Wildcard path Option B: Use an Execute Pipeline activity that calls itself recursively for each subfolder
4. Email Notification
Add a Web activity at the end that calls a Logic App or sends an email with the cleanup summary:
Files scanned: 150
Files deleted: 45
Storage freed: 2.3 GB
5. Cost Optimization Integration
Combine this pipeline with Azure Storage Lifecycle Management for a two-tier approach: – Lifecycle Management: automatically moves old files from Hot to Cool to Archive tier – ADF Cleanup Pipeline: permanently deletes files past their retention period
Scheduling the Cleanup
Add a trigger to run weekly or monthly:
- Click Add trigger > New/Edit
- Create a Schedule trigger
- Name:
TR_Weekly_Cleanup - Recurrence: Every 1 week, on Sunday at 2:00 AM
- Pass parameters:
- TargetFolder:
sqldb/Customer - RetentionDays:
180
For multiple folders, create one trigger that calls the parent pipeline (PL_Cleanup_AllFolders).
Interview Questions
Q: How do you automate file cleanup in a data lake? A: Build a pipeline using Get Metadata (to list files), ForEach (to iterate), Get Metadata again (to check lastModified date), If Condition (to compare against retention threshold), and Delete (to remove old files). Parameterize the folder path and retention days for reusability.
Q: What does the Get Metadata activity return for a folder?
A: When you request childItems, it returns an array of objects with name and type (File or Folder) for each item in the directory. You can also request lastModified, size, exists, structure, and columnCount.
Q: How do you compare dates in ADF expressions?
A: Use @less() or @greater() with date values. For example, @less(activity('GetFile').output.lastModified, addDays(utcnow(), -180)) checks if a file is older than 180 days. The addDays() function with a negative number subtracts days from the current time.
Q: What is the difference between Get Metadata and Lookup? A: Lookup reads data from a table or query (SQL, CSV content). Get Metadata reads file/folder properties (name, size, lastModified, childItems) from a file system. Use Lookup for data content, Get Metadata for file system information.
Q: How do you handle the If Condition activity in ADF? A: If Condition evaluates a boolean expression. It has True activities (execute when expression is true) and False activities (execute when false). Common operators include equals(), greater(), less(), and(), or(), not(), contains(), startsWith(), and endsWith().
Wrapping Up
Data lake cleanup is not glamorous, but it is essential. Without it, storage costs grow linearly, queries slow down, and your data lake becomes unmanageable.
The Get Metadata + ForEach + If Condition + Delete pattern gives you a flexible, auditable, parameterizable cleanup solution that runs on a schedule. Combined with audit logging and dry run mode, you have a production-grade data lifecycle management tool.
Related posts: – ADLS Gen2 Complete Guide – Azure Blob Storage Guide – Metadata-Driven Pipeline in ADF – ADF Expressions Guide – Common ADF/Synapse Errors
If this guide helped you clean up your data lake, share it with your team. Questions? Drop a comment below.
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.