Automating Data Lake Cleanup: Delete Old Files Using Get Metadata, ForEach, and If Condition in Azure Data Factory

Your data lake grows every day. Pipelines write new Parquet files, CSVs accumulate, and before you know it, you have terabytes of data that nobody has looked at in months. Storage costs creep up, queries slow down scanning old partitions, and your data lake turns into a data swamp.

The solution is automated cleanup — a pipeline that scans your ADLS Gen2 folders, checks the age of each file, and deletes anything older than a retention period (like 6 months).

In this guide, I will walk you through building this cleanup pipeline using three ADF/Synapse activities you might not have used before: Get Metadata (to list files and their properties), ForEach (to iterate over each file), and If Condition (to decide whether to delete or keep).

This is a real production pattern used in enterprise data platforms for data lifecycle management.

The Problem: Data Lake Bloat
The Solution: Automated File Cleanup Pipeline
Understanding the Get Metadata Activity
Understanding the If Condition Activity
Understanding the Delete Activity
Pipeline Architecture
Step 1: Create the Datasets
Step 2: Build the Pipeline — Get Metadata
Step 3: Add ForEach to Iterate Over Files
Step 4: Inside ForEach — Get Each File Metadata
Step 5: Inside ForEach — If Condition (Age Check)
Step 6: Inside If Condition — Delete Activity
Step 7: The Complete Pipeline Structure
Step 8: Run and Test
Understanding the Date Comparison Expression
All Get Metadata Field List Options
All If Condition Operators
Making It Dynamic with Parameters
Production Enhancements
Scheduling the Cleanup
Interview Questions
Wrapping Up

The Problem: Data Lake Bloat

Consider a pipeline that writes daily Parquet files to ADLS Gen2:

datalake/
  sqldb/
    Customer/
      2025/10/05/part-00000.snappy.parquet    -- 6+ months old
      2025/11/15/part-00000.snappy.parquet    -- 5+ months old
      2025/12/20/part-00000.snappy.parquet    -- 4+ months old
      2026/01/10/part-00000.snappy.parquet    -- 3 months old
      2026/02/15/part-00000.snappy.parquet    -- 2 months old
      2026/03/20/part-00000.snappy.parquet    -- 2 weeks old
      2026/04/05/part-00000.snappy.parquet    -- today
    Address/
      ... (same pattern, hundreds of files)
    Product/
      ... (same pattern)

After a year of daily runs across 50 tables, you have 18,000+ files. Most of them are never queried. But they cost storage money every month and slow down queries that scan the folder.

The Solution: Automated File Cleanup Pipeline

We will build a pipeline that:

Scans a folder in ADLS Gen2 to get the list of all child items (files and subfolders)
Iterates over each item using ForEach
Checks the last modified date of each file using Get Metadata
Compares the date to a threshold (6 months ago)
Deletes the file if it is older than 6 months
Keeps the file if it is newer

Understanding the Get Metadata Activity

The Get Metadata activity retrieves information about files and folders in your data store. It is one of the most versatile but underused activities in ADF.

What It Can Return

You choose which “field list” items to retrieve:

Field	What It Returns	Example
childItems	List of files and folders in a directory	`[{name: "file1.parquet", type: "File"}, ...]`
itemName	Name of the file or folder	`"part-00000.snappy.parquet"`
itemType	Whether it is a File or Folder	`"File"` or `"Folder"`
size	File size in bytes	`125000`
lastModified	Last modified timestamp	`"2025-10-05T14:30:00Z"`
created	Creation timestamp	`"2025-10-05T14:30:00Z"`
exists	Whether the file/folder exists	`true` or `false`
structure	Column schema of the file	`[{name: "id", type: "Int32"}, ...]`
columnCount	Number of columns	`15`

Two Modes of Operation

Folder mode (childItems): When you point Get Metadata at a folder and request childItems, it returns the list of all files and subfolders inside that folder. This is how we scan for files to clean up.

File mode (lastModified, size, etc.): When you point Get Metadata at a specific file, it returns that file properties. This is how we check when each file was last modified.

We use both modes in our pipeline — first to list files, then to check each file age.

Understanding the If Condition Activity

The If Condition activity is the ADF equivalent of an if/else statement. It evaluates a boolean expression and executes different activities depending on the result.

If Condition: @greater(activity('GetFileInfo').output.lastModified, addDays(utcnow(), -180))
  |
  |-- True activities (file is NEWER than 6 months)  --> do nothing (keep the file)
  |
  |-- False activities (file is OLDER than 6 months) --> Delete the file

Comparison Operators Available

Operator	Expression	Example
Equals	`@equals(a, b)`	`@equals(item().type, 'File')`
Not equals	`@not(equals(a, b))`	`@not(equals(item().type, 'Folder'))`
Greater than	`@greater(a, b)`	`@greater(output.size, 1000000)`
Greater or equal	`@greaterOrEquals(a, b)`	`@greaterOrEquals(output.size, 0)`
Less than	`@less(a, b)`	`@less(output.size, 100)`
Less or equal	`@lessOrEquals(a, b)`	`@lessOrEquals(output.size, 1000)`
And	`@and(expr1, expr2)`	`@and(equals(a, 'File'), greater(b, 0))`
Or	`@or(expr1, expr2)`	`@or(equals(a, 'csv'), equals(a, 'parquet'))`
Not	`@not(expr)`	`@not(equals(item().type, 'Folder'))`
Contains	`@contains(string, substr)`	`@contains(item().name, '.parquet')`
StartsWith	`@startsWith(string, prefix)`	`@startsWith(item().name, 'part-')`
EndsWith	`@endsWith(string, suffix)`	`@endsWith(item().name, '.csv')`
Empty	`@empty(value)`	`@empty(activity('Lookup').output.value)`

Understanding the Delete Activity

The Delete activity removes files or folders from supported data stores (ADLS Gen2, Blob Storage, SFTP, etc.).

Key settings: – Dataset: Points to the file or folder to delete – Recursive: If deleting a folder, whether to delete contents recursively – Logging: Can log deleted file names to a separate location for audit trail

The Delete activity returns: – output.deletedFiles — number of files deleted – output.deletedFolders — number of folders deleted

Pipeline Architecture

Here is the complete pipeline flow:

Pipeline: PL_Cleanup_OldFiles
|
|-- Get Metadata: "GetFolderContents"
|   |-- Dataset: DS_ADLS_Folder (points to the target folder)
|   |-- Field list: childItems
|   |-- Output: array of {name, type} for each item in the folder
|
|-- ForEach: "ForEach_Item"
    |-- Items: @activity('GetFolderContents').output.childItems
    |-- Sequential: true (safer for delete operations)
    |
    |-- Inside ForEach:
        |
        |-- If Condition: "CheckIfFile"
        |   |-- Expression: @equals(item().type, 'File')
        |   |
        |   |-- True (it is a file):
        |       |
        |       |-- Get Metadata: "GetFileAge"
        |       |   |-- Field list: lastModified
        |       |   |-- Dataset: DS_ADLS_File (parameterized with item().name)
        |       |
        |       |-- If Condition: "CheckAge"
        |           |-- Expression: @less(
        |           |       activity('GetFileAge').output.lastModified,
        |           |       addDays(utcnow(), -180))
        |           |
        |           |-- True (older than 6 months):
        |           |   |-- Delete: "DeleteOldFile"
        |           |
        |           |-- False (newer than 6 months):
        |               |-- (do nothing -- keep the file)

Step 1: Create the Datasets

You need two datasets:

A. DS_ADLS_Folder (Points to the folder to scan)

Create dataset: ADLS Gen2 –> Binary (or any format — we are just listing contents)
Name: DS_ADLS_Folder
Linked service: your ADLS Gen2 linked service
Parameters:
FolderPath (String, no default)
Connection tab:
Container: datalake (or your container name)
Directory: @dataset().FolderPath
File name: leave blank

B. DS_ADLS_File (Points to a specific file — parameterized)

Create dataset: ADLS Gen2 –> Binary
Name: DS_ADLS_File
Linked service: same ADLS linked service
Parameters:
FolderPath (String, no default)
FileName (String, no default)
Connection tab:
Container: datalake
Directory: @dataset().FolderPath
File name: @dataset().FileName

Step 2: Build the Pipeline — Get Metadata

Create pipeline: PL_Cleanup_OldFiles
Add pipeline parameters:
TargetFolder (String, default: sqldb/Customer)
RetentionDays (Int, default: 180)
Drag a Get Metadata activity onto the canvas
Name: GetFolderContents
Settings tab:
Dataset: DS_ADLS_Folder
FolderPath parameter: @pipeline().parameters.TargetFolder
Field list: click + New and select childItems

What This Returns

The Get Metadata activity outputs:

{
    "childItems": [
        {"name": "part-00000-abc.snappy.parquet", "type": "File"},
        {"name": "part-00001-def.snappy.parquet", "type": "File"},
        {"name": "_SUCCESS", "type": "File"},
        {"name": "2026", "type": "Folder"}
    ]
}

Each item has a name and type (File or Folder).

Step 3: Add ForEach to Iterate Over Files

Drag ForEach activity onto the canvas
Name: ForEach_Item
Connect: green arrow from GetFolderContents to ForEach_Item
Settings tab:
Sequential: checked (recommended for delete operations to avoid race conditions)
Items: @activity('GetFolderContents').output.childItems

Step 4: Inside ForEach — Check If File and Get Metadata

Double-click ForEach to open the inner canvas.

A. First If Condition: Is It a File?

We only want to process files, not subfolders:

Drag If Condition activity inside ForEach
Name: CheckIfFile
Activities tab:
Expression: @equals(item().type, 'File')

B. True Path — Get File Metadata

Inside the If Condition True activities:

Drag Get Metadata activity
Name: GetFileAge
Settings:
Dataset: DS_ADLS_File
FolderPath: @pipeline().parameters.TargetFolder
FileName: @item().name
Field list: select lastModified

This returns:

{
    "lastModified": "2025-10-05T14:30:00Z",
    "itemName": "part-00000-abc.snappy.parquet"
}

Step 5: Inside ForEach — If Condition (Age Check)

After GetFileAge, add another If Condition to check the age:

Drag If Condition activity (inside True path, after GetFileAge)
Name: CheckAge
Expression:

@less(
    activity('GetFileAge').output.lastModified,
    addDays(utcnow(), mul(-1, pipeline().parameters.RetentionDays))
)

Breaking Down This Expression

@less(A, B)  -- is A less than (older than) B?

A = activity('GetFileAge').output.lastModified
    -- the file last modified date (e.g., "2025-10-05")

B = addDays(utcnow(), mul(-1, pipeline().parameters.RetentionDays))
    -- today minus RetentionDays (e.g., 2026-04-05 minus 180 = 2025-10-07)

If the file was modified on 2025-10-05 and the threshold is 2025-10-07:
    @less("2025-10-05", "2025-10-07") = TRUE
    -- File is older than 6 months --> DELETE

If the file was modified on 2026-03-20:
    @less("2026-03-20", "2025-10-07") = FALSE
    -- File is newer than 6 months --> KEEP

Alternative simpler expression if you want to hardcode 180 days:

@less(
    activity('GetFileAge').output.lastModified,
    addDays(utcnow(), -180)
)

Step 6: Inside If Condition — Delete Activity

In the CheckAge True path (file IS older than threshold):

Drag Delete activity
Name: DeleteOldFile
Settings:
Dataset: DS_ADLS_File
FolderPath: @pipeline().parameters.TargetFolder
FileName: @item().name
Enable logging: Yes (recommended — logs which files were deleted)
Log settings:
- Linked service: your ADLS linked service
- Path: datalake/cleanup-logs (creates a log of deleted files)

In the CheckAge False path (file is newer): – Leave empty (do nothing — the file is kept)

Step 7: The Complete Pipeline Structure

GetFolderContents (Get Metadata -- list childItems)
    |
    v
ForEach_Item (iterate over each child item)
    |
    |-- CheckIfFile (If Condition: item().type == 'File')
        |
        |-- TRUE:
        |   |-- GetFileAge (Get Metadata -- lastModified)
        |   |   |
        |   |   v
        |   |-- CheckAge (If Condition: lastModified < 6 months ago)
        |       |
        |       |-- TRUE:  DeleteOldFile (Delete activity)
        |       |-- FALSE: (do nothing -- keep file)
        |
        |-- FALSE: (skip folders)

Step 8: Run and Test

Test Setup

Before running on real data, create test files:

Upload a file to your target folder with a recent date (keep this)
Upload another file and manually set its properties to an old date, OR
Point the pipeline at a folder with known old files

Debug Run

Click Validate — fix any errors
Click Debug
Set pipeline parameters:
TargetFolder: sqldb/Customer
RetentionDays: 180
Watch the Monitor output

Expected Behavior

GetFolderContents    --> Succeeded (found 7 items)
ForEach_Item         --> Succeeded (7 iterations)
  Iteration 1: part-00000-abc.parquet
    CheckIfFile      --> True (it is a file)
    GetFileAge       --> lastModified: 2025-10-05
    CheckAge         --> True (older than 6 months)
    DeleteOldFile    --> Succeeded (file deleted)
  Iteration 2: part-00001-def.parquet
    CheckIfFile      --> True
    GetFileAge       --> lastModified: 2026-03-20
    CheckAge         --> False (newer than 6 months)
    (no delete -- file kept)
  Iteration 3: 2026 (folder)
    CheckIfFile      --> False (it is a folder)
    (skipped)

Verify

Check your ADLS Gen2 folder — old files should be gone, recent files should remain.

Check the cleanup log at datalake/cleanup-logs/ — it contains a record of every deleted file.

Understanding the Date Comparison Expression

Let us break down the core expression piece by piece:

@less(
    activity('GetFileAge').output.lastModified,
    addDays(utcnow(), -180)
)

Part	What It Does	Example Value
`utcnow()`	Current UTC time	`2026-04-05T14:30:00Z`
`addDays(utcnow(), -180)`	Subtract 180 days	`2025-10-07T14:30:00Z`
`activity('GetFileAge').output.lastModified`	File last modified	`2025-10-05T14:30:00Z`
`@less(fileDate, threshold)`	Is file date before threshold?	`true` (Oct 5 < Oct 7)

The less() function compares dates chronologically. An earlier date is “less than” a later date. So @less(oldDate, recentDate) returns true.

Making Retention Dynamic

Instead of hardcoding -180, use a pipeline parameter:

@less(
    activity('GetFileAge').output.lastModified,
    addDays(utcnow(), mul(-1, pipeline().parameters.RetentionDays))
)

mul(-1, 180) = -180. This way you can change retention without editing the pipeline.

All Get Metadata Field List Options

Here is every field you can request from Get Metadata:

Field	Applies To	Returns
childItems	Folders	Array of {name, type} for each child
itemName	Files and Folders	Name of the item
itemType	Files and Folders	“File” or “Folder”
size	Files	Size in bytes
lastModified	Files and Folders	Last modified timestamp
created	Files and Folders	Creation timestamp
exists	Files and Folders	Boolean — does it exist?
structure	Files (structured)	Column schema array
columnCount	Files (structured)	Number of columns
contentMD5	Files	MD5 hash of file content

Combining Multiple Fields

You can request multiple fields in a single Get Metadata call:

Field list: childItems, itemName, itemType, lastModified

This is more efficient than making separate calls for each property.

All If Condition Operators

Complete reference of boolean expressions for If Condition:

Comparison

@equals(a, b)              -- a == b
@not(equals(a, b))         -- a != b
@greater(a, b)             -- a > b
@greaterOrEquals(a, b)     -- a >= b
@less(a, b)                -- a < b
@lessOrEquals(a, b)        -- a <= b

Logical

@and(expr1, expr2)         -- both true
@or(expr1, expr2)          -- either true
@not(expr)                 -- negate

String

@contains(string, substr)       -- string contains substr
@startsWith(string, prefix)     -- string starts with prefix
@endsWith(string, suffix)       -- string ends with suffix
@empty(value)                   -- value is null or empty

Practical Examples

-- Only process .parquet files
@and(equals(item().type, 'File'), endsWith(item().name, '.parquet'))

-- Skip hidden files (starting with _ or .)
@not(or(startsWith(item().name, '_'), startsWith(item().name, '.')))

-- File is larger than 1 MB
@greater(activity('GetFileAge').output.size, 1048576)

-- File is older than 90 days AND larger than 100 MB
@and(
    less(activity('GetFileAge').output.lastModified, addDays(utcnow(), -90)),
    greater(activity('GetFileAge').output.size, 104857600)
)

Making It Dynamic with Parameters

To make the pipeline reusable for any folder, add parameters:

Pipeline Parameters:
  TargetFolder   (String)   -- e.g., "sqldb/Customer"
  RetentionDays  (Int)      -- e.g., 180
  ContainerName  (String)   -- e.g., "datalake"
  FileExtension  (String)   -- e.g., ".parquet" (optional filter)

Then call this pipeline from a parent pipeline that loops through multiple folders:

Parent Pipeline: PL_Cleanup_AllFolders
  |-- Lookup: read list of folders to clean from a config table
  |-- ForEach: iterate over folders
      |-- Execute Pipeline: call PL_Cleanup_OldFiles with each folder
          Parameters:
            TargetFolder: @item().FolderName
            RetentionDays: @item().RetentionDays

This way you can set different retention periods for different folders:

INSERT INTO cleanup_config VALUES ('sqldb/Customer', 180);    -- 6 months
INSERT INTO cleanup_config VALUES ('sqldb/Logs', 30);         -- 30 days
INSERT INTO cleanup_config VALUES ('sqldb/Archive', 365);     -- 1 year

Production Enhancements

1. Add Audit Logging

Log every deletion to an audit table:

INSERT INTO cleanup_audit (file_name, folder_path, file_date, deleted_at, retention_days)
VALUES (@fileName, @folderPath, @fileDate, GETDATE(), @retentionDays);

Add a Stored Procedure activity after the Delete activity to record what was deleted.

2. Dry Run Mode

Add a DryRun boolean pipeline parameter. When true, the pipeline logs which files WOULD be deleted but does not actually delete them:

If Condition expression:
@and(
    less(activity('GetFileAge').output.lastModified, addDays(utcnow(), -180)),
    not(pipeline().parameters.DryRun)
)

3. Handle Nested Folders

The basic pipeline only scans one level. For nested date-partitioned folders like Customer/2025/10/05/, you need recursive scanning:

Option A: Use the Delete activity with Recursive = true and a Wildcard path Option B: Use an Execute Pipeline activity that calls itself recursively for each subfolder

4. Email Notification

Add a Web activity at the end that calls a Logic App or sends an email with the cleanup summary:

Files scanned: 150
Files deleted: 45
Storage freed: 2.3 GB

5. Cost Optimization Integration

Combine this pipeline with Azure Storage Lifecycle Management for a two-tier approach: – Lifecycle Management: automatically moves old files from Hot to Cool to Archive tier – ADF Cleanup Pipeline: permanently deletes files past their retention period

Scheduling the Cleanup

Add a trigger to run weekly or monthly:

Click Add trigger > New/Edit
Create a Schedule trigger
Name: TR_Weekly_Cleanup
Recurrence: Every 1 week, on Sunday at 2:00 AM
Pass parameters:
TargetFolder: sqldb/Customer
RetentionDays: 180

For multiple folders, create one trigger that calls the parent pipeline (PL_Cleanup_AllFolders).

Interview Questions

Q: How do you automate file cleanup in a data lake? A: Build a pipeline using Get Metadata (to list files), ForEach (to iterate), Get Metadata again (to check lastModified date), If Condition (to compare against retention threshold), and Delete (to remove old files). Parameterize the folder path and retention days for reusability.

Q: What does the Get Metadata activity return for a folder? A: When you request childItems, it returns an array of objects with name and type (File or Folder) for each item in the directory. You can also request lastModified, size, exists, structure, and columnCount.

Q: How do you compare dates in ADF expressions? A: Use @less() or @greater() with date values. For example, @less(activity('GetFile').output.lastModified, addDays(utcnow(), -180)) checks if a file is older than 180 days. The addDays() function with a negative number subtracts days from the current time.

Q: What is the difference between Get Metadata and Lookup? A: Lookup reads data from a table or query (SQL, CSV content). Get Metadata reads file/folder properties (name, size, lastModified, childItems) from a file system. Use Lookup for data content, Get Metadata for file system information.

Q: How do you handle the If Condition activity in ADF? A: If Condition evaluates a boolean expression. It has True activities (execute when expression is true) and False activities (execute when false). Common operators include equals(), greater(), less(), and(), or(), not(), contains(), startsWith(), and endsWith().

Wrapping Up

Data lake cleanup is not glamorous, but it is essential. Without it, storage costs grow linearly, queries slow down, and your data lake becomes unmanageable.

The Get Metadata + ForEach + If Condition + Delete pattern gives you a flexible, auditable, parameterizable cleanup solution that runs on a schedule. Combined with audit logging and dry run mode, you have a production-grade data lifecycle management tool.

If this guide helped you clean up your data lake, share it with your team. Questions? Drop a comment below.

Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.