Autonomous in the loop. Accountable at the gates. An agentic offering by Plainsight

Migrate ADF to Databricks.

Pipeline JSON becomes Lakeflow Jobs, Auto Loader ingestion and Spark transforms: rebuilt, not lifted, and proven against your data.

ADF logo Databricks logo

Why ADF to Databricks is a rebuild, not a lift

Azure Data Factory and Databricks solve overlapping problems with completely different machinery, and that mismatch is exactly why a naive copy of a factory onto the lakehouse fails. ADF is an orchestration-and-movement service: its unit of work is a pipeline of activities, its transforms run as Mapping Data Flows on a managed Spark cluster you never touch, and its compute lives in Integration Runtimes. Databricks is a lakehouse: Delta tables on object storage, governed by Unity Catalog, with Apache Spark as the engine you do control, orchestrated by Lakeflow Jobs and built either as Lakeflow Spark Declarative Pipelines or as your own framework. There is no button that turns one into the other. What moves cleanly is the intent of each pipeline; what has to be rebuilt is almost everything underneath it. The agent fleet treats the migration as exactly that: read the logic, re-express it on the target, and prove the result against your data before anyone signs off. Autonomous in the loop, accountable at the gates.

What breaks in a naive port

The first thing that breaks is the Copy activity. In ADF, “copy data” is a single declarative step that hides connector logic, type handling and incremental bookkeeping. On Databricks there is no monolithic copy, there is a deliberate ingest choice. A file landing zone becomes Auto Loader, whose cloudFiles source incrementally discovers new files and records their metadata in a checkpoint for exactly-once processing; a high-throughput SQL load becomes COPY INTO; a managed application or database source becomes a Lakeflow Connect connector. Pick the wrong one and you either reprocess files or miss them. Worse, ADF’s Copy is forgiving about schema, while Auto Loader infers types, so a column the legacy data flow quietly cast to decimal(18,2) can land as a string and silently corrupt every downstream aggregate. This is the single most common parity failure the fleet catches on this pair, which is why the iteration log above opens with it.

The second thing that breaks is the Mapping Data Flow. It looks portable because it already runs on Spark, but its visual components (Derived Column, Conditional/Alter Row, Window, Aggregate, and the Slowly Changing Dimension pattern) are an ADF abstraction, not Spark code you can lift. The fleet re-expresses the data flow’s semantics set-based. An Alter Row upsert becomes a Delta MERGE INTO; an aggregate becomes a GROUP BY; a Type 2 SCD becomes either AUTO CDC (create_auto_cdc_flow) inside a Lakeflow Spark Declarative Pipeline or a dbt snapshot with the timestamp strategy, which maintains dbt_valid_from, dbt_valid_to and dbt_scd_id over a mutable source. dbt’s merge incremental strategy (the Delta default) covers the SCD Type 1 “latest value wins” case; snapshots cover the Type 2 “keep the history” case. Choosing the wrong one loses history, and the loop’s row-count parity test is what surfaces it.

Rebuilding the control flow and the triggers

ADF’s dependsOn graph is the third thing that does not survive a copy. Each dependency carries one or more conditions (Succeeded, Failed, Skipped, Completed) and those edges encode the real error-handling logic of the estate, not just the happy path. The fleet rebuilds that graph two ways depending on flavor. In the your-framework flavor it becomes Lakeflow Jobs tasks with dependencies and conditional control flow, where a Failed edge turns into a run-if rule on a recovery task. In the dbt flavor the data dependencies become the dbt DAG via ref(), and the conditional and error branches move up into the orchestrator, because dbt is declarative SQL and runtime branching belongs in Lakeflow Jobs (or a Python model), never inside a model. Metadata-driven ForEach loops over a control table become a Lakeflow Jobs for_each task or a PySpark loop; the fleet resolves the parameter chain to describe what runs but never invents control-table values, flagging them for the design gate instead.

Triggers are the fourth. Schedule and storage-event triggers map onto Lakeflow Jobs scheduled and file-arrival triggers with little friction. The tumbling-window trigger is the awkward one: it runs fixed-size, contiguous, non-overlapping intervals, retains state, and supports backfill through windowStartTime / windowEndTime. The fleet reproduces that by parameterising the job over a window boundary and expressing backfill as bounded runs, or, in dbt, as a microbatch incremental keyed off an event-time column. The cadence and replay semantics are preserved even though the mechanism is rebuilt.

Connectivity is the dependency that does not move

Finally, the Integration Runtime. The Azure IR is just managed cloud compute and disappears into Databricks Spark without ceremony. The self-hosted IR is the one that matters: any Copy or data flow reaching an on-premises or network-restricted store through a SHIR is bound to that network path, its custom drivers and its firewall rules, and none of that migrates on its own. The fleet logs every SHIR-bound activity as an explicit risk item and re-plans the path (typically a Lakeflow Connect connector or a connector with the right network access landing the data as Delta) so modeling can start from a known bronze table. The Azure-SSIS IR is logged separately: native SSIS execution on Databricks is not a generally available lift-and-shift, so any lifted packages are rebuild candidates, never a runtime we claim on the target.

How the fleet proves it

Every claim above is settled the same way: the build-test-run loop. The builders generate the ingest, the transforms and the orchestration; the test agent runs schema checks, the dbt tests (unique, not_null, relationships) and row-count parity queries against the ADF baseline; and any failure goes back to the builders, not onto a list. The loop iterates until green (type drift pinned, SCD2 windows non-overlapping, parity delta zero) and only then does the work reach the promotion gate. If you are weighing the other destination, Migrate ADF to Microsoft Fabric runs the same fleet against Microsoft Fabric instead.

How the fleet runs it

ADF to Databricks, stage by stage.

For ADF → Databricks, the same five stages run with the agents that handle this pair named and linked below. Inside the loop they iterate on their own; your experts hold the three gates.

Build → Test → Operate iterates until green on Databricks

Orchestrated end-to-end by The Conductor , The Librarian and The Chronicler across every ADF → Databricks asset.

dbt or your framework

Built your way, on Databricks.

Take ADF to Databricks as an open-source dbt project (free to run), or built into the framework your team has trusted for years: same migration, your standards either way.

What gets generated

dbt
  • A dbt project (dbt_project.yml, models/, snapshots/, tests/, macros/, sources.yml) targeting a Databricks SQL warehouse via dbt-databricks
  • Staging and mart models replacing each Mapping Data Flow, with ref() rebuilding the dependsOn graph as the dbt DAG
  • Incremental models with incremental_strategy='merge' (the Delta default) for the per-key upserts your Copy/data-flow sinks performed
  • dbt snapshots (timestamp or check strategy) for any SCD Type 2 a data flow maintained, producing dbt_valid_from / dbt_valid_to
  • Generic and singular tests (unique, not_null, relationships, plus row-count parity tests) that the loop runs until green

Pattern mapping

Each ADF pattern, mapped to a deliberate Databricks target.

Source pattern
Databricks · dbt
Databricks · your framework
Copy activity (Blob/ADLS file landing zone)
Pair with a load step (Auto Loader / COPY INTO), then model the landed table with source() + staging models
Auto Loader cloudFiles stream into a Delta bronze table (or COPY INTO for thousands-of-files loads)
Copy activity from a managed source (Salesforce, SQL Server, ServiceNow)
Pair with a Lakeflow Connect ingest, then source() the landed table downstream
Lakeflow Connect fully-managed connector (built-in auth, CDC, schema evolution) into Delta
Mapping Data Flow (visual transform graph)
SELECT-based staging then mart models; materialized='table'/'view'/'incremental'
Lakeflow Spark Declarative Pipelines or a Spark SQL / PySpark notebook on Spark compute
Sink with upsert / Alter Row (insert+update)
incremental model, incremental_strategy='merge', unique_key=[...] (compiles to MERGE INTO)
MERGE INTO on a Delta target, or DeltaTable.merge in PySpark
Data-flow Slowly Changing Dimension (Type 2) pattern
dbt **snapshot** (timestamp or check strategy) → dbt_valid_from / dbt_valid_to / dbt_scd_id
AUTO CDC (create_auto_cdc_flow) in Lakeflow SDP, or a hand-built SCD2 MERGE INTO against a Delta dimension
dependsOn graph (Succeeded / Failed / Skipped / Completed edges)
dbt DAG from ref(); error/Failed branches move to the orchestrator (a Lakeflow Job step)
Lakeflow Jobs task dependencies with conditional control flow (run-if rules) reproducing each edge
ForEach over a control table (metadata-driven pipeline)
Jinja for compile-time fan-out across models; the loop itself lives in the orchestrator, not in SQL
Lakeflow Jobs for_each task, or a PySpark loop driving parameterised notebook runs
Schedule / tumbling-window / storage-event trigger
dbt run invoked on a schedule by the orchestrator
Lakeflow Jobs trigger (scheduled or file-arrival) with backfill expressed as bounded job runs
Self-hosted IR reaching an on-prem store
Flagged as a connectivity dependency; modeling starts from the landed Delta table
Re-planned network path: Lakeflow Connect / a connector with the right network access into Delta
Stored Procedure activity (Execute SQL on a DB)
Decompose into models + pre-/post-hooks; complex procedural logic to a Python model
Spark SQL statements, a SQL-scripting BEGIN...END block, or a notebook task in the job

Before & after

A real ADF snippet and what the builders generate on Databricks.

pipelines/IngestCustomers.json legacy
{
  "name": "IngestCustomers",
  "properties": {
    "activities": [
      {
        "name": "CopyCustomersToStage",
        "type": "Copy",
        "policy": { "timeout": "0.12:00:00", "retry": 0 },
        "typeProperties": {
          "source": { "type": "DelimitedTextSource" },
          "sink": { "type": "ParquetSink" }
        },
        "inputs": [{ "referenceName": "ds_customers_csv", "type": "DatasetReference" }],
        "outputs": [{ "referenceName": "ds_stage_customers", "type": "DatasetReference" }]
      },
      {
        "name": "UpsertCustomerDim",
        "type": "MappingDataFlow",
        "dependsOn": [
          { "activity": "CopyCustomersToStage", "dependencyConditions": ["Succeeded"] }
        ],
        "typeProperties": { "dataFlow": { "referenceName": "df_scd2_customer" } }
      }
    ],
    "parameters": { "loadDate": { "type": "String" } }
  }
}
dbt
models/marts/dim_customer.sql + snapshots/customer_snapshot.sql dbt
-- snapshots/customer_snapshot.sql  (SCD Type 2, the data flow's job)
{% snapshot customer_snapshot %}
{{ config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='timestamp',
    updated_at='updated_at',
    file_format='delta'
) }}
select * from {{ source('stage', 'customers') }}
{% endsnapshot %}

-- models/marts/dim_customer.sql  (current view over the history)
{{ config(materialized='table', file_format='delta') }}
select
    customer_id, name, segment, region,
    dbt_valid_from as valid_from,
    dbt_valid_to   as valid_to
from {{ ref('customer_snapshot') }}
The loop that closes itself

Iterated until green, and logged.

A failure on this pair doesn't go on a list. It goes back to the builders, and the loop re-runs until the suite passes.

adf-to-databricks · iteration log build-test-operate
iter 1 Iter 1, FAIL: Auto Loader inferred `amount` as STRING; legacy data flow cast it to decimal(18,2). Type drift on the bronze table.
iter 2 Iter 2, FAIL: SCD2 row counts off by 412. The data flow's late-arriving `updated_at` ties broke `sequence_by` ordering.
iter 3 Iter 3, FAIL: parity query found 7 customers with overlapping `valid_from`/`valid_to` windows after the tie fix.
iter 4 Iter 3 (rerun), GREEN: schema hint pins decimal, deterministic tie-break applied; dim_customer parity delta = 0 vs the ADF run.

Green on iteration 3: SCD2 history and ingest types match the ADF baseline, delta 0.

Questions for this migration

ADF → Databricks, answered.

ADF Copy activity isn't really a transform, what does it become on Databricks?

It splits into an explicit ingest step. A file landing zone (Copy from Blob or ADLS) becomes **Auto Loader** (the `cloudFiles` Structured Streaming source that incrementally processes new files and tracks discovered-file metadata in a checkpoint for exactly-once handling) or **`COPY INTO`** for simpler SQL-driven loads of thousands of files. A Copy from a managed source such as Salesforce or SQL Server maps to a **Lakeflow Connect** connector with built-in authentication, CDC and schema evolution. The dbt flavor does not ingest raw files itself: we pair it with one of those load steps and then `source()` the landed Delta table. The rule of thumb the fleet applies is Auto Loader for millions of files over time, `COPY INTO` for thousands.

How does a Mapping Data Flow translate, is it a one-to-one port?

No. A Mapping Data Flow is a visual transform graph that ADF executes on a managed Spark cluster behind the Azure IR, but its components (Derived Column, Alter Row, Window, the SCD pattern) have no literal Databricks equivalent. The fleet re-expresses the *semantics* set-based: in the your-framework flavor as Lakeflow Spark Declarative Pipelines or a Spark SQL / PySpark notebook, and in the dbt flavor as staging-to-mart models. An Alter Row upsert becomes a `MERGE INTO`; an aggregate becomes a `GROUP BY`; the SCD2 transform becomes `AUTO CDC` (the `create_auto_cdc_flow` API in `pyspark.pipelines`, which replaced the older `apply_changes`) or a dbt snapshot. Where the fleet finds a legacy `import dlt` pipeline it reads and upgrades it to the current API. Parity is proven against your data in the loop, not assumed from the diagram.

What happens to my self-hosted integration runtime when there's nowhere on-prem in Databricks?

A self-hosted IR (SHIR) is a connectivity dependency, not a portable artifact. Any Copy or data flow that reached an on-premises or network-restricted store through a SHIR is implicitly bound to that network path, its custom drivers and firewall rules, none of which moves on its own. The fleet records every SHIR-bound activity as an explicit risk item at the assessment gate, and the rebuild re-plans the path: a Lakeflow Connect connector or a connector with the right network access landing the data as Delta. Modeling then starts from the landed bronze table. We never claim the on-prem reach migrates automatically.

ADF triggers include tumbling windows with backfill, does Lakeflow Jobs cover that?

Lakeflow Jobs triggers cover scheduled and file-arrival starts directly, which map cleanly to ADF schedule and storage-event triggers. The tumbling-window trigger is different: it runs fixed-size, non-overlapping intervals, retains state and exposes `windowStartTime` / `windowEndTime` for backfill. The fleet reproduces that by parameterising the job over a window boundary and expressing backfill as a set of bounded job runs (or a `microbatch`-style incremental keyed off an event-time column in the dbt flavor). The intent (contiguous, replayable windows) is preserved; the mechanism is rebuilt rather than ported.

My ADF pipelines are metadata-driven, one generic pipeline looping a control table. How is that handled?

That is the hardest ADF shape because the behavior is data-dependent and not visible in any one JSON file. The fleet resolves the parameter chain (`@pipeline().parameters`, `@dataset().parameters`, chained `@activity('Name').output`) to describe what actually runs, and records where the effective shape depends on the control table. In the your-framework flavor the `ForEach` becomes a Lakeflow Jobs `for_each` task (or a PySpark loop driving parameterised notebook runs); in the dbt flavor the fan-out is generated with Jinja at compile time while the loop itself lives in the orchestrator. The fleet does not invent control-table values, it flags the dependency so your team supplies it at the design gate.

Does the Stored Procedure activity in my pipeline need a rewrite?

It depends what the procedure does. Simple set-based DML can land as Spark SQL statements or dbt pre-/post-hooks. Procedural logic (`IF`/`WHILE`, cursors, dynamic SQL) has more options now than it used to: Databricks supports SQL scripting (`BEGIN...END`, `IF`, `WHILE`, condition handlers) and `CREATE PROCEDURE` stored procedures in Unity Catalog, but several of these are gated behind recent Databricks Runtime versions, so the fleet confirms the customer's runtime before relying on them. Where pure SQL is insufficient the logic moves to a PySpark notebook (your-framework) or a Python model (dbt). The build-test-run loop proves the rewrite produces the same rows.

Let's talk

Ready to migrate ADF to Databricks?

Tell us about your ADF estate and we'll run the assessment, the Surveyor scores it before you commit to Databricks.

Plan my migration

A short form, no spam. We usually reply within one business day.

Plan my migration