Migrate Azure Synapse to Databricks.
Four Synapse workloads onto the lakehouse (Spark notebooks fit naturally, the SQL pools get re-derived as Delta) with three human gates and a loop that iterates until green.
Why Azure Synapse to Databricks is four migrations, not one
An Azure Synapse Analytics workspace looks like one product, but it’s a collaboration boundary that bundles four distinct workloads, each with its own engine: Synapse pipelines, Spark notebooks, dedicated SQL pools, and serverless SQL. Treating them as a single thing is the first mistake a naive lift-and-shift makes. The pipelines are the same JSON model as Azure Data Factory; the notebooks run on serverless Apache Spark pools and write Delta; the dedicated SQL pool is a massively parallel T-SQL warehouse spread across 60 distributions; and serverless SQL is an on-demand query engine that stores nothing and reads files in place. Moving the estate onto Databricks means handling all four, and the difficulty is wildly uneven across them. The agent fleet documents each workload on its own terms first, then rebuilds: autonomous in the loop, accountable at the three human gates of assessment, design, and promotion.
The good news for this pair is that one workload is a genuinely natural fit. Synapse Spark notebooks already run Apache Spark and already write Delta, so they map to Databricks notebooks more closely than to almost any other target. A Spark-notebook-heavy Synapse estate lands on the lakehouse with the least friction of any source we migrate. The hard parts are the SQL pools, where Synapse’s physical design has no Databricks equivalent and must be re-derived rather than copied.
What breaks in a naive lift-and-shift
The dedicated SQL pool is where the copy-paste instinct does the most damage. Its performance model is built entirely around the 60-distribution MPP architecture: you choose a HASH distribution column so co-located joins avoid data movement, you make staging tables ROUND_ROBIN for fast loads, you make small dimensions REPLICATED so a full copy is cached on every compute node, and you partition on a single column (usually a date) so you can do fast partition switching and partition elimination. On Databricks none of these concepts exist. Delta Lake on the lakehouse has no distribution layer; physical layout is handled by liquid clustering and OPTIMIZE, and the Spark optimizer decides broadcast joins on its own, so the REPLICATED choice simply evaporates. Porting DISTRIBUTION = HASH(...) as if it were meaningful Delta DDL produces tables that are at best ignored and at worst misleading. The fleet reads why each distribution and partition choice was made (the access patterns behind it) and re-derives a Delta layout from those patterns.
The dialect breaks too. Even though dedicated-pool T-SQL is already a reduced subset (an 8-level nesting cap instead of 32, no @@NESTLEVEL, no INSERT...EXECUTE, constrained scalar UDFs, blob-type limits that force dynamic SQL to be chunked and run with EXEC()), it is still T-SQL, not Spark SQL. ISNULL becomes coalesce/nvl, GETDATE() becomes current_timestamp(), TOP n becomes LIMIT n, and [bracketed] identifiers become backticks. IDENTITY(1,1) surrogate keys do not behave like Delta’s GENERATED ... AS IDENTITY under parallel writes. Every T-SQL function gets checked against the Spark SQL built-in set; some have no direct equivalent and need rewriting, and the fleet catches these in the loop, not in a hand-off document.
Serverless SQL has its own trap. Because it is compatible with Delta reader version 1 only, a Synapse logical data warehouse may have been built around that ceiling: no deletion vectors, no column renames, no v2 checkpoints. The lakehouse imposes no such limit, so the rebuild as Unity Catalog external tables and views over the same Delta usually gets simpler, but the fleet still has to recognise where the old constraint shaped the design. And the deepest risk is the coupling between workloads: in Synapse the real logic lives across the seams, where a Spark notebook lands Delta, serverless SQL exposes views over it, the dedicated pool holds the modeled warehouse, and pipelines orchestrate the lot. Type mismatches, Delta-version compatibility, and schema-sync delays all hide at those Spark → serverless → dedicated boundaries.
How the fleet handles it
The Documenter reads the logic, not the labels, across all four workloads and follows data across the boundaries rather than documenting each pool in isolation. That inventory (the knowledge base) is what a human signs off at the assessment gate. For Synapse pipelines, the activity graph and its dependsOn conditions (Succeeded, Failed, Skipped, Completed) are reconstructed and re-expressed as Lakeflow Jobs tasks, dependencies, and triggers. For notebooks, the close-fit Spark logic is ported while %run versus mssparkutils.notebook.run chaining and mssparkutils utility calls are reworked for Databricks. For the dedicated pool, layout is re-derived for Delta and SCD2 history that was hand-rolled in T-SQL becomes either dbt snapshots (with dbt_valid_from / dbt_valid_to) on the dbt flavor or an explicit Delta MERGE INTO on the your-framework flavor. For serverless SQL, query-in-place views become Unity Catalog tables and views.
Whichever flavor you choose, the builders generate, the Test agent runs the suite, and the build-test-run loop iterates until green: row counts, SCD2 valid-from/valid-to spans, and aggregate parity all have to match source before anything reaches the promotion gate. Counts come from config, and you get to inventory in minutes. If your destination is Microsoft’s own platform instead, Migrate Azure Synapse to Microsoft Fabric follows the same fleet and the same gates onto Fabric.
Azure Synapse to Databricks, stage by stage.
For Azure Synapse → Databricks, the same five stages run with the agents that handle this pair named and linked below. Inside the loop they iterate on their own; your experts hold the three gates.
Document
Orchestrated end-to-end by The Conductor , The Librarian and The Chronicler across every Azure Synapse → Databricks asset.
Built your way, on Databricks.
Take Azure Synapse to Databricks as an open-source dbt project (free to run), or built into the framework your team has trusted for years: same migration, your standards either way.
What gets generated
- A dbt project on a Databricks SQL warehouse via
dbt-databricks(+file_format: delta), withref()-built DAG,sources.yml, tests, macros and generated docs - Dimension and fact models as Delta tables, with
incrementalmodels usingincremental_strategy='merge'(compiles toMERGE INTO) onunique_key - dbt **snapshots** (timestamp or check strategy) for SCD Type 2, producing
dbt_valid_from/dbt_valid_tohistory in place of dedicated-pool history tables - Serverless-SQL views re-expressed as dbt
viewmodels over Unity Catalog tables;liquid_clustered_by/partition_bychosen for Delta layout
- Databricks notebooks (PySpark / Spark SQL) ported close-to-line from Synapse Spark notebooks, scheduled as tasks in **Lakeflow Jobs**
- Delta tables in Unity Catalog for the dedicated-pool warehouse, with
MERGE INTOupserts and SCD2 written explicitly against the dimension - **Lakeflow Spark Declarative Pipelines** with data-quality **expectations** where serverless views and notebook hand-offs become governed pipelines
- Unity Catalog external tables and views over Delta for serverless query-in-place; Lakeflow Jobs replacing Synapse pipeline orchestration
Pattern mapping
Each Azure Synapse pattern, mapped to a deliberate Databricks target.
table/incremental model on Delta; liquid_clustered_by or partition_by chosen for layout (distribution discarded)OPTIMIZE; the 60-distribution key is re-derived, not copiedtable); no distribution concept on DeltaCOPY INTO for the landing steptable model; broadcast handled by the Spark optimizer, not declaredincremental_strategy='insert_overwrite' (partition-aware) or replace_where for selective rewritesreplaceWhere / partition overwrite; OPTIMIZE and liquid clustering replace partition switchingOPENROWSET / external table over Delta in the lakeview/table model over a Unity Catalog table; raw files declared in sources.ymlcloudFiles (Auto Loader) for incremental file landingview model selecting from the registered Delta table (partitions resolved by the engine)mssparkutils → dbutils, %run include logic reworked as Lakeflow Jobs task dependenciesdependsOnref()) for transform order; the schedule runs dbt build as an orchestrated taskBEGIN...END, IF, WHILE) and CREATE PROCEDURE in Unity Catalog, or a PySpark notebook (confirm Runtime gates)Before & after
A real Azure Synapse snippet and what the builders generate on Databricks.
-- Synapse dedicated SQL pool: hash-distributed, date-partitioned SCD2 dimension
CREATE TABLE dbo.DimCustomer
(
CustomerKey BIGINT IDENTITY(1,1) NOT NULL,
CustomerId INT NOT NULL,
CustomerName NVARCHAR(200) NOT NULL,
Segment NVARCHAR(50) NOT NULL,
ValidFrom DATE NOT NULL,
ValidTo DATE NULL,
IsCurrent BIT NOT NULL
)
WITH
(
DISTRIBUTION = HASH(CustomerId),
CLUSTERED COLUMNSTORE INDEX,
PARTITION ( ValidFrom RANGE RIGHT FOR VALUES ('2024-01-01','2025-01-01') )
);
-- close changed rows, then insert new versions (hand-rolled SCD Type 2)
UPDATE tgt
SET tgt.ValidTo = stg.LoadDate, tgt.IsCurrent = 0
FROM dbo.DimCustomer tgt
INNER JOIN stg.Customer stg
ON tgt.CustomerId = stg.CustomerId AND tgt.IsCurrent = 1
WHERE tgt.Segment <> stg.Segment OR tgt.CustomerName <> stg.CustomerName;
{% snapshot dim_customer %}
{{
config(
target_schema='marts',
unique_key='customer_id',
strategy='check',
check_cols=['customer_name', 'segment'],
file_format='delta'
)
}}
-- Delta snapshot on Databricks: dbt maintains dbt_valid_from / dbt_valid_to.
-- The dedicated-pool HASH distribution and RANGE partition are dropped;
-- physical layout is a Delta concern (liquid_clustered_by), not a key choice.
select
customer_id,
customer_name,
segment
from {{ source('crm', 'customer') }}
{% endsnapshot %}
-- Databricks (Spark SQL) in a notebook task on Lakeflow Jobs.
-- SCD Type 2 written explicitly with Delta MERGE INTO against a
-- Unity Catalog table. No distribution / partition-switch concepts;
-- layout is liquid clustering, tuned later with OPTIMIZE.
MERGE INTO marts.dim_customer AS tgt
USING staging.customer AS stg
ON tgt.customer_id = stg.customer_id AND tgt.is_current = true
WHEN MATCHED
AND (tgt.segment <> stg.segment OR tgt.customer_name <> stg.customer_name)
THEN UPDATE SET tgt.valid_to = stg.load_date, tgt.is_current = false;
INSERT INTO marts.dim_customer
(customer_id, customer_name, segment, valid_from, valid_to, is_current)
SELECT stg.customer_id, stg.customer_name, stg.segment,
stg.load_date, NULL, true
FROM staging.customer stg
LEFT JOIN marts.dim_customer cur
ON cur.customer_id = stg.customer_id AND cur.is_current = true
WHERE cur.customer_id IS NULL
OR cur.segment <> stg.segment
OR cur.customer_name <> stg.customer_name;
Iterated until green, and logged.
A failure on this pair doesn't go on a list. It goes back to the builders, and the loop re-runs until the suite passes.
Green on iteration 4: full parity on the dedicated-pool warehouse and the Spark-notebook hand-offs, surrogate keys and SCD2 spans included.
Azure Synapse → Databricks, answered.
Do the dedicated SQL pool's distribution and partition choices carry over to Databricks?
No, and trying to copy them is the classic naive-migration mistake. A dedicated SQL pool spreads every table across 60 distributions, and you tune it by picking a `HASH` distribution column, a `ROUND_ROBIN` staging table, or a `REPLICATED` small dimension, plus single-column partitioning for partition switching. None of that maps to Delta. The lakehouse has no 60-distribution MPP layer: physical layout is a Delta concern handled by liquid clustering, `OPTIMIZE`, and partition pruning, and the Spark optimizer decides broadcast joins on its own (so `REPLICATED` has no analogue). The fleet re-derives the layout for Delta from the access patterns it documented; it never ports the distribution key as if it were portable.
Synapse Spark notebooks are already Spark, is that part basically free?
It is the closest fit of the four workloads, but not a no-op. Synapse notebooks run on serverless Apache Spark pools and write Delta, so the DataFrame and Spark SQL logic ports cleanly to Databricks notebooks. The work is in the seams: `mssparkutils` / `notebookutils` calls become `dbutils`; `%run` (which copies cells in and shares variable context) and `mssparkutils.notebook.run` (isolated, import-style) behave differently from Databricks notebook chaining, so the fleet re-expresses cross-notebook flow as task dependencies in Lakeflow Jobs; and `%%configure` session settings map to cluster / job configuration. A Spark-notebook-heavy estate genuinely lands on Databricks naturally, the fleet just reads the chaining semantics rather than assuming them.
How does serverless SQL pool's query-in-place model translate to the lakehouse?
Serverless SQL pool stores nothing: it queries files in place with `OPENROWSET`, external tables, and views, often built into a logical data warehouse over Delta in the lake. On Databricks the equivalent is Unity Catalog external tables and views over the same Delta, queried through Databricks SQL or Spark. One real gotcha the fleet flags: serverless SQL is compatible with Delta reader version 1 only, so deletion vectors, column renames, and v2 checkpoints were off the table; the lakehouse has no such ceiling, which usually simplifies the rebuild rather than complicating it.
Can dedicated-pool stored procedures move, given Synapse's reduced T-SQL dialect?
They move, but they are re-expressed, not ported verbatim. Dedicated-pool procedures already run a reduced dialect: 8-level nesting cap, no `@@NESTLEVEL`, no `INSERT...EXECUTE`, constrained scalar UDFs, so the imperative logic has to be read and rebuilt. On the dbt flavor it decomposes into ordered, idempotent models with pre/post-hooks, and anything genuinely procedural becomes a dbt Python model. On the your-framework flavor Databricks now offers SQL scripting (`BEGIN...END`, `IF`, `WHILE`, condition handlers) and `CREATE PROCEDURE` in Unity Catalog, but several of these are gated behind recent Databricks Runtime versions, so the fleet confirms the customer's Runtime before relying on them rather than assuming availability.
How does the fleet keep the Spark-to-SQL hand-offs from breaking on migration?
In Synapse the end-to-end logic only exists across the seams: a Spark notebook lands Delta, serverless SQL exposes views over it, the dedicated pool holds the modeled warehouse, and pipelines orchestrate all three. The Documenter follows the data across those Spark → serverless → dedicated boundaries instead of documenting each pool in isolation, so type mismatches, Delta-version compatibility, and schema-sync delays (the usual breakage points) are caught in the knowledge base, not in production. The build-test-run loop then proves the hand-offs hold: the Test agent checks parity end to end, and the loop iterates until green before anything reaches the promotion gate.
Keep exploring
Azure Synapse migration
How the Documenter reads Azure Synapse, with both destinations beyond the Databricks build on this page.
Migrate to Databricks
What the fleet builds when you take Azure Synapse to Databricks, in dbt or your framework.
Migrate Synapse to Fabric
The same source, the other destination, compare the fit.
Migrate SSIS to Databricks
Another source onto Databricks, run by the same fleet.
Migrate ADF to Databricks
Another source onto Databricks, run by the same fleet.
Ready to migrate Azure Synapse to Databricks?
Tell us about your Azure Synapse estate and we'll run the assessment, the Surveyor scores it before you commit to Databricks.
Plan my migrationA short form, no spam. We usually reply within one business day.