Migrate SSIS to Databricks.

Every `.dtsx` package read by the Documenter, rebuilt on Spark and Delta as dbt or in your framework: autonomous in the loop, accountable at the gates.

Plan my migration Migrate to Databricks

Why moving SSIS onto Databricks is a rebuild, not a port

An SSIS package is a .dtsx XML document describing a control flow, one or more data flows, connection managers, variables, parameters and configurations. What makes SSIS hard to move isn’t the task names. It’s the execution model underneath them. A Data Flow only exists inside a Data Flow Task, and each one spins up a separate instance of a buffer-based engine: rows stream through PipelineBuffer objects, a new buffer is allocated whenever an asynchronous transformation (Sort, Aggregate, Merge Join) is hit, and synchronous transformations reuse the buffer row by row. Databricks has no row-buffer model. Spark is a set-based, distributed engine over Delta Lake. So there is no general-availability lift-and-shift that runs the SSIS runtime on Databricks, and anyone who tells you otherwise is selling a runtime that does not exist. The work is to reproduce the package’s semantics (sorted inputs, error outputs, lineage-tracked columns, effective-dating) as Spark SQL or PySpark, not to emulate its buffers.

That is exactly what a naive lift-and-shift gets wrong. It treats the data flow as a black box that “moves data,” ports the visible SQL, and loses the parts that were never visible: the expression-driven ConnectionString on a connection manager, the variable whose value is computed at runtime because EvaluateAsExpression = True, the SSIS expression that casts to (DT_BOOL) where True is represented as -1 rather than Spark’s true. Each of those is a silent behaviour change waiting to surface as a wrong row count three weeks after go-live.

What the Documenter reads, and what the fleet builds

The SSIS Documenter parses the package XML, not the canvas. It reconstructs the control flow from precedence constraints, capturing the Value (Success / Failure / Completion), the EvalOp, and the LogicalAnd flag that decides whether multiple incoming constraints are AND-ed or OR-ed, so a branch like “run the load only if extract succeeded and validation did not fail” survives the move. It maps every Data Flow component, resolves SSIS expressions, captures variable and parameter scopes (variable names are case-sensitive, and parameters resolve through environment references that the .dtsx alone does not show), and isolates embedded T-SQL. All of it is written to the knowledge base as a functional spec, and the Librarian keeps that spec coherent across packages.

From that spec the builders generate one of two flavors. In dbt, each Data Flow becomes staging and mart models, plain SELECT statements that dbt materializes to Delta tables or views on a Databricks SQL warehouse via dbt-databricks. The per-row upserts that an OLE DB Command used to do become an incremental model with incremental_strategy='merge', the default on Delta, which compiles to MERGE INTO. The SSIS Slowly Changing Dimension transform becomes a dbt snapshot, adding dbt_valid_from and dbt_valid_to columns with the timestamp or check strategy. Run order comes free from the ref() DAG, which replaces precedence-constraint ordering.

In your framework, the same logic lands as native Databricks objects. Data Flows become PySpark or Spark SQL notebooks, or Lakeflow Spark Declarative Pipelines with streaming tables and materialized views. SCD2 is a Delta MERGE INTO that closes the open version and opens a new one. File landing zones that SSIS read with a Flat File source become Auto Loader (the cloudFiles source) for incremental ingestion, or COPY INTO for simpler SQL loads. Connection managers become Unity Catalog connections, external locations and volumes; precedence constraints become Lakeflow Jobs task dependencies with conditional control flow. Whatever lived inside a Script Task (opaque .NET that cannot be auto-converted) is rebuilt as a PySpark notebook and explicitly flagged for a human to read at the design gate.

Autonomous in the loop, accountable at the gates

Once the build exists, the Databricks Operator deploys it to DEV, runs the Workflows and pipelines for real, reads each failure and routes the fix back to the builders. That build-test-run loop re-runs on its own until the Databricks Test Agent’s suite passes and the Reconciler’s delta against the legacy dimension is zero or explained. The early failures are the predictable ones, GETDATE() left as T-SQL instead of current_timestamp(), an ISNULL() that should be coalesce(), a (DT_BOOL) cast that broke the active-flag because SSIS True is -1, an SCD2 valid_to boundary off by one. The fleet iterates through them; it does not hand you a list. What stays human are the three gates: your experts sign off the assessment, the target design, and the promotion. The agents are autonomous inside the loop; you are accountable at the gates.

If your target is Microsoft Fabric instead, the same Documenter output drives a different build, Migrate SSIS to Microsoft Fabric covers that pair.

How the fleet runs it

SSIS to Databricks, stage by stage.

For SSIS → Databricks, the same five stages run with the agents that handle this pair named and linked below. Inside the loop they iterate on their own; your experts hold the three gates.

Document

Architect

The Architect

Build

Test

Operate

Build → Test → Operate iterates until green on Databricks

Orchestrated end-to-end by The Conductor , The Librarian and The Chronicler across every SSIS → Databricks asset.

dbt or your framework

Built your way, on Databricks.

Take SSIS to Databricks as an open-source dbt project (free to run), or built into the framework your team has trusted for years: same migration, your standards either way.

What gets generated

dbt

Staging and mart models as SELECT statements on Databricks SQL warehouses, materialized to Delta tables and views
Incremental models with incremental_strategy='merge' (the default on Delta) for the per-row upserts SSIS did with OLE DB Command
dbt snapshots (timestamp or check strategy) for every SCD Type 2 the SSIS Slowly Changing Dimension transform used to maintain
Generic and singular tests for the parity the Test Agent asserts; sources.yml standing in for connection managers
Python models where a Script Task's .NET logic has no pure-SQL equivalent (flagged for review)

Pattern mapping

Each SSIS pattern, mapped to a deliberate Databricks target.

Source pattern

Databricks · dbt

Databricks · your framework

Data Flow Task (OLE DB source, transforms, destination)

Staging and mart models; materialized='table'/'view'/'incremental' on Delta

PySpark / Spark SQL notebook, or a Lakeflow Spark Declarative Pipeline (streaming tables / materialized views)

Slowly Changing Dimension transform (Type 2)

dbt snapshot with timestamp or check strategy, adding dbt_valid_from/dbt_valid_to

MERGE INTO on a Delta dimension closing and opening versions, or Lakeflow SDP change processing

OLE DB Command (per-row upsert)

Incremental model, incremental_strategy='merge' on unique_key (compiles to MERGE INTO)

MERGE INTO in Spark SQL, or DeltaTable.merge in PySpark

Execute SQL Task (set-based statement / proc)

dbt model, with pre-/post-hooks for ad-hoc DML the model can't express

Spark SQL statement in a notebook task, or a SQL-scripting BEGIN...END block / stored procedure

Script Task / Script Component (.NET, VSTA)

Not a dbt concept, a Python model where pure SQL is insufficient (flagged)

PySpark / Python notebook carrying the full programmatic logic (flagged for human review)

Precedence constraints (Success/Failure/Completion, LogicalAnd)

The dbt DAG built from ref() edges determines run order

Lakeflow Jobs tasks with dependencies and conditional control flow

Connection managers (OLE DB / Flat File, expression-driven)

Sources in sources.yml plus the warehouse connection profile

Unity Catalog connections / external locations / volumes; secrets in secret scopes

Flat File source over a landing folder

Paired load step, then modeled, dbt does not ingest raw files

Auto Loader (cloudFiles) incremental file ingestion, or COPY INTO

SQL Agent schedule / package-level orchestration

dbt invoked as a scheduled task by the orchestrator

Lakeflow Jobs (jobs, tasks, triggers, scheduled or file-arrival)

Before & after

A real SSIS snippet and what the builders generate on Databricks.

LoadDimCustomer.dtsx (Data Flow + SCD2, abridged) legacy

<!-- Data Flow Task: extract -> Slowly Changing Dimension (Type 2) -> insert/update -->
<component name="OLE DB Source" componentClassID="Microsoft.OLEDBSource">
  <property name="SqlCommand">SELECT CustomerKey, Name, City, Segment, UpdatedAt
                              FROM staging.Customer</property>
</component>
<component name="SCD Customer" componentClassID="Microsoft.SCD">
  <!-- business key -->
  <inputColumn name="CustomerKey" scdType="Key" />
  <!-- historical attributes -> new row version on change -->
  <inputColumn name="City"    scdType="Historical" />
  <inputColumn name="Segment" scdType="Historical" />
  <!-- changing attribute -> overwrite in place -->
  <inputColumn name="Name"    scdType="Changing" />
</component>
<!-- OLE DB Command issues a per-row UPDATE to close the prior version;
     precedence constraint (Success) then runs the dim insert. -->

dbt

snapshots/dim_customer_snapshot.sql dbt

{% snapshot dim_customer_snapshot %}
{{
  config(
    target_schema='marts',
    unique_key='customer_key',
    strategy='timestamp',
    updated_at='updated_at',
    file_format='delta'
  )
}}
-- The SCD2 transform becomes a dbt snapshot. dbt closes the old
-- row and opens a new one, maintaining dbt_valid_from / dbt_valid_to.
select
  customer_key,
  name,
  city,
  segment,
  updated_at
from {{ ref('stg_customer') }}
{% endsnapshot %}

The loop that closes itself

Iterated until green, and logged.

A failure on this pair doesn't go on a list. It goes back to the builders, and the loop re-runs until the suite passes.

ssis-to-databricks · iteration log build-test-operate

iter 1 ✗ Iteration 1, FAIL: SSIS expression cast `(DT_BOOL)` ported literally; SSIS True is -1, Spark boolean is true/false. Active-flag rows mismatched.

iter 2 ✗ Iteration 2, FAIL: `GETDATE()` and `ISNULL()` left as T-SQL; rewritten to `current_timestamp()` and `coalesce()`. Open-version count off by 412.

iter 3 ✗ Iteration 3, FAIL: SCD2 `valid_to` boundary off by one, old version not closed at the new `updated_at`; MERGE predicate corrected.

iter 4 ✓ Iteration 4, GREEN: row-count parity, SCD effective-dating, null and key integrity all pass. Reconciler delta 0.

Green on iteration 4: full SCD2 parity, Reconciler delta 0 against the legacy dimension.

Questions for this migration

SSIS → Databricks, answered.

What happens to a Script Task that runs custom .NET code?

Script Tasks are opaque .NET (C#/VB) authored in VSTA, with logic that has no declarative equivalent. They are one of the two highest-risk constructs in an SSIS estate. The Documenter reads the `ScriptMain` / `Main` entry point and records what it does; the builders re-implement it as a PySpark or Python notebook on Databricks, and the result is flagged for human review at the design gate. We never claim a Script Task is auto-converted: .NET Core/Standard references aren't even supported in VSTA, so the logic is rebuilt, not lifted.

Can I just lift-and-shift the SSIS runtime onto Databricks?

No, there is no GA lift-and-shift that runs native SSIS on Databricks. SSIS executes a buffer-based, row-by-row data-flow engine that simply does not exist on Spark. The fleet rebuilds each Data Flow as set-based Spark SQL or PySpark transformations; the legacy semantics (sorted inputs, error outputs, lineage-tracked columns) are reproduced, not the runtime. Treat anyone promising a runtime port with suspicion.

How does an SSIS Slowly Changing Dimension (Type 2) transform map to Databricks?

On the dbt flavor it becomes a dbt snapshot (`timestamp` strategy when there's a reliable `updated_at`, `check` strategy otherwise) which maintains `dbt_valid_from` / `dbt_valid_to` on a Delta table. On your-framework it becomes a Delta `MERGE INTO` that closes the open version and opens a new one, or a Lakeflow Spark Declarative Pipeline using change processing. A `merge` upsert alone is SCD Type 1 (overwrite); SCD2 history needs the snapshot or the explicit close/open MERGE.

Where do precedence constraints and connection managers go?

Precedence constraints (the Success/Failure/Completion edges with their LogicalAnd merge logic) become Lakeflow Jobs task dependencies and conditional control flow in your-framework, or are inferred from the `ref()` DAG in dbt, which runs models in topological order. Connection managers become Unity Catalog connections, external locations and volumes (with secrets in secret scopes), or `sources.yml` entries on the dbt side. The Documenter resolves expression-driven `ConnectionString` properties first, so dynamic connections aren't lost.

What about dynamic SQL built in an Execute SQL Task with `SQLSourceType = Variable`?

Dynamic SQL is the other top-risk construct: when the statement is assembled in a variable or a Script Task, the actual SQL isn't present statically in the package. The Documenter traces the variable construction to recover what runs, writes it to the knowledge base, and the builders re-express it, as Jinja templating at compile time on the dbt flavor, or `EXECUTE IMMEDIATE` / `spark.sql(...)` on your-framework. The parameterization and security model differs, so each case is reviewed rather than ported verbatim.

Ready to migrate SSIS to Databricks?

Tell us about your SSIS estate and we'll run the assessment, the Surveyor scores it before you commit to Databricks.

Plan my migration

A short form, no spam. We usually reply within one business day.