Skip to content
plainsight.pro

Databricks CI/CD Best Practices

Purpose

This guide provides best practices for designing robust CI/CD pipelines on Databricks, ensuring rapid and reliable deployment of data engineering and analytics workloads.

Overview

CI/CD (Continuous Integration and Continuous Delivery) has become a cornerstone of modern data engineering and analytics, ensuring code changes are integrated, tested, and deployed rapidly and reliably. Databricks provides a flexible framework supporting various CI/CD options shaped by organizational preferences, existing workflows, and specific technology environments.

%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart LR
    A[Development] --> B[Version Control]
    B --> C[Automated Testing]
    C --> D[Build & Package]
    D --> E[Deploy to Staging]
    E --> F[Validation]
    F --> G[Deploy to Production]
    G --> H[Monitor]
    H --|Rollback if needed|--> E

Core Principles of CI/CD

Effective CI/CD pipelines share foundational principles regardless of implementation specifics. These universal best practices apply across organizational preferences, developer workflows, and cloud environments.

Version Control Everything

  • Store all artifacts in Git: Notebooks, scripts, infrastructure definitions (IaC), and job configurations
  • Use branching strategies: Implement Gitflow or similar strategies aligned with development, staging, and production environments
  • Track configuration changes: Maintain history of all deployment-related changes

Automate Testing

Test Type Tools Purpose
Unit Tests pytest (Python), ScalaTest (Scala) Validate business logic
Validation Tests Databricks CLI bundle validate Check notebook and workflow functionality
Integration Tests chispa for Spark DataFrames Test complete workflows and data pipelines

Testing Strategy

Implement multiple layers of testing to catch issues early in the development cycle.

Employ Infrastructure as Code (IaC)

  • Define infrastructure declaratively: Use Databricks Asset Bundles YAML or Terraform for clusters, jobs, and workspace configurations
  • Parameterize settings: Avoid hardcoding environment-specific values like cluster size and secrets
  • Version infrastructure changes: Track all infrastructure modifications in Git

Isolate Environments

Environment separation best practices: - Maintain separate workspaces for development, staging, and production - Choose tools that match your cloud ecosystem: - Azure: Azure DevOps + Databricks Asset Bundles or Terraform

Monitor and Automate Rollbacks

Track key metrics: - Deployment success rates - Job performance - Test coverage - Pipeline execution times

Implement automated rollback mechanisms for failed deployments to minimize downtime.

Unify Asset Management

Avoid Siloed Management

Use Databricks Asset Bundles to deploy code, jobs, and infrastructure as a single unit. Avoid managing notebooks, libraries, and workflows separately.

[!info] Authentication Recommendation Databricks recommends workload identity federation for CI/CD authentication. This eliminates the need for Databricks secrets, making it the most secure authentication method for automated flows.

Databricks Asset Bundles for CI/CD

Databricks Asset Bundles offer a unified approach to managing code, workflows, and infrastructure within the Databricks ecosystem and are recommended for CI/CD pipelines.

Why Bundles?

By bundling code, workflows, and infrastructure into a single YAML-defined unit, bundles: - ✅ Simplify deployment - ✅ Ensure consistency across environments - ✅ Provide atomic deployments - ✅ Enable version control of entire stack

Workflow Adaptation

Different development backgrounds require different approaches to adopting bundles:

Developer Background Traditional Workflow Bundle Workflow
Java Build JARs with Maven/Gradle, test with JUnit Bundle code + infrastructure in YAML
Python Package wheels, test with pytest Bundle Python code + configs together
SQL Query validation, notebook management Bundle SQL files with pipeline definitions

Source Control Strategies

The first choice when implementing CI/CD is how to store and version source files. Two main approaches exist:

Structure:

databricks-dab-repo/
├── databricks.yml               # Bundle definition
├── resources/
│   ├── workflows/
│   │   ├── my_pipeline.yml      # YAML pipeline def
│   │   └── my_pipeline_job.yml  # YAML job def
│   ├── clusters/
│   │   ├── dev_cluster.yml      # Development cluster
│   │   └── prod_cluster.yml     # Production cluster
├── src/
│   ├── my_pipeline.ipynb        # Pipeline notebook
│   └── mypython.py              # Python modules
└── README.md

Pros: - ✅ All artifacts versioned together - ✅ Single PR updates both code and configuration - ✅ Simplified CI/CD pipeline

Cons: - ❌ Repository may become bloated - ❌ Coordinated releases required

Repository 1: Application Code

java-app-repo/
├── pom.xml                      # Maven configuration
├── src/
│   ├── main/
│   │   ├── java/                # Java source code
│   │   └── resources/           # Application resources
│   └── test/
│       ├── java/                # Unit tests
│       └── resources/           # Test resources
├── target/                      # Compiled JARs
└── README.md

Repository 2: Bundle Configuration

databricks-dab-repo/
├── databricks.yml               # Bundle definition
├── resources/
│   ├── jobs/
│   │   ├── my_java_job.yml      # Job definitions
│   │   └── my_other_job.yml
│   ├── clusters/
│   │   ├── dev_cluster.yml      # Cluster configs
│   │   └── prod_cluster.yml
└── README.md

Pros: - ✅ Team separation: Development vs. infrastructure management - ✅ Independent release cycles - ✅ Smaller, focused repositories

Cons: - ❌ Additional coordination required - ❌ Must ensure version compatibility

Version Artifacts

Always use versioned artifacts (Git commit hashes) when uploading to Databricks or external storage to ensure traceability and rollback capabilities.

Reference Compiled Artifacts in Bundles

Example databricks.yml referencing a versioned JAR:

resources:
  jobs:
    my_java_job:
      tasks:
        - task_key: process_data
          libraries:
            - jar: /Volumes/artifacts/my-app-${{ GIT_SHA }}.jar

Regardless of repository structure, follow this workflow:

%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart TD
    A[Code Change / PR] --> B[Compile & Test]
    B --> C{Tests Pass?}
    C -->|No| D[Fix Issues]
    D --> A
    C -->|Yes| E[Upload Artifact]
    E --> F[Validate Bundle]
    F --> G{Validation OK?}
    G -->|No| D
    G -->|Yes| H[Deploy to Staging]
    H --> I[Integration Tests]
    I --> J{Tests Pass?}
    J -->|No| K[Rollback]
    J -->|Yes| L[Deploy to Production]
    L --> M[Monitor]

    style B fill:#e1f5ff
    style E fill:#fff3cd
    style L fill:#d4edda
    style K fill:#f8d7da

Step 1: Compile and Test

Triggered on: Pull request or commit to main branch

Actions: 1. Compile code 2. Run unit tests 3. Generate versioned artifact (e.g., my-app-1.0.jar)

Step 2: Upload and Store Artifact

Store compiled files in Databricks Unity Catalog volume or artifact repository Azure Blob Storage or Databricks Unity Catalog volumes (recommended)

Versioning scheme examples:

/Volumes/artifacts/my-app-${{ github.sha }}.jar
dbfs:/mnt/artifacts/my-app-1.0.jar

Step 3: Validate Bundle

databricks bundle validate

This ensures: - YAML configuration correctness - No missing libraries - Proper resource references - Environment-specific parameters are valid

Catch Issues Early

Run validation during pull requests to catch misconfigurations before deployment.

Step 4: Deploy Bundle

databricks bundle deploy --target=staging

For production:

databricks bundle deploy --target=prod

Reference uploaded libraries in databricks.yml. See Databricks Asset Bundles library dependencies for details.

CI/CD for Machine Learning

ML projects introduce unique CI/CD challenges compared to traditional software development:

%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart LR
    A[Data Engineering] --> B[Feature Engineering]
    B --> C[Model Training]
    C --> D[Model Validation]
    D --> E[Model Registry]
    E --> F[Staging Deployment]
    F --> G[A/B Testing]
    G --> H[Production Deployment]
    H --> I[Monitoring & Retraining]
    I --|Data Drift|--> C

    style A fill:#e1f5ff
    style C fill:#fff3cd
    style E fill:#d4edda
    style I fill:#f8d7da

ML-Specific Considerations

Challenge Databricks Solution
Multi-team coordination MLflow for experiment tracking, Delta Sharing for governance, Asset Bundles for IaC
Data & model versioning Delta Lake (ACID transactions, time travel), MLflow Model Registry (lineage)
Reproducibility Asset Bundles ensure atomic deployment across environments
Continuous retraining Lakeflow Jobs + MLflow + Data Quality Monitoring

MLOps Stacks Framework

MLOps Stacks combine: - Databricks Asset Bundles - Preconfigured CI/CD workflows - Modular ML project templates

Team responsibilities:

Team Responsibilities Bundle Components Artifacts
Data Engineers ETL pipelines, data quality Lakeflow YAML, cluster policies etl_pipeline.yml, feature_store_job.yml
Data Scientists Model training, validation MLflow Projects, notebooks train_model.yml, batch_inference_job.yml
MLOps Engineers Orchestration, monitoring Environment variables, dashboards databricks.yml, lakehouse_monitoring.yml

ML CI/CD Collaboration Workflow

  1. Data engineers commit ETL pipeline changes → automated schema validation → staging deployment
  2. Data scientists submit ML code → unit tests run → deploy to staging workspace for integration testing
  3. MLOps engineers review validation metrics → promote vetted models to production via MLflow Registry

CI/CD for SQL Developers

SQL developers using Databricks SQL to manage streaming tables and materialized views can leverage Git integration and CI/CD pipelines to streamline workflows.

SQL Workflow Best Practices

%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart TD
    A[Write SQL in Editor] --> B[Commit to Git Branch]
    B --> C[Pull Request]
    C --> D[Validate Syntax & Schema]
    D --> E{Valid?}
    E -->|No| F[Fix Issues]
    F --> A
    E -->|Yes| G[Merge to Main]
    G --> H[Deploy SQL Files]
    H --> I[Schedule Refreshes]
    I --> J[Monitor Performance]

    style A fill:#e1f5ff
    style G fill:#d4edda
    style J fill:#fff3cd

Version Control SQL Files

  • Store .sql files in Git using Databricks Git folders or external providers (GitHub, Azure DevOps)
  • Use branches for environment-specific changes (development, staging, production)
  • Integrate into CI/CD to automate deployment

Parameterize for Environment Isolation

Use variables in .sql files for dynamic resource references:

CREATE OR REFRESH STREAMING TABLE ${env}_sales_ingest 
AS SELECT * 
FROM read_files('s3://${env}-sales-data')

Schedule and Monitor Refreshes

  • Use SQL tasks in Databricks Jobs to schedule updates
  • Refresh materialized views:
    REFRESH MATERIALIZED VIEW view_name
    
  • Monitor refresh history using system tables

SQL CI/CD Workflow Steps

Stage Action
Develop Write and test .sql scripts in SQL editor, commit to Git branch
Validate During PR, validate syntax and schema compatibility via automated CI checks
Deploy Upon merge, deploy .sql scripts using CI/CD pipelines
Monitor Track query performance and data freshness with dashboards and alerts

CI/CD for Dashboard Developers

Databricks supports integrating dashboards into CI/CD workflows using Asset Bundles.

Benefits

  • ✅ Version-control dashboards for auditability
  • ✅ Automate deployments alongside jobs and pipelines
  • ✅ Reduce manual errors
  • ✅ Ensure consistent updates across environments

Dashboard CI/CD Implementation

1. Export existing dashboards as JSON:

databricks bundle generate

2. Configure bundle YAML:

resources:
  dashboards:
    sales_dashboard:
      display_name: 'Sales Dashboard'
      file_path: ./dashboards/sales_dashboard.lvdash.json
      warehouse_id: ${var.warehouse_id}

3. Store .lvdash.json files in Git to track changes and collaborate

4. Deploy dashboards via CI/CD:

name: Deploy Dashboard
run: databricks bundle deploy --target=prod
env:
  DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

5. Parameterize configurations using variables like ${var.warehouse_id} for SQL warehouses or data sources

Continuous Sync

  • Use bundle generate --watch to continuously sync local dashboard JSON files with Databricks UI changes
  • Force overwrite with --force flag during deployment if discrepancies occur

Dashboard Workflow

Maintain dashboard definitions in Git, deploy through automated pipelines, and use variables for environment-specific configurations.



This guide helps teams implement robust CI/CD practices on Databricks, accelerating data engineering and analytics initiatives while improving code quality and reducing deployment risks.