Databricks CI/CD Best Practices¶
Purpose
This guide provides best practices for designing robust CI/CD pipelines on Databricks, ensuring rapid and reliable deployment of data engineering and analytics workloads.
Overview¶
CI/CD (Continuous Integration and Continuous Delivery) has become a cornerstone of modern data engineering and analytics, ensuring code changes are integrated, tested, and deployed rapidly and reliably. Databricks provides a flexible framework supporting various CI/CD options shaped by organizational preferences, existing workflows, and specific technology environments.
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart LR
A[Development] --> B[Version Control]
B --> C[Automated Testing]
C --> D[Build & Package]
D --> E[Deploy to Staging]
E --> F[Validation]
F --> G[Deploy to Production]
G --> H[Monitor]
H --|Rollback if needed|--> E
Core Principles of CI/CD¶
Effective CI/CD pipelines share foundational principles regardless of implementation specifics. These universal best practices apply across organizational preferences, developer workflows, and cloud environments.
Version Control Everything¶
- Store all artifacts in Git: Notebooks, scripts, infrastructure definitions (IaC), and job configurations
- Use branching strategies: Implement Gitflow or similar strategies aligned with development, staging, and production environments
- Track configuration changes: Maintain history of all deployment-related changes
Automate Testing¶
| Test Type | Tools | Purpose |
|---|---|---|
| Unit Tests | pytest (Python), ScalaTest (Scala) |
Validate business logic |
| Validation Tests | Databricks CLI bundle validate |
Check notebook and workflow functionality |
| Integration Tests | chispa for Spark DataFrames |
Test complete workflows and data pipelines |
Testing Strategy
Implement multiple layers of testing to catch issues early in the development cycle.
Employ Infrastructure as Code (IaC)¶
- Define infrastructure declaratively: Use Databricks Asset Bundles YAML or Terraform for clusters, jobs, and workspace configurations
- Parameterize settings: Avoid hardcoding environment-specific values like cluster size and secrets
- Version infrastructure changes: Track all infrastructure modifications in Git
Isolate Environments¶
Environment separation best practices: - Maintain separate workspaces for development, staging, and production - Choose tools that match your cloud ecosystem: - Azure: Azure DevOps + Databricks Asset Bundles or Terraform
Monitor and Automate Rollbacks¶
Track key metrics: - Deployment success rates - Job performance - Test coverage - Pipeline execution times
Implement automated rollback mechanisms for failed deployments to minimize downtime.
Unify Asset Management¶
Avoid Siloed Management
Use Databricks Asset Bundles to deploy code, jobs, and infrastructure as a single unit. Avoid managing notebooks, libraries, and workflows separately.
[!info] Authentication Recommendation Databricks recommends workload identity federation for CI/CD authentication. This eliminates the need for Databricks secrets, making it the most secure authentication method for automated flows.
Databricks Asset Bundles for CI/CD¶
Databricks Asset Bundles offer a unified approach to managing code, workflows, and infrastructure within the Databricks ecosystem and are recommended for CI/CD pipelines.
Why Bundles?¶
By bundling code, workflows, and infrastructure into a single YAML-defined unit, bundles: - ✅ Simplify deployment - ✅ Ensure consistency across environments - ✅ Provide atomic deployments - ✅ Enable version control of entire stack
Workflow Adaptation¶
Different development backgrounds require different approaches to adopting bundles:
| Developer Background | Traditional Workflow | Bundle Workflow |
|---|---|---|
| Java | Build JARs with Maven/Gradle, test with JUnit | Bundle code + infrastructure in YAML |
| Python | Package wheels, test with pytest | Bundle Python code + configs together |
| SQL | Query validation, notebook management | Bundle SQL files with pipeline definitions |
Source Control Strategies¶
The first choice when implementing CI/CD is how to store and version source files. Two main approaches exist:
Option 1: Single Repository (Recommended for Small Projects)¶
Structure:
databricks-dab-repo/
├── databricks.yml # Bundle definition
├── resources/
│ ├── workflows/
│ │ ├── my_pipeline.yml # YAML pipeline def
│ │ └── my_pipeline_job.yml # YAML job def
│ ├── clusters/
│ │ ├── dev_cluster.yml # Development cluster
│ │ └── prod_cluster.yml # Production cluster
├── src/
│ ├── my_pipeline.ipynb # Pipeline notebook
│ └── mypython.py # Python modules
└── README.md
Pros: - ✅ All artifacts versioned together - ✅ Single PR updates both code and configuration - ✅ Simplified CI/CD pipeline
Cons: - ❌ Repository may become bloated - ❌ Coordinated releases required
Option 2: Separate Repositories (Recommended for Large Teams)¶
Repository 1: Application Code
java-app-repo/
├── pom.xml # Maven configuration
├── src/
│ ├── main/
│ │ ├── java/ # Java source code
│ │ └── resources/ # Application resources
│ └── test/
│ ├── java/ # Unit tests
│ └── resources/ # Test resources
├── target/ # Compiled JARs
└── README.md
Repository 2: Bundle Configuration
databricks-dab-repo/
├── databricks.yml # Bundle definition
├── resources/
│ ├── jobs/
│ │ ├── my_java_job.yml # Job definitions
│ │ └── my_other_job.yml
│ ├── clusters/
│ │ ├── dev_cluster.yml # Cluster configs
│ │ └── prod_cluster.yml
└── README.md
Pros: - ✅ Team separation: Development vs. infrastructure management - ✅ Independent release cycles - ✅ Smaller, focused repositories
Cons: - ❌ Additional coordination required - ❌ Must ensure version compatibility
Version Artifacts
Always use versioned artifacts (Git commit hashes) when uploading to Databricks or external storage to ensure traceability and rollback capabilities.
Reference Compiled Artifacts in Bundles¶
Example databricks.yml referencing a versioned JAR:
resources:
jobs:
my_java_job:
tasks:
- task_key: process_data
libraries:
- jar: /Volumes/artifacts/my-app-${{ GIT_SHA }}.jar
Recommended CI/CD Workflow¶
Regardless of repository structure, follow this workflow:
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart TD
A[Code Change / PR] --> B[Compile & Test]
B --> C{Tests Pass?}
C -->|No| D[Fix Issues]
D --> A
C -->|Yes| E[Upload Artifact]
E --> F[Validate Bundle]
F --> G{Validation OK?}
G -->|No| D
G -->|Yes| H[Deploy to Staging]
H --> I[Integration Tests]
I --> J{Tests Pass?}
J -->|No| K[Rollback]
J -->|Yes| L[Deploy to Production]
L --> M[Monitor]
style B fill:#e1f5ff
style E fill:#fff3cd
style L fill:#d4edda
style K fill:#f8d7da
Step 1: Compile and Test¶
Triggered on: Pull request or commit to main branch
Actions:
1. Compile code
2. Run unit tests
3. Generate versioned artifact (e.g., my-app-1.0.jar)
Step 2: Upload and Store Artifact¶
Store compiled files in Databricks Unity Catalog volume or artifact repository Azure Blob Storage or Databricks Unity Catalog volumes (recommended)
Versioning scheme examples:
Step 3: Validate Bundle¶
This ensures: - YAML configuration correctness - No missing libraries - Proper resource references - Environment-specific parameters are valid
Catch Issues Early
Run validation during pull requests to catch misconfigurations before deployment.
Step 4: Deploy Bundle¶
For production:
Reference uploaded libraries in databricks.yml. See Databricks Asset Bundles library dependencies for details.
CI/CD for Machine Learning¶
ML projects introduce unique CI/CD challenges compared to traditional software development:
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart LR
A[Data Engineering] --> B[Feature Engineering]
B --> C[Model Training]
C --> D[Model Validation]
D --> E[Model Registry]
E --> F[Staging Deployment]
F --> G[A/B Testing]
G --> H[Production Deployment]
H --> I[Monitoring & Retraining]
I --|Data Drift|--> C
style A fill:#e1f5ff
style C fill:#fff3cd
style E fill:#d4edda
style I fill:#f8d7da
ML-Specific Considerations¶
| Challenge | Databricks Solution |
|---|---|
| Multi-team coordination | MLflow for experiment tracking, Delta Sharing for governance, Asset Bundles for IaC |
| Data & model versioning | Delta Lake (ACID transactions, time travel), MLflow Model Registry (lineage) |
| Reproducibility | Asset Bundles ensure atomic deployment across environments |
| Continuous retraining | Lakeflow Jobs + MLflow + Data Quality Monitoring |
MLOps Stacks Framework¶
MLOps Stacks combine: - Databricks Asset Bundles - Preconfigured CI/CD workflows - Modular ML project templates
Team responsibilities:
| Team | Responsibilities | Bundle Components | Artifacts |
|---|---|---|---|
| Data Engineers | ETL pipelines, data quality | Lakeflow YAML, cluster policies | etl_pipeline.yml, feature_store_job.yml |
| Data Scientists | Model training, validation | MLflow Projects, notebooks | train_model.yml, batch_inference_job.yml |
| MLOps Engineers | Orchestration, monitoring | Environment variables, dashboards | databricks.yml, lakehouse_monitoring.yml |
ML CI/CD Collaboration Workflow¶
- Data engineers commit ETL pipeline changes → automated schema validation → staging deployment
- Data scientists submit ML code → unit tests run → deploy to staging workspace for integration testing
- MLOps engineers review validation metrics → promote vetted models to production via MLflow Registry
CI/CD for SQL Developers¶
SQL developers using Databricks SQL to manage streaming tables and materialized views can leverage Git integration and CI/CD pipelines to streamline workflows.
SQL Workflow Best Practices¶
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart TD
A[Write SQL in Editor] --> B[Commit to Git Branch]
B --> C[Pull Request]
C --> D[Validate Syntax & Schema]
D --> E{Valid?}
E -->|No| F[Fix Issues]
F --> A
E -->|Yes| G[Merge to Main]
G --> H[Deploy SQL Files]
H --> I[Schedule Refreshes]
I --> J[Monitor Performance]
style A fill:#e1f5ff
style G fill:#d4edda
style J fill:#fff3cd
Version Control SQL Files¶
- Store
.sqlfiles in Git using Databricks Git folders or external providers (GitHub, Azure DevOps) - Use branches for environment-specific changes (development, staging, production)
- Integrate into CI/CD to automate deployment
Parameterize for Environment Isolation¶
Use variables in .sql files for dynamic resource references:
CREATE OR REFRESH STREAMING TABLE ${env}_sales_ingest
AS SELECT *
FROM read_files('s3://${env}-sales-data')
Schedule and Monitor Refreshes¶
- Use SQL tasks in Databricks Jobs to schedule updates
- Refresh materialized views:
- Monitor refresh history using system tables
SQL CI/CD Workflow Steps¶
| Stage | Action |
|---|---|
| Develop | Write and test .sql scripts in SQL editor, commit to Git branch |
| Validate | During PR, validate syntax and schema compatibility via automated CI checks |
| Deploy | Upon merge, deploy .sql scripts using CI/CD pipelines |
| Monitor | Track query performance and data freshness with dashboards and alerts |
CI/CD for Dashboard Developers¶
Databricks supports integrating dashboards into CI/CD workflows using Asset Bundles.
Benefits¶
- ✅ Version-control dashboards for auditability
- ✅ Automate deployments alongside jobs and pipelines
- ✅ Reduce manual errors
- ✅ Ensure consistent updates across environments
Dashboard CI/CD Implementation¶
1. Export existing dashboards as JSON:
2. Configure bundle YAML:
resources:
dashboards:
sales_dashboard:
display_name: 'Sales Dashboard'
file_path: ./dashboards/sales_dashboard.lvdash.json
warehouse_id: ${var.warehouse_id}
3. Store .lvdash.json files in Git to track changes and collaborate
4. Deploy dashboards via CI/CD:
name: Deploy Dashboard
run: databricks bundle deploy --target=prod
env:
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
5. Parameterize configurations using variables like ${var.warehouse_id} for SQL warehouses or data sources
Continuous Sync¶
- Use
bundle generate --watchto continuously sync local dashboard JSON files with Databricks UI changes - Force overwrite with
--forceflag during deployment if discrepancies occur
Dashboard Workflow
Maintain dashboard definitions in Git, deploy through automated pipelines, and use variables for environment-specific configurations.
Related Pages¶
This guide helps teams implement robust CI/CD practices on Databricks, accelerating data engineering and analytics initiatives while improving code quality and reducing deployment risks.