Skip to content
plainsight.pro

How to choose your Compute?

Purpose

This guide helps you select the right compute resource in Databricks based on your workload type, team collaboration needs, and cost considerations.

Compute Types Overview

Quick Navigation

This section provides a high-level overview. Click the compute type names to jump to detailed guidance.

Compute Type Use Case Cost Efficiency Auto-termination Best For
SQL Warehouses SQL analytics & BI ⭐⭐⭐ High Yes Business analysts, SQL-based reporting
All-Purpose Clusters Interactive development ⭐⭐ Medium Yes Collaborative data engineering
Job Clusters Automated workloads ⭐⭐⭐ High Automatic Production pipelines, scheduled jobs
Single Node Clusters Light exploration ⭐⭐⭐ High Yes Quick data checks, small datasets
Personal Compute Individual development ⭐⭐ Medium Yes Solo experimentation (see All-Purpose)

Compute Decision Flow

Use this flow to identify the right compute type for your needs:

%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart TD
    A[What's your workload?] --> B{Interactive vs Automated?}
    B -->|Interactive Development| C{Collaboration needed?}
    B -->|Automated Jobs| D[Job Clusters]

    C -->|Yes - Team Development| E[Shared Clusters]
    C -->|No - Solo Exploration| F[Personal Clusters]

    E --> G{Workload Type?}
    F --> H{Workload Type?}
    D --> I{Workload Type?}

    G -->|SQL Analytics| J[SQL Warehouses]
    G -->|Data Engineering| K[All-Purpose Cluster<br/>Shared Access Mode]

    H -->|ML Experimentation| L[Single User Cluster<br/>with ML Runtime]
    H -->|Data Exploration| M[Single Node Cluster]

    I -->|ETL Pipeline| N[Job Cluster<br/>Autoscaling Enabled]
    I -->|ML Training| O[Job Cluster<br/>GPU-enabled]
    I -->|Streaming| P[Job Cluster<br/>Always-on]

    style B fill:#a8d8ff
    style C fill:#ffd4a3
    style G fill:#c9f0c9
    style H fill:#c9f0c9
    style I fill:#c9f0c9

    style J fill:#e8e8e8
    style K fill:#e8e8e8
    style L fill:#e8e8e8
    style M fill:#e8e8e8
    style N fill:#e8e8e8
    style O fill:#e8e8e8
    style P fill:#e8e8e8

Legend: πŸ”΅ Interactive vs Automated | 🟠 Collaboration | 🟒 Workload Type | βšͺ Your Result


Detailed Compute Guidance

The following sections provide in-depth information about each compute type, including configuration examples, best practices, and when to use them.

SQL Warehouses

When to use: - Running SQL queries for analytics - Connecting BI tools (Power BI, Tableau) - Ad-hoc data exploration by business users - Serving dashboards and reports

Configuration guidance:

%%{init: { "flowchart": { "useMaxWidth": true } } }%%
graph LR
    A[SQL Warehouse Size] --> B[X-Small: <10 users]
    A --> C[Small: 10-20 users]
    A --> D[Medium: 20-50 users]
    A --> E[Large: 50+ users]

    style B fill:#e1f5e1
    style C fill:#fff4e1
    style D fill:#ffe1e1
    style E fill:#ffe1e1

Cost Optimization

Enable auto-stop after 10-15 minutes of inactivity. SQL Warehouses can start up quickly, so aggressive auto-stop settings minimize costs without impacting user experience.

Example use case:

-- Typical SQL Warehouse query for business reporting
SELECT 
    product_category,
    SUM(revenue) as total_revenue,
    COUNT(DISTINCT customer_id) as unique_customers
FROM sales_fact
WHERE order_date >= '2025-01-01'
GROUP BY product_category
ORDER BY total_revenue DESC

All-Purpose Clusters

When to use: - Team collaboration on notebooks - Interactive data engineering work - Prototyping data pipelines - Shared development environment

Access modes:

Mode Use Case Language Support Libraries
Shared Team collaboration, production-ready Python, SQL, R Pre-approved only
Single User ML experimentation, full control Python, SQL, R, Scala Install anything
No Isolation Legacy support All Full access

Shared vs Single User

Shared access mode provides security isolation but restricts some libraries. Single User mode gives full flexibility but can't be shared with team members.

Configuration example:

# Cluster configuration for data engineering team
{
    "cluster_name": "DE-Team-Shared",
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "autoscale": {
        "min_workers": 2,
        "max_workers": 8
    },
    "autotermination_minutes": 30,
    "data_security_mode": "USER_ISOLATION"
}

Job Clusters

When to use: - Production ETL pipelines - Scheduled data processing - Automated ML model training - Any automated, repeatable workflow

Key advantages: - Spin up for job execution only - Automatically terminated after completion - Most cost-effective for production workloads - Isolated from interactive development

Configuration strategy:

%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart LR
    A[Job Cluster Design] --> B[Enable Autoscaling]
    A --> C[Right-size Workers]
    A --> D[Match Runtime to Job]

    B --> B1[Min: Handle base load]
    B --> B2[Max: Cap costs]

    C --> C1[CPU: ETL & transformations]
    C --> C2[Memory: Large aggregations]
    C --> C3[GPU: ML training]

    D --> D1[Standard: General purpose]
    D --> D2[ML: Model training]
    D --> D3[Photon: SQL-heavy jobs]

Job Cluster Best Practice

Always define job clusters in the job definition itself, not as a separate shared cluster. This ensures complete isolation and automatic cleanup.

Example job configuration:

# Optimal job cluster for daily ETL
{
    "name": "daily-sales-aggregation",
    "new_cluster": {
        "spark_version": "14.3.x-photon-scala2.12",
        "node_type_id": "Standard_D4ds_v5",
        "autoscale": {
            "min_workers": 2,
            "max_workers": 10
        },
        "enable_elastic_disk": true
    },
    "libraries": [
        {"pypi": {"package": "delta-spark==2.4.0"}}
    ]
}

Single Node Clusters

When to use: - Quick data exploration on small datasets (<100 GB) - Testing code snippets - Learning and experimentation - Personal data science notebooks

Limitations: - No distributed processing - Limited memory and CPU - Not suitable for production - Driver-only architecture

Size Limitations

Single node clusters work well for datasets that fit in memory. For larger datasets, switch to a multi-node cluster.

When it makes sense:

# Good use case: Small data exploration
df = spark.read.parquet("dbfs:/small-dataset")  # 10 MB
df.groupBy("category").count().show()

# Bad use case: Large data processing
# df = spark.read.parquet("dbfs:/huge-dataset")  # 500 GB
# This will fail on single node!

Decision Matrix

Use this matrix to quickly identify the right compute type:

Your Scenario Recommended Compute Why?
"I need to write SQL for a dashboard" SQL Warehouse Optimized for SQL, BI tool integration
"Our team is building a new pipeline together" All-Purpose (Shared) Collaboration with cost sharing
"I'm experimenting with ML models" All-Purpose (Single User) Full library access, GPU support
"This ETL runs every night at 2 AM" Job Cluster Most cost-effective for automation
"I want to quickly check a CSV file" Single Node Fast startup, low cost
"We're running production streaming" Job Cluster (Always-on) Reliability and isolation
"Analysts need self-service SQL access" SQL Warehouse Easy to use, cost-effective

Cost Optimization Strategies

Autoscaling Best Practices

%%{init: { "flowchart": { "useMaxWidth": true } } }%%
graph TD
    A[Configure Autoscaling] --> B[Set Minimum Workers]
    A --> C[Set Maximum Workers]

    B --> D[Min = 2-3 for stable workloads]
    B --> E[Min = 1 for sporadic usage]

    C --> F[Max = 3x average load]
    C --> G[Max = Budget cap Γ· worker cost]

    style D fill:#e1f5e1
    style E fill:#e1f5e1
    style F fill:#e1f5e1
    style G fill:#ffe1e1

Auto-termination Guidelines

Compute Type Recommended Auto-stop Reasoning
SQL Warehouse 10-15 minutes Fast restart, analytics is bursty
All-Purpose (Active dev) 30-60 minutes Balance cost vs restart friction
All-Purpose (Occasional) 15-20 minutes Minimize idle time
Personal Compute 20 minutes Individual usage patterns
Job Cluster N/A Auto-terminates on completion

Monitor and Adjust

Review cluster usage metrics monthly. If clusters frequently restart due to aggressive auto-termination, increase the timeout slightly to improve user experience.


Photon Acceleration: When to Use It

Photon is Databricks' native vectorized query engine that accelerates SQL and DataFrame workloads. While it can provide 2-5x performance improvements, it comes with approximately 2x higher DBU costs. In practice, Photon is primarily valuable for workloads with heavy joins and aggregations. For most other use cases, the cost increase outweighs the performance benefits.

Decision Framework

%%{init: { "flowchart": { "useMaxWidth": true } } }%%
flowchart TD
    A[Evaluate Workload] --> B{Heavy Joins/Aggregations?}

    B -->|Yes - Many joins or complex aggregations| C{Dataset Size?}
    B -->|No - Simple transformations| D[❌ Standard Runtime<br/>Cost not justified]

    C -->|Large >100GB| E[βœ… Use Photon<br/>Performance gains justify cost]
    C -->|Small <50GB| F[❌ Standard Runtime<br/>Overhead exceeds benefit]

    style B fill:#a8d8ff
    style C fill:#ffd4a3
    style D fill:#ffe1e1
    style E fill:#e1f5e1
    style F fill:#ffe1e1

When to Enable Photon

Workload Characteristic Use Photon? Why?
Complex multi-table joins βœ… Yes Photon's vectorization excels at join operations
Heavy aggregations (GROUP BY) βœ… Yes Significant speedup for complex aggregations on large datasets
Star schema queries βœ… Yes Ideal for fact-dimension joins with aggregations
BI dashboards (SQL Warehouse) βœ… Default SQL Warehouses include Photon, optimized for analytical queries

When NOT to Use Photon

Workload Characteristic Use Standard Why?
Simple transformations (filters, selects) ❌ Standard Minimal joins/aggregations = cost not justified
ETL with mostly data movement ❌ Standard Reading/writing data doesn't benefit from Photon
Small to medium datasets (<50GB) ❌ Standard Performance gains too small to justify 2x cost
DLT pipelines without complex logic ❌ Standard Simple Landingβ†’ADS (Bronzeβ†’Silver-equivalent) transformations don't need Photon
Streaming without aggregations ❌ Standard Simple event processing doesn't benefit
ML training pipelines ❌ Standard ML libraries don't benefit from Photon (use GPU instead)
Development/testing environments ❌ Standard Cost not justified for non-production work
Custom Scala/Java code ❌ Standard Photon optimizes SQL/DataFrame operations only

Cost vs Performance Trade-off

Photon pricing: - Approximately 2x DBU cost compared to standard runtime - Performance improvements: 2-5x faster for SQL workloads

ROI calculation:

Standard runtime: 100 DBUs Γ— $0.40 = $40 (takes 60 minutes)
Photon runtime: 100 DBUs Γ— $0.80 = $80 (takes 20 minutes)

Cost increase: +$40 (+100%)
Time saved: -40 minutes (-67%)

Cost vs Benefit Reality

Photon doubles your DBU costs. Unless your workload performs heavy joins and aggregations on large datasets (>100GB), the 2x cost increase typically exceeds any time savings. Most ETL pipelines don't have enough join/aggregation complexity to justify Photon.

Photon Configuration Examples

Enable Photon on Job Cluster:

{
    "name": "daily-etl-with-photon",
    "new_cluster": {
        "spark_version": "14.3.x-photon-scala2.12",  # Note: photon in version
        "node_type_id": "Standard_D4ds_v5",
        "autoscale": {
            "min_workers": 2,
            "max_workers": 10
        }
    }
}

Photon in SQL Warehouse:

-- SQL Warehouses automatically use Photon
-- No configuration needed - it's enabled by default
SELECT 
    customer_segment,
    COUNT(*) as customer_count,
    SUM(lifetime_value) as total_value
FROM gold.customer_metrics  -- Gold-equivalent schema
GROUP BY customer_segment
ORDER BY total_value DESC

Photon in DLT Pipeline (only for heavy aggregations):

import dlt

# Use Photon only when you have complex joins and aggregations
@dlt.table(
    comment="Complex multi-table aggregation - Photon justified"
)
def gold_sales_summary():
    return (
        dlt.read("ads_sales")
            .join(dlt.read("ads_customers"), "customer_id")
            .join(dlt.read("ads_products"), "product_id")
            .join(dlt.read("ads_regions"), "region_id")
            .groupBy("region", "customer_segment", "product_category")
            .agg(
                sum("amount").alias("total_revenue"),
                avg("amount").alias("avg_order_value"),
                count("transaction_id").alias("transaction_count")
            )
    )

Best Practices

Practice Rationale
βœ… Use Photon only for heavy joins/aggregations The ONLY use case where cost is justified
βœ… Enable on SQL Warehouses for BI Already default, optimized for analytical queries
βœ… Require >100GB datasets before considering Small data doesn't benefit enough
βœ… Benchmark rigorously before committing Most workloads won't see ROI - prove it first
❌ Don't use Photon by default 2x cost increase requires clear justification
❌ Don't use for simple ETL pipelines Data movement and basic transforms don't benefit
❌ Don't use for development/testing Cost never justified for non-production
❌ Don't use without many joins/aggregations Core value proposition - without these, skip Photon

Anti-Patterns to Avoid

❌ Don't: Use All-Purpose clusters for production jobs βœ… Do: Use dedicated Job Clusters with autoscaling

❌ Don't: Leave interactive clusters running 24/7 βœ… Do: Enable auto-termination on all interactive clusters

❌ Don't: Create one massive shared cluster for everything βœ… Do: Use purpose-specific clusters for different workloads

❌ Don't: Use GPU instances for non-ML workloads βœ… Do: Match compute resources to actual requirements

❌ Don't: Set min_workers = max_workers on variable workloads βœ… Do: Enable autoscaling to match demand


Last updated: November 2025