Skip to main content
Skip to main content
Edit this page

Integrating ClickHouse with Databricks

ClickHouse Supported

The ClickHouse Spark connector works seamlessly with Databricks. This guide covers platform-specific setup, installation, and usage patterns for Databricks.

API Selection for Databricks

In Databricks, you must use the TableProvider API (format-based access). Unity Catalog blocks any attempt to register Spark catalogs, so the Catalog API is not available in Databricks environments.

Installation on Databricks

Option 1: Upload JAR via Databricks UI

  1. Build or download the runtime JAR:

    clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar
    
  2. Navigate to your Databricks workspace:

    • Go to Compute → Select your cluster
    • Click the Libraries tab
    • Click Install New
    • Select UploadJAR
    • Upload the runtime JAR file
    • Click Install
  3. Restart the cluster to load the library

Option 2: Install via Databricks CLI

# Upload JAR to DBFS
databricks fs cp clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar \
  dbfs:/FileStore/jars/

# Install on cluster
databricks libraries install \
  --cluster-id <your-cluster-id> \
  --jar dbfs:/FileStore/jars/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar

Option 3: Maven Coordinates (Recommended)

In your cluster configuration, add the Maven coordinates:

com.clickhouse.spark:clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}:{{ stable_version }}

Using TableProvider API

In Databricks, you must use the TableProvider API (format-based access). Unity Catalog blocks any attempt to register Spark catalogs.

Reading Data

# Read from ClickHouse using TableProvider API
df = spark.read \
    .format("clickhouse") \
    .option("host", "your-clickhouse-cloud-host.clickhouse.cloud") \
    .option("protocol", "https") \
    .option("http_port", "8443") \
    .option("database", "default") \
    .option("table", "events") \
    .option("user", "default") \
    .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \
    .option("ssl", "true") \
    .load()

# Schema is automatically inferred
df.display()

Writing Data

# Write to ClickHouse - table will be created automatically if it doesn't exist
df.write \
    .format("clickhouse") \
    .option("host", "your-clickhouse-cloud-host.clickhouse.cloud") \
    .option("protocol", "https") \
    .option("http_port", "8443") \
    .option("database", "default") \
    .option("table", "events_copy") \
    .option("user", "default") \
    .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \
    .option("ssl", "true") \
    .option("order_by", "id") \  # Required: specify ORDER BY when creating a new table
    .option("settings.allow_nullable_key", "1") \  # Required for ClickHouse Cloud if ORDER BY has nullable columns
    .mode("append") \
    .save()
Note

This example assumes preconfigured secret scopes in Databricks. For setup instructions, see the Databricks Secret management documentation.

Databricks-Specific Considerations

Secret Management

Use Databricks secret scopes to securely store ClickHouse credentials:

# Access secrets
password = dbutils.secrets.get(scope="clickhouse", key="password")

For setup instructions, see the Databricks Secret management documentation.

ClickHouse Cloud Connection

When connecting to ClickHouse Cloud from Databricks:

  1. Use HTTPS protocol (protocol: https, http_port: 8443)
  2. Enable SSL (ssl: true)

Performance Optimization

  • Set appropriate batch sizes via spark.clickhouse.write.batchSize
  • Consider using JSON format for VariantType data
  • Enable predicate pushdown (enabled by default)

Runtime Compatibility

  • Spark 3.3, 3.4, 3.5, 4.0: Fully supported
  • Scala 2.12, 2.13: Supported (Spark 4.0 requires Scala 2.13)
  • Python 3: Supported
  • Java 8, 17: Supported (Java 17+ required for Spark 4.0)

Examples

Complete Workflow Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark with ClickHouse connector
spark = SparkSession.builder \
    .config("spark.jars.packages", "com.clickhouse.spark:clickhouse-spark-runtime-3.4_2.12:0.9.0") \
    .getOrCreate()

# Read from ClickHouse
df = spark.read \
    .format("clickhouse") \
    .option("host", "your-host.clickhouse.cloud") \
    .option("protocol", "https") \
    .option("http_port", "8443") \
    .option("database", "default") \
    .option("table", "source_table") \
    .option("user", "default") \
    .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \
    .option("ssl", "true") \
    .load()

# Transform data
transformed_df = df.filter(col("status") == "active")

# Write to ClickHouse
transformed_df.write \
    .format("clickhouse") \
    .option("host", "your-host.clickhouse.cloud") \
    .option("protocol", "https") \
    .option("http_port", "8443") \
    .option("database", "default") \
    .option("table", "target_table") \
    .option("user", "default") \
    .option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \
    .option("ssl", "true") \
    .option("order_by", "id") \
    .mode("append") \
    .save()