Integrating ClickHouse with Databricks
The ClickHouse Spark connector works seamlessly with Databricks. This guide covers platform-specific setup, installation, and usage patterns for Databricks.
API Selection for Databricks
In Databricks, you must use the TableProvider API (format-based access). Unity Catalog blocks any attempt to register Spark catalogs, so the Catalog API is not available in Databricks environments.
Installation on Databricks
Option 1: Upload JAR via Databricks UI
-
Build or download the runtime JAR:
clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar
-
Navigate to your Databricks workspace:
- Go to Compute → Select your cluster
- Click the Libraries tab
- Click Install New
- Select Upload → JAR
- Upload the runtime JAR file
- Click Install
-
Restart the cluster to load the library
Option 2: Install via Databricks CLI
# Upload JAR to DBFS
databricks fs cp clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar \
dbfs:/FileStore/jars/
# Install on cluster
databricks libraries install \
--cluster-id <your-cluster-id> \
--jar dbfs:/FileStore/jars/clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}-{{ stable_version }}.jar
Option 3: Maven Coordinates (Recommended)
In your cluster configuration, add the Maven coordinates:
com.clickhouse.spark:clickhouse-spark-runtime-{{ spark_binary_version }}_{{ scala_binary_version }}:{{ stable_version }}
Using TableProvider API
In Databricks, you must use the TableProvider API (format-based access). Unity Catalog blocks any attempt to register Spark catalogs.
Reading Data
# Read from ClickHouse using TableProvider API
df = spark.read \
.format("clickhouse") \
.option("host", "your-clickhouse-cloud-host.clickhouse.cloud") \
.option("protocol", "https") \
.option("http_port", "8443") \
.option("database", "default") \
.option("table", "events") \
.option("user", "default") \
.option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \
.option("ssl", "true") \
.load()
# Schema is automatically inferred
df.display()
val df = spark.read
.format("clickhouse")
.option("host", "your-clickhouse-cloud-host.clickhouse.cloud")
.option("protocol", "https")
.option("http_port", "8443")
.option("database", "default")
.option("table", "events")
.option("user", "default")
.option("password", dbutils.secrets.get(scope="clickhouse", key="password"))
.option("ssl", "true")
.load()
df.show()
Writing Data
# Write to ClickHouse - table will be created automatically if it doesn't exist
df.write \
.format("clickhouse") \
.option("host", "your-clickhouse-cloud-host.clickhouse.cloud") \
.option("protocol", "https") \
.option("http_port", "8443") \
.option("database", "default") \
.option("table", "events_copy") \
.option("user", "default") \
.option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \
.option("ssl", "true") \
.option("order_by", "id") \ # Required: specify ORDER BY when creating a new table
.option("settings.allow_nullable_key", "1") \ # Required for ClickHouse Cloud if ORDER BY has nullable columns
.mode("append") \
.save()
df.write
.format("clickhouse")
.option("host", "your-clickhouse-cloud-host.clickhouse.cloud")
.option("protocol", "https")
.option("http_port", "8443")
.option("database", "default")
.option("table", "events_copy")
.option("user", "default")
.option("password", dbutils.secrets.get(scope="clickhouse", key="password"))
.option("ssl", "true")
.option("order_by", "id") // Required: specify ORDER BY when creating a new table
.option("settings.allow_nullable_key", "1") // Required for ClickHouse Cloud if ORDER BY has nullable columns
.mode("append")
.save()
Databricks-Specific Considerations
Secret Management
Use Databricks secret scopes to securely store ClickHouse credentials:
# Access secrets
password = dbutils.secrets.get(scope="clickhouse", key="password")
For setup instructions, see the Databricks Secret management documentation.
ClickHouse Cloud Connection
When connecting to ClickHouse Cloud from Databricks:
- Use HTTPS protocol (
protocol: https, http_port: 8443)
- Enable SSL (
ssl: true)
- Set appropriate batch sizes via
spark.clickhouse.write.batchSize
- Consider using JSON format for VariantType data
- Enable predicate pushdown (enabled by default)
Runtime Compatibility
- Spark 3.3, 3.4, 3.5, 4.0: Fully supported
- Scala 2.12, 2.13: Supported (Spark 4.0 requires Scala 2.13)
- Python 3: Supported
- Java 8, 17: Supported (Java 17+ required for Spark 4.0)
Examples
Complete Workflow Example
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark with ClickHouse connector
spark = SparkSession.builder \
.config("spark.jars.packages", "com.clickhouse.spark:clickhouse-spark-runtime-3.4_2.12:0.9.0") \
.getOrCreate()
# Read from ClickHouse
df = spark.read \
.format("clickhouse") \
.option("host", "your-host.clickhouse.cloud") \
.option("protocol", "https") \
.option("http_port", "8443") \
.option("database", "default") \
.option("table", "source_table") \
.option("user", "default") \
.option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \
.option("ssl", "true") \
.load()
# Transform data
transformed_df = df.filter(col("status") == "active")
# Write to ClickHouse
transformed_df.write \
.format("clickhouse") \
.option("host", "your-host.clickhouse.cloud") \
.option("protocol", "https") \
.option("http_port", "8443") \
.option("database", "default") \
.option("table", "target_table") \
.option("user", "default") \
.option("password", dbutils.secrets.get(scope="clickhouse", key="password")) \
.option("ssl", "true") \
.option("order_by", "id") \
.mode("append") \
.save()
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
// Initialize Spark with ClickHouse connector
val spark = SparkSession.builder
.config("spark.jars.packages", "com.clickhouse.spark:clickhouse-spark-runtime-3.4_2.12:0.9.0")
.getOrCreate()
// Read from ClickHouse
val df = spark.read
.format("clickhouse")
.option("host", "your-host.clickhouse.cloud")
.option("protocol", "https")
.option("http_port", "8443")
.option("database", "default")
.option("table", "source_table")
.option("user", "default")
.option("password", dbutils.secrets.get(scope="clickhouse", key="password"))
.option("ssl", "true")
.load()
// Transform data
val transformedDF = df.filter(col("status") === "active")
// Write to ClickHouse
transformedDF.write
.format("clickhouse")
.option("host", "your-host.clickhouse.cloud")
.option("protocol", "https")
.option("http_port", "8443")
.option("database", "default")
.option("table", "target_table")
.option("user", "default")
.option("password", dbutils.secrets.get(scope="clickhouse", key="password"))
.option("ssl", "true")
.option("order_by", "id")
.mode("append")
.save()