Redpanda Iceberg Docker Compose Example

This lab provides a Docker Compose environment to help you quickly get started with Redpanda and its integration with Apache Iceberg. It showcases how Redpanda, when paired with a Tiered Storage solution like MinIO, can write data in the Iceberg format, enabling seamless analytics workflows. The lab also includes a Spark environment configured for querying the Iceberg tables using SQL within a Jupyter Notebook interface.

In this setup, you will:

  • Produce data to Redpanda topics that are Iceberg-enabled.

  • Observe how Redpanda writes this data in Iceberg format to MinIO as the Tiered Storage backend.

  • Use Spark to query the Iceberg tables, demonstrating a complete pipeline from data production to querying.

This environment is ideal for experimenting with Redpanda’s Iceberg and Tiered Storage capabilities, enabling you to test end-to-end workflows for analytics and data lake architectures.

Prerequisites

You must have the following installed on your machine:

Run the lab

  1. Clone this repository:

    git clone https://github.com/redpanda-data/redpanda-labs.git
  2. Change into the docker-compose/iceberg/ directory:

    cd redpanda-labs/docker-compose/iceberg
  3. Set the REDPANDA_VERSION environment variable to at least version 24.3.1. For all available versions, see the GitHub releases.

    For example:

    export REDPANDA_VERSION=25.1.1
  4. Set the REDPANDA_CONSOLE_VERSION environment variable to the version of Redpanda Console that you want to run. For all available versions, see the GitHub releases.

    You must use at least version v3.0.0 of Redpanda Console to deploy this lab.

    For example:

    export REDPANDA_CONSOLE_VERSION=3.0.0
  5. Start the Docker Compose environment, which includes Redpanda, MinIO, Spark, and Jupyter Notebook:

    docker compose build && docker compose up
  6. Create and switch to a new rpk profile that connects to your Redpanda broker:

    rpk profile create docker-compose-iceberg --set=admin_api.addresses=localhost:19644 --set=brokers=localhost:19092 --set=schema_registry.addresses=localhost:18081
  7. Create two topics with Iceberg enabled:

    rpk topic create key_value --topic-config=redpanda.iceberg.mode=key_value
    rpk topic create value_schema_id_prefix --topic-config=redpanda.iceberg.mode=value_schema_id_prefix
  8. Produce data to the key_value topic and see data show up.

    echo "hello world" | rpk topic produce key_value --format='%k %v\n'
  9. Open Redpanda Console at http://localhost:8081/topics to see that the topics exist in Redpanda.

  10. Open MinIO at http://localhost:9001/browser to view your data stored in the S3-compatible object store.

    Login credentials:

    • Username: minio

    • Password: minio123

  11. Open the Jupyter Notebook server at http://localhost:8888. The notebook guides you through querying Iceberg tables created from Redpanda topics.

  12. Create a schema in the Schema Registry:

    rpk registry schema create value_schema_id_prefix-value --schema schema.avsc
  13. Produce data to the value_schema_id_prefix topic:

    echo '{"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}\n{"user_id":3333,"event_type":"SCROLL","ts":"2024-11-25T20:24:14.774Z"}\n{"user_id":7272,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:24:34.552Z"}' | rpk topic produce value_schema_id_prefix --format='%v\n' --schema-id=topic

When the data is committed, it should be available in Iceberg format and you can query the table lab.redpanda.value_schema_id_prefix in the Jupyter Notebook.

Alternative query interfaces

While the notebook server is running, you can query Iceberg tables directly using Spark’s CLI tools, Instead of Jupyter Notebook:

Spark Shell
docker exec -it spark-iceberg spark-shell
Spark SQL
docker exec -it spark-iceberg spark-sql
PySpark
docker exec -it spark-iceberg pyspark

Clean up

To shut down and delete the containers along with all your cluster data:

docker compose down -v