Skip to content

Batch & Streaming Processing

Ontul supports distributed batch and streaming data processing through its SDK, enabling ETL pipelines, data transformations, and continuous streaming jobs within the same cluster used for interactive SQL.

Job Types

Batch Jobs

Batch jobs process a bounded dataset and complete when all data has been processed. Use cases include ETL pipelines, data migration, periodic aggregation, and data export.

  • Small batch jobs run with the Master as the driver
  • Large batch jobs are automatically delegated to a Worker as the driver, keeping the Master lightweight

Streaming Jobs

Streaming jobs process unbounded data continuously and run until explicitly stopped. Use cases include real-time ingestion from Kafka, CDC pipelines, and continuous ETL.

  • Streaming jobs are always delegated to a Worker as the driver
  • Support watermark and window operations (tumbling, sliding)

SDK

Ontul provides SDKs for programmatic data processing:

Java SDK

OntulSession session = OntulSession.builder()
    .master("localhost", 47470)
    .token("your-jwt-token")
    .build();

// Read from S3
DataFrame df = session.source(Source.s3("s3://bucket/data", "parquet")
    .connection("my-s3-conn"));

// SQL query
DataFrame result = session.sql("SELECT * FROM iceberg_catalog.db.my_table");

// Write to Iceberg
df.sink(Sink.table("iceberg_catalog.db.target_table"));

Python SDK

from ontul import OntulSession

session = OntulSession(host="localhost", port=47470, token="your-jwt-token")

# Execute SQL
result = session.sql("SELECT * FROM iceberg_catalog.db.my_table LIMIT 10")

# Get Pandas DataFrame
df = session.sql_pandas("SELECT * FROM iceberg_catalog.db.my_table")

Execution Modes

Client Mode

  • Interactive execution from applications or notebooks
  • No dependency upload required — SDK only
  • session.source() / df.sink() — results return to the client
  • Source and Sink configurations reference connection IDs or inline properties

Server Mode

  • Upload JARs with custom dependencies to the server
  • session.submit(MyJob.class) — jobs run asynchronously
  • Client can disconnect after submission
  • Status and log polling via REST API or Admin UI
  • Suitable for scheduling with workflow orchestrators (e.g., Airflow)

Job Management

Jobs are managed through the REST API and Admin UI:

  • Submit, kill, and monitor jobs
  • View job status (SUBMITTED → RUNNING → COMPLETED / FAILED / KILLED)
  • Access job logs in real-time
  • Job history stored locally or on S3

Driver Delegation

Ontul automatically determines where to run the driver for each job:

Job Type Driver Location
Small batch Master
Large batch Worker
Streaming Worker (always)

This keeps the Master responsive for query planning and administrative operations while offloading heavy computation to Workers.