Batch & Streaming Processing

Ontul supports distributed batch and streaming data processing through its SDK, enabling ETL pipelines, data transformations, and continuous streaming jobs within the same cluster used for interactive SQL.

Job Types

Batch Jobs

Batch jobs process a bounded dataset and complete when all data has been processed. Use cases include ETL pipelines, data migration, periodic aggregation, and data export.

Small batch jobs run with the Master as the driver
Large batch jobs are automatically delegated to a Worker as the driver, keeping the Master lightweight

Streaming Jobs

Streaming jobs process unbounded data continuously and run until explicitly stopped. Use cases include real-time ingestion from Kafka, CDC pipelines, and continuous ETL.

Streaming jobs are always delegated to a Worker as the driver
Support watermark and window operations (tumbling, sliding)

SDK

Ontul provides SDKs for programmatic data processing:

Java SDK

OntulSession session = OntulSession.builder()
    .master("localhost", 47470)
    .token("your-jwt-token")
    .build();

// Read from S3
DataFrame df = session.source(Source.s3("s3://bucket/data", "parquet")
    .connection("my-s3-conn"));

// SQL query
DataFrame result = session.sql("SELECT * FROM iceberg_catalog.db.my_table");

// Write to Iceberg
df.sink(Sink.table("iceberg_catalog.db.target_table"));

Python SDK

from ontul import OntulSession

session = OntulSession(host="localhost", port=47470, token="your-jwt-token")

# Execute SQL
result = session.sql("SELECT * FROM iceberg_catalog.db.my_table LIMIT 10")

# Get Pandas DataFrame
df = session.sql_pandas("SELECT * FROM iceberg_catalog.db.my_table")

Execution Modes

Client Mode

Interactive execution from applications or notebooks
No dependency upload required — SDK only
session.source() / df.sink() — results return to the client
Source and Sink configurations reference connection IDs or inline properties

Server Mode

Upload JARs with custom dependencies to the server
session.submit(MyJob.class) — jobs run asynchronously
Client can disconnect after submission
Status and log polling via REST API or Admin UI
Suitable for scheduling with workflow orchestrators (e.g., Airflow)

Job Management

Jobs are managed through the REST API and Admin UI:

Submit, kill, and monitor jobs
View job status (SUBMITTED → RUNNING → COMPLETED / FAILED / KILLED)
Access job logs in real-time
Job history stored locally or on S3

Driver Delegation

Ontul automatically determines where to run the driver for each job:

Job Type	Driver Location
Small batch	Master
Large batch	Worker
Streaming	Worker (always)

This keeps the Master responsive for query planning and administrative operations while offloading heavy computation to Workers.