Arrow-Native Execution Engine

Ontul's execution engine is built entirely on Apache Arrow, processing all data in Arrow columnar format from ingestion to output. This eliminates unnecessary serialization/deserialization overhead and enables vectorized computation.

Arrow Columnar Format

All internal data is represented as Arrow RecordBatches. Data is converted to Arrow format once at the source (connector) and converted out once at the destination (client or sink). Everything in between — filtering, joining, aggregating, sorting — operates directly on Arrow vectors with zero-copy.

Vectorized Operator Pipeline

The execution engine implements a pull-based streaming operator pipeline:

ScanOperator: Reads data splits from connectors as Arrow RecordBatches
FilterOperator: Evaluates predicates on Arrow vectors
ProjectOperator: Computes expressions and selects columns
HashJoinOperator: Hash-based join with Arrow vector probing
SortMergeJoinOperator: Sort-merge join for large datasets
HashAggregateOperator: Hash-based aggregation on Arrow vectors
SortOperator: Sorts Arrow RecordBatches
TopNOperator: Efficient top-N selection without full sort
WindowOperator: Window functions (ROW_NUMBER, RANK, DENSE_RANK, COUNT, SUM, AVG, MIN, MAX) with PARTITION BY and ORDER BY
ExchangeOperator: Shuffles data between Workers via Arrow Flight

Each operator follows an open() → next() → close() lifecycle, streaming data through the pipeline without materializing full intermediate results.

Java-Native Implementation

The execution engine is implemented purely in Java using the Arrow Java libraries (arrow-vector, arrow-algorithm). There are no JNI dependencies, no native engines (Velox, DataFusion), and no external runtime requirements — ensuring portability and simplifying deployment.

Memory Management & Disk Spill

When operators exceed the memory threshold, intermediate results are spilled to local disk via SpillStore and automatically cleaned up after query completion. Worker local disk is used exclusively for temporary spill — Ontul is a processing engine, not a storage engine.

Data Transport

Arrow Flight is used for all bulk data transfer between nodes:

Query results from Workers to Master to Client
Shuffle data between Workers (Worker-to-Worker)
Source/Sink data transfers for jobs

All data stays in Arrow IPC binary format throughout — no JSON serialization, no base64 encoding.