OpenPX
Metadata-Driven Data Platform

Build Data Pipelines Once. Run Them Local Or Cloud.

OpenPX is one Rust engine with a visual designer, 18 composable operators, distributed execution, and a portable .opx dataset format. Design fast on a laptop, then scale to on-prem clusters or cloud object storage โ€” same engine, same compiler, no rewrite.

18Composable operators
4UI apps in the suite
3Run modes: laptop โ†’ cloud
0Migration rewrites

Built For Real Data Teams

OpenPX is designed for organizations that need repeatable, portable pipeline execution across development, on-prem, and cloud environments while keeping one coherent architecture.

One Unified Runtime

A single Rust engine and operator model run in every edition, eliminating drift between dev, on-prem, and production.

Portable .opx Format

Datasets are Parquet parts plus a metadata descriptor with relative paths โ€” the same files move across environments untouched.

Object Storage Ready

Read and write directly to S3, MinIO, Azure Blob, or GCS, with distributed workers for multi-node processing at scale.

Visual, API & CLI Control

Drive pipelines from the Designer canvas, the HTTP API, or the openpx CLI โ€” whichever fits your team and automation.

Hybrid Deployment

Start on a laptop, stay air-gapped on-prem if you must, or move to cloud-native profiles with minimal process changes.

Deterministic By Design

No hidden shuffles and no unseeded randomness โ€” equal keys always co-locate and output is byte-identical across worker counts.

One Engine, Built In Rust

At the core is a pull-based execution engine on Apache Arrow, with an explicit compiler, pluggable storage, and a real distributed transport. Everything the platform does is metadata-driven and inspectable before a single row moves.

  • Arrow-native core. Apache Arrow in memory and Parquet on disk, with a deterministic type system across every operator.
  • Pluggable backends. A native compute backend plus an optional Polars lazy backend; consecutive same-backend stages are fused into one plan.
  • Inspectable compiler. YAML jobs and Designer graphs lower to a LogicalPlan then PhysicalPlan โ€” with schema inference, cycle detection, and per-node diagnostics.
  • Explicit partitioning. HASH, RANGE, ROUND_ROBIN, ENTIRE, and SAME partitioners โ€” the compiler inserts shuffles only where contracts require them.
  • Distributed transport. A conductor drives gRPC workers across hosts for partition-parallel execution, verified byte-identical from 1 to N workers.
  • Pluggable storage. One StorageProvider trait dispatches by URI scheme โ€” local filesystem or S3 / MinIO / Azure / GCS object stores.

At A Glance

LanguageRust
In-memory formatApache Arrow
Persisted formatParquet ยท .opx
ExecutionPull-based, partition-parallel
Compute backendsNative ยท Polars
Distributed transportgRPC (conductor + workers)
StorageLocal FS ยท S3 ยท MinIO ยท Azure ยท GCS
InterfacesDesigner ยท HTTP API ยท CLI

A Complete Operator Library

Eighteen composable stage types cover the full pipeline โ€” from sources and transforms to joins, change-data-capture, fan-in/fan-out, and sinks. Every operator carries an explicit partition and schema contract, so behavior is predictable at design time.

Sources Read

Bring data in from files, object stores, and databases.

Parquet CSV .opx dataset S3 / object store Database

Transforms Compute

Reshape rows and columns, choosing native or Polars per stage.

Filter Transform Project Modify Aggregate Sort Sample Remove Duplicates

Joins & Lookups Combine

Multi-branch equality joins, enrichment, and CDC.

Join (inner/left/right/full) Hash ยท Sort-Merge ยท Broadcast Lookup Change Capture Change Apply

Combine & Route Fan-in / out

Merge many inputs or split a stream into many branches.

Funnel Merge (k-way) Copy Switch

Partitioning Shuffle

The one explicit barrier that moves rows across partitions.

HASH RANGE ROUND_ROBIN ENTIRE SAME

Sinks Write

Persist results, load tables, or tap a stream for preview.

Parquet / .opx writer Database writer Peek (preview tap)

Four Apps For The Full Lifecycle

A React application suite spans design, administration, operations, and governance โ€” all wired to the same compile, run, and preview APIs.

Designer

Live

Visually build jobs on a canvas, configure stage properties, and compile against the real engine.

  • Drag-and-drop stage palette with live schema import
  • Real /compile with node-localized diagnostics
  • Sampled preview and an ad-hoc SQL query notebook
  • Shuffle Lens heat-matrix of actual row movement

Admin

Suite

Manage the platform: projects, users and roles, connections, and secrets.

  • Project and workspace administration
  • Role-based access and credential management
  • Connection and secret configuration

Director

Suite

Operate and monitor runs: phases, shuffles, logs, and schedules in one place.

  • Run detail with phase and shuffle breakdowns
  • Execution logs and worker assignment
  • Schedule and operations visibility

Quality

Suite

Govern data: profiling, quality rules, a glossary, and a lineage explorer.

  • Data profiling and quality rule definitions
  • Column-level lineage graph explorer
  • Business glossary and governance workflows

Governance & Observability, Built In

Because plans are metadata-driven, OpenPX can explain what a pipeline does โ€” and what actually happened โ€” without guesswork.

Column-Level Lineage

Track how every column is derived, propagated, or dropped through the expression graph.

Shuffle Lens

See a source-to-destination row matrix, bytes moved, and per-partition skew for every run.

Run Reports

Trace logs, phases, worker assignments, and per-stage metrics for full run transparency.

Schema Inference

Infer Parquet schemas and validate fail-closed at compile time โ€” mismatches surface early.

Query Notebook

Run read-only ad-hoc SQL over any dataset to inspect and validate results in place.

Job Registry

Submit and track jobs and runs, backed optionally by Postgres for durable history.

Connectors & Formats

Read from and write to the sources your data already lives in.

Parquet CSV .opx datasets Amazon S3 MinIO Azure Blob Google Cloud Storage PostgreSQL MySQL SQLite Local filesystem

Choose Your Operating Model

OpenPX editions are intentionally compatible so teams can match regulatory, operational, and scale requirements without re-platforming.

OpenPX Local

Self-managed edition for developer velocity, on-prem operations, and air-gapped environments.

  • Disk-based .opx datasets for controlled environments
  • On-prem multi-node gRPC clusters over shared storage
  • Docker Compose stack: API, control plane, and metadata Postgres
  • Ideal for regulated and private infrastructure

OpenPX Cloud

Cloud-native edition for distributed execution and object-storage-centric data operations.

  • S3, MinIO, Azure, and GCS object storage support
  • Distributed conductor + worker pools over gRPC
  • Terraform IaC for AWS ECS Fargate (ARM64), ECR, and CloudWatch
  • Same .opx format โ€” datasets interchange with Local

From Pilot To Production

OpenPX helps teams move from initial proof-of-concept to production rollout through a staged but consistent execution model.

Phase 1Design jobs and validate outcomes locally.
Phase 2Standardize workflows with shared dataset contracts.
Phase 3Adopt distributed execution as throughput grows.
Phase 4Operationalize governance, monitoring, and scale.

Plan Your OpenPX Rollout

Whether you are modernizing ETL, launching internal data products, or standardizing hybrid data operations, OpenPX can provide a single execution foundation across environments.