Apache Cloudberry
Self-hosted database management tool that provides leverage advanced analytics with a modern PostgreSQL kernel. 100% for robust data solutions.
A PostgreSQL-based MPP database, honestly reviewed. Not for everyone — but free if you need it.
TL;DR
- What it is: An open-source Massively Parallel Processing (MPP) database forked from Greenplum Database, built on a PostgreSQL 14.4 kernel and donated to the Apache Software Foundation [README].
- Who it’s for: Data engineers and DBAs running large-scale analytics or data warehouse workloads who previously paid for Greenplum or a managed service like Redshift. Not a tool for non-technical founders.
- Cost savings: Redshift, BigQuery, and Snowflake pricing starts affordable and scales to thousands per month at serious data volumes. Cloudberry’s Apache 2.0 license means you can run it on your own hardware for the cost of electricity and ops time [README].
- Key strength: Genuine Apache 2.0 license — not “fair-code,” not source-available, not BSL. You can fork, embed, redistribute, and build commercial products on top without calling a lawyer [README].
- Key weakness: Only 1,192 GitHub stars and currently in Apache “Incubating” status, meaning it hasn’t been fully vetted and endorsed as a top-level ASF project [README][homepage]. Self-hosting an MPP cluster requires multiple machines, serious Linux expertise, and ongoing operational burden — this isn’t a VPS-and-docker-compose situation.
What is Apache Cloudberry
Apache Cloudberry is a Massively Parallel Processing (MPP) database. MPP means query execution is split across many nodes simultaneously — the coordinator node receives your SQL, breaks it into pieces, and dispatches work to segment nodes that run it in parallel. For analytical queries scanning billions of rows across terabytes of data, this architecture is orders of magnitude faster than a single-node PostgreSQL instance.
The project was created by the original developers of Greenplum Database, VMware’s open-source MPP system based on PostgreSQL. When Broadcom acquired VMware in 2023 and started signaling reduced investment in open-source Greenplum, the community forked the codebase and contributed it to the Apache Software Foundation under the name Cloudberry. The GitHub description is unambiguous: “One advanced and mature open-source MPP database. Open source alternative to Greenplum Database” [README].
The PostgreSQL lineage matters practically. Cloudberry ships with a PostgreSQL 14.4 kernel, which is significantly newer than older Greenplum releases. This means better SQL compatibility, more modern extensions, and a broader ecosystem of PostgreSQL-native tooling. If you know how to write PostgreSQL queries and use psql, you can talk to Cloudberry [docs][README].
As of this review, Cloudberry sits at 1,192 GitHub stars — a number that reflects its narrow audience (enterprise data warehouse operators) rather than the quality of the project itself.
Why people choose it
The clearest use case is Greenplum migration. The official FAQ is direct: “One goal of Apache Cloudberry (Incubating) is to be compatible with Greenplum to let users use Cloudberry the way they use Greenplum. You can migrate from Greenplum to Cloudberry using gpbackup or other migration tools” [homepage]. If your organization has an existing Greenplum investment and wants to avoid Broadcom’s commercial licensing direction, Cloudberry is the obvious off-ramp.
The second use case is avoiding managed data warehouse pricing. Redshift, BigQuery, and Snowflake all have pricing models that start reasonable and compound as data volume, query frequency, and compute usage grow. Organizations storing and querying multiple terabytes of operational or analytical data regularly see bills in the thousands per month. Cloudberry running on bare-metal or on-premises hardware replaces that recurring cost with capital expense and operational overhead [README].
The third driver is the Apache 2.0 license. For companies building internal data platforms or embedding analytics capabilities into products, the license terms matter. Apache 2.0 permits commercial use, modification, distribution, and sublicensing without copyleft requirements. Competing commercial data warehouses are closed-source. Some newer open-source alternatives use Business Source License (BSL) or other forms that restrict commercial use. Apache 2.0 has none of those restrictions [README].
What’s notably absent from the available sources: independent third-party benchmark comparisons, user reviews on Trustpilot or G2, or community commentary on Reddit or Hacker News. The sources available for this review are primarily official Apache Cloudberry documentation pages [1][2][3][4][5]. This is its own signal: the project has a small but specialized audience that doesn’t generate the kind of casual web commentary that tools like n8n or Activepieces attract.
Features
From the official documentation and README:
Core MPP engine:
- Coordinator/segment architecture — SQL dispatched to parallel segment nodes [README]
- External tables with gpfdist for high-performance parallel data loading [4][5]
- COPY command with single-row error isolation mode and configurable reject limits [3]
- Partitioned tables for time-series and range-based data management [docs]
- Replicated table distribution for reference data [3]
- Full SQL compatibility via PostgreSQL 14.4 kernel [README][homepage]
Analytics capabilities:
- Window functions, CTEs, lateral joins — standard PostgreSQL analytical SQL [docs]
- PostGIS spatial extensions [docs]
- PXF (Platform Extension Framework) for querying external data sources — HDFS, S3, Hive, HBase [README]
- Support for AI/ML workloads (described in general terms on the homepage, specifics not detailed in available sources) [homepage]
Security and auth:
- pg_hba.conf-based client authentication — local socket, TCP/IP, SSL-enforced connections [1][2]
- Role-based access control via PostgreSQL’s role system [1][2]
- SSL/TLS for client connections with hostssl configuration [1][2]
- gpfdist supports SSL with gpfdists:// protocol [4][5]
Operational tooling:
- gpfdist: parallel file distribution service for bulk load/unload operations [4][5]
- gpbackup: backup utility (separate repository) [README]
- Docker-based sandbox for evaluation [README]
- Helm charts and Linux build guides for RHEL/Rocky/Ubuntu [README]
Ecosystem repositories:
- cloudberry-backup: backup utility
- cloudberry-pxf: Platform Extension Framework
- cloudberry-go-libs: Go client libraries [README]
What I could not verify from available sources: specific benchmark numbers, performance comparisons with Greenplum or Redshift, or user-reported production scale (number of nodes, data volumes, query performance at scale).
Pricing: SaaS vs self-hosted math
Apache Cloudberry has no commercial pricing tier. It is entirely free software under Apache 2.0 [README]. The cost comparison is therefore Cloudberry-on-your-infrastructure versus a managed data warehouse service.
Managed data warehouse pricing (public rates, no direct source in scraped data): Cloud data warehouse pricing varies widely by provider, compute tier, and storage volume. Exact current rates were not available in the scraped sources for this review — consult the provider pricing pages directly before making a decision. What is consistently true: at multi-terabyte scale with sustained query load, managed service bills in the hundreds to thousands of dollars per month are common.
Self-hosted Cloudberry:
- Software: $0 (Apache 2.0) [README]
- Hardware or cloud compute: depends entirely on your cluster size — an MPP deployment is not a single VPS. A minimal evaluation cluster might run 1 coordinator + 2 segment hosts. Production deployments that justify MPP typically run significantly more nodes.
- Storage: your cost — block storage, NAS, or S3-compatible object storage via PXF [README]
- Operations: a meaningful time investment for a DBA familiar with PostgreSQL and Linux. This is not a fire-and-forget deployment.
The honest math: if you’re running less than 1TB of analytical data with moderate query load, a single well-tuned PostgreSQL instance (or even Supabase) is cheaper to operate than an MPP cluster. Cloudberry’s economics only become compelling when you have data volumes and query patterns that genuinely require parallelism across multiple nodes — or when you’re migrating from Greenplum and want license continuity [README][homepage].
Deployment reality check
The README documents two installation paths: building from source on Linux (RHEL/Rocky Linux or Ubuntu) following the deployment guides, or spinning up a Docker-based sandbox for evaluation [README].
The sandbox path is genuinely quick — it’s a single Docker command to get a running instance for exploration. The production path is a different story.
What a real deployment requires:
- Multiple Linux hosts (coordinator + at least 2 segment nodes for meaningful parallelism) [README]
- Shared storage or network-accessible storage between nodes
- PostgreSQL-level familiarity — you configure authentication, roles, and connection handling through pg_hba.conf and the PostgreSQL configuration system [1][2]
- Understanding of external table design and gpfdist for loading data at scale [3][4][5]
- A DBA or data engineer who can manage PostgreSQL-derived systems
What can go sideways:
- The project is still in Apache Incubating status. Incubation doesn’t mean the code is bad — it means the governance structures, communication processes, and infrastructure are still being evaluated by the ASF. Projects can stay in incubation for years; some graduate, some are retired [homepage].
- Documentation is split across versions (1.x docs are explicitly marked as “no longer actively maintained” [1][3][4]) — you need to track which version you’re running and read the corresponding docs carefully.
- The gpfdist parallel loader requires its own operational understanding — it runs as a separate service on ETL hosts and requires proper configuration for SSL, timeouts, and multi-threading [4][5].
- Community support channels are Slack, GitHub Discussions, and mailing lists — enterprise-grade support (SLAs, dedicated support engineers) is not available from the project itself [README][homepage].
Realistic estimate for a data engineer who knows PostgreSQL: 1–2 days to get a functioning multi-node cluster on Linux. For a developer new to MPP databases: expect a week to understand the architecture, configure the cluster correctly, and test data loading patterns.
Pros and Cons
Pros
- Genuine Apache 2.0 license. The most permissive license in its class — use it commercially, fork it, embed it, sublicense it without restriction [README]. Competing open-source alternatives often carry more restrictive terms.
- PostgreSQL 14.4 kernel. More modern than older Greenplum releases, meaning better SQL compatibility and access to newer PostgreSQL extensions [README][homepage].
- Greenplum migration path. Explicitly compatible with Greenplum. If you’re escaping Broadcom’s licensing direction, Cloudberry is the documented path out [homepage].
- Mature codebase. Built on years of Greenplum development — this is not a greenfield project. The core MPP execution engine is battle-tested in production at scale [README].
- PXF for external data. Can query S3, HDFS, Hive, and other external systems without loading data in — useful for data lake architectures [README].
- Docker sandbox for evaluation. Low-friction way to explore features before committing to infrastructure [README].
- Active Apache incubation. The ASF umbrella provides long-term governance stability — projects that graduate become well-maintained independent standards [homepage].
Cons
- 1,192 GitHub stars. Low adoption relative to its category. Narrow audience, limited community resources, fewer Stack Overflow answers to your edge-case questions [README].
- Incubating status. The project has not yet been validated by the ASF as a mature, self-governing project. Future direction is uncertain until graduation [homepage].
- No commercial support tier. If your data warehouse goes down at 2am, you’re on your own — or paying a third-party consultant [homepage].
- MPP clusters are not simple to operate. This is not a $10/month VPS setup. Multi-node Linux clusters with shared storage and careful network configuration are required [README].
- Documentation fragmentation. Actively maintained docs (2.x), deprecated docs (1.x), and unreleased next-version docs exist simultaneously — easy to accidentally follow the wrong version [1][2].
- Limited third-party ecosystem commentary. No significant presence on Trustpilot, G2, or community forums at review time. Makes independent validation difficult.
- The “AI/ML workloads” pitch is vague. The homepage and README mention AI/ML support without specifics about what this means in practice [homepage][README]. Likely refers to running SQL-based ML queries or connecting via external tools, not native model training.
Who should use this / who shouldn’t
Use Apache Cloudberry if:
- You’re running Greenplum today and want to migrate off Broadcom licensing without rewriting your data pipelines.
- You have multi-terabyte analytical workloads currently running on a managed service like Redshift or BigQuery and the monthly bill has crossed into four figures.
- You have the in-house DBA expertise to run a PostgreSQL-derived MPP cluster.
- You need Apache 2.0 licensing for a product you’re building or embedding — no copyleft restrictions, no commercial use limitations.
- You’re a data platform team at a mid-size company that wants data warehouse control without vendor dependency.
Skip it (stay on managed cloud) if:
- You’re a solo founder or small team without a dedicated data engineer. The operational overhead will consume more time than the cost savings justify.
- Your data volumes are under 1TB. A well-tuned single PostgreSQL instance with read replicas handles that range more cheaply and with less complexity.
- You need 24/7 enterprise support with SLA guarantees — Cloudberry’s support is community-driven only.
- Your compliance team won’t approve self-hosted infrastructure for sensitive data.
Skip it (use ClickHouse instead) if:
- You want an open-source analytical database with a significantly larger community (30K+ stars), managed cloud options, excellent documentation, and better modern query performance on typical OLAP workloads.
Skip it (use DuckDB instead) if:
- Your analytical workloads run on a single machine. DuckDB is dramatically simpler, performs exceptionally well on single-node hardware, and has a thriving ecosystem.
Alternatives worth considering
- Greenplum Database — the commercial upstream. If you need enterprise support from a vendor, Greenplum is still available, though future investment under Broadcom is unclear.
- ClickHouse — open-source OLAP database (Apache 2.0) with 30K+ GitHub stars, a managed cloud offering, and excellent single-table analytical query performance. More accessible than MPP deployment.
- Apache Doris — another ASF-graduated MPP database. More active community (9K+ stars), real-time analytics focus.
- Redshift — AWS’s managed MPP database. Expensive at scale but fully managed and deeply integrated with the AWS ecosystem.
- BigQuery — Google’s serverless data warehouse. No infrastructure to manage, serverless pricing, strong SQL support. The gold standard for teams that don’t want to think about clusters.
- Snowflake — the expensive benchmark. Excellent usability, strong ecosystem, premium pricing.
- DuckDB — SQLite for analytics. Handles hundreds of gigabytes on a laptop. If you don’t genuinely need multiple nodes, start here.
- PostgreSQL + Citus — horizontal scale-out via the Citus extension if you’re already deep in the PostgreSQL ecosystem.
For the typical unsubbed.co reader — a founder escaping SaaS bills — the honest advice is to look at ClickHouse or DuckDB before Apache Cloudberry. Cloudberry fills a specific gap (Greenplum replacement, Apache 2.0 MPP at large scale) that most small organizations don’t have.
Bottom line
Apache Cloudberry does what it says: it’s a mature, Apache 2.0-licensed MPP database that’s the open-source continuation of Greenplum, built on a newer PostgreSQL kernel, and usable as a data warehouse for large-scale analytics workloads. The license is clean, the codebase is real, and the Greenplum migration story is legitimate.
But this isn’t a tool for non-technical founders looking to escape a SaaS bill. It’s a tool for organizations with Greenplum expertise, multi-terabyte datasets, and data engineers who can operate a PostgreSQL-derived cluster. The 1,192 GitHub stars and thin third-party review presence reflect that narrow audience accurately. If you’re in that audience — Greenplum shops looking for an exit ramp, data platform teams that want Apache 2.0 licensing, analytics-at-scale operations that want to stop writing Redshift checks — Cloudberry is worth serious evaluation. Everyone else should look at ClickHouse, DuckDB, or a managed service first.
Sources
- Configure Client Authentication (1.x) | Apache Cloudberry (Incubating) — cloudberry.apache.org. https://cloudberry.apache.org/docs/1.x/security/client-auth/
- Configure Client Authentication (Next) | Apache Cloudberry (Incubating) — cloudberry.apache.org. https://cloudberry.apache.org/docs/next/security/client-auth/
- COPY | Apache Cloudberry (Incubating) — cloudberry.apache.org. https://cloudberry.apache.org/docs/1.x/sql-stmts/copy/
- gpfdist (1.x) | Apache Cloudberry (Incubating) — cloudberry.apache.org. https://cloudberry.apache.org/docs/1.x/sys-utilities/gpfdist/
- gpfdist (Next) | Apache Cloudberry (Incubating) — cloudberry.apache.org. https://cloudberry.apache.org/docs/next/sys-utilities/gpfdist/
Primary sources:
- GitHub repository: https://github.com/apache/cloudberry (1,192 stars, Apache 2.0 license)
- Official website: https://cloudberry.apache.org
- Documentation: https://cloudberry.apache.org/docs
Features
Integrations & APIs
- Plugin / Extension System
Replaces
Related Databases & Data Tools Tools
View all 122 →Supabase
99KThe open-source Firebase alternative — Postgres database, Auth, instant APIs, Realtime subscriptions, Edge Functions, Storage, and Vector embeddings.
Prometheus
63KAn open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.
NocoDB
62KTurn your existing database into a collaborative spreadsheet interface — without moving a single row of data.
Meilisearch
56KLightning-fast, typo-tolerant search engine with an intuitive API. Drop-in replacement for Algolia that you can self-host for free.
DBeaver
49KFree universal database management tool for developers, DBAs, and analysts. Supports 100+ databases including PostgreSQL, MySQL, SQLite, MongoDB, and more.
Milvus
43KMilvus is a high-performance open-source vector database built for AI applications, supporting billion-scale similarity search with sub-second latency.