unsubbed.co

Apache Airflow

Apache Airflow is the industry-standard platform for programmatically authoring, scheduling, and monitoring data pipelines and workflows as Python code.

Open-source workflow orchestration, honestly reviewed. What you actually get when you move beyond cron jobs and shell scripts.

TL;DR

  • What it is: Open-source (Apache 2.0) workflow orchestration platform where pipelines are defined as Python code, not config files or drag-and-drop UI [1][4].
  • Who it’s for: Data engineers, ML teams, and infrastructure engineers who need to schedule, monitor, and manage complex multi-step workflows — ETL pipelines, ML model training, data transformations [4][5].
  • Cost savings: AWS MWAA (managed Airflow) starts around $0.49/hour per environment (~$350/mo at minimum), Google Cloud Composer runs $0.34–$0.55/hour per environment. Self-hosted Airflow on a $20–40/mo VPS eliminates that bill entirely [4].
  • Key strength: De facto standard for data workflow orchestration. 320 million downloads in 2024 — roughly 10× the nearest competitor [1]. When you search for a solution to a data pipeline problem, there’s almost certainly an Airflow answer.
  • Key weakness: It is not for non-technical people. Workflows are Python code. Setup for production is genuinely complex. Resource tuning is necessary. If you don’t have a data engineer on staff, this tool will frustrate you [1].

What is Apache Airflow

Apache Airflow is a platform for defining, scheduling, and monitoring workflows entirely in Python code. The core abstraction is the DAG (Directed Acyclic Graph) — a Python file that describes tasks and the dependencies between them. When you tell Airflow to run a DAG, it executes the tasks in the right order, retries failures, logs everything, and shows you a visual representation of what happened [2][4].

It was built at Airbnb in 2014 to handle increasingly complicated data workflows that had outgrown cron jobs. In 2016 it entered the Apache Incubator, and in 2019 it graduated as an Apache Software Foundation Top-Level Project [3]. At that point it was already deployed across 200+ organizations including Adobe, Etsy, Google, and Twitter [3]. Today, 500+ organizations are listed in the official user registry, and the project sits at 44,726 GitHub stars [1][2].

What separates Airflow from visual ETL tools and simpler schedulers is the code-first philosophy. Pipelines aren’t configured in a UI — they’re written in Python, version-controlled in git, reviewed in pull requests, and deployed like software. This means you can use loops, conditionals, environment variables, and the full Python standard library to generate tasks dynamically [4][5]. You can define a DAG that creates 50 parallel tasks based on a database query result, something that’s impractical or impossible in GUI-based tools.

Airflow 3.0 added event-driven scheduling, DAG versioning, and the Task SDK — a Python-native interface for defining tasks that’s decoupled from Airflow internals, making DAGs more forward-compatible across Airflow versions [1][website].

The project is maintained by a community of core committers under the Apache Software Foundation, with Astronomer (the company) being the primary commercial sponsor and the one that runs the annual user survey [website].


Why people choose it

The single most honest way to describe Airflow’s position: it is the default choice, not a considered choice. When a data engineering team needs workflow orchestration, they reach for Airflow first because it’s what everyone already knows, the documentation is deep, Stack Overflow is full of answers, and provider packages exist for every cloud service they’re likely to use [1].

The Datamation review [1] puts it bluntly: Airflow is the “de facto tool for data engineering” according to VentureBeat’s 2025 assessment, with 92% of users saying they’d recommend it. That’s an unusually high satisfaction number for a tool that reviewers in the same sentence describe as having a steep learning curve and complex production setup. The implication is that the people who stuck with it find it worth the pain.

Code as configuration. The Python-native DAG model is genuinely powerful and the reason engineers prefer it over alternatives. Dynamic pipeline generation — creating tasks in a loop based on runtime data — is native. Version control of pipeline logic is built-in by default. When something breaks, you debug Python, not a proprietary config format [4][5].

Integration breadth. Airflow has provider packages for AWS, GCP, Azure, and dozens of third-party services [1][4][website]. This isn’t a small integration catalog — it’s the accumulated work of the data infrastructure community over a decade. If you’re orchestrating jobs that touch Snowflake, BigQuery, Redshift, S3, GCS, Databricks, dbt, and Kubernetes, there are maintained provider packages for all of them.

Production maturity. When Airflow runs at scale, it runs at serious scale. Companies running thousands of DAGs with millions of task instances per month are not unusual [1]. The scheduler has been battle-tested in production at organizations far larger than any startup reader of this article.

AI workloads. 30% of organizations surveyed are now using Airflow for AI initiatives, up 24% year-over-year [1]. Orchestrating ML training jobs, model evaluation pipelines, and data preprocessing workflows is a natural fit — the same DAG model applies.


Features

Based on the official website, documentation, and third-party sources:

Core orchestration:

  • DAGs defined in Python with full language support — loops, conditionals, functions, imports [4][5]
  • TaskFlow API for decorator-based task definition, introduced to reduce boilerplate [4]
  • Directed Acyclic Graph execution with dependency management [2][4]
  • Cron and event-driven scheduling (Airflow 3.0) [1][website]
  • DAG versioning (Airflow 3.0) [1]
  • Automatic retry with configurable backoff [2]
  • XCom for passing data between tasks [2]
  • Jinja templating for parameterization [website]

Monitoring and visibility:

  • Web UI for monitoring DAG runs, task states, and logs [2][4][website]
  • Grid view, graph view, and task instance history [website README features]
  • Task log streaming and historical log access
  • SLA monitoring and alerting
  • Backfill support for re-running historical periods [website README]

Infrastructure:

  • Pluggable executor model: LocalExecutor (single machine), CeleryExecutor (distributed), KubernetesExecutor (k8s-native) [website]
  • Official Docker image and Docker Compose setup [website docs]
  • Official Helm chart for Kubernetes deployments [website docs]
  • Python API client for programmatic management [website docs]
  • airflowctl — new REST API-driven CLI that doesn’t require database access [website docs]
  • Task SDK for decoupled task authoring [website docs]

Integrations (provider packages):

  • AWS: S3, RDS, Redshift, EMR, MWAA, Lambda, Glue, and more
  • GCP: BigQuery, Cloud Storage, Dataflow, Composer, Cloud Functions
  • Azure: Data Factory, Blob Storage, Azure ML
  • Databases: PostgreSQL, MySQL, Snowflake, Databricks
  • Tools: dbt, Spark, Kubernetes, Docker, HTTP operators [4][website]

Pricing: SaaS vs self-hosted math

Self-hosted (open source):

  • Software license: $0 (Apache 2.0) [merged profile]
  • Minimum VPS to run it on: $20–40/mo (Airflow needs more resources than lightweight tools — the scheduler, webserver, worker, and database are separate processes)
  • For production with CeleryExecutor or KubernetesExecutor: $60–150/mo depending on workload

Managed Airflow (if you don’t want to operate it yourself):

  • AWS MWAA (Managed Workflows for Apache Airflow): Priced per environment-hour plus worker usage. A small environment runs approximately $0.49/hour = ~$354/month at minimum, plus worker costs on top. Large environments run $1.97/hour = ~$1,420/month [4].
  • Google Cloud Composer: $0.34–$0.55/hour per environment depending on tier, plus underlying GKE node costs. Budget $250–600/month for a typical small deployment [4].
  • Astronomer (commercial Airflow): Managed cloud offering with enterprise support. Pricing on request; targets larger data teams.

The math: If you’re currently on AWS MWAA at minimum ($350/mo), self-hosting on a $40/mo VPS saves roughly $3,700/year — but you absorb the operations overhead yourself. If you have a data engineer who knows Airflow, that trade-off is obvious. If you’re trying to set this up from scratch without experience, the managed services save you weeks of configuration pain.

There is no meaningful free tier in managed Airflow — you pay from the moment the environment is running, even if no DAGs are executing. This is a significant hidden cost for small teams that spin up MWAA “just to try it” and forget to shut it down.


Deployment reality check

This is where Airflow’s reputation takes the hit it deserves. The Datamation review [1] specifically calls out “complex setup for production” and “resource management tuning needed” as the top challenges. This is not marketing hedging — it’s an accurate description.

What you actually need for a production-ready self-hosted Airflow:

  • A server with at least 4GB RAM, preferably 8GB (the scheduler alone is memory-hungry)
  • PostgreSQL or MySQL for the metadata database (SQLite is only for local testing) [canonical features]
  • Redis if you use CeleryExecutor for distributed task execution
  • A reverse proxy (nginx or Caddy) with HTTPS
  • A domain name
  • Docker and docker-compose, or a Kubernetes cluster for the Helm chart path
  • SMTP setup for alerting
  • Persistent storage for logs

What the official Docker Compose quickstart actually gets you: A working local environment, not a production-ready deployment. The docker-compose.yaml in the official docs is labeled “for development” and will tell you so in the comments.

What can go sideways:

  • The scheduler can become a bottleneck if you have hundreds of DAGs. Tuning scheduler.max_active_runs_per_dag, core.parallelism, and executor pool sizes is not optional at scale [1].
  • DAG parsing errors bring down the entire scheduler if not handled carefully. A syntax error in one Python file affects all DAGs.
  • The default SQLite backend will silently fail under concurrent load. Production requires PostgreSQL from day one [canonical features].
  • Upgrading Airflow versions, especially major versions, requires database migrations that can take hours on large metadata databases.
  • Airflow 3.0 is a breaking change from 2.x with a new API and Task SDK — if you’re on 2.x and targeting 3.0, budget real migration time [1].

Realistic time estimate for an experienced DevOps engineer: 4–8 hours for a production-ready Docker deployment with PostgreSQL, Redis, Celery workers, nginx, and SSL. For a data engineer who’s new to ops: 1–2 days including troubleshooting. For a non-technical founder: this is not a DIY project — budget for professional deployment help.


Pros and Cons

Pros

  • Industry standard. 44,726 stars, 500+ organizations, 320M downloads in 2024 [1][2]. The documentation is deep, the community is large, and every data infrastructure question has been asked and answered somewhere.
  • Code-first pipeline definition. Python DAGs version-control naturally, generate tasks dynamically, and compose with the full Python ecosystem [4][5]. This is the correct architecture for complex workflows.
  • Integration coverage. Provider packages for every major cloud service, database, and data tool [4][website]. You are unlikely to hit a service it can’t reach.
  • Apache 2.0 license. Genuinely permissive — no commercial restrictions, no “fair-code” caveats, no “you can self-host but not resell” clauses [merged profile].
  • Production track record. Runs at scale in some of the largest data infrastructure deployments in the world [1][3]. The scheduler, retry logic, and monitoring have been hardened over a decade.
  • Airflow 3.0 improvements. Event-driven scheduling and the Task SDK address the two biggest architectural criticisms of older versions [1][website].
  • Managed service options. If you don’t want to operate it, AWS MWAA, Google Cloud Composer, and Astronomer all exist [4].
  • 92% would recommend rate among surveyed users [1].

Cons

  • Not for non-technical teams. There is no meaningful drag-and-drop interface. Workflows are Python files. This is a tool for engineers [1][4].
  • Complex production setup. The gap between “runs on my laptop” and “runs reliably in production” is wide and requires real infrastructure knowledge [1].
  • Resource-hungry. The scheduler, webserver, and workers are separate processes with real memory requirements. You need more server than you think.
  • Steep learning curve. DAG concepts, executor types, XCom, operators vs. sensors vs. hooks — there’s a substantial mental model to build before you’re productive [1].
  • DAG parsing is fragile. Python import errors crash the scheduler for all DAGs. One broken file affects everything.
  • Upgrade pain. Major version migrations are non-trivial, especially on large deployments. Airflow 3.0 is a meaningful breaking change from 2.x [1].
  • Overkill for simple use cases. If you need to run three scripts on a schedule with email on failure, a simpler tool (or even cron with a monitoring wrapper) will serve you better and cost less time.
  • Managed services are expensive. If you’re not self-hosting, AWS MWAA starts at ~$350/month — more than most SaaS alternatives — just for the environment [4].

Who should use this / who shouldn’t

Use Apache Airflow if:

  • You have a data engineering team (or a technically capable solo engineer) who will own the pipelines.
  • Your workflows involve multiple dependent steps, external systems, retry logic, and need observable history.
  • You’re orchestrating ETL pipelines, ML training jobs, data quality checks, or infrastructure automation at meaningful scale.
  • You want the widest possible provider coverage and a large community to draw on.
  • You need Python’s full expressiveness for dynamic pipeline generation — creating 100 parallel tasks based on a database table, for example.
  • You’re already in AWS or GCP and willing to use MWAA or Cloud Composer to avoid operations overhead.

Skip it if you’re a non-technical founder:

  • There is no path to “I’ll just click through a setup wizard.” This tool requires Python and infrastructure knowledge.
  • If you want Zapier-style automation for connecting SaaS apps, look at Activepieces or n8n.
  • If you want scheduled scripts with monitoring and you have one or two engineers, look at Prefect or Temporal as lower-overhead alternatives.

Skip it if your workflows are simple:

  • Running five scripts every night with email on failure doesn’t need Airflow’s scheduler architecture.
  • Simple scheduled tasks: use cron + a monitoring tool like Healthchecks.io.
  • Small data teams wanting Airflow’s workflow model with less ops overhead: consider Prefect (Cloud free tier exists, open-source core) or Dagster.

Skip it if setup cost is the constraint:

  • If you can’t budget 2–3 days of engineering time for initial deployment and ongoing maintenance, the managed alternatives (MWAA, Composer) become cheaper in practice — even at their higher dollar cost.

Alternatives worth considering

  • Prefect — Python-native, similar DAG model, dramatically simpler setup, generous free cloud tier. Preferred by teams that want Airflow’s philosophy without Airflow’s operational complexity. Less mature provider ecosystem.
  • Dagster — Asset-centric orchestration (think about data assets, not tasks). Excellent for data engineering teams invested in data lineage and observability. Steeper initial learning curve than Prefect.
  • n8n — Visual flow builder, open-source, self-hostable. Right tool if your team is not writing Python and you need a trigger-action automation model. Wrong tool for complex DAG-style data pipelines.
  • Temporal — Workflow orchestration for long-running business processes, not data pipelines. Better choice if you’re orchestrating application workflows (order processing, approval chains) rather than data jobs.
  • Kestra — YAML-based workflows, built-in UI, growing provider catalog. Closer to Airflow’s use case with a lower setup bar. Younger project, smaller community.
  • AWS MWAA / Google Cloud Composer / Microsoft Fabric — Managed Airflow. Pay a premium to not operate it yourself. Correct choice for teams that need Airflow compatibility without DevOps overhead [4].
  • Activepieces / Zapier — Not real alternatives for data engineering. If you’re comparing these to Airflow, you have a different use case than Airflow targets.

Bottom line

Apache Airflow is the right answer to a specific question: “How do I orchestrate complex, multi-step data workflows in production, with dependency management, retry logic, monitoring, and the full power of Python?” For that question, nothing else has its combination of maturity, community, and integration coverage. 320 million downloads in a single year and 92% recommendation rates from a user base that openly acknowledges the complexity — that’s not marketing, that’s a tool that earns its keep [1].

The honest caveat: Airflow is not for non-technical founders, not for simple automation tasks, and not for teams that want a working setup in an afternoon. It is an infrastructure tool with an infrastructure tool’s setup demands. If you have a data engineer who’ll own it, the cost-per-workflow on a self-hosted instance is effectively zero, and the operational leverage is enormous. If you don’t, one of the managed services — or a simpler alternative like Prefect — will serve you better.

For teams already paying for AWS MWAA or Google Cloud Composer and willing to absorb the ops work themselves, self-hosting saves real money ($3,000–$15,000/year depending on environment size) for a one-time investment of a week of setup time.


Sources

  1. Datamation“Apache Airflow Review”. https://www.datamation.com/applications/apache-airflow-review/
  2. facts.dev“Apache Airflow project details”. https://www.facts.dev/p/apache-airflow/
  3. SD Times“Apache Airflow is now a Top-Level Project”. https://sdtimes.com/data/apache-airflow-is-now-a-top-level-project/
  4. Rost Glukhov / glukhov.org“Apache Airflow for MLOps and ETL — Description, Benefits and Examples”. https://www.glukhov.org/post/2025/06/apache-airflow/
  5. Rost Glukhov / glukhov.org“Apache Airflow integrations reference”. https://www.glukhov.org/data-infrastructure/integrations/apache-airflow/

Primary sources: