Webarchive
Self-hosted archiving & preservation tool that provides lightweight _wayback machine_ that creates HTML and PDF files from your bookmarks.
Self-hosted web archiving, honestly reviewed. No marketing fluff, just what you get when you run it yourself.
TL;DR
- What it is: A lightweight, open-source (BSD-3-Clause) personal web archiver written in Go — save web pages as PDF, single-file HTML, or raw headers [README].
- Who it’s for: Developers or technically-inclined solo users who want the absolute smallest footprint for personal web archiving and are comfortable with a REST API and minimal UI.
- Cost savings: Tools like ArchiveBox have managed hosting starting at €49/month [5]. Webarchive self-hosts on any server that can run a Go binary or Docker container — cost is whatever your VPS costs.
- Key strength: Dead-simple architecture, no database server dependency by default, REST API out of the box, one-click deploys available for AWS/DigitalOcean/Render [README].
- Key weakness: 188 GitHub stars, no auth, no multi-user support, no tags, and a dependency on
wkhtmltopdf— a project that’s been in maintenance mode for years. This is early-stage software with a long roadmap ahead of it.
What is Webarchive
Webarchive (GitHub: derfenix/webarchive) describes itself plainly as “own webarchive service — aimed to be a simple, fast and easy-to-use webarchive for personal or home-net usage” [README]. That framing is accurate and refreshingly honest — it’s not trying to be ArchiveBox, it’s not trying to index the internet. It’s a Go binary that accepts URLs via a REST API and saves them in one or more formats: a PDF render, a single-file HTML snapshot, or raw HTTP response headers.
The project is BSD-3-Clause licensed, which is about as permissive as it gets — you can run it, modify it, and embed it in commercial software without restrictions [README].
On GitHub it sits at 188 stars. That puts it firmly in the “indie side project” tier — not a dead project, but also not a mature community tool. There’s no website, no documentation site, no user forum. The README is the documentation.
The core value proposition: if you’ve ever lost a webpage you needed because the site went down or moved, and you want a simple self-hosted way to never have that problem again, Webarchive is one answer — particularly if you prefer Go’s lightweight runtime over Python-heavy alternatives.
Why people choose it
There’s no direct third-party review of derfenix/webarchive specifically — at 188 stars it hasn’t attracted the blogger attention that ArchiveBox (19,100+ stars [5]) or Hoarder have. But the Reddit thread that drives people to this category is instructive: users want “something where it will save a snapshot of the website (including those that require logins) + ideally support social media like Twitter and YouTube” [4].
Webarchive addresses the first half of that — it saves page snapshots — and doesn’t address the second half at all. That’s the honest starting point for evaluating it.
People gravitating toward Webarchive specifically tend to cite one or more of these reasons:
Minimal dependency surface. ArchiveBox requires Python 3 and a dozen apt packages. Hoarder spins up three containers including a headless Chrome instance [1]. Webarchive is a Go binary. If you want a small, auditable codebase you can actually read, Go is a better story than a Python monolith.
REST-first design. The primary interface is an HTTP API. You POST a URL with format preferences, get back an ID, and later GET the result. This makes Webarchive straightforward to integrate into scripts, browser extensions, or other workflows without touching a UI [README].
No external database required by default. The default storage is file-based (DB_PATH environment variable). You don’t need to run a Postgres or Redis sidecar to get started, unlike Hoarder which requires both [1].
The alternatives feel like overkill for simple use. The SOSSE review [3] notes RAM usage spiking to 5.5GB during crawling, idle usage around 2.8GB — that’s a meaningful server footprint for what is ultimately “save this webpage.” For personal use with occasional archiving needs, that’s a lot. Webarchive targets the user who doesn’t need a search engine, just a snapshot.
Features
Based on the README, here’s what Webarchive actually does today:
Archive formats:
- PDF — full-page render using
wkhtmltopdf. Configurable: landscape/portrait, grayscale, DPI (default 150), viewport (default 1280×720), zoom, print vs. screen media type [README]. - Single-file HTML — saves the page plus all its embedded resources (CSS, JS, images) as one self-contained
.htmlfile [README]. - Headers — saves all HTTP response headers from the URL [README].
You can request multiple formats in a single archive job, so one POST can produce both a PDF and a headers dump simultaneously [README].
API:
POST /api/v1/pages— submit a URL with desired formatsGET /api/v1/pages/:id— check status and get file IDs once processing completesGET /api/v1/pages/:id/file/:file_id— retrieve the archived fileGET /api/v1/pages— list all stored pages [README]
Web UI:
There is a built-in UI (UI_ENABLED=true by default) with a configurable URL prefix and theme [README]. The README lists “Basic web UI” as unchecked in the roadmap, suggesting the current UI is functional but unfinished.
Deployment:
- Docker Compose with
docker compose up -d webarchive - Direct Go build:
go run ./cmd/server/main.go - One-click deploy buttons for AWS (CloudFormation), DigitalOcean (App Platform), and Render via DeployStack.io [README]
What’s notably missing (from the roadmap):
- Authentication of any kind — the API and UI are wide open
- Multi-user support
- Tags or categories
- SQL database backend (currently file-based only)
- Markdown export
- Saving page to HTML with separate resource files (separate from single-file mode) [README]
Pricing: SaaS vs self-hosted math
There’s no commercial offering from derfenix/webarchive. The software is free and self-hosted.
The relevant pricing comparison is against the category of managed web archiving:
Managed ArchiveBox (Stellar Hosted) [5]:
- Standard: €49/month — 10GB storage, unlimited pages/users
- Premium: €99/month — 100GB storage, SSO
- Custom: from €149/month with unlimited storage (€25 per 100GB add-on)
Self-hosting ArchiveBox:
- Free software + ~$5–10/mo VPS
Self-hosting Webarchive:
- Free software + any server that runs Docker or a Go binary
- Storage is file-based and local, so you’re paying for disk, not execution count
If you’re currently paying for a managed archiving service or leaning toward paying €49/month for hosted ArchiveBox, self-hosting either tool eliminates that bill. Webarchive’s lighter footprint means it runs comfortably on a $5 Hetzner VPS or even a Raspberry Pi on your home network — the latter being explicitly noted in its “personal or home-net usage” positioning [README].
The honest savings math: €49–99/month managed → $5–10/month VPS → $500–1,100/year saved. That’s the ceiling. In practice most personal users don’t need managed archiving at all, so the real comparison is “free and self-hosted” versus “free and self-hosted with a more mature tool.”
Deployment reality check
What you need:
- A machine running Linux, macOS, or anything Go compiles for
- Docker (if using the compose path)
wkhtmltopdfinstalled and in$PATHif you want PDF output — without it, PDF archiving fails silently- A reverse proxy (Caddy, nginx) if you expose it externally
The wkhtmltopdf problem. This is the biggest operational concern with Webarchive. wkhtmltopdf uses a patched Qt WebKit and has been in maintenance mode since 2020, with the project explicitly noting it’s no longer actively developed. On modern Linux distributions, installing it requires either an old .deb package from GitHub releases or a workaround involving libssl compatibility shims. It’s not in most distribution package managers in a current form. If PDF is your primary reason to use Webarchive, this dependency is a real friction point that won’t get easier over time.
The API is unauthenticated. The roadmap explicitly lists “Optional authentication” as a future item [README]. Today there is none. The README’s positioning as “personal or home-net usage” signals you’re not meant to expose this to the public internet. If you’re running it on a VPS behind a firewall with port 5001 locked down, that’s fine. If you expose it publicly without a reverse proxy implementing auth, anyone can submit archive jobs to your server.
Setup time estimate: 15–30 minutes for a developer comfortable with Docker. The compose file and environment variable configuration are simple [README]. The friction is all in getting wkhtmltopdf installed correctly on your host, which can add another 30–60 minutes if you hit compatibility issues.
One-click deploy reality check: The DeployStack integrations for AWS/DigitalOcean/Render [README] lower the barrier for non-developers, but these platforms add cost — DigitalOcean App Platform starts at $5–12/month even for a small container — and you still need wkhtmltopdf available in the container environment. The one-click buttons are a nice touch but won’t fully automate a working PDF-capable deployment for a non-technical user.
Pros and Cons
Pros
- BSD-3-Clause license — as permissive as open-source gets. Self-host, fork, embed in your product, no lawyers needed [README].
- Written in Go — single binary, small memory footprint, easy to compile and audit. No Python dependency hell.
- REST API first — clean integration point for automation. If you’re building a personal tool that saves pages programmatically, you
curlto save andcurlto retrieve [README]. - Multiple output formats in one request — submit a URL once, get PDF + single-file HTML + headers [README].
- No external database by default — file-based storage means no Postgres/Redis sidecar to manage [README].
- One-click deploy paths for AWS, DigitalOcean, Render via DeployStack.io [README].
- Configurable PDF rendering — DPI, zoom, viewport, grayscale, orientation — more render control than some competing tools [README].
Cons
- 188 GitHub stars, no website, no documentation site — this is a small indie project. Longevity risk is real. If the maintainer moves on, there’s no community to fork-forward.
wkhtmltopdfdependency — deprecated upstream, increasingly hard to install on modern Linux. PDF support is load-bearing for many use cases, and this dependency is technically debt you inherit [README].- Zero authentication — no API keys, no passwords, no OAuth. Listed as a future roadmap item [README]. Not production-safe without external auth layer.
- No full-text search — you can list pages and retrieve files, but you can’t search the archived content. Hoarder offers full-text search using MeiliSearch [1][2], which Webarchive has no equivalent of.
- No AI tagging or summarization — Hoarder integrates with OpenAI/local LLMs to auto-tag and summarize archived content [1][2]. Webarchive is a pure archival tool with no intelligence layer.
- No social media / login-gated archiving — can’t archive pages that require authentication. The Reddit thread [4] identifies this as a top user need; Webarchive doesn’t address it. Hoarder handles it via the SingleFile browser extension [2].
- UI is minimal/unfinished — “Basic web UI” is listed as incomplete in the roadmap [README]. The configuration options suggest it exists but expect it to be spartan.
- No mobile support — Hoarder has an Android app [2]. Webarchive has a REST API only.
- Single-user only — multi-user access is a roadmap item, not a current feature [README].
Who should use this / who shouldn’t
Use Webarchive if:
- You’re a developer who wants a minimal REST API for programmatic web archiving and doesn’t need search, auth, or AI features.
- You want to run something on a home server or Raspberry Pi with the smallest possible footprint.
- You primarily need page snapshots in PDF or single-file HTML and don’t need JavaScript-heavy pages handled perfectly.
- You’re evaluating the Go codebase as a base for building your own archiving tool.
- You want BSD-3 licensing with no commercial strings.
Skip it (use Hoarder instead) if:
- You want full-text search across your archive [1][2].
- You want auto-tagging and AI-generated summaries [1][2].
- You want a mobile app to capture links on the go [2].
- You need to archive login-gated or paywalled content via SingleFile integration [2].
- You want a finished, polished product rather than a work-in-progress.
Skip it (use ArchiveBox instead) if:
- You need to bulk-import browser history, Pocket exports, or Pinboard bookmarks [5].
- You want a mature tool with a large community (19,100+ stars vs. 188 [5][README]).
- You need multiple archive backends — ArchiveBox saves WARC, PDF, HTML, screenshots, and more.
- Your organization needs a managed hosted option with SLAs [5].
Skip it (use SOSSE instead) if:
- You need recurring crawls and scheduled archiving with change detection [3].
- You want full-text search over dynamically rendered JavaScript pages [3].
- You have the server resources to spare (SOSSE idles at ~2.8GB RAM [3]) and need enterprise-like crawl management.
Alternatives worth considering
The self-hosted web archiving space has several more mature options:
- ArchiveBox — the category leader at 19,100+ stars [5]. Accepts URLs, browser history, Pocket/Pinboard exports. Saves WARC, PDF, HTML, screenshots, and more. Managed hosting available from €49/month [5]. Python-based, heavier footprint. Actively maintained with a real community. This is what most users should default to.
- Hoarder — fastest-growing option in the space [1][2]. AI auto-tagging, full-text search, SingleFile browser integration for login-gated content, Android app, three-container Docker setup [1]. Aimed at the “bookmark everything” use case. Best choice if you want intelligence features.
- SOSSE — Selenium-based, handles JavaScript-heavy pages, scheduled recurring crawls, generates Atom feeds [3]. Heavier (2.8GB RAM idle [3]) but more capable for systematic crawling. Best for users who want to monitor sites for changes.
- Wallabag — the veteran read-later/archive tool. Simpler than ArchiveBox, focused on article reading and archiving rather than full-page snapshots. Has mobile apps. More mature than Webarchive, but narrower in scope.
- SingleFile — technically a browser extension, not a server. Saves full-page HTML directly from your browser including login-gated content [2]. Works well combined with a server-side tool for storage.
For a non-technical founder evaluating options: Hoarder or ArchiveBox are the practical choices. Webarchive’s lack of auth, limited UI, and small community make it a developer experiment rather than a production recommendation right now.
Bottom line
Webarchive is an honest, minimal tool that does exactly what it says: save a URL to PDF, single-file HTML, and/or raw headers via a REST API. The BSD-3 license, Go codebase, and zero-external-database default make it appealing if you want a lightweight foundation you can build on or audit yourself. But it’s 188 stars for a reason — no auth, no search, no AI, no mobile, a deprecated PDF dependency, and a roadmap that acknowledges the UI isn’t finished yet. If you want personal web archiving that works today without engineering it yourself, Hoarder or ArchiveBox will serve you better. Webarchive is a project to bookmark on GitHub and revisit in a year, not one to bet your archiving workflow on today.
Sources
- Roo’s View — “Hoarder – a self hosted link collection and web archive” (2024). https://lowtek.ca/roo/2024/hoarder-a-self-hosted-link-collection-and-web-archive/
- James Ravenscroft, Brainsteam — “Building a Personal Archive With Hoarder” (February 15, 2025). https://brainsteam.co.uk/2025/2/15/personal-archive-hoarder/
- noted.lol — “SOSSE: Open-Source, Self-Hosted Digital Archiving & Search Engine Solution”. https://noted.lol/sosse/
- r/selfhosted — “Is there a self-hosted alternative to WebArchive?” (Reddit, 2024). https://www.reddit.com/r/selfhosted/comments/1gnusy8/is_there_a_selfhosted_alternative_to_webarchive/
- Stellar Hosted — “ArchiveBox hosting - Managed, fast and secure”. https://www.stellarhosted.com/archivebox/
Primary sources:
- GitHub repository and README: https://github.com/derfenix/webarchive (188 stars, BSD-3-Clause license)
Features
Integrations & APIs
- REST API
Category
Related Archiving & Preservation Tools
View all 15 →ArchiveBox
27KSelf-hosted web archiving tool that saves pages as HTML, PDF, screenshots, and WARC files from bookmarks, history, or RSS feeds.
CKAN
5KCKAN is a self-hosted archiving & preservation replacement for Socrata.
Wayback
2.2KFor archiving & preservation, Wayback is a self-hosted solution that provides toolkit for archiving webpages to the Internet Archive, archive.today, IPFS,...
Open Archiver
1.8KOpen Archiver lets you run email archiving solution with full-text search and eDiscovery search features entirely on your own server.
mail-archiver
1.7KMail-archiver is a C#-based application that provides web application for archiving.
Bichon
1.5KBichon lets you run lightweight e-mail archiver entirely on your own server.