How to Archive the Original HTML of a Webpage (Not Just the Text)

GYevhen··8 min read

There's a real difference between saving an article and archiving a webpage's HTML. Most "save" tools hand you the readable text: the words, maybe the images, with the clutter stripped out. That's perfect for reading. Sometimes, though, you need the actual thing, the original markup and structure that a developer or archivist cares about. Maybe you're preserving evidence, debugging a scraper, or keeping a faithful record of how a page really looked under the hood.

This guide is about that second job: capturing the original HTML, not a cleaned-up version of it.

Extracted Text vs. Original HTML

A quick mental model before we start.

Extracted content is what reader views and read-later apps produce. They run the page through a parser, drop the nav bars, ads, and scripts, then keep the article. Clean and searchable, but not the original source.

Original HTML is the raw markup your browser actually worked with: every tag, attribute, inline style, and embedded resource reference.

If you just need to read it later, extracted text is great. If you need fidelity for forensics, scraping, or a faithful archive, you want the original HTML. Match the tool to the job.

Method 1: Browser "Save As" — Complete vs. HTML Only

The built-in option. Press Ctrl/Cmd + S and you'll usually see two relevant choices. "Webpage, HTML Only" saves just the .html document, the raw markup with no images or CSS. It's tiny, but the page looks unstyled when you reopen it. "Webpage, Complete" saves the .html plus a sibling folder of images, stylesheets, and scripts so it renders properly offline.

📝

"Save As" captures the HTML after your browser has loaded it, so it usually includes JavaScript-rendered content. But it saves the rendered DOM, not necessarily the exact bytes the server sent. That's fine for most archiving. For true source fidelity, see wget below.

Method 2: SingleFile — original page, one portable file

SingleFile is a free browser extension that captures the fully rendered page and inlines every asset (images, fonts, CSS) into a single self-contained .html file. No fragile assets folder, no broken links when you move it. It's the most convenient way to keep a faithful, openable copy of a page as it appeared. I reach for it whenever I want to archive a specific page on the spot: click the icon, get one file, done.

Method 3: wget — for the real source bytes (and whole sites)

If you want the HTML as the server actually sent it, before JavaScript runs, command-line wget is the classic tool:

# Save one page with everything needed to display it offline
wget -p -k https://example.com/article

# Mirror a whole section of a site
wget --mirror -p -k -np https://example.com/docs/

-p grabs page requisites (images, CSS, JS), -k rewrites links so the copy works offline, --mirror recurses, and -np keeps it from climbing to parent directories.

The tradeoff: wget fetches the raw server response, so pages that build themselves with client-side JavaScript may come down nearly empty. For those, SingleFile or "Save As" capture the rendered page instead.

Method 4: WARC — the archivist's standard

For serious, long-term preservation there's WARC (Web ARChive), the format the Internet Archive uses. A WARC file stores the full HTTP request and response, headers, timestamps, and every resource, so a page can be replayed faithfully years later. Tools like wget --warc-file=..., Webrecorder/ArchiveWeb.page, and wpull produce them.

WARC is overkill for "I want to keep this one article." It earns its weight when fidelity and provenance matter: legal evidence, research datasets, or institutional archives where you need to prove exactly what was served and when.

Pick Your Tool

ToolWhat you getRenders JSBest for
Save As (HTML Only)Raw markup, no assetsYes (DOM)A quick look at the source
Save As (Complete)HTML + assets folderYes (DOM)Offline viewing
SingleFileOne self-contained .htmlYes (DOM)Portable faithful copies
wgetServer-sent HTML / mirrorsNoSource bytes, whole sites
WARC toolsFull HTTP exchangeDependsPreservation, evidence

Where Gleamr Fits

Gleamr's role here is narrow, because it's a read-later app and not an HTML archiver. Gleamr saves the extracted article content (clean text and media, fully searchable), which is the right call for reading and finding things later. It also preserves the original page HTML when it's available, powering an "Original View" so you can fall back to the source layout.

What it is not is a full-fidelity WARC archiver. It won't store the complete HTTP exchange or guarantee byte-for-byte source for every page. If raw-HTML fidelity is your goal, use SingleFile, wget, or a WARC tool. If you want a clean, searchable library you'll actually use day to day, that's the read-later app's lane.

Frequently Asked Questions

What's the difference between saving a page's text and archiving its HTML?

Saving the text (what reader views and read-later apps do) keeps the readable article but discards the original markup, scripts, and structure. Archiving the HTML keeps the raw source, every tag and attribute, which matters for developers, scrapers, and faithful preservation.

How do I save the original HTML of a webpage?

Use SingleFile to capture the rendered page as one self-contained .html file, or run wget -p -k <url> on the command line to download the server's HTML plus its assets. For the truest source, wget fetches the raw response before JavaScript runs.

Why does my saved HTML look empty or broken?

Two common causes. Either the page builds its content with JavaScript, so a raw wget gets an empty shell (use SingleFile or "Save As" instead), or you saved "HTML Only" without the asset folder, so styling and images are missing. Use "Webpage, Complete" or SingleFile to keep everything together.

What is a WARC file?

WARC (Web ARChive) is a standard format that stores the full HTTP request and response, including headers, timestamps, and every resource, so a page can be faithfully replayed later. It's what major web archives use and is the gold standard when provenance matters.


Need raw HTML? Use SingleFile or wget. Need a clean, searchable reading library instead? That's Gleamr: 10 free articles, full-text search, and JSON export to start. See also how to permanently save a webpage.

Related Articles