How To Extract URLs from Archive.org (Wayback Machine)

There are occasions where a client may come to you following a CMS or domain migration which has resulted in a ranking or traffic loss. This can be a difficult situation to remedy when you are unable to find any previous sitemap.xml files or older Screaming Frog crawls.

If some pages had a high traffic, sales, or lead generation value then they may be lost altogether. If some pages had a high total of inbound links then the value of those links — measured in PageRank, Link Equity, Trust Flow, etc — would be lost entirely too.

Without full knowledge of the website’s former site structure and the URLs within it, there could be a lot of value lost to dead-end 404 pages.

Having run into the same situation ourselves recently we had to figure out — thanks to a large helping hand from Liam Delahunty. Thanks Liam! — a solution which we’d now like to pass on to you.

Get to the top of Google

Learn how to get your website to the very top of Google (and turn that traffic into revenue).

Download my free copy

Yellow book cover reading ‘Get to the Top of Google’ with ‘50,000+ copies sold’ at the top, a hand-drawn black arrow pointing upward, an ‘AI Ready’ badge, and the author name Tim Cameron-Kitchen at the bottom

Using Archive.org Data

Archive.org, or the Wayback Machine as it’s more commonly know, is a web crawler and indexing system for the internet’s web pages for historical archiving. It’s a cool tool which allows us to take a peek at what Google looked like when it was still in Beta back in 1998, for example.

As it crawls a large percentage of the internet it’s highly likely that your website has been crawled by their web crawler. By retrieving this publically available data we can piece together a rough idea what the pre-migrated website’s site structure may have been.

The data is freely available to use and Archive.org have a brief outline of how the API may be accessed and used available here.

Not being an API-wielding specialist myself, in the following process I’ll be falling back on a classic copy-and-paste approach which Search Engine Optimsation Specialists of any skill level can use.

How To Extract Old URLs from Archive.org

1. Locate your website’s JSON or TXT file

Start by navigating to the following URL, changing the holding example.com root domain to your website’s own root.

For JSON file:
http://web.archive.org/cdx/search/cdx?url=example.com*&output=json

For TXT format:
http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt

If you need to limit the time frame of the crawl then you can add the following parameters to the end to narrow the range.

yyyyMMddhhmmss

Example:
http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt&from=2010&to=2018

You can also decrease or increase the limit to match your needs.

Example:
http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt&limit=999999

You can find a full rundown of the available filtering options here:
https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#filtering

2. Paste into your spreadsheet and separate into columns

Copy the entire text of the loaded page and paste the results into a spreadsheet. In this instance, we’re using Google Sheets.

Select the entire range of data and use the “Split text into columns…” option of the “Data” menu in the toolbar. As we’re using the TXT formatting we use the “space” delimiter to separate our data.

3. Remove columns leaving only the URLs

Delete all of the unrequired columns to leave only the URLs. This will usually be Column C.

4. Use Find and Replace to remove :80 from URLs

Select the column of URLs and use the “Find and Replace” function to locate the text “:80” and replace it with nothing (leave the replacement text box empty). This will tidy up all of the URLs, sometimes removing tens of thousands of instances of “:80”

5. Use =UNIQUE formula to remove duplicates

In a separate column use the UNIQUE formula — i.e, =UNIQUE(A:A) — to remove the duplicates from the first column, leaving only singular URLs to check for 3XX, 4XX, and 5XX status codes.

6. Crawl URLs using Screaming Frog and extract report for review

Copy your final list of URLs, open Screaming Frog and switch it to List mode, then paste in your gathered URLs.

Export your completed crawl as a CSV and copy/paste the data into another tab of your spreadsheet. At this point, you can either remove all columns except for the URL and Status Code columns or you can do a VLOOKUP to populate the correlating statuses for your original list.

You can now filter this complete list of URLs to find 404 pages or redirect chains.

Other advantages and tips

This process can be enhanced further by gathering URLs via Google Analytics for as far back as you can, making sure to check any former URLs which may have been high traffic or high converting sales pages in the past.

Taking it another step further, you can find more URLs via other web crawlers such as Majestic — which also keeps a log of URLs crawled — which you can also download and add to your total list before removing all the duplicates and crawling them.

Additionally important is to run this list of URLs through a tool like Majestic to see whether there are any backlinks to the pages with 3XX, 4XX, and 5XX status codes where link equity may be diffused or lost entirely

This process can also be used for link-building. By following the same process for your competitor’s websites you may find pages with 4XX status codes with backlinks to them. You could use the Wayback Machine to see what these pages used to be and then recreate and improve on their old content — without copying anything in their original — before reaching out to the linking domains to suggest your new content as a replacement for that broken link.

And that’s it. A simple process for gathering the URLs of an old website long forgotten or recently migrated.

Again, thanks to Liam Delahunty for the guidance through this process. It’s entirely his brainchild.

“Working with Exposure Ninja has resulted in a 259% increase in sales qualified leads” — Ellis Clark, Tunley Environmental

“Working with Exposure Ninja for the past 18 months has had a fantastic impact on our business” — Pete Jenkins, Age Care Bathrooms

“Exposure Ninja has been an incredibly valuable partner” — Alison Moreau, The Ordinary

“I have gained approximately 50% increase in calls” — , Personal Injury Law Firm

“[Exposure Ninja] produces content and marketing to help drive qualified traffic.” — Ron Henry, Golf Course Lawn

“Exposure Ninja are the best agency to build your website and run your campaigns” — Amy Russell, Russell Regulatory Consultants

“I like how exciting their ideas are” — Dana Hendrix, DSLD Mortgage

“I can't recommend them enough!” — Holly Yates, French Bedroom

How To Extract Your Website’s URLs from Archive.org (Wayback Machine)

Contents

Get Weekly Marketing Tips

Get to the top of Google

Using Archive.org Data

How To Extract Old URLs from Archive.org

1. Locate your website’s JSON or TXT file

2. Paste into your spreadsheet and separate into columns

3. Remove columns leaving only the URLs

4. Use Find and Replace to remove :80 from URLs

5. Use =UNIQUE formula to remove duplicates

6. Crawl URLs using Screaming Frog and extract report for review

Other advantages and tips

Related Posts

How to Learn SEO for Free: 50 Expert Resources

Why Your Marketing Funnel No Longer Works

How to Build an Enterprise AI Search Strategy (That Works)

More of the Good Stuff

Read our blog

Watch our videos

Read our book

Listen to our podcast

We Literally Wrote the Book

Request a FREE Marketing Review

Get the latest marketing news direct to your inbox.

How To Extract Your Website’s URLs from Archive.org (Wayback Machine)

Contents

Get Weekly Marketing Tips

Get to the top of Google

Using Archive.org Data

How To Extract Old URLs from Archive.org

1. Locate your website’s JSON or TXT file

2. Paste into your spreadsheet and separate into columns

<img decoding=async class="aligncenter size-large wp-image-60922 img-fluid" src="https://exposureninja.com//wp-content/uploads/2020/08/extracted-archive-org-data-split-delimiter-1024x550.png" alt="Extracted data from Archive.org split into columns" width=1024 height=550 />

3. Remove columns leaving only the URLs

4. Use Find and Replace to remove :80 from URLs

<img decoding=async class="aligncenter size-large wp-image-60923 img-fluid" src="https://exposureninja.com//wp-content/uploads/2020/08/find-replace-archive-org-data-1024x548.png" alt="Find and replace used on archive.org data" width=1024 height=548 />

5. Use =UNIQUE formula to remove duplicates

<img decoding=async class="aligncenter size-large wp-image-60925 img-fluid" src="https://exposureninja.com//wp-content/uploads/2020/08/unique-formula-archive-org-data.png" alt="Using the unique formula in Google Sheets to clean the archive.org data" width=1024 height=710 />

6. Crawl URLs using Screaming Frog and extract report for review

Other advantages and tips

Related Posts

How to Learn SEO for Free: 50 Expert Resources

Why Your Marketing Funnel No Longer Works

How to Build an Enterprise AI Search Strategy (That Works)

More of the Good Stuff

Read our blog

Watch our videos

Read our book

Listen to our podcast

We Literally Wrote the Book

Request a FREE Marketing Review

Get the latest marketing news direct to your inbox.