How To Extract Your Website’s URLs from Archive.org (Wayback Machine)

How To Extract URLs from Archive.org

There are occasions where a client may come to you following a CMS or domain migration which has resulted in a ranking or traffic loss. This can be a difficult situation to remedy when you are unable to find any previous sitemap.xml files or older Screaming Frog crawls.

If some pages had a high traffic, sales, or lead generation value then they may be lost altogether. If some pages had a high total of inbound links then the value of those links — measured in PageRank, Link Equity, Trust Flow, etc — would be lost entirely too.

Without full knowledge of the website’s former site structure and the URLs within it, there could be a lot of value lost to dead-end 404 pages.

Having run into the same situation ourselves recently we had to figure out — thanks to a large helping hand from Liam Delahunty. Thanks Liam! — a solution which we’d now like to pass on to you.

Using Archive.org Data

Archive.org, or the Wayback Machine as it’s more commonly know, is a web crawler and indexing system for the internet’s web pages for historical archiving. It’s a cool tool which allows us to take a peek at what Google looked like when it was still in Beta back in 1998, for example.

As it crawls a large percentage of the internet it’s highly likely that your website has been crawled by their web crawler. By retrieving this publically available data we can piece together a rough idea what the pre-migrated website’s site structure may have been.

The data is freely available to use and Archive.org have a brief outline of how the API may be accessed and used available here.

Not being an API-wielding specialist myself, in the following process I’ll be falling back on a classic copy-and-paste approach which Search Engine Optimsation Specialists of any skill level can use.

Example txt log file on archive.org

How To Extract Old URLs from Archive.org

1. Locate your website’s JSON or TXT file

Start by navigating to the following URL, changing the holding example.com root domain to your website’s own root.

For JSON file:
http://web.archive.org/cdx/search/cdx?url=example.com*&output=json

For TXT format:
http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt

If you need to limit the time frame of the crawl then you can add the following parameters to the end to narrow the range.

yyyyMMddhhmmss

Example:
http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt&from=2010&to=2018

You can also decrease or increase the limit to match your needs.

Example:
http://web.archive.org/cdx/search/cdx?url=example.com*&output=txt&limit=999999

You can find a full rundown of the available filtering options here:
https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#filtering

2. Paste into your spreadsheet and separate into columns

Copy the entire text of the loaded page and paste the results into a spreadsheet. In this instance, we’re using Google Sheets.

Select the entire range of data and use the “Split text into columns…” option of the “Data” menu in the toolbar. As we’re using the TXT formatting we use the “space” delimiter to separate our data.

Extracted data from Archive.org split into columns

3. Remove columns leaving only the URLs

Delete all of the unrequired columns to leave only the URLs. This will usually be Column C.

4. Use Find and Replace to remove :80 from URLs

Select the column of URLs and use the “Find and Replace” function to locate the text “:80” and replace it with nothing (leave the replacement text box empty). This will tidy up all of the URLs, sometimes removing tens of thousands of instances of “:80”

Find and replace used on archive.org data

5. Use =UNIQUE formula to remove duplicates

In a separate column use the UNIQUE formula — i.e, =UNIQUE(A:A) — to remove the duplicates from the first column, leaving only singular URLs to check for 3XX, 4XX, and 5XX status codes.

Using the unique formula in Google Sheets to clean the archive.org data

6. Crawl URLs using Screaming Frog and extract report for review

Copy your final list of URLs, open Screaming Frog and switch it to List mode, then paste in your gathered URLs.

Export your completed crawl as a CSV and copy/paste the data into another tab of your spreadsheet. At this point, you can either remove all columns except for the URL and Status Code columns or you can do a VLOOKUP to populate the correlating statuses for your original list.

You can now filter this complete list of URLs to find 404 pages or redirect chains.

A spreadsheet showing hundreds of redirects and 404 pages

Other advantages and tips

This process can be enhanced further by gathering URLs via Google Analytics for as far back as you can, making sure to check any former URLs which may have been high traffic or high converting sales pages in the past.

Taking it another step further, you can find more URLs via other web crawlers such as Majestic — which also keeps a log of URLs crawled — which you can also download and add to your total list before removing all the duplicates and crawling them.

Additionally important is to run this list of URLs through a tool like Majestic to see whether there are any backlinks to the pages with 3XX, 4XX, and 5XX status codes where link equity may be diffused or lost entirely

This process can also be used for link-building. By following the same process for your competitor’s websites you may find pages with 4XX status codes with backlinks to them. You could use the Wayback Machine to see what these pages used to be and then recreate and improve on their old content — without copying anything in their original — before reaching out to the linking domains to suggest your new content as a replacement for that broken link.

And that’s it. A simple process for gathering the URLs of an old website long forgotten or recently migrated.

Again, thanks to Liam Delahunty for the guidance through this process. It’s entirely his brainchild.

Not Getting Enough Traffic?

Not Converting Enough Leads?

Get a free review of your marketing and website from our team of digital marketing experts, worth £197.

Oh, did we say it was FREE?

Menu