/ coding

Resurrect an offline blog with Python

Downloading an offline but cached blog with sitemap XML and Python. Try it for your lost blog(s). 🕵

My late teens were influenced by Mike Cernovich's Danger & Play blog. I said about as much when he commented on here.

Recently www.dangerandplay.com went offline. There's a book on Amazon of its bigger posts, but the site had over a thousand. Surely content didn't make it.

Disclaimer: some of D&P is "controversial" so here's the vague statement I don't agree with it all.

Here's how I downloaded/resurrected the whole thing with Python code and some luck. Hope this might help with something similar.

The luck part was this - there were (cached) sitemaps with links to all (cached) posts.

Let's walk through the process, which started on Google.

dp-site-google

Further down the first page it points to "archives".

dp-archives

Defunct. It loaded dynamically via Wordpress plugin. But check we found a sitemap which didn't make Google.

dp-sitemap-index

And that points to post sitemaps. Nice.

This is a good time to say I don't know why this "www2" cache occurred. Chalk it up to Cloudflare and decent traffic. Hopefully your target has that going for it too.

Anyway here's a post sitemap.

sitemap-xml

In Python we'll script a request to each link and save the response HTML.

Yes, there are libraries for scraping and crawling. But they're a lot of overhead for this straightforward task. I ended up with this code.

Could've saved a dependency using lxml straight but I'm not familiar with it off-hand. This took like 5 minutes to write. Fork if you're inclined.

Runtime seemed under 10 minutes and the result's as expected.

dp-html-files

Ta-da. Happy hacking. 🐱‍💻