Resurrect an offline blog with Python

by Randy Gingeleski

2 minutes to read

Downloading an offline but cached blog with sitemap XML and Python. Try it for your lost blog(s). 🕵

Post featured image

My late teens were influenced by Mike Cernovich’s Danger & Play blog. I said about as much when he commented on here.

Recently www.dangerandplay.com went offline. There’s a book on Amazon of its bigger posts, but the site had over a thousand. Surely content didn’t make it.

Disclaimer: some of D&P is “controversial” so here’s the vague statement I don’t agree with it all.

Here’s how I downloaded/resurrected the whole thing with Python code and some luck. Hope this might help with something similar.

The luck part was this - there were (cached) sitemaps with links to all (cached) posts.

Let’s walk through the process, which started on Google.

Further down the first page it points to “archives”.

Defunct. It loaded dynamically via Wordpress plugin. But check we found a sitemap which didn’t make Google.

And that points to post sitemaps. Nice.

This is a good time to say I don’t know why this “www2” cache occurred. Chalk it up to Cloudflare and decent traffic. Hopefully your target has that going for it too.

Anyway here’s a post sitemap.

In Python we’ll script a request to each link and save the response HTML.

Yes, there are libraries for scraping and crawling. But they’re a lot of overhead for this straightforward task. I ended up with this code.

from bs4 import BeautifulSoup
import requests

def get_all_links(r):
    all_links = []
    soup = BeautifulSoup(r.content,'lxml')
    for url in soup.findAll('loc'):
        all_links.append(url.string)
    return all_links

def crawl_and_save(target_url, current_depth=1, max_depth=2):
    r = requests.get(target_url)
    if current_depth < max_depth:
        current_depth += 1
        links = get_all_links(r)
        for link in links:
            print(link)
            crawl_and_save(link,current_depth)
    else:
        filename = target_url.split('/')[-2] + '.html'
        with open(filename, mode='wb') as file:
            file.write(r.content)

if __name__ == '__main__':
    crawl_and_save('http://www2.dangerandplay.com/post-sitemap1.xml')
    crawl_and_save('http://www2.dangerandplay.com/post-sitemap2.xml')

view raw - grab_all_dp.py

Could’ve saved a dependency using lxml straight but I’m not familiar with it off-hand. This took like 5 minutes to write. Fork if you’re inclined.

Runtime seemed under 10 minutes and the result’s as expected.

Ta-da. Happy hacking. 🐱‍💻