Resurrect an offline blog with Python
by Randy Gingeleski
2 minutes to read
Downloading an offline but cached blog with sitemap XML and Python. Try it for your lost blog(s). 🕵
My late teens were influenced by Mike Cernovich’s Danger & Play blog. I said about as much when he commented on here.
www.dangerandplay.com went offline. There’s a book on Amazon of its bigger posts, but the site had over a thousand. Surely content didn’t make it.
Disclaimer: some of D&P is “controversial” so here’s the vague statement I don’t agree with it all.
Here’s how I downloaded/resurrected the whole thing with Python code and some luck. Hope this might help with something similar.
The luck part was this - there were (cached) sitemaps with links to all (cached) posts.
Let’s walk through the process, which started on Google.
Further down the first page it points to “archives”.
Defunct. It loaded dynamically via Wordpress plugin. But check we found a sitemap which didn’t make Google.
And that points to post sitemaps. Nice.
This is a good time to say I don’t know why this “www2” cache occurred. Chalk it up to Cloudflare and decent traffic. Hopefully your target has that going for it too.
Anyway here’s a post sitemap.
In Python we’ll script a request to each link and save the response HTML.
Yes, there are libraries for scraping and crawling. But they’re a lot of overhead for this straightforward task. I ended up with this code.
from bs4 import BeautifulSoup import requests def get_all_links(r): all_links =  soup = BeautifulSoup(r.content,'lxml') for url in soup.findAll('loc'): all_links.append(url.string) return all_links def crawl_and_save(target_url, current_depth=1, max_depth=2): r = requests.get(target_url) if current_depth < max_depth: current_depth += 1 links = get_all_links(r) for link in links: print(link) crawl_and_save(link,current_depth) else: filename = target_url.split('/')[-2] + '.html' with open(filename, mode='wb') as file: file.write(r.content) if __name__ == '__main__': crawl_and_save('http://www2.dangerandplay.com/post-sitemap1.xml') crawl_and_save('http://www2.dangerandplay.com/post-sitemap2.xml')
Could’ve saved a dependency using
lxml straight but I’m not familiar with it off-hand. This took like 5 minutes to write. Fork if you’re inclined.
Runtime seemed under 10 minutes and the result’s as expected.
Ta-da. Happy hacking. 🐱💻