Minasan, Watashiwa Wawan Desu...

NurCell Movies

Saturday, February 5, 2011

Python web crawler code - use at your own risk


Big changes to the crawler code:

Switched from urllib, which left sockets open and created memory leaks, crashes and other computer higgledy-piggledy, to httplib.Now fetching mime-type and using it to separate images from text pages.Better URL handling.Cleaner output - removes domain name from output for smaller, easier to handle files.Switch to crawl only the current site, or to check other sites, too.Checks for URLs previously crawled and marks them as such, but still notes them. That will provide you a complete list of all links on every page, without slowing the crawl.Better connection management for faster crawls.

Things that still bug me:

The script will time out if the target site times out. Need a way to have it stop gracefully.Still not multithreaded.Not storing in a database. That's to keep the script simple and portable, but at some point it'll have to change.Needs a pretty interface. Working on that next.Download the code (and contribute to the project by improving the code!) here:

[ CMCrawler - an open source Python web crawler ]


This is a command-line Python script. It doesn't get much uglier, just so ya know. But it's fast, lightweight and the output is easy to mash for generating XML sitemaps, checking for 404 errors on your site, or just getting a sense of a site's layout.


As a speed reference: It averages 90 seconds to crawl 700 or so pages. It is single threaded (at the moment).


You must have Python installed. If you don't, or don't know how to install it, frankly I don't suggest you mess with this just now. It's not a mature-enough program yet.


You also need one library that doesn't come standard with Python: The fantastic BeautifulSoup library. It's worth the effort, and without it, writing this crawler would have reduced me to a damp, gibbering lump of flesh under my desk.


Finally, you need to know how to use the command line on your computer, just a little bit.

Download the code.Extract the compressed archive to your hard drive. Put it wherever you want - just make sure you remember the location.Start up your command-line client. On my Mac I use the trusty BASH shell.Navigate to the folder where you put the script.Type python cmcrawler.py [domain to crawl] [stay within domain]. Domain to crawl is your site's domain, without the leading 'http://'. Stay within domain is a '0' for 'stay within this domain or a '1' for 'crawl everything'. For God's sake, stick with 0 for now, OK?The script will spit out the results of the crawl, as they happen. The results are tab-delimited, so you can easily cut-and-paste them into a text editor or Excel.

A few folks at #seochat last night asked for the code from a Python-driven web crawler I'm working on, so here it is, in a Github repository.


I'm just warning you: This is some ugly stuff in that code. This was the very first Python code I wrote. Ever. It does all the horrible things developers do when learning a new platform.


I'll update it as often as I can. The code is totally, 100% free for everyone to use. There are a few conditions though:

You can't use this for a commercial project without talking to me first.Please improve it! Check out the issues page on Github and see what you can do. Send me feedback.You are not permitted to laugh at my lack of coding-fu.

Enjoy.


PS: This crawler is really me hacking together great libraries other people wrote. I get no credit for anything that works.


[ CMCrawler - an open source Python web crawler ]

comments (2) | trackbacks (0) | permalink


View the original article here

No comments: