A Simple Web Bot with Requests and BeautifulSoup
Tuesday, February 21, 2012
Today I helped a colleague debugging a web bot written in Java. Since I did’t really work with Java since a few years, I thought it would be easier for me to reproduce (and solve) the problem with Requests and BeautifulSoup. (I’ve actually been looking for an opportunity to try Requests out for a while, since I’ve heard so much good about it.)
And what can I say—I was blown away by how easy it was to implement a simple web bot that filled out a form and grabbed a huge table of data for me.
I cannot show you exactly that web bot, but here’s how you could search my website for “Python” and get the headings of the resulting posts:
from BeautifulSoup import BeautifulSoup
import requests
url = 'http://stefan.sofa-rockers.org/search/?q=%(q)s'
payload = {
'q': 'Python',
}
r = requests.get(url % payload)
soup = BeautifulSoup(r.text)
titles = [h2.text for h2 in soup.findAll('h2', attrs={'class': 'post_title'})]
for t in titles:
print(t)
Comment
karl on February 21, 2012 at 16:59:
This is more an HTML grabber. a compact and elegant one. Maybe you could replace BeautifulSoup with lxml and the html5parser. :)
But more interesting would be the part of the code which makes it a real crawling bot. The management of the URIs queue. How did you do it?
Stefan on February 21, 2012 at 17:51:
Yepp, you’re right. It’s more like a grabber than like a bot. It also doesn’t crawl. It just fills out a form and downloads the results. However, I was surprised how easy this was. :-)
I’ll take a look at lxml and html5parser when I do my next awesome web-grabber-bot :-)
Renier on February 22, 2012 at 14:51:
Check out http://www.scrapy.org , I know it also uses BeautifulSoup, and it is really amazingly powerful.