« Check Python site-packages for Updates Designing and Testing PyZMQ Applications – Part 3 »
.

A Simple Web Bot with Requests and BeautifulSoup

Today I helped a colleague debugging a web bot written in Java. Since I did’t really work with Java since a few years, I thought it would be easier for me to reproduce (and solve) the problem with Requests and BeautifulSoup. (I’ve actually been looking for an opportunity to try Requests out for a while, since I’ve heard so much good about it.)

And what can I say—I was blown away by how easy it was to implement a simple web bot that filled out a form and grabbed a huge table of data for me.

I cannot show you exactly that web bot, but here’s how you could search my website for “Python” and get the headings of the resulting posts:

from BeautifulSoup import BeautifulSoup
import requests

url = 'http://stefan.sofa-rockers.org/search/?q=%(q)s'
payload = {
    'q': 'Python',
}
r = requests.get(url % payload)

soup = BeautifulSoup(r.text)
titles = [h2.text for h2 in soup.findAll('h2', attrs={'class': 'post_title'})]

for t in titles:
    print(t)

Comment

  1. karl on February 21, 2012 at 16:59:

    This is more an HTML grabber. a compact and elegant one. Maybe you could replace BeautifulSoup with lxml and the html5parser. :)

    But more interesting would be the part of the code which makes it a real crawling bot. The management of the URIs queue. How did you do it?

  2. Stefan on February 21, 2012 at 17:51:

    Yepp, you’re right. It’s more like a grabber than like a bot. It also doesn’t crawl. It just fills out a form and downloads the results. However, I was surprised how easy this was. :-)

    I’ll take a look at lxml and html5parser when I do my next awesome web-grabber-bot :-)

  3. Renier on February 22, 2012 at 14:51:

    Check out http://www.scrapy.org , I know it also uses BeautifulSoup, and it is really amazingly powerful.

New comment

Required
Required, but not displayed
Optional
Format using ReStructuredText (Quickref)
  • *emphasis*, **strong**, ``inline code``
  • Blockquotes: indent each line to be quoted
  • Links w/ description: `Description <URL>`_
  • Highlighted code blocks: See here
captcha Please prove that you are human.