We are going to assume you will be using Eclipse and a fresh project. In this example we are going to use Triple J Unearthed's Top 100 charts HTML page to parse.
Adding lxml to Google App Engine
The first thing we need to do is add the lxml library to our app.yaml configuration file. In your Eclipse project add a new file called app.yaml and add the following:application: almightynassar
version: 1
runtime: python27
api_version: 1
threadsafe: true
handlers:
- url: /.*
script: triplej.app
libraries:
- name: lxml
version: latest
Most of these fields were covered in Getting Started, but we now have a new field: libraries. This is where we declare all third party libraries not included in GAE default python environment.
Using lxml
Create a new file called triplej.py and add the following code:# The webapp2 framework
import webapp2
# lxml parser for XML and HTML
from lxml import etree
# The URL Fetch library
from google.appengine.api import urlfetch
# Fetches an XML document and parses it
class MainPage(webapp2.RequestHandler):
# Respond to a HTTP GET request
def get(self):
# Grabs the HTML
url = urlfetch.fetch('http://www.triplejunearthed.com/Charts/')
# Parses the HTML
tree = etree.HTML(url.content)
# Converts the DOM into a string
result = etree.tostring(tree, pretty_print=True, method="html")
# Output the results onto the screen
self.response.out.write(str(result))
# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/*', MainPage)], debug=True)
If you run this code you will notice that all it does is simply download the HTML page, parses it, and then outputs the page exactly as it was downloaded (minus all the images and CSS styling). Nothing impressive, but we proved the concept works. Now on to something a little more beefy....
Parsing, Extracting and Cleaning the HTML
In this example we will perform multiple functions that will only extract the chart from the Triple J Unearthed website. Replace the triplej.py code with the following:# The webapp2 framework
import webapp2
# lxml parser for XML and HTML
from lxml import html
# HTML cleaner
from lxml.html.clean import Cleaner
# The URL Fetch library
from google.appengine.api import urlfetch
# Fetches an XML document and parses it
class MainPage(webapp2.RequestHandler):
# Respond to a HTTP GET request
def get(self):
# Grabs the HTML
url = 'http://www.triplejunearthed.com/Charts/'
website = urlfetch.fetch(url)
# Saves our content as a string
page = str(website.content)
# Parses the HTML
tree = html.fromstring(page)
# The ID string of the table element we want # NOTE: This is bound to change!!! Double check the HTML source first!!!
elementID = "ctl00_ctl00_ctl00_ctl00_MainBody_ContentPlaceHolder1_ContentPlaceHolder1_ContentPlaceHolder1_GridView1"
# Grab the chart element
#
# style: removes styling
# links: removes links
# add_nofollow: adds rel="nofollow" to anchor tags
# page_structure: removes <html>, <head>, and <title> tages
# safe_attrs_only: only allows safe element attributes
# javascript: removes embedded javascript
# scripts: remove script tags
# kill_tags: remove the element and content
# remove_tags: remove only the element, but not the content
#
# There are more available. See the API reference for lxml
cleaner = Cleaner(style=True, links=True, add_nofollow=True,
page_structure=True, safe_attrs_only=True,
javascript=True, scripts=True, kill_tags = set(['img','th']),
remove_tags = (['div']))
# Grab only our chart (but scrub it clean first!)
chart = cleaner.clean_html(tree.get_element_by_id(elementID))
# Change all relative links into absolute links based on the url
chart.make_links_absolute(url)
# Converts the DOM element into a string
result = html.tostring(chart)
# Output the results onto the screen
self.response.out.write(result)
# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/*', MainPage)], debug=True)
Running this code should result in a sanitized version of the Triple J Top 100 chart!
References
- Google's own getting started with webapp and Python.
- The official webapp2 reference
- The Google developer resource for GAE
- Google App Engine FAQs
- lxml homepage
- Parsing with lxml reference
- 'Google App Engine, Python 2.7 and lxml' by MS-potilas
- YAML reference
- app.yaml reference
- The lxml cleaner reference
Thanks for the article. Installing lxml was a task.
ReplyDelete