Jun 15, 2012

Cron and Datastore in Google App Engine: Python

This is a continuation of my Python 2.7 and Google App Engine series. If you are just starting out I suggest you start reading Getting Started and First App. If you are after parsing XML or HTML files please see my posts 'Parsing XML with Google App Engine: Python' or 'Parsing HTML with lxml and Google App Engine: Python'.

This example will use the code from my previous post 'Convert Twitter stream into an RSS feed'. If you don't understand why I did something, just pop on over to that link and see how I came up with the code originally.

I STRONGLY suggest you look up some of the references such as how Google App Engine handles cron, datastore, and some things about entities and keys. Most of the stuff you will need are provided in the references at the end of the post.

What our plan is...

Ok, let me briefly go over what my proposed system will do and how cron and GAE's Datastore fits in.

In a previous blog post I created a web app that would connect to someones twitter feed and convert it into an RSS feed. A problem with this set-up was that there was a massive lag (about 2-3 seconds) while the app downloaded the stream, parsed it, inserted links and outputted an RSS XML file.

To solve this, I will create a cron script that I will run in the background (I will also hide it behind an administration login page so that random users cannot call it randomly). This requires me to store the feed into a persistent object, which Datastore conveniently supplies.

Now for the code....

The RSS object

We will first create a Python object that we will use to define the objects we store into our database. Just create a file called entity.py and add the following code:

# Our datastore interface
from google.appengine.ext import db

# Our RSS entity object
class Rss(db.Model):
    feed = db.StringProperty()
    content = db.TextProperty()

Note that we import the db object; this is our interface to Datastore. If you want to know more about creating entities, I suggest you read the references provided at the end of the post.

The cron script

This script will do everything we did in our previous blog post, except it will store the feed into our RSS entity and into the Datastore. Create a file called cron.py and insert the following code:

# The minidom library for XML parsing
from xml.dom.minidom import parseString

# The URL Fetch library
from google.appengine.api import urlfetch

# Our entity library
import entity

# Detects if it is a URL link and adds the HTML tags
def linkify(text):
    # If http is present in, add the link tag
    if "http" in text:
        text = "<a href='" + text + "'>" + text + "</a>"
    # If @ is present, turn it into a twitter handle link

    elif "@" in text:
        text = "<a href='http://twitter.com/#!/" + text.split("@")[1] + "'>" + text
        text+= "</a>"
    # Turn into twitter hash tags

    elif "#" in text:
        text = "<a href='https://twitter.com/#!/search/%23" + text.split("@")[1] + "'>" + text
        text+= "</a>"

       
    return text

# Output the XML into an RSS feed
def outputRSS(xml):
    # The get the status list
    statuses = xml.getElementsByTagName("status")
   
    # Our return string
    outputString = "<?xml version='1.0'?>\n<rss version='2.0'>\n\t<channel>"
    outputString+= "\n\t\t<title>Almightyolive Twitter</title>\n\t\t"

    outputString+= "<link>https://twitter.com/#!/almightyolive</link>\n"
    outputString+= "\t\t<description>The twitter feed for the Almighty "
    outputString+= "Olive</description>"

   
    # Cycled through the status
    for status in statuses:
        #Gets the statuses
        text = status.getElementsByTagName("text")[0].firstChild.data
        date = status.getElementsByTagName("created_at")[0].firstChild.data
        tweet = status.getElementsByTagName("id")[0].firstChild.data
       
        # Insert links into the text
        words = text.split()
       
        for i in range (len(words)):
            words[i] = linkify(words[i])
       
        # Recompile words
        text = " ".join(words)
       
        # Creates our output
        string = "\n\t\t<item>\n\t\t\t<title>" + str(date) + "</title>\n"
        string+="\t\t\t<link>https://twitter.com/AlmightyOlive/status/" + tweet
        string+= "</link>\n\t\t\t<description>" + str(text) + "</description>\n"
        string+= "\t\t</item>"

        outputString+=string
       
    # Output string
    outputString += "\n\t</channel>\n</rss>"
    return outputString   

# OUR CRON SCRIPT PROPER!
#
# Grabs the XML

url = urlfetch.fetch('https://api.twitter.com/1/statuses/user_timeline.xml?screen_name=almightyolive&count=10&trim_user=true')
           
# Parses the document
xml = parseString(url.content)

content = outputRSS(xml)
# Our RSS storage entity
rssStore = entity.Rss(key_name='almightyolive')

# Elements of our RSS
rssStore.feed = "almightyolive"
rssStore.content = content

# Stores our RSS Feed into the datastore
rssStore.put()

The functions linkify() and outputRSSS() are exactly the same as in the previous blog post (with the addition to linkify to do hashtags). Our biggest difference is the replacing MainPage and the webapp specific stuff with a simple sequential script (which in actuality is not unlike the content of MainPage).

A brief explanation of the entity and datastore code:
  1. Create rssStore object as defined by the Rss object in our entity.py file. Note that we pass a key called 'almightyolive', which is our unique identifier for this object.
  2. Store our object values, especially our feed variable content
  3. Call the put() method on our rssStore object to push it onto the Datastore
And thats it!

The feed app

Now we need to create our front-end to access the RSS feed xml. Create a new file called feed.py and add the following:

# The webapp2 framework
import webapp2

# Our datastore interface
from google.appengine.ext import db

import entity

# Fetches an datastore object and displays it
class MainPage(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
        # A try-catch statement
        try:
            # Create RSS entity

            feed = entity.Rss()
            # Get the key for an RSS entity called almightyolive

            feed_k = db.Key.from_path('Rss', 'almightyolive')
            # Retrieve object from datastore

            feed = db.get(feed_k)
           
            # Outputs the RSS
            self.response.out.write(feed.content)

        # Our exception code
        except (TypeError, ValueError):
            self.response.out.write("<html><body><p>Invalid inputs</p></body></html>")

# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/*', MainPage)], debug=True)

Pretty simple, huh? Now onto the configuration files....

app.yaml and cron.yaml

Let's start with app.yaml first. Add the following:

application: almightynassar
version: 1
runtime: python27
api_version: 1
threadsafe: no

handlers:
- url: /cron
  script: cron.py
  login: admin
 
- url: /.*
  script: feed.app
Note that this is no longer threadsafe; this is because we defined another handler other than feed.app. Why? Because I added a line for our cron handler: 'login: admin'. This restricts access to the URL to only the administrators of the application.

Now create cron.yaml with the following:

cron:
- description: daily summary job
  url: /cron
  schedule: every 1 hours

And there you have it! Once you upload it, the cron.py script will run every hour (you can run it manually first to populate the database) and then see the RSS feed!

References

2 comments:

  1. Hello Nassar! I think you did a great job with this post. I've been building a Google App Engine project with Java, but I wanted to set up a cron job with Python, and this post was very helpful.
    Keep up the good work.

    ReplyDelete

Thanks for contributing!! Try to keep on topic and please avoid flame wars!!