Jun 26, 2012

Handling HTTP GET requests with webapp2 and Google App Engine: Python

This is a continuation of my Python 2.7 and Google App Engine series. This particular blog post builds upon the code given in my previous posts URL Routing and  Cron and Datastore in Google App Engine: Python, which in turn builds upon my earlier work. If you don't understand parts of the code I highly suggest you browse my earlier blog posts so you can understand some of the design decisions I have made.

A brief overview...

For those who are diving straight in, let me explain the old code and how I will update it:

I have a script feed.py that I have mapped using app.yaml. A cron script (configured by cron.yaml) simply connects to my Twitter account and converts my status updates into an RSS feed. It then stores the RSS feed into a Google Datastore object.

The feed script takes the Datastore object and displays it. We use another script (entity.py) to define the Datastore object.

We will now configure the system so that it can convert multiple twitter accounts into an RSS feed. To display a particular RSS feed we will use a HTTP GET request.

The main application

We will create a file called feed.py. This script will be our controller; it simply gets the HTTP requests and maps them to certain classes. These classes will then call other functions to perform the required tasks.

# The webapp2 framework
import webapp2

# Our datastore interface
from google.appengine.ext import db

# Our entity library
import entity

# Our XML2RSS library
import XML2RSS

# Output the XML in a HTML friendly manner
class Cron(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
        # A try-catch statement
        try:
            XML2RSS.getTweets("almightyolive")
            XML2RSS.getTweets("founding")
            XML2RSS.getTweets("ABCNews24")
            XML2RSS.getTweets("SBSNews")
       
        # Our exception code
        except (TypeError, ValueError):
            self.response.out.write("<html><body><p>Invalid inputs</p></body></html>")

# Fetches an XML document and parses it
class MainPage(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
        # A try-catch statement
        try:
            account = self.request.get('account')
           
            feed = entity.Rss()
            feed_k = db.Key.from_path('Rss', account)
            feed = db.get(feed_k)
           
            # Outputs the RSS
            self.response.out.write(feed.content)

        # Our exception code
        except (TypeError,ValueError):
            self.response.out.write("<html><body><p>Invalid inputs (Type Error)</p></body></html>")
        except:
            self.response.out.write("<html><body><p>Unspecified Error</p></body></html>")

# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/', MainPage),('/cron', Cron)], debug=True)

The XML2RSS script

As you may have noticed,the feed.py script made reference to an XML2RSS object. This is a separate script that outsources the conversion of XML to RSS into easy-to-call functions. Create a new file called XML2RSS.py and add the following:

# The minidom library for XML parsing
from xml.dom.minidom import parseString

# The URL Fetch library
from google.appengine.api import urlfetch

# Our entity library
import entity

# Detects if it is a URL link and adds the HTML tags
def linkify(text):
    # If http is present in, add the link tag
    if "http" in text:
        text = "&lt;a href='" + text + "'&gt;" + text + "&lt;/a&gt;"
    elif "@" in text:
        text = "&lt;a href='http://twitter.com/#!/" + text.split("@")[1] + "'&gt;" + text + "&lt;/a&gt;"
    elif "#" in text:
        text = "&lt;a href='https://twitter.com/#!/search/%23" + text.split("#")[1] + "'&gt;" + text + "&lt;/a&gt;"
       
    return text

# Output the XML in a HTML friendly manner
def outputRSS(xml, account):
    # The get the states list
    statuses = xml.getElementsByTagName("status")
   
    # Our return string
    outputString = "<?xml version='1.0'?>\n<rss version='2.0'>\n\t<channel>\n\t\t<title>Twitter: " + account + "</title>\n\t\t"
    outputString+= "<link>https://twitter.com/#!/almightyolive</link>\n\t\t<description>The twitter feed for " + account + "</description>"
   
    # Cycled through the states
    for status in statuses:
        #Gets the statuses
        text = status.getElementsByTagName("text")[0].firstChild.data
        date = status.getElementsByTagName("created_at")[0].firstChild.data
        tweet = status.getElementsByTagName("id")[0].firstChild.data
       
        # Insert links into the text
        words = text.split()
       
        for i in range (len(words)):
            words[i] = linkify(words[i])
       
        # Recompile words
        text = " ".join(words)
       
        # Creates our output
        string = "\n\t\t<item>\n\t\t\t<title>" + str(date) + "</title>\n\t\t\t<link>https://twitter.com/AlmightyOlive/status/" + tweet + "</link>\n\t\t\t<description>" + str(text) + "</description>\n\t\t</item>"
        outputString+=string
       
    # Output string
    outputString += "\n\t</channel>\n</rss>"
    return outputString   

# Our RSS storage function
def getTweets(account):
    # Grabs the XML
    url = urlfetch.fetch('https://api.twitter.com/1/statuses/user_timeline.xml?screen_name=' + account + '&count=10&trim_user=true')
           
    # Parses the document
    xml = parseString(url.content)

    # Converts the XML into RSS
    content = outputRSS(xml, account)
   
    # Our RSS storage entity
    rssStore = entity.Rss(key_name='' + account)

    # Elements of our RSS
    rssStore.feed = '' + account
    rssStore.content = content

    # Stores our RSS Feed into the datastore
    rssStore.put()

The pieces to make it all work

If you have been following on from my previous work, then you should already have most of this code. I won't bother explaining it here because it is mostly self-explanatory.

app.yaml:
application: almightynassar
version: 1
runtime: python27
api_version: 1
threadsafe: yes

handlers:
- url: /cron
  script: feed.app
  login: admin
 
- url: /.*
  script: feed.app

cron.yaml:


cron:
- description: daily summary job
  url: /cron
  schedule: every 1 hours

entity.py:

# Our datastore interface
from google.appengine.ext import db

# Our RSS entity object
class Rss(db.Model):
    feed = db.StringProperty()
    content = db.TextProperty()

And that's it! You now have a fully functional application that just uses the webapp2 framework!

If you navigate to http://localhost:8080/?account=almightyolive you should now see the RSS feed. You can test if your mapping works by navigating to http://localhost:8080/?account=founding; you should see the Founding Institute twitter account instead!


References:

Jun 25, 2012

URL Routing through WebApp2 in Google App Engine: Python

This is a continuation of my Python 2.7 and Google App Engine series. This particular blog post builds upon the code given in my previous post Cron and Datastore in Google App Engine: Python, which in turns builds upon my earlier work. If you don't understand parts of the code I highly suggest you browse my earlier blog posts so you can understand some of the design decisions I have made.

A brief overview...

For those who are diving straight in, let me explain the old code and how I will update it:

I have two scripts (feed.py and cron.py) that I have mapped using app.yaml. The cron script simply connects to my Twitter account and converts my status updates into an RSS feed. It then stores the RSS feed into a Google Datastore object.

The feed script takes the Datastore object and displays it. Both scripts use a third script (entity.py) to define the Datastore object.

Currently the set-up is not thread-safe because I have to use two different scripts to handle my incoming requests. The plan is to replace this set-up with one that is thread-safe. Effectively, we will be using the URL routing functionality provided by the webapp2 framework.

Combining the scripts

The first thing we will do is combine both cron.py and feed.py into one script. The following code should be saved to a file called feed.py:

# The webapp2 framework
import webapp2

# Our datastore interface
from google.appengine.ext import db

# The minidom library for XML parsing
from xml.dom.minidom import parseString

# The URL Fetch library
from google.appengine.api import urlfetch

# Our entity library
import entity

# Detects if it is a URL link and adds the HTML tags
def linkify(text):
    # If http is present in, add the link tag
    if "http" in text:
        text = "&lt;a href='" + text + "'&gt;" + text + "&lt;/a&gt;"
    elif "@" in text:
        text = "&lt;a href='http://twitter.com/#!/" + text.split("@")[1] + "'&gt;" + text
        text+= "&lt;/a&gt;"
    elif "#" in text:
        text = "&lt;a href='https://twitter.com/#!/search/%23" + text.split("#")[1] + "'&gt;" + text + "&lt;/a&gt;"
       
    return text

# Output the XML in a HTML friendly manner
def outputRSS(xml):
    # The get the states list
    statuses = xml.getElementsByTagName("status")
   
    # Our return string
    outputString = "<?xml version='1.0'?>\n<rss version='2.0'>\n\t<channel>"
    outputString+= "\n\t\t<title>Almightyolive Twitter</title>\n\t\t"
    outputString+= "<link>https://twitter.com/#!/almightyolive</link>\n"
    outputString+= "\t\t<description>The twitter feed for the Almighty "
    outputString+= "Olive</description>"
   
    # Cycled through the states
    for status in statuses:
        #Gets the statuses
        text = status.getElementsByTagName("text")[0].firstChild.data
        date = status.getElementsByTagName("created_at")[0].firstChild.data
        tweet = status.getElementsByTagName("id")[0].firstChild.data
       
        # Insert links into the text
        words = text.split()
       
        for i in range (len(words)):
            words[i] = linkify(words[i])
       
        # Recompile words
        text = " ".join(words)
       
        # Creates our output
        string = "\n\t\t<item>\n\t\t\t<title>" + str(date) + "</title>\n"
        string+= "\t\t\t<link>https://twitter.com/AlmightyOlive/status/" + tweet
        string+= "</link>\n\t\t\t<description>" + str(text) + "</description>\n"
        string+= "\t\t</item>"
        outputString+=string
       
    # Output string
    outputString += "\n\t</channel>\n</rss>"
    return outputString   

# Output the XML in a HTML friendly manner
class Cron(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
        # A try-catch statement
        try:
            # Grabs the XML
            url = urlfetch.fetch('https://api.twitter.com/1/statuses/user_timeline.xml?screen_name=almightyolive&count=10&trim_user=true')
           
            # Parses the document
            xml = parseString(url.content)

            content = outputRSS(xml)
            # Our RSS storage entity
            rssStore = entity.Rss(key_name='almightyolive')
           
            # Elements of our RSS
            rssStore.feed = "almightyolive"
            rssStore.content = content

            # Stores our RSS Feed into the datastore
            rssStore.put()
       
        # Our exception code
        except (TypeError, ValueError):
            self.response.out.write("<html><body><p>Invalid inputs</p></body></html>")

# Fetches an XML document and parses it
class MainPage(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
        # A try-catch statement
        try:
            feed = entity.Rss()
            feed_k = db.Key.from_path('Rss', 'almightyolive')
            feed = db.get(feed_k)
           
            # Outputs the RSS
            self.response.out.write(feed.content)

        # Our exception code
        except (TypeError, ValueError):
            self.response.out.write("<html><body><p>Invalid inputs</p></body></html>")

# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/', MainPage),('/cron', Cron)], debug=True)

The big changes are:
  • We have added a new class called Cron, which included all of that loose code in cron.py
  • We have added a new URL mapping to our WSGI Application. This will hand over any request for '/cron' to our new Cron class

The pieces to make it all work

If you have been following on from my previous work, then you should already have most of this code. The only thing you need to touch is one line in app.yaml, which is to map /cron to our feed webapp.

app.yaml:
application: almightynassar
version: 1
runtime: python27
api_version: 1
threadsafe: yes

handlers:
- url: /cron
  script: feed.app
  login: admin
 
- url: /.*
  script: feed.app

cron.yaml:


cron:
- description: daily summary job
  url: /cron
  schedule: every 1 hours

entity.py:

# Our datastore interface
from google.appengine.ext import db

# Our RSS entity object
class Rss(db.Model):
    feed = db.StringProperty()
    content = db.TextProperty()

And that's it! You now have a fully functional application that just uses the webapp2 framework!

References:

Jun 15, 2012

Cron and Datastore in Google App Engine: Python

This is a continuation of my Python 2.7 and Google App Engine series. If you are just starting out I suggest you start reading Getting Started and First App. If you are after parsing XML or HTML files please see my posts 'Parsing XML with Google App Engine: Python' or 'Parsing HTML with lxml and Google App Engine: Python'.

This example will use the code from my previous post 'Convert Twitter stream into an RSS feed'. If you don't understand why I did something, just pop on over to that link and see how I came up with the code originally.

I STRONGLY suggest you look up some of the references such as how Google App Engine handles cron, datastore, and some things about entities and keys. Most of the stuff you will need are provided in the references at the end of the post.

What our plan is...

Ok, let me briefly go over what my proposed system will do and how cron and GAE's Datastore fits in.

In a previous blog post I created a web app that would connect to someones twitter feed and convert it into an RSS feed. A problem with this set-up was that there was a massive lag (about 2-3 seconds) while the app downloaded the stream, parsed it, inserted links and outputted an RSS XML file.

To solve this, I will create a cron script that I will run in the background (I will also hide it behind an administration login page so that random users cannot call it randomly). This requires me to store the feed into a persistent object, which Datastore conveniently supplies.

Now for the code....

The RSS object

We will first create a Python object that we will use to define the objects we store into our database. Just create a file called entity.py and add the following code:

# Our datastore interface
from google.appengine.ext import db

# Our RSS entity object
class Rss(db.Model):
    feed = db.StringProperty()
    content = db.TextProperty()

Note that we import the db object; this is our interface to Datastore. If you want to know more about creating entities, I suggest you read the references provided at the end of the post.

The cron script

This script will do everything we did in our previous blog post, except it will store the feed into our RSS entity and into the Datastore. Create a file called cron.py and insert the following code:

# The minidom library for XML parsing
from xml.dom.minidom import parseString

# The URL Fetch library
from google.appengine.api import urlfetch

# Our entity library
import entity

# Detects if it is a URL link and adds the HTML tags
def linkify(text):
    # If http is present in, add the link tag
    if "http" in text:
        text = "&lt;a href='" + text + "'&gt;" + text + "&lt;/a&gt;"
    # If @ is present, turn it into a twitter handle link

    elif "@" in text:
        text = "&lt;a href='http://twitter.com/#!/" + text.split("@")[1] + "'&gt;" + text
        text+= "&lt;/a&gt;"
    # Turn into twitter hash tags

    elif "#" in text:
        text = "&lt;a href='https://twitter.com/#!/search/%23" + text.split("@")[1] + "'&gt;" + text
        text+= "&lt;/a&gt;"

       
    return text

# Output the XML into an RSS feed
def outputRSS(xml):
    # The get the status list
    statuses = xml.getElementsByTagName("status")
   
    # Our return string
    outputString = "<?xml version='1.0'?>\n<rss version='2.0'>\n\t<channel>"
    outputString+= "\n\t\t<title>Almightyolive Twitter</title>\n\t\t"

    outputString+= "<link>https://twitter.com/#!/almightyolive</link>\n"
    outputString+= "\t\t<description>The twitter feed for the Almighty "
    outputString+= "Olive</description>"

   
    # Cycled through the status
    for status in statuses:
        #Gets the statuses
        text = status.getElementsByTagName("text")[0].firstChild.data
        date = status.getElementsByTagName("created_at")[0].firstChild.data
        tweet = status.getElementsByTagName("id")[0].firstChild.data
       
        # Insert links into the text
        words = text.split()
       
        for i in range (len(words)):
            words[i] = linkify(words[i])
       
        # Recompile words
        text = " ".join(words)
       
        # Creates our output
        string = "\n\t\t<item>\n\t\t\t<title>" + str(date) + "</title>\n"
        string+="\t\t\t<link>https://twitter.com/AlmightyOlive/status/" + tweet
        string+= "</link>\n\t\t\t<description>" + str(text) + "</description>\n"
        string+= "\t\t</item>"

        outputString+=string
       
    # Output string
    outputString += "\n\t</channel>\n</rss>"
    return outputString   

# OUR CRON SCRIPT PROPER!
#
# Grabs the XML

url = urlfetch.fetch('https://api.twitter.com/1/statuses/user_timeline.xml?screen_name=almightyolive&count=10&trim_user=true')
           
# Parses the document
xml = parseString(url.content)

content = outputRSS(xml)
# Our RSS storage entity
rssStore = entity.Rss(key_name='almightyolive')

# Elements of our RSS
rssStore.feed = "almightyolive"
rssStore.content = content

# Stores our RSS Feed into the datastore
rssStore.put()

The functions linkify() and outputRSSS() are exactly the same as in the previous blog post (with the addition to linkify to do hashtags). Our biggest difference is the replacing MainPage and the webapp specific stuff with a simple sequential script (which in actuality is not unlike the content of MainPage).

A brief explanation of the entity and datastore code:
  1. Create rssStore object as defined by the Rss object in our entity.py file. Note that we pass a key called 'almightyolive', which is our unique identifier for this object.
  2. Store our object values, especially our feed variable content
  3. Call the put() method on our rssStore object to push it onto the Datastore
And thats it!

The feed app

Now we need to create our front-end to access the RSS feed xml. Create a new file called feed.py and add the following:

# The webapp2 framework
import webapp2

# Our datastore interface
from google.appengine.ext import db

import entity

# Fetches an datastore object and displays it
class MainPage(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
        # A try-catch statement
        try:
            # Create RSS entity

            feed = entity.Rss()
            # Get the key for an RSS entity called almightyolive

            feed_k = db.Key.from_path('Rss', 'almightyolive')
            # Retrieve object from datastore

            feed = db.get(feed_k)
           
            # Outputs the RSS
            self.response.out.write(feed.content)

        # Our exception code
        except (TypeError, ValueError):
            self.response.out.write("<html><body><p>Invalid inputs</p></body></html>")

# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/*', MainPage)], debug=True)

Pretty simple, huh? Now onto the configuration files....

app.yaml and cron.yaml

Let's start with app.yaml first. Add the following:

application: almightynassar
version: 1
runtime: python27
api_version: 1
threadsafe: no

handlers:
- url: /cron
  script: cron.py
  login: admin
 
- url: /.*
  script: feed.app
Note that this is no longer threadsafe; this is because we defined another handler other than feed.app. Why? Because I added a line for our cron handler: 'login: admin'. This restricts access to the URL to only the administrators of the application.

Now create cron.yaml with the following:

cron:
- description: daily summary job
  url: /cron
  schedule: every 1 hours

And there you have it! Once you upload it, the cron.py script will run every hour (you can run it manually first to populate the database) and then see the RSS feed!

References

Jun 13, 2012

Real Time Operating Systems: Subject Notes

RTOS (Real Time Operating Systems) is a subject that was offered by the University of Technology, Sydney (UTS). These are some of my notes from that subject.

Pre-emption (or context switch)

  • Pre-emption is the ability of the Operating System to stop a currently scheduled task in favour of a higher priority task. Enables pre-emptive multi-tasking.
  • An interrupt generally denotes something that needs to be handled straight away. Normal scheduling is avoided until the interrupt is handled; in other words, interrupts generally cannot be pre-empted.
  • By making the scheduler pre-emptive, we make it more responsive to events. The downside is that it is more susceptible to race conditions, where the executing program modifies/uses data the pre-empted process has not finished using.
  • Pre-emptive multi-tasking can be compared to co-operative multi-tasking, where the process must give their time to other processes. This requires the process to be co-operative and not hog all the resources.
  • A scheduler that can pre-empt a process during a system call is a pre-emptive kernel.
A simple flow of how pre-emption works. The Red box is a low priority thread that is pre-empted, the Yellow and Blue are interrupts and the Green is a higher priority thread.

Process

  • A process is an instance of a program in execution. The execution happens in a sequential fashion.
  • A process has at a minimum a copy of the program code, a Process Control Block (discussed later), a Stack (to keep track of active subroutines and events) and a Data Section or Heap. The data section includes the program code, process-specific data (such as inputs and outputs) and storage of intermediate data.
  • A process will start in the New state, then go to Ready (waiting to be assigned). It can then be put into either the Running (executing) or Waiting (waiting on I/O operations) states until it reaches the Terminated state.
    A visualisation of the process state
  • The Process Control Block (PCB) contains the information associated with a particular process and its context. This includes the Process state (see above diagram), a Program Counter (hold the memory address of the next program instruction to execute), CPU Registers (for storage of process specific data), CPU Scheduling Information (so the CPU can make scheduling decisions), Memory-management information, Accounting information (how long the last run was, how much time accumulated, etc), and I/O status information.
  • When we wish to switch the executing process, the CPU will store the current state of the running process into a PCB and load up the state of the next process from their PCB. When the process has completed executing, the CPU will then store that process' PCB and reload the previous PCB. This is called Context Switching.
  • Context switching is considered overhead. It is time in which the CPU does nothing.
  • The queues used to store the details of processes are the Job Queue (all processes in the system), the Ready Queue (processes in main memory waiting to execute), and the Device Queue (processes waiting for an I/O device to respond).
  • A Scheduler can be either Long-term (selects processes to be brought into the ready queue, and does not need to be fast because it matches the speed of I/O devices) or Short-term (selects process from ready-queue to be executed, and needs to match the CPU speed)
  • Since Long-term schedulers take the most time, it determines the degree of multi-programming a computer can handle.
  • A process can either be considered an I/O bound process (does more I/O tasks) or a CPU bound process (does more computations)
  • Processes are generally created from a parent process, which in turn creates it's own child processes. This creates a tree-hierarchy of processes.
  • Aspects of the relationship between a child process to its parent are:
    • Resource sharing: The parent and child can share all resources, a sub-set of resources, or no resources at all
    • Execution: The parent and child run concurrently, or the parent waits until the child terminates
    • Address space: The child is an exact duplicate of the parent, or the child has another program image loaded into it
    • Termination: The process can make a call to exit() which makes the system deallocate process resources, or the parent can abort the child. Some operating systems do no allow the child to continue if the parent is terminated
  • Processes can communicate by either Message passing (which passes through the kernel) or through Shared Memory (much quicker but can lead to race conditions). Message passing requires that a communication link is established that is either physical or logical.
  • Message passing can be Blocking/Synchronous (The sender must wait until the message is received or the receiver must wait until a message is sent) or Non-blocking/Asynchronous (both entities do not have to wait)
  • The Client-Server communication model uses Sockets (end-point for communication), Remote Procedure Call or RPC (abstract procedure calls across networked systems. Uses a Stub to locate and marshal/pack the parameters into a message) and Remote Method Invocation or RMI (similar to RPC but for Object-Orientated applications).
  • A Light Weight Process (LWP) is different to normal processes in that it shares some or all of its logical address space and system resources with other processes. It differs from threads (discussed later) in that it has its own Process Identifier and is fully controlled by the kernel (threads are controlled by the application). This set-up means that a LWP has less processing overhead.

Threads

  • Threads are by-products of processes in that they share everything the parent process has. It has its own stack and registers to store information, but it shares code and files with other threads and the parent process. This lessens the strain on resources.
  • Threads can be implemented through the User-level (via libraries) or the Kernel-level (direct kernel support)
  • There are three mapping schemes used to map user (process) threads to kernel threads.
    1. One-to-One scheme has one user thread to every kernel thread
    2. Many-to-One scheme has many user threads to every kernel thread
    3. Many-to-many scheme has many user threads running off many kernel threads
  • Java threads are handled by the Java Virtual Machine (JVM). To use Java threads you must use the Runnable class or interface.
  • Terminating a thread involves two methods: Asynchronous (terminate immediately) or Deferred (allows the thread to periodically check if it should be terminated)

CPU Scheduling

  • The Scheduler will select a process from the ready queue and allocate the CPU to that process.
  • The Dispatcher gives control of the CPU to the process selected by the short-term scheduler. It handles context switching and jumping to the correct location in the program code.
  • Dispatcher Latency is the term used to describe the time it takes for the dispatcher to stop one program and start another
  • The criteria that the scheduler uses to determine which process to select include:
    • CPU Utilization: process keeps the CPU busy
    • Throughput: number of processes that complete execution per time unit
    • Turnaround time: total time to execute a particular process
    • Waiting time: amount of time a process has been waiting in the queue
    • Response time: time when the request was made till the first response
  • Scheduling schemes include:
    • First Come First Served (FCFS): Tasks are executed in the order that they arrive. This means that a high-priority process will have to wait until other tasks are executed first. In addition, a long process will significantly slow down all other processes.
    • Shortest Job First (SJF): Each process will be associated with the estimated time of its next CPU burst. The shortest job at the time will be executed over other processes. This significantly reduces the waiting time.
    • Round Robin (RR): The CPU has a time quantum of q time units. Each process will be executed for that amount of time before being pre-empted. If there are n processes, each process will get 1/n of the CPU time in time chunks of q. No process waits more than (n1-)q time units. Note that q has to be large enough otherwise there is too much overhead.
  • Priority Scheduling means associating a number with a process to aid in scheduling decisions. A problem with such solutions is starvation (where a low priority process never gets the chance to run). The solution is aging (increase the priority of a process over time).
  • Multi-level queues means splittling the Ready Queue into two:
    • Foreground are the interactive user applications. They generally require some form of user input. They typically use Round Robin scheduling.
    • Background are the batch jobs and system services. They do not require any sort of user input. They typically use FIFO or FCFS scheduling.
  • CPU scheduling becomes more complex when multiple CPUs are added. It needs to share the load.
  • Hard Real Time Scheduling means that critical processes are assured that they get completed within a set period of time. Soft Real Time Scheduling require that just critical processes receive priority over other processes
  • Thread scheduling can occur locally through libraries or globally through the kernel.

Process Synchronization

  • Concurrent access to data is not desirable as this will result in race conditions and data inconsistency. Therefore we must implement mechanisms to ensure data remains consistent.
  • Terms used in process synchronization are:
    • Mutual Exclusion: When one process is operating in its critical section no other process can enter their critical section
    • Race condition: Where there is concurrent access to data and the result is dependent on the order of execution
    • Critical section: Section of code where the process accesses shared data
    • Entry section: Section of code where the process asks if it can move into the critical section
    • Exit section: Section of code where the process releases the shared data
    • Bounded waiting: A bound must exist on the number of times other processes are allowed to enter their critical sections after a process has made a request to enter its critical section.
    • Locks: Flags used to determine whether the data is in use. The entry section will acquire the lock, while the exit section will release it
    • Semaphore: A synchronization tool that can simply be an integer value with two methods: acquire and release. It can be a Counting semaphore (no bounds) or a Binary Semaphore (also known as a mutex lock)
    • Monitors are a high-level abstraction that provide a convenient mechanism for process synchronization. Only one process may be active in a monitor at a time.
  • For semaphores to work, we must guarantee that no two processes can execute the acquire or release functions on the same semaphore at the same time. A semaphore wait list does this by creating a linked list that stores the current value and a pointer to the next value.

Main Memory

  • A program must be bought into memory and placed within a process to be run. Main Memory (generally RAM), Registers and Cache are the only storage elements that a CPU can directly access.
  • Memory address types are:
    • Logical Address: generated by the CPU and is an abstraction for user-level programs so that we do not have to specify the physical address at compile time
    • Physical Address: what is seen by the memory management unit. This is the address the RAM unit uses to refer to a physical element.
  • The Memory Management Unit (MMU) maps Logical to Physical addresses. It will add a relocation register to the logical address sent by a user program to determine a real physical address.
  • Dynamic Loading involves loading a routine only if it is needed. This saves system resources and can be implemented by the program, so no special support from the OS is required.
  • Dynamic Linking (or shared libraries) is a method where linking does not occur until execution. A small piece of code (known as a stub) is inserted into the program which attempts to locate the required routine in memory. The code will then replace itself with the address of the loaded routine
  • Swapping involves moving a process out of main memory and into secondary storage. It can later be moved back for continued execution. However, the system must maintain a ready queue of processes ready to run but are currently swapped out.
  • Physical memory is divided into fixed-size blocks called frames, while Logical memory is divided into blocks called pages. The OS will keep track of all free frames so that when a process requires n pages it will find n free frames and load the program. Note that while logical memory may seem contiguous to the process, it may actually be non-contiguous in physical memory. To facilitate this we need a page table to transfer Logical addresses to physical addresses.
  • Segmentation: a memory management scheme that acknowledges a process is a collection of segments. Segments are distinct from other segments, but are bounded within the process. In physical memory they can occupy the same frame.

Real Time Systems

  • A Real Time system requires that processes be finished before a certain deadline. Real-time does not mean Real-fast; it means that the system can respond to external environmental events almost instantly, seemingly running in "real-time".
  • An Embedded system is part of a larger system
  • A Safety critical system has catastrophic results in the case of failure
  • Hard real-time systems guarantee that processes will be completed before the deadline. Soft real-time systems only prioritize real-time tasks over other processes
  • A real-time system are generally developed for a single purpose and have specific timing requirements. To achieve this, real-time systems are developed using the System-on-Chip (SoC) strategy. This involves putting all the required hardware onto a s single integrated circuit. This is in contrast to a Bus orientated system which separate these components.

Jun 10, 2012

Convert Twitter into RSS Feed with Google App Engine: Python

This is a continuation of my Python 2.7 and Google App Engine series. If you are just starting out I suggest you start reading Getting Started and First App. If you are after parsing XML or HTML files please see my posts 'Parsing XML with Google App Engine: Python' or 'Parsing HTML with lxml and Google App Engine: Python'.


Simple RSS syndication

We are going to assume you have already created a project ( Hint: You just need an app.yaml configuration file and a main.py file). If you don't know, please refer to one of my earlier blog posts (above) or the references (below).

In your main.py file add the following:

# The webapp2 framework
import webapp2

# The minidom library for XML parsing
from xml.dom.minidom import parseString

# The URL Fetch library
from google.appengine.api import urlfetch

# Fetches an XML document and parses it
class MainPage(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
        # A try-catch statement
        try:
            # Grabs the XML
            url = urlfetch.fetch('https://api.twitter.com/1/statuses/user_timeline.xml?screen_name=almightyolive&count=10&trim_user=true')
           
            # Parses the document
            xml = parseString(url.content)
           
            # Outputs the RSS
            self.response.out.write(outputRSS(xml))
           
            # Sets up the webpage
            self.response.out.write("</table></body></html>")

        # Our exception code
        except (TypeError, ValueError):
            self.response.out.write("<html><body><p>Invalid inputs</p></body></html>")

# Output the XML in a HTML friendly manner
def outputRSS(xml):
    # The get the states list
    statuses = xml.getElementsByTagName("status")
   
    # Set up our XML return
    outputString = "<?xml version='1.0'?>\n<rss version='2.0'>\n\t<channel>"
    outputString+= "\n\t\t<title>Almightyolive Twitter</title>\n\t\t"

    outputString+= "<link>https://twitter.com/#!/almightyolive</link>\n"
    outputString+= "\t\t<description>The twitter feed</description>"

   
    # Cycled through the statuses
    for status in statuses:
        #Gets the Text and date and cycles through them
        text = status.getElementsByTagName("text")[0].firstChild.data
        date = status.getElementsByTagName("created_at")[0].firstChild.data
        string = "\n\t\t<item>\n\t\t\t<title>" + str(date) + "</title>\n"
        string+= "\t\t\t<link>https://twitter.com/#!/almightyolive</link>\n\t\t"
        string+= "\t<description>" + str(text) + "</description>\n\t\t</item>"

        outputString+=string
       
    # Output string
    outputString += "\n\t</channel>\n</rss>"
    return outputString   

# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/*', MainPage)], debug=True)

This is a very simple feed; you won't get any nice links and clicking on a particular item will just take you to the main feed. Now to tweak it just a little....


RSS with links

Now we will add a function that will add appropriate links to our tweets. Note that this is a really, really dumb function: it will apply to ANY instances of 'http' or '@' in a word, so it will accidentally affect emails or tweets about the HTTP protocol. I leave it up to you to fix the code if you don't want these things to happen.

Anyway, replace main.py with the following code:

# The webapp2 framework
import webapp2

# The minidom library for XML parsing
from xml.dom.minidom import parseString

# The URL Fetch library
from google.appengine.api import urlfetch

# Fetches an XML document and parses it
class MainPage(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
        # A try-catch statement
        try:
            # Grabs the XML
            url = urlfetch.fetch('https://api.twitter.com/1/statuses/user_timeline.xml?screen_name=almightyolive&count=10&trim_user=true')
           
            # Parses the document
            xml = parseString(url.content)
           
            # Outputs the RSS
            self.response.out.write(outputRSS(xml))

        # Our exception code
        except (TypeError, ValueError):
            self.response.out.write("<html><body><p>Invalid inputs</p></body></html>")

# Output the XML in a HTML friendly manner
def outputRSS(xml):
    # The get the states list
    statuses = xml.getElementsByTagName("status")
   
    # Our return string
    outputString = "<?xml version='1.0'?>\n<rss version='2.0'>\n\t<channel>"
    outputString+= "\n\t\t<title>Almightyolive Twitter</title>\n\t\t"

    outputString+= "<link>https://twitter.com/#!/almightyolive</link>\n"
    outputString+= "\t\t<description>The twitter feed for the Almighty "
    outputString+= "Olive</description>"

   
    # Cycled through the states
    for status in statuses:
        #Gets the statuses
        text = status.getElementsByTagName("text")[0].firstChild.data
        date = status.getElementsByTagName("created_at")[0].firstChild.data
        tweet = status.getElementsByTagName("id")[0].firstChild.data
       
        # Insert links into the text
        words = text.split()
       
        for i in range (len(words)):
            words[i] = linkify(words[i])
       
        # Recompile words
        text = " ".join(words)
       
        # Creates our output
        string = "\n\t\t<item>\n\t\t\t<title>" + str(date) + "</title>\n"
        string+= "\t\t\t<link>https://twitter.com/AlmightyOlive/status/" + tweet
        string+= "</link>\n\t\t\t<description>" + str(text) + "</description>\n"
        string+= "\t\t</item>"

        outputString+=string
       
    # Output string
    outputString += "\n\t</channel>\n</rss>"
    return outputString   

# Detects if it is a URL link and adds the HTML tags
def linkify(text):
    # If http is present in, add the link tag
    if "http" in text:
        text = "&lt;a href='" + text + "'&gt;" + text + "&lt;/a&gt;"
    elif "@" in text:
        text = "&lt;a href='http://twitter.com/#!/" + text.split("@")[1]
        text+= "'&gt;" + text + "&lt;/a&gt;"
  
    return text

# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/*', MainPage)], debug=True)

And there we have it; the linkify() function will add links into our tweets to make them more usable and accessible!

References

Jun 9, 2012

Parsing HTML with lxml and Google App Engine: Python

This is a continuation of my Python 2.7 and Google App Engine series. If you are just starting out I suggest you start reading Getting Started and First App. If you are after parsing XML files please see my post 'Parsing XML with Google App Engine: Python'.

We are going to assume you will be using Eclipse and a fresh project. In this example we are going to use Triple J Unearthed's Top 100 charts HTML page to parse.

Adding lxml to Google App Engine

The first thing we need to do is add the lxml library to our app.yaml configuration file. In your Eclipse project add a new file called app.yaml and add the following:

application: almightynassar
version: 1
runtime: python27
api_version: 1
threadsafe: true

handlers:
- url: /.*
  script: triplej.app

libraries:
- name: lxml
  version: latest

Most of these fields were covered in Getting Started, but we now have a new field: libraries. This is where we declare all third party libraries not included in GAE default python environment.

Using lxml

Create a new file called triplej.py and add the following code:

# The webapp2 framework
import webapp2

# lxml parser for XML and HTML
from lxml import etree

# The URL Fetch library
from google.appengine.api import urlfetch

# Fetches an XML document and parses it
class MainPage(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
            # Grabs the HTML
            url = urlfetch.fetch('http://www.triplejunearthed.com/Charts/')
           
            # Parses the HTML
            tree   = etree.HTML(url.content)

            # Converts the DOM into a string       
            result = etree.tostring(tree, pretty_print=True, method="html")

           
           # Output the results onto the screen
           self.response.out.write(str(result))
   
       
# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/*', MainPage)], debug=True)

If you run this code you will notice that all it does is simply download the HTML page, parses it, and then outputs the page exactly as it was downloaded (minus all the images and CSS styling). Nothing impressive, but we proved the concept works. Now on to something a little more beefy....

Parsing, Extracting and Cleaning the HTML

In this example we will perform multiple functions that will only extract the chart from the Triple J Unearthed website. Replace the triplej.py code with the following:

# The webapp2 framework
import webapp2

# lxml parser for XML and HTML
from lxml import html

# HTML cleaner
from lxml.html.clean import Cleaner

# The URL Fetch library
from google.appengine.api import urlfetch

# Fetches an XML document and parses it
class MainPage(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
        # Grabs the HTML
        url = 'http://www.triplejunearthed.com/Charts/'
        website = urlfetch.fetch(url)
       
        # Saves our content as a string
        page = str(website.content)

        # Parses the HTML
        tree = html.fromstring(page)

        # The ID string of the table element we want          # NOTE: This is bound to change!!! Double check the HTML source first!!!
        elementID = "ctl00_ctl00_ctl00_ctl00_MainBody_ContentPlaceHolder1_ContentPlaceHolder1_ContentPlaceHolder1_GridView1"
       
        # Grab the chart element
        #
        # style: removes styling
        # links: removes links
        # add_nofollow: adds rel="nofollow" to anchor tags
        # page_structure: removes <html>, <head>, and <title> tages
        # safe_attrs_only: only allows safe element attributes
        # javascript: removes embedded javascript
        # scripts: remove script tags
        # kill_tags: remove the element and content
        # remove_tags: remove only the element, but not the content
        #
        # There are more available. See the API reference for lxml
        cleaner = Cleaner(style=True, links=True, add_nofollow=True,
                          page_structure=True, safe_attrs_only=True,

                          javascript=True, scripts=True, kill_tags = set(['img','th']),
                          remove_tags = (['div']))

       
        # Grab only our chart (but scrub it clean first!)
        chart = cleaner.clean_html(tree.get_element_by_id(elementID))
       
        # Change all relative links into absolute links based on the url
        chart.make_links_absolute(url)
       
        # Converts the DOM element into a string
        result = html.tostring(chart)
       
        # Output the results onto the screen
        self.response.out.write(result)        
       
# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/*', MainPage)], debug=True)

Running this code should result in a sanitized version of the Triple J Top 100 chart!

References