This Site's RSS Generator

This is an ancient post from 2013. I’m not using any of these now.

Previously with Pandoc I was using a simple setup to create RSS. Markdown files were converted to plain, headerless HTML and they were collected together to build an XML file. The obvious drawback is that all HTML files should be generated by Pandoc and anything that doesn't fit that route does not appear in the feeds.

However, when I began to use Org Mode for data analysis and other tasks, I began not to touch Pandoc. Org Mode has extensive facilities for exporting into HTML and other document formats and I would not mess with Pandoc for this.

I thought RSS can be produced by parsing HTML files after they are produced. This requires to parse the HTML file but it's simple, and there are parsers for all programming languages out there. My previous RSS generator was in Python and I decided to modify it to fit my needs. I think producing RSS for a static HTML site is a common need and I tried to solve this problem as simple as possible.

Let's begin with the ubiquitous shebang line. This tells that the script is in Python.

#!/usr/bin/env python 

Following are the exports for this script. Apart from PyRSS2Gen all modules are present in Python 2.7

import argparse
import codecs
import os
import datetime
from HTMLParser import HTMLParser
import PyRSS2Gen as rssgen
import operator as op
import re
import subprocess as proc

I use Mercurial to track the site's files. I once thought to use Mercurial public API to check the status of files but it proved to be an overkill, because only modification time of files is necessary and retrieving them using a standard command line call is much simpler. Hence I removed the following imports for the time being.

# from mercurial import commands as cmd
# from mercurial import hg
# from mercurial import ui as hgui

The following function returns a valid HTML tag string, given the tag and its attributes in a list. HTMLParser sends the tags in a list form and I use this function to reconvert them to usual HTML tags.

def make_tag(tag, attrs): 
    content_list = [ tag ]
    content_list += [ "%s=\"%s\"" % (k, v) for (k, v) in attrs]
    return "<" + " ".join(content_list) + ">" 

=TitleBodyExtractor= is an HTMLParser subclass. It collects the body of a page in a string and also keeps the title. These two are the only requirements. It might be possible to parse meta tags to get publish date and author information as well, but I prefer to keep simple things simple.

class TitleBodyExtractor(HTMLParser): 

    def __init__(self): 
        HTMLParser.__init__(self)
        self.in_body = False
        self.in_title = False
        self.body = ""
        self.title = ""


    def handle_data(self, data): 
        if self.in_body: 
            self.body += data
        if self.in_title: 
            self.title += data

    def handle_starttag(self, tag, attrs): 
        if self.in_body: 
            self.body += make_tag(tag, attrs)

        if tag == "body": 
            self.in_body = True
        if tag == "title": 
            self.in_title = True

    def handle_endtag(self, tag):
        if tag == "body": 
            self.in_body = False

        if tag == "title": 
            self.in_title = False

        if self.in_body: 
            self.body += "</%s>" % (tag)

Getting contents of a file in UTF-8 encoding is a common task. The following two functions retrieve and store the contents in UTF-8 by the codecs module.

def get_content(filename):
    f = codecs.open(filename, "r", "utf-8")
    cont = f.read()
    f.close()
    return cont

def write_content(filename, content):
    f = codecs.open(filename, "w", "utf-8")
    f.write(content)
    f.close()

Mercurial allows to give commands for a repository outside of that repository with -R command line switch. However, it requires the exact path of the repository and does not accept a child path. The following function finds the repository path of a file by recursively checking whether parent paths' contain an .hg/ directory.

def get_repo_path(dir): 
    if dir == "/" or dir == "":
        return ""
    if os.path.exists(os.path.join(dir, ".hg")): 
        return dir
    else:
        return get_repo_path(os.path.dirname(dir))

=FileObject= class keeps the required data of an HTML file. It stores path, modification time, body and the title.

class FileObject: 
    def __init__(self, path, mtime): 
        self.path = path
        self.mtime = int(mtime) 
        self._body = ""
        self._title = ""

    def parse(self): 
        content = get_content(self.path)
        tbe = TitleBodyExtractor()
        tbe.feed(content)
        self._body = tbe.body
        self._title = tbe.title

    def body(self): 
        if self._body == "":
            self.parse()
        return self._body

    def title(self): 
        if self._title == "": 
            self.parse()
        return self._title

    def __str__(self):
        return str(self.path) + " " + str(self.mtime)

python

The modification time of a file should be retrieved from the Mercurial repository. The following function calls hg log with a specific template, then parses the date to get the last commit time of a file. If the file is not registered to a repository, it simply returns the filesystem modification time.

def get_mtime(full_path):
    if os.path.exists(full_path):
        repo_path = get_repo_path(full_path)
        if repo_path != "":
            logcmd = "/usr/bin/hg log -R %s --template='{date|hgdate}' -l 1 %s " % (repo_path, full_path)    
            # print logcmd

            proc_res = proc.check_output(logcmd, shell=True).split()
            if len(proc_res) > 0:
                filetime = int(proc_res[0])
            else:
                filetime = os.path.getmtime(full_path)
        else:
            filetime = os.path.getmtime(full_path)
        return filetime
    else:
        return 0

We need a list of files as FileObject objects, given the directory name and extension. The function also takes a repository path and excludes the filenames with a given regex.


    def file_list(dirname, extension, repo_path, exclude_regex=None):
        results = []
        for root, dirs, files in os.walk(dirname):
            # print "Dirs:", dirs
            for d in dirs:
                if exclude_regex == None or (not re.match(exclude_regex, d)): 
                    results += file_list(os.path.join(root, d), extension, repo_path)
                else:
                    print "Skipping", d
            # print "Files:", files
            for f in files: 
                if (exclude_regex == None or (not re.match(exclude_regex, f))) and f.endswith(extension):
                    fullname = os.path.join(root, f)
                    filetime = get_mtime(fullname)
                    results.append(FileObject(fullname, filetime))
        return results

Given a local file in the site, we need to create a link that shows the URL of that file relative to the site's URL. The following function finds the relative path w.r.t. an input directory and returns the complete URL by contatenating it to the site's url.


    def make_link(site_url, input_dir, file_path): 
        rel_path = os.path.relpath(file_path, input_dir)
        return site_url + rel_path

Given a FileObject that points to an HTML file, we need a function that builds an RSS item from it. It obtains the URL of the file and fills the rest by attributes of the FileObject.


    def get_rss_item(file_object, input_dir, site_url):
        the_link = make_link(site_url, input_dir, file_object.path)
        item = rssgen.RSSItem(title=file_object.title(),
                              link=the_link,
                              description=file_object.body(),
                              guid=rssgen.Guid(the_link),
                              pubDate=datetime.datetime.fromtimestamp(file_object.mtime))
        return item

The function that creates an RSS file from the files in a given directory is the top most function. It takes all variables that are set from the command line and returns writes the rss file.

It first lists all files with the given extension (default being .html) in the input dir. Then it compares the modification time of RSS file with the modification times of these listed files. If there is no previous RSS file or there are newer HTML files, the RSS is generated again.

Note that, a certain amount of edit time can be set, letting the script don't consider files as new if they are modified within edit time minutes. This can be set to prevent too frequent generation of files during an edit session.

To generate the RSS, each file is supplied to the previous function and an RSS object is get, then these are fed into RSS2 function of PyRSS2Gen to get the resulting object.

def generate_rss(input_dir, extension = ".html", output = "rss/rss.xml", site_title = "Title", site_description = "Description", max_items = 20, site_url = "http://example.com", edit_time=0, exclude_regex = None):
        if not site_url.endswith("/"): 
            site_url += "/"
        files = file_list(input_dir, extension, get_repo_path(input_dir), exclude_regex)
        rssmtime = get_mtime(output)
        files_up = [f for f in files if f.mtime >= (rssmtime + edit_time)]
        if len(files_up) > 0: 
            print "Before", files_up
            files_up.sort(key=lambda x: x.mtime, reverse=True)
            print "After", files_up
            rss_items = [get_rss_item(fo, input_dir, site_url) for fo in files_up[:max_items]]
            rssobj = rssgen.RSS2(title = site_title,
                                 link = site_url,
                                 description = site_description,
                                 lastBuildDate = datetime.datetime.now(),
                                 items = rss_items)
            return rssobj
        return None

The main function uses argparse to denote options. Input directory, site title, site URL and number of items are mandatory, other options have sensible default values.

The function also builds the exclude_regex object to supply to file listings. Regex is built here from the supplied string and all other functions use this compiled regex.

After generating the RSS, it writes the file with write_xml function.


    def main():

        parser = argparse.ArgumentParser(description='Generate RSS feed from a set of HTML files')

        parser.add_argument('--input-dir', help="input directory", required=True)
        parser.add_argument("--extension", help="file extension to collect", default=".html")
        parser.add_argument("--output", help="output filename to write the results", default="rss/rss.xml")
        parser.add_argument('--title', help="title of the RSS feed", required=True)
        parser.add_argument("--description", help="site description", default="")
        parser.add_argument("--items", help="max items included in the feed", required=True, type=int)
        parser.add_argument("--site-url", help="site url of items", required=True)
        parser.add_argument("--exclude-regex", help="regex to set skipped files", default="")
        parser.add_argument("--edit-time", help="minutes to wait before putting an item into rss", default=0, type=int)

        args = vars(parser.parse_args())

        if args["exclude_regex"] == "": 
            exclude_regex = None
        else:
            exclude_regex = re.compile(args["exclude_regex"])

        rssresults = generate_rss(args["input_dir"],
                                  args["extension"],
                                  args["output"],
                                  args["title"],
                                  args["description"],
                                  args["items"],
                                  args["site_url"],
                                  args["edit_time"],
                                  exclude_regex)

        if rssresults != None: 
            rssresults.write_xml(open(args["output"], "w"))



    if __name__ == "__main__": 
        main()

You can get the resulting Python script from rss-generator.py