A couple of month ago the german linux magazin asked known guru's of a few different programming languages to solve a little programming task in their preferred language. In the article readers were encouraged to contribute their solution of the task and the redaktion would look them over, pick a few subjective winners and publish an article on the results.
I decided to have a shot at it using my language of choice, Python. Needless to say I didn't get a pick for the article.
Well I learned quite a bit about Python in the process, so I decided to post my solutions. Yes that's plural I couldn't resist in also making a shortest version of the script.
The task is actually very simple. Read an ASCII file that contains some text with footnotes and at the end of the file a list of footnotes and renumber the footnotes. The article and the list are separated by a single line with the text "@footnotes:". A refereence to a footnote is defined by square brackets surrounding a number i.e. [235]. The footnote list caontains one footnote per line defnied by square brackets sorrounding a number, a space and then the footnote till the end of the line. You either had to renumber the footnotes by appearance in the article or in the footnote list. If you sorted it by appearance in the artcle the list also nedded to be sorted by the new ID. For the germans or those capable of reading it, the original task description can be found in this online article on the linux magazin site.
One of the main things I learned is that handling lots of objects slower than using pure lists or dicts. Another thing is use the force. Where the force in this case is the standard library. Things like the key parameter on the sort method of lists make life so much easier.
First the short version. This'll completely fit in this post. Sans import and comments, it's five logical lines of code. Which can be reduced to four if the regex precompilation is removed. Since it's well documented I won't go into each line here just read the source. So without further ado, here is the code, I wrapped the lines to fit:
import sys, re # initialize the footnote id dictionary footnotes = dict() # compile the regex. not precompiling the regex will save 1 line of source # but increase runtime by about 20%. footnote_re = re.compile('\[(?P<id>\d+)\]') # read in the whole file. Make sure you have enough memory data = file(sys.argv[1], "r").read() # generate a new footnote ID, using a lambda function and the dicts # setdefault method. Then split the file at the separator (data, footnotes) = footnote_re.sub( lambda match: [%d]' %( footnotes.setdefault( data[match.start('id'):match.end('id')], len(footnotes) + 1 ) ), data ).split('\n@footnote:\n') # write the data, the separator and the footnote to stdout. First creating a # list of the footnotes, filtering it to make sure it only contains # footnotes. Then sort it using a lambda to extract the ID. sys.stdout.write(data + '\n@footnote:\n' + '\n'.join( sorted( filter( lambda f: footnote_re.match(f), footnotes.split('\n')), key = lambda a: int(footnote_re.findall(a)[0]) ) ) + '\n')
The second solution is a batteries included version. It uses getopt to read the command line parameters, has error handling and some sane usage instructions. This version can also handle files of arbitrary size.
To decrease runtime I implemented the file reading using an iterator class that reads the file in 1MB chunks, making sure that a chunk doesn't end in a footnote or inside the separator.
What follows is the next method of the iterator that handles the chunk reading.
def next(self): data = self.fp.read(self.chunk_size) if not data: raise StopIteration negated_sep_len = int(self.separator_len * -1) # we need to make sure that the data doesn't end in the separator # so we read from the end of the data and check wether the end of # data is a part of the separator. idx = -1 # we start with the last char and continue going forwards. while idx > negated_sep_len: if self.separator.startswith(data[idx:]): #chunk ends on part of the separator data += self.fp.read(1) # add another byte to the chunk idx -= 1 #move the index to compare one more byte else: # chunk doesn't end on part of the separator break # so we can stop # need to make sure that the data doesn't end exactly on the separator if data[negated_sep_len:] == self.separator: #read the footnotes to the end of file. self.footnote_data = self.fp.read() return data # and return the doc data. # split the read data on separator # to make sure we don't return part of the footnotes. try: (doc_data, footnote_data) = data.split(self.separator) except ValueError: doc_data = data footnote_data = None # if we have footnote data then read the rest if footnote_data is not None: self.footnote_data = footnote_data + self.fp.read() #return the rest of the document data appending the separator. doc_data = doc_data + self.separator else: # we don't have footnotes in this chunk #last check to make sure that the end of the # data isn't in a footnote pointer. while self.open_footnote_re.match(doc_data): doc_data += self.fp.read(1) #new data bytewise # and return the read data. return doc_data
Well those were the interesting bits of the scripts. If you want to have a look at the whole scripts. You can get the it here.
I had fun developing those two scripts and learned new ways of manipulating large lists.
I hope you found this article at least a little interesting.
