redcat

Introduction

Since about four years ago Debian has distributed patches to its indexes of packages available for download -- so that rather than having to download the full lists (currently about 30MB uncompressed or 6MB compressed) every time you want to get a package that's only just become available, you can just get the changes to the list (which after compression is only about 10kB every six hours).

See the link above for background, but the basic summary is that every time the archive is updated, a compressed "ed-style" diff is generated that lists each line that was changed in the original Packages or Sources index -- since those are text files, with stanzas sorted by package name, that works out very well. An additional index of the ed-style diffs is also constructed so apt can determine what it needs to download and apply to update its old indexes, and that's pretty much it.

Unfortunately, this turns out to be a bit slow for many people -- as attested to by apt bugs #383881 and #372712, probably among others. The problem is two fold: first, editing a 30MB file isn't always trivial -- you need to read 30MB off the disk, write 30MB to the disk, and if you want to avoid doing that multiple times, you need enough memory to cache 60MB (mmaps of both the original file and the modified file); and if you've got multiple entries in your sources.list, you need to multiply that out for each of them. That might not be a major problem if you're on a modern system; but it often becomes a problem when multiple pdiffs are required to bring your Packages file up to date. In this case, apt simply applies each pdiff in turn (pausing to download and decompress the next pdiff in turn).

It's possible, at least in theory, to do substantially better than this: downloading all the pdiff files necessary, manipulating them as necessary, and then applying them in a single pass over the original index should reduce the delays substantially. Unfortunately this isn't so easy in practice, as the manipulation necessary to apply a series of diffs in one pass isn't at all obvious.

This program aims to remedy that, providing a straightforward implementation of the necessary manipulations. It takes as input an ordered series of ed diffs, combines them internally, then either (a) outputs a single ed diff that has the same result as applying them all in sequence, or (b) applies the patches against stdin.

It's python based, so here's the boilerplate:

#!/usr/bin/env python

# Copyright (c) 2009 Anthony Towns
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.

Dealing with diffs

We're going to start with an basic class representing a set of modifications that apply to a text file. Only two sorts of modifications are permitted: inserting some lines, and deleting some lines. This matches what is handled by diff, and what is needed for this problem, though obviously it's nowhere near the only modifications that could potentially be made.

In order to manage the internal data structure a little more easily, we'll use the following "Struct" class which allows us to define classes with just a few fixed members and easily create and use them.

class Struct(object):
    __slots__ = ()
    def __init__(self, *args, **kwargs):
        d = dict(zip(self.__slots__, args))
        d.update(kwargs)
        for x in self.__slots__:
            setattr(self, x, d.get(x,None))

The data structure we use is important. It's essentially a single ordered array of changes in the form of an offset (how many lines after the end of the last change does this change begin), how many lines to delete (possibly zero), and what lines to add (possibly none).

class Change(Struct):
    __slots__ = ("offset", "del_cnt", "add_cnt", "add_lines")

In addition, since changes will generally be nearby, we'll keep a cache of the current absolute position in the file -- that is, the current change we're looking at, and the line number where the previous change was finished. It would be fantastic if we could cache this for every modification, but that would mean we'd have to update every line number after any insertion/deletion we make, so it's better to just use offsets.

class LineNrCache(Struct)
    __slots__ = ("change", "line")

class Modifications(object):
    def __init__(self):
        self.changes = []  # list of Change objects
        self.line_nr_cache = LineNrCache(change=0, line=0)

The main operation on this data structure will be inserts, like so:

    def insert(self, i, offset, del_cnt, add_lines):
        self.changes.insert(i,
            Change(offset, del_cnt, len(add_lines), add_lines) )

But before we can do anything useful with this, we need to be a bit cleverer as to how we manipulate it. The most important aspect is that we want to be able to easily make modifications at a particular line number -- so we'll create a function to prepare our data structure for precisely that. modfor() will take a line number, and return an index into the array that's calibrated so that deletions and insertions happen at the right line number, based on all the current offsets (and it will both start from linenr_cache and update it to be a little clever about what's going on).

Implementing the function is then a matter of incrementing/decrementing until either we discover we've never looked at a line number this high before, and have to extend our array, or we find a pre-existing modification that covers our line number and just have to deal with the three possible states we could have: we'd skipped past this line in the offset, this line is in the middle of text we'd added previously, or this is exactly where we started making a change previously.

    def mod_for(self, line):
        assert line >= 0

        i = self.line_nr_cache.change
        pos = self.line_nr_cache.line

        changes = self.changes # abbreviation

        assert (i == len(changes) == 0) or (0 <= i and i < len(changes))

        while True:
            assert 0 <= i and i <= len(changes)
            if i == len(changes):
                self.insert(i, line-pos, 0, [])
                break

            if pos + changes[i].offset + changes[i].add_cnt < line:
                pos += changes[i].offset + change[i].add_cnt
                i += 1
                continue
            elif line < pos:
                assert i > 0
                i -= 1
                pos -= changes[i].offset + changes[i].add_cnt
                continue

            assert i == 0 or pos <= line
            assert line <= pos + changes[i].offset + changes[i].add_cnt
            if line < pos + changes[i].offset:
                changes[i].offset -= (line-pos)
                self.insert(i, line-pos, 0, [])
                break
            elif pos + changes[i].offset < line:
                split = line - pos - changes[i].offset
                self.insert(i+1, 0, 0, changes[i].add_lines[split:])
                changes[i].add_cnt = split 
                changes[i].add_lines = changes[i].add_lines[:split]
                i += 1
                break
            else: 
                assert line == pos + changes[i].offset
                break

        assert 0 <= i and i < len(changes)
        self.line_nr_cache.change = i
        self.line_nr_cache.line = line - changes[i].offset
        return i

Of course, endlessly splitting up our data structure is going to get annoying, but there's a simple observation we can make: if a modification's offset is zero, we can easily merge it with the previous modification. So when we manipulate the array, whether by calling mod_for or more manually, we'll make sure to check for this condition, and tidy up after ourselves.

    def merge_next(self, i):
        assert i+1 < len(self.changes) and self.changes[i+1].offset == 0

        c_i = self.line_nr_cache.change
        c_p = self.line_nr_cache.line
        if c_i == i+1:
            c_i -= 1
            c_p -= self.change[i].offset + self.change[i].add_cnt
            self.line_nr_cache.change = c_i
            self.line_nr_chace.line = c_p

        self.changes.pop(i+1)  

So with all that established, we can now insert lines!

    def insert_lines(self, line, strs):
        mod = self.mod_for(line)

        self.changes[mod].add_cnt += len(strs)
        self.changes[mod].add_lines = strs + self.add_lines[mod]

        if mod > 0 and self.changes[mod].offset == 0:
            self.merge_next(mod-1)
            assert mod == len(self.changes) or self.changes[mod].offset > 0

        assert mod-1 <= 0 or self.changes[mod-1].offset > 0

The other half of our job is, of course, to remove lines. This is a little bit more complicated, because we might find ourselves (a) removing original lines, (b) removing added lines, (c) overlapping previously removed lines, (d) crossing modification boundaries.

So we do this piecemeal. If our current block has added lines, these are the first to go. Having done that, if we want to remove more lines, we can, but only up to the offset to the next block. If we reduce that to zero, we need to merge the current block and the next block, at which point we can start again. (Of course, if there isn't a next block, we can remove as many lines as we like without worrying) Anyway, iterating through that eventually gets enough lines deleted, and we can just merge with the previous block if necessary, and declare ourselves successfully completed.

    def remove_lines(self, line, cnt):
        mod = self.mod_for(line)

        changes = self.changes # abbreviation

        while cnt > 0:
            k = min(changes[mod].add_cnt, cnt)
            changes[mod].add_lines = self.add_lines[mod][k:]
            changes[mod].add_cnt -= k
            cnt -= k

            if mod+1 < len(changes):
                k = min(changes[mod+1].offset, cnt)
                changes[mod].del_cnt += k
                changes[mod+1].offset -= k
                cnt -= k
                if changes[mod+1].offset == 0:
                    self.merge_next(mod)
            else:
                changes[mod].del_cnt += cnt
                cnt = 0

        if mod > 0 and changes[mod].offset == 0:
            self.merge_next(mod-1)
            assert mod-1 == 0 or changes[mod-1].offset > 0

Having gotten ourselves a list of modifications, applying them to a file is fairly easy: we copy some lines from input to output, skip some lines from the input, and then write the added lines to the output, and repeat.

    def apply_against_file(self, file, out):
        for change in self.changes):
            for _ in xrange(change.offset):
                out.write(file.readline())
            for _ in xrange(change.del_cnt):
                file.readline()
            assert change.add_cnt == len(change.add_lines)
            for line in change.add_lines:
                out.write(line + "\n")
        while True:
            line = file.readline()
            if line == "":
                break
            out.write(line)

Ed-style diffs

So far we've been pretty generic: we've defined a class with an insertlines function and a removelines function, but not any sort of file format that represents the lines to be inserted or removed. As per our motivation, we want to deal with ed-style diffs. First comes parsing, which we'll use a generator for. We'll only accept commands like "0a" (append after the 0th line), "1,2c" (change lines from 1 to 2 to the text that follows) and "1,2d" (delete lines from 1 to 2 inclusive). Note that with this syntax it's not possible to express "add a line containing a single dot (.)". You would generally use the "s/.//" command after appending a line containing two dots for that operation, however lines with a single dot aren't valid in the Packages or Sources files we're dealing with, so we simply won't address this issue.

def red_components(file):
    cmdwanted = True
    lines = []
    for line in file:
        line = line.rstrip("\n")
        if cmdwanted:
            if line == "":
                continue

            line, cmd = line[:-1], line[-1]
            if "," in line:
                vals = line.split(",", 1)
            else:
                vals = line, line
            vals = [int(x) for x in vals]

            offset = vals[0]-1
            del_cnt = vals[1]-vals[0]+1

            if cmd == "d":
                yield (offset, del_cnt, [])
            elif cmd == "c":
                cmdwanted = False
            elif cmd == "a":
                offset += 1
                del_cnt = 0
                cmdwanted = False
            else:
                raise Exception("Could not parse ed command: \"%s%s\"" % 
                    (line,cmd))
        else:
            if line == ".":
                yield (offset, del_cnt, lines)
                lines = []
                cmdwanted = True
            else:
                lines.append(line)
    if not cmdwanted:
        raise Exception("ed input hit eof within text block")
    return

With that little bit of parsing out of the way, we can extend our previous class to deal with Ed Diffs explicitly:

class EdModifications(Modifications):
    def read_diff(self, file):
        for offset, del_cnt, add_lines in red_components(file):
            if del_cnt > 0:
                self.remove_lines(offset, del_cnt)
            if add_lines != []:
                self.insert_lines(offset, add_lines)
        return

In addition, we'd like to be able to output a cumulative diff. There are any number of valid ways to do this as far as ed is concerned, but the way diff does it, and the only way apt is willing to accept, is to have each the changes listed in reverse order -- from the ones affecting the end of the file, to the start. The advantage this offers is that the line numbers specified in the ed file match the line numbers in the original file, which is kinda nifty.

Either way makes very little difference to us though. We can work out the old line numbers by simply counting the offsets (which was the text in the original file that makes it through unchanged) and the deleted text (which was also in the original file), and not counting the added text (which obviously wasn't in the original file). Iterating that way to the last modification lets us work out the line numbers we need, and then it's just a matter of iterating back to the start, and outputting in ed format.

(Note again that we don't take care to avoid lines with single dots, relying on them simply not appearing in the first place)

    def write_diff(self, out):
        line = 0
        for change in self.changes:
            line += change.offset + change.del_cnt

        for change in reversed(self.changes):
            line -= change.del_cnt

            if change.del_cnt <= 1:
                lines = "%d" % (line+1)
            else:
                lines = "%d,%d" % (line+1, line+change.del_cnt)
            if change.add_cnt == 0:
                if change.del_cnt == 0:
                    continue
                else:
                    out.write("%sd\n" % (lines))
            else:
                if change.del_cnt == 0:
                    out.write("%da\n" % (line))   # NB: not line+1
                else:
                    out.write("%sc\n" % (lines))
                for added_line in change.add_lines:
                    out.write("%s\n" % (added_line))
                out.write(".\n")
            line -= change.offset

main()

And that was our module. To demonstrate it we supply a simple main() function that accepts a series of ed diffs on the command line, and either outputs a cumulative diff, or acts as a filter on stdin applying all the patches in bulk.

def main():
    import sys, gzip

    if len(sys.argv) == 1 or (len(sys.argv) == 2 and sys.argv[1] == "-f"):
        print "Usage: %s [-f] <diff>..."
        print """
One of more diff files must be specified. Gzip compression is assumed if
.gz extension is used.

If -f is specified, patch will be applied to stdin, with result on stdout.
Otherwise, combined diff is produced on stdout."""
        sys.exit(0)

    if sys.argv[1] == "-f":
        just_dump_ed = False
        files = sys.argv[2:]
    else:
        just_dump_ed = True
        files = sys.argv[1:]

    diffs = EdModifications()
    for filename in files:
        if filename.endswith(".gz"):
            file = gzip.GzipFile(filename, "r")
        else:
            file = open(filename, "r")
        diffs.read_diff(file)

    if just_dump_ed:
        diffs.write_diff(sys.stdout)
    else:
        diffs.apply_against_file(sys.stdin, sys.stdout)

if __name__ == "__main__":
    main()
Powered by Sputnik | XHTML 1.1