Just spent most of the day and quite a bit of yesterday porting a program Stew wrote for the secteam (called srpm) that builds a database of src.rpm files so you can do queries on source files (this is extremely useful for us in determining whether or not a vulnerable source file exists in any supported distros). For example, if there was a vulnerability in foo.c, we could do "srpm -q foo.c" to see if that source file exists anywhere; this helped us find a bunch of PDF-related programs that were vulnerable to issues in xpdf. An invaluable tool.

The problem with it was it was using sqlite, which is fine, but I've been thinking about a web frontend to it and wanted to make it more general (and due to PHP's sqlite module, I'd need both it and sqlite2 installed on Annvix). So I took the liberty of rewriting parts of it and also making it use the PEAR MDB2 functionality to make it portable. Right now I've got it working with a PostgreSQL database, which is kinda nice.

The nice thing about the database abstraction stuff is that it doesn't seem to be slowing things down too much. With srpm (which uses straight php sqlite functions), I get:

$ time srpm -q libgen.h -C
39 matche(s) in database for substring (libgen.h)
3.31user 0.60system 0:04.15elapsed 94%CPU (0avgtext+0avgdata 0maxresident)k

The same query with rqs results in:

% time ./rqs -q libgen.h -C 
rqs $Id: rqs 175 2007-05-04 03:30:24Z vdanen $
39 matche(s) in database for substring (libgen.h)
./rqs -q libgen.h -C  0.05s user 0.00s system 0% cpu 15.651 total

So about a 1s increase using database abstraction and postgresql vs using straight php sqlite functions and an sqlite database.

The two databases are the same, in that they have data for CS3, CS4, 2007, and 2007.1. According to rqs, this provides the following records:

 $Id: rqs 173 2007-05-03 18:58:34Z vdanen $

Database statistics:

Tag Records : 4               Package Records: 7791           
File Records: 4226875         Source Records : 40233 

The really interesting thing is importing. Stew told me that srpm took quite a bit to import data, but I didn't realize how slow it was (I've never used srpm to import anything before). To normalize the database I had to import the 2007.1 sources (and remove the 2006 sources from it). The resulting import of 4.5GB of src.rpm files via srpm was 4:33:52 (yes, 4.5hrs). With rqs, the import of the same src.rpm directory took 52:05, so just under an hour. Scary.

So now the next step is to write rqs' counterpart (rqp) which will do the same thing rqs does for src.rpms with binary rpms. And then the web frontend to it as well.