So the RPM Query (rq) tool that I inherited as ‘srpm’ from Stew Benedict back in the day at Mandriva is now on fedorahosted,org. Unfortunately, it’s missing history because I had a dickens of a time trying to extract it from my existing svn repo (I got it so just the relevant commits where in there, but the entire history of the other repo came with it which does me no good). You can check it out at https://fedorahosted.org/rq/wiki.
But the problem I’m having now, and maybe someone on the lazyweb can help me with this, is that it is absolutely freaking slow. Querying the database is fast enough when you’re searching for something, but doing the imports and especially handling the updates is a process of hours. For instance, to have it manage the updates for Fedora 14 (the SRPMs), I’m looking at 1311 package updates and 353 new packages being added. It’s been running since last night (I don’t know the specific time it started, but it has to be well over 10hrs ago) and it’s still in the “remove old package records” stage. That seems highly excessive to me.
But the database is big:
rqs 0.6 ($Id: rqs 425 2011-01-21 23:37:19Z [email protected] $) Database statistics: Database => User: rq, Host: localhost, Database: rqs Data size => 9.96 GB Tag records : 5 Package records : 21629 File records : 9554598 Source records : 76031 Ctags records: 56300434 Requires records: 100345
And what it is doing is first removing all of the old records (packages, files, ctags, requires, source files) before it can add in the new. Of course, it does this one at a time:
if self.type == 'source': tables = ('packages', 'sources', 'files', 'ctags', 'buildreqs') for rnum in to_remove: r_count = r_count + 1 for table in tables: query = "DELETE FROM %s WHERE p_record = %d" % (table, rnum) result = self.db.do_query(query)
I think the problem is I’ve reached my limit of understanding of efficient MySQL as I’ve never dealt with data sets this huge. What would be the best way to handle this? Would using temporary tables work best? Would having each rpm repo on its own table work better? That would logistically be more work as I’d have to create SQL tables on the fly (and while I would like this thing to be portable between db types at some point, it currently isn’t — but this would have an impact when I want to make it portable). The problem then is if I want to query and look for ‘sshd.c’ for instance, I’d have to look across multiple tables (such as ‘rqs_f14_source’ and ‘rqs_rhel6_source’ instead of just ‘source’), but my concern is that might increase the query time (querying right now is pretty fast, and I don’t want to negatively impact querying times, but this import/update time now is ridiculous).
But maybe that is the answer? Multiple tables, grouped by repo (or tag, as rq calls them) as opposed to giant tables now? I mean, looking at the number of rows in the ctags table (56,300,434), this is a pretty huge dataset. Or can you remove from multiple tables at the same time (the p_record column is identical across tables… same name, same data).
Any hints to speed this sucker up?