important 2005/12/04: this is an old pet project of mine (Marco) which I had to leave behind a couple of years ago (the first posting to the RULE website was on 2003/02/10, and a DAn was discussed on the RULE mailing list in the summer of 2002), but I am sure it could still be useful to any Linux distribution. Please let me know if you are interested to work on it, if you have find bugs or know similar projects. Thanks.
There is always a lot of talk among users of RPM based distros about RPM dependency conflicts, packages forcing the installation of too many other packages, bloated “default” installs, and similar issues. Lately, apt has been ported to rpm in order to help in this area, and several tools (up2date, urpmi, you name it…) were already available.
What is still missing?
Personally, I am interested in these problems, but from two others point of view. First of all, I want something which helps BEFORE the installation, to create the best possible comps files and such. I don’t even want something that only knows about what is on two or three CDROMS: I want to put together my own mix of RPMs from stock RH CDs, findrpm.net, freshrpms, the self-made ones in my hard drive, etc.. and look at that.
I want to feed that mix of packages to a program and then ask to it questions like: Find among these 20 email clients, 15 browser, 12 text editors, 20 news clients….. (all end user applications) the combination requiring the smallest number of RPMs and HD megabytes. Another output I’d like from it is to say: you need to install this RPM, but only 50% of its files are actually needed by other packages, you can safely remove the rest.
The second use, (the first I am interested in) is as a packaging analysis tool. I want this thing to discover that installing X.rpm carries on after itself 30 more megs of disk because (is the only package wich) requires Y.rpm, Z.rpm and W.rpm while actually needing only one file from each of them. At that point, I can theoretically contact the maintainer and ask him to patch the spec file of X.rpm.
Similarly, if someBIGlib.rpm contains 100 files, but the first 50 are only needed by gcc and the others only by Xfree, or only for Japanese support, I could contact the maintainer and suggest splitting someBIGlib.rpm in two small pieces. If all users started to pester the developers for THESE reasons, the quality of any distribution may improve considerably. I have the feeling that the examples above are quite frequent today. I can personally report that the Xemacs rpm for RH 7.2 on www.rpmfind.net gives a binary that will crash on an english system if the canna lib for Japanese fonts are not installed.
Last but not least, I want the thing to be easily portable to non rpm systems, and to work even if the actual RPM files are not available (maybe from a CGI interface…).
Enough of this rambling, show me the code!
The alpha version of DAn is here on the website. It is consists of two scripts, the first (which requires a Perl module) preparing a “database” to be used by the second, which does the actual work. You are encouraged to download, try and report to : before any report, however, you are also kindly encouraged to read the threads on the distro analyzer appeared in the RULE list during july/august 2002. The DAn tar archive contains three files:
- rpm_gen_db
- rh7.3_all_cds
- rpm_analyze
The first one is a perl script which queries all the rpm files found in one directory, using the PERL module RPM::Perlonly. In this way it works even if rpm and/or the rpm db are not installed, so one can put a RH CD in any WIN*/Solaris box and still use it. usage: rpm_gen_db <RPMDIR> <RPMDIRLABEL> > some_file
. RPMDIR is any directory containing only RPM files. RPMDIRLABEL is a label (max 15 chars) associated to that directory, ex: RH7.3_2nd_CD. The output of rpm_gen_db is valid Perl code instantiating a huge Perl hash equivalent of the rpm database. The second file in the archive, rh7.3_all_cds, is the result of running rpm_gen_db on all the three valhalla CDs. rpm_analyze does (will do, eventually) the real work. It reads and evaluates its first argument, and so it comes to know, through the %PACKAGES hash, all the available packages, their needs and relationships. Right now, it will print all the rpm needed by tha package given as second argument. Try “rpm_analyze rh7.3_all_cds bash” to see what happens.
TODO
- eliminate circular dependencies
- make it do all the other things declared in the original announce…
- deal with shortcomings in the package themselves
EX: current DAn version is in trouble because many packages require ‘/bin/bash’ (i.e. not a capability but a file!) which is obviously provided by package bash. This one, however, doesn’t advertise it! It PROVIDES bash and bash2 when queried, not /bin/bash. The solution, still to implement, is to add to the %PACKAGE hash one ‘PROVIDE’ line for each *f y the RPM, but only if it is REQUIREd by at least one other package.
What is the rationale for a GNU/Linux Distribution Analyzer?
When I first proposed the idea (in 2002 or even earlier, on the Mandrake 5.1 list) I got this question: there are databases, rpmdb, anf other solutions: why did you choose something so ugly and inefficient as the mammoth hash thing?. My answer is that all those other olutions require that you have at least any combination of:
- the version of Red Hat to be analyzed installed
- rpm installed
- all the rpm files, on disk or CD
With this approach, instead:
- the thing can run as CGI on any http server, regardless of its OS, with much less disk space required
- everybody can evaluate if a certain version of RH will fit for him without having to (try to) install it first
- the Perl file required by rpm_analyze can be generated by looking at rpm files, but also by querying rpmfind, SOLARIS or .deb packages, or even the makefiles in plain tgz archives. This means that one can find the optimal rpm combination for his needs even if those rpms don’t exist yet.
- the whole thing can be easily ported en masse to any other distribution (that’s why I called the hash “PACKAGES” instead of “RPMs”: REQUIRE and PROVIDE are very general concepts, aren’t they?