mk/bulk/parallel.txt

   1 # $Id: parallel.txt,v 1.7 2006/12/15 12:46:24 martti Exp $
   2 #
   3
   4 These are my (<dmcmahill>) thoughts on how one would want a parallel
   5 bulk build to work.
   6
   7
   8 ====================================================================
   9 Single Machine Build Process
  10 ====================================================================
  11
  12 The current (as of 2003-03-16) bulk build system works in the
  13 following manner:
  14
  15 1)  All installed packages are removed.
  16
  17 2)  Packages listed in the BULK_PREREQ variable are installed.  This
  18     must be done before step 3 as some packages (like xpkgwedge) can
  19     affect the dependencies of other packages when installed.
  20
  21 3)  Each package directory is visited and its explicitly listed
  22     dependencies are extracted and put in a 'dependstree' file.  The
  23     mk/bulk/tflat script is used to generate flattened dependencies
  24     for all packages from this dependstree file in both the up and
  25     down directions.  The result is a file 'dependsfile' which has one
  26     line per package that lists all build dependencies.  Additionally,
  27     a 'supportsfile' is created which has one line for each package
  28     and lists all packages which depend upon the listed pacakge.
  29     Finally, tsort(1) is applied to the 'dependstree' file to
  30     determine the correct build order for the bulk build.  The build
  31     order is stored in a 'buildorder' file.  This is all achieved via
  32     the 'bulk-cache' top level target.  By extracting dependencies in
  33     this fashion, we avoid highly redundant recursive make calls.  For
  34     example, we no longer need to use a recursive make to find the
  35     dependencies for libtool literally thousands and thousands of
  36     times throughout the build.
  37
  38 4)  During the build, the 'buildorder' file is consulted to figure out
  39     which package should be built next.  Then to build the package,
  40     the following steps are taken:
  41
  42     a)  Check for the existance of a '.broken' file in the package
  43     directory.  If this file exists, then the package is already
  44     broken for some reason so move on to the next package.
  45
  46     b)  Remove all packages which are not needed to build the current
  47     package.  This dependency list is obtained from the 'dependsfile'
  48     created in step 3 and the BULK_PREREQ variable.
  49
  50     c)  Install via pkg_add all packages which are needed to build the
  51     current package.  We are able to do this because we have been
  52     building our packages in a bottom up order so all dependencies
  53     should have been built.
  54
  55     d)  Build and package the package.
  56
  57     e)  If the package build fails, then we copy over the build log to
  58     a .broken file and in addition, we consult the 'supportsfile' and
  59     mark all packages which depend upon this one as broken by adding a
  60     line to their .broken files (creating them if needed).  By going
  61     ahead and marking these packages as broken, we avoid wasting time
  62     on them later.
  63
  64     f)  Append the package directory name to the top level pkgsrc
  65     '.make' file to indicate that we have processed this package.
  66
  67 5)  Run the mk/bulk/post-build script to collect the summary and
  68     generate html pages and the email we've all seen.
  69
  70 ====================================================================
  71 Single Machine Build Comments
  72 ====================================================================
  73
  74 There are several features of this approach that are worth mentioning
  75 explicitly.
  76
  77 1)  Packages are built in the correct order.  We don't want to rebuild
  78     the gnome meta-pkg and then rebuild gnome-libs for example.
  79
  80 2)  Restarting the build is a cheap operation.  Remember that this
  81     build can take weeks or more.  In fact the 1.6 build took nearly 6
  82     weeks on a sparc 20!  If for some reason, the build needs to be
  83     interrupted, it can be easily restarted because in step 4f we keep
  84     track of what has been built in a file.  The lines in the build
  85     script which control this are:
  86
  87       for pkgdir in `cat $ORDERFILE` ; do
  88         if ! grep -q "^${pkgdir}\$" $BUILDLOG ; then
  89           (cd $pkgdir && \
  90              nice -n 20 ${BMAKE} USE_BULK_CACHE=yes bulk-package)
  91         fi
  92       done
  93
  94     In addition to storing the progress to disk, the bulk cache files
  95     (the 'dependstreefile', 'dependsfile', 'supportsfile', and
  96     'orderfile') are stored on disk so they do not need to be
  97     recreated if a build is stopped and then restarted.
  98
  99 3)  By leaving packages installed and only deleting the ones which are
 100     not needed before each build, we reduce the amount of installing
 101     and deinstalling needed during the build.  For example, it is
 102     quite common to build several packages in a row which all need GNU
 103     make or perl.
 104
 105 4)  Using the 'supportsfile' to mark all packages which depend upon a
 106     package which has just failed to build can greatly reduce the time
 107     wasted on trying to build packages which known broken dependencies.
 108
 109 ====================================================================
 110 Parallel Build Thoughts
 111 ====================================================================
 112
 113 To exploit multiple machines in an attempt to reduce the build time,
 114 many of the same ideas used in the single machine build can still be
 115 used.  My view of how a parallel build should work is detailed here.
 116
 117 master   == master machine.  This machine is in charge of directing
 118             the build and may or may not actively participate in it.
 119             In addition, this machine might not be of the same
 120             architecture or operating system as the slaves (unless it
 121             is to be used as a slave as well).
 122
 123 slave#x  == slave machine #x.  All slave machines are of the same
 124             MACHINE_ARCH and have the same operating system and access
 125             the same pkgsrc tree via NFS and access the same binary
 126             packages directory.
 127
 128             If the master machine is also to be used as a build
 129             machine, then it is also considered a slave.
 130
 131 Prior to starting the build, the master directs one of the slaves to
 132 extract the dependency information per steps 1-3 in the single machine
 133 case.
 134
 135 The actually build should progress as follows:
 136
 137 1)  For each slave which needs a job, the master assigns a package to
 138     build based on the rule that only packages that have had all their
 139     dependencies built will be sent to slaves for compilation.
 140
 141 2)  When a slave finishes, the master either notes that the binary
 142     package is now available for use as a depends _or_ notes failure
 143     and marks all packages which depend upon it as broken as in step
 144     4e of the single machine build.
 145
 146
 147 Each slave builds a package in the same way as it would in a single
 148 machine build (steps 4a-d).
 149
 150 ====================================================================
 151 Important Parallel Build Considerations
 152 ====================================================================
 153
 154
 155 1)  Security.  Packages are installed as root prior to packaging.
 156
 157 2)  All state kept by the master should be stored to disk to
 158     facilitate restarting a build.  Remember this could take weeks so
 159     we don't want to have to start over.
 160
 161 3)  The master needs to be able to monitor all slaves for signs of
 162     life.  I.e., if a slave machine is simply shut off, the master
 163     should detect that it's no longer there and re-assign that slaves
 164     current job.
 165
 166 3a) The master must be able to distinguish between a slave failing to
 167     compile a package due to the package failing vs a
 168     network/power/disk/etc. failure.  The former causes the package to
 169     be marked as broken, the latter causes the slave to be marked as
 170     broken.
 171
 172 4)  Ability to add and remove slaves from the cluster during a build.
 173     Again, a build may take a long time so we want to add/remove
 174     slaves while the build is in progress.
 175
 176 ====================================================================
 177 Additional Thoughts
 178 ====================================================================
 179
 180 This is mostly related to using slaves which are not on a local
 181 network.
 182
 183 -  maybe a hook could be put in place which rsync's the binary package
 184    tree between the binary package repository machine and the slave
 185    machine before and after each package is built?
 186
 187 -  security
 188
 189 -  support for Kerberos?
 190
 191 ====================================================================
 192 Implementation Thoughts
 193 ====================================================================
 194
 195 -  Can this all be written around using ssh to send out tasks?  How do
 196    we monitor slaves for signs of life?  How do we indicate 'build
 197    failed/build succeeded/slave failed' conditions?
 198
 199 -  Maybe we could have a file listing slaves and the master consults
 200    this each time it needs a slave.  That would make adding/removing
 201    slaves easy.  There would need to be another file to keep track of
 202    which slaves are busy (and with what).
 203
 204 -  Do we want to use something like pvm instead?  There is a
 205    p5-Parallel-Pvm package and perl nicely deals with parsing some of
 206    these files and sorting dependencies although I hate to add any
 207    extra dependencies to the build system.