sys/vfs/hammer2/DESIGN

   1
   2                             HAMMER2 DESIGN DOCUMENT
   3
   4 * These features have been speced in the media structures.
   5
   6 * Implementation work has begun.
   7
   8 * A working filesystem with some features implemented is expected by July 2012.
   9
  10 * A fully functional filesystem with most (but not all) features is expected
  11   by the end of 2012.
  12
  13 * All elements of the filesystem have been designed except for the freemap
  14   (which isn't needed for initial work).  8MB per 2GB of filesystem
  15   storage has been reserved for the freemap.  The design of the freemap
  16   is expected to be completely speced by mid-year.
  17
  18 * This is my only project this year.  I'm not going to be doing any major
  19   kernel bug hunting this year.
  20
  21                                 Feature List
  22
  23 * Multiple roots (allowing snapshots to be mounted).  This is implemented
  24   via the super-root concept.  When mounting a HAMMER2 filesystem you specify
  25   a device path and a directory name in the super-root.
  26
  27 * HAMMER1 had PFS's.  HAMMER2 does not.  Instead, in HAMMER2 any directory
  28   in the tree can be configured as a PFS, causing all elements recursively
  29   underneath that directory to become a part of that PFS.
  30
  31 * Writable snapshots.  Any subdirectory tree can be snapshotted.  Snapshots
  32   show up in the super-root.  It is possible to snapshot a subdirectory
  33   and then later snapshot a parent of that subdirectory... really there are
  34   no limitations here.
  35
  36 * Directory sub-hierarchy based quotas and space and inode usage tracking.
  37   Any directory sub-tree, whether at a mount point or not, tracks aggregate
  38   inode use and data space use.  This is stored in the directory inode all
  39   the way up the chain.
  40
  41 * Incremental queueless mirroring / mirroring-streams.  Because HAMMER2 is
  42   block-oriented and copy-on-write each blockref tracks both direct
  43   modifications to the referenced data via (modify_tid) and indirect
  44   modifications to the referenced data or any sub-tree via (mirror_tid).
  45   This makes it possible to do an incremental scan of meta-data that covers
  46   only changes made since the mirror_tid recorded in a prior-run.
  47
  48   This feature is also intended to be used to locate recently allocated
  49   blocks and thus be able to fixup the freemap after a crash.
  50
  51   HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in
  52   that HAMMER2 does not keep track of 'deleted' records.  Instead any
  53   recursion by the mirroring code which finds that (modify_tid) has
  54   been updated must also send the direct block table or indirect block
  55   table state it winds up recursing through so the target can check
  56   similar key ranges and locate elements to be deleted.  This can be
  57   avoided if the mirroring stream is mostly caught up in that very recent
  58   deletions will be cached in memory and can be queried, allowing shorter
  59   record deletions to be passed in the stream instead.
  60
  61 * Will support multiple compression algorithms configured on subdirectory
  62   tree basis and on a file basis.  Up to 64K block compression will be used.
  63   Only compression ratios near powers of 2 that are at least 2:1 (e.g. 2:1,
  64   4:1, 8:1, etc) will work in this scheme because physical block allocations
  65   in HAMMER2 are always power-of-2.
  66
  67   Compression algorithm #0 will mean no compression and no zero-checking.
  68   Compression algorithm #1 will mean zero-checking but no other compression.
  69   Real compression will be supported starting with algorithm 2.
  70
  71 * Zero detection on write (writing all-zeros), which requires the data
  72   buffer to be scanned, will be supported as compression algorithm #1.
  73   This allows the writing of 0's to create holes and will be the default
  74   compression algorithm for HAMMER2.
  75
  76 * Copies support for redundancy.  The media blockref structure would
  77   have become too bloated but I found a clean way to do copies using the
  78   blockset structure (which is a set of 8 fully associative blockref's).
  79
  80   The design is such that the filesystem should be able to function at
  81   full speed even if disks are pulled or inserted, as long as at least one
  82   good copy is present.  A background task will be needed to resynchronize
  83   missing copies (or remove excessive copies in the case where the copies
  84   value is reduced on a live filesystem).
  85
  86 * Intended to be clusterable, with a multi-master protocol under design
  87   but not expected to be fully operational until mid-2013.  The media
  88   format for HAMMER1 was less condusive to logical clustering than I had
  89   hoped so I was never able to get that aspect of my personal goals
  90   working with HAMMER1.  HAMMER2 effectively solves the issues that cropped
  91   up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical
  92   file/directory hierarchy, making cache coherency very difficult).
  93
  94 * Hardlinks will be supported.  All other standard features will be supported
  95   too of course.  Hardlinks in this sort of filesystem require significant
  96   work.
  97
  98 * The media blockref structure is now large enough to support up to a 192-bit
  99   check value, which would typically be a cryptographic hash of some sort.
 100   Multiple check value algorithms will be supported with the default being
 101   a simple 32-bit iSCSI CRC.
 102
 103 * Fully verified deduplication will be supported and automatic (and
 104   necessary in many respects).
 105
 106 * Non-verified de-duplication will be supported as a configurable option on
 107   a file or subdirectory tree.  Non-verified deduplication would use the
 108   largest available check code (192 bits) and not bother to verify data
 109   matches during the dedup pass, which is necessary on extremely large
 110   filesystems with a great deal of deduplicable data (as otherwise a large
 111   chunk of the media would have to be read to implement the dedup).
 112
 113   This feature is intended only for those files where occassional corruption
 114   is ok, such as in a large data store of farmed web content.
 115
 116                                 GENERAL DESIGN
 117
 118 HAMMER2 generally implements a copy-on-write block design for the filesystem,
 119 which is very different from HAMMER1's B-Tree design.  Because the design
 120 is copy-on-write it can be trivially snapshotted simply by referencing an
 121 existing block, and because the media structures logically match a standard
 122 filesystem directory/file hierarchy snapshots and other similar operations
 123 can be trivially performed on an entire subdirectory tree at any level in
 124 the filesystem.
 125
 126 The copy-on-write nature of the filesystem implies that any modification
 127 whatsoever will have to eventually synchronize new disk blocks all the way
 128 to the super-root of the filesystem and the volume header itself.  This forms
 129 the basis for crash recovery.  All disk writes are to new blocks except for
 130 the volume header, thus allowing all writes to run concurrently except for
 131 the volume header update at the end.
 132
 133 Clearly this method requires intermediate modifications to the chain to be
 134 cached so multiple modifications can be aggregated prior to being
 135 synchronized.  One advantage, however, is that the cache can be flushed at
 136 any time WITHOUT having to allocate yet another new block when further
 137 modifications are made as long as the volume header has not yet been flushed.
 138 This means that buffer cache overhead is very well bounded and can handle
 139 filesystem operations of any complexity even on boxes with very small amounts
 140 of physical memory.
 141
 142 I intend to implement a shortcut to make fsync()'s run fast, and that is to
 143 allow deep updates to blockrefs to shortcut to auxillary space in the
 144 volume header to satisfy the fsync requirement.  The related blockref is
 145 then recorded when the filesystem is mounted after a crash and the update
 146 chain is reconstituted when a matching blockref is encountered again during
 147 normal operation of the filesystem.
 148
 149 Basically this means that no real work needs to be done at mount-time
 150 even after a crash.
 151
 152 Directories are hashed, and another major design element is that directory
 153 entries ARE INODES.  They are one and the same.  In addition to directory
 154 entries being inodes the data for very small files (512 bytes or smaller)
 155 can be directly embedded in the inode (overloaded onto the same space that
 156 the direct blockref array uses).  This should result in very high
 157 performance.
 158
 159 Inode numbers are not spatially referenced, which complicates NFS servers
 160 but doesn't complicate anything else.  The inode number is stored in the
 161 inode itself, an absolutely necessary feature in order to support the
 162 hugely flexible snapshots that we want to have in HAMMER2.
 163
 164                                   HARDLINKS
 165
 166 Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of
 167 a spatial reference to the inode number.  We do not want to have to have
 168 an index of inode numbers for any basic HAMMER2 feature if we can help it.
 169
 170 Hardlinks are handled by placing the inode for a multiply-hardlinked file
 171 in the closest common parent directory.  If "a/x" and "a/y" are hardlinked
 172 the inode for the hardlinked file will be placed in directory "a", e.g.
 173 "a/3239944", but it will be invisible and will be in an out-of-band namespace.
 174 The directory entries "a/x" and "a/y" will be given the same inode number
 175 but in fact just be placemarks that cause HAMMER2 to recurse upwards through
 176 the directory tree to find the invisible inode number.
 177
 178 Because directories are hashed and a different namespace (hash key range)
 179 is used for hardlinked inodes, standard directory scans are able to trivially
 180 skip this invisible namespace and inode-specific lookups can restrict their
 181 lookup to within this space.
 182
 183 The nature of snapshotting makes handling link-count 2->1 and 1->2 cases
 184 trivial.  Basically the inode media structure is copied as needed to break-up
 185 or re-form the standard directory entry/inode.  There are no backpointers in
 186 HAMMER2 and no reference counts on the blocks (see FREEMAP NOTES below), so
 187 it is an utterly trivial operation.
 188
 189                                 FREEMAP NOTES
 190
 191 In order to implement fast snapshots (and writable snapshots for that
 192 matter), HAMMER2 does NOT ref-count allocations.  The freemap which
 193 is still under design just won't do that.  All the freemap does is
 194 keep track of 100% free blocks.
 195
 196 This not only trivializes all the snapshot features it also trivializes
 197 hardlink handling and solves the problem of keeping the freemap sychronized
 198 in the event of a crash.  Now all we have to do after a crash is make
 199 sure blocks allocated before the freemap was flushed are properly
 200 marked as allocated in the allocmap.  This is a trivial exercise using the
 201 same algorithm the mirror streaming code uses (which is very similar to
 202 HAMMER1)... an incremental meta-data scan that covers only the blocks that
 203 might have been allocated between the last allocation map sync and now.
 204
 205 Thus the freemap does not have to be synchronized during a fsync().
 206
 207 The complexity is in figuring out what can be freed... that is, when one
 208 can mark blocks in the freemap as being free.  HAMMER2 implements this as
 209 a background task which essentially must scan available meta-data to
 210 determine which blocks are not being referenced.
 211
 212 Part of the ongoing design work is finding ways to reduce the scope of this
 213 meta-data scan so the entire filesystem's meta-data does not need to be
 214 scanned (though in tests with HAMMER1, even full meta-data scans have
 215 turned out to be fairly low cost).  In other words, its an area that we
 216 can continue to improve on as the filesystem matures.  Not only that, but
 217 we can completely change the freemap algorithms without creating
 218 incompatibilities (at worse simply having to require that a R+W mount do
 219 a full meta-data scan when upgrading or downgrading the freemap algorithm).
 220
 221                                   CLUSTERING
 222
 223 Clustering, as always, is the most difficult bit but we have some advantages
 224 with HAMMER2 that we did not have with HAMMER1.  First, HAMMER2's media
 225 structures generally follow the kernel's filesystem hiearchy.  Second,
 226 HAMMER2's writable snapshots make it possible to implement several forms
 227 of multi-master clustering.
 228
 229 The general mechanics for most of the multi-master clustering implementations
 230 will be as follows:
 231
 232     (a) Use the copies mechanism to specify all elements of the cluster,
 233         both local and remote (networked).
 234
 235     (b) The core synchronization state operates just as it does for copies,
 236         simply requiring a fully-flushed ack from the remote in order to
 237         mark the blocks as having been fully synchronized.
 238
 239         The mirror_tid may be used to locate these blocks, allowing the
 240         synchronization state to be updated on the fly at a much later
 241         time without requiring the state to be maintained in-memory.
 242         (also for crash recovery resynchronization purposes).
 243
 244     (c) Data/meta-data can be retrieved from those copies which are marked
 245         as being synchronized, with priority given to the local storage
 246         relative to any given physical machine.
 247
 248         This means that e.g. even in a master-slave orientation the slave
 249         may be able to satisfy a request from a program when the slave
 250         happens to be the local storage.
 251
 252     (d) Transaction id synchronization between all elements of the cluster,
 253         typically through masking (assigning a cluster number using the low
 254         3 bits of the transaction id).
 255
 256     (e) General access (synchronized or otherwise) may require cache
 257         coherency mechanisms to run over the network.
 258
 259         Implementing cache coherency is a major complexity issue.
 260
 261     (f) General access (synchronized or otherwise) may require quorum
 262         agreement, using the synchronization flags in the blockrefs
 263         to determine whether agreement has been reached.
 264
 265         Implementing quorum voting is a major complexity issue.
 266
 267 There are lots of ways to implement multi-master environments using the
 268 above core features but the implementation is going to be fairly complex
 269 even with HAMMER2's feature set.
 270
 271 Keep in mind that modifications propagate all the way to the super-root
 272 and volume header, so in any clustered arrangement the use of (modify_tid)
 273 and (mirror_tid) is critical in determining the synchronization state of
 274 portion(s) of the filesystem.
 275
 276 Specifically, since any modification propagates to the root the (mirror_tid)
 277 in higher level directories is going to be in a constant state of flux.  This
 278 state of flux DOES NOT invalidate the cache state for these higher levels
 279 of directories.  Instead, the (modify_tid) is used on a node-by-node basis
 280 to determine cache state at any given level, and (mirror_tid) is used to
 281 determine whether any recursively underlying state is desynchronized.
 282
 283 * Simple semi-synchronized multi-master environment.
 284
 285     In this environment all nodes are considered masters and modifications
 286     can be made on any of them, and then propagate to the others
 287     asynchronously via HAMMER2 mirror streams.  One difference here is
 288     that kernel can activate these userland-managed streams automatically
 289     when the copies configuration is used to specify the cluster.
 290
 291     The only type of conflict which isn't readily resolvable by comparing
 292     the (modify_tid) is when file data is updated.  In this case user
 293     intervention might be required but, theoretically, it should be
 294     possible to automate most merges using a multi-way patch and, if not,
 295     choosing one and creating backup copies if the others to allow the
 296     user or sysop to resolve the conflict later.
 297
 298 * Simple fully synchronized fail-over environment.
 299
 300     In this environment there is one designated master and the remaining
 301     nodes are slaves.  If the master fails all remaining nodes agree on a
 302     new master, possibly with the requirement that a quorum be achieved
 303     (if you don't want to allow the cluster to split).
 304
 305     If network splits are allowed the each sub-cluster operates in this
 306     mode but recombining the clusters reverts to the first algorithm.
 307     If not allowed whomever no longer has a quorum will be forced to stall.
 308
 309     In this environment the current designated master is responsible for
 310     managing locks for modifying operations.  The designated master will
 311     proactively tell the other nodes to mark the blocks related to the
 312     modifying operation as no longer being synchronized while any local
 313     data at the node that acquired the lock (master or slave) remains
 314     marked as being synchronized.
 315
 316     The node that succesfully gets the lock then issues the modifying
 317     operation to both its local copy and to the master, marking the
 318     master as being desynchronized until the master acknowledges receipt.
 319
 320     In this environment any node can access data from local storage if
 321     the designated master copy is marked synchronized AND its (modify_tid)
 322     matches the slave copy's (modify_tid).
 323
 324     However, if a slave disconnects from the master then reconnects the
 325     slave will have lost the master's desynchronization stream and must
 326     mark its root blockref for the master copy HAMMER2_BREF_DESYNCHLD as
 327     well as clear the SYNC1/SYNC2 bits.  Setting DESYNCCHLD forces on-demand
 328     recursive reverification that the master and slave are (or are not) in
 329     sync in order to reestablish on the slave the synchronization state of
 330     the master.
 331
 332     That might be a bit confusing but the whole point here is to allow
 333     read accesses to the filesystem to be satisfied by any node in a
 334     multi-master cluster, not just by the current designated master.
 335
 336 * Fully cache coherent and synchronized multi-master environment.
 337
 338     In this environment a quorum is required to perform any modifying
 339     action.  All nodes are masters (there is no 'designated' master)
 340     and all nodes connect to all other nodes in a cross-bar.
 341
 342     The quorum is specified by copies setup in the root volume configuration.
 343     A quorum of nodes in the cluster must agree on the copies configuration.
 344     If they do not the cluster cannot proceed to mount.  Any other nodes
 345     not in the quorum which are in the cluster which disagree with the
 346     configuration will inherit the copies configuration from the quorum.
 347
 348     Any modifying action will initiate a lock request locally to all nodes
 349     in the cluster.  The modifying action is allowed to proceed the instant
 350     a quorum of nodes respond in the affirmative (even if some have not
 351     yet responded or are down).  The modifying action is considered complete
 352     once the two-phase commit protocol succeeds.  The modifying action
 353     typically creates and commits a temporary snapshot on at least a quorum
 354     of masters as phase-1 and then ties the snapshot back into the main
 355     mount as phase-2.
 356
 357     These locks are cache-coherency locks and may be passively maintained
 358     in order to aggregate multiple operations under the same lock and thus
 359     under the same transaction from the point of view of the rest of the
 360     quorum.
 361
 362     A lock request which interferes with a passively maintained lock will
 363     force the two-phase commit protocol to complete and then transfer
 364     ownership to the requesting entity, thus avoiding having to deal with
 365     deadlock protocols at this point in the state machine.
 366
 367     Since any node can initiate concurrent lock requests to many other nodes
 368     it is possible to deadlock.  When two nodes initiate conflicting lock
 369     requests to the cluster the one achieving the quorum basically wins and
 370     the other is forced to retry (go back one paragraph).  In this situation
 371     no deadlock will occur.
 372
 373     If three are more nodes initiate conflicting lock requests to the
 374     cluster a deadlock can occur whereby none of the nodes achieve a quorum.
 375     In this case every node will know which of the other nodes was granted
 376     the lock(s).  Deadlock resolution then proceeds simultaniously on the
 377     three nodes (since they have the same information), whereby the lock
 378     holders on the losing end of the algorithm transfer their locks to one
 379     of the other nodes.  The lock state and knowledge of the lock state is
 380     updated in real time on all nodes until a quorum is achieved.
 381
 382 * Fully cache coherent and synchronized multi-master environment with
 383   passive read locking.
 384
 385     This is a more complex form of clustering than the previous form.
 386     Take the previous form and add the ability to passively hold SHARED
 387     locks in addition to the EXCLUSIVE locks the previous form is able
 388     to hold.
 389
 390     The advantage of being able to passively hold a shared lock on a sub-tree
 391     (locks can be held on single nodes or entire sub-trees) is that it is
 392     then possible for all nodes to validate a node (modify_tid) or entire
 393     sub-tree (mirror_tid) with a very short network transaction and then
 394     satisfy a large number of requests from local storage.
 395
 396 * Fully cache coherent and synchronized multi-master environment with
 397   passive read locking and slave-only nodes.
 398
 399     This is the MOST complex form of clustering we intend to support.
 400     In a multi-master environment requiring a quorum of masters to operate
 401     we implement all of the above plus ALSO allow additional nodes to be
 402     added to the cluster as slave-only nodes.
 403
 404     The difference between a slave-only node and setting up a manual
 405     mirror-stream from the cluster to a read-only snapshot on another
 406     HAMMER2 filesystem is that the slave-only node will be fully
 407     cache coherent with either the cluster proper (if connected to a quorum
 408     of masters), or to one or more other nodes in the cluster (if not
 409     connected to a quorum of masters), EVEN if the slave itself is not
 410     completely caught up.
 411
 412     So if the slave-only cluster node is connected to the rest of the cluster
 413     over a slow connection you basically get a combination of local disk
 414     speeds for any data that is locally in sync and network-limited speeds
 415     for any data that is not locally in sync.
 416
 417     slave-only cluster nodes run a standard mirror-stream in the background
 418     to pull in the data as quickly as possible.
 419
 420     This is in constrast to a manual mirror-stream to a read-only
 421     snapshot (basically a simple slave), which has no ability to bypass
 422     the local storage to handle out-of-date requests (in fact has no ability
 423     to detect that the local storage is out-of-date anyway).