sys/vfs/hammer2/DESIGN

   1
   2                             HAMMER2 DESIGN DOCUMENT
   3
   4                                 Matthew Dillon
   5                              dillon@backplane.com
   6
   7                                03-Apr-2015 (v3)
   8                                14-May-2013 (v2)
   9                                08-Feb-2012 (v1)
  10
  11                         Current Status as of document date
  12
  13 * Filesystem Core       - operational
  14   - bulkfree            - operational
  15   - Compression         - operational
  16   - Snapshots           - operational
  17   - Deduper             - specced
  18   - Subhierarchy quotas - specced
  19   - Logical Encryption  - not specced yet
  20   - Copies              - not specced yet
  21   - fsync bypass        - not specced yet
  22
  23 * Clustering core
  24   - Network msg core    - operational
  25   - Network blk device  - operational
  26   - Error handling      - under development
  27   - Quorum Protocol     - under development
  28   - Synchronization     - under development
  29   - Transaction replay  - not specced yet
  30   - Cache coherency     - not specced yet
  31
  32                                     Feature List
  33
  34 * Block topology (both the main topology and the freemap) use a copy-on-write
  35   design.  Media-level block frees are delayed and flushes rotate between
  36   4 volume headers (maxes out at 4 if the filesystem is > ~8GB).  Flushes
  37   will allocate new blocks up to the root in order to propagate block table
  38   changes and transaction ids.
  39
  40 * Incremental update scans are trivial by design.
  41
  42 * Multiple roots, with many features.  This is implemented via the super-root
  43   concept.  When mounting a HAMMER2 filesystem you specify a device path and
  44   a directory name in the super-root.  (HAMMER1 had only one root).
  45
  46 * All cluster types and multiple PFSs (belonging to the same or different
  47   clusters) can be mixed on one physical filesystem.
  48
  49   This allows independent cluster components to be configured within a
  50   single formatted H2 filesystem.  Each component is a super-root entry,
  51   a cluster identifier, and a unique identifier.  The network protocl
  52   integrates the component into the cluster when it is created
  53
  54 * Roots are really no different from snapshots (HAMMER1 distinguished between
  55   its root mount and its PFS's.  HAMMER2 does not).
  56
  57 * Snapshots are writable (in HAMMER1 snapshots were read-only).
  58
  59 * Snapshots are explicit but trivial to create.  In HAMMER1 snapshots were
  60   both explicit and fine-grained/automatic.  HAMMER2 does not implement
  61   automatic fine-grained snapshots.  H2 snapshots are cheap enough that you
  62   can create fine-grained snapshots if you desire.
  63
  64 * HAMMER2 formalizes a synchronization point for the flush, does a pre-flush
  65   that does not update the volume root, then waits for all running modifying
  66   operations to complete to memory (not to disk) while temporarily stalling
  67   new modifying operation initiations.  The final flush is then executed.
  68
  69   At the moment we do not allow concurrent modifying operations during the
  70   final flush phase.  Ultimately I would like to, but doing so can be complex.
  71
  72 * HAMMER2 flushes and synchronization points do not bisect VOPs (system calls).
  73   (HAMMER1 flushes could wind up bisecting VOPs).  This means the H2 flushes
  74   leave the filesystem in a far more consistent state than H1 flushes did.
  75
  76 * Directory sub-hierarchy-based quotas for space and inode usage tracking.
  77   Any directory can be used.
  78
  79 * Low memory footprint.  Except for the volume header, the buffer cache
  80   is completely asynchronous and dirty buffers can be retired by the OS
  81   directly to backing store with no further interactions with the filesystem.
  82
  83 * Background synchronization and mirroring occurs at the logical level.
  84   When a failure occurs or a normal validation scan comes up with
  85   discrepancies, the synchronization thread will use the quorum to figure
  86   out which information is not correct and update accordingly.
  87
  88 * Support for multiple compression algorithms configured on subdirectory
  89   tree basis and on a file basis.  Block compression up to 64KB will be used.
  90   Only compression ratios at powers of 2 that are at least 2:1 (e.g. 2:1,
  91   4:1, 8:1, etc) will work in this scheme because physical block allocations
  92   in HAMMER2 are always power-of-2.  Modest compression can be achieved with
  93   low overhead, is turned on by default, and is compatible with deduplication.
  94
  95 * Encryption.  Whole-disk encryption is supported by another layer, but I
  96   intend to give H2 an encryption feature at the logical layer which works
  97   approximately as follows:
  98
  99   - Encryption controlled by the client on an inode/sub-tree basis.
 100   - Server has no visibility to decrypted data.
 101   - Encrypt filenames in directory entries.  Since the filename[] array
 102     is 256 bytes wide, client can add random bytes after the normal
 103     terminator to make it virtually impossible for an attacker to figure
 104     out the filename.
 105   - Encrypt file size and most inode contents.
 106   - Encrypt file data (holes are not encrypted).
 107   - Encryption occurs after compression, with random filler.
 108   - Check codes calculated after encryption & compression (not before).
 109
 110   - Blockrefs are not encrypted.
 111   - Directory and File Topology is not encrypted.
 112   - Encryption is not sub-topology validation.  Client would have to keep
 113     track of that itself.  Server or other clients can still e.g. remove
 114     files, rename, etc.
 115
 116   In particular, note that even though the file size field can be encrypted,
 117   the server does have visibility on the block topology and thus has a pretty
 118   good idea how big the file is.  However, a client could add junk blocks
 119   at the end of a file to make this less apparent, at the cost of space.
 120
 121   If a client really wants a fully validated H2-encrypted space the easiest
 122   solution is to format a filesystem within an encrypted file by treating it
 123   as a block device, but I digress.
 124
 125 * Zero detection on write (writing all-zeros), which requires the data
 126   buffer to be scanned, is fully supported.  This allows the writing of 0's
 127   to create holes.
 128
 129 * Copies support for redundancy within a single physical filesystem.
 130   Up to 256 physical disks and/or partitions can be ganged to form a
 131   single physical filesystem.  If you use a disk or RAID aggregation
 132   layer then the actual number of physical disks that can be associated
 133   with a single H2 filesystem is unbounded.
 134
 135   H2 handles this by interleaving the 64-bit address space 256-ways on
 136   level-1 (2GB) boundaries.  Thus, a single physical disk from H2's point
 137   of view can be sized up to 2^56 which is 64 Petabytes, and the total
 138   filesystem size can be up to 16 Exabytes.
 139
 140   Copies support is implemented by having multiple blockref entries for
 141   the same key, each with a different copyid.  The copyid represents which
 142   of the 256 slots is used.  Meta-data is also subject to the copies
 143   mechanism.  However, for both meta-data and data, each copy should be
 144   identical so the check fields in the blockref for all copies should wind
 145   up being the same, and any valid copy can be used by the block-level
 146   hammer2_chain code to access the filesystem.  File accesses will attempt
 147   to use the same copy.  If an I/O read error occurs, a different copy will
 148   be chosen.  Modifying operations must update all copies and/or create
 149   new copies as needed.  If a write error occurs on a copy and other copies
 150   are available, the errored target will be taken offline.
 151
 152   It is possible to configure H2 to write out fewer copies on-write and then
 153   use a background scan to beef-up the number of copies to improve real-time
 154   throughput.
 155
 156 * MESI Cache coherency for multi-master/multi-client clustering operations.
 157   The servers hosting the MASTERs are also responsible for keeping track of
 158   the cache state.
 159
 160 * Hardlinks and softlinks are supported.  Hardlinks are somewhat complex to
 161   deal with and there is still an edge case.  I am trying to avoid storing
 162   the hardlinks at the root level because that messes up my concept for
 163   sub-tree quotas and is unnecessarily burdensome in terms of SMP collisions
 164   under heavy loads.
 165
 166 * The media blockref structure is now large enough to support up to a 192-bit
 167   check value, which would typically be a cryptographic hash of some sort.
 168   Multiple check value algorithms will be supported with the default being
 169   a simple 32-bit iSCSI CRC.
 170
 171 * Fully verified deduplication will be supported and automatic (and
 172   necessary in many respects).
 173
 174 * Unverified de-duplication will be supported as a configurable option on a
 175   file or subdirectory tree.  Unverified deduplication must use the largest
 176   available check code (192 bits).  It will not verify that data content with
 177   the same check code is actually identical during the dedup pass, resulting
 178   in approximately 100x to 1000x the deduplication performance but at the cost
 179   of potentially corrupting some data.
 180
 181   The Unverified dedup feature is intended only for those files where
 182   occassional corruption is ok, such as in a web-crawler data store or
 183   other situations where the data content is not critically important
 184   or can be externally recovered if it becomes corrupt.
 185
 186                                 GENERAL DESIGN
 187
 188 HAMMER2 generally implements a copy-on-write block design for the filesystem,
 189 which is very different from HAMMER1's B-Tree design.  Because the design
 190 is copy-on-write it can be trivially snapshotted simply by referencing an
 191 existing block, and because the media structures logically match a standard
 192 filesystem directory/file hierarchy snapshots and other similar operations
 193 can be trivially performed on an entire subdirectory tree at any level in
 194 the filesystem.
 195
 196 The copy-on-write design implements a block table in a radix-tree format,
 197 with a small 8x fan-out in the volume header and inode and a large 256x or
 198 1024x fan-out for indirect blocks.  The table is built bottom-up.
 199 Intermediate radii are only created when necessary so small files will use
 200 much shallower radix block trees.  The inode itself can accomodate files
 201 up 512KB (65536x8).  Directories also use a radix block table and directory
 202 inodes can accomodate up to 8 entries before pushing an indirect radix block.
 203
 204 The copy-on-write nature of the filesystem implies that any modification
 205 whatsoever will have to eventually synchronize new disk blocks all the way
 206 to the super-root of the filesystem and the volume header itself.  This forms
 207 the basis for crash recovery and also ensures that recovery occurs on a
 208 completed high-level transaction boundary.  All disk writes are to new blocks
 209 except for the volume header (which cycles through 4 copies), thus allowing
 210 all writes to run asynchronously and concurrently prior to and during a flush,
 211 and then just doing a final synchronization and volume header update at the
 212 end.  Many of HAMMER2s features are enabled by this core design feature.
 213
 214 Clearly this method requires intermediate modifications to the chain to be
 215 cached so multiple modifications can be aggregated prior to being
 216 synchronized.  One advantage, however, is that the normal buffer cache can
 217 be used and intermediate elements can be retired to disk by H2 or the OS
 218 at any time.  This means that HAMMER2 has very resource overhead from the
 219 point of view of the operating system.  Unlike HAMMER1 which had to lock
 220 dirty buffers in memory for long periods of time, HAMMER2 has no such
 221 requirement.
 222
 223 Buffer cache overhead is very well bounded and can handle filesystem
 224 operations of any complexity, even on boxes with very small amounts
 225 of physical memory.  Buffer cache overhead is significantly lower with H2
 226 than with H1 (and orders of magnitude lower than ZFS).
 227
 228 At some point I intend to implement a shortcut to make fsync()'s run fast,
 229 and that is to allow deep updates to blockrefs to shortcut to auxillary
 230 space in the volume header to satisfy the fsync requirement.  The related
 231 blockref is then recorded when the filesystem is mounted after a crash and
 232 the update chain is reconstituted when a matching blockref is encountered
 233 again during normal operation of the filesystem.
 234
 235                             MIRROR_TID, MODIFY_TID
 236
 237 In HAMMER2, the core block reference is 64-byte structure called a blockref.
 238 The blockref contains various bits of information including the 64-bit radix
 239 key (typically a directory hash if a directory entry, inode number if a
 240 hidden hardlink target, or file offset if a file block), 64-bit data offset
 241 with the physical block size radix encoded in it (physical block size can be
 242 different from logical block size due to compression), two 64-bit transaction
 243 ids, type information, and 192 bits worth of check data for the block being
 244 reference which can be a simple CRC or stronger HASH.
 245
 246 Both mirror_tid and modify_tid propagate upward from the change point all the
 247 way to the root, but serve different purposes and work in slightly different
 248 ways.
 249
 250 mirror_tid - This is a media-centric (as in physical disk partition)
 251              transaction id which tracks media-level updates.
 252
 253              Whenever any block in the media topology is modified, its
 254              mirror_tid is updated with the flush id and will propagate
 255              upward during the flush all the way to the volume header.
 256
 257              mirror_tid is monotonic.
 258
 259 modify_tid - This is a cluster-centric (as in across all the nodes used
 260              to build a cluster) transaction id which tracks filesystem-level
 261              updates.
 262
 263              modify_tid is updated when the front-end of the filesystem makes
 264              a change to an inode or data block.  It will also propagate
 265              upward, stopping at the root of the PFS (the mount point for
 266              the cluster).
 267
 268 The major difference between mirror_tid and modify_tid is that for any given
 269 element in the topology residing on different nodes.  e.g. file "x" on node 1
 270 and file "x" on node 2, if the files are synchronized with each other they
 271 will have the same modify_tid on a block-by-block basis, and a single check
 272 of the inode's modify_tid is sufficient to determine that the files are fully
 273 synchronized and identical.  These same inodes and representitive blocks will
 274 have very different mirror_tids because the nodes will reside on different
 275 physical media.
 276
 277 I noted above that modify_tids also propagate upward, but not in all cases.
 278 A node which is undergoing SYNCHRONIZATION only updates the modify_tid of
 279 a block when it has determined that the block and its entire sub-block
 280 hierarchy has been synchronized to that point.
 281
 282 The synchronization code updates an out-of-sync node bottom-up and will
 283 definitely set modify_tid as it goes, but media flushes can occur at any
 284 time and these flushes will use mirror_tid for flush and freemap management.
 285 The mirror_tid for each flush propagates upward to the volume header on each
 286 flush.
 287
 288 * The synchronization code is able to determine that a sub-tree is
 289   synchronized simply by observing the modify_tid at the root of the sub-tree,
 290   on a directory-by-directory basis.
 291
 292 * The synchronization code is able to do an incremental update of an
 293   out-of-sync node simply by skipping elements with matching modify_tids.
 294
 295 * The synchronization code can be interrupted and restarted at any time,
 296   and is able to pick up where it left off with very little overhead.
 297
 298 * The synchronization code does not inhibit media flushes.  Media flushes
 299   can occur (and must occur) while synchronization is ongoing.
 300
 301 There are several other stored transaction ids in HAMMER2.  There is a
 302 separate freemap_tid in the volume header that is used to allow freemap
 303 flushes to be deferred, and inodes have an attr_tid and a dirent_tid which
 304 tracks attribute changes and (for directories) create/rename/delete changes.
 305 The inode TIDs are used as an aid for the cache coherency subsystem.
 306
 307 Remember that since this is a copy-on-write filesystem, we can propagate
 308 a considerable amount of information up the tree to the volume header
 309 without adding to the I/O we already have to do.
 310
 311                             DIRECTORIES AND INODES
 312
 313 Directories are hashed, and another major design element is that directory
 314 entries ARE INODES.  They are one and the same, with a special placemarker
 315 for hardlinks.  Inodes are 1KB.
 316
 317 Half of the inode structure (512 bytes) is used to hold top-level blockrefs
 318 to the radix block tree representing the file contents.  Files which are
 319 less than or equal to 512 bytes in size will simply store the file contents
 320 in this area instead of a blockref array.  So files <= 512 bytes take only
 321 1KB of space inclusive of the inode.
 322
 323 Inode numbers are not spatially referenced, which complicates NFS servers
 324 but doesn't complicate anything else.  The inode number is stored in the
 325 inode itself, an absolute necessity required to properly support HAMMER2s
 326 hugely flexible snapshots.
 327
 328                                     RECOVERY
 329
 330 H2 allows freemap flushes to lag behind topology flushes.  The freemap flush
 331 tracks a separate transaction id in the volume header.
 332
 333 On mount, HAMMER2 will first locate the highest-sequenced check-code-validated
 334 volume header from the 4 copies available (if the filesystem is big enough,
 335 e.g. > ~10GB or so, there will be 4 copies of the volume header).
 336
 337 HAMMER2 will then run an incremental scan of the topology for mirror_tid
 338 transaction ids between the last freemap flush and the current topology in
 339 order to update the freemap.  Because this scan is incremental the
 340 worst-case time to run the scan is the time it takes to scan the meta-data
 341 for all changes made between the last freemap flush and the last topology
 342 flush.
 343
 344 The filesystem is then ready for use.
 345
 346                             DISK I/O OPTIMIZATIONS
 347
 348 The freemap implements a 1KB allocation resolution.  Each 2MB segment managed
 349 by the freemap is zoned and has a tendancy to collect inodes, small data,
 350 indirect blocks, and larger data blocks into separate segments.  The idea is
 351 to greatly improve I/O performance (particularly by laying inodes down next
 352 to each other which has a huge effect on directory scans).
 353
 354 The current implementation of HAMMER2 implements a fixed block size of 64KB
 355 in order to allow aliasing of hammer2_dio's in its IO subsystem.  This way
 356 we don't have to worry about matching the buffer cache / DIO cache to the
 357 variable block size of underlying elements.
 358
 359 HAMMER2 also allows OS support for ganging buffers together into even
 360 larger blocks for I/O, OS-supported read-ahead, and other performance
 361 features typically provided by the OS at the block-level.
 362
 363                                   HARDLINKS
 364
 365 Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of
 366 a spatial reference to the inode number.  We do not want to have to have
 367 an index of inode numbers for any basic HAMMER2 feature if we can help it.
 368
 369 Hardlinks are handled by placing the inode for a multiply-hardlinked file
 370 in the closest common parent directory.  If "a/x" and "a/y" are hardlinked
 371 the inode for the hardlinked file will be placed in directory "a", e.g.
 372 "a/<inode_number>".  The actual file inode will be an invisible out-of-band
 373 entry in the directory.  The directory entries "a/x" and "a/y" will be given
 374 the same inode number but in fact they are only placemarks that cause
 375 HAMMER2 to recurse upwards through the directory tree to find the invisible
 376 real inode.
 377
 378 Because directories are hashed and a different namespace (hash key range)
 379 is used for hardlinked inodes, standard directory scans are able to trivially
 380 skip this invisible namespace and inode-specific lookups can restrict their
 381 lookup to within this space.  No linear scans are needed.
 382
 383                                 FREEMAP NOTES
 384
 385 The freemap is stored in the reserved blocks situated in the ~4MB reserved
 386 area at the baes of every ~2GB level-1 zone.  The current implementation
 387 reserves 8 copies of every freemap block and cycles through them in order
 388 to make the freemap operate in a copy-on-write fashion.
 389
 390     - Freemap is copy-on-write.
 391     - Freemap operations are transactional, same as everything else.
 392     - All backup volume headers are consistent on-mount.
 393
 394 The Freemap is organized using the same radix blockmap algorithm used for
 395 files and directories, but with fixed radix values.  For a maximally-sized
 396 filesystem the Freemap will wind up being a 5-level-deep radix blockmap,
 397 but the top-level is embedded in the volume header so insofar as performance
 398 goes it is really just a 4-level blockmap.
 399
 400 The freemap radix allocation mechanism is also the same, meaning that it is
 401 bottom-up and will not allocate unnecessary intermediate levels for smaller
 402 filesystems.  A 16GB filesystem uses a 2-level blockmap (volume header + one
 403 level), a 16TB filesystem uses a 3-level blockmap (volume header + two levels),
 404 and so forth.
 405
 406 The Freemap has bitmap granularity down to 16KB and a linear iterator that
 407 can linearly allocate space down to 1KB.  Due to fragmentation it is possible
 408 for the linear allocator to become marginalized, but it is relatively easy
 409 to for a reallocation of small blocks every once in a while (like once a year
 410 if you care at all) and once the old data cycles out of the snapshots, or you
 411 also rewrite the snapshots (which you can do), the freemap should wind up
 412 relatively optimal again.  Generally speaking I believe that algorithms can
 413 be developed to make this a non-problem without requiring any media structure
 414 changes.
 415
 416 In order to implement fast snapshots (and writable snapshots for that
 417 matter), HAMMER2 does NOT ref-count allocations.  All the freemap does is
 418 keep track of 100% free blocks plus some extra bits for staging the bulkfree
 419 scan.  The lack of ref-counting makes it possible to:
 420
 421     - Completely trivialize HAMMER2s snapshot operations.
 422     - Allows any volume header backup to be used trivially.
 423     - Allows whole sub-trees to be destroyed without having to scan them.
 424     - Simplifies normal crash recovery operations.
 425     - Simplifies catastrophic recovery operations.
 426
 427 Normal crash recovery is simply a matter of doing an incremental scan
 428 of the topology between the last flushed freemap TID and the last flushed
 429 topology TID.  This usually takes only a few seconds and allows:
 430
 431     - Freemap flushes to be be deferred for any number of topology flush
 432       cycles.
 433     - Does not have to be flushed for fsync, reducing fsync overhead.
 434
 435                                 FREEMAP - BULKFREE
 436
 437 Blocks are freed via a bulkfree scan, which is a two-stage meta-data scan.
 438 Blocks are first marked as being possibly free and then finalized in the
 439 second scan.  Live filesystem operations are allowed to run during these
 440 scans and any freemap block that is allocated or adjusted after the first
 441 scan will simply be re-marked as allocated and the second scan will not
 442 transition it to being free.
 443
 444 The cost of not doing ref-count tracking is that HAMMER2 must perform two
 445 bulkfree scans of the meta-data to determine which blocks can actually be
 446 freed.  This can be complicated by the volume header backups and snapshots
 447 which cause the same meta-data topology to be scanned over and over again,
 448 but mitigated somewhat by keeping a cache of higher-level nodes to detect
 449 when we would scan a sub-topology that we have already scanned.  Due to the
 450 copy-on-write nature of the filesystem, such detection is easy to implement.
 451
 452 Part of the ongoing design work is finding ways to reduce the scope of this
 453 meta-data scan so the entire filesystem's meta-data does not need to be
 454 scanned (though in tests with HAMMER1, even full meta-data scans have
 455 turned out to be fairly low cost).  In other words, its an area where
 456 improvements can be made without any media format changes.
 457
 458 Another advantage of operating the freemap like this is that some future
 459 version of HAMMER2 might decide to completely change how the freemap works
 460 and would be able to make the change with relatively low downtime.
 461
 462                                   CLUSTERING
 463
 464 Clustering, as always, is the most difficult bit but we have some advantages
 465 with HAMMER2 that we did not have with HAMMER1.  First, HAMMER2's media
 466 structures generally follow the kernel's filesystem hiearchy which allows
 467 cluster operations to use topology cache and lock state.  Second,
 468 HAMMER2's writable snapshots make it possible to implement several forms
 469 of multi-master clustering.
 470
 471 The mount device path you specify serves to bootstrap your entry into
 472 the cluster.  This is typically local media.  It can even be a ram-disk
 473 that only contains placemarkers that help HAMMER2 connect to a fully
 474 networked cluster.
 475
 476 With HAMMER2 you mount a directory entry under the super-root.  This entry
 477 will contain a cluster identifier that helps HAMMER2 identify and integrate
 478 with the nodes making up the cluster.  HAMMER2 will automatically integrate
 479 *all* entries under the super-root when you mount one of them.  You have to
 480 mount at least one for HAMMER2 to integrate the block device in the larger
 481 cluster.  This mount will typically be a SOFT_MASTER, DUMMY, SLAVE, or CACHE
 482 mount that simply serves to cause hammer to integrate the rest of the
 483 represented cluster.  ALL CLUSTER ELEMENTS ARE TREATED ACCORDING TO TYPE
 484 NO MATTER WHICH ONE YOU MOUNT.
 485
 486 For cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER
 487 which can be mounted in order to make the rest of the elements under the
 488 super-root available to the network.  (In a prior specification I emplaced
 489 the cluster connections in the volume header's configuration space but I no
 490 longer do that).
 491
 492 Connecting to the wider networked cluster involves setting up the /etc/hammer2
 493 directory with appropriate IP addresses and keys.  The user-mode hammer2
 494 service daemon maintains the connections and performs graph operations
 495 via libdmsg.
 496
 497 Node types within the cluster:
 498
 499     DUMMY       - Used as a local placeholder (typically in ramdisk)
 500     CACHE       - Used as a local placeholder and cache (typically on a SSD)
 501     SLAVE       - A SLAVE in the cluster, can source data on quorum agreement.
 502     MASTER      - A MASTER in the cluster, can source and sink data on quorum
 503                   agreement.
 504     SOFT_SLAVE  - A SLAVE in the cluster, can source data locally without
 505                   quorum agreement (must be directly mounted).
 506     SOFT_MASTER - A local MASTER but *not* a MASTER in the cluster.  Can source
 507                   and sink data locally without quorum agreement, intended to
 508                   be synchronized with the real MASTERs when connectivity
 509                   allows.  Operations are not coherent with the real MASTERS
 510                   even when they are available.
 511
 512     NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a
 513           SLAVE.  A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer
 514           synchronized against current masters.
 515
 516     NOTE: Any SLAVE or other copy can be turned into its own writable MASTER
 517           by giving it a unique cluster id, taking it out of the cluster that
 518           originally spawned it.
 519
 520 There are four major protocols:
 521
 522     Quorum protocol
 523
 524         This protocol is used between MASTER nodes to vote on operations
 525         and resolve deadlocks.
 526
 527         This protocol is used between SOFT_MASTER nodes in a sub-cluster
 528         to vote on operations, resolve deadlocks, determine what the latest
 529         transaction id for an element is, and to perform commits.
 530
 531     Cache sub-protocol
 532
 533         This is the MESI sub-protocol which runs under the Quorum
 534         protocol.  This protocol is used to maintain cache state for
 535         sub-trees to ensure that operations remain cache coherent.
 536
 537         Depending on administrative rights this protocol may or may
 538         not allow a leaf node in the cluster to hold a cache element
 539         indefinitely.  The administrative controller may preemptively
 540         downgrade a leaf with insufficient administrative rights
 541         without giving it a chance to synchronize any modified state
 542         back to the cluster.
 543
 544     Proxy protocol
 545
 546         The Quorum and Cache protocols only operate between MASTER
 547         and SOFT_MASTER nodes.  All other node types must use the
 548         Proxy protocol to perform similar actions.  This protocol
 549         differs in that proxy requests are typically sent to just
 550         one adjacent node and that node then maintains state and
 551         forwards the request or performs the required operation.
 552         When the link is lost to the proxy, the proxy automatically
 553         forwards a deletion of the state to the other nodes based on
 554         what it has recorded.
 555
 556         If a leaf has insufficient administrative rights it may not
 557         be allowed to actually initiate a quorum operation and may only
 558         be allowed to maintain partial MESI cache state or perhaps none
 559         at all (since cache state can block other machines in the
 560         cluster).  Instead a leaf with insufficient rights will have to
 561         make due with a preemptive loss of cache state and any allowed
 562         modifying operations will have to be forwarded to the proxy which
 563         continues forwarding it until a node with sufficient administrative
 564         rights is encountered.
 565
 566         To reduce issues and give the cluster more breath, sub-clusters
 567         made up of SOFT_MASTERs can be formed in order to provide full
 568         cache coherent within a subset of machines and yet still tie them
 569         into a greater cluster that they normally would not have such
 570         access to.  This effectively makes it possible to create a two
 571         or three-tier fan-out of groups of machines which are cache-coherent
 572         within the group, but perhaps not between groups, and use other
 573         means to synchronize between the groups.
 574
 575     Media protocol
 576
 577         This is basically the physical media protocol.
 578
 579                        MASTER & SLAVE SYNCHRONIZATION
 580
 581 With HAMMER2 I really want to be hard-nosed about the consistency of the
 582 filesystem, including the consistency of SLAVEs (snapshots, etc).  In order
 583 to guarantee consistency we take advantage of the copy-on-write nature of
 584 the filesystem by forking consistent nodes and using the forked copy as the
 585 source for synchronization.
 586
 587 Similarly, the target for synchronization is not updated on the fly but instead
 588 is also forked and the forked copy is updated.  When synchronization is
 589 complete, forked sources can be thrown away and forked copies can replace
 590 the original synchronization target.
 591
 592 This may seem complex, but 'forking a copy' is actually a virtually free
 593 operation.  The top-level inode (under the super-root), on-media, is simply
 594 copied to a new inode and poof, we have an unchanging snapshot to work with.
 595
 596         - Making a snapshot is fast... almost instantanious.
 597
 598         - Snapshots are used for various purposes, including synchronization
 599           of out-of-date nodes.
 600
 601         - A snapshot can be converted into a MASTER or some other PFS type.
 602
 603         - A snapshot can be forked off from its parent cluster entirely and
 604           turned into its own writable filesystem, either as a single MASTER
 605           or this can be done across the cluster by forking a quorum+ of
 606           existing MASTERs and transfering them all to a new cluster id.
 607
 608 More complex is reintegrating the target once the synchronization is complete.
 609 For SLAVEs we just delete the old SLAVE and rename the copy to the same name.
 610 However, if the SLAVE is mounted and not optioned as a static mount (that is
 611 the mounter wants to see updates as they are synchronized), a reconciliation
 612 must occur on the live mount to clean up the vnode, inode, and chain caches
 613 and shift any remaining vnodes over to the updated copy.
 614
 615         - A mounted SLAVE can track updates made to the SLAVE but the
 616           actual mechanism is that the SLAVE PFS is replaced with an
 617           updated copy, typically every 30-60 seconds.
 618
 619 Reintegrating a MASTER which has fallen out of the quorum due to being out
 620 of date is also somewhat more complex.  The same updating mechanic is used,
 621 we actually have to throw the 'old' MASTER away once the new one has been
 622 updated.  However if the cluster is undergoing heavy modifications the
 623 updated MASTER will be out of date almost the instant its source is
 624 snapshotted.  Reintegrating a MASTER thus requires a somewhat more complex
 625 interaction.
 626
 627         - If a MASTER is really out of date we can run one or more
 628           synchronization passes concurrent with modifying operations.
 629           The quorum can remain live.
 630
 631         - A final synchronization pass is required with quorum operations
 632           blocked to reintegrate the now up-to-date MASTER into the cluster.
 633
 634
 635                                 QUORUM OPERATIONS
 636
 637 Quorum operations can be broken down into HARD BLOCK operations and NETWORK
 638 operations.  If your MASTERs are all local mounts, then failures and
 639 sequencing is easy to deal with.
 640
 641 Quorum operations on a networked cluster are more complex.  The problems:
 642
 643     - Masters cannot rely on clients to moderate quorum transactions.
 644       Apart from the reliance being unsafe, the client could also
 645       lose contact with one or more masters during the transaction and
 646       leave one or more masters out-of-sync without the master(s) knowing
 647       they are out of sync.
 648
 649     - When many clients are present, we do not want a flakey network
 650       link from one to cause one or more masters to go out of
 651       synchronization and potentially stall the whole works.
 652
 653     - Normal hammer2 mounts allow a virtually unlimited number of modifying
 654       transactions between actual flushes.  The media flush rolls everything
 655       up into a single transaction id per flush.  Detection of 'missing'
 656       transactions in a concurrent multi-client setup when one or more client
 657       temporarily loses connectivity is thus difficult.
 658
 659     - Clients have a limited amount of time to reconnect to a cluster after
 660       a network disconnect before their MESI cache states are lost.
 661
 662     - Clients may proceed with several transactions before knowing for sure
 663       that earlier transactions were completely successful.  Performance is
 664       important, we won't be waiting for a full quorum-verified synchronous
 665       flush to media before allowing a system call to return.
 666
 667     - Masters can decide that a client's MESI cache states were lost (i.e.
 668       that the transaction was too slow) as well.
 669
 670 The solutions (for modifying transactions):
 671
 672     - Masters handle quorum confirmation amongst themselves and do not rely
 673       on the client for that purpose.
 674
 675     - A client can connect to one or more masters regardless of the size of
 676       the quorum and can submit modifying operations to a single master if
 677       desired.  The master will take care of the rest.
 678
 679       A client must still validate the quorum (and obtain MESI cache states)
 680       when doing read-only operations in order to present the correct data
 681       to the user process for the VOP.
 682
 683     - Masters will run a 2-phase commit amongst themselves, often concurrent
 684       with other non-conflicting transactions, and will serialize operations
 685       and/or enforce synchronization points for 2-phase completion on
 686       serialized transactions from the same client or when cache state
 687       ownership is shifted from one client to another.
 688
 689     - Clients will usually allow operations to run asynchronously and return
 690       from system calls more or less ASAP once they own the necessary cache
 691       coherency locks.  The client can select the validation mode to wait for
 692       with mount options:
 693
 694       (1) Fully async           (mount -o async)
 695       (2) Wait for phase-1 ack  (mount)
 696       (3) Wait for phase-2 ack  (mount -o sync)         (fsync - wait p2ack)
 697       (4) Wait for flush        (mount -o sync)         (fsync - wait flush)
 698
 699       Modifying system calls cannot be told to wait for a full media
 700       flush, as full media flushes are prohibitively expensive.  You
 701       still have to fsync().
 702
 703       The fsync wait mode for network links can be selected, either to
 704       return after the phase-2 ack or to return after the media flush.
 705       The default is to wait for the phase-2 ack, which at least guarantees
 706       that a network failure after that point will not disrupt operations
 707       issued before the fsync.
 708
 709     - Clients must adjust the chain state for modifying operations prior to
 710       releasing chain locks / returning from the system call, even if the
 711       masters have not finished the transaction.  A late failure by the
 712       cluster will result in desynchronized state which requires erroring
 713       out the whole filesystem or resynchronizing somehow.
 714
 715     - Clients can opt to keep a record of transactions through the phase-2
 716       ack or the actual media flush on the masters.
 717
 718       However, replaying/revalidating the log cannot necessarily guarantee
 719       success.  If the masters lose synchronization due to network issues
 720       between masters (or if the client was mounted fully-async), or if enough
 721       masters crash simultaniously such that a quorum fails to flush even
 722       after the phase-2 ack, then it is possible that by the time a client
 723       is able to replay/revalidate, some other client has squeeded in and
 724       committed something that would conflict.
 725
 726       If the client crashes it works similarly to a crash with a local storage
 727       mount... many dirty buffers might be lost.  And the same happens in
 728       the cluster case.
 729
 730                                 TRANSACTION LOG
 731
 732 Keeping a short-term transaction log, much less being able to properly replay
 733 it, is fraught with difficulty and I've made it a separate development task.
 734
 735