sys/vfs/hammer2/DESIGN

   1
   2                             HAMMER2 DESIGN DOCUMENT
   3
   4                                 Matthew Dillon
   5                              dillon@backplane.com
   6
   7                                09-Jul-2016 (v4)
   8                                03-Apr-2015 (v3)
   9                                14-May-2013 (v2)
  10                                08-Feb-2012 (v1)
  11
  12                         Current Status as of document date
  13
  14 * Filesystem Core       - operational
  15   - bulkfree            - operational
  16   - Compression         - operational
  17   - Snapshots           - operational
  18   - Deduper             - live operational, batch specced
  19   - Subhierarchy quotas - (may have to be discarded)
  20   - Logical Encryption  - not specced yet
  21   - Copies              - not specced yet
  22   - fsync bypass        - not specced yet
  23
  24 * Clustering core
  25   - Network msg core    - operational
  26   - Network blk device  - operational
  27   - Error handling      - under development
  28   - Quorum Protocol     - under development
  29   - Synchronization     - under development
  30   - Transaction replay  - not specced yet
  31   - Cache coherency     - not specced yet
  32
  33                                     Feature List
  34
  35 * Block topology (both the main topology and the freemap) use a copy-on-write
  36   design.  Media-level block frees are delayed and flushes rotate between
  37   4 volume headers (maxes out at 4 if the filesystem is > ~8GB).  Flushes
  38   will allocate new blocks up to the root in order to propagate block table
  39   changes and transaction ids.
  40
  41 * Incremental synchronization is queueless and trivial by design.
  42
  43 * Multiple roots, with many features.  This is implemented via the super-root
  44   concept.  When mounting a HAMMER2 filesystem you specify a device path and
  45   a directory name in the super-root.  (HAMMER1 had only one root).
  46
  47 * All cluster types and multiple PFSs (belonging to the same or different
  48   clusters) can be mixed on one physical filesystem.
  49
  50   This allows independent cluster components to be configured within a
  51   single formatted H2 filesystem.  Each component is a super-root entry,
  52   a cluster identifier, and a unique identifier.  The network protocl
  53   integrates the component into the cluster when it is created
  54
  55 * Roots are really no different from snapshots (HAMMER1 distinguished between
  56   its root mount and its PFS's.  HAMMER2 does not).
  57
  58 * I/O and chain locking thread separation.  I/O stalls and lock stalls can
  59   cause any filesystem which purports to operate over multiple physical and
  60   network devices to implode.  HAMMER2 incorporates a frontend/backend design
  61   which separates media operations into support threads and allows the
  62   frontend to validate the cluster, proceed with an operation, and disconnect
  63   any remaining running operation even when backend ops have not completed
  64   on all nodes.  This allows the frontend to return 'early' (so to speak).
  65
  66 * Early return on best data-path supported by virtue of the above.  In a
  67   multi-master system, frontend ops will issue I/O on all cluster elements
  68   concurrently and will return the instant incoming data validates the
  69   cluster.
  70
  71 * Snapshots are writable (in HAMMER1 snapshots were read-only).
  72
  73 * Snapshots are explicit but trivial to create.  In HAMMER1 snapshots were
  74   both explicit and fine-grained/automatic.  HAMMER2 does not implement
  75   automatic fine-grained snapshots.  H2 snapshots are cheap enough that you
  76   can create fine-grained snapshots if you desire.
  77
  78 * HAMMER2 formalizes a synchronization point for the flush, does a pre-flush
  79   that does not update the volume root, then waits for all running modifying
  80   operations to complete to memory (not to disk) while temporarily stalling
  81   new modifying operation initiations.  The final flush is then executed.
  82
  83   At the moment we do not allow concurrent modifying operations during the
  84   final flush phase.  Ultimately I would like to, but doing so can be complex.
  85
  86 * HAMMER2 flushes and synchronization points do not bisect VOPs (system calls).
  87   (HAMMER1 flushes could wind up bisecting VOPs).  This means the H2 flushes
  88   leave the filesystem in a far more consistent state than H1 flushes did.
  89
  90 * Directory sub-hierarchy-based quotas for space and inode usage tracking.
  91   Any directory can be used.
  92
  93 * Low memory footprint.  Except for the volume header, the buffer cache
  94   is completely asynchronous and dirty buffers can be retired by the OS
  95   directly to backing store with no further interactions with the filesystem.
  96
  97 * Background synchronization and mirroring occurs at the logical level.
  98   When a failure occurs or a normal validation scan comes up with
  99   discrepancies, the synchronization thread will use the quorum to figure
 100   out which information is not correct and update accordingly.
 101
 102 * Support for multiple compression algorithms configured on a subdirectory
 103   tree basis and on a file basis.  Block compression up to 64KB will be used.
 104   Only compression ratios at powers of 2 that are at least 2:1 (e.g. 2:1,
 105   4:1, 8:1, etc) will work in this scheme because physical block allocations
 106   in HAMMER2 are always power-of-2.  Modest compression can be achieved with
 107   low overhead, is turned on by default, and is compatible with deduplication.
 108
 109 * Encryption.  Whole-disk encryption is supported by another layer, but I
 110   intend to give H2 an encryption feature at the logical layer which works
 111   approximately as follows:
 112
 113   - Encryption controlled by the client on an inode/sub-tree basis.
 114   - Server has no visibility to decrypted data.
 115   - Encrypt filenames in directory entries.  Since the filename[] array
 116     is 256 bytes wide, client can add random bytes after the normal
 117     terminator to make it virtually impossible for an attacker to figure
 118     out the filename.
 119   - Encrypt file size and most inode contents.
 120   - Encrypt file data (holes are not encrypted).
 121   - Encryption occurs after compression, with random filler.
 122   - Check codes calculated after encryption & compression (not before).
 123
 124   - Blockrefs are not encrypted.
 125   - Directory and File Topology is not encrypted.
 126   - Encryption is not sub-topology validation.  Client would have to keep
 127     track of that itself.  Server or other clients can still e.g. remove
 128     files, rename, etc.
 129
 130   In particular, note that even though the file size field can be encrypted,
 131   the server does have visibility on the block topology and thus has a pretty
 132   good idea how big the file is.  However, a client could add junk blocks
 133   at the end of a file to make this less apparent, at the cost of space.
 134
 135   If a client really wants a fully validated H2-encrypted space the easiest
 136   solution is to format a filesystem within an encrypted file by treating it
 137   as a block device, but I digress.
 138
 139 * Zero detection on write (writing all-zeros), which requires the data
 140   buffer to be scanned, is fully supported.  This allows the writing of 0's
 141   to create holes.
 142
 143 * Allow sector overwrite (avoid copy-on-write) under certain circumstances.
 144   This is allowed on file data blocks if the file check mode is set to NONE,
 145   as long as the data block's modify_tid does not violate the last snapshot
 146   taken (if it does, a copy is made and overwrites are allowed on the copy
 147   until the next snapshot).
 148
 149 * Copies support for redundancy within a single physical filesystem.
 150   Up to 256 physical disks and/or partitions can be ganged to form a
 151   single physical filesystem.  If you use a disk or RAID aggregation
 152   layer then the actual number of physical disks that can be associated
 153   with a single H2 filesystem is unbounded.
 154
 155   H2 puts an 8-bit copyid in the blockref structure to represent potentially
 156   multiple copies of a block.  The copyid corresponds to a configuration
 157   specification in the volume header.  The full algorithm has not been
 158   specced yet.
 159
 160   Copies support is implemented by having multiple blockref entries for
 161   the same key, each with a different copyid.  The copyid represents which
 162   of the 256 slots is used.  Meta-data is also subject to the copies
 163   mechanism.  However, for both meta-data and data, each copy should be
 164   identical so the check fields in the blockref for all copies should wind
 165   up being the same, and any valid copy can be used by the block-level
 166   hammer2_chain code to access the filesystem.  File accesses will attempt
 167   to use the same copy.  If an I/O read error occurs, a different copy will
 168   be chosen.  Modifying operations must update all copies and/or create
 169   new copies as needed.  If a write error occurs on a copy and other copies
 170   are available, the errored target will be taken offline.
 171
 172   It is possible to configure H2 to write out fewer copies on-write and then
 173   use a background scan to beef-up the number of copies to improve real-time
 174   throughput.
 175
 176 * MESI Cache coherency for multi-master/multi-client clustering operations.
 177   The servers hosting the MASTERs are also responsible for keeping track of
 178   the cache state.
 179
 180 * Hardlinks and softlinks are supported.  Hardlinks are somewhat complex to
 181   deal with and there is still an edge case.  I am trying to avoid storing
 182   the hardlinks at the root level because that messes up my concept for
 183   sub-tree quotas and is unnecessarily burdensome in terms of SMP collisions
 184   under heavy loads.
 185
 186 * The media blockref structure is now large enough to support up to a 192-bit
 187   check value, which would typically be a cryptographic hash of some sort.
 188   Multiple check value algorithms will be supported with the default being
 189   a simple 32-bit iSCSI CRC.
 190
 191 * Fully verified deduplication will be supported and automatic (and
 192   necessary in many respects).
 193
 194 * Unverified de-duplication will be supported as a configurable option on a
 195   file or subdirectory tree.  Unverified deduplication must use the largest
 196   available check code (192 bits).  It will not verify that data content with
 197   the same check code is actually identical during the dedup pass, resulting
 198   in approximately 100x to 1000x the deduplication performance but at the cost
 199   of potentially corrupting some data.
 200
 201   The Unverified dedup feature is intended only for those files where
 202   occassional corruption is ok, such as in a web-crawler data store or
 203   other situations where the data content is not critically important
 204   or can be externally recovered if it becomes corrupt.
 205
 206                                 GENERAL DESIGN
 207
 208 HAMMER2 generally implements a copy-on-write block design for the filesystem,
 209 which is very different from HAMMER1's B-Tree design.  Because the design
 210 is copy-on-write it can be trivially snapshotted simply by referencing an
 211 existing block, and because the media structures logically match a standard
 212 filesystem directory/file hierarchy snapshots and other similar operations
 213 can be trivially performed on an entire subdirectory tree at any level in
 214 the filesystem.
 215
 216 The copy-on-write design implements a block table in a radix-tree format,
 217 with a small 8x fan-out in the volume header and inode and a large 256x or
 218 1024x fan-out for indirect blocks.  The table is built bottom-up.
 219 Intermediate radii are only created when necessary so small files will use
 220 much shallower radix block trees.  The inode itself can accomodate files
 221 up 512KB (65536x8).  Directories also use a radix block table and directory
 222 inodes can accomodate up to 8 entries before pushing an indirect radix block.
 223
 224 The copy-on-write nature of the filesystem implies that any modification
 225 whatsoever will have to eventually synchronize new disk blocks all the way
 226 to the super-root of the filesystem and the volume header itself.  This forms
 227 the basis for crash recovery and also ensures that recovery occurs on a
 228 completed high-level transaction boundary.  All disk writes are to new blocks
 229 except for the volume header (which cycles through 4 copies), thus allowing
 230 all writes to run asynchronously and concurrently prior to and during a flush,
 231 and then just doing a final synchronization and volume header update at the
 232 end.  Many of HAMMER2s features are enabled by this core design feature.
 233
 234 Clearly this method requires intermediate modifications to the chain to be
 235 cached so multiple modifications can be aggregated prior to being
 236 synchronized.  One advantage, however, is that the normal buffer cache can
 237 be used and intermediate elements can be retired to disk by H2 or the OS
 238 at any time.  This means that HAMMER2 has very low resource overhead from the
 239 point of view of the operating system.  Unlike HAMMER1 which had to lock
 240 dirty buffers in memory for long periods of time, HAMMER2 has no such
 241 requirement.
 242
 243 Buffer cache overhead is very well bounded and can handle filesystem
 244 operations of any complexity, even on boxes with very small amounts
 245 of physical memory.  Buffer cache overhead is significantly lower with H2
 246 than with H1 (and orders of magnitude lower than ZFS).
 247
 248 At some point I intend to implement a shortcut to make fsync()'s run fast,
 249 and that is to allow deep updates to blockrefs to shortcut to auxillary
 250 space in the volume header to satisfy the fsync requirement.  The related
 251 blockref is then recorded when the filesystem is mounted after a crash and
 252 the update chain is reconstituted when a matching blockref is encountered
 253 again during normal operation of the filesystem.
 254
 255                         MIRROR_TID, MODIFY_TID, UPDATE_TID
 256
 257 In HAMMER2, the core block reference is 128-byte structure called a blockref.
 258 The blockref contains various bits of information including the 64-bit radix
 259 key (typically a directory hash if a directory entry, inode number if a
 260 hidden hardlink target, or file offset if a file block), 64-bit data offset
 261 with the physical block size radix encoded in it (physical block size can be
 262 different from logical block size due to compression), three 64-bit
 263 transaction ids, type information, and up to 512 bits worth of check data
 264 for the block being reference which can be anything from a simple CRC to
 265 a strong cryptographic hash.
 266
 267 mirror_tid - This is a media-centric (as in physical disk partition)
 268              transaction id which tracks media-level updates.  The mirror_tid
 269              can be different at the same point on different nodes in a
 270              cluster.
 271
 272              Whenever any block in the media topology is modified, its
 273              mirror_tid is updated with the flush id and will propagate
 274              upward during the flush all the way to the volume header.
 275
 276              mirror_tid is monotonic.  It is primarily used for on-mount
 277              recovery and volume root validation.  The name is historical
 278              from H1, it is not used for nominal mirroring.
 279
 280 modify_tid - This is a cluster-centric (as in across all the nodes used
 281              to build a cluster) transaction id which tracks filesystem-level
 282              updates.
 283
 284              modify_tid is updated when the front-end of the filesystem makes
 285              a change to an inode or data block.  It does NOT propagate upward
 286              during a flush.
 287
 288 update_tid - This is a cluster synchronization transaction id.  Modifications
 289              made to the topology will clear this field to 0 as they propagate
 290              up to the root.  This gives the synchronizer an easy way to
 291              determine what needs revalidation.
 292
 293              The synchronizer revalidates the cluster bottom-up by validating
 294              a sub-topology and propagating the highest modify_tid in the
 295              validated sub-topology up via the update_tid field.
 296
 297              Update to this field may be optimized by the HAMMER2 VFS to
 298              avoid the double-transition.
 299
 300 The synchronization code updates an out-of-sync node bottom-up and will
 301 dynamically set update_tid as it goes, but media flushes can occur at any
 302 time and these flushes will use mirror_tid for flush and freemap management.
 303 The mirror_tid for each flush propagates upward to the volume header on each
 304 flush.  modify_tid is set for any chains modified by a cluster op but does
 305 not propagate up, instead serving as a seed for update_tid.
 306
 307 * The synchronization code is able to determine that a sub-tree is
 308   synchronized simply by observing the update_tid at the root of the sub-tree,
 309   on an inode-by-inode basis and also on a data-block-by-data-block basis.
 310
 311 * The synchronization code is able to do an incremental update of an
 312   out-of-sync node simply by skipping elements with a matching update_tid
 313   (when not 0).
 314
 315 * The synchronization code can be interrupted and restarted at any time,
 316   and is able to pick up where it left off with very little overhead.
 317
 318 * The synchronization code does not inhibit media flushes.  Media flushes
 319   can occur (and must occur) while synchronization is ongoing.
 320
 321 There are several other stored transaction ids in HAMMER2.  There is a
 322 separate freemap_tid in the volume header that is used to allow freemap
 323 flushes to be deferred, and inodes have a pfs_psnap_tid which is used in
 324 conjuction with CHECK_NONE to allow blocks without a check code which do
 325 not violate the most recent snapshot to be overwritten in-place.
 326
 327 Remember that since this is a copy-on-write filesystem, we can propagate
 328 a considerable amount of information up the tree to the volume header
 329 without adding to the I/O we already have to do.
 330
 331                             DIRECTORIES AND INODES
 332
 333 Directories are hashed, and another major design element is that directory
 334 entries ARE inodes.  They are one and the same, with a special placemarker
 335 for hardlinks.  Inodes are 1KB.
 336
 337 Hardlinks are implemented with placemarkers as directory entries which simply
 338 represent the inode number.  The actual file resides in a parent directory
 339 that is common to all hardlinks to that file.  If the hardlinks are all within
 340 a single directory, the actual hardlink inode is in that directory.  The
 341 hardlink target, as we call it, is a hidden directory entry in a common parent
 342 whos key is basically just the inode number itself, so lookups are fast.
 343
 344 Half of the inode structure (512 bytes) is used to hold top-level blockrefs
 345 to the radix block tree representing the file contents.  Files which are
 346 less than or equal to 512 bytes in size will simply store the file contents
 347 in this area instead of a blockref array.  So files <= 512 bytes take only
 348 1KB of space inclusive of the inode.
 349
 350 Inode numbers are not spatially referenced, which complicates NFS servers
 351 but doesn't complicate anything else.  The inode number is stored in the
 352 inode itself, an absolute necessity required to properly support HAMMER2s
 353 hugely flexible snapshots.  I would like to support NFS services but it
 354 would require (probably) a lookaside index in the root for inode lookups
 355 and might not happen quickly.
 356
 357                                     RECOVERY
 358
 359 H2 allows freemap flushes to lag behind topology flushes.  The freemap flush
 360 tracks a separate transaction id (via mirror_tid) in the volume header.
 361
 362 On mount, HAMMER2 will first locate the highest-sequenced check-code-validated
 363 volume header from the 4 copies available (if the filesystem is big enough,
 364 e.g. > ~10GB or so, there will be 4 copies of the volume header).
 365
 366 HAMMER2 will then run an incremental scan of the topology for mirror_tid
 367 transaction ids between the last freemap flush tid and the last topology
 368 flush tid in order to synchronize the freemap.  Because this scan is
 369 incremental the time it takes to run will be relatively short and well-bounded
 370 at mount-time.  This is NOT fsck.  Freemap flushes can be avoided for any
 371 number of normal topology flushes but should still occur frequently enough
 372 to avoid long recovery times in case of a crash.
 373
 374 The filesystem is then ready for use.
 375
 376                             DISK I/O OPTIMIZATIONS
 377
 378 The freemap implements a 1KB allocation resolution.  Each 2MB segment managed
 379 by the freemap is zoned and has a tendancy to collect inodes, small data,
 380 indirect blocks, and larger data blocks into separate segments.  The idea is
 381 to greatly improve I/O performance (particularly by laying inodes down next
 382 to each other which has a huge effect on directory scans).
 383
 384 The current implementation of HAMMER2 implements a fixed block size of 64KB
 385 in order to allow the mapping of hammer2_dio's in its IO subsystem to
 386 conumers that might desire different sizes.  This way we don't have to
 387 worry about matching the buffer cache / DIO cache to the variable block
 388 size of underlying elements.
 389
 390 The biggest issue we are avoiding by having a fixed 64KB I/O size is not
 391 actually to help nominal front-end access issue but instead to reduce the
 392 complexity when blocks are freed and reused for another purpose.  HAMMER1
 393 had to have specialized code to check for and invalidate buffer cache buffers
 394 in the free/reuse case.  HAMMER2 does not need such code.
 395
 396 That said, HAMMER2 places no major restrictions on mixing block sizes within
 397 a 64KB block.  The only restriction is that a HAMMER2 block cannot cross
 398 a 64KB boundary.  The soft restrictions the block allocator puts in place
 399 exist primarily for performance reasons (i.e. try to collect 1K inodes
 400 together).  The 2MB freemap zone granularity should work very well in this
 401 regard.
 402
 403 HAMMER2 also allows OS support for ganging buffers together into even
 404 larger blocks for I/O (OS buffer cache 'clustering'), OS-supported read-ahead,
 405 OS-driven asynchronous retirement, and other performance features typically
 406 provided by the OS at the block-level to ensure smooth system operation.
 407
 408 By avoiding wiring buffers/memory and allowing these features to run normally,
 409 HAMMER2 winds up with very low OS overhead.
 410
 411                                 FREEMAP NOTES
 412
 413 The freemap is stored in the reserved blocks situated in the ~4MB reserved
 414 area at the baes of every ~1GB level-1 zone.  The current implementation
 415 reserves 8 copies of every freemap block and cycles through them in order
 416 to make the freemap operate in a copy-on-write fashion.
 417
 418     - Freemap is copy-on-write.
 419     - Freemap operations are transactional, same as everything else.
 420     - All backup volume headers are consistent on-mount.
 421
 422 The Freemap is organized using the same radix blockmap algorithm used for
 423 files and directories, but with fixed radix values.  For a maximally-sized
 424 filesystem the Freemap will wind up being a 5-level-deep radix blockmap,
 425 but the top-level is embedded in the volume header so insofar as performance
 426 goes it is really just a 4-level blockmap.
 427
 428 The freemap radix allocation mechanism is also the same, meaning that it is
 429 bottom-up and will not allocate unnecessary intermediate levels for smaller
 430 filesystems.  The number of blockmap levels not including the volume header
 431 for various filesystem sizes is as follows:
 432
 433         up-to           #of freemap levels
 434         1GB             1-level
 435         256GB           2-level
 436         64TB            3-level
 437         16PB            4-level
 438         4EB             5-level
 439         16EB            6-level
 440
 441 The Freemap has bitmap granularity down to 16KB and a linear iterator that
 442 can linearly allocate space down to 1KB.  Due to fragmentation it is possible
 443 for the linear allocator to become marginalized, but it is relatively easy
 444 to for a reallocation of small blocks every once in a while (like once a year
 445 if you care at all) and once the old data cycles out of the snapshots, or you
 446 also rewrite the snapshots (which you can do), the freemap should wind up
 447 relatively optimal again.  Generally speaking I believe that algorithms can
 448 be developed to make this a non-problem without requiring any media structure
 449 changes.
 450
 451 In order to implement fast snapshots (and writable snapshots for that
 452 matter), HAMMER2 does NOT ref-count allocations.  All the freemap does is
 453 keep track of 100% free blocks plus some extra bits for staging the bulkfree
 454 scan.  The lack of ref-counting makes it possible to:
 455
 456     - Completely trivialize HAMMER2s snapshot operations.
 457     - Allows any volume header backup to be used trivially.
 458     - Allows whole sub-trees to be destroyed without having to scan them.
 459     - Simplifies normal crash recovery operations.
 460     - Simplifies catastrophic recovery operations.
 461
 462 Normal crash recovery is simply a matter of doing an incremental scan
 463 of the topology between the last flushed freemap TID and the last flushed
 464 topology TID.  This usually takes only a few seconds and allows:
 465
 466     - Freemap flushes to be be deferred for any number of topology flush
 467       cycles.
 468     - Does not have to be flushed for fsync, reducing fsync overhead.
 469
 470                                 FREEMAP - BULKFREE
 471
 472 Blocks are freed via a bulkfree scan, which is a two-stage meta-data scan.
 473 Blocks are first marked as being possibly free and then finalized in the
 474 second scan.  Live filesystem operations are allowed to run during these
 475 scans and any freemap block that is allocated or adjusted after the first
 476 scan will simply be re-marked as allocated and the second scan will not
 477 transition it to being free.
 478
 479 The cost of not doing ref-count tracking is that HAMMER2 must perform two
 480 bulkfree scans of the meta-data to determine which blocks can actually be
 481 freed.  This can be complicated by the volume header backups and snapshots
 482 which cause the same meta-data topology to be scanned over and over again,
 483 but mitigated somewhat by keeping a cache of higher-level nodes to detect
 484 when we would scan a sub-topology that we have already scanned.  Due to the
 485 copy-on-write nature of the filesystem, such detection is easy to implement.
 486
 487 Part of the ongoing design work is finding ways to reduce the scope of this
 488 meta-data scan so the entire filesystem's meta-data does not need to be
 489 scanned (though in tests with HAMMER1, even full meta-data scans have
 490 turned out to be fairly low cost).  In other words, its an area where
 491 improvements can be made without any media format changes.
 492
 493 Another advantage of operating the freemap like this is that some future
 494 version of HAMMER2 might decide to completely change how the freemap works
 495 and would be able to make the change with relatively low downtime.
 496
 497                                   CLUSTERING
 498
 499 Clustering, as always, is the most difficult bit but we have some advantages
 500 with HAMMER2 that we did not have with HAMMER1.  First, HAMMER2's media
 501 structures generally follow the kernel's filesystem hiearchy which allows
 502 cluster operations to use topology cache and lock state.  Second,
 503 HAMMER2's writable snapshots make it possible to implement several forms
 504 of multi-master clustering.
 505
 506 The mount device path you specify serves to bootstrap your entry into
 507 the cluster.  This is typically local media.  It can even be a ram-disk
 508 that only contains placemarkers that help HAMMER2 connect to a fully
 509 networked cluster.
 510
 511 With HAMMER2 you mount a directory entry under the super-root.  This entry
 512 will contain a cluster identifier that helps HAMMER2 identify and integrate
 513 with the nodes making up the cluster.  HAMMER2 will automatically integrate
 514 *all* entries under the super-root when you mount one of them.  You have to
 515 mount at least one for HAMMER2 to integrate the block device in the larger
 516 cluster.
 517
 518 For cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER
 519 which can be mounted in order to make the rest of the elements under the
 520 super-root available to the network.  (In a prior specification I emplaced
 521 the cluster connections in the volume header's configuration space but I no
 522 longer do that).
 523
 524 Connecting to the wider networked cluster involves setting up the /etc/hammer2
 525 directory with appropriate IP addresses and keys.  The user-mode hammer2
 526 service daemon maintains the connections and performs graph operations
 527 via libdmsg.
 528
 529 Node types within the cluster:
 530
 531     DUMMY       - Used as a local placeholder (typically in ramdisk)
 532     CACHE       - Used as a local placeholder and cache (typically on a SSD)
 533     SLAVE       - A SLAVE in the cluster, can source data on quorum agreement.
 534     MASTER      - A MASTER in the cluster, can source and sink data on quorum
 535                   agreement.
 536     SOFT_SLAVE  - A SLAVE in the cluster, can source data locally without
 537                   quorum agreement (must be directly mounted).
 538     SOFT_MASTER - A local MASTER but *not* a MASTER in the cluster.  Can source
 539                   and sink data locally without quorum agreement, intended to
 540                   be synchronized with the real MASTERs when connectivity
 541                   allows.  Operations are not coherent with the real MASTERS
 542                   even when they are available.
 543
 544     NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a
 545           SLAVE.  A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer
 546           synchronized against current masters.
 547
 548     NOTE: Any SLAVE or other copy can be turned into its own writable MASTER
 549           by giving it a unique cluster id, taking it out of the cluster that
 550           originally spawned it.
 551
 552 There are four major protocols:
 553
 554     Quorum protocol
 555
 556         This protocol is used between MASTER nodes to vote on operations
 557         and resolve deadlocks.
 558
 559         This protocol is used between SOFT_MASTER nodes in a sub-cluster
 560         to vote on operations, resolve deadlocks, determine what the latest
 561         transaction id for an element is, and to perform commits.
 562
 563     Cache sub-protocol
 564
 565         This is the MESI sub-protocol which runs under the Quorum
 566         protocol.  This protocol is used to maintain cache state for
 567         sub-trees to ensure that operations remain cache coherent.
 568
 569         Depending on administrative rights this protocol may or may
 570         not allow a leaf node in the cluster to hold a cache element
 571         indefinitely.  The administrative controller may preemptively
 572         downgrade a leaf with insufficient administrative rights
 573         without giving it a chance to synchronize any modified state
 574         back to the cluster.
 575
 576     Proxy protocol
 577
 578         The Quorum and Cache protocols only operate between MASTER
 579         and SOFT_MASTER nodes.  All other node types must use the
 580         Proxy protocol to perform similar actions.  This protocol
 581         differs in that proxy requests are typically sent to just
 582         one adjacent node and that node then maintains state and
 583         forwards the request or performs the required operation.
 584         When the link is lost to the proxy, the proxy automatically
 585         forwards a deletion of the state to the other nodes based on
 586         what it has recorded.
 587
 588         If a leaf has insufficient administrative rights it may not
 589         be allowed to actually initiate a quorum operation and may only
 590         be allowed to maintain partial MESI cache state or perhaps none
 591         at all (since cache state can block other machines in the
 592         cluster).  Instead a leaf with insufficient rights will have to
 593         make due with a preemptive loss of cache state and any allowed
 594         modifying operations will have to be forwarded to the proxy which
 595         continues forwarding it until a node with sufficient administrative
 596         rights is encountered.
 597
 598         To reduce issues and give the cluster more breath, sub-clusters
 599         made up of SOFT_MASTERs can be formed in order to provide full
 600         cache coherent within a subset of machines and yet still tie them
 601         into a greater cluster that they normally would not have such
 602         access to.  This effectively makes it possible to create a two
 603         or three-tier fan-out of groups of machines which are cache-coherent
 604         within the group, but perhaps not between groups, and use other
 605         means to synchronize between the groups.
 606
 607     Media protocol
 608
 609         This is basically the physical media protocol.
 610
 611                        MASTER & SLAVE SYNCHRONIZATION
 612
 613 With HAMMER2 I really want to be hard-nosed about the consistency of the
 614 filesystem, including the consistency of SLAVEs (snapshots, etc).  In order
 615 to guarantee consistency we take advantage of the copy-on-write nature of
 616 the filesystem by forking consistent nodes and using the forked copy as the
 617 source for synchronization.
 618
 619 Similarly, the target for synchronization is not updated on the fly but instead
 620 is also forked and the forked copy is updated.  When synchronization is
 621 complete, forked sources can be thrown away and forked copies can replace
 622 the original synchronization target.
 623
 624 This may seem complex, but 'forking a copy' is actually a virtually free
 625 operation.  The top-level inode (under the super-root), on-media, is simply
 626 copied to a new inode and poof, we have an unchanging snapshot to work with.
 627
 628         - Making a snapshot is fast... almost instantanious.
 629
 630         - Snapshots are used for various purposes, including synchronization
 631           of out-of-date nodes.
 632
 633         - A snapshot can be converted into a MASTER or some other PFS type.
 634
 635         - A snapshot can be forked off from its parent cluster entirely and
 636           turned into its own writable filesystem, either as a single MASTER
 637           or this can be done across the cluster by forking a quorum+ of
 638           existing MASTERs and transfering them all to a new cluster id.
 639
 640 More complex is reintegrating the target once the synchronization is complete.
 641 For SLAVEs we just delete the old SLAVE and rename the copy to the same name.
 642 However, if the SLAVE is mounted and not optioned as a static mount (that is
 643 the mounter wants to see updates as they are synchronized), a reconciliation
 644 must occur on the live mount to clean up the vnode, inode, and chain caches
 645 and shift any remaining vnodes over to the updated copy.
 646
 647         - A mounted SLAVE can track updates made to the SLAVE but the
 648           actual mechanism is that the SLAVE PFS is replaced with an
 649           updated copy, typically every 30-60 seconds.
 650
 651 Reintegrating a MASTER which has fallen out of the quorum due to being out
 652 of date is also somewhat more complex.  The same updating mechanic is used,
 653 we actually have to throw the 'old' MASTER away once the new one has been
 654 updated.  However if the cluster is undergoing heavy modifications the
 655 updated MASTER will be out of date almost the instant its source is
 656 snapshotted.  Reintegrating a MASTER thus requires a somewhat more complex
 657 interaction.
 658
 659         - If a MASTER is really out of date we can run one or more
 660           synchronization passes concurrent with modifying operations.
 661           The quorum can remain live.
 662
 663         - A final synchronization pass is required with quorum operations
 664           blocked to reintegrate the now up-to-date MASTER into the cluster.
 665
 666
 667                                 QUORUM OPERATIONS
 668
 669 Quorum operations can be broken down into HARD BLOCK operations and NETWORK
 670 operations.  If your MASTERs are all local mounts, then failures and
 671 sequencing is easy to deal with.
 672
 673 Quorum operations on a networked cluster are more complex.  The problems:
 674
 675     - Masters cannot rely on clients to moderate quorum transactions.
 676       Apart from the reliance being unsafe, the client could also
 677       lose contact with one or more masters during the transaction and
 678       leave one or more masters out-of-sync without the master(s) knowing
 679       they are out of sync.
 680
 681     - When many clients are present, we do not want a flakey network
 682       link from one to cause one or more masters to go out of
 683       synchronization and potentially stall the whole works.
 684
 685     - Normal hammer2 mounts allow a virtually unlimited number of modifying
 686       transactions between actual flushes.  The media flush rolls everything
 687       up into a single transaction id per flush.  Detection of 'missing'
 688       transactions in a concurrent multi-client setup when one or more client
 689       temporarily loses connectivity is thus difficult.
 690
 691     - Clients have a limited amount of time to reconnect to a cluster after
 692       a network disconnect before their MESI cache states are lost.
 693
 694     - Clients may proceed with several transactions before knowing for sure
 695       that earlier transactions were completely successful.  Performance is
 696       important, we won't be waiting for a full quorum-verified synchronous
 697       flush to media before allowing a system call to return.
 698
 699     - Masters can decide that a client's MESI cache states were lost (i.e.
 700       that the transaction was too slow) as well.
 701
 702 The solutions (for modifying transactions):
 703
 704     - Masters handle quorum confirmation amongst themselves and do not rely
 705       on the client for that purpose.
 706
 707     - A client can connect to one or more masters regardless of the size of
 708       the quorum and can submit modifying operations to a single master if
 709       desired.  The master will take care of the rest.
 710
 711       A client must still validate the quorum (and obtain MESI cache states)
 712       when doing read-only operations in order to present the correct data
 713       to the user process for the VOP.
 714
 715     - Masters will run a 2-phase commit amongst themselves, often concurrent
 716       with other non-conflicting transactions, and will serialize operations
 717       and/or enforce synchronization points for 2-phase completion on
 718       serialized transactions from the same client or when cache state
 719       ownership is shifted from one client to another.
 720
 721     - Clients will usually allow operations to run asynchronously and return
 722       from system calls more or less ASAP once they own the necessary cache
 723       coherency locks.  The client can select the validation mode to wait for
 724       with mount options:
 725
 726       (1) Fully async           (mount -o async)
 727       (2) Wait for phase-1 ack  (mount)
 728       (3) Wait for phase-2 ack  (mount -o sync)         (fsync - wait p2ack)
 729       (4) Wait for flush        (mount -o sync)         (fsync - wait flush)
 730
 731       Modifying system calls cannot be told to wait for a full media
 732       flush, as full media flushes are prohibitively expensive.  You
 733       still have to fsync().
 734
 735       The fsync wait mode for network links can be selected, either to
 736       return after the phase-2 ack or to return after the media flush.
 737       The default is to wait for the phase-2 ack, which at least guarantees
 738       that a network failure after that point will not disrupt operations
 739       issued before the fsync.
 740
 741     - Clients must adjust the chain state for modifying operations prior to
 742       releasing chain locks / returning from the system call, even if the
 743       masters have not finished the transaction.  A late failure by the
 744       cluster will result in desynchronized state which requires erroring
 745       out the whole filesystem or resynchronizing somehow.
 746
 747     - Clients can opt to keep a record of transactions through the phase-2
 748       ack or the actual media flush on the masters.
 749
 750       However, replaying/revalidating the log cannot necessarily guarantee
 751       success.  If the masters lose synchronization due to network issues
 752       between masters (or if the client was mounted fully-async), or if enough
 753       masters crash simultaniously such that a quorum fails to flush even
 754       after the phase-2 ack, then it is possible that by the time a client
 755       is able to replay/revalidate, some other client has squeeded in and
 756       committed something that would conflict.
 757
 758       If the client crashes it works similarly to a crash with a local storage
 759       mount... many dirty buffers might be lost.  And the same happens in
 760       the cluster case.
 761
 762                                 TRANSACTION LOG
 763
 764 Keeping a short-term transaction log, much less being able to properly replay
 765 it, is fraught with difficulty and I've made it a separate development task.
 766
 767