sys/vfs/hammer2/DESIGN

   1
   2                             HAMMER2 DESIGN DOCUMENT
   3
   4                                 Matthew Dillon
   5                              dillon@backplane.com
   6
   7                                24-Jul-2017 (v5)
   8                                09-Jul-2016 (v4)
   9                                03-Apr-2015 (v3)
  10                                14-May-2013 (v2)
  11                                08-Feb-2012 (v1)
  12
  13                         Current Status as of document date
  14
  15 * Filesystem Core       - operational
  16   - bulkfree            - operational
  17   - Compression         - operational
  18   - Snapshots           - operational
  19   - Deduper             - live operational, batch specced
  20   - Subhierarchy quotas - scrapped (still possible on a limited basis)
  21   - Logical Encryption  - not specced yet
  22   - Copies              - not specced yet
  23   - fsync bypass        - not specced yet
  24
  25 * Clustering core
  26   - Network msg core    - operational
  27   - Network blk device  - operational
  28   - Error handling      - under development
  29   - Quorum Protocol     - under development
  30   - Synchronization     - under development
  31   - Transaction replay  - not specced yet
  32   - Cache coherency     - not specced yet
  33
  34                             Intended Feature List
  35                        (not all features yet implemented)
  36
  37 * Standard filesystem semantics with full hardlink and softlink support.
  38
  39 * Filesystems can be constructed across multiple nodes.  Each low-level
  40   H2 device can accomodate nodes belonging to multiple cluster components
  41   as well as nodes that are simply local to the device or machine.
  42
  43 * A dynamic radix tree on each formatted device is used to index the
  44   topology, with 64-bit keys.  Elements can be ranged in powers of 2.
  45   The tree is built efficiently, bottom-up.  Indirect blocks are only
  46   created when a layer fills up.
  47
  48 * Utilizes a copy-on-write block mechanic for both the main topology
  49   and the freemap.  Media-level block frees are delayed and flushes rotate
  50   between 4 volume headers (maxes out at 4 if the filesystem is > ~8GB).
  51   Flushes will allocate new blocks up to the root in order to propagate
  52   block table changes and transaction ids.  Recovery will choose the most
  53   recent valid volume root and can thus work around failures which cause
  54   partial volume header writes.
  55
  56 * Utilizes a fat blockref structure (128 bytes) which can store up to
  57   64 bytes (512 bits) of check code data.  Defaults to simpler 64-bit
  58   hashes.
  59
  60 * 1024 byte fat inode structure also contains helpful meta-data for
  61   debugging catastrophic recovery, up to 512 bytes of direct-data for
  62   small files, or 4 indirect blocks (instead of the direct-data) as
  63   the data root.
  64
  65   Inodes are stored as hidden elements under each node directory in the
  66   super-root.  H2 originally tried to embed inodes in the directories in
  67   which they resides, or in the nearest common parent directory when multiple
  68   hardlinks were present, but this wound up being too difficult to get
  69   right and made NFS support impossible (would have required adding more
  70   complexity to index inode numbers, which I didn't want to do).
  71
  72 * Directory entries are indexed in the radix tree just like everything
  73   else based on a hash of the filename plus an iterator to deal with
  74   collisions nicely.  Directory entries with filenames <= 64 bytes
  75   fit entirely within the 128-byte blockref structure without requiring
  76   a data reference for highly optimal directory scans.  In the optimal
  77   case the directory entry simply uses the 64-byte check field to store
  78   the filename (since there is no data reference).
  79
  80   Directory entries record inode number, file type, and filename.
  81
  82 * The top-level for each formatted device is called the super-root.  The
  83   top-level cannot be directly mounted.  Instead, it contains inodes
  84   representing pieces of nodes in the cluster which can be individually
  85   mounted.
  86
  87   This means that H2 filesystems can create multiple roots if desired.
  88   In fact, snapshots are stored as discretely mountable nodes at this
  89   level, so it is possible to boot from or mount root from a snapshot.
  90
  91   (HAMMER1 had only one root, but multiple 'PFS's, but they weren't nearly
  92   as flexible).
  93
  94   Each formatted H2 device can hold pieces of or whole nodes belonging
  95   to multiple filesystems.  This allows independent cluster components to
  96   be configured within a single formatted H2 filesystem.  Each component is
  97   a super-root entry, a cluster identifier, and a unique identifier.  The
  98   network protocl integrates the component into the cluster when it is
  99   created
 100
 101 * Snapshots are implemented as nodes under the super-root.  Snapshots
 102   are writable and easy to create (in HAMMER1 snapshots were read-only).
 103   However, HAMMER2 snapshots are not as fine-grained as HAMMER1 snapshots,
 104   and also not automatic.  They must be explicitly created but are cheap
 105   enough that fine-grained H2 snapshots can be created on a schedule if
 106   desired.
 107
 108 * Utilizes a Frontend-Backend operational design with multiple dedicated
 109   threads for each cluster component.  This allows the frontend to dispatch
 110   parallel ops to backend components and then finalize the frontend operation
 111   the instant a sufficient number of components agree (depending on the
 112   replication mode), even if other nodes in the cluster are stalled.
 113
 114   H2 can deal with stalled backend threads without stalling the frontend.
 115
 116 * Flush handling is really difficult to deal with because we want to
 117   utilize the system's normal buffer cache for efficiency.  This means
 118   that some flush elements (dirty data buffer cache buffers) are not
 119   necessarily in sync with dirty meta-data tracked by the filesystem code.
 120
 121   If H2 locks the entire filesystem during a flush, then many front-end
 122   operations can wind up stalling for a very long time (depending on how
 123   much dirty data the filesystem and operating system lets build up).
 124
 125   Currently HAMMER2 tries to deal with by allowing for an almost-fully
 126   asynchronous flush.  Essentially, everything related to data and meta-data
 127   except the volume header itself can be flushed asynchronously.  This
 128   means it can also be flushed concurrently with front-end operations.
 129
 130   In order to make the 'final' flush of the volume header itself meaningful,
 131   the flush code will first attempt to asynchronously flush all pending
 132   buffers and meta-data, then will lock the filesystem and do a second
 133   flush of anything that slipped through while the first flush was running.
 134   And then will flush the volume header.
 135
 136   CURRENT STATUS: This work is still in progress and there are still
 137   stall issues in the handling of flushes.
 138
 139 * Low memory footprint.  Except for the volume header, the buffer cache
 140   is completely asynchronous and dirty buffers can be retired by the OS
 141   directly to backing store with no further interactions with the filesystem.
 142
 143 * Compression support.  Multiple algorithms are supported and can be
 144   configured on a subdirectory hierarchy or individual file basis.
 145   Block compression up to 64KB will be used.  Only compression ratios at
 146   powers of 2 that are at least 2:1 (e.g. 2:1, 4:1, 8:1, etc) will work in
 147   this scheme because physical block allocations in HAMMER2 are always
 148   power-of-2.  Modest compression can be achieved with low overhead, is
 149   turned on by default, and is compatible with deduplication.
 150
 151 * De-duplication support.  HAMMER2 uses a relatively simple freemap
 152   scheme that allows the filesystem to discard block references
 153   asynchronously, and the same scheme allows essentially unlimited
 154   references to the same data block in the hierarchy.  Thus, both live
 155   de-duplication and bulk deduplication is relatively easy to implement.
 156
 157 * Zero detection on write (writing all-zeros), which requires the data
 158   buffer to be scanned, is fully supported.  This allows the writing of 0's
 159   to create holes.
 160
 161   Generally speaking pre-writing zerod blocks to reserve space doesn't work
 162   well on copy-on-write filesystems.  However, if both compression and
 163   check codes are disabled on a file, H2 will also disable zero-detection,
 164   allow pre-reservation of file blocks (by pre-zeroing), and allow data
 165   overwrites to write to the same sector.
 166
 167 * In addition to the above, sector overwrites (avoiding the copy-on-write)
 168   are also allowed when multiple writes to the same block occur in-between
 169   flush operations.
 170
 171 * Incremental synchronization via highest-transaction id propagation
 172   within the radix tree.  This is a queueless, incremental design.
 173
 174   CURRENT STATUS: Due to the flat inode hierarchy now being employed,
 175   the current synchronization code which silently recurses indirect nodes
 176   will be inefficient due to the fact that all the inodes are at the
 177   same logical level in the topology.  To fix this, the code will need
 178   to explicitly iterate indirect nodes and keep track of the related
 179   key ranges to match them up on an indirect-block basis, which would
 180   be incredibly efficient.
 181
 182 * Background synchronization and mirroring occurs at the logical layer
 183   rather than the physical layer.  This allows cluster components to
 184   have differing storage arrangements.
 185
 186   In addition, this mechanism will fully correct any out of sync nodes
 187   in the cluster as long as a sufficient number of other nodes agree on
 188   what the proper state should be.
 189
 190                         DESIGN PENDING ON THESE FEATURES
 191
 192 * Encryption.  Whole-disk encryption is supported by another layer, but I
 193   intend to give H2 an encryption feature at the logical layer which works
 194   approximately as follows:
 195
 196   - Encryption controlled by the client on an inode/sub-tree basis.
 197   - Server has no visibility to decrypted data.
 198   - Encrypt filenames in directory entries.  Since the filename[] array
 199     is 256 bytes wide, client can add random bytes after the normal
 200     terminator to make it virtually impossible for an attacker to figure
 201     out the filename.
 202   - Encrypt file size and most inode contents.
 203   - Encrypt file data (holes are not encrypted).
 204   - Encryption occurs after compression, with random filler.
 205   - Check codes calculated after encryption & compression (not before).
 206
 207   - Blockrefs are not encrypted.
 208   - Directory and File Topology is not encrypted.
 209   - Encryption is not sub-topology validation.  Client would have to keep
 210     track of that itself.  Server or other clients can still e.g. remove
 211     files, rename, etc.
 212
 213   In particular, note that even though the file size field can be encrypted,
 214   the server does have visibility on the block topology and thus has a pretty
 215   good idea how big the file is.  However, a client could add junk blocks
 216   at the end of a file to make this less apparent, at the cost of space.
 217
 218   If a client really wants a fully validated H2-encrypted space the easiest
 219   solution is to format a filesystem within an encrypted file by treating it
 220   as a block device, but I digress.
 221
 222 * Device ganging, copies for redundancy, and file splitting.
 223
 224   Device ganging - The idea here is not to gang devices into a single
 225   physical volume but to instead format each device independently
 226   and allow crossover-references in the blockref to other devices in
 227   the set.
 228
 229   One of the things we want to accomplish is to ensure that a failed
 230   device does not prevent access to radix tree elements in other devices
 231   in the gang, and that the failed device can be reconstructed.  To do
 232   this, each device implements complete reachability from the node root
 233   to all elements underneath it.  When a device fails, the sychronization
 234   code can theoretically reconstruct the missing material in other
 235   devices making up the gang.  New devices can be added to the gang and
 236   existing devices can be removed from the gang.
 237
 238   Redundant copies - This is actually a fairly tough problem.  The
 239   solution I would like to implement is to use the device ganging feature
 240   to also implement redundancy, that way if a device fails within the
 241   gang there's a good chance that it can still remain completely functional
 242   without having to resynchronize.  But making this work is difficult to say
 243   the least.
 244
 245 * MESI Cache coherency for multi-master/multi-client clustering operations.
 246   The servers hosting the MASTERs are also responsible for keeping track of
 247   the cache state.
 248
 249   This is a feature that we would need to implement coherent cross-machine
 250   multi-threading and migration.
 251
 252 * Implement unverified de-duplication (where only the check code is tested,
 253   avoiding having to actually read data blocks to calculate a de-duplication.
 254   This would make use of the blockref structure's widest check field
 255   (512 bits).
 256
 257   Out of necessity this type of feature would be settable on a file or
 258   recursive directory tree basis, but should only be used when the data
 259   is throw-away or can be reconstructed since data corruption (mismatched
 260   duplicates with the same hash) is still possible even with a 512-bit
 261   check code.
 262
 263   The Unverified dedup feature is intended only for those files where
 264   occassional corruption is ok, such as in a web-crawler data store or
 265   other situations where the data content is not critically important
 266   or can be externally recovered if it becomes corrupt.
 267
 268                                 GENERAL DESIGN
 269
 270 HAMMER2 generally implements a copy-on-write block design for the filesystem,
 271 which is very different from HAMMER1's B-Tree design.  Because the design
 272 is copy-on-write it can be trivially snapshotted simply by making a copy
 273 of the block table we desire to snapshot.  Snapshotting the root inode
 274 effectively snapshots the entire filesystem, whereas snapshotting a file
 275 inode only snapshots that one file.  Snapshotting a directory inode is
 276 generally unhelpful since it only contains directory entries and the
 277 underlying files are not arranged under it in the radix tree.
 278
 279 The copy-on-write design implements a block table as a radix-tree,
 280 with a small fan-out in the volume header and inode (typically 4x) and
 281 a large fan-out for indirect blocks (typically 128x and 512x depending).
 282 The table is built bottom-up.  Intermediate radii are only created when
 283 necessary so small files and directories will have a much shallower radix
 284 tree.
 285
 286 HAMMER2 implements several space optimizations:
 287
 288   1. Directory entries with filenames <= 64 characters will fit entirely
 289      in the 128-byte blockref structure and do not require additional data
 290      block references.  Since blockrefs are the core elements making up
 291      block tables, most directories should have good locality of reference
 292      for directory scans.
 293
 294   2. Inodes embed 4 blockrefs, so files up to 256KB and directories with
 295      up to four directory entries (not including "." or "..") can be
 296      accomodated without requiring any indirecct blocks.
 297
 298   3. Indirect blocks can be sized to any power of two up to 65536 bytes,
 299      and H2 typically uses 16384 and 65536 bytes.  The smaller size is
 300      used for initial indirect blocks to reduce storage overhead for
 301      medium-sized files and directories.
 302
 303   4. The File inode itself can directly hold the data for small
 304      files <= 512 bytes in size.
 305
 306   5. The last block in a file will have a storage allocation in powers
 307      of 2 from 1KB to 64KB as needed.  Thus a small file in excess of
 308      512 bytes but less than 64KB will not waste a full 64KB block.
 309
 310   6. When compression is enabled, small physical blocks will be allocated
 311      when possible.  However, only reductions in powers of 2 are supported.
 312      So if a 64KB data block can be compressed to (16KB+1) to 32KB, then
 313      a 32KB block will be used.  This gives H2 modest compression at very
 314      low cost without too much added complexity.
 315
 316   7. Live de-dup will attempt to share data blocks when file copying is
 317      detected, significantly reducing actual physical writes to storage
 318      and the storage used.  Bulk de-dup (when implemented), will catch
 319      other cases of de-duplication.
 320
 321 Directories contain directory entries which are indexed using a hash of
 322 their filename.  The hash is carefully designed to maintain some natural
 323 sort ordering.
 324
 325 The copy-on-write nature of the filesystem means that any modification
 326 whatsoever will have to eventually synchronize new disk blocks all the way
 327 to the super-root of the filesystem and the volume header itself.  This forms
 328 the basis for crash recovery and also ensures that recovery occurs on a
 329 completed high-level transaction boundary.  All disk writes are to new blocks
 330 except for the volume header (which cycles through 4 copies), thus allowing
 331 all writes to run asynchronously and concurrently prior to and during a flush,
 332 and then just doing a final synchronization and volume header update at the
 333 end.  Many of HAMMER2s features are enabled by this core design feature.
 334
 335 The Freemap is also implemented using a radix tree via a set of pre-reserved
 336 blocks (approximately 4MB for every 2GB of storage), and also cycles through
 337 multiple copies to ensure that crash recovery can restore the state of the
 338 filesystem quickly at mount time.
 339
 340 HAMMER2 tries to maintain a small footprint and one way it does this is
 341 by using the normal buffer cache for data and meta-data, and allowing the
 342 kernel to asynchronously flush device buffers at any time (even during
 343 synchronization).  The volume root is flushed separately, separated from
 344 the asynchronous flushes by a synchronizing BUF_CMD_FLUSH op.  This means
 345 that HAMMER2 has very low resource overhead from the point of view of the
 346 operating system and is very much unlike HAMMER1 which had to lock dirty
 347 buffers into memory for long periods of time.  HAMMER2 has no such
 348 requirement.
 349
 350 Buffer cache overhead is very well bounded and can handle filesystem
 351 operations of any complexity, even on boxes with very small amounts
 352 of physical memory.  Buffer cache overhead is significantly lower with H2
 353 than with H1 (and orders of magnitude lower than ZFS).
 354
 355 At some point I intend to implement a shortcut to make fsync()'s run fast,
 356 and that is to allow deep updates to blockrefs to shortcut to auxillary
 357 space in the volume header to satisfy the fsync requirement.  The related
 358 blockref is then recorded when the filesystem is mounted after a crash and
 359 the update chain is reconstituted when a matching blockref is encountered
 360 again during normal operation of the filesystem.
 361
 362                         MIRROR_TID, MODIFY_TID, UPDATE_TID
 363
 364 In HAMMER2, the core block reference is a 128-byte structure called a blockref.
 365 The blockref contains various bits of information including the 64-bit radix
 366 key (typically a directory hash if a directory entry, inode number if a
 367 hidden hardlink target, or file offset if a file block), number of significant
 368 key bits for ranged recursion of indirect blocks, a 64-bit device seek that
 369 encodes the radix of the physical block size in the low bits (physical block
 370 size can be different from logical block size due to compression),
 371 three 64-bit transaction ids, type information, and up to 512 bits worth
 372 of check data for the block being reference which can be anything from
 373 a simple CRC to a strong cryptographic hash.
 374
 375 mirror_tid - This is a media-centric (as in physical disk partition)
 376              transaction id which tracks media-level updates.  The mirror_tid
 377              can be different at the same point on different nodes in a
 378              cluster.
 379
 380              Whenever any block in the media topology is modified, its
 381              mirror_tid is updated with the flush id and will propagate
 382              upward during the flush all the way to the volume header.
 383
 384              mirror_tid is monotonic.  It is primarily used for on-mount
 385              recovery and volume root validation.  The name is historical
 386              from H1, it is not used for nominal mirroring.
 387
 388 modify_tid - This is a cluster-centric (as in across all the nodes used
 389              to build a cluster) transaction id which tracks filesystem-level
 390              updates.
 391
 392              modify_tid is updated when the front-end of the filesystem makes
 393              a change to an inode or data block.  It does NOT propagate upward
 394              during a flush.
 395
 396 update_tid - This is a cluster synchronization transaction id.  Modifications
 397              made to the topology will clear this field to 0 as they propagate
 398              up to the root.  This gives the synchronizer an easy way to
 399              determine what needs revalidation.
 400
 401              The synchronizer revalidates the cluster bottom-up by validating
 402              a sub-topology and propagating the highest modify_tid in the
 403              validated sub-topology up via the update_tid field.
 404
 405              Update to this field may be optimized by the HAMMER2 VFS to
 406              avoid the double-transition.
 407
 408 The synchronization code updates an out-of-sync node bottom-up and will
 409 dynamically set update_tid as it goes, but media flushes can occur at any
 410 time and these flushes will use mirror_tid for flush and freemap management.
 411 The mirror_tid for each flush propagates upward to the volume header on each
 412 flush.  modify_tid is set for any chains modified by a cluster op but does
 413 not propagate up, instead serving as a seed for update_tid.
 414
 415 * The synchronization code is able to determine that a sub-tree is
 416   synchronized simply by observing the update_tid at the root of the sub-tree,
 417   on an inode-by-inode basis and also on a data-block-by-data-block basis.
 418
 419 * The synchronization code is able to do an incremental update of an
 420   out-of-sync node simply by skipping elements with a matching update_tid
 421   (when not 0).
 422
 423 * The synchronization code can be interrupted and restarted at any time,
 424   and is able to pick up where it left off with very low overhead.
 425
 426 * The synchronization code does not inhibit media flushes.  Media flushes
 427   can occur (and must occur) while synchronization is ongoing.
 428
 429 There are several other stored transaction ids in HAMMER2.  There is a
 430 separate freemap_tid in the volume header that is used to allow freemap
 431 flushes to be deferred, and inodes have a pfs_psnap_tid which is used in
 432 conjuction with CHECK_NONE to allow blocks without a check code which do
 433 not violate the most recent snapshot to be overwritten in-place.
 434
 435 Remember that since this is a copy-on-write filesystem, we can propagate
 436 a considerable amount of information up the tree to the volume header
 437 without adding to the I/O we already have to do.
 438
 439                             DIRECTORIES AND INODES
 440
 441 Directories are hashed.  In HAMMER2, a directory can contain a mix of
 442 directory entries AND embedded inodes.  In the first iteration of HAMMER2
 443 I tried really hard to embed inodes (since most files are not usually
 444 hardlinked), but this created huge problems for NFS exports.  At the
 445 moment the super-root directory utilizes embedded inodes and filesystems
 446 under the super-root typically use normal directory entries.  The real
 447 inodes are in the mounted root directory as hidden entries.
 448
 449 However, I reserve the right to implement embedded inodes within a
 450 mount to create export domains which can serve as mini-roots.  Such
 451 mini-roots would be able to have their own quotas and would be separately
 452 snapshottable, but would also have to be exported separately from the
 453 primary mount they exist under.
 454
 455 Hardlinks are implemented normally, with directory entries and the maintenance
 456 of a nlinks count in the target inode.
 457
 458                                     RECOVERY
 459
 460 H2 allows freemap flushes to lag behind topology flushes.  The freemap flush
 461 tracks a separate transaction id (via mirror_tid) in the volume header.
 462
 463 On mount, HAMMER2 will first locate the highest-sequenced check-code-validated
 464 volume header from the 4 copies available (if the filesystem is big enough,
 465 e.g. > ~10GB or so, there will be 4 copies of the volume header).
 466
 467 HAMMER2 will then run an incremental scan of the topology for mirror_tid
 468 transaction ids between the last freemap flush tid and the last topology
 469 flush tid in order to synchronize the freemap.  Because this scan is
 470 incremental the time it takes to run will be relatively short and well-bounded
 471 at mount-time.  This is NOT an fsck.  Freemap flushes can be avoided for any
 472 number of normal topology flushes but should still occur frequently enough
 473 to avoid long recovery times in case of a crash.
 474
 475 The filesystem is then ready for use.
 476
 477                             DISK I/O OPTIMIZATIONS
 478
 479 The freemap implements a 1KB allocation resolution.  Each 2MB segment managed
 480 by the freemap is zoned and has a tendancy to collect inodes, small data,
 481 indirect blocks, and larger data blocks into separate segments.  The idea is
 482 to greatly improve I/O performance (particularly by laying inodes down next
 483 to each other which has a huge effect on directory scans).
 484
 485 The current implementation of HAMMER2 implements a fixed I/O block size
 486 of 64KB in order to allow the mapping of hammer2_dio's in its IO subsystem
 487 to conumers that might desire different sizes.  This way we don't have to
 488 worry about matching the buffer cache / DIO cache to the variable block
 489 size of underlying elements.  In addition, 64KB I/Os allow compatibility
 490 with physical sector sizes up to 64KB in the underlying physical storage
 491 with no change in the byte-by-byte format of the filesystem.
 492
 493 The biggest issue we are avoiding by having a fixed 64KB I/O size is not
 494 actually to help nominal front-end access issue but instead to reduce the
 495 complexity of having to deal with mixed block sizes in the buffer cache,
 496 particularly when blocks are freed and then later reused with a different
 497 block size.  HAMMER1 had to have specialized code to check for and
 498 invalidate buffer cache buffers in the free/reuse case.  HAMMER2 does not
 499 need such code.
 500
 501 That said, HAMMER2 places no major restrictions on mixing block sizes within
 502 a 64KB block.  The only restriction is that a HAMMER2 block cannot cross
 503 a 64KB boundary.  The soft restrictions the block allocator puts in place
 504 exist primarily for performance reasons (i.e. try to collect 1K inodes
 505 together).  The 2MB freemap zone granularity should work very well in this
 506 regard.
 507
 508 HAMMER2 also allows OS support for ganging buffers together into even
 509 larger blocks for I/O (OS buffer cache 'clustering'), OS-supported read-ahead,
 510 OS-driven asynchronous retirement, and other performance features typically
 511 provided by the OS at the block-level to ensure smooth system operation.
 512
 513 By avoiding wiring buffers/memory and allowing these features to run normally,
 514 HAMMER2 winds up with very low OS overhead.
 515
 516                                 FREEMAP NOTES
 517
 518 The freemap is stored in the reserved blocks situated in the ~4MB reserved
 519 area at the baes of every ~1GB level-1 zone.  The current implementation
 520 reserves 8 copies of every freemap block and cycles through them in order
 521 to make the freemap operate in a copy-on-write fashion.
 522
 523     - Freemap is copy-on-write.
 524     - Freemap operations are transactional, same as everything else.
 525     - All backup volume headers are consistent on-mount.
 526
 527 The Freemap is organized using the same radix blockmap algorithm used for
 528 files and directories, but with fixed radix values.  For a maximally-sized
 529 filesystem the Freemap will wind up being a 5-level-deep radix blockmap,
 530 but the top-level is embedded in the volume header so insofar as performance
 531 goes it is really just a 4-level blockmap.
 532
 533 The freemap radix allocation mechanism is also the same, meaning that it is
 534 bottom-up and will not allocate unnecessary intermediate levels for smaller
 535 filesystems.  The number of blockmap levels not including the volume header
 536 for various filesystem sizes is as follows:
 537
 538         up-to           #of freemap levels
 539         1GB             1-level
 540         256GB           2-level
 541         64TB            3-level
 542         16PB            4-level
 543         4EB             5-level
 544         16EB            6-level
 545
 546 The Freemap has bitmap granularity down to 16KB and a linear iterator that
 547 can linearly allocate space down to 1KB.  Due to fragmentation it is possible
 548 for the linear allocator to become marginalized, but it is relatively easy
 549 to for a reallocation of small blocks every once in a while (like once a year
 550 if you care at all) and once the old data cycles out of the snapshots, or you
 551 also rewrite the snapshots (which you can do), the freemap should wind up
 552 relatively optimal again.  Generally speaking I believe that algorithms can
 553 be developed to make this a non-problem without requiring any media structure
 554 changes.
 555
 556 In order to implement fast snapshots (and writable snapshots for that
 557 matter), HAMMER2 does NOT ref-count allocations.  All the freemap does is
 558 keep track of 100% free blocks plus some extra bits for staging the bulkfree
 559 scan.  The lack of ref-counting makes it possible to:
 560
 561     - Completely trivialize HAMMER2s snapshot operations.
 562     - Allows any volume header backup to be used trivially.
 563     - Allows whole sub-trees to be destroyed without having to scan them.
 564     - Simplifies normal crash recovery operations.
 565     - Simplifies catastrophic recovery operations.
 566
 567 Normal crash recovery is simply a matter of doing an incremental scan
 568 of the topology between the last flushed freemap TID and the last flushed
 569 topology TID.  This usually takes only a few seconds and allows:
 570
 571     - Freemap flushes to be be deferred for any number of topology flush
 572       cycles.
 573     - Does not have to be flushed for fsync, reducing fsync overhead.
 574
 575                                 FREEMAP - BULKFREE
 576
 577 Blocks are freed via a bulkfree scan, which is a two-stage meta-data scan.
 578 Blocks are first marked as being possibly free and then finalized in the
 579 second scan.  Live filesystem operations are allowed to run during these
 580 scans and any freemap block that is allocated or adjusted after the first
 581 scan will simply be re-marked as allocated and the second scan will not
 582 transition it to being free.
 583
 584 The cost of not doing ref-count tracking is that HAMMER2 must perform two
 585 bulkfree scans of the meta-data to determine which blocks can actually be
 586 freed.  This can be complicated by the volume header backups and snapshots
 587 which cause the same meta-data topology to be scanned over and over again,
 588 but mitigated somewhat by keeping a cache of higher-level nodes to detect
 589 when we would scan a sub-topology that we have already scanned.  Due to the
 590 copy-on-write nature of the filesystem, such detection is easy to implement.
 591
 592 Part of the ongoing design work is finding ways to reduce the scope of this
 593 meta-data scan so the entire filesystem's meta-data does not need to be
 594 scanned (though in tests with HAMMER1, even full meta-data scans have
 595 turned out to be fairly low cost).  In other words, its an area where
 596 improvements can be made without any media format changes.
 597
 598 Another advantage of operating the freemap like this is that some future
 599 version of HAMMER2 might decide to completely change how the freemap works
 600 and would be able to make the change with relatively low downtime.
 601
 602                                   CLUSTERING
 603
 604 Clustering, as always, is the most difficult bit but we have some advantages
 605 with HAMMER2 that we did not have with HAMMER1.  First, HAMMER2's media
 606 structures generally follow the kernel's filesystem hiearchy which allows
 607 cluster operations to use topology cache and lock state.  Second,
 608 HAMMER2's writable snapshots make it possible to implement several forms
 609 of multi-master clustering.
 610
 611 The mount device path you specify serves to bootstrap your entry into
 612 the cluster.  This is typically local media.  It can even be a ram-disk
 613 that only contains placemarkers that help HAMMER2 connect to a fully
 614 networked cluster.
 615
 616 With HAMMER2 you mount a directory entry under the super-root.  This entry
 617 will contain a cluster identifier that helps HAMMER2 identify and integrate
 618 with the nodes making up the cluster.  HAMMER2 will automatically integrate
 619 *all* entries under the super-root when you mount one of them.  You have to
 620 mount at least one for HAMMER2 to integrate the block device in the larger
 621 cluster.
 622
 623 For cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER
 624 which can be mounted in order to make the rest of the elements under the
 625 super-root available to the network.  (In a prior specification I emplaced
 626 the cluster connections in the volume header's configuration space but I no
 627 longer do that).
 628
 629 Connecting to the wider networked cluster involves setting up the /etc/hammer2
 630 directory with appropriate IP addresses and keys.  The user-mode hammer2
 631 service daemon maintains the connections and performs graph operations
 632 via libdmsg.
 633
 634 Node types within the cluster:
 635
 636     DUMMY       - Used as a local placeholder (typically in ramdisk)
 637     CACHE       - Used as a local placeholder and cache (typically on a SSD)
 638     SLAVE       - A SLAVE in the cluster, can source data on quorum agreement.
 639     MASTER      - A MASTER in the cluster, can source and sink data on quorum
 640                   agreement.
 641     SOFT_SLAVE  - A SLAVE in the cluster, can source data locally without
 642                   quorum agreement (must be directly mounted).
 643     SOFT_MASTER - A local MASTER but *not* a MASTER in the cluster.  Can source
 644                   and sink data locally without quorum agreement, intended to
 645                   be synchronized with the real MASTERs when connectivity
 646                   allows.  Operations are not coherent with the real MASTERS
 647                   even when they are available.
 648
 649     NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a
 650           SLAVE.  A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer
 651           synchronized against current masters.
 652
 653     NOTE: Any SLAVE or other copy can be turned into its own writable MASTER
 654           by giving it a unique cluster id, taking it out of the cluster that
 655           originally spawned it.
 656
 657 There are four major protocols:
 658
 659     Quorum protocol
 660
 661         This protocol is used between MASTER nodes to vote on operations
 662         and resolve deadlocks.
 663
 664         This protocol is used between SOFT_MASTER nodes in a sub-cluster
 665         to vote on operations, resolve deadlocks, determine what the latest
 666         transaction id for an element is, and to perform commits.
 667
 668     Cache sub-protocol
 669
 670         This is the MESI sub-protocol which runs under the Quorum
 671         protocol.  This protocol is used to maintain cache state for
 672         sub-trees to ensure that operations remain cache coherent.
 673
 674         Depending on administrative rights this protocol may or may
 675         not allow a leaf node in the cluster to hold a cache element
 676         indefinitely.  The administrative controller may preemptively
 677         downgrade a leaf with insufficient administrative rights
 678         without giving it a chance to synchronize any modified state
 679         back to the cluster.
 680
 681     Proxy protocol
 682
 683         The Quorum and Cache protocols only operate between MASTER
 684         and SOFT_MASTER nodes.  All other node types must use the
 685         Proxy protocol to perform similar actions.  This protocol
 686         differs in that proxy requests are typically sent to just
 687         one adjacent node and that node then maintains state and
 688         forwards the request or performs the required operation.
 689         When the link is lost to the proxy, the proxy automatically
 690         forwards a deletion of the state to the other nodes based on
 691         what it has recorded.
 692
 693         If a leaf has insufficient administrative rights it may not
 694         be allowed to actually initiate a quorum operation and may only
 695         be allowed to maintain partial MESI cache state or perhaps none
 696         at all (since cache state can block other machines in the
 697         cluster).  Instead a leaf with insufficient rights will have to
 698         make due with a preemptive loss of cache state and any allowed
 699         modifying operations will have to be forwarded to the proxy which
 700         continues forwarding it until a node with sufficient administrative
 701         rights is encountered.
 702
 703         To reduce issues and give the cluster more breath, sub-clusters
 704         made up of SOFT_MASTERs can be formed in order to provide full
 705         cache coherent within a subset of machines and yet still tie them
 706         into a greater cluster that they normally would not have such
 707         access to.  This effectively makes it possible to create a two
 708         or three-tier fan-out of groups of machines which are cache-coherent
 709         within the group, but perhaps not between groups, and use other
 710         means to synchronize between the groups.
 711
 712     Media protocol
 713
 714         This is basically the physical media protocol.
 715
 716                        MASTER & SLAVE SYNCHRONIZATION
 717
 718 With HAMMER2 I really want to be hard-nosed about the consistency of the
 719 filesystem, including the consistency of SLAVEs (snapshots, etc).  In order
 720 to guarantee consistency we take advantage of the copy-on-write nature of
 721 the filesystem by forking consistent nodes and using the forked copy as the
 722 source for synchronization.
 723
 724 Similarly, the target for synchronization is not updated on the fly but instead
 725 is also forked and the forked copy is updated.  When synchronization is
 726 complete, forked sources can be thrown away and forked copies can replace
 727 the original synchronization target.
 728
 729 This may seem complex, but 'forking a copy' is actually a virtually free
 730 operation.  The top-level inode (under the super-root), on-media, is simply
 731 copied to a new inode and poof, we have an unchanging snapshot to work with.
 732
 733         - Making a snapshot is fast... almost instantanious.
 734
 735         - Snapshots are used for various purposes, including synchronization
 736           of out-of-date nodes.
 737
 738         - A snapshot can be converted into a MASTER or some other PFS type.
 739
 740         - A snapshot can be forked off from its parent cluster entirely and
 741           turned into its own writable filesystem, either as a single MASTER
 742           or this can be done across the cluster by forking a quorum+ of
 743           existing MASTERs and transfering them all to a new cluster id.
 744
 745 More complex is reintegrating the target once the synchronization is complete.
 746 For SLAVEs we just delete the old SLAVE and rename the copy to the same name.
 747 However, if the SLAVE is mounted and not optioned as a static mount (that is
 748 the mounter wants to see updates as they are synchronized), a reconciliation
 749 must occur on the live mount to clean up the vnode, inode, and chain caches
 750 and shift any remaining vnodes over to the updated copy.
 751
 752         - A mounted SLAVE can track updates made to the SLAVE but the
 753           actual mechanism is that the SLAVE PFS is replaced with an
 754           updated copy, typically every 30-60 seconds.
 755
 756 Reintegrating a MASTER which has fallen out of the quorum due to being out
 757 of date is also somewhat more complex.  The same updating mechanic is used,
 758 we actually have to throw the 'old' MASTER away once the new one has been
 759 updated.  However if the cluster is undergoing heavy modifications the
 760 updated MASTER will be out of date almost the instant its source is
 761 snapshotted.  Reintegrating a MASTER thus requires a somewhat more complex
 762 interaction.
 763
 764         - If a MASTER is really out of date we can run one or more
 765           synchronization passes concurrent with modifying operations.
 766           The quorum can remain live.
 767
 768         - A final synchronization pass is required with quorum operations
 769           blocked to reintegrate the now up-to-date MASTER into the cluster.
 770
 771
 772                                 QUORUM OPERATIONS
 773
 774 Quorum operations can be broken down into HARD BLOCK operations and NETWORK
 775 operations.  If your MASTERs are all local mounts, then failures and
 776 sequencing is easy to deal with.
 777
 778 Quorum operations on a networked cluster are more complex.  The problems:
 779
 780     - Masters cannot rely on clients to moderate quorum transactions.
 781       Apart from the reliance being unsafe, the client could also
 782       lose contact with one or more masters during the transaction and
 783       leave one or more masters out-of-sync without the master(s) knowing
 784       they are out of sync.
 785
 786     - When many clients are present, we do not want a flakey network
 787       link from one to cause one or more masters to go out of
 788       synchronization and potentially stall the whole works.
 789
 790     - Normal hammer2 mounts allow a virtually unlimited number of modifying
 791       transactions between actual flushes.  The media flush rolls everything
 792       up into a single transaction id per flush.  Detection of 'missing'
 793       transactions in a concurrent multi-client setup when one or more client
 794       temporarily loses connectivity is thus difficult.
 795
 796     - Clients have a limited amount of time to reconnect to a cluster after
 797       a network disconnect before their MESI cache states are lost.
 798
 799     - Clients may proceed with several transactions before knowing for sure
 800       that earlier transactions were completely successful.  Performance is
 801       important, we won't be waiting for a full quorum-verified synchronous
 802       flush to media before allowing a system call to return.
 803
 804     - Masters can decide that a client's MESI cache states were lost (i.e.
 805       that the transaction was too slow) as well.
 806
 807 The solutions (for modifying transactions):
 808
 809     - Masters handle quorum confirmation amongst themselves and do not rely
 810       on the client for that purpose.
 811
 812     - A client can connect to one or more masters regardless of the size of
 813       the quorum and can submit modifying operations to a single master if
 814       desired.  The master will take care of the rest.
 815
 816       A client must still validate the quorum (and obtain MESI cache states)
 817       when doing read-only operations in order to present the correct data
 818       to the user process for the VOP.
 819
 820     - Masters will run a 2-phase commit amongst themselves, often concurrent
 821       with other non-conflicting transactions, and will serialize operations
 822       and/or enforce synchronization points for 2-phase completion on
 823       serialized transactions from the same client or when cache state
 824       ownership is shifted from one client to another.
 825
 826     - Clients will usually allow operations to run asynchronously and return
 827       from system calls more or less ASAP once they own the necessary cache
 828       coherency locks.  The client can select the validation mode to wait for
 829       with mount options:
 830
 831       (1) Fully async           (mount -o async)
 832       (2) Wait for phase-1 ack  (mount)
 833       (3) Wait for phase-2 ack  (mount -o sync)         (fsync - wait p2ack)
 834       (4) Wait for flush        (mount -o sync)         (fsync - wait flush)
 835
 836       Modifying system calls cannot be told to wait for a full media
 837       flush, as full media flushes are prohibitively expensive.  You
 838       still have to fsync().
 839
 840       The fsync wait mode for network links can be selected, either to
 841       return after the phase-2 ack or to return after the media flush.
 842       The default is to wait for the phase-2 ack, which at least guarantees
 843       that a network failure after that point will not disrupt operations
 844       issued before the fsync.
 845
 846     - Clients must adjust the chain state for modifying operations prior to
 847       releasing chain locks / returning from the system call, even if the
 848       masters have not finished the transaction.  A late failure by the
 849       cluster will result in desynchronized state which requires erroring
 850       out the whole filesystem or resynchronizing somehow.
 851
 852     - Clients can opt to keep a record of transactions through the phase-2
 853       ack or the actual media flush on the masters.
 854
 855       However, replaying/revalidating the log cannot necessarily guarantee
 856       success.  If the masters lose synchronization due to network issues
 857       between masters (or if the client was mounted fully-async), or if enough
 858       masters crash simultaniously such that a quorum fails to flush even
 859       after the phase-2 ack, then it is possible that by the time a client
 860       is able to replay/revalidate, some other client has squeeded in and
 861       committed something that would conflict.
 862
 863       If the client crashes it works similarly to a crash with a local storage
 864       mount... many dirty buffers might be lost.  And the same happens in
 865       the cluster case.
 866
 867                                 TRANSACTION LOG
 868
 869 Keeping a short-term transaction log, much less being able to properly replay
 870 it, is fraught with difficulty and I've made it a separate development task.
 871 For now HAMMER2 does not have one.
 872