sys/vfs/hammer2/DESIGN

   1
   2                             HAMMER2 DESIGN DOCUMENT
   3
   4                                 Matthew Dillon
   5                                  08-Feb-2012
   6                              dillon@backplane.com
   7
   8 * These features have been speced in the media structures.
   9
  10 * Implementation work has begun.
  11
  12 * A working filesystem with some features implemented is expected by July 2012.
  13
  14 * A fully functional filesystem with most (but not all) features is expected
  15   by the end of 2012.
  16
  17 * All elements of the filesystem have been designed except for the freemap
  18   (which isn't needed for initial work).  8MB per 2GB of filesystem
  19   storage has been reserved for the freemap.  The design of the freemap
  20   is expected to be completely speced by mid-year.
  21
  22 * This is my only project this year.  I'm not going to be doing any major
  23   kernel bug hunting this year.
  24
  25                                 Feature List
  26
  27 * Multiple roots (allowing snapshots to be mounted).  This is implemented
  28   via the super-root concept.  When mounting a HAMMER2 filesystem you specify
  29   a device path and a directory name in the super-root.
  30
  31 * HAMMER1 had PFS's.  HAMMER2 does not.  Instead, in HAMMER2 any directory
  32   in the tree can be configured as a PFS, causing all elements recursively
  33   underneath that directory to become a part of that PFS.
  34
  35 * Writable snapshots.  Any subdirectory tree can be snapshotted.  Snapshots
  36   show up in the super-root.  It is possible to snapshot a subdirectory
  37   and then later snapshot a parent of that subdirectory... really there are
  38   no limitations here.
  39
  40 * Directory sub-hierarchy based quotas and space and inode usage tracking.
  41   Any directory sub-tree, whether at a mount point or not, tracks aggregate
  42   inode use and data space use.  This is stored in the directory inode all
  43   the way up the chain.
  44
  45 * Incremental queueless mirroring / mirroring-streams.  Because HAMMER2 is
  46   block-oriented and copy-on-write each blockref tracks both direct
  47   modifications to the referenced data via (modify_tid) and indirect
  48   modifications to the referenced data or any sub-tree via (mirror_tid).
  49   This makes it possible to do an incremental scan of meta-data that covers
  50   only changes made since the mirror_tid recorded in a prior-run.
  51
  52   This feature is also intended to be used to locate recently allocated
  53   blocks and thus be able to fixup the freemap after a crash.
  54
  55   HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in
  56   that HAMMER2 does not keep track of 'deleted' records.  Instead any
  57   recursion by the mirroring code which finds that (modify_tid) has
  58   been updated must also send the direct block table or indirect block
  59   table state it winds up recursing through so the target can check
  60   similar key ranges and locate elements to be deleted.  This can be
  61   avoided if the mirroring stream is mostly caught up in that very recent
  62   deletions will be cached in memory and can be queried, allowing shorter
  63   record deletions to be passed in the stream instead.
  64
  65 * Will support multiple compression algorithms configured on subdirectory
  66   tree basis and on a file basis.  Up to 64K block compression will be used.
  67   Only compression ratios near powers of 2 that are at least 2:1 (e.g. 2:1,
  68   4:1, 8:1, etc) will work in this scheme because physical block allocations
  69   in HAMMER2 are always power-of-2.
  70
  71   Compression algorithm #0 will mean no compression and no zero-checking.
  72   Compression algorithm #1 will mean zero-checking but no other compression.
  73   Real compression will be supported starting with algorithm 2.
  74
  75 * Zero detection on write (writing all-zeros), which requires the data
  76   buffer to be scanned, will be supported as compression algorithm #1.
  77   This allows the writing of 0's to create holes and will be the default
  78   compression algorithm for HAMMER2.
  79
  80 * Copies support for redundancy.  Each copy has its own blockref.  The
  81   blockrefs representing the copies must exist within the same blockset
  82   (set of 8 blockrefs), though I may relax this requirement in the
  83   implementation.
  84
  85   The design is such that the filesystem should be able to function at
  86   full speed even if disks are pulled or inserted, as long as at least one
  87   good copy is present.  A background task will be needed to resynchronize
  88   missing copies (or remove excessive copies in the case where the copies
  89   value is reduced on a live filesystem).
  90
  91   Copies are specified using the same copyinfo[] array that is used to
  92   specify cluster interconnections for PFS's.
  93
  94 * Clusterable with MESI cache coherency and dynamic granularity.
  95   The media format for HAMMER1 was less condusive to logical clustering
  96   than I had hoped so I was never able to get that aspect of my personal goals
  97   working with HAMMER1.  HAMMER2 effectively solves the issues that cropped
  98   up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical
  99   file/directory hierarchy, making cache coherency very difficult).
 100
 101 * Hardlinks will be supported.  All other standard features will be supported
 102   too of course.  Hardlinks in this sort of filesystem require significant
 103   work.
 104
 105 * The media blockref structure is now large enough to support up to a 192-bit
 106   check value, which would typically be a cryptographic hash of some sort.
 107   Multiple check value algorithms will be supported with the default being
 108   a simple 32-bit iSCSI CRC.
 109
 110 * Fully verified deduplication will be supported and automatic (and
 111   necessary in many respects).
 112
 113 * Non-verified de-duplication will be supported as a configurable option on
 114   a file or subdirectory tree.  Non-verified deduplication would use the
 115   largest available check code (192 bits) and not bother to verify data
 116   matches during the dedup pass, which is necessary on extremely large
 117   filesystems with a great deal of deduplicable data (as otherwise a large
 118   chunk of the media would have to be read to implement the dedup).
 119
 120   This feature is intended only for those files where occassional corruption
 121   is ok, such as in a large data store of farmed web content.
 122
 123                                 GENERAL DESIGN
 124
 125 HAMMER2 generally implements a copy-on-write block design for the filesystem,
 126 which is very different from HAMMER1's B-Tree design.  Because the design
 127 is copy-on-write it can be trivially snapshotted simply by referencing an
 128 existing block, and because the media structures logically match a standard
 129 filesystem directory/file hierarchy snapshots and other similar operations
 130 can be trivially performed on an entire subdirectory tree at any level in
 131 the filesystem.
 132
 133 The copy-on-write nature of the filesystem implies that any modification
 134 whatsoever will have to eventually synchronize new disk blocks all the way
 135 to the super-root of the filesystem and the volume header itself.  This forms
 136 the basis for crash recovery.  All disk writes are to new blocks except for
 137 the volume header, thus allowing all writes to run concurrently except for
 138 the volume header update at the end.
 139
 140 Clearly this method requires intermediate modifications to the chain to be
 141 cached so multiple modifications can be aggregated prior to being
 142 synchronized.  One advantage, however, is that the cache can be flushed at
 143 any time WITHOUT having to allocate yet another new block when further
 144 modifications are made as long as the volume header has not yet been flushed.
 145 This means that buffer cache overhead is very well bounded and can handle
 146 filesystem operations of any complexity even on boxes with very small amounts
 147 of physical memory.
 148
 149 I intend to implement a shortcut to make fsync()'s run fast, and that is to
 150 allow deep updates to blockrefs to shortcut to auxillary space in the
 151 volume header to satisfy the fsync requirement.  The related blockref is
 152 then recorded when the filesystem is mounted after a crash and the update
 153 chain is reconstituted when a matching blockref is encountered again during
 154 normal operation of the filesystem.
 155
 156 Basically this means that no real work needs to be done at mount-time
 157 even after a crash.
 158
 159 Directories are hashed, and another major design element is that directory
 160 entries ARE INODES.  They are one and the same.  In addition to directory
 161 entries being inodes the data for very small files (512 bytes or smaller)
 162 can be directly embedded in the inode (overloaded onto the same space that
 163 the direct blockref array uses).  This should result in very high
 164 performance.
 165
 166 Inode numbers are not spatially referenced, which complicates NFS servers
 167 but doesn't complicate anything else.  The inode number is stored in the
 168 inode itself, an absolutely necessary feature in order to support the
 169 hugely flexible snapshots that we want to have in HAMMER2.
 170
 171                                   HARDLINKS
 172
 173 Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of
 174 a spatial reference to the inode number.  We do not want to have to have
 175 an index of inode numbers for any basic HAMMER2 feature if we can help it.
 176
 177 Hardlinks are handled by placing the inode for a multiply-hardlinked file
 178 in the closest common parent directory.  If "a/x" and "a/y" are hardlinked
 179 the inode for the hardlinked file will be placed in directory "a", e.g.
 180 "a/3239944", but it will be invisible and will be in an out-of-band namespace.
 181 The directory entries "a/x" and "a/y" will be given the same inode number
 182 but in fact just be placemarks that cause HAMMER2 to recurse upwards through
 183 the directory tree to find the invisible inode number.
 184
 185 Because directories are hashed and a different namespace (hash key range)
 186 is used for hardlinked inodes, standard directory scans are able to trivially
 187 skip this invisible namespace and inode-specific lookups can restrict their
 188 lookup to within this space.
 189
 190 The nature of snapshotting makes handling link-count 2->1 and 1->2 cases
 191 trivial.  Basically the inode media structure is copied as needed to break-up
 192 or re-form the standard directory entry/inode.  There are no backpointers in
 193 HAMMER2 and no reference counts on the blocks (see FREEMAP NOTES below), so
 194 it is an utterly trivial operation.
 195
 196                                 FREEMAP NOTES
 197
 198 In order to implement fast snapshots (and writable snapshots for that
 199 matter), HAMMER2 does NOT ref-count allocations.  The freemap which
 200 is still under design just won't do that.  All the freemap does is
 201 keep track of 100% free blocks.
 202
 203 This not only trivializes all the snapshot features it also trivializes
 204 hardlink handling and solves the problem of keeping the freemap sychronized
 205 in the event of a crash.  Now all we have to do after a crash is make
 206 sure blocks allocated before the freemap was flushed are properly
 207 marked as allocated in the allocmap.  This is a trivial exercise using the
 208 same algorithm the mirror streaming code uses (which is very similar to
 209 HAMMER1)... an incremental meta-data scan that covers only the blocks that
 210 might have been allocated between the last allocation map sync and now.
 211
 212 Thus the freemap does not have to be synchronized during a fsync().
 213
 214 The complexity is in figuring out what can be freed... that is, when one
 215 can mark blocks in the freemap as being free.  HAMMER2 implements this as
 216 a background task which essentially must scan available meta-data to
 217 determine which blocks are not being referenced.
 218
 219 Part of the ongoing design work is finding ways to reduce the scope of this
 220 meta-data scan so the entire filesystem's meta-data does not need to be
 221 scanned (though in tests with HAMMER1, even full meta-data scans have
 222 turned out to be fairly low cost).  In other words, its an area that we
 223 can continue to improve on as the filesystem matures.  Not only that, but
 224 we can completely change the freemap algorithms without creating
 225 incompatibilities (at worse simply having to require that a R+W mount do
 226 a full meta-data scan when upgrading or downgrading the freemap algorithm).
 227
 228                                   CLUSTERING
 229
 230 Clustering, as always, is the most difficult bit but we have some advantages
 231 with HAMMER2 that we did not have with HAMMER1.  First, HAMMER2's media
 232 structures generally follow the kernel's filesystem hiearchy.  Second,
 233 HAMMER2's writable snapshots make it possible to implement several forms
 234 of multi-master clustering.
 235
 236 The mount device path you specify serves to bootstrap your entry into
 237 the cluster.  This can be local media or directly specify a network
 238 cluster connection (or several).  When a local media mount is used the
 239 volume header is scanned for local copies and the best volume header is
 240 selected from all available copies.  Multiple devices may be specified for
 241 redundancy.
 242
 243 The volume header on local media also contains cluster connection
 244 specifications keyed by super-root pfsid.  Network connections are
 245 maintained to all targets.  ALL ELEMENTS ARE TREATED ACCORDING TO TYPE
 246 NO MATTER WHICH ONE YOU MOUNT FROM.
 247
 248 The actual networked cluster may be far larger than the elements you list
 249 in the hammer2_copy_data[] array, but your machine will only make direct
 250 connections as specified by the array.
 251
 252 In the simplest case you simply network a few machines together as ring 0
 253 masters and each client connects directly to all the masters (and/or are
 254 the masters themselves).  Thus any quorum operation is straight-forward.
 255 These master nodes are labeled 'ring 0'.
 256
 257 If you have too many clients to reasonably connect directly you set up
 258 sub-clusters as satellites.  This is called 'ring 1'.  Ring 1 may contain
 259 several sub-clusters.  A client then connects to all the nodes in a
 260 particular sub-cluster (typically 3).  The quorum protocol runs as per
 261 normal except that once the operation is resolved against the sub-cluster
 262 an aggregation must be resolved against the master nodes (ring 0).  The
 263 sub-cluster does this for the client... all the client sees is the normal
 264 quorum operation against the sub-cluster.
 265
 266 Since each node in the sub-cluster connects to all master nodes we get
 267 a multiplication.  If we set a reasonable upper limit of, say, 256
 268 connections at each master node then ring 1 may contain 85 sub-clusters x 3
 269 nodes in each sub-cluster.
 270
 271 In the most complex case when one wishes to support potentially millions
 272 of clients then further fan-out is required into ring 2, ring 3, and
 273 so forth.  However, each sub-cluster in ring 2 must only connect to
 274 1 sub-cluster in ring 1 (otherwise the cache state will become mightily
 275 confused).  Using reasonable metrics this will allow ring 2 to contain
 276 85 * 85 = 7225 sub-clusters.  At this point you could have 1000 clients
 277 connect to each sub-cluster and support 7.2 million clients, but if that
 278 isn't enough going to another ring will support 61M clients, and so forth.
 279
 280 Each ring imposes additional latencies for cache operations but the key
 281 to making this work efficiently is that the satellite clusters can negotiate
 282 coarse-grained cache coherency locks with the next lower ring and then
 283 fan-out finer-grained locks to the next higher ring.  Since caching can
 284 occur anywhere (including on the connecting client), it is the cache
 285 coherency lock that ultimately dictates efficiency and allows a client
 286 (or satellite) to access large amoutns of data from local storage.
 287
 288 Modifying operations, particularly commits, also have higher latencies
 289 when multiple rings are in use.  In this situation it is possible to
 290 short-cut localized operations by having competing clients connect to
 291 to sub-clusters which are near each other topologically... having the
 292 competing clients connect to the same sub-cluster would be the most optimal.
 293
 294 In addition, sub-clusters (typically in ring 1) can act in SOFT_MASTER mode
 295 which allows the sub-cluster to acknowledge a full commit within its own
 296 quorum only, and then resolve asynchronously to the masters in ring 0.
 297
 298 The nodes in these intermediate rings can be pure proxies with only memory
 299 caches, use local media for persistent cache, or use local media to
 300 completely slave the filesystem.
 301
 302     ADMIN       - Media does not participate, administrative proxy only
 303     CLIENT      - Media does not participate, client only
 304     CACHE       - Media only acts as a persistent cache
 305     COPY        - Media only acts as a local copy
 306     SLAVE       - Media is a RO slave that can be mounted RW
 307
 308     SOFT_SLAVE  - This is a SLAVE which can become writable when
 309                   the quorum is not available, but is not guaranteed
 310                   to be able to be merged back when the quorum becomes
 311                   available again.  Elements which cannot be merged
 312                   back remain localized and writable until manual
 313                   or scripted intervention recombines them.
 314
 315     SOFT_MASTER - Similar to the above but can form a sub-cluster
 316                   and run the quorum protocol within the sub-cluster
 317                   to serve machines that connect to the sub-cluster
 318                   when the master cluster is not available.
 319
 320                   The SOFT_MASTER nodes in a sub-cluster must be
 321                   fully interconnected with each other.
 322
 323     MASTER      - This is a MASTER node in the quorum protocol.
 324
 325                   The MASTER nodes in a cluster must be fully
 326                   interconnected with each other.
 327
 328 There are four major protocols:
 329
 330     Quorum protocol
 331
 332         This protocol is used between MASTER nodes to vote on operations
 333         and resolve deadlocks.
 334
 335         This protocol is used between SOFT_MASTER nodes in a sub-cluster
 336         to vote on operations, resolve deadlocks, determine what the latest
 337         transaction id for an element is, and to perform commits.
 338
 339     Cache sub-protocol
 340
 341         This is the MESI sub-protocol which runs under the Quorum
 342         protocol.  This protocol is used to maintain cache state for
 343         sub-trees to ensure that operations remain cache coherent.
 344
 345         Depending on administrative rights this protocol may or may
 346         not allow a leaf node in the cluster to hold a cache element
 347         indefinitely.  The administrative controller may preemptively
 348         downgrade a leaf with insufficient administrative rights
 349         without giving it a chance to synchronize any modified state
 350         back to the cluster.
 351
 352     Proxy protocol
 353
 354         The Quorum and Cache protocols only operate between MASTER
 355         and SOFT_MASTER nodes.  All other node types must use the
 356         Proxy protocol to perform similar actions.  This protocol
 357         differs in that proxy requests are typically sent to just
 358         one adjacent node and that node then maintains state and
 359         forwards the request or performs the required operation.
 360         When the link is lost to the proxy, the proxy automatically
 361         forwards a deletion of the state to the other nodes based on
 362         what it has recorded.
 363
 364         If a leaf has insufficient administrative rights it may not
 365         be allowed to actually initiate a quorum operation and may only
 366         be allowed to maintain partial MESI cache state or perhaps none
 367         at all (since cache state can block other machines in the
 368         cluster).  Instead a leaf with insufficient rights will have to
 369         make due with a preemptive loss of cache state and any allowed
 370         modifying operations will have to be forwarded to the proxy which
 371         continues forwarding it until a node with sufficient administrative
 372         rights is encountered.
 373
 374         To reduce issues and give the cluster more breath, sub-clusters
 375         made up of SOFT_MASTERs can be formed in order to provide full
 376         cache coherent within a subset of machines and yet still tie them
 377         into a greater cluster that they normally would not have such
 378         access to.  This effectively makes it possible to create a two
 379         or three-tier fan-out of groups of machines which are cache-coherent
 380         within the group, but perhaps not between groups, and use other
 381         means to synchronize between the groups.
 382
 383     Media protocol
 384
 385         This is basically the physical media protocol.
 386
 387 There are lots of ways to implement multi-master environments using the
 388 above core features but the implementation is going to be fairly complex
 389 even with HAMMER2's feature set.
 390
 391 Keep in mind that modifications propagate all the way to the super-root
 392 and volume header, so in any clustered arrangement the use of (modify_tid)
 393 and (mirror_tid) is critical in determining the synchronization state of
 394 portion(s) of the filesystem.
 395
 396 Specifically, since any modification propagates to the root the (mirror_tid)
 397 in higher level directories is going to be in a constant state of flux.  This
 398 state of flux DOES NOT invalidate the cache state for these higher levels
 399 of directories.  Instead, the (modify_tid) is used on a node-by-node basis
 400 to determine cache state at any given level, and (mirror_tid) is used to
 401 determine whether any recursively underlying state is desynchronized.
 402 The inode structure also has two additional transaction ids used to optimize
 403 path lookups, stat, and directory lookup/scan operations.