| Commit | Line | Data |
|---|---|---|
| 5a9a531c MD |
1 | |
| 2 | HAMMER2 DESIGN DOCUMENT | |
| 3 | ||
| 05af5bd1 MD |
4 | Matthew Dillon |
| 5 | 08-Feb-2012 | |
| 6 | dillon@backplane.com | |
| 7 | ||
| 5a9a531c MD |
8 | * These features have been speced in the media structures. |
| 9 | ||
| 10 | * Implementation work has begun. | |
| 11 | ||
| 12 | * A working filesystem with some features implemented is expected by July 2012. | |
| 13 | ||
| 14 | * A fully functional filesystem with most (but not all) features is expected | |
| 15 | by the end of 2012. | |
| 16 | ||
| 17 | * All elements of the filesystem have been designed except for the freemap | |
| 18 | (which isn't needed for initial work). 8MB per 2GB of filesystem | |
| 19 | storage has been reserved for the freemap. The design of the freemap | |
| 20 | is expected to be completely speced by mid-year. | |
| 21 | ||
| 22 | * This is my only project this year. I'm not going to be doing any major | |
| 23 | kernel bug hunting this year. | |
| 24 | ||
| 25 | Feature List | |
| 26 | ||
| 27 | * Multiple roots (allowing snapshots to be mounted). This is implemented | |
| 28 | via the super-root concept. When mounting a HAMMER2 filesystem you specify | |
| 29 | a device path and a directory name in the super-root. | |
| 30 | ||
| 31 | * HAMMER1 had PFS's. HAMMER2 does not. Instead, in HAMMER2 any directory | |
| 32 | in the tree can be configured as a PFS, causing all elements recursively | |
| 33 | underneath that directory to become a part of that PFS. | |
| 34 | ||
| 35 | * Writable snapshots. Any subdirectory tree can be snapshotted. Snapshots | |
| 36 | show up in the super-root. It is possible to snapshot a subdirectory | |
| 37 | and then later snapshot a parent of that subdirectory... really there are | |
| 38 | no limitations here. | |
| 39 | ||
| 40 | * Directory sub-hierarchy based quotas and space and inode usage tracking. | |
| 41 | Any directory sub-tree, whether at a mount point or not, tracks aggregate | |
| 42 | inode use and data space use. This is stored in the directory inode all | |
| 43 | the way up the chain. | |
| 44 | ||
| 45 | * Incremental queueless mirroring / mirroring-streams. Because HAMMER2 is | |
| 46 | block-oriented and copy-on-write each blockref tracks both direct | |
| 47 | modifications to the referenced data via (modify_tid) and indirect | |
| 48 | modifications to the referenced data or any sub-tree via (mirror_tid). | |
| 49 | This makes it possible to do an incremental scan of meta-data that covers | |
| 50 | only changes made since the mirror_tid recorded in a prior-run. | |
| 51 | ||
| 52 | This feature is also intended to be used to locate recently allocated | |
| 53 | blocks and thus be able to fixup the freemap after a crash. | |
| 54 | ||
| 55 | HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in | |
| 56 | that HAMMER2 does not keep track of 'deleted' records. Instead any | |
| 57 | recursion by the mirroring code which finds that (modify_tid) has | |
| 58 | been updated must also send the direct block table or indirect block | |
| 59 | table state it winds up recursing through so the target can check | |
| 60 | similar key ranges and locate elements to be deleted. This can be | |
| 61 | avoided if the mirroring stream is mostly caught up in that very recent | |
| 62 | deletions will be cached in memory and can be queried, allowing shorter | |
| 63 | record deletions to be passed in the stream instead. | |
| 64 | ||
| 65 | * Will support multiple compression algorithms configured on subdirectory | |
| 66 | tree basis and on a file basis. Up to 64K block compression will be used. | |
| 67 | Only compression ratios near powers of 2 that are at least 2:1 (e.g. 2:1, | |
| 68 | 4:1, 8:1, etc) will work in this scheme because physical block allocations | |
| 69 | in HAMMER2 are always power-of-2. | |
| 70 | ||
| 71 | Compression algorithm #0 will mean no compression and no zero-checking. | |
| 72 | Compression algorithm #1 will mean zero-checking but no other compression. | |
| 73 | Real compression will be supported starting with algorithm 2. | |
| 74 | ||
| 75 | * Zero detection on write (writing all-zeros), which requires the data | |
| 76 | buffer to be scanned, will be supported as compression algorithm #1. | |
| 77 | This allows the writing of 0's to create holes and will be the default | |
| 78 | compression algorithm for HAMMER2. | |
| 79 | ||
| 80 | * Copies support for redundancy. The media blockref structure would | |
| 81 | have become too bloated but I found a clean way to do copies using the | |
| 82 | blockset structure (which is a set of 8 fully associative blockref's). | |
| 83 | ||
| 84 | The design is such that the filesystem should be able to function at | |
| 85 | full speed even if disks are pulled or inserted, as long as at least one | |
| 86 | good copy is present. A background task will be needed to resynchronize | |
| 87 | missing copies (or remove excessive copies in the case where the copies | |
| 88 | value is reduced on a live filesystem). | |
| 89 | ||
| 2910a90c MD |
90 | * Clusterable with MESI cache coherency and dynamic granularity. |
| 91 | The media format for HAMMER1 was less condusive to logical clustering | |
| 92 | than I had hoped so I was never able to get that aspect of my personal goals | |
| 5a9a531c MD |
93 | working with HAMMER1. HAMMER2 effectively solves the issues that cropped |
| 94 | up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical | |
| 95 | file/directory hierarchy, making cache coherency very difficult). | |
| 96 | ||
| 97 | * Hardlinks will be supported. All other standard features will be supported | |
| 98 | too of course. Hardlinks in this sort of filesystem require significant | |
| 99 | work. | |
| 100 | ||
| 101 | * The media blockref structure is now large enough to support up to a 192-bit | |
| 102 | check value, which would typically be a cryptographic hash of some sort. | |
| 103 | Multiple check value algorithms will be supported with the default being | |
| 104 | a simple 32-bit iSCSI CRC. | |
| 105 | ||
| 106 | * Fully verified deduplication will be supported and automatic (and | |
| 107 | necessary in many respects). | |
| 108 | ||
| 109 | * Non-verified de-duplication will be supported as a configurable option on | |
| 110 | a file or subdirectory tree. Non-verified deduplication would use the | |
| 111 | largest available check code (192 bits) and not bother to verify data | |
| 112 | matches during the dedup pass, which is necessary on extremely large | |
| 113 | filesystems with a great deal of deduplicable data (as otherwise a large | |
| 114 | chunk of the media would have to be read to implement the dedup). | |
| 115 | ||
| 116 | This feature is intended only for those files where occassional corruption | |
| 117 | is ok, such as in a large data store of farmed web content. | |
| 118 | ||
| 119 | GENERAL DESIGN | |
| 120 | ||
| 121 | HAMMER2 generally implements a copy-on-write block design for the filesystem, | |
| 122 | which is very different from HAMMER1's B-Tree design. Because the design | |
| 123 | is copy-on-write it can be trivially snapshotted simply by referencing an | |
| 124 | existing block, and because the media structures logically match a standard | |
| 125 | filesystem directory/file hierarchy snapshots and other similar operations | |
| 126 | can be trivially performed on an entire subdirectory tree at any level in | |
| 127 | the filesystem. | |
| 128 | ||
| 129 | The copy-on-write nature of the filesystem implies that any modification | |
| 130 | whatsoever will have to eventually synchronize new disk blocks all the way | |
| 131 | to the super-root of the filesystem and the volume header itself. This forms | |
| 132 | the basis for crash recovery. All disk writes are to new blocks except for | |
| 133 | the volume header, thus allowing all writes to run concurrently except for | |
| 134 | the volume header update at the end. | |
| 135 | ||
| 136 | Clearly this method requires intermediate modifications to the chain to be | |
| 137 | cached so multiple modifications can be aggregated prior to being | |
| 138 | synchronized. One advantage, however, is that the cache can be flushed at | |
| 139 | any time WITHOUT having to allocate yet another new block when further | |
| 140 | modifications are made as long as the volume header has not yet been flushed. | |
| 141 | This means that buffer cache overhead is very well bounded and can handle | |
| 142 | filesystem operations of any complexity even on boxes with very small amounts | |
| 143 | of physical memory. | |
| 144 | ||
| 145 | I intend to implement a shortcut to make fsync()'s run fast, and that is to | |
| 146 | allow deep updates to blockrefs to shortcut to auxillary space in the | |
| 147 | volume header to satisfy the fsync requirement. The related blockref is | |
| 148 | then recorded when the filesystem is mounted after a crash and the update | |
| 149 | chain is reconstituted when a matching blockref is encountered again during | |
| 150 | normal operation of the filesystem. | |
| 151 | ||
| 152 | Basically this means that no real work needs to be done at mount-time | |
| 153 | even after a crash. | |
| 154 | ||
| 155 | Directories are hashed, and another major design element is that directory | |
| 156 | entries ARE INODES. They are one and the same. In addition to directory | |
| 157 | entries being inodes the data for very small files (512 bytes or smaller) | |
| 158 | can be directly embedded in the inode (overloaded onto the same space that | |
| 159 | the direct blockref array uses). This should result in very high | |
| 160 | performance. | |
| 161 | ||
| 162 | Inode numbers are not spatially referenced, which complicates NFS servers | |
| 163 | but doesn't complicate anything else. The inode number is stored in the | |
| 164 | inode itself, an absolutely necessary feature in order to support the | |
| 165 | hugely flexible snapshots that we want to have in HAMMER2. | |
| 166 | ||
| 167 | HARDLINKS | |
| 168 | ||
| 169 | Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of | |
| 170 | a spatial reference to the inode number. We do not want to have to have | |
| 171 | an index of inode numbers for any basic HAMMER2 feature if we can help it. | |
| 172 | ||
| 173 | Hardlinks are handled by placing the inode for a multiply-hardlinked file | |
| 174 | in the closest common parent directory. If "a/x" and "a/y" are hardlinked | |
| 175 | the inode for the hardlinked file will be placed in directory "a", e.g. | |
| 176 | "a/3239944", but it will be invisible and will be in an out-of-band namespace. | |
| 177 | The directory entries "a/x" and "a/y" will be given the same inode number | |
| 178 | but in fact just be placemarks that cause HAMMER2 to recurse upwards through | |
| 179 | the directory tree to find the invisible inode number. | |
| 180 | ||
| 181 | Because directories are hashed and a different namespace (hash key range) | |
| 182 | is used for hardlinked inodes, standard directory scans are able to trivially | |
| 183 | skip this invisible namespace and inode-specific lookups can restrict their | |
| 184 | lookup to within this space. | |
| 185 | ||
| 186 | The nature of snapshotting makes handling link-count 2->1 and 1->2 cases | |
| 187 | trivial. Basically the inode media structure is copied as needed to break-up | |
| 188 | or re-form the standard directory entry/inode. There are no backpointers in | |
| 189 | HAMMER2 and no reference counts on the blocks (see FREEMAP NOTES below), so | |
| 190 | it is an utterly trivial operation. | |
| 191 | ||
| 192 | FREEMAP NOTES | |
| 193 | ||
| 194 | In order to implement fast snapshots (and writable snapshots for that | |
| 195 | matter), HAMMER2 does NOT ref-count allocations. The freemap which | |
| 196 | is still under design just won't do that. All the freemap does is | |
| 197 | keep track of 100% free blocks. | |
| 198 | ||
| 199 | This not only trivializes all the snapshot features it also trivializes | |
| 200 | hardlink handling and solves the problem of keeping the freemap sychronized | |
| 201 | in the event of a crash. Now all we have to do after a crash is make | |
| 202 | sure blocks allocated before the freemap was flushed are properly | |
| 203 | marked as allocated in the allocmap. This is a trivial exercise using the | |
| 204 | same algorithm the mirror streaming code uses (which is very similar to | |
| 205 | HAMMER1)... an incremental meta-data scan that covers only the blocks that | |
| 206 | might have been allocated between the last allocation map sync and now. | |
| 207 | ||
| 208 | Thus the freemap does not have to be synchronized during a fsync(). | |
| 209 | ||
| 210 | The complexity is in figuring out what can be freed... that is, when one | |
| 211 | can mark blocks in the freemap as being free. HAMMER2 implements this as | |
| 212 | a background task which essentially must scan available meta-data to | |
| 213 | determine which blocks are not being referenced. | |
| 214 | ||
| 215 | Part of the ongoing design work is finding ways to reduce the scope of this | |
| 216 | meta-data scan so the entire filesystem's meta-data does not need to be | |
| 217 | scanned (though in tests with HAMMER1, even full meta-data scans have | |
| 218 | turned out to be fairly low cost). In other words, its an area that we | |
| 219 | can continue to improve on as the filesystem matures. Not only that, but | |
| 220 | we can completely change the freemap algorithms without creating | |
| 221 | incompatibilities (at worse simply having to require that a R+W mount do | |
| 222 | a full meta-data scan when upgrading or downgrading the freemap algorithm). | |
| 223 | ||
| 224 | CLUSTERING | |
| 225 | ||
| 226 | Clustering, as always, is the most difficult bit but we have some advantages | |
| 227 | with HAMMER2 that we did not have with HAMMER1. First, HAMMER2's media | |
| 228 | structures generally follow the kernel's filesystem hiearchy. Second, | |
| 229 | HAMMER2's writable snapshots make it possible to implement several forms | |
| 230 | of multi-master clustering. | |
| 231 | ||
| 2910a90c MD |
232 | This is important: The mount device path you specify serves to bootstrap |
| 233 | your entry into the cluster, but your mount will make active connections | |
| 234 | to ALL copy elements in the hammer2_copy_data[] array (stored in the volume | |
| 235 | header) which match the PFSID of the directory in the super-root that you | |
| 236 | specified. The local media path does not have to be mentioned in this | |
| 237 | array but becomes part of the cluster based on its type and access | |
| 238 | rights. ALL ELEMENTS ARE TREATED ACCORDING TO TYPE NO MATTER WHICH ONE | |
| 239 | YOU MOUNT FROM. | |
| 240 | ||
| 241 | The actual cluster may be far larger than the elements you list in the | |
| 242 | hammer2_copy_data[] array. You list only the elements you wish to | |
| 243 | directly connect to and you are able to access the rest of the cluster | |
| 244 | indirectly through those connections. | |
| 245 | ||
| 246 | All nodes in the cluster may act as administrative proxies. All nodes | |
| 247 | in the cluster, including your mount point, are classified as one of the | |
| 248 | following as specified in the inode's structure: | |
| 249 | ||
| 250 | ADMIN - Media does not participate, administrative proxy only | |
| 251 | CACHE - Media only acts as a persistent cache | |
| 252 | COPY - Media only acts as a local copy | |
| 253 | SLAVE - Media is a RO slave that can be mounted RW | |
| 254 | ||
| 255 | SOFT_SLAVE - This is a SLAVE which can become writable when | |
| 256 | the quorum is not available, but is not guaranteed | |
| 257 | to be able to be merged back when the quorum becomes | |
| 258 | available again. Elements which cannot be merged | |
| 259 | back remain localized and writable until manual | |
| 260 | or scripted intervention recombines them. | |
| 261 | ||
| 262 | SOFT_MASTER - Similar to the above but can form a sub-cluster | |
| 263 | and run the quorum protocol within the sub-cluster | |
| 264 | to serve machines that connect to the sub-cluster | |
| 265 | when the master cluster is not available. | |
| 266 | ||
| 267 | The SOFT_MASTER nodes in a sub-cluster must be | |
| 268 | fully interconnected with each other. | |
| 269 | ||
| 270 | MASTER - This is a MASTER node in the quorum protocol. | |
| 271 | ||
| 272 | The MASTER nodes in a cluster must be fully | |
| 273 | interconnected with each other. | |
| 274 | ||
| 275 | There are four major protocols: | |
| 276 | ||
| 277 | Quorum protocol | |
| 278 | ||
| 279 | This protocol is used between MASTER nodes to vote on operations | |
| 280 | and resolve deadlocks. | |
| 281 | ||
| 282 | This protocol is used between SOFT_MASTER nodes in a sub-cluster | |
| 283 | to vote on operations, resolve deadlocks, determine what the latest | |
| 284 | transaction id for an element is, and to perform commits. | |
| 285 | ||
| 286 | Cache sub-protocol | |
| 287 | ||
| 288 | This is the MESI sub-protocol which runs under the Quorum | |
| 289 | protocol. This protocol is used to maintain cache state for | |
| 290 | sub-trees to ensure that operations remain cache coherent. | |
| 291 | ||
| 292 | Depending on administrative rights this protocol may or may | |
| 293 | not allow a leaf node in the cluster to hold a cache element | |
| 294 | indefinitely. The administrative controller may preemptively | |
| 295 | downgrade a leaf with insufficient administrative rights | |
| 296 | without giving it a chance to synchronize any modified state | |
| 297 | back to the cluster. | |
| 298 | ||
| 299 | Proxy protocol | |
| 300 | ||
| 301 | The Quorum and Cache protocols only operate between MASTER | |
| 302 | and SOFT_MASTER nodes. All other node types must use the | |
| 303 | Proxy protocol to perform similar actions. This protocol | |
| 304 | differs in that proxy requests are typically sent to just | |
| 305 | one adjacent node and that node then maintains state and | |
| 306 | forwards the request or performs the required operation. | |
| 307 | When the link is lost to the proxy, the proxy automatically | |
| 308 | forwards a deletion of the state to the other nodes based on | |
| 309 | what it has recorded. | |
| 310 | ||
| 311 | If a leaf has insufficient administrative rights it may not | |
| 312 | be allowed to actually initiate a quorum operation and may only | |
| 313 | be allowed to maintain partial MESI cache state or perhaps none | |
| 314 | at all (since cache state can block other machines in the | |
| 315 | cluster). Instead a leaf with insufficient rights will have to | |
| 316 | make due with a preemptive loss of cache state and any allowed | |
| 317 | modifying operations will have to be forwarded to the proxy which | |
| 318 | continues forwarding it until a node with sufficient administrative | |
| 319 | rights is encountered. | |
| 320 | ||
| 321 | To reduce issues and give the cluster more breath, sub-clusters | |
| 322 | made up of SOFT_MASTERs can be formed in order to provide full | |
| 323 | cache coherent within a subset of machines and yet still tie them | |
| 324 | into a greater cluster that they normally would not have such | |
| 325 | access to. This effectively makes it possible to create a two | |
| 326 | or three-tier fan-out of groups of machines which are cache-coherent | |
| 327 | within the group, but perhaps not between groups, and use other | |
| 328 | means to synchronize between the groups. | |
| 329 | ||
| 330 | Media protocol | |
| 331 | ||
| 332 | This is basically the physical media protocol. | |
| 5a9a531c MD |
333 | |
| 334 | There are lots of ways to implement multi-master environments using the | |
| 335 | above core features but the implementation is going to be fairly complex | |
| 336 | even with HAMMER2's feature set. | |
| 337 | ||
| 338 | Keep in mind that modifications propagate all the way to the super-root | |
| 339 | and volume header, so in any clustered arrangement the use of (modify_tid) | |
| 340 | and (mirror_tid) is critical in determining the synchronization state of | |
| 341 | portion(s) of the filesystem. | |
| 342 | ||
| 343 | Specifically, since any modification propagates to the root the (mirror_tid) | |
| 344 | in higher level directories is going to be in a constant state of flux. This | |
| 345 | state of flux DOES NOT invalidate the cache state for these higher levels | |
| 346 | of directories. Instead, the (modify_tid) is used on a node-by-node basis | |
| 347 | to determine cache state at any given level, and (mirror_tid) is used to | |
| 348 | determine whether any recursively underlying state is desynchronized. | |
| 2910a90c MD |
349 | The inode structure also has two additional transaction ids used to optimize |
| 350 | path lookups, stat, and directory lookup/scan operations. |