2 * Copyright (c) 2013-2015 The DragonFly Project. All rights reserved.
4 * This code is derived from software contributed to The DragonFly Project
5 * by Matthew Dillon <dillon@dragonflybsd.org>
7 * Redistribution and use in source and binary forms, with or without
8 * modification, are permitted provided that the following conditions
11 * 1. Redistributions of source code must retain the above copyright
12 * notice, this list of conditions and the following disclaimer.
13 * 2. Redistributions in binary form must reproduce the above copyright
14 * notice, this list of conditions and the following disclaimer in
15 * the documentation and/or other materials provided with the
17 * 3. Neither the name of The DragonFly Project nor the names of its
18 * contributors may be used to endorse or promote products derived
19 * from this software without specific, prior written permission.
21 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
22 * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
23 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
24 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
25 * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
26 * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
27 * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
28 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
29 * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
30 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
31 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
35 * The cluster module collects multiple chains representing the same
36 * information from different nodes into a single entity. It allows direct
37 * access to media data as long as it is not blockref array data (which
38 * will obviously have to be different at each node).
40 * This module also handles I/O dispatch, status rollup, and various
41 * mastership arrangements including quorum operations. It effectively
42 * presents one topology to the vnops layer.
44 * Many of the API calls mimic chain API calls but operate on clusters
45 * instead of chains. Please see hammer2_chain.c for more complete code
46 * documentation of the API functions.
48 * WARNING! This module is *extremely* complex. It must issue asynchronous
49 * locks and I/O, do quorum and/or master-slave processing, and
50 * it must operate properly even if some nodes are broken (which
51 * can also mean indefinite locks).
55 * Cluster operations can be broken down into three pieces:
57 * (1) Chain locking and data retrieval.
58 * hammer2_cluster_lock()
59 * hammer2_cluster_parent()
61 * - Most complex functions, quorum management on transaction ids.
63 * - Locking and data accesses must be internally asynchronous.
65 * - Validate and manage cache coherency primitives (cache state
66 * is stored in chain topologies but must be validated by these
69 * (2) Lookups and Scans
70 * hammer2_cluster_lookup()
71 * hammer2_cluster_next()
73 * - Depend on locking & data retrieval functions, but still complex.
75 * - Must do quorum management on transaction ids.
77 * - Lookup and Iteration ops Must be internally asynchronous.
79 * (3) Modifying Operations
80 * hammer2_cluster_create()
81 * hammer2_cluster_rename()
82 * hammer2_cluster_delete()
83 * hammer2_cluster_modify()
84 * hammer2_cluster_modsync()
86 * - Can usually punt on failures, operation continues unless quorum
87 * is lost. If quorum is lost, must wait for resynchronization
88 * (depending on the management mode).
90 * - Must disconnect node on failures (also not flush), remount, and
93 * - Network links (via kdmsg) are relatively easy to issue as the
94 * complex underworkings of hammer2_chain.c don't have to messed
95 * with (the protocol is at a higher level than block-level).
97 * - Multiple local disk nodes (i.e. block devices) are another matter.
98 * Chain operations have to be dispatched to per-node threads (xN)
99 * because we can't asynchronize potentially very complex chain
100 * operations in hammer2_chain.c (it would be a huge mess).
102 * (these threads are also used to terminate incoming kdmsg ops from
105 * - Single-node filesystems do not use threads and will simply call
106 * hammer2_chain.c functions directly. This short-cut is handled
107 * at the base of each cluster function.
109 #include <sys/cdefs.h>
110 #include <sys/param.h>
111 #include <sys/systm.h>
112 #include <sys/types.h>
113 #include <sys/lock.h>
114 #include <sys/uuid.h>
119 * Returns TRUE if any chain in the cluster needs to be resized.
122 hammer2_cluster_need_resize(hammer2_cluster_t *cluster, int bytes)
124 hammer2_chain_t *chain;
127 for (i = 0; i < cluster->nchains; ++i) {
128 chain = cluster->array[i].chain;
129 if (chain && chain->bytes != bytes)
136 hammer2_cluster_type(hammer2_cluster_t *cluster)
138 return(cluster->focus->bref.type);
142 hammer2_cluster_modified(hammer2_cluster_t *cluster)
144 return((cluster->focus->flags & HAMMER2_CHAIN_MODIFIED) != 0);
148 * Return a bref representative of the cluster. Any data offset is removed
149 * (since it would only be applicable to a particular chain in the cluster).
151 * However, the radix portion of data_off is used for many purposes and will
155 hammer2_cluster_bref(hammer2_cluster_t *cluster, hammer2_blockref_t *bref)
157 *bref = cluster->focus->bref;
158 bref->data_off &= HAMMER2_OFF_MASK_RADIX;
162 * Return non-zero if the chain representing an inode has been flagged
163 * as having been unlinked. Allows the vnode reclaim to avoid loading
164 * the inode data from disk e.g. when unmount or recycling old, clean
168 hammer2_cluster_isunlinked(hammer2_cluster_t *cluster)
170 hammer2_chain_t *chain;
175 for (i = 0; i < cluster->nchains; ++i) {
176 chain = cluster->array[i].chain;
178 flags |= chain->flags;
180 return (flags & HAMMER2_CHAIN_UNLINKED);
184 hammer2_cluster_set_chainflags(hammer2_cluster_t *cluster, uint32_t flags)
186 hammer2_chain_t *chain;
189 for (i = 0; i < cluster->nchains; ++i) {
190 chain = cluster->array[i].chain;
192 atomic_set_int(&chain->flags, flags);
197 hammer2_cluster_clr_chainflags(hammer2_cluster_t *cluster, uint32_t flags)
199 hammer2_chain_t *chain;
202 for (i = 0; i < cluster->nchains; ++i) {
203 chain = cluster->array[i].chain;
205 atomic_clear_int(&chain->flags, flags);
210 hammer2_cluster_setflush(hammer2_trans_t *trans, hammer2_cluster_t *cluster)
212 hammer2_chain_t *chain;
215 for (i = 0; i < cluster->nchains; ++i) {
216 chain = cluster->array[i].chain;
218 hammer2_chain_setflush(trans, chain);
223 hammer2_cluster_setmethod_check(hammer2_trans_t *trans,
224 hammer2_cluster_t *cluster,
227 hammer2_chain_t *chain;
230 for (i = 0; i < cluster->nchains; ++i) {
231 chain = cluster->array[i].chain;
233 KKASSERT(chain->flags & HAMMER2_CHAIN_MODIFIED);
234 chain->bref.methods &= ~HAMMER2_ENC_CHECK(-1);
235 chain->bref.methods |= HAMMER2_ENC_CHECK(check_algo);
241 * Create a cluster with one ref from the specified chain. The chain
242 * is not further referenced. The caller typically supplies a locked
243 * chain and transfers ownership to the cluster.
245 * The returned cluster will be focused on the chain (strictly speaking,
246 * the focus should be NULL if the chain is not locked but we do not check
247 * for this condition).
252 hammer2_cluster_from_chain(hammer2_chain_t *chain)
254 hammer2_cluster_t *cluster;
256 cluster = kmalloc(sizeof(*cluster), M_HAMMER2, M_WAITOK | M_ZERO);
257 cluster->array[0].chain = chain;
258 cluster->nchains = 1;
259 cluster->focus = chain;
260 cluster->pmp = chain->pmp;
262 cluster->flags = HAMMER2_CLUSTER_LOCKED |
263 HAMMER2_CLUSTER_WRHARD |
264 HAMMER2_CLUSTER_RDHARD |
265 HAMMER2_CLUSTER_MSYNCED |
266 HAMMER2_CLUSTER_SSYNCED;
273 * Allocates a cluster and its underlying chain structures. The underlying
274 * chains will be locked. The cluster and underlying chains will have one
275 * ref and will be focused on the first chain.
277 * XXX focus on first chain.
280 hammer2_cluster_alloc(hammer2_pfs_t *pmp,
281 hammer2_trans_t *trans, hammer2_blockref_t *bref)
283 hammer2_cluster_t *cluster;
284 hammer2_cluster_t *rcluster;
285 hammer2_chain_t *chain;
286 hammer2_chain_t *rchain;
288 u_int bytes = 1U << (int)(bref->data_off & HAMMER2_OFF_MASK_RADIX);
292 KKASSERT(pmp != NULL);
295 * Construct the appropriate system structure.
298 case HAMMER2_BREF_TYPE_INODE:
299 case HAMMER2_BREF_TYPE_INDIRECT:
300 case HAMMER2_BREF_TYPE_FREEMAP_NODE:
301 case HAMMER2_BREF_TYPE_DATA:
302 case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
304 * Chain's are really only associated with the hmp but we
305 * maintain a pmp association for per-mount memory tracking
306 * purposes. The pmp can be NULL.
309 case HAMMER2_BREF_TYPE_VOLUME:
310 case HAMMER2_BREF_TYPE_FREEMAP:
312 panic("hammer2_cluster_alloc volume type illegal for op");
315 panic("hammer2_cluster_alloc: unrecognized blockref type: %d",
319 cluster = kmalloc(sizeof(*cluster), M_HAMMER2, M_WAITOK | M_ZERO);
321 cluster->flags = HAMMER2_CLUSTER_LOCKED;
323 rcluster = &pmp->iroot->cluster;
324 for (i = 0; i < rcluster->nchains; ++i) {
325 rchain = rcluster->array[i].chain;
326 chain = hammer2_chain_alloc(rchain->hmp, pmp, trans, bref);
328 chain->hmp = rchain->hmp;
330 chain->bytes = bytes;
332 chain->flags |= HAMMER2_CHAIN_ALLOCATED;
336 * NOTE: When loading a chain from backing store or creating a
337 * snapshot, trans will be NULL and the caller is
338 * responsible for setting these fields.
340 cluster->array[i].chain = chain;
342 cluster->nchains = i;
344 cluster->focus = cluster->array[0].chain;
351 * Add a reference to a cluster.
353 * We must also ref the underlying chains in order to allow ref/unlock
354 * sequences to later re-lock.
357 hammer2_cluster_ref(hammer2_cluster_t *cluster)
359 hammer2_chain_t *chain;
362 atomic_add_int(&cluster->refs, 1);
363 for (i = 0; i < cluster->nchains; ++i) {
364 chain = cluster->array[i].chain;
366 hammer2_chain_ref(chain);
371 * Drop the caller's reference to the cluster. When the ref count drops to
372 * zero this function frees the cluster and drops all underlying chains.
374 * In-progress read I/Os are typically detached from the cluster once the
375 * first one returns (the remaining stay attached to the DIOs but are then
376 * ignored and drop naturally).
379 hammer2_cluster_drop(hammer2_cluster_t *cluster)
381 hammer2_chain_t *chain;
384 KKASSERT(cluster->refs > 0);
385 for (i = 0; i < cluster->nchains; ++i) {
386 chain = cluster->array[i].chain;
388 hammer2_chain_drop(chain);
389 if (cluster->refs == 1)
390 cluster->array[i].chain = NULL;
393 if (atomic_fetchadd_int(&cluster->refs, -1) == 1) {
394 cluster->focus = NULL; /* safety XXX chg to assert */
395 kfree(cluster, M_HAMMER2);
396 /* cluster is invalid */
401 hammer2_cluster_wait(hammer2_cluster_t *cluster)
403 tsleep(cluster->focus, 0, "h2clcw", 1);
407 * Lock and ref a cluster. This adds a ref to the cluster and its chains
408 * and then locks them.
410 * The act of locking a cluster sets its focus if not already set.
412 * The chains making up the cluster may be narrowed down based on quorum
413 * acceptability, and if RESOLVE_RDONLY is specified the chains can be
414 * narrowed down to a single chain as long as the entire subtopology is known
415 * to be intact. So, for example, we can narrow a read-only op to a single
416 * fast SLAVE but if we focus a CACHE chain we must still retain at least
417 * a SLAVE to ensure that the subtopology can be accessed.
419 * RESOLVE_RDONLY operations are effectively as-of so the quorum does not need
420 * to be maintained once the topology is validated as-of the top level of
423 * If a failure occurs the operation must be aborted by higher-level code and
427 hammer2_cluster_lock(hammer2_cluster_t *cluster, int how)
429 hammer2_chain_t *chain;
432 hammer2_tid_t quorum_tid;
443 KKASSERT(pmp != NULL || cluster->nchains == 0);
445 /* cannot be on inode-embedded cluster template, must be on copy */
446 KKASSERT((cluster->flags & HAMMER2_CLUSTER_INODE) == 0);
447 if (cluster->flags & HAMMER2_CLUSTER_LOCKED) {
448 kprintf("hammer2_cluster_lock: cluster %p already locked!\n",
451 KKASSERT(cluster->focus == NULL);
453 atomic_set_int(&cluster->flags, HAMMER2_CLUSTER_LOCKED);
464 * Calculate the quorum count.
466 nquorum = pmp ? pmp->pfs_nmasters / 2 + 1 : 0;
468 if ((how & HAMMER2_RESOLVE_NOREF) == 0)
469 atomic_add_int(&cluster->refs, 1);
472 * Lock chains and accumulate statistics.
474 for (i = 0; i < cluster->nchains; ++i) {
475 chain = cluster->array[i].chain;
478 error = hammer2_chain_lock(chain, how);
480 kprintf("hammer2_cluster_lock: cluster %p index %d "
481 "lock failure\n", cluster, i);
482 cluster->array[i].chain = NULL;
483 hammer2_chain_drop(chain);
488 * We can resolve some things on this array pass but other
489 * things will require a full-accounting of available
492 switch (cluster->pmp->pfs_types[i]) {
493 case HAMMER2_PFSTYPE_MASTER:
495 if (quorum_tid < chain->bref.mirror_tid ||
498 quorum_tid = chain->bref.mirror_tid;
499 } else if (quorum_tid == chain->bref.mirror_tid) {
503 case HAMMER2_PFSTYPE_SLAVE:
506 case HAMMER2_PFSTYPE_SOFT_MASTER:
507 nflags |= HAMMER2_CLUSTER_WRSOFT;
508 nflags |= HAMMER2_CLUSTER_RDSOFT;
510 case HAMMER2_PFSTYPE_SOFT_SLAVE:
511 nflags |= HAMMER2_CLUSTER_RDSOFT;
513 case HAMMER2_PFSTYPE_SUPROOT:
515 * Degenerate cluster representing the super-root
516 * topology on a single device.
518 nflags |= HAMMER2_CLUSTER_WRHARD;
519 nflags |= HAMMER2_CLUSTER_RDHARD;
520 cluster->focus = chain;
528 * Pass 2 - accumulate statistics.
530 for (i = 0; i < cluster->nchains; ++i) {
531 chain = cluster->array[i].chain;
534 switch (cluster->pmp->pfs_types[i]) {
535 case HAMMER2_PFSTYPE_MASTER:
537 * We must have enough up-to-date masters to reach
538 * a quorum and the master mirror_tid must match
539 * the quorum's mirror_tid.
541 if (nmasters >= nquorum &&
542 quorum_tid == chain->bref.mirror_tid) {
543 nflags |= HAMMER2_CLUSTER_WRHARD;
544 nflags |= HAMMER2_CLUSTER_RDHARD;
545 if (cluster->focus == NULL ||
546 focus_pfs_type == HAMMER2_PFSTYPE_SLAVE) {
547 focus_pfs_type = HAMMER2_PFSTYPE_MASTER;
548 cluster->focus = chain;
552 case HAMMER2_PFSTYPE_SLAVE:
554 * We must have enough up-to-date masters to reach
555 * a quorum and the slave mirror_tid must match the
556 * quorum's mirror_tid.
558 if (nmasters >= nquorum &&
559 quorum_tid == chain->bref.mirror_tid) {
561 nflags |= HAMMER2_CLUSTER_RDHARD;
562 if (cluster->focus == NULL) {
563 focus_pfs_type = HAMMER2_PFSTYPE_SLAVE;
564 cluster->focus = chain;
568 case HAMMER2_PFSTYPE_SOFT_MASTER:
570 * Directly mounted soft master always wins. There
571 * should be only one.
573 KKASSERT(focus_pfs_type != HAMMER2_PFSTYPE_SOFT_MASTER);
574 cluster->focus = chain;
575 focus_pfs_type = HAMMER2_PFSTYPE_SOFT_MASTER;
577 case HAMMER2_PFSTYPE_SOFT_SLAVE:
579 * Directly mounted soft slave always wins. There
580 * should be only one.
582 KKASSERT(focus_pfs_type != HAMMER2_PFSTYPE_SOFT_SLAVE);
583 if (focus_pfs_type != HAMMER2_PFSTYPE_SOFT_MASTER) {
584 cluster->focus = chain;
585 focus_pfs_type = HAMMER2_PFSTYPE_SOFT_SLAVE;
594 * Set SSYNCED or MSYNCED for slaves and masters respectively if
595 * all available nodes (even if 0 are available) are fully
596 * synchronized. This is used by the synchronization thread to
597 * determine if there is work it could potentially accomplish.
599 if (nslaves == ttlslaves)
600 nflags |= HAMMER2_CLUSTER_SSYNCED;
601 if (nmasters == ttlmasters)
602 nflags |= HAMMER2_CLUSTER_MSYNCED;
605 * Determine if the cluster was successfully locked for the
606 * requested operation and generate an error code. The cluster
607 * will not be locked (or ref'd) if an error is returned.
609 atomic_set_int(&cluster->flags, nflags);
610 atomic_clear_int(&cluster->flags, HAMMER2_CLUSTER_ZFLAGS & ~nflags);
613 if (how & HAMMER2_RESOLVE_RDONLY) {
614 if ((nflags & HAMMER2_CLUSTER_RDOK) == 0)
617 if ((nflags & HAMMER2_CLUSTER_WROK) == 0)
622 kprintf("hammer2_cluster_lock: cluster %p lock failed %d\n",
624 for (i = 0; i < cluster->nchains; ++i) {
625 chain = cluster->array[i].chain;
628 hammer2_chain_unlock(chain);
630 if ((how & HAMMER2_RESOLVE_NOREF) == 0)
631 atomic_add_int(&cluster->refs, -1);
639 * Replace the contents of dst with src, adding a reference to src's chains
640 * but not adding any additional locks.
642 * dst is assumed to already have a ref and any chains present in dst are
643 * assumed to be locked and will be unlocked.
645 * If the chains in src are locked, only one of (src) or (dst) should be
646 * considered locked by the caller after return, not both.
649 hammer2_cluster_replace(hammer2_cluster_t *dst, hammer2_cluster_t *src)
651 hammer2_chain_t *chain;
652 hammer2_chain_t *tmp;
655 KKASSERT(dst->refs == 1);
658 for (i = 0; i < src->nchains; ++i) {
659 chain = src->array[i].chain;
661 hammer2_chain_ref(chain);
662 if (i < dst->nchains &&
663 (tmp = dst->array[i].chain) != NULL) {
664 hammer2_chain_unlock(tmp);
666 dst->array[i].chain = chain;
667 if (dst->focus == NULL)
671 while (i < dst->nchains) {
672 chain = dst->array[i].chain;
674 hammer2_chain_unlock(chain);
675 dst->array[i].chain = NULL;
679 dst->nchains = src->nchains;
683 * Replace the contents of the locked destination with the contents of the
684 * locked source. The destination must have one ref.
686 * Returns with the destination still with one ref and the copied chains
687 * with an additional lock (representing their state on the destination).
688 * The original chains associated with the destination are unlocked.
690 * From the point of view of the caller, both src and dst are locked on
691 * call and remain locked on return.
693 * XXX adjust flag state
696 hammer2_cluster_replace_locked(hammer2_cluster_t *dst, hammer2_cluster_t *src)
698 hammer2_chain_t *chain;
699 hammer2_chain_t *tmp;
702 KKASSERT(dst->refs == 1);
705 for (i = 0; i < src->nchains; ++i) {
706 chain = src->array[i].chain;
708 hammer2_chain_lock(chain, 0);
709 if (i < dst->nchains &&
710 (tmp = dst->array[i].chain) != NULL) {
711 hammer2_chain_unlock(tmp);
713 dst->array[i].chain = chain;
716 while (i < dst->nchains) {
717 chain = dst->array[i].chain;
719 hammer2_chain_unlock(chain);
720 dst->array[i].chain = NULL;
724 dst->nchains = src->nchains;
725 dst->flags = src->flags;
726 dst->focus = src->focus;
731 * Copy a cluster, returned a ref'd cluster. All underlying chains
732 * are also ref'd, but not locked.
734 * The cluster focus is not set because the cluster is not yet locked
735 * (and the originating cluster does not have to be locked either).
738 hammer2_cluster_copy(hammer2_cluster_t *ocluster)
740 hammer2_pfs_t *pmp = ocluster->pmp;
741 hammer2_cluster_t *ncluster;
742 hammer2_chain_t *chain;
745 ncluster = kmalloc(sizeof(*ncluster), M_HAMMER2, M_WAITOK | M_ZERO);
747 ncluster->nchains = ocluster->nchains;
749 ncluster->flags = 0; /* cluster not locked */
751 for (i = 0; i < ocluster->nchains; ++i) {
752 chain = ocluster->array[i].chain;
753 ncluster->array[i].chain = chain;
755 hammer2_chain_ref(chain);
761 * Unlock and deref a cluster. The cluster is destroyed if this is the
765 hammer2_cluster_unlock(hammer2_cluster_t *cluster)
767 hammer2_chain_t *chain;
770 if ((cluster->flags & HAMMER2_CLUSTER_LOCKED) == 0) {
771 kprintf("hammer2_cluster_unlock: cluster %p not locked\n",
774 /* KKASSERT(cluster->flags & HAMMER2_CLUSTER_LOCKED); */
775 KKASSERT(cluster->refs > 0);
776 atomic_clear_int(&cluster->flags, HAMMER2_CLUSTER_LOCKED);
778 for (i = 0; i < cluster->nchains; ++i) {
779 chain = cluster->array[i].chain;
781 hammer2_chain_unlock(chain);
782 if (cluster->refs == 1)
783 cluster->array[i].chain = NULL; /* safety */
786 cluster->focus = NULL;
788 if (atomic_fetchadd_int(&cluster->refs, -1) == 1) {
789 kfree(cluster, M_HAMMER2);
790 /* cluster = NULL; safety */
795 * Resize the cluster's physical storage allocation in-place. This may
796 * replace the cluster's chains.
799 hammer2_cluster_resize(hammer2_trans_t *trans, hammer2_inode_t *ip,
800 hammer2_cluster_t *cparent, hammer2_cluster_t *cluster,
801 int nradix, int flags)
803 hammer2_chain_t *chain;
806 KKASSERT(cparent->pmp == cluster->pmp); /* can be NULL */
807 KKASSERT(cparent->nchains == cluster->nchains);
809 for (i = 0; i < cluster->nchains; ++i) {
810 chain = cluster->array[i].chain;
812 KKASSERT(cparent->array[i].chain);
813 hammer2_chain_resize(trans, ip,
814 cparent->array[i].chain, chain,
821 * Set an inode's cluster modified, marking the related chains RW and
822 * duplicating them if necessary.
824 * The passed-in chain is a localized copy of the chain previously acquired
825 * when the inode was locked (and possilby replaced in the mean time), and
826 * must also be updated. In fact, we update it first and then synchronize
827 * the inode's cluster cache.
829 hammer2_inode_data_t *
830 hammer2_cluster_modify_ip(hammer2_trans_t *trans, hammer2_inode_t *ip,
831 hammer2_cluster_t *cluster, int flags)
833 atomic_set_int(&ip->flags, HAMMER2_INODE_MODIFIED);
834 hammer2_cluster_modify(trans, cluster, flags);
836 hammer2_inode_repoint(ip, NULL, cluster);
839 return (&hammer2_cluster_wdata(cluster)->ipdata);
843 * Adjust the cluster's chains to allow modification and adjust the
844 * focus. Data will be accessible on return.
847 hammer2_cluster_modify(hammer2_trans_t *trans, hammer2_cluster_t *cluster,
850 hammer2_chain_t *chain;
853 for (i = 0; i < cluster->nchains; ++i) {
854 chain = cluster->array[i].chain;
856 hammer2_chain_modify(trans, chain, flags);
861 * Synchronize modifications from the focus to other chains in a cluster.
862 * Convenient because nominal API users can just modify the contents of the
863 * focus (at least for non-blockref data).
865 * Nominal front-end operations only edit non-block-table data in a single
866 * chain. This code copies such modifications to the other chains in the
867 * cluster. Blocktable modifications are handled on a chain-by-chain basis
868 * by both the frontend and the backend and will explode in fireworks if
872 hammer2_cluster_modsync(hammer2_cluster_t *cluster)
874 hammer2_chain_t *focus;
875 hammer2_chain_t *scan;
876 const hammer2_inode_data_t *ripdata;
877 hammer2_inode_data_t *wipdata;
880 focus = cluster->focus;
881 KKASSERT(focus->flags & HAMMER2_CHAIN_MODIFIED);
883 for (i = 0; i < cluster->nchains; ++i) {
884 scan = cluster->array[i].chain;
885 if (scan == NULL || scan == focus)
887 KKASSERT(scan->flags & HAMMER2_CHAIN_MODIFIED);
888 KKASSERT(focus->bytes == scan->bytes &&
889 focus->bref.type == scan->bref.type);
890 switch(focus->bref.type) {
891 case HAMMER2_BREF_TYPE_INODE:
892 ripdata = &focus->data->ipdata;
893 wipdata = &scan->data->ipdata;
894 if ((ripdata->op_flags &
895 HAMMER2_OPFLAG_DIRECTDATA) == 0) {
896 bcopy(ripdata, wipdata,
897 offsetof(hammer2_inode_data_t, u));
900 /* fall through to full copy */
901 case HAMMER2_BREF_TYPE_DATA:
902 bcopy(focus->data, scan->data, focus->bytes);
904 case HAMMER2_BREF_TYPE_FREEMAP_NODE:
905 case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
906 case HAMMER2_BREF_TYPE_FREEMAP:
907 case HAMMER2_BREF_TYPE_VOLUME:
908 panic("hammer2_cluster_modsync: illegal node type");
912 panic("hammer2_cluster_modsync: unknown node type");
919 * Lookup initialization/completion API
922 hammer2_cluster_lookup_init(hammer2_cluster_t *cparent, int flags)
924 hammer2_cluster_t *cluster;
927 cluster = kmalloc(sizeof(*cluster), M_HAMMER2, M_WAITOK | M_ZERO);
928 cluster->pmp = cparent->pmp; /* can be NULL */
929 cluster->flags = 0; /* cluster not locked (yet) */
930 /* cluster->focus = NULL; already null */
932 for (i = 0; i < cparent->nchains; ++i)
933 cluster->array[i].chain = cparent->array[i].chain;
934 cluster->nchains = cparent->nchains;
937 * Independently lock (this will also give cluster 1 ref)
939 if (flags & HAMMER2_LOOKUP_SHARED) {
940 hammer2_cluster_lock(cluster, HAMMER2_RESOLVE_ALWAYS |
941 HAMMER2_RESOLVE_SHARED);
943 hammer2_cluster_lock(cluster, HAMMER2_RESOLVE_ALWAYS);
949 hammer2_cluster_lookup_done(hammer2_cluster_t *cparent)
952 hammer2_cluster_unlock(cparent);
956 * Locate first match or overlap under parent, return a new cluster
959 hammer2_cluster_lookup(hammer2_cluster_t *cparent, hammer2_key_t *key_nextp,
960 hammer2_key_t key_beg, hammer2_key_t key_end, int flags)
963 hammer2_cluster_t *cluster;
964 hammer2_chain_t *chain;
965 hammer2_key_t key_accum;
966 hammer2_key_t key_next;
967 hammer2_key_t bref_key;
974 pmp = cparent->pmp; /* can be NULL */
975 key_accum = *key_nextp;
982 cluster = kmalloc(sizeof(*cluster), M_HAMMER2, M_WAITOK | M_ZERO);
983 cluster->pmp = pmp; /* can be NULL */
985 if ((flags & HAMMER2_LOOKUP_NOLOCK) == 0)
986 cluster->flags |= HAMMER2_CLUSTER_LOCKED;
988 for (i = 0; i < cparent->nchains; ++i) {
989 key_next = *key_nextp;
990 if (cparent->array[i].chain == NULL) {
994 chain = hammer2_chain_lookup(&cparent->array[i].chain,
997 &cparent->array[i].cache_index,
999 cluster->array[i].chain = chain;
1000 if (chain == NULL) {
1003 int ddflag = (chain->bref.type ==
1004 HAMMER2_BREF_TYPE_INODE);
1007 * Set default focus.
1009 if (cluster->focus == NULL) {
1010 bref_type = chain->bref.type;
1011 bref_key = chain->bref.key;
1012 bref_keybits = chain->bref.keybits;
1013 bytes = chain->bytes;
1014 cluster->ddflag = ddflag;
1015 cluster->focus = chain;
1019 * Override default focus to follow the parent.
1021 if (cparent->focus == cparent->array[i].chain)
1022 cluster->focus = chain;
1024 KKASSERT(bref_type == chain->bref.type);
1025 KKASSERT(bref_key == chain->bref.key);
1026 KKASSERT(bref_keybits == chain->bref.keybits);
1027 KKASSERT(bytes == chain->bytes);
1028 KKASSERT(cluster->ddflag == ddflag);
1030 if (key_accum > key_next)
1031 key_accum = key_next;
1033 *key_nextp = key_accum;
1034 cluster->nchains = i;
1036 if (null_count == i) {
1037 hammer2_cluster_drop(cluster);
1045 * Locate next match or overlap under parent, replace cluster
1048 hammer2_cluster_next(hammer2_cluster_t *cparent, hammer2_cluster_t *cluster,
1049 hammer2_key_t *key_nextp,
1050 hammer2_key_t key_beg, hammer2_key_t key_end, int flags)
1052 hammer2_chain_t *chain;
1053 hammer2_key_t key_accum;
1054 hammer2_key_t key_next;
1055 hammer2_key_t bref_key;
1062 key_accum = *key_nextp;
1064 cluster->focus = NULL;
1065 cparent->focus = NULL;
1071 cluster->ddflag = 0;
1073 for (i = 0; i < cparent->nchains; ++i) {
1074 key_next = *key_nextp;
1075 chain = cluster->array[i].chain;
1076 if (chain == NULL) {
1080 if (cparent->array[i].chain == NULL) {
1081 if (flags & HAMMER2_LOOKUP_NOLOCK)
1082 hammer2_chain_drop(chain);
1084 hammer2_chain_unlock(chain);
1088 chain = hammer2_chain_next(&cparent->array[i].chain, chain,
1089 &key_next, key_beg, key_end,
1090 &cparent->array[i].cache_index,
1092 cluster->array[i].chain = chain;
1093 if (chain == NULL) {
1096 int ddflag = (chain->bref.type ==
1097 HAMMER2_BREF_TYPE_INODE);
1098 if (cluster->focus == NULL) {
1099 bref_type = chain->bref.type;
1100 bref_key = chain->bref.key;
1101 bref_keybits = chain->bref.keybits;
1102 bytes = chain->bytes;
1103 cluster->ddflag = ddflag;
1104 cluster->focus = chain;
1108 * Override default focus to follow the parent.
1110 if (cparent->focus == cparent->array[i].chain)
1111 cluster->focus = chain;
1113 KKASSERT(bref_type == chain->bref.type);
1114 KKASSERT(bref_key == chain->bref.key);
1115 KKASSERT(bref_keybits == chain->bref.keybits);
1116 KKASSERT(bytes == chain->bytes);
1117 KKASSERT(cluster->ddflag == ddflag);
1119 if (key_accum > key_next)
1120 key_accum = key_next;
1122 cluster->nchains = i;
1124 if (null_count == i) {
1125 hammer2_cluster_drop(cluster);
1133 * XXX initial NULL cluster needs reworking (pass **clusterp ?)
1135 * The raw scan function is similar to lookup/next but does not seek to a key.
1136 * Blockrefs are iterated via first_chain = (parent, NULL) and
1137 * next_chain = (parent, chain).
1139 * The passed-in parent must be locked and its data resolved. The returned
1140 * chain will be locked. Pass chain == NULL to acquire the first sub-chain
1141 * under parent and then iterate with the passed-in chain (which this
1142 * function will unlock).
1145 hammer2_cluster_scan(hammer2_cluster_t *cparent, hammer2_cluster_t *cluster,
1148 hammer2_chain_t *chain;
1154 for (i = 0; i < cparent->nchains; ++i) {
1155 chain = cluster->array[i].chain;
1156 if (chain == NULL) {
1160 if (cparent->array[i].chain == NULL) {
1161 if (flags & HAMMER2_LOOKUP_NOLOCK)
1162 hammer2_chain_drop(chain);
1164 hammer2_chain_unlock(chain);
1169 chain = hammer2_chain_scan(cparent->array[i].chain, chain,
1170 &cparent->array[i].cache_index,
1172 cluster->array[i].chain = chain;
1177 if (null_count == i) {
1178 hammer2_cluster_drop(cluster);
1187 * Create a new cluster using the specified key
1190 hammer2_cluster_create(hammer2_trans_t *trans, hammer2_cluster_t *cparent,
1191 hammer2_cluster_t **clusterp,
1192 hammer2_key_t key, int keybits,
1193 int type, size_t bytes, int flags)
1195 hammer2_cluster_t *cluster;
1200 pmp = trans->pmp; /* can be NULL */
1202 if ((cluster = *clusterp) == NULL) {
1203 cluster = kmalloc(sizeof(*cluster), M_HAMMER2,
1205 cluster->pmp = pmp; /* can be NULL */
1207 cluster->flags = HAMMER2_CLUSTER_LOCKED;
1209 cluster->focus = NULL;
1212 * NOTE: cluster->array[] entries can initially be NULL. If
1213 * *clusterp is supplied, skip NULL entries, otherwise
1214 * create new chains.
1216 for (i = 0; i < cparent->nchains; ++i) {
1217 if (*clusterp && cluster->array[i].chain == NULL) {
1220 error = hammer2_chain_create(trans, &cparent->array[i].chain,
1221 &cluster->array[i].chain, pmp,
1223 type, bytes, flags);
1224 KKASSERT(error == 0);
1225 if (cluster->focus == NULL)
1226 cluster->focus = cluster->array[i].chain;
1227 if (cparent->focus == cparent->array[i].chain)
1228 cluster->focus = cluster->array[i].chain;
1230 cluster->nchains = i;
1231 *clusterp = cluster;
1237 * Rename a cluster to a new parent.
1239 * WARNING! Unlike hammer2_chain_rename(), only the key and keybits fields
1240 * are used from a passed-in non-NULL bref pointer. All other fields
1241 * are extracted from the original chain for each chain in the
1245 hammer2_cluster_rename(hammer2_trans_t *trans, hammer2_blockref_t *bref,
1246 hammer2_cluster_t *cparent, hammer2_cluster_t *cluster,
1249 hammer2_chain_t *chain;
1250 hammer2_blockref_t xbref;
1253 cluster->focus = NULL;
1254 cparent->focus = NULL;
1256 for (i = 0; i < cluster->nchains; ++i) {
1257 chain = cluster->array[i].chain;
1260 xbref = chain->bref;
1261 xbref.key = bref->key;
1262 xbref.keybits = bref->keybits;
1263 hammer2_chain_rename(trans, &xbref,
1264 &cparent->array[i].chain,
1267 hammer2_chain_rename(trans, NULL,
1268 &cparent->array[i].chain,
1271 KKASSERT(cluster->array[i].chain == chain); /*remove*/
1277 * Mark a cluster deleted
1280 hammer2_cluster_delete(hammer2_trans_t *trans, hammer2_cluster_t *cparent,
1281 hammer2_cluster_t *cluster, int flags)
1283 hammer2_chain_t *chain;
1284 hammer2_chain_t *parent;
1287 if (cparent == NULL) {
1288 kprintf("cparent is NULL\n");
1292 for (i = 0; i < cluster->nchains; ++i) {
1293 parent = (i < cparent->nchains) ?
1294 cparent->array[i].chain : NULL;
1295 chain = cluster->array[i].chain;
1298 if (chain->parent != parent) {
1299 kprintf("hammer2_cluster_delete: parent "
1300 "mismatch chain=%p parent=%p against=%p\n",
1301 chain, chain->parent, parent);
1303 hammer2_chain_delete(trans, parent, chain, flags);
1309 * Create a snapshot of the specified {parent, ochain} with the specified
1310 * label. The originating hammer2_inode must be exclusively locked for
1313 * The ioctl code has already synced the filesystem.
1316 hammer2_cluster_snapshot(hammer2_trans_t *trans, hammer2_cluster_t *ocluster,
1317 hammer2_ioc_pfs_t *pfs)
1320 hammer2_cluster_t *ncluster;
1321 const hammer2_inode_data_t *ripdata;
1322 hammer2_inode_data_t *wipdata;
1323 hammer2_chain_t *nchain;
1324 hammer2_inode_t *nip;
1334 kprintf("snapshot %s\n", pfs->name);
1336 name_len = strlen(pfs->name);
1337 lhc = hammer2_dirhash(pfs->name, name_len);
1342 ripdata = &hammer2_cluster_rdata(ocluster)->ipdata;
1344 opfs_clid = ripdata->pfs_clid;
1346 hmp = ocluster->focus->hmp; /* XXX find synchronized local disk */
1349 * Create the snapshot directory under the super-root
1351 * Set PFS type, generate a unique filesystem id, and generate
1352 * a cluster id. Use the same clid when snapshotting a PFS root,
1353 * which theoretically allows the snapshot to be used as part of
1354 * the same cluster (perhaps as a cache).
1356 * Copy the (flushed) blockref array. Theoretically we could use
1357 * chain_duplicate() but it becomes difficult to disentangle
1358 * the shared core so for now just brute-force it.
1364 nip = hammer2_inode_create(trans, hmp->spmp->iroot, &vat,
1365 proc0.p_ucred, pfs->name, name_len,
1367 HAMMER2_INSERT_PFSROOT, &error);
1370 wipdata = hammer2_cluster_modify_ip(trans, nip, ncluster, 0);
1371 wipdata->pfs_type = HAMMER2_PFSTYPE_MASTER;
1372 wipdata->pfs_subtype = HAMMER2_PFSSUBTYPE_SNAPSHOT;
1373 wipdata->op_flags |= HAMMER2_OPFLAG_PFSROOT;
1374 kern_uuidgen(&wipdata->pfs_fsid, 1);
1377 * Give the snapshot its own private cluster. As a snapshot
1378 * no further synchronization with the original cluster will
1382 if (ocluster->focus->flags & HAMMER2_CHAIN_PFSBOUNDARY)
1383 wipdata->pfs_clid = opfs_clid;
1385 kern_uuidgen(&wipdata->pfs_clid, 1);
1387 kern_uuidgen(&wipdata->pfs_clid, 1);
1389 for (i = 0; i < ncluster->nchains; ++i) {
1390 nchain = ncluster->array[i].chain;
1392 nchain->bref.flags |= HAMMER2_BREF_FLAG_PFSROOT;
1395 /* XXX can't set this unless we do an explicit flush, which
1396 we also need a pmp assigned to do, else the flush code
1397 won't flush ncluster because it thinks it is crossing a
1399 hammer2_cluster_set_chainflags(ncluster,
1400 HAMMER2_CHAIN_PFSBOUNDARY);
1403 /* XXX hack blockset copy */
1404 /* XXX doesn't work with real cluster */
1405 KKASSERT(ocluster->nchains == 1);
1406 wipdata->u.blockset = ripdata->u.blockset;
1407 hammer2_cluster_modsync(ncluster);
1408 for (i = 0; i < ncluster->nchains; ++i) {
1409 nchain = ncluster->array[i].chain;
1411 hammer2_flush(trans, nchain);
1413 hammer2_inode_unlock_ex(nip, ncluster);
1419 * Return locked parent cluster given a locked child. The child remains
1420 * locked on return. The new parent's focus follows the child's focus
1421 * and the parent is always resolved.
1424 hammer2_cluster_parent(hammer2_cluster_t *cluster)
1426 hammer2_cluster_t *cparent;
1429 cparent = hammer2_cluster_copy(cluster);
1431 for (i = 0; i < cparent->nchains; ++i) {
1432 hammer2_chain_t *chain;
1433 hammer2_chain_t *rchain;
1436 * Calculate parent for each element. Old chain has an extra
1437 * ref for cparent but the lock remains with cluster.
1439 chain = cparent->array[i].chain;
1442 while ((rchain = chain->parent) != NULL) {
1443 hammer2_chain_ref(rchain);
1444 hammer2_chain_unlock(chain);
1445 hammer2_chain_lock(rchain, HAMMER2_RESOLVE_ALWAYS);
1446 hammer2_chain_lock(chain, HAMMER2_RESOLVE_ALWAYS);
1447 hammer2_chain_drop(rchain);
1448 if (chain->parent == rchain)
1450 hammer2_chain_unlock(rchain);
1452 if (cluster->focus == chain)
1453 cparent->focus = rchain;
1454 cparent->array[i].chain = rchain;
1455 hammer2_chain_drop(chain);
1457 cparent->flags |= HAMMER2_CLUSTER_LOCKED;
1462 /************************************************************************
1464 ************************************************************************
1467 * WARNING! blockref[] array data is not universal. These functions should
1468 * only be used to access universal data.
1470 * NOTE! The rdata call will wait for at least one of the chain I/Os to
1471 * complete if necessary. The I/O's should have already been
1472 * initiated by the cluster_lock/chain_lock operation.
1474 * The cluster must already be in a modified state before wdata
1475 * is called. The data will already be available for this case.
1477 const hammer2_media_data_t *
1478 hammer2_cluster_rdata(hammer2_cluster_t *cluster)
1480 return(cluster->focus->data);
1483 hammer2_media_data_t *
1484 hammer2_cluster_wdata(hammer2_cluster_t *cluster)
1486 KKASSERT(hammer2_cluster_modified(cluster));
1487 return(cluster->focus->data);
1491 * Load cluster data asynchronously with callback.
1493 * The callback is made for the first validated data found, or NULL
1494 * if no valid data is available.
1496 * NOTE! The cluster structure is either unique or serialized (e.g. embedded
1497 * in the inode with an exclusive lock held), the chain structure may be
1501 hammer2_cluster_load_async(hammer2_cluster_t *cluster,
1502 void (*callback)(hammer2_iocb_t *iocb), void *ptr)
1504 hammer2_chain_t *chain;
1505 hammer2_iocb_t *iocb;
1507 hammer2_blockref_t *bref;
1511 * Try to find a chain whos data is already resolved. If none can
1512 * be found, start with the first chain.
1515 for (i = 0; i < cluster->nchains; ++i) {
1516 chain = cluster->array[i].chain;
1517 if (chain && chain->data)
1520 if (i == cluster->nchains) {
1521 chain = cluster->array[0].chain;
1525 iocb = &cluster->iocb;
1526 iocb->callback = callback;
1527 iocb->dio = NULL; /* for already-validated case */
1528 iocb->cluster = cluster;
1529 iocb->chain = chain;
1531 iocb->lbase = (off_t)i;
1536 * Data already validated
1544 * We must resolve to a device buffer, either by issuing I/O or
1545 * by creating a zero-fill element. We do not mark the buffer
1546 * dirty when creating a zero-fill element (the hammer2_chain_modify()
1547 * API must still be used to do that).
1549 * The device buffer is variable-sized in powers of 2 down
1550 * to HAMMER2_MIN_ALLOC (typically 1K). A 64K physical storage
1551 * chunk always contains buffers of the same size. (XXX)
1553 * The minimum physical IO size may be larger than the variable
1556 * XXX TODO - handle HAMMER2_CHAIN_INITIAL for case where chain->bytes
1557 * matches hammer2_devblksize()? Or does the freemap's
1558 * pre-zeroing handle the case for us?
1560 bref = &chain->bref;
1564 /* handled by callback? <- TODO XXX even needed for loads? */
1566 * The getblk() optimization for a 100% overwrite can only be used
1567 * if the physical block size matches the request.
1569 if ((chain->flags & HAMMER2_CHAIN_INITIAL) &&
1570 chain->bytes == hammer2_devblksize(chain->bytes)) {
1571 error = hammer2_io_new(hmp, bref->data_off, chain->bytes, &dio);
1572 KKASSERT(error == 0);
1580 * Otherwise issue a read
1582 hammer2_adjreadcounter(&chain->bref, chain->bytes);
1583 hammer2_io_getblk(hmp, bref->data_off, chain->bytes, iocb);
1586 /************************************************************************
1588 ************************************************************************
1590 * A node failure can occur for numerous reasons.
1592 * - A read I/O may fail
1593 * - A write I/O may fail
1594 * - An unexpected chain might be found (or be missing)
1595 * - A node might disconnect temporarily and reconnect later
1596 * (for example, a USB stick could get pulled, or a node might
1597 * be programmatically disconnected).
1598 * - A node might run out of space during a modifying operation.
1600 * When a read failure or an unexpected chain state is found, the chain and
1601 * parent chain at the failure point for the nodes involved (the nodes
1602 * which we determine to be in error) are flagged as failed and removed
1603 * from the cluster. The node itself is allowed to remain active. The
1604 * highest common point (usually a parent chain) is queued to the
1605 * resynchronization thread for action.
1607 * When a write I/O fails or a node runs out of space, we first adjust
1608 * as if a read failure occurs but we further disable flushes on the
1609 * ENTIRE node. Concurrent modifying transactions are allowed to complete
1610 * but any new modifying transactions will automatically remove the node
1611 * from consideration in all related cluster structures and not generate
1612 * any new modified chains. The ROOT chain for the failed node(s) is queued
1613 * to the resynchronization thread for action.
1615 * A temporary disconnect is handled as if a write failure occurred.
1617 * Any of these failures might or might not stall related high level VNOPS,
1618 * depending on what has failed, what nodes remain, the type of cluster,
1619 * and the operating state of the cluster.
1621 * FLUSH ON WRITE-DISABLED NODES
1623 * A flush on a write-disabled node is not allowed to write anything because
1624 * we cannot safely update the mirror_tid anywhere on the failed node. The
1625 * synchronization thread uses mirror_tid to calculate incremental resyncs.
1626 * Dirty meta-data related to the failed node is thrown away.
1628 * Dirty buffer cache buffers and inodes are only thrown away if they can be
1629 * retired... that is, if the filesystem still has enough nodes to complete
1633 /************************************************************************
1634 * SYNCHRONIZATION THREAD *
1635 ************************************************************************
1637 * This thread is responsible for [re]synchronizing the cluster representing
1638 * a PFS. Any out-of-sync or failed node starts this thread on a
1639 * node-by-node basis when the failure is detected.
1641 * Clusters needing resynchronization are queued at the highest point
1642 * where the parent on the failed node is still valid, or a special
1643 * incremental scan from the ROOT is queued if no parent exists. This
1644 * thread is also responsible for waiting for reconnections of the failed
1645 * node if the cause was due to a disconnect, and waiting for space to be
1646 * freed up if the cause was due to running out of space.
1648 * If the cause is due to a node running out of space, this thread will also
1649 * remove older (unlocked) snapshots to make new space, recover space, and
1650 * then start resynchronization.
1652 * Each resynchronization pass virtually snapshots the PFS on the good nodes
1653 * and synchronizes using that snapshot against the target node. This
1654 * ensures a consistent chain topology and also avoids interference between
1655 * the resynchronization thread and frontend operations.
1657 * Since these are per-node threads it is possible to resynchronize several