hammer2 - bulkfree work, rip-up cluster sync.
authorMatthew Dillon <dillon@apollo.backplane.com>
Fri, 28 Aug 2015 20:49:47 +0000 (13:49 -0700)
committerMatthew Dillon <dillon@apollo.backplane.com>
Fri, 28 Aug 2015 21:08:38 +0000 (14:08 -0700)
* bulkfree no longer attempts to flush.  Instead it deals with races against
  live by refusing to free blocks in L1 freemap chains that have been modified
  since the last sync.  This is a temporary workaround.

* No longer propagate modify_tid during a flush.  modify_tid is now used
  as a localized but cluster-aware TID (whereas mirror_tid is only localized
  to a cluster node).

* Start work on adding an update_tid to the blockref.  This will ultimately
  be used by the cluster synchronization code instead of modify_tid.

* Adjust the DESIGN document for the new synchronization concept.

sys/vfs/hammer2/DESIGN
sys/vfs/hammer2/hammer2.h
sys/vfs/hammer2/hammer2_bulkscan.c
sys/vfs/hammer2/hammer2_chain.c
sys/vfs/hammer2/hammer2_disk.h
sys/vfs/hammer2/hammer2_flush.c
sys/vfs/hammer2/hammer2_inode.c
sys/vfs/hammer2/hammer2_ioctl.c
sys/vfs/hammer2/hammer2_vfsops.c
sys/vfs/hammer2/hammer2_vnops.c

index 25158a7..4583406 100644 (file)
@@ -245,65 +245,65 @@ blockref is then recorded when the filesystem is mounted after a crash and
 the update chain is reconstituted when a matching blockref is encountered
 again during normal operation of the filesystem.
 
-                           MIRROR_TID, MODIFY_TID
+                       MIRROR_TID, MODIFY_TID, UPDATE_TID
 
-In HAMMER2, the core block reference is 64-byte structure called a blockref.
+In HAMMER2, the core block reference is 128-byte structure called a blockref.
 The blockref contains various bits of information including the 64-bit radix
 key (typically a directory hash if a directory entry, inode number if a
 hidden hardlink target, or file offset if a file block), 64-bit data offset
 with the physical block size radix encoded in it (physical block size can be
-different from logical block size due to compression), two 64-bit transaction
-ids, type information, and 192 bits worth of check data for the block being
-reference which can be a simple CRC or stronger HASH.
-
-Both mirror_tid and modify_tid propagate upward from the change point all the
-way to the root, but serve different purposes and work in slightly different
-ways.
+different from logical block size due to compression), three 64-bit
+transaction ids, type information, and up to 512 bits worth of check data
+for the block being reference which can be anything from a simple CRC to
+a strong cryptographic hash.
 
 mirror_tid - This is a media-centric (as in physical disk partition)
-            transaction id which tracks media-level updates.
+            transaction id which tracks media-level updates.  The mirror_tid
+            can be different at the same point on different nodes in a
+            cluster.
 
             Whenever any block in the media topology is modified, its
             mirror_tid is updated with the flush id and will propagate
             upward during the flush all the way to the volume header.
 
-            mirror_tid is monotonic.
+            mirror_tid is monotonic.  It is primarily used for on-mount
+            recovery and volume root validation.  The name is historical
+            from H1, it is not used for nominal mirroring.
 
 modify_tid - This is a cluster-centric (as in across all the nodes used
             to build a cluster) transaction id which tracks filesystem-level
             updates.
 
             modify_tid is updated when the front-end of the filesystem makes
-            a change to an inode or data block.  It will also propagate
-            upward, stopping at the root of the PFS (the mount point for
-            the cluster).
-
-The major difference between mirror_tid and modify_tid is that for any given
-element in the topology residing on different nodes.  e.g. file "x" on node 1
-and file "x" on node 2, if the files are synchronized with each other they
-will have the same modify_tid on a block-by-block basis, and a single check
-of the inode's modify_tid is sufficient to determine that the files are fully
-synchronized and identical.  These same inodes and representitive blocks will
-have very different mirror_tids because the nodes will reside on different
-physical media.
-
-I noted above that modify_tids also propagate upward, but not in all cases.
-A node which is undergoing SYNCHRONIZATION only updates the modify_tid of
-a block when it has determined that the block and its entire sub-block
-hierarchy has been synchronized to that point.
+            a change to an inode or data block.  It does NOT propagate upward
+            during a flush.
+
+update_tid - This is a cluster synchronization transaction id.  Modifications
+            made to the topology will clear this field to 0 as they propagate
+            up to the root.  This gives the synchronizer an easy way to
+            determine what needs revalidation.
+
+            The synchronizer revalidates the cluster bottom-up by validating
+            a sub-topology and propagating the highest modify_tid in the
+            validated sub-topology up via the update_tid field.
+
+            Update to this field may be optimized by the HAMMER2 VFS to
+            avoid the double-transition.
 
 The synchronization code updates an out-of-sync node bottom-up and will
-definitely set modify_tid as it goes, but media flushes can occur at any
+dynamically set update_tid as it goes, but media flushes can occur at any
 time and these flushes will use mirror_tid for flush and freemap management.
 The mirror_tid for each flush propagates upward to the volume header on each
-flush.
+flush.  modify_tid is set for any chains modified by a cluster op but does
+not propagate up, instead serving as a seed for update_tid.
 
 * The synchronization code is able to determine that a sub-tree is
-  synchronized simply by observing the modify_tid at the root of the sub-tree,
-  on a directory-by-directory basis.
+  synchronized simply by observing the update_tid at the root of the sub-tree,
+  on an inode-by-inode basis and also on a data-block-by-data-block basis.
 
 * The synchronization code is able to do an incremental update of an
-  out-of-sync node simply by skipping elements with matching modify_tids.
+  out-of-sync node simply by skipping elements with a matching update_tid
+  (when not 0).
 
 * The synchronization code can be interrupted and restarted at any time,
   and is able to pick up where it left off with very little overhead.
index 74c17e2..68c76b1 100644 (file)
@@ -476,7 +476,6 @@ RB_PROTOTYPE(hammer2_chain_tree, hammer2_chain, rbnode, hammer2_chain_cmp);
 #define HAMMER2_MODIFY_OPTDATA         0x00000002      /* data can be NULL */
 #define HAMMER2_MODIFY_NO_MODIFY_TID   0x00000004
 #define HAMMER2_MODIFY_UNUSED0008      0x00000008
-#define HAMMER2_MODIFY_NOREALLOC       0x00000010
 
 /*
  * Flags passed to hammer2_chain_lock()
@@ -768,6 +767,10 @@ typedef struct hammer2_trans hammer2_trans_t;
 #define HAMMER2_FREEMAP_HEUR           (HAMMER2_FREEMAP_HEUR_NRADIX * \
                                         HAMMER2_FREEMAP_HEUR_TYPES)
 
+#define HAMMER2_FLUSH_TOP              0x0001
+#define HAMMER2_FLUSH_ALL              0x0002
+
+
 /*
  * Hammer2 support thread element.
  *
@@ -1317,7 +1320,7 @@ int hammer2_inode_connect(hammer2_inode_t *dip, hammer2_inode_t *ip,
                        hammer2_key_t lhc);
 hammer2_inode_t *hammer2_inode_common_parent(hammer2_inode_t *fdip,
                        hammer2_inode_t *tdip);
-void hammer2_inode_fsync(hammer2_inode_t *ip);
+void hammer2_inode_chain_sync(hammer2_inode_t *ip);
 int hammer2_inode_unlink_finisher(hammer2_inode_t *ip, int isopen);
 void hammer2_inode_install_hidden(hammer2_pfs_t *pmp);
 
@@ -1382,8 +1385,6 @@ void hammer2_chain_rename(hammer2_blockref_t *bref,
                                hammer2_tid_t mtid, int flags);
 void hammer2_chain_delete(hammer2_chain_t *parent, hammer2_chain_t *chain,
                                hammer2_tid_t mtid, int flags);
-void hammer2_flush(hammer2_chain_t *chain, hammer2_tid_t mtid, int istop);
-void hammer2_delayed_flush(hammer2_chain_t *chain);
 void hammer2_chain_setflush(hammer2_chain_t *chain);
 void hammer2_chain_countbrefs(hammer2_chain_t *chain,
                                hammer2_blockref_t *base, int count);
@@ -1405,6 +1406,13 @@ void hammer2_base_insert(hammer2_chain_t *chain,
                                hammer2_blockref_t *base, int count,
                                int *cache_indexp, hammer2_chain_t *child);
 
+/*
+ * hammer2_flush.c
+ */
+void hammer2_flush(hammer2_chain_t *chain, int istop);
+void hammer2_flush_quick(hammer2_dev_t *hmp);
+void hammer2_delayed_flush(hammer2_chain_t *chain);
+
 /*
  * hammer2_trans.c
  */
@@ -1486,7 +1494,7 @@ void hammer2_xop_scanall(hammer2_xop_t *xop, int clidx);
 void hammer2_xop_lookup(hammer2_xop_t *xop, int clidx);
 void hammer2_inode_xop_create(hammer2_xop_t *xop, int clidx);
 void hammer2_inode_xop_destroy(hammer2_xop_t *xop, int clidx);
-void hammer2_inode_xop_fsync(hammer2_xop_t *xop, int clidx);
+void hammer2_inode_xop_chain_sync(hammer2_xop_t *xop, int clidx);
 void hammer2_inode_xop_unlinkall(hammer2_xop_t *xop, int clidx);
 void hammer2_inode_xop_connect(hammer2_xop_t *xop, int clidx);
 void hammer2_inode_xop_flush(hammer2_xop_t *xop, int clidx);
index e461c1e..ce0599e 100644 (file)
@@ -165,7 +165,7 @@ hammer2_bulk_scan(hammer2_chain_t *parent,
  * Bulkfree algorithm
  *
  * Repeat {
- *     Chain flush (partial synchronization)
+ *     Chain flush (partial synchronization) XXX removed
  *     Scan the whole topology - build in-memory freemap (mark 11)
  *     Reconcile the in-memory freemap against the on-disk freemap.
  *             ondisk xx -> ondisk 11 (if allocated)
@@ -206,13 +206,15 @@ typedef struct hammer2_bulkfree_info {
        long                    count_linadjusts;
        hammer2_off_t           adj_free;
        hammer2_tid_t           mtid;
+       hammer2_tid_t           saved_mirror_tid;
        time_t                  save_time;
 } hammer2_bulkfree_info_t;
 
 static int h2_bulkfree_callback(hammer2_chain_t *chain, void *info);
 static void h2_bulkfree_sync(hammer2_bulkfree_info_t *cbinfo);
 static void h2_bulkfree_sync_adjust(hammer2_bulkfree_info_t *cbinfo,
-                       hammer2_bmap_data_t *live, hammer2_bmap_data_t *bmap);
+                       hammer2_bmap_data_t *live, hammer2_bmap_data_t *bmap,
+                       int nofree);
 
 int
 hammer2_bulkfree_pass(hammer2_dev_t *hmp, hammer2_ioc_bulkfree_t *bfi)
@@ -230,11 +232,23 @@ hammer2_bulkfree_pass(hammer2_dev_t *hmp, hammer2_ioc_bulkfree_t *bfi)
         */
        lockmgr(&hmp->bulklk, LK_EXCLUSIVE);
 
+#if 0
        /*
-        * Flush-a-roonie.  A full filesystem flush is not needed
+        * XXX This has been removed.  Instead of trying to flush, which
+        * appears to have a ton of races against life chains even with
+        * the two-stage scan, we simply refuse to free any blocks
+        * related to freemap chains modified after the last filesystem
+        * sync.
+        *
+        * Do a quick flush so we can snapshot vchain for any blocks that
+        * have been allocated prior to this point.  We don't need to
+        * flush vnodes, logical buffers, or dirty inodes that have not
+        * allocated blocks yet.  We do not want to flush the device buffers
+        * nor do we want to flush the actual volume root to disk here,
+        * that is not needed to perform the snapshot.
         */
-
-       /* hammer2_vfs_sync(hmp->mp, MNT_WAIT); XXX */
+       hammer2_flush_quick(hmp);
+#endif
 
        /*
         * Setup for free pass
@@ -244,6 +258,7 @@ hammer2_bulkfree_pass(hammer2_dev_t *hmp, hammer2_ioc_bulkfree_t *bfi)
               ~(size_t)(HAMMER2_FREEMAP_LEVELN_PSIZE - 1);
        cbinfo.hmp = hmp;
        cbinfo.bmap = kmem_alloc_swapbacked(&cbinfo.kp, size);
+       cbinfo.saved_mirror_tid = hmp->voldata.mirror_tid;
 
        /*
         * Normalize start point to a 2GB boundary.  We operate on a
@@ -479,8 +494,8 @@ h2_bulkfree_callback(hammer2_chain_t *chain, void *info)
  * direct copy.  Instead the bitmaps must be compared:
  *
  *     In-memory       Live-freemap
- *        00             11 -> 10
- *                       10 -> 00
+ *        00             11 -> 10      (do nothing if live modified)
+ *                       10 -> 00      (do nothing if live modified)
  *        11             10 -> 11      handles race against live
  *                       ** -> 11      nominally warn of corruption
  * 
@@ -497,6 +512,7 @@ h2_bulkfree_sync(hammer2_bulkfree_info_t *cbinfo)
        hammer2_chain_t *live_chain;
        int cache_index = -1;
        int bmapindex;
+       int nofree;
 
        kprintf("hammer2_bulkfree - range %016jx-%016jx\n",
                (intmax_t)cbinfo->sbase,
@@ -509,6 +525,7 @@ h2_bulkfree_sync(hammer2_bulkfree_info_t *cbinfo)
        hammer2_chain_ref(live_parent);
        hammer2_chain_lock(live_parent, HAMMER2_RESOLVE_ALWAYS);
        live_chain = NULL;
+       nofree = 1;     /* safety */
 
        while (data_off < cbinfo->sstop) {
                /*
@@ -536,8 +553,22 @@ h2_bulkfree_sync(hammer2_bulkfree_info_t *cbinfo)
                                            key + HAMMER2_FREEMAP_LEVEL1_MASK,
                                            &cache_index,
                                            HAMMER2_LOOKUP_ALWAYS);
-                       if (live_chain)
+                       /*
+                        * If recent allocations were made we avoid races by
+                        * not freeing any blocks.
+                        */
+                       if (live_chain) {
                                kprintf("live_chain %016jx\n", (intmax_t)key);
+                               if (live_chain->bref.mirror_tid >
+                                   cbinfo->saved_mirror_tid) {
+                                       kprintf("hammer2_bulkfree: "
+                                               "avoid %016jx\n",
+                                               data_off);
+                                       nofree = 1;
+                               } else {
+                                       nofree = 0;
+                               }
+                       }
                                        
                }
                if (live_chain == NULL) {
@@ -582,7 +613,7 @@ h2_bulkfree_sync(hammer2_bulkfree_info_t *cbinfo)
                        data_off, bmapindex, live->class, live->avail);
 
                hammer2_chain_modify(live_chain, cbinfo->mtid, 0);
-               h2_bulkfree_sync_adjust(cbinfo, live, bmap);
+               h2_bulkfree_sync_adjust(cbinfo, live, bmap, nofree);
 next:
                data_off += HAMMER2_FREEMAP_LEVEL0_SIZE;
                ++bmap;
@@ -597,10 +628,17 @@ next:
        }
 }
 
+/*
+ * Merge the bulkfree bitmap against the existing bitmap.
+ *
+ * If nofree is non-zero the merge will only mark free blocks as allocated
+ * and will refuse to free any blocks.
+ */
 static
 void
 h2_bulkfree_sync_adjust(hammer2_bulkfree_info_t *cbinfo,
-                       hammer2_bmap_data_t *live, hammer2_bmap_data_t *bmap)
+                       hammer2_bmap_data_t *live, hammer2_bmap_data_t *bmap,
+                       int nofree)
 {
        int bindex;
        int scount;
@@ -629,6 +667,8 @@ h2_bulkfree_sync_adjust(hammer2_bulkfree_info_t *cbinfo,
                                                "transition m=00/l=01\n");
                                        break;
                                case 2: /* 10 -> 00 */
+                                       if (nofree)
+                                               break;
                                        live->bitmapq[bindex] &=
                                            ~((hammer2_bitmap_t)2 << scount);
                                        live->avail +=
@@ -638,6 +678,8 @@ h2_bulkfree_sync_adjust(hammer2_bulkfree_info_t *cbinfo,
                                        ++cbinfo->count_10_00;
                                        break;
                                case 3: /* 11 -> 10 */
+                                       if (nofree)
+                                               break;
                                        live->bitmapq[bindex] &=
                                            ~((hammer2_bitmap_t)1 << scount);
                                        ++cbinfo->count_11_10;
index 65b8d3e..f99d8c5 100644 (file)
@@ -1059,9 +1059,15 @@ hammer2_chain_resize(hammer2_inode_t *ip,
        }
 }
 
+/*
+ * Set the chain modified so its data can be changed by the caller.
+ *
+ * Sets bref.modify_tid to mtid only if mtid != 0.  Note that bref.modify_tid
+ * is a CLC (cluster level change) field and is not updated by parent
+ * propagation during a flush.
+ */
 void
-hammer2_chain_modify(hammer2_chain_t *chain,
-                    hammer2_tid_t mtid, int flags)
+hammer2_chain_modify(hammer2_chain_t *chain, hammer2_tid_t mtid, int flags)
 {
        hammer2_blockref_t obref;
        hammer2_dev_t *hmp;
@@ -1115,12 +1121,12 @@ hammer2_chain_modify(hammer2_chain_t *chain,
         * The modification or re-modification requires an allocation and
         * possible COW.
         *
-        * We normally always allocate new storage here.  If storage exists
-        * and MODIFY_NOREALLOC is passed in, we do not allocate new storage.
+        * XXX can a chain already be marked MODIFIED without a data
+        * assignment?  If not, assert here instead of testing the case.
         */
        if (chain != &hmp->vchain && chain != &hmp->fchain) {
                if ((chain->bref.data_off & ~HAMMER2_OFF_MASK_RADIX) == 0 ||
-                    ((flags & HAMMER2_MODIFY_NOREALLOC) == 0 && newmod)
+                    newmod
                ) {
                        hammer2_freemap_alloc(chain, chain->bytes);
                        /* XXX failed allocation */
@@ -1129,14 +1135,14 @@ hammer2_chain_modify(hammer2_chain_t *chain,
 
        /*
         * Update mirror_tid and modify_tid.  modify_tid is only updated
-        * automatically by this function when used from the frontend.
-        * Flushes and synchronization adjust the flag manually.
+        * if not passed as zero (during flushes, parent propagation passes
+        * the value 0).
         *
         * NOTE: chain->pmp could be the device spmp.
         */
-       KKASSERT(mtid != 0);
        chain->bref.mirror_tid = hmp->voldata.mirror_tid + 1;
-       chain->bref.modify_tid = mtid;
+       if (mtid)
+               chain->bref.modify_tid = mtid;
 
        /*
         * Set BMAPUPD to tell the flush code that an existing blockmap entry
@@ -4273,7 +4279,7 @@ hammer2_chain_snapshot(hammer2_chain_t *chain, hammer2_ioc_pfs_t *pmp,
                /* XXX doesn't work with real cluster */
                wipdata->meta = nip->meta;
                wipdata->u.blockset = ripdata->u.blockset;
-               hammer2_flush(nchain, mtid, 1);
+               hammer2_flush(nchain, 1);
                hammer2_chain_unlock(nchain);
                hammer2_chain_drop(nchain);
                hammer2_inode_unlock(nip);
index 29cccfa..ac6230f 100644 (file)
@@ -533,38 +533,44 @@ typedef struct dmsg_lnk_hammer2_volconf dmsg_lnk_hammer2_volconf_t;
  * The primary feature a blockref represents is the ability to validate
  * the entire tree underneath it via its check code.  Any modification to
  * anything propagates up the blockref tree all the way to the root, replacing
- * the related blocks.  Propagations can shortcut to the volume root to
- * implement the 'fast syncing' feature but this only delays the eventual
- * propagation.
+ * the related blocks and compounding the generated check code.
  *
- * The check code can be a simple 32-bit iscsi code, a 64-bit crc,
- * or as complex as a 192 bit cryptographic hash.  192 bits is the maximum
- * supported check code size, which is not sufficient for unverified dedup
- * UNLESS one doesn't mind once-in-a-blue-moon data corruption (such as when
- * farming web data).  HAMMER2 has an unverified dedup feature for just this
- * purpose.
+ * The check code can be a simple 32-bit iscsi code, a 64-bit crc, or as
+ * complex as a 512 bit cryptographic hash.  I originally used a 64-byte
+ * blockref but later expanded it to 128 bytes to be able to support the
+ * larger check code as well as to embed statistics for quota operation.
+ *
+ * Simple check codes are not sufficient for unverified dedup.  Even with
+ * a maximally-sized check code unverified dedup should only be used in
+ * in subdirectory trees where you do not need 100% data integrity.
+ *
+ * Unverified dedup is deduping based on meta-data only without verifying
+ * that the data blocks are actually identical.  Verified dedup guarantees
+ * integrity but is a far more I/O-expensive operation.
  *
  * --
  *
+ * mirror_tid - per cluster node modified (propagated upward by flush)
+ * modify_tid - clc record modified (not propagated).
+ * update_tid - clc record updated (propagated upward on verification)
+ *
+ * CLC - Stands for 'Cluster Level Change', identifiers which are identical
+ *      within the topology across all cluster nodes (when fully
+ *      synchronized).
+ *
  * NOTE: The range of keys represented by the blockref is (key) to
  *      ((key) + (1LL << keybits) - 1).  HAMMER2 usually populates
  *      blocks bottom-up, inserting a new root when radix expansion
  *      is required.
  *
- * --
+ *                                 RESERVED FIELDS
+ *
+ * A number of blockref fields are reserved and should generally be set to
+ * 0 for future compatibility.
+ *
  *                             FUTURE BLOCKREF EXPANSION
  *
- * In order to implement a 256-bit content addressable index we want to
- * have a 256-bit key which essentially represents the cryptographic hash.
- * (so, 64-bit key + 192-bit crypto-hash or 256-bit key-is-the-hash +
- * 32-bit consistency check for indirect block layers).
- *
- * THIS IS POSSIBLE in a 64-byte blockref structure.  Of course, any number
- * of bits can be represented by sizing the blockref.  For the purposes of
- * HAMMER2 though my limit is 256 bits.  Not only that, but it will be an
- * optimal construction because H2 already uses a variably-sized radix to
- * pack the blockrefs at each level.  A 256-bit mechanic would allow us
- * to implement a content-addressable index.
+ * CONTENT ADDRESSABLE INDEXING (future) - Using a 256 or 512-bit check code.
  */
 struct hammer2_blockref {              /* MUST BE EXACTLY 64 BYTES */
        uint8_t         type;           /* type of underlying item */
@@ -577,11 +583,11 @@ struct hammer2_blockref {         /* MUST BE EXACTLY 64 BYTES */
        uint8_t         reserved07;
        hammer2_key_t   key;            /* key specification */
        hammer2_tid_t   mirror_tid;     /* media flush topology & freemap */
-       hammer2_tid_t   modify_tid;     /* cluster level change / flush */
+       hammer2_tid_t   modify_tid;     /* clc modify (not propagated) */
        hammer2_off_t   data_off;       /* low 6 bits is phys size (radix)*/
        hammer2_key_t   data_count;     /* statistics aggregation */
        hammer2_key_t   inode_count;    /* statistics aggregation */
-       hammer2_key_t   reserved38;
+       hammer2_tid_t   update_tid;     /* clc modify (propagated upward) */
        union {                         /* check info */
                char    buf[64];
                struct {
index 6e6790c..30b5986 100644 (file)
@@ -69,7 +69,7 @@ struct hammer2_flush_info {
        int             depth;
        int             diddeferral;
        int             cache_index;
-       hammer2_tid_t   mtid;
+       int             flags;
        struct h2_flush_list flushq;
        hammer2_chain_t *debug;
 };
@@ -77,7 +77,7 @@ struct hammer2_flush_info {
 typedef struct hammer2_flush_info hammer2_flush_info_t;
 
 static void hammer2_flush_core(hammer2_flush_info_t *info,
-                               hammer2_chain_t *chain, int deleting);
+                               hammer2_chain_t *chain, int flags);
 static int hammer2_flush_recurse(hammer2_chain_t *child, void *data);
 
 /*
@@ -183,8 +183,14 @@ hammer2_trans_init(hammer2_pfs_t *pmp, uint32_t flags)
 
 /*
  * Start a sub-transaction, there is no 'subdone' function.  This will
- * issue a new modify_tid (mtid) for the current transaction and must
- * be called for each XOP when multiple XOPs are run in sequence.
+ * issue a new modify_tid (mtid) for the current transaction, which is a
+ * CLC (cluster level change) id and not a per-node id.
+ *
+ * This function must be called for each XOP when multiple XOPs are run in
+ * sequence within a transaction.
+ *
+ * Callers typically update the inode with the transaction mtid manually
+ * to enforce sequencing.
  */
 hammer2_tid_t
 hammer2_trans_sub(hammer2_pfs_t *pmp)
@@ -300,15 +306,17 @@ hammer2_delayed_flush(hammer2_chain_t *chain)
 
 /*
  * Flush the chain and all modified sub-chains through the specified
- * synchronization point, propagating parent chain modifications, modify_tid,
- * and mirror_tid updates back up as needed.
+ * synchronization point, propagating blockref updates back up.  As
+ * part of this propagation, mirror_tid and inode/data usage statistics
+ * propagates back upward.
+ *
+ * modify_tid (clc - cluster level change) is not propagated.
  *
- * Caller must have already vetted synchronization points to ensure they
- * are properly flushed.  Only snapshots and cluster flushes can create
- * these sorts of synchronization points.
+ * update_tid (clc) is used for validation and is not propagated by this
+ * function.
  *
  * This routine can be called from several places but the most important
- * is from VFS_SYNC.
+ * is from VFS_SYNC (frontend) via hammer2_inode_xop_flush (backend).
  *
  * chain is locked on call and will remain locked on return.  The chain's
  * UPDATE flag indicates that its parent's block table (which is not yet
@@ -316,7 +324,7 @@ hammer2_delayed_flush(hammer2_chain_t *chain)
  * the call if it was modified.
  */
 void
-hammer2_flush(hammer2_chain_t *chain, hammer2_tid_t mtid, int istop)
+hammer2_flush(hammer2_chain_t *chain, int flags)
 {
        hammer2_chain_t *scan;
        hammer2_flush_info_t info;
@@ -334,7 +342,7 @@ hammer2_flush(hammer2_chain_t *chain, hammer2_tid_t mtid, int istop)
        bzero(&info, sizeof(info));
        TAILQ_INIT(&info.flushq);
        info.cache_index = -1;
-       info.mtid = mtid;
+       info.flags = flags & ~HAMMER2_FLUSH_TOP;
 
        /*
         * Calculate parent (can be NULL), if not NULL the flush core
@@ -383,7 +391,7 @@ hammer2_flush(hammer2_chain_t *chain, hammer2_tid_t mtid, int istop)
                        if (hammer2_debug & 0x0040)
                                kprintf("deferred flush %p\n", scan);
                        hammer2_chain_lock(scan, HAMMER2_RESOLVE_MAYBE);
-                       hammer2_flush(scan, mtid, 0);
+                       hammer2_flush(scan, flags & ~HAMMER2_FLUSH_TOP);
                        hammer2_chain_unlock(scan);
                        hammer2_chain_drop(scan);       /* ref from deferral */
                }
@@ -392,7 +400,7 @@ hammer2_flush(hammer2_chain_t *chain, hammer2_tid_t mtid, int istop)
                 * [re]flush chain.
                 */
                info.diddeferral = 0;
-               hammer2_flush_core(&info, chain, istop);
+               hammer2_flush_core(&info, chain, flags);
 
                /*
                 * Only loop if deep recursions have been deferred.
@@ -449,7 +457,7 @@ hammer2_flush(hammer2_chain_t *chain, hammer2_tid_t mtid, int istop)
  */
 static void
 hammer2_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain,
-                  int istop)
+                  int flags)
 {
        hammer2_chain_t *parent;
        hammer2_dev_t *hmp;
@@ -480,16 +488,9 @@ hammer2_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain,
                 * Already deferred.
                 */
                ++info->diddeferral;
-       } else if (info->depth == HAMMER2_FLUSH_DEPTH_LIMIT) {
-               /*
-                * Recursion depth reached.
-                */
-               KKASSERT((chain->flags & HAMMER2_CHAIN_DELAYED) == 0);
-               hammer2_chain_ref(chain);
-               TAILQ_INSERT_TAIL(&info->flushq, chain, flush_node);
-               atomic_set_int(&chain->flags, HAMMER2_CHAIN_DEFERRED);
-               ++info->diddeferral;
-       } else if ((chain->flags & HAMMER2_CHAIN_PFSBOUNDARY) && istop == 0) {
+       } else if ((chain->flags & HAMMER2_CHAIN_PFSBOUNDARY) &&
+                  (flags & HAMMER2_FLUSH_ALL) == 0 &&
+                  (flags & HAMMER2_FLUSH_TOP) == 0) {
                /*
                 * We do not recurse through PFSROOTs.  PFSROOT flushes are
                 * handled by the related pmp's (whether mounted or not,
@@ -499,18 +500,25 @@ hammer2_flush_core(hammer2_flush_info_t *info, hammer2_chain_t *chain,
                 * table updates in their parent (which IS part of our flush).
                 *
                 * Note that the volume root, vchain, does not set this flag.
+                * Note the logic here requires that this test be done before
+                * the depth-limit test, else it might become the top on a
+                * flushq iteration.
                 */
                ;
+       } else if (info->depth == HAMMER2_FLUSH_DEPTH_LIMIT) {
+               /*
+                * Recursion depth reached.
+                */
+               KKASSERT((chain->flags & HAMMER2_CHAIN_DELAYED) == 0);
+               hammer2_chain_ref(chain);
+               TAILQ_INSERT_TAIL(&info->flushq, chain, flush_node);
+               atomic_set_int(&chain->flags, HAMMER2_CHAIN_DEFERRED);
+               ++info->diddeferral;
        } else if (chain->flags & HAMMER2_CHAIN_ONFLUSH) {
                /*
                 * Downward recursion search (actual flush occurs bottom-up).
                 * pre-clear ONFLUSH.  It can get set again due to races,
                 * which we want so the scan finds us again in the next flush.
-                * These races can also include 
-                *
-                * Flush recursions stop at PFSROOT boundaries.  Each PFS
-                * must be individually flushed and then the root must
-                * be flushed.
                 */
                atomic_clear_int(&chain->flags, HAMMER2_CHAIN_ONFLUSH);
                info->parent = chain;
@@ -869,7 +877,7 @@ again:
                 * We are updating the parent's blockmap, the parent must
                 * be set modified.
                 */
-               hammer2_chain_modify(parent, info->mtid, 0);
+               hammer2_chain_modify(parent, 0, 0);
                if (parent->bref.modify_tid < chain->bref.modify_tid)
                        parent->bref.modify_tid = chain->bref.modify_tid;
 
@@ -1002,13 +1010,13 @@ hammer2_flush_recurse(hammer2_chain_t *child, void *data)
         */
        if (child->flags & HAMMER2_CHAIN_FLUSH_MASK) {
                ++info->depth;
-               hammer2_flush_core(info, child, 0);
+               hammer2_flush_core(info, child, info->flags);
                --info->depth;
        } else if (hammer2_debug & 0x200) {
                if (info->debug == NULL)
                        info->debug = child;
                ++info->depth;
-               hammer2_flush_core(info, child, 0);
+               hammer2_flush_core(info, child, info->flags);
                --info->depth;
                if (info->debug == child)
                        info->debug = NULL;
@@ -1026,6 +1034,39 @@ hammer2_flush_recurse(hammer2_chain_t *child, void *data)
        return (0);
 }
 
+/*
+ * flush helper (direct)
+ *
+ * Quickly flushes any dirty chains for a device.  This will update our
+ * concept of the volume root but does NOT flush the actual volume root
+ * and does not flush dirty device buffers.
+ *
+ * This function is primarily used by the bulkfree code to allow it to
+ * create a snapshot for the pass.  It doesn't care about any pending
+ * work (dirty vnodes, dirty inodes, dirty logical buffers) for which blocks
+ * have not yet been allocated.
+ */
+void
+hammer2_flush_quick(hammer2_dev_t *hmp)
+{
+       hammer2_chain_t *chain;
+
+       hammer2_trans_init(hmp->spmp, HAMMER2_TRANS_ISFLUSH);
+
+       hammer2_chain_ref(&hmp->vchain);
+       hammer2_chain_lock(&hmp->vchain, HAMMER2_RESOLVE_ALWAYS);
+       if (hmp->vchain.flags & HAMMER2_CHAIN_FLUSH_MASK) {
+               chain = &hmp->vchain;
+               hammer2_flush(chain, HAMMER2_FLUSH_TOP |
+                                    HAMMER2_FLUSH_ALL);
+               KKASSERT(chain == &hmp->vchain);
+       }
+       hammer2_chain_unlock(&hmp->vchain);
+       hammer2_chain_drop(&hmp->vchain);
+
+       hammer2_trans_done(hmp->spmp);  /* spmp trans */
+}
+
 /*
  * flush helper (backend threaded)
  *
@@ -1052,7 +1093,7 @@ hammer2_inode_xop_flush(hammer2_xop_t *arg, int clindex)
        if (chain) {
                hmp = chain->hmp;
                if (chain->flags & HAMMER2_CHAIN_FLUSH_MASK) {
-                       hammer2_flush(chain, xop->head.mtid, 1);
+                       hammer2_flush(chain, HAMMER2_FLUSH_TOP);
                        parent = chain->parent;
                        KKASSERT(chain->pmp != parent->pmp);
                        hammer2_chain_setflush(parent);
@@ -1081,7 +1122,8 @@ hammer2_inode_xop_flush(hammer2_xop_t *arg, int clindex)
        /*
         * spmp transaction.  The super-root is never directly mounted so
         * there shouldn't be any vnodes, let alone any dirty vnodes
-        * associated with it.
+        * associated with it, so we shouldn't have to mess around with any
+        * vnode flushes here.
         */
        hammer2_trans_init(hmp->spmp, HAMMER2_TRANS_ISFLUSH);
 
@@ -1105,7 +1147,7 @@ hammer2_inode_xop_flush(hammer2_xop_t *arg, int clindex)
                 */
                hammer2_voldata_modify(hmp);
                chain = &hmp->fchain;
-               hammer2_flush(chain, xop->head.mtid, 1);
+               hammer2_flush(chain, HAMMER2_FLUSH_TOP);
                KKASSERT(chain == &hmp->fchain);
        }
        hammer2_chain_unlock(&hmp->fchain);
@@ -1116,7 +1158,7 @@ hammer2_inode_xop_flush(hammer2_xop_t *arg, int clindex)
        hammer2_chain_lock(&hmp->vchain, HAMMER2_RESOLVE_ALWAYS);
        if (hmp->vchain.flags & HAMMER2_CHAIN_FLUSH_MASK) {
                chain = &hmp->vchain;
-               hammer2_flush(chain, xop->head.mtid, 1);
+               hammer2_flush(chain, HAMMER2_FLUSH_TOP);
                KKASSERT(chain == &hmp->vchain);
        }
        hammer2_chain_unlock(&hmp->vchain);
index 075e1b7..d680904 100644 (file)
@@ -1301,7 +1301,7 @@ hammer2_inode_modify(hammer2_inode_t *ip)
  * Called with a locked inode inside a transaction.
  */
 void
-hammer2_inode_fsync(hammer2_inode_t *ip)
+hammer2_inode_chain_sync(hammer2_inode_t *ip)
 {
        if (ip->flags & (HAMMER2_INODE_RESIZED | HAMMER2_INODE_MODIFIED)) {
                hammer2_xop_fsync_t *xop;
@@ -1324,7 +1324,7 @@ hammer2_inode_fsync(hammer2_inode_t *ip)
 
                atomic_clear_int(&ip->flags, HAMMER2_INODE_RESIZED |
                                             HAMMER2_INODE_MODIFIED);
-               hammer2_xop_start(&xop->head, hammer2_inode_xop_fsync);
+               hammer2_xop_start(&xop->head, hammer2_inode_xop_chain_sync);
                error = hammer2_xop_collect(&xop->head, 0);
                hammer2_xop_retire(&xop->head, HAMMER2_XOPMASK_VOP);
                if (error == ENOENT)
@@ -1629,8 +1629,11 @@ fail:
        }
 }
 
+/*
+ * Synchronize the in-memory inode with the chain.
+ */
 void
-hammer2_inode_xop_fsync(hammer2_xop_t *arg, int clindex)
+hammer2_inode_xop_chain_sync(hammer2_xop_t *arg, int clindex)
 {
        hammer2_xop_fsync_t *xop = &arg->xop_fsync;
        hammer2_chain_t *parent;
@@ -1689,6 +1692,18 @@ hammer2_inode_xop_fsync(hammer2_xop_t *arg, int clindex)
                                                   HAMMER2_LOOKUP_NODATA |
                                                   HAMMER2_LOOKUP_NODIRECT);
                }
+
+               /*
+                * Reset to point at inode for following code, if necessary.
+                */
+               if (parent->bref.type != HAMMER2_BREF_TYPE_INODE) {
+                       hammer2_chain_unlock(parent);
+                       hammer2_chain_drop(parent);
+                       parent = hammer2_inode_chain(xop->head.ip1, clindex,
+                                                    HAMMER2_RESOLVE_ALWAYS);
+                       kprintf("hammer2: TRUNCATE RESET on '%s'\n",
+                               parent->data->ipdata.filename);
+               }
        }
 
        /*
index a210f7a..3956fe3 100644 (file)
@@ -656,7 +656,7 @@ hammer2_ioctl_pfs_create(hammer2_inode_t *ip, void *data)
                 */
                hammer2_inode_ref(nip);
                hammer2_inode_unlock(nip);
-               hammer2_inode_fsync(nip);
+               hammer2_inode_chain_sync(nip);
                hammer2_inode_drop(nip);
        }
        hammer2_trans_done(hmp->spmp);
index 478a6d8..7832ca1 100644 (file)
@@ -1896,7 +1896,7 @@ hammer2_recovery_scan(hammer2_dev_t *hmp, hammer2_chain_t *parent,
                 */
                if ((chain->bref.flags & HAMMER2_BREF_FLAG_PFSROOT) &&
                    (chain->flags & HAMMER2_CHAIN_ONFLUSH)) {
-                       hammer2_flush(chain, info->mtid, 1);
+                       hammer2_flush(chain, HAMMER2_FLUSH_TOP);
                }
                chain = hammer2_chain_scan(parent, chain, &cache_index,
                                           HAMMER2_LOOKUP_NODATA);
@@ -2050,7 +2050,12 @@ hammer2_sync_scan2(struct mount *mp, struct vnode *vp, void *data)
        if ((ip->flags & HAMMER2_INODE_MODIFIED) ||
            !RB_EMPTY(&vp->v_rbdirty_tree)) {
                vfsync(vp, info->waitfor, 1, NULL, NULL);
-               hammer2_inode_fsync(ip);
+               if (ip->flags & (HAMMER2_INODE_RESIZED |
+                                HAMMER2_INODE_MODIFIED)) {
+                       hammer2_inode_lock(ip, 0);
+                       hammer2_inode_chain_sync(ip);
+                       hammer2_inode_unlock(ip);
+               }
        }
        if ((ip->flags & HAMMER2_INODE_MODIFIED) == 0 &&
            RB_EMPTY(&vp->v_rbdirty_tree)) {
index 5838dd6..1046ef9 100644 (file)
@@ -228,7 +228,7 @@ hammer2_vop_fsync(struct vop_fsync_args *ap)
         */
        hammer2_inode_lock(ip, 0);
        if (ip->flags & HAMMER2_INODE_MODIFIED)
-               hammer2_inode_fsync(ip);
+               hammer2_inode_chain_sync(ip);
        hammer2_inode_unlock(ip);
        hammer2_trans_done(ip->pmp);
 
@@ -460,7 +460,7 @@ done:
         * block table.
         */
        if (ip->flags & HAMMER2_INODE_RESIZED)
-               hammer2_inode_fsync(ip);
+               hammer2_inode_chain_sync(ip);
 
        /*
         * Cleanup.
@@ -972,14 +972,14 @@ hammer2_write_file(hammer2_inode_t *ip, struct uio *uio,
                hammer2_mtx_ex(&ip->lock);
                hammer2_truncate_file(ip, old_eof);
                if (ip->flags & HAMMER2_INODE_MODIFIED)
-                       hammer2_inode_fsync(ip);
+                       hammer2_inode_chain_sync(ip);
                hammer2_mtx_unlock(&ip->lock);
        } else if (modified) {
                hammer2_mtx_ex(&ip->lock);
                hammer2_inode_modify(ip);
                hammer2_update_time(&ip->meta.mtime);
                if (ip->flags & HAMMER2_INODE_MODIFIED)
-                       hammer2_inode_fsync(ip);
+                       hammer2_inode_chain_sync(ip);
                hammer2_mtx_unlock(&ip->lock);
                hammer2_knote(ip->vp, kflags);
        }
@@ -1034,8 +1034,9 @@ hammer2_truncate_file(hammer2_inode_t *ip, hammer2_key_t nsize)
  *
  * Even though the file size is changing, we do not have to set the
  * INODE_RESIZED bit unless the file size crosses the EMBEDDED_BYTES
- * boundary.  When this occurs a hammer2_inode_fsync() is required
- * to prepare the inode cluster's indirect block table.
+ * boundary.  When this occurs a hammer2_inode_chain_sync() is required
+ * to prepare the inode cluster's indirect block table, otherwise
+ * async execution of the strategy code will implode on us.
  *
  * WARNING! Assumes that the kernel interlocks size changes at the
  *         vnode level.
@@ -1060,8 +1061,10 @@ hammer2_extend_file(hammer2_inode_t *ip, hammer2_key_t nsize)
        ip->meta.size = nsize;
        atomic_set_int(&ip->flags, HAMMER2_INODE_MODIFIED);
 
-       if (osize <= HAMMER2_EMBEDDED_BYTES && nsize > HAMMER2_EMBEDDED_BYTES)
+       if (osize <= HAMMER2_EMBEDDED_BYTES && nsize > HAMMER2_EMBEDDED_BYTES) {
                atomic_set_int(&ip->flags, HAMMER2_INODE_RESIZED);
+               hammer2_inode_chain_sync(ip);
+       }
 
        hammer2_mtx_unlock(&ip->lock);
        if (ip->vp) {