hammer2 - Merge Daniel Flores's HAMMER2 GSOC project into the main tree
authorMatthew Dillon <dillon@apollo.backplane.com>
Thu, 19 Sep 2013 02:56:07 +0000 (19:56 -0700)
committerMatthew Dillon <dillon@apollo.backplane.com>
Thu, 19 Sep 2013 02:56:07 +0000 (19:56 -0700)
* This merge contains work primarily by Daniel Flores, and also by
  Matthew Dillon.  Daniel's work is focused around adding the compression
  infrastructure while my work was focused around changing HAMMER2's
  I/O infrastructure to work with the compression code.

  Daniel's work involved adding the compression functions and heuristics,
  modifying mainly vnops and vfsops to use them, adding the new buffer
  cache write thread, and adding the new hammer2 utility directives and
  related ioctls.

  My work involved changing the H2 I/O infrastructure to always double-buffer
  (i.e. logical buffers vs device buffers) because device buffers can now
  wind up being a different size than the related logical buffers.  I also
  had to make changes to the hammer2_mount and hammer2_pfsmount mechanics
  and other things to prevent deadlocks.

    Daniel's Work

* Add the hammer2 setcomp directive which sets the compression mode on a
  directory or file.  If applie to a directory, the compression mode is
  inherited by any new files or directories created under the directory.
  Pre-existing subdirectories and files are not affected.

  The directive has a recursive option to recurse down and set the mode on
  everything underneath.

* Add wthread_bioq and related fields to hammer2_mount to support a
  buffer cache buffer writing thread.  This thread is responsible for
  calculating compression sizes, allocating device buffer blocks, and
  compressing logical buffers into the device buffers.

* Implement HAMMER2_COMP_AUTOZERO, HAMMER2_COMP_LZ4, and HAMMER2_COMP_ZLIB
  compression modes.  AUTOZERO is the zero-block detection code.  LZ4 will
  do zero-block-detection and LZ4 otherwise, and ZLIB will do
  zero-block-detection and gzip otherwise.

  This work entails a ton of new files imported from the LZ4 and ZLIB
  projects plus lots of wiring.

  The new files had to be cleaned up considerably, as well, since they
  were originally intended for userland.

* Move synchronous device buffer handling out of hammer2_vop_write() and
  into the support thread.  Numerous procedures were moved out of
  hammer2_vnops.c and into hammer2_vfsops.c as well.

  This greatly simplifies hammer2_vop_write() as well as the truncate and
  extend code, and improves the critical-path performance for write()
  (at least until the buffer cache fills up or gets too far behind).

* Implement semi-synchronous decompression callbacks for read I/O and
  read-ahead I/O.

* Add HAMMER2IOC_INODE_COMP_REC_SET and HAMMER2IOC_INODE_COMP_REC_SET2
  ioctls to support the setcomp directive.

  Matthew's Work

* The hammer2_inode copies additional fields from the inode data, allowing
  the inode data to be deallocated after use.

* Due to the way the buffer cache now operates, multiple deletions of the
  same chain key can occur within the same transaction.  Adjust the RBTREE
  compare code to handle the case.

* Track chain structure use on a pfsmount-by-pfsmount basis for current
  and future management of the kmalloc pools used by hammer2.

* Rework the way inodes are locked to track chain modifications.

* Rewrite hammer2_chain_delete_duplicate().

* Rewrite hammer2_trans_init() and the flush code primarily to fix
  deadlocks in the flush synchronization mechanics.

* Interlock very low level chain operations with a spin lock instead
  of the full-blown chain lock to deal with potential deadlocks and
  fix a few SMP races.

* For the moment make all logical buffers 64KB.  Not efficient for small
  files and will be changed back at some point but necessary for efficient
  compression at the moment.

* Implement an asynchronous buffer cache callback feature.

* Use a localized in-hammer2_inode-structure size field for frontend
  operations, including extend and truncate, to remove confusion against
  backend flushes.  This way the inode data can be modified only during
  the flush and not before where it might cause confusion with previously
  staged flushes.

37 files changed:
sbin/hammer2/Makefile
sbin/hammer2/cmd_setcomp.c [new file with mode: 0644]
sbin/hammer2/cmd_stat.c
sbin/hammer2/hammer2.h
sbin/hammer2/main.c
sbin/hammer2/print_inode.c [new file with mode: 0644]
sys/vfs/hammer2/Makefile
sys/vfs/hammer2/hammer2.h
sys/vfs/hammer2/hammer2_chain.c
sys/vfs/hammer2/hammer2_disk.h
sys/vfs/hammer2/hammer2_flush.c
sys/vfs/hammer2/hammer2_freemap.c
sys/vfs/hammer2/hammer2_inode.c
sys/vfs/hammer2/hammer2_ioctl.c
sys/vfs/hammer2/hammer2_ioctl.h
sys/vfs/hammer2/hammer2_lz4.c [new file with mode: 0644]
sys/vfs/hammer2/hammer2_lz4.h [new file with mode: 0644]
sys/vfs/hammer2/hammer2_lz4_encoder.h [new file with mode: 0644]
sys/vfs/hammer2/hammer2_subr.c
sys/vfs/hammer2/hammer2_vfsops.c
sys/vfs/hammer2/hammer2_vnops.c
sys/vfs/hammer2/zlib/hammer2_zlib.h [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_adler32.c [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_deflate.c [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_deflate.h [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_inffast.c [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_inffast.h [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_inffixed.h [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_inflate.c [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_inflate.h [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_inftrees.c [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_inftrees.h [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_trees.c [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_trees.h [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_zconf.h [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_zutil.c [new file with mode: 0644]
sys/vfs/hammer2/zlib/hammer2_zlib_zutil.h [new file with mode: 0644]

index a083d6b..755095f 100644 (file)
@@ -2,7 +2,7 @@ PROG=   hammer2
 SRCS=  main.c subs.c icrc.c
 SRCS+= cmd_remote.c cmd_snapshot.c cmd_pfs.c
 SRCS+= cmd_service.c cmd_leaf.c cmd_debug.c
-SRCS+= cmd_rsa.c cmd_stat.c
+SRCS+= cmd_rsa.c cmd_stat.c cmd_setcomp.c print_inode.c
 #MAN=  hammer2.8
 NOMAN= TRUE
 DEBUG_FLAGS=-g
diff --git a/sbin/hammer2/cmd_setcomp.c b/sbin/hammer2/cmd_setcomp.c
new file mode 100644 (file)
index 0000000..1c0b140
--- /dev/null
@@ -0,0 +1,219 @@
+/*
+ * Copyright (c) 2013 The DragonFly Project.  All rights reserved.
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ * 3. Neither the name of The DragonFly Project nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific, prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
+ * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
+ * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "hammer2.h"
+int
+cmd_setcomp(char* comp_string, char* file_string)
+{
+       int comp_method;
+       if (strcmp(comp_string, "0") == 0) {
+               printf("Will turn off compression on directory/file %s\n", file_string);
+               comp_method = HAMMER2_COMP_NONE;
+       } else if (strcmp(comp_string, "1") == 0) {
+               printf("Will set zero-checking compression on directory/file %s.\n",
+                       file_string);
+               comp_method = HAMMER2_COMP_AUTOZERO;
+       } else if (strcmp(comp_string, "2") == 0) {
+               printf("Will set LZ4 compression on directory/file %s.\n", file_string);
+               comp_method = HAMMER2_COMP_LZ4;
+       } else if (strcmp(comp_string, "3:6") == 0) {
+               printf("Will set ZLIB level 6 compression on directory/file %s.\n", file_string);
+               comp_method = 6 << 4;
+               comp_method += HAMMER2_COMP_ZLIB;
+       } else if (strcmp(comp_string, "3") == 0 || strcmp(comp_string, "3:7") == 0) {
+               printf("Will set ZLIB level 7 (default) compression on directory/file %s.\n", file_string);
+               comp_method = 7 << 4;
+               comp_method += HAMMER2_COMP_ZLIB;
+       } else if (strcmp(comp_string, "3:8") == 0) {
+               printf("Will set ZLIB level 8 compression on directory/file %s.\n", file_string);
+               comp_method = 8 << 4;
+               comp_method += HAMMER2_COMP_ZLIB;
+       } else if (strcmp(comp_string, "3:9") == 0) {
+               printf("Will set ZLIB level 9 compression on directory/file %s.\n", file_string);
+               printf("CAUTION: May be extremely slow on big amount of data.\n");
+               comp_method = 9 << 4;
+               comp_method += HAMMER2_COMP_ZLIB;
+       } else if (strcmp(comp_string, "3:5") == 0 || strcmp(comp_string, "3:4") == 0 ||
+                               strcmp(comp_string, "3:3") == 0 || strcmp(comp_string, "3:2") == 0 ||
+                               strcmp(comp_string, "3:1") == 0) {
+               printf("ZLIB compression levels below 6 are not supported,\n");
+               printf("please use LZ4 (setcomp 2) for fast compression instead.\n");
+               return 1;
+       }
+       else {
+               printf("ERROR: Unknown compression method.\n");
+               return 1;
+       }
+       int fd = hammer2_ioctl_handle(file_string);
+       hammer2_ioc_inode_t inode;
+       int res = ioctl(fd, HAMMER2IOC_INODE_GET, &inode);
+       if (res < 0) {
+               fprintf(stderr, "ERROR before setting the mode: %s\n",
+                       strerror(errno));
+               return 3;
+       }
+       inode.ip_data.comp_algo = comp_method & 0x0FF;
+       res = ioctl(fd, HAMMER2IOC_INODE_SET, &inode);
+       if (res < 0) {
+               if (errno != EINVAL) {
+                       fprintf(stderr, "ERROR after trying to set the mode: %s\n",
+                               strerror(errno));
+                       return 3;
+               }
+       }
+       close(fd);
+       return 0;
+}
+
+int
+cmd_setcomp_recursive(char* option_string, char* comp_string, char* file_string)
+{
+       int ecode = 0;
+       int set_files;
+       if (strcmp(option_string, "-r") == 0) {
+               set_files = 0;
+       }
+       else if (strcmp(option_string, "-rf") == 0) {
+               set_files = 1;
+       }
+       else {
+               printf("setcomp: Unrecognized option.\n");
+               exit(1);
+       }
+       int comp_method;
+       if (strcmp(comp_string, "0") == 0) {
+               printf("Will turn off compression on directory/file %s\n", file_string);
+               comp_method = HAMMER2_COMP_NONE;
+       } else if (strcmp(comp_string, "1") == 0) {
+               printf("Will set zero-checking compression on directory/file %s.\n", file_string);
+               comp_method = HAMMER2_COMP_AUTOZERO;
+       } else if (strcmp(comp_string, "2") == 0) {
+               printf("Will set LZ4 compression on directory/file %s.\n", file_string);
+               comp_method = HAMMER2_COMP_LZ4;
+       } else if (strcmp(comp_string, "3") == 0) {
+               printf("Will set ZLIB (slowest) compression on directory/file %s.\n", file_string);
+               comp_method = HAMMER2_COMP_ZLIB;
+       }
+       else {
+               printf("Unknown compression method.\n");
+               return 1;
+       }
+       int fd = hammer2_ioctl_handle(file_string);
+       hammer2_ioc_inode_t inode;
+       int res = ioctl(fd, HAMMER2IOC_INODE_GET, &inode);
+       if (res < 0) {
+               fprintf(stderr, "ERROR before setting the mode: %s\n", strerror(errno));
+               return 3;
+       }
+       if (inode.ip_data.type != HAMMER2_OBJTYPE_DIRECTORY) {
+               printf("setcomp: the specified object is not a directory, nothing changed.\n");
+               return 1;
+       }
+       printf("Attention: recursive compression mode setting demanded, this may take a while...\n");
+       ecode = setcomp_recursive_call(file_string, comp_method, set_files);
+       inode.ip_data.comp_algo = comp_method;
+       res = ioctl(fd, HAMMER2IOC_INODE_SET, &inode);
+       if (res < 0) {
+               if (errno != EINVAL) {
+                       fprintf(stderr, "ERROR after trying to set the mode: %s\n", strerror(errno));
+                       return 3;
+               }
+       }
+       close(fd);
+       return ecode;
+}
+
+int
+setcomp_recursive_call(char *directory, int comp_method, int set_files)
+{
+       int ecode = 0;
+       DIR *dir;
+       if ((dir = opendir (directory)) == NULL) {
+        fprintf(stderr, "ERROR while trying to set the mode recursively: %s\n",
+                       strerror(errno));
+               return 3;
+    }
+    struct dirent *dent;
+    int lenght;
+    lenght = strlen(directory);
+    char name[HAMMER2_INODE_MAXNAME];
+    strcpy(name, directory);
+    name[lenght] = '/';
+    ++lenght;
+    errno = 0;
+    dent = readdir(dir);
+    while (dent != NULL && ecode == 0) {
+               if ((strcmp(dent->d_name, ".") != 0) &&
+                (strcmp(dent->d_name, "..") != 0)) {
+                       strncpy(name + lenght, dent->d_name, HAMMER2_INODE_MAXNAME -
+                               lenght);
+                       int fd = hammer2_ioctl_handle(name);
+                       hammer2_ioc_inode_t inode;
+                       int res = ioctl(fd, HAMMER2IOC_INODE_GET, &inode);
+                       if (res < 0) {
+                               fprintf(stderr, "ERROR during recursion: %s\n",
+                                       strerror(errno));
+                               return 3;
+                       }
+                       if (inode.ip_data.type == HAMMER2_OBJTYPE_DIRECTORY) {
+                               ecode = setcomp_recursive_call(name, comp_method, set_files);
+                               inode.ip_data.comp_algo = comp_method;
+                               res = ioctl(fd, HAMMER2IOC_INODE_SET, &inode);
+                       }
+                       else {
+                               if (set_files == 1 && inode.ip_data.type ==
+                                               HAMMER2_OBJTYPE_REGFILE) {
+                                       inode.ip_data.comp_algo = comp_method;
+                                       res = ioctl(fd, HAMMER2IOC_INODE_SET, &inode);
+                               }
+                       }
+                       if (res < 0) {
+                               if (errno != EINVAL) {
+                                       fprintf(stderr, "ERROR during recursion after trying"
+                                               "to set the mode: %s\n",
+                                               strerror(errno));
+                                       return 3;
+                               }
+                       }
+                       close(fd);
+               }
+               errno = 0; //we must set errno to 0 before readdir()
+               dent = readdir(dir);
+       }
+       closedir(dir);
+       if (errno != 0) {
+               fprintf(stderr, "ERROR during iteration: %s\n", strerror(errno));
+               return 3;
+    }
+    return ecode;
+}
index 1a50ed9..faa3363 100644 (file)
@@ -58,7 +58,7 @@ cmd_stat(int ac, const char **av)
        }
        if (w < 16)
                w = 16;
-       printf("%-*.*s ncp  data-use inode-use kaddr\n", w, w, "PATH");
+       printf("%-*.*s ncp  data-use inode-use comp kaddr\n", w, w, "PATH");
        for (i = 0; i < ac; ++i) {
                if ((fd = open(av[i], O_RDONLY)) < 0) {
                        fprintf(stderr, "%s: %s\n", av[i], strerror(errno));
@@ -75,6 +75,7 @@ cmd_stat(int ac, const char **av)
                printf("%9s ", sizetostr(ino.ip_data.data_count));
                printf("%9s ", sizetostr(ino.ip_data.inode_count));
                printf("%p ", ino.kdata);
+               printf("%02x ", ino.ip_data.comp_algo);
                if (ino.ip_data.data_quota || ino.ip_data.inode_quota) {
                        printf(" quota ");
                        printf("%12s", sizetostr(ino.ip_data.data_quota));
index e537293..f4ddf00 100644 (file)
@@ -50,6 +50,7 @@
 #include <sys/udev.h>
 #include <sys/diskslice.h>
 #include <dmsg.h>
+#include <dirent.h>
 
 #include <netinet/in.h>
 #include <netinet/tcp.h>
@@ -135,6 +136,9 @@ int cmd_show(const char *devpath, int dofreemap);
 int cmd_rsainit(const char *dir_path);
 int cmd_rsaenc(const char **keys, int nkeys);
 int cmd_rsadec(const char **keys, int nkeys);
+int cmd_setcomp(char* comp_string, char* file_string);
+int cmd_setcomp_recursive(char* option_string, char* comp_string,
+       char* file_string);
 
 /*
  * Misc functions
@@ -150,3 +154,6 @@ uint32_t hammer2_icrc32(const void *buf, size_t size);
 uint32_t hammer2_icrc32c(const void *buf, size_t size, uint32_t crc);
 
 void hammer2_shell_parse(dmsg_msg_t *msg);
+int setcomp_recursive_call(char *directory, int comp_method,
+       int set_files);
+void print_inode(char* inode_string);
index d622250..19a66cb 100644 (file)
@@ -330,6 +330,25 @@ main(int ac, char **av)
                } else {
                        cmd_show(av[1], 1);
                }
+       } else if (strcmp(av[0], "setcomp") == 0) {
+               if (ac < 3 || ac > 4) {
+                       fprintf(stderr, "setcomp: requires compression method and"
+                               "directory/file path\n");
+                       usage(1);
+               } else {
+                       if (ac == 3) //no option specified, no recursion by default
+                               ecode = cmd_setcomp(av[1], av[2]);
+                       else
+                               ecode = cmd_setcomp_recursive(av[1], av[2], av[3]);
+                       if (ecode == 0) printf("Compression mode set.\n");                      
+               }
+       } else if (strcmp(av[0], "printinode") == 0) {
+               if (ac != 2) {
+                       fprintf(stderr, "printinode: requires directory/file path\n");
+                       usage(1);
+               }
+               else
+                       print_inode(av[1]);
        } else {
                fprintf(stderr, "Unrecognized command: %s\n", av[0]);
                usage(1);
@@ -376,6 +395,7 @@ usage(int code)
                "    rsainit            Initialize rsa fields\n"
                "    show devpath       Raw hammer2 media dump\n"
                "    freemap devpath    Raw hammer2 media dump\n"
+               "    setcomp comp_algo directory   Sets compression with comp_algo (0-2) on a directory\n"
        );
        exit(code);
 }
diff --git a/sbin/hammer2/print_inode.c b/sbin/hammer2/print_inode.c
new file mode 100644 (file)
index 0000000..7a749c8
--- /dev/null
@@ -0,0 +1,81 @@
+/*
+ * Copyright (c) 2013 The DragonFly Project.  All rights reserved.
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ * 3. Neither the name of The DragonFly Project nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific, prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
+ * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
+ * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "hammer2.h"
+
+void 
+print_inode(char* inode_string)
+{
+       printf("Printing the inode's contents of directory/file %s\n", inode_string);
+       int fd = hammer2_ioctl_handle(inode_string);
+       if (fd != -1) {
+               hammer2_ioc_inode_t inode;
+               int res = ioctl(fd, HAMMER2IOC_INODE_GET, &inode);
+               hammer2_inode_data_t inode_data;
+               inode_data = inode.ip_data;
+               printf("Got res = %d\n", res);
+               printf("Printing inode data.\n");
+               /*printf("version = %d\n", inode_data.version);
+               printf("uflags = %d\n", inode_data.uflags);
+               printf("rmajor = %d\n", inode_data.rmajor);
+               printf("rminor = %d\n", inode_data.rminor);
+               printf("ctime = %u !\n", (unsigned int)inode_data.ctime);
+               printf("mtime = %u !\n", (unsigned int)inode_data.mtime);*/
+               printf("type = %d\n", inode_data.type);
+               printf("op_flags = %d\n", inode_data.op_flags);
+               /*printf("cap_flags = %d\n", inode_data.cap_flags);
+               printf("mode = %d\n", inode_data.mode);
+               printf("inum = %u !\n", (unsigned int)inode_data.inum);
+               printf("size = %u !\n", (unsigned int)inode_data.size),*/
+               printf("name_key = %u !\n", (unsigned int)inode_data.name_key);
+               /*printf("name_len = %d\n", inode_data.name_len);
+               printf("ncopies = %d\n", inode_data.ncopies);*/
+               printf("comp_algo = %d\n", inode_data.comp_algo);
+               if (inode_data.op_flags != HAMMER2_OPFLAG_DIRECTDATA) {
+                       int i;
+                       for (i = 0; i < HAMMER2_SET_COUNT; ++i) {
+                               if (inode_data.u.blockset.blockref[i].type != HAMMER2_BREF_TYPE_EMPTY) {
+                                       printf("blockrefs %d type = %d\n", i, inode_data.u.blockset.blockref[i].type);
+                                       printf("blockrefs %d methods = %d\n", i, inode_data.u.blockset.blockref[i].methods);
+                                       printf("blockrefs %d copyid = %d\n", i, inode_data.u.blockset.blockref[i].copyid);
+                                       printf("blockrefs %d flags = %d\n", i, inode_data.u.blockset.blockref[i].flags);
+                                       printf("blockrefs %d key = %u !\n", i, (unsigned int)inode_data.u.blockset.blockref[i].key);
+                               }
+                               else
+                                       printf("blockrefs %d is empty.\n", i);
+                               }
+                       }
+               else {
+                       printf("This inode has data instead of blockrefs.\n");
+               }
+       }
+}
index 888e7de..6f0f2d2 100644 (file)
@@ -1,12 +1,17 @@
 # Makefile for hammer2 vfs
 #
 #
-.PATH: ${.CURDIR}
+.PATH: ${.CURDIR} ${.CURDIR}/zlib
 
 CFLAGS+= -DINVARIANTS -DSMP
 KMOD=  hammer2
 SRCS=  hammer2_vfsops.c hammer2_vnops.c hammer2_inode.c hammer2_ccms.c
 SRCS+= hammer2_chain.c hammer2_flush.c hammer2_freemap.c
 SRCS+= hammer2_ioctl.c hammer2_msgops.c hammer2_subr.c
+SRCS+=  hammer2_lz4.c
+SRCS+=  hammer2_zlib_adler32.c hammer2_zlib_deflate.c
+SRCS+=  hammer2_zlib_inffast.c hammer2_zlib_inflate.c
+SRCS+=  hammer2_zlib_inftrees.c hammer2_zlib_trees.c
+SRCS+=  hammer2_zlib_zutil.c
 
 .include <bsd.kmod.mk>
index 9096549..63a1d3f 100644 (file)
@@ -63,6 +63,8 @@
 #include <sys/buf2.h>
 #include <sys/signal2.h>
 #include <sys/dmsg.h>
+#include <sys/mutex.h>
+#include <sys/mutex2.h>
 
 #include "hammer2_disk.h"
 #include "hammer2_mount.h"
@@ -144,6 +146,7 @@ struct hammer2_chain {
        struct hammer2_chain    *next_parent;
        struct hammer2_state    *state;         /* if active cache msg */
        struct hammer2_mount    *hmp;
+       struct hammer2_pfsmount *pmp;           /* can be NULL */
 
        hammer2_tid_t   modify_tid;             /* snapshot/flush filter */
        hammer2_tid_t   delete_tid;
@@ -193,6 +196,7 @@ RB_PROTOTYPE(hammer2_chain_tree, hammer2_chain, rbnode, hammer2_chain_cmp);
 #define HAMMER2_CHAIN_ONRBTREE         0x00004000      /* on parent RB tree */
 #define HAMMER2_CHAIN_SNAPSHOT         0x00008000      /* snapshot special */
 #define HAMMER2_CHAIN_EMBEDDED         0x00010000      /* embedded data */
+#define HAMMER2_CHAIN_HARDLINK         0x00020000      /* converted to hardlink */
 
 /*
  * Flags passed to hammer2_chain_lookup() and hammer2_chain_next()
@@ -305,6 +309,8 @@ struct hammer2_inode {
        hammer2_tid_t           inum;
        u_int                   flags;
        u_int                   refs;           /* +vpref, +flushref */
+       hammer2_off_t           size;
+       uint64_t                mtime;
 };
 
 typedef struct hammer2_inode hammer2_inode_t;
@@ -313,6 +319,8 @@ typedef struct hammer2_inode hammer2_inode_t;
 #define HAMMER2_INODE_SROOT            0x0002  /* kmalloc special case */
 #define HAMMER2_INODE_RENAME_INPROG    0x0004
 #define HAMMER2_INODE_ONRBTREE         0x0008
+#define HAMMER2_INODE_RESIZED          0x0010
+#define HAMMER2_INODE_MTIME            0x0020
 
 int hammer2_inode_cmp(hammer2_inode_t *ip1, hammer2_inode_t *ip2);
 RB_PROTOTYPE2(hammer2_inode_tree, hammer2_inode, rbnode, hammer2_inode_cmp,
@@ -373,8 +381,9 @@ struct hammer2_trans {
 
 typedef struct hammer2_trans hammer2_trans_t;
 
-#define HAMMER2_TRANS_ISFLUSH          0x0001
+#define HAMMER2_TRANS_ISFLUSH          0x0001  /* formal flush */
 #define HAMMER2_TRANS_RESTRICTED       0x0002  /* snapshot flush restrict */
+#define HAMMER2_TRANS_BUFCACHE         0x0004  /* from bioq strategy write */
 
 #define HAMMER2_FREEMAP_HEUR_NRADIX    4       /* pwr 2 PBUFRADIX-MINIORADIX */
 #define HAMMER2_FREEMAP_HEUR_TYPES     8
@@ -411,6 +420,9 @@ struct hammer2_mount {
        int             volhdrno;       /* last volhdrno written */
        hammer2_volume_data_t voldata;
        hammer2_volume_data_t volsync;  /* synchronized voldata */
+       struct bio_queue_head wthread_bioq; /* bio queue for write thread */
+       struct mtx wthread_mtx;     /* mutex for write thread */
+       int     wthread_destroy;    /* to control the write thread */
 };
 
 typedef struct hammer2_mount hammer2_mount_t;
@@ -448,6 +460,9 @@ struct hammer2_pfsmount {
        kdmsg_iocom_t           iocom;
        struct spinlock         inum_spin;      /* inumber lookup */
        struct hammer2_inode_tree inum_tree;
+       long                    inmem_inodes;
+       long                    inmem_chains;
+       int                     inmem_waiting;
 };
 
 typedef struct hammer2_pfsmount hammer2_pfsmount_t;
@@ -535,7 +550,7 @@ hammer2_chain_refactor_test(hammer2_chain_t *chain, int traverse_hlink)
        }
        if (traverse_hlink &&
            chain->bref.type == HAMMER2_BREF_TYPE_INODE &&
-           chain->data->ipdata.type == HAMMER2_OBJTYPE_HARDLINK &&
+           (chain->flags & HAMMER2_CHAIN_HARDLINK) &&
            chain->next_parent &&
            (chain->next_parent->flags & HAMMER2_CHAIN_SNAPSHOT) == 0) {
                return(1);
@@ -572,6 +587,14 @@ extern long hammer2_ioa_indr_write;
 extern long hammer2_ioa_fmap_write;
 extern long hammer2_ioa_volu_write;
 
+extern struct objcache *cache_buffer_read;
+extern struct objcache *cache_buffer_write;
+
+extern int destroy;
+extern int write_thread_wakeup;
+
+extern mtx_t thread_protect;
+
 /*
  * hammer2_subr.c
  */
@@ -606,6 +629,7 @@ int hammer2_getradix(size_t bytes);
 
 int hammer2_calc_logical(hammer2_inode_t *ip, hammer2_off_t uoff,
                         hammer2_key_t *lbasep, hammer2_key_t *leofp);
+int hammer2_calc_physical(hammer2_inode_t *ip, hammer2_key_t lbase);
 void hammer2_update_time(uint64_t *timep);
 
 /*
@@ -635,7 +659,8 @@ int hammer2_inode_connect(hammer2_trans_t *trans, int hlink,
                        const uint8_t *name, size_t name_len);
 hammer2_inode_t *hammer2_inode_common_parent(hammer2_inode_t *fdip,
                        hammer2_inode_t *tdip);
-
+void hammer2_inode_fsync(hammer2_trans_t *trans, hammer2_inode_t *ip,
+                       hammer2_chain_t **parentp);
 int hammer2_unlink_file(hammer2_trans_t *trans, hammer2_inode_t *dip,
                        const uint8_t *name, size_t name_len, int isdir,
                        int *hlinkp);
@@ -651,9 +676,8 @@ int hammer2_hardlink_find(hammer2_inode_t *dip,
  * hammer2_chain.c
  */
 void hammer2_modify_volume(hammer2_mount_t *hmp);
-hammer2_chain_t *hammer2_chain_alloc(hammer2_mount_t *hmp,
-                               hammer2_trans_t *trans,
-                               hammer2_blockref_t *bref);
+hammer2_chain_t *hammer2_chain_alloc(hammer2_mount_t *hmp, hammer2_pfsmount_t *pmp,
+                               hammer2_trans_t *trans, hammer2_blockref_t *bref);
 void hammer2_chain_core_alloc(hammer2_chain_t *chain,
                                hammer2_chain_core_t *core);
 void hammer2_chain_ref(hammer2_chain_t *chain);
@@ -670,7 +694,6 @@ hammer2_inode_data_t *hammer2_chain_modify_ip(hammer2_trans_t *trans,
                                hammer2_inode_t *ip, hammer2_chain_t **chainp,
                                int flags);
 void hammer2_chain_resize(hammer2_trans_t *trans, hammer2_inode_t *ip,
-                               struct buf *bp,
                                hammer2_chain_t *parent,
                                hammer2_chain_t **chainp,
                                int nradix, int flags);
@@ -713,6 +736,9 @@ void hammer2_chain_flush(hammer2_trans_t *trans, hammer2_chain_t *chain);
 void hammer2_chain_commit(hammer2_trans_t *trans, hammer2_chain_t *chain);
 void hammer2_chain_setsubmod(hammer2_trans_t *trans, hammer2_chain_t *chain);
 
+void hammer2_chain_memory_wait(hammer2_pfsmount_t *pmp);
+void hammer2_chain_memory_wakeup(hammer2_pfsmount_t *pmp);
+
 /*
  * hammer2_trans.c
  */
index 692091a..c187f19 100644 (file)
@@ -72,6 +72,7 @@ static int hammer2_indirect_optimize; /* XXX SYSCTL */
 static hammer2_chain_t *hammer2_chain_create_indirect(
                hammer2_trans_t *trans, hammer2_chain_t *parent,
                hammer2_key_t key, int keybits, int for_type, int *errorp);
+static void hammer2_chain_drop_data(hammer2_chain_t *chain, int lastdrop);
 static void adjreadcounter(hammer2_blockref_t *bref, size_t bytes);
 
 /*
@@ -97,6 +98,20 @@ hammer2_chain_cmp(hammer2_chain_t *chain1, hammer2_chain_t *chain2)
                return(-1);
        if (chain1->delete_tid > chain2->delete_tid)
                return(1);
+
+       /*
+        * Multiple deletions in the same transaction are possible.  We
+        * still need to detect SMP races on _get() so only do this
+        * conditionally.
+        */
+       if ((chain1->flags & HAMMER2_CHAIN_DELETED) &&
+           (chain2->flags & HAMMER2_CHAIN_DELETED)) {
+               if (chain1 < chain2)
+                       return(-1);
+               if (chain1 > chain2)
+                       return(1);
+       }
+
        return(0);
 }
 
@@ -150,8 +165,8 @@ hammer2_chain_setsubmod(hammer2_trans_t *trans, hammer2_chain_t *chain)
  * NOTE: Returns a referenced but unlocked (because there is no core) chain.
  */
 hammer2_chain_t *
-hammer2_chain_alloc(hammer2_mount_t *hmp, hammer2_trans_t *trans,
-                   hammer2_blockref_t *bref)
+hammer2_chain_alloc(hammer2_mount_t *hmp, hammer2_pfsmount_t *pmp,
+                   hammer2_trans_t *trans, hammer2_blockref_t *bref)
 {
        hammer2_chain_t *chain;
        u_int bytes = 1U << (int)(bref->data_off & HAMMER2_OFF_MASK_RADIX);
@@ -165,7 +180,16 @@ hammer2_chain_alloc(hammer2_mount_t *hmp, hammer2_trans_t *trans,
        case HAMMER2_BREF_TYPE_FREEMAP_NODE:
        case HAMMER2_BREF_TYPE_DATA:
        case HAMMER2_BREF_TYPE_FREEMAP_LEAF:
+               /*
+                * Chain's are really only associated with the hmp but we maintain
+                * a pmp association for per-mount memory tracking purposes.  The
+                * pmp can be NULL.
+                */
                chain = kmalloc(sizeof(*chain), hmp->mchain, M_WAITOK | M_ZERO);
+               if (pmp) {
+                       chain->pmp = pmp;
+                       atomic_add_long(&pmp->inmem_chains, 1);
+               }
                break;
        case HAMMER2_BREF_TYPE_VOLUME:
        case HAMMER2_BREF_TYPE_FREEMAP:
@@ -289,6 +313,7 @@ static
 hammer2_chain_t *
 hammer2_chain_lastdrop(hammer2_chain_t *chain)
 {
+       hammer2_pfsmount_t *pmp;
        hammer2_mount_t *hmp;
        hammer2_chain_core_t *above;
        hammer2_chain_core_t *core;
@@ -315,6 +340,7 @@ hammer2_chain_lastdrop(hammer2_chain_t *chain)
        }
 
        hmp = chain->hmp;
+       pmp = chain->pmp;       /* can be NULL */
        rdrop1 = NULL;
        rdrop2 = NULL;
 
@@ -419,10 +445,41 @@ hammer2_chain_lastdrop(hammer2_chain_t *chain)
        KKASSERT((chain->flags & (HAMMER2_CHAIN_MOVED |
                                  HAMMER2_CHAIN_MODIFIED)) == 0);
 
+       hammer2_chain_drop_data(chain, 1);
+
+       KKASSERT(chain->bp == NULL);
+       chain->hmp = NULL;
+
+       if (chain->flags & HAMMER2_CHAIN_ALLOCATED) {
+               chain->flags &= ~HAMMER2_CHAIN_ALLOCATED;
+               kfree(chain, hmp->mchain);
+               if (pmp) {
+                       atomic_add_long(&pmp->inmem_chains, -1);
+                       hammer2_chain_memory_wakeup(pmp);
+               }
+       }
+       if (rdrop1 && rdrop2) {
+               hammer2_chain_drop(rdrop1);
+               return(rdrop2);
+       } else if (rdrop1)
+               return(rdrop1);
+       else
+               return(rdrop2);
+}
+
+/*
+ * On either last lock release or last drop
+ */
+static void
+hammer2_chain_drop_data(hammer2_chain_t *chain, int lastdrop)
+{
+       hammer2_mount_t *hmp = chain->hmp;
+
        switch(chain->bref.type) {
        case HAMMER2_BREF_TYPE_VOLUME:
        case HAMMER2_BREF_TYPE_FREEMAP:
-               chain->data = NULL;
+               if (lastdrop)
+                       chain->data = NULL;
                break;
        case HAMMER2_BREF_TYPE_INODE:
                if (chain->data) {
@@ -440,23 +497,9 @@ hammer2_chain_lastdrop(hammer2_chain_t *chain)
                KKASSERT(chain->data == NULL);
                break;
        }
-
-       KKASSERT(chain->bp == NULL);
-       chain->hmp = NULL;
-
-       if (chain->flags & HAMMER2_CHAIN_ALLOCATED) {
-               chain->flags &= ~HAMMER2_CHAIN_ALLOCATED;
-               kfree(chain, hmp->mchain);
-       }
-       if (rdrop1 && rdrop2) {
-               hammer2_chain_drop(rdrop1);
-               return(rdrop2);
-       } else if (rdrop1)
-               return(rdrop1);
-       else
-               return(rdrop2);
 }
 
+
 /*
  * Ref and lock a chain element, acquiring its data with I/O if necessary,
  * and specify how you would like the data to be resolved.
@@ -901,6 +944,8 @@ hammer2_chain_unlock(hammer2_chain_t *chain)
         */
        if (chain->bp == NULL) {
                atomic_clear_int(&chain->flags, HAMMER2_CHAIN_DIRTYBP);
+               if ((chain->flags & HAMMER2_CHAIN_MODIFIED) == 0)
+                       hammer2_chain_drop_data(chain, 0);
                ccms_thread_unlock_upgraded(&core->cst, ostate);
                hammer2_chain_drop(chain);
                return;
@@ -1027,7 +1072,6 @@ hammer2_chain_unlock(hammer2_chain_t *chain)
  */
 void
 hammer2_chain_resize(hammer2_trans_t *trans, hammer2_inode_t *ip,
-                    struct buf *bp,
                     hammer2_chain_t *parent, hammer2_chain_t **chainp,
                     int nradix, int flags)
 {
@@ -1466,8 +1510,19 @@ hammer2_chain_find_callback(hammer2_chain_t *child, void *data)
        struct hammer2_chain_find_info *info = data;
 
        if (info->delete_tid < child->delete_tid) {
-               info->delete_tid = child->delete_tid;
-               info->best = child;
+               /*
+                * Normally the child with the larger delete_tid (which would
+                * be MAX_TID if the child is not deleted) wins.  However, if
+                * the child was deleted AND flushed (DELETED set and MOVED
+                * no longer set), the parent bref is now valid and we don't
+                * want the child to improperly shadow it.
+                */
+               if ((child->flags &
+                    (HAMMER2_CHAIN_DELETED | HAMMER2_CHAIN_MOVED)) !=
+                   HAMMER2_CHAIN_DELETED) {
+                       info->delete_tid = child->delete_tid;
+                       info->best = child;
+               }
        }
        return(0);
 }
@@ -1608,7 +1663,7 @@ retry:
         *
         * The locking operation we do later will issue I/O to read it.
         */
-       chain = hammer2_chain_alloc(hmp, NULL, bref);
+       chain = hammer2_chain_alloc(hmp, parent->pmp, NULL, bref);
        hammer2_chain_core_alloc(chain, NULL);  /* ref'd chain returned */
 
        /*
@@ -2312,7 +2367,7 @@ hammer2_chain_create(hammer2_trans_t *trans, hammer2_chain_t **parentp,
                dummy.keybits = keybits;
                dummy.data_off = hammer2_getradix(bytes);
                dummy.methods = parent->bref.methods;
-               chain = hammer2_chain_alloc(hmp, trans, &dummy);
+               chain = hammer2_chain_alloc(hmp, parent->pmp, trans, &dummy);
                hammer2_chain_core_alloc(chain, NULL);
 
                atomic_set_int(&chain->flags, HAMMER2_CHAIN_INITIAL);
@@ -2591,7 +2646,7 @@ hammer2_chain_duplicate(hammer2_trans_t *trans, hammer2_chain_t *parent, int i,
        hmp = ochain->hmp;
        if (bref == NULL)
                bref = &ochain->bref;
-       nchain = hammer2_chain_alloc(hmp, trans, bref);
+       nchain = hammer2_chain_alloc(hmp, ochain->pmp, trans, bref);
        hammer2_chain_core_alloc(nchain, ochain->core);
        bytes = (hammer2_off_t)1 <<
                (int)(bref->data_off & HAMMER2_OFF_MASK_RADIX);
@@ -2749,6 +2804,10 @@ hammer2_chain_duplicate(hammer2_trans_t *trans, hammer2_chain_t *parent, int i,
  * locked parent.  (*chainp) is marked DELETED and atomically replaced
  * with a duplicate.  Atomicy is at the very-fine spin-lock level in
  * order to ensure that lookups do not race us.
+ *
+ * If the input chain is already marked deleted the duplicated chain will
+ * also be marked deleted.  This case can occur when an inode is removed
+ * from the filesystem but programs still have an open descriptor to it.
  */
 void
 hammer2_chain_delete_duplicate(hammer2_trans_t *trans, hammer2_chain_t **chainp,
@@ -2762,12 +2821,23 @@ hammer2_chain_delete_duplicate(hammer2_trans_t *trans, hammer2_chain_t **chainp,
        int oflags;
        void *odata;
 
+       ochain = *chainp;
+       oflags = ochain->flags;
+       hmp = ochain->hmp;
+
+       /*
+        * Shortcut DELETED case if possible (only if delete_tid already matches the
+        * transaction id).
+        */
+       if ((oflags & HAMMER2_CHAIN_DELETED) &&
+           ochain->delete_tid == trans->sync_tid) {
+               return;
+       }
+
        /*
         * First create a duplicate of the chain structure
         */
-       ochain = *chainp;
-       hmp = ochain->hmp;
-       nchain = hammer2_chain_alloc(hmp, trans, &ochain->bref);    /* 1 ref */
+       nchain = hammer2_chain_alloc(hmp, ochain->pmp, trans, &ochain->bref);    /* 1 ref */
        if (flags & HAMMER2_DELDUP_RECORE)
                hammer2_chain_core_alloc(nchain, NULL);
        else
@@ -2782,9 +2852,14 @@ hammer2_chain_delete_duplicate(hammer2_trans_t *trans, hammer2_chain_t **chainp,
        nchain->inode_count += ochain->inode_count;
 
        /*
-        * Lock nchain and insert into ochain's core hierarchy, marking
-        * ochain DELETED at the same time.  Having both chains locked
-        * is extremely important for atomicy.
+        * Lock nchain so both chains are now locked (extremely important
+        * for atomicy).  Mark ochain deleted and reinsert into the topology
+        * and insert nchain all in one go.
+        *
+        * If the ochain is already deleted it is left alone and nchain
+        * is inserted into the topology as a deleted chain.  This is
+        * important because it allows ongoing operations to be executed
+        * on a deleted inode which still has open descriptors.
         */
        hammer2_chain_lock(nchain, HAMMER2_RESOLVE_NEVER);
        hammer2_chain_dup_fixup(ochain, nchain);
@@ -2792,18 +2867,31 @@ hammer2_chain_delete_duplicate(hammer2_trans_t *trans, hammer2_chain_t **chainp,
 
        nchain->index = ochain->index;
 
+       KKASSERT(ochain->flags & HAMMER2_CHAIN_ONRBTREE);
        spin_lock(&above->cst.spin);
-       atomic_set_int(&nchain->flags, HAMMER2_CHAIN_ONRBTREE);
-       ochain->delete_tid = trans->sync_tid;
+       KKASSERT(ochain->flags & HAMMER2_CHAIN_ONRBTREE);
+
+       if (oflags & HAMMER2_CHAIN_DELETED) {
+               atomic_set_int(&nchain->flags, HAMMER2_CHAIN_DELETED);
+               nchain->delete_tid = trans->sync_tid;
+       } else {
+               RB_REMOVE(hammer2_chain_tree, &above->rbtree, ochain);
+               ochain->delete_tid = trans->sync_tid;
+               atomic_set_int(&ochain->flags, HAMMER2_CHAIN_DELETED);
+               if (RB_INSERT(hammer2_chain_tree, &above->rbtree, ochain))
+                       panic("chain_delete: reinsertion failed %p", ochain);
+       }
+
        nchain->above = above;
-       atomic_set_int(&ochain->flags, HAMMER2_CHAIN_DELETED);
+       atomic_set_int(&nchain->flags, HAMMER2_CHAIN_ONRBTREE);
+       if (RB_INSERT(hammer2_chain_tree, &above->rbtree, nchain)) {
+               panic("hammer2_chain_delete_duplicate: collision");
+       }
+
        if ((ochain->flags & HAMMER2_CHAIN_MOVED) == 0) {
                hammer2_chain_ref(ochain);
                atomic_set_int(&ochain->flags, HAMMER2_CHAIN_MOVED);
        }
-       if (RB_INSERT(hammer2_chain_tree, &above->rbtree, nchain)) {
-               panic("hammer2_chain_delete_duplicate: collision");
-       }
        spin_unlock(&above->cst.spin);
 
        /*
@@ -2811,7 +2899,6 @@ hammer2_chain_delete_duplicate(hammer2_trans_t *trans, hammer2_chain_t **chainp,
         * case (data == NULL) to catch any extra locks that might have been
         * present, then transfer state to nchain.
         */
-       oflags = ochain->flags;
        odata = ochain->data;
        hammer2_chain_unlock(ochain);   /* replacing ochain */
        KKASSERT(ochain->bref.type == HAMMER2_BREF_TYPE_INODE ||
@@ -3210,7 +3297,7 @@ hammer2_chain_create_indirect(hammer2_trans_t *trans, hammer2_chain_t *parent,
        dummy.bref.data_off = hammer2_getradix(nbytes);
        dummy.bref.methods = parent->bref.methods;
 
-       ichain = hammer2_chain_alloc(hmp, trans, &dummy.bref);
+       ichain = hammer2_chain_alloc(hmp, parent->pmp, trans, &dummy.bref);
        atomic_set_int(&ichain->flags, HAMMER2_CHAIN_INITIAL);
        hammer2_chain_core_alloc(ichain, NULL);
        icore = ichain->core;
@@ -3627,13 +3714,19 @@ hammer2_chain_delete(hammer2_trans_t *trans, hammer2_chain_t *chain, int flags)
         * We need the spinlock on the core whos RBTREE contains chain
         * to protect against races.
         */
+       KKASSERT(chain->flags & HAMMER2_CHAIN_ONRBTREE);
        spin_lock(&chain->above->cst.spin);
+
+       RB_REMOVE(hammer2_chain_tree, &chain->above->rbtree, chain);
+       chain->delete_tid = trans->sync_tid;
        atomic_set_int(&chain->flags, HAMMER2_CHAIN_DELETED);
+       if (RB_INSERT(hammer2_chain_tree, &chain->above->rbtree, chain))
+               panic("chain_delete: reinsertion failed %p", chain);
+
        if ((chain->flags & HAMMER2_CHAIN_MOVED) == 0) {
                hammer2_chain_ref(chain);
                atomic_set_int(&chain->flags, HAMMER2_CHAIN_MOVED);
        }
-       chain->delete_tid = trans->sync_tid;
        spin_unlock(&chain->above->cst.spin);
 
        /*
@@ -3652,6 +3745,42 @@ hammer2_chain_wait(hammer2_chain_t *chain)
        tsleep(chain, 0, "chnflw", 1);
 }
 
+/*
+ * Manage excessive memory resource use for chain and related
+ * structures.
+ */
+void
+hammer2_chain_memory_wait(hammer2_pfsmount_t *pmp)
+{
+#if 0
+       while (pmp->inmem_chains > desiredvnodes / 10 &&
+              pmp->inmem_chains > pmp->mp->mnt_nvnodelistsize * 2) {
+               kprintf("w");
+               speedup_syncer();
+               pmp->inmem_waiting = 1;
+               tsleep(&pmp->inmem_waiting, 0, "chnmem", hz);
+       }
+#endif
+#if 0
+       if (pmp->inmem_chains > desiredvnodes / 10 &&
+           pmp->inmem_chains > pmp->mp->mnt_nvnodelistsize * 7 / 4) {
+               speedup_syncer();
+       }
+#endif
+}
+
+void
+hammer2_chain_memory_wakeup(hammer2_pfsmount_t *pmp)
+{
+       if (pmp->inmem_waiting &&
+           (pmp->inmem_chains <= desiredvnodes / 10 ||
+            pmp->inmem_chains <= pmp->mp->mnt_nvnodelistsize * 2)) {
+               kprintf("s");
+               pmp->inmem_waiting = 0;
+               wakeup(&pmp->inmem_waiting);
+       }
+}
+
 static
 void
 adjreadcounter(hammer2_blockref_t *bref, size_t bytes)
index 5e93f3e..49b3118 100644 (file)
@@ -467,7 +467,9 @@ typedef struct hammer2_blockref hammer2_blockref_t;
 #define HAMMER2_DEC_COMP(n)            ((n) & 15)
 
 #define HAMMER2_COMP_NONE              0
-#define HAMMER2_COMP_AUTOZERO          1
+#define HAMMER2_COMP_AUTOZERO  1
+#define HAMMER2_COMP_LZ4               2
+#define HAMMER2_COMP_ZLIB              3
 
 
 /*
index c46045c..c06d435 100644 (file)
@@ -103,6 +103,10 @@ hammer2_updatestats(hammer2_flush_info_t *info, hammer2_blockref_t *bref,
  * with that flush.  They only have to wait for transactions prior to the
  * flush trans to complete before they unstall.
  *
+ * WARNING! Transaction ids are only allocated when the transaction becomes
+ *         active, which allows other transactions to insert ahead of us
+ *         if we are forced to block (only bioq transactions do that).
+ *
  * WARNING! Modifications to the root volume cannot dup the root volume
  *         header to handle synchronization points, so alloc_tid can
  *         wind up (harmlessly) more advanced on flush.
@@ -124,30 +128,54 @@ hammer2_trans_init(hammer2_trans_t *trans, hammer2_pfsmount_t *pmp, int flags)
        hmp = cluster->hmp;
 
        hammer2_voldata_lock(hmp);
-       trans->sync_tid = hmp->voldata.alloc_tid++;
        trans->flags = flags;
        trans->td = curthread;
-       TAILQ_INSERT_TAIL(&hmp->transq, trans, entry);
 
        if (flags & HAMMER2_TRANS_ISFLUSH) {
+               /*
+                * If multiple flushes are trying to run we have to
+                * wait until it is our turn, then set curflush to
+                * indicate that a flush is now pending (but not
+                * necessarily active yet).
+                *
+                * NOTE: Do not set trans->blocked here.
+                */
+               ++hmp->flushcnt;
+               while (hmp->curflush != NULL) {
+                       lksleep(&hmp->curflush, &hmp->voldatalk,
+                               0, "h2multf", hz);
+               }
+               hmp->curflush = trans;
+               TAILQ_INSERT_TAIL(&hmp->transq, trans, entry);
+
                /*
                 * If we are a flush we have to wait for all transactions
                 * prior to our flush synchronization point to complete
                 * before we can start our flush.
+                *
+                * Most importantly, this includes bioq flushes.
+                *
+                * NOTE: Do not set trans->blocked here.
                 */
-               ++hmp->flushcnt;
-               if (hmp->curflush == NULL) {
-                       hmp->curflush = trans;
-                       hmp->topo_flush_tid = trans->sync_tid;
-               }
                while (TAILQ_FIRST(&hmp->transq) != trans) {
                        lksleep(&trans->sync_tid, &hmp->voldatalk,
                                0, "h2syncw", hz);
                }
 
+               /*
+                * don't assign sync_tid until we become the running
+                * flush.  topo_flush_tid is used to control when
+                * chain modifications in concurrent transactions are
+                * required to delete-duplicate (so as not to disturb
+                * the state of what is being currently flushed).
+                */
+               trans->sync_tid = hmp->voldata.alloc_tid++;
+               hmp->topo_flush_tid = trans->sync_tid;
+
                /*
                 * Once we become the running flush we can wakeup anyone
-                * who blocked on us.
+                * who blocked on us, up to the next flush.  That is,
+                * our flush can run concurrent with frontend operations.
                 */
                scan = trans;
                while ((scan = TAILQ_NEXT(scan, entry)) != NULL) {
@@ -158,25 +186,45 @@ hammer2_trans_init(hammer2_trans_t *trans, hammer2_pfsmount_t *pmp, int flags)
                        scan->blocked = 0;
                        wakeup(&scan->blocked);
                }
+       } else if ((flags & HAMMER2_TRANS_BUFCACHE) && hmp->curflush) {
+               /*
+                * We cannot block if we are the bioq thread.  When a
+                * flush is not pending we can operate normally but
+                * if a flush IS pending the bioq thread's transaction
+                * must be placed either before or after curflush.
+                *
+                * If the current flush is waiting the bioq thread's
+                * transaction is placed before.  If it is running the
+                * bioq thread's transaction is placed after.
+                */
+               scan = TAILQ_FIRST(&hmp->transq);
+               if (scan != hmp->curflush) {
+                       TAILQ_INSERT_BEFORE(hmp->curflush, trans, entry);
+               } else {
+                       TAILQ_INSERT_TAIL(&hmp->transq, trans, entry);
+               }
+               trans->sync_tid = hmp->voldata.alloc_tid++;
        } else {
                /*
-                * If we are not a flush but our sync_tid is after a
-                * stalled flush, we have to wait until that flush unstalls
-                * (that is, all transactions prior to that flush complete),
-                * but then we can run concurrently with that flush.
+                * If this is a normal transaction and not a flush, or
+                * if this is a bioq transaction and no flush is pending,
+                * we can queue normally.
                 *
-                * (flushcnt check only good as pre-condition, otherwise it
-                *  may represent elements queued after us after we block).
+                * Normal transactions must block while a pending flush is
+                * waiting for prior transactions to complete.  Once the
+                * pending flush becomes active we can run concurrently
+                * with it.
                 */
-               if (hmp->flushcnt > 1 ||
-                   (hmp->curflush &&
-                    TAILQ_FIRST(&hmp->transq) != hmp->curflush)) {
+               TAILQ_INSERT_TAIL(&hmp->transq, trans, entry);
+               scan = TAILQ_FIRST(&hmp->transq);
+               if (hmp->curflush && hmp->curflush != scan) {
                        trans->blocked = 1;
                        while (trans->blocked) {
                                lksleep(&trans->blocked, &hmp->voldatalk,
                                        0, "h2trans", hz);
                        }
                }
+               trans->sync_tid = hmp->voldata.alloc_tid++;
        }
        hammer2_voldata_unlock(hmp, 0);
 }
@@ -194,24 +242,16 @@ hammer2_trans_done(hammer2_trans_t *trans)
        hammer2_voldata_lock(hmp);
        TAILQ_REMOVE(&hmp->transq, trans, entry);
        if (trans->flags & HAMMER2_TRANS_ISFLUSH) {
-               /*
-                * If we were a flush we have to adjust curflush to the
-                * next flush.
-                *
-                * flush_tid is used to partition copy-on-write operations
-                * (mostly duplicate-on-modify ops), which is what allows
-                * us to execute a flush concurrent with modifying operations
-                * with higher TIDs.
-                */
                --hmp->flushcnt;
                if (hmp->flushcnt) {
-                       TAILQ_FOREACH(scan, &hmp->transq, entry) {
-                               if (scan->flags & HAMMER2_TRANS_ISFLUSH)
-                                       break;
-                       }
-                       KKASSERT(scan);
-                       hmp->curflush = scan;
-                       hmp->topo_flush_tid = scan->sync_tid;
+                       /*
+                        * If we were a flush then wakeup anyone waiting on
+                        * curflush (i.e. other flushes that want to run).
+                        * Leave topo_flush_id set (I think we could probably
+                        * clear it to zero here).
+                        */
+                       hmp->curflush = NULL;
+                       wakeup(&hmp->curflush);
                } else {
                        /*
                         * Theoretically we don't have to clear flush_tid
index d4d4bb0..ec2e121 100644 (file)
@@ -744,6 +744,14 @@ hammer2_freemap_free(hammer2_trans_t *trans, hammer2_mount_t *hmp,
        bytes = (size_t)1 << radix;
        class = (bref->type << 8) | hammer2_devblkradix(radix);
 
+       /*
+        * We can't free data allocated by newfs_hammer2.
+        * Assert validity.
+        */
+       if (data_off < hmp->voldata.allocator_beg)
+               return;
+       KKASSERT((data_off & HAMMER2_ZONE_MASK64) >= HAMMER2_ZONE_SEG);
+
        /*
         * Lookup the level1 freemap chain.  The chain must exist.
         */
index c7cc9f5..a21fd4b 100644 (file)
@@ -83,8 +83,8 @@ hammer2_inode_lock_ex(hammer2_inode_t *ip)
         */
 again:
        chain = ip->chain;
+       spin_lock(&chain->core->cst.spin);
        if (hammer2_chain_refactor_test(chain, 1)) {
-               spin_lock(&chain->core->cst.spin);
                while (hammer2_chain_refactor_test(chain, 1))
                        chain = chain->next_parent;
                if (ip->chain != chain) {
@@ -95,6 +95,8 @@ again:
                } else {
                        spin_unlock(&chain->core->cst.spin);
                }
+       } else {
+               spin_unlock(&chain->core->cst.spin);
        }
 
        KKASSERT(chain != NULL);        /* for now */
@@ -277,6 +279,7 @@ hammer2_inode_drop(hammer2_inode_t *ip)
                                        KKASSERT((ip->flags &
                                                  HAMMER2_INODE_SROOT) == 0);
                                        kfree(ip, pmp->minode);
+                                       atomic_add_long(&pmp->inmem_inodes, -1);
                                } else {
                                        KKASSERT(ip->flags &
                                                 HAMMER2_INODE_SROOT);
@@ -482,11 +485,15 @@ again:
         */
        if (pmp) {
                nip = kmalloc(sizeof(*nip), pmp->minode, M_WAITOK | M_ZERO);
+               atomic_add_long(&pmp->inmem_inodes, 1);
+               hammer2_chain_memory_wakeup(pmp);
        } else {
                nip = kmalloc(sizeof(*nip), M_HAMMER2, M_WAITOK | M_ZERO);
                nip->flags = HAMMER2_INODE_SROOT;
        }
        nip->inum = chain->data->ipdata.inum;
+       nip->size = chain->data->ipdata.size;
+       nip->mtime = chain->data->ipdata.mtime;
        hammer2_inode_repoint(nip, NULL, chain);
        nip->pip = dip;                         /* can be NULL */
        if (dip)
@@ -583,12 +590,11 @@ retry:
                chain = NULL;
                ++lhc;
        }
-       if (error == 0) {
-               error = hammer2_chain_create(trans, &parent, &chain,
+       if (error == 0)
+               error = hammer2_chain_create(trans, &parent, &chain, //sets chain's brefs to parent's brefs
                                             lhc, 0,
                                             HAMMER2_BREF_TYPE_INODE,
                                             HAMMER2_INODE_BYTES);
-       }
 
        /*
         * Cleanup and handle retries.
@@ -619,7 +625,7 @@ retry:
         */
        chain->data->ipdata.inum = trans->sync_tid;
        nip = hammer2_inode_get(dip->pmp, dip, chain);
-       nipdata = &chain->data->ipdata;
+       nipdata = &chain->data->ipdata; //nipdata will have chain's brefs data
 
        if (vap) {
                KKASSERT(trans->inodes_created == 0);
@@ -630,6 +636,11 @@ retry:
                nipdata->type = HAMMER2_OBJTYPE_DIRECTORY;
                nipdata->inum = 1;
        }
+       
+       /* Inherit parent's inode compression mode. */
+       nipdata->comp_algo = dipdata->comp_algo;
+       nipdata->reserved85 = 0;
+       
        nipdata->version = HAMMER2_INODE_VERSION_ONE;
        hammer2_update_time(&nipdata->ctime);
        nipdata->mtime = nipdata->ctime;
@@ -951,6 +962,7 @@ hammer2_inode_connect(hammer2_trans_t *trans, int hlink,
                                     HAMMER2_MODIFY_ASSERTNOCOPY);
                KKASSERT(name_len < HAMMER2_INODE_MAXNAME);
                ipdata = &nchain->data->ipdata;
+               atomic_set_int(&nchain->flags, HAMMER2_CHAIN_HARDLINK);
                bcopy(name, ipdata->filename, name_len);
                ipdata->name_key = lhc;
                ipdata->name_len = name_len;
@@ -1033,6 +1045,12 @@ hammer2_inode_repoint(hammer2_inode_t *ip, hammer2_inode_t *pip,
        if (ochain)
                hammer2_chain_drop(ochain);
 
+       /*
+        * Flag the chain for the refactor test
+        */
+       if (nchain && nchain->data && nchain->data->ipdata.type == HAMMER2_OBJTYPE_HARDLINK)
+               atomic_set_int(&nchain->flags, HAMMER2_CHAIN_HARDLINK);
+
        /*
         * Repoint ip->pip if requested (non-NULL pip).
         */
@@ -1315,6 +1333,7 @@ hammer2_hardlink_consolidate(hammer2_trans_t *trans, hammer2_inode_t *ip,
                        hammer2_chain_modify(trans, &chain, 0);
                        hammer2_chain_delete_duplicate(trans, &chain,
                                                       HAMMER2_DELDUP_RECORE);
+                       atomic_set_int(&chain->flags, HAMMER2_CHAIN_HARDLINK);
                        ipdata = &chain->data->ipdata;
                        ipdata->target_type = ipdata->type;
                        ipdata->type = HAMMER2_OBJTYPE_HARDLINK;
@@ -1507,3 +1526,74 @@ hammer2_inode_common_parent(hammer2_inode_t *fdip, hammer2_inode_t *tdip)
        /* NOT REACHED */
        return(NULL);
 }
+
+/*
+ * Synchronize the inode's frontend state with the chain state prior
+ * to any explicit flush of the inode or any strategy write call.
+ *
+ * Called with a locked inode.
+ */
+void
+hammer2_inode_fsync(hammer2_trans_t *trans, hammer2_inode_t *ip, 
+                   hammer2_chain_t **chainp)
+{
+       hammer2_inode_data_t *ipdata;
+       hammer2_chain_t *parent;
+       hammer2_chain_t *chain;
+       hammer2_key_t lbase;
+
+       ipdata = &ip->chain->data->ipdata;
+
+       if (ip->flags & HAMMER2_INODE_MTIME) {
+               ipdata = hammer2_chain_modify_ip(trans, ip, chainp, 0);
+               atomic_clear_int(&ip->flags, HAMMER2_INODE_MTIME);
+               ipdata->mtime = ip->mtime;
+       }
+       if ((ip->flags & HAMMER2_INODE_RESIZED) && ip->size < ipdata->size) {
+               ipdata = hammer2_chain_modify_ip(trans, ip, chainp, 0);
+               ipdata->size = ip->size;
+               atomic_clear_int(&ip->flags, HAMMER2_INODE_RESIZED);
+
+               /*
+                * We must delete any chains beyond the EOF.  The chain
+                * straddling the EOF will be pending in the bioq.
+                */
+               lbase = (ipdata->size + HAMMER2_PBUFMASK64) &
+                       ~HAMMER2_PBUFMASK64;
+               parent = hammer2_chain_lookup_init(ip->chain, 0);
+               chain = hammer2_chain_lookup(&parent,
+                                            lbase, (hammer2_key_t)-1,
+                                            HAMMER2_LOOKUP_NODATA);
+               while (chain) {
+                       /*
+                        * Degenerate embedded case, nothing to loop on
+                        */
+                       if (chain->bref.type == HAMMER2_BREF_TYPE_INODE) {
+                               hammer2_chain_unlock(chain);
+                               break;
+                       }
+                       if (chain->bref.type == HAMMER2_BREF_TYPE_DATA) {
+                               hammer2_chain_delete(trans, chain, 0);
+                       }
+                       chain = hammer2_chain_next(&parent, chain,
+                                                  lbase, (hammer2_key_t)-1,
+                                                  HAMMER2_LOOKUP_NODATA);
+               }
+               hammer2_chain_lookup_done(parent);
+       } else
+       if ((ip->flags & HAMMER2_INODE_RESIZED) && ip->size > ipdata->size) {
+               ipdata = hammer2_chain_modify_ip(trans, ip, chainp, 0);
+               ipdata->size = ip->size;
+               atomic_clear_int(&ip->flags, HAMMER2_INODE_RESIZED);
+
+               /*
+                * When resizing larger we may not have any direct-data
+                * available.
+                */
+               if ((ipdata->op_flags & HAMMER2_OPFLAG_DIRECTDATA) &&
+                   ip->size > HAMMER2_EMBEDDED_BYTES) {
+                       ipdata->op_flags &= ~HAMMER2_OPFLAG_DIRECTDATA;
+                       bzero(&ipdata->u.blockset, sizeof(ipdata->u.blockset));
+               }
+       }
+}
index cb8718f..1de50f7 100644 (file)
@@ -57,6 +57,9 @@ static int hammer2_ioctl_pfs_snapshot(hammer2_inode_t *ip, void *data);
 static int hammer2_ioctl_pfs_delete(hammer2_inode_t *ip, void *data);
 static int hammer2_ioctl_inode_get(hammer2_inode_t *ip, void *data);
 static int hammer2_ioctl_inode_set(hammer2_inode_t *ip, void *data);
+//static int hammer2_ioctl_inode_comp_set(hammer2_inode_t *ip, void *data);
+//static int hammer2_ioctl_inode_comp_rec_set(hammer2_inode_t *ip, void *data);
+//static int hammer2_ioctl_inode_comp_rec_set2(hammer2_inode_t *ip, void *data);
 
 int
 hammer2_ioctl(hammer2_inode_t *ip, u_long com, void *data, int fflag,
@@ -129,6 +132,15 @@ hammer2_ioctl(hammer2_inode_t *ip, u_long com, void *data, int fflag,
                if (error == 0)
                        error = hammer2_ioctl_inode_set(ip, data);
                break;
+       /*case HAMMER2IOC_INODE_COMP_SET:
+               error = hammer2_ioctl_inode_comp_set(ip, data);
+               break;
+       case HAMMER2IOC_INODE_COMP_REC_SET:
+               error = hammer2_ioctl_inode_comp_rec_set(ip, data);
+               break;
+       case HAMMER2IOC_INODE_COMP_REC_SET2:
+               error = hammer2_ioctl_inode_comp_rec_set2(ip, data);
+               break;*/
        default:
                error = EOPNOTSUPP;
                break;
@@ -562,20 +574,48 @@ hammer2_ioctl_inode_get(hammer2_inode_t *ip, void *data)
        return (0);
 }
 
+//original
+/*static int
+hammer2_ioctl_inode_set(hammer2_inode_t *ip, void *data)
+{
+       hammer2_ioc_inode_t *ino = data;
+       int error = EINVAL;
+
+       if (ino->flags & HAMMER2IOC_INODE_FLAG_IQUOTA) {
+       }
+       if (ino->flags & HAMMER2IOC_INODE_FLAG_DQUOTA) {
+       }
+       if (ino->flags & HAMMER2IOC_INODE_FLAG_COPIES) {
+       }
+
+       return (error);
+}*/
+
+//use this set function instead of dedicated ioctl for some time
 static int
 hammer2_ioctl_inode_set(hammer2_inode_t *ip, void *data)
 {
+       hammer2_inode_data_t *ipdata;
        hammer2_ioc_inode_t *ino = data;
        hammer2_chain_t *parent;
+       hammer2_trans_t trans;
        int error = EINVAL;
 
+       hammer2_trans_init(&trans, ip->pmp, 0);
        parent = hammer2_inode_lock_ex(ip);
+       ipdata = hammer2_chain_modify_ip(&trans, ip, &parent,
+                                                 HAMMER2_MODIFY_ASSERTNOCOPY);
+       ip->chain->data->ipdata = ino->ip_data;
+       ino->kdata = ip;
+       
+       /*Ignore those flags for now...*/
        if (ino->flags & HAMMER2IOC_INODE_FLAG_IQUOTA) {
        }
        if (ino->flags & HAMMER2IOC_INODE_FLAG_DQUOTA) {
        }
        if (ino->flags & HAMMER2IOC_INODE_FLAG_COPIES) {
        }
+       hammer2_trans_done(&trans);
        hammer2_inode_unlock_ex(ip, parent);
 
        return (error);
index b518ea3..0e8932b 100644 (file)
@@ -140,4 +140,8 @@ typedef struct hammer2_ioc_inode hammer2_ioc_inode_t;
 #define HAMMER2IOC_INODE_GET   _IOWR('h', 86, struct hammer2_ioc_inode)
 #define HAMMER2IOC_INODE_SET   _IOWR('h', 87, struct hammer2_ioc_inode)
 
+/*#define HAMMER2IOC_INODE_COMP_SET    _IOWR('h', 88, struct hammer2_ioc_inode) //set compression mode on inode
+#define HAMMER2IOC_INODE_COMP_REC_SET  _IOWR('h', 89, struct hammer2_ioc_inode)
+#define HAMMER2IOC_INODE_COMP_REC_SET2 _IOWR('h', 90, struct hammer2_ioc_inode)*/
+
 #endif
diff --git a/sys/vfs/hammer2/hammer2_lz4.c b/sys/vfs/hammer2/hammer2_lz4.c
new file mode 100644 (file)
index 0000000..90b7e41
--- /dev/null
@@ -0,0 +1,526 @@
+/*
+   LZ4 - Fast LZ compression algorithm
+   Copyright (C) 2011-2013, Yann Collet.
+   BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+
+       * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+       * Redistributions in binary form must reproduce the above
+   copyright notice, this list of conditions and the following disclaimer
+   in the documentation and/or other materials provided with the
+   distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+   You can contact the author at :
+   - LZ4 homepage : http://fastcompression.blogspot.com/p/lz4.html
+   - LZ4 source repository : http://code.google.com/p/lz4/
+*/
+
+/*
+Note : this source file requires "hammer2_lz4_encoder.h"
+*/
+
+//**************************************
+// Tuning parameters
+//**************************************
+// MEMORY_USAGE :
+// Memory usage formula : N->2^N Bytes (examples : 10 -> 1KB; 12 -> 4KB;
+// 16 -> 64KB; 20 -> 1MB; etc.)
+// Increasing memory usage improves compression ratio
+// Reduced memory usage can improve speed, due to cache effect
+// Default value is 14, for 16KB, which nicely fits into Intel x86 L1 cache
+#define MEMORY_USAGE 14
+
+// HEAPMODE :
+// Select how default compression function will allocate memory for its 
+// hash table,
+// in memory stack (0:default, fastest), or in memory heap (1:requires 
+// memory allocation (malloc)).
+// Default allocation strategy is to use stack (HEAPMODE 0)
+// Note : explicit functions *_stack* and *_heap* are unaffected by this setting
+#define HEAPMODE 1
+
+// BIG_ENDIAN_NATIVE_BUT_INCOMPATIBLE :
+// This will provide a small boost to performance for big endian cpu, 
+// but the resulting compressed stream will be incompatible with little-endian CPU.
+// You can set this option to 1 in situations where data will remain within
+// closed environment
+// This option is useless on Little_Endian CPU (such as x86)
+//#define BIG_ENDIAN_NATIVE_BUT_INCOMPATIBLE 1
+
+
+//**************************************
+// CPU Feature Detection
+//**************************************
+// 32 or 64 bits ?
+#if (defined(__x86_64__) || defined(_M_X64))   // Detects 64 bits mode
+#  define LZ4_ARCH64 1
+#else
+#  define LZ4_ARCH64 0
+#endif
+
+//This reduced library code is only Little Endian compatible,
+//if the need arises, please look for the appropriate defines in the
+//original complete LZ4 library.
+//Same is true for unaligned memory access which is enabled by default,
+//hardware bit count, also enabled by default, and Microsoft/Visual
+//Studio compilers.
+
+//**************************************
+// Compiler Options
+//**************************************
+#if defined (__STDC_VERSION__) && __STDC_VERSION__ >= 199901L   // C99
+/* "restrict" is a known keyword */
+#else
+#  define restrict // Disable restrict
+#endif
+
+#define GCC_VERSION (__GNUC__ * 100 + __GNUC_MINOR__)
+
+#if (GCC_VERSION >= 302) || (__INTEL_COMPILER >= 800) || defined(__clang__)
+#  define expect(expr,value)    (__builtin_expect ((expr),(value)) )
+#else
+#  define expect(expr,value)    (expr)
+#endif
+
+#define likely(expr)     expect((expr) != 0, 1)
+#define unlikely(expr)   expect((expr) != 0, 0)
+
+
+//**************************************
+// Includes
+//**************************************
+#include <sys/malloc.h> //for malloc macros
+#include "hammer2.h"
+#include "hammer2_lz4.h"
+
+
+//Declaration for kmalloc functions
+MALLOC_DECLARE(C_HASHTABLE);
+MALLOC_DEFINE(C_HASHTABLE, "comphashtable",
+       "A hash table used by LZ4 compression function.");
+
+
+//**************************************
+// Basic Types
+//**************************************
+#if defined (__STDC_VERSION__) && __STDC_VERSION__ >= 199901L   // C99
+# include <stdint.h>
+  typedef uint8_t  BYTE;
+  typedef uint16_t U16;
+  typedef uint32_t U32;
+  typedef  int32_t S32;
+  typedef uint64_t U64;
+#else
+  typedef unsigned char       BYTE;
+  typedef unsigned short      U16;
+  typedef unsigned int        U32;
+  typedef   signed int        S32;
+  typedef unsigned long long  U64;
+#endif
+
+#if defined(__GNUC__)  && !defined(LZ4_FORCE_UNALIGNED_ACCESS)
+#  define _PACKED __attribute__ ((packed))
+#else
+#  define _PACKED
+#endif
+
+#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)
+#  pragma pack(push, 1)
+#endif
+
+typedef struct _U16_S { U16 v; } _PACKED U16_S;
+typedef struct _U32_S { U32 v; } _PACKED U32_S;
+typedef struct _U64_S { U64 v; } _PACKED U64_S;
+
+#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)
+#  pragma pack(pop)
+#endif
+
+#define A64(x) (((U64_S *)(x))->v)
+#define A32(x) (((U32_S *)(x))->v)
+#define A16(x) (((U16_S *)(x))->v)
+
+
+//**************************************
+// Constants
+//**************************************
+#define HASHTABLESIZE (1 << MEMORY_USAGE)
+
+#define MINMATCH 4
+
+#define COPYLENGTH 8
+#define LASTLITERALS 5
+#define MFLIMIT (COPYLENGTH+MINMATCH)
+#define MINLENGTH (MFLIMIT+1)
+
+#define LZ4_64KLIMIT ((1<<16) + (MFLIMIT-1))
+#define SKIPSTRENGTH 6     
+// Increasing this value will make the compression run slower on 
+// incompressible data
+
+#define MAXD_LOG 16
+#define MAX_DISTANCE ((1 << MAXD_LOG) - 1)
+
+#define ML_BITS  4
+#define ML_MASK  ((1U<<ML_BITS)-1)
+#define RUN_BITS (8-ML_BITS)
+#define RUN_MASK ((1U<<RUN_BITS)-1)
+
+
+//**************************************
+// Architecture-specific macros
+//**************************************
+#if LZ4_ARCH64   // 64-bit
+#  define STEPSIZE 8
+#  define UARCH U64
+#  define AARCH A64
+#  define LZ4_COPYSTEP(s,d)       A64(d) = A64(s); d+=8; s+=8;
+#  define LZ4_COPYPACKET(s,d)     LZ4_COPYSTEP(s,d)
+#  define LZ4_SECURECOPY(s,d,e)   if (d<e) LZ4_WILDCOPY(s,d,e)
+#  define HTYPE                   U32
+#  define INITBASE(base)          BYTE* base = ip
+#else      // 32-bit
+#  define STEPSIZE 4
+#  define UARCH U32
+#  define AARCH A32
+#  define LZ4_COPYSTEP(s,d)       A32(d) = A32(s); d+=4; s+=4;
+#  define LZ4_COPYPACKET(s,d)     LZ4_COPYSTEP(s,d); LZ4_COPYSTEP(s,d);
+#  define LZ4_SECURECOPY          LZ4_WILDCOPY
+#  define HTYPE                   BYTE*
+#  define INITBASE(base)          int base = 0
+#endif
+
+#if (defined(LZ4_BIG_ENDIAN) && !defined(BIG_ENDIAN_NATIVE_BUT_INCOMPATIBLE))
+#  define LZ4_READ_LITTLEENDIAN_16(d,s,p) { U16 v = A16(p); 
+                                                                                       v = lz4_bswap16(v); 
+                                                                                       d = (s) - v; }
+#  define LZ4_WRITE_LITTLEENDIAN_16(p,i)  { U16 v = (U16)(i); 
+                                                                                       v = lz4_bswap16(v);
+                                                                                       A16(p) = v;
+                                                                                       p+=2; }
+#else      // Little Endian
+#  define LZ4_READ_LITTLEENDIAN_16(d,s,p) { d = (s) - A16(p); }
+#  define LZ4_WRITE_LITTLEENDIAN_16(p,v)  { A16(p) = v; p+=2; }
+#endif
+
+
+//**************************************
+// Macros
+//**************************************
+#define LZ4_WILDCOPY(s,d,e)     do { LZ4_COPYPACKET(s,d) } while (d<e);
+#define LZ4_BLINDCOPY(s,d,l)    { BYTE* e=(d)+(l); LZ4_WILDCOPY(s,d,e); d=e; }
+
+
+//****************************
+// Private functions
+//****************************
+#if LZ4_ARCH64
+
+static
+inline
+int
+LZ4_NbCommonBytes (register U64 val)
+{
+#if defined(LZ4_BIG_ENDIAN)
+    #if defined(_MSC_VER) && !defined(LZ4_FORCE_SW_BITCOUNT)
+    unsigned long r = 0;
+    _BitScanReverse64( &r, val );
+    return (int)(r>>3);
+    #elif defined(__GNUC__) && (GCC_VERSION >= 304) && !defined(LZ4_FORCE_SW_BITCOUNT)
+    return (__builtin_clzll(val) >> 3);
+    #else
+    int r;
+    if (!(val>>32)) { r=4; } else { r=0; val>>=32; }
+    if (!(val>>16)) { r+=2; val>>=8; } else { val>>=24; }
+    r += (!val);
+    return r;
+    #endif
+#else
+    #if defined(_MSC_VER) && !defined(LZ4_FORCE_SW_BITCOUNT)
+    unsigned long r = 0;
+    _BitScanForward64( &r, val );
+    return (int)(r>>3);
+    #elif defined(__GNUC__) && (GCC_VERSION >= 304) && !defined(LZ4_FORCE_SW_BITCOUNT)
+    return (__builtin_ctzll(val) >> 3);
+    #else
+    static int DeBruijnBytePos[64] = { 
+               0, 0, 0, 0, 0, 1, 1, 2, 0, 3,
+               1, 3, 1, 4, 2, 7, 0, 2, 3, 6,
+               1, 5, 3, 5, 1, 3, 4, 4, 2, 5,
+               6, 7, 7, 0, 1, 2, 3, 3, 4, 6,
+               2, 6, 5, 5, 3, 4, 5, 6, 7, 1,
+               2, 4, 6, 4, 4, 5, 7, 2, 6, 5,
+               7, 6, 7, 7 };
+    return DeBruijnBytePos[((U64)((val & -val) * 0x0218A392CDABBD3F)) >> 58];
+    #endif
+#endif
+}
+
+#else
+
+static
+inline
+int
+LZ4_NbCommonBytes (register U32 val)
+{
+#if defined(LZ4_BIG_ENDIAN)
+#  if defined(_MSC_VER) && !defined(LZ4_FORCE_SW_BITCOUNT)
+    unsigned long r = 0;
+    _BitScanReverse( &r, val );
+    return (int)(r>>3);
+#  elif defined(__GNUC__) && (GCC_VERSION >= 304) && !defined(LZ4_FORCE_SW_BITCOUNT)
+    return (__builtin_clz(val) >> 3);
+#  else
+    int r;
+    if (!(val>>16)) { r=2; val>>=8; } else { r=0; val>>=24; }
+    r += (!val);
+    return r;
+#  endif
+#else
+#  if defined(_MSC_VER) && !defined(LZ4_FORCE_SW_BITCOUNT)
+    unsigned long r;
+    _BitScanForward( &r, val );
+    return (int)(r>>3);
+#  elif defined(__GNUC__) && (GCC_VERSION >= 304) && !defined(LZ4_FORCE_SW_BITCOUNT)
+    return (__builtin_ctz(val) >> 3);
+#  else
+    static int DeBruijnBytePos[32] = { 
+               0, 0, 3, 0, 3, 1, 3, 0, 3, 2,
+               2, 1, 3, 2, 0, 1, 3, 3, 1, 2,
+               2, 2, 2, 0, 3, 1, 2, 0, 1, 0,
+               1, 1 };
+    return DeBruijnBytePos[((U32)((val & -(S32)val) * 0x077CB531U)) >> 27];
+#  endif
+#endif
+}
+
+#endif
+
+
+
+//******************************
+// Compression functions
+//******************************
+
+#include "hammer2_lz4_encoder.h"
+
+/*
+void* LZ4_createHeapMemory();
+int LZ4_freeHeapMemory(void* ctx);
+
+Used to allocate and free hashTable memory 
+to be used by the LZ4_compress_heap* family of functions.
+LZ4_createHeapMemory() returns NULL is memory allocation fails.
+*/
+void*
+LZ4_create(void)
+{
+       return kmalloc(HASHTABLESIZE, C_HASHTABLE, M_INTWAIT);
+}
+
+int
+LZ4_free(void* ctx)
+{
+       kfree(ctx, C_HASHTABLE);
+       return 0;
+}
+
+int
+LZ4_compress_limitedOutput(char* source, char* dest, int inputSize, int maxOutputSize)
+{
+    void* ctx = LZ4_create();
+    int result;
+    if (ctx == NULL) return 0;    // Failed allocation => compression not done
+    if (inputSize < LZ4_64KLIMIT)
+        result = LZ4_compress64k_heap_limitedOutput(ctx, source, dest,
+                       inputSize, maxOutputSize);
+    else result = LZ4_compress_heap_limitedOutput(ctx, source, dest,
+                       inputSize, maxOutputSize);
+    LZ4_free(ctx);
+    return result;
+}
+
+
+//****************************
+// Decompression functions
+//****************************
+
+typedef enum { noPrefix = 0, withPrefix = 1 } prefix64k_directive;
+typedef enum { endOnOutputSize = 0, endOnInputSize = 1 } end_directive;
+typedef enum { full = 0, partial = 1 } exit_directive;
+
+
+// This generic decompression function cover all use cases.
+// It shall be instanciated several times, using different sets of directives
+// Note that it is essential this generic function is really inlined, 
+// in order to remove useless branches during compilation optimisation.
+static
+inline
+int LZ4_decompress_generic(
+                 char* source,
+                 char* dest,
+                 int inputSize,          //
+                 int outputSize,        
+                 // OutputSize must be != 0; if endOnInput==endOnInputSize, 
+                 // this value is the max size of Output Buffer.
+
+                 int endOnInput,         // endOnOutputSize, endOnInputSize
+                 int prefix64k,          // noPrefix, withPrefix
+                 int partialDecoding,    // full, partial
+                 int targetOutputSize    // only used if partialDecoding==partial
+                 )
+{
+    // Local Variables
+    BYTE* restrict ip = (BYTE*) source;
+    BYTE* ref;
+    BYTE* iend = ip + inputSize;
+
+    BYTE* op = (BYTE*) dest;
+    BYTE* oend = op + outputSize;
+    BYTE* cpy;
+    BYTE* oexit = op + targetOutputSize;
+
+    size_t dec32table[] = {0, 3, 2, 3, 0, 0, 0, 0};
+#if LZ4_ARCH64
+    size_t dec64table[] = {0, 0, 0, (size_t)-1, 0, 1, 2, 3};
+#endif
+
+
+    // Special case
+    if ((partialDecoding) && (oexit> oend-MFLIMIT)) oexit = oend-MFLIMIT;
+    // targetOutputSize too large, better decode everything
+    if unlikely(outputSize==0) goto _output_error;
+    // Empty output buffer
+
+
+    // Main Loop
+    while (1)
+    {
+        unsigned token;
+        size_t length;
+
+        // get runlength
+        token = *ip++;
+        if ((length=(token>>ML_BITS)) == RUN_MASK)  
+        { 
+            unsigned s=255; 
+            while (((endOnInput)?ip<iend:1) && (s==255)) 
+            { 
+                s = *ip++; 
+                length += s; 
+            } 
+        }
+
+        // copy literals
+        cpy = op+length;
+        if (((endOnInput) && ((cpy>(partialDecoding?oexit:oend-MFLIMIT)) 
+                       || (ip+length>iend-(2+1+LASTLITERALS))) )
+            || ((!endOnInput) && (cpy>oend-COPYLENGTH)))
+        {
+            if (partialDecoding)
+            {
+                if (cpy > oend) goto _output_error;
+                // Error : write attempt beyond end of output buffer
+                if ((endOnInput) && (ip+length > iend)) goto _output_error;
+                // Error : read attempt beyond end of input buffer
+            }
+            else
+            {
+                if ((!endOnInput) && (cpy != oend)) goto _output_error;
+                // Error : block decoding must stop exactly there,
+                // due to parsing restrictions
+                if ((endOnInput) && ((ip+length != iend) || (cpy > oend))) 
+                                       goto _output_error;
+                                       // Error : not enough place for another match (min 4) + 5 literals
+            }
+            memcpy(op, ip, length);
+            ip += length;
+            op += length;
+            break;
+            // Necessarily EOF, due to parsing restrictions
+        }
+        LZ4_WILDCOPY(ip, op, cpy); ip -= (op-cpy); op = cpy;
+
+        // get offset
+        LZ4_READ_LITTLEENDIAN_16(ref,cpy,ip); ip+=2;
+        if ((prefix64k==noPrefix) && unlikely(ref < (BYTE*)dest))
+                       goto _output_error;   // Error : offset outside destination buffer
+
+        // get matchlength
+        if ((length=(token&ML_MASK)) == ML_MASK) 
+        { 
+            while (endOnInput ? ip<iend-(LASTLITERALS+1) : 1)
+            // A minimum nb of input bytes must remain for LASTLITERALS + token
+            { 
+                unsigned s = *ip++; 
+                length += s; 
+                if (s==255) continue; 
+                break; 
+            } 
+        }
+
+        // copy repeated sequence
+        if unlikely((op-ref)<STEPSIZE)
+        {
+#if LZ4_ARCH64
+            size_t dec64 = dec64table[op-ref];
+#else
+            const size_t dec64 = 0;
+#endif
+            op[0] = ref[0];
+            op[1] = ref[1];
+            op[2] = ref[2];
+            op[3] = ref[3];
+            op += 4, ref += 4; ref -= dec32table[op-ref];
+            A32(op) = A32(ref); 
+            op += STEPSIZE-4; ref -= dec64;
+        } else { LZ4_COPYSTEP(ref,op); }
+        cpy = op + length - (STEPSIZE-4);
+
+        if unlikely(cpy>oend-(COPYLENGTH)-(STEPSIZE-4))
+        {
+            if (cpy > oend-LASTLITERALS) goto _output_error;
+            // Error : last 5 bytes must be literals
+            LZ4_SECURECOPY(ref, op, (oend-COPYLENGTH));
+            while(op<cpy) *op++=*ref++;
+            op=cpy;
+            continue;
+        }
+        LZ4_WILDCOPY(ref, op, cpy);
+        op=cpy;   // correction
+    }
+
+    // end of decoding
+    if (endOnInput)
+       return (int) (((char*)op)-dest);     // Nb of output bytes decoded
+    else
+       return (int) (((char*)ip)-source);   // Nb of input bytes read
+
+    // Overflow error detected
+_output_error:
+    return (int) (-(((char*)ip)-source))-1;
+}
+
+
+int
+LZ4_decompress_safe(char* source, char* dest, int inputSize, int maxOutputSize)
+{
+    return LZ4_decompress_generic(source, dest, inputSize, maxOutputSize,
+                                                       endOnInputSize, noPrefix, full, 0);
+}
diff --git a/sys/vfs/hammer2/hammer2_lz4.h b/sys/vfs/hammer2/hammer2_lz4.h
new file mode 100644 (file)
index 0000000..f81dfe0
--- /dev/null
@@ -0,0 +1,93 @@
+/*
+   LZ4 - Fast LZ compression algorithm
+   Header File
+   Copyright (C) 2011-2013, Yann Collet.
+   BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+
+       * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+       * Redistributions in binary form must reproduce the above
+   copyright notice, this list of conditions and the following disclaimer
+   in the documentation and/or other materials provided with the
+   distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+   You can contact the author at :
+   - LZ4 homepage : http://fastcompression.blogspot.com/p/lz4.html
+   - LZ4 source repository : http://code.google.com/p/lz4/
+*/
+#pragma once
+
+#if defined (__cplusplus)
+extern "C" {
+#endif
+
+
+//**************************************
+// Compiler Options
+//**************************************
+//Should go here if they are needed
+
+//****************************
+// Simple Functions
+//****************************
+
+int LZ4_decompress_safe (char* source, char* dest, int inputSize,
+                                               int maxOutputSize);
+
+/*
+LZ4_decompress_safe() :
+    maxOutputSize : 
+     is the size of the destination buffer (which must be already allocated)
+    return :
+     the number of bytes decoded in the destination buffer
+     (necessarily <= maxOutputSize)
+     If the source stream is malformed or too large, the function will
+     stop decoding and return a negative result.
+     This function is protected against any kind of buffer overflow attemps 
+     (never writes outside of output buffer, and never reads outside of 
+     input buffer). It is therefore protected against malicious data packets
+*/
+
+
+//****************************
+// Advanced Functions
+//****************************
+
+int LZ4_compress_limitedOutput(char* source, char* dest, int inputSize,
+                                               int maxOutputSize);
+
+/*
+LZ4_compress_limitedOutput() :
+    Compress 'inputSize' bytes from 'source' into an output buffer 'dest'
+    of maximum size 'maxOutputSize'.
+    If it cannot achieve it, compression will stop, and result of
+    the function will be zero.
+    This function never writes outside of provided output buffer.
+
+    inputSize  : 
+     Max supported value is ~1.9GB
+    maxOutputSize : 
+     is the size of the destination buffer (which must bealready allocated)
+    return :
+     the number of bytes written in buffer 'dest' or 0 if the compression fails
+*/
+
+#if defined (__cplusplus)
+}
+#endif
diff --git a/sys/vfs/hammer2/hammer2_lz4_encoder.h b/sys/vfs/hammer2/hammer2_lz4_encoder.h
new file mode 100644 (file)
index 0000000..cedabb8
--- /dev/null
@@ -0,0 +1,467 @@
+/*
+   LZ4 Encoder - Part of LZ4 compression algorithm
+   Copyright (C) 2011-2013, Yann Collet.
+   BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions are
+   met:
+
+       * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+       * Redistributions in binary form must reproduce the above
+   copyright notice, this list of conditions and the following disclaimer
+   in the documentation and/or other materials provided with the
+   distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+   You can contact the author at :
+   - LZ4 homepage : http://fastcompression.blogspot.com/p/lz4.html
+   - LZ4 source repository : http://code.google.com/p/lz4/
+*/
+
+/* lz4_encoder.h must be included into lz4.c
+   The objective of this file is to create a single LZ4 compression function source
+   which will be instanciated multiple times with minor variations
+   depending on a set of #define.
+*/
+
+void*
+LZ4_create(void);
+int
+LZ4_free(void* ctx);
+
+int
+LZ4_compress_heap_limitedOutput(
+                 void* ctx,
+                 char* source,
+                 char* dest,
+                 int inputSize,
+                 int maxOutputSize);
+                 
+int
+LZ4_compress64k_heap_limitedOutput(
+                 void* ctx,
+                 char* source,
+                 char* dest,
+                 int inputSize,
+                 int maxOutputSize);
+
+
+//****************************
+// Local definitions
+//****************************
+
+#ifdef COMPRESS_64K
+#  define HASHLOG (MEMORY_USAGE-1)
+#  define CURRENT_H_TYPE U16
+#  define CURRENTBASE(base) BYTE* base = ip
+#else
+#  define HASHLOG (MEMORY_USAGE-2)
+#  define CURRENT_H_TYPE HTYPE
+#  define CURRENTBASE(base) INITBASE(base)
+#endif
+
+#define HASHTABLE_NBCELLS  (1U<<HASHLOG)
+#define LZ4_HASH(i)        (((i) * 2654435761U) >> ((MINMATCH*8)-HASHLOG))
+#define LZ4_HASHVALUE(p)   LZ4_HASH(A32(p))
+
+
+
+//****************************
+// Function code
+//****************************
+
+int
+LZ4_compress_heap_limitedOutput(
+                 void* ctx,
+                 char* source,
+                 char* dest,
+                 int inputSize,
+                 int maxOutputSize)
+{
+    CURRENT_H_TYPE* HashTable = (CURRENT_H_TYPE*)ctx;
+
+    BYTE* ip = (BYTE*) source;
+    CURRENTBASE(base);
+    BYTE* anchor = ip;
+    BYTE* iend = ip + inputSize;
+    BYTE* mflimit = iend - MFLIMIT;
+#define matchlimit (iend - LASTLITERALS)
+
+    BYTE* op = (BYTE*) dest;
+    BYTE* oend = op + maxOutputSize;
+
+    int length;
+    int skipStrength = SKIPSTRENGTH;
+    U32 forwardH;
+
+
+    // Init
+    if (inputSize<MINLENGTH) goto _last_literals;
+    
+    memset((void*)HashTable, 0, HASHTABLESIZE);
+
+    // First Byte
+    HashTable[LZ4_HASHVALUE(ip)] = (CURRENT_H_TYPE)(ip - base);
+    ip++;
+    forwardH = LZ4_HASHVALUE(ip);
+
+    // Main Loop
+    for ( ; ; )
+    {
+        int findMatchAttempts = (1U << skipStrength) + 3;
+        BYTE* forwardIp = ip;
+        BYTE* ref;
+        BYTE* token;
+
+        // Find a match
+        do {
+            U32 h = forwardH;
+            int step = findMatchAttempts++ >> skipStrength;
+            ip = forwardIp;
+            forwardIp = ip + step;
+
+            if unlikely(forwardIp > mflimit) { 
+                               goto _last_literals; 
+                       }
+
+            forwardH = LZ4_HASHVALUE(forwardIp);
+            ref = base + HashTable[h];
+            HashTable[h] = (CURRENT_H_TYPE)(ip - base);
+
+        } while ((ref < ip - MAX_DISTANCE) || (A32(ref) != A32(ip)));
+
+        // Catch up
+        while ((ip>anchor) && (ref>(BYTE*)source) && unlikely(ip[-1]==ref[-1])) { 
+                       ip--;
+                       ref--;
+               }
+
+        // Encode Literal length
+        length = (int)(ip - anchor);
+        token = op++;
+
+        if unlikely(op + length + (2 + 1 + LASTLITERALS) + (length>>8) > oend)
+                       return 0;   // Check output limit
+
+        if (length>=(int)RUN_MASK) 
+        { 
+            int len = length-RUN_MASK; 
+            *token=(RUN_MASK<<ML_BITS); 
+            for(; len >= 255 ; len-=255) 
+                               *op++ = 255; 
+            *op++ = (BYTE)len; 
+        }
+        else *token = (BYTE)(length<<ML_BITS);
+
+        // Copy Literals
+        LZ4_BLINDCOPY(anchor, op, length);
+
+_next_match:
+        // Encode Offset
+        LZ4_WRITE_LITTLEENDIAN_16(op,(U16)(ip-ref));
+
+        // Start Counting
+        ip+=MINMATCH; ref+=MINMATCH;    // MinMatch already verified
+        anchor = ip;
+        while likely(ip<matchlimit-(STEPSIZE-1))
+        {
+            UARCH diff = AARCH(ref) ^ AARCH(ip);
+            if (!diff) {
+                               ip+=STEPSIZE;
+                               ref+=STEPSIZE; 
+                               continue; 
+                       }
+            ip += LZ4_NbCommonBytes(diff);
+            goto _endCount;
+        }
+        if (LZ4_ARCH64) if ((ip<(matchlimit-3)) && (A32(ref) == A32(ip))) {
+                       ip+=4;
+                       ref+=4;
+               }
+        if ((ip<(matchlimit-1)) && (A16(ref) == A16(ip))) {
+                       ip+=2;
+                       ref+=2;
+               }
+        if ((ip<matchlimit) && (*ref == *ip)) 
+                       ip++;
+_endCount:
+
+        // Encode MatchLength
+        length = (int)(ip - anchor);
+
+        if unlikely(op + (1 + LASTLITERALS) + (length>>8) > oend) 
+                       return 0;    // Check output limit
+
+        if (length>=(int)ML_MASK) 
+        { 
+            *token += ML_MASK; 
+            length -= ML_MASK; 
+            for (; length > 509 ; length-=510) {
+                               *op++ = 255;
+                               *op++ = 255;
+                       } 
+            if (length >= 255) {
+                               length-=255;
+                               *op++ = 255;
+                       } 
+            *op++ = (BYTE)length; 
+        }
+        else *token += (BYTE)length;
+
+        // Test end of chunk
+        if (ip > mflimit) {
+                       anchor = ip;
+                       break;
+               }
+
+        // Fill table
+        HashTable[LZ4_HASHVALUE(ip-2)] = (CURRENT_H_TYPE)(ip - 2 - base);
+
+        // Test next position
+        ref = base + HashTable[LZ4_HASHVALUE(ip)];
+        HashTable[LZ4_HASHVALUE(ip)] = (CURRENT_H_TYPE)(ip - base);
+        if ((ref >= ip - MAX_DISTANCE) && (A32(ref) == A32(ip))) {
+                       token = op++;
+                       *token=0;
+                       goto _next_match;
+               }
+
+        // Prepare next loop
+        anchor = ip++;
+        forwardH = LZ4_HASHVALUE(ip);
+    }
+
+_last_literals:
+    // Encode Last Literals
+    {
+        int lastRun = (int)(iend - anchor);
+
+        if (((char*)op - dest) + lastRun + 1 + ((lastRun+255-RUN_MASK)/255) > (U32)maxOutputSize)
+                       return 0;  // Check output limit
+
+        if (lastRun>=(int)RUN_MASK) {
+                       *op++=(RUN_MASK<<ML_BITS);
+                       lastRun-=RUN_MASK;
+                       for(; lastRun >= 255 ; lastRun-=255)
+                               *op++ = 255;
+                       *op++ = (BYTE) lastRun;
+               }
+        else *op++ = (BYTE)(lastRun<<ML_BITS);
+        memcpy(op, anchor, iend - anchor);
+        op += iend-anchor;
+    }
+
+    // End
+    return (int) (((char*)op)-dest);
+}
+
+int
+LZ4_compress64k_heap_limitedOutput(
+                 void* ctx,
+                 char* source,
+                 char* dest,
+                 int inputSize,
+                 int maxOutputSize)
+{
+    CURRENT_H_TYPE* HashTable = (CURRENT_H_TYPE*)ctx;
+
+    BYTE* ip = (BYTE*) source;
+    CURRENTBASE(base);
+    BYTE* anchor = ip;
+    BYTE* iend = ip + inputSize;
+    BYTE* mflimit = iend - MFLIMIT;
+#define matchlimit (iend - LASTLITERALS)
+
+    BYTE* op = (BYTE*) dest;
+    BYTE* oend = op + maxOutputSize;
+
+    int length;
+    int skipStrength = SKIPSTRENGTH;
+    U32 forwardH;
+
+
+    // Init
+    if (inputSize<MINLENGTH) goto _last_literals;
+
+    memset((void*)HashTable, 0, HASHTABLESIZE);
+
+    // First Byte
+    HashTable[LZ4_HASHVALUE(ip)] = (CURRENT_H_TYPE)(ip - base);
+    ip++;
+    forwardH = LZ4_HASHVALUE(ip);
+
+    // Main Loop
+    for ( ; ; )
+    {
+        int findMatchAttempts = (1U << skipStrength) + 3;
+        BYTE* forwardIp = ip;
+        BYTE* ref;
+        BYTE* token;
+
+        // Find a match
+        do {
+            U32 h = forwardH;
+            int step = findMatchAttempts++ >> skipStrength;
+            ip = forwardIp;
+            forwardIp = ip + step;
+
+            if unlikely(forwardIp > mflimit) { 
+                               goto _last_literals; 
+                       }
+
+            forwardH = LZ4_HASHVALUE(forwardIp);
+            ref = base + HashTable[h];
+            HashTable[h] = (CURRENT_H_TYPE)(ip - base);
+
+        } while ((ref < ip - MAX_DISTANCE) || (A32(ref) != A32(ip)));
+
+        // Catch up
+        while ((ip>anchor) && (ref>(BYTE*)source) && unlikely(ip[-1]==ref[-1])) { 
+                       ip--;
+                       ref--;
+               }
+
+        // Encode Literal length
+        length = (int)(ip - anchor);
+        token = op++;
+
+        if unlikely(op + length + (2 + 1 + LASTLITERALS) + (length>>8) > oend)
+                       return 0;   // Check output limit
+
+        if (length>=(int)RUN_MASK) 
+        { 
+            int len = length-RUN_MASK; 
+            *token=(RUN_MASK<<ML_BITS); 
+            for(; len >= 255 ; len-=255) 
+                               *op++ = 255; 
+            *op++ = (BYTE)len; 
+        }
+        else *token = (BYTE)(length<<ML_BITS);
+
+        // Copy Literals
+        LZ4_BLINDCOPY(anchor, op, length);
+
+_next_match:
+        // Encode Offset
+        LZ4_WRITE_LITTLEENDIAN_16(op,(U16)(ip-ref));
+
+        // Start Counting
+        ip+=MINMATCH; ref+=MINMATCH;    // MinMatch already verified
+        anchor = ip;
+        while likely(ip<matchlimit-(STEPSIZE-1))
+        {
+            UARCH diff = AARCH(ref) ^ AARCH(ip);
+            if (!diff) {
+                               ip+=STEPSIZE;
+                               ref+=STEPSIZE; 
+                               continue; 
+                       }
+            ip += LZ4_NbCommonBytes(diff);
+            goto _endCount;
+        }
+        if (LZ4_ARCH64) if ((ip<(matchlimit-3)) && (A32(ref) == A32(ip))) {
+                       ip+=4;
+                       ref+=4;
+               }
+        if ((ip<(matchlimit-1)) && (A16(ref) == A16(ip))) {
+                       ip+=2;
+                       ref+=2;
+               }
+        if ((ip<matchlimit) && (*ref == *ip)) 
+                       ip++;
+_endCount:
+
+        // Encode MatchLength
+        length = (int)(ip - anchor);
+
+        if unlikely(op + (1 + LASTLITERALS) + (length>>8) > oend) 
+                       return 0;    // Check output limit
+
+        if (length>=(int)ML_MASK) 
+        { 
+            *token += ML_MASK; 
+            length -= ML_MASK; 
+            for (; length > 509 ; length-=510) {
+                               *op++ = 255;
+                               *op++ = 255;
+                       } 
+            if (length >= 255) {
+                               length-=255;
+                               *op++ = 255;
+                       } 
+            *op++ = (BYTE)length; 
+        }
+        else *token += (BYTE)length;
+
+        // Test end of chunk
+        if (ip > mflimit) {
+                       anchor = ip;
+                       break;
+               }
+
+        // Fill table
+        HashTable[LZ4_HASHVALUE(ip-2)] = (CURRENT_H_TYPE)(ip - 2 - base);
+
+        // Test next position
+        ref = base + HashTable[LZ4_HASHVALUE(ip)];
+        HashTable[LZ4_HASHVALUE(ip)] = (CURRENT_H_TYPE)(ip - base);
+        if ((ref >= ip - MAX_DISTANCE) && (A32(ref) == A32(ip))) {
+                       token = op++;
+                       *token=0;
+                       goto _next_match;
+               }
+
+        // Prepare next loop
+        anchor = ip++;
+        forwardH = LZ4_HASHVALUE(ip);
+    }
+
+_last_literals:
+    // Encode Last Literals
+    {
+        int lastRun = (int)(iend - anchor);
+
+        if (((char*)op - dest) + lastRun + 1 + 
+                       ((lastRun+255-RUN_MASK)/255) > (U32)maxOutputSize)
+                       return 0;  // Check output limit
+
+        if (lastRun>=(int)RUN_MASK) {
+                       *op++=(RUN_MASK<<ML_BITS);
+                       lastRun-=RUN_MASK;
+                       for(; lastRun >= 255 ; lastRun-=255)
+                               *op++ = 255;
+                       *op++ = (BYTE) lastRun;
+               }
+        else *op++ = (BYTE)(lastRun<<ML_BITS);
+        memcpy(op, anchor, iend - anchor);
+        op += iend-anchor;
+    }
+
+    // End
+    return (int) (((char*)op)-dest);
+}
+
+//****************************
+// Clean defines
+//****************************
+
+// Locally Generated
+#undef HASHLOG
+#undef HASHTABLE_NBCELLS
+#undef LZ4_HASH
+#undef LZ4_HASHVALUE
+#undef CURRENT_H_TYPE
+#undef CURRENTBASE
index 39ff93f..6aa9614 100644 (file)
@@ -339,26 +339,72 @@ hammer2_getradix(size_t bytes)
 
 /*
  * ip must be locked sh/ex
+ *
+ * Use 16KB logical buffers for file blocks <= 1MB and 64KB logical buffers
+ * otherwise.  The write code may utilize smaller device buffers when
+ * compressing or handling the EOF case, but is not able to coalesce smaller
+ * logical buffers into larger device buffers.
+ *
+ * For now this means that even large files will have a bunch of 16KB blocks
+ * at the beginning of the file.  On the plus side this tends to cause small
+ * files to cluster together in the freemap.
  */
 int
 hammer2_calc_logical(hammer2_inode_t *ip, hammer2_off_t uoff,
                     hammer2_key_t *lbasep, hammer2_key_t *leofp)
 {
-       hammer2_inode_data_t *ipdata = &ip->chain->data->ipdata;
-       int radix;
-
-       *lbasep = uoff & ~HAMMER2_PBUFMASK64;
-       *leofp = ipdata->size & ~HAMMER2_PBUFMASK64;
-       KKASSERT(*lbasep <= *leofp);
-       if (*lbasep == *leofp /*&& *leofp < 1024 * 1024*/) {
-               radix = hammer2_getradix((size_t)(ipdata->size - *leofp));
-               if (radix < HAMMER2_MINIORADIX)
-                       radix = HAMMER2_MINIORADIX;
-               *leofp += 1U << radix;
-               return (1U << radix);
+#if 0
+       if (uoff < (hammer2_off_t)1024 * 1024) {
+               if (lbasep)
+                       *lbasep = uoff & ~HAMMER2_LBUFMASK64;
+               if (leofp) {
+                       if (ip->size > (hammer2_key_t)1024 * 1024)
+                               *leofp = (hammer2_key_t)1024 * 1024;
+                       else
+                               *leofp = (ip->size + HAMMER2_LBUFMASK64) &
+                                        ~HAMMER2_LBUFMASK64;
+               }
+               return (HAMMER2_LBUFSIZE);
        } else {
+#endif
+               if (lbasep)
+                       *lbasep = uoff & ~HAMMER2_PBUFMASK64;
+               if (leofp) {
+                       *leofp = (ip->size + HAMMER2_PBUFMASK64) &
+                                ~HAMMER2_PBUFMASK64;
+               }
                return (HAMMER2_PBUFSIZE);
+#if 0
        }
+#endif
+}
+
+/*
+ * Calculate the physical block size.  pblksize <= lblksize.  Primarily
+ * used to calculate a smaller physical block for the logical block
+ * containing the file EOF.
+ *
+ * Returns 0 if the requested base offset is beyond the file EOF.
+ */
+int
+hammer2_calc_physical(hammer2_inode_t *ip, hammer2_key_t lbase)
+{
+       int lblksize;
+       int pblksize;
+       int eofbytes;
+
+       lblksize = hammer2_calc_logical(ip, lbase, NULL, NULL);
+       if (lbase + lblksize <= ip->chain->data->ipdata.size)
+               return (lblksize);
+       if (lbase >= ip->chain->data->ipdata.size)
+               return (0);
+       eofbytes = (int)(ip->chain->data->ipdata.size - lbase);
+       pblksize = lblksize;
+       while (pblksize >= eofbytes && pblksize >= HAMMER2_MIN_ALLOC)
+               pblksize >>= 1;
+       pblksize <<= 1;
+
+       return (pblksize);
 }
 
 void
index 7ab3b4b..c96d951 100644 (file)
@@ -3,6 +3,7 @@
  *
  * This code is derived from software contributed to The DragonFly Project
  * by Matthew Dillon <dillon@backplane.com>
+ * by Daniel Flores (GSOC 2013 - mentored by Matthew Dillon, compression)
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
 #include <sys/vfsops.h>
 #include <sys/sysctl.h>
 #include <sys/socket.h>
+#include <sys/objcache.h>
+
+#include <sys/proc.h>
+#include <sys/namei.h>
+#include <sys/mountctl.h>
+#include <sys/dirent.h>
+#include <sys/uio.h>
+
+#include <sys/mutex.h>
+#include <sys/mutex2.h>
 
 #include "hammer2.h"
 #include "hammer2_disk.h"
 #include "hammer2_mount.h"
 
+#include "hammer2.h"
+#include "hammer2_lz4.h"
+
+#include "zlib/hammer2_zlib.h"
+
 #define REPORT_REFS_ERRORS 1   /* XXX remove me */
 
+MALLOC_DEFINE(M_OBJCACHE, "objcache", "Object Cache");
+
 struct hammer2_sync_info {
        hammer2_trans_t trans;
        int error;
@@ -84,6 +102,18 @@ long hammer2_ioa_meta_write;
 long hammer2_ioa_indr_write;
 long hammer2_ioa_volu_write;
 
+MALLOC_DECLARE(C_BUFFER);
+MALLOC_DEFINE(C_BUFFER, "compbuffer", "Buffer used for compression.");
+
+MALLOC_DECLARE(D_BUFFER);
+MALLOC_DEFINE(D_BUFFER, "decompbuffer", "Buffer used for decompression.");
+
+MALLOC_DECLARE(W_BIOQUEUE);
+MALLOC_DEFINE(W_BIOQUEUE, "wbioqueue", "Writing bio queue.");
+
+MALLOC_DECLARE(W_MTX);
+MALLOC_DEFINE(W_MTX, "wmutex", "Mutex for write thread.");
+
 SYSCTL_NODE(_vfs, OID_AUTO, hammer2, CTLFLAG_RW, 0, "HAMMER2 filesystem");
 
 SYSCTL_INT(_vfs_hammer2, OID_AUTO, debug, CTLFLAG_RW,
@@ -138,6 +168,7 @@ SYSCTL_LONG(_vfs_hammer2, OID_AUTO, ioa_volu_write, CTLFLAG_RW,
           &hammer2_ioa_volu_write, 0, "");
 
 static int hammer2_vfs_init(struct vfsconf *conf);
+static int hammer2_vfs_uninit(struct vfsconf *vfsp);
 static int hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
                                struct ucred *cred);
 static int hammer2_remount(struct mount *, char *, struct vnode *,
@@ -161,14 +192,49 @@ static int hammer2_install_volume_header(hammer2_mount_t *hmp);
 static int hammer2_sync_scan1(struct mount *mp, struct vnode *vp, void *data);
 static int hammer2_sync_scan2(struct mount *mp, struct vnode *vp, void *data);
 
+static void hammer2_write_thread(void *arg);
+
+/* 
+ * Functions for compression in threads,
+ * from hammer2_vnops.c
+ */
+static void hammer2_write_file_core_t(struct buf *bp, hammer2_trans_t *trans,
+                               hammer2_inode_t *ip,
+                               hammer2_inode_data_t *ipdata,
+                               hammer2_chain_t **parentp,
+                               hammer2_key_t lbase, int ioflag, int pblksize,
+                               int *errorp);
+static void hammer2_compress_and_write_t(struct buf *bp, hammer2_trans_t *trans,
+                               hammer2_inode_t *ip,
+                               hammer2_inode_data_t *ipdata,
+                               hammer2_chain_t **parentp,
+                               hammer2_key_t lbase, int ioflag,
+                               int pblksize, int *errorp, int comp_method);
+static void hammer2_zero_check_and_write_t(struct buf *bp,
+                               hammer2_trans_t *trans, hammer2_inode_t *ip,
+                               hammer2_inode_data_t *ipdata,
+                               hammer2_chain_t **parentp,
+                               hammer2_key_t lbase,
+                               int ioflag, int pblksize, int* error);
+static int test_block_not_zeros_t(char *buf, size_t bytes);
+static void zero_write_t(struct buf *bp, hammer2_trans_t *trans,
+                               hammer2_inode_t *ip,
+                               hammer2_inode_data_t *ipdata,
+                               hammer2_chain_t **parentp, 
+                               hammer2_key_t lbase);
+static void hammer2_write_bp_t(hammer2_chain_t *chain, struct buf *bp,
+                               int ioflag, int pblksize);
+
 static int hammer2_rcvdmsg(kdmsg_msg_t *msg);
 static void hammer2_autodmsg(kdmsg_msg_t *msg);
 
+
 /*
  * HAMMER2 vfs operations.
  */
 static struct vfsops hammer2_vfsops = {
        .vfs_init       = hammer2_vfs_init,
+       .vfs_uninit = hammer2_vfs_uninit,
        .vfs_sync       = hammer2_vfs_sync,
        .vfs_mount      = hammer2_vfs_mount,
        .vfs_unmount    = hammer2_vfs_unmount,
@@ -190,6 +256,9 @@ static
 int
 hammer2_vfs_init(struct vfsconf *conf)
 {
+       static struct objcache_malloc_args margs_read;
+       static struct objcache_malloc_args margs_write;
+
        int error;
 
        error = 0;
@@ -203,6 +272,19 @@ hammer2_vfs_init(struct vfsconf *conf)
 
        if (error)
                kprintf("HAMMER2 structure size mismatch; cannot continue.\n");
+       
+       margs_read.objsize = 65536;
+       margs_read.mtype = D_BUFFER;
+       
+       margs_write.objsize = 32768;
+       margs_write.mtype = C_BUFFER;
+       
+       cache_buffer_read = objcache_create(margs_read.mtype->ks_shortdesc,
+                               0, 1, NULL, NULL, NULL, objcache_malloc_alloc,
+                               objcache_malloc_free, &margs_read);
+       cache_buffer_write = objcache_create(margs_write.mtype->ks_shortdesc,
+                               0, 1, NULL, NULL, NULL, objcache_malloc_alloc,
+                               objcache_malloc_free, &margs_write);
 
        lockinit(&hammer2_mntlk, "mntlk", 0, 0);
        TAILQ_INIT(&hammer2_mntlist);
@@ -210,6 +292,15 @@ hammer2_vfs_init(struct vfsconf *conf)
        return (error);
 }
 
+static
+int
+hammer2_vfs_uninit(struct vfsconf *vfsp __unused)
+{
+       objcache_destroy(cache_buffer_read);
+       objcache_destroy(cache_buffer_write);
+       return 0;
+}
+
 /*
  * Mount or remount HAMMER2 fileystem from physical media
  *
@@ -258,6 +349,7 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
        dev = NULL;
        label = NULL;
        devvp = NULL;
+       
 
        kprintf("hammer2_mount\n");
 
@@ -437,6 +529,16 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
                hammer2_inode_ref(hmp->sroot);  /* for hmp->sroot */
                hammer2_inode_unlock_ex(hmp->sroot, schain);
                schain = NULL;
+               
+               mtx_init(&hmp->wthread_mtx);
+               bioq_init(&hmp->wthread_bioq);
+               hmp->wthread_destroy = 0;
+       
+               /*
+                * Launch threads.
+                */
+               lwkt_create(hammer2_write_thread, hmp,
+                               NULL, NULL, 0, -1, "hammer2-write");
        }
 
        /*
@@ -466,7 +568,7 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
        ccms_domain_init(&pmp->ccms_dom);
        ++hmp->pmp_count;
        lockmgr(&hammer2_mntlk, LK_RELEASE);
-       kprintf("hammer2_mount hmp=%p pmpcnt=%d\n", hmp, hmp->pmp_count);
+       kprintf("hammer2_mount hmp=%p pmp=%p pmpcnt=%d\n", hmp, pmp, hmp->pmp_count);
 
        mp->mnt_flag = MNT_LOCAL;
        mp->mnt_kern_flag |= MNTK_ALL_MPSAFE;   /* all entry pts are SMP */
@@ -532,6 +634,11 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
        pmp->mount_cluster->rchain = rchain;    /* left held & unlocked */
        pmp->iroot = hammer2_inode_get(pmp, NULL, rchain);
        hammer2_inode_ref(pmp->iroot);          /* ref for pmp->iroot */
+
+       KKASSERT(rchain->pmp == NULL);          /* bootstrap the tracking pmp for rchain */
+       rchain->pmp = pmp;
+       atomic_add_long(&pmp->inmem_chains, 1);
+
        hammer2_inode_unlock_ex(pmp->iroot, rchain);
 
        kprintf("iroot %p\n", pmp->iroot);
@@ -567,10 +674,594 @@ hammer2_vfs_mount(struct mount *mp, char *path, caddr_t data,
         * Initial statfs to prime mnt_stat.
         */
        hammer2_vfs_statfs(mp, &mp->mnt_stat, cred);
-
+       
        return 0;
 }
 
+/*
+ * Handle bioq for strategy write
+ */
+static
+void
+hammer2_write_thread(void *arg)
+{
+       hammer2_mount_t* hmp;
+       struct bio *bio;
+       struct buf *bp;
+       hammer2_trans_t trans;
+       struct vnode *vp;
+       hammer2_inode_t *last_ip;
+       hammer2_inode_t *ip;
+       hammer2_chain_t *parent;
+       hammer2_chain_t **parentp; //to comply with the current functions...
+       hammer2_inode_data_t *ipdata;
+       hammer2_key_t lbase;
+       int lblksize;
+       int pblksize;
+       int error;
+       
+       hmp = arg;
+       
+       mtx_lock(&hmp->wthread_mtx);
+       while (hmp->wthread_destroy == 0) {
+               if (bioq_first(&hmp->wthread_bioq) == NULL) {
+                       mtxsleep(&hmp->wthread_bioq, &hmp->wthread_mtx,
+                                0, "h2bioqw", 0);
+               }
+               last_ip = NULL;
+               parent = NULL;
+               parentp = &parent;
+
+               while ((bio = bioq_takefirst(&hmp->wthread_bioq)) != NULL) {
+                       mtx_unlock(&hmp->wthread_mtx);
+                       
+                       error = 0;
+                       bp = bio->bio_buf;
+                       vp = bp->b_vp;
+                       ip = VTOI(vp);
+
+                       /*
+                        * Cache transaction for multi-buffer flush efficiency.
+                        * Lock the ip separately for each buffer to allow
+                        * interleaving with frontend writes.
+                        */
+                       if (last_ip != ip) {
+                               if (last_ip)
+                                       hammer2_trans_done(&trans);
+                               hammer2_trans_init(&trans, ip->pmp,
+                                                  HAMMER2_TRANS_BUFCACHE);
+                               last_ip = ip;
+                       }
+                       parent = hammer2_inode_lock_ex(ip);
+
+                       /*
+                        * Inode is modified, flush size and mtime changes
+                        * to ensure that the file size remains consistent
+                        * with the buffers being flushed.
+                        */
+                       if (ip->flags & (HAMMER2_INODE_RESIZED |
+                                        HAMMER2_INODE_MTIME)) {
+                               hammer2_inode_fsync(&trans, ip, parentp);
+                       }
+                       ipdata = hammer2_chain_modify_ip(&trans, ip,
+                                                        parentp, 0);
+                       lblksize = hammer2_calc_logical(ip, bio->bio_offset,
+                                                       &lbase, NULL);
+                       pblksize = hammer2_calc_physical(ip, lbase);
+                       hammer2_write_file_core_t(bp, &trans, ip, ipdata,
+                                               parentp,
+                                               lbase, IO_ASYNC,
+                                               pblksize, &error);
+                       hammer2_inode_unlock_ex(ip, parent);
+                       if (error) {
+                               kprintf("An error occured in writing thread.\n");
+                               break;
+                       }
+                       biodone(bio);
+                       mtx_lock(&hmp->wthread_mtx);
+               }
+
+               /*
+                * Clean out transaction cache
+                */
+               if (last_ip)
+                       hammer2_trans_done(&trans);
+       }
+       hmp->wthread_destroy = -1;
+       wakeup(&hmp->wthread_destroy);
+       
+       mtx_unlock(&hmp->wthread_mtx);
+}
+
+/* 
+ * From hammer2_vnops.c. 
+ * Physical block assignement function.
+ */
+static
+hammer2_chain_t *
+hammer2_assign_physical(hammer2_trans_t *trans,
+                       hammer2_inode_t *ip, hammer2_chain_t **parentp,
+                       hammer2_key_t lbase, int pblksize, int *errorp)
+{
+       hammer2_chain_t *parent;
+       hammer2_chain_t *chain;
+       hammer2_off_t pbase;
+       int pradix = hammer2_getradix(pblksize);
+
+       /*
+        * Locate the chain associated with lbase, return a locked chain.
+        * However, do not instantiate any data reference (which utilizes a
+        * device buffer) because we will be using direct IO via the
+        * logical buffer cache buffer.
+        */
+       *errorp = 0;
+retry:
+       parent = *parentp;
+       hammer2_chain_lock(parent, HAMMER2_RESOLVE_ALWAYS); /* extra lock */
+       chain = hammer2_chain_lookup(&parent,
+                                    lbase, lbase,
+                                    HAMMER2_LOOKUP_NODATA);
+
+       if (chain == NULL) {
+               /*
+                * We found a hole, create a new chain entry.
+                *
+                * NOTE: DATA chains are created without device backing
+                *       store (nor do we want any).
+                */
+               *errorp = hammer2_chain_create(trans, &parent, &chain,
+                                              lbase, HAMMER2_PBUFRADIX,
+                                              HAMMER2_BREF_TYPE_DATA,
+                                              pblksize);
+               if (chain == NULL) {
+                       hammer2_chain_lookup_done(parent);
+                       panic("hammer2_chain_create: par=%p error=%d\n",
+                               parent, *errorp);
+                       goto retry;
+               }
+
+               pbase = chain->bref.data_off & ~HAMMER2_OFF_MASK_RADIX;
+               /*ip->delta_dcount += pblksize;*/
+       } else {
+               switch (chain->bref.type) {
+               case HAMMER2_BREF_TYPE_INODE:
+                       /*
+                        * The data is embedded in the inode.  The
+                        * caller is responsible for marking the inode
+                        * modified and copying the data to the embedded
+                        * area.
+                        */
+                       pbase = NOOFFSET;
+                       break;
+               case HAMMER2_BREF_TYPE_DATA:
+                       if (chain->bytes != pblksize) {
+                               hammer2_chain_resize(trans, ip,
+                                                    parent, &chain,
+                                                    pradix,
+                                                    HAMMER2_MODIFY_OPTDATA);
+                       }
+                       hammer2_chain_modify(trans, &chain,
+                                            HAMMER2_MODIFY_OPTDATA);
+                       pbase = chain->bref.data_off & ~HAMMER2_OFF_MASK_RADIX;
+                       break;
+               default:
+                       panic("hammer2_assign_physical: bad type");
+                       /* NOT REACHED */
+                       pbase = NOOFFSET;
+                       break;
+               }
+       }
+
+       /*
+        * Cleanup.  If chain wound up being the inode (i.e. DIRECTDATA),
+        * we might have to replace *parentp.
+        */
+       hammer2_chain_lookup_done(parent);
+       if (chain) {
+               if (*parentp != chain &&
+                   (*parentp)->core == chain->core) {
+                       parent = *parentp;
+                       *parentp = chain;               /* eats lock */
+                       hammer2_chain_unlock(parent);
+                       hammer2_chain_lock(chain, 0);   /* need another */
+               }
+               /* else chain already locked for return */
+       }
+       return (chain);
+}
+
+/* 
+ * From hammer2_vnops.c.
+ * The core write function which determines which path to take
+ * depending on compression settings.
+ */
+static
+void
+hammer2_write_file_core_t(struct buf *bp, hammer2_trans_t *trans,
+                       hammer2_inode_t *ip, hammer2_inode_data_t *ipdata,
+                       hammer2_chain_t **parentp,
+                       hammer2_key_t lbase, int ioflag, int pblksize,
+                       int *errorp)
+{
+       hammer2_chain_t *chain;
+       if (ipdata->comp_algo > HAMMER2_COMP_AUTOZERO) {
+               hammer2_compress_and_write_t(bp, trans, ip,
+                                          ipdata, parentp,
+                                          lbase, ioflag,
+                                          pblksize, errorp, ipdata->comp_algo);
+       } else if (ipdata->comp_algo == HAMMER2_COMP_AUTOZERO) {
+               hammer2_zero_check_and_write_t(bp, trans, ip,
+                                   ipdata, parentp, lbase,
+                                   ioflag, pblksize, errorp);
+       } else {
+               /*
+                * We have to assign physical storage to the buffer
+                * we intend to dirty or write now to avoid deadlocks
+                * in the strategy code later.
+                *
+                * This can return NOOFFSET for inode-embedded data.
+                * The strategy code will take care of it in that case.
+                */
+               chain = hammer2_assign_physical(trans, ip, parentp,
+                                               lbase, pblksize,
+                                               errorp);
+               hammer2_write_bp_t(chain, bp, ioflag, pblksize);
+               if (chain)
+                       hammer2_chain_unlock(chain);
+       }
+       ipdata = &ip->chain->data->ipdata;      /* reload */
+}
+
+/*
+ * From hammer2_vnops.c
+ * Generic function that will perform the compression in compression
+ * write path. The compression algorithm is determined by the settings
+ * obtained from inode.
+ */
+static
+void
+hammer2_compress_and_write_t(struct buf *bp, hammer2_trans_t *trans,
+       hammer2_inode_t *ip, hammer2_inode_data_t *ipdata,
+       hammer2_chain_t **parentp,
+       hammer2_key_t lbase, int ioflag, int pblksize,
+       int *errorp, int comp_method)
+{
+       hammer2_chain_t *chain;
+
+       if (test_block_not_zeros_t(bp->b_data, pblksize)) {
+               int compressed_size = 0;
+               int compressed_block_size;
+               char *compressed_buffer = NULL; //to avoid a compiler warning
+
+               KKASSERT(pblksize / 2 <= 32768);
+               
+               if (ipdata->reserved85 < 8 || (ipdata->reserved85 & 7) == 0) {
+                       if ((comp_method & 0x0F) == HAMMER2_COMP_LZ4) {
+                               //kprintf("LZ4 compression activated.\n");
+                               compressed_buffer = objcache_get(cache_buffer_write, M_INTWAIT);
+                               compressed_size = LZ4_compress_limitedOutput(bp->b_data,
+                                   &compressed_buffer[sizeof(int)], pblksize,
+                                   pblksize / 2 - sizeof(int));
+                               *(int *)compressed_buffer = compressed_size;
+                               if (compressed_size)
+                                       compressed_size += sizeof(int); /* our added overhead */
+                               //kprintf("Compressed size = %d.\n", compressed_size);
+                       } else if ((comp_method & 0x0F) == HAMMER2_COMP_ZLIB) {
+                               int comp_level = (comp_method >> 4) & 0x0F;
+                               z_stream strm_compress;
+                               int ret;
+                           //kprintf("ZLIB compression activated, level %d.\n", comp_level);
+
+                               ret = deflateInit(&strm_compress, comp_level);
+                               if (ret != Z_OK)
+                                       kprintf("HAMMER2 ZLIB: fatal error on deflateInit.\n");
+                               
+                               compressed_buffer = objcache_get(cache_buffer_write, M_INTWAIT);
+                               strm_compress.next_in = bp->b_data;
+                               strm_compress.avail_in = pblksize;
+                               strm_compress.next_out = compressed_buffer;
+                               strm_compress.avail_out = pblksize / 2;
+                               ret = deflate(&strm_compress, Z_FINISH);
+                               if (ret == Z_STREAM_END) {
+                                       compressed_size = pblksize / 2 - strm_compress.avail_out;
+                               } else {
+                                       compressed_size = 0;
+                               }
+                               ret = deflateEnd(&strm_compress);
+                               //kprintf("Compressed size = %d.\n", compressed_size);
+                       }
+                       else {
+                               kprintf("Error: Unknown compression method.\n");
+                               kprintf("Comp_method = %d.\n", comp_method);
+                               //And the block will be written uncompressed...
+                       }
+               }
+               if (compressed_size == 0) { //compression failed or turned off
+                       compressed_block_size = pblksize;       /* safety */
+                       ++(ipdata->reserved85);
+                       if (ipdata->reserved85 == 255) { //protection against overflows
+                               ipdata->reserved85 = 8;
+                       }
+               } else {
+                       ipdata->reserved85 = 0;
+                       if (compressed_size <= 1024) {
+                               compressed_block_size = 1024;
+                       } else if (compressed_size <= 2048) {
+                               compressed_block_size = 2048;
+                       } else if (compressed_size <= 4096) {
+                               compressed_block_size = 4096;
+                       } else if (compressed_size <= 8192) {
+                               compressed_block_size = 8192;
+                       } else if (compressed_size <= 16384) {
+                               compressed_block_size = 16384;
+                       } else if (compressed_size <= 32768) {
+                               compressed_block_size = 32768;
+                       } else {
+                               panic("WRITE PATH: Weird compressed_size value.\n");
+                               compressed_block_size = pblksize;       /* NOT REACHED */
+                       }
+               }
+
+               chain = hammer2_assign_physical(trans, ip, parentp,
+                                               lbase, compressed_block_size,
+                                               errorp);
+               ipdata = &ip->chain->data->ipdata;      /* RELOAD */
+                       
+               if (*errorp) {
+                       kprintf("WRITE PATH: An error occurred while "
+                               "assigning physical space.\n");
+                       KKASSERT(chain == NULL);
+               } else {
+                       /* Get device offset */
+                       hammer2_off_t pbase;
+                       hammer2_off_t pmask;
+                       hammer2_off_t peof;
+                       size_t boff;
+                       size_t psize;
+                       struct buf *dbp;
+                       
+                       KKASSERT(chain->flags & HAMMER2_CHAIN_MODIFIED);
+                       
+                       switch(chain->bref.type) {
+                       case HAMMER2_BREF_TYPE_INODE:
+                               KKASSERT(chain->data->ipdata.op_flags &
+                                       HAMMER2_OPFLAG_DIRECTDATA);
+                               KKASSERT(bp->b_loffset == 0);
+                               bcopy(bp->b_data, chain->data->ipdata.u.data,
+                                       HAMMER2_EMBEDDED_BYTES);
+                               break;
+                       case HAMMER2_BREF_TYPE_DATA:                            
+                               psize = hammer2_devblksize(chain->bytes);
+                               pmask = (hammer2_off_t)psize - 1;
+                               pbase = chain->bref.data_off & ~pmask;
+                               boff = chain->bref.data_off & (HAMMER2_OFF_MASK & pmask);
+                               peof = (pbase + HAMMER2_SEGMASK64) & ~HAMMER2_SEGMASK64;
+                               int temp_check = HAMMER2_DEC_CHECK(chain->bref.methods);
+
+                               /*
+                                * Optimize out the read-before-write if possible.
+                                */
+                               if (compressed_block_size == psize) {
+                                       dbp = getblk(chain->hmp->devvp, pbase, psize, 0, 0);
+                               } else {
+                                       *errorp = bread(chain->hmp->devvp, pbase, psize, &dbp);
+                                       if (*errorp) {
+                                               kprintf("WRITE PATH: An error ocurred while bread().\n");
+                                               break;
+                                       }
+                               }
+
+                               /*
+                                * When loading the block make sure we don't leave garbage
+                                * after the compressed data.
+                                */
+                               if (compressed_size) {
+                                       chain->bref.methods = HAMMER2_ENC_COMP(comp_method) +
+                                                             HAMMER2_ENC_CHECK(temp_check);
+                                       bcopy(compressed_buffer, dbp->b_data + boff,
+                                             compressed_size);
+                                       if (compressed_size != compressed_block_size) {
+                                               bzero(dbp->b_data + boff + compressed_size,
+                                                     compressed_block_size - compressed_size);
+                                       }
+                               } else {
+                                       chain->bref.methods = HAMMER2_ENC_COMP(HAMMER2_COMP_NONE) +
+                                                             HAMMER2_ENC_CHECK(temp_check);
+                                       bcopy(bp->b_data, dbp->b_data + boff, pblksize);
+                               }
+
+                               /*
+                                * Device buffer is now valid, chain is no
+                                * longer in the initial state.
+                                */
+                               atomic_clear_int(&chain->flags,
+                                                HAMMER2_CHAIN_INITIAL);
+
+                               /* Now write the related bdp. */
+                               if (ioflag & IO_SYNC) {
+                                       /*
+                                        * Synchronous I/O requested.
+                                        */
+                                       bwrite(dbp);
+                               /*
+                               } else if ((ioflag & IO_DIRECT) && loff + n == pblksize) {
+                                       bdwrite(dbp);
+                               */
+                               } else if (ioflag & IO_ASYNC) {
+                                       bawrite(dbp);
+                               } else if (hammer2_cluster_enable) {
+                                       cluster_write(dbp, peof, HAMMER2_PBUFSIZE, 4/*XXX*/);
+                               } else {
+                                       bdwrite(dbp);
+                               }
+                               break;
+                       default:
+                               panic("hammer2_write_bp_t: bad chain type %d\n",
+                                       chain->bref.type);
+                       /* NOT REACHED */
+                               break;
+                       }
+                       
+                       hammer2_chain_unlock(chain);
+               }
+               if (compressed_buffer)
+                       objcache_put(cache_buffer_write, compressed_buffer);
+       } else {
+               zero_write_t(bp, trans, ip, ipdata, parentp, lbase);
+       }
+}
+
+/*
+ * Function that performs zero-checking and writing without compression,
+ * it corresponds to default zero-checking path.
+ */
+static
+void
+hammer2_zero_check_and_write_t(struct buf *bp, hammer2_trans_t *trans,
+       hammer2_inode_t *ip, hammer2_inode_data_t *ipdata,
+       hammer2_chain_t **parentp,
+       hammer2_key_t lbase, int ioflag, int pblksize, int *errorp)
+{
+       hammer2_chain_t *chain;
+
+       if (test_block_not_zeros_t(bp->b_data, pblksize)) {
+               chain = hammer2_assign_physical(trans, ip, parentp,
+                                               lbase, pblksize, errorp);
+               hammer2_write_bp_t(chain, bp, ioflag, pblksize);
+               if (chain)
+                       hammer2_chain_unlock(chain);
+       } else {
+               zero_write_t(bp, trans, ip, ipdata, parentp, lbase);
+       }
+}
+
+/*
+ * A function to test whether a block of data contains only zeros,
+ * returns 0 in that case or returns 1 otherwise.
+ */
+static
+int
+test_block_not_zeros_t(char *buf, size_t bytes)
+{
+       size_t i;
+
+       for (i = 0; i < bytes; i += sizeof(long)) {
+               if (*(long *)(buf + i) != 0)
+                       return (1);
+       }
+       return (0);
+}
+
+/*
+ * Function to "write" a block that contains only zeros.
+ */
+static
+void
+zero_write_t(struct buf *bp, hammer2_trans_t *trans, hammer2_inode_t *ip,
+       hammer2_inode_data_t *ipdata, hammer2_chain_t **parentp,
+       hammer2_key_t lbase)
+{
+       hammer2_chain_t *parent;
+       hammer2_chain_t *chain;
+
+       parent = hammer2_chain_lookup_init(*parentp, 0);
+
+       chain = hammer2_chain_lookup(&parent, lbase, lbase,
+                                    HAMMER2_LOOKUP_NODATA);
+       if (chain) {
+               if (chain->bref.type == HAMMER2_BREF_TYPE_INODE) {
+                       bzero(chain->data->ipdata.u.data,
+                             HAMMER2_EMBEDDED_BYTES);
+               } else {
+                       hammer2_chain_delete(trans, chain, 0);
+               }
+               hammer2_chain_unlock(chain);
+       }
+       hammer2_chain_lookup_done(parent);
+}
+
+/*
+ * Function to write the data as it is, without performing any sort of
+ * compression. This function is used in path without compression and
+ * default zero-checking path.
+ */
+static
+void
+hammer2_write_bp_t(hammer2_chain_t *chain, struct buf *bp, int ioflag,
+                               int pblksize)
+{
+       hammer2_off_t pbase;
+       hammer2_off_t pmask;
+       hammer2_off_t peof;
+       struct buf *dbp;
+       size_t boff;
+       size_t psize;
+       int error;
+       int temp_check = HAMMER2_DEC_CHECK(chain->bref.methods);
+
+       KKASSERT(chain->flags & HAMMER2_CHAIN_MODIFIED);
+
+       switch(chain->bref.type) {
+       case HAMMER2_BREF_TYPE_INODE:
+               KKASSERT(chain->data->ipdata.op_flags &
+                        HAMMER2_OPFLAG_DIRECTDATA);
+               KKASSERT(bp->b_loffset == 0);
+               bcopy(bp->b_data, chain->data->ipdata.u.data,
+                     HAMMER2_EMBEDDED_BYTES);
+               break;
+       case HAMMER2_BREF_TYPE_DATA:
+               psize = hammer2_devblksize(chain->bytes);
+               pmask = (hammer2_off_t)psize - 1;
+               pbase = chain->bref.data_off & ~pmask;
+               boff = chain->bref.data_off & (HAMMER2_OFF_MASK & pmask);
+               peof = (pbase + HAMMER2_SEGMASK64) & ~HAMMER2_SEGMASK64;
+
+               if (psize == pblksize) {
+                       dbp = getblk(chain->hmp->devvp, pbase,
+                               psize, 0, 0);
+               } else {
+                       error = bread(chain->hmp->devvp, pbase, psize, &dbp);
+                       if (error) {
+                               kprintf("WRITE PATH: An error ocurred while bread().\n");
+                               break;
+                       }
+               }
+
+               chain->bref.methods = HAMMER2_ENC_COMP(HAMMER2_COMP_NONE) +
+                                     HAMMER2_ENC_CHECK(temp_check);
+               bcopy(bp->b_data, dbp->b_data + boff, chain->bytes);
+               
+               /*
+                * Device buffer is now valid, chain is no
+                * longer in the initial state.
+            */
+               atomic_clear_int(&chain->flags, HAMMER2_CHAIN_INITIAL);
+
+               if (ioflag & IO_SYNC) {
+                       /*
+                        * Synchronous I/O requested.
+                        */
+                       bwrite(dbp);
+               /*
+               } else if ((ioflag & IO_DIRECT) && loff + n == pblksize) {
+                       bdwrite(dbp);
+               */
+               } else if (ioflag & IO_ASYNC) {
+                       bawrite(dbp);
+               } else if (hammer2_cluster_enable) {
+                       cluster_write(dbp, peof, HAMMER2_PBUFSIZE, 4/*XXX*/);
+               } else {
+                       bdwrite(dbp);
+               }
+               break;
+       default:
+               panic("hammer2_write_bp_t: bad chain type %d\n",
+                     chain->bref.type);
+               /* NOT REACHED */
+               break;
+       }
+}
+
 static
 int
 hammer2_remount(struct mount *mp, char *path, struct vnode *devvp,
@@ -623,10 +1314,11 @@ hammer2_vfs_unmount(struct mount *mp, int mntflags)
         * to synchronize against HAMMER2_CHAIN_MODIFIED_AUX.
         */
        hammer2_voldata_lock(hmp);
-       if (hmp->vchain.flags & (HAMMER2_CHAIN_MODIFIED |
-                                HAMMER2_CHAIN_SUBMODIFIED)) {
+       if ((hmp->vchain.flags | hmp->fchain.flags) &
+           (HAMMER2_CHAIN_MODIFIED | HAMMER2_CHAIN_SUBMODIFIED)) {
                hammer2_voldata_unlock(hmp, 0);
                hammer2_vfs_sync(mp, MNT_WAIT);
+               hammer2_vfs_sync(mp, MNT_WAIT);
        } else {
                hammer2_voldata_unlock(hmp, 0);
        }
@@ -740,6 +1432,15 @@ hammer2_vfs_unmount(struct mount *mp, int mntflags)
        kfree(cluster, M_HAMMER2);
        kfree(pmp, M_HAMMER2);
        if (hmp->pmp_count == 0) {
+               mtx_lock(&hmp->wthread_mtx);
+               hmp->wthread_destroy = 1;
+               wakeup(&hmp->wthread_bioq);
+               while (hmp->wthread_destroy != -1) {
+                       mtxsleep(&hmp->wthread_destroy, &hmp->wthread_mtx, 0,
+                               "umount-sleep", 0);
+               }
+               mtx_unlock(&hmp->wthread_mtx);
+               
                TAILQ_REMOVE(&hammer2_mntlist, hmp, mntentry);
                kmalloc_destroy(&hmp->mchain);
                kfree(hmp, M_HAMMER2);
@@ -859,7 +1560,15 @@ hammer2_vfs_sync(struct mount *mp, int waitfor)
 
        pmp = MPTOPMP(mp);
 
-       flags = VMSC_GETVP;
+       /*
+        * We can't acquire locks on existing vnodes while in a transaction
+        * without risking a deadlock.  This assumes that vfsync() can be
+        * called without the vnode locked (which it can in DragonFly).
+        * Otherwise we'd have to implement a multi-pass or flag the lock
+        * failures and retry.
+        */
+       /*flags = VMSC_GETVP;*/
+       flags = 0;
        if (waitfor & MNT_LAZY)
                flags |= VMSC_ONEPASS;
 
@@ -895,7 +1604,7 @@ hammer2_vfs_sync(struct mount *mp, int waitfor)
        }
        hammer2_chain_unlock(&hmp->vchain);
 
-#if 0
+#if 1
        /*
         * Rollup flush.  The fsyncs above basically just flushed
         * data blocks.  The flush below gets all the meta-data.
@@ -1009,11 +1718,14 @@ hammer2_sync_scan2(struct mount *mp, struct vnode *vp, void *data)
        /*
         * VOP_FSYNC will start a new transaction so replicate some code
         * here to do it inline (see hammer2_vop_fsync()).
+        *
+        * WARNING: The vfsync interacts with the buffer cache and might
+        *          block, we can't hold the inode lock at that time.
         */
-       parent = hammer2_inode_lock_ex(ip);
        atomic_clear_int(&ip->flags, HAMMER2_INODE_MODIFIED);
        if (ip->vp)
                vfsync(ip->vp, MNT_NOWAIT, 1, NULL, NULL);
+       parent = hammer2_inode_lock_ex(ip);
        hammer2_chain_flush(&info->trans, parent);
        hammer2_inode_unlock_ex(ip, parent);
        error = 0;
index f963ee7..e11484c 100644 (file)
@@ -4,6 +4,7 @@
  * This code is derived from software contributed to The DragonFly Project
  * by Matthew Dillon <dillon@dragonflybsd.org>
  * by Venkatesh Srinivas <vsrinivas@dragonflybsd.org>
+ * by Daniel Flores (GSOC 2013 - mentored by Matthew Dillon, compression) 
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
 #include <sys/mountctl.h>
 #include <sys/dirent.h>
 #include <sys/uio.h>
+#include <sys/objcache.h>
 
 #include "hammer2.h"
+#include "hammer2_lz4.h"
+
+#include "zlib/hammer2_zlib.h"
 
 #define ZFOFFSET       (-2LL)
 
 static int hammer2_read_file(hammer2_inode_t *ip, struct uio *uio,
                                int seqcount);
-static int hammer2_write_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
-                               hammer2_chain_t **parentp,
-                               struct uio *uio, int ioflag, int seqcount);
-static void hammer2_write_bp(hammer2_chain_t *chain, struct buf *bp,
-                               int ioflag);
-static hammer2_chain_t *hammer2_assign_physical(hammer2_trans_t *trans,
-                               hammer2_inode_t *ip, hammer2_chain_t **parentp,
-                               hammer2_key_t lbase, int lblksize,
-                               int *errorp);
-static void hammer2_extend_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
-                               hammer2_chain_t **parentp, hammer2_key_t nsize);
-static void hammer2_truncate_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
-                               hammer2_chain_t **parentp, hammer2_key_t nsize);
+static int hammer2_write_file(hammer2_inode_t *ip, struct uio *uio,
+                               int ioflag, int seqcount);
+static void hammer2_extend_file(hammer2_inode_t *ip, hammer2_key_t nsize);
+static void hammer2_truncate_file(hammer2_inode_t *ip, hammer2_key_t nsize);
+static void hammer2_decompress_LZ4_callback(struct bio *bio);
+static void hammer2_decompress_ZLIB_callback(struct bio *bio);
+
+struct objcache *cache_buffer_read;
+struct objcache *cache_buffer_write;
+
+/* 
+ * Callback used in read path in case that a block is compressed with LZ4.
+ */
+static
+void
+hammer2_decompress_LZ4_callback(struct bio *bio)
+{
+       struct buf *bp = bio->bio_buf;
+       struct buf *obp;
+       struct bio *obio;
+       int loff;
+
+       /*
+        * If BIO_DONE is already set the device buffer was already
+        * fully valid (B_CACHE).  If it is not set then I/O was issued
+        * and we have to run I/O completion as the last bio.
+        *
+        * Nobody is waiting for our device I/O to complete, we are
+        * responsible for bqrelse()ing it which means we also have to do
+        * the equivalent of biowait() and clear BIO_DONE (which breadcb()
+        * may have set).
+        *
+        * Any preexisting device buffer should match the requested size,
+        * but due to bigblock recycling and other factors there is some
+        * fragility there, so we assert that the device buffer covers
+        * the request.
+        */
+       if ((bio->bio_flags & BIO_DONE) == 0)
+               bpdone(bp, 0);
+       bio->bio_flags &= ~(BIO_DONE | BIO_SYNC);
+
+       obio = bio->bio_caller_info1.ptr;
+       obp = obio->bio_buf;
+       loff = obio->bio_caller_info3.value;
+
+       if (bp->b_flags & B_ERROR) {
+               obp->b_flags |= B_ERROR;
+               obp->b_error = bp->b_error;
+       } else if (obio->bio_caller_info2.index &&
+                  obio->bio_caller_info1.uvalue32 !=
+                   crc32(bp->b_data, bp->b_bufsize)) {
+               obp->b_flags |= B_ERROR;
+               obp->b_error = EIO;
+       } else {
+               KKASSERT(obp->b_bufsize <= 65536);
+               
+               char *buffer;
+               char *compressed_buffer;
+               int *compressed_size;
+               
+               buffer = bp->b_data + loff;
+               compressed_size = (int*)buffer;
+               compressed_buffer = objcache_get(cache_buffer_read, M_INTWAIT);
+               KKASSERT((unsigned int)*compressed_size <= 65536);
+               int result = LZ4_decompress_safe(&buffer[sizeof(int)],
+                       compressed_buffer, *compressed_size, obp->b_bufsize);
+               if (result < 0) {
+                       kprintf("READ PATH: Error during decompression."
+                               "bio %016jx/%d loff=%d\n",
+                               (intmax_t)bio->bio_offset, bio->bio_buf->b_bufsize, loff);
+                       /* make sure it isn't random garbage */
+                       bzero(compressed_buffer, obp->b_bufsize);
+               }
+               KKASSERT(result <= obp->b_bufsize);
+               bcopy(compressed_buffer, obp->b_data, obp->b_bufsize);
+               if (result < obp->b_bufsize)
+                       bzero(obp->b_data + result, obp->b_bufsize - result);
+               objcache_put(cache_buffer_read, compressed_buffer);
+               obp->b_resid = 0;
+               obp->b_flags |= B_AGE;
+       }
+       biodone(obio);
+       bqrelse(bp);
+}
+
+/*
+ * Callback used in read path in case that a block is compressed with ZLIB.
+ * It is almost identical to LZ4 callback, so in theory they can be unified,
+ * but we didn't want to make changes in bio structure for that.
+ */
+static
+void
+hammer2_decompress_ZLIB_callback(struct bio *bio)
+{
+       struct buf *bp = bio->bio_buf;
+       struct buf *obp;
+       struct bio *obio;
+       int loff;
+
+       /*
+        * If BIO_DONE is already set the device buffer was already
+        * fully valid (B_CACHE).  If it is not set then I/O was issued
+        * and we have to run I/O completion as the last bio.
+        *
+        * Nobody is waiting for our device I/O to complete, we are
+        * responsible for bqrelse()ing it which means we also have to do
+        * the equivalent of biowait() and clear BIO_DONE (which breadcb()
+        * may have set).
+        *
+        * Any preexisting device buffer should match the requested size,
+        * but due to bigblock recycling and other factors there is some
+        * fragility there, so we assert that the device buffer covers
+        * the request.
+        */
+       if ((bio->bio_flags & BIO_DONE) == 0)
+               bpdone(bp, 0);
+       bio->bio_flags &= ~(BIO_DONE | BIO_SYNC);
+
+       obio = bio->bio_caller_info1.ptr;
+       obp = obio->bio_buf;
+       loff = obio->bio_caller_info3.value;
+
+       if (bp->b_flags & B_ERROR) {
+               obp->b_flags |= B_ERROR;
+               obp->b_error = bp->b_error;
+       } else if (obio->bio_caller_info2.index &&
+                  obio->bio_caller_info1.uvalue32 !=
+                   crc32(bp->b_data, bp->b_bufsize)) {
+               obp->b_flags |= B_ERROR;
+               obp->b_error = EIO;
+       } else {
+               KKASSERT(obp->b_bufsize <= 65536);
+               
+               char *buffer;
+               char *compressed_buffer;
+               int ret;
+               
+               z_stream strm_decompress;
+
+               strm_decompress.avail_in = 0;
+               strm_decompress.next_in = Z_NULL;
+               
+               ret = inflateInit(&strm_decompress);
+               
+               if (ret != Z_OK)
+                               kprintf("HAMMER2 ZLIB: Fatal error in inflateInit.\n");
+               
+               buffer = bp->b_data + loff;
+               compressed_buffer = objcache_get(cache_buffer_read, M_INTWAIT);
+               strm_decompress.next_in = buffer;
+               strm_decompress.avail_in = bp->b_bufsize - loff; //bp->b_bufsize?
+               strm_decompress.next_out = compressed_buffer;
+               strm_decompress.avail_out = obp->b_bufsize;
+               
+               ret = inflate(&strm_decompress, Z_FINISH);
+               if (ret != Z_STREAM_END) {
+                       kprintf("HAMMER2 ZLIB: Fatar error during decompression.\n");
+                       bzero(compressed_buffer, obp->b_bufsize);
+               }
+               bcopy(compressed_buffer, obp->b_data, obp->b_bufsize);
+               int result = obp->b_bufsize - strm_decompress.avail_out;
+               if (result < obp->b_bufsize)
+                       bzero(obp->b_data + result, strm_decompress.avail_out);
+               objcache_put(cache_buffer_read, compressed_buffer);
+               obp->b_resid = 0;
+               obp->b_flags |= B_AGE;
+               ret = inflateEnd(&strm_decompress);
+       }
+       biodone(obio);
+       bqrelse(bp);
+}
 
 static __inline
 void
@@ -216,9 +379,12 @@ hammer2_vop_fsync(struct vop_fsync_args *ap)
        vp = ap->a_vp;
        ip = VTOI(vp);
 
+       /*
+        * WARNING: The vfsync interacts with the buffer cache and might
+        *          block, we can't hold the inode lock and we can't
+        *          have a flush transaction pending.
+        */
        hammer2_trans_init(&trans, ip->pmp, HAMMER2_TRANS_ISFLUSH);
-       chain = hammer2_inode_lock_ex(ip);
-
        vfsync(vp, ap->a_waitfor, 1, NULL, NULL);
 
        /*
@@ -229,7 +395,11 @@ hammer2_vop_fsync(struct vop_fsync_args *ap)
         * which call this function will eventually call chain_flush
         * on the volume root as a catch-all, which is far more optimal.
         */
+       chain = hammer2_inode_lock_ex(ip);
        atomic_clear_int(&ip->flags, HAMMER2_INODE_MODIFIED);
+       if (ip->flags & (HAMMER2_INODE_RESIZED|HAMMER2_INODE_MTIME))
+               hammer2_inode_fsync(&trans, ip, &chain);
+
        if (ap->a_flags & VOP_FSYNC_SYSCALL) {
                hammer2_chain_flush(&trans, chain);
        }
@@ -288,7 +458,7 @@ hammer2_vop_getattr(struct vop_getattr_args *ap)
        vap->va_gid = hammer2_to_unix_xid(&ipdata->gid);
        vap->va_rmajor = 0;
        vap->va_rminor = 0;
-       vap->va_size = ipdata->size;
+       vap->va_size = ip->size;        /* protected by shared lock */
        vap->va_blocksize = HAMMER2_PBUFSIZE;
        vap->va_flags = ipdata->uflags;
        hammer2_time_to_timespec(ipdata->ctime, &vap->va_ctime);
@@ -332,6 +502,7 @@ hammer2_vop_setattr(struct vop_setattr_args *ap)
        if (ip->pmp->ronly)
                return(EROFS);
 
+       hammer2_chain_memory_wait(ip->pmp);
        hammer2_trans_init(&trans, ip->pmp, 0);
        chain = hammer2_inode_lock_ex(ip);
        ipdata = &chain->data->ipdata;
@@ -394,18 +565,18 @@ hammer2_vop_setattr(struct vop_setattr_args *ap)
        /*
         * Resize the file
         */
-       if (vap->va_size != VNOVAL && ipdata->size != vap->va_size) {
+       if (vap->va_size != VNOVAL && ip->size != vap->va_size) {
                switch(vp->v_type) {
                case VREG:
-                       if (vap->va_size == ipdata->size)
+                       if (vap->va_size == ip->size)
                                break;
-                       if (vap->va_size < ipdata->size) {
-                               hammer2_truncate_file(&trans, ip,
-                                                     &chain, vap->va_size);
+                       hammer2_inode_unlock_ex(ip, chain);
+                       if (vap->va_size < ip->size) {
+                               hammer2_truncate_file(ip, vap->va_size);
                        } else {
-                               hammer2_extend_file(&trans, ip,
-                                                   &chain, vap->va_size);
+                               hammer2_extend_file(ip, vap->va_size);
                        }
+                       chain = hammer2_inode_lock_ex(ip);
                        ipdata = &chain->data->ipdata; /* RELOAD */
                        domtime = 1;
                        break;
@@ -441,6 +612,13 @@ hammer2_vop_setattr(struct vop_setattr_args *ap)
                        kflags |= NOTE_ATTRIB;
                }
        }
+
+       /*
+        * If a truncation occurred we must call inode_fsync() now in order
+        * to trim the related data chains, otherwise a later expansion can
+        * cause havoc.
+        */
+       hammer2_inode_fsync(&trans, ip, &chain);
 done:
        hammer2_inode_unlock_ex(ip, chain);
        hammer2_trans_done(&trans);
@@ -682,7 +860,6 @@ hammer2_vop_write(struct vop_write_args *ap)
 {
        hammer2_inode_t *ip;
        hammer2_trans_t trans;
-       hammer2_chain_t *parent;
        thread_t td;
        struct vnode *vp;
        struct uio *uio;
@@ -722,17 +899,11 @@ hammer2_vop_write(struct vop_write_args *ap)
        bigwrite = (uio->uio_resid > 100 * 1024 * 1024);
 
        /*
-        * ip must be locked if extending the file.
-        * ip must be locked to avoid racing a truncation.
-        *
-        * ip must be marked modified, particularly because the write
-        * might wind up being copied into the embedded data area.
+        * The transaction interlocks against flushes initiations
+        * (note: but will run concurrently with the actual flush).
         */
        hammer2_trans_init(&trans, ip->pmp, 0);
-       parent = hammer2_inode_lock_ex(ip);
-       error = hammer2_write_file(&trans, ip, &parent,
-                                  uio, ap->a_ioflag, seqcount);
-       hammer2_inode_unlock_ex(ip, parent);
+       error = hammer2_write_file(ip, uio, ap->a_ioflag, seqcount);
        hammer2_trans_done(&trans);
 
        return (error);
@@ -749,7 +920,6 @@ int
 hammer2_read_file(hammer2_inode_t *ip, struct uio *uio, int seqcount)
 {
        hammer2_off_t size;
-       hammer2_chain_t *parent;
        struct buf *bp;
        int error;
 
@@ -758,8 +928,9 @@ hammer2_read_file(hammer2_inode_t *ip, struct uio *uio, int seqcount)
        /*
         * UIO read loop.
         */
-       parent = hammer2_inode_lock_sh(ip);
-       size = ip->chain->data->ipdata.size;
+       ccms_thread_lock(&ip->topo_cst, CCMS_STATE_EXCLUSIVE);
+       size = ip->size;
+       ccms_thread_unlock(&ip->topo_cst);
 
        while (uio->uio_resid > 0 && uio->uio_offset < size) {
                hammer2_key_t lbase;
@@ -787,43 +958,35 @@ hammer2_read_file(hammer2_inode_t *ip, struct uio *uio, int seqcount)
                uiomove((char *)bp->b_data + loff, n, uio);
                bqrelse(bp);
        }
-       hammer2_inode_unlock_sh(ip, parent);
        return (error);
 }
 
 /*
- * Called with a locked (ip) to do the underlying write to a file or
- * to build the symlink target.
+ * Write to the file represented by the inode via the logical buffer cache.
+ * The inode may represent a regular file or a symlink.
+ *
+ * The inode must not be locked.
  */
 static
 int
-hammer2_write_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
-                  hammer2_chain_t **parentp,
+hammer2_write_file(hammer2_inode_t *ip,
                   struct uio *uio, int ioflag, int seqcount)
 {
-       hammer2_inode_data_t *ipdata;
        hammer2_key_t old_eof;
+       hammer2_key_t new_eof;
        struct buf *bp;
        int kflags;
        int error;
-       int modified = 0;
+       int modified;
 
        /*
         * Setup if append
         */
-       ipdata = hammer2_chain_modify_ip(trans, ip, parentp, 0);
+       ccms_thread_lock(&ip->topo_cst, CCMS_STATE_EXCLUSIVE);
        if (ioflag & IO_APPEND)
-               uio->uio_offset = ipdata->size;
-       kflags = 0;
-       error = 0;
-
-       /*
-        * vfs_sync visibility.  Interlocked by the inode ex lock so we
-        * shouldn't have to reassert it multiple times if the ip->chain
-        * is modified/flushed multiple times during the write, except
-        * when we release/reacquire the inode ex lock.
-        */
-       atomic_set_int(&ip->flags, HAMMER2_INODE_MODIFIED);
+               uio->uio_offset = ip->size;
+       old_eof = ip->size;
+       ccms_thread_unlock(&ip->topo_cst);
 
        /*
         * Extend the file if necessary.  If the write fails at some point
@@ -833,46 +996,36 @@ hammer2_write_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
         * Doing this now makes it easier to calculate buffer sizes in
         * the loop.
         */
-       KKASSERT(ipdata->type != HAMMER2_OBJTYPE_HARDLINK);
-       old_eof = ipdata->size;
-       if (uio->uio_offset + uio->uio_resid > ipdata->size) {
+       kflags = 0;
+       error = 0;
+       modified = 0;
+
+       if (uio->uio_offset + uio->uio_resid > old_eof) {
+               new_eof = uio->uio_offset + uio->uio_resid;
                modified = 1;
-               hammer2_extend_file(trans, ip, parentp,
-                                   uio->uio_offset + uio->uio_resid);
-               ipdata = &ip->chain->data->ipdata;      /* RELOAD */
+               hammer2_extend_file(ip, new_eof);
                kflags |= NOTE_EXTEND;
+       } else {
+               new_eof = old_eof;
        }
-       KKASSERT(ipdata->type != HAMMER2_OBJTYPE_HARDLINK);
-
+       
        /*
         * UIO write loop
         */
        while (uio->uio_resid > 0) {
-               hammer2_chain_t *chain;
                hammer2_key_t lbase;
-               hammer2_key_t leof;
                int trivial;
                int lblksize;
                int loff;
                int n;
+               int rem_size;
 
                /*
                 * Don't allow the buffer build to blow out the buffer
                 * cache.
                 */
-               if ((ioflag & IO_RECURSE) == 0) {
-                       /*
-                        * XXX should try to leave this unlocked through
-                        *      the whole loop
-                        */
-                       hammer2_inode_unlock_ex(ip, *parentp);
+               if ((ioflag & IO_RECURSE) == 0)
                        bwillwrite(HAMMER2_PBUFSIZE);
-                       *parentp = hammer2_inode_lock_ex(ip);
-                       atomic_set_int(&ip->flags, HAMMER2_INODE_MODIFIED);
-                       ipdata = &ip->chain->data->ipdata;      /* reload */
-               }
-
-               /* XXX bigwrite & signal check test */
 
                /*
                 * This nominally tells us how much we can cluster and
@@ -881,8 +1034,17 @@ hammer2_write_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
                 * block at a time.
                 */
                lblksize = hammer2_calc_logical(ip, uio->uio_offset,
-                                               &lbase, &leof);
+                                               &lbase, NULL);
                loff = (int)(uio->uio_offset - lbase);
+               
+               if (uio->uio_resid < lblksize) {
+                       rem_size = (int)uio->uio_resid;
+               }
+               else {
+                       rem_size = 0;
+               }
+               
+               KKASSERT(lblksize <= 65536);
 
                /*
                 * Calculate bytes to copy this transfer and whether the
@@ -892,8 +1054,7 @@ hammer2_write_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
                n = lblksize - loff;
                if (n > uio->uio_resid) {
                        n = uio->uio_resid;
-                       if (loff == lbase &&
-                           uio->uio_offset + n == ipdata->size)
+                       if (loff == lbase && uio->uio_offset + n == new_eof)
                                trivial = 1;
                } else if (loff == 0) {
                        trivial = 1;
@@ -945,518 +1106,84 @@ hammer2_write_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
                /*
                 * Ok, copy the data in
                 */
-               hammer2_inode_unlock_ex(ip, *parentp);
                error = uiomove(bp->b_data + loff, n, uio);
-               *parentp = hammer2_inode_lock_ex(ip);
-               atomic_set_int(&ip->flags, HAMMER2_INODE_MODIFIED);
-               ipdata = &ip->chain->data->ipdata;      /* reload */
                kflags |= NOTE_WRITE;
                modified = 1;
                if (error) {
                        brelse(bp);
                        break;
                }
-
-               /*
-                * We have to assign physical storage to the buffer we intend
-                * to dirty or write now to avoid deadlocks in the strategy
-                * code later.
-                *
-                * This can return NOOFFSET for inode-embedded data.  The
-                * strategy code will take care of it in that case.
-                */
-               chain = hammer2_assign_physical(trans, ip, parentp,
-                                               lbase, lblksize, &error);
-               ipdata = &ip->chain->data->ipdata;      /* RELOAD */
-
-               if (error) {
-                       KKASSERT(chain == NULL);
-                       brelse(bp);
+               bdwrite(bp);
+               if (error)
                        break;
-               }
-
-               /* XXX update ip_data.mtime */
-
-               /*
-                * Once we dirty a buffer any cached offset becomes invalid.
-                *
-                * NOTE: For cluster_write() always use the trailing block
-                *       size, which is HAMMER2_PBUFSIZE.  lblksize is the
-                *       eof-straddling blocksize and is incorrect.
-                */
-               bp->b_flags |= B_AGE;
-               hammer2_write_bp(chain, bp, ioflag);
-               hammer2_chain_unlock(chain);
        }
 
        /*
         * Cleanup.  If we extended the file EOF but failed to write through
         * the entire write is a failure and we have to back-up.
         */
-       if (error && ipdata->size != old_eof) {
-               hammer2_truncate_file(trans, ip, parentp, old_eof);
-               ipdata = &ip->chain->data->ipdata;      /* RELOAD */
+       if (error && new_eof != old_eof) {
+               hammer2_truncate_file(ip, old_eof);
        } else if (modified) {
-               ipdata = hammer2_chain_modify_ip(trans, ip, parentp, 0);
-               hammer2_update_time(&ipdata->mtime);
+               ccms_thread_lock(&ip->topo_cst, CCMS_STATE_EXCLUSIVE);
+               hammer2_update_time(&ip->mtime);
+               atomic_set_int(&ip->flags, HAMMER2_INODE_MTIME);
+               ccms_thread_unlock(&ip->topo_cst);
        }
+       atomic_set_int(&ip->flags, HAMMER2_INODE_MODIFIED);
        hammer2_knote(ip->vp, kflags);
 
        return error;
 }
 
 /*
- * Write the logical file bp out.
+ * Truncate the size of a file.  The inode must not be locked.
  */
 static
 void
-hammer2_write_bp(hammer2_chain_t *chain, struct buf *bp, int ioflag)
-{
-       hammer2_off_t pbase;
-       hammer2_off_t pmask;
-       hammer2_off_t peof;
-       struct buf *dbp;
-       size_t boff;
-       size_t psize;
-
-       KKASSERT(chain->flags & HAMMER2_CHAIN_MODIFIED);
-
-       switch(chain->bref.type) {
-       case HAMMER2_BREF_TYPE_INODE:
-               KKASSERT(chain->data->ipdata.op_flags &
-                        HAMMER2_OPFLAG_DIRECTDATA);
-               KKASSERT(bp->b_loffset == 0);
-               bcopy(bp->b_data, chain->data->ipdata.u.data,
-                     HAMMER2_EMBEDDED_BYTES);
-               break;
-       case HAMMER2_BREF_TYPE_DATA:
-               psize = hammer2_devblksize(chain->bytes);
-               pmask = (hammer2_off_t)psize - 1;
-               pbase = chain->bref.data_off & ~pmask;
-               boff = chain->bref.data_off & (HAMMER2_OFF_MASK & pmask);
-               peof = (pbase + HAMMER2_SEGMASK64) & ~HAMMER2_SEGMASK64;
-
-               dbp = getblk(chain->hmp->devvp, pbase, psize, 0, 0);
-               bcopy(bp->b_data, dbp->b_data + boff, chain->bytes);
-
-               if (ioflag & IO_SYNC) {
-                       /*
-                        * Synchronous I/O requested.
-                        */
-                       bwrite(dbp);
-               /*
-               } else if ((ioflag & IO_DIRECT) && loff + n == lblksize) {
-                       bdwrite(dbp);
-               */
-               } else if (ioflag & IO_ASYNC) {
-                       bawrite(dbp);
-               } else if (hammer2_cluster_enable) {
-                       cluster_write(dbp, peof, HAMMER2_PBUFSIZE, 4/*XXX*/);
-               } else {
-                       bdwrite(dbp);
-               }
-               break;
-       default:
-               panic("hammer2_write_bp: bad chain type %d\n",
-                     chain->bref.type);
-               /* NOT REACHED */
-               break;
-       }
-       bqrelse(bp);
-}
-
-/*
- * Assign physical storage to a logical block.  This function creates the
- * related meta-data chains representing the data blocks and marks them
- * MODIFIED.  We could mark them MOVED instead but ultimately I need to
- * XXX code the flusher to check that the related logical buffer is
- * flushed.
- *
- * NOOFFSET is returned if the data is inode-embedded.  In this case the
- * strategy code will simply bcopy() the data into the inode.
- *
- * The inode's delta_dcount is adjusted.
- */
-static
-hammer2_chain_t *
-hammer2_assign_physical(hammer2_trans_t *trans,
-                       hammer2_inode_t *ip, hammer2_chain_t **parentp,
-                       hammer2_key_t lbase, int lblksize, int *errorp)
+hammer2_truncate_file(hammer2_inode_t *ip, hammer2_key_t nsize)
 {
-       hammer2_chain_t *parent;
-       hammer2_chain_t *chain;
-       hammer2_off_t pbase;
-
-       /*
-        * Locate the chain associated with lbase, return a locked chain.
-        * However, do not instantiate any data reference (which utilizes a
-        * device buffer) because we will be using direct IO via the
-        * logical buffer cache buffer.
-        */
-       *errorp = 0;
-retry:
-       parent = *parentp;
-       hammer2_chain_lock(parent, HAMMER2_RESOLVE_ALWAYS); /* extra lock */
-       chain = hammer2_chain_lookup(&parent,
-                                    lbase, lbase,
-                                    HAMMER2_LOOKUP_NODATA);
-
-       if (chain == NULL) {
-               /*
-                * We found a hole, create a new chain entry.
-                *
-                * NOTE: DATA chains are created without device backing
-                *       store (nor do we want any).
-                */
-               *errorp = hammer2_chain_create(trans, &parent, &chain,
-                                              lbase, HAMMER2_PBUFRADIX,
-                                              HAMMER2_BREF_TYPE_DATA,
-                                              lblksize);
-               if (chain == NULL) {
-                       hammer2_chain_lookup_done(parent);
-                       panic("hammer2_chain_create: par=%p error=%d\n",
-                               parent, *errorp);
-                       goto retry;
-               }
-
-               pbase = chain->bref.data_off & ~HAMMER2_OFF_MASK_RADIX;
-               /*ip->delta_dcount += lblksize;*/
-       } else {
-               switch (chain->bref.type) {
-               case HAMMER2_BREF_TYPE_INODE:
-                       /*
-                        * The data is embedded in the inode.  The
-                        * caller is responsible for marking the inode
-                        * modified and copying the data to the embedded
-                        * area.
-                        */
-                       pbase = NOOFFSET;
-                       break;
-               case HAMMER2_BREF_TYPE_DATA:
-                       if (chain->bytes != lblksize) {
-                               panic("hammer2_assign_physical: "
-                                     "size mismatch %d/%d\n",
-                                     lblksize, chain->bytes);
-                       }
-                       hammer2_chain_modify(trans, &chain,
-                                            HAMMER2_MODIFY_OPTDATA);
-                       pbase = chain->bref.data_off & ~HAMMER2_OFF_MASK_RADIX;
-                       break;
-               default:
-                       panic("hammer2_assign_physical: bad type");
-                       /* NOT REACHED */
-                       pbase = NOOFFSET;
-                       break;
-               }
-       }
-
-       /*
-        * Cleanup.  If chain wound up being the inode (i.e. DIRECTDATA),
-        * we might have to replace *parentp.
-        */
-       hammer2_chain_lookup_done(parent);
-       if (chain) {
-               if (*parentp != chain &&
-                   (*parentp)->core == chain->core) {
-                       parent = *parentp;
-                       *parentp = chain;               /* eats lock */
-                       hammer2_chain_unlock(parent);
-                       hammer2_chain_lock(chain, 0);   /* need another */
-               }
-               /* else chain already locked for return */
-       }
-       return (chain);
-}
-
-/*
- * Truncate the size of a file.
- *
- * This routine adjusts ipdata->size smaller, destroying any related
- * data beyond the new EOF and potentially resizing the block straddling
- * the EOF.
- *
- * The inode must be locked.
- */
-static
-void
-hammer2_truncate_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
-                     hammer2_chain_t **parentp, hammer2_key_t nsize)
-{
-       hammer2_inode_data_t *ipdata;
-       hammer2_chain_t *parent;
-       hammer2_chain_t *chain;
        hammer2_key_t lbase;
-       hammer2_key_t leof;
-       struct buf *bp;
-       int loff;
-       int error;
-       int oblksize;
        int nblksize;
 
-       bp = NULL;
-       error = 0;
-       ipdata = hammer2_chain_modify_ip(trans, ip, parentp, 0);
-
-       /*
-        * Destroy any logical buffer cache buffers beyond the file EOF.
-        *
-        * We call nvtruncbuf() w/ trivial == 1 to prevent it from messing
-        * around with the buffer straddling EOF, because we need to assign
-        * a new physical offset to it.
-        */
-       if (ip->vp) {
-               nvtruncbuf(ip->vp, nsize,
-                          HAMMER2_PBUFSIZE, (int)nsize & HAMMER2_PBUFMASK,
-                          1);
-       }
-
-       /*
-        * Setup for lookup/search
-        */
-       parent = hammer2_chain_lookup_init(ip->chain, 0);
-
-       /*
-        * Handle the case where a chain/logical-buffer straddles the new
-        * EOF.  We told nvtruncbuf() above not to mess with the logical
-        * buffer straddling the EOF because we need to reassign its storage
-        * and can't let the strategy code do it for us.
-        */
-       loff = (int)nsize & HAMMER2_PBUFMASK;
-       if (loff && ip->vp) {
-               oblksize = hammer2_calc_logical(ip, nsize, &lbase, &leof);
-               error = bread(ip->vp, lbase, oblksize, &bp);
-               KKASSERT(error == 0);
-       }
-       ipdata->size = nsize;
-       nblksize = hammer2_calc_logical(ip, nsize, &lbase, &leof);
-
-       /*
-        * Fixup the chain element.  If we have a logical buffer in-hand
-        * we don't want to create a conflicting device buffer.
-        */
-       if (loff && bp) {
-               chain = hammer2_chain_lookup(&parent, lbase, lbase,
-                                            HAMMER2_LOOKUP_NODATA);
-               if (chain) {
-                       switch(chain->bref.type) {
-                       case HAMMER2_BREF_TYPE_DATA:
-                               hammer2_chain_resize(trans, ip, bp,
-                                            parent, &chain,
-                                            hammer2_getradix(nblksize),
-                                            HAMMER2_MODIFY_OPTDATA);
-                               allocbuf(bp, nblksize);
-                               bzero(bp->b_data + loff, nblksize - loff);
-                               bp->b_bio2.bio_caller_info1.ptr = chain->hmp;
-                               bp->b_bio2.bio_offset = chain->bref.data_off &
-                                                       HAMMER2_OFF_MASK;
-                               break;
-                       case HAMMER2_BREF_TYPE_INODE:
-                               allocbuf(bp, nblksize);
-                               bzero(bp->b_data + loff, nblksize - loff);
-                               bp->b_bio2.bio_caller_info1.ptr = NULL;
-                               bp->b_bio2.bio_offset = NOOFFSET;
-                               break;
-                       default:
-                               panic("hammer2_truncate_file: bad type");
-                               break;
-                       }
-                       hammer2_write_bp(chain, bp, 0);
-                       hammer2_chain_unlock(chain);
-               } else {
-                       /*
-                        * Destroy clean buffer w/ wrong buffer size.  Retain
-                        * backing store.
-                        */
-                       bp->b_flags |= B_RELBUF;
-                       KKASSERT(bp->b_bio2.bio_offset == NOOFFSET);
-                       KKASSERT((bp->b_flags & B_DIRTY) == 0);
-                       bqrelse(bp);
-               }
-       } else if (loff) {
-               /*
-                * WARNING: This utilizes a device buffer for the data.
-                *
-                * This case should not occur because file truncations without
-                * a vnode (and hence no logical buffer cache) should only
-                * always truncate to 0-length.
-                */
-               panic("hammer2_truncate_file: non-zero truncation, no-vnode");
-       }
-
-       /*
-        * Clean up any fragmentory VM pages now that we have properly
-        * resized the straddling buffer.  These pages are no longer
-        * part of the buffer.
-        */
        if (ip->vp) {
+               nblksize = hammer2_calc_logical(ip, nsize, &lbase, NULL);
                nvtruncbuf(ip->vp, nsize,
                           nblksize, (int)nsize & (nblksize - 1),
-                          1);
-       }
-
-       /*
-        * Destroy any physical blocks after the new EOF point.
-        */
-       lbase = (nsize + HAMMER2_PBUFMASK64) & ~HAMMER2_PBUFMASK64;
-       chain = hammer2_chain_lookup(&parent,
-                                    lbase, (hammer2_key_t)-1,
-                                    HAMMER2_LOOKUP_NODATA);
-       while (chain) {
-               /*
-                * Degenerate embedded data case, nothing to loop on.
-                */
-               if (chain->bref.type == HAMMER2_BREF_TYPE_INODE) {
-                       hammer2_chain_unlock(chain);
-                       break;
-               }
-
-               /*
-                * Delete physical data blocks past the file EOF.
-                */
-               if (chain->bref.type == HAMMER2_BREF_TYPE_DATA) {
-                       /*ip->delta_dcount -= chain->bytes;*/
-                       hammer2_chain_delete(trans, chain, 0);
-               }
-               /* XXX check parent if empty indirect block & delete */
-               chain = hammer2_chain_next(&parent, chain,
-                                          lbase, (hammer2_key_t)-1,
-                                          HAMMER2_LOOKUP_NODATA);
+                          0);
        }
-       hammer2_chain_lookup_done(parent);
+       ccms_thread_lock(&ip->topo_cst, CCMS_STATE_EXCLUSIVE);
+       ip->size = nsize;
+       atomic_set_int(&ip->flags, HAMMER2_INODE_RESIZED);
+       ccms_thread_unlock(&ip->topo_cst);
 }
 
 /*
- * Extend the size of a file.  The inode must be locked.
- *
- * We may have to resize the block straddling the old EOF.
+ * Extend the size of a file.  The inode must not be locked.
  */
 static
 void
-hammer2_extend_file(hammer2_trans_t *trans, hammer2_inode_t *ip,
-                   hammer2_chain_t **parentp, hammer2_key_t nsize)
+hammer2_extend_file(hammer2_inode_t *ip, hammer2_key_t nsize)
 {
-       hammer2_inode_data_t *ipdata;
-       hammer2_chain_t *parent;
-       hammer2_chain_t *chain;
-       struct buf *bp;
+       hammer2_key_t lbase;
        hammer2_key_t osize;
-       hammer2_key_t obase;
-       hammer2_key_t nbase;
-       hammer2_key_t leof;
        int oblksize;
        int nblksize;
-       int nradix;
-       int error;
 
-       KKASSERT(ip->vp);
+       ccms_thread_lock(&ip->topo_cst, CCMS_STATE_EXCLUSIVE);
+       osize = ip->size;
+       ip->size = nsize;
+       ccms_thread_unlock(&ip->topo_cst);
 
-       ipdata = hammer2_chain_modify_ip(trans, ip, parentp, 0);
-
-       /*
-        * Nothing to do if the direct-data case is still intact
-        */
-       if ((ipdata->op_flags & HAMMER2_OPFLAG_DIRECTDATA) &&
-           nsize <= HAMMER2_EMBEDDED_BYTES) {
-               ipdata->size = nsize;
+       if (ip->vp) {
+               oblksize = hammer2_calc_logical(ip, osize, &lbase, NULL);
+               nblksize = hammer2_calc_logical(ip, nsize, &lbase, NULL);
                nvextendbuf(ip->vp,
-                           ipdata->size, nsize,
-                           0, HAMMER2_EMBEDDED_BYTES,
-                           0, (int)nsize,
-                           1);
-               /* ipdata = &ip->chain->data->ipdata; RELOAD */
-               return;
-       }
-
-       /*
-        * Calculate the blocksize at the original EOF and resize the block
-        * if necessary.  Adjust the file size in the inode.
-        */
-       osize = ipdata->size;
-       oblksize = hammer2_calc_logical(ip, osize, &obase, &leof);
-       ipdata->size = nsize;
-       nblksize = hammer2_calc_logical(ip, osize, &nbase, &leof);
-
-       /*
-        * Do all required vnode operations, but do not mess with the
-        * buffer straddling the orignal EOF.
-        */
-       nvextendbuf(ip->vp,
-                   ipdata->size, nsize,
-                   0, nblksize,
-                   0, (int)nsize & HAMMER2_PBUFMASK,
-                   1);
-       ipdata = &ip->chain->data->ipdata;
-
-       /*
-        * Early return if we have no more work to do.
-        */
-       if (obase == nbase && oblksize == nblksize &&
-           (ipdata->op_flags & HAMMER2_OPFLAG_DIRECTDATA) == 0) {
-               return;
-       }
-
-       /*
-        * We have work to do, including possibly resizing the buffer
-        * at the previous EOF point and turning off DIRECTDATA mode.
-        */
-       bp = NULL;
-       if (((int)osize & HAMMER2_PBUFMASK)) {
-               error = bread(ip->vp, obase, oblksize, &bp);
-               KKASSERT(error == 0);
-       }
-
-       /*
-        * Disable direct-data mode by loading up a buffer cache buffer
-        * with the data, then converting the inode data area into the
-        * inode indirect block array area.
-        */
-       if (ipdata->op_flags & HAMMER2_OPFLAG_DIRECTDATA) {
-               ipdata->op_flags &= ~HAMMER2_OPFLAG_DIRECTDATA;
-               bzero(&ipdata->u.blockset, sizeof(ipdata->u.blockset));
-       }
-
-       /*
-        * Resize the chain element at the old EOF.
-        */
-       if (((int)osize & HAMMER2_PBUFMASK)) {
-retry:
-               error = 0;
-               parent = hammer2_chain_lookup_init(ip->chain, 0);
-               nradix = hammer2_getradix(nblksize);
-
-               chain = hammer2_chain_lookup(&parent,
-                                            obase, obase,
-                                            HAMMER2_LOOKUP_NODATA);
-               if (chain == NULL) {
-                       error = hammer2_chain_create(trans, &parent, &chain,
-                                                    obase, nblksize,
-                                                    HAMMER2_BREF_TYPE_DATA,
-                                                    nblksize);
-                       if (chain == NULL) {
-                               hammer2_chain_lookup_done(parent);
-                               panic("hammer2_chain_create: par=%p error=%d\n",
-                                       parent, error);
-                               goto retry;
-                       }
-                       /*ip->delta_dcount += nblksize;*/
-               } else {
-                       KKASSERT(chain->bref.type == HAMMER2_BREF_TYPE_DATA);
-                       hammer2_chain_resize(trans, ip, bp,
-                                            parent, &chain,
-                                            nradix,
-                                            HAMMER2_MODIFY_OPTDATA);
-               }
-               if (obase != nbase) {
-                       if (oblksize != HAMMER2_PBUFSIZE)
-                               allocbuf(bp, HAMMER2_PBUFSIZE);
-               } else {
-                       if (oblksize != nblksize)
-                               allocbuf(bp, nblksize);
-               }
-               hammer2_write_bp(chain, bp, 0);
-               hammer2_chain_unlock(chain);
-               hammer2_chain_lookup_done(parent);
+                           osize, nsize,
+                           oblksize, nblksize,
+                           -1, -1, 0);
        }
+       atomic_set_int(&ip->flags, HAMMER2_INODE_RESIZED);
 }
 
 static
@@ -1630,6 +1357,7 @@ hammer2_vop_nmkdir(struct vop_nmkdir_args *ap)
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
 
+       hammer2_chain_memory_wait(dip->pmp);
        hammer2_trans_init(&trans, dip->pmp, 0);
        nip = hammer2_inode_create(&trans, dip, ap->a_vap, ap->a_cred,
                                   name, name_len, &chain, &error);
@@ -1827,6 +1555,7 @@ hammer2_vop_nlink(struct vop_nlink_args *ap)
         * returned chain is locked.
         */
        ip = VTOI(ap->a_vp);
+       hammer2_chain_memory_wait(ip->pmp);
        hammer2_trans_init(&trans, ip->pmp, 0);
 
        chain = hammer2_inode_lock_ex(ip);
@@ -1883,6 +1612,7 @@ hammer2_vop_ncreate(struct vop_ncreate_args *ap)
        ncp = ap->a_nch->ncp;
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
+       hammer2_chain_memory_wait(dip->pmp);
        hammer2_trans_init(&trans, dip->pmp, 0);
 
        nip = hammer2_inode_create(&trans, dip, ap->a_vap, ap->a_cred,
@@ -1918,7 +1648,7 @@ hammer2_vop_nsymlink(struct vop_nsymlink_args *ap)
        const uint8_t *name;
        size_t name_len;
        int error;
-
+       
        dip = VTOI(ap->a_dvp);
        if (dip->pmp->ronly)
                return (EROFS);
@@ -1926,6 +1656,7 @@ hammer2_vop_nsymlink(struct vop_nsymlink_args *ap)
        ncp = ap->a_nch->ncp;
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
+       hammer2_chain_memory_wait(dip->pmp);
        hammer2_trans_init(&trans, dip->pmp, 0);
 
        ap->a_vap->va_type = VLNK;      /* enforce type */
@@ -1957,7 +1688,10 @@ hammer2_vop_nsymlink(struct vop_nsymlink_args *ap)
                                 HAMMER2_OPFLAG_DIRECTDATA);
                        bcopy(ap->a_target, nipdata->u.data, bytes);
                        nipdata->size = bytes;
+                       nip->size = bytes;
+                       hammer2_inode_unlock_ex(nip, nparent);
                } else {
+                       hammer2_inode_unlock_ex(nip, nparent);
                        bzero(&auio, sizeof(auio));
                        bzero(&aiov, sizeof(aiov));
                        auio.uio_iov = &aiov;
@@ -1968,14 +1702,14 @@ hammer2_vop_nsymlink(struct vop_nsymlink_args *ap)
                        auio.uio_td = curthread;
                        aiov.iov_base = ap->a_target;
                        aiov.iov_len = bytes;
-                       error = hammer2_write_file(&trans, nip, &nparent,
-                                                  &auio, IO_APPEND, 0);
+                       error = hammer2_write_file(nip, &auio, IO_APPEND, 0);
                        nipdata = &nip->chain->data->ipdata; /* RELOAD */
                        /* XXX handle error */
                        error = 0;
                }
+       } else {
+               hammer2_inode_unlock_ex(nip, nparent);
        }
-       hammer2_inode_unlock_ex(nip, nparent);
        hammer2_trans_done(&trans);
 
        /*
@@ -2010,6 +1744,7 @@ hammer2_vop_nremove(struct vop_nremove_args *ap)
        ncp = ap->a_nch->ncp;
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
+       hammer2_chain_memory_wait(dip->pmp);
        hammer2_trans_init(&trans, dip->pmp, 0);
        error = hammer2_unlink_file(&trans, dip, name, name_len, 0, NULL);
        hammer2_trans_done(&trans);
@@ -2041,6 +1776,7 @@ hammer2_vop_nrmdir(struct vop_nrmdir_args *ap)
        name = ncp->nc_name;
        name_len = ncp->nc_nlen;
 
+       hammer2_chain_memory_wait(dip->pmp);
        hammer2_trans_init(&trans, dip->pmp, 0);
        error = hammer2_unlink_file(&trans, dip, name, name_len, 1, NULL);
        hammer2_trans_done(&trans);
@@ -2090,6 +1826,7 @@ hammer2_vop_nrename(struct vop_nrename_args *ap)
        tname = tncp->nc_name;
        tname_len = tncp->nc_nlen;
 
+       hammer2_chain_memory_wait(tdip->pmp);
        hammer2_trans_init(&trans, tdip->pmp, 0);
 
        /*
@@ -2230,6 +1967,7 @@ hammer2_strategy_read(struct vop_strategy_args *ap)
        hammer2_chain_t *parent;
        hammer2_chain_t *chain;
        hammer2_key_t lbase;
+       int loff;
 
        bio = ap->a_bio;
        bp = bio->bio_buf;
@@ -2270,16 +2008,60 @@ hammer2_strategy_read(struct vop_strategy_args *ap)
                 *
                 * XXX direct-IO shortcut could go here XXX.
                 */
-               hammer2_chain_load_async(chain, hammer2_strategy_read_callback,
-                                        nbio);
+               if (HAMMER2_DEC_COMP(chain->bref.methods) == HAMMER2_COMP_LZ4) {
+                       /*
+                        * Block compression is determined by bref.methods value.
+                        */
+                       hammer2_blockref_t *bref;
+                       hammer2_off_t pbase;
+                       hammer2_off_t pmask;
+                       size_t psize;
+                               
+                       bref = &chain->bref;
+                       psize = hammer2_devblksize(chain->bytes);
+                       pmask = (hammer2_off_t)psize - 1;
+                       pbase = bref->data_off & ~pmask;
+                       loff = (int)((bref->data_off &
+                                     ~HAMMER2_OFF_MASK_RADIX) - pbase);
+                       nbio->bio_caller_info3.value = loff;
+                       breadcb(chain->hmp->devvp, pbase, psize,
+                               hammer2_decompress_LZ4_callback, nbio);
+                       /* XXX async read dev blk not protected by chain lk */
+                       hammer2_chain_unlock(chain);
+               } else if (HAMMER2_DEC_COMP(chain->bref.methods) == HAMMER2_COMP_ZLIB) {
+                       hammer2_blockref_t *bref;
+                       hammer2_off_t pbase;
+                       hammer2_off_t pmask;
+                       size_t psize;
+                               
+                       bref = &chain->bref;
+                       psize = hammer2_devblksize(chain->bytes);
+                       pmask = (hammer2_off_t)psize - 1;
+                       pbase = bref->data_off & ~pmask;
+                       loff = (int)((bref->data_off &
+                                     ~HAMMER2_OFF_MASK_RADIX) - pbase);
+                       nbio->bio_caller_info3.value = loff;
+                       breadcb(chain->hmp->devvp, pbase, psize,
+                               hammer2_decompress_ZLIB_callback, nbio);
+                       /* XXX async read dev blk not protected by chain lk */
+                       hammer2_chain_unlock(chain);
+               }
+               else {
+                       hammer2_chain_load_async(chain,
+                                                hammer2_strategy_read_callback,
+                                                nbio);
+               }
        } else {
-               panic("hammer2_strategy_read: unknown bref type");
+               panic("READ PATH: hammer2_strategy_read: unknown bref type");
                chain = NULL;
        }
        hammer2_inode_unlock_sh(ip, parent);
        return (0);
 }
 
+/*
+ * Read callback for block that is not compressed.
+ */
 static
 void
 hammer2_strategy_read_callback(hammer2_chain_t *chain, struct buf *dbp,
@@ -2306,7 +2088,12 @@ hammer2_strategy_read_callback(hammer2_chain_t *chain, struct buf *dbp,
                 *
                 * XXX direct-IO shortcut could go here XXX.
                 */
-               bcopy(data, bp->b_data, bp->b_bcount);
+               KKASSERT(chain->bytes <= bp->b_bcount);
+               bcopy(data, bp->b_data, chain->bytes);
+               if (chain->bytes < bp->b_bcount); {
+                       bzero(bp->b_data + chain->bytes,
+                             bp->b_bcount - chain->bytes);
+               }
                bp->b_flags |= B_NOTMETA;
                bp->b_resid = 0;
                bp->b_error = 0;
@@ -2316,57 +2103,39 @@ hammer2_strategy_read_callback(hammer2_chain_t *chain, struct buf *dbp,
                if (dbp)
                        bqrelse(dbp);
                panic("hammer2_strategy_read: unknown bref type");
-               chain = NULL;
+               /*hammer2_chain_unlock(chain);*/
+               /*chain = NULL;*/
        }
 }
 
 static
 int
 hammer2_strategy_write(struct vop_strategy_args *ap)
-{
-       KKASSERT(0);
-#if 0
-       struct buf *bp;
-       struct bio *bio;
-       struct bio *nbio;
-       hammer2_chain_t *chain;
+{      
+       /*
+        * XXX temporary because all write handling is currently
+        * in the vop_write path (which is incorrect and won't catch
+        * certain file modifications via mmap()).  What we need
+        * to do is have the strategy_write code queue the bio to
+        * one or more support threads which will do the complex
+        * logical->physical work and have the vop_write path just do
+        * normal operations on the logical buffer.
+        */
        hammer2_mount_t *hmp;
+       struct bio *bio;
+       struct buf *bp;
        hammer2_inode_t *ip;
-
+       
        bio = ap->a_bio;
        bp = bio->bio_buf;
        ip = VTOI(ap->a_vp);
-       nbio = push_bio(bio);
-
-       KKASSERT((bio->bio_offset & HAMMER2_PBUFMASK64) == 0);
-       KKASSERT(nbio->bio_offset != 0 && nbio->bio_offset != ZFOFFSET);
-
-       if (nbio->bio_offset == NOOFFSET) {
-               /*
-                * The data is embedded in the inode.  Note that strategy
-                * calls for embedded data are synchronous in order to
-                * ensure that ip->chain is stable.  Chain modification
-                * status is handled by the caller.
-                */
-               KKASSERT(ip->chain->flags & HAMMER2_CHAIN_MODIFIED);
-               KKASSERT(bio->bio_offset == 0);
-               KKASSERT(ip->chain && ip->chain->data);
-               chain = ip->chain;
-               bcopy(bp->b_data, chain->data->ipdata.u.data,
-                     HAMMER2_EMBEDDED_BYTES);
-               bp->b_resid = 0;
-               bp->b_error = 0;
-               biodone(nbio);
-       } else {
-               /*
-                * Forward direct IO to the device
-                */
-               hmp = nbio->bio_caller_info1.ptr;
-               KKASSERT(hmp);
-               vn_strategy(hmp->devvp, nbio);
-       }
-       return (0);
-#endif
+       hmp = ip->pmp->mount_cluster->hmp;
+       
+       mtx_lock(&hmp->wthread_mtx);
+       bioq_insert_tail(&hmp->wthread_bioq, ap->a_bio);
+       wakeup(&hmp->wthread_bioq);
+       mtx_unlock(&hmp->wthread_mtx);
+       return(0);
 }
 
 /*
@@ -2453,3 +2222,4 @@ struct vop_ops hammer2_spec_vops = {
 struct vop_ops hammer2_fifo_vops = {
 
 };
+
diff --git a/sys/vfs/hammer2/zlib/hammer2_zlib.h b/sys/vfs/hammer2/zlib/hammer2_zlib.h
new file mode 100644 (file)
index 0000000..a7261d5
--- /dev/null
@@ -0,0 +1,551 @@
+/* zlib.h -- interface of the 'zlib' general purpose compression library
+  version 1.2.8, April 28th, 2013
+
+  Copyright (C) 1995-2013 Jean-loup Gailly and Mark Adler
+
+  This software is provided 'as-is', without any express or implied
+  warranty.  In no event will the authors be held liable for any damages
+  arising from the use of this software.
+
+  Permission is granted to anyone to use this software for any purpose,
+  including commercial applications, and to alter it and redistribute it
+  freely, subject to the following restrictions:
+
+  1. The origin of this software must not be misrepresented; you must not
+     claim that you wrote the original software. If you use this software
+     in a product, an acknowledgment in the product documentation would be
+     appreciated but is not required.
+  2. Altered source versions must be plainly marked as such, and must not be
+     misrepresented as being the original software.
+  3. This notice may not be removed or altered from any source distribution.
+
+  Jean-loup Gailly        Mark Adler
+  jloup@gzip.org          madler@alumni.caltech.edu
+
+
+  The data format used by the zlib library is described by RFCs (Request for
+  Comments) 1950 to 1952 in the files http://tools.ietf.org/html/rfc1950
+  (zlib format), rfc1951 (deflate format) and rfc1952 (gzip format).
+*/
+
+#ifndef ZLIB_H
+#define ZLIB_H
+
+//#include "zconf.h"
+
+#include "hammer2_zlib_zconf.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define ZLIB_VERSION "1.2.8"
+#define ZLIB_VERNUM 0x1280
+#define ZLIB_VER_MAJOR 1
+#define ZLIB_VER_MINOR 2
+#define ZLIB_VER_REVISION 8
+#define ZLIB_VER_SUBREVISION 0
+
+/*
+    The 'zlib' compression library provides in-memory compression and
+  decompression functions, including integrity checks of the uncompressed data.
+  This version of the library supports only one compression method (deflation)
+  but other algorithms will be added later and will have the same stream
+  interface.
+
+    Compression can be done in a single step if the buffers are large enough,
+  or can be done by repeated calls of the compression function.  In the latter
+  case, the application must provide more input and/or consume the output
+  (providing more output space) before each call.
+
+    The compressed data format used by default by the in-memory functions is
+  the zlib format, which is a zlib wrapper documented in RFC 1950, wrapped
+  around a deflate stream, which is itself documented in RFC 1951.
+
+    The library also supports reading and writing files in gzip (.gz) format
+  with an interface similar to that of stdio using the functions that start
+  with "gz".  The gzip format is different from the zlib format.  gzip is a
+  gzip wrapper, documented in RFC 1952, wrapped around a deflate stream.
+
+    This library can optionally read and write gzip streams in memory as well.
+
+    The zlib format was designed to be compact and fast for use in memory
+  and on communications channels.  The gzip format was designed for single-
+  file compression on file systems, has a larger header than zlib to maintain
+  directory information, and uses a different, slower check method than zlib.
+
+    The library does not install any signal handler.  The decoder checks
+  the consistency of the compressed data, so the library should never crash
+  even in case of corrupted input.
+*/
+
+struct internal_state;
+
+typedef struct z_stream_s {
+    z_const Bytef *next_in;     /* next input byte */
+    uInt     avail_in;  /* number of bytes available at next_in */
+    uLong    total_in;  /* total number of input bytes read so far */
+
+    Bytef    *next_out; /* next output byte should be put there */
+    uInt     avail_out; /* remaining free space at next_out */
+    uLong    total_out; /* total number of bytes output so far */
+
+    z_const char *msg;  /* last error message, NULL if no error */
+    struct internal_state FAR *state; /* not visible by applications */
+
+    int     data_type;  /* best guess about the data type: binary or text */
+    uLong   adler;      /* adler32 value of the uncompressed data */
+    uLong   reserved;   /* reserved for future use */
+} z_stream;
+
+typedef z_stream FAR *z_streamp;
+
+/*
+     The application must update next_in and avail_in when avail_in has dropped
+   to zero.  It must update next_out and avail_out when avail_out has dropped
+   to zero.  The application must initialize zalloc, zfree and opaque before
+   calling the init function.  All other fields are set by the compression
+   library and must not be updated by the application.
+
+     The opaque value provided by the application will be passed as the first
+   parameter for calls of zalloc and zfree.  This can be useful for custom
+   memory management.  The compression library attaches no meaning to the
+   opaque value.
+
+     zalloc must return Z_NULL if there is not enough memory for the object.
+   If zlib is used in a multi-threaded application, zalloc and zfree must be
+   thread safe.
+
+     On 16-bit systems, the functions zalloc and zfree must be able to allocate
+   exactly 65536 bytes, but will not be required to allocate more than this if
+   the symbol MAXSEG_64K is defined (see zconf.h).  WARNING: On MSDOS, pointers
+   returned by zalloc for objects of exactly 65536 bytes *must* have their
+   offset normalized to zero.  The default allocation function provided by this
+   library ensures this (see zutil.c).  To reduce memory requirements and avoid
+   any allocation of 64K objects, at the expense of compression ratio, compile
+   the library with -DMAX_WBITS=14 (see zconf.h).
+
+     The fields total_in and total_out can be used for statistics or progress
+   reports.  After compression, total_in holds the total size of the
+   uncompressed data and may be saved for use in the decompressor (particularly
+   if the decompressor wants to decompress everything in a single step).
+*/
+
+                        /* constants */
+
+#define Z_NO_FLUSH      0
+#define Z_PARTIAL_FLUSH 1
+#define Z_SYNC_FLUSH    2
+#define Z_FULL_FLUSH    3
+#define Z_FINISH        4
+#define Z_BLOCK         5
+#define Z_TREES         6
+/* Allowed flush values; see deflate() and inflate() below for details */
+
+#define Z_OK            0
+#define Z_STREAM_END    1
+#define Z_NEED_DICT     2
+#define Z_ERRNO        (-1)
+#define Z_STREAM_ERROR (-2)
+#define Z_DATA_ERROR   (-3)
+#define Z_MEM_ERROR    (-4)
+#define Z_BUF_ERROR    (-5)
+#define Z_VERSION_ERROR (-6)
+/* Return codes for the compression/decompression functions. Negative values
+ * are errors, positive values are used for special but normal events.
+ */
+
+#define Z_NO_COMPRESSION         0
+#define Z_BEST_SPEED             1
+#define Z_BEST_COMPRESSION       9
+#define Z_DEFAULT_COMPRESSION  (-1)
+/* compression levels */
+
+#define Z_FILTERED            1
+#define Z_HUFFMAN_ONLY        2
+#define Z_RLE                 3
+#define Z_FIXED               4
+#define Z_DEFAULT_STRATEGY    0
+/* compression strategy; see deflateInit2() below for details */
+
+#define Z_BINARY   0
+#define Z_TEXT     1
+#define Z_ASCII    Z_TEXT   /* for compatibility with 1.2.2 and earlier */
+#define Z_UNKNOWN  2
+/* Possible values of the data_type field (though see inflate()) */
+
+#define Z_DEFLATED   8
+/* The deflate compression method (the only one supported in this version) */
+
+#define Z_NULL  0  /* for initializing zalloc, zfree, opaque */
+
+#define zlib_version zlibVersion()
+/* for compatibility with versions < 1.0.2 */
+
+
+                        /* basic functions */
+
+//ZEXTERN const char * ZEXPORT zlibVersion OF((void));
+/* The application can compare zlibVersion and ZLIB_VERSION for consistency.
+   If the first character differs, the library code actually used is not
+   compatible with the zlib.h header file used by the application.  This check
+   is automatically made by deflateInit and inflateInit.
+ */
+
+int deflateInit(z_streamp strm, int level);
+
+/*
+ZEXTERN int ZEXPORT deflateInit OF((z_streamp strm, int level));
+
+     Initializes the internal stream state for compression.  The fields
+   zalloc, zfree and opaque must be initialized before by the caller.  If
+   zalloc and zfree are set to Z_NULL, deflateInit updates them to use default
+   allocation functions.
+
+     The compression level must be Z_DEFAULT_COMPRESSION, or between 0 and 9:
+   1 gives best speed, 9 gives best compression, 0 gives no compression at all
+   (the input data is simply copied a block at a time).  Z_DEFAULT_COMPRESSION
+   requests a default compromise between speed and compression (currently
+   equivalent to level 6).
+
+     deflateInit returns Z_OK if success, Z_MEM_ERROR if there was not enough
+   memory, Z_STREAM_ERROR if level is not a valid compression level, or
+   Z_VERSION_ERROR if the zlib library version (zlib_version) is incompatible
+   with the version assumed by the caller (ZLIB_VERSION).  msg is set to null
+   if there is no error message.  deflateInit does not perform any compression:
+   this will be done by deflate().
+*/
+
+
+int deflate(z_streamp strm, int flush);
+/*
+    deflate compresses as much data as possible, and stops when the input
+  buffer becomes empty or the output buffer becomes full.  It may introduce
+  some output latency (reading input without producing any output) except when
+  forced to flush.
+
+    The detailed semantics are as follows.  deflate performs one or both of the
+  following actions:
+
+  - Compress more input starting at next_in and update next_in and avail_in
+    accordingly.  If not all input can be processed (because there is not
+    enough room in the output buffer), next_in and avail_in are updated and
+    processing will resume at this point for the next call of deflate().
+
+  - Provide more output starting at next_out and update next_out and avail_out
+    accordingly.  This action is forced if the parameter flush is non zero.
+    Forcing flush frequently degrades the compression ratio, so this parameter
+    should be set only when necessary (in interactive applications).  Some
+    output may be provided even if flush is not set.
+
+    Before the call of deflate(), the application should ensure that at least
+  one of the actions is possible, by providing more input and/or consuming more
+  output, and updating avail_in or avail_out accordingly; avail_out should
+  never be zero before the call.  The application can consume the compressed
+  output when it wants, for example when the output buffer is full (avail_out
+  == 0), or after each call of deflate().  If deflate returns Z_OK and with
+  zero avail_out, it must be called again after making room in the output
+  buffer because there might be more output pending.
+
+    Normally the parameter flush is set to Z_NO_FLUSH, which allows deflate to
+  decide how much data to accumulate before producing output, in order to
+  maximize compression.
+
+    If the parameter flush is set to Z_SYNC_FLUSH, all pending output is
+  flushed to the output buffer and the output is aligned on a byte boundary, so
+  that the decompressor can get all input data available so far.  (In
+  particular avail_in is zero after the call if enough output space has been
+  provided before the call.) Flushing may degrade compression for some
+  compression algorithms and so it should be used only when necessary.  This
+  completes the current deflate block and follows it with an empty stored block
+  that is three bits plus filler bits to the next byte, followed by four bytes
+  (00 00 ff ff).
+
+    If flush is set to Z_PARTIAL_FLUSH, all pending output is flushed to the
+  output buffer, but the output is not aligned to a byte boundary.  All of the
+  input data so far will be available to the decompressor, as for Z_SYNC_FLUSH.
+  This completes the current deflate block and follows it with an empty fixed
+  codes block that is 10 bits long.  This assures that enough bytes are output
+  in order for the decompressor to finish the block before the empty fixed code
+  block.
+
+    If flush is set to Z_BLOCK, a deflate block is completed and emitted, as
+  for Z_SYNC_FLUSH, but the output is not aligned on a byte boundary, and up to
+  seven bits of the current block are held to be written as the next byte after
+  the next deflate block is completed.  In this case, the decompressor may not
+  be provided enough bits at this point in order to complete decompression of
+  the data provided so far to the compressor.  It may need to wait for the next
+  block to be emitted.  This is for advanced applications that need to control
+  the emission of deflate blocks.
+
+    If flush is set to Z_FULL_FLUSH, all output is flushed as with
+  Z_SYNC_FLUSH, and the compression state is reset so that decompression can
+  restart from this point if previous compressed data has been damaged or if
+  random access is desired.  Using Z_FULL_FLUSH too often can seriously degrade
+  compression.
+
+    If deflate returns with avail_out == 0, this function must be called again
+  with the same value of the flush parameter and more output space (updated
+  avail_out), until the flush is complete (deflate returns with non-zero
+  avail_out).  In the case of a Z_FULL_FLUSH or Z_SYNC_FLUSH, make sure that
+  avail_out is greater than six to avoid repeated flush markers due to
+  avail_out == 0 on return.
+
+    If the parameter flush is set to Z_FINISH, pending input is processed,
+  pending output is flushed and deflate returns with Z_STREAM_END if there was
+  enough output space; if deflate returns with Z_OK, this function must be
+  called again with Z_FINISH and more output space (updated avail_out) but no
+  more input data, until it returns with Z_STREAM_END or an error.  After
+  deflate has returned Z_STREAM_END, the only possible operations on the stream
+  are deflateReset or deflateEnd.
+
+    Z_FINISH can be used immediately after deflateInit if all the compression
+  is to be done in a single step.  In this case, avail_out must be at least the
+  value returned by deflateBound (see below).  Then deflate is guaranteed to
+  return Z_STREAM_END.  If not enough output space is provided, deflate will
+  not return Z_STREAM_END, and it must be called again as described above.
+
+    deflate() sets strm->adler to the adler32 checksum of all input read
+  so far (that is, total_in bytes).
+
+    deflate() may update strm->data_type if it can make a good guess about
+  the input data type (Z_BINARY or Z_TEXT).  In doubt, the data is considered
+  binary.  This field is only for information purposes and does not affect the
+  compression algorithm in any manner.
+
+    deflate() returns Z_OK if some progress has been made (more input
+  processed or more output produced), Z_STREAM_END if all input has been
+  consumed and all output has been produced (only when flush is set to
+  Z_FINISH), Z_STREAM_ERROR if the stream state was inconsistent (for example
+  if next_in or next_out was Z_NULL), Z_BUF_ERROR if no progress is possible
+  (for example avail_in or avail_out was zero).  Note that Z_BUF_ERROR is not
+  fatal, and deflate() can be called again with more input and more output
+  space to continue compressing.
+*/
+
+
+int deflateEnd(z_streamp strm);
+/*
+     All dynamically allocated data structures for this stream are freed.
+   This function discards any unprocessed input and does not flush any pending
+   output.
+
+     deflateEnd returns Z_OK if success, Z_STREAM_ERROR if the
+   stream state was inconsistent, Z_DATA_ERROR if the stream was freed
+   prematurely (some input or output was discarded).  In the error case, msg
+   may be set but then points to a static string (which must not be
+   deallocated).
+*/
+
+int inflateInit(z_streamp strm);
+
+/*
+ZEXTERN int ZEXPORT inflateInit OF((z_streamp strm));
+
+     Initializes the internal stream state for decompression.  The fields
+   next_in, avail_in, zalloc, zfree and opaque must be initialized before by
+   the caller.  If next_in is not Z_NULL and avail_in is large enough (the
+   exact value depends on the compression method), inflateInit determines the
+   compression method from the zlib header and allocates all data structures
+   accordingly; otherwise the allocation will be deferred to the first call of
+   inflate.  If zalloc and zfree are set to Z_NULL, inflateInit updates them to
+   use default allocation functions.
+
+     inflateInit returns Z_OK if success, Z_MEM_ERROR if there was not enough
+   memory, Z_VERSION_ERROR if the zlib library version is incompatible with the
+   version assumed by the caller, or Z_STREAM_ERROR if the parameters are
+   invalid, such as a null pointer to the structure.  msg is set to null if
+   there is no error message.  inflateInit does not perform any decompression
+   apart from possibly reading the zlib header if present: actual decompression
+   will be done by inflate().  (So next_in and avail_in may be modified, but
+   next_out and avail_out are unused and unchanged.) The current implementation
+   of inflateInit() does not process any header information -- that is deferred
+   until inflate() is called.
+*/
+
+
+int inflate(z_streamp strm, int flush);
+/*
+    inflate decompresses as much data as possible, and stops when the input
+  buffer becomes empty or the output buffer becomes full.  It may introduce
+  some output latency (reading input without producing any output) except when
+  forced to flush.
+
+  The detailed semantics are as follows.  inflate performs one or both of the
+  following actions:
+
+  - Decompress more input starting at next_in and update next_in and avail_in
+    accordingly.  If not all input can be processed (because there is not
+    enough room in the output buffer), next_in is updated and processing will
+    resume at this point for the next call of inflate().
+
+  - Provide more output starting at next_out and update next_out and avail_out
+    accordingly.  inflate() provides as much output as possible, until there is
+    no more input data or no more space in the output buffer (see below about
+    the flush parameter).
+
+    Before the call of inflate(), the application should ensure that at least
+  one of the actions is possible, by providing more input and/or consuming more
+  output, and updating the next_* and avail_* values accordingly.  The
+  application can consume the uncompressed output when it wants, for example
+  when the output buffer is full (avail_out == 0), or after each call of
+  inflate().  If inflate returns Z_OK and with zero avail_out, it must be
+  called again after making room in the output buffer because there might be
+  more output pending.
+
+    The flush parameter of inflate() can be Z_NO_FLUSH, Z_SYNC_FLUSH, Z_FINISH,
+  Z_BLOCK, or Z_TREES.  Z_SYNC_FLUSH requests that inflate() flush as much
+  output as possible to the output buffer.  Z_BLOCK requests that inflate()
+  stop if and when it gets to the next deflate block boundary.  When decoding
+  the zlib or gzip format, this will cause inflate() to return immediately
+  after the header and before the first block.  When doing a raw inflate,
+  inflate() will go ahead and process the first block, and will return when it
+  gets to the end of that block, or when it runs out of data.
+
+    The Z_BLOCK option assists in appending to or combining deflate streams.
+  Also to assist in this, on return inflate() will set strm->data_type to the
+  number of unused bits in the last byte taken from strm->next_in, plus 64 if
+  inflate() is currently decoding the last block in the deflate stream, plus
+  128 if inflate() returned immediately after decoding an end-of-block code or
+  decoding the complete header up to just before the first byte of the deflate
+  stream.  The end-of-block will not be indicated until all of the uncompressed
+  data from that block has been written to strm->next_out.  The number of
+  unused bits may in general be greater than seven, except when bit 7 of
+  data_type is set, in which case the number of unused bits will be less than
+  eight.  data_type is set as noted here every time inflate() returns for all
+  flush options, and so can be used to determine the amount of currently
+  consumed input in bits.
+
+    The Z_TREES option behaves as Z_BLOCK does, but it also returns when the
+  end of each deflate block header is reached, before any actual data in that
+  block is decoded.  This allows the caller to determine the length of the
+  deflate block header for later use in random access within a deflate block.
+  256 is added to the value of strm->data_type when inflate() returns
+  immediately after reaching the end of the deflate block header.
+
+    inflate() should normally be called until it returns Z_STREAM_END or an
+  error.  However if all decompression is to be performed in a single step (a
+  single call of inflate), the parameter flush should be set to Z_FINISH.  In
+  this case all pending input is processed and all pending output is flushed;
+  avail_out must be large enough to hold all of the uncompressed data for the
+  operation to complete.  (The size of the uncompressed data may have been
+  saved by the compressor for this purpose.) The use of Z_FINISH is not
+  required to perform an inflation in one step.  However it may be used to
+  inform inflate that a faster approach can be used for the single inflate()
+  call.  Z_FINISH also informs inflate to not maintain a sliding window if the
+  stream completes, which reduces inflate's memory footprint.  If the stream
+  does not complete, either because not all of the stream is provided or not
+  enough output space is provided, then a sliding window will be allocated and
+  inflate() can be called again to continue the operation as if Z_NO_FLUSH had
+  been used.
+
+     In this implementation, inflate() always flushes as much output as
+  possible to the output buffer, and always uses the faster approach on the
+  first call.  So the effects of the flush parameter in this implementation are
+  on the return value of inflate() as noted below, when inflate() returns early
+  when Z_BLOCK or Z_TREES is used, and when inflate() avoids the allocation of
+  memory for a sliding window when Z_FINISH is used.
+
+     If a preset dictionary is needed after this call (see inflateSetDictionary
+  below), inflate sets strm->adler to the Adler-32 checksum of the dictionary
+  chosen by the compressor and returns Z_NEED_DICT; otherwise it sets
+  strm->adler to the Adler-32 checksum of all output produced so far (that is,
+  total_out bytes) and returns Z_OK, Z_STREAM_END or an error code as described
+  below.  At the end of the stream, inflate() checks that its computed adler32
+  checksum is equal to that saved by the compressor and returns Z_STREAM_END
+  only if the checksum is correct.
+
+    inflate() can decompress and check either zlib-wrapped or gzip-wrapped
+  deflate data.  The header type is detected automatically, if requested when
+  initializing with inflateInit2().  Any information contained in the gzip
+  header is not retained, so applications that need that information should
+  instead use raw inflate, see inflateInit2() below, or inflateBack() and
+  perform their own processing of the gzip header and trailer.  When processing
+  gzip-wrapped deflate data, strm->adler32 is set to the CRC-32 of the output
+  producted so far.  The CRC-32 is checked against the gzip trailer.
+
+    inflate() returns Z_OK if some progress has been made (more input processed
+  or more output produced), Z_STREAM_END if the end of the compressed data has
+  been reached and all uncompressed output has been produced, Z_NEED_DICT if a
+  preset dictionary is needed at this point, Z_DATA_ERROR if the input data was
+  corrupted (input stream not conforming to the zlib format or incorrect check
+  value), Z_STREAM_ERROR if the stream structure was inconsistent (for example
+  next_in or next_out was Z_NULL), Z_MEM_ERROR if there was not enough memory,
+  Z_BUF_ERROR if no progress is possible or if there was not enough room in the
+  output buffer when Z_FINISH is used.  Note that Z_BUF_ERROR is not fatal, and
+  inflate() can be called again with more input and more output space to
+  continue decompressing.  If Z_DATA_ERROR is returned, the application may
+  then call inflateSync() to look for a good compression block if a partial
+  recovery of the data is desired.
+*/
+
+
+int inflateEnd(z_streamp strm);
+/*
+     All dynamically allocated data structures for this stream are freed.
+   This function discards any unprocessed input and does not flush any pending
+   output.
+
+     inflateEnd returns Z_OK if success, Z_STREAM_ERROR if the stream state
+   was inconsistent.  In the error case, msg may be set but then points to a
+   static string (which must not be deallocated).
+*/
+
+                        /* checksum functions */
+
+/*
+     These functions are not related to compression but are exported
+   anyway because they might be useful in applications using the compression
+   library.
+*/
+
+uLong adler32(uLong adler, const Bytef *buf, uInt len);
+/*
+     Update a running Adler-32 checksum with the bytes buf[0..len-1] and
+   return the updated checksum.  If buf is Z_NULL, this function returns the
+   required initial value for the checksum.
+
+     An Adler-32 checksum is almost as reliable as a crc32_zlib but can be computed
+   much faster.
+
+   Usage example:
+
+     uLong adler = adler32(0L, Z_NULL, 0);
+
+     while (read_buffer(buffer, length) != EOF) {
+       adler = adler32(adler, buffer, length);
+     }
+     if (adler != original_adler) error();
+*/
+
+                        /* various hacks, don't look :) */
+
+/* deflateInit and inflateInit are macros to allow checking the zlib version
+ * and the compiler's view of z_stream:
+ */
+int deflateInit_(z_streamp strm, int level,
+                                     const char *version, int stream_size);
+int inflateInit_(z_streamp strm,
+                                     const char *version, int stream_size);
+
+#define deflateInit(strm, level) \
+        deflateInit_((strm), (level), ZLIB_VERSION, (int)sizeof(z_stream))
+#define inflateInit(strm) \
+        inflateInit_((strm), ZLIB_VERSION, (int)sizeof(z_stream))
+#define deflateInit2(strm, level, method, windowBits, memLevel, strategy) \
+        deflateInit2_((strm),(level),(method),(windowBits),(memLevel),\
+                      (strategy), ZLIB_VERSION, (int)sizeof(z_stream))
+#define inflateInit2(strm, windowBits) \
+        inflateInit2_((strm), (windowBits), ZLIB_VERSION, \
+                      (int)sizeof(z_stream))
+
+/* hack for buggy compilers */
+#if !defined(ZUTIL_H) && !defined(NO_DUMMY_DECL)
+    struct internal_state {int dummy;};
+#endif
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* ZLIB_H */
diff --git a/sys/vfs/hammer2/zlib/hammer2_zlib_adler32.c b/sys/vfs/hammer2/zlib/hammer2_zlib_adler32.c
new file mode 100644 (file)
index 0000000..437abc1
--- /dev/null
@@ -0,0 +1,175 @@
+/* adler32.c -- compute the Adler-32 checksum of a data stream
+ * Copyright (C) 1995-2011 Mark Adler
+ * For conditions of distribution and use, see copyright notice in zlib.h
+ */
+
+/* @(#) $Id$ */
+
+#include "hammer2_zlib_zutil.h"
+
+#define local static
+
+//local uLong adler32_combine_ (uLong adler1, uLong adler2, z_off64_t len2);
+
+#define BASE 65521      /* largest prime smaller than 65536 */
+#define NMAX 5552
+/* NMAX is the largest n such that 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32-1 */
+
+#define DO1(buf,i)  {adler += (buf)[i]; sum2 += adler;}
+#define DO2(buf,i)  DO1(buf,i); DO1(buf,i+1);
+#define DO4(buf,i)  DO2(buf,i); DO2(buf,i+2);
+#define DO8(buf,i)  DO4(buf,i); DO4(buf,i+4);
+#define DO16(buf)   DO8(buf,0); DO8(buf,8);
+
+/* use NO_DIVIDE if your processor does not do division in hardware --
+   try it both ways to see which is faster */
+#ifdef NO_DIVIDE
+/* note that this assumes BASE is 65521, where 65536 % 65521 == 15
+   (thank you to John Reiser for pointing this out) */
+#  define CHOP(a) \
+    do { \
+        unsigned long tmp = a >> 16; \
+        a &= 0xffffUL; \
+        a += (tmp << 4) - tmp; \
+    } while (0)
+#  define MOD28(a) \
+    do { \
+        CHOP(a); \
+        if (a >= BASE) a -= BASE; \
+    } while (0)
+#  define MOD(a) \
+    do { \
+        CHOP(a); \
+        MOD28(a); \
+    } while (0)
+#  define MOD63(a) \
+    do { /* this assumes a is not negative */ \
+        z_off64_t tmp = a >> 32; \
+        a &= 0xffffffffL; \
+        a += (tmp << 8) - (tmp << 5) + tmp; \
+        tmp = a >> 16; \
+        a &= 0xffffL; \
+        a += (tmp << 4) - tmp; \
+        tmp = a >> 16; \
+        a &= 0xffffL; \
+        a += (tmp << 4) - tmp; \
+        if (a >= BASE) a -= BASE; \
+    } while (0)
+#else
+#  define MOD(a) a %= BASE
+#  define MOD28(a) a %= BASE
+#  define MOD63(a) a %= BASE
+#endif
+
+local uLong adler32_combine_(uLong adler1, uLong adler2, z_off64_t len2);
+uLong adler32_combine(uLong adler1, uLong adler2, z_off_t len2);
+
+/* ========================================================================= */
+uLong
+adler32(uLong adler, const Bytef *buf, uInt len)
+{
+    unsigned long sum2;
+    unsigned n;
+
+    /* split Adler-32 into component sums */
+    sum2 = (adler >> 16) & 0xffff;
+    adler &= 0xffff;
+
+    /* in case user likes doing a byte at a time, keep it fast */
+    if (len == 1) {
+        adler += buf[0];
+        if (adler >= BASE)
+            adler -= BASE;
+        sum2 += adler;
+        if (sum2 >= BASE)
+            sum2 -= BASE;
+        return adler | (sum2 << 16);
+    }
+
+    /* initial Adler-32 value (deferred check for len == 1 speed) */
+    if (buf == Z_NULL)
+        return 1L;
+
+    /* in case short lengths are provided, keep it somewhat fast */
+    if (len < 16) {
+        while (len--) {
+            adler += *buf++;
+            sum2 += adler;
+        }
+        if (adler >= BASE)
+            adler -= BASE;
+        MOD28(sum2);            /* only added so many BASE's */
+        return adler | (sum2 << 16);
+    }
+
+    /* do length NMAX blocks -- requires just one modulo operation */
+    while (len >= NMAX) {
+        len -= NMAX;
+        n = NMAX / 16;          /* NMAX is divisible by 16 */
+        do {
+            DO16(buf);          /* 16 sums unrolled */
+            buf += 16;
+        } while (--n);
+        MOD(adler);
+        MOD(sum2);
+    }
+
+    /* do remaining bytes (less than NMAX, still just one modulo) */
+    if (len) {                  /* avoid modulos if none remaining */
+        while (len >= 16) {
+            len -= 16;
+            DO16(buf);
+            buf += 16;
+        }
+        while (len--) {
+            adler += *buf++;
+            sum2 += adler;
+        }
+        MOD(adler);
+        MOD(sum2);
+    }
+
+    /* return recombined sums */
+    return adler | (sum2 << 16);
+}
+
+/* ========================================================================= */
+local
+uLong
+adler32_combine_(uLong adler1, uLong adler2, z_off64_t len2)
+{
+    unsigned long sum1;
+    unsigned long sum2;
+    unsigned rem;
+
+    /* for negative len, return invalid adler32 as a clue for debugging */
+    if (len2 < 0)
+        return 0xffffffffUL;
+
+    /* the derivation of this formula is left as an exercise for the reader */
+    MOD63(len2);                /* assumes len2 >= 0 */
+    rem = (unsigned)len2;
+    sum1 = adler1 & 0xffff;
+    sum2 = rem * sum1;
+    MOD(sum2);
+    sum1 += (adler2 & 0xffff) + BASE - 1;
+    sum2 += ((adler1 >> 16) & 0xffff) + ((adler2 >> 16) & 0xffff) + BASE - rem;
+    if (sum1 >= BASE) sum1 -= BASE;
+    if (sum1 >= BASE) sum1 -= BASE;
+    if (sum2 >= (BASE << 1)) sum2 -= (BASE << 1);
+    if (sum2 >= BASE) sum2 -= BASE;
+    return sum1 | (sum2 << 16);
+}
+
+/* ========================================================================= */
+uLong
+adler32_combine(uLong adler1, uLong adler2, z_off_t len2)
+{
+    return adler32_combine_(adler1, adler2, len2);
+}
+
+uLong
+adler32_combine64(uLong adler1, uLong adler2, z_off64_t len2)
+{
+    return adler32_combine_(adler1, adler2, len2);
+}
diff --git a/sys/vfs/hammer2/zlib/hammer2_zlib_deflate.c b/sys/vfs/hammer2/zlib/hammer2_zlib_deflate.c
new file mode 100644 (file)
index 0000000..4e7a719
--- /dev/null
@@ -0,0 +1,1210 @@
+/* deflate.c -- compress data using the deflation algorithm
+ * Copyright (C) 1995-2013 Jean-loup Gailly and Mark Adler
+ * For conditions of distribution and use, see copyright notice in zlib.h
+ */
+
+/*
+ *  ALGORITHM
+ *
+ *      The "deflation" process depends on being able to identify portions
+ *      of the input text which are identical to earlier input (within a
+ *      sliding window trailing behind the input currently being processed).
+ *
+ *      The most straightforward technique turns out to be the fastest for
+ *      most input files: try all possible matches and select the longest.
+ *      The key feature of this algorithm is that insertions into the string
+ *      dictionary are very simple and thus fast, and deletions are avoided
+ *      completely. Insertions are performed at each input character, whereas
+ *      string matches are performed only when the previous match ends. So it
+ *      is preferable to spend more time in matches to allow very fast string
+ *      insertions and avoid deletions. The matching algorithm for small
+ *      strings is inspired from that of Rabin & Karp. A brute force approach
+ *      is used to find longer strings when a small match has been found.
+ *      A similar algorithm is used in comic (by Jan-Mark Wams) and freeze
+ *      (by Leonid Broukhis).
+ *         A previous version of this file used a more sophisticated algorithm
+ *      (by Fiala and Greene) which is guaranteed to run in linear amortized
+ *      time, but has a larger average cost, uses more memory and is patented.
+ *      However the F&G algorithm may be faster for some highly redundant
+ *      files if the parameter max_chain_length (described below) is too large.
+ *
+ *  ACKNOWLEDGEMENTS
+ *
+ *      The idea of lazy evaluation of matches is due to Jan-Mark Wams, and
+ *      I found it in 'freeze' written by Leonid Broukhis.
+ *      Thanks to many people for bug reports and testing.
+ *
+ *  REFERENCES
+ *
+ *      Deutsch, L.P.,"DEFLATE Compressed Data Format Specification".
+ *      Available in http://tools.ietf.org/html/rfc1951
+ *
+ *      A description of the Rabin and Karp algorithm is given in the book
+ *         "Algorithms" by R. Sedgewick, Addison-Wesley, p252.
+ *
+ *      Fiala,E.R., and Greene,D.H.
+ *         Data Compression with Finite Windows, Comm.ACM, 32,4 (1989) 490-595
+ *
+ */
+
+/* @(#) $Id$ */
+
+#include "hammer2_zlib_deflate.h"
+#include "../hammer2.h"
+#include <sys/malloc.h> //for malloc macros
+
+MALLOC_DECLARE(C_ZLIB_BUFFER_DEFLATE);
+MALLOC_DEFINE(C_ZLIB_BUFFER_DEFLATE, "compzlibbufferdeflate",
+       "A private buffer used by zlib library for deflate function.");
+
+const char deflate_copyright[] =
+   " deflate 1.2.8 Copyright 1995-2013 Jean-loup Gailly and Mark Adler ";
+/*
+  If you use the zlib library in a product, an acknowledgment is welcome
+  in the documentation of your product. If for some reason you cannot
+  include such an acknowledgment, I would appreciate that you keep this
+  copyright string in the executable of your product.
+ */
+
+/* ===========================================================================
+ *  Function prototypes.
+ */
+typedef enum {
+    need_more,      /* block not completed, need more input or more output */
+    block_done,     /* block flush performed */
+    finish_started, /* finish started, need only more output at next deflate */
+    finish_done     /* finish done, accept no more input or output */
+} block_state;
+
+typedef block_state (*compress_func)(deflate_state *s, int flush);
+/* Compression function. Returns the block state after the call. */
+
+local void fill_window (deflate_state *s);
+#ifndef FASTEST
+local block_state deflate_slow(deflate_state *s, int flush);
+#endif
+local block_state deflate_rle(deflate_state *s, int flush);
+local block_state deflate_huff(deflate_state *s, int flush);
+local void lm_init(deflate_state *s);
+local void putShortMSB(deflate_state *s, uInt b);
+local void flush_pending(z_streamp strm);
+local int read_buf(z_streamp strm, Bytef *buf, unsigned size);
+#ifdef ASMV
+      void match_init(void); /* asm code initialization */
+      uInt longest_match(deflate_state *s, IPos cur_match);
+#else
+local uInt longest_match(deflate_state *s, IPos cur_match);
+#endif
+
+#ifdef DEBUG
+local  void check_match(deflate_state *s, IPos start, IPos match,
+                            int length);
+#endif
+
+int deflateInit2_(z_streamp strm, int level, int method, int windowBits,
+                                       int memLevel, int strategy, const char *version, 
+                                       int stream_size);
+int deflateReset (z_streamp strm);
+int deflateResetKeep (z_streamp strm);
+
+/* ===========================================================================
+ * Local data
+ */
+
+#define NIL 0
+/* Tail of hash chains */
+
+#ifndef TOO_FAR
+#  define TOO_FAR 4096
+#endif
+/* Matches of length 3 are discarded if their distance exceeds TOO_FAR */
+
+/* Values for max_lazy_match, good_match and max_chain_length, depending on
+ * the desired pack level (0..9). The values given below have been tuned to
+ * exclude worst case performance for pathological files. Better values may be
+ * found for specific files.
+ */
+typedef struct config_s {
+   ush good_length; /* reduce lazy search above this match length */
+   ush max_lazy;    /* do not perform lazy search above this match length */
+   ush nice_length; /* quit search above this match length */
+   ush max_chain;
+   compress_func func;
+} config;
+
+local const config configuration_table[10] = {
+/*      good lazy nice chain */
+/* 0 */ {0,    0,  0,    0, deflate_slow/*deflate_stored*/},  /* store only */
+/* 1 */ {4,    4,  8,    4, deflate_slow/*deflate_fast*/}, /* max speed, no lazy matches */
+/* 2 */ {4,    5, 16,    8, deflate_slow/*deflate_fast*/},
+/* 3 */ {4,    6, 32,   32, deflate_slow/*deflate_fast*/},
+
+/* 4 */ {4,    4, 16,   16, deflate_slow},  /* lazy matches */
+/* 5 */ {8,   16, 32,   32, deflate_slow},
+/* 6 */ {8,   16, 128, 128, deflate_slow},
+/* 7 */ {8,   32, 128, 256, deflate_slow},
+/* 8 */ {32, 128, 258, 1024, deflate_slow},
+/* 9 */ {32, 258, 258, 4096, deflate_slow}}; /* max compression */
+
+/* Note: the deflate() code requires max_lazy >= MIN_MATCH and max_chain >= 4
+ * For deflate_fast() (levels <= 3) good is ignored and lazy has a different
+ * meaning.
+ */
+
+#define EQUAL 0
+/* result of memcmp for equal strings */
+
+#ifndef NO_DUMMY_DECL
+struct static_tree_desc_s {int dummy;}; /* for buggy compilers */
+#endif
+
+/* rank Z_BLOCK between Z_NO_FLUSH and Z_PARTIAL_FLUSH */
+#define RANK(f) (((f) << 1) - ((f) > 4 ? 9 : 0))
+
+/* ===========================================================================
+ * Update a hash value with the given input byte
+ * IN  assertion: all calls to to UPDATE_HASH are made with consecutive
+ *    input characters, so that a running hash key can be computed from the
+ *    previous key instead of complete recalculation each time.
+ */
+#define UPDATE_HASH(s,h,c) (h = (((h)<<s->hash_shift) ^ (c)) & s->hash_mask)
+
+
+/* ===========================================================================
+ * Insert string str in the dictionary and set match_head to the previous head
+ * of the hash chain (the most recent string with same hash key). Return
+ * the previous length of the hash chain.
+ * If this file is compiled with -DFASTEST, the compression level is forced
+ * to 1, and no hash chains are maintained.
+ * IN  assertion: all calls to to INSERT_STRING are made with consecutive
+ *    input characters and the first MIN_MATCH bytes of str are valid
+ *    (except for the last MIN_MATCH-1 bytes of the input file).
+ */
+#define INSERT_STRING(s, str, match_head) \
+   (UPDATE_HASH(s, s->ins_h, s->window[(str) + (MIN_MATCH-1)]), \
+    match_head = s->prev[(str) & s->w_mask] = s->head[s->ins_h], \
+    s->head[s->ins_h] = (Pos)(str))
+
+/* ===========================================================================
+ * Initialize the hash table (avoiding 64K overflow for 16 bit systems).
+ * prev[] will be initialized on the fly.
+ */
+#define CLEAR_HASH(s) \
+    s->head[s->hash_size-1] = NIL; \
+    zmemzero((Bytef *)s->head, (unsigned)(s->hash_size-1)*sizeof(*s->head));
+
+/* ========================================================================= */
+int
+deflateInit_(z_streamp strm, int level, const char *version, int stream_size)
+{
+    return deflateInit2_(strm, level, Z_DEFLATED, MAX_WBITS, DEF_MEM_LEVEL,
+                         Z_DEFAULT_STRATEGY, version, stream_size);
+    /* To do: ignore strm->next_in if we use it as window */
+}
+
+/* ========================================================================= */
+int
+deflateInit2_(z_streamp strm, int level, int method, int windowBits,
+       int memLevel, int strategy, const char *version, int stream_size)
+{
+    deflate_state *s;
+    int wrap = 1;
+    static const char my_version[] = ZLIB_VERSION;
+
+    ushf *overlay;
+    /* We overlay pending_buf and d_buf+l_buf. This works since the average
+     * output size for (length,distance) codes is <= 24 bits.
+     */
+
+    if (version == Z_NULL || version[0] != my_version[0] ||
+        stream_size != sizeof(z_stream)) {
+        return Z_VERSION_ERROR;
+    }
+    if (strm == Z_NULL) return Z_STREAM_ERROR;
+
+    strm->msg = Z_NULL;
+
+    if (level == Z_DEFAULT_COMPRESSION) level = 6;
+
+    if (windowBits < 0) { /* suppress zlib wrapper */
+        wrap = 0;
+        windowBits = -windowBits;
+    }
+    if (memLevel < 1 || memLevel > MAX_MEM_LEVEL || method != Z_DEFLATED ||
+        windowBits < 8 || windowBits > 15 || level < 0 || level > 9 ||
+        strategy < 0 || strategy > Z_FIXED) {
+        return Z_STREAM_ERROR;
+    }
+    if (windowBits == 8) windowBits = 9;  /* until 256-byte window bug fixed */
+    s = (deflate_state *) kmalloc(sizeof(*s), C_ZLIB_BUFFER_DEFLATE, M_INTWAIT);
+    if (s == Z_NULL) return Z_MEM_ERROR;
+    strm->state = (struct internal_state FAR *)s;
+    s->strm = strm;
+
+    s->wrap = wrap;
+    s->w_bits = windowBits;
+    s->w_size = 1 << s->w_bits;
+    s->w_mask = s->w_size - 1;
+
+    s->hash_bits = memLevel + 7;
+    s->hash_size = 1 << s->hash_bits;
+    s->hash_mask = s->hash_size - 1;
+    s->hash_shift =  ((s->hash_bits+MIN_MATCH-1)/MIN_MATCH);
+
+    s->window = (Bytef *) kmalloc((s->w_size)*2*sizeof(Byte), C_ZLIB_BUFFER_DEFLATE, M_INTWAIT);
+    s->prev   = (Posf *)  kmalloc((s->w_size)*sizeof(Pos), C_ZLIB_BUFFER_DEFLATE, M_INTWAIT);
+    s->head   = (Posf *)  kmalloc((s->hash_size)*sizeof(Pos), C_ZLIB_BUFFER_DEFLATE, M_INTWAIT);
+
+    s->high_water = 0;      /* nothing written to s->window yet */
+
+    s->lit_bufsize = 1 << (memLevel + 6); /* 16K elements by default */
+
+    overlay = (ushf *) kmalloc((s->lit_bufsize)*(sizeof(ush)+2), C_ZLIB_BUFFER_DEFLATE, M_INTWAIT);
+    s->pending_buf = (uchf *) overlay;
+    s->pending_buf_size = (ulg)s->lit_bufsize * (sizeof(ush)+2L);
+
+    if (s->window == Z_NULL || s->prev == Z_NULL || s->head == Z_NULL ||
+        s->pending_buf == Z_NULL) {
+        s->status = FINISH_STATE;
+        strm->msg = ERR_MSG(Z_MEM_ERROR);
+        deflateEnd (strm);
+        return Z_MEM_ERROR;
+    }
+    s->d_buf = overlay + s->lit_bufsize/sizeof(ush);
+    s->l_buf = s->pending_buf + (1+sizeof(ush))*s->lit_bufsize;
+
+    s->level = level;
+    s->strategy = strategy;
+    s->method = (Byte)method;
+
+    return deflateReset(strm);
+}
+
+/* ========================================================================= */
+int
+deflateResetKeep (z_streamp strm)
+{
+    deflate_state *s;
+
+    if (strm == Z_NULL || strm->state == Z_NULL) {
+        return Z_STREAM_ERROR;
+    }
+
+    strm->total_in = strm->total_out = 0;
+    strm->msg = Z_NULL; /* use zfree if we ever allocate msg dynamically */
+    strm->data_type = Z_UNKNOWN;
+
+    s = (deflate_state *)strm->state;
+    s->pending = 0;
+    s->pending_out = s->pending_buf;
+
+    if (s->wrap < 0) {
+        s->wrap = -s->wrap; /* was made negative by deflate(..., Z_FINISH); */
+    }
+    s->status = s->wrap ? INIT_STATE : BUSY_STATE;
+    strm->adler = adler32(0L, Z_NULL, 0);
+    s->last_flush = Z_NO_FLUSH;
+
+    _tr_init(s);
+
+    return Z_OK;
+}
+
+/* ========================================================================= */
+int
+deflateReset (z_streamp strm)
+{
+    int ret;
+
+    ret = deflateResetKeep(strm);
+    if (ret == Z_OK)
+        lm_init(strm->state);
+    return ret;
+}
+
+/* =========================================================================
+ * Put a short in the pending buffer. The 16-bit value is put in MSB order.
+ * IN assertion: the stream state is correct and there is enough room in
+ * pending_buf.
+ */
+local
+void
+putShortMSB (deflate_state *s, uInt b)
+{
+    put_byte(s, (Byte)(b >> 8));
+    put_byte(s, (Byte)(b & 0xff));
+}
+
+/* =========================================================================
+ * Flush as much pending output as possible. All deflate() output goes
+ * through this function so some applications may wish to modify it
+ * to avoid allocating a large strm->next_out buffer and copying into it.
+ * (See also read_buf()).
+ */
+local
+void
+flush_pending(z_streamp strm)
+{
+    unsigned len;
+    deflate_state *s = strm->state;
+
+    _tr_flush_bits(s);
+    len = s->pending;
+    if (len > strm->avail_out) len = strm->avail_out;
+    if (len == 0) return;
+
+    zmemcpy(strm->next_out, s->pending_out, len);
+    strm->next_out  += len;
+    s->pending_out  += len;
+    strm->total_out += len;
+    strm->avail_out  -= len;
+    s->pending -= len;
+    if (s->pending == 0) {
+        s->pending_out = s->pending_buf;
+    }
+}
+
+/* ========================================================================= */
+int
+deflate (z_streamp strm, int flush)
+{
+    int old_flush; /* value of flush param for previous deflate call */
+    deflate_state *s;
+
+    if (strm == Z_NULL || strm->state == Z_NULL ||
+        flush > Z_BLOCK || flush < 0) {
+        return Z_STREAM_ERROR;
+    }
+    s = strm->state;
+
+    if (strm->next_out == Z_NULL ||
+        (strm->next_in == Z_NULL && strm->avail_in != 0) ||
+        (s->status == FINISH_STATE && flush != Z_FINISH)) {
+        ERR_RETURN(strm, Z_STREAM_ERROR);
+    }
+    if (strm->avail_out == 0) ERR_RETURN(strm, Z_BUF_ERROR);
+
+    s->strm = strm; /* just in case */
+    old_flush = s->last_flush;
+    s->last_flush = flush;
+
+    /* Write the header */
+    uInt header = (Z_DEFLATED + ((s->w_bits-8)<<4)) << 8;
+    uInt level_flags;
+
+    if (s->strategy >= Z_HUFFMAN_ONLY || s->level < 2)
+        level_flags = 0;
+    else if (s->level < 6)
+        level_flags = 1;
+    else if (s->level == 6)
+        level_flags = 2;
+    else
+        level_flags = 3;
+    header |= (level_flags << 6);
+    if (s->strstart != 0) header |= PRESET_DICT;
+    header += 31 - (header % 31);
+
+    s->status = BUSY_STATE;
+    putShortMSB(s, header);
+
+    /* Save the adler32 of the preset dictionary: */
+    if (s->strstart != 0) {
+        putShortMSB(s, (uInt)(strm->adler >> 16));
+        putShortMSB(s, (uInt)(strm->adler & 0xffff));
+    }
+    strm->adler = adler32(0L, Z_NULL, 0);
+        
+    /* Flush as much pending output as possible */
+    if (s->pending != 0) {
+        flush_pending(strm);
+        if (strm->avail_out == 0) {
+            /* Since avail_out is 0, deflate will be called again with
+             * more output space, but possibly with both pending and
+             * avail_in equal to zero. There won't be anything to do,
+             * but this is not an error situation so make sure we
+             * return OK instead of BUF_ERROR at next call of deflate:
+             */
+            s->last_flush = -1;
+            return Z_OK;
+        }
+
+    /* Make sure there is something to do and avoid duplicate consecutive
+     * flushes. For repeated and useless calls with Z_FINISH, we keep
+     * returning Z_STREAM_END instead of Z_BUF_ERROR.
+     */
+    } else if (strm->avail_in == 0 && RANK(flush) <= RANK(old_flush) &&
+               flush != Z_FINISH) {
+        ERR_RETURN(strm, Z_BUF_ERROR);
+    }
+
+    /* User must not provide more input after the first FINISH: */
+    if (s->status == FINISH_STATE && strm->avail_in != 0) {
+        ERR_RETURN(strm, Z_BUF_ERROR);
+    }
+
+    /* Start a new block or continue the current one.
+     */
+    if (strm->avail_in != 0 || s->lookahead != 0 ||
+        (flush != Z_NO_FLUSH && s->status != FINISH_STATE)) {
+        block_state bstate;
+
+        bstate = s->strategy == Z_HUFFMAN_ONLY ? deflate_huff(s, flush) :
+                    (s->strategy == Z_RLE ? deflate_rle(s, flush) :
+                        (*(configuration_table[s->level].func))(s, flush));
+
+        if (bstate == finish_started || bstate == finish_done) {
+            s->status = FINISH_STATE;
+        }
+        if (bstate == need_more || bstate == finish_started) {
+            if (strm->avail_out == 0) {
+                s->last_flush = -1; /* avoid BUF_ERROR next call, see above */
+            }
+            return Z_OK;
+            /* If flush != Z_NO_FLUSH && avail_out == 0, the next call
+             * of deflate should use the same flush parameter to make sure
+             * that the flush is complete. So we don't have to output an
+             * empty block here, this will be done at next call. This also
+             * ensures that for a very small output buffer, we emit at most
+             * one empty block.
+             */
+        }
+        if (bstate == block_done) {
+            if (flush == Z_PARTIAL_FLUSH) {
+                _tr_align(s);
+            } else if (flush != Z_BLOCK) { /* FULL_FLUSH or SYNC_FLUSH */
+                _tr_stored_block(s, (char*)0, 0L, 0);
+                /* For a full flush, this empty block will be recognized
+                 * as a special marker by inflate_sync().
+                 */
+                if (flush == Z_FULL_FLUSH) {
+                    CLEAR_HASH(s);             /* forget history */
+                    if (s->lookahead == 0) {
+                        s->strstart = 0;
+                        s->block_start = 0L;
+                        s->insert = 0;
+                    }
+                }
+            }
+            flush_pending(strm);
+            if (strm->avail_out == 0) {
+              s->last_flush = -1; /* avoid BUF_ERROR at next call, see above */
+              return Z_OK;
+            }
+        }
+    }
+    Assert(strm->avail_out > 0, "bug2");
+
+    if (flush != Z_FINISH) return Z_OK;
+    if (s->wrap <= 0) return Z_STREAM_END;
+
+    /* Write the trailer */
+    putShortMSB(s, (uInt)(strm->adler >> 16));
+    putShortMSB(s, (uInt)(strm->adler & 0xffff));
+    
+    flush_pending(strm);
+    /* If avail_out is zero, the application will call deflate again
+     * to flush the rest.
+     */
+    if (s->wrap > 0) s->wrap = -s->wrap; /* write the trailer only once! */
+    return s->pending != 0 ? Z_OK : Z_STREAM_END;
+}
+
+/* ========================================================================= */
+int
+deflateEnd (z_streamp strm)
+{
+    int status;
+
+    if (strm == Z_NULL || strm->state == Z_NULL) return Z_STREAM_ERROR;
+
+    status = strm->state->status;
+    if (status != INIT_STATE &&
+        status != EXTRA_STATE &&
+        status != NAME_STATE &&
+        status != COMMENT_STATE &&
+        status != HCRC_STATE &&
+        status != BUSY_STATE &&
+        status != FINISH_STATE) {
+      return Z_STREAM_ERROR;
+    }
+
+    /* Deallocate in reverse order of allocations: */
+    kfree(strm->state->pending_buf, C_ZLIB_BUFFER_DEFLATE);
+    kfree(strm->state->head, C_ZLIB_BUFFER_DEFLATE);
+    kfree(strm->state->prev, C_ZLIB_BUFFER_DEFLATE);
+    kfree(strm->state->window, C_ZLIB_BUFFER_DEFLATE);
+
+    kfree(strm->state, C_ZLIB_BUFFER_DEFLATE);
+    strm->state = Z_NULL;
+
+    return status == BUSY_STATE ? Z_DATA_ERROR : Z_OK;
+}
+
+/* ===========================================================================
+ * Read a new buffer from the current input stream, update the adler32
+ * and total number of bytes read.  All deflate() input goes through
+ * this function so some applications may wish to modify it to avoid
+ * allocating a large strm->next_in buffer and copying from it.
+ * (See also flush_pending()).
+ */
+local
+int
+read_buf(z_streamp strm, Bytef *buf, unsigned size)
+{
+    unsigned len = strm->avail_in;
+
+    if (len > size) len = size;
+    if (len == 0) return 0;
+
+    strm->avail_in  -= len;
+
+    zmemcpy(buf, strm->next_in, len);
+    if (strm->state->wrap == 1) {
+        strm->adler = adler32(strm->adler, buf, len);
+    }
+
+    strm->next_in  += len;
+    strm->total_in += len;
+
+    return (int)len;
+}
+
+/* ===========================================================================
+ * Initialize the "longest match" routines for a new zlib stream
+ */
+local
+void
+lm_init (deflate_state *s)
+{
+    s->window_size = (ulg)2L*s->w_size;
+
+    CLEAR_HASH(s);
+
+    /* Set the default configuration parameters:
+     */
+    s->max_lazy_match   = configuration_table[s->level].max_lazy;
+    s->good_match       = configuration_table[s->level].good_length;
+    s->nice_match       = configuration_table[s->level].nice_length;
+    s->max_chain_length = configuration_table[s->level].max_chain;
+
+    s->strstart = 0;
+    s->block_start = 0L;
+    s->lookahead = 0;
+    s->insert = 0;
+    s->match_length = s->prev_length = MIN_MATCH-1;
+    s->match_available = 0;
+    s->ins_h = 0;
+#ifndef FASTEST
+#ifdef ASMV
+    match_init(); /* initialize the asm code */
+#endif
+#endif
+}
+
+#ifndef FASTEST
+/* ===========================================================================
+ * Set match_start to the longest match starting at the given string and
+ * return its length. Matches shorter or equal to prev_length are discarded,
+ * in which case the result is equal to prev_length and match_start is
+ * garbage.
+ * IN assertions: cur_match is the head of the hash chain for the current
+ *   string (strstart) and its distance is <= MAX_DIST, and prev_length >= 1
+ * OUT assertion: the match length is not greater than s->lookahead.
+ */
+#ifndef ASMV
+/* For 80x86 and 680x0, an optimized version will be provided in match.asm or
+ * match.S. The code will be functionally equivalent.
+ */
+local
+uInt
+longest_match(deflate_state *s, IPos cur_match) /* cur_match = current match */
+{
+    unsigned chain_length = s->max_chain_length;/* max hash chain length */
+    register Bytef *scan = s->window + s->strstart; /* current string */
+    register Bytef *match;                       /* matched string */
+    register int len;                           /* length of current match */
+    int best_len = s->prev_length;              /* best match length so far */
+    int nice_match = s->nice_match;             /* stop if match long enough */
+    IPos limit = s->strstart > (IPos)MAX_DIST(s) ?
+        s->strstart - (IPos)MAX_DIST(s) : NIL;
+    /* Stop when cur_match becomes <= limit. To simplify the code,
+     * we prevent matches with the string of window index 0.
+     */
+    Posf *prev = s->prev;
+    uInt wmask = s->w_mask;
+
+#ifdef UNALIGNED_OK
+    /* Compare two bytes at a time. Note: this is not always beneficial.
+     * Try with and without -DUNALIGNED_OK to check.
+     */
+    register Bytef *strend = s->window + s->strstart + MAX_MATCH - 1;
+    register ush scan_start = *(ushf*)scan;
+    register ush scan_end   = *(ushf*)(scan+best_len-1);
+#else
+    register Bytef *strend = s->window + s->strstart + MAX_MATCH;
+    register Byte scan_end1  = scan[best_len-1];
+    register Byte scan_end   = scan[best_len];
+#endif
+
+    /* The code is optimized for HASH_BITS >= 8 and MAX_MATCH-2 multiple of 16.
+     * It is easy to get rid of this optimization if necessary.
+     */
+    Assert(s->hash_bits >= 8 && MAX_MATCH == 258, "Code too clever");
+
+    /* Do not waste too much time if we already have a good match: */
+    if (s->prev_length >= s->good_match) {
+        chain_length >>= 2;
+    }
+    /* Do not look for matches beyond the end of the input. This is necessary
+     * to make deflate deterministic.
+     */
+    if ((uInt)nice_match > s->lookahead) nice_match = s->lookahead;
+
+    Assert((ulg)s->strstart <= s->window_size-MIN_LOOKAHEAD, "need lookahead");
+
+    do {
+        Assert(cur_match < s->strstart, "no future");
+        match = s->window + cur_match;
+
+        /* Skip to next match if the match length cannot increase
+         * or if the match length is less than 2.  Note that the checks below
+         * for insufficient lookahead only occur occasionally for performance
+         * reasons.  Therefore uninitialized memory will be accessed, and
+         * conditional jumps will be made that depend on those values.
+         * However the length of the match is limited to the lookahead, so
+         * the output of deflate is not affected by the uninitialized values.
+         */
+#if (defined(UNALIGNED_OK) && MAX_MATCH == 258)
+        /* This code assumes sizeof(unsigned short) == 2. Do not use
+         * UNALIGNED_OK if your compiler uses a different size.
+         */
+        if (*(ushf*)(match+best_len-1) != scan_end ||
+            *(ushf*)match != scan_start) continue;
+
+        /* It is not necessary to compare scan[2] and match[2] since they are
+         * always equal when the other bytes match, given that the hash keys
+         * are equal and that HASH_BITS >= 8. Compare 2 bytes at a time at
+         * strstart+3, +5, ... up to strstart+257. We check for insufficient
+         * lookahead only every 4th comparison; the 128th check will be made
+         * at strstart+257. If MAX_MATCH-2 is not a multiple of 8, it is
+         * necessary to put more guard bytes at the end of the window, or
+         * to check more often for insufficient lookahead.
+         */
+        Assert(scan[2] == match[2], "scan[2]?");
+        scan++, match++;
+        do {
+        } while (*(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
+                 *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
+                 *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
+                 *(ushf*)(scan+=2) == *(ushf*)(match+=2) &&
+                 scan < strend);
+        /* The funny "do {}" generates better code on most compilers */
+
+        /* Here, scan <= window+strstart+257 */
+        Assert(scan <= s->window+(unsigned)(s->window_size-1), "wild scan");
+        if (*scan == *match) scan++;
+
+        len = (MAX_MATCH - 1) - (int)(strend-scan);
+        scan = strend - (MAX_MATCH-1);
+
+#else /* UNALIGNED_OK */
+
+        if (match[best_len]   != scan_end  ||
+            match[best_len-1] != scan_end1 ||
+            *match            != *scan     ||
+            *++match          != scan[1])      continue;
+
+        /* The check at best_len-1 can be removed because it will be made
+         * again later. (This heuristic is not always a win.)
+         * It is not necessary to compare scan[2] and match[2] since they
+         * are always equal when the other bytes match, given that
+         * the hash keys are equal and that HASH_BITS >= 8.
+         */
+        scan += 2, match++;
+        Assert(*scan == *match, "match[2]?");
+
+        /* We check for insufficient lookahead only every 8th comparison;
+         * the 256th check will be made at strstart+258.
+         */
+        do {
+        } while (*++scan == *++match && *++scan == *++match &&
+                 *++scan == *++match && *++scan == *++match &&
+                 *++scan == *++match && *++scan == *++match &&
+                 *++scan == *++match && *++scan == *++match &&
+                 scan < strend);
+
+        Assert(scan <= s->window+(unsigned)(s->window_size-1), "wild scan");
+
+        len = MAX_MATCH - (int)(strend - scan);
+        scan = strend - MAX_MATCH;
+
+#endif /* UNALIGNED_OK */
+
+        if (len > best_len) {
+            s->match_start = cur_match;
+            best_len = len;
+            if (len >= nice_match) break;
+#ifdef UNALIGNED_OK
+            scan_end = *(ushf*)(scan+best_len-1);
+#else
+            scan_end1  = scan[best_len-1];
+            scan_end   = scan[best_len];
+#endif
+        }
+    } while ((cur_match = prev[cur_match & wmask]) > limit
+             && --chain_length != 0);
+
+    if ((uInt)best_len <= s->lookahead) return (uInt)best_len;
+    return s->lookahead;
+}
+#endif /* ASMV */
+
+#endif /* FASTEST */
+
+#ifdef DEBUG
+/* ===========================================================================
+ * Check that the match at match_start is indeed a match.
+ */
+local
+void
+check_match(deflate_state *s, IPos start, match, int length)
+{
+    /* check that the match is indeed a match */
+    if (zmemcmp(s->window + match,
+                s->window + start, length) != EQUAL) {
+        fprintf(stderr, " start %u, match %u, length %d\n",
+                start, match, length);
+        do {
+            fprintf(stderr, "%c%c", s->window[match++], s->window[start++]);
+        } while (--length != 0);
+        z_error("invalid match");
+    }
+    if (z_verbose > 1) {
+        fprintf(stderr,"\\[%d,%d]", start-match, length);
+        do { putc(s->window[start++], stderr); } while (--length != 0);
+    }
+}
+#else
+#  define check_match(s, start, match, length)
+#endif /* DEBUG */
+
+/* ===========================================================================
+ * Fill the window when the lookahead becomes insufficient.
+ * Updates strstart and lookahead.
+ *
+ * IN assertion: lookahead < MIN_LOOKAHEAD
+ * OUT assertions: strstart <= window_size-MIN_LOOKAHEAD
+ *    At least one byte has been read, or avail_in == 0; reads are
+ *    performed for at least two bytes (required for the zip translate_eol
+ *    option -- not supported here).
+ */
+local
+void
+fill_window(deflate_state *s)
+{
+    register unsigned n, m;
+    register Posf *p;
+    unsigned more;    /* Amount of free space at the end of the window. */
+    uInt wsize = s->w_size;
+
+    Assert(s->lookahead < MIN_LOOKAHEAD, "already enough lookahead");
+
+    do {
+        more = (unsigned)(s->window_size -(ulg)s->lookahead -(ulg)s->strstart);
+
+        /* Deal with !@#$% 64K limit: */
+        if (sizeof(int) <= 2) {
+            if (more == 0 && s->strstart == 0 && s->lookahead == 0) {
+                more = wsize;
+
+            } else if (more == (unsigned)(-1)) {
+                /* Very unlikely, but possible on 16 bit machine if
+                 * strstart == 0 && lookahead == 1 (input done a byte at time)
+                 */
+                more--;
+            }
+        }
+
+        /* If the window is almost full and there is insufficient lookahead,
+         * move the upper half to the lower one to make room in the upper half.
+         */
+        if (s->strstart >= wsize+MAX_DIST(s)) {
+
+            zmemcpy(s->window, s->window+wsize, (unsigned)wsize);
+            s->match_start -= wsize;
+            s->strstart    -= wsize; /* we now have strstart >= MAX_DIST */
+            s->block_start -= (long) wsize;
+
+            /* Slide the hash table (could be avoided with 32 bit values
+               at the expense of memory usage). We slide even when level == 0
+               to keep the hash table consistent if we switch back to level > 0
+               later. (Using level 0 permanently is not an optimal usage of
+               zlib, so we don't care about this pathological case.)
+             */
+            n = s->hash_size;
+            p = &s->head[n];
+            do {
+                m = *--p;
+                *p = (Pos)(m >= wsize ? m-wsize : NIL);
+            } while (--n);
+
+            n = wsize;
+#ifndef FASTEST
+            p = &s->prev[n];
+            do {
+                m = *--p;
+                *p = (Pos)(m >= wsize ? m-wsize : NIL);
+                /* If n is not on any hash chain, prev[n] is garbage but
+                 * its value will never be used.
+                 */
+            } while (--n);
+#endif
+            more += wsize;
+        }
+        if (s->strm->avail_in == 0) break;
+
+        /* If there was no sliding:
+         *    strstart <= WSIZE+MAX_DIST-1 && lookahead <= MIN_LOOKAHEAD - 1 &&
+         *    more == window_size - lookahead - strstart
+         * => more >= window_size - (MIN_LOOKAHEAD-1 + WSIZE + MAX_DIST-1)
+         * => more >= window_size - 2*WSIZE + 2
+         * In the BIG_MEM or MMAP case (not yet supported),
+         *   window_size == input_size + MIN_LOOKAHEAD  &&
+         *   strstart + s->lookahead <= input_size => more >= MIN_LOOKAHEAD.
+         * Otherwise, window_size == 2*WSIZE so more >= 2.
+         * If there was sliding, more >= WSIZE. So in all cases, more >= 2.
+         */
+        Assert(more >= 2, "more < 2");
+
+        n = read_buf(s->strm, s->window + s->strstart + s->lookahead, more);
+        s->lookahead += n;
+
+        /* Initialize the hash value now that we have some input: */
+        if (s->lookahead + s->insert >= MIN_MATCH) {
+            uInt str = s->strstart - s->insert;
+            s->ins_h = s->window[str];
+            UPDATE_HASH(s, s->ins_h, s->window[str + 1]);
+#if MIN_MATCH != 3
+            Call UPDATE_HASH() MIN_MATCH-3 more times
+#endif
+            while (s->insert) {
+                UPDATE_HASH(s, s->ins_h, s->window[str + MIN_MATCH-1]);
+#ifndef FASTEST
+                s->prev[str & s->w_mask] = s->head[s->ins_h];
+#endif
+                s->head[s->ins_h] = (Pos)str;
+                str++;
+                s->insert--;
+                if (s->lookahead + s->insert < MIN_MATCH)
+                    break;
+            }
+        }
+        /* If the whole input has less than MIN_MATCH bytes, ins_h is garbage,
+         * but this is not important since only literal bytes will be emitted.
+         */
+
+    } while (s->lookahead < MIN_LOOKAHEAD && s->strm->avail_in != 0);
+
+    /* If the WIN_INIT bytes after the end of the current data have never been
+     * written, then zero those bytes in order to avoid memory check reports of
+     * the use of uninitialized (or uninitialised as Julian writes) bytes by
+     * the longest match routines.  Update the high water mark for the next
+     * time through here.  WIN_INIT is set to MAX_MATCH since the longest match
+     * routines allow scanning to strstart + MAX_MATCH, ignoring lookahead.
+     */
+    if (s->high_water < s->window_size) {
+        ulg curr = s->strstart + (ulg)(s->lookahead);
+        ulg init;
+
+        if (s->high_water < curr) {
+            /* Previous high water mark below current data -- zero WIN_INIT
+             * bytes or up to end of window, whichever is less.
+             */
+            init = s->window_size - curr;
+            if (init > WIN_INIT)
+                init = WIN_INIT;
+            zmemzero(s->window + curr, (unsigned)init);
+            s->high_water = curr + init;
+        }
+        else if (s->high_water < (ulg)curr + WIN_INIT) {
+            /* High water mark at or above current data, but below current data
+             * plus WIN_INIT -- zero out to current data plus WIN_INIT, or up
+             * to end of window, whichever is less.
+             */
+            init = (ulg)curr + WIN_INIT - s->high_water;
+            if (init > s->window_size - s->high_water)
+                init = s->window_size - s->high_water;
+            zmemzero(s->window + s->high_water, (unsigned)init);
+            s->high_water += init;
+        }
+    }
+
+    Assert((ulg)s->strstart <= s->window_size - MIN_LOOKAHEAD,
+           "not enough room for search");
+}
+
+/* ===========================================================================
+ * Flush the current block, with given end-of-file flag.
+ * IN assertion: strstart is set to the end of the current match.
+ */
+#define FLUSH_BLOCK_ONLY(s, last) { \
+   _tr_flush_block(s, (s->block_start >= 0L ? \
+                   (charf *)&s->window[(unsigned)s->block_start] : \
+                   (charf *)Z_NULL), \
+                (ulg)((long)s->strstart - s->block_start), \
+                (last)); \
+   s->block_start = s->strstart; \
+   flush_pending(s->strm); \
+   Tracev((stderr,"[FLUSH]")); \
+}
+
+/* Same but force premature exit if necessary. */
+#define FLUSH_BLOCK(s, last) { \
+   FLUSH_BLOCK_ONLY(s, last); \
+   if (s->strm->avail_out == 0) return (last) ? finish_started : need_more; \
+}
+
+#ifndef FASTEST
+/* ===========================================================================
+ * Same as above, but achieves better compression. We use a lazy
+ * evaluation for matches: a match is finally adopted only if there is
+ * no better match at the next window position.
+ */
+local
+block_state
+deflate_slow(deflate_state *s, int flush)
+{
+    IPos hash_head;          /* head of hash chain */
+    int bflush;              /* set if current block must be flushed */
+
+    /* Process the input block. */
+    for (;;) {
+        /* Make sure that we always have enough lookahead, except
+         * at the end of the input file. We need MAX_MATCH bytes
+         * for the next match, plus MIN_MATCH bytes to insert the
+         * string following the next match.
+         */
+        if (s->lookahead < MIN_LOOKAHEAD) {
+            fill_window(s);
+            if (s->lookahead < MIN_LOOKAHEAD && flush == Z_NO_FLUSH) {
+                return need_more;
+            }
+            if (s->lookahead == 0) break; /* flush the current block */
+        }
+
+        /* Insert the string window[strstart .. strstart+2] in the
+         * dictionary, and set hash_head to the head of the hash chain:
+         */
+        hash_head = NIL;
+        if (s->lookahead >= MIN_MATCH) {
+            INSERT_STRING(s, s->strstart, hash_head);
+        }
+
+        /* Find the longest match, discarding those <= prev_length.
+         */
+        s->prev_length = s->match_length, s->prev_match = s->match_start;
+        s->match_length = MIN_MATCH-1;
+
+        if (hash_head != NIL && s->prev_length < s->max_lazy_match &&
+            s->strstart - hash_head <= MAX_DIST(s)) {
+            /* To simplify the code, we prevent matches with the string
+             * of window index 0 (in particular we have to avoid a match
+             * of the string with itself at the start of the input file).
+             */
+            s->match_length = longest_match (s, hash_head);
+            /* longest_match() sets match_start */
+
+            if (s->match_length <= 5 && (s->strategy == Z_FILTERED
+#if TOO_FAR <= 32767
+                || (s->match_length == MIN_MATCH &&
+                    s->strstart - s->match_start > TOO_FAR)
+#endif
+                )) {
+
+                /* If prev_match is also MIN_MATCH, match_start is garbage
+                 * but we will ignore the current match anyway.
+                 */
+                s->match_length = MIN_MATCH-1;
+            }
+        }
+        /* If there was a match at the previous step and the current
+         * match is not better, output the previous match:
+         */
+        if (s->prev_length >= MIN_MATCH && s->match_length <= s->prev_length) {
+            uInt max_insert = s->strstart + s->lookahead - MIN_MATCH;
+            /* Do not insert strings in hash table beyond this. */
+
+            check_match(s, s->strstart-1, s->prev_match, s->prev_length);
+
+            _tr_tally_dist(s, s->strstart -1 - s->prev_match,
+                           s->prev_length - MIN_MATCH, bflush);
+
+            /* Insert in hash table all strings up to the end of the match.
+             * strstart-1 and strstart are already inserted. If there is not
+             * enough lookahead, the last two strings are not inserted in
+             * the hash table.
+             */
+            s->lookahead -= s->prev_length-1;
+            s->prev_length -= 2;
+            do {
+                if (++s->strstart <= max_insert) {
+                    INSERT_STRING(s, s->strstart, hash_head);
+                }
+            } while (--s->prev_length != 0);
+            s->match_available = 0;
+            s->match_length = MIN_MATCH-1;
+            s->strstart++;
+
+            if (bflush) FLUSH_BLOCK(s, 0);
+
+        } else if (s->match_available) {
+            /* If there was no match at the previous position, output a
+             * single literal. If there was a match but the current match
+             * is longer, truncate the previous match to a single literal.
+             */
+            Tracevv((stderr,"%c", s->window[s->strstart-1]));
+            _tr_tally_lit(s, s->window[s->strstart-1], bflush);
+            if (bflush) {
+                FLUSH_BLOCK_ONLY(s, 0);
+            }
+            s->strstart++;
+            s->lookahead--;
+            if (s->strm->avail_out == 0) return need_more;
+        } else {
+            /* There is no previous match to compare with, wait for
+             * the next step to decide.
+             */
+            s->match_available = 1;
+            s->strstart++;
+            s->lookahead--;
+        }
+    }
+    Assert (flush != Z_NO_FLUSH, "no flush?");
+    if (s->match_available) {
+        Tracevv((stderr,"%c", s->window[s->strstart-1]));
+        _tr_tally_lit(s, s->window[s->strstart-1], bflush);
+        s->match_available = 0;
+    }
+    s->insert = s->strstart < MIN_MATCH-1 ? s->strstart : MIN_MATCH-1;
+    if (flush == Z_FINISH) {
+        FLUSH_BLOCK(s, 1);
+        return finish_done;
+    }
+    if (s->last_lit)
+        FLUSH_BLOCK(s, 0);
+    return block_done;
+}
+#endif /* FASTEST */
+
+/* ===========================================================================
+ * For Z_RLE, simply look for runs of bytes, generate matches only of distance
+ * one.  Do not maintain a hash table.  (It will be regenerated if this run of
+ * deflate switches away from Z_RLE.)
+ */
+local
+block_state
+deflate_rle(deflate_state *s, int flush)
+{
+    int bflush;             /* set if current block must be flushed */
+    uInt prev;              /* byte at distance one to match */
+    Bytef *scan, *strend;   /* scan goes up to strend for length of run */
+
+    for (;;) {
+        /* Make sure that we always have enough lookahead, except
+         * at the end of the input file. We need MAX_MATCH bytes
+         * for the longest run, plus one for the unrolled loop.
+         */
+        if (s->lookahead <= MAX_MATCH) {
+            fill_window(s);
+            if (s->lookahead <= MAX_MATCH && flush == Z_NO_FLUSH) {
+                return need_more;
+            }
+            if (s->lookahead == 0) break; /* flush the current block */
+        }
+
+        /* See how many times the previous byte repeats */
+        s->match_length = 0;
+        if (s->lookahead >= MIN_MATCH && s->strstart > 0) {
+            scan = s->window + s->strstart - 1;
+            prev = *scan;
+            if (prev == *++scan && prev == *++scan && prev == *++scan) {
+                strend = s->window + s->strstart + MAX_MATCH;
+                do {
+                } while (prev == *++scan && prev == *++scan &&
+                         prev == *++scan && prev == *++scan &&
+                         prev == *++scan && prev == *++scan &&
+                         prev == *++scan && prev == *++scan &&
+                         scan < strend);
+                s->match_length = MAX_MATCH - (int)(strend - scan);
+                if (s->match_length > s->lookahead)
+                    s->match_length = s->lookahead;
+            }
+            Assert(scan <= s->window+(uInt)(s->window_size-1), "wild scan");
+        }
+
+        /* Emit match if have run of MIN_MATCH or longer, else emit literal */
+        if (s->match_length >= MIN_MATCH) {
+            check_match(s, s->strstart, s->strstart - 1, s->match_length);
+
+            _tr_tally_dist(s, 1, s->match_length - MIN_MATCH, bflush);
+
+            s->lookahead -= s->match_length;
+            s->strstart += s->match_length;
+            s->match_length = 0;
+        } else {
+            /* No match, output a literal byte */
+            Tracevv((stderr,"%c", s->window[s->strstart]));
+            _tr_tally_lit (s, s->window[s->strstart], bflush);
+            s->lookahead--;
+            s->strstart++;
+        }
+        if (bflush) FLUSH_BLOCK(s, 0);
+    }
+    s->insert = 0;
+    if (flush == Z_FINISH) {
+        FLUSH_BLOCK(s, 1);
+        return finish_done;
+    }
+    if (s->last_lit)
+        FLUSH_BLOCK(s, 0);
+    return block_done;
+}
+
+/* ===========================================================================
+ * For Z_HUFFMAN_ONLY, do not look for matches.  Do not maintain a hash table.
+ * (It will be regenerated if this run of deflate switches away from Huffman.)
+ */
+local
+block_state
+deflate_huff(deflate_state *s, int flush)
+{
+    int bflush;             /* set if current block must be flushed */
+
+    for (;;) {
+        /* Make sure that we have a literal to write. */
+        if (s->lookahead == 0) {
+            fill_window(s);
+            if (s->lookahead == 0) {
+                if (flush == Z_NO_FLUSH)
+                    return need_more;
+                break;      /* flush the current block */
+            }
+        }
+
+        /* Output a literal byte */
+        s->match_length = 0;
+        Tracevv((stderr,"%c", s->window[s->strstart]));
+        _tr_tally_lit (s, s->window[s->strstart], bflush);
+        s->lookahead--;
+        s->strstart++;
+        if (bflush) FLUSH_BLOCK(s, 0);
+    }
+    s->insert = 0;
+    if (flush == Z_FINISH) {
+        FLUSH_BLOCK(s, 1);
+        return finish_done;
+    }
+    if (s->last_lit)
+        FLUSH_BLOCK(s, 0);
+    return block_done;
+}
diff --git a/sys/vfs/hammer2/zlib/hammer2_zlib_deflate.h b/sys/vfs/hammer2/zlib/hammer2_zlib_deflate.h
new file mode 100644 (file)
index 0000000..25ad713
--- /dev/null
@@ -0,0 +1,337 @@
+/* deflate.h -- internal compression state
+ * Copyright (C) 1995-2012 Jean-loup Gailly
+ * For conditions of distribution and use, see copyright notice in zlib.h
+ */
+
+/* WARNING: this file should *not* be used by applications. It is
+   part of the implementation of the compression library and is
+   subject to change. Applications should only use zlib.h.
+ */
+
+/* @(#) $Id$ */
+
+#ifndef DEFLATE_H
+#define DEFLATE_H
+
+#include "hammer2_zlib_zutil.h"
+
+/* ===========================================================================
+ * Internal compression state.
+ */
+
+#define LENGTH_CODES 29
+/* number of length codes, not counting the special END_BLOCK code */
+
+#define LITERALS  256
+/* number of literal bytes 0..255 */
+
+#define L_CODES (LITERALS+1+LENGTH_CODES)
+/* number of Literal or Length codes, including the END_BLOCK code */
+
+#define D_CODES   30
+/* number of distance codes */
+
+#define BL_CODES  19
+/* number of codes used to transfer the bit lengths */
+
+#define HEAP_SIZE (2*L_CODES+1)
+/* maximum heap size */
+
+#define MAX_BITS 15
+/* All codes must not exceed MAX_BITS bits */
+
+#define Buf_size 16
+/* size of bit buffer in bi_buf */
+
+#define INIT_STATE    42
+#define EXTRA_STATE   69
+#define NAME_STATE    73
+#define COMMENT_STATE 91
+#define HCRC_STATE   103
+#define BUSY_STATE   113
+#define FINISH_STATE 666
+/* Stream status */
+
+
+/* Data structure describing a single value and its code string. */
+typedef struct ct_data_s {
+    union {
+        ush  freq;       /* frequency count */
+        ush  code;       /* bit string */
+    } fc;
+    union {
+        ush  dad;        /* father node in Huffman tree */
+        ush  len;        /* length of bit string */
+    } dl;
+} FAR ct_data;
+
+#define Freq fc.freq
+#define Code fc.code
+#define Dad  dl.dad
+#define Len  dl.len
+
+typedef struct static_tree_desc_s  static_tree_desc;
+
+typedef struct tree_desc_s {
+    ct_data *dyn_tree;           /* the dynamic tree */
+    int     max_code;            /* largest code with non zero frequency */
+    static_tree_desc *stat_desc; /* the corresponding static tree */
+} FAR tree_desc;
+
+typedef ush Pos;
+typedef Pos FAR Posf;
+typedef unsigned IPos;
+
+/* A Pos is an index in the character window. We use short instead of int to
+ * save space in the various tables. IPos is used only for parameter passing.
+ */
+
+typedef struct internal_state {
+    z_streamp strm;      /* pointer back to this zlib stream */
+    int   status;        /* as the name implies */
+    Bytef *pending_buf;  /* output still pending */
+    ulg   pending_buf_size; /* size of pending_buf */
+    Bytef *pending_out;  /* next pending byte to output to the stream */
+    uInt   pending;      /* nb of bytes in the pending buffer */
+    int   wrap;          /* bit 0 true for zlib, bit 1 true for gzip */
+    uInt   gzindex;      /* where in extra, name, or comment */
+    Byte  method;        /* can only be DEFLATED */
+    int   last_flush;    /* value of flush param for previous deflate call */
+
+                /* used by deflate.c: */
+
+    uInt  w_size;        /* LZ77 window size (32K by default) */
+    uInt  w_bits;        /* log2(w_size)  (8..16) */
+    uInt  w_mask;        /* w_size - 1 */
+
+    Bytef *window;
+    /* Sliding window. Input bytes are read into the second half of the window,
+     * and move to the first half later to keep a dictionary of at least wSize
+     * bytes. With this organization, matches are limited to a distance of
+     * wSize-MAX_MATCH bytes, but this ensures that IO is always
+     * performed with a length multiple of the block size. Also, it limits
+     * the window size to 64K, which is quite useful on MSDOS.
+     * To do: use the user input buffer as sliding window.
+     */
+
+    ulg window_size;
+    /* Actual size of window: 2*wSize, except when the user input buffer
+     * is directly used as sliding window.
+     */
+
+    Posf *prev;
+    /* Link to older string with same hash index. To limit the size of this
+     * array to 64K, this link is maintained only for the last 32K strings.
+     * An index in this array is thus a window index modulo 32K.
+     */
+
+    Posf *head; /* Heads of the hash chains or NIL. */
+
+    uInt  ins_h;          /* hash index of string to be inserted */
+    uInt  hash_size;      /* number of elements in hash table */
+    uInt  hash_bits;      /* log2(hash_size) */
+    uInt  hash_mask;      /* hash_size-1 */
+
+    uInt  hash_shift;
+    /* Number of bits by which ins_h must be shifted at each input
+     * step. It must be such that after MIN_MATCH steps, the oldest
+     * byte no longer takes part in the hash key, that is:
+     *   hash_shift * MIN_MATCH >= hash_bits
+     */
+
+    long block_start;
+    /* Window position at the beginning of the current output block. Gets
+     * negative when the window is moved backwards.
+     */
+
+    uInt match_length;           /* length of best match */
+    IPos prev_match;             /* previous match */
+    int match_available;         /* set if previous match exists */
+    uInt strstart;               /* start of string to insert */
+    uInt match_start;            /* start of matching string */
+    uInt lookahead;              /* number of valid bytes ahead in window */
+
+    uInt prev_length;
+    /* Length of the best match at previous step. Matches not greater than this
+     * are discarded. This is used in the lazy match evaluation.
+     */
+
+    uInt max_chain_length;
+    /* To speed up deflation, hash chains are never searched beyond this
+     * length.  A higher limit improves compression ratio but degrades the
+     * speed.
+     */
+
+    uInt max_lazy_match;
+    /* Attempt to find a better match only when the current match is strictly
+     * smaller than this value. This mechanism is used only for compression
+     * levels >= 4.
+     */
+#   define max_insert_length  max_lazy_match
+    /* Insert new strings in the hash table only if the match length is not
+     * greater than this length. This saves time but degrades compression.
+     * max_insert_length is used only for compression levels <= 3.
+     */
+
+    int level;    /* compression level (1..9) */
+    int strategy; /* favor or force Huffman coding*/
+
+    uInt good_match;
+    /* Use a faster search when the previous match is longer than this */
+
+    int nice_match; /* Stop searching when current match exceeds this */
+
+                /* used by trees.c: */
+    /* Didn't use ct_data typedef below to suppress compiler warning */
+    struct ct_data_s dyn_ltree[HEAP_SIZE];   /* literal and length tree */
+    struct ct_data_s dyn_dtree[2*D_CODES+1]; /* distance tree */
+    struct ct_data_s bl_tree[2*BL_CODES+1];  /* Huffman tree for bit lengths */
+
+    struct tree_desc_s l_desc;               /* desc. for literal tree */
+    struct tree_desc_s d_desc;               /* desc. for distance tree */
+    struct tree_desc_s bl_desc;              /* desc. for bit length tree */
+
+    ush bl_count[MAX_BITS+1];
+    /* number of codes at each bit length for an optimal tree */
+
+    int heap[2*L_CODES+1];      /* heap used to build the Huffman trees */
+    int heap_len;               /* number of elements in the heap */
+    int heap_max;               /* element of largest frequency */
+    /* The sons of heap[n] are heap[2*n] and heap[2*n+1]. heap[0] is not used.
+     * The same heap array is used to build all trees.
+     */
+
+    uch depth[2*L_CODES+1];
+    /* Depth of each subtree used as tie breaker for trees of equal frequency
+     */
+
+    uchf *l_buf;          /* buffer for literals or lengths */
+
+    uInt  lit_bufsize;
+    /* Size of match buffer for literals/lengths.  There are 4 reasons for
+     * limiting lit_bufsize to 64K:
+     *   - frequencies can be kept in 16 bit counters
+     *   - if compression is not successful for the first block, all input
+     *     data is still in the window so we can still emit a stored block even
+     *     when input comes from standard input.  (This can also be done for
+     *     all blocks if lit_bufsize is not greater than 32K.)
+     *   - if compression is not successful for a file smaller than 64K, we can
+     *     even emit a stored file instead of a stored block (saving 5 bytes).
+     *     This is applicable only for zip (not gzip or zlib).
+     *   - creating new Huffman trees less frequently may not provide fast
+     *     adaptation to changes in the input data statistics. (Take for
+     *     example a binary file with poorly compressible code followed by
+     *     a highly compressible string table.) Smaller buffer sizes give
+     *     fast adaptation but have of course the overhead of transmitting
+     *     trees more frequently.
+     *   - I can't count above 4
+     */
+
+    uInt last_lit;      /* running index in l_buf */
+
+    ushf *d_buf;
+    /* Buffer for distances. To simplify the code, d_buf and l_buf have
+     * the same number of elements. To use different lengths, an extra flag
+     * array would be necessary.
+     */
+
+    ulg opt_len;        /* bit length of current block with optimal trees */
+    ulg static_len;     /* bit length of current block with static trees */
+    uInt matches;       /* number of string matches in current block */
+    uInt insert;        /* bytes at end of window left to insert */
+
+#ifdef DEBUG
+    ulg compressed_len; /* total bit length of compressed file mod 2^32 */
+    ulg bits_sent;      /* bit length of compressed data sent mod 2^32 */
+#endif
+
+    ush bi_buf;
+    /* Output buffer. bits are inserted starting at the bottom (least
+     * significant bits).
+     */
+    int bi_valid;
+    /* Number of valid bits in bi_buf.  All bits above the last valid bit
+     * are always zero.
+     */
+
+    ulg high_water;
+    /* High water mark offset in window for initialized bytes -- bytes above
+     * this are set to zero in order to avoid memory check warnings when
+     * longest match routines access bytes past the input.  This is then
+     * updated to the new high water mark.
+     */
+
+} FAR deflate_state;
+
+/* Output a byte on the stream.
+ * IN assertion: there is enough room in pending_buf.
+ */
+#define put_byte(s, c) {s->pending_buf[s->pending++] = (c);}
+
+
+#define MIN_LOOKAHEAD (MAX_MATCH+MIN_MATCH+1)
+/* Minimum amount of lookahead, except at the end of the input file.
+ * See deflate.c for comments about the MIN_MATCH+1.
+ */
+
+#define MAX_DIST(s)  ((s)->w_size-MIN_LOOKAHEAD)
+/* In order to simplify the code, particularly on 16 bit machines, match
+ * distances are limited to MAX_DIST instead of WSIZE.
+ */
+
+#define WIN_INIT MAX_MATCH
+/* Number of bytes after end of data in window to initialize in order to avoid
+   memory checker errors from longest match routines */
+
+        /* in trees.c */
+void ZLIB_INTERNAL _tr_init(deflate_state *s);
+int ZLIB_INTERNAL _tr_tally(deflate_state *s, unsigned dist, unsigned lc);
+void ZLIB_INTERNAL _tr_flush_block(deflate_state *s, charf *buf,
+                        ulg stored_len, int last);
+void ZLIB_INTERNAL _tr_flush_bits(deflate_state *s);
+void ZLIB_INTERNAL _tr_align(deflate_state *s);
+void ZLIB_INTERNAL _tr_stored_block(deflate_state *s, charf *buf,
+                        ulg stored_len, int last);
+
+#define d_code(dist) \
+   ((dist) < 256 ? _dist_code[dist] : _dist_code[256+((dist)>>7)])
+/* Mapping from a distance to a distance code. dist is the distance - 1 and
+ * must not have side effects. _dist_code[256] and _dist_code[257] are never
+ * used.
+ */
+
+#ifndef DEBUG
+/* Inline versions of _tr_tally for speed: */
+
+#if defined(GEN_TREES_H) || !defined(STDC)
+  extern uch ZLIB_INTERNAL _length_code[];
+  extern uch ZLIB_INTERNAL _dist_code[];
+#else
+  extern const uch ZLIB_INTERNAL _length_code[];
+  extern const uch ZLIB_INTERNAL _dist_code[];
+#endif
+
+# define _tr_tally_lit(s, c, flush) \
+  { uch cc = (c); \
+    s->d_buf[s->last_lit] = 0; \
+    s->l_buf[s->last_lit++] = cc; \
+    s->dyn_ltree[cc].Freq++; \
+    flush = (s->last_lit == s->lit_bufsize-1); \
+   }
+# define _tr_tally_dist(s, distance, length, flush) \
+  { uch len = (length); \
+    ush dist = (distance); \
+    s->d_buf[s->last_lit] = dist; \
+    s->l_buf[s->last_lit++] = len; \
+    dist--; \
+    s->dyn_ltree[_length_code[len]+LITERALS+1].Freq++; \
+    s->dyn_dtree[d_code(dist)].Freq++; \
+    flush = (s->last_lit == s->lit_bufsize-1); \
+  }
+#else
+# define _tr_tally_lit(s, c, flush) flush = _tr_tally(s, 0, c)
+# define _tr_tally_dist(s, distance, length, flush) \
+              flush = _tr_tally(s, distance, length)
+#endif
+
+#endif /* DEFLATE_H */
diff --git a/sys/vfs/hammer2/zlib/hammer2_zlib_inffast.c b/sys/vfs/hammer2/zlib/hammer2_zlib_inffast.c
new file mode 100644 (file)
index 0000000..79f5ced
--- /dev/null
@@ -0,0 +1,340 @@
+/* inffast.c -- fast decoding
+ * Copyright (C) 1995-2008, 2010, 2013 Mark Adler
+ * For conditions of distribution and use, see copyright notice in zlib.h
+ */
+
+#include "hammer2_zlib_zutil.h"
+#include "hammer2_zlib_inftrees.h"
+#include "hammer2_zlib_inflate.h"
+#include "hammer2_zlib_inffast.h"
+
+#ifndef ASMINF
+
+/* Allow machine dependent optimization for post-increment or pre-increment.
+   Based on testing to date,
+   Pre-increment preferred for:
+   - PowerPC G3 (Adler)
+   - MIPS R5000 (Randers-Pehrson)
+   Post-increment preferred for:
+   - none
+   No measurable difference:
+   - Pentium III (Anderson)
+   - M68060 (Nikl)
+ */
+#ifdef POSTINC
+#  define OFF 0
+#  define PUP(a) *(a)++
+#else
+#  define OFF 1
+#  define PUP(a) *++(a)
+#endif
+
+/*
+   Decode literal, length, and distance codes and write out the resulting
+   literal and match bytes until either not enough input or output is
+   available, an end-of-block is encountered, or a data error is encountered.
+   When large enough input and output buffers are supplied to inflate(), for
+   example, a 16K input buffer and a 64K output buffer, more than 95% of the
+   inflate execution time is spent in this routine.
+
+   Entry assumptions:
+
+        state->mode == LEN
+        strm->avail_in >= 6
+        strm->avail_out >= 258
+        start >= strm->avail_out
+        state->bits < 8
+
+   On return, state->mode is one of:
+
+        LEN -- ran out of enough output space or enough available input
+        TYPE -- reached end of block code, inflate() to interpret next block
+        BAD -- error in block data
+
+   Notes:
+
+    - The maximum input bits used by a length/distance pair is 15 bits for the
+      length code, 5 bits for the length extra, 15 bits for the distance code,
+      and 13 bits for the distance extra.  This totals 48 bits, or six bytes.
+      Therefore if strm->avail_in >= 6, then there is enough input to avoid
+      checking for available input while decoding.
+
+    - The maximum bytes that a single length/distance pair can output is 258
+      bytes, which is the maximum length that can be coded.  inflate_fast()
+      requires strm->avail_out >= 258 for each loop to avoid checking for
+      output space.
+ */
+void
+ZLIB_INTERNAL
+inflate_fast(z_streamp strm, unsigned start) /* inflate()'s starting value for strm->avail_out */
+{
+    struct inflate_state FAR *state;
+    z_const unsigned char FAR *in;      /* local strm->next_in */
+    z_const unsigned char FAR *last;    /* have enough input while in < last */
+    unsigned char FAR *out;     /* local strm->next_out */
+    unsigned char FAR *beg;     /* inflate()'s initial strm->next_out */
+    unsigned char FAR *end;     /* while out < end, enough space available */
+#ifdef INFLATE_STRICT
+    unsigned dmax;              /* maximum distance from zlib header */
+#endif
+    unsigned wsize;             /* window size or zero if not using window */
+    unsigned whave;             /* valid bytes in the window */
+    unsigned wnext;             /* window write index */
+    unsigned char FAR *window;  /* allocated sliding window, if wsize != 0 */
+    unsigned long hold;         /* local strm->hold */
+    unsigned bits;              /* local strm->bits */
+    code const FAR *lcode;      /* local strm->lencode */
+    code const FAR *dcode;      /* local strm->distcode */
+    unsigned lmask;             /* mask for first level of length codes */
+    unsigned dmask;             /* mask for first level of distance codes */
+    code here;                  /* retrieved table entry */
+    unsigned op;                /* code bits, operation, extra bits, or */
+                                /*  window position, window bytes to copy */
+    unsigned len;               /* match length, unused bytes */
+    unsigned dist;              /* match distance */
+    unsigned char FAR *from;    /* where to copy match from */
+
+    /* copy state to local variables */
+    state = (struct inflate_state FAR *)strm->state;
+    in = strm->next_in - OFF;
+    last = in + (strm->avail_in - 5);
+    out = strm->next_out - OFF;
+    beg = out - (start - strm->avail_out);
+    end = out + (strm->avail_out - 257);
+#ifdef INFLATE_STRICT
+    dmax = state->dmax;
+#endif
+    wsize = state->wsize;
+    whave = state->whave;
+    wnext = state->wnext;
+    window = state->window;
+    hold = state->hold;
+    bits = state->bits;
+    lcode = state->lencode;
+    dcode = state->distcode;
+    lmask = (1U << state->lenbits) - 1;
+    dmask = (1U << state->distbits) - 1;
+
+    /* decode literals and length/distances until end-of-block or not enough
+       input data or output space */
+    do {
+        if (bits < 15) {
+            hold += (unsigned long)(PUP(in)) << bits;
+            bits += 8;
+            hold += (unsigned long)(PUP(in)) << bits;
+            bits += 8;
+        }
+        here = lcode[hold & lmask];
+      dolen:
+        op = (unsigned)(here.bits);
+        hold >>= op;
+        bits -= op;
+        op = (unsigned)(here.op);
+        if (op == 0) {                          /* literal */
+            Tracevv((stderr, here.val >= 0x20 && here.val < 0x7f ?
+                    "inflate:         literal '%c'\n" :
+                    "inflate:         literal 0x%02x\n", here.val));
+            PUP(out) = (unsigned char)(here.val);
+        }
+        else if (op & 16) {                     /* length base */
+            len = (unsigned)(here.val);
+            op &= 15;                           /* number of extra bits */
+            if (op) {
+                if (bits < op) {
+                    hold += (unsigned long)(PUP(in)) << bits;
+                    bits += 8;
+                }
+                len += (unsigned)hold & ((1U << op) - 1);
+                hold >>= op;
+                bits -= op;
+            }
+            Tracevv((stderr, "inflate:         length %u\n", len));
+            if (bits < 15) {
+                hold += (unsigned long)(PUP(in)) << bits;
+                bits += 8;
+                hold += (unsigned long)(PUP(in)) << bits;
+                bits += 8;
+            }
+            here = dcode[hold & dmask];
+          dodist:
+            op = (unsigned)(here.bits);
+            hold >>= op;
+            bits -= op;
+            op = (unsigned)(here.op);
+            if (op & 16) {                      /* distance base */
+                dist = (unsigned)(here.val);
+                op &= 15;                       /* number of extra bits */
+                if (bits < op) {
+                    hold += (unsigned long)(PUP(in)) << bits;
+                    bits += 8;
+                    if (bits < op) {
+                        hold += (unsigned long)(PUP(in)) << bits;
+                        bits += 8;
+                    }
+                }
+                dist += (unsigned)hold & ((1U << op) - 1);
+#ifdef INFLATE_STRICT
+                if (dist > dmax) {
+                    strm->msg = (char *)"invalid distance too far back";
+                    state->mode = BAD;
+                    break;
+                }
+#endif
+                hold >>= op;
+                bits -= op;
+                Tracevv((stderr, "inflate:         distance %u\n", dist));
+                op = (unsigned)(out - beg);     /* max distance in output */
+                if (dist > op) {                /* see if copy from window */
+                    op = dist - op;             /* distance back in window */
+                    if (op > whave) {
+                        if (state->sane) {
+                            strm->msg =
+                                (char *)"invalid distance too far back";
+                            state->mode = BAD;
+                            break;
+                        }
+#ifdef INFLATE_ALLOW_INVALID_DISTANCE_TOOFAR_ARRR
+                        if (len <= op - whave) {
+                            do {
+                                PUP(out) = 0;
+                            } while (--len);
+                            continue;
+                        }
+                        len -= op - whave;
+                        do {
+                            PUP(out) = 0;
+                        } while (--op > whave);
+                        if (op == 0) {
+                            from = out - dist;
+                            do {
+                                PUP(out) = PUP(from);
+                            } while (--len);
+                            continue;
+                        }
+#endif
+                    }
+                    from = window - OFF;
+                    if (wnext == 0) {           /* very common case */
+                        from += wsize - op;
+                        if (op < len) {         /* some from window */
+                            len -= op;
+                            do {
+                                PUP(out) = PUP(from);
+                            } while (--op);
+                            from = out - dist;  /* rest from output */
+                        }
+                    }
+                    else if (wnext < op) {      /* wrap around window */
+                        from += wsize + wnext - op;
+                        op -= wnext;
+                        if (op < len) {         /* some from end of window */
+                            len -= op;
+                            do {
+                                PUP(out) = PUP(from);
+                            } while (--op);
+                            from = window - OFF;
+                            if (wnext < len) {  /* some from start of window */
+                                op = wnext;
+                                len -= op;
+                                do {
+                                    PUP(out) = PUP(from);
+                                } while (--op);
+                                from = out - dist;      /* rest from output */
+                            }
+                        }
+                    }
+                    else {                      /* contiguous in window */
+                        from += wnext - op;
+                        if (op < len) {         /* some from window */
+                            len -= op;
+                            do {
+                                PUP(out) = PUP(from);
+                            } while (--op);
+               &nb