hammer - Add tunable vfs.hammer.skip_redo
[dragonfly.git] / share / man / man5 / hammer.5
CommitLineData
5025869b
SW
1.\"
2.\" Copyright (c) 2008
3.\" The DragonFly Project. All rights reserved.
4.\"
5.\" Redistribution and use in source and binary forms, with or without
6.\" modification, are permitted provided that the following conditions
7.\" are met:
8.\"
9.\" 1. Redistributions of source code must retain the above copyright
10.\" notice, this list of conditions and the following disclaimer.
11.\" 2. Redistributions in binary form must reproduce the above copyright
12.\" notice, this list of conditions and the following disclaimer in
13.\" the documentation and/or other materials provided with the
14.\" distribution.
15.\" 3. Neither the name of The DragonFly Project nor the names of its
16.\" contributors may be used to endorse or promote products derived
17.\" from this software without specific, prior written permission.
18.\"
19.\" THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
20.\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
21.\" LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
22.\" FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
23.\" COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
24.\" INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
25.\" BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
26.\" LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
27.\" AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28.\" OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
29.\" OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
dbd4f600 32.Dd August 14, 2012
5025869b 33.Dt HAMMER 5
aacaa523 34.Os
5025869b
SW
35.Sh NAME
36.Nm HAMMER
37.Nd HAMMER file system
38.Sh SYNOPSIS
f9f627d2
SW
39To compile this driver into the kernel,
40place the following line in your
41kernel configuration file:
42.Bd -ragged -offset indent
aacaa523 43.Cd "options HAMMER"
f9f627d2
SW
44.Ed
45.Pp
46Alternatively, to load the driver as a
47module at boot time, place the following line in
48.Xr loader.conf 5 :
49.Bd -literal -offset indent
50hammer_load="YES"
51.Ed
5025869b 52.Pp
cd8f292b 53To mount via
5025869b 54.Xr fstab 5 :
cd8f292b
SW
55.Bd -literal -offset indent
56/dev/ad0s1d[:/dev/ad1s1d:...] /mnt hammer rw 2 0
5025869b
SW
57.Ed
58.Sh DESCRIPTION
59The
60.Nm
4e3c62a3 61file system provides facilities to store file system data onto disk devices
738141e7
SW
62and is intended to replace
63.Xr ffs 5
64as the default file system for
5025869b 65.Dx .
aacaa523 66.Pp
4e3c62a3
TN
67Among its features are instant crash recovery,
68large file systems spanning multiple volumes,
5329a464 69data integrity checking,
aacaa523
TN
70data deduplication,
71fine grained history retention and snapshots,
72pseudo-filesystems (PFSs),
73mirroring capability and
74unlimited number of files and links.
5025869b 75.Pp
ab3617ee 76All functions related to managing
5025869b 77.Nm
ab3617ee
SW
78file systems are provided by the
79.Xr newfs_hammer 8 ,
80.Xr mount_hammer 8 ,
81.Xr hammer 8 ,
aacaa523 82.Xr sysctl 8 ,
a0ed9ee2 83.Xr chflags 1 ,
5025869b
SW
84and
85.Xr undo 1
86utilities.
738141e7
SW
87.Pp
88For a more detailed introduction refer to the paper and slides listed in the
89.Sx SEE ALSO
90section.
91For some common usages of
92.Nm
93see the
94.Sx EXAMPLES
95section below.
aacaa523
TN
96.Pp
97Description of
98.Nm
99features:
bc7579a1
SW
100.Ss Instant Crash Recovery
101After a non-graceful system shutdown,
102.Nm
103file systems will be brought back into a fully coherent state
104when mounting the file system, usually within a few seconds.
aacaa523 105.Pp
dbd4f600
AHJ
106In the unlikely case
107.Nm
108mount fails due redo recovery (stage 2 recovery) being corrupted, a
109workaround to skip this stage can be applied by setting the following tunable:
110.Bd -literal -offset indent
111vfs.hammer.skip_redo=<value>
112.Ed
113.Pp
114Possible values are:
115.Bl -tag -width indent
116.It 0
117Run redo recovery normally and fail to mount in the case of error (default).
118.It 1
119Run redo recovery but continue mounting if an error appears.
120.It 2
121Completely bypass redo recovery.
122.El
123.Pp
aacaa523
TN
124Related commands:
125.Xr mount_hammer 8
bc7579a1
SW
126.Ss Large File Systems & Multi Volume
127A
128.Nm
a0ed9ee2
TN
129file system can be up to 1 Exabyte in size.
130It can span up to 256 volumes,
131each volume occupies a
bc7579a1
SW
132.Dx
133disk slice or partition, or another special file,
134and can be up to 4096 TB in size.
a0ed9ee2
TN
135Minimum recommended
136.Nm
137file system size is 50 GB.
bc7579a1
SW
138For volumes over 2 TB in size
139.Xr gpt 8
140and
141.Xr disklabel64 8
142normally need to be used.
aacaa523
TN
143.Pp
144Related
145.Xr hammer 8
146commands:
147.Cm volume-add ,
148.Cm volume-del ,
149.Cm volume-list ;
150see also
151.Xr newfs_hammer 8
5329a464
TN
152.Ss Data Integrity Checking
153.Nm
154has high focus on data integrity,
155CRC checks are made for all major structures and data.
156.Nm
157snapshots implements features to make data integrity checking easier:
a0ed9ee2
TN
158The atime and mtime fields are locked to the ctime
159for files accessed via a snapshot.
5329a464
TN
160The
161.Fa st_dev
162field is based on the PFS
163.Ar shared-uuid
164and not on any real device.
a0ed9ee2 165This means that archiving the contents of a snapshot with e.g.\&
5329a464
TN
166.Xr tar 1
167and piping it to something like
168.Xr md5 1
169will yield a consistent result.
170The consistency is also retained on mirroring targets.
aacaa523
TN
171.Ss Data Deduplication
172To save disk space data deduplication can be used.
173Data deduplication will identify data blocks which occur multiple times
174and only store one copy, multiple reference will be made to this copy.
175.Pp
176Related
177.Xr hammer 8
178commands:
179.Cm dedup ,
180.Cm dedup-simulate ,
181.Cm cleanup ,
182.Cm config
5025869b
SW
183.Ss Transaction IDs
184The
185.Nm
aacaa523 186file system uses 64-bit transaction ids to refer to historical
5025869b 187file or directory data.
aacaa523
TN
188Transaction ids used by
189.Nm
190are monotonically increasing over time.
191In other words:
192when a transaction is made,
193.Nm
194will always use higher transaction ids for following transactions.
195A transaction id is given in hexadecimal format
196.Li 0x016llx ,
5025869b
SW
197such as
198.Li 0x00000001061a8ba6 .
0257b9da
SW
199.Pp
200Related
201.Xr hammer 8
202commands:
aacaa523
TN
203.Cm snapshot ,
204.Cm snap ,
205.Cm snaplo ,
206.Cm snapq ,
207.Cm snapls ,
208.Cm synctid
bc7579a1
SW
209.Ss History & Snapshots
210History metadata on the media is written with every sync operation, so that
211by default the resolution of a file's history is 30-60 seconds until the next
212prune operation.
aacaa523
TN
213Prior versions of files and directories are generally accessible by appending
214.Ql @@
215and a transaction id to the name.
bc7579a1
SW
216The common way of accessing history, however, is by taking snapshots.
217.Pp
218Snapshots are softlinks to prior versions of directories and their files.
e328ac93 219Their data will be retained across prune operations for as long as the
bc7579a1
SW
220softlink exists.
221Removing the softlink enables the file system to reclaim the space
222again upon the next prune & reblock operations.
aacaa523
TN
223In
224.Nm
225Version 3+ snapshots are also maintained as file system meta-data.
0257b9da
SW
226.Pp
227Related
228.Xr hammer 8
229commands:
aacaa523
TN
230.Cm cleanup ,
231.Cm history ,
232.Cm snapshot ,
233.Cm snap ,
234.Cm snaplo ,
235.Cm snapq ,
236.Cm snaprm ,
237.Cm snapls ,
238.Cm config ,
239.Cm viconfig ;
f704fe91
TN
240see also
241.Xr undo 1
242.Ss Pruning & Reblocking
bc7579a1 243Pruning is the act of deleting file system history.
a0ed9ee2
TN
244By default only history used by the given snapshots
245and history from after the latest snapshot will be retained.
246By setting the per PFS parameter
247.Cm prune-min ,
248history is guaranteed to be saved at least this time interval.
bc7579a1 249All other history is deleted.
f704fe91
TN
250Reblocking will reorder all elements and thus defragment the file system and
251free space for reuse.
252After pruning a file system must be reblocked to recover all available space.
5329a464 253Reblocking is needed even when using the
aacaa523 254.Cm nohistory
5329a464 255.Xr mount_hammer 8
a0ed9ee2
TN
256option or
257.Xr chflags 1
258flag.
0257b9da
SW
259.Pp
260Related
261.Xr hammer 8
262commands:
aacaa523
TN
263.Cm cleanup ,
264.Cm snapshot ,
265.Cm prune ,
266.Cm prune-everything ,
267.Cm rebalance ,
268.Cm reblock ,
269.Cm reblock-btree ,
270.Cm reblock-inodes ,
271.Cm reblock-dirs ,
272.Cm reblock-data
273.Ss Pseudo-Filesystems (PFSs)
274A pseudo-filesystem, PFS for short, is a sub file system in a
275.Nm
276file system.
277Each PFS has independent inode numbers.
278All disk space in a
279.Nm
280file system is shared between all PFSs in it,
281so each PFS is free to use all remaining space.
0257b9da
SW
282A
283.Nm
aacaa523
TN
284file system supports up to 65536 PFSs.
285The root of a
286.Nm
287file system is PFS# 0, it is called the root PFS and is always a master PFS.
288.Pp
289A PFS can be either master or slave.
290Slaves are always read-only,
291so they can't be updated by normal file operations, only by
292.Xr hammer 8
293operations like mirroring and pruning.
0257b9da
SW
294Upgrading slaves to masters and downgrading masters to slaves are supported.
295.Pp
5329a464
TN
296It is recommended to use a
297.Nm null
aacaa523 298mount to access a PFS, except for root PFS;
5329a464
TN
299this way no tools are confused by the PFS root being a symlink
300and inodes not being unique across a
301.Nm
302file system.
303.Pp
aacaa523
TN
304Many
305.Xr hammer 8
306operations operates per PFS,
307this includes mirroring, offline deduping, pruning, reblocking and rebalancing.
308.Pp
309Related
310.Xr hammer 8
311commands:
312.Cm pfs-master ,
313.Cm pfs-slave ,
314.Cm pfs-status ,
315.Cm pfs-update ,
316.Cm pfs-destroy ,
317.Cm pfs-upgrade ,
318.Cm pfs-downgrade ;
319see also
320.Xr mount_null 8
321.Ss Mirroring
322Mirroring is copying of all data in a file system, including snapshots
323and other historical data.
324In order to allow inode numbers to be duplicated on the slaves
325.Nm
326mirroring feature uses PFSs.
327A master or slave PFS can be mirrored to a slave PFS.
328I.e.\& for mirroring multiple slaves per master are supported,
329but multiple masters per slave are not.
330.Pp
0257b9da
SW
331Related
332.Xr hammer 8
333commands:
aacaa523
TN
334.Cm mirror-copy ,
335.Cm mirror-stream ,
336.Cm mirror-read ,
337.Cm mirror-read-stream ,
338.Cm mirror-write ,
339.Cm mirror-dump
340.Ss Fsync Flush Modes
341The
342.Nm
343file system implements several different
344.Fn fsync
345flush modes, the mode used is set via the
346.Va vfs.hammer.flush_mode
347sysctl, see
348.Xr hammer 8
349for details.
350.Ss Unlimited Number of Files and Links
351There is no limit on the number of files or links in a
352.Nm
353file system, apart from available disk space.
5329a464
TN
354.Ss NFS Export
355.Nm
356file systems support NFS export.
cb9cae46
TN
357NFS export of PFSs is done using
358.Nm null
aacaa523
TN
359mounts (for file/directory in root PFS
360.Nm null
361mount is not needed).
738141e7 362For example, to export the PFS
5329a464 363.Pa /hammer/pfs/data ,
738141e7 364create a
cb9cae46
TN
365.Nm null
366mount, e.g.\& to
5329a464
TN
367.Pa /hammer/data
368and export the latter path.
369.Pp
370Don't export a directory containing a PFS (e.g.\&
371.Pa /hammer/pfs
372above).
373Only
374.Nm null
375mount for PFS root
376(e.g.\&
377.Pa /hammer/data
aacaa523
TN
378above) should be exported (subdirectory may be escaped if exported).
379.Ss File System Versions
380As new features have been introduced to
381.Nm
382a version number has been bumped.
383Each
384.Nm
385file system has a version, which can be upgraded to support new features.
386.Pp
387Related
388.Xr hammer 8
389commands:
390.Cm version ,
391.Cm version-upgrade ;
392see also
393.Xr newfs_hammer 8
ab3617ee 394.Sh EXAMPLES
4e3c62a3 395.Ss Preparing the File System
ab3617ee
SW
396To create and mount a
397.Nm
398file system use the
399.Xr newfs_hammer 8
400and
401.Xr mount_hammer 8
402commands.
403Note that all
404.Nm
405file systems must have a unique name on a per-machine basis.
5329a464 406.Bd -literal -offset indent
738141e7 407newfs_hammer -L HOME /dev/ad0s1d
ab3617ee
SW
408mount_hammer /dev/ad0s1d /home
409.Ed
410.Pp
cd8f292b
SW
411Similarly, multi volume file systems can be created and mounted by
412specifying additional arguments.
5329a464 413.Bd -literal -offset indent
738141e7 414newfs_hammer -L MULTIHOME /dev/ad0s1d /dev/ad1s1d
ab3617ee
SW
415mount_hammer /dev/ad0s1d /dev/ad1s1d /home
416.Ed
417.Pp
418Once created and mounted,
419.Nm
e0331f4f 420file systems need periodic clean up making snapshots, pruning and reblocking,
5329a464
TN
421in order to have access to history and file system not to fill up.
422For this it is recommended to use the
423.Xr hammer 8
aacaa523 424.Cm cleanup
e0331f4f 425metacommand.
738141e7 426.Pp
e0331f4f
SW
427By default,
428.Dx
429is set up to run
aacaa523 430.Nm hammer Cm cleanup
e0331f4f
SW
431nightly via
432.Xr periodic 8 .
5329a464 433.Pp
738141e7
SW
434It is also possible to perform these operations individually via
435.Xr crontab 5 .
5329a464 436For example, to reblock the
4e3c62a3
TN
437.Pa /home
438file system every night at 2:15 for up to 5 minutes:
5329a464 439.Bd -literal -offset indent
738141e7 44015 2 * * * hammer -c /var/run/HOME.reblock -t 300 reblock /home \e
5329a464 441 >/dev/null 2>&1
ab3617ee 442.Ed
e328ac93
SW
443.Ss Snapshots
444The
445.Xr hammer 8
446utility's
aacaa523 447.Cm snapshot
e328ac93
SW
448command provides several ways of taking snapshots.
449They all assume a directory where snapshots are kept.
5329a464 450.Bd -literal -offset indent
e328ac93
SW
451mkdir /snaps
452hammer snapshot /home /snaps/snap1
0257b9da
SW
453(...after some changes in /home...)
454hammer snapshot /home /snaps/snap2
455.Ed
bc7579a1
SW
456.Pp
457The softlinks in
458.Pa /snaps
459point to the state of the
460.Pa /home
461directory at the time each snapshot was taken, and could now be used to copy
462the data somewhere else for backup purposes.
738141e7
SW
463.Pp
464By default,
465.Dx
466is set up to create nightly snapshots of all
467.Nm
468file systems via
469.Xr periodic 8
470and to keep them for 60 days.
0257b9da
SW
471.Ss Pruning
472A snapshot directory is also the argument to the
aacaa523
TN
473.Xr hammer 8
474.Cm prune
bc7579a1 475command which frees historical data from the file system that is not
aacaa523
TN
476pointed to by any snapshot link and is not from after the latest snapshot
477and is older than
478.Cm prune-min .
5329a464 479.Bd -literal -offset indent
0257b9da
SW
480rm /snaps/snap1
481hammer prune /snaps
482.Ed
0257b9da 483.Ss Mirroring
aacaa523
TN
484Mirroring is set up using
485.Nm
486pseudo-filesystems (PFSs).
0257b9da
SW
487To associate the slave with the master its shared UUID should be set to
488the master's shared UUID as output by the
aacaa523 489.Nm hammer Cm pfs-master
0257b9da 490command.
5329a464
TN
491.Bd -literal -offset indent
492hammer pfs-master /home/pfs/master
493hammer pfs-slave /home/pfs/slave shared-uuid=<master's shared uuid>
0257b9da
SW
494.Ed
495.Pp
496The
cb9cae46 497.Pa /home/pfs/slave
0257b9da
SW
498link is unusable for as long as no mirroring operation has taken place.
499.Pp
500To mirror the master's data, either pipe a
aacaa523 501.Cm mirror-read
0257b9da 502command into a
aacaa523 503.Cm mirror-write
0257b9da 504or, as a short-cut, use the
aacaa523 505.Cm mirror-copy
0257b9da
SW
506command (which works across a
507.Xr ssh 1
508connection as well).
cb9cae46
TN
509Initial mirroring operation has to be done to the PFS path (as
510.Xr mount_null 8
511can't access it yet).
5329a464 512.Bd -literal -offset indent
cb9cae46
TN
513hammer mirror-copy /home/pfs/master /home/pfs/slave
514.Ed
515.Pp
aacaa523
TN
516It is also possible to have the target PFS auto created
517by just issuing the same
518.Cm mirror-copy
519command, if the target PFS doesn't exist you will be prompted
520if you would like to create it.
521You can even omit the prompting by using the
522.Fl y
523flag:
524.Bd -literal -offset indent
525hammer -y mirror-copy /home/pfs/master /home/pfs/slave
526.Ed
527.Pp
cb9cae46
TN
528After this initial step
529.Nm null
530mount can be setup for
531.Pa /home/pfs/slave .
532Further operations can use
533.Nm null
534mounts.
535.Bd -literal -offset indent
536mount_null /home/pfs/master /home/master
537mount_null /home/pfs/slave /home/slave
538
0257b9da 539hammer mirror-copy /home/master /home/slave
e328ac93 540.Ed
5329a464
TN
541.Ss NFS Export
542To NFS export from the
543.Nm
544file system
545.Pa /hammer
546the directory
547.Pa /hammer/non-pfs
548without PFSs, and the PFS
549.Pa /hammer/pfs/data ,
aacaa523
TN
550the latter is
551.Nm null
552mounted to
5329a464
TN
553.Pa /hammer/data .
554.Pp
555Add to
556.Pa /etc/fstab
557(see
558.Xr fstab 5 ) :
559.Bd -literal -offset indent
560/hammer/pfs/data /hammer/data null rw
561.Ed
562.Pp
563Add to
564.Pa /etc/exports
565(see
566.Xr exports 5 ) :
567.Bd -literal -offset indent
568/hammer/non-pfs
569/hammer/data
570.Ed
aacaa523
TN
571.Sh DIAGNOSTICS
572.Bl -diag
573.It "hammer: System has insuffient buffers to rebalance the tree. nbuf < %d"
574Rebalancing a
575.Nm
576PFS uses quite a bit of memory and
577can't be done on low memory systems.
578It has been reported to fail on 512MB systems.
579Rebalancing isn't critical for
580.Nm
581file system operation;
582it is done by
583.Nm hammer
584.Cm rebalance ,
585often as part of
586.Nm hammer
587.Cm cleanup .
588.El
5025869b 589.Sh SEE ALSO
a0ed9ee2 590.Xr chflags 1 ,
5329a464
TN
591.Xr md5 1 ,
592.Xr tar 1 ,
5025869b 593.Xr undo 1 ,
a0ed9ee2 594.Xr exports 5 ,
738141e7 595.Xr ffs 5 ,
a0ed9ee2 596.Xr fstab 5 ,
4e3c62a3
TN
597.Xr disklabel64 8 ,
598.Xr gpt 8 ,
5025869b
SW
599.Xr hammer 8 ,
600.Xr mount_hammer 8 ,
5329a464 601.Xr mount_null 8 ,
aacaa523
TN
602.Xr newfs_hammer 8 ,
603.Xr periodic 8 ,
604.Xr sysctl 8
5025869b
SW
605.Rs
606.%A Matthew Dillon
607.%D June 2008
738141e7 608.%O http://www.dragonflybsd.org/hammer/hammer.pdf
5025869b
SW
609.%T "The HAMMER Filesystem"
610.Re
738141e7
SW
611.Rs
612.%A Matthew Dillon
613.%D October 2008
614.%O http://www.dragonflybsd.org/hammer/nycbsdcon/
615.%T "Slideshow from NYCBSDCon 2008"
616.Re
2b9eb799
SW
617.Rs
618.%A Michael Neumann
619.%D January 2010
620.%O http://www.ntecs.de/sysarch09/HAMMER.pdf
aacaa523 621.%T "Slideshow for a presentation held at KIT (http://www.kit.edu)"
2b9eb799 622.Re
eab82c57
MD
623.Sh FILESYSTEM PERFORMANCE
624The
625.Nm
626file system has a front-end which processes VNOPS and issues necessary
627block reads from disk, and a back-end which handles meta-data updates
a0ed9ee2
TN
628on-media and performs all meta-data write operations.
629Bulk file write operations are handled by the front-end.
eab82c57
MD
630Because
631.Nm
632defers meta-data updates virtually no meta-data read operations will be
a0ed9ee2 633issued by the frontend while writing large amounts of data to the file system
eab82c57
MD
634or even when creating new files or directories, and even though the
635kernel prioritizes reads over writes the fact that writes are cached by
636the drive itself tends to lead to excessive priority given to writes.
637.Pp
a0ed9ee2
TN
638There are four bioq sysctls, shown below with default values,
639which can be adjusted to give reads a higher priority:
eab82c57
MD
640.Bd -literal -offset indent
641kern.bioq_reorder_minor_bytes: 262144
642kern.bioq_reorder_burst_bytes: 3000000
643kern.bioq_reorder_minor_interval: 5
644kern.bioq_reorder_burst_interval: 60
645.Ed
646.Pp
647If a higher read priority is desired it is recommended that the
aacaa523 648.Va kern.bioq_reorder_minor_interval
eab82c57 649be increased to 15, 30, or even 60, and the
aacaa523 650.Va kern.bioq_reorder_burst_bytes
eab82c57 651be decreased to 262144 or 524288.
5025869b
SW
652.Sh HISTORY
653The
654.Nm
655file system first appeared in
656.Dx 1.11 .
657.Sh AUTHORS
658.An -nosplit
659The
660.Nm
661file system was designed and implemented by
aacaa523
TN
662.An Matthew Dillon Aq dillon@backplane.com ,
663data deduplication was added by
664.An Ilya Dryomov .
5025869b 665This manual page was written by
aacaa523
TN
666.An Sascha Wildner
667and updated by
668.An Thomas Nikolajsen .
669.Sh CAVEATS
670Data deduplication is considered experimental.