share/man/man8/swapcache.8

   1 .\"
   2 .\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .Dd February 7, 2010
  13 .Dt SWAPCACHE 8
  14 .Os
  15 .Sh NAME
  16 .Nm swapcache
  17 .Nd a mechanism to use fast swap to cache filesystem data and meta-data
  18 .Sh SYNOPSIS
  19 .Cd sysctl vm.swapcache.accrate=100000
  20 .Cd sysctl vm.swapcache.maxfilesize=0
  21 .Cd sysctl vm.swapcache.maxburst=2000000000
  22 .Cd sysctl vm.swapcache.curburst=4000000000
  23 .Cd sysctl vm.swapcache.minburst=10000000
  24 .Cd sysctl vm.swapcache.read_enable=0
  25 .Cd sysctl vm.swapcache.meta_enable=0
  26 .Cd sysctl vm.swapcache.data_enable=0
  27 .Cd sysctl vm.swapcache.use_chflags=1
  28 .Cd sysctl vm.swapcache.maxlaunder=256
  29 .Cd sysctl vm.swapcache.hysteresis=(vm.stats.vm.v_inactive_target/2)
  30 .Sh DESCRIPTION
  31 .Nm
  32 is a system capability which allows a solid state disk (SSD) in a swap
  33 space configuration to be used to cache clean filesystem data and meta-data
  34 in addition to its normal function of backing anonymous memory.
  35 .Pp
  36 Sysctls are used to manage operational parameters and can be adjusted at
  37 any time.
  38 Typically a large initial burst is desired after system boot,
  39 controlled by the initial
  40 .Va vm.swapcache.curburst
  41 parameter.
  42 This parameter is reduced as data is written to swap by the swapcache
  43 and increased at a rate specified by
  44 .Va vm.swapcache.accrate .
  45 Once this parameter reaches zero write activity ceases until it has
  46 recovered sufficiently for write activity to resume.
  47 .Pp
  48 .Va vm.swapcache.meta_enable
  49 enables the writing of filesystem meta-data to the swapcache.
  50 Filesystem
  51 metadata is any data which the filesystem accesses via the disk device
  52 using buffercache.
  53 Meta-data is cached globally regardless of file or directory flags.
  54 .Pp
  55 .Va vm.swapcache.data_enable
  56 enables the writing of clean filesystem file-data to the swapcache.
  57 Filesystem filedata is any data which the filesystem accesses via a
  58 regular file.
  59 In technical terms, when the buffer cache is used to access
  60 a regular file through its vnode.
  61 Please do not blindly turn on this option, see the
  62 .Sx PERFORMANCE TUNING
  63 section for more information.
  64 .Pp
  65 .Va vm.swapcache.use_chflags
  66 enables the use of the
  67 .Va cache
  68 and
  69 .Va noscache
  70 .Xr chflags 1
  71 flags to control which files will be data-cached.
  72 If this sysctl is disabled and
  73 .Va data_enable
  74 is enabled, the system will ignore file flags and attempt to
  75 swapcache all regular files.
  76 .Pp
  77 .Va vm.swapcache.read_enable
  78 enables reading from the swapcache and should be set to 1 for normal
  79 operation.
  80 .Pp
  81 .Va vm.swapcache.maxfilesize
  82 controls which files are to be cached based on their size.
  83 If set to non-zero only files smaller than the specified size
  84 will be cached.
  85 Larger files will not be cached.
  86 .Pp
  87 .Va vm.swapcache.maxlaunder
  88 controls the maximum number of clean VM pages which will be added to
  89 the swap cache and written out to swap on each poll.
  90 Swapcache polls ten times a second.
  91 .Pp
  92 .Va vm.swapcache.hysteresis
  93 controls how many pages swapcache waits to be added to the inactive page
  94 queue before continuing its scan.
  95 Once it decides to scan it continues subject to the above limitations
  96 until it reaches the end of the inactive page queue.
  97 This parameter is designed to make swapcache generate more bulky bursts
  98 to swap which helps SSDs reduce write amplification effects.
  99 .Sh PERFORMANCE TUNING
 100 Best operation is achieved when the active data set fits within the
 101 swapcache.
 102 .Pp
 103 .Bl -tag -width 4n -compact
 104 .It Va vm.swapcache.accrate
 105 This specifies the burst accumulation rate in bytes per second and
 106 ultimately controls the write bandwidth to swap averaged over a long
 107 period of time.
 108 This parameter must be carefully chosen to manage the write endurance of
 109 the SSD in order to avoid wearing it out too quickly.
 110 Even though SSDs have limited write endurance, there is massive
 111 cost/performance benefit to using one in a swapcache configuration.
 112 .Pp
 113 Let's use the Intel X25V 40GB MLC SATA SSD as an example.
 114 This device has approximately a
 115 40TB (40 terabyte) write endurance, but see later
 116 notes on this, it is more a minimum value.
 117 Limiting the long term average bandwidth to 100KB/sec leads to no more
 118 than ~9GB/day writing which calculates approximately to a 12 year endurance.
 119 Endurance scales linearly with size.
 120 The 80GB version of this SSD
 121 will have a write endurance of approximately 80TB.
 122 .Pp
 123 MLC SSDs have a 1000-10000x write endurance, while the lower density
 124 higher-cost SLC SSDs have a 10000-100000x write endurance, approximately.
 125 MLC SSDs can be used for the swapcache (and swap) as long as the system
 126 manager is cognizant of its limitations.
 127 .Pp
 128 .It Va vm.swapcache.meta_enable
 129 Turning on just
 130 .Va meta_enable
 131 causes only filesystem meta-data to be cached and will result
 132 in very fast directory operations even over millions of inodes
 133 and even in the face of other invasive operations being run
 134 by other processes.
 135 .Pp
 136 For
 137 .Nm HAMMER
 138 filesystems meta-data includes the B-Tree, directory entries,
 139 and data related to tiny files.
 140 Approximately 6 GB of swapcache is needed
 141 for every 14 million or so inodes cached, effectively giving one the
 142 ability to cache all the meta-data in a multi-terabyte filesystem using
 143 a fairly small SSD.
 144 .Pp
 145 .It Va vm.swapcache.data_enable
 146 Turning on
 147 .Va data_enable
 148 (with or without other features) allows bulk file data to be cached.
 149 This feature is very useful for web server operation when the
 150 operational data set fits in swap.
 151 However, care must be taken to avoid thrashing the swapcache.
 152 In almost all cases you will want to leave chflags mode enabled
 153 and use 'chflags cache' on governing directories to control which
 154 directory subtrees file data should be cached for.
 155 .Pp
 156 Vnode recycling can also cause problems.
 157 32-bit systems are typically limited to 100,000 cached vnodes and
 158 64-bit systems are typically limited to around 400,000 cached vnodes.
 159 When operating on a filesystem containing a large number of files
 160 vnode recycling by the kernel will cause related swapcache data
 161 to be lost and also cause potential thrashing of the swapcache.
 162 Cache thrashing due to vnode recyclement can occur whether chflags
 163 mode is used or not.
 164 .Pp
 165 To solve the thrashing problem you can turn on HAMMER's
 166 double buffering feature via
 167 .Va vfs.hammer.double_buffer .
 168 This causes HAMMER to cache file data via its block device.
 169 HAMMER cannot avoid also caching file data via individual vnodes
 170 but will try to expire the second copy more quickly (hence
 171 why it is called double buffer mode), but the key point here is
 172 that
 173 .Nm
 174 will only cache the data blocks via the block device when
 175 double_buffer mode is used and since the block device is associated
 176 with the mount it will not get recycled.
 177 This allows the data for any number (potentially millions) of files to
 178 be cached.
 179 You still should use chflags mode to control the size of the dataset
 180 being cached to remain under 75% of configured swap space.
 181 .Pp
 182 Data caching is definitely more wasteful of the SSD's write durability
 183 than meta-data caching.
 184 If not carefully managed the swapcache may exhaust its burst and smack
 185 against the long term average bandwidth limit, causing the SSD to wear
 186 out at the maximum rate you programmed.
 187 Data caching is far less wasteful and more efficient
 188 if (on a 64-bit system only) you provide a sufficiently large SSD.
 189 .Pp
 190 When caching large data sets you may want to use a medium-sized SSD
 191 with good write performance instead of a small SSD to accommodate
 192 the higher burst write rate data caching incurs and to reduce
 193 interference between reading and writing.
 194 Write durability also tends to scale with larger SSDs, but keep in mind
 195 that newer flash technologies use smaller feature sizes on-chip
 196 which reduce the write durability of the chips, so pay careful attention
 197 to the type of flash employed by the SSD when making durability
 198 assumptions.
 199 For example, an Intel X25-V only has 40MB/s in write performance
 200 and burst writing by swapcache will seriously interfere with
 201 concurrent read operation on the SSD.
 202 The 80GB X25-M on the otherhand has double the write performance.
 203 But the Intel 310 series SSDs use flash chips with a smaller feature
 204 size so an 80G 310 series SSD will wind up with a durability relative
 205 close to the older 40G X25-V.
 206 .Pp
 207 When data caching is turned on you generally always want swapcache's
 208 chflags mode enabled and use
 209 .Xr chflags 1
 210 with the
 211 .Va cache
 212 flag to enable data caching on a directory.
 213 This flag is tracked by the namecache and does not need to be
 214 recursively set in the directory tree.
 215 Simply setting the flag in a top level directory or mount point
 216 is usually sufficient.
 217 However, the flag does not track across mount points.
 218 A typical setup is something like this:
 219 .Pp
 220 .Dl chflags cache /etc /sbin /bin /usr /home
 221 .Dl chflags noscache /usr/obj
 222 .Pp
 223 It is possible to tell
 224 .Nm
 225 to ignore the cache flag by setting
 226 .Va vm.swapcache.use_chflags
 227 to zero, but it is not recommended.
 228 .Nm chflag Ns 'ing .
 229 .Pp
 230 Filesystems such as NFS which do not support flags generally
 231 have a
 232 .Va cache
 233 mount option which enables swapcache operation on the mount.
 234 .Pp
 235 .It Va vm.swapcache.maxfilesize
 236 This may be used to reduce cache thrashing when a focus on a small
 237 potentially fragmented filespace is desired, leaving the
 238 larger (more linearly accessed) files alone.
 239 .Pp
 240 .It Va vm.swapcache.minburst
 241 This controls hysteresis and prevents nickel-and-dime write bursting.
 242 Once
 243 .Va curburst
 244 drops to zero, writing to the swapcache ceases until it has recovered past
 245 .Va minburst .
 246 The idea here is to avoid creating a heavily fragmented swapcache where
 247 reading data from a file must alternate between the cache and the primary
 248 filesystem.
 249 Doing so does not save disk seeks on the primary filesystem
 250 so we want to avoid doing small bursts.
 251 This parameter allows us to do larger bursts.
 252 The larger bursts also tend to improve SSD performance as the SSD itself
 253 can do a better job write-combining and erasing blocks.
 254 .Pp
 255 .It Va vm_swapcache.maxswappct
 256 This controls the maximum amount of swapspace
 257 .Nm
 258 may use, in percentage terms.
 259 The default is 75%, leaving the remaining 25% of swap available for normal
 260 paging operations.
 261 .El
 262 .Pp
 263 It is important to note that you should always use
 264 .Xr disklabel64 8
 265 to label your SSD.
 266 Disklabel64 will properly align the base of the
 267 partition space relative to the physical drive regardless of how badly
 268 aligned the fdisk slice is.
 269 This will significantly reduce write amplification and write combining
 270 inefficiencies on the SSD.
 271 .Pp
 272 Finally, interleaved swap (multiple SSDs) may be used to increase
 273 performance even further.
 274 A single SATA-II SSD is typically capable of reading 120-220MB/sec.
 275 Configuring two SSDs for your swap will
 276 improve aggregate swapcache read performance by 1.5x to 1.8x.
 277 In tests with two Intel 40GB SSDs 300MB/sec was easily achieved.
 278 With two SATA-III SSDs it is possible to achieve 600MB/sec or better
 279 and well over 400MB/sec random-read performance (verses the ~3MB/sec
 280 random read performance a hard drive gives you).
 281 .Pp
 282 At this point you will be configuring more swap space than a 32 bit
 283 .Dx
 284 kernel can handle (due to KVM limitations).
 285 By default, 32 bit
 286 .Dx
 287 systems only support 32GB of configured swap and while this limit
 288 can be increased somewhat by using
 289 .Va kern.maxswzone
 290 in
 291 .Pa /boot/loader.conf
 292 (a setting of 96m == a maximum of 96GB of swap),
 293 you will quickly run out of KVM.
 294 Running a 64-bit system with its 512G maximum swap space default
 295 is preferable at that point.
 296 .Pp
 297 In addition there will be periods of time where the system is in
 298 steady state and not writing to the swapcache.
 299 During these periods
 300 .Va curburst
 301 will inch back up but will not exceed
 302 .Va maxburst .
 303 Thus the
 304 .Va maxburst
 305 value controls how large a repeated burst can be.
 306 Remember that
 307 .Va curburst
 308 dynamically tracks burst and will go up and down depending.
 309 .Pp
 310 A second bursting parameter called
 311 .Va vm.swapcache.minburst
 312 controls bursting when the maximum write bandwidth has been reached.
 313 When
 314 .Va minburst
 315 reaches zero write activity ceases and
 316 .Va curburst
 317 is allowed to recover up to
 318 .Va minburst
 319 before write activity resumes.
 320 The recommended range for the
 321 .Va minburst
 322 parameter is 1MB to 50MB.
 323 This parameter has a relationship to
 324 how fragmented the swapcache gets when not in a steady state.
 325 Large bursts reduce fragmentation and reduce incidences of
 326 excessive seeking on the hard drive.
 327 If set too low the
 328 swapcache will become fragmented within a single regular file
 329 and the constant back-and-forth between the swapcache and the
 330 hard drive will result in excessive seeking on the hard drive.
 331 .Sh SWAPCACHE SIZE & MANAGEMENT
 332 The swapcache feature will use up to 75% of configured swap space
 333 by default.
 334 The remaining 25% is reserved for normal paging operation.
 335 The system operator should configure at least 4 times the SWAP space
 336 versus main memory and no less than 8GB of swap space.
 337 If a 40GB SSD is used the recommendation is to configure 16GB to 32GB of
 338 swap (note: 32-bit is limited to 32GB of swap by default, for 64-bit
 339 it is 512GB of swap), and to leave the remainder unwritten and unused.
 340 .Pp
 341 The
 342 .Va vm_swapcache.maxswappct
 343 sysctl may be used to change the default.
 344 You may have to change this default if you also use
 345 .Xr tmpfs 5 ,
 346 .Xr vn 4 ,
 347 or if you have not allocated enough swap for reasonable normal paging
 348 activity to occur (in which case you probably shouldn't be using
 349 .Nm
 350 anyway).
 351 .Pp
 352 If swapcache reaches the 75% limit it will begin tearing down swap
 353 in linear bursts by iterating through available VM objects, until
 354 swap space use drops to 70%.
 355 The tear-down is limited by the rate at
 356 which new data is written and this rate in turn is often limited by
 357 .Va vm.swapcache.accrate ,
 358 resulting in an orderly replacement of cached data and meta-data.
 359 The limit is typically only reached when doing full data+meta-data
 360 caching with no file size limitations and serving primarily large
 361 files, or (on a 64-bit system) bumping
 362 .Va kern.maxvnodes
 363 up to very high values.
 364 .Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
 365 This is not a function of
 366 .Nm
 367 per se but instead a normal function of the system.
 368 Most systems have
 369 sufficient memory that they do not need to page memory to swap.
 370 These types of systems are the ones best suited for MLC SSD
 371 configured swap running with a
 372 .Nm
 373 configuration.
 374 Systems which modestly page to swap, in the range of a few hundred
 375 megabytes a day worth of writing, are also well suited for MLC SSD
 376 configured swap.
 377 Desktops usually fall into this category even if they
 378 page out a bit more because swap activity is governed by the actions of
 379 a single person.
 380 .Pp
 381 Systems which page anonymous memory heavily when
 382 .Nm
 383 would otherwise be turned off are not usually well suited for MLC SSD
 384 configured swap.
 385 Heavy paging activity is not governed by
 386 .Nm
 387 bandwidth control parameters and can lead to excessive uncontrolled
 388 writing to the MLC SSD, causing premature wearout.
 389 You would have to use the lower density, more expensive SLC SSD
 390 technology (which has 10x the durability).
 391 This isn't to say that
 392 .Nm
 393 would be ineffective, just that the aggregate write bandwidth required
 394 to support the system would be too large for MLC flash technologies.
 395 .Pp
 396 With this caveat in mind, SSD based paging on systems with insufficient
 397 RAM can be extremely effective in extending the useful life of the system.
 398 For example, a system with a measly 192MB of RAM and SSD swap can run
 399 a -j 8 parallel build world in a little less than twice the time it
 400 would take if the system had 2GB of RAM, whereas it would take 5x to 10x
 401 as long with normal HD based swap.
 402 .Sh USING SWAPCACHE WITH NORMAL HARD DRIVES
 403 Although
 404 .Nm
 405 is designed to work with SSD-based storage it can also be used with
 406 HD-based storage as an aid for offloading the primary storage system.
 407 Here we need to make a distinction between using RAID for fanning out
 408 storage verses using RAID for redundancy.  There are numerous situations
 409 where RAID-based redundancy does not make sense.
 410 .Pp
 411 A good example would be in an environment where the servers themselves
 412 are redundant and can suffer a total failure without effecting
 413 ongoing operations.  When the primary storage requirements easily fit onto
 414 a single large-capacity drive it doesn't make a whole lot of sense to
 415 use RAID if your only desire is to improve performance.  If you had a farm
 416 of, say, 20 servers supporting the same facility adding RAID to each one
 417 would not accomplish anything other than to bloat your deployment and
 418 maintenance costs.
 419 .Pp
 420 In these sorts of situations it may be desirable and convenient to have
 421 the primary filesystem for each machine on a single large drive and then
 422 use the
 423 .Nm
 424 facility to offload the drive and make the machine more effective without
 425 actually distributing the filesystem itself across multiple drives.
 426 For the purposes of offloading while a SSD would be the most effective
 427 from a performance standpoint, a second medium sized HD with its much lower
 428 cost and higher capacity might actually be more cost effective.
 429 .Pp
 430 In cases where you might desire to use
 431 .Nm
 432 with a normal hard drive you should probably consider running a 64-bit
 433 .Dx
 434 instead of a 32-bit system.
 435 The 64-bit build is capable of supporting much larger swap configurations
 436 (upwards of 512G) and would be a more suitable match against a medium-sized
 437 HD.
 438 .Sh EXPLANATION OF STATIC VS DYNAMIC WEARING LEVELING, AND WRITE-COMBINING
 439 Modern SSDs keep track of space that has never been written to.
 440 This would also include space freed up via TRIM, but simply not
 441 touching a bit of storage in a factory fresh SSD works just as well.
 442 Once you touch (write to) the storage all bets are off, even if
 443 you reformat/repartition later.  It takes sending the SSD a
 444 whole-device TRIM command or special format command to take it back
 445 to its factory-fresh condition (sans wear already present).
 446 .Pp
 447 SSDs have wear leveling algorithms which are responsible for trying
 448 to even out the erase/write cycles across all flash cells in the
 449 storage.  The better a job the SSD can do the longer the SSD will
 450 remain usable.
 451 .Pp
 452 The more unused storage there is from the SSDs point of view the
 453 easier a time the SSD has running its wear leveling algorithms.
 454 Basically the wear leveling algorithm in a modern SSD (say Intel or OCZ)
 455 uses a combination of static and dynamic leveling.  Static is the
 456 best, allowing the SSD to reuse flash cells that have not been
 457 erased very much by moving static (unchanging) data out of them and
 458 into other cells that have more wear.  Dynamic wear leveling involves
 459 writing data to available flash cells and then marking the cells containing
 460 the previous copy of the data as being free/reusable.  Dynamic wear leveling
 461 is the worst kind but the easiest to implement.  Modern SSDs use a combination
 462 of both algorithms plus also do write-combining.
 463 .Pp
 464 USB sticks often use only dynamic wear leveling and have short life spans
 465 because of that.
 466 .Pp
 467 In anycase, any unused space in the SSD effectively makes the dynamic
 468 wear leveling the SSD does more efficient by giving the SSD more 'unused'
 469 space above and beyond the physical space it reserves beyond its stated
 470 storage capacity to cycle data throgh, so the SSD lasts longer in theory.
 471 .Pp
 472 Write-combining is a feature whereby the SSD is able to reduced write
 473 amplification effects by combining OS writes of smaller, discrete,
 474 non-contiguous logical sectors into a single contiguous 128KB physical
 475 flash block.
 476 .Pp
 477 On the flip side write-combining also results in more complex lookup tables
 478 which can become fragmented over time and reduce the SSDs read performance.
 479 Fragmentation can also occur when write-combined blocks are rewritten
 480 piecemeal.
 481 Modern SSDs can regain the lost performance by de-combining previously
 482 write-combined areas as part of their static wear leveling algorithm, but
 483 at the cost of extra write/erase cycles which slightly increase write
 484 amplification effects.
 485 Operating systems can also help maintain the SSDs performance by utilizing
 486 larger blocks.
 487 Write-combining results in a net-reduction
 488 of write-amplification effects but due to having to de-combine later and
 489 other fragmentary effects it isn't 100%.
 490 From testing with Intel devices write-amplification can be well controlled
 491 in the 2x-4x range with the OS doing 16K writes, verses a worst-case
 492 8x write-amplification with 16K blocks, 32x with 4K blocks, and a truly
 493 horrid worst-case with 512 byte blocks.
 494 .Pp
 495 The
 496 .Dx
 497 .Nm
 498 feature utilizes 64K-128K writes and is specifically designed to minimize
 499 write amplification and write-combining stresses.
 500 In terms of placing an actual filesystem on the SSD, the
 501 .Dx
 502 .Xr hammer 8
 503 filesystem utilizes 16K blocks and is well behaved as long as you limit
 504 reblocking operations.
 505 For UFS you should create the filesystem with at least a 4K fragment
 506 size, verses the default 2K.
 507 Modern Windows filesystems use 4K clusters but it is unclear how SSD-friendly
 508 NTFS is.
 509 .Sh EXPLANATION OF FLASH CHIP FEATURE SIZE VS ERASE/REWRITE CYCLE DURABILITY
 510 Manufacturers continue to produce flash chips with smaller feature sizes.
 511 Smaller flash cells means reduced erase/rewrite cycle durability which in
 512 turn reduces the durability of the SSD.
 513 .Pp
 514 The older 34nm flash typically had a 10,000 cell durability while the newer
 515 25nm flash is closer to 1000.  The newer flash uses larger ECCs and more
 516 sensitive voltage comparators on-chip to increase the durability closer to
 517 3000 cycles.  Generally speaking you should assume a durability of around
 518 1/3 for the same storage capacity using the new chips verses the older
 519 chips.  If you can squeeze out a 400TB durability from an older 40GB X25-V
 520 using 34nm technology then you should assume around a 400TB durability from
 521 a newer 120GB 310 series SSD using 25nm technology.
 522 .Sh WARNINGS
 523 I am going to repeat and expand a bit on SSD wear.
 524 Wear on SSDs is a function of the write durability of the cells,
 525 whether the SSD implements static or dynamic wear leveling (or both),
 526 write amplification effects when the OS does not issue write-aligned 128KB
 527 ops or when the SSD is unable to write-combine adjacent logical sectors,
 528 or if the SSD has a poor write-combining algorithm for non-adjacent sectors.
 529 In addition some additional erase/rewrite activity occurs from cleanup
 530 operations the SSD performs as part of its static wear leveling algorithms
 531 and its write-decombining algorithms (necessary to maintain performance over
 532 time).  MLC flash uses 128KB physical write/erase blocks while SLC flash
 533 typically uses 64KB physical write/erase blocks.
 534 .Pp
 535 The algorithms the SSD implements in its firmware are probably the most
 536 important part of the device and a major differentiator between e.g. SATA
 537 and USB-based SSDs.  SATA form factor drives will universally be far superior
 538 to USB storage sticks.
 539 SSDs can also have wildly different wearout rates and wildly different
 540 performance curves over time.
 541 For example the performance of a SSD which does not implement
 542 write-decombining can seriously degrade over time as its lookup
 543 tables become severely fragmented.
 544 For the purposes of this manual page we are primarily using Intel and OCZ
 545 drives when describing performance and wear issues.
 546 .Pp
 547 .Nm
 548 parameters should be carefully chosen to avoid early wearout.
 549 For example, the Intel X25V 40GB SSD has a minimum write durability
 550 of 40TB and an actual durability that can be quite a bit higher.
 551 Generally speaking, you want to select parameters that will give you
 552 at least 10 years of service life.
 553 The most important parameter to control this is
 554 .Va vm.swapcache.accrate .
 555 .Nm
 556 uses a very conservative 100KB/sec default but even a small X25V
 557 can probably handle 300KB/sec of continuous writing and still last 10 years.
 558 .Pp
 559 Depending on the wear leveling algorithm the drive uses, durability
 560 and performance can sometimes be improved by configuring less
 561 space (in a manufacturer-fresh drive) than the drive's probed capacity.
 562 For example, by only using 32GB of a 40GB SSD.
 563 SSDs typically implement 10% more storage than advertised and
 564 use this storage to improve wear leveling.
 565 As cells begin to fail
 566 this overallotment slowly becomes part of the primary storage
 567 until it has been exhausted.
 568 After that the SSD has basically failed.
 569 Keep in mind that if you use a larger portion of the SSD's advertised
 570 storage the SSD will not know if/when you decide to use less unless
 571 appropriate TRIM commands are sent (if supported), or a low level
 572 factory erase is issued.
 573 .Pp
 574 .Nm smartctl
 575 (from pkgsrc's sysutils/smartmontools) may be used to retrieve
 576 the wear indicator from the drive.
 577 One usually runs something like
 578 .Ql smartctl -d sat -a /dev/daXX
 579 (for AHCI/SILI/SCSI), or
 580 .Ql smartctl -a /dev/adXX
 581 for NATA.
 582 Some SSDs
 583 (particularly the Intels) will brick the SATA port when smart operations
 584 are done while the drive is busy with normal activity, so the tool should
 585 only be run when the SSD is idle.
 586 .Pp
 587 ID 232 (0xe8) in the SMART data dump indicates available reserved
 588 space and ID 233 (0xe9) is the wear-out meter.
 589 Reserved space
 590 typically starts at 100 and decrements to 10, after which the SSD
 591 is considered to operate in a degraded mode.
 592 The wear-out meter typically starts at 99 and decrements to 0,
 593 after which the SSD has failed.
 594 .Pp
 595 .Nm
 596 tends to use large 64KB writes and tends to cluster multiple writes
 597 linearly.
 598 The SSD is able to take significant advantage of this
 599 and write amplification effects are greatly reduced.
 600 If we take a 40GB Intel X25V as an example the vendor specifies a write
 601 durability of approximately 40TB, but
 602 .Nm
 603 should be able to squeeze out upwards of 200TB due the fairly optimal
 604 write clustering it does.
 605 The theoretical limit for the Intel X25V is 400TB (10,000 erase cycles
 606 per MLC cell, 40GB drive, with 34nm technology), but the firmware doesn't
 607 do perfect static wear leveling so the actual durability is less.
 608 In tests over several hundred days we have validated a write endurance
 609 greater than 200TB on the 40G Intel X25V using
 610 .Nm .
 611 .Pp
 612 In contrast, filesystems directly stored on a SSD could have
 613 fairly severe write amplification effects and will have durabilities
 614 ranging closer to the vendor-specified limit.
 615 .Pp
 616 Power-on hours, power cycles, and read operations do not really affect wear.
 617 There is something called read-disturb but it is unclear what sort of
 618 ratio would be needed.  Since the data is cached in ram and thus not
 619 re-read at a high rate there is no expectation of a practical effect.
 620 For all intents and purposes only write operations effect wear.
 621 .Pp
 622 SSD's with MLC-based flash technology are high-density, low-cost solutions
 623 with limited write durability.
 624 SLC-based flash technology is a low-density,
 625 higher-cost solution with 10x the write durability as MLC.
 626 The durability also scales with the amount of flash storage.
 627 SLC based flash is typically
 628 twice as expensive per gigabyte.
 629 From a cost perspective, SLC based flash
 630 is at least 5x more cost effective in situations where high write
 631 bandwidths are required (because it lasts 10x longer).
 632 MLC is at least 2x more cost effective in situations where high
 633 write bandwidth is not required.
 634 When wear calculations are in years, these differences become huge, but
 635 often the quantity of storage needed trumps the wear life so we expect most
 636 people will be using MLC.
 637 .Nm
 638 is usable with both technologies.
 639 .Sh SEE ALSO
 640 .Xr chflags 1 ,
 641 .Xr fstab 5 ,
 642 .Xr disklabel64 8 ,
 643 .Xr hammer 8 ,
 644 .Xr swapon 8
 645 .Sh HISTORY
 646 .Nm
 647 first appeared in
 648 .Dx 2.5 .
 649 .Sh AUTHORS
 650 .An Matthew Dillon