share/man/man8/swapcache.8

   1 .\"
   2 .\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .Dd February 7, 2010
  13 .Dt SWAPCACHE 8
  14 .Os
  15 .Sh NAME
  16 .Nm swapcache
  17 .Nd a mechanism to use fast swap to cache filesystem data and meta-data
  18 .Sh SYNOPSIS
  19 .Cd sysctl vm.swapcache.accrate=100000
  20 .Cd sysctl vm.swapcache.maxfilesize=0
  21 .Cd sysctl vm.swapcache.maxburst=2000000000
  22 .Cd sysctl vm.swapcache.curburst=4000000000
  23 .Cd sysctl vm.swapcache.minburst=10000000
  24 .Cd sysctl vm.swapcache.read_enable=0
  25 .Cd sysctl vm.swapcache.meta_enable=0
  26 .Cd sysctl vm.swapcache.data_enable=0
  27 .Cd sysctl vm.swapcache.use_chflags=1
  28 .Cd sysctl vm.swapcache.maxlaunder=256
  29 .Cd sysctl vm.swapcache.hysteresis=(vm.stats.vm.v_inactive_target/2)
  30 .Sh DESCRIPTION
  31 .Nm
  32 is a system capability which allows a solid state disk (SSD) in a swap
  33 space configuration to be used to cache clean filesystem data and meta-data
  34 in addition to its normal function of backing anonymous memory.
  35 .Pp
  36 Sysctls are used to manage operational parameters and can be adjusted at
  37 any time.
  38 Typically a large initial burst is desired after system boot,
  39 controlled by the initial
  40 .Va vm.swapcache.curburst
  41 parameter.
  42 This parameter is reduced as data is written to swap by the swapcache
  43 and increased at a rate specified by
  44 .Va vm.swapcache.accrate .
  45 Once this parameter reaches zero write activity ceases until it has
  46 recovered sufficiently for write activity to resume.
  47 .Pp
  48 .Va vm.swapcache.meta_enable
  49 enables the writing of filesystem meta-data to the swapcache.
  50 Filesystem
  51 metadata is any data which the filesystem accesses via the disk device
  52 using buffercache.
  53 Meta-data is cached globally regardless of file or directory flags.
  54 .Pp
  55 .Va vm.swapcache.data_enable
  56 enables the writing of clean filesystem file-data to the swapcache.
  57 Filesystem filedata is any data which the filesystem accesses via a
  58 regular file.
  59 In technical terms, when the buffer cache is used to access
  60 a regular file through its vnode.
  61 Please do not blindly turn on this option, see the
  62 .Sx PERFORMANCE TUNING
  63 section for more information.
  64 .Pp
  65 .Va vm.swapcache.use_chflags
  66 enables the use of the
  67 .Va cache
  68 and
  69 .Va noscache
  70 .Xr chflags 1
  71 flags to control which files will be data-cached.
  72 If this sysctl is disabled and
  73 .Va data_enable
  74 is enabled, the system will ignore file flags and attempt to
  75 swapcache all regular files.
  76 .Pp
  77 .Va vm.swapcache.read_enable
  78 enables reading from the swapcache and should be set to 1 for normal
  79 operation.
  80 .Pp
  81 .Va vm.swapcache.maxfilesize
  82 controls which files are to be cached based on their size.
  83 If set to non-zero only files smaller than the specified size
  84 will be cached.
  85 Larger files will not be cached.
  86 .Pp
  87 .Va vm.swapcache.maxlaunder
  88 controls the maximum number of clean VM pages which will be added to
  89 the swap cache and written out to swap on each poll.
  90 Swapcache polls ten times a second.
  91 .Pp
  92 .Va vm.swapcache.hysteresis
  93 controls how many pages swapcache waits to be added to the inactive page
  94 queue before continuing its scan.
  95 Once it decides to scan it continues subject to the above limitations
  96 until it reaches the end of the inactive page queue.
  97 This parameter is designed to make swapcache generate more bulky bursts
  98 to swap which helps SSDs reduce write amplification effects.
  99 .Sh PERFORMANCE TUNING
 100 Best operation is achieved when the active data set fits within the
 101 swapcache.
 102 .Pp
 103 .Bl -tag -width 4n -compact
 104 .It Va vm.swapcache.accrate
 105 This specifies the burst accumulation rate in bytes per second and
 106 ultimately controls the write bandwidth to swap averaged over a long
 107 period of time.
 108 This parameter must be carefully chosen to manage the write endurance of
 109 the SSD in order to avoid wearing it out too quickly.
 110 Even though SSDs have limited write endurance, there is massive
 111 cost/performance benefit to using one in a swapcache configuration.
 112 .Pp
 113 Let's use the Intel X25V 40GB MLC SATA SSD as an example.
 114 This device has approximately a
 115 40TB (40 terabyte) write endurance, but see later
 116 notes on this, it is more a minimum value.
 117 Limiting the long term average bandwidth to 100KB/sec leads to no more
 118 than ~9GB/day writing which calculates approximately to a 12 year endurance.
 119 Endurance scales linearly with size.
 120 The 80GB version of this SSD
 121 will have a write endurance of approximately 80TB.
 122 .Pp
 123 MLC SSDs have a 1000-10000x write endurance, while the lower density
 124 higher-cost SLC SSDs have an approximately 10000-100000x write endurance.
 125 MLC SSDs can be used for the swapcache (and swap) as long as the system
 126 manager is cognizant of its limitations.
 127 .Pp
 128 .It Va vm.swapcache.meta_enable
 129 Turning on just
 130 .Va meta_enable
 131 causes only filesystem meta-data to be cached and will result
 132 in very fast directory operations even over millions of inodes
 133 and even in the face of other invasive operations being run
 134 by other processes.
 135 .Pp
 136 For
 137 .Nm HAMMER
 138 filesystems meta-data includes the B-Tree, directory entries,
 139 and data related to tiny files.
 140 Approximately 6 GB of swapcache is needed
 141 for every 14 million or so inodes cached, effectively giving one the
 142 ability to cache all the meta-data in a multi-terabyte filesystem using
 143 a fairly small SSD.
 144 .Pp
 145 .It Va vm.swapcache.data_enable
 146 Turning on
 147 .Va data_enable
 148 (with or without other features) allows bulk file data to be cached.
 149 This feature is very useful for web server operation when the
 150 operational data set fits in swap.
 151 The usefulness is somewhat mitigated by the maximum number
 152 of vnodes supported by the system via
 153 .Va kern.maxfiles ,
 154 because the bulk data in the cache is lost when the related
 155 vnode is recycled.
 156 In this case it might be desirable to
 157 take the plunge into running a 64-bit kernel which can support
 158 far more vnodes.
 159 32-bit kernels have limited kernel virtual
 160 memory (KVM) and cannot reliably support more than around
 161 100,000 active vnodes.
 162 64-bit kernels can support 300,000+ active vnodes.
 163 .Pp
 164 Data caching is definitely more wasteful of the SSD's write durability
 165 than meta-data caching.
 166 The swapcache may exhaust its burst and smack against the long term
 167 average bandwidth limit, causing the SSD to wear out at the maximum rate
 168 you programmed.
 169 Data caching is far less wasteful and more efficient
 170 if (on a 64-bit system only) you provide a sufficiently large SSD and
 171 increase
 172 .Va kern.maxvnodes
 173 to cover the entire directory topology being served.
 174 Each vnode requires about 1KB of physical RAM.
 175 .Pp
 176 Due to the higher SSD write rate you may want to use a
 177 medium-sized SSD with good write performance to reduce interference
 178 between reading and writing.
 179 Write durability also scales with larger SSDs.
 180 For example, an Intel X25-V only has 40MB/s in write performance
 181 and burst writing by swapcache will seriously interfere with
 182 concurrent read operation on the SSD.
 183 The 80GB X25-M on the otherhand has double the write performance.
 184 .Pp
 185 When data caching is turned on you generally want to use
 186 .Xr chflags 1
 187 with the
 188 .Va cache
 189 flag to enable data caching on a directory.
 190 This flag is tracked by the namecache and does not need to be
 191 recursively set in the directory tree.
 192 Simply setting the flag in a top level directory or mount point
 193 is usually sufficient.
 194 However, the flag does not track across mount points.
 195 A typical setup is something like this:
 196 .Pp
 197 .Dl chflags cache /etc /sbin /bin /usr /home
 198 .Dl chflags noscache /usr/obj
 199 .Pp
 200 If that doesn't work you can turn off
 201 .Va vm.swapcache.use_chflags
 202 entirely and not bother with any
 203 .Nm chflag Ns 'ing .
 204 .Pp
 205 Filesystems such as NFS which do not support flags generally
 206 have a
 207 .Va cache
 208 mount option which enables swapcache operation on the mount.
 209 .Pp
 210 .It Va vm.swapcache.maxfilesize
 211 This may be used to reduce cache thrashing when a focus on a small
 212 potentially fragmented filespace is desired, leaving the
 213 larger files alone.
 214 .Pp
 215 .It Va vm.swapcache.minburst
 216 This controls hysteresis and prevents nickel-and-dime write bursting.
 217 Once
 218 .Va curburst
 219 drops to zero, writing to the swapcache ceases until it has recovered past
 220 .Va minburst .
 221 The idea here is to avoid creating a heavily fragmented swapcache where
 222 reading data from a file must alternate between the cache and the primary
 223 filesystem.
 224 Doing so does not save disk seeks on the primary filesystem
 225 so we want to avoid doing small bursts.
 226 This parameter allows us to do larger bursts.
 227 The larger bursts also tend to improve SSD performance as the SSD itself
 228 can do a better job write-combining and erasing blocks.
 229 .Pp
 230 .It Va vm_swapcache.maxswappct
 231 This controls the maximum amount of swapspace
 232 .Nm
 233 may use, in percentage terms.
 234 .El
 235 .Pp
 236 It is important to note that you should always use
 237 .Xr disklabel64 8
 238 to label your SSD.
 239 Disklabel64 will properly align the base of the
 240 partition space relative to the physical drive regardless of how badly
 241 aligned the fdisk slice is.
 242 This will significantly reduce write amplification and write combining
 243 inefficiencies on the SSD.
 244 .Pp
 245 Finally, interleaved swap (multiple SSDs) may be used to increase
 246 performance even further.
 247 A single SATA SSD is typically capable of reading 120-220MB/sec.
 248 Configuring two SSDs for your swap will
 249 improve aggregate swapcache read performance by 1.5x to 1.8x.
 250 In tests with two Intel 40GB SSDs 300MB/sec was easily achieved.
 251 .Pp
 252 At this point you will be configuring more swap space than a 32 bit
 253 .Dx
 254 kernel can handle (due to KVM limitations).
 255 By default, 32 bit
 256 .Dx
 257 systems only support 32GB of configured swap and while this limit
 258 can be increased somewhat in
 259 .Pa /boot/loader.conf
 260 you should really be using a 64-bit
 261 .Dx
 262 kernel instead.
 263 64-bit systems support up to 512GB of swap by default
 264 and can be boosted to up to 8TB if you are really crazy and have enough RAM.
 265 Each 1GB of swap requires around 1MB of physical memory to manage it so
 266 the practical limit is more around 1TB of swap.
 267 .Pp
 268 Of course, a 1TB SSD is something on the order of $3000+ as of this writing.
 269 Even though a 1TB configuration might not be cost effective, storage levels
 270 more in the 100-200GB range certainly are.
 271 If the machine has only a 1GigE
 272 ethernet (100MB/s) there's no point configuring it for more SSD bandwidth.
 273 A single SSD of the desired size would be sufficient.
 274 .Sh INITIAL BURSTING & REPEATED BURSTING
 275 Even though the average write bandwidth is limited it is desirable
 276 to have a large initial burst after boot to load the cache.
 277 .Va curburst
 278 is initialized to 4GB by default and you can force rebursting
 279 by adjusting it with a sysctl.
 280 Remember that
 281 .Va curburst
 282 dynamically tracks burst and will go up and down depending.
 283 .Pp
 284 In addition there will be periods of time where the system is in
 285 steady state and not writing to the swapcache.
 286 During these periods
 287 .Va curburst
 288 will inch back up but will not exceed
 289 .Va maxburst .
 290 Thus the
 291 .Va maxburst
 292 value controls how large a repeated burst can be.
 293 .Pp
 294 A second bursting parameter called
 295 .Va vm.swapcache.minburst
 296 controls bursting when the maximum write bandwidth has been reached.
 297 When
 298 .Va minburst
 299 reaches zero write activity ceases and
 300 .Va curburst
 301 is allowed to recover up to
 302 .Va minburst
 303 before write activity resumes.
 304 The recommended range for the
 305 .Va minburst
 306 parameter is 1MB to 50MB.
 307 This parameter has a relationship to
 308 how fragmented the swapcache gets when not in a steady state.
 309 Large bursts reduce fragmentation and reduce incidences of
 310 excessive seeking on the hard drive.
 311 If set too low the
 312 swapcache will become fragmented within a single regular file
 313 and the constant back-and-forth between the swapcache and the
 314 hard drive will result in excessive seeking on the hard drive.
 315 .Sh SWAPCACHE SIZE & MANAGEMENT
 316 The swapcache feature will use up to 75% of configured swap space
 317 by default.
 318 The remaining 25% is reserved for normal paging operation.
 319 The system operator should configure at least 4 times the SWAP space
 320 versus main memory and no less than 8GB of swap space.
 321 If a 40GB SSD is used the recommendation is to configure 16GB to 32GB of
 322 swap (note: 32-bit is limited to 32GB of swap by default, for 64-bit
 323 it is 512GB of swap), and to leave the remainder unwritten and unused.
 324 .Pp
 325 The
 326 .Va vm_swapcache.maxswappct
 327 sysctl may be used to change the default.
 328 You may have to change this default if you also use
 329 .Xr tmpfs 5 ,
 330 .Xr vn 4 ,
 331 or if you have not allocated enough swap for reasonable normal paging
 332 activity to occur (in which case you probably shouldn't be using
 333 .Nm
 334 anyway).
 335 .Pp
 336 If swapcache reaches the 75% limit it will begin tearing down swap
 337 in linear bursts by iterating through available VM objects, until
 338 swap space use drops to 70%.
 339 The tear-down is limited by the rate at
 340 which new data is written and this rate in turn is often limited by
 341 .Va vm.swapcache.accrate ,
 342 resulting in an orderly replacement of cached data and meta-data.
 343 The limit is typically only reached when doing full data+meta-data
 344 caching with no file size limitations and serving primarily large
 345 files, or (on a 64-bit system) bumping
 346 .Va kern.maxvnodes
 347 up to very high values.
 348 .Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
 349 This is not a function of
 350 .Nm
 351 per se but instead a normal function of the system.
 352 Most systems have
 353 sufficient memory that they do not need to page memory to swap.
 354 These types of systems are the ones best suited for MLC SSD
 355 configured swap running with a
 356 .Nm
 357 configuration.
 358 Systems which modestly page to swap, in the range of a few hundred
 359 megabytes a day worth of writing, are also well suited for MLC SSD
 360 configured swap.
 361 Desktops usually fall into this category even if they
 362 page out a bit more because swap activity is governed by the actions of
 363 a single person.
 364 .Pp
 365 Systems which page anonymous memory heavily when
 366 .Nm
 367 would otherwise be turned off are not usually well suited for MLC SSD
 368 configured swap.
 369 Heavy paging activity is not governed by
 370 .Nm
 371 bandwidth control parameters and can lead to excessive uncontrolled
 372 writing to the MLC SSD, causing premature wearout.
 373 You would have to use the lower density, more expensive SLC SSD
 374 technology (which has 10x the durability).
 375 This isn't to say that
 376 .Nm
 377 would be ineffective, just that the aggregate write bandwidth required
 378 to support the system would be too large for MLC flash technologies.
 379 .Pp
 380 With this caveat in mind, SSD based paging on systems with insufficient
 381 RAM can be extremely effective in extending the useful life of the system.
 382 For example, a system with a measly 192MB of RAM and SSD swap can run
 383 a -j 8 parallel build world in a little less than twice the time it
 384 would take if the system had 2GB of RAM, whereas it would take 5x to 10x
 385 as long with normal HD based swap.
 386 .Sh WARNINGS
 387 I am going to repeat and expand a bit on SSD wear.
 388 Wear on SSDs is a function of the write durability of the cells,
 389 whether the SSD implements static or dynamic wear leveling, and
 390 write amplification effects based on the type of write activity.
 391 Write amplification occurs due to wasted space when the SSD must
 392 erase and rewrite the underlying flash blocks.
 393 E.g.\& MLC flash uses 128KB erase/write blocks.
 394 .Pp
 395 .Nm
 396 parameters should be carefully chosen to avoid early wearout.
 397 For example, the Intel X25V 40GB SSD has a minimum write durability
 398 of 40TB and an actual durability that can be quite a bit higher.
 399 Generally speaking, you want to select parameters that will give you
 400 at least 10 years of service life.
 401 The most important parameter to control this is
 402 .Va vm.swapcache.accrate .
 403 .Nm
 404 uses a very conservative 100KB/sec default but even a small X25V
 405 can probably handle 300KB/sec of continuous writing and still last 10 years.
 406 .Pp
 407 Depending on the wear leveling algorithm the drive uses, durability
 408 and performance can sometimes be improved by configuring less
 409 space (in a manufacturer-fresh drive) than the drive's probed capacity.
 410 For example, by only using 32GB of a 40GB SSD.
 411 SSDs typically implement 10% more storage than advertised and
 412 use this storage to improve wear leveling.
 413 As cells begin to fail
 414 this overallotment slowly becomes part of the primary storage
 415 until it has been exhausted.
 416 After that the SSD has basically failed.
 417 Keep in mind that if you use a larger portion of the SSD's advertised
 418 storage the SSD will not know if/when you decide to use less unless
 419 appropriate TRIM commands are sent (if supported), or a low level
 420 factory erase is issued.
 421 .Pp
 422 The swapcache is designed for use with SSDs configured as swap and
 423 will generally not improve performance when a normal hard drive is used
 424 for swap.
 425 .Pp
 426 .Nm smartctl
 427 (from pkgsrc's sysutils/smartmontools) may be used to retrieve
 428 the wear indicator from the drive.
 429 One usually runs something like
 430 .Ql smartctl -d sat -a /dev/daXX
 431 (for AHCI/SILI/SCSI), or
 432 .Ql smartctl -a /dev/adXX
 433 for NATA.
 434 Some SSDs
 435 (particularly the Intels) will brick the SATA port when smart operations
 436 are done while the drive is busy with normal activity, so the tool should
 437 only be run when the SSD is idle.
 438 .Pp
 439 ID 232 (0xe8) in the SMART data dump indicates available reserved
 440 space and ID 233 (0xe9) is the wear-out meter.
 441 Reserved space
 442 typically starts at 100 and decrements to 10, after which the SSD
 443 is considered to operate in a degraded mode.
 444 The wear-out meter typically starts at 99 and decrements to 0,
 445 after which the SSD has failed.
 446 .Pp
 447 .Nm
 448 tends to use large 64KB writes and tends to cluster multiple writes
 449 linearly.
 450 The SSD is able to take significant advantage of this
 451 and write amplification effects are greatly reduced.
 452 If we take a 40GB Intel X25V as an example the vendor specifies a write
 453 durability of approximately 40TB, but
 454 .Nm
 455 should be able to squeeze out upwards of 200TB due the fairly optimal
 456 write clustering it does.
 457 The theoretical limit for the Intel X25V is 400TB (10,000 erase cycles
 458 per MLC cell, 40GB drive), but the firmware doesn't do perfect static
 459 wear leveling so the actual durability is less.
 460 .Pp
 461 In contrast, most filesystems directly stored on a SSD have
 462 fairly severe write amplification effects and will have durabilities
 463 ranging closer to the vendor-specified limit.
 464 Power-on hours, power cycles, and read operations do not really affect wear.
 465 .Pp
 466 SSD's with MLC-based flash technology are high-density, low-cost solutions
 467 with limited write durability.
 468 SLC-based flash technology is a low-density,
 469 higher-cost solution with 10x the write durability as MLC.
 470 The durability also scales with the amount of flash storage.
 471 SLC based flash is typically
 472 twice as expensive per gigabyte.
 473 From a cost perspective, SLC based flash
 474 is at least 5x more cost effective in situations where high write
 475 bandwidths are required (because it lasts 10x longer).
 476 MLC is at least 2x more cost effective in situations where high
 477 write bandwidth is not required.
 478 When wear calculations are in years, these differences become huge, but
 479 often the quantity of storage needed trumps the wear life so we expect most
 480 people will be using MLC.
 481 .Nm
 482 is usable with both technologies.
 483 .Sh SEE ALSO
 484 .Xr chflags 1 ,
 485 .Xr fstab 5 ,
 486 .Xr disklabel64 8 ,
 487 .Xr swapon 8
 488 .Sh HISTORY
 489 .Nm
 490 first appeared in
 491 .Dx 2.5 .
 492 .Sh AUTHORS
 493 .An Matthew Dillon