share/man/man8/swapcache.8

   1 .\"
   2 .\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .Dd February 7, 2010
  13 .Dt SWAPCACHE 8
  14 .Os
  15 .Sh NAME
  16 .Nm swapcache
  17 .Nd a
  18 mechanism which allows the system to use fast swap to cache filesystem
  19 data and meta-data.
  20 .Sh SYNOPSIS
  21 .Cd sysctl vm.swapcache.accrate=100000
  22 .Cd sysctl vm.swapcache.maxfilesize=0
  23 .Cd sysctl vm.swapcache.maxburst=2000000000
  24 .Cd sysctl vm.swapcache.curburst=4000000000
  25 .Cd sysctl vm.swapcache.minburst=10000000
  26 .Cd sysctl vm.swapcache.read_enable=0
  27 .Cd sysctl vm.swapcache.meta_enable=0
  28 .Cd sysctl vm.swapcache.data_enable=0
  29 .Cd sysctl vm.swapcache.use_chflags=1
  30 .Cd sysctl vm.swapcache.maxlaunder=256
  31 .Cd sysctl vm.swapcache.hysteresis=(vm.stats.vm.v_inactive_target/2)
  32 .Sh DESCRIPTION
  33 .Nm
  34 is a system capability which allows a solid state disk (SSD) in a swap
  35 space configuration to be used to cache clean filesystem data and meta-data
  36 in addition to its normal function of backing anonymous memory.
  37 .Pp
  38 Sysctls are used to manage operational parameters and can be adjusted at
  39 any time.  Typically a large initial burst is desired after system boot,
  40 controlled by the initial
  41 .Cd vm.swapcache.curburst
  42 parameter.
  43 This parameter is reduced as data is written to swap by the swapcache
  44 and increased at a rate specified by
  45 .Cd vm.swapcache.accrate .
  46 Once this parameter reaches zero write activity ceases until it has
  47 recovered sufficiently for write activity to resume.
  48 .Pp
  49 .Cd vm.swapcache.meta_enable
  50 enables the writing of filesystem meta-data to the swapcache.  Filesystem
  51 metadata is any data which the filesystem accesses via the disk device
  52 using buffercache.  Meta-data is cached globally regardless of file
  53 or directory flags.
  54 .Pp
  55 .Cd vm.swapcache.data_enable
  56 enables the writing of clean filesystem file-data to the swapcache.
  57 Filesystem filedata is any data which the filesystem accesses via a
  58 regular file.  In technical terms, when the buffer cache is used to access
  59 a regular file through its vnode.
  60 Please do not blindly turn on this option, see the PERFORMANCE TUNING
  61 section for more information.
  62 .Pp
  63 .Cd vm.swapcache.use_chflags
  64 enables the use of the
  65 .Cm cache
  66 and
  67 .Cm noscache
  68 .Xr chflags 1
  69 flags to control which files will be data-cached.
  70 If this sysctl is disabled and data_enable is enabled,
  71 the system will ignore file flags and attempt to swapcache all
  72 regular files.
  73 .Pp
  74 .Cd vm.swapcache.read_enable
  75 enables reading from the swapcache and should be set to 1 for normal
  76 operation.
  77 .Pp
  78 .Cd vm.swapcache.maxfilesize
  79 controls which files are to be cached based on their size.
  80 If set to non-zero only files smaller than the specified size
  81 will be cached.  Larger files will not be cached.
  82 .Pp
  83 .Cd vm.swapcache.maxlaunder
  84 controls the maximum number of clean VM pages which will be added to
  85 the swap cache and written out to swap on each poll.
  86 Swapcache polls ten times a second.
  87 .Pp
  88 .Cd vm.swapcache.hysteresis
  89 controls how many pages swapcache waits to be added to the inactive page
  90 queue before continuing its scan.  Once it decides to scan it continues
  91 subject to the above limitations until it reaches the end of the inactive
  92 page queue.
  93 This parameter is designed to make swapcache generate more bulky bursts
  94 to swap which helps SSDs reduce write amplification effects.
  95 .Sh PERFORMANCE TUNING
  96 Best operation is achieved when the active data set fits within the
  97 swapcache.
  98 .Pp
  99 .Bl -tag -width 4n -compact
 100 .It Cd vm.swapcache.accrate
 101 This specifies the burst accumulation rate in bytes per second and
 102 ultimately controls the write bandwidth to swap averaged over a long
 103 period of time.
 104 This parameter must be carefully chosen to manage the write endurance of
 105 the SSD in order to avoid wearing it out too quickly.
 106 Even though SSDs have limited write endurance, there is massive
 107 cost/performance benefit to using one in a swapcache configuration.
 108 .Pp
 109 Let's use the Intel X25V 40G MLC SATA SSD as an example.  This device
 110 has approximately a
 111 40TB (40 terabyte) write endurance, but see later
 112 notes on this, it is more a minimum value.
 113 Limiting the long term average bandwidth to 100K/sec leads to no more
 114 than ~9G/day writing which calculates approximately to a 12 year
 115 endurance.
 116 Endurance scales linearly with size.  The 80G version of this SSD
 117 will have a write endurance of approximately 80TB.
 118 .Pp
 119 MLC SSDs have a 1000-10000x write endurance, while the lower density
 120 higher-cost SLC SSDs have an approximately 10000-100000x write endurance.
 121 MLC SSDs can be used for the swapcache (and swap) as long as the system
 122 manager is cognizant of its limitations.
 123 .Pp
 124 .It Cd vm.swapcache.meta_enable
 125 Turning on just
 126 .Cd meta_enable
 127 causes only filesystem meta-data to be cached and will result
 128 in very fast directory operations even over millions of inodes
 129 and even in the face of other invasive operations being run
 130 by other processes.
 131 .Pp
 132 For HAMMER filesystems meta-data includes the B-Tree, directory entries,
 133 and data related to tiny files.  Approximately 6 GB of swapcache is needed
 134 for every 14 million or so inodes cached, effectively giving one the
 135 ability to cache all the meta-data in a multi-terrabyte filesystem using
 136 a fairly small SSD.
 137 .Pp
 138 .It Cd vm.swapcache.data_enable
 139 Turning on
 140 .Cd data_enable
 141 (with or without other features) allows bulk file data to be
 142 cached.
 143 This feature is very useful for web server operation when the
 144 operational data set fits in swap.
 145 The usefulness is somewhat mitigated by the maximum number
 146 of vnodes supported by the system via
 147 .Cd kern.maxfiles ,
 148 because the bulk data in the cache is lost when the related
 149 vnode is recycled.  In this case it might be desireable to
 150 take the plunge into running a 64-bit kernel which can support
 151 far more vnodes.  32-bit kernels have limited kernel virtual
 152 memory (KVM) and cannot reliably support more than around
 153 100,000 active vnodes.  64-bit kernels can support 300,000+
 154 active vnodes.
 155 .Pp
 156 Data caching is definitely more wasteful of SSD write bandwidth
 157 than meta-data caching.  It doesn't hurt performance per se,
 158 but may cause the
 159 .Nm
 160 to exhaust its burst and smack against the long term average
 161 bandwidth limit, causing the SSD to wear out at the maximum rate you
 162 programmed.  Data caching is far less wasteful and more efficient
 163 if (on a 64-bit system only) you provide a sufficiently large SSD and
 164 increase
 165 .Cd kern.maxvnodes
 166 to cover the entire directory topology being served.
 167 Each vnode requires about 1K of physical ram.
 168 .Pp
 169 When data caching is turned on you generally want to use
 170 .Xr chflags 1
 171 with the
 172 .Cm cache
 173 flag to enable data caching on a directory.
 174 This flag is tracked by the namecache and does not need to be
 175 recursively set in the directory tree.
 176 Simply setting the flag in a top level directory or mount point
 177 is usually sufficient.
 178 However, the flag does not track across mount points.
 179 A typical setup is something like this:
 180 .Pp
 181 .Dl chflags cache /etc /sbin /bin /usr /home
 182 .Dl chflags noscache /usr/obj
 183 .Pp
 184 If that doesn't work you can turn off
 185 .Cd vm.swapcache.use_chflags
 186 entirely and not bother with any chflagging.
 187 .Pp
 188 Filesystems such as NFS which do not support flags generally
 189 have a
 190 .Cd cache
 191 mount option which enables swapcache operation on the mount.
 192 .Pp
 193 .It Cd vm.swapcache.maxfilesize
 194 This may be used to reduce cache thrashing when a focus on a small
 195 potentially fragmented filespace is desired, leaving the
 196 larger files alone.
 197 .Pp
 198 .It Cd vm.swapcache.minburst
 199 This controls hysteresis and prevents nickel-and-dime write bursting.
 200 Once
 201 .Cd curburst
 202 drops to zero, writing to the swapcache ceases until it has recovered
 203 past
 204 .Cd minburst .
 205 The idea here is to avoid creating a heavily fragmented swapcache where
 206 reading data from a file must alternate between the cache and the primary
 207 filesystem.  Doing so does not save disk seeks on the primary filesystem
 208 so we want to avoid doing small bursts.  This parameter allows us to do
 209 larger bursts.
 210 The larger bursts also tend to improve SSD performance as the SSD itself
 211 can do a better job write-combining and erasing blocks.
 212 .Pp
 213 .It Cd vm_swapcache.maxswappct
 214 This controls the maximum amount of swapspace
 215 .Nm
 216 may use, in percentage terms.
 217 .El
 218 .Pp
 219 It is important to note that you should always use
 220 .Xr disklabel64 8
 221 to label your SSD.  Disklabel64 will properly align the base of the
 222 partition space relative to the physical drive regardless of how badly
 223 aligned the fdisk slice is.
 224 This will significantly reduce write amplification and write combining
 225 inefficiencies on the SSD.
 226 .Pp
 227 Finally, interleaved swap (multiple SSDs) may be used to increase
 228 performance even further.  A single SATA SSD is typically capable of
 229 reading 120-220MB/sec.  Configuring two SSDs for your swap will
 230 improve aggregate swapcache read performance by 1.5x to 1.8x.
 231 In tests with two Intel 40G SSDs 300MB/sec was easily achieved.
 232 .Pp
 233 At this point you will be configuring more swap space than a 32 bit
 234 .Dx
 235 kernel can handle (due to KVM limitations).  By default, 32 bit
 236 .Dx
 237 systems only support 32G of configured swap and while this limit
 238 can be increased somewhat in
 239 .Pa /boot/loader.conf
 240 you should really be using a 64-bit
 241 .Dx
 242 kernel instead.  64-bit systems support up to 512G of swap by default
 243 and can be boosted to up to 8TB if you are really crazy and have enough ram.
 244 Each 1GB of swap requires around 1MB of physical memory to manage it so
 245 the practical limit is more around 1TB of swap.
 246 .Pp
 247 Of course, a 1TB SSD is something on the order of $3000+ as of this writing.
 248 Even though a 1TB configuration might not be cost effective, storage levels
 249 more in the 100-200G range certainly are.  If the machine has only a 1GigE
 250 ethernet (100MB/s) there's no point configuring it for more SSD bandwidth.
 251 A single SSD of the desired size would be sufficient.
 252 .Sh INITIAL BURSTING & REPEATED BURSTING
 253 Even though the average write bandwidth is limited it is desireable
 254 to have a large initial burst after boot to load the cache.
 255 .Cd curburst
 256 is initialized to 4GB by default and you can force rebursting
 257 by adjusting it with a sysctl.
 258 Remember that
 259 .Cd curburst
 260 dynamically tracks burst and will go up and down depending.
 261 .Pp
 262 In addition there will be periods of time where the system is in
 263 steady state and not writing to the swapcache.  During these periods
 264 .Cd curburst
 265 will inch back up but will not exceed
 266 .Cd maxburst .
 267 Thus the
 268 .Cd maxburst
 269 value controls how large a repeated burst can be.
 270 .Pp
 271 A second bursting parameter called
 272 .Cd vm.swapcache.minburst
 273 controls bursting when the maximum write bandwidth has been reached.
 274 When
 275 .Cd minburst
 276 reaches zero write activity ceases and
 277 .Cd curburst
 278 is allowed to recover up to
 279 .Cd minburst
 280 before write activity resumes.  The recommended range for the
 281 .Cd minburst
 282 parameter is 1MB to 50MB.  This parameter has a relationship to
 283 how fragmented the swapcache gets when not in a steady state.
 284 Large bursts reduce fragmentation and reduce incidences of
 285 excessive seeking on the hard drive.  If set too low the
 286 swapcache will become fragmented within a single regular file
 287 and the constant back-and-forth between the swapcache and the
 288 hard drive will result in excessive seeking on the hard drive.
 289 .Sh SWAPCACHE SIZE & MANAGEMENT
 290 The swapcache feature will use up to 75% of configured swap space
 291 by default.
 292 The remaining 25% is reserved for normal paging operation.
 293 The system operator should configure at least 4 times the SWAP space
 294 versus main memory and no less than 8G of swap space.
 295 If a 40G SSD is used the recommendation is to configure 16G to 32G of
 296 swap (note: 32-bit is limited to 32G of swap by default, for 64-bit
 297 it is 512G of swap), and to leave the remainder unwritten and unused.
 298 .Pp
 299 The
 300 .Cd vm_swapcache.maxswappct
 301 sysctl may be used to change the default.
 302 You may have to change this default if you also use
 303 .Xr tmpfs 5 ,
 304 .Xr vn 4 ,
 305 or if you have not allocated enough swap for reasonable normal paging
 306 activity to occur (in which case you probably shouldn't be using
 307 .Nm
 308 anyway).
 309 .Pp
 310 If swapcache reaches the 75% limit it will begin tearing down swap
 311 in linear bursts by iterating through available VM objects, until
 312 swap space use drops to 70%.  The tear-down is limited by the rate at
 313 which new data is written and this rate in turn is often limited
 314 by
 315 .Cd vm.swapcache.accrate ,
 316 resulting in an orderly replacement of cached data and meta-data.
 317 The limit is typically only reached when doing full data+meta-data
 318 caching with no file size limitations and serving primarily large
 319 files, or (on a 64-bit system) bumping kern.maxvnodes up to very
 320 high values.
 321 .Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
 322 This is not a function of
 323 .Nm
 324 per se but instead a normal function of the system.  Most systems have
 325 sufficient memory that they do not need to page memory to swap.  These
 326 types of systems are the ones best suited for MLC SSD configured swap
 327 running with a
 328 .Nm
 329 configuration.
 330 Systems which modestly page to swap, in the range of a few hundred
 331 megabytes a day worth of writing, are also well suited for MLC SSD
 332 configured swap.  Desktops usually fall into this category even if they
 333 page out a bit more because swap activity is governed by the actions of
 334 a single person.
 335 .Pp
 336 Systems which page anonymous memory heavily when
 337 .Nm
 338 would otherwise be turned off are not usually well suited for MLC SSD
 339 configured swap.  Heavy paging activity is not governed by
 340 .Nm
 341 bandwidth control parameters and can lead to excessive uncontrolled
 342 writing to the MLC SSD, causing premature wearout.  You would have to
 343 use the lower density, more expensive SLC SSD technology (which has 10x
 344 the durability).  This isn't to say that
 345 .Nm
 346 would be ineffective, just that the aggregate write bandwidth required
 347 to support the system would be too large for MLC flash technologies.
 348 .Pp
 349 With this caveat in mind, SSD based paging on systems with insufficient
 350 ram can be extremely effective in extending the useful life of the system.
 351 For example, a system with a measly 192MB of ram and SSD swap can run
 352 a -j 8 parallel build world in a little less than twice the time it
 353 would take if the system had 2G of ram, whereas it would take 5x to 10x
 354 as long with normal HD based swap.
 355 .Sh WARNINGS
 356 I am going to repeat and expand a bit on SSD wear.
 357 Wear on SSDs is a function of the write durability of the cells,
 358 whether the SSD implements static or dynamic wear leveling, and
 359 write amplification effects based on the type of write activity.
 360 Write amplification occurs due to wasted space when the SSD must
 361 erase and rewrite the underlying flash blocks.  e.g. MLC flash uses
 362 128KB erase/write blocks.
 363 .Pp
 364 .Nm
 365 parameters should be carefully chosen to avoid early wearout.
 366 For example, the Intel X25V 40G SSD has a minimum write durability
 367 of 40TB and an actual durability that can be quite a bit higher.
 368 Generally speaking, you want to select parameters that will give you
 369 at least 10 years of service life.
 370 The most important parameter to control this is
 371 .Cd vm.swapcache.accrate .
 372 .Nm
 373 uses a very conservative 100KB/sec default but even a small X25V
 374 can probably handle 300KB/sec of continuous writing and still last
 375 10 years.
 376 .Pp
 377 Depending on the wear leveling algorithm the drive uses, durability
 378 and performance can sometimes be improved by configuring less
 379 space (in a manufacturer-fresh drive) than the drive's probed capacity.
 380 For example, by only using 32G of a 40G SSD.
 381 SSDs typically implement 10% more storage than advertised and
 382 use this storage to improve wear leveling.  As cells begin to fail
 383 this overallotment slowly becomes part of the primary storage
 384 until it has been exhausted.  After that the SSD has basically failed.
 385 Keep in mind that if you use a larger portion of the SSD's advertised
 386 storage the SSD will not know if/when you decide to use less unless
 387 appropriate TRIM commands are sent (if supported), or a low level
 388 factory erase is issued.
 389 .Pp
 390 The swapcache is designed for use with SSDs configured as swap and
 391 will generally not improve performance when a normal hard drive is used
 392 for swap.
 393 .Pp
 394 .Nm smartctl
 395 (from pkgsrc's sysutils/smartmontools) may be used to retrieve
 396 the wear indicator from the drive.
 397 One usually runs something like 'smartctl -d sat -a /dev/daXX'
 398 (for AHCI/SILI/SCSI), or 'smartctl -a /dev/adXX' for NATA.  Some SSDs
 399 (particularly the Intels) will brick the SATA port when smart operations
 400 are done while the drive is busy with normal activity, so the tool should
 401 only be run when the SSD is idle.
 402 .Pp
 403 ID 232 (0xe8) in the SMART data dump indicates available reserved
 404 space and ID 233 (0xe9) is the wear-out meter.  Reserved space
 405 typically starts at 100 and decrements to 10, after which the SSD
 406 is considered to operate in a degraded mode.  The wear-out meter
 407 typically starts at 99 and decrements to 0, after which the SSD
 408 has failed.
 409 .Pp
 410 .Nm
 411 tends to use large 64K writes and tends to cluster multiple writes
 412 linearly.  The SSD is able to take significant advantage of this
 413 and write amplification effects are greatly reduced.  If we
 414 take a 40G Intel X25V as an example the vendor specifies a write
 415 durability of approximately 40TB, but
 416 .Nm
 417 should be able to squeeze out upwards of 200TB due the fairly optimal
 418 write clustering it does.
 419 The theoretical limit for the Intel X25V is 400TB (10,000 erase cycles
 420 per MLC cell, 40G drive), but the firmware doesn't do perfect static
 421 wear leveling so the actual durability is less.
 422 .Pp
 423 In contrast, most filesystems directly stored on a SSD have
 424 fairly severe write amplification effects and will have durabilities
 425 ranging closer to the vendor-specified limit.
 426 Power-on hours, power cycles, and read operations do not really affect
 427 wear.
 428 .Pp
 429 SSD's with MLC-based flash technology are high-density, low-cost solutions
 430 with limited write durability.  SLC-based flash technology is a low-density,
 431 higher-cost solution with 10x the write durability as MLC.  The durability
 432 also scales with the amount of flash storage.  SLC based flash is typically
 433 twice as expensive per gigabyte.  From a cost perspective, SLC based flash
 434 is at least 5x more cost effective in situations where high write
 435 bandwidths are required (because it lasts 10x longer).  MLC is at least
 436 2x more cost effective in situations where high write bandwidth is not
 437 required.
 438 When wear calculations are in years, these differences become huge, but
 439 often the quantity of storage needed trumps the wear life so we expect most
 440 people will be using MLC.
 441 .Nm
 442 is usable with both technologies.
 443 .Sh SEE ALSO
 444 .Xr swapon 8 ,
 445 .Xr disklabel64 8 ,
 446 .Xr fstab 5
 447 .Sh HISTORY
 448 .Nm
 449 first appeared in
 450 .Dx 2.5 .
 451 .Sh AUTHORS
 452 .An Matthew Dillon