share/man/man8/swapcache.8

   1 .\"
   2 .\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .Dd February 7, 2010
  13 .Dt SWAPCACHE 8
  14 .Os
  15 .Sh NAME
  16 .Nm swapcache
  17 .Nd a
  18 mechanism which allows the system to use fast swap to cache filesystem
  19 data and meta-data.
  20 .Sh SYNOPSIS (defaults shown)
  21 .Cd sysctl vm.swapcache.accrate=100000
  22 .Cd sysctl vm.swapcache.maxfilesize=0
  23 .Cd sysctl vm.swapcache.maxburst=2000000000
  24 .Cd sysctl vm.swapcache.curburst=4000000000
  25 .Cd sysctl vm.swapcache.minburst=10000000
  26 .Cd sysctl vm.swapcache.read_enable=0
  27 .Cd sysctl vm.swapcache.meta_enable=0
  28 .Cd sysctl vm.swapcache.data_enable=0
  29 .Cd sysctl vm.swapcache.use_chflags=1
  30 .Cd sysctl vm.swapcache.maxlaunder=256
  31 .Sh DESCRIPTION
  32 .Nm
  33 is a system capability which allows a solid state disk (SSD) in a swap
  34 space configuration to be used to cache clean filesystem data and meta-data
  35 in addition to its normal function of backing anonymous memory.
  36 .Pp
  37 Sysctls are used to manage operational parameters and can be adjusted at
  38 any time.  Typically a large initial burst is desired after system boot,
  39 controlled by the initial
  40 .Cd vm.swapcache.curburst
  41 parameter.
  42 This parameter is reduced as data is written to swap by the swapcache
  43 and increased at a rate specified by
  44 .Cd vm.swapcache.accrate .
  45 Once this parameter reaches zero write activity ceases until it has
  46 recovered sufficiently for write activity to resume.
  47 .Pp
  48 .Cd vm.swapcache.meta_enable
  49 enables the writing of filesystem meta-data to the swapcache.  Filesystem
  50 metadata is any data which the filesystem accesses via the disk device
  51 using buffercache.  Meta-data is cached globally regardless of file
  52 or directory flags.
  53 .Pp
  54 .Cd vm.swapcache.data_enable
  55 enables the writing of filesystem file-data to the swapcache.  Filesystem
  56 filedata is any data which the filesystem accesses via a regular file.
  57 In technical terms, when the buffer cache is used to access a regular
  58 file through its vnode.  Please do not blindly turn on this option,
  59 see the PERFORMANCE TUNING section for more information.
  60 .Pp
  61 .Cd vm.swapcache.use_chflags
  62 enables the use of the
  63 .Cm cache
  64 and
  65 .Cm noscache
  66 .Xr chflags 1
  67 flags to control which files will be data-cached.
  68 If this sysctl is disabled and data_enable is enabled,
  69 the system will ignore file flags and attempt to swapcache all
  70 regular files.
  71 .Pp
  72 .Cd vm.swapcache.read_enable
  73 enables reading from the swapcache and should be set to 1 for normal
  74 operation.
  75 .Pp
  76 .Cd vm.swapcache.maxfilesize
  77 controls which files are to be cached based on their size.
  78 If set to non-zero only files smaller than the specified size
  79 will be cached.  Larger files will not be cached.
  80 .Sh PERFORMANCE TUNING
  81 Best operation is achieved when the active data set fits within the
  82 swapcache.
  83 .Pp
  84 .Bl -tag -width 4n -compact
  85 .It Cd vm.swapcache.accrate
  86 This specifies the burst accumulation rate in bytes per second and
  87 ultimately controls the write bandwidth to swap averaged over a long
  88 period of time.
  89 This parameter must be carefully chosen to manage the write endurance of
  90 the SSD in order to avoid wearing it out too quickly.
  91 Even though SSDs have limited write endurance, there is massive
  92 cost/performance benefit to using one in a swapcache configuration.
  93 .Pp
  94 Let's use the Intel X25V 40G MLC SATA SSD as an example.  This device
  95 has approximately a 40TB (40 terabyte) write endurance.
  96 Limiting the long term average bandwidth to 100K/sec leads to no more
  97 than ~9G/day writing which calculates approximately to a 12 year
  98 endurance.
  99 Endurance scales linearly with size.  The 80G version of this SSD
 100 will have a write endurance of approximately 80TB.
 101 .Pp
 102 MLC SSDs have approximately a 1000x write endurance, while the
 103 lower density higher-cost SLC SSDs have an approximately 10000x
 104 write endurance.  MLC SSDs can be used for the swapcache (and swap)
 105 as long as the system manager is cognizant of its limitations.
 106 .Pp
 107 .It Cd vm.swapcache.meta_enable
 108 Turning on just
 109 .Cd meta_enable
 110 causes only filesystem meta-data to be cached and will result
 111 in very fast directory operations even over millions of inodes
 112 and even in the face of other invasive operations being run
 113 by other processes.
 114 .Pp
 115 .It Cd vm.swapcache.data_enable
 116 Turning on
 117 .Cd data_enable
 118 (with or without other features) allows bulk file data to be
 119 cached.
 120 This feature is very useful for web server operation when the
 121 operational data set fits in swap.
 122 The usefulness is somewhat mitigated by the maximum number
 123 of vnodes supported by the system via
 124 .Cd kern.maxfiles ,
 125 because the bulk data in the cache is lost when the related
 126 vnode is recycled.  In this case it might be desireable to
 127 take the plunge into running a 64-bit kernel which can support
 128 far more vnodes.  32-bit kernels have limited kernel virtual
 129 memory (KVM) and cannot reliably support more than around
 130 100,000 active vnodes.  64-bit kernels can support 300,000+
 131 active vnodes.
 132 .Pp
 133 Data caching is definitely more wasteful of SSD write bandwidth
 134 than meta-data caching.  It doesn't hurt performance per se,
 135 but may cause the
 136 .Nm
 137 to exhaust its burst and smack against the long term average
 138 bandwidth limit, causing the SSD to wear out at the maximum rate you
 139 programmed.  Data caching is far less wasteful and more efficient
 140 if (on a 64-bit system only) you provide a sufficiently large SSD and
 141 increase
 142 .Cd kern.maxvnodes
 143 to cover the entire directory topology being served.
 144 Each vnode requires about 1K of physical ram.
 145 .Pp
 146 When data caching is turned on you generally want to use
 147 .Xr chflags 1
 148 with the
 149 .Cm cache
 150 flag to enable data caching on a directory.
 151 This flag is tracked by the namecache and does not need to be
 152 recursively set in the directory tree.
 153 Simply setting the flag in a top level directory is sufficient.
 154 A typical setup is something like this:
 155 .Pp
 156 .Dl chflags cache /etc /sbin /bin /usr /home
 157 .Dl chflags noscache /usr/obj
 158 .Pp
 159 .It Cd vm.swapcache.maxfilesize
 160 This may be used to reduce cache thrashing when a focus on a small
 161 potentially fragmented filespace is desired, leaving the
 162 larger files alone.
 163 .Pp
 164 .It Cd vm.swapcache.minburst
 165 This controls hysteresis and prevents nickel-and-dime write bursting.
 166 Once
 167 .Cd curburst
 168 drops to zero, writing to the swapcache ceases until it has recovered
 169 past
 170 .Cd minburst .
 171 The idea here is to avoid creating a heavily fragmented swapcache where
 172 reading data from a file must alternate between the cache and the primary
 173 filesystem.  Doing so does not save disk seeks on the primary filesystem
 174 so we want to avoid doing small bursts.  This parameter allows us to do
 175 larger bursts.
 176 The larger bursts also tend to improve SSD performance as the SSD itself
 177 can do a better job write-combining and erasing blocks.
 178 .Pp
 179 .It Cd vm_swapcache.maxswappct
 180 This controls the maximum amount of swapspace
 181 .Nm
 182 may use, in percentage terms.
 183 .El
 184 .Pp
 185 Finally, interleaved swap (multiple SSDs) may be used to increase
 186 performance even further.  A single SATA SSD is typically capable of
 187 reading 120-220MB/sec.  Configuring two SSDs for your swap will
 188 improve aggregate swapcache read performance by 1.5x to 1.8x.
 189 In tests with two Intel 40G SSDs 300MB/sec was easily achieved.
 190 .Pp
 191 At this point you will be configuring more swap space than a 32 bit
 192 .Dx
 193 kernel can handle (due to KVM limitations).  By default, 32 bit
 194 .Dx
 195 systems only support 32G of configured swap and while this limit
 196 can be increased somewhat in
 197 .Pa /boot/loader.conf
 198 you should really be using a 64-bit
 199 .Dx
 200 kernel instead.  64-bit systems support up to 512G of swap by default
 201 and can be boosted to up to 8TB if you are really crazy and have enough ram.
 202 Each 1GB of swap requires around 1MB of physical memory to manage it so
 203 the practical limit is more around 1TB of swap.
 204 .Pp
 205 Of course, a 1TB SSD is something on the order of $3000+ as of this writing.
 206 Even though a 1TB configuration might not be cost effective, storage levels
 207 more in the 100-200G range certainly are.  If the machine has only a 1GigE
 208 ethernet (100MB/s) there's no point configuring it for more SSD bandwidth.
 209 A single SSD of the desired size would be sufficient.
 210 .Sh INITIAL BURSTING & REPEATED BURSTING
 211 Even though the average write bandwidth is limited it is desireable
 212 to have a large initial burst after boot to load the cache.
 213 .Cd curburst
 214 is initialized to 4GB by default and you can force rebursting
 215 by adjusting it with a sysctl.
 216 Remember that
 217 .Cd curburst
 218 dynamically tracks burst and will go up and down depending.
 219 .Pp
 220 In addition there will be periods of time where the system is in
 221 steady state and not writing to the swapcache.  During these periods
 222 .Cd curburst
 223 will inch back up but will not exceed
 224 .Cd maxburst .
 225 Thus the
 226 .Cd maxburst
 227 value controls how large a repeated burst can be.
 228 .Pp
 229 A second bursting parameter called
 230 .Cd vm.swapcache.minburst
 231 controls bursting when the maximum write bandwidth has been reached.
 232 When
 233 .Cd minburst
 234 reaches zero write activity ceases and
 235 .Cd curburst
 236 is allowed to recover up to
 237 .Cd minburst
 238 before write activity resumes.  The recommended range for the
 239 .Cd minburst
 240 parameter is 1MB to 50MB.  This parameter has a relationship to
 241 how fragmented the swapcache gets when not in a steady state.
 242 Large bursts reduce fragmentation and reduce incidences of
 243 excessive seeking on the hard drive.  If set too low the
 244 swapcache will become fragmented within a single regular file
 245 and the constant back-and-forth between the swapcache and the
 246 hard drive will result in excessive seeking on the hard drive.
 247 .Sh SWAPCACHE SIZE & MANAGEMENT
 248 The swapcache feature will use up to 75% of configured swap space
 249 by default.
 250 The remaining 25% is reserved for normal paging operation.
 251 The system operator should configure at least 4 times the SWAP space
 252 versus main memory and no less than 8G of swap space.
 253 If a 40G SSD is used the recommendation is to configure 16G to 32G of
 254 swap (note: 32-bit is limited to 32G of swap by default, for 64-bit
 255 it is 512G of swap).
 256 .Pp
 257 The
 258 .Cd vm_swapcache.maxswappct
 259 sysctl may be used to change the default.
 260 You may have to change this default if you also use
 261 .Xr tmpfs 5 ,
 262 .Xr vn 4 ,
 263 or if you have not allocated enough swap for reasonable normal paging
 264 activity to occur (in which case you probably shouldn't be using
 265 .Nm
 266 anyway).
 267 .Pp
 268 If swapcache reaches the 75% limit it will begin tearing down swap
 269 in linear bursts by iterating through available VM objects, until
 270 swap space use drops to 70%.  The tear-down is limited by the rate at
 271 which new data is written and this rate in turn is often limited
 272 by
 273 .Cd vm.swapcache.accrate ,
 274 resulting in an orderly replacement of cached data and meta-data.
 275 The limit is typically only reached when doing full data+meta-data
 276 caching with no file size limitations and serving primarily large
 277 files, or (on a 64-bit system) bumping kern.maxvnodes up to very
 278 high values.
 279 .Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
 280 This is not a function of
 281 .Nm
 282 per se but instead a normal function of the system.  Most systems have
 283 sufficient memory that they do not need to page memory to swap.  These
 284 types of systems are the ones best suited for MLC SSD configured swap
 285 running with a
 286 .Nm
 287 configuration.
 288 Systems which modestly page to swap, in the range of a few hundred
 289 megabytes a day worth of writing, are also well suited for MLC SSD
 290 configured swap.  Desktops usually fall into this category even if they
 291 page out a bit more because swap activity is governed by the actions of
 292 a single person.
 293 .Pp
 294 Systems which page anonymous memory heavily when
 295 .Nm
 296 would otherwise be turned off are not usually well suited for MLC SSD
 297 configured swap.  Heavy paging activity is not governed by
 298 .Nm
 299 bandwidth control parameters and can lead to excessive uncontrolled
 300 writing to the MLC SSD, causing premature wearout.  You would have to
 301 use the lower density, more expensive SLC SSD technology (which has 10x
 302 the durability).  This isn't to say that
 303 .Nm
 304 would be ineffective, just that the aggregate write bandwidth required
 305 to support the system would be too large for MLC flash technologies.
 306 .Pp
 307 With this caveat in mind, SSD based paging on systems with insufficient
 308 ram can be extremely effective in extending the useful life of the system.
 309 For example, a system with a measly 192MB of ram and SSD swap can run
 310 a -j 8 parallel build world in a little less than twice the time it
 311 would take if the system had 2G of ram, whereas it would take 5x to 10x
 312 as long with normal HD based swap.
 313 .Sh WARNINGS
 314 SSDs have limited durability and
 315 .Nm
 316 parameters should be carefully chosen to avoid early wearout.
 317 For example, the Intel X25V 40G SSD has a nominal 40TB (terabyte)
 318 write durability.
 319 Generally speaking, you want to select parameters that will give you
 320 at least 5 years of service life.  10 years is a good compromise.
 321 .Pp
 322 Durability typically scales with size and also depends on the
 323 wear-leveling algorithm used by the device.  Durability can often
 324 be improved by configuring less space (in a manufacturer-fresh drive)
 325 than the drive's capacity.  For example, by only using 32G of a 40G
 326 SSD.  SSDs typically implement 10% more storage than advertised and
 327 use this storage to improve wear leveling.  As cells begin to fail
 328 this overallotment slowly becomes part of the primary storage
 329 until it has been exhausted.  After that the SSD has basically failed.
 330 Keep in mind that if you use a larger portion of the SSD's advertised
 331 storage the SSD will not know if/when you decide to use less unless
 332 appropriate TRIM commands are sent (if supported), or a low level
 333 factory erase is issued.
 334 .Pp
 335 The swapcache is designed for use with SSDs configured as swap and
 336 will generally not improve performance when a normal hard drive is used
 337 for swap.
 338 .Pp
 339 .Nm smartctl
 340 (from pkgsrc's sysutils/smartmontools) may be used to retrieve
 341 the wear indicator from the drive.
 342 One usually runs something like 'smartctl -d sat -a /dev/daXX'
 343 (for AHCI/SILI/SCSI), or 'smartctl -a /dev/adXX' for NATA.  Many SSDs
 344 will brick the SATA port when smart operations are done while the drive
 345 is busy with normal activity, so the tool should only be run when the
 346 SSD is idle.
 347 .Pp
 348 ID 232 (0xe8) in the SMART data dump indicates available reserved
 349 space and ID 233 (0xe9) is the wear-out meter.  Reserved space
 350 typically starts at 100 and decrements to 10, after which the SSD
 351 is considered to operate in a degraded mode.  The wear-out meter
 352 typically starts at 99 and decrements to 0, after which the SSD
 353 has failed.
 354 Wear on SSDs is a function only of the write durability which is
 355 essentially just the total aggregate sectors written.
 356 .Nm
 357 tends to use large 64K writes as well as operates in a bursty fashion
 358 which the SSD is able to take significant advantage of.
 359 Power-on hours, power cycles, and read operations do not really affect wear.
 360 .Pp
 361 SSD's with MLC-based flash technology are high-density, low-cost solutions
 362 with limited write durability.  SLC-based flash technology is a low-density,
 363 higher-cost solution with 10x the write durability as MLC.  The durability
 364 also scales with the amount of flash storage, with SLC based flash typically
 365 twice as expensive per gigabyte.  From a cost perspective, SLC based flash
 366 is at least 5x more cost effective in situations where high write
 367 bandwidths are required (lasting 10x longer).  MLC is at least 2x more
 368 cost effective in situations where high write bandwidth is not required.
 369 When wear calculations are in years, these differences become huge.
 370 .Nm
 371 is usable with both technologies.
 372 .Sh SEE ALSO
 373 .Xr swapon 8 ,
 374 .Xr fstab 5
 375 .Sh HISTORY
 376 .Nm
 377 first appeared in
 378 .Dx 2.5 .
 379 .Sh AUTHORS
 380 .An Matthew Dillon