| 1 | .\" Copyright (c) 2001 Matthew Dillon. Terms and conditions are those of |
| 2 | .\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in |
| 3 | .\" the source tree. |
| 4 | .\" |
| 5 | .\" $FreeBSD: src/share/man/man7/tuning.7,v 1.1.2.30 2002/12/17 19:32:08 dillon Exp $ |
| 6 | .\" $DragonFly: src/share/man/man7/tuning.7,v 1.15 2007/09/14 23:47:53 swildner Exp $ |
| 7 | .\" |
| 8 | .Dd March 4, 2007 |
| 9 | .Dt TUNING 7 |
| 10 | .Os |
| 11 | .Sh NAME |
| 12 | .Nm tuning |
| 13 | .Nd performance tuning under |
| 14 | .Dx |
| 15 | .Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP |
| 16 | When using |
| 17 | .Xr disklabel 8 |
| 18 | or the |
| 19 | .Dx |
| 20 | installer |
| 21 | to lay out your filesystems on a hard disk it is important to remember |
| 22 | that hard drives can transfer data much more quickly from outer tracks |
| 23 | than they can from inner tracks. |
| 24 | To take advantage of this you should |
| 25 | try to pack your smaller filesystems and swap closer to the outer tracks, |
| 26 | follow with the larger filesystems, and end with the largest filesystems. |
| 27 | It is also important to size system standard filesystems such that you |
| 28 | will not be forced to resize them later as you scale the machine up. |
| 29 | I usually create, in order, a 128M root, 1G swap, 128M |
| 30 | .Pa /var , |
| 31 | 128M |
| 32 | .Pa /var/tmp , |
| 33 | 3G |
| 34 | .Pa /usr , |
| 35 | and use any remaining space for |
| 36 | .Pa /home . |
| 37 | .Pp |
| 38 | You should typically size your swap space to approximately 2x main memory. |
| 39 | If you do not have a lot of RAM, though, you will generally want a lot |
| 40 | more swap. |
| 41 | It is not recommended that you configure any less than |
| 42 | 256M of swap on a system and you should keep in mind future memory |
| 43 | expansion when sizing the swap partition. |
| 44 | The kernel's VM paging algorithms are tuned to perform best when there is |
| 45 | at least 2x swap versus main memory. |
| 46 | Configuring too little swap can lead |
| 47 | to inefficiencies in the VM page scanning code as well as create issues |
| 48 | later on if you add more memory to your machine. |
| 49 | Finally, on larger systems |
| 50 | with multiple SCSI disks (or multiple IDE disks operating on different |
| 51 | controllers), we strongly recommend that you configure swap on each drive |
| 52 | (up to four drives). |
| 53 | The swap partitions on the drives should be approximately the same size. |
| 54 | The kernel can handle arbitrary sizes but |
| 55 | internal data structures scale to 4 times the largest swap partition. |
| 56 | Keeping |
| 57 | the swap partitions near the same size will allow the kernel to optimally |
| 58 | stripe swap space across the N disks. |
| 59 | Do not worry about overdoing it a |
| 60 | little, swap space is the saving grace of |
| 61 | .Ux |
| 62 | and even if you do not normally use much swap, it can give you more time to |
| 63 | recover from a runaway program before being forced to reboot. |
| 64 | .Pp |
| 65 | How you size your |
| 66 | .Pa /var |
| 67 | partition depends heavily on what you intend to use the machine for. |
| 68 | This |
| 69 | partition is primarily used to hold mailboxes, the print spool, and log |
| 70 | files. |
| 71 | Some people even make |
| 72 | .Pa /var/log |
| 73 | its own partition (but except for extreme cases it is not worth the waste |
| 74 | of a partition ID). |
| 75 | If your machine is intended to act as a mail |
| 76 | or print server, |
| 77 | or you are running a heavily visited web server, you should consider |
| 78 | creating a much larger partition \(en perhaps a gig or more. |
| 79 | It is very easy |
| 80 | to underestimate log file storage requirements. |
| 81 | .Pp |
| 82 | Sizing |
| 83 | .Pa /var/tmp |
| 84 | depends on the kind of temporary file usage you think you will need. |
| 85 | 128M is |
| 86 | the minimum we recommend. |
| 87 | Also note that the |
| 88 | .Dx |
| 89 | installer will create a |
| 90 | .Pa /tmp |
| 91 | directory. |
| 92 | Dedicating a partition for temporary file storage is important for |
| 93 | two reasons: first, it reduces the possibility of filesystem corruption |
| 94 | in a crash, and second it reduces the chance of a runaway process that |
| 95 | fills up |
| 96 | .Oo Pa /var Oc Ns Pa /tmp |
| 97 | from blowing up more critical subsystems (mail, |
| 98 | logging, etc). |
| 99 | Filling up |
| 100 | .Oo Pa /var Oc Ns Pa /tmp |
| 101 | is a very common problem to have. |
| 102 | .Pp |
| 103 | In the old days there were differences between |
| 104 | .Pa /tmp |
| 105 | and |
| 106 | .Pa /var/tmp , |
| 107 | but the introduction of |
| 108 | .Pa /var |
| 109 | (and |
| 110 | .Pa /var/tmp ) |
| 111 | led to massive confusion |
| 112 | by program writers so today programs haphazardly use one or the |
| 113 | other and thus no real distinction can be made between the two. |
| 114 | So it makes sense to have just one temporary directory and |
| 115 | softlink to it from the other tmp directory locations. |
| 116 | However you handle |
| 117 | .Pa /tmp , |
| 118 | the one thing you do not want to do is leave it sitting |
| 119 | on the root partition where it might cause root to fill up or possibly |
| 120 | corrupt root in a crash/reboot situation. |
| 121 | .Pp |
| 122 | The |
| 123 | .Pa /usr |
| 124 | partition holds the bulk of the files required to support the system and |
| 125 | a subdirectory within it called |
| 126 | .Pa /usr/pkg |
| 127 | holds the bulk of the files installed from the |
| 128 | .Xr pkgsrc 7 |
| 129 | collection. |
| 130 | If you do not use |
| 131 | .Xr pkgsrc 7 |
| 132 | all that much and do not intend to keep system source |
| 133 | .Pq Pa /usr/src |
| 134 | on the machine, you can get away with |
| 135 | a 1 gigabyte |
| 136 | .Pa /usr |
| 137 | partition. |
| 138 | However, if you install a lot of packages |
| 139 | (especially window managers and Linux-emulated binaries), we recommend |
| 140 | at least a 2 gigabyte |
| 141 | .Pa /usr |
| 142 | and if you also intend to keep system source |
| 143 | on the machine, we recommend a 3 gigabyte |
| 144 | .Pa /usr . |
| 145 | Do not underestimate the |
| 146 | amount of space you will need in this partition, it can creep up and |
| 147 | surprise you! |
| 148 | .Pp |
| 149 | The |
| 150 | .Pa /home |
| 151 | partition is typically used to hold user-specific data. |
| 152 | I usually size it to the remainder of the disk. |
| 153 | .Pp |
| 154 | Why partition at all? |
| 155 | Why not create one big |
| 156 | .Pa / |
| 157 | partition and be done with it? |
| 158 | Then I do not have to worry about undersizing things! |
| 159 | Well, there are several reasons this is not a good idea. |
| 160 | First, |
| 161 | each partition has different operational characteristics and separating them |
| 162 | allows the filesystem to tune itself to those characteristics. |
| 163 | For example, |
| 164 | the root and |
| 165 | .Pa /usr |
| 166 | partitions are read-mostly, with very little writing, while |
| 167 | a lot of reading and writing could occur in |
| 168 | .Pa /var |
| 169 | and |
| 170 | .Pa /var/tmp . |
| 171 | By properly |
| 172 | partitioning your system fragmentation introduced in the smaller more |
| 173 | heavily write-loaded partitions will not bleed over into the mostly-read |
| 174 | partitions. |
| 175 | Additionally, keeping the write-loaded partitions closer to |
| 176 | the edge of the disk (i.e. before the really big partitions instead of after |
| 177 | in the partition table) will increase I/O performance in the partitions |
| 178 | where you need it the most. |
| 179 | Now it is true that you might also need I/O |
| 180 | performance in the larger partitions, but they are so large that shifting |
| 181 | them more towards the edge of the disk will not lead to a significant |
| 182 | performance improvement whereas moving |
| 183 | .Pa /var |
| 184 | to the edge can have a huge impact. |
| 185 | Finally, there are safety concerns. |
| 186 | Having a small neat root partition that |
| 187 | is essentially read-only gives it a greater chance of surviving a bad crash |
| 188 | intact. |
| 189 | .Pp |
| 190 | Properly partitioning your system also allows you to tune |
| 191 | .Xr newfs 8 , |
| 192 | and |
| 193 | .Xr tunefs 8 |
| 194 | parameters. |
| 195 | Tuning |
| 196 | .Xr newfs 8 |
| 197 | requires more experience but can lead to significant improvements in |
| 198 | performance. |
| 199 | There are three parameters that are relatively safe to tune: |
| 200 | .Em blocksize , bytes/i-node , |
| 201 | and |
| 202 | .Em cylinders/group . |
| 203 | .Pp |
| 204 | .Dx |
| 205 | performs best when using 8K or 16K filesystem block sizes. |
| 206 | The default filesystem block size is 16K, |
| 207 | which provides best performance for most applications, |
| 208 | with the exception of those that perform random access on large files |
| 209 | (such as database server software). |
| 210 | Such applications tend to perform better with a smaller block size, |
| 211 | although modern disk characteristics are such that the performance |
| 212 | gain from using a smaller block size may not be worth consideration. |
| 213 | Using a block size larger than 16K |
| 214 | can cause fragmentation of the buffer cache and |
| 215 | lead to lower performance. |
| 216 | .Pp |
| 217 | The defaults may be unsuitable |
| 218 | for a filesystem that requires a very large number of i-nodes |
| 219 | or is intended to hold a large number of very small files. |
| 220 | Such a filesystem should be created with an 8K or 4K block size. |
| 221 | This also requires you to specify a smaller |
| 222 | fragment size. |
| 223 | We recommend always using a fragment size that is \(18 |
| 224 | the block size (less testing has been done on other fragment size factors). |
| 225 | The |
| 226 | .Xr newfs 8 |
| 227 | options for this would be |
| 228 | .Dq Li "newfs -f 1024 -b 8192 ..." . |
| 229 | .Pp |
| 230 | If a large partition is intended to be used to hold fewer, larger files, such |
| 231 | as database files, you can increase the |
| 232 | .Em bytes/i-node |
| 233 | ratio which reduces the number of i-nodes (maximum number of files and |
| 234 | directories that can be created) for that partition. |
| 235 | Decreasing the number |
| 236 | of i-nodes in a filesystem can greatly reduce |
| 237 | .Xr fsck 8 |
| 238 | recovery times after a crash. |
| 239 | Do not use this option |
| 240 | unless you are actually storing large files on the partition, because if you |
| 241 | overcompensate you can wind up with a filesystem that has lots of free |
| 242 | space remaining but cannot accommodate any more files. |
| 243 | Using 32768, 65536, or 262144 bytes/i-node is recommended. |
| 244 | You can go higher but |
| 245 | it will have only incremental effects on |
| 246 | .Xr fsck 8 |
| 247 | recovery times. |
| 248 | For example, |
| 249 | .Dq Li "newfs -i 32768 ..." . |
| 250 | .Pp |
| 251 | .Xr tunefs 8 |
| 252 | may be used to further tune a filesystem. |
| 253 | This command can be run in |
| 254 | single-user mode without having to reformat the filesystem. |
| 255 | However, this is possibly the most abused program in the system. |
| 256 | Many people attempt to |
| 257 | increase available filesystem space by setting the min-free percentage to 0. |
| 258 | This can lead to severe filesystem fragmentation and we do not recommend |
| 259 | that you do this. |
| 260 | Really the only |
| 261 | .Xr tunefs 8 |
| 262 | option worthwhile here is turning on |
| 263 | .Em softupdates |
| 264 | with |
| 265 | .Dq Li "tunefs -n enable /filesystem" . |
| 266 | (Note: in |
| 267 | .Dx , |
| 268 | softupdates can be turned on using the |
| 269 | .Fl U |
| 270 | option to |
| 271 | .Xr newfs 8 , |
| 272 | and |
| 273 | .Dx |
| 274 | installer will typically enable softupdates automatically for |
| 275 | non-root filesystems). |
| 276 | Softupdates drastically improves meta-data performance, mainly file |
| 277 | creation and deletion. |
| 278 | We recommend enabling softupdates on most filesystems; however, there |
| 279 | are two limitations to softupdates that you should be aware of when |
| 280 | determining whether to use it on a filesystem. |
| 281 | First, softupdates guarantees filesystem consistency in the |
| 282 | case of a crash but could very easily be several seconds (even a minute!) |
| 283 | behind on pending writes to the physical disk. |
| 284 | If you crash you may lose more work |
| 285 | than otherwise. |
| 286 | Secondly, softupdates delays the freeing of filesystem |
| 287 | blocks. |
| 288 | If you have a filesystem (such as the root filesystem) which is |
| 289 | close to full, doing a major update of it, e.g.\& |
| 290 | .Dq Li "make installworld" , |
| 291 | can run it out of space and cause the update to fail. |
| 292 | For this reason, softupdates will not be enabled on the root filesystem |
| 293 | during a typical install. There is no loss of performance since the root |
| 294 | filesystem is rarely written to. |
| 295 | .Pp |
| 296 | A number of run-time |
| 297 | .Xr mount 8 |
| 298 | options exist that can help you tune the system. |
| 299 | The most obvious and most dangerous one is |
| 300 | .Cm async . |
| 301 | Do not ever use it; it is far too dangerous. |
| 302 | A less dangerous and more |
| 303 | useful |
| 304 | .Xr mount 8 |
| 305 | option is called |
| 306 | .Cm noatime . |
| 307 | .Ux |
| 308 | filesystems normally update the last-accessed time of a file or |
| 309 | directory whenever it is accessed. |
| 310 | This operation is handled in |
| 311 | .Dx |
| 312 | with a delayed write and normally does not create a burden on the system. |
| 313 | However, if your system is accessing a huge number of files on a continuing |
| 314 | basis the buffer cache can wind up getting polluted with atime updates, |
| 315 | creating a burden on the system. |
| 316 | For example, if you are running a heavily |
| 317 | loaded web site, or a news server with lots of readers, you might want to |
| 318 | consider turning off atime updates on your larger partitions with this |
| 319 | .Xr mount 8 |
| 320 | option. |
| 321 | However, you should not gratuitously turn off atime |
| 322 | updates everywhere. |
| 323 | For example, the |
| 324 | .Pa /var |
| 325 | filesystem customarily |
| 326 | holds mailboxes, and atime (in combination with mtime) is used to |
| 327 | determine whether a mailbox has new mail. |
| 328 | You might as well leave |
| 329 | atime turned on for mostly read-only partitions such as |
| 330 | .Pa / |
| 331 | and |
| 332 | .Pa /usr |
| 333 | as well. |
| 334 | This is especially useful for |
| 335 | .Pa / |
| 336 | since some system utilities |
| 337 | use the atime field for reporting. |
| 338 | .Sh STRIPING DISKS |
| 339 | In larger systems you can stripe partitions from several drives together |
| 340 | to create a much larger overall partition. |
| 341 | Striping can also improve |
| 342 | the performance of a filesystem by splitting I/O operations across two |
| 343 | or more disks. |
| 344 | The |
| 345 | .Xr vinum 8 |
| 346 | and |
| 347 | .Xr ccdconfig 8 |
| 348 | utilities may be used to create simple striped filesystems. |
| 349 | Generally |
| 350 | speaking, striping smaller partitions such as the root and |
| 351 | .Pa /var/tmp , |
| 352 | or essentially read-only partitions such as |
| 353 | .Pa /usr |
| 354 | is a complete waste of time. |
| 355 | You should only stripe partitions that require serious I/O performance, |
| 356 | typically |
| 357 | .Pa /var , /home , |
| 358 | or custom partitions used to hold databases and web pages. |
| 359 | Choosing the proper stripe size is also |
| 360 | important. |
| 361 | Filesystems tend to store meta-data on power-of-2 boundaries |
| 362 | and you usually want to reduce seeking rather than increase seeking. |
| 363 | This |
| 364 | means you want to use a large off-center stripe size such as 1152 sectors |
| 365 | so sequential I/O does not seek both disks and so meta-data is distributed |
| 366 | across both disks rather than concentrated on a single disk. |
| 367 | If |
| 368 | you really need to get sophisticated, we recommend using a real hardware |
| 369 | RAID controller from the list of |
| 370 | .Dx |
| 371 | supported controllers. |
| 372 | .Sh SYSCTL TUNING |
| 373 | .Xr sysctl 8 |
| 374 | variables permit system behavior to be monitored and controlled at |
| 375 | run-time. |
| 376 | Some sysctls simply report on the behavior of the system; others allow |
| 377 | the system behavior to be modified; |
| 378 | some may be set at boot time using |
| 379 | .Xr rc.conf 5 , |
| 380 | but most will be set via |
| 381 | .Xr sysctl.conf 5 . |
| 382 | There are several hundred sysctls in the system, including many that appear |
| 383 | to be candidates for tuning but actually are not. |
| 384 | In this document we will only cover the ones that have the greatest effect |
| 385 | on the system. |
| 386 | .Pp |
| 387 | The |
| 388 | .Va kern.ipc.shm_use_phys |
| 389 | sysctl defaults to 0 (off) and may be set to 0 (off) or 1 (on). |
| 390 | Setting |
| 391 | this parameter to 1 will cause all System V shared memory segments to be |
| 392 | mapped to unpageable physical RAM. |
| 393 | This feature only has an effect if you |
| 394 | are either (A) mapping small amounts of shared memory across many (hundreds) |
| 395 | of processes, or (B) mapping large amounts of shared memory across any |
| 396 | number of processes. |
| 397 | This feature allows the kernel to remove a great deal |
| 398 | of internal memory management page-tracking overhead at the cost of wiring |
| 399 | the shared memory into core, making it unswappable. |
| 400 | .Pp |
| 401 | The |
| 402 | .Va vfs.write_behind |
| 403 | sysctl defaults to 1 (on). This tells the filesystem to issue media |
| 404 | writes as full clusters are collected, which typically occurs when writing |
| 405 | large sequential files. The idea is to avoid saturating the buffer |
| 406 | cache with dirty buffers when it would not benefit I/O performance. However, |
| 407 | this may stall processes and under certain circumstances you may wish to turn |
| 408 | it off. |
| 409 | .Pp |
| 410 | The |
| 411 | .Va vfs.hirunningspace |
| 412 | sysctl determines how much outstanding write I/O may be queued to |
| 413 | disk controllers system wide at any given instance. The default is |
| 414 | usually sufficient but on machines with lots of disks you may want to bump |
| 415 | it up to four or five megabytes. Note that setting too high a value |
| 416 | (exceeding the buffer cache's write threshold) can lead to extremely |
| 417 | bad clustering performance. Do not set this value arbitrarily high! Also, |
| 418 | higher write queueing values may add latency to reads occurring at the same |
| 419 | time. |
| 420 | .Pp |
| 421 | There are various other buffer-cache and VM page cache related sysctls. |
| 422 | We do not recommend modifying these values. |
| 423 | As of |
| 424 | .Fx 4.3 , |
| 425 | the VM system does an extremely good job tuning itself. |
| 426 | .Pp |
| 427 | The |
| 428 | .Va net.inet.tcp.sendspace |
| 429 | and |
| 430 | .Va net.inet.tcp.recvspace |
| 431 | sysctls are of particular interest if you are running network intensive |
| 432 | applications. |
| 433 | They control the amount of send and receive buffer space |
| 434 | allowed for any given TCP connection. |
| 435 | The default sending buffer is 32K; the default receiving buffer |
| 436 | is 64K. |
| 437 | You can often |
| 438 | improve bandwidth utilization by increasing the default at the cost of |
| 439 | eating up more kernel memory for each connection. |
| 440 | We do not recommend |
| 441 | increasing the defaults if you are serving hundreds or thousands of |
| 442 | simultaneous connections because it is possible to quickly run the system |
| 443 | out of memory due to stalled connections building up. |
| 444 | But if you need |
| 445 | high bandwidth over a fewer number of connections, especially if you have |
| 446 | gigabit Ethernet, increasing these defaults can make a huge difference. |
| 447 | You can adjust the buffer size for incoming and outgoing data separately. |
| 448 | For example, if your machine is primarily doing web serving you may want |
| 449 | to decrease the recvspace in order to be able to increase the |
| 450 | sendspace without eating too much kernel memory. |
| 451 | Note that the routing table (see |
| 452 | .Xr route 8 ) |
| 453 | can be used to introduce route-specific send and receive buffer size |
| 454 | defaults. |
| 455 | .Pp |
| 456 | As an additional management tool you can use pipes in your |
| 457 | firewall rules (see |
| 458 | .Xr ipfw 8 ) |
| 459 | to limit the bandwidth going to or from particular IP blocks or ports. |
| 460 | For example, if you have a T1 you might want to limit your web traffic |
| 461 | to 70% of the T1's bandwidth in order to leave the remainder available |
| 462 | for mail and interactive use. |
| 463 | Normally a heavily loaded web server |
| 464 | will not introduce significant latencies into other services even if |
| 465 | the network link is maxed out, but enforcing a limit can smooth things |
| 466 | out and lead to longer term stability. |
| 467 | Many people also enforce artificial |
| 468 | bandwidth limitations in order to ensure that they are not charged for |
| 469 | using too much bandwidth. |
| 470 | .Pp |
| 471 | Setting the send or receive TCP buffer to values larger then 65535 will result |
| 472 | in a marginal performance improvement unless both hosts support the window |
| 473 | scaling extension of the TCP protocol, which is controlled by the |
| 474 | .Va net.inet.tcp.rfc1323 |
| 475 | sysctl. |
| 476 | These extensions should be enabled and the TCP buffer size should be set |
| 477 | to a value larger than 65536 in order to obtain good performance from |
| 478 | certain types of network links; specifically, gigabit WAN links and |
| 479 | high-latency satellite links. |
| 480 | RFC1323 support is enabled by default. |
| 481 | .Pp |
| 482 | The |
| 483 | .Va net.inet.tcp.always_keepalive |
| 484 | sysctl determines whether or not the TCP implementation should attempt |
| 485 | to detect dead TCP connections by intermittently delivering |
| 486 | .Dq keepalives |
| 487 | on the connection. |
| 488 | By default, this is disabled for all applications, only applications |
| 489 | that specifically request keepalives will use them. |
| 490 | In most environments, TCP keepalives will improve the management of |
| 491 | system state by expiring dead TCP connections, particularly for |
| 492 | systems serving dialup users who may not always terminate individual |
| 493 | TCP connections before disconnecting from the network. |
| 494 | However, in some environments, temporary network outages may be |
| 495 | incorrectly identified as dead sessions, resulting in unexpectedly |
| 496 | terminated TCP connections. |
| 497 | In such environments, setting the sysctl to 0 may reduce the occurrence of |
| 498 | TCP session disconnections. |
| 499 | .Pp |
| 500 | The |
| 501 | .Va net.inet.tcp.delayed_ack |
| 502 | TCP feature is largely misunderstood. Historically speaking this feature |
| 503 | was designed to allow the acknowledgement to transmitted data to be returned |
| 504 | along with the response. For example, when you type over a remote shell |
| 505 | the acknowledgement to the character you send can be returned along with the |
| 506 | data representing the echo of the character. With delayed acks turned off |
| 507 | the acknowledgement may be sent in its own packet before the remote service |
| 508 | has a chance to echo the data it just received. This same concept also |
| 509 | applies to any interactive protocol (e.g. SMTP, WWW, POP3) and can cut the |
| 510 | number of tiny packets flowing across the network in half. The |
| 511 | .Dx |
| 512 | delayed-ack implementation also follows the TCP protocol rule that |
| 513 | at least every other packet be acknowledged even if the standard 100ms |
| 514 | timeout has not yet passed. Normally the worst a delayed ack can do is |
| 515 | slightly delay the teardown of a connection, or slightly delay the ramp-up |
| 516 | of a slow-start TCP connection. While we aren't sure we believe that |
| 517 | the several FAQs related to packages such as SAMBA and SQUID which advise |
| 518 | turning off delayed acks may be refering to the slow-start issue. |
| 519 | .Pp |
| 520 | The |
| 521 | .Va net.inet.tcp.inflight_enable |
| 522 | sysctl turns on bandwidth delay product limiting for all TCP connections. |
| 523 | The system will attempt to calculate the bandwidth delay product for each |
| 524 | connection and limit the amount of data queued to the network to just the |
| 525 | amount required to maintain optimum throughput. This feature is useful |
| 526 | if you are serving data over modems, GigE, or high speed WAN links (or |
| 527 | any other link with a high bandwidth*delay product), especially if you are |
| 528 | also using window scaling or have configured a large send window. If |
| 529 | you enable this option you should also be sure to set |
| 530 | .Va net.inet.tcp.inflight_debug |
| 531 | to 0 (disable debugging), and for production use setting |
| 532 | .Va net.inet.tcp.inflight_min |
| 533 | to at least 6144 may be beneficial. Note, however, that setting high |
| 534 | minimums may effectively disable bandwidth limiting depending on the link. |
| 535 | The limiting feature reduces the amount of data built up in intermediate |
| 536 | router and switch packet queues as well as reduces the amount of data built |
| 537 | up in the local host's interface queue. With fewer packets queued up, |
| 538 | interactive connections, especially over slow modems, will also be able |
| 539 | to operate with lower round trip times. However, note that this feature |
| 540 | only affects data transmission (uploading / server-side). It does not |
| 541 | affect data reception (downloading). |
| 542 | .Pp |
| 543 | Adjusting |
| 544 | .Va net.inet.tcp.inflight_stab |
| 545 | is not recommended. |
| 546 | This parameter defaults to 20, representing 2 maximal packets added |
| 547 | to the bandwidth delay product window calculation. The additional |
| 548 | window is required to stabilize the algorithm and improve responsiveness |
| 549 | to changing conditions, but it can also result in higher ping times |
| 550 | over slow links (though still much lower then you would get without |
| 551 | the inflight algorithm). In such cases you may |
| 552 | wish to try reducing this parameter to 15, 10, or 5, and you may also |
| 553 | have to reduce |
| 554 | .Va net.inet.tcp.inflight_min |
| 555 | (for example, to 3500) to get the desired effect. Reducing these parameters |
| 556 | should be done as a last resort only. |
| 557 | .Pp |
| 558 | The |
| 559 | .Va net.inet.ip.portrange.* |
| 560 | sysctls control the port number ranges automatically bound to TCP and UDP |
| 561 | sockets. There are three ranges: A low range, a default range, and a |
| 562 | high range, selectable via an IP_PORTRANGE setsockopt() call. Most |
| 563 | network programs use the default range which is controlled by |
| 564 | .Va net.inet.ip.portrange.first |
| 565 | and |
| 566 | .Va net.inet.ip.portrange.last , |
| 567 | which defaults to 1024 and 5000 respectively. Bound port ranges are |
| 568 | used for outgoing connections and it is possible to run the system out |
| 569 | of ports under certain circumstances. This most commonly occurs when you are |
| 570 | running a heavily loaded web proxy. The port range is not an issue |
| 571 | when running serves which handle mainly incoming connections such as a |
| 572 | normal web server, or has a limited number of outgoing connections such |
| 573 | as a mail relay. For situations where you may run yourself out of |
| 574 | ports we recommend increasing |
| 575 | .Va net.inet.ip.portrange.last |
| 576 | modestly. A value of 10000 or 20000 or 30000 may be reasonable. You should |
| 577 | also consider firewall effects when changing the port range. Some firewalls |
| 578 | may block large ranges of ports (usually low-numbered ports) and expect systems |
| 579 | to use higher ranges of ports for outgoing connections. For this reason |
| 580 | we do not recommend that |
| 581 | .Va net.inet.ip.portrange.first |
| 582 | be lowered. |
| 583 | .Pp |
| 584 | The |
| 585 | .Va kern.ipc.somaxconn |
| 586 | sysctl limits the size of the listen queue for accepting new TCP connections. |
| 587 | The default value of 128 is typically too low for robust handling of new |
| 588 | connections in a heavily loaded web server environment. |
| 589 | For such environments, |
| 590 | we recommend increasing this value to 1024 or higher. |
| 591 | The service daemon |
| 592 | may itself limit the listen queue size (e.g.\& |
| 593 | .Xr sendmail 8 , |
| 594 | apache) but will |
| 595 | often have a directive in its configuration file to adjust the queue size up. |
| 596 | Larger listen queues also do a better job of fending off denial of service |
| 597 | attacks. |
| 598 | .Pp |
| 599 | The |
| 600 | .Va kern.maxfiles |
| 601 | sysctl determines how many open files the system supports. |
| 602 | The default is |
| 603 | typically a few thousand but you may need to bump this up to ten or twenty |
| 604 | thousand if you are running databases or large descriptor-heavy daemons. |
| 605 | The read-only |
| 606 | .Va kern.openfiles |
| 607 | sysctl may be interrogated to determine the current number of open files |
| 608 | on the system. |
| 609 | .Pp |
| 610 | The |
| 611 | .Va vm.swap_idle_enabled |
| 612 | sysctl is useful in large multi-user systems where you have lots of users |
| 613 | entering and leaving the system and lots of idle processes. |
| 614 | Such systems |
| 615 | tend to generate a great deal of continuous pressure on free memory reserves. |
| 616 | Turning this feature on and adjusting the swapout hysteresis (in idle |
| 617 | seconds) via |
| 618 | .Va vm.swap_idle_threshold1 |
| 619 | and |
| 620 | .Va vm.swap_idle_threshold2 |
| 621 | allows you to depress the priority of pages associated with idle processes |
| 622 | more quickly then the normal pageout algorithm. |
| 623 | This gives a helping hand |
| 624 | to the pageout daemon. |
| 625 | Do not turn this option on unless you need it, |
| 626 | because the tradeoff you are making is to essentially pre-page memory sooner |
| 627 | rather than later, eating more swap and disk bandwidth. |
| 628 | In a small system |
| 629 | this option will have a detrimental effect but in a large system that is |
| 630 | already doing moderate paging this option allows the VM system to stage |
| 631 | whole processes into and out of memory more easily. |
| 632 | .Sh LOADER TUNABLES |
| 633 | Some aspects of the system behavior may not be tunable at runtime because |
| 634 | memory allocations they perform must occur early in the boot process. |
| 635 | To change loader tunables, you must set their values in |
| 636 | .Xr loader.conf 5 |
| 637 | and reboot the system. |
| 638 | .Pp |
| 639 | .Va kern.maxusers |
| 640 | controls the scaling of a number of static system tables, including defaults |
| 641 | for the maximum number of open files, sizing of network memory resources, etc. |
| 642 | On |
| 643 | .Dx , |
| 644 | .Va kern.maxusers |
| 645 | is automatically sized at boot based on the amount of memory available in |
| 646 | the system, and may be determined at run-time by inspecting the value of the |
| 647 | read-only |
| 648 | .Va kern.maxusers |
| 649 | sysctl. |
| 650 | Some sites will require larger or smaller values of |
| 651 | .Va kern.maxusers |
| 652 | and may set it as a loader tunable; values of 64, 128, and 256 are not |
| 653 | uncommon. |
| 654 | We do not recommend going above 256 unless you need a huge number |
| 655 | of file descriptors; many of the tunable values set to their defaults by |
| 656 | .Va kern.maxusers |
| 657 | may be individually overridden at boot-time or run-time as described |
| 658 | elsewhere in this document. |
| 659 | .Pp |
| 660 | The |
| 661 | .Va kern.dfldsiz |
| 662 | and |
| 663 | .Va kern.dflssiz |
| 664 | tunables set the default soft limits for process data and stack size |
| 665 | respectively. |
| 666 | Processes may increase these up to the hard limits by calling |
| 667 | .Xr setrlimit 2 . |
| 668 | The |
| 669 | .Va kern.maxdsiz , |
| 670 | .Va kern.maxssiz , |
| 671 | and |
| 672 | .Va kern.maxtsiz |
| 673 | tunables set the hard limits for process data, stack, and text size |
| 674 | respectively; processes may not exceed these limits. |
| 675 | The |
| 676 | .Va kern.sgrowsiz |
| 677 | tunable controls how much the stack segment will grow when a process |
| 678 | needs to allocate more stack. |
| 679 | .Pp |
| 680 | .Va kern.ipc.nmbclusters |
| 681 | may be adjusted to increase the number of network mbufs the system is |
| 682 | willing to allocate. |
| 683 | Each cluster represents approximately 2K of memory, |
| 684 | so a value of 1024 represents 2M of kernel memory reserved for network |
| 685 | buffers. |
| 686 | You can do a simple calculation to figure out how many you need. |
| 687 | If you have a web server which maxes out at 1000 simultaneous connections, |
| 688 | and each connection eats a 16K receive and 16K send buffer, you need |
| 689 | approximately 32MB worth of network buffers to deal with it. |
| 690 | A good rule of |
| 691 | thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768. |
| 692 | So for this case |
| 693 | you would want to set |
| 694 | .Va kern.ipc.nmbclusters |
| 695 | to 32768. |
| 696 | We recommend values between |
| 697 | 1024 and 4096 for machines with moderates amount of memory, and between 4096 |
| 698 | and 32768 for machines with greater amounts of memory. |
| 699 | Under no circumstances |
| 700 | should you specify an arbitrarily high value for this parameter, it could |
| 701 | lead to a boot-time crash. |
| 702 | The |
| 703 | .Fl m |
| 704 | option to |
| 705 | .Xr netstat 1 |
| 706 | may be used to observe network cluster use. |
| 707 | .Pp |
| 708 | More and more programs are using the |
| 709 | .Xr sendfile 2 |
| 710 | system call to transmit files over the network. |
| 711 | The |
| 712 | .Va kern.ipc.nsfbufs |
| 713 | sysctl controls the number of filesystem buffers |
| 714 | .Xr sendfile 2 |
| 715 | is allowed to use to perform its work. |
| 716 | This parameter nominally scales |
| 717 | with |
| 718 | .Va kern.maxusers |
| 719 | so you should not need to modify this parameter except under extreme |
| 720 | circumstances. |
| 721 | .Sh KERNEL CONFIG TUNING |
| 722 | There are a number of kernel options that you may have to fiddle with in |
| 723 | a large-scale system. |
| 724 | In order to change these options you need to be |
| 725 | able to compile a new kernel from source. |
| 726 | The |
| 727 | .Xr config 8 |
| 728 | manual page and the handbook are good starting points for learning how to |
| 729 | do this. |
| 730 | Generally the first thing you do when creating your own custom |
| 731 | kernel is to strip out all the drivers and services you do not use. |
| 732 | Removing things like |
| 733 | .Dv INET6 |
| 734 | and drivers you do not have will reduce the size of your kernel, sometimes |
| 735 | by a megabyte or more, leaving more memory available for applications. |
| 736 | .Pp |
| 737 | .Dv SCSI_DELAY |
| 738 | may be used to reduce system boot times. |
| 739 | The default is fairly high and |
| 740 | can be responsible for 15+ seconds of delay in the boot process. |
| 741 | Reducing |
| 742 | .Dv SCSI_DELAY |
| 743 | to 5 seconds usually works (especially with modern drives). |
| 744 | .Pp |
| 745 | There are a number of |
| 746 | .Dv *_CPU |
| 747 | options that can be commented out. |
| 748 | If you only want the kernel to run |
| 749 | on a Pentium class CPU, you can easily remove |
| 750 | .Dv I386_CPU |
| 751 | and |
| 752 | .Dv I486_CPU , |
| 753 | but only remove |
| 754 | .Dv I586_CPU |
| 755 | if you are sure your CPU is being recognized as a Pentium II or better. |
| 756 | Some clones may be recognized as a Pentium or even a 486 and not be able |
| 757 | to boot without those options. |
| 758 | If it works, great! |
| 759 | The operating system |
| 760 | will be able to better-use higher-end CPU features for MMU, task switching, |
| 761 | timebase, and even device operations. |
| 762 | Additionally, higher-end CPUs support |
| 763 | 4MB MMU pages, which the kernel uses to map the kernel itself into memory, |
| 764 | increasing its efficiency under heavy syscall loads. |
| 765 | .Sh IDE WRITE CACHING |
| 766 | .Fx 4.3 |
| 767 | flirted with turning off IDE write caching. |
| 768 | This reduced write bandwidth |
| 769 | to IDE disks but was considered necessary due to serious data consistency |
| 770 | issues introduced by hard drive vendors. |
| 771 | Basically the problem is that |
| 772 | IDE drives lie about when a write completes. |
| 773 | With IDE write caching turned |
| 774 | on, IDE hard drives will not only write data to disk out of order, they |
| 775 | will sometimes delay some of the blocks indefinitely under heavy disk |
| 776 | load. |
| 777 | A crash or power failure can result in serious filesystem |
| 778 | corruption. |
| 779 | So our default was changed to be safe. |
| 780 | Unfortunately, the |
| 781 | result was such a huge loss in performance that we caved in and changed the |
| 782 | default back to on after the release. |
| 783 | You should check the default on |
| 784 | your system by observing the |
| 785 | .Va hw.ata.wc |
| 786 | sysctl variable. |
| 787 | If IDE write caching is turned off, you can turn it back |
| 788 | on by setting the |
| 789 | .Va hw.ata.wc |
| 790 | loader tunable to 1. |
| 791 | More information on tuning the ATA driver system may be found in the |
| 792 | .Xr ata 4 |
| 793 | man page. |
| 794 | .Pp |
| 795 | There is a new experimental feature for IDE hard drives called |
| 796 | .Va hw.ata.tags |
| 797 | (you also set this in the boot loader) which allows write caching to be safely |
| 798 | turned on. |
| 799 | This brings SCSI tagging features to IDE drives. |
| 800 | As of this |
| 801 | writing only IBM DPTA and DTLA drives support the feature. |
| 802 | Warning! |
| 803 | These |
| 804 | drives apparently have quality control problems and I do not recommend |
| 805 | purchasing them at this time. |
| 806 | If you need performance, go with SCSI. |
| 807 | .Sh CPU, MEMORY, DISK, NETWORK |
| 808 | The type of tuning you do depends heavily on where your system begins to |
| 809 | bottleneck as load increases. |
| 810 | If your system runs out of CPU (idle times |
| 811 | are perpetually 0%) then you need to consider upgrading the CPU or moving to |
| 812 | an SMP motherboard (multiple CPU's), or perhaps you need to revisit the |
| 813 | programs that are causing the load and try to optimize them. |
| 814 | If your system |
| 815 | is paging to swap a lot you need to consider adding more memory. |
| 816 | If your |
| 817 | system is saturating the disk you typically see high CPU idle times and |
| 818 | total disk saturation. |
| 819 | .Xr systat 1 |
| 820 | can be used to monitor this. |
| 821 | There are many solutions to saturated disks: |
| 822 | increasing memory for caching, mirroring disks, distributing operations across |
| 823 | several machines, and so forth. |
| 824 | If disk performance is an issue and you |
| 825 | are using IDE drives, switching to SCSI can help a great deal. |
| 826 | While modern |
| 827 | IDE drives compare with SCSI in raw sequential bandwidth, the moment you |
| 828 | start seeking around the disk SCSI drives usually win. |
| 829 | .Pp |
| 830 | Finally, you might run out of network suds. |
| 831 | The first line of defense for |
| 832 | improving network performance is to make sure you are using switches instead |
| 833 | of hubs, especially these days where switches are almost as cheap. |
| 834 | Hubs |
| 835 | have severe problems under heavy loads due to collision backoff and one bad |
| 836 | host can severely degrade the entire LAN. |
| 837 | Second, optimize the network path |
| 838 | as much as possible. |
| 839 | For example, in |
| 840 | .Xr firewall 7 |
| 841 | we describe a firewall protecting internal hosts with a topology where |
| 842 | the externally visible hosts are not routed through it. |
| 843 | Use 100BaseT rather |
| 844 | than 10BaseT, or use 1000BaseT rather than 100BaseT, depending on your needs. |
| 845 | Most bottlenecks occur at the WAN link (e.g.\& |
| 846 | modem, T1, DSL, whatever). |
| 847 | If expanding the link is not an option it may be possible to use the |
| 848 | .Xr dummynet 4 |
| 849 | feature to implement peak shaving or other forms of traffic shaping to |
| 850 | prevent the overloaded service (such as web services) from affecting other |
| 851 | services (such as email), or vice versa. |
| 852 | In home installations this could |
| 853 | be used to give interactive traffic (your browser, |
| 854 | .Xr ssh 1 |
| 855 | logins) priority |
| 856 | over services you export from your box (web services, email). |
| 857 | .Sh SEE ALSO |
| 858 | .Xr netstat 1 , |
| 859 | .Xr systat 1 , |
| 860 | .Xr ata 4 , |
| 861 | .Xr dummynet 4 , |
| 862 | .Xr login.conf 5 , |
| 863 | .Xr rc.conf 5 , |
| 864 | .Xr sysctl.conf 5 , |
| 865 | .Xr firewall 7 , |
| 866 | .Xr hier 7 , |
| 867 | .Xr boot 8 , |
| 868 | .Xr ccdconfig 8 , |
| 869 | .Xr config 8 , |
| 870 | .Xr disklabel 8 , |
| 871 | .Xr fsck 8 , |
| 872 | .Xr ifconfig 8 , |
| 873 | .Xr ipfw 8 , |
| 874 | .Xr loader 8 , |
| 875 | .Xr mount 8 , |
| 876 | .Xr newfs 8 , |
| 877 | .Xr route 8 , |
| 878 | .Xr sysctl 8 , |
| 879 | .Xr tunefs 8 , |
| 880 | .Xr vinum 8 |
| 881 | .Sh HISTORY |
| 882 | The |
| 883 | .Nm |
| 884 | manual page was originally written by |
| 885 | .An Matthew Dillon |
| 886 | and first appeared |
| 887 | in |
| 888 | .Fx 4.3 , |
| 889 | May 2001. |