Do some syncing with reality.
[dragonfly.git] / share / man / man7 / tuning.7
CommitLineData
033a4603 1.\" Copyright (c) 2001 Matthew Dillon. Terms and conditions are those of
984263bc
MD
2.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in
3.\" the source tree.
4.\"
5.\" $FreeBSD: src/share/man/man7/tuning.7,v 1.1.2.30 2002/12/17 19:32:08 dillon Exp $
f5f2fec6 6.\" $DragonFly: src/share/man/man7/tuning.7,v 1.9 2006/10/15 00:04:45 swildner Exp $
984263bc 7.\"
4ad6607f 8.Dd May 11, 2006
984263bc
MD
9.Dt TUNING 7
10.Os
11.Sh NAME
12.Nm tuning
a3220ac5
SW
13.Nd performance tuning under
14.Dx
984263bc
MD
15.Sh SYSTEM SETUP - DISKLABEL, NEWFS, TUNEFS, SWAP
16When using
17.Xr disklabel 8
f5f2fec6
SW
18or the
19.Dx
20installer
984263bc
MD
21to lay out your filesystems on a hard disk it is important to remember
22that hard drives can transfer data much more quickly from outer tracks
23than they can from inner tracks.
24To take advantage of this you should
25try to pack your smaller filesystems and swap closer to the outer tracks,
26follow with the larger filesystems, and end with the largest filesystems.
27It is also important to size system standard filesystems such that you
28will not be forced to resize them later as you scale the machine up.
29I usually create, in order, a 128M root, 1G swap, 128M
30.Pa /var ,
31128M
32.Pa /var/tmp ,
333G
34.Pa /usr ,
35and use any remaining space for
36.Pa /home .
37.Pp
38You should typically size your swap space to approximately 2x main memory.
39If you do not have a lot of RAM, though, you will generally want a lot
40more swap.
41It is not recommended that you configure any less than
42256M of swap on a system and you should keep in mind future memory
43expansion when sizing the swap partition.
44The kernel's VM paging algorithms are tuned to perform best when there is
45at least 2x swap versus main memory.
46Configuring too little swap can lead
47to inefficiencies in the VM page scanning code as well as create issues
48later on if you add more memory to your machine.
49Finally, on larger systems
50with multiple SCSI disks (or multiple IDE disks operating on different
51controllers), we strongly recommend that you configure swap on each drive
52(up to four drives).
53The swap partitions on the drives should be approximately the same size.
54The kernel can handle arbitrary sizes but
55internal data structures scale to 4 times the largest swap partition.
56Keeping
57the swap partitions near the same size will allow the kernel to optimally
58stripe swap space across the N disks.
59Do not worry about overdoing it a
60little, swap space is the saving grace of
61.Ux
62and even if you do not normally use much swap, it can give you more time to
63recover from a runaway program before being forced to reboot.
64.Pp
65How you size your
66.Pa /var
67partition depends heavily on what you intend to use the machine for.
68This
69partition is primarily used to hold mailboxes, the print spool, and log
70files.
71Some people even make
72.Pa /var/log
73its own partition (but except for extreme cases it is not worth the waste
74of a partition ID).
75If your machine is intended to act as a mail
76or print server,
77or you are running a heavily visited web server, you should consider
78creating a much larger partition \(en perhaps a gig or more.
79It is very easy
80to underestimate log file storage requirements.
81.Pp
82Sizing
83.Pa /var/tmp
84depends on the kind of temporary file usage you think you will need.
85128M is
86the minimum we recommend.
f5f2fec6
SW
87Also note that the
88.Dx
89installer will create a
984263bc
MD
90.Pa /tmp
91directory.
92Dedicating a partition for temporary file storage is important for
93two reasons: first, it reduces the possibility of filesystem corruption
94in a crash, and second it reduces the chance of a runaway process that
95fills up
96.Oo Pa /var Oc Ns Pa /tmp
97from blowing up more critical subsystems (mail,
98logging, etc).
99Filling up
100.Oo Pa /var Oc Ns Pa /tmp
101is a very common problem to have.
102.Pp
103In the old days there were differences between
104.Pa /tmp
105and
106.Pa /var/tmp ,
107but the introduction of
108.Pa /var
109(and
110.Pa /var/tmp )
111led to massive confusion
112by program writers so today programs haphazardly use one or the
113other and thus no real distinction can be made between the two.
114So it makes sense to have just one temporary directory and
115softlink to it from the other tmp directory locations.
116However you handle
117.Pa /tmp ,
118the one thing you do not want to do is leave it sitting
119on the root partition where it might cause root to fill up or possibly
120corrupt root in a crash/reboot situation.
121.Pp
122The
123.Pa /usr
124partition holds the bulk of the files required to support the system and
125a subdirectory within it called
f5f2fec6 126.Pa /usr/pkg
984263bc 127holds the bulk of the files installed from the
f5f2fec6
SW
128pkgsrc collection.
129If you do not use pkgsrc all that much and do not intend to keep
984263bc
MD
130system source
131.Pq Pa /usr/src
132on the machine, you can get away with
133a 1 gigabyte
134.Pa /usr
135partition.
f5f2fec6 136However, if you install a lot of packages
984263bc
MD
137(especially window managers and Linux-emulated binaries), we recommend
138at least a 2 gigabyte
139.Pa /usr
140and if you also intend to keep system source
141on the machine, we recommend a 3 gigabyte
142.Pa /usr .
143Do not underestimate the
144amount of space you will need in this partition, it can creep up and
145surprise you!
146.Pp
147The
148.Pa /home
149partition is typically used to hold user-specific data.
150I usually size it to the remainder of the disk.
151.Pp
152Why partition at all?
153Why not create one big
154.Pa /
155partition and be done with it?
156Then I do not have to worry about undersizing things!
157Well, there are several reasons this is not a good idea.
158First,
159each partition has different operational characteristics and separating them
160allows the filesystem to tune itself to those characteristics.
161For example,
162the root and
163.Pa /usr
164partitions are read-mostly, with very little writing, while
165a lot of reading and writing could occur in
166.Pa /var
167and
168.Pa /var/tmp .
169By properly
170partitioning your system fragmentation introduced in the smaller more
171heavily write-loaded partitions will not bleed over into the mostly-read
172partitions.
173Additionally, keeping the write-loaded partitions closer to
174the edge of the disk (i.e. before the really big partitions instead of after
175in the partition table) will increase I/O performance in the partitions
176where you need it the most.
177Now it is true that you might also need I/O
178performance in the larger partitions, but they are so large that shifting
179them more towards the edge of the disk will not lead to a significant
180performance improvement whereas moving
181.Pa /var
182to the edge can have a huge impact.
183Finally, there are safety concerns.
184Having a small neat root partition that
185is essentially read-only gives it a greater chance of surviving a bad crash
186intact.
187.Pp
188Properly partitioning your system also allows you to tune
189.Xr newfs 8 ,
190and
191.Xr tunefs 8
192parameters.
193Tuning
194.Xr newfs 8
195requires more experience but can lead to significant improvements in
196performance.
197There are three parameters that are relatively safe to tune:
198.Em blocksize , bytes/i-node ,
199and
200.Em cylinders/group .
201.Pp
9bb2a92d 202.Dx
984263bc
MD
203performs best when using 8K or 16K filesystem block sizes.
204The default filesystem block size is 16K,
205which provides best performance for most applications,
206with the exception of those that perform random access on large files
207(such as database server software).
208Such applications tend to perform better with a smaller block size,
209although modern disk characteristics are such that the performance
210gain from using a smaller block size may not be worth consideration.
211Using a block size larger than 16K
212can cause fragmentation of the buffer cache and
213lead to lower performance.
214.Pp
215The defaults may be unsuitable
216for a filesystem that requires a very large number of i-nodes
217or is intended to hold a large number of very small files.
218Such a filesystem should be created with an 8K or 4K block size.
219This also requires you to specify a smaller
220fragment size.
aa0d550a 221We recommend always using a fragment size that is \(18
984263bc
MD
222the block size (less testing has been done on other fragment size factors).
223The
224.Xr newfs 8
225options for this would be
226.Dq Li "newfs -f 1024 -b 8192 ..." .
227.Pp
228If a large partition is intended to be used to hold fewer, larger files, such
229as database files, you can increase the
230.Em bytes/i-node
231ratio which reduces the number of i-nodes (maximum number of files and
232directories that can be created) for that partition.
233Decreasing the number
234of i-nodes in a filesystem can greatly reduce
235.Xr fsck 8
236recovery times after a crash.
237Do not use this option
238unless you are actually storing large files on the partition, because if you
239overcompensate you can wind up with a filesystem that has lots of free
240space remaining but cannot accommodate any more files.
241Using 32768, 65536, or 262144 bytes/i-node is recommended.
242You can go higher but
243it will have only incremental effects on
244.Xr fsck 8
245recovery times.
246For example,
247.Dq Li "newfs -i 32768 ..." .
248.Pp
249.Xr tunefs 8
250may be used to further tune a filesystem.
251This command can be run in
252single-user mode without having to reformat the filesystem.
253However, this is possibly the most abused program in the system.
254Many people attempt to
255increase available filesystem space by setting the min-free percentage to 0.
256This can lead to severe filesystem fragmentation and we do not recommend
257that you do this.
258Really the only
259.Xr tunefs 8
260option worthwhile here is turning on
261.Em softupdates
262with
263.Dq Li "tunefs -n enable /filesystem" .
264(Note: in
f5f2fec6
SW
265.Dx ,
266softupdates can be turned on using the
984263bc
MD
267.Fl U
268option to
269.Xr newfs 8 ,
270and
f5f2fec6
SW
271.Dx
272installer will typically enable softupdates automatically for
273non-root filesystems).
984263bc
MD
274Softupdates drastically improves meta-data performance, mainly file
275creation and deletion.
276We recommend enabling softupdates on most filesystems; however, there
277are two limitations to softupdates that you should be aware of when
278determining whether to use it on a filesystem.
279First, softupdates guarantees filesystem consistency in the
280case of a crash but could very easily be several seconds (even a minute!)
281behind on pending writes to the physical disk.
282If you crash you may lose more work
283than otherwise.
284Secondly, softupdates delays the freeing of filesystem
285blocks.
286If you have a filesystem (such as the root filesystem) which is
287close to full, doing a major update of it, e.g.\&
288.Dq Li "make installworld" ,
289can run it out of space and cause the update to fail.
290For this reason, softupdates will not be enabled on the root filesystem
291during a typical install. There is no loss of performance since the root
292filesystem is rarely written to.
293.Pp
294A number of run-time
295.Xr mount 8
296options exist that can help you tune the system.
297The most obvious and most dangerous one is
298.Cm async .
299Do not ever use it; it is far too dangerous.
300A less dangerous and more
301useful
302.Xr mount 8
303option is called
304.Cm noatime .
305.Ux
306filesystems normally update the last-accessed time of a file or
307directory whenever it is accessed.
308This operation is handled in
9bb2a92d 309.Dx
984263bc
MD
310with a delayed write and normally does not create a burden on the system.
311However, if your system is accessing a huge number of files on a continuing
312basis the buffer cache can wind up getting polluted with atime updates,
313creating a burden on the system.
314For example, if you are running a heavily
315loaded web site, or a news server with lots of readers, you might want to
316consider turning off atime updates on your larger partitions with this
317.Xr mount 8
318option.
319However, you should not gratuitously turn off atime
320updates everywhere.
321For example, the
322.Pa /var
323filesystem customarily
324holds mailboxes, and atime (in combination with mtime) is used to
325determine whether a mailbox has new mail.
326You might as well leave
327atime turned on for mostly read-only partitions such as
328.Pa /
329and
330.Pa /usr
331as well.
332This is especially useful for
333.Pa /
334since some system utilities
335use the atime field for reporting.
336.Sh STRIPING DISKS
337In larger systems you can stripe partitions from several drives together
338to create a much larger overall partition.
339Striping can also improve
340the performance of a filesystem by splitting I/O operations across two
341or more disks.
342The
343.Xr vinum 8
344and
345.Xr ccdconfig 8
346utilities may be used to create simple striped filesystems.
347Generally
348speaking, striping smaller partitions such as the root and
349.Pa /var/tmp ,
350or essentially read-only partitions such as
351.Pa /usr
352is a complete waste of time.
353You should only stripe partitions that require serious I/O performance,
354typically
355.Pa /var , /home ,
356or custom partitions used to hold databases and web pages.
357Choosing the proper stripe size is also
358important.
359Filesystems tend to store meta-data on power-of-2 boundaries
360and you usually want to reduce seeking rather than increase seeking.
361This
362means you want to use a large off-center stripe size such as 1152 sectors
363so sequential I/O does not seek both disks and so meta-data is distributed
364across both disks rather than concentrated on a single disk.
365If
366you really need to get sophisticated, we recommend using a real hardware
367RAID controller from the list of
9bb2a92d 368.Dx
984263bc
MD
369supported controllers.
370.Sh SYSCTL TUNING
371.Xr sysctl 8
372variables permit system behavior to be monitored and controlled at
373run-time.
374Some sysctls simply report on the behavior of the system; others allow
375the system behavior to be modified;
376some may be set at boot time using
377.Xr rc.conf 5 ,
378but most will be set via
379.Xr sysctl.conf 5 .
380There are several hundred sysctls in the system, including many that appear
381to be candidates for tuning but actually are not.
382In this document we will only cover the ones that have the greatest effect
383on the system.
384.Pp
385The
386.Va kern.ipc.shm_use_phys
387sysctl defaults to 0 (off) and may be set to 0 (off) or 1 (on).
388Setting
389this parameter to 1 will cause all System V shared memory segments to be
390mapped to unpageable physical RAM.
391This feature only has an effect if you
392are either (A) mapping small amounts of shared memory across many (hundreds)
393of processes, or (B) mapping large amounts of shared memory across any
394number of processes.
395This feature allows the kernel to remove a great deal
396of internal memory management page-tracking overhead at the cost of wiring
397the shared memory into core, making it unswappable.
398.Pp
399The
984263bc
MD
400.Va vfs.write_behind
401sysctl defaults to 1 (on). This tells the filesystem to issue media
402writes as full clusters are collected, which typically occurs when writing
403large sequential files. The idea is to avoid saturating the buffer
404cache with dirty buffers when it would not benefit I/O performance. However,
405this may stall processes and under certain circumstances you may wish to turn
406it off.
407.Pp
408The
409.Va vfs.hirunningspace
410sysctl determines how much outstanding write I/O may be queued to
411disk controllers system wide at any given instance. The default is
412usually sufficient but on machines with lots of disks you may want to bump
413it up to four or five megabytes. Note that setting too high a value
414(exceeding the buffer cache's write threshold) can lead to extremely
415bad clustering performance. Do not set this value arbitrarily high! Also,
416higher write queueing values may add latency to reads occuring at the same
417time.
418.Pp
419There are various other buffer-cache and VM page cache related sysctls.
420We do not recommend modifying these values.
421As of
422.Fx 4.3 ,
423the VM system does an extremely good job tuning itself.
424.Pp
425The
426.Va net.inet.tcp.sendspace
427and
428.Va net.inet.tcp.recvspace
429sysctls are of particular interest if you are running network intensive
430applications.
431They control the amount of send and receive buffer space
432allowed for any given TCP connection.
433The default sending buffer is 32K; the default receiving buffer
434is 64K.
435You can often
436improve bandwidth utilization by increasing the default at the cost of
437eating up more kernel memory for each connection.
438We do not recommend
439increasing the defaults if you are serving hundreds or thousands of
440simultaneous connections because it is possible to quickly run the system
441out of memory due to stalled connections building up.
442But if you need
443high bandwidth over a fewer number of connections, especially if you have
444gigabit Ethernet, increasing these defaults can make a huge difference.
445You can adjust the buffer size for incoming and outgoing data separately.
446For example, if your machine is primarily doing web serving you may want
447to decrease the recvspace in order to be able to increase the
448sendspace without eating too much kernel memory.
449Note that the routing table (see
450.Xr route 8 )
451can be used to introduce route-specific send and receive buffer size
452defaults.
453.Pp
454As an additional management tool you can use pipes in your
455firewall rules (see
456.Xr ipfw 8 )
457to limit the bandwidth going to or from particular IP blocks or ports.
458For example, if you have a T1 you might want to limit your web traffic
459to 70% of the T1's bandwidth in order to leave the remainder available
460for mail and interactive use.
461Normally a heavily loaded web server
462will not introduce significant latencies into other services even if
463the network link is maxed out, but enforcing a limit can smooth things
464out and lead to longer term stability.
465Many people also enforce artificial
466bandwidth limitations in order to ensure that they are not charged for
467using too much bandwidth.
468.Pp
469Setting the send or receive TCP buffer to values larger then 65535 will result
470in a marginal performance improvement unless both hosts support the window
471scaling extension of the TCP protocol, which is controlled by the
472.Va net.inet.tcp.rfc1323
473sysctl.
474These extensions should be enabled and the TCP buffer size should be set
475to a value larger than 65536 in order to obtain good performance from
476certain types of network links; specifically, gigabit WAN links and
477high-latency satellite links.
478RFC1323 support is enabled by default.
479.Pp
480The
481.Va net.inet.tcp.always_keepalive
482sysctl determines whether or not the TCP implementation should attempt
483to detect dead TCP connections by intermittently delivering
484.Dq keepalives
485on the connection.
f5f2fec6
SW
486By default, this is disabled for all applications, only applications
487that specifically request keepalives will use them.
984263bc
MD
488In most environments, TCP keepalives will improve the management of
489system state by expiring dead TCP connections, particularly for
490systems serving dialup users who may not always terminate individual
491TCP connections before disconnecting from the network.
492However, in some environments, temporary network outages may be
493incorrectly identified as dead sessions, resulting in unexpectedly
494terminated TCP connections.
495In such environments, setting the sysctl to 0 may reduce the occurrence of
496TCP session disconnections.
497.Pp
498The
499.Va net.inet.tcp.delayed_ack
500TCP feature is largly misunderstood. Historically speaking this feature
501was designed to allow the acknowledgement to transmitted data to be returned
502along with the response. For example, when you type over a remote shell
503the acknowledgement to the character you send can be returned along with the
504data representing the echo of the character. With delayed acks turned off
505the acknowledgement may be sent in its own packet before the remote service
506has a chance to echo the data it just received. This same concept also
507applies to any interactive protocol (e.g. SMTP, WWW, POP3) and can cut the
a3220ac5
SW
508number of tiny packets flowing across the network in half. The
509.Dx
984263bc
MD
510delayed-ack implementation also follows the TCP protocol rule that
511at least every other packet be acknowledged even if the standard 100ms
512timeout has not yet passed. Normally the worst a delayed ack can do is
513slightly delay the teardown of a connection, or slightly delay the ramp-up
514of a slow-start TCP connection. While we aren't sure we believe that
515the several FAQs related to packages such as SAMBA and SQUID which advise
f5f2fec6 516turning off delayed acks may be refering to the slow-start issue.
984263bc
MD
517.Pp
518The
519.Va net.inet.tcp.inflight_enable
520sysctl turns on bandwidth delay product limiting for all TCP connections.
521The system will attempt to calculate the bandwidth delay product for each
522connection and limit the amount of data queued to the network to just the
523amount required to maintain optimum throughput. This feature is useful
524if you are serving data over modems, GigE, or high speed WAN links (or
525any other link with a high bandwidth*delay product), especially if you are
526also using window scaling or have configured a large send window. If
527you enable this option you should also be sure to set
528.Va net.inet.tcp.inflight_debug
529to 0 (disable debugging), and for production use setting
530.Va net.inet.tcp.inflight_min
531to at least 6144 may be beneficial. Note, however, that setting high
532minimums may effectively disable bandwidth limiting depending on the link.
533The limiting feature reduces the amount of data built up in intermediate
534router and switch packet queues as well as reduces the amount of data built
535up in the local host's interface queue. With fewer packets queued up,
536interactive connections, especially over slow modems, will also be able
537to operate with lower round trip times. However, note that this feature
538only effects data transmission (uploading / server-side). It does not
539effect data reception (downloading).
540.Pp
541Adjusting
542.Va net.inet.tcp.inflight_stab
543is not recommended.
1bf4b486 544This parameter defaults to 20, representing 2 maximal packets added
984263bc
MD
545to the bandwidth delay product window calculation. The additional
546window is required to stabilize the algorithm and improve responsiveness
547to changing conditions, but it can also result in higher ping times
1bf4b486 548over slow links (though still much lower then you would get without
984263bc
MD
549the inflight algorithm). In such cases you may
550wish to try reducing this parameter to 15, 10, or 5, and you may also
551have to reduce
552.Va net.inet.tcp.inflight_min
553(for example, to 3500) to get the desired effect. Reducing these parameters
554should be done as a last resort only.
555.Pp
556The
557.Va net.inet.ip.portrange.*
558sysctls control the port number ranges automatically bound to TCP and UDP
559sockets. There are three ranges: A low range, a default range, and a
1bf4b486 560high range, selectable via an IP_PORTRANGE setsockopt() call. Most
984263bc
MD
561network programs use the default range which is controlled by
562.Va net.inet.ip.portrange.first
563and
564.Va net.inet.ip.portrange.last ,
565which defaults to 1024 and 5000 respectively. Bound port ranges are
566used for outgoing connections and it is possible to run the system out
567of ports under certain circumstances. This most commonly occurs when you are
568running a heavily loaded web proxy. The port range is not an issue
569when running serves which handle mainly incoming connections such as a
570normal web server, or has a limited number of outgoing connections such
571as a mail relay. For situations where you may run yourself out of
572ports we recommend increasing
573.Va net.inet.ip.portrange.last
574modestly. A value of 10000 or 20000 or 30000 may be reasonable. You should
575also consider firewall effects when changing the port range. Some firewalls
576may block large ranges of ports (usually low-numbered ports) and expect systems
577to use higher ranges of ports for outgoing connections. For this reason
578we do not recommend that
579.Va net.inet.ip.portrange.first
580be lowered.
581.Pp
582The
583.Va kern.ipc.somaxconn
584sysctl limits the size of the listen queue for accepting new TCP connections.
585The default value of 128 is typically too low for robust handling of new
586connections in a heavily loaded web server environment.
587For such environments,
588we recommend increasing this value to 1024 or higher.
589The service daemon
590may itself limit the listen queue size (e.g.\&
591.Xr sendmail 8 ,
592apache) but will
593often have a directive in its configuration file to adjust the queue size up.
594Larger listen queues also do a better job of fending off denial of service
595attacks.
596.Pp
597The
598.Va kern.maxfiles
599sysctl determines how many open files the system supports.
600The default is
601typically a few thousand but you may need to bump this up to ten or twenty
602thousand if you are running databases or large descriptor-heavy daemons.
603The read-only
604.Va kern.openfiles
605sysctl may be interrogated to determine the current number of open files
606on the system.
607.Pp
608The
609.Va vm.swap_idle_enabled
610sysctl is useful in large multi-user systems where you have lots of users
611entering and leaving the system and lots of idle processes.
612Such systems
613tend to generate a great deal of continuous pressure on free memory reserves.
614Turning this feature on and adjusting the swapout hysteresis (in idle
615seconds) via
616.Va vm.swap_idle_threshold1
617and
618.Va vm.swap_idle_threshold2
619allows you to depress the priority of pages associated with idle processes
620more quickly then the normal pageout algorithm.
621This gives a helping hand
622to the pageout daemon.
623Do not turn this option on unless you need it,
624because the tradeoff you are making is to essentially pre-page memory sooner
625rather then later, eating more swap and disk bandwidth.
626In a small system
627this option will have a detrimental effect but in a large system that is
628already doing moderate paging this option allows the VM system to stage
629whole processes into and out of memory more easily.
630.Sh LOADER TUNABLES
631Some aspects of the system behavior may not be tunable at runtime because
632memory allocations they perform must occur early in the boot process.
633To change loader tunables, you must set their values in
634.Xr loader.conf 5
635and reboot the system.
636.Pp
637.Va kern.maxusers
638controls the scaling of a number of static system tables, including defaults
639for the maximum number of open files, sizing of network memory resources, etc.
f5f2fec6
SW
640On
641.Dx ,
984263bc
MD
642.Va kern.maxusers
643is automatically sized at boot based on the amount of memory available in
644the system, and may be determined at run-time by inspecting the value of the
645read-only
646.Va kern.maxusers
647sysctl.
648Some sites will require larger or smaller values of
649.Va kern.maxusers
650and may set it as a loader tunable; values of 64, 128, and 256 are not
651uncommon.
652We do not recommend going above 256 unless you need a huge number
653of file descriptors; many of the tunable values set to their defaults by
654.Va kern.maxusers
655may be individually overridden at boot-time or run-time as described
656elsewhere in this document.
984263bc
MD
657.Pp
658.Va kern.ipc.nmbclusters
659may be adjusted to increase the number of network mbufs the system is
660willing to allocate.
661Each cluster represents approximately 2K of memory,
662so a value of 1024 represents 2M of kernel memory reserved for network
663buffers.
664You can do a simple calculation to figure out how many you need.
665If you have a web server which maxes out at 1000 simultaneous connections,
666and each connection eats a 16K receive and 16K send buffer, you need
667approximately 32MB worth of network buffers to deal with it.
668A good rule of
669thumb is to multiply by 2, so 32MBx2 = 64MB/2K = 32768.
670So for this case
671you would want to set
672.Va kern.ipc.nmbclusters
673to 32768.
674We recommend values between
6751024 and 4096 for machines with moderates amount of memory, and between 4096
676and 32768 for machines with greater amounts of memory.
677Under no circumstances
678should you specify an arbitrarily high value for this parameter, it could
679lead to a boot-time crash.
680The
681.Fl m
682option to
683.Xr netstat 1
684may be used to observe network cluster use.
984263bc
MD
685.Pp
686More and more programs are using the
687.Xr sendfile 2
688system call to transmit files over the network.
689The
690.Va kern.ipc.nsfbufs
691sysctl controls the number of filesystem buffers
692.Xr sendfile 2
693is allowed to use to perform its work.
694This parameter nominally scales
695with
696.Va kern.maxusers
697so you should not need to modify this parameter except under extreme
698circumstances.
699.Sh KERNEL CONFIG TUNING
700There are a number of kernel options that you may have to fiddle with in
701a large-scale system.
702In order to change these options you need to be
703able to compile a new kernel from source.
704The
705.Xr config 8
706manual page and the handbook are good starting points for learning how to
707do this.
708Generally the first thing you do when creating your own custom
709kernel is to strip out all the drivers and services you do not use.
710Removing things like
711.Dv INET6
712and drivers you do not have will reduce the size of your kernel, sometimes
713by a megabyte or more, leaving more memory available for applications.
714.Pp
715.Dv SCSI_DELAY
984263bc 716may be used to reduce system boot times.
4ad6607f 717The default is fairly high and
984263bc
MD
718can be responsible for 15+ seconds of delay in the boot process.
719Reducing
720.Dv SCSI_DELAY
721to 5 seconds usually works (especially with modern drives).
984263bc
MD
722.Pp
723There are a number of
724.Dv *_CPU
725options that can be commented out.
726If you only want the kernel to run
727on a Pentium class CPU, you can easily remove
728.Dv I386_CPU
729and
730.Dv I486_CPU ,
731but only remove
732.Dv I586_CPU
733if you are sure your CPU is being recognized as a Pentium II or better.
734Some clones may be recognized as a Pentium or even a 486 and not be able
735to boot without those options.
736If it works, great!
737The operating system
738will be able to better-use higher-end CPU features for MMU, task switching,
739timebase, and even device operations.
740Additionally, higher-end CPUs support
7414MB MMU pages, which the kernel uses to map the kernel itself into memory,
742increasing its efficiency under heavy syscall loads.
743.Sh IDE WRITE CACHING
744.Fx 4.3
745flirted with turning off IDE write caching.
746This reduced write bandwidth
747to IDE disks but was considered necessary due to serious data consistency
748issues introduced by hard drive vendors.
749Basically the problem is that
750IDE drives lie about when a write completes.
751With IDE write caching turned
752on, IDE hard drives will not only write data to disk out of order, they
753will sometimes delay some of the blocks indefinitely under heavy disk
754load.
755A crash or power failure can result in serious filesystem
756corruption.
757So our default was changed to be safe.
758Unfortunately, the
759result was such a huge loss in performance that we caved in and changed the
760default back to on after the release.
761You should check the default on
762your system by observing the
763.Va hw.ata.wc
764sysctl variable.
765If IDE write caching is turned off, you can turn it back
766on by setting the
767.Va hw.ata.wc
768loader tunable to 1.
769More information on tuning the ATA driver system may be found in the
770.Xr ata 4
771man page.
772.Pp
773There is a new experimental feature for IDE hard drives called
774.Va hw.ata.tags
775(you also set this in the boot loader) which allows write caching to be safely
776turned on.
777This brings SCSI tagging features to IDE drives.
778As of this
779writing only IBM DPTA and DTLA drives support the feature.
780Warning!
781These
782drives apparently have quality control problems and I do not recommend
783purchasing them at this time.
784If you need performance, go with SCSI.
785.Sh CPU, MEMORY, DISK, NETWORK
786The type of tuning you do depends heavily on where your system begins to
787bottleneck as load increases.
788If your system runs out of CPU (idle times
789are perpetually 0%) then you need to consider upgrading the CPU or moving to
790an SMP motherboard (multiple CPU's), or perhaps you need to revisit the
791programs that are causing the load and try to optimize them.
792If your system
793is paging to swap a lot you need to consider adding more memory.
794If your
795system is saturating the disk you typically see high CPU idle times and
796total disk saturation.
797.Xr systat 1
798can be used to monitor this.
799There are many solutions to saturated disks:
800increasing memory for caching, mirroring disks, distributing operations across
801several machines, and so forth.
802If disk performance is an issue and you
803are using IDE drives, switching to SCSI can help a great deal.
804While modern
805IDE drives compare with SCSI in raw sequential bandwidth, the moment you
806start seeking around the disk SCSI drives usually win.
807.Pp
808Finally, you might run out of network suds.
809The first line of defense for
810improving network performance is to make sure you are using switches instead
811of hubs, especially these days where switches are almost as cheap.
812Hubs
813have severe problems under heavy loads due to collision backoff and one bad
814host can severely degrade the entire LAN.
815Second, optimize the network path
816as much as possible.
817For example, in
818.Xr firewall 7
819we describe a firewall protecting internal hosts with a topology where
820the externally visible hosts are not routed through it.
821Use 100BaseT rather
822than 10BaseT, or use 1000BaseT rather then 100BaseT, depending on your needs.
823Most bottlenecks occur at the WAN link (e.g.\&
824modem, T1, DSL, whatever).
825If expanding the link is not an option it may be possible to use the
826.Xr dummynet 4
827feature to implement peak shaving or other forms of traffic shaping to
828prevent the overloaded service (such as web services) from affecting other
829services (such as email), or vice versa.
830In home installations this could
831be used to give interactive traffic (your browser,
832.Xr ssh 1
833logins) priority
834over services you export from your box (web services, email).
835.Sh SEE ALSO
836.Xr netstat 1 ,
837.Xr systat 1 ,
838.Xr ata 4 ,
839.Xr dummynet 4 ,
840.Xr login.conf 5 ,
841.Xr rc.conf 5 ,
842.Xr sysctl.conf 5 ,
843.Xr firewall 7 ,
844.Xr hier 7 ,
984263bc
MD
845.Xr boot 8 ,
846.Xr ccdconfig 8 ,
847.Xr config 8 ,
848.Xr disklabel 8 ,
849.Xr fsck 8 ,
850.Xr ifconfig 8 ,
851.Xr ipfw 8 ,
852.Xr loader 8 ,
853.Xr mount 8 ,
854.Xr newfs 8 ,
855.Xr route 8 ,
856.Xr sysctl 8 ,
984263bc
MD
857.Xr tunefs 8 ,
858.Xr vinum 8
859.Sh HISTORY
860The
861.Nm
862manual page was originally written by
863.An Matthew Dillon
864and first appeared
865in
866.Fx 4.3 ,
867May 2001.