share/doc/papers/sysperf/4.t

   1 .\" Copyright (c) 1985 The Regents of the University of California.
   2 .\" All rights reserved.
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .\" 3. All advertising materials mentioning features or use of this software
  13 .\"    must display the following acknowledgement:
  14 .\"     This product includes software developed by the University of
  15 .\"     California, Berkeley and its contributors.
  16 .\" 4. Neither the name of the University nor the names of its contributors
  17 .\"    may be used to endorse or promote products derived from this software
  18 .\"    without specific prior written permission.
  19 .\"
  20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  23 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  30 .\" SUCH DAMAGE.
  31 .\"
  32 .\"     @(#)4.t 5.1 (Berkeley) 4/17/91
  33 .\"
  34 .ds RH Performance Improvements
  35 .NH
  36 Performance Improvements
  37 .PP
  38 This section outlines the changes made to the system
  39 since the 4.2BSD distribution.
  40 The changes reported here were made in response
  41 to the problems described in Section 3.
  42 The improvements fall into two major classes;
  43 changes to the kernel that are described in this section,
  44 and changes to the system libraries and utilities that are
  45 described in the following section.
  46 .NH 2
  47 Performance Improvements in the Kernel
  48 .PP
  49 Our goal has been to optimize system performance
  50 for our general timesharing environment.
  51 Since most sites running 4.2BSD have been forced to take
  52 advantage of declining
  53 memory costs rather than replace their existing machines with
  54 ones that are more powerful, we have
  55 chosen to optimize running time at the expense of memory.
  56 This tradeoff may need to be reconsidered for personal workstations
  57 that have smaller memories and higher latency disks.
  58 Decreases in the running time of the system may be unnoticeable
  59 because of higher paging rates incurred by a larger kernel.
  60 Where possible, we have allowed the size of caches to be controlled
  61 so that systems with limited memory may reduce them as appropriate.
  62 .NH 3
  63 Name Cacheing
  64 .PP
  65 Our initial profiling studies showed that more than one quarter
  66 of the time in the system was spent in the
  67 pathname translation routine, \fInamei\fP,
  68 translating path names to inodes\u\s-21\s0\d\**.
  69 .FS
  70 \** \u\s-21\s0\d Inode is an abbreviation for ``Index node''.
  71 Each file on the system is described by an inode;
  72 the inode maintains access permissions, and an array of pointers to
  73 the disk blocks that hold the data associated with the file.
  74 .FE
  75 An inspection of \fInamei\fP shows that
  76 it consists of two nested loops.
  77 The outer loop is traversed once per pathname component.
  78 The inner loop performs a linear search through a directory looking
  79 for a particular pathname component.
  80 .PP
  81 Our first idea was to reduce the number of iterations
  82 around the inner loop of \fInamei\fP by observing that many programs
  83 step through a directory performing an operation on each entry in turn.
  84 To improve performance for processes doing directory scans,
  85 the system keeps track of the directory offset of the last component of the
  86 most recently translated path name for each process.
  87 If the next name the process requests is in the same directory,
  88 the search is started from the offset that the previous name was found
  89 (instead of from the beginning of the directory).
  90 Changing directories invalidates the cache, as
  91 does modifying the directory.
  92 For programs that step sequentially through a directory with
  93 .EQ
  94 delim $$
  95 .EN
  96 $N$ files, search time decreases from $O ( N sup 2 )$ to $O(N)$.
  97 .EQ
  98 delim off
  99 .EN
 100 .PP
 101 The cost of the cache is about 20 lines of code
 102 (about 0.2 kilobytes)
 103 and 16 bytes per process, with the cached data
 104 stored in a process's \fIuser\fP vector.
 105 .PP
 106 As a quick benchmark to verify the maximum effectiveness of the
 107 cache we ran ``ls \-l''
 108 on a directory containing 600 files.
 109 Before the per-process cache this command
 110 used 22.3 seconds of system time.
 111 After adding the cache the program used the same amount
 112 of user time, but the system time dropped to 3.3 seconds.
 113 .PP
 114 This change prompted our rerunning a profiled system
 115 on a machine containing the new \fInamei\fP.
 116 The results showed that the time in \fInamei\fP
 117 dropped by only 2.6 ms/call and
 118 still accounted for 36% of the system call time,
 119 18% of the kernel, or about 10% of all the machine cycles.
 120 This amounted to a drop in system time from 57% to about 55%.
 121 The results are shown in Table 9.
 122 .KF
 123 .DS L
 124 .TS
 125 center box;
 126 l r r.
 127 part    time    % of kernel
 128 _
 129 self    11.0 ms/call    9.2%
 130 child   10.6 ms/call    8.9%
 131 _
 132 total   21.6 ms/call    18.1%
 133 .TE
 134 .ce
 135 Table 9. Call times for \fInamei\fP with per-process cache.
 136 .DE
 137 .KE
 138 .PP
 139 The small performance improvement
 140 was caused by a low cache hit ratio.
 141 Although the cache was 90% effective when hit,
 142 it was only usable on about 25% of the names being translated.
 143 An additional reason for the small improvement was that
 144 although the amount of time spent in \fInamei\fP itself
 145 decreased substantially,
 146 more time was spent in the routines that it called
 147 since each directory had to be accessed twice;
 148 once to search from the middle to the end,
 149 and once to search from the beginning to the middle.
 150 .PP
 151 Frequent requests for a small set of names are best handled
 152 with a cache of recent name translations\**.
 153 .FS
 154 \** The cache is keyed on a name and the
 155 inode and device number of the directory that contains it.
 156 Associated with each entry is a pointer to the corresponding
 157 entry in the inode table.
 158 .FE
 159 This has the effect of eliminating the inner loop of \fInamei\fP.
 160 For each path name component,
 161 \fInamei\fP first looks in its cache of recent translations
 162 for the needed name.
 163 If it exists, the directory search can be completely eliminated.
 164 .PP
 165 The system already maintained a cache of recently accessed inodes,
 166 so the initial name cache
 167 maintained a simple name-inode association that was used to
 168 check each component of a path name during name translations.
 169 We considered implementing the cache by tagging each inode
 170 with its most recently translated name,
 171 but eventually decided to have a separate data structure that
 172 kept names with pointers to the inode table.
 173 Tagging inodes has two drawbacks;
 174 many inodes such as those associated with login ports remain in
 175 the inode table for a long period of time, but are never looked
 176 up by name.
 177 Other inodes, such as those describing directories are looked up
 178 frequently by many different names (\fIe.g.\fP ``..'').
 179 By keeping a separate table of names, the cache can
 180 truly reflect the most recently used names.
 181 An added benefit is that the table can be sized independently
 182 of the inode table, so that machines with small amounts of memory
 183 can reduce the size of the cache (or even eliminate it)
 184 without modifying the inode table structure.
 185 .PP
 186 Another issue to be considered is how the name cache should
 187 hold references to the inode table.
 188 Normally processes hold ``hard references'' by incrementing the
 189 reference count in the inode they reference.
 190 Since the system reuses only inodes with zero reference counts,
 191 a hard reference insures that the inode pointer will remain valid.
 192 However, if the name cache holds hard references,
 193 it is limited to some fraction of the size of the inode table,
 194 since some inodes must be left free for new files.
 195 It also makes it impossible for other parts of the kernel
 196 to verify sole use of a device or file.
 197 These reasons made it impractical to use hard references
 198 without affecting the behavior of the inode caching scheme.
 199 Thus, we chose instead to keep ``soft references'' protected
 200 by a \fIcapability\fP \- a 32-bit number
 201 guaranteed to be unique\u\s-22\s0\d \**.
 202 .FS
 203 \** \u\s-22\s0\d When all the numbers have been exhausted, all outstanding
 204 capabilities are purged and numbering starts over from scratch.
 205 Purging is possible as all capabilities are easily found in kernel memory.
 206 .FE
 207 When an entry is made in the name cache,
 208 the capability of its inode is copied to the name cache entry.
 209 When an inode is reused it is issued a new capability.
 210 When a name cache hit occurs,
 211 the capability of the name cache entry is compared
 212 with the capability of the inode that it references.
 213 If the capabilities do not match, the name cache entry is invalid.
 214 Since the name cache holds only soft references,
 215 it may be sized independent of the size of the inode table.
 216 A final benefit of using capabilities is that all
 217 cached names for an inode may be invalidated without
 218 searching through the entire cache;
 219 instead all you need to do is assign a new capability to the inode.
 220 .PP
 221 The cost of the name cache is about 200 lines of code
 222 (about 1.2 kilobytes)
 223 and 48 bytes per cache entry.
 224 Depending on the size of the system,
 225 about 200 to 1000 entries will normally be configured,
 226 using 10-50 kilobytes of physical memory.
 227 The name cache is resident in memory at all times.
 228 .PP
 229 After adding the system wide name cache we reran ``ls \-l''
 230 on the same directory.
 231 The user time remained the same,
 232 however the system time rose slightly to 3.7 seconds.
 233 This was not surprising as \fInamei\fP
 234 now had to maintain the cache,
 235 but was never able to make any use of it.
 236 .PP
 237 Another profiled system was created and measurements
 238 were collected over a 17 hour period.  These measurements
 239 showed a 13 ms/call decrease in \fInamei\fP, with
 240 \fInamei\fP accounting for only 26% of the system call time,
 241 13% of the time in the kernel,
 242 or about 7% of all the machine cycles.
 243 System time dropped from 55% to about 49%.
 244 The results are shown in Table 10.
 245 .KF
 246 .DS L
 247 .TS
 248 center box;
 249 l r r.
 250 part    time    % of kernel
 251 _
 252 self    4.2 ms/call     6.2%
 253 child   4.4 ms/call     6.6%
 254 _
 255 total   8.6 ms/call     12.8%
 256 .TE
 257 .ce
 258 Table 10.  Call times for \fInamei\fP with both caches.
 259 .DE
 260 .KE
 261 .PP
 262 On our general time sharing systems we find that during the twelve
 263 hour period from 8AM to 8PM the system does 500,000 to 1,000,000
 264 name translations.
 265 Statistics on the performance of both caches show that
 266 the large performance improvement is
 267 caused by the high hit ratio.
 268 The name cache has a hit rate of 70%-80%;
 269 the directory offset cache gets a hit rate of 5%-15%.
 270 The combined hit rate of the two caches almost always adds up to 85%.
 271 With the addition of the two caches,
 272 the percentage of system time devoted to name translation has
 273 dropped from 25% to less than 13%.
 274 While the system wide cache reduces both the amount of time in
 275 the routines that \fInamei\fP calls as well as \fInamei\fP itself
 276 (since fewer directories need to be accessed or searched),
 277 it is interesting to note that the actual percentage of system
 278 time spent in \fInamei\fP itself increases even though the
 279 actual time per call decreases.
 280 This is because less total time is being spent in the kernel,
 281 hence a smaller absolute time becomes a larger total percentage.
 282 .NH 3
 283 Intelligent Auto Siloing
 284 .PP
 285 Most terminal input hardware can run in two modes:
 286 it can either generate an interrupt each time a character is received,
 287 or collect characters in a silo that the system then periodically drains.
 288 To provide quick response for interactive input and flow control,
 289 a silo must be checked 30 to 50 times per second.
 290 Ascii terminals normally exhibit
 291 an input rate of less than 30 characters per second.
 292 At this input rate
 293 they are most efficiently handled with interrupt per character mode,
 294 since this generates fewer interrupts than draining the input silos
 295 of the terminal multiplexors at each clock interrupt.
 296 When input is being generated by another machine
 297 or a malfunctioning terminal connection, however,
 298 the input rate is usually more than 50 characters per second.
 299 It is more efficient to use a device's silo input mode,
 300 since this generates fewer interrupts than handling each character
 301 as a separate interrupt.
 302 Since a given dialup port may switch between uucp logins and user logins,
 303 it is impossible to statically select the most efficient input mode to use.
 304 .PP
 305 We therefore changed the terminal multiplexor handlers
 306 to dynamically choose between the use of the silo and the use of
 307 per-character interrupts.
 308 At low input rates the handler processes characters on an
 309 interrupt basis, avoiding the overhead
 310 of checking each interface on each clock interrupt.
 311 During periods of sustained input, the handler enables the silo
 312 and starts a timer to drain input.
 313 This timer runs less frequently than the clock interrupts,
 314 and is used only when there is a substantial amount of input.
 315 The transition from using silos to an interrupt per character is
 316 damped to minimize the number of transitions with bursty traffic
 317 (such as in network communication).
 318 Input characters serve to flush the silo, preventing long latency.
 319 By switching between these two modes of operation dynamically,
 320 the overhead of checking the silos is incurred only
 321 when necessary.
 322 .PP
 323 In addition to the savings in the terminal handlers,
 324 the clock interrupt routine is no longer required to schedule
 325 a software interrupt after each hardware interrupt to drain the silos.
 326 The software-interrupt level portion of the clock routine is only
 327 needed when timers expire or the current user process is collecting
 328 an execution profile.
 329 Thus, the number of interrupts attributable to clock processing
 330 is substantially reduced.
 331 .NH 3
 332 Process Table Management
 333 .PP
 334 As systems have grown larger, the size of the process table
 335 has grown far past 200 entries.
 336 With large tables, linear searches must be eliminated
 337 from any frequently used facility.
 338 The kernel process table is now multi-threaded to allow selective searching
 339 of active and zombie processes.
 340 A third list threads unused process table slots.
 341 Free slots can be obtained in constant time by taking one
 342 from the front of the free list.
 343 The number of processes used by a given user may be computed by scanning
 344 only the active list.
 345 Since the 4.2BSD release,
 346 the kernel maintained linked lists of the descendents of each process.
 347 This linkage is now exploited when dealing with process exit;
 348 parents seeking the exit status of children now avoid linear search
 349 of the process table, but examine only their direct descendents.
 350 In addition, the previous algorithm for finding all descendents of an exiting
 351 process used multiple linear scans of the process table.
 352 This has been changed to follow the links between child process and siblings.
 353 .PP
 354 When forking a new process,
 355 the system must assign it a unique process identifier.
 356 The system previously scanned the entire process table each time it created
 357 a new process to locate an identifier that was not already in use.
 358 Now, to avoid scanning the process table for each new process,
 359 the system computes a range of unused identifiers
 360 that can be directly assigned.
 361 Only when the set of identifiers is exhausted is another process table
 362 scan required.
 363 .NH 3
 364 Scheduling
 365 .PP
 366 Previously the scheduler scanned the entire process table
 367 once per second to recompute process priorities.
 368 Processes that had run for their entire time slice had their
 369 priority lowered.
 370 Processes that had not used their time slice, or that had
 371 been sleeping for the past second had their priority raised.
 372 On systems running many processes,
 373 the scheduler represented nearly 20% of the system time.
 374 To reduce this overhead,
 375 the scheduler has been changed to consider only
 376 runnable processes when recomputing priorities.
 377 To insure that processes sleeping for more than a second
 378 still get their appropriate priority boost,
 379 their priority is recomputed when they are placed back on the run queue.
 380 Since the set of runnable process is typically only a small fraction
 381 of the total number of processes on the system,
 382 the cost of invoking the scheduler drops proportionally.
 383 .NH 3
 384 Clock Handling
 385 .PP
 386 The hardware clock interrupts the processor 100 times per second
 387 at high priority.
 388 As most of the clock-based events need not be done at high priority,
 389 the system schedules a lower priority software interrupt to do the less
 390 time-critical events such as cpu scheduling and timeout processing.
 391 Often there are no such events, and the software interrupt handler
 392 finds nothing to do and returns.
 393 The high priority event now checks to see if there are low priority
 394 events to process;
 395 if there is nothing to do, the software interrupt is not requested.
 396 Often, the high priority interrupt occurs during a period when the
 397 machine had been running at low priority.
 398 Rather than posting a software interrupt that would occur as
 399 soon as it returns,
 400 the hardware clock interrupt handler simply lowers the processor priority
 401 and calls the software clock routines directly.
 402 Between these two optimizations, nearly 80 of the 100 software
 403 interrupts per second can be eliminated.
 404 .NH 3
 405 File System
 406 .PP
 407 The file system uses a large block size, typically 4096 or 8192 bytes.
 408 To allow small files to be stored efficiently, the large blocks can
 409 be broken into smaller fragments, typically multiples of 1024 bytes.
 410 To minimize the number of full-sized blocks that must be broken
 411 into fragments, the file system uses a best fit strategy.
 412 Programs that slowly grow files using write of 1024 bytes or less
 413 can force the file system to copy the data to
 414 successively larger and larger fragments until it finally
 415 grows to a full sized block.
 416 The file system still uses a best fit strategy the first time
 417 a fragment is written.
 418 However, the first time that the file system is forced to copy a growing
 419 fragment it places it at the beginning of a full sized block.
 420 Continued growth can be accommodated without further copying
 421 by using up the rest of the block.
 422 If the file ceases to grow, the rest of the block is still
 423 available for holding other fragments.
 424 .PP
 425 When creating a new file name,
 426 the entire directory in which it will reside must be scanned
 427 to insure that the name does not already exist.
 428 For large directories, this scan is time consuming.
 429 Because there was no provision for shortening directories,
 430 a directory that is once over-filled will increase the cost
 431 of file creation even after the over-filling is corrected.
 432 Thus, for example, a congested uucp connection can leave a legacy long
 433 after it is cleared up.
 434 To alleviate the problem, the system now deletes empty blocks
 435 that it finds at the end of a directory while doing a complete
 436 scan to create a new name.
 437 .NH 3
 438 Network
 439 .PP
 440 The default amount of buffer space allocated for stream sockets (including
 441 pipes) has been increased to 4096 bytes.
 442 Stream sockets and pipes now return their buffer sizes in the block size field
 443 of the stat structure.
 444 This information allows the standard I/O library to use more optimal buffering.
 445 Unix domain stream sockets also return a dummy device and inode number
 446 in the stat structure to increase compatibility
 447 with other pipe implementations.
 448 The TCP maximum segment size is calculated according to the destination
 449 and interface in use; non-local connections use a more conservative size
 450 for long-haul networks.
 451 .PP
 452 On multiply-homed hosts, the local address bound by TCP now always corresponds
 453 to the interface that will be used in transmitting data packets for the
 454 connection.
 455 Several bugs in the calculation of round trip timing have been corrected.
 456 TCP now switches to an alternate gateway when an existing route fails,
 457 or when an ICMP redirect message is received.
 458 ICMP source quench messages are used to throttle the transmission
 459 rate of TCP streams by temporarily creating an artificially small
 460 send window, and retransmissions send only a single packet
 461 rather than resending all queued data.
 462 A send policy has been implemented
 463 that decreases the number of small packets outstanding
 464 for network terminal traffic [Nagle84],
 465 providing additional reduction of network congestion.
 466 The overhead of packet routing has been decreased by changes in the routing
 467 code and by caching the most recently used route for each datagram socket.
 468 .PP
 469 The buffer management strategy implemented by \fIsosend\fP has been
 470 changed to make better use of the increased size of the socket buffers
 471 and a better tuned delayed acknowledgement algorithm.
 472 Routing has been modified to include a one element cache of the last
 473 route computed.
 474 Multiple messages send with the same destination now require less processing.
 475 Performance deteriorates because of load in
 476 either the sender host, receiver host, or ether.
 477 Also, any CPU contention degrades substantially
 478 the throughput achievable by user processes [Cabrera85].
 479 We have observed empty VAX 11/750s using up to 90% of their cycles
 480 transmitting network messages.
 481 .NH 3
 482 Exec
 483 .PP
 484 When \fIexec\fP-ing a new process, the kernel creates the new
 485 program's argument list by copying the arguments and environment
 486 from the parent process's address space into the system, then back out
 487 again onto the stack of the newly created process.
 488 These two copy operations were done one byte at a time, but
 489 are now done a string at a time.
 490 This optimization reduced the time to process
 491 an argument list by a factor of ten;
 492 the average time to do an \fIexec\fP call decreased by 25%.
 493 .NH 3
 494 Context Switching
 495 .PP
 496 The kernel used to post a software event when it wanted to force
 497 a process to be rescheduled.
 498 Often the process would be rescheduled for other reasons before
 499 exiting the kernel, delaying the event trap.
 500 At some later time the process would again
 501 be selected to run and would complete its pending system call,
 502 finally causing the event to take place.
 503 The event would cause the scheduler to be invoked a second time
 504 selecting the same process to run.
 505 The fix to this problem is to cancel any software reschedule
 506 events when saving a process context.
 507 This change doubles the speed with which processes
 508 can synchronize using pipes or signals.
 509 .NH 3
 510 Setjmp/Longjmp
 511 .PP
 512 The kernel routine \fIsetjmp\fP, that saves the current system
 513 context in preparation for a non-local goto used to save many more
 514 registers than necessary under most circumstances.
 515 By trimming its operation to save only the minimum state required,
 516 the overhead for system calls decreased by an average of 13%.
 517 .NH 3
 518 Compensating for Lack of Compiler Technology
 519 .PP
 520 The current compilers available for C do not
 521 do any significant optimization.
 522 Good optimizing compilers are unlikely to be built;
 523 the C language is not well suited to optimization
 524 because of its rampant use of unbound pointers.
 525 Thus, many classical optimizations such as common subexpression
 526 analysis and selection of register variables must be done
 527 by hand using ``exterior'' knowledge of when such optimizations are safe.
 528 .PP
 529 Another optimization usually done by optimizing compilers
 530 is inline expansion of small or frequently used routines.
 531 In past Berkeley systems this has been done by using \fIsed\fP to
 532 run over the assembly language and replace calls to small
 533 routines with the code for the body of the routine, often
 534 a single VAX instruction.
 535 While this optimization eliminated the cost of the subroutine
 536 call and return,
 537 it did not eliminate the pushing and popping of several arguments
 538 to the routine.
 539 The \fIsed\fP script has been replaced by a more intelligent expander,
 540 \fIinline\fP, that merges the pushes and pops into moves to registers.
 541 For example, if the C code
 542 .DS
 543 if (scanc(map[i], 1, 47, i - 63))
 544 .DE
 545 is compiled into assembly language it generates the code shown
 546 in the left hand column of Table 11.
 547 The \fIsed\fP inline expander changes this code to that
 548 shown in the middle column.
 549 The newer optimizer eliminates most of the stack
 550 operations to generate the code shown in the right hand column.
 551 .KF
 552 .TS
 553 center, box;
 554 c s s s s s
 555 c s | c s | c s
 556 l l | l l | l l.
 557 Alternative C Language Code Optimizations
 558 _
 559 cc      sed     inline
 560 _
 561 subl3   $64,_i,\-(sp)   subl3   $64,_i,\-(sp)   subl3   $64,_i,r5
 562 pushl   $47     pushl   $47     movl    $47,r4
 563 pushl   $1      pushl   $1      pushl   $1
 564 mull2   $16,_i,r3       mull2   $16,_i,r3       mull2   $16,_i,r3
 565 pushl   \-56(fp)[r3]    pushl   \-56(fp)[r3]    movl    \-56(fp)[r3],r2
 566 calls   $4,_scanc       movl    (sp)+,r5        movl    (sp)+,r3
 567 tstl    r0      movl    (sp)+,r4        scanc   r2,(r3),(r4),r5
 568 jeql    L7      movl    (sp)+,r3        tstl    r0
 569                 movl    (sp)+,r2        jeql    L7
 570                 scanc   r2,(r3),(r4),r5
 571                 tstl    r0
 572                 jeql    L7
 573 .TE
 574 .ce
 575 Table 11. Alternative inline code expansions.
 576 .KE
 577 .PP
 578 Another optimization involved reevaluating
 579 existing data structures in the context of the current system.
 580 For example, disk buffer hashing was implemented when the system
 581 typically had thirty to fifty buffers.
 582 Most systems today have 200 to 1000 buffers.
 583 Consequently, most of the hash chains contained
 584 ten to a hundred buffers each!
 585 The running time of the low level buffer management primitives was
 586 dramatically improved simply by enlarging the size of the hash table.
 587 .NH 2
 588 Improvements to Libraries and Utilities
 589 .PP
 590 Intuitively, changes to the kernel would seem to have the greatest
 591 payoff since they affect all programs that run on the system.
 592 However, the kernel has been tuned many times before, so the
 593 opportunity for significant improvement was small.
 594 By contrast, many of the libraries and utilities had never been tuned.
 595 For example, we found utilities that spent 90% of their
 596 running time doing single character read system calls.
 597 Changing the utility to use the standard I/O library cut the
 598 running time by a factor of five!
 599 Thus, while most of our time has been spent tuning the kernel,
 600 more than half of the speedups are because of improvements in
 601 other parts of the system.
 602 Some of the more dramatic changes are described in the following
 603 subsections.
 604 .NH 3
 605 Hashed Databases
 606 .PP
 607 UNIX provides a set of database management routines, \fIdbm\fP,
 608 that can be used to speed lookups in large data files
 609 with an external hashed index file.
 610 The original version of dbm was designed to work with only one
 611 database at a time.  These routines were generalized to handle
 612 multiple database files, enabling them to be used in rewrites
 613 of the password and host file lookup routines.  The new routines
 614 used to access the password file significantly improve the running
 615 time of many important programs such as the mail subsystem,
 616 the C-shell (in doing tilde expansion), \fIls \-l\fP, etc.
 617 .NH 3
 618 Buffered I/O
 619 .PP
 620 The new filesystem with its larger block sizes allows better
 621 performance, but it is possible to degrade system performance
 622 by performing numerous small transfers rather than using
 623 appropriately-sized buffers.
 624 The standard I/O library
 625 automatically determines the optimal buffer size for each file.
 626 Some C library routines and commonly-used programs use low-level
 627 I/O or their own buffering, however.
 628 Several important utilities that did not use the standard I/O library
 629 and were buffering I/O using the old optimal buffer size,
 630 1Kbytes; the programs were changed to buffer I/O according to the
 631 optimal file system blocksize.
 632 These include the editor, the assembler, loader, remote file copy,
 633 the text formatting programs, and the C compiler.
 634 .PP
 635 The standard error output has traditionally been unbuffered
 636 to prevent delay in presenting the output to the user,
 637 and to prevent it from being lost if buffers are not flushed.
 638 The inordinate expense of sending single-byte packets through
 639 the network led us to impose a buffering scheme on the standard
 640 error stream.
 641 Within a single call to \fIfprintf\fP, all output is buffered temporarily.
 642 Before the call returns, all output is flushed and the stream is again
 643 marked unbuffered.
 644 As before, the normal block or line buffering mechanisms can be used
 645 instead of the default behavior.
 646 .PP
 647 It is possible for programs with good intentions to unintentionally
 648 defeat the standard I/O library's choice of I/O buffer size by using
 649 the \fIsetbuf\fP call to assign an output buffer.
 650 Because of portability requirements, the default buffer size provided
 651 by \fIsetbuf\fP is 1024 bytes; this can lead, once again, to added
 652 overhead.
 653 One such program with this problem was \fIcat\fP;
 654 there are undoubtedly other standard system utilities with similar problems
 655 as the system has changed much since they were originally written.
 656 .NH 3
 657 Mail System
 658 .PP
 659 The problems discussed in section 3.1.1 prompted significant work
 660 on the entire mail system.  The first problem identified was a bug
 661 in the \fIsyslog\fP program.  The mail delivery program, \fIsendmail\fP
 662 logs all mail transactions through this process with the 4.2BSD interprocess
 663 communication facilities.  \fISyslog\fP then records the information in
 664 a log file.  Unfortunately, \fIsyslog\fP was performing a \fIsync\fP
 665 operation after each message it received, whether it was logged to a file
 666 or not.  This wreaked havoc on the effectiveness of the
 667 buffer cache and explained, to a large
 668 extent, why sending mail to large distribution lists generated such a
 669 heavy load on the system (one syslog message was generated for each
 670 message recipient causing almost a continuous sequence of sync operations).
 671 .PP
 672 The hashed data base files were
 673 installed in all mail programs, resulting in a order of magnitude
 674 speedup on large distribution lists.  The code in \fI/bin/mail\fP
 675 that notifies the \fIcomsat\fP program when mail has been delivered to
 676 a user was changed to cache host table lookups, resulting in a similar
 677 speedup on large distribution lists.
 678 .PP
 679 Next, the file locking facilities
 680 provided in 4.2BSD, \fIflock\fP\|(2), were used in place of the old
 681 locking mechanism.
 682 The mail system previously used \fIlink\fP and \fIunlink\fP in
 683 implementing file locking primitives.
 684 Because these operations usually modify the contents of directories
 685 they require synchronous disk operations and cannot take
 686 advantage of the name cache maintained by the system.
 687 Unlink requires that the entry be found in the directory so that
 688 it can be removed;
 689 link requires that the directory be scanned to insure that the name
 690 does not already exist.
 691 By contrast the advisory locking facility in 4.2BSD is
 692 efficient because it is all done with in-memory tables.
 693 Thus, the mail system was modified to use the file locking primitives.
 694 This yielded another 10% cut in the basic overhead of delivering mail.
 695 Extensive profiling and tuning of \fIsendmail\fP and
 696 compiling it without debugging code reduced the overhead by another 20%.
 697 .NH 3
 698 Network Servers
 699 .PP
 700 With the introduction of the network facilities in 4.2BSD,
 701 a myriad of services became available, each of which
 702 required its own daemon process.
 703 Many of these daemons were rarely if ever used,
 704 yet they lay asleep in the process table consuming
 705 system resources and generally slowing down response.
 706 Rather than having many servers started at boot time, a single server,
 707 \fIinetd\fP was substituted.
 708 This process reads a simple configuration file
 709 that specifies the services the system is willing to support
 710 and listens for service requests on each service's Internet port.
 711 When a client requests service the appropriate server is created
 712 and passed a service connection as its standard input.  Servers
 713 that require the identity of their client may use the \fIgetpeername\fP
 714 system call; likewise \fIgetsockname\fP may be used to find out
 715 a server's local address without consulting data base files.
 716 This scheme is attractive for several reasons:
 717 .IP \(bu 3
 718 it eliminates
 719 as many as a dozen processes, easing system overhead and
 720 allowing the file and text tables to be made smaller,
 721 .IP \(bu 3
 722 servers need not contain the code required to handle connection
 723 queueing, simplifying the programs, and
 724 .IP \(bu 3
 725 installing and replacing servers becomes simpler.
 726 .PP
 727 With an increased numbers of networks, both local and external to Berkeley,
 728 we found that the overhead of the routing process was becoming
 729 inordinately high.
 730 Several changes were made in the routing daemon to reduce this load.
 731 Routes to external networks are no longer exchanged by routers
 732 on the internal machines, only a route to a default gateway.
 733 This reduces the amount of network traffic and the time required
 734 to process routing messages.
 735 In addition, the routing daemon was profiled
 736 and functions responsible for large amounts
 737 of time were optimized.
 738 The major changes were a faster hashing scheme,
 739 and inline expansions of the ubiquitous byte-swapping functions.
 740 .PP
 741 Under certain circumstances, when output was blocked,
 742 attempts by the remote login process
 743 to send output to the user were rejected by the system,
 744 although a prior \fIselect\fP call had indicated that data could be sent.
 745 This resulted in continuous attempts to write the data until the remote
 746 user restarted output.
 747 This problem was initially avoided in the remote login handler,
 748 and the original problem in the kernel has since been corrected.
 749 .NH 3
 750 The C Run-time Library
 751 .PP
 752 Several people have found poorly tuned code
 753 in frequently used routines in the C library [Lankford84].
 754 In particular the running time of the string routines can be
 755 cut in half by rewriting them using the VAX string instructions.
 756 The memory allocation routines have been tuned to waste less
 757 memory for memory allocations with sizes that are a power of two.
 758 Certain library routines that did file input in one-character reads
 759 have been corrected.
 760 Other library routines including \fIfread\fP and \fIfwrite\fP
 761 have been rewritten for efficiency.
 762 .NH 3
 763 Csh
 764 .PP
 765 The C-shell was converted to run on 4.2BSD by
 766 writing a set of routines to simulate the old jobs library.
 767 While this provided a functioning C-shell,
 768 it was grossly inefficient, generating up
 769 to twenty system calls per prompt.
 770 The C-shell has been modified to use the new signal
 771 facilities directly,
 772 cutting the number of system calls per prompt in half.
 773 Additional tuning was done with the help of profiling
 774 to cut the cost of frequently used facilities.