| 1 | # @(#)TOUR 8.1 (Berkeley) 5/31/93 |
| 2 | # $FreeBSD: head/bin/sh/TOUR 253650 2013-07-25 15:08:41Z jilles $ |
| 3 | |
| 4 | NOTE -- This is the original TOUR paper distributed with ash and |
| 5 | does not represent the current state of the shell. It is provided anyway |
| 6 | since it provides helpful information for how the shell is structured, |
| 7 | but be warned that things have changed -- the current shell is |
| 8 | still under development. |
| 9 | |
| 10 | ================================================================ |
| 11 | |
| 12 | A Tour through Ash |
| 13 | |
| 14 | Copyright 1989 by Kenneth Almquist. |
| 15 | |
| 16 | |
| 17 | DIRECTORIES: The subdirectory bltin contains commands which can |
| 18 | be compiled stand-alone. The rest of the source is in the main |
| 19 | ash directory. |
| 20 | |
| 21 | SOURCE CODE GENERATORS: Files whose names begin with "mk" are |
| 22 | programs that generate source code. A complete list of these |
| 23 | programs is: |
| 24 | |
| 25 | program input files generates |
| 26 | ------- ----------- --------- |
| 27 | mkbuiltins builtins builtins.h builtins.c |
| 28 | mknodes nodetypes nodes.h nodes.c |
| 29 | mksyntax - syntax.h syntax.c |
| 30 | mktokens - token.h |
| 31 | |
| 32 | There are undoubtedly too many of these. |
| 33 | |
| 34 | EXCEPTIONS: Code for dealing with exceptions appears in |
| 35 | exceptions.c. The C language doesn't include exception handling, |
| 36 | so I implement it using setjmp and longjmp. The global variable |
| 37 | exception contains the type of exception. EXERROR is raised by |
| 38 | calling error. EXINT is an interrupt. |
| 39 | |
| 40 | INTERRUPTS: In an interactive shell, an interrupt will cause an |
| 41 | EXINT exception to return to the main command loop. (Exception: |
| 42 | EXINT is not raised if the user traps interrupts using the trap |
| 43 | command.) The INTOFF and INTON macros (defined in exception.h) |
| 44 | provide uninterruptible critical sections. Between the execution |
| 45 | of INTOFF and the execution of INTON, interrupt signals will be |
| 46 | held for later delivery. INTOFF and INTON can be nested. |
| 47 | |
| 48 | MEMALLOC.C: Memalloc.c defines versions of malloc and realloc |
| 49 | which call error when there is no memory left. It also defines a |
| 50 | stack oriented memory allocation scheme. Allocating off a stack |
| 51 | is probably more efficient than allocation using malloc, but the |
| 52 | big advantage is that when an exception occurs all we have to do |
| 53 | to free up the memory in use at the time of the exception is to |
| 54 | restore the stack pointer. The stack is implemented using a |
| 55 | linked list of blocks. |
| 56 | |
| 57 | STPUTC: If the stack were contiguous, it would be easy to store |
| 58 | strings on the stack without knowing in advance how long the |
| 59 | string was going to be: |
| 60 | p = stackptr; |
| 61 | *p++ = c; /* repeated as many times as needed */ |
| 62 | stackptr = p; |
| 63 | The following three macros (defined in memalloc.h) perform these |
| 64 | operations, but grow the stack if you run off the end: |
| 65 | STARTSTACKSTR(p); |
| 66 | STPUTC(c, p); /* repeated as many times as needed */ |
| 67 | grabstackstr(p); |
| 68 | |
| 69 | We now start a top-down look at the code: |
| 70 | |
| 71 | MAIN.C: The main routine performs some initialization, executes |
| 72 | the user's profile if necessary, and calls cmdloop. Cmdloop |
| 73 | repeatedly parses and executes commands. |
| 74 | |
| 75 | OPTIONS.C: This file contains the option processing code. It is |
| 76 | called from main to parse the shell arguments when the shell is |
| 77 | invoked, and it also contains the set builtin. The -i and -m op- |
| 78 | tions (the latter turns on job control) require changes in signal |
| 79 | handling. The routines setjobctl (in jobs.c) and setinteractive |
| 80 | (in trap.c) are called to handle changes to these options. |
| 81 | |
| 82 | PARSING: The parser code is all in parser.c. A recursive des- |
| 83 | cent parser is used. Syntax tables (generated by mksyntax) are |
| 84 | used to classify characters during lexical analysis. There are |
| 85 | four tables: one for normal use, one for use when inside single |
| 86 | quotes and dollar single quotes, one for use when inside double |
| 87 | quotes and one for use in arithmetic. The tables are machine |
| 88 | dependent because they are indexed by character variables and |
| 89 | the range of a char varies from machine to machine. |
| 90 | |
| 91 | PARSE OUTPUT: The output of the parser consists of a tree of |
| 92 | nodes. The various types of nodes are defined in the file node- |
| 93 | types. |
| 94 | |
| 95 | Nodes of type NARG are used to represent both words and the con- |
| 96 | tents of here documents. An early version of ash kept the con- |
| 97 | tents of here documents in temporary files, but keeping here do- |
| 98 | cuments in memory typically results in significantly better per- |
| 99 | formance. It would have been nice to make it an option to use |
| 100 | temporary files for here documents, for the benefit of small |
| 101 | machines, but the code to keep track of when to delete the tem- |
| 102 | porary files was complex and I never fixed all the bugs in it. |
| 103 | (AT&T has been maintaining the Bourne shell for more than ten |
| 104 | years, and to the best of my knowledge they still haven't gotten |
| 105 | it to handle temporary files correctly in obscure cases.) |
| 106 | |
| 107 | The text field of a NARG structure points to the text of the |
| 108 | word. The text consists of ordinary characters and a number of |
| 109 | special codes defined in parser.h. The special codes are: |
| 110 | |
| 111 | CTLVAR Variable substitution |
| 112 | CTLENDVAR End of variable substitution |
| 113 | CTLBACKQ Command substitution |
| 114 | CTLBACKQ|CTLQUOTE Command substitution inside double quotes |
| 115 | CTLESC Escape next character |
| 116 | |
| 117 | A variable substitution contains the following elements: |
| 118 | |
| 119 | CTLVAR type name '=' [ alternative-text CTLENDVAR ] |
| 120 | |
| 121 | The type field is a single character specifying the type of sub- |
| 122 | stitution. The possible types are: |
| 123 | |
| 124 | VSNORMAL $var |
| 125 | VSMINUS ${var-text} |
| 126 | VSMINUS|VSNUL ${var:-text} |
| 127 | VSPLUS ${var+text} |
| 128 | VSPLUS|VSNUL ${var:+text} |
| 129 | VSQUESTION ${var?text} |
| 130 | VSQUESTION|VSNUL ${var:?text} |
| 131 | VSASSIGN ${var=text} |
| 132 | VSASSIGN|VSNUL ${var:=text} |
| 133 | |
| 134 | In addition, the type field will have the VSQUOTE flag set if the |
| 135 | variable is enclosed in double quotes. The name of the variable |
| 136 | comes next, terminated by an equals sign. If the type is not |
| 137 | VSNORMAL, then the text field in the substitution follows, ter- |
| 138 | minated by a CTLENDVAR byte. |
| 139 | |
| 140 | Commands in back quotes are parsed and stored in a linked list. |
| 141 | The locations of these commands in the string are indicated by |
| 142 | CTLBACKQ and CTLBACKQ+CTLQUOTE characters, depending upon whether |
| 143 | the back quotes were enclosed in double quotes. |
| 144 | |
| 145 | The character CTLESC escapes the next character, so that in case |
| 146 | any of the CTL characters mentioned above appear in the input, |
| 147 | they can be passed through transparently. CTLESC is also used to |
| 148 | escape '*', '?', '[', and '!' characters which were quoted by the |
| 149 | user and thus should not be used for file name generation. |
| 150 | |
| 151 | CTLESC characters have proved to be particularly tricky to get |
| 152 | right. In the case of here documents which are not subject to |
| 153 | variable and command substitution, the parser doesn't insert any |
| 154 | CTLESC characters to begin with (so the contents of the text |
| 155 | field can be written without any processing). Other here docu- |
| 156 | ments, and words which are not subject to splitting and file name |
| 157 | generation, have the CTLESC characters removed during the vari- |
| 158 | able and command substitution phase. Words which are subject to |
| 159 | splitting and file name generation have the CTLESC characters re- |
| 160 | moved as part of the file name phase. |
| 161 | |
| 162 | EXECUTION: Command execution is handled by the following files: |
| 163 | eval.c The top level routines. |
| 164 | redir.c Code to handle redirection of input and output. |
| 165 | jobs.c Code to handle forking, waiting, and job control. |
| 166 | exec.c Code to do path searches and the actual exec sys call. |
| 167 | expand.c Code to evaluate arguments. |
| 168 | var.c Maintains the variable symbol table. Called from expand.c. |
| 169 | |
| 170 | EVAL.C: Evaltree recursively executes a parse tree. The exit |
| 171 | status is returned in the global variable exitstatus. The alter- |
| 172 | native entry evalbackcmd is called to evaluate commands in back |
| 173 | quotes. It saves the result in memory if the command is a buil- |
| 174 | tin; otherwise it forks off a child to execute the command and |
| 175 | connects the standard output of the child to a pipe. |
| 176 | |
| 177 | JOBS.C: To create a process, you call makejob to return a job |
| 178 | structure, and then call forkshell (passing the job structure as |
| 179 | an argument) to create the process. Waitforjob waits for a job |
| 180 | to complete. These routines take care of process groups if job |
| 181 | control is defined. |
| 182 | |
| 183 | REDIR.C: Ash allows file descriptors to be redirected and then |
| 184 | restored without forking off a child process. This is accom- |
| 185 | plished by duplicating the original file descriptors. The redir- |
| 186 | tab structure records where the file descriptors have been dupli- |
| 187 | cated to. |
| 188 | |
| 189 | EXEC.C: The routine find_command locates a command, and enters |
| 190 | the command in the hash table if it is not already there. The |
| 191 | third argument specifies whether it is to print an error message |
| 192 | if the command is not found. (When a pipeline is set up, |
| 193 | find_command is called for all the commands in the pipeline be- |
| 194 | fore any forking is done, so to get the commands into the hash |
| 195 | table of the parent process. But to make command hashing as |
| 196 | transparent as possible, we silently ignore errors at that point |
| 197 | and only print error messages if the command cannot be found |
| 198 | later.) |
| 199 | |
| 200 | The routine shellexec is the interface to the exec system call. |
| 201 | |
| 202 | EXPAND.C: Arguments are processed in three passes. The first |
| 203 | (performed by the routine argstr) performs variable and command |
| 204 | substitution. The second (ifsbreakup) performs word splitting |
| 205 | and the third (expandmeta) performs file name generation. |
| 206 | |
| 207 | VAR.C: Variables are stored in a hash table. Probably we should |
| 208 | switch to extensible hashing. The variable name is stored in the |
| 209 | same string as the value (using the format "name=value") so that |
| 210 | no string copying is needed to create the environment of a com- |
| 211 | mand. Variables which the shell references internally are preal- |
| 212 | located so that the shell can reference the values of these vari- |
| 213 | ables without doing a lookup. |
| 214 | |
| 215 | When a program is run, the code in eval.c sticks any environment |
| 216 | variables which precede the command (as in "PATH=xxx command") in |
| 217 | the variable table as the simplest way to strip duplicates, and |
| 218 | then calls "environment" to get the value of the environment. |
| 219 | |
| 220 | BUILTIN COMMANDS: The procedures for handling these are scat- |
| 221 | tered throughout the code, depending on which location appears |
| 222 | most appropriate. They can be recognized because their names al- |
| 223 | ways end in "cmd". The mapping from names to procedures is |
| 224 | specified in the file builtins, which is processed by the mkbuilt- |
| 225 | ins command. |
| 226 | |
| 227 | A builtin command is invoked with argc and argv set up like a |
| 228 | normal program. A builtin command is allowed to overwrite its |
| 229 | arguments. Builtin routines can call nextopt to do option pars- |
| 230 | ing. This is kind of like getopt, but you don't pass argc and |
| 231 | argv to it. Builtin routines can also call error. This routine |
| 232 | normally terminates the shell (or returns to the main command |
| 233 | loop if the shell is interactive), but when called from a builtin |
| 234 | command it causes the builtin command to terminate with an exit |
| 235 | status of 2. |
| 236 | |
| 237 | The directory bltins contains commands which can be compiled in- |
| 238 | dependently but can also be built into the shell for efficiency |
| 239 | reasons. The makefile in this directory compiles these programs |
| 240 | in the normal fashion (so that they can be run regardless of |
| 241 | whether the invoker is ash), but also creates a library named |
| 242 | bltinlib.a which can be linked with ash. The header file bltin.h |
| 243 | takes care of most of the differences between the ash and the |
| 244 | stand-alone environment. The user should call the main routine |
| 245 | "main", and #define main to be the name of the routine to use |
| 246 | when the program is linked into ash. This #define should appear |
| 247 | before bltin.h is included; bltin.h will #undef main if the pro- |
| 248 | gram is to be compiled stand-alone. |
| 249 | |
| 250 | CD.C: This file defines the cd and pwd builtins. |
| 251 | |
| 252 | SIGNALS: Trap.c implements the trap command. The routine set- |
| 253 | signal figures out what action should be taken when a signal is |
| 254 | received and invokes the signal system call to set the signal ac- |
| 255 | tion appropriately. When a signal that a user has set a trap for |
| 256 | is caught, the routine "onsig" sets a flag. The routine dotrap |
| 257 | is called at appropriate points to actually handle the signal. |
| 258 | When an interrupt is caught and no trap has been set for that |
| 259 | signal, the routine "onint" in error.c is called. |
| 260 | |
| 261 | OUTPUT: Ash uses it's own output routines. There are three out- |
| 262 | put structures allocated. "Output" represents the standard out- |
| 263 | put, "errout" the standard error, and "memout" contains output |
| 264 | which is to be stored in memory. This last is used when a buil- |
| 265 | tin command appears in backquotes, to allow its output to be col- |
| 266 | lected without doing any I/O through the UNIX operating system. |
| 267 | The variables out1 and out2 normally point to output and errout, |
| 268 | respectively, but they are set to point to memout when appropri- |
| 269 | ate inside backquotes. |
| 270 | |
| 271 | INPUT: The basic input routine is pgetc, which reads from the |
| 272 | current input file. There is a stack of input files; the current |
| 273 | input file is the top file on this stack. The code allows the |
| 274 | input to come from a string rather than a file. (This is for the |
| 275 | -c option and the "." and eval builtin commands.) The global |
| 276 | variable plinno is saved and restored when files are pushed and |
| 277 | popped from the stack. The parser routines store the number of |
| 278 | the current line in this variable. |
| 279 | |
| 280 | DEBUGGING: If DEBUG is defined in shell.h, then the shell will |
| 281 | write debugging information to the file $HOME/trace. Most of |
| 282 | this is done using the TRACE macro, which takes a set of printf |
| 283 | arguments inside two sets of parenthesis. Example: |
| 284 | "TRACE(("n=%d0, n))". The double parenthesis are necessary be- |
| 285 | cause the preprocessor can't handle functions with a variable |
| 286 | number of arguments. Defining DEBUG also causes the shell to |
| 287 | generate a core dump if it is sent a quit signal. The tracing |
| 288 | code is in show.c. |