1 \input texinfo @c -*-texinfo-*-
2 @c %**start of header (This is for running Texinfo on a region.)
4 @settitle The GNU Awk User's Guide
5 @c %**end of header (This is for running Texinfo on a region.)
7 @c inside ifinfo for older versions of texinfo.tex
9 @c I hope this is the right category
10 @dircategory Programming Languages
12 * Gawk: (gawk). A Text Scanning and Processing Language.
16 @c @set xref-automatic-section-title
19 @c The following information should be updated here only!
20 @c This sets the edition of the document, the version of gawk it
21 @c applies to, and when the document was updated.
22 @set TITLE Effective AWK Programming
23 @set SUBTITLE A User's Guide for GNU Awk
25 @set EDITION 1.0.@value{PATCHLEVEL}
27 @set UPDATE-MONTH July, 2000
32 @set DOCUMENT Info file
36 Some comments on the layout for TeX.
37 1. Use at least texinfo.tex 2.159. It contains fixes that
38 are needed to get the footings for draft mode to not appear.
39 2. I have done A LOT of work to make this look good. There are `@page' commands
40 and use of `@group ... @end group' in a number of places. If you muck
41 with anything, it's your responsibility not to break the layout.
44 @c merge the function and variable indexes into the concept index
54 @c If "finalout" is commented out, the printed output will show
55 @c black boxes that mark lines that are too long. Thus, it is
56 @c unwise to comment it out when running a master in case there are
57 @c overfulls which are deemed okay.
71 This file documents @code{awk}, a program that you can use to select
72 particular records in a file and perform operations upon them.
74 This is Edition @value{EDITION} of @cite{@value{TITLE}},
75 for the @value{VERSION}.@value{PATCHLEVEL} version of the GNU implementation of AWK.
77 Copyright (C) 1989, 1991, 1992, 1993, 1996-2000 Free Software Foundation, Inc.
79 Permission is granted to make and distribute verbatim copies of
80 this manual provided the copyright notice and this permission notice
81 are preserved on all copies.
84 Permission is granted to process this file through TeX and print the
85 results, provided the printed document carries copying permission
86 notice identical to this one except for the removal of this paragraph
87 (this paragraph not being relevant to the printed manual).
90 Permission is granted to copy and distribute modified versions of this
91 manual under the conditions for verbatim copying, provided that the entire
92 resulting derived work is distributed under the terms of a permission
93 notice identical to this one.
95 Permission is granted to copy and distribute translations of this manual
96 into another language, under the above conditions for modified versions,
97 except that this permission notice may be stated in a translation approved
101 @setchapternewpage odd
105 @subtitle @value{SUBTITLE}
106 @subtitle Edition @value{EDITION}
107 @subtitle @value{UPDATE-MONTH}
108 @author Arnold D. Robbins
111 @author Based on @cite{The GAWK Manual},
112 @author by Robbins, Close, Rubin, and Stallman
115 @c Include the Distribution inside the titlepage environment so
116 @c that headings are turned off. Headings on and off do not work.
119 @vskip 0pt plus 1filll
121 The programs and applications presented in this book have been
122 included for their instructional value. They have been tested with care,
123 but are not guaranteed for any particular purpose. The publisher does not
124 offer any warranties or representations, nor does it accept any
125 liabilities with respect to the programs or applications.
128 UNIX is a registered trademark of X/Open, Ltd. @*
129 Microsoft, MS, and MS-DOS are registered trademarks, and Windows is a
130 trademark of Microsoft Corporation in the United States and other
132 Atari, 520ST, 1040ST, TT, STE, Mega, and Falcon are registered trademarks
133 or trademarks of Atari Corporation. @*
134 DEC, Digital, OpenVMS, ULTRIX, and VMS, are trademarks of Digital Equipment
137 ``To boldly go where no man has gone before'' is a
138 Registered Trademark of Paramount Pictures Corporation. @*
139 @c sorry, i couldn't resist
141 Copyright @copyright{} 1989, 1991, 1992, 1993, 1996-2000 Free Software Foundation, Inc.
144 This is Edition @value{EDITION} of @cite{@value{TITLE}}, @*
145 for the @value{VERSION}.@value{PATCHLEVEL} (or later) version of the GNU implementation of AWK.
150 Free Software Foundation @*
151 59 Temple Place --- Suite 330 @*
152 Boston, MA 02111-1307 USA @*
153 Phone: +1-617-542-5942 @*
154 Fax: +1-617-542-2652 @*
155 Email: @code{gnu@@gnu.org} @*
156 URL: @code{http://www.gnu.org/} @*
159 @c this ISBN can change!
160 @c This one is correct for gawk 3.0 and edition 1.0 from the FSF
161 ISBN 1-882114-26-4 @*
163 Permission is granted to make and distribute verbatim copies of
164 this manual provided the copyright notice and this permission notice
165 are preserved on all copies.
167 Permission is granted to copy and distribute modified versions of this
168 manual under the conditions for verbatim copying, provided that the entire
169 resulting derived work is distributed under the terms of a permission
170 notice identical to this one.
172 Permission is granted to copy and distribute translations of this manual
173 into another language, under the above conditions for modified versions,
174 except that this permission notice may be stated in a translation approved
177 Cover art by Etienne Suvasa.
180 @c Thanks to Bob Chassell for directions on doing dedications.
186 @center @i{To Miriam, for making me complete.}
188 @center @i{To Chana, for the joy you bring us.}
190 @center @i{To Rivka, for the exponential increase.}
192 @center @i{To Nachum, for the added dimension.}
194 @center @i{To Malka, for the new beginning.}
203 @evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @|
204 @oddheading @| @| @strong{@thischapter}@ @ @ @thispage
206 @evenfooting @today{} @| @emph{DRAFT!} @| Please Do Not Redistribute
207 @oddfooting Please Do Not Redistribute @| @emph{DRAFT!} @| @today{}
212 @node Top, Preface, (dir), (dir)
213 @top General Introduction
214 @c Preface or Licensing nodes should come right after the Top
215 @c node, in `unnumbered' sections, then the chapter, `What is gawk'.
217 This file documents @code{awk}, a program that you can use to select
218 particular records in a file and perform operations upon them.
220 This is Edition @value{EDITION} of @cite{@value{TITLE}}, @*
221 for the @value{VERSION}.@value{PATCHLEVEL} version of the GNU implementation @*
227 * Preface:: What this @value{DOCUMENT} is about; brief
228 history and acknowledgements.
229 * What Is Awk:: What is the @code{awk} language; using this
231 * Getting Started:: A basic introduction to using @code{awk}. How
232 to run an @code{awk} program. Command line
234 * One-liners:: Short, sample @code{awk} programs.
235 * Regexp:: All about matching things using regular
237 * Reading Files:: How to read files and manipulate fields.
238 * Printing:: How to print using @code{awk}. Describes the
239 @code{print} and @code{printf} statements.
240 Also describes redirection of output.
241 * Expressions:: Expressions are the basic building blocks of
243 * Patterns and Actions:: Overviews of patterns and actions.
244 * Statements:: The various control statements are described
246 * Built-in Variables:: Built-in Variables
247 * Arrays:: The description and use of arrays. Also
248 includes array-oriented control statements.
249 * Built-in:: The built-in functions are summarized here.
250 * User-defined:: User-defined functions are described in
252 * Invoking Gawk:: How to run @code{gawk}.
253 * Library Functions:: A Library of @code{awk} Functions.
254 * Sample Programs:: Many @code{awk} programs with complete
256 * Language History:: The evolution of the @code{awk} language.
257 * Gawk Summary:: @code{gawk} Options and Language Summary.
258 * Installation:: Installing @code{gawk} under various operating
260 * Notes:: Something about the implementation of
262 * Glossary:: An explanation of some unfamiliar terms.
263 * Copying:: Your right to copy and distribute @code{gawk}.
264 * Index:: Concept and Variable Index.
266 * History:: The history of @code{gawk} and @code{awk}.
267 * Manual History:: Brief history of the GNU project and this
269 * Acknowledgements:: Acknowledgements.
270 * This Manual:: Using this @value{DOCUMENT}. Includes sample
271 input files that you can use.
272 * Conventions:: Typographical Conventions.
273 * Sample Data Files:: Sample data files for use in the @code{awk}
274 programs illustrated in this @value{DOCUMENT}.
275 * Names:: What name to use to find @code{awk}.
276 * Running gawk:: How to run @code{gawk} programs; includes
278 * One-shot:: Running a short throw-away @code{awk} program.
279 * Read Terminal:: Using no input files (input from terminal
281 * Long:: Putting permanent @code{awk} programs in
283 * Executable Scripts:: Making self-contained @code{awk} programs.
284 * Comments:: Adding documentation to @code{gawk} programs.
285 * Very Simple:: A very simple example.
286 * Two Rules:: A less simple one-line example with two rules.
287 * More Complex:: A more complex example.
288 * Statements/Lines:: Subdividing or combining statements into
290 * Other Features:: Other Features of @code{awk}.
291 * When:: When to use @code{gawk} and when to use other
293 * Regexp Usage:: How to Use Regular Expressions.
294 * Escape Sequences:: How to write non-printing characters.
295 * Regexp Operators:: Regular Expression Operators.
296 * GNU Regexp Operators:: Operators specific to GNU software.
297 * Case-sensitivity:: How to do case-insensitive matching.
298 * Leftmost Longest:: How much text matches.
299 * Computed Regexps:: Using Dynamic Regexps.
300 * Records:: Controlling how data is split into records.
301 * Fields:: An introduction to fields.
302 * Non-Constant Fields:: Non-constant Field Numbers.
303 * Changing Fields:: Changing the Contents of a Field.
304 * Field Separators:: The field separator and how to change it.
305 * Basic Field Splitting:: How fields are split with single characters or
307 * Regexp Field Splitting:: Using regexps as the field separator.
308 * Single Character Fields:: Making each character a separate field.
309 * Command Line Field Separator:: Setting @code{FS} from the command line.
310 * Field Splitting Summary:: Some final points and a summary table.
311 * Constant Size:: Reading constant width data.
312 * Multiple Line:: Reading multi-line records.
313 * Getline:: Reading files under explicit program control
314 using the @code{getline} function.
315 * Getline Intro:: Introduction to the @code{getline} function.
316 * Plain Getline:: Using @code{getline} with no arguments.
317 * Getline/Variable:: Using @code{getline} into a variable.
318 * Getline/File:: Using @code{getline} from a file.
319 * Getline/Variable/File:: Using @code{getline} into a variable from a
321 * Getline/Pipe:: Using @code{getline} from a pipe.
322 * Getline/Variable/Pipe:: Using @code{getline} into a variable from a
324 * Getline Summary:: Summary Of @code{getline} Variants.
325 * Print:: The @code{print} statement.
326 * Print Examples:: Simple examples of @code{print} statements.
327 * Output Separators:: The output separators and how to change them.
328 * OFMT:: Controlling Numeric Output With @code{print}.
329 * Printf:: The @code{printf} statement.
330 * Basic Printf:: Syntax of the @code{printf} statement.
331 * Control Letters:: Format-control letters.
332 * Format Modifiers:: Format-specification modifiers.
333 * Printf Examples:: Several examples.
334 * Redirection:: How to redirect output to multiple files and
336 * Special Files:: File name interpretation in @code{gawk}.
337 @code{gawk} allows access to inherited file
339 * Close Files And Pipes:: Closing Input and Output Files and Pipes.
340 * Constants:: String, numeric, and regexp constants.
341 * Scalar Constants:: Numeric and string constants.
342 * Regexp Constants:: Regular Expression constants.
343 * Using Constant Regexps:: When and how to use a regexp constant.
344 * Variables:: Variables give names to values for later use.
345 * Using Variables:: Using variables in your programs.
346 * Assignment Options:: Setting variables on the command line and a
347 summary of command line syntax. This is an
348 advanced method of input.
349 * Conversion:: The conversion of strings to numbers and vice
351 * Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-},
353 * Concatenation:: Concatenating strings.
354 * Assignment Ops:: Changing the value of a variable or a field.
355 * Increment Ops:: Incrementing the numeric value of a variable.
356 * Truth Values:: What is ``true'' and what is ``false''.
357 * Typing and Comparison:: How variables acquire types, and how this
358 affects comparison of numbers and strings with
360 * Boolean Ops:: Combining comparison expressions using boolean
361 operators @samp{||} (``or''), @samp{&&}
362 (``and'') and @samp{!} (``not'').
363 * Conditional Exp:: Conditional expressions select between two
364 subexpressions under control of a third
366 * Function Calls:: A function call is an expression.
367 * Precedence:: How various operators nest.
368 * Pattern Overview:: What goes into a pattern.
369 * Kinds of Patterns:: A list of all kinds of patterns.
370 * Regexp Patterns:: Using regexps as patterns.
371 * Expression Patterns:: Any expression can be used as a pattern.
372 * Ranges:: Pairs of patterns specify record ranges.
373 * BEGIN/END:: Specifying initialization and cleanup rules.
374 * Using BEGIN/END:: How and why to use BEGIN/END rules.
375 * I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
376 * Empty:: The empty pattern, which matches every record.
377 * Action Overview:: What goes into an action.
378 * If Statement:: Conditionally execute some @code{awk}
380 * While Statement:: Loop until some condition is satisfied.
381 * Do Statement:: Do specified action while looping until some
382 condition is satisfied.
383 * For Statement:: Another looping statement, that provides
384 initialization and increment clauses.
385 * Break Statement:: Immediately exit the innermost enclosing loop.
386 * Continue Statement:: Skip to the end of the innermost enclosing
388 * Next Statement:: Stop processing the current input record.
389 * Nextfile Statement:: Stop processing the current file.
390 * Exit Statement:: Stop execution of @code{awk}.
391 * User-modified:: Built-in variables that you change to control
393 * Auto-set:: Built-in variables where @code{awk} gives you
395 * ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}.
396 * Array Intro:: Introduction to Arrays
397 * Reference to Elements:: How to examine one element of an array.
398 * Assigning Elements:: How to change an element of an array.
399 * Array Example:: Basic Example of an Array
400 * Scanning an Array:: A variation of the @code{for} statement. It
401 loops through the indices of an array's
403 * Delete:: The @code{delete} statement removes an element
405 * Numeric Array Subscripts:: How to use numbers as subscripts in
407 * Uninitialized Subscripts:: Using Uninitialized variables as subscripts.
408 * Multi-dimensional:: Emulating multi-dimensional arrays in
410 * Multi-scanning:: Scanning multi-dimensional arrays.
411 * Calling Built-in:: How to call built-in functions.
412 * Numeric Functions:: Functions that work with numbers, including
413 @code{int}, @code{sin} and @code{rand}.
414 * String Functions:: Functions for string manipulation, such as
415 @code{split}, @code{match}, and
417 * I/O Functions:: Functions for files and shell commands.
418 * Time Functions:: Functions for dealing with time stamps.
419 * Definition Syntax:: How to write definitions and what they mean.
420 * Function Example:: An example function definition and what it
422 * Function Caveats:: Things to watch out for.
423 * Return Statement:: Specifying the value a function returns.
424 * Options:: Command line options and their meanings.
425 * Other Arguments:: Input file names and variable assignments.
426 * AWKPATH Variable:: Searching directories for @code{awk} programs.
427 * Obsolete:: Obsolete Options and/or features.
428 * Undocumented:: Undocumented Options and Features.
429 * Known Bugs:: Known Bugs in @code{gawk}.
430 * Portability Notes:: What to do if you don't have @code{gawk}.
431 * Nextfile Function:: Two implementations of a @code{nextfile}
433 * Assert Function:: A function for assertions in @code{awk}
435 * Round Function:: A function for rounding if @code{sprintf} does
437 * Ordinal Functions:: Functions for using characters as numbers and
439 * Join Function:: A function to join an array into a string.
440 * Mktime Function:: A function to turn a date into a timestamp.
441 * Gettimeofday Function:: A function to get formatted times.
442 * Filetrans Function:: A function for handling data file transitions.
443 * Getopt Function:: A function for processing command line
445 * Passwd Functions:: Functions for getting user information.
446 * Group Functions:: Functions for getting group information.
447 * Library Names:: How to best name private global variables in
449 * Clones:: Clones of common utilities.
450 * Cut Program:: The @code{cut} utility.
451 * Egrep Program:: The @code{egrep} utility.
452 * Id Program:: The @code{id} utility.
453 * Split Program:: The @code{split} utility.
454 * Tee Program:: The @code{tee} utility.
455 * Uniq Program:: The @code{uniq} utility.
456 * Wc Program:: The @code{wc} utility.
457 * Miscellaneous Programs:: Some interesting @code{awk} programs.
458 * Dupword Program:: Finding duplicated words in a document.
459 * Alarm Program:: An alarm clock.
460 * Translate Program:: A program similar to the @code{tr} utility.
461 * Labels Program:: Printing mailing labels.
462 * Word Sorting:: A program to produce a word usage count.
463 * History Sorting:: Eliminating duplicate entries from a history
465 * Extract Program:: Pulling out programs from Texinfo source
467 * Simple Sed:: A Simple Stream Editor.
468 * Igawk Program:: A wrapper for @code{awk} that includes files.
469 * V7/SVR3.1:: The major changes between V7 and System V
471 * SVR4:: Minor changes between System V Releases 3.1
473 * POSIX:: New features from the POSIX standard.
474 * BTL:: New features from the Bell Laboratories
475 version of @code{awk}.
476 * POSIX/GNU:: The extensions in @code{gawk} not in POSIX
478 * Command Line Summary:: Recapitulation of the command line.
479 * Language Summary:: A terse review of the language.
480 * Variables/Fields:: Variables, fields, and arrays.
481 * Fields Summary:: Input field splitting.
482 * Built-in Summary:: @code{awk}'s built-in variables.
483 * Arrays Summary:: Using arrays.
484 * Data Type Summary:: Values in @code{awk} are numbers or strings.
485 * Rules Summary:: Patterns and Actions, and their component
487 * Pattern Summary:: Quick overview of patterns.
488 * Regexp Summary:: Quick overview of regular expressions.
489 * Actions Summary:: Quick overview of actions.
490 * Operator Summary:: @code{awk} operators.
491 * Control Flow Summary:: The control statements.
492 * I/O Summary:: The I/O statements.
493 * Printf Summary:: A summary of @code{printf}.
494 * Special File Summary:: Special file names interpreted internally.
495 * Built-in Functions Summary:: Built-in numeric and string functions.
496 * Time Functions Summary:: Built-in time functions.
497 * String Constants Summary:: Escape sequences in strings.
498 * Functions Summary:: Defining and calling functions.
499 * Historical Features:: Some undocumented but supported ``features''.
500 * Gawk Distribution:: What is in the @code{gawk} distribution.
501 * Getting:: How to get the distribution.
502 * Extracting:: How to extract the distribution.
503 * Distribution contents:: What is in the distribution.
504 * Unix Installation:: Installing @code{gawk} under various versions
506 * Quick Installation:: Compiling @code{gawk} under Unix.
507 * Configuration Philosophy:: How it's all supposed to work.
508 * VMS Installation:: Installing @code{gawk} on VMS.
509 * VMS Compilation:: How to compile @code{gawk} under VMS.
510 * VMS Installation Details:: How to install @code{gawk} under VMS.
511 * VMS Running:: How to run @code{gawk} under VMS.
512 * VMS POSIX:: Alternate instructions for VMS POSIX.
513 * PC Installation:: Installing and Compiling @code{gawk} on MS-DOS
515 * Atari Installation:: Installing @code{gawk} on the Atari ST.
516 * Atari Compiling:: Compiling @code{gawk} on Atari
517 * Atari Using:: Running @code{gawk} on Atari
518 * Amiga Installation:: Installing @code{gawk} on an Amiga.
519 * Bugs:: Reporting Problems and Bugs.
520 * Other Versions:: Other freely available @code{awk}
522 * Compatibility Mode:: How to disable certain @code{gawk} extensions.
523 * Additions:: Making Additions To @code{gawk}.
524 * Adding Code:: Adding code to the main body of @code{gawk}.
525 * New Ports:: Porting @code{gawk} to a new operating system.
526 * Future Extensions:: New features that may be implemented one day.
527 * Improvements:: Suggestions for improvements by volunteers.
531 @c dedication for Info file
533 @center To Miriam, for making me complete.
535 @center To Chana, for the joy you bring us.
537 @center To Rivka, for the exponential increase.
539 @center To Nachum, for the added dimension.
541 @center To Malka, for the new beginning.
544 @node Preface, What Is Awk, Top, Top
547 @c I saw a comment somewhere that the preface should describe the book itself,
548 @c and the introduction should describe what the book covers.
550 This @value{DOCUMENT} teaches you about the @code{awk} language and
551 how you can use it effectively. You should already be familiar with basic
552 system commands, such as @code{cat} and @code{ls},@footnote{These commands
553 are available on POSIX compliant systems, as well as on traditional Unix
554 based systems. If you are using some other operating system, you still need to
555 be familiar with the ideas of I/O redirection and pipes.} and basic shell
556 facilities, such as Input/Output (I/O) redirection and pipes.
558 Implementations of the @code{awk} language are available for many different
559 computing environments. This @value{DOCUMENT}, while describing the @code{awk} language
560 in general, also describes a particular implementation of @code{awk} called
561 @code{gawk} (which stands for ``GNU Awk''). @code{gawk} runs on a broad range
562 of Unix systems, ranging from 80386 PC-based computers, up through large scale
563 systems, such as Crays. @code{gawk} has also been ported to MS-DOS and
564 OS/2 PC's, Atari and Amiga micro-computers, and VMS.
567 * History:: The history of @code{gawk} and @code{awk}.
568 * Manual History:: Brief history of the GNU project and this
570 * Acknowledgements:: Acknowledgements.
573 @node History, Manual History, Preface, Preface
574 @unnumberedsec History of @code{awk} and @code{gawk}
577 @cindex history of @code{awk}
579 @cindex Weinberger, Peter
580 @cindex Kernighan, Brian
581 @cindex old @code{awk}
582 @cindex new @code{awk}
583 The name @code{awk} comes from the initials of its designers: Alfred V.@:
584 Aho, Peter J.@: Weinberger, and Brian W.@: Kernighan. The original version of
585 @code{awk} was written in 1977 at AT&T Bell Laboratories.
586 In 1985 a new version made the programming
587 language more powerful, introducing user-defined functions, multiple input
588 streams, and computed regular expressions.
589 This new version became generally available with Unix System V Release 3.1.
590 The version in System V Release 4 added some new features and also cleaned
591 up the behavior in some of the ``dark corners'' of the language.
592 The specification for @code{awk} in the POSIX Command Language
593 and Utilities standard further clarified the language based on feedback
594 from both the @code{gawk} designers, and the original Bell Labs @code{awk}
597 The GNU implementation, @code{gawk}, was written in 1986 by Paul Rubin
598 and Jay Fenlason, with advice from Richard Stallman. John Woods
599 contributed parts of the code as well. In 1988 and 1989, David Trueman, with
600 help from Arnold Robbins, thoroughly reworked @code{gawk} for compatibility
601 with the newer @code{awk}. Current development focuses on bug fixes,
602 performance improvements, standards compliance, and occasionally, new features.
604 @node Manual History, Acknowledgements, History, Preface
605 @unnumberedsec The GNU Project and This Book
607 @cindex Free Software Foundation
608 @cindex Stallman, Richard
609 The Free Software Foundation (FSF) is a non-profit organization dedicated
610 to the production and distribution of freely distributable software.
611 It was founded by Richard M.@: Stallman, the author of the original
612 Emacs editor. GNU Emacs is the most widely used version of Emacs today.
615 The GNU project is an on-going effort on the part of the Free Software
616 Foundation to create a complete, freely distributable, POSIX compliant
617 computing environment. (GNU stands for ``GNU's not Unix''.)
618 The FSF uses the ``GNU General Public License'' (or GPL) to ensure that
619 source code for their software is always available to the end user. A
620 copy of the GPL is included for your reference
621 (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
622 The GPL applies to the C language source code for @code{gawk}.
624 A shell, an editor (Emacs), highly portable optimizing C, C++, and
625 Objective-C compilers, a symbolic debugger, and dozens of large and
626 small utilities (such as @code{gawk}), have all been completed and are
627 freely available. As of this writing (early 1997), the GNU operating
628 system kernel (the HURD), has been released, but is still in an early
629 stage of development.
634 Until the GNU operating system is more fully developed, you should
635 consider using Linux, a freely distributable, Unix-like operating
636 system for 80386, DEC Alpha, Sun SPARC and other systems. There are
637 many books on Linux. One freely available one is @cite{Linux
638 Installation and Getting Started}, by Matt Welsh.
639 Many Linux distributions are available, often in computer stores or
640 bundled on CD-ROM with books about Linux.
641 (There are three other freely available, Unix-like operating systems for
642 80386 and other systems, NetBSD, FreeBSD,and OpenBSD. All are based on the
643 4.4-Lite Berkeley Software Distribution, and they use recent versions
644 of @code{gawk} for their versions of @code{awk}.)
647 This @value{DOCUMENT} you are reading now is actually free. The
648 information in it is freely available to anyone, the machine readable
649 source code for the @value{DOCUMENT} comes with @code{gawk}, and anyone
650 may take this @value{DOCUMENT} to a copying machine and make as many
651 copies of it as they like. (Take a moment to check the copying
652 permissions on the Copyright page.)
654 If you paid money for this @value{DOCUMENT}, what you actually paid for
655 was the @value{DOCUMENT}'s nice printing and binding, and the
656 publisher's associated costs to produce it. We have made an effort to
657 keep these costs reasonable; most people would prefer a bound book to
658 over 330 pages of photo-copied text that would then have to be held in
659 a loose-leaf binder (not to mention the time and labor involved in
660 doing the copying). The same is true of producing this
661 @value{DOCUMENT} from the machine readable source; the retail price is
662 only slightly more than the cost per page of printing it
666 This @value{DOCUMENT} itself has gone through several previous,
667 preliminary editions. I started working on a preliminary draft of
668 @cite{The GAWK Manual}, by Diane Close, Paul Rubin, and Richard
669 Stallman in the fall of 1988.
670 It was around 90 pages long, and barely described the original, ``old''
671 version of @code{awk}. After substantial revision, the first version of
672 the @cite{The GAWK Manual} to be released was Edition 0.11 Beta in
673 October of 1989. The manual then underwent more substantial revision
674 for Edition 0.13 of December 1991.
675 David Trueman, Pat Rankin, and Michal Jaegermann contributed sections
676 of the manual for Edition 0.13.
677 That edition was published by the
678 FSF as a bound book early in 1992. Since then there have been several
679 minor revisions, notably Edition 0.14 of November 1992 that was published
680 by the FSF in January of 1993, and Edition 0.16 of August 1993.
682 Edition 1.0 of @cite{@value{TITLE}} represents a significant re-working
683 of @cite{The GAWK Manual}, with much additional material.
684 The FSF and I agree that I am now the primary author.
685 I also felt that it needed a more descriptive title.
687 @cite{@value{TITLE}} will undoubtedly continue to evolve.
688 An electronic version
689 comes with the @code{gawk} distribution from the FSF.
690 If you find an error in this @value{DOCUMENT}, please report it!
691 @xref{Bugs, ,Reporting Problems and Bugs}, for information on submitting
692 problem reports electronically, or write to me in care of the FSF.
694 @node Acknowledgements, , Manual History, Preface
695 @unnumberedsec Acknowledgements
697 @cindex Stallman, Richard
698 I would like to acknowledge Richard M.@: Stallman, for his vision of a
699 better world, and for his courage in founding the FSF and starting the
702 The initial draft of @cite{The GAWK Manual} had the following acknowledgements:
705 Many people need to be thanked for their assistance in producing this
706 manual. Jay Fenlason contributed many ideas and sample programs. Richard
707 Mlynarik and Robert Chassell gave helpful comments on drafts of this
708 manual. The paper @cite{A Supplemental Document for @code{awk}} by John W.@:
709 Pierce of the Chemistry Department at UC San Diego, pinpointed several
710 issues relevant both to @code{awk} implementation and to this manual, that
711 would otherwise have escaped us.
714 The following people provided many helpful comments on Edition 0.13 of
715 @cite{The GAWK Manual}: Rick Adams, Michael Brennan, Rich Burridge, Diane Close,
716 Christopher (``Topher'') Eliot, Michael Lijewski, Pat Rankin, Miriam Robbins,
717 and Michal Jaegermann.
719 The following people provided many helpful comments for Edition 1.0 of
720 @cite{@value{TITLE}}: Karl Berry, Michael Brennan, Darrel
721 Hankerson, Michal Jaegermann, Michael Lijewski, and Miriam Robbins.
722 Pat Rankin, Michal Jaegermann, Darrel Hankerson and Scott Deifik
723 updated their respective sections for Edition 1.0.
725 Robert J.@: Chassell provided much valuable advice on
726 the use of Texinfo. He also deserves special thanks for
727 convincing me @emph{not} to title this @value{DOCUMENT}
728 @cite{How To Gawk Politely}.
729 Karl Berry helped significantly with the @TeX{} part of Texinfo.
731 @cindex Trueman, David
732 David Trueman deserves special credit; he has done a yeoman job
733 of evolving @code{gawk} so that it performs well, and without bugs.
734 Although he is no longer involved with @code{gawk},
735 working with him on this project was a significant pleasure.
737 @cindex Deifik, Scott
738 @cindex Hankerson, Darrel
739 @cindex Rommel, Kai Uwe
741 @cindex Jaegermann, Michal
742 Scott Deifik, Darrel Hankerson, Kai Uwe Rommel, Pat Rankin, and Michal
743 Jaegermann (in no particular order) are long time members of the
744 @code{gawk} ``crack portability team.'' Without their hard work and
745 help, @code{gawk} would not be nearly the fine program it is today. It
746 has been and continues to be a pleasure working with this team of fine
749 @cindex Friedl, Jeffrey
750 Jeffrey Friedl provided invaluable help in tracking down a number
751 of last minute problems with regular expressions in @code{gawk} 3.0.
753 @cindex Kernighan, Brian
754 David and I would like to thank Brian Kernighan of Bell Labs for
755 invaluable assistance during the testing and debugging of @code{gawk}, and for
756 help in clarifying numerous points about the language. We could not have
757 done nearly as good a job on either @code{gawk} or its documentation without
761 I would like to thank Marshall and Elaine Hartholz of Seattle, and Dr.@:
762 Bert and Rita Schreiber of Detroit for large amounts of quiet vacation
763 time in their homes, which allowed me to make significant progress on
764 this @value{DOCUMENT} and on @code{gawk} itself. Phil Hughes of SSC
765 contributed in a very important way by loaning me his laptop Linux
766 system, not once, but twice, allowing me to do a lot of work while
769 @cindex Robbins, Miriam
770 Finally, I must thank my wonderful wife, Miriam, for her patience through
771 the many versions of this project, for her proof-reading,
772 and for sharing me with the computer.
773 I would like to thank my parents for their love, and for the grace with
774 which they raised and educated me.
775 I also must acknowledge my gratitude to G-d, for the many opportunities
776 He has sent my way, as well as for the gifts He has given me with which to
777 take advantage of those opportunities.
785 Stuff still not covered anywhere:
787 Integer vs. floating point
788 Hex vs. octal vs. decimal
789 Interpreter vs compiler
793 @node What Is Awk, Getting Started, Preface, Top
794 @chapter Introduction
796 If you are like many computer users, you would frequently like to make
797 changes in various text files wherever certain patterns appear, or
798 extract data from parts of certain lines while discarding the rest. To
799 write a program to do this in a language such as C or Pascal is a
800 time-consuming inconvenience that may take many lines of code. The job
801 may be easier with @code{awk}.
803 The @code{awk} utility interprets a special-purpose programming language
804 that makes it possible to handle simple data-reformatting jobs
805 with just a few lines of code.
807 The GNU implementation of @code{awk} is called @code{gawk}; it is fully
808 upward compatible with the System V Release 4 version of
809 @code{awk}. @code{gawk} is also upward compatible with the POSIX
810 specification of the @code{awk} language. This means that all
811 properly written @code{awk} programs should work with @code{gawk}.
812 Thus, we usually don't distinguish between @code{gawk} and other @code{awk}
815 @cindex uses of @code{awk}
816 Using @code{awk} you can:
820 manage small, personal databases
829 produce indexes, and perform other document preparation tasks
832 even experiment with algorithms that can be adapted later to other computer
837 * This Manual:: Using this @value{DOCUMENT}. Includes sample
838 input files that you can use.
839 * Conventions:: Typographical Conventions.
840 * Sample Data Files:: Sample data files for use in the @code{awk}
841 programs illustrated in this @value{DOCUMENT}.
844 @node This Manual, Conventions, What Is Awk, What Is Awk
845 @section Using This Book
846 @cindex book, using this
847 @cindex using this book
848 @cindex language, @code{awk}
849 @cindex program, @code{awk}
851 @cindex @code{awk} language
852 @cindex @code{awk} program
855 The term @code{awk} refers to a particular program, and to the language you
856 use to tell this program what to do. When we need to be careful, we call
857 the program ``the @code{awk} utility'' and the language ``the @code{awk}
858 language.'' The term @code{gawk} refers to a version of @code{awk} developed
859 as part the GNU project. The purpose of this @value{DOCUMENT} is to explain
860 both the @code{awk} language and how to run the @code{awk} utility.
862 The main purpose of the @value{DOCUMENT} is to explain the features
863 of @code{awk}, as defined in the POSIX standard. It does so in the context
864 of one particular implementation, @code{gawk}. While doing so, it will also
865 attempt to describe important differences between @code{gawk} and other
866 @code{awk} implementations. Finally, any @code{gawk} features that
867 are not in the POSIX standard for @code{awk} will be noted.
870 This @value{DOCUMENT} has the difficult task of being both tutorial and reference.
871 If you are a novice, feel free to skip over details that seem too complex.
872 You should also ignore the many cross references; they are for the
873 expert user, and for the on-line Info version of the document.
876 The term @dfn{@code{awk} program} refers to a program written by you in
877 the @code{awk} programming language.
879 @xref{Getting Started, ,Getting Started with @code{awk}}, for the bare
880 essentials you need to know to start using @code{awk}.
882 Some useful ``one-liners'' are included to give you a feel for the
883 @code{awk} language (@pxref{One-liners, ,Useful One Line Programs}).
885 Many sample @code{awk} programs have been provided for you
886 (@pxref{Library Functions, ,A Library of @code{awk} Functions}; also
887 @pxref{Sample Programs, ,Practical @code{awk} Programs}).
889 The entire @code{awk} language is summarized for quick reference in
890 @ref{Gawk Summary, ,@code{gawk} Summary}. Look there if you just need
891 to refresh your memory about a particular feature.
893 If you find terms that you aren't familiar with, try looking them
894 up in the glossary (@pxref{Glossary}).
896 Most of the time complete @code{awk} programs are used as examples, but in
897 some of the more advanced sections, only the part of the @code{awk} program
898 that illustrates the concept being described is shown.
900 While this @value{DOCUMENT} is aimed principally at people who have not been
902 to @code{awk}, there is a lot of information here that even the @code{awk}
903 expert should find useful. In particular, the description of POSIX
904 @code{awk}, and the example programs in
905 @ref{Library Functions, ,A Library of @code{awk} Functions}, and
906 @ref{Sample Programs, ,Practical @code{awk} Programs},
907 should be of interest.
909 @c fakenode --- for prepinfo
910 @unnumberedsubsec Dark Corners
912 @i{Who opened that window shade?!?}
917 @cindex d.c., see ``dark corner''
919 Until the POSIX standard (and @cite{The Gawk Manual}),
920 many features of @code{awk} were either poorly documented, or not
921 documented at all. Descriptions of such features
922 (often called ``dark corners'') are noted in this @value{DOCUMENT} with
924 They also appear in the index under the heading ``dark corner.''
926 @node Conventions, Sample Data Files, This Manual, What Is Awk
927 @section Typographical Conventions
929 This @value{DOCUMENT} is written using Texinfo, the GNU documentation formatting language.
930 A single Texinfo source file is used to produce both the printed and on-line
931 versions of the documentation.
933 Because of this, the typographical conventions
934 are slightly different than in other books you may have read.
937 This section briefly documents the typographical conventions used in Texinfo.
940 Examples you would type at the command line are preceded by the common
941 shell primary and secondary prompts, @samp{$} and @samp{>}.
942 Output from the command is preceded by the glyph ``@print{}''.
943 This typically represents the command's standard output.
944 Error messages, and other output on the command's standard error, are preceded
945 by the glyph ``@error{}''. For example:
950 @print{} hi on stdout
951 $ echo hello on stderr 1>&2
952 @error{} hello on stderr
957 In the text, command names appear in @code{this font}, while code segments
958 appear in the same font and quoted, @samp{like this}. Some things will
959 be emphasized @emph{like this}, and if a point needs to be made
960 strongly, it will be done @strong{like this}. The first occurrence of
961 a new term is usually its @dfn{definition}, and appears in the same
962 font as the previous occurrence of ``definition'' in this sentence.
963 File names are indicated like this: @file{/path/to/ourfile}.
966 Characters that you type at the keyboard look @kbd{like this}. In particular,
967 there are special characters called ``control characters.'' These are
968 characters that you type by holding down both the @kbd{CONTROL} key and
969 another key, at the same time. For example, a @kbd{Control-d} is typed
970 by first pressing and holding the @kbd{CONTROL} key, next
971 pressing the @kbd{d} key, and finally releasing both keys.
973 @node Sample Data Files, , Conventions, What Is Awk
974 @section Data Files for the Examples
976 @cindex input file, sample
977 @cindex sample input file
978 @cindex @file{BBS-list} file
979 Many of the examples in this @value{DOCUMENT} take their input from two sample
980 data files. The first, called @file{BBS-list}, represents a list of
981 computer bulletin board systems together with information about those systems.
982 The second data file, called @file{inventory-shipped}, contains
983 information about shipments on a monthly basis. In both files,
984 each line is considered to be one @dfn{record}.
986 In the file @file{BBS-list}, each record contains the name of a computer
987 bulletin board, its phone number, the board's baud rate(s), and a code for
988 the number of hours it is operational. An @samp{A} in the last column
989 means the board operates 24 hours a day. A @samp{B} in the last
990 column means the board operates evening and weekend hours, only. A
991 @samp{C} means the board operates only on weekends.
993 @c 2e: Update the baud rates to reflect today's faster modems
996 @c system mkdir eg/lib
997 @c system mkdir eg/data
998 @c system mkdir eg/prog
999 @c system mkdir eg/misc
1000 @c file eg/data/BBS-list
1001 aardvark 555-5553 1200/300 B
1002 alpo-net 555-3412 2400/1200/300 A
1003 barfly 555-7685 1200/300 A
1004 bites 555-1675 2400/1200/300 A
1005 camelot 555-0542 300 C
1006 core 555-2912 1200/300 C
1007 fooey 555-1234 2400/1200/300 B
1008 foot 555-6699 1200/300 B
1009 macfoo 555-6480 1200/300 A
1010 sdace 555-3430 2400/1200/300 A
1011 sabafoo 555-2127 1200/300 C
1015 @cindex @file{inventory-shipped} file
1016 The second data file, called @file{inventory-shipped}, represents
1017 information about shipments during the year.
1018 Each record contains the month of the year, the number
1019 of green crates shipped, the number of red boxes shipped, the number of
1020 orange bags shipped, and the number of blue packages shipped,
1021 respectively. There are 16 entries, covering the 12 months of one year
1022 and four months of the next year.
1025 @c file eg/data/inventory-shipped
1047 If you are reading this in GNU Emacs using Info, you can copy the regions
1048 of text showing these sample files into your own test files. This way you
1049 can try out the examples shown in the remainder of this document. You do
1050 this by using the command @kbd{M-x write-region} to copy text from the Info
1051 file into a file for use with @code{awk}
1052 (@xref{Misc File Ops, , Miscellaneous File Operations, emacs, GNU Emacs Manual},
1053 for more information). Using this information, create your own
1054 @file{BBS-list} and @file{inventory-shipped} files, and practice what you
1055 learn in this @value{DOCUMENT}.
1057 If you are using the stand-alone version of Info,
1058 see @ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
1059 for an @code{awk} program that will extract these data files from
1060 @file{gawk.texi}, the Texinfo source file for this Info file.
1063 @node Getting Started, One-liners, What Is Awk, Top
1064 @chapter Getting Started with @code{awk}
1065 @cindex script, definition of
1066 @cindex rule, definition of
1067 @cindex program, definition of
1068 @cindex basic function of @code{awk}
1070 The basic function of @code{awk} is to search files for lines (or other
1071 units of text) that contain certain patterns. When a line matches one
1072 of the patterns, @code{awk} performs specified actions on that line.
1073 @code{awk} keeps processing input lines in this way until the end of the
1074 input files are reached.
1076 @cindex data-driven languages
1077 @cindex procedural languages
1078 @cindex language, data-driven
1079 @cindex language, procedural
1080 Programs in @code{awk} are different from programs in most other languages,
1081 because @code{awk} programs are @dfn{data-driven}; that is, you describe
1082 the data you wish to work with, and then what to do when you find it.
1083 Most other languages are @dfn{procedural}; you have to describe, in great
1084 detail, every step the program is to take. When working with procedural
1085 languages, it is usually much
1086 harder to clearly describe the data your program will process.
1087 For this reason, @code{awk} programs are often refreshingly easy to both
1090 @cindex program, definition of
1091 @cindex rule, definition of
1092 When you run @code{awk}, you specify an @code{awk} @dfn{program} that
1093 tells @code{awk} what to do. The program consists of a series of
1094 @dfn{rules}. (It may also contain @dfn{function definitions},
1095 an advanced feature which we will ignore for now.
1096 @xref{User-defined, ,User-defined Functions}.) Each rule specifies one
1097 pattern to search for, and one action to perform when that pattern is found.
1099 Syntactically, a rule consists of a pattern followed by an action. The
1100 action is enclosed in curly braces to separate it from the pattern.
1101 Rules are usually separated by newlines. Therefore, an @code{awk}
1102 program looks like this:
1105 @var{pattern} @{ @var{action} @}
1106 @var{pattern} @{ @var{action} @}
1111 * Names:: What name to use to find @code{awk}.
1112 * Running gawk:: How to run @code{gawk} programs; includes
1113 command line syntax.
1114 * Very Simple:: A very simple example.
1115 * Two Rules:: A less simple one-line example with two rules.
1116 * More Complex:: A more complex example.
1117 * Statements/Lines:: Subdividing or combining statements into
1119 * Other Features:: Other Features of @code{awk}.
1120 * When:: When to use @code{gawk} and when to use other
1124 @node Names, Running gawk , Getting Started, Getting Started
1125 @section A Rose By Any Other Name
1127 @cindex old @code{awk} vs. new @code{awk}
1128 @cindex new @code{awk} vs. old @code{awk}
1129 The @code{awk} language has evolved over the years. Full details are
1130 provided in @ref{Language History, ,The Evolution of the @code{awk} Language}.
1131 The language described in this @value{DOCUMENT}
1132 is often referred to as ``new @code{awk}.''
1134 Because of this, many systems have multiple
1135 versions of @code{awk}.
1136 Some systems have an @code{awk} utility that implements the
1137 original version of the @code{awk} language, and a @code{nawk} utility
1138 for the new version. Others have an @code{oawk} for the ``old @code{awk}''
1139 language, and plain @code{awk} for the new one. Still others only
1140 have one version, usually the new one.@footnote{Often, these systems
1141 use @code{gawk} for their @code{awk} implementation!}
1143 All in all, this makes it difficult for you to know which version of
1144 @code{awk} you should run when writing your programs. The best advice
1145 we can give here is to check your local documentation. Look for @code{awk},
1146 @code{oawk}, and @code{nawk}, as well as for @code{gawk}. Chances are, you
1147 will have some version of new @code{awk} on your system, and that is what
1148 you should use when running your programs. (Of course, if you're reading
1149 this @value{DOCUMENT}, chances are good that you have @code{gawk}!)
1151 Throughout this @value{DOCUMENT}, whenever we refer to a language feature
1152 that should be available in any complete implementation of POSIX @code{awk},
1153 we simply use the term @code{awk}. When referring to a feature that is
1154 specific to the GNU implementation, we use the term @code{gawk}.
1156 @node Running gawk, Very Simple, Names, Getting Started
1157 @section How to Run @code{awk} Programs
1159 @cindex command line formats
1160 @cindex running @code{awk} programs
1161 There are several ways to run an @code{awk} program. If the program is
1162 short, it is easiest to include it in the command that runs @code{awk},
1166 awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
1170 where @var{program} consists of a series of patterns and actions, as
1172 (The reason for the single quotes is described below, in
1173 @ref{One-shot, ,One-shot Throw-away @code{awk} Programs}.)
1175 When the program is long, it is usually more convenient to put it in a file
1176 and run it with a command like this:
1179 awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{}
1183 * One-shot:: Running a short throw-away @code{awk} program.
1184 * Read Terminal:: Using no input files (input from terminal
1186 * Long:: Putting permanent @code{awk} programs in
1188 * Executable Scripts:: Making self-contained @code{awk} programs.
1189 * Comments:: Adding documentation to @code{gawk} programs.
1192 @node One-shot, Read Terminal, Running gawk, Running gawk
1193 @subsection One-shot Throw-away @code{awk} Programs
1195 Once you are familiar with @code{awk}, you will often type in simple
1196 programs the moment you want to use them. Then you can write the
1197 program as the first argument of the @code{awk} command, like this:
1200 awk '@var{program}' @var{input-file1} @var{input-file2} @dots{}
1204 where @var{program} consists of a series of @var{patterns} and
1205 @var{actions}, as described earlier.
1207 @cindex single quotes, why needed
1208 This command format instructs the @dfn{shell}, or command interpreter,
1209 to start @code{awk} and use the @var{program} to process records in the
1210 input file(s). There are single quotes around @var{program} so that
1211 the shell doesn't interpret any @code{awk} characters as special shell
1212 characters. They also cause the shell to treat all of @var{program} as
1213 a single argument for @code{awk} and allow @var{program} to be more
1216 This format is also useful for running short or medium-sized @code{awk}
1217 programs from shell scripts, because it avoids the need for a separate
1218 file for the @code{awk} program. A self-contained shell script is more
1219 reliable since there are no other files to misplace.
1221 @ref{One-liners, , Useful One Line Programs}, presents several short,
1222 self-contained programs.
1224 As an interesting side point, the command
1227 awk '/foo/' @var{files} @dots{}
1231 is essentially the same as
1233 @cindex @code{egrep}
1235 egrep foo @var{files} @dots{}
1238 @node Read Terminal, Long, One-shot, Running gawk
1239 @subsection Running @code{awk} without Input Files
1241 @cindex standard input
1242 @cindex input, standard
1243 You can also run @code{awk} without any input files. If you type the
1251 then @code{awk} applies the @var{program} to the @dfn{standard input},
1252 which usually means whatever you type on the terminal. This continues
1253 until you indicate end-of-file by typing @kbd{Control-d}.
1254 (On other operating systems, the end-of-file character may be different.
1255 For example, on OS/2 and MS-DOS, it is @kbd{Control-z}.)
1257 For example, the following program prints a friendly piece of advice
1258 (from Douglas Adams' @cite{The Hitchhiker's Guide to the Galaxy}),
1259 to keep you from worrying about the complexities of computer programming
1260 (@samp{BEGIN} is a feature we haven't discussed yet).
1263 $ awk "BEGIN @{ print \"Don't Panic!\" @}"
1264 @print{} Don't Panic!
1267 @cindex quoting, shell
1268 @cindex shell quoting
1269 This program does not read any input. The @samp{\} before each of the
1270 inner double quotes is necessary because of the shell's quoting rules,
1271 in particular because it mixes both single quotes and double quotes.
1273 This next simple @code{awk} program
1274 emulates the @code{cat} utility; it copies whatever you type at the
1275 keyboard to its standard output. (Why this works is explained shortly.)
1279 Now is the time for all good men
1280 @print{} Now is the time for all good men
1281 to come to the aid of their country.
1282 @print{} to come to the aid of their country.
1283 Four score and seven years ago, ...
1284 @print{} Four score and seven years ago, ...
1286 @print{} What, me worry?
1290 @node Long, Executable Scripts, Read Terminal, Running gawk
1291 @subsection Running Long Programs
1293 @cindex running long programs
1294 @cindex @code{-f} option
1295 @cindex program file
1296 @cindex file, @code{awk} program
1297 Sometimes your @code{awk} programs can be very long. In this case it is
1298 more convenient to put the program into a separate file. To tell
1299 @code{awk} to use that file for its program, you type:
1302 awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{}
1305 The @samp{-f} instructs the @code{awk} utility to get the @code{awk} program
1306 from the file @var{source-file}. Any file name can be used for
1307 @var{source-file}. For example, you could put the program:
1310 BEGIN @{ print "Don't Panic!" @}
1314 into the file @file{advice}. Then this command:
1321 does the same thing as this one:
1324 awk "BEGIN @{ print \"Don't Panic!\" @}"
1327 @cindex quoting, shell
1328 @cindex shell quoting
1330 which was explained earlier (@pxref{Read Terminal, ,Running @code{awk} without Input Files}).
1331 Note that you don't usually need single quotes around the file name that you
1332 specify with @samp{-f}, because most file names don't contain any of the shell's
1333 special characters. Notice that in @file{advice}, the @code{awk}
1334 program did not have single quotes around it. The quotes are only needed
1335 for programs that are provided on the @code{awk} command line.
1337 If you want to identify your @code{awk} program files clearly as such,
1338 you can add the extension @file{.awk} to the file name. This doesn't
1339 affect the execution of the @code{awk} program, but it does make
1340 ``housekeeping'' easier.
1342 @node Executable Scripts, Comments, Long, Running gawk
1343 @subsection Executable @code{awk} Programs
1344 @cindex executable scripts
1345 @cindex scripts, executable
1346 @cindex self contained programs
1347 @cindex program, self contained
1348 @cindex @code{#!} (executable scripts)
1350 Once you have learned @code{awk}, you may want to write self-contained
1351 @code{awk} scripts, using the @samp{#!} script mechanism. You can do
1352 this on many Unix systems@footnote{The @samp{#!} mechanism works on
1354 Unix systems derived from Berkeley Unix, System V Release 4, and some System
1355 V Release 3 systems.} (and someday on the GNU system).
1357 For example, you could update the file @file{advice} to look like this:
1362 BEGIN @{ print "Don't Panic!" @}
1366 After making this file executable (with the @code{chmod} utility), you
1367 can simply type @samp{advice}
1368 at the shell, and the system will arrange to run @code{awk}@footnote{The
1369 line beginning with @samp{#!} lists the full file name of an interpreter
1370 to be run, and an optional initial command line argument to pass to that
1371 interpreter. The operating system then runs the interpreter with the given
1372 argument and the full argument list of the executed program. The first argument
1373 in the list is the full file name of the @code{awk} program. The rest of the
1374 argument list will either be options to @code{awk}, or data files,
1375 or both.} as if you had typed @samp{awk -f advice}.
1380 @print{} Don't Panic!
1385 Self-contained @code{awk} scripts are useful when you want to write a
1386 program which users can invoke without their having to know that the program is
1387 written in @code{awk}.
1389 @strong{Caution:} You should not put more than one argument on the @samp{#!}
1390 line after the path to @code{awk}. This will not work. The operating system
1391 treats the rest of the line as a single agument, and passes it to @code{awk}.
1392 Doing this will lead to confusing behavior: most likely a usage diagnostic
1393 of some sort from @code{awk}.
1395 @cindex shell scripts
1396 @cindex scripts, shell
1397 Some older systems do not support the @samp{#!} mechanism. You can get a
1398 similar effect using a regular shell script. It would look something
1402 : The colon ensures execution by the standard shell.
1403 awk '@var{program}' "$@@"
1406 Using this technique, it is @emph{vital} to enclose the @var{program} in
1407 single quotes to protect it from interpretation by the shell. If you
1408 omit the quotes, only a shell wizard can predict the results.
1410 The @code{"$@@"} causes the shell to forward all the command line
1411 arguments to the @code{awk} program, without interpretation. The first
1412 line, which starts with a colon, is used so that this shell script will
1413 work even if invoked by a user who uses the C shell. (Not all older systems
1414 obey this convention, but many do.)
1416 @c Someday: (See @cite{The Bourne Again Shell}, by ??.)
1418 @node Comments, , Executable Scripts, Running gawk
1419 @subsection Comments in @code{awk} Programs
1420 @cindex @code{#} (comment)
1422 @cindex use of comments
1423 @cindex documenting @code{awk} programs
1424 @cindex programs, documenting
1426 A @dfn{comment} is some text that is included in a program for the sake
1427 of human readers; it is not really part of the program. Comments
1428 can explain what the program does, and how it works. Nearly all
1429 programming languages have provisions for comments, because programs are
1430 typically hard to understand without their extra help.
1432 In the @code{awk} language, a comment starts with the sharp sign
1433 character, @samp{#}, and continues to the end of the line.
1434 The @samp{#} does not have to be the first character on the line. The
1435 @code{awk} language ignores the rest of a line following a sharp sign.
1436 For example, we could have put the following into @file{advice}:
1439 # This program prints a nice friendly message. It helps
1440 # keep novice users from being afraid of the computer.
1441 BEGIN @{ print "Don't Panic!" @}
1444 You can put comment lines into keyboard-composed throw-away @code{awk}
1445 programs also, but this usually isn't very useful; the purpose of a
1446 comment is to help you or another person understand the program at
1449 @strong{Caution:} As mentioned in
1450 @ref{One-shot, ,One-shot Throw-away @code{awk} Programs},
1451 you can enclose small to medium programs in single quotes, in order to keep
1452 your shell scripts self-contained. When doing so, @emph{don't} put
1453 an apostrophe (i.e., a single quote) into a comment (or anywhere else
1454 in your program). The shell will interpret the quote as the closing
1455 quote for the entire program. As a result, usually the shell will
1456 print a message about mismatched quotes, and if @code{awk} actually
1457 runs, it will probably print strange messages about syntax errors.
1461 awk 'BEGIN @{ print "hello" @} # let's be cute'
1464 @node Very Simple, Two Rules, Running gawk, Getting Started
1465 @section A Very Simple Example
1467 The following command runs a simple @code{awk} program that searches the
1468 input file @file{BBS-list} for the string of characters: @samp{foo}. (A
1469 string of characters is usually called a @dfn{string}.
1470 The term @dfn{string} is perhaps based on similar usage in English, such
1471 as ``a string of pearls,'' or, ``a string of cars in a train.'')
1474 awk '/foo/ @{ print $0 @}' BBS-list
1478 When lines containing @samp{foo} are found, they are printed, because
1479 @w{@samp{print $0}} means print the current line. (Just @samp{print} by
1480 itself means the same thing, so we could have written that
1483 You will notice that slashes, @samp{/}, surround the string @samp{foo}
1484 in the @code{awk} program. The slashes indicate that @samp{foo}
1485 is a pattern to search for. This type of pattern is called a
1486 @dfn{regular expression}, and is covered in more detail later
1487 (@pxref{Regexp, ,Regular Expressions}).
1488 The pattern is allowed to match parts of words.
1490 single-quotes around the @code{awk} program so that the shell won't
1491 interpret any of it as special shell characters.
1493 Here is what this program prints:
1497 $ awk '/foo/ @{ print $0 @}' BBS-list
1498 @print{} fooey 555-1234 2400/1200/300 B
1499 @print{} foot 555-6699 1200/300 B
1500 @print{} macfoo 555-6480 1200/300 A
1501 @print{} sabafoo 555-2127 1200/300 C
1505 @cindex action, default
1506 @cindex pattern, default
1507 @cindex default action
1508 @cindex default pattern
1509 In an @code{awk} rule, either the pattern or the action can be omitted,
1510 but not both. If the pattern is omitted, then the action is performed
1511 for @emph{every} input line. If the action is omitted, the default
1512 action is to print all lines that match the pattern.
1514 @cindex empty action
1515 @cindex action, empty
1516 Thus, we could leave out the action (the @code{print} statement and the curly
1517 braces) in the above example, and the result would be the same: all
1518 lines matching the pattern @samp{foo} would be printed. By comparison,
1519 omitting the @code{print} statement but retaining the curly braces makes an
1520 empty action that does nothing; then no lines would be printed.
1522 @node Two Rules, More Complex, Very Simple, Getting Started
1523 @section An Example with Two Rules
1524 @cindex how @code{awk} works
1526 The @code{awk} utility reads the input files one line at a
1527 time. For each line, @code{awk} tries the patterns of each of the rules.
1528 If several patterns match then several actions are run, in the order in
1529 which they appear in the @code{awk} program. If no patterns match, then
1532 After processing all the rules (perhaps none) that match the line,
1533 @code{awk} reads the next line (however,
1534 @pxref{Next Statement, ,The @code{next} Statement},
1535 and also @pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
1536 This continues until the end of the file is reached.
1538 For example, the @code{awk} program:
1546 contains two rules. The first rule has the string @samp{12} as the
1547 pattern and @samp{print $0} as the action. The second rule has the
1548 string @samp{21} as the pattern and also has @samp{print $0} as the
1549 action. Each rule's action is enclosed in its own pair of braces.
1551 This @code{awk} program prints every line that contains the string
1552 @samp{12} @emph{or} the string @samp{21}. If a line contains both
1553 strings, it is printed twice, once by each rule.
1555 This is what happens if we run this program on our two sample data files,
1556 @file{BBS-list} and @file{inventory-shipped}, as shown here:
1559 $ awk '/12/ @{ print $0 @}
1560 > /21/ @{ print $0 @}' BBS-list inventory-shipped
1561 @print{} aardvark 555-5553 1200/300 B
1562 @print{} alpo-net 555-3412 2400/1200/300 A
1563 @print{} barfly 555-7685 1200/300 A
1564 @print{} bites 555-1675 2400/1200/300 A
1565 @print{} core 555-2912 1200/300 C
1566 @print{} fooey 555-1234 2400/1200/300 B
1567 @print{} foot 555-6699 1200/300 B
1568 @print{} macfoo 555-6480 1200/300 A
1569 @print{} sdace 555-3430 2400/1200/300 A
1570 @print{} sabafoo 555-2127 1200/300 C
1571 @print{} sabafoo 555-2127 1200/300 C
1572 @print{} Jan 21 36 64 620
1573 @print{} Apr 21 70 74 514
1577 Note how the line in @file{BBS-list} beginning with @samp{sabafoo}
1578 was printed twice, once for each rule.
1580 @node More Complex, Statements/Lines, Two Rules, Getting Started
1581 @section A More Complex Example
1584 We have to use ls -lg here to get portable output across Unix systems.
1585 The POSIX ls matches this behavior too. Sigh.
1587 Here is an example to give you an idea of what typical @code{awk}
1588 programs do. This example shows how @code{awk} can be used to
1589 summarize, select, and rearrange the output of another utility. It uses
1590 features that haven't been covered yet, so don't worry if you don't
1591 understand all the details.
1594 ls -lg | awk '$6 == "Nov" @{ sum += $5 @}
1595 END @{ print sum @}'
1598 @cindex @code{csh}, backslash continuation
1599 @cindex backslash continuation in @code{csh}
1600 This command prints the total number of bytes in all the files in the
1601 current directory that were last modified in November (of any year).
1602 (In the C shell you would need to type a semicolon and then a backslash
1603 at the end of the first line; in a POSIX-compliant shell, such as the
1604 Bourne shell or Bash, the GNU Bourne-Again shell, you can type the example
1607 FIXME: how can users tell what shell they are running? Need a footnote
1608 or something, but getting into this is a distraction.
1611 The @w{@samp{ls -lg}} part of this example is a system command that gives
1612 you a listing of the files in a directory, including file size and the date
1613 the file was last modified. Its output looks like this:
1616 -rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile
1617 -rw-r--r-- 1 arnold user 10809 Nov 7 13:03 gawk.h
1618 -rw-r--r-- 1 arnold user 983 Apr 13 12:14 gawk.tab.h
1619 -rw-r--r-- 1 arnold user 31869 Jun 15 12:20 gawk.y
1620 -rw-r--r-- 1 arnold user 22414 Nov 7 13:03 gawk1.c
1621 -rw-r--r-- 1 arnold user 37455 Nov 7 13:03 gawk2.c
1622 -rw-r--r-- 1 arnold user 27511 Dec 9 13:07 gawk3.c
1623 -rw-r--r-- 1 arnold user 7989 Nov 7 13:03 gawk4.c
1627 The first field contains read-write permissions, the second field contains
1628 the number of links to the file, and the third field identifies the owner of
1629 the file. The fourth field identifies the group of the file.
1630 The fifth field contains the size of the file in bytes. The
1631 sixth, seventh and eighth fields contain the month, day, and time,
1632 respectively, that the file was last modified. Finally, the ninth field
1633 contains the name of the file.
1635 @cindex automatic initialization
1636 @cindex initialization, automatic
1637 The @samp{$6 == "Nov"} in our @code{awk} program is an expression that
1638 tests whether the sixth field of the output from @w{@samp{ls -lg}}
1639 matches the string @samp{Nov}. Each time a line has the string
1640 @samp{Nov} for its sixth field, the action @samp{sum += $5} is
1641 performed. This adds the fifth field (the file size) to the variable
1642 @code{sum}. As a result, when @code{awk} has finished reading all the
1643 input lines, @code{sum} is the sum of the sizes of files whose
1644 lines matched the pattern. (This works because @code{awk} variables
1645 are automatically initialized to zero.)
1647 After the last line of output from @code{ls} has been processed, the
1648 @code{END} rule is executed, and the value of @code{sum} is
1649 printed. In this example, the value of @code{sum} would be 80600.
1651 These more advanced @code{awk} techniques are covered in later sections
1652 (@pxref{Action Overview, ,Overview of Actions}). Before you can move on to more
1653 advanced @code{awk} programming, you have to know how @code{awk} interprets
1654 your input and displays your output. By manipulating fields and using
1655 @code{print} statements, you can produce some very useful and impressive
1658 @node Statements/Lines, Other Features, More Complex, Getting Started
1659 @section @code{awk} Statements Versus Lines
1663 Most often, each line in an @code{awk} program is a separate statement or
1664 separate rule, like this:
1667 awk '/12/ @{ print $0 @}
1668 /21/ @{ print $0 @}' BBS-list inventory-shipped
1671 However, @code{gawk} will ignore newlines after any of the following:
1674 , @{ ? : || && do else
1678 A newline at any other point is considered the end of the statement.
1679 (Splitting lines after @samp{?} and @samp{:} is a minor @code{gawk}
1680 extension. The @samp{?} and @samp{:} referred to here is the
1681 three operand conditional expression described in
1682 @ref{Conditional Exp, ,Conditional Expressions}.)
1684 @cindex backslash continuation
1685 @cindex continuation of lines
1686 @cindex line continuation
1687 If you would like to split a single statement into two lines at a point
1688 where a newline would terminate it, you can @dfn{continue} it by ending the
1689 first line with a backslash character, @samp{\}. The backslash must be
1690 the final character on the line to be recognized as a continuation
1691 character. This is allowed absolutely anywhere in the statement, even
1692 in the middle of a string or regular expression. For example:
1695 awk '/This regular expression is too long, so continue it\
1696 on the next line/ @{ print $1 @}'
1700 @cindex portability issues
1701 We have generally not used backslash continuation in the sample programs
1702 in this @value{DOCUMENT}. Since in @code{gawk} there is no limit on the
1703 length of a line, it is never strictly necessary; it just makes programs
1704 more readable. For this same reason, as well as for clarity, we have
1705 kept most statements short in the sample programs presented throughout
1706 the @value{DOCUMENT}. Backslash continuation is most useful when your
1707 @code{awk} program is in a separate source file, instead of typed in on
1708 the command line. You should also note that many @code{awk}
1709 implementations are more particular about where you may use backslash
1710 continuation. For example, they may not allow you to split a string
1711 constant using backslash continuation. Thus, for maximal portability of
1712 your @code{awk} programs, it is best not to split your lines in the
1713 middle of a regular expression or a string.
1715 @cindex @code{csh}, backslash continuation
1716 @cindex backslash continuation in @code{csh}
1717 @strong{Caution: backslash continuation does not work as described above
1718 with the C shell.} Continuation with backslash works for @code{awk}
1719 programs in files, and also for one-shot programs @emph{provided} you
1720 are using a POSIX-compliant shell, such as the Bourne shell or Bash, the
1721 GNU Bourne-Again shell. But the C shell (@code{csh}) behaves
1722 differently! There, you must use two backslashes in a row, followed by
1723 a newline. Note also that when using the C shell, @emph{every} newline
1724 in your awk program must be escaped with a backslash. To illustrate:
1731 @print{} hello, world
1735 Here, the @samp{%} and @samp{?} are the C shell's primary and secondary
1736 prompts, analogous to the standard shell's @samp{$} and @samp{>}.
1738 @code{awk} is a line-oriented language. Each rule's action has to
1739 begin on the same line as the pattern. To have the pattern and action
1740 on separate lines, you @emph{must} use backslash continuation---there
1743 @cindex backslash continuation and comments
1744 @cindex comments and backslash continuation
1745 Note that backslash continuation and comments do not mix. As soon
1746 as @code{awk} sees the @samp{#} that starts a comment, it ignores
1747 @emph{everything} on the rest of the line. For example:
1751 $ gawk 'BEGIN @{ print "dont panic" # a friendly \
1754 @error{} gawk: cmd. line:2: BEGIN rule
1755 @error{} gawk: cmd. line:2: ^ parse error
1760 Here, it looks like the backslash would continue the comment onto the
1761 next line. However, the backslash-newline combination is never even
1762 noticed, since it is ``hidden'' inside the comment. Thus, the
1763 @samp{BEGIN} is noted as a syntax error.
1765 @cindex multiple statements on one line
1766 When @code{awk} statements within one rule are short, you might want to put
1767 more than one of them on a line. You do this by separating the statements
1768 with a semicolon, @samp{;}.
1770 This also applies to the rules themselves.
1771 Thus, the previous program could have been written:
1774 /12/ @{ print $0 @} ; /21/ @{ print $0 @}
1778 @strong{Note:} the requirement that rules on the same line must be
1779 separated with a semicolon was not in the original @code{awk}
1780 language; it was added for consistency with the treatment of statements
1783 @node Other Features, When, Statements/Lines, Getting Started
1784 @section Other Features of @code{awk}
1786 The @code{awk} language provides a number of predefined, or built-in variables, which
1787 your programs can use to get information from @code{awk}. There are other
1788 variables your program can set to control how @code{awk} processes your
1791 In addition, @code{awk} provides a number of built-in functions for doing
1792 common computational and string related operations.
1794 As we develop our presentation of the @code{awk} language, we introduce
1795 most of the variables and many of the functions. They are defined
1796 systematically in @ref{Built-in Variables}, and
1797 @ref{Built-in, ,Built-in Functions}.
1799 @node When, , Other Features, Getting Started
1800 @section When to Use @code{awk}
1802 @cindex when to use @code{awk}
1803 @cindex applications of @code{awk}
1804 You might wonder how @code{awk} might be useful for you. Using
1805 utility programs, advanced patterns, field separators, arithmetic
1806 statements, and other selection criteria, you can produce much more
1807 complex output. The @code{awk} language is very useful for producing
1808 reports from large amounts of raw data, such as summarizing information
1809 from the output of other utility programs like @code{ls}.
1810 (@xref{More Complex, ,A More Complex Example}.)
1812 Programs written with @code{awk} are usually much smaller than they would
1813 be in other languages. This makes @code{awk} programs easy to compose and
1814 use. Often, @code{awk} programs can be quickly composed at your terminal,
1815 used once, and thrown away. Since @code{awk} programs are interpreted, you
1816 can avoid the (usually lengthy) compilation part of the typical
1817 edit-compile-test-debug cycle of software development.
1819 Complex programs have been written in @code{awk}, including a complete
1820 retargetable assembler for eight-bit microprocessors (@pxref{Glossary}, for
1821 more information) and a microcode assembler for a special purpose Prolog
1822 computer. However, @code{awk}'s capabilities are strained by tasks of
1825 If you find yourself writing @code{awk} scripts of more than, say, a few
1826 hundred lines, you might consider using a different programming
1827 language. Emacs Lisp is a good choice if you need sophisticated string
1828 or pattern matching capabilities. The shell is also good at string and
1829 pattern matching; in addition, it allows powerful use of the system
1830 utilities. More conventional languages, such as C, C++, and Lisp, offer
1831 better facilities for system programming and for managing the complexity
1832 of large programs. Programs in these languages may require more lines
1833 of source code than the equivalent @code{awk} programs, but they are
1834 easier to maintain and usually run more efficiently.
1836 @node One-liners, Regexp, Getting Started, Top
1837 @chapter Useful One Line Programs
1840 Many useful @code{awk} programs are short, just a line or two. Here is a
1841 collection of useful, short programs to get you started. Some of these
1842 programs contain constructs that haven't been covered yet. The description
1843 of the program will give you a good idea of what is going on, but please
1844 read the rest of the @value{DOCUMENT} to become an @code{awk} expert!
1846 Most of the examples use a data file named @file{data}. This is just a
1847 placeholder; if you were to use these programs yourself, you would substitute
1848 your own file names for @file{data}.
1851 Since you are reading this in Info, each line of the example code is
1852 enclosed in quotes, to represent text that you would type literally.
1853 The examples themselves represent shell commands that use single quotes
1854 to keep the shell from interpreting the contents of the program.
1855 When reading the examples, focus on the text between the open and close
1860 @item awk '@{ if (length($0) > max) max = length($0) @}
1861 @itemx @ @ @ @ @ END @{ print max @}' data
1862 This program prints the length of the longest input line.
1864 @item awk 'length($0) > 80' data
1865 This program prints every line that is longer than 80 characters. The sole
1866 rule has a relational expression as its pattern, and has no action (so the
1867 default action, printing the record, is used).
1869 @item expand@ data@ |@ awk@ '@{ if (x < length()) x = length() @}
1870 @itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "maximum line length is " x @}'
1871 This program prints the length of the longest line in @file{data}. The input
1872 is processed by the @code{expand} program to change tabs into spaces,
1873 so the widths compared are actually the right-margin columns.
1875 @item awk 'NF > 0' data
1876 This program prints every line that has at least one field. This is an
1877 easy way to delete blank lines from a file (or rather, to create a new
1878 file similar to the old file but from which the blank lines have been
1881 @c Karl Berry points out that new users probably don't want to see
1882 @c multiple ways to do things, just the `best' way. He's probably
1883 @c right. At some point it might be worth adding something about there
1884 @c often being multiple ways to do things in awk, but for now we'll
1885 @c just take this one out.
1887 @item awk '@{ if (NF > 0) print @}' data
1888 This program also prints every line that has at least one field. Here we
1889 allow the rule to match every line, and then decide in the action whether
1893 @item awk@ 'BEGIN@ @{@ for (i = 1; i <= 7; i++)
1894 @itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ print int(101 * rand()) @}'
1895 This program prints seven random numbers from zero to 100, inclusive.
1897 @item ls -lg @var{files} | awk '@{ x += $5 @} ; END @{ print "total bytes: " x @}'
1898 This program prints the total number of bytes used by @var{files}.
1900 @item ls -lg @var{files} | awk '@{ x += $5 @}
1901 @itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "total K-bytes: " (x + 1023)/1024 @}'
1902 This program prints the total number of kilobytes used by @var{files}.
1904 @item awk -F: '@{ print $1 @}' /etc/passwd | sort
1905 This program prints a sorted list of the login names of all users.
1907 @item awk 'END @{ print NR @}' data
1908 This program counts lines in a file.
1910 @item awk 'NR % 2 == 0' data
1911 This program prints the even numbered lines in the data file.
1912 If you were to use the expression @samp{NR % 2 == 1} instead,
1913 it would print the odd numbered lines.
1916 @node Regexp, Reading Files, One-liners, Top
1917 @chapter Regular Expressions
1918 @cindex pattern, regular expressions
1920 @cindex regular expression
1921 @cindex regular expressions as patterns
1923 A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a
1925 Because regular expressions are such a fundamental part of @code{awk}
1926 programming, their format and use deserve a separate chapter.
1928 A regular expression enclosed in slashes (@samp{/})
1929 is an @code{awk} pattern that matches every input record whose text
1930 belongs to that set.
1932 The simplest regular expression is a sequence of letters, numbers, or
1933 both. Such a regexp matches any string that contains that sequence.
1934 Thus, the regexp @samp{foo} matches any string containing @samp{foo}.
1935 Therefore, the pattern @code{/foo/} matches any input record containing
1936 the three characters @samp{foo}, @emph{anywhere} in the record. Other
1937 kinds of regexps let you specify more complicated classes of strings.
1940 Initially, the examples will be simple. As we explain more about how
1941 regular expressions work, we will present more complicated examples.
1945 * Regexp Usage:: How to Use Regular Expressions.
1946 * Escape Sequences:: How to write non-printing characters.
1947 * Regexp Operators:: Regular Expression Operators.
1948 * GNU Regexp Operators:: Operators specific to GNU software.
1949 * Case-sensitivity:: How to do case-insensitive matching.
1950 * Leftmost Longest:: How much text matches.
1951 * Computed Regexps:: Using Dynamic Regexps.
1954 @node Regexp Usage, Escape Sequences, Regexp, Regexp
1955 @section How to Use Regular Expressions
1957 A regular expression can be used as a pattern by enclosing it in
1958 slashes. Then the regular expression is tested against the
1959 entire text of each record. (Normally, it only needs
1960 to match some part of the text in order to succeed.) For example, this
1961 prints the second field of each record that contains the three
1962 characters @samp{foo} anywhere in it:
1966 $ awk '/foo/ @{ print $2 @}' BBS-list
1974 @cindex regexp matching operators
1975 @cindex string-matching operators
1976 @cindex operators, string-matching
1977 @cindex operators, regexp matching
1978 @cindex regexp match/non-match operators
1979 @cindex @code{~} operator
1980 @cindex @code{!~} operator
1981 Regular expressions can also be used in matching expressions. These
1982 expressions allow you to specify the string to match against; it need
1983 not be the entire current input record. The two operators, @samp{~}
1984 and @samp{!~}, perform regular expression comparisons. Expressions
1985 using these operators can be used as patterns or in @code{if},
1986 @code{while}, @code{for}, and @code{do} statements.
1988 @c adding this xref in TeX screws up the formatting too much
1989 (@xref{Statements, ,Control Statements in Actions}.)
1993 @item @var{exp} ~ /@var{regexp}/
1994 This is true if the expression @var{exp} (taken as a string)
1995 is matched by @var{regexp}. The following example matches, or selects,
1996 all input records with the upper-case letter @samp{J} somewhere in the
2001 $ awk '$1 ~ /J/' inventory-shipped
2002 @print{} Jan 13 25 15 115
2003 @print{} Jun 31 42 75 492
2004 @print{} Jul 24 34 67 436
2005 @print{} Jan 21 36 64 620
2012 awk '@{ if ($1 ~ /J/) print @}' inventory-shipped
2015 @item @var{exp} !~ /@var{regexp}/
2016 This is true if the expression @var{exp} (taken as a character string)
2017 is @emph{not} matched by @var{regexp}. The following example matches,
2018 or selects, all input records whose first field @emph{does not} contain
2019 the upper-case letter @samp{J}:
2023 $ awk '$1 !~ /J/' inventory-shipped
2024 @print{} Feb 15 32 24 226
2025 @print{} Mar 15 24 34 228
2026 @print{} Apr 31 52 63 420
2027 @print{} May 16 34 29 208
2033 @cindex regexp constant
2034 When a regexp is written enclosed in slashes, like @code{/foo/}, we call it
2035 a @dfn{regexp constant}, much like @code{5.27} is a numeric constant, and
2036 @code{"foo"} is a string constant.
2038 @node Escape Sequences, Regexp Operators, Regexp Usage, Regexp
2039 @section Escape Sequences
2041 @cindex escape sequence notation
2042 Some characters cannot be included literally in string constants
2043 (@code{"foo"}) or regexp constants (@code{/foo/}). You represent them
2044 instead with @dfn{escape sequences}, which are character sequences
2045 beginning with a backslash (@samp{\}).
2047 One use of an escape sequence is to include a double-quote character in
2048 a string constant. Since a plain double-quote would end the string, you
2049 must use @samp{\"} to represent an actual double-quote character as a
2050 part of the string. For example:
2053 $ awk 'BEGIN @{ print "He said \"hi!\" to her." @}'
2054 @print{} He said "hi!" to her.
2057 The backslash character itself is another character that cannot be
2058 included normally; you write @samp{\\} to put one backslash in the
2059 string or regexp. Thus, the string whose contents are the two characters
2060 @samp{"} and @samp{\} must be written @code{"\"\\"}.
2062 Another use of backslash is to represent unprintable characters
2063 such as tab or newline. While there is nothing to stop you from entering most
2064 unprintable characters directly in a string constant or regexp constant,
2067 Here is a table of all the escape sequences used in @code{awk}, and
2068 what they represent. Unless noted otherwise, all of these escape
2069 sequences apply to both string constants and regexp constants.
2074 A literal backslash, @samp{\}.
2076 @cindex @code{awk} language, V.4 version
2078 The ``alert'' character, @kbd{Control-g}, ASCII code 7 (BEL).
2081 Backspace, @kbd{Control-h}, ASCII code 8 (BS).
2084 Formfeed, @kbd{Control-l}, ASCII code 12 (FF).
2087 Newline, @kbd{Control-j}, ASCII code 10 (LF).
2090 Carriage return, @kbd{Control-m}, ASCII code 13 (CR).
2093 Horizontal tab, @kbd{Control-i}, ASCII code 9 (HT).
2095 @cindex @code{awk} language, V.4 version
2097 Vertical tab, @kbd{Control-k}, ASCII code 11 (VT).
2100 The octal value @var{nnn}, where @var{nnn} are one to three digits
2101 between @samp{0} and @samp{7}. For example, the code for the ASCII ESC
2102 (escape) character is @samp{\033}.
2104 @cindex @code{awk} language, V.4 version
2105 @cindex @code{awk} language, POSIX version
2106 @cindex POSIX @code{awk}
2107 @item \x@var{hh}@dots{}
2108 The hexadecimal value @var{hh}, where @var{hh} are hexadecimal
2109 digits (@samp{0} through @samp{9} and either @samp{A} through @samp{F} or
2110 @samp{a} through @samp{f}). Like the same construct in ANSI C, the escape
2111 sequence continues until the first non-hexadecimal digit is seen. However,
2112 using more than two hexadecimal digits produces undefined results. (The
2113 @samp{\x} escape sequence is not allowed in POSIX @code{awk}.)
2116 A literal slash (necessary for regexp constants only).
2117 You use this when you wish to write a regexp
2118 constant that contains a slash. Since the regexp is delimited by
2119 slashes, you need to escape the slash that is part of the pattern,
2120 in order to tell @code{awk} to keep processing the rest of the regexp.
2123 A literal double-quote (necessary for string constants only).
2124 You use this when you wish to write a string
2125 constant that contains a double-quote. Since the string is delimited by
2126 double-quotes, you need to escape the quote that is part of the string,
2127 in order to tell @code{awk} to keep processing the rest of the string.
2131 In @code{gawk}, there are additional two character sequences that begin
2132 with backslash that have special meaning in regexps.
2133 @xref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}.
2135 In a string constant,
2136 what happens if you place a backslash before something that is not one of
2137 the characters listed above? POSIX @code{awk} purposely leaves this case
2138 undefined. There are two choices.
2142 Strip the backslash out. This is what Unix @code{awk} and @code{gawk} both do.
2143 For example, @code{"a\qc"} is the same as @code{"aqc"}.
2146 Leave the backslash alone. Some other @code{awk} implementations do this.
2147 In such implementations, @code{"a\qc"} is the same as if you had typed
2151 In a regexp, a backslash before any character that is not in the above table,
2153 @ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}},
2154 means that the next character should be taken literally, even if it would
2155 normally be a regexp operator. E.g., @code{/a\+b/} matches the three
2156 characters @samp{a+b}.
2158 @cindex portability issues
2159 For complete portability, do not use a backslash before any character not
2160 listed in the table above.
2162 Another interesting question arises. Suppose you use an octal or hexadecimal
2163 escape to represent a regexp metacharacter
2164 (@pxref{Regexp Operators, , Regular Expression Operators}).
2165 Does @code{awk} treat the character as a literal character, or as a regexp
2169 It turns out that historically, such characters were taken literally (d.c.).
2170 However, the POSIX standard indicates that they should be treated
2171 as real metacharacters, and this is what @code{gawk} does.
2172 However, in compatibility mode (@pxref{Options, ,Command Line Options}),
2173 @code{gawk} treats the characters represented by octal and hexadecimal
2174 escape sequences literally when used in regexp constants. Thus,
2175 @code{/a\52b/} is equivalent to @code{/a\*b/}.
2181 The escape sequences in the table above are always processed first,
2182 for both string constants and regexp constants. This happens very early,
2183 as soon as @code{awk} reads your program.
2186 @code{gawk} processes both regexp constants and dynamic regexps
2187 (@pxref{Computed Regexps, ,Using Dynamic Regexps}),
2188 for the special operators listed in
2189 @ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}.
2192 A backslash before any other character means to treat that character
2196 @node Regexp Operators, GNU Regexp Operators, Escape Sequences, Regexp
2197 @section Regular Expression Operators
2198 @cindex metacharacters
2199 @cindex regular expression metacharacters
2200 @cindex regexp operators
2202 You can combine regular expressions with the following characters,
2203 called @dfn{regular expression operators}, or @dfn{metacharacters}, to
2204 increase the power and versatility of regular expressions.
2206 The escape sequences described
2210 in @ref{Escape Sequences},
2211 are valid inside a regexp. They are introduced by a @samp{\}. They
2212 are recognized and converted into the corresponding real characters as
2213 the very first step in processing regexps.
2215 Here is a table of metacharacters. All characters that are not escape
2216 sequences and that are not listed in the table stand for themselves.
2220 This is used to suppress the special meaning of a character when
2221 matching. For example:
2228 matches the character @samp{$}.
2232 @cindex anchors in regexps
2233 @cindex regexp, anchors
2235 This matches the beginning of a string. For example:
2242 matches the @samp{@@chapter} at the beginning of a string, and can be used
2243 to identify chapter beginnings in Texinfo source files.
2244 The @samp{^} is known as an @dfn{anchor}, since it anchors the pattern to
2245 matching only at the beginning of the string.
2247 It is important to realize that @samp{^} does not match the beginning of
2248 a line embedded in a string. In this example the condition is not true:
2251 if ("line1\nLINE 2" ~ /^L/) @dots{}
2255 This is similar to @samp{^}, but it matches only at the end of a string.
2263 matches a record that ends with a @samp{p}. The @samp{$} is also an anchor,
2264 and also does not match the end of a line embedded in a string. In this
2265 example the condition is not true:
2268 if ("line1\nLINE 2" ~ /1$/) @dots{}
2272 The period, or dot, matches any single character,
2273 @emph{including} the newline character. For example:
2280 matches any single character followed by a @samp{P} in a string. Using
2281 concatenation we can make a regular expression like @samp{U.A}, which
2282 matches any three-character sequence that begins with @samp{U} and ends
2285 @cindex @code{awk} language, POSIX version
2286 @cindex POSIX @code{awk}
2287 In strict POSIX mode (@pxref{Options, ,Command Line Options}),
2288 @samp{.} does not match the @sc{nul}
2289 character, which is a character with all bits equal to zero.
2290 Otherwise, @sc{nul} is just another character. Other versions of @code{awk}
2291 may not be able to match the @sc{nul} character.
2294 2e: Add stuff that character list is the POSIX terminology. In other
2295 literature known as character set or character class.
2298 @cindex character list
2300 This is called a @dfn{character list}. It matches any @emph{one} of the
2301 characters that are enclosed in the square brackets. For example:
2308 matches any one of the characters @samp{M}, @samp{V}, or @samp{X} in a
2311 Ranges of characters are indicated by using a hyphen between the beginning
2312 and ending characters, and enclosing the whole thing in brackets. For
2321 Multiple ranges are allowed. E.g., the list @code{@w{[A-Za-z0-9]}} is a
2322 common way to express the idea of ``all alphanumeric characters.''
2324 To include one of the characters @samp{\}, @samp{]}, @samp{-} or @samp{^} in a
2325 character list, put a @samp{\} in front of it. For example:
2332 matches either @samp{d}, or @samp{]}.
2334 @cindex @code{egrep}
2335 This treatment of @samp{\} in character lists
2336 is compatible with other @code{awk}
2337 implementations, and is also mandated by POSIX.
2338 The regular expressions in @code{awk} are a superset
2339 of the POSIX specification for Extended Regular Expressions (EREs).
2340 POSIX EREs are based on the regular expressions accepted by the
2341 traditional @code{egrep} utility.
2343 @cindex character classes
2344 @cindex @code{awk} language, POSIX version
2345 @cindex POSIX @code{awk}
2346 @dfn{Character classes} are a new feature introduced in the POSIX standard.
2347 A character class is a special notation for describing
2348 lists of characters that have a specific attribute, but where the
2349 actual characters themselves can vary from country to country and/or
2350 from character set to character set. For example, the notion of what
2351 is an alphabetic character differs in the USA and in France.
2353 A character class is only valid in a regexp @emph{inside} the
2354 brackets of a character list. Character classes consist of @samp{[:},
2355 a keyword denoting the class, and @samp{:]}. Here are the character
2356 classes defined by the POSIX standard.
2360 Alphanumeric characters.
2363 Alphabetic characters.
2366 Space and tab characters.
2375 Characters that are printable and are also visible.
2376 (A space is printable, but not visible, while an @samp{a} is both.)
2379 Lower-case alphabetic characters.
2382 Printable characters (characters that are not control characters.)
2385 Punctuation characters (characters that are not letter, digits,
2386 control characters, or space characters).
2389 Space characters (such as space, tab, and formfeed, to name a few).
2392 Upper-case alphabetic characters.
2395 Characters that are hexadecimal digits.
2398 For example, before the POSIX standard, to match alphanumeric
2399 characters, you had to write @code{/[A-Za-z0-9]/}. If your
2400 character set had other alphabetic characters in it, this would not
2401 match them. With the POSIX character classes, you can write
2402 @code{/[[:alnum:]]/}, and this will match @emph{all} the alphabetic
2403 and numeric characters in your character set.
2405 @cindex collating elements
2406 Two additional special sequences can appear in character lists.
2407 These apply to non-ASCII character sets, which can have single symbols
2408 (called @dfn{collating elements}) that are represented with more than one
2409 character, as well as several characters that are equivalent for
2410 @dfn{collating}, or sorting, purposes. (E.g., in French, a plain ``e''
2411 and a grave-accented ``@`e'' are equivalent.)
2414 @cindex collating symbols
2415 @item Collating Symbols
2416 A @dfn{collating symbol} is a multi-character collating element enclosed in
2417 @samp{[.} and @samp{.]}. For example, if @samp{ch} is a collating element,
2418 then @code{[[.ch.]]} is a regexp that matches this collating element, while
2419 @code{[ch]} is a regexp that matches either @samp{c} or @samp{h}.
2421 @cindex equivalence classes
2422 @item Equivalence Classes
2423 An @dfn{equivalence class} is a locale-specific name for a list of
2424 characters that are equivalent. The name is enclosed in
2425 @samp{[=} and @samp{=]}.
2426 For example, the name @samp{e} might be used to represent all of
2427 ``e,'' ``@`e,'' and ``@'e.'' In this case, @code{[[=e]]} is a regexp
2428 that matches any of @samp{e}, @samp{@'e}, or @samp{@`e}.
2431 These features are very valuable in non-English speaking locales.
2433 @strong{Caution:} The library functions that @code{gawk} uses for regular
2434 expression matching currently only recognize POSIX character classes;
2435 they do not recognize collating symbols or equivalence classes.
2436 @c maybe one day ...
2438 @cindex complemented character list
2439 @cindex character list, complemented
2441 This is a @dfn{complemented character list}. The first character after
2442 the @samp{[} @emph{must} be a @samp{^}. It matches any characters
2443 @emph{except} those in the square brackets. For example:
2450 matches any character that is not a digit.
2453 This is the @dfn{alternation operator}, and it is used to specify
2454 alternatives. For example:
2461 matches any string that matches either @samp{^P} or @samp{[0-9]}. This
2462 means it matches any string that starts with @samp{P} or contains a digit.
2464 The alternation applies to the largest possible regexps on either side.
2465 In other words, @samp{|} has the lowest precedence of all the regular
2466 expression operators.
2469 Parentheses are used for grouping in regular expressions as in
2470 arithmetic. They can be used to concatenate regular expressions
2471 containing the alternation operator, @samp{|}. For example,
2472 @samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and
2473 @samp{@@samp@{bar@}}. (These are Texinfo formatting control sequences.)
2476 This symbol means that the preceding regular expression is to be
2477 repeated as many times as necessary to find a match. For example:
2484 applies the @samp{*} symbol to the preceding @samp{h} and looks for matches
2485 of one @samp{p} followed by any number of @samp{h}s. This will also match
2486 just @samp{p} if no @samp{h}s are present.
2488 The @samp{*} repeats the @emph{smallest} possible preceding expression.
2489 (Use parentheses if you wish to repeat a larger expression.) It finds
2490 as many repetitions as possible. For example:
2493 awk '/\(c[ad][ad]*r x\)/ @{ print @}' sample
2497 prints every record in @file{sample} containing a string of the form
2498 @samp{(car x)}, @samp{(cdr x)}, @samp{(cadr x)}, and so on.
2499 Notice the escaping of the parentheses by preceding them
2503 This symbol is similar to @samp{*}, but the preceding expression must be
2504 matched at least once. This means that:
2511 would match @samp{why} and @samp{whhy} but not @samp{wy}, whereas
2512 @samp{wh*y} would match all three of these strings. This is a simpler
2513 way of writing the last @samp{*} example:
2516 awk '/\(c[ad]+r x\)/ @{ print @}' sample
2520 This symbol is similar to @samp{*}, but the preceding expression can be
2521 matched either once or not at all. For example:
2528 will match @samp{fed} and @samp{fd}, but nothing else.
2530 @cindex @code{awk} language, POSIX version
2531 @cindex POSIX @code{awk}
2532 @cindex interval expressions
2535 @itemx @{@var{n},@var{m}@}
2536 One or two numbers inside braces denote an @dfn{interval expression}.
2537 If there is one number in the braces, the preceding regexp is repeated
2539 If there are two numbers separated by a comma, the preceding regexp is
2540 repeated @var{n} to @var{m} times.
2541 If there is one number followed by a comma, then the preceding regexp
2542 is repeated at least @var{n} times.
2546 matches @samp{whhhy} but not @samp{why} or @samp{whhhhy}.
2549 matches @samp{whhhy} or @samp{whhhhy} or @samp{whhhhhy}, only.
2552 matches @samp{whhy} or @samp{whhhy}, and so on.
2555 Interval expressions were not traditionally available in @code{awk}.
2556 As part of the POSIX standard they were added, to make @code{awk}
2557 and @code{egrep} consistent with each other.
2559 However, since old programs may use @samp{@{} and @samp{@}} in regexp
2560 constants, by default @code{gawk} does @emph{not} match interval expressions
2561 in regexps. If either @samp{--posix} or @samp{--re-interval} are specified
2562 (@pxref{Options, , Command Line Options}), then interval expressions
2563 are allowed in regexps.
2566 @cindex precedence, regexp operators
2567 @cindex regexp operators, precedence of
2568 In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators,
2569 as well as the braces @samp{@{} and @samp{@}},
2571 the highest precedence, followed by concatenation, and finally by @samp{|}.
2572 As in arithmetic, parentheses can change how operators are grouped.
2574 If @code{gawk} is in compatibility mode
2575 (@pxref{Options, ,Command Line Options}),
2576 character classes and interval expressions are not available in
2577 regular expressions.
2586 discusses the GNU-specific regexp operators, and provides
2587 more detail concerning how command line options affect the way @code{gawk}
2588 interprets the characters in regular expressions.
2590 @node GNU Regexp Operators, Case-sensitivity, Regexp Operators, Regexp
2591 @section Additional Regexp Operators Only in @code{gawk}
2593 @c This section adapted from the regex-0.12 manual
2595 @cindex regexp operators, GNU specific
2596 GNU software that deals with regular expressions provides a number of
2597 additional regexp operators. These operators are described in this
2598 section, and are specific to @code{gawk}; they are not available in other
2599 @code{awk} implementations.
2601 @cindex word, regexp definition of
2602 Most of the additional operators are for dealing with word matching.
2603 For our purposes, a @dfn{word} is a sequence of one or more letters, digits,
2604 or underscores (@samp{_}).
2607 @cindex @code{\w} regexp operator
2609 This operator matches any word-constituent character, i.e.@: any
2610 letter, digit, or underscore. Think of it as a short-hand for
2611 @c @w{@code{[A-Za-z0-9_]}} or
2612 @w{@code{[[:alnum:]_]}}.
2614 @cindex @code{\W} regexp operator
2616 This operator matches any character that is not word-constituent.
2617 Think of it as a short-hand for
2618 @c @w{@code{[^A-Za-z0-9_]}} or
2619 @w{@code{[^[:alnum:]_]}}.
2621 @cindex @code{\<} regexp operator
2623 This operator matches the empty string at the beginning of a word.
2624 For example, @code{/\<away/} matches @samp{away}, but not
2627 @cindex @code{\>} regexp operator
2629 This operator matches the empty string at the end of a word.
2630 For example, @code{/stow\>/} matches @samp{stow}, but not @samp{stowaway}.
2632 @cindex @code{\y} regexp operator
2633 @cindex word boundaries, matching
2635 This operator matches the empty string at either the beginning or the
2636 end of a word (the word boundar@strong{y}). For example, @samp{\yballs?\y}
2637 matches either @samp{ball} or @samp{balls} as a separate word.
2639 @cindex @code{\B} regexp operator
2641 This operator matches the empty string within a word. In other words,
2642 @samp{\B} matches the empty string that occurs between two
2643 word-constituent characters. For example,
2644 @code{/\Brat\B/} matches @samp{crate}, but it does not match @samp{dirty rat}.
2645 @samp{\B} is essentially the opposite of @samp{\y}.
2648 There are two other operators that work on buffers. In Emacs, a
2649 @dfn{buffer} is, naturally, an Emacs buffer. For other programs, the
2650 regexp library routines that @code{gawk} uses consider the entire
2651 string to be matched as the buffer.
2653 For @code{awk}, since @samp{^} and @samp{$} always work in terms
2654 of the beginning and end of strings, these operators don't add any
2655 new capabilities. They are provided for compatibility with other GNU
2658 @cindex buffer matching operators
2660 @cindex @code{\`} regexp operator
2662 This operator matches the empty string at the
2663 beginning of the buffer.
2665 @cindex @code{\'} regexp operator
2667 This operator matches the empty string at the
2671 In other GNU software, the word boundary operator is @samp{\b}. However,
2672 that conflicts with the @code{awk} language's definition of @samp{\b}
2673 as backspace, so @code{gawk} uses a different letter.
2675 An alternative method would have been to require two backslashes in the
2676 GNU operators, but this was deemed to be too confusing, and the current
2677 method of using @samp{\y} for the GNU @samp{\b} appears to be the
2678 lesser of two evils.
2680 @c NOTE!!! Keep this in sync with the same table in the summary appendix!
2681 @cindex regexp, effect of command line options
2682 The various command line options
2683 (@pxref{Options, ,Command Line Options})
2684 control how @code{gawk} interprets characters in regexps.
2688 In the default case, @code{gawk} provides all the facilities of
2689 POSIX regexps and the GNU regexp operators described
2694 in @ref{Regexp Operators, ,Regular Expression Operators}.
2696 However, interval expressions are not supported.
2698 @item @code{--posix}
2699 Only POSIX regexps are supported, the GNU operators are not special
2700 (e.g., @samp{\w} matches a literal @samp{w}). Interval expressions
2703 @item @code{--traditional}
2704 Traditional Unix @code{awk} regexps are matched. The GNU operators
2705 are not special, interval expressions are not available, and neither
2706 are the POSIX character classes (@code{[[:alnum:]]} and so on).
2707 Characters described by octal and hexadecimal escape sequences are
2708 treated literally, even if they represent regexp metacharacters.
2710 @item @code{--re-interval}
2711 Allow interval expressions in regexps, even if @samp{--traditional}
2715 @node Case-sensitivity, Leftmost Longest, GNU Regexp Operators, Regexp
2716 @section Case-sensitivity in Matching
2718 @cindex case sensitivity
2719 @cindex ignoring case
2720 Case is normally significant in regular expressions, both when matching
2721 ordinary characters (i.e.@: not metacharacters), and inside character
2722 sets. Thus a @samp{w} in a regular expression matches only a lower-case
2723 @samp{w} and not an upper-case @samp{W}.
2725 The simplest way to do a case-independent match is to use a character
2726 list: @samp{[Ww]}. However, this can be cumbersome if you need to use it
2727 often; and it can make the regular expressions harder to
2728 read. There are two alternatives that you might prefer.
2730 One way to do a case-insensitive match at a particular point in the
2731 program is to convert the data to a single case, using the
2732 @code{tolower} or @code{toupper} built-in string functions (which we
2733 haven't discussed yet;
2734 @pxref{String Functions, ,Built-in Functions for String Manipulation}).
2738 tolower($1) ~ /foo/ @{ @dots{} @}
2742 converts the first field to lower-case before matching against it.
2743 This will work in any POSIX-compliant implementation of @code{awk}.
2745 @cindex differences between @code{gawk} and @code{awk}
2746 @cindex @code{~} operator
2747 @cindex @code{!~} operator
2749 Another method, specific to @code{gawk}, is to set the variable
2750 @code{IGNORECASE} to a non-zero value (@pxref{Built-in Variables}).
2751 When @code{IGNORECASE} is not zero, @emph{all} regexp and string
2752 operations ignore case. Changing the value of
2753 @code{IGNORECASE} dynamically controls the case sensitivity of your
2754 program as it runs. Case is significant by default because
2755 @code{IGNORECASE} (like most variables) is initialized to zero.
2760 if (x ~ /ab/) @dots{} # this test will fail
2765 if (x ~ /ab/) @dots{} # now it will succeed
2769 In general, you cannot use @code{IGNORECASE} to make certain rules
2770 case-insensitive and other rules case-sensitive, because there is no way
2771 to set @code{IGNORECASE} just for the pattern of a particular rule.
2773 This isn't quite true. Consider:
2775 IGNORECASE=1 && /foObAr/ { .... }
2776 IGNORECASE=0 || /foobar/ { .... }
2778 But that's pretty bad style and I don't want to get into it at this
2781 To do this, you must use character lists or @code{tolower}. However, one
2782 thing you can do only with @code{IGNORECASE} is turn case-sensitivity on
2783 or off dynamically for all the rules at once.
2785 @code{IGNORECASE} can be set on the command line, or in a @code{BEGIN} rule
2786 (@pxref{Other Arguments, ,Other Command Line Arguments}; also
2787 @pxref{Using BEGIN/END, ,Startup and Cleanup Actions}).
2788 Setting @code{IGNORECASE} from the command line is a way to make
2789 a program case-insensitive without having to edit it.
2791 Prior to version 3.0 of @code{gawk}, the value of @code{IGNORECASE}
2792 only affected regexp operations. It did not affect string comparison
2793 with @samp{==}, @samp{!=}, and so on.
2794 Beginning with version 3.0, both regexp and string comparison
2795 operations are affected by @code{IGNORECASE}.
2799 Beginning with version 3.0 of @code{gawk}, the equivalences between upper-case
2800 and lower-case characters are based on the ISO-8859-1 (ISO Latin-1)
2801 character set. This character set is a superset of the traditional 128
2802 ASCII characters, that also provides a number of characters suitable
2803 for use with European languages.
2805 A pure ASCII character set can be used instead if @code{gawk} is compiled
2806 with @samp{-DUSE_PURE_ASCII}.
2809 The value of @code{IGNORECASE} has no effect if @code{gawk} is in
2810 compatibility mode (@pxref{Options, ,Command Line Options}).
2811 Case is always significant in compatibility mode.
2813 @node Leftmost Longest, Computed Regexps, Case-sensitivity, Regexp
2814 @section How Much Text Matches?
2816 @cindex leftmost longest match
2817 @cindex matching, leftmost longest
2818 Consider the following example:
2821 echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
2824 This example uses the @code{sub} function (which we haven't discussed yet,
2825 @pxref{String Functions, ,Built-in Functions for String Manipulation})
2826 to make a change to the input record. Here, the regexp @code{/a+/}
2827 indicates ``one or more @samp{a} characters,'' and the replacement
2830 The input contains four @samp{a} characters. What will the output be?
2831 In other words, how many is ``one or more''---will @code{awk} match two,
2832 three, or all four @samp{a} characters?
2834 The answer is, @code{awk} (and POSIX) regular expressions always match
2835 the leftmost, @emph{longest} sequence of input characters that can
2836 match. Thus, in this example, all four @samp{a} characters are
2837 replaced with @samp{<A>}.
2840 $ echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}'
2844 For simple match/no-match tests, this is not so important. But when doing
2845 text matching and substitutions with the @code{match}, @code{sub}, @code{gsub},
2846 and @code{gensub} functions, it is very important.
2848 @xref{String Functions, ,Built-in Functions for String Manipulation},
2849 for more information on these functions.
2851 Understanding this principle is also important for regexp-based record
2852 and field splitting (@pxref{Records, ,How Input is Split into Records},
2853 and also @pxref{Field Separators, ,Specifying How Fields are Separated}).
2855 @node Computed Regexps, , Leftmost Longest, Regexp
2856 @section Using Dynamic Regexps
2858 @cindex computed regular expressions
2859 @cindex regular expressions, computed
2860 @cindex dynamic regular expressions
2861 @cindex regexp, dynamic
2862 @cindex @code{~} operator
2863 @cindex @code{!~} operator
2864 The right hand side of a @samp{~} or @samp{!~} operator need not be a
2865 regexp constant (i.e.@: a string of characters between slashes). It may
2866 be any expression. The expression is evaluated, and converted if
2867 necessary to a string; the contents of the string are used as the
2868 regexp. A regexp that is computed in this way is called a @dfn{dynamic
2869 regexp}. For example:
2872 BEGIN @{ identifier_regexp = "[A-Za-z_][A-Za-z_0-9]*" @}
2873 $0 ~ identifier_regexp @{ print @}
2877 sets @code{identifier_regexp} to a regexp that describes @code{awk}
2878 variable names, and tests if the input record matches this regexp.
2881 Do we want to use "^[A-Za-z_][A-Za-z_0-9]*$" to restrict the entire
2882 record to just identifiers? Doing that also would disrupt the flow of
2886 @strong{Caution:} When using the @samp{~} and @samp{!~}
2887 operators, there is a difference between a regexp constant
2888 enclosed in slashes, and a string constant enclosed in double quotes.
2889 If you are going to use a string constant, you have to understand that
2890 the string is in essence scanned @emph{twice}; the first time when
2891 @code{awk} reads your program, and the second time when it goes to
2892 match the string on the left-hand side of the operator with the pattern
2893 on the right. This is true of any string valued expression (such as
2894 @code{identifier_regexp} above), not just string constants.
2896 @cindex regexp constants, difference between slashes and quotes
2897 What difference does it make if the string is
2898 scanned twice? The answer has to do with escape sequences, and particularly
2899 with backslashes. To get a backslash into a regular expression inside a
2900 string, you have to type two backslashes.
2902 For example, @code{/\*/} is a regexp constant for a literal @samp{*}.
2903 Only one backslash is needed. To do the same thing with a string,
2904 you would have to type @code{"\\*"}. The first backslash escapes the
2905 second one, so that the string actually contains the
2906 two characters @samp{\} and @samp{*}.
2908 @cindex common mistakes
2909 @cindex mistakes, common
2910 @cindex errors, common
2911 Given that you can use both regexp and string constants to describe
2912 regular expressions, which should you use? The answer is ``regexp
2913 constants,'' for several reasons.
2917 String constants are more complicated to write, and
2918 more difficult to read. Using regexp constants makes your programs
2919 less error-prone. Not understanding the difference between the two
2920 kinds of constants is a common source of errors.
2923 It is also more efficient to use regexp constants: @code{awk} can note
2924 that you have supplied a regexp and store it internally in a form that
2925 makes pattern matching more efficient. When using a string constant,
2926 @code{awk} must first convert the string into this internal form, and
2927 then perform the pattern matching.
2930 Using regexp constants is better style; it shows clearly that you
2931 intend a regexp match.
2934 @node Reading Files, Printing, Regexp, Top
2935 @chapter Reading Input Files
2937 @cindex reading files
2939 @cindex standard input
2941 In the typical @code{awk} program, all input is read either from the
2942 standard input (by default the keyboard, but often a pipe from another
2943 command) or from files whose names you specify on the @code{awk} command
2944 line. If you specify input files, @code{awk} reads them in order, reading
2945 all the data from one before going on to the next. The name of the current
2946 input file can be found in the built-in variable @code{FILENAME}
2947 (@pxref{Built-in Variables}).
2949 The input is read in units called @dfn{records}, and processed by the
2950 rules of your program one record at a time.
2951 By default, each record is one line. Each
2952 record is automatically split into chunks called @dfn{fields}.
2953 This makes it more convenient for programs to work on the parts of a record.
2955 On rare occasions you will need to use the @code{getline} command.
2956 The @code{getline} command is valuable, both because it
2957 can do explicit input from any number of files, and because the files
2958 used with it do not have to be named on the @code{awk} command line
2959 (@pxref{Getline, ,Explicit Input with @code{getline}}).
2962 * Records:: Controlling how data is split into records.
2963 * Fields:: An introduction to fields.
2964 * Non-Constant Fields:: Non-constant Field Numbers.
2965 * Changing Fields:: Changing the Contents of a Field.
2966 * Field Separators:: The field separator and how to change it.
2967 * Constant Size:: Reading constant width data.
2968 * Multiple Line:: Reading multi-line records.
2969 * Getline:: Reading files under explicit program control
2970 using the @code{getline} function.
2973 @node Records, Fields, Reading Files, Reading Files
2974 @section How Input is Split into Records
2976 @cindex record separator, @code{RS}
2977 @cindex changing the record separator
2978 @cindex record, definition of
2980 The @code{awk} utility divides the input for your @code{awk}
2981 program into records and fields.
2982 Records are separated by a character called the @dfn{record separator}.
2983 By default, the record separator is the newline character.
2984 This is why records are, by default, single lines.
2985 You can use a different character for the record separator by
2986 assigning the character to the built-in variable @code{RS}.
2988 You can change the value of @code{RS} in the @code{awk} program,
2989 like any other variable, with the
2990 assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}).
2991 The new record-separator character should be enclosed in quotation marks,
2993 a string constant. Often the right time to do this is at the beginning
2994 of execution, before any input has been processed, so that the very
2995 first record will be read with the proper separator. To do this, use
2996 the special @code{BEGIN} pattern
2997 (@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). For
3001 awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list
3005 changes the value of @code{RS} to @code{"/"}, before reading any input.
3006 This is a string whose first character is a slash; as a result, records
3007 are separated by slashes. Then the input file is read, and the second
3008 rule in the @code{awk} program (the action with no pattern) prints each
3009 record. Since each @code{print} statement adds a newline at the end of
3010 its output, the effect of this @code{awk} program is to copy the input
3011 with each slash changed to a newline. Here are the results of running
3012 the program on @file{BBS-list}:
3016 $ awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list
3017 @print{} aardvark 555-5553 1200
3019 @print{} alpo-net 555-3412 2400
3022 @print{} barfly 555-7685 1200
3024 @print{} bites 555-1675 2400
3027 @print{} camelot 555-0542 300 C
3028 @print{} core 555-2912 1200
3030 @print{} fooey 555-1234 2400
3033 @print{} foot 555-6699 1200
3035 @print{} macfoo 555-6480 1200
3037 @print{} sdace 555-3430 2400
3040 @print{} sabafoo 555-2127 1200
3047 Note that the entry for the @samp{camelot} BBS is not split.
3048 In the original data file
3049 (@pxref{Sample Data Files, , Data Files for the Examples}),
3050 the line looks like this:
3053 camelot 555-0542 300 C
3057 It only has one baud rate; there are no slashes in the record.
3059 Another way to change the record separator is on the command line,
3060 using the variable-assignment feature
3061 (@pxref{Other Arguments, ,Other Command Line Arguments}).
3064 awk '@{ print $0 @}' RS="/" BBS-list
3068 This sets @code{RS} to @samp{/} before processing @file{BBS-list}.
3070 Using an unusual character such as @samp{/} for the record separator
3071 produces correct behavior in the vast majority of cases. However,
3072 the following (extreme) pipeline prints a surprising @samp{1}. There
3073 is one field, consisting of a newline. The value of the built-in
3074 variable @code{NF} is the number of fields in the current record.
3078 $ echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}'
3085 Reaching the end of an input file terminates the current input record,
3086 even if the last character in the file is not the character in @code{RS}
3089 @cindex empty string
3090 The empty string, @code{""} (a string of no characters), has a special meaning
3091 as the value of @code{RS}: it means that records are separated
3092 by one or more blank lines, and nothing else.
3093 @xref{Multiple Line, ,Multiple-Line Records}, for more details.
3095 If you change the value of @code{RS} in the middle of an @code{awk} run,
3096 the new value is used to delimit subsequent records, but the record
3097 currently being processed (and records already processed) are not
3101 @cindex record terminator, @code{RT}
3102 @cindex terminator, record
3103 @cindex differences between @code{gawk} and @code{awk}
3104 After the end of the record has been determined, @code{gawk}
3105 sets the variable @code{RT} to the text in the input that matched
3108 @cindex regular expressions as record separators
3109 The value of @code{RS} is in fact not limited to a one-character
3110 string. It can be any regular expression
3111 (@pxref{Regexp, ,Regular Expressions}).
3112 In general, each record
3113 ends at the next string that matches the regular expression; the next
3114 record starts at the end of the matching string. This general rule is
3115 actually at work in the usual case, where @code{RS} contains just a
3116 newline: a record ends at the beginning of the next matching string (the
3117 next newline in the input) and the following record starts just after
3118 the end of this string (at the first character of the following line).
3119 The newline, since it matches @code{RS}, is not part of either record.
3121 When @code{RS} is a single character, @code{RT} will
3122 contain the same single character. However, when @code{RS} is a
3123 regular expression, then @code{RT} becomes more useful; it contains
3124 the actual input text that matched the regular expression.
3126 The following example illustrates both of these features.
3127 It sets @code{RS} equal to a regular expression that
3128 matches either a newline, or a series of one or more upper-case letters
3129 with optional leading and/or trailing white space
3130 (@pxref{Regexp, , Regular Expressions}).
3133 $ echo record 1 AAAA record 2 BBBB record 3 |
3134 > gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @}
3135 > @{ print "Record =", $0, "and RT =", RT @}'
3136 @print{} Record = record 1 and RT = AAAA
3137 @print{} Record = record 2 and RT = BBBB
3138 @print{} Record = record 3 and RT =
3143 The final line of output has an extra blank line. This is because the
3144 value of @code{RT} is a newline, and then the @code{print} statement
3145 supplies its own terminating newline.
3147 @xref{Simple Sed, ,A Simple Stream Editor}, for a more useful example
3148 of @code{RS} as a regexp and @code{RT}.
3150 @cindex differences between @code{gawk} and @code{awk}
3151 The use of @code{RS} as a regular expression and the @code{RT}
3152 variable are @code{gawk} extensions; they are not available in
3154 (@pxref{Options, ,Command Line Options}).
3155 In compatibility mode, only the first character of the value of
3156 @code{RS} is used to determine the end of the record.
3158 @cindex number of records, @code{NR}, @code{FNR}
3161 The @code{awk} utility keeps track of the number of records that have
3162 been read so far from the current input file. This value is stored in a
3163 built-in variable called @code{FNR}. It is reset to zero when a new
3164 file is started. Another built-in variable, @code{NR}, is the total
3165 number of input records read so far from all data files. It starts at zero
3166 but is never automatically reset to zero.
3168 @node Fields, Non-Constant Fields, Records, Reading Files
3169 @section Examining Fields
3171 @cindex examining fields
3173 @cindex accessing fields
3174 When @code{awk} reads an input record, the record is
3175 automatically separated or @dfn{parsed} by the interpreter into chunks
3176 called @dfn{fields}. By default, fields are separated by whitespace,
3177 like words in a line.
3178 Whitespace in @code{awk} means any string of one or more spaces,
3179 tabs or newlines;@footnote{In POSIX @code{awk}, newlines are not
3180 considered whitespace for separating fields.} other characters such as
3181 formfeed, and so on, that are
3182 considered whitespace by other languages are @emph{not} considered
3183 whitespace by @code{awk}.
3185 The purpose of fields is to make it more convenient for you to refer to
3186 these pieces of the record. You don't have to use them---you can
3187 operate on the whole record if you wish---but fields are what make
3188 simple @code{awk} programs so powerful.
3190 @cindex @code{$} (field operator)
3191 @cindex field operator @code{$}
3192 To refer to a field in an @code{awk} program, you use a dollar-sign,
3193 @samp{$}, followed by the number of the field you want. Thus, @code{$1}
3194 refers to the first field, @code{$2} to the second, and so on. For
3195 example, suppose the following is a line of input:
3198 This seems like a pretty nice example.
3202 Here the first field, or @code{$1}, is @samp{This}; the second field, or
3203 @code{$2}, is @samp{seems}; and so on. Note that the last field,
3204 @code{$7}, is @samp{example.}. Because there is no space between the
3205 @samp{e} and the @samp{.}, the period is considered part of the seventh
3209 @cindex number of fields, @code{NF}
3210 @code{NF} is a built-in variable whose value
3211 is the number of fields in the current record.
3212 @code{awk} updates the value of @code{NF} automatically, each time
3215 No matter how many fields there are, the last field in a record can be
3216 represented by @code{$NF}. So, in the example above, @code{$NF} would
3217 be the same as @code{$7}, which is @samp{example.}. Why this works is
3218 explained below (@pxref{Non-Constant Fields, ,Non-constant Field Numbers}).
3219 If you try to reference a field beyond the last one, such as @code{$8}
3220 when the record has only seven fields, you get the empty string.
3221 @c the empty string acts like 0 in some contexts, but I don't want to
3222 @c get into that here....
3224 @code{$0}, which looks like a reference to the ``zeroth'' field, is
3225 a special case: it represents the whole input record. @code{$0} is
3226 used when you are not interested in fields.
3230 Here are some more examples:
3234 $ awk '$1 ~ /foo/ @{ print $0 @}' BBS-list
3235 @print{} fooey 555-1234 2400/1200/300 B
3236 @print{} foot 555-6699 1200/300 B
3237 @print{} macfoo 555-6480 1200/300 A
3238 @print{} sabafoo 555-2127 1200/300 C
3243 This example prints each record in the file @file{BBS-list} whose first
3244 field contains the string @samp{foo}. The operator @samp{~} is called a
3245 @dfn{matching operator}
3246 (@pxref{Regexp Usage, , How to Use Regular Expressions});
3247 it tests whether a string (here, the field @code{$1}) matches a given regular
3250 By contrast, the following example
3251 looks for @samp{foo} in @emph{the entire record} and prints the first
3252 field and the last field for each input record containing a
3257 $ awk '/foo/ @{ print $1, $NF @}' BBS-list
3265 @node Non-Constant Fields, Changing Fields, Fields, Reading Files
3266 @section Non-constant Field Numbers
3268 The number of a field does not need to be a constant. Any expression in
3269 the @code{awk} language can be used after a @samp{$} to refer to a
3270 field. The value of the expression specifies the field number. If the
3271 value is a string, rather than a number, it is converted to a number.
3272 Consider this example:
3275 awk '@{ print $NR @}'
3279 Recall that @code{NR} is the number of records read so far: one in the
3280 first record, two in the second, etc. So this example prints the first
3281 field of the first record, the second field of the second record, and so
3282 on. For the twentieth record, field number 20 is printed; most likely,
3283 the record has fewer than 20 fields, so this prints a blank line.
3285 Here is another example of using expressions as field numbers:
3288 awk '@{ print $(2*2) @}' BBS-list
3291 @code{awk} must evaluate the expression @samp{(2*2)} and use
3292 its value as the number of the field to print. The @samp{*} sign
3293 represents multiplication, so the expression @samp{2*2} evaluates to four.
3294 The parentheses are used so that the multiplication is done before the
3295 @samp{$} operation; they are necessary whenever there is a binary
3296 operator in the field-number expression. This example, then, prints the
3297 hours of operation (the fourth field) for every line of the file
3298 @file{BBS-list}. (All of the @code{awk} operators are listed, in
3299 order of decreasing precedence, in
3300 @ref{Precedence, , Operator Precedence (How Operators Nest)}.)
3302 If the field number you compute is zero, you get the entire record.
3303 Thus, @code{$(2-2)} has the same value as @code{$0}. Negative field
3304 numbers are not allowed; trying to reference one will usually terminate
3305 your running @code{awk} program. (The POSIX standard does not define
3306 what happens when you reference a negative field number. @code{gawk}
3307 will notice this and terminate your program. Other @code{awk}
3308 implementations may behave differently.)
3310 As mentioned in @ref{Fields, ,Examining Fields},
3311 the number of fields in the current record is stored in the built-in
3312 variable @code{NF} (also @pxref{Built-in Variables}). The expression
3313 @code{$NF} is not a special feature: it is the direct consequence of
3314 evaluating @code{NF} and using its value as a field number.
3316 @node Changing Fields, Field Separators, Non-Constant Fields, Reading Files
3317 @section Changing the Contents of a Field
3319 @cindex field, changing contents of
3320 @cindex changing contents of a field
3321 @cindex assignment to fields
3322 You can change the contents of a field as seen by @code{awk} within an
3323 @code{awk} program; this changes what @code{awk} perceives as the
3324 current input record. (The actual input is untouched; @code{awk} @emph{never}
3325 modifies the input file.)
3327 Consider this example and its output:
3331 $ awk '@{ $3 = $2 - 10; print $2, $3 @}' inventory-shipped
3340 The @samp{-} sign represents subtraction, so this program reassigns
3341 field three, @code{$3}, to be the value of field two minus ten,
3342 @samp{$2 - 10}. (@xref{Arithmetic Ops, ,Arithmetic Operators}.)
3343 Then field two, and the new value for field three, are printed.
3345 In order for this to work, the text in field @code{$2} must make sense
3346 as a number; the string of characters must be converted to a number in
3347 order for the computer to do arithmetic on it. The number resulting
3348 from the subtraction is converted back to a string of characters which
3349 then becomes field three.
3350 @xref{Conversion, ,Conversion of Strings and Numbers}.
3352 When you change the value of a field (as perceived by @code{awk}), the
3353 text of the input record is recalculated to contain the new field where
3354 the old one was. Therefore, @code{$0} changes to reflect the altered
3355 field. Thus, this program
3356 prints a copy of the input file, with 10 subtracted from the second
3361 $ awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped
3362 @print{} Jan 3 25 15 115
3363 @print{} Feb 5 32 24 226
3364 @print{} Mar 5 24 34 228
3369 You can also assign contents to fields that are out of range. For
3373 $ awk '@{ $6 = ($5 + $4 + $3 + $2)
3374 > print $6 @}' inventory-shipped
3382 We've just created @code{$6}, whose value is the sum of fields
3383 @code{$2}, @code{$3}, @code{$4}, and @code{$5}. The @samp{+} sign
3384 represents addition. For the file @file{inventory-shipped}, @code{$6}
3385 represents the total number of parcels shipped for a particular month.
3387 Creating a new field changes @code{awk}'s internal copy of the current
3388 input record---the value of @code{$0}. Thus, if you do @samp{print $0}
3389 after adding a field, the record printed includes the new field, with
3390 the appropriate number of field separators between it and the previously
3393 This recomputation affects and is affected by
3394 @code{NF} (the number of fields; @pxref{Fields, ,Examining Fields}),
3395 and by a feature that has not been discussed yet,
3396 the @dfn{output field separator}, @code{OFS},
3397 which is used to separate the fields (@pxref{Output Separators}).
3398 For example, the value of @code{NF} is set to the number of the highest
3401 Note, however, that merely @emph{referencing} an out-of-range field
3402 does @emph{not} change the value of either @code{$0} or @code{NF}.
3403 Referencing an out-of-range field only produces an empty string. For
3408 print "can't happen"
3410 print "everything is normal"
3414 should print @samp{everything is normal}, because @code{NF+1} is certain
3415 to be out of range. (@xref{If Statement, ,The @code{if}-@code{else} Statement},
3416 for more information about @code{awk}'s @code{if-else} statements.
3417 @xref{Typing and Comparison, ,Variable Typing and Comparison Expressions},
3418 for more information about the @samp{!=} operator.)
3420 It is important to note that making an assignment to an existing field
3422 value of @code{$0}, but will not change the value of @code{NF},
3423 even when you assign the empty string to a field. For example:
3427 $ echo a b c d | awk '@{ OFS = ":"; $2 = ""
3428 > print $0; print NF @}'
3435 The field is still there; it just has an empty value. You can tell
3436 because there are two colons in a row.
3438 This example shows what happens if you create a new field.
3441 $ echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new"
3442 > print $0; print NF @}'
3443 @print{} a::c:d::new
3448 The intervening field, @code{$5} is created with an empty value
3449 (indicated by the second pair of adjacent colons),
3450 and @code{NF} is updated with the value six.
3452 Finally, decrementing @code{NF} will lose the values of the fields
3453 after the new value of @code{NF}, and @code{$0} will be recomputed.
3457 $ echo a b c d e f | ../gawk '@{ print "NF =", NF;
3458 > NF = 3; print $0 @}'
3463 @node Field Separators, Constant Size, Changing Fields, Reading Files
3464 @section Specifying How Fields are Separated
3466 This section is rather long; it describes one of the most fundamental
3467 operations in @code{awk}.
3470 * Basic Field Splitting:: How fields are split with single characters
3472 * Regexp Field Splitting:: Using regexps as the field separator.
3473 * Single Character Fields:: Making each character a separate field.
3474 * Command Line Field Separator:: Setting @code{FS} from the command line.
3475 * Field Splitting Summary:: Some final points and a summary table.
3478 @node Basic Field Splitting, Regexp Field Splitting, Field Separators, Field Separators
3479 @subsection The Basics of Field Separating
3481 @cindex fields, separating
3482 @cindex field separator, @code{FS}
3484 The @dfn{field separator}, which is either a single character or a regular
3485 expression, controls the way @code{awk} splits an input record into fields.
3486 @code{awk} scans the input record for character sequences that
3487 match the separator; the fields themselves are the text between the matches.
3489 In the examples below, we use the bullet symbol ``@bullet{}'' to represent
3490 spaces in the output.
3492 If the field separator is @samp{oo}, then the following line:
3499 would be split into three fields: @samp{m}, @samp{@bullet{}g} and
3500 @samp{@bullet{}gai@bullet{}pan}.
3501 Note the leading spaces in the values of the second and third fields.
3503 @cindex common mistakes
3504 @cindex mistakes, common
3505 @cindex errors, common
3506 The field separator is represented by the built-in variable @code{FS}.
3507 Shell programmers take note! @code{awk} does @emph{not} use the name @code{IFS}
3508 which is used by the POSIX compatible shells (such as the Bourne shell,
3509 @code{sh}, or the GNU Bourne-Again Shell, Bash).
3511 You can change the value of @code{FS} in the @code{awk} program with the
3512 assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}).
3513 Often the right time to do this is at the beginning of execution,
3514 before any input has been processed, so that the very first record
3515 will be read with the proper separator. To do this, use the special
3516 @code{BEGIN} pattern
3517 (@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
3518 For example, here we set the value of @code{FS} to the string
3522 awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'
3526 Given the input line,
3529 John Q. Smith, 29 Oak St., Walamazoo, MI 42139
3533 this @code{awk} program extracts and prints the string
3534 @samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
3536 @cindex field separator, choice of
3537 @cindex regular expressions as field separators
3538 Sometimes your input data will contain separator characters that don't
3539 separate fields the way you thought they would. For instance, the
3540 person's name in the example we just used might have a title or
3541 suffix attached, such as @samp{John Q. Smith, LXIX}. From input
3542 containing such a name:
3545 John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
3549 @c careful of an overfull hbox here!
3550 the above program would extract @samp{@bullet{}LXIX}, instead of
3551 @samp{@bullet{}29@bullet{}Oak@bullet{}St.}.
3552 If you were expecting the program to print the
3553 address, you would be surprised. The moral is: choose your data layout and
3554 separator characters carefully to prevent such problems.
3557 As you know, normally,
3562 fields are separated by whitespace sequences
3563 (spaces, tabs and newlines), not by single spaces: two spaces in a row do not
3564 delimit an empty field. The default value of the field separator @code{FS}
3565 is a string containing a single space, @w{@code{" "}}. If this value were
3566 interpreted in the usual way, each space character would separate
3567 fields, so two spaces in a row would make an empty field between them.
3568 The reason this does not happen is that a single space as the value of
3569 @code{FS} is a special case: it is taken to specify the default manner
3570 of delimiting fields.
3572 If @code{FS} is any other single character, such as @code{","}, then
3573 each occurrence of that character separates two fields. Two consecutive
3574 occurrences delimit an empty field. If the character occurs at the
3575 beginning or the end of the line, that too delimits an empty field. The
3576 space character is the only single character which does not follow these
3579 @node Regexp Field Splitting, Single Character Fields, Basic Field Splitting, Field Separators
3580 @subsection Using Regular Expressions to Separate Fields
3589 discussed the use of single characters or simple strings as the
3591 More generally, the value of @code{FS} may be a string containing any
3592 regular expression. In this case, each match in the record for the regular
3593 expression separates fields. For example, the assignment:
3600 makes every area of an input line that consists of a comma followed by a
3601 space and a tab, into a field separator. (@samp{\t}
3602 is an @dfn{escape sequence} that stands for a tab;
3603 @pxref{Escape Sequences},
3604 for the complete list of similar escape sequences.)
3606 For a less trivial example of a regular expression, suppose you want
3607 single spaces to separate fields the way single commas were used above.
3608 You can set @code{FS} to @w{@code{"[@ ]"}} (left bracket, space, right
3609 bracket). This regular expression matches a single space and nothing else
3610 (@pxref{Regexp, ,Regular Expressions}).
3612 There is an important difference between the two cases of @samp{FS = @w{" "}}
3613 (a single space) and @samp{FS = @w{"[ \t\n]+"}} (left bracket, space,
3614 backslash, ``t'', backslash, ``n'', right bracket, which is a regular
3615 expression matching one or more spaces, tabs, or newlines). For both
3616 values of @code{FS}, fields are separated by runs of spaces, tabs
3617 and/or newlines. However, when the value of @code{FS} is @w{@code{"
3618 "}}, @code{awk} will first strip leading and trailing whitespace from
3619 the record, and then decide where the fields are.
3621 For example, the following pipeline prints @samp{b}:
3625 $ echo ' a b c d ' | awk '@{ print $2 @}'
3631 However, this pipeline prints @samp{a} (note the extra spaces around
3635 $ echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t]+" @}
3642 @cindex empty string
3643 In this case, the first field is @dfn{null}, or empty.
3645 The stripping of leading and trailing whitespace also comes into
3646 play whenever @code{$0} is recomputed. For instance, study this pipeline:
3649 $ echo ' a b c d' | awk '@{ print; $2 = $2; print @}'
3655 The first @code{print} statement prints the record as it was read,
3656 with leading whitespace intact. The assignment to @code{$2} rebuilds
3657 @code{$0} by concatenating @code{$1} through @code{$NF} together,
3658 separated by the value of @code{OFS}. Since the leading whitespace
3659 was ignored when finding @code{$1}, it is not part of the new @code{$0}.
3660 Finally, the last @code{print} statement prints the new @code{$0}.
3662 @node Single Character Fields, Command Line Field Separator, Regexp Field Splitting, Field Separators
3663 @subsection Making Each Character a Separate Field
3665 @cindex differences between @code{gawk} and @code{awk}
3666 @cindex single character fields
3667 There are times when you may want to examine each character
3668 of a record separately. In @code{gawk}, this is easy to do, you
3669 simply assign the null string (@code{""}) to @code{FS}. In this case,
3670 each individual character in the record will become a separate field.
3675 $ echo a b | gawk 'BEGIN @{ FS = "" @}
3677 > for (i = 1; i <= NF; i = i + 1)
3678 > print "Field", i, "is", $i
3680 @print{} Field 1 is a
3682 @print{} Field 3 is b
3687 Traditionally, the behavior for @code{FS} equal to @code{""} was not defined.
3688 In this case, Unix @code{awk} would simply treat the entire record
3689 as only having one field (d.c.). In compatibility mode
3690 (@pxref{Options, ,Command Line Options}),
3691 if @code{FS} is the null string, then @code{gawk} will also
3694 @node Command Line Field Separator, Field Splitting Summary, Single Character Fields, Field Separators
3695 @subsection Setting @code{FS} from the Command Line
3696 @cindex @code{-F} option
3697 @cindex field separator, on command line
3698 @cindex command line, setting @code{FS} on
3700 @code{FS} can be set on the command line. You use the @samp{-F} option to
3704 awk -F, '@var{program}' @var{input-files}
3708 sets @code{FS} to be the @samp{,} character. Notice that the option uses
3709 a capital @samp{F}. Contrast this with @samp{-f}, which specifies a file
3710 containing an @code{awk} program. Case is significant in command line options:
3711 the @samp{-F} and @samp{-f} options have nothing to do with each other.
3712 You can use both options at the same time to set the @code{FS} variable
3713 @emph{and} get an @code{awk} program from a file.
3715 The value used for the argument to @samp{-F} is processed in exactly the
3716 same way as assignments to the built-in variable @code{FS}. This means that
3717 if the field separator contains special characters, they must be escaped
3718 appropriately. For example, to use a @samp{\} as the field separator, you
3723 awk -F\\\\ '@dots{}' files @dots{}
3727 Since @samp{\} is used for quoting in the shell, @code{awk} will see
3728 @samp{-F\\}. Then @code{awk} processes the @samp{\\} for escape
3729 characters (@pxref{Escape Sequences}), finally yielding
3730 a single @samp{\} to be used for the field separator.
3732 @cindex historical features
3733 As a special case, in compatibility mode
3734 (@pxref{Options, ,Command Line Options}), if the
3735 argument to @samp{-F} is @samp{t}, then @code{FS} is set to the tab
3736 character. This is because if you type @samp{-F\t} at the shell,
3737 without any quotes, the @samp{\} gets deleted, so @code{awk} figures that you
3738 really want your fields to be separated with tabs, and not @samp{t}s.
3739 Use @samp{-v FS="t"} on the command line if you really do want to separate
3740 your fields with @samp{t}s
3741 (@pxref{Options, ,Command Line Options}).
3743 For example, let's use an @code{awk} program file called @file{baud.awk}
3744 that contains the pattern @code{/300/}, and the action @samp{print $1}.
3745 Here is the program:
3748 /300/ @{ print $1 @}
3751 Let's also set @code{FS} to be the @samp{-} character, and run the
3752 program on the file @file{BBS-list}. The following command prints a
3753 list of the names of the bulletin boards that operate at 300 baud and
3754 the first three digits of their phone numbers:
3756 @c tweaked to make the tex output look better in @smallbook
3759 $ awk -F- -f baud.awk BBS-list
3760 @print{} aardvark 555
3767 @print{} camelot 555
3773 @print{} sabafoo 555
3778 Note the second line of output. In the original file
3779 (@pxref{Sample Data Files, ,Data Files for the Examples}),
3780 the second line looked like this:
3783 alpo-net 555-3412 2400/1200/300 A
3786 The @samp{-} as part of the system's name was used as the field
3787 separator, instead of the @samp{-} in the phone number that was
3788 originally intended. This demonstrates why you have to be careful in
3789 choosing your field and record separators.
3791 On many Unix systems, each user has a separate entry in the system password
3792 file, one line per user. The information in these lines is separated
3793 by colons. The first field is the user's logon name, and the second is
3794 the user's encrypted password. A password file entry might look like this:
3797 arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
3800 The following program searches the system password file, and prints
3801 the entries for users who have no password:
3804 awk -F: '$2 == ""' /etc/passwd
3807 @node Field Splitting Summary, , Command Line Field Separator, Field Separators
3808 @subsection Field Splitting Summary
3810 @cindex @code{awk} language, POSIX version
3811 @cindex POSIX @code{awk}
3812 According to the POSIX standard, @code{awk} is supposed to behave
3813 as if each record is split into fields at the time that it is read.
3814 In particular, this means that you can change the value of @code{FS}
3815 after a record is read, and the value of the fields (i.e.@: how they were split)
3816 should reflect the old value of @code{FS}, not the new one.
3819 @cindex @code{sed} utility
3820 @cindex stream editor
3821 However, many implementations of @code{awk} do not work this way. Instead,
3822 they defer splitting the fields until a field is actually
3823 referenced. The fields will be split
3824 using the @emph{current} value of @code{FS}! (d.c.)
3825 This behavior can be difficult
3826 to diagnose. The following example illustrates the difference
3827 between the two methods.
3828 (The @code{sed}@footnote{The @code{sed} utility is a ``stream editor.''
3829 Its behavior is also defined by the POSIX standard.}
3830 command prints just the first line of @file{/etc/passwd}.)
3833 sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}'
3844 on an incorrect implementation of @code{awk}, while @code{gawk}
3845 will print something like
3848 root:nSijPlPhZZwgE:0:0:Root:/:
3851 The following table summarizes how fields are split, based on the
3852 value of @code{FS}. (@samp{==} means ``is equal to.'')
3857 Fields are separated by runs of whitespace. Leading and trailing
3858 whitespace are ignored. This is the default.
3860 @item FS == @var{any other single character}
3861 Fields are separated by each occurrence of the character. Multiple
3862 successive occurrences delimit empty fields, as do leading and
3863 trailing occurrences.
3864 The character can even be a regexp metacharacter; it does not need
3867 @item FS == @var{regexp}
3868 Fields are separated by occurrences of characters that match @var{regexp}.
3869 Leading and trailing matches of @var{regexp} delimit empty fields.
3872 Each individual character in the record becomes a separate field.
3876 @node Constant Size, Multiple Line, Field Separators, Reading Files
3877 @section Reading Fixed-width Data
3879 (This section discusses an advanced, experimental feature. If you are
3880 a novice @code{awk} user, you may wish to skip it on the first reading.)
3882 @code{gawk} version 2.13 introduced a new facility for dealing with
3883 fixed-width fields with no distinctive field separator. Data of this
3884 nature arises, for example, in the input for old FORTRAN programs where
3885 numbers are run together; or in the output of programs that did not
3886 anticipate the use of their output as input for other programs.
3888 An example of the latter is a table where all the columns are lined up by
3889 the use of a variable number of spaces and @emph{empty fields are just
3890 spaces}. Clearly, @code{awk}'s normal field splitting based on @code{FS}
3891 will not work well in this case. Although a portable @code{awk} program
3892 can use a series of @code{substr} calls on @code{$0}
3893 (@pxref{String Functions, ,Built-in Functions for String Manipulation}),
3894 this is awkward and inefficient for a large number of fields.
3896 The splitting of an input record into fixed-width fields is specified by
3897 assigning a string containing space-separated numbers to the built-in
3898 variable @code{FIELDWIDTHS}. Each number specifies the width of the field
3899 @emph{including} columns between fields. If you want to ignore the columns
3900 between fields, you can specify the width as a separate field that is
3901 subsequently ignored.
3903 The following data is the output of the Unix @code{w} utility. It is useful
3904 to illustrate the use of @code{FIELDWIDTHS}.
3908 10:06pm up 21 days, 14:04, 23 users
3909 User tty login@ idle JCPU PCPU what
3910 hzuo ttyV0 8:58pm 9 5 vi p24.tex
3911 hzang ttyV3 6:37pm 50 -csh
3912 eklye ttyV5 9:53pm 7 1 em thes.tex
3913 dportein ttyV6 8:17pm 1:47 -csh
3914 gierd ttyD3 10:00pm 1 elm
3915 dave ttyD4 9:47pm 4 4 w
3916 brent ttyp0 26Jun91 4:46 26:46 4:41 bash
3917 dave ttyq4 26Jun9115days 46 46 wnewmail
3921 The following program takes the above input, converts the idle time to
3922 number of seconds and prints out the first two fields and the calculated
3923 idle time. (This program uses a number of @code{awk} features that
3924 haven't been introduced yet.)
3927 BEGIN @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @}
3930 sub(/^ */, "", idle) # strip leading spaces
3936 idle = t[1] * 60 + t[2]
3941 idle *= 24 * 60 * 60
3948 Here is the result of running the program on the data:
3961 Another (possibly more practical) example of fixed-width input data
3962 would be the input from a deck of balloting cards. In some parts of
3963 the United States, voters mark their choices by punching holes in computer
3964 cards. These cards are then processed to count the votes for any particular
3965 candidate or on any particular issue. Since a voter may choose not to
3966 vote on some issue, any column on the card may be empty. An @code{awk}
3967 program for processing such data could use the @code{FIELDWIDTHS} feature
3968 to simplify reading the data. (Of course, getting @code{gawk} to run on
3969 a system with card readers is another story!)
3972 Exercise: Write a ballot card reading program
3975 Assigning a value to @code{FS} causes @code{gawk} to return to using
3976 @code{FS} for field splitting. Use @samp{FS = FS} to make this happen,
3977 without having to know the current value of @code{FS}.
3979 This feature is still experimental, and may evolve over time.
3980 Note that in particular, @code{gawk} does not attempt to verify
3981 the sanity of the values used in the value of @code{FIELDWIDTHS}.
3983 @node Multiple Line, Getline, Constant Size, Reading Files
3984 @section Multiple-Line Records
3986 @cindex multiple line records
3987 @cindex input, multiple line records
3988 @cindex reading files, multiple line records
3989 @cindex records, multiple line
3990 In some data bases, a single line cannot conveniently hold all the
3991 information in one entry. In such cases, you can use multi-line
3994 The first step in doing this is to choose your data format: when records
3995 are not defined as single lines, how do you want to define them?
3996 What should separate records?
3998 One technique is to use an unusual character or string to separate
3999 records. For example, you could use the formfeed character (written
4000 @samp{\f} in @code{awk}, as in C) to separate them, making each record
4001 a page of the file. To do this, just set the variable @code{RS} to
4002 @code{"\f"} (a string containing the formfeed character). Any
4003 other character could equally well be used, as long as it won't be part
4004 of the data in a record.
4006 Another technique is to have blank lines separate records. By a special
4007 dispensation, an empty string as the value of @code{RS} indicates that
4008 records are separated by one or more blank lines. If you set @code{RS}
4009 to the empty string, a record always ends at the first blank line
4010 encountered. And the next record doesn't start until the first non-blank
4011 line that follows---no matter how many blank lines appear in a row, they
4012 are considered one record-separator.
4014 @cindex leftmost longest match
4015 @cindex matching, leftmost longest
4016 You can achieve the same effect as @samp{RS = ""} by assigning the
4017 string @code{"\n\n+"} to @code{RS}. This regexp matches the newline
4018 at the end of the record, and one or more blank lines after the record.
4019 In addition, a regular expression always matches the longest possible
4020 sequence when there is a choice
4021 (@pxref{Leftmost Longest, ,How Much Text Matches?}).
4022 So the next record doesn't start until
4023 the first non-blank line that follows---no matter how many blank lines
4024 appear in a row, they are considered one record-separator.
4027 There is an important difference between @samp{RS = ""} and
4028 @samp{RS = "\n\n+"}. In the first case, leading newlines in the input
4029 data file are ignored, and if a file ends without extra blank lines
4030 after the last record, the final newline is removed from the record.
4031 In the second case, this special processing is not done (d.c.).
4033 Now that the input is separated into records, the second step is to
4034 separate the fields in the record. One way to do this is to divide each
4035 of the lines into fields in the normal manner. This happens by default
4036 as the result of a special feature: when @code{RS} is set to the empty
4037 string, the newline character @emph{always} acts as a field separator.
4038 This is in addition to whatever field separations result from @code{FS}.
4040 The original motivation for this special exception was probably to provide
4041 useful behavior in the default case (i.e.@: @code{FS} is equal
4042 to @w{@code{" "}}). This feature can be a problem if you really don't
4043 want the newline character to separate fields, since there is no way to
4044 prevent it. However, you can work around this by using the @code{split}
4045 function to break up the record manually
4046 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
4048 Another way to separate fields is to
4049 put each field on a separate line: to do this, just set the
4050 variable @code{FS} to the string @code{"\n"}. (This simple regular
4051 expression matches a single newline.)
4053 A practical example of a data file organized this way might be a mailing
4054 list, where each entry is separated by blank lines. If we have a mailing
4055 list in a file named @file{addresses}, that looks like this:
4062 Anywhere, SE 12345-6789
4065 456 Tree-lined Avenue
4066 Smallville, MW 98765-4321
4071 A simple program to process this file would look like this:
4075 # addrs.awk --- simple mailing list program
4077 # Records are separated by blank lines.
4078 # Each line is one field.
4079 BEGIN @{ RS = "" ; FS = "\n" @}
4082 print "Name is:", $1
4083 print "Address is:", $2
4084 print "City and State are:", $3
4090 Running the program produces the following output:
4094 $ awk -f addrs.awk addresses
4095 @print{} Name is: Jane Doe
4096 @print{} Address is: 123 Main Street
4097 @print{} City and State are: Anywhere, SE 12345-6789
4101 @print{} Name is: John Smith
4102 @print{} Address is: 456 Tree-lined Avenue
4103 @print{} City and State are: Smallville, MW 98765-4321
4109 @xref{Labels Program, ,Printing Mailing Labels}, for a more realistic
4110 program that deals with address lists.
4112 The following table summarizes how records are split, based on the
4113 value of @code{RS}. (@samp{==} means ``is equal to.'')
4118 Records are separated by the newline character (@samp{\n}). In effect,
4119 every line in the data file is a separate record, including blank lines.
4120 This is the default.
4122 @item RS == @var{any single character}
4123 Records are separated by each occurrence of the character. Multiple
4124 successive occurrences delimit empty records.
4127 Records are separated by runs of blank lines. The newline character
4128 always serves as a field separator, in addition to whatever value
4129 @code{FS} may have. Leading and trailing newlines in a file are ignored.
4131 @item RS == @var{regexp}
4132 Records are separated by occurrences of characters that match @var{regexp}.
4133 Leading and trailing matches of @var{regexp} delimit empty records.
4138 In all cases, @code{gawk} sets @code{RT} to the input text that matched the
4139 value specified by @code{RS}.
4141 @node Getline, , Multiple Line, Reading Files
4142 @section Explicit Input with @code{getline}
4145 @cindex input, explicit
4146 @cindex explicit input
4147 @cindex input, @code{getline} command
4148 @cindex reading files, @code{getline} command
4149 So far we have been getting our input data from @code{awk}'s main
4150 input stream---either the standard input (usually your terminal, sometimes
4151 the output from another program) or from the
4152 files specified on the command line. The @code{awk} language has a
4153 special built-in command called @code{getline} that
4154 can be used to read input under your explicit control.
4157 * Getline Intro:: Introduction to the @code{getline} function.
4158 * Plain Getline:: Using @code{getline} with no arguments.
4159 * Getline/Variable:: Using @code{getline} into a variable.
4160 * Getline/File:: Using @code{getline} from a file.
4161 * Getline/Variable/File:: Using @code{getline} into a variable from a
4163 * Getline/Pipe:: Using @code{getline} from a pipe.
4164 * Getline/Variable/Pipe:: Using @code{getline} into a variable from a
4166 * Getline Summary:: Summary Of @code{getline} Variants.
4169 @node Getline Intro, Plain Getline, Getline, Getline
4170 @subsection Introduction to @code{getline}
4172 This command is used in several different ways, and should @emph{not} be
4173 used by beginners. It is covered here because this is the chapter on input.
4174 The examples that follow the explanation of the @code{getline} command
4175 include material that has not been covered yet. Therefore, come back
4176 and study the @code{getline} command @emph{after} you have reviewed the
4177 rest of this @value{DOCUMENT} and have a good knowledge of how @code{awk} works.
4180 @cindex differences between @code{gawk} and @code{awk}
4181 @cindex @code{getline}, return values
4182 @code{getline} returns one if it finds a record, and zero if the end of the
4183 file is encountered. If there is some error in getting a record, such
4184 as a file that cannot be opened, then @code{getline} returns @minus{}1.
4185 In this case, @code{gawk} sets the variable @code{ERRNO} to a string
4186 describing the error that occurred.
4188 In the following examples, @var{command} stands for a string value that
4189 represents a shell command.
4191 @node Plain Getline, Getline/Variable, Getline Intro, Getline
4192 @subsection Using @code{getline} with No Arguments
4194 The @code{getline} command can be used without arguments to read input
4195 from the current input file. All it does in this case is read the next
4196 input record and split it up into fields. This is useful if you've
4197 finished processing the current record, but you want to do some special
4198 processing @emph{right now} on the next record. Here's an
4204 if ((t = index($0, "/*")) != 0) @{
4205 # value will be "" if t is 1
4206 tmp = substr($0, 1, t - 1)
4207 u = index(substr($0, t + 2), "*/")
4209 if (getline <= 0) @{
4210 m = "unexpected EOF or error"
4212 print m > "/dev/stderr"
4220 # substr expression will be "" if */
4221 # occurred at end of line
4222 $0 = tmp substr($0, t + u + 3)
4229 This @code{awk} program deletes all C-style comments, @samp{/* @dots{}
4230 */}, from the input. By replacing the @samp{print $0} with other
4231 statements, you could perform more complicated processing on the
4232 decommented input, like searching for matches of a regular
4233 expression. This program has a subtle problem---it does not work if one
4234 comment ends and another begins on the same line.
4238 write a program that does handle multiple comments on the line.
4241 This form of the @code{getline} command sets @code{NF} (the number of
4242 fields; @pxref{Fields, ,Examining Fields}), @code{NR} (the number of
4243 records read so far; @pxref{Records, ,How Input is Split into Records}),
4244 @code{FNR} (the number of records read from this input file), and the
4248 @strong{Note:} the new value of @code{$0} is used in testing
4249 the patterns of any subsequent rules. The original value
4250 of @code{$0} that triggered the rule which executed @code{getline}
4252 By contrast, the @code{next} statement reads a new record
4253 but immediately begins processing it normally, starting with the first
4254 rule in the program. @xref{Next Statement, ,The @code{next} Statement}.
4256 @node Getline/Variable, Getline/File, Plain Getline, Getline
4257 @subsection Using @code{getline} Into a Variable
4259 You can use @samp{getline @var{var}} to read the next record from
4260 @code{awk}'s input into the variable @var{var}. No other processing is
4263 For example, suppose the next line is a comment, or a special string,
4264 and you want to read it, without triggering
4265 any rules. This form of @code{getline} allows you to read that line
4266 and store it in a variable so that the main
4267 read-a-line-and-check-each-rule loop of @code{awk} never sees it.
4269 The following example swaps every two lines of input. For example, given:
4294 if ((getline tmp) > 0) @{
4303 The @code{getline} command used in this way sets only the variables
4304 @code{NR} and @code{FNR} (and of course, @var{var}). The record is not
4305 split into fields, so the values of the fields (including @code{$0}) and
4306 the value of @code{NF} do not change.
4308 @node Getline/File, Getline/Variable/File, Getline/Variable, Getline
4309 @subsection Using @code{getline} from a File
4311 @cindex input redirection
4312 @cindex redirection of input
4313 Use @samp{getline < @var{file}} to read
4314 the next record from the file
4315 @var{file}. Here @var{file} is a string-valued expression that
4316 specifies the file name. @samp{< @var{file}} is called a @dfn{redirection}
4317 since it directs input to come from a different place.
4319 For example, the following
4320 program reads its input record from the file @file{secondary.input} when it
4321 encounters a first field with a value equal to 10 in the current input
4328 getline < "secondary.input"
4336 Since the main input stream is not used, the values of @code{NR} and
4337 @code{FNR} are not changed. But the record read is split into fields in
4338 the normal manner, so the values of @code{$0} and other fields are
4339 changed. So is the value of @code{NF}.
4341 @c Thanks to Paul Eggert for initial wording here
4342 According to POSIX, @samp{getline < @var{expression}} is ambiguous if
4343 @var{expression} contains unparenthesized operators other than
4344 @samp{$}; for example, @samp{getline < dir "/" file} is ambiguous
4345 because the concatenation operator is not parenthesized, and you should
4346 write it as @samp{getline < (dir "/" file)} if you want your program
4347 to be portable to other @code{awk} implementations.
4349 @node Getline/Variable/File, Getline/Pipe, Getline/File, Getline
4350 @subsection Using @code{getline} Into a Variable from a File
4352 Use @samp{getline @var{var} < @var{file}} to read input
4354 @var{file} and put it in the variable @var{var}. As above, @var{file}
4355 is a string-valued expression that specifies the file from which to read.
4357 In this version of @code{getline}, none of the built-in variables are
4358 changed, and the record is not split into fields. The only variable
4359 changed is @var{var}.
4362 @c Thanks to Paul Eggert for initial wording here
4363 According to POSIX, @samp{getline @var{var} < @var{expression}} is ambiguous if
4364 @var{expression} contains unparenthesized operators other than
4365 @samp{$}; for example, @samp{getline < dir "/" file} is ambiguous
4366 because the concatenation operator is not parenthesized, and you should
4367 write it as @samp{getline < (dir "/" file)} if you want your program
4368 to be portable to other @code{awk} implementations.
4371 For example, the following program copies all the input files to the
4372 output, except for records that say @w{@samp{@@include @var{filename}}}.
4373 Such a record is replaced by the contents of the file
4379 if (NF == 2 && $1 == "@@include") @{
4380 while ((getline line < $2) > 0)
4389 Note here how the name of the extra input file is not built into
4390 the program; it is taken directly from the data, from the second field on
4391 the @samp{@@include} line.
4393 The @code{close} function is called to ensure that if two identical
4394 @samp{@@include} lines appear in the input, the entire specified file is
4396 @xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}.
4398 One deficiency of this program is that it does not process nested
4399 @samp{@@include} statements
4400 (@samp{@@include} statements in included files)
4401 the way a true macro preprocessor would.
4402 @xref{Igawk Program, ,An Easy Way to Use Library Functions}, for a program
4403 that does handle nested @samp{@@include} statements.
4405 @node Getline/Pipe, Getline/Variable/Pipe, Getline/Variable/File, Getline
4406 @subsection Using @code{getline} from a Pipe
4408 @cindex input pipeline
4409 @cindex pipeline, input
4410 You can pipe the output of a command into @code{getline}, using
4411 @samp{@var{command} | getline}. In
4412 this case, the string @var{command} is run as a shell command and its output
4413 is piped into @code{awk} to be used as input. This form of @code{getline}
4414 reads one record at a time from the pipe.
4416 For example, the following program copies its input to its output, except for
4417 lines that begin with @samp{@@execute}, which are replaced by the output
4418 produced by running the rest of the line as a shell command:
4423 if ($1 == "@@execute") @{
4424 tmp = substr($0, 10)
4425 while ((tmp | getline) > 0)
4435 The @code{close} function is called to ensure that if two identical
4436 @samp{@@execute} lines appear in the input, the command is run for
4438 @xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}.
4440 @c This example is unrealistic, since you could just use system
4455 the program might produce:
4462 arnold ttyv0 Jul 13 14:22
4463 miriam ttyp0 Jul 13 14:23 (murphy:0)
4464 bill ttyp1 Jul 13 14:23 (murphy:0)
4470 Notice that this program ran the command @code{who} and printed the result.
4471 (If you try this program yourself, you will of course get different results,
4472 showing you who is logged in on your system.)
4474 This variation of @code{getline} splits the record into fields, sets the
4475 value of @code{NF} and recomputes the value of @code{$0}. The values of
4476 @code{NR} and @code{FNR} are not changed.
4478 @c Thanks to Paul Eggert for initial wording here
4479 According to POSIX, @samp{@var{expression} | getline} is ambiguous if
4480 @var{expression} contains unparenthesized operators other than
4481 @samp{$}; for example, @samp{"echo " "date" | getline} is ambiguous
4482 because the concatenation operator is not parenthesized, and you should
4483 write it as @samp{("echo " "date") | getline} if you want your program
4484 to be portable to other @code{awk} implementations.
4485 (It happens that @code{gawk} gets it right, but you should not
4486 rely on this. Parentheses make it easier to read, anyway.)
4488 @node Getline/Variable/Pipe, Getline Summary, Getline/Pipe, Getline
4489 @subsection Using @code{getline} Into a Variable from a Pipe
4491 When you use @samp{@var{command} | getline @var{var}}, the
4492 output of the command @var{command} is sent through a pipe to
4493 @code{getline} and into the variable @var{var}. For example, the
4494 following program reads the current date and time into the variable
4495 @code{current_time}, using the @code{date} utility, and then
4501 "date" | getline current_time
4503 print "Report printed on " current_time
4508 In this version of @code{getline}, none of the built-in variables are
4509 changed, and the record is not split into fields.
4512 @c Thanks to Paul Eggert for initial wording here
4513 According to POSIX, @samp{@var{expression} | getline @var{var}} is ambiguous if
4514 @var{expression} contains unparenthesized operators other than
4515 @samp{$}; for example, @samp{"echo " "date" | getline @var{var}} is ambiguous
4516 because the concatenation operator is not parenthesized, and you should
4517 write it as @samp{("echo " "date") | getline @var{var}} if you want your
4518 program to be portable to other @code{awk} implementations.
4519 (It happens that @code{gawk} gets it right, but you should not
4520 rely on this. Parentheses make it easier to read, anyway.)
4523 @node Getline Summary, , Getline/Variable/Pipe, Getline
4524 @subsection Summary of @code{getline} Variants
4526 With all the forms of @code{getline}, even though @code{$0} and @code{NF},
4527 may be updated, the record will not be tested against all the patterns
4528 in the @code{awk} program, in the way that would happen if the record
4529 were read normally by the main processing loop of @code{awk}. However
4530 the new record is tested against any subsequent rules.
4532 @cindex differences between @code{gawk} and @code{awk}
4534 @cindex implementation limits
4535 Many @code{awk} implementations limit the number of pipelines an @code{awk}
4536 program may have open to just one! In @code{gawk}, there is no such limit.
4537 You can open as many pipelines as the underlying operating system will
4542 @cindex @code{getline}, setting @code{FILENAME}
4543 @cindex @code{FILENAME}, being set by @code{getline}
4544 An interesting side-effect occurs if you use @code{getline} (without a
4545 redirection) inside a @code{BEGIN} rule. Since an unredirected @code{getline}
4546 reads from the command line data files, the first @code{getline} command
4547 causes @code{awk} to set the value of @code{FILENAME}. Normally,
4548 @code{FILENAME} does not have a value inside @code{BEGIN} rules, since you
4549 have not yet started to process the command line data files (d.c.).
4550 (@xref{BEGIN/END, , The @code{BEGIN} and @code{END} Special Patterns},
4551 also @pxref{Auto-set, , Built-in Variables that Convey Information}.)
4553 The following table summarizes the six variants of @code{getline},
4554 listing which built-in variables are set by each one.
4559 sets @code{$0}, @code{NF}, @code{FNR}, and @code{NR}.
4561 @item getline @var{var}
4562 sets @var{var}, @code{FNR}, and @code{NR}.
4564 @item getline < @var{file}
4565 sets @code{$0}, and @code{NF}.
4567 @item getline @var{var} < @var{file}
4570 @item @var{command} | getline
4571 sets @code{$0}, and @code{NF}.
4573 @item @var{command} | getline @var{var}
4578 @node Printing, Expressions, Reading Files, Top
4579 @chapter Printing Output
4583 One of the most common actions is to @dfn{print}, or output,
4584 some or all of the input. You use the @code{print} statement
4585 for simple output. You use the @code{printf} statement
4586 for fancier formatting. Both are described in this chapter.
4589 * Print:: The @code{print} statement.
4590 * Print Examples:: Simple examples of @code{print} statements.
4591 * Output Separators:: The output separators and how to change them.
4592 * OFMT:: Controlling Numeric Output With @code{print}.
4593 * Printf:: The @code{printf} statement.
4594 * Redirection:: How to redirect output to multiple files and
4596 * Special Files:: File name interpretation in @code{gawk}.
4597 @code{gawk} allows access to inherited file
4599 * Close Files And Pipes:: Closing Input and Output Files and Pipes.
4602 @node Print, Print Examples, Printing, Printing
4603 @section The @code{print} Statement
4604 @cindex @code{print} statement
4606 The @code{print} statement does output with simple, standardized
4607 formatting. You specify only the strings or numbers to be printed, in a
4608 list separated by commas. They are output, separated by single spaces,
4609 followed by a newline. The statement looks like this:
4612 print @var{item1}, @var{item2}, @dots{}
4616 The entire list of items may optionally be enclosed in parentheses. The
4617 parentheses are necessary if any of the item expressions uses the @samp{>}
4618 relational operator; otherwise it could be confused with a redirection
4619 (@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}).
4621 The items to be printed can be constant strings or numbers, fields of the
4622 current record (such as @code{$1}), variables, or any @code{awk}
4624 Numeric values are converted to strings, and then printed.
4626 The @code{print} statement is completely general for
4627 computing @emph{what} values to print. However, with two exceptions,
4628 you cannot specify @emph{how} to print them---how many
4629 columns, whether to use exponential notation or not, and so on.
4630 (For the exceptions, @pxref{Output Separators}, and
4631 @ref{OFMT, ,Controlling Numeric Output with @code{print}}.)
4632 For that, you need the @code{printf} statement
4633 (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
4635 The simple statement @samp{print} with no items is equivalent to
4636 @samp{print $0}: it prints the entire current record. To print a blank
4637 line, use @samp{print ""}, where @code{""} is the empty string.
4639 To print a fixed piece of text, use a string constant such as
4640 @w{@code{"Don't Panic"}} as one item. If you forget to use the
4641 double-quote characters, your text will be taken as an @code{awk}
4642 expression, and you will probably get an error. Keep in mind that a
4643 space is printed between any two items.
4645 Each @code{print} statement makes at least one line of output. But it
4646 isn't limited to one line. If an item value is a string that contains a
4647 newline, the newline is output along with the rest of the string. A
4648 single @code{print} can make any number of lines this way.
4650 @node Print Examples, Output Separators, Print, Printing
4651 @section Examples of @code{print} Statements
4653 Here is an example of printing a string that contains embedded newlines
4654 (the @samp{\n} is an escape sequence, used to represent the newline
4655 character; @pxref{Escape Sequences}):
4659 $ awk 'BEGIN @{ print "line one\nline two\nline three" @}'
4666 Here is an example that prints the first two fields of each input record,
4667 with a space between them:
4671 $ awk '@{ print $1, $2 @}' inventory-shipped
4679 @cindex common mistakes
4680 @cindex mistakes, common
4681 @cindex errors, common
4682 A common mistake in using the @code{print} statement is to omit the comma
4683 between two items. This often has the effect of making the items run
4684 together in the output, with no space. The reason for this is that
4685 juxtaposing two string expressions in @code{awk} means to concatenate
4686 them. Here is the same program, without the comma:
4690 $ awk '@{ print $1 $2 @}' inventory-shipped
4698 To someone unfamiliar with the file @file{inventory-shipped}, neither
4699 example's output makes much sense. A heading line at the beginning
4700 would make it clearer. Let's add some headings to our table of months
4701 (@code{$1}) and green crates shipped (@code{$2}). We do this using the
4702 @code{BEGIN} pattern
4703 (@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
4704 to force the headings to be printed only once:
4707 awk 'BEGIN @{ print "Month Crates"
4708 print "----- ------" @}
4709 @{ print $1, $2 @}' inventory-shipped
4713 Did you already guess what happens? When run, the program prints
4728 The headings and the table data don't line up! We can fix this by printing
4729 some spaces between the two fields:
4732 awk 'BEGIN @{ print "Month Crates"
4733 print "----- ------" @}
4734 @{ print $1, " ", $2 @}' inventory-shipped
4737 You can imagine that this way of lining up columns can get pretty
4738 complicated when you have many columns to fix. Counting spaces for two
4739 or three columns can be simple, but more than this and you can get
4740 lost quite easily. This is why the @code{printf} statement was
4741 created (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing});
4742 one of its specialties is lining up columns of data.
4744 @cindex line continuation
4746 you can continue either a @code{print} or @code{printf} statement simply
4747 by putting a newline after any comma
4748 (@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
4750 @node Output Separators, OFMT, Print Examples, Printing
4751 @section Output Separators
4753 @cindex output field separator, @code{OFS}
4754 @cindex output record separator, @code{ORS}
4757 As mentioned previously, a @code{print} statement contains a list
4758 of items, separated by commas. In the output, the items are normally
4759 separated by single spaces. This need not be the case; a
4760 single space is only the default. You can specify any string of
4761 characters to use as the @dfn{output field separator} by setting the
4762 built-in variable @code{OFS}. The initial value of this variable
4763 is the string @w{@code{" "}}, that is, a single space.
4765 The output from an entire @code{print} statement is called an
4766 @dfn{output record}. Each @code{print} statement outputs one output
4767 record and then outputs a string called the @dfn{output record separator}.
4768 The built-in variable @code{ORS} specifies this string. The initial
4769 value of @code{ORS} is the string @code{"\n"}, i.e.@: a newline
4770 character; thus, normally each @code{print} statement makes a separate line.
4772 You can change how output fields and records are separated by assigning
4773 new values to the variables @code{OFS} and/or @code{ORS}. The usual
4774 place to do this is in the @code{BEGIN} rule
4775 (@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}), so
4776 that it happens before any input is processed. You may also do this
4777 with assignments on the command line, before the names of your input
4778 files, or using the @samp{-v} command line option
4779 (@pxref{Options, ,Command Line Options}).
4785 awk 'BEGIN @{ print "Month Crates"
4786 print "----- ------" @}
4787 @{ print $1, " ", $2 @}' inventory-shipped
4789 program by using a new value of @code{OFS}.
4792 The following example prints the first and second fields of each input
4793 record separated by a semicolon, with a blank line added after each
4798 $ awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @}
4799 > @{ print $1, $2 @}' BBS-list
4800 @print{} aardvark;555-5553
4802 @print{} alpo-net;555-3412
4804 @print{} barfly;555-7685
4809 If the value of @code{ORS} does not contain a newline, all your output
4810 will be run together on a single line, unless you output newlines some
4813 @node OFMT, Printf, Output Separators, Printing
4814 @section Controlling Numeric Output with @code{print}
4816 @cindex numeric output format
4817 @cindex format, numeric output
4818 @cindex output format specifier, @code{OFMT}
4819 When you use the @code{print} statement to print numeric values,
4820 @code{awk} internally converts the number to a string of characters,
4821 and prints that string. @code{awk} uses the @code{sprintf} function
4822 to do this conversion
4823 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
4824 For now, it suffices to say that the @code{sprintf}
4825 function accepts a @dfn{format specification} that tells it how to format
4826 numbers (or strings), and that there are a number of different ways in which
4827 numbers can be formatted. The different format specifications are discussed
4829 @ref{Control Letters, , Format-Control Letters}.
4831 The built-in variable @code{OFMT} contains the default format specification
4832 that @code{print} uses with @code{sprintf} when it wants to convert a
4833 number to a string for printing.
4834 The default value of @code{OFMT} is @code{"%.6g"}.
4835 By supplying different format specifications
4836 as the value of @code{OFMT}, you can change how @code{print} will print
4837 your numbers. As a brief example:
4842 > OFMT = "%.0f" # print numbers as integers (rounds)
4850 @cindex @code{awk} language, POSIX version
4851 @cindex POSIX @code{awk}
4852 According to the POSIX standard, @code{awk}'s behavior will be undefined
4853 if @code{OFMT} contains anything but a floating point conversion specification
4856 @node Printf, Redirection, OFMT, Printing
4857 @section Using @code{printf} Statements for Fancier Printing
4858 @cindex formatted output
4859 @cindex output, formatted
4861 If you want more precise control over the output format than
4862 @code{print} gives you, use @code{printf}. With @code{printf} you can
4863 specify the width to use for each item, and you can specify various
4864 formatting choices for numbers (such as what radix to use, whether to
4865 print an exponent, whether to print a sign, and how many digits to print
4866 after the decimal point). You do this by supplying a string, called
4867 the @dfn{format string}, which controls how and where to print the other
4871 * Basic Printf:: Syntax of the @code{printf} statement.
4872 * Control Letters:: Format-control letters.
4873 * Format Modifiers:: Format-specification modifiers.
4874 * Printf Examples:: Several examples.
4877 @node Basic Printf, Control Letters, Printf, Printf
4878 @subsection Introduction to the @code{printf} Statement
4880 @cindex @code{printf} statement, syntax of
4881 The @code{printf} statement looks like this:
4884 printf @var{format}, @var{item1}, @var{item2}, @dots{}
4888 The entire list of arguments may optionally be enclosed in parentheses. The
4889 parentheses are necessary if any of the item expressions use the @samp{>}
4890 relational operator; otherwise it could be confused with a redirection
4891 (@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}).
4893 @cindex format string
4894 The difference between @code{printf} and @code{print} is the @var{format}
4895 argument. This is an expression whose value is taken as a string; it
4896 specifies how to output each of the other arguments. It is called
4897 the @dfn{format string}.
4899 The format string is very similar to that in the ANSI C library function
4900 @code{printf}. Most of @var{format} is text to be output verbatim.
4901 Scattered among this text are @dfn{format specifiers}, one per item.
4902 Each format specifier says to output the next item in the argument list
4903 at that place in the format.
4905 The @code{printf} statement does not automatically append a newline to its
4906 output. It outputs only what the format string specifies. So if you want
4907 a newline, you must include one in the format string. The output separator
4908 variables @code{OFS} and @code{ORS} have no effect on @code{printf}
4909 statements. For example:
4914 ORS = "\nOUCH!\n"; OFS = "!"
4915 msg = "Don't Panic!"; printf "%s\n", msg
4920 This program still prints the familiar @samp{Don't Panic!} message.
4922 @node Control Letters, Format Modifiers, Basic Printf, Printf
4923 @subsection Format-Control Letters
4924 @cindex @code{printf}, format-control characters
4925 @cindex format specifier
4927 A format specifier starts with the character @samp{%} and ends with a
4928 @dfn{format-control letter}; it tells the @code{printf} statement how
4929 to output one item. (If you actually want to output a @samp{%}, write
4930 @samp{%%}.) The format-control letter specifies what kind of value to
4931 print. The rest of the format specifier is made up of optional
4932 @dfn{modifiers} which are parameters to use, such as the field width.
4934 Here is a list of the format-control letters:
4938 This prints a number as an ASCII character. Thus, @samp{printf "%c",
4939 65} outputs the letter @samp{A}. The output for a string value is
4940 the first character of the string.
4944 These are equivalent. They both print a decimal integer.
4945 The @samp{%i} specification is for compatibility with ANSI C.
4949 This prints a number in scientific (exponential) notation.
4953 printf "%4.3e\n", 1950
4957 prints @samp{1.950e+03}, with a total of four significant figures of
4958 which three follow the decimal point. The @samp{4.3} are modifiers,
4959 discussed below. @samp{%E} uses @samp{E} instead of @samp{e} in the output.
4962 This prints a number in floating point notation.
4966 printf "%4.3f", 1950
4970 prints @samp{1950.000}, with a total of four significant figures of
4971 which three follow the decimal point. The @samp{4.3} are modifiers,
4976 This prints a number in either scientific notation or floating point
4977 notation, whichever uses fewer characters. If the result is printed in
4978 scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}.
4981 This prints an unsigned octal integer.
4982 (In octal, or base-eight notation, the digits run from @samp{0} to @samp{7};
4983 the decimal number eight is represented as @samp{10} in octal.)
4986 This prints a string.
4989 This prints an unsigned decimal number.
4990 (This format is of marginal use, since all numbers in @code{awk}
4991 are floating point. It is provided primarily for compatibility
4996 This prints an unsigned hexadecimal integer.
4997 (In hexadecimal, or base-16 notation, the digits are @samp{0} through @samp{9}
4998 and @samp{a} through @samp{f}. The hexadecimal digit @samp{f} represents
4999 the decimal number 15.) @samp{%X} uses the letters @samp{A} through @samp{F}
5000 instead of @samp{a} through @samp{f}.
5003 This isn't really a format-control letter, but it does have a meaning
5004 when used after a @samp{%}: the sequence @samp{%%} outputs one
5005 @samp{%}. It does not consume an argument, and it ignores any
5010 When using the integer format-control letters for values that are outside
5011 the range of a C @code{long} integer, @code{gawk} will switch to the
5012 @samp{%g} format specifier. Other versions of @code{awk} may print
5013 invalid values, or do something else entirely (d.c.).
5015 @node Format Modifiers, Printf Examples, Control Letters, Printf
5016 @subsection Modifiers for @code{printf} Formats
5018 @cindex @code{printf}, modifiers
5019 @cindex modifiers (in format specifiers)
5020 A format specification can also include @dfn{modifiers} that can control
5021 how much of the item's value is printed and how much space it gets. The
5022 modifiers come between the @samp{%} and the format-control letter.
5023 In the examples below, we use the bullet symbol ``@bullet{}'' to represent
5024 spaces in the output. Here are the possible modifiers, in the order in
5025 which they may appear:
5029 The minus sign, used before the width modifier (see below),
5030 says to left-justify
5031 the argument within its specified width. Normally the argument
5032 is printed right-justified in the specified width. Thus,
5035 printf "%-4s", "foo"
5039 prints @samp{foo@bullet{}}.
5042 For numeric conversions, prefix positive values with a space, and
5043 negative values with a minus sign.
5046 The plus sign, used before the width modifier (see below),
5047 says to always supply a sign for numeric conversions, even if the data
5048 to be formatted is positive. The @samp{+} overrides the space modifier.
5051 Use an ``alternate form'' for certain control letters.
5052 For @samp{%o}, supply a leading zero.
5053 For @samp{%x}, and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for
5055 For @samp{%e}, @samp{%E}, and @samp{%f}, the result will always contain a
5057 For @samp{%g}, and @samp{%G}, trailing zeros are not removed from the result.
5061 A leading @samp{0} (zero) acts as a flag, that indicates output should be
5062 padded with zeros instead of spaces.
5063 This applies even to non-numeric output formats (d.c.).
5064 This flag only has an effect when the field width is wider than the
5065 value to be printed.
5068 This is a number specifying the desired minimum width of a field. Inserting any
5069 number between the @samp{%} sign and the format control character forces the
5070 field to be expanded to this width. The default way to do this is to
5071 pad with spaces on the left. For example,
5078 prints @samp{@bullet{}foo}.
5080 The value of @var{width} is a minimum width, not a maximum. If the item
5081 value requires more than @var{width} characters, it can be as wide as
5085 printf "%4s", "foobar"
5089 prints @samp{foobar}.
5091 Preceding the @var{width} with a minus sign causes the output to be
5092 padded with spaces on the right, instead of on the left.
5095 This is a number that specifies the precision to use when printing.
5096 For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the
5097 number of digits you want printed to the right of the decimal point.
5098 For the @samp{g}, and @samp{G} formats, it specifies the maximum number
5099 of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u},
5100 @samp{x}, and @samp{X} formats, it specifies the minimum number of
5101 digits to print. For a string, it specifies the maximum number of
5102 characters from the string that should be printed. Thus,
5105 printf "%.4s", "foobar"
5112 The C library @code{printf}'s dynamic @var{width} and @var{prec}
5113 capability (for example, @code{"%*.*s"}) is supported. Instead of
5114 supplying explicit @var{width} and/or @var{prec} values in the format
5115 string, you pass them in the argument list. For example:
5121 printf "%*.*s\n", w, p, s
5125 is exactly equivalent to
5133 Both programs output @samp{@w{@bullet{}@bullet{}abc}}.
5135 Earlier versions of @code{awk} did not support this capability.
5136 If you must use such a version, you may simulate this feature by using
5137 concatenation to build up the format string, like so:
5143 printf "%" w "." p "s\n", s
5147 This is not particularly easy to read, but it does work.
5149 @cindex @code{awk} language, POSIX version
5150 @cindex POSIX @code{awk}
5151 C programmers may be used to supplying additional @samp{l} and @samp{h}
5152 flags in @code{printf} format strings. These are not valid in @code{awk}.
5153 Most @code{awk} implementations silently ignore these flags.
5154 If @samp{--lint} is provided on the command line
5155 (@pxref{Options, ,Command Line Options}),
5156 @code{gawk} will warn about their use. If @samp{--posix} is supplied,
5157 their use is a fatal error.
5159 @node Printf Examples, , Format Modifiers, Printf
5160 @subsection Examples Using @code{printf}
5162 Here is how to use @code{printf} to make an aligned table:
5165 awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5169 prints the names of bulletin boards (@code{$1}) of the file
5170 @file{BBS-list} as a string of 10 characters, left justified. It also
5171 prints the phone numbers (@code{$2}) afterward on the line. This
5172 produces an aligned two-column table of names and phone numbers:
5176 $ awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5177 @print{} aardvark 555-5553
5178 @print{} alpo-net 555-3412
5179 @print{} barfly 555-7685
5180 @print{} bites 555-1675
5181 @print{} camelot 555-0542
5182 @print{} core 555-2912
5183 @print{} fooey 555-1234
5184 @print{} foot 555-6699
5185 @print{} macfoo 555-6480
5186 @print{} sdace 555-3430
5187 @print{} sabafoo 555-2127
5191 Did you notice that we did not specify that the phone numbers be printed
5192 as numbers? They had to be printed as strings because the numbers are
5193 separated by a dash.
5194 If we had tried to print the phone numbers as numbers, all we would have
5195 gotten would have been the first three digits, @samp{555}.
5196 This would have been pretty confusing.
5198 We did not specify a width for the phone numbers because they are the
5199 last things on their lines. We don't need to put spaces after them.
5201 We could make our table look even nicer by adding headings to the tops
5202 of the columns. To do this, we use the @code{BEGIN} pattern
5203 (@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
5204 to force the header to be printed only once, at the beginning of
5205 the @code{awk} program:
5209 awk 'BEGIN @{ print "Name Number"
5210 print "---- ------" @}
5211 @{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5215 Did you notice that we mixed @code{print} and @code{printf} statements in
5216 the above example? We could have used just @code{printf} statements to get
5221 awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number"
5222 printf "%-10s %s\n", "----", "------" @}
5223 @{ printf "%-10s %s\n", $1, $2 @}' BBS-list
5228 By printing each column heading with the same format specification
5229 used for the elements of the column, we have made sure that the headings
5230 are aligned just like the columns.
5232 The fact that the same format specification is used three times can be
5233 emphasized by storing it in a variable, like this:
5237 awk 'BEGIN @{ format = "%-10s %s\n"
5238 printf format, "Name", "Number"
5239 printf format, "----", "------" @}
5240 @{ printf format, $1, $2 @}' BBS-list
5245 See if you can use the @code{printf} statement to line up the headings and
5246 table data for our @file{inventory-shipped} example covered earlier in the
5247 section on the @code{print} statement
5248 (@pxref{Print, ,The @code{print} Statement}).
5250 @node Redirection, Special Files, Printf, Printing
5251 @section Redirecting Output of @code{print} and @code{printf}
5253 @cindex output redirection
5254 @cindex redirection of output
5255 So far we have been dealing only with output that prints to the standard
5256 output, usually your terminal. Both @code{print} and @code{printf} can
5257 also send their output to other places.
5258 This is called @dfn{redirection}.
5260 A redirection appears after the @code{print} or @code{printf} statement.
5261 Redirections in @code{awk} are written just like redirections in shell
5262 commands, except that they are written inside the @code{awk} program.
5264 There are three forms of output redirection: output to a file,
5265 output appended to a file, and output through a pipe to another
5267 They are all shown for
5268 the @code{print} statement, but they work identically for @code{printf}
5272 @item print @var{items} > @var{output-file}
5273 This type of redirection prints the items into the output file
5274 @var{output-file}. The file name @var{output-file} can be any
5275 expression. Its value is changed to a string and then used as a
5276 file name (@pxref{Expressions}).
5278 When this type of redirection is used, the @var{output-file} is erased
5279 before the first output is written to it. Subsequent writes
5280 to the same @var{output-file} do not
5281 erase @var{output-file}, but append to it. If @var{output-file} does
5282 not exist, then it is created.
5284 For example, here is how an @code{awk} program can write a list of
5285 BBS names to a file @file{name-list} and a list of phone numbers to a
5286 file @file{phone-list}. Each output file contains one name or number
5291 $ awk '@{ print $2 > "phone-list"
5292 > print $1 > "name-list" @}' BBS-list
5308 @item print @var{items} >> @var{output-file}
5309 This type of redirection prints the items into the pre-existing output file
5310 @var{output-file}. The difference between this and the
5311 single-@samp{>} redirection is that the old contents (if any) of
5312 @var{output-file} are not erased. Instead, the @code{awk} output is
5313 appended to the file.
5314 If @var{output-file} does not exist, then it is created.
5316 @cindex pipes for output
5317 @cindex output, piping
5318 @item print @var{items} | @var{command}
5319 It is also possible to send output to another program through a pipe
5321 file. This type of redirection opens a pipe to @var{command} and writes
5322 the values of @var{items} through this pipe, to another process created
5323 to execute @var{command}.
5325 The redirection argument @var{command} is actually an @code{awk}
5326 expression. Its value is converted to a string, whose contents give the
5327 shell command to be run.
5329 For example, this produces two files, one unsorted list of BBS names
5330 and one list sorted in reverse alphabetical order:
5333 awk '@{ print $1 > "names.unsorted"
5334 command = "sort -r > names.sorted"
5335 print $1 | command @}' BBS-list
5338 Here the unsorted list is written with an ordinary redirection while
5339 the sorted list is written by piping through the @code{sort} utility.
5341 This example uses redirection to mail a message to a mailing
5342 list @samp{bug-system}. This might be useful when trouble is encountered
5343 in an @code{awk} script run periodically for system maintenance.
5346 report = "mail bug-system"
5347 print "Awk script failed:", $0 | report
5348 m = ("at record number " FNR " of " FILENAME)
5353 The message is built using string concatenation and saved in the variable
5354 @code{m}. It is then sent down the pipeline to the @code{mail} program.
5356 We call the @code{close} function here because it's a good idea to close
5357 the pipe as soon as all the intended output has been sent to it.
5358 @xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes},
5359 for more information
5360 on this. This example also illustrates the use of a variable to represent
5361 a @var{file} or @var{command}: it is not necessary to always
5362 use a string constant. Using a variable is generally a good idea,
5363 since @code{awk} requires you to spell the string value identically
5367 Redirecting output using @samp{>}, @samp{>>}, or @samp{|} asks the system
5368 to open a file or pipe only if the particular @var{file} or @var{command}
5369 you've specified has not already been written to by your program, or if
5370 it has been closed since it was last written to.
5372 @cindex differences between @code{gawk} and @code{awk}
5374 @cindex implementation limits
5376 As mentioned earlier
5377 (@pxref{Getline Summary, , Summary of @code{getline} Variants}),
5383 @code{awk} implementations limit the number of pipelines an @code{awk}
5384 program may have open to just one! In @code{gawk}, there is no such limit.
5385 You can open as many pipelines as the underlying operating system will
5388 @node Special Files, Close Files And Pipes , Redirection, Printing
5389 @section Special File Names in @code{gawk}
5390 @cindex standard input
5391 @cindex standard output
5392 @cindex standard error output
5393 @cindex file descriptors
5395 Running programs conventionally have three input and output streams
5396 already available to them for reading and writing. These are known as
5397 the @dfn{standard input}, @dfn{standard output}, and @dfn{standard error
5398 output}. These streams are, by default, connected to your terminal, but
5399 they are often redirected with the shell, via the @samp{<}, @samp{<<},
5400 @samp{>}, @samp{>>}, @samp{>&} and @samp{|} operators. Standard error
5401 is typically used for writing error messages; the reason we have two separate
5402 streams, standard output and standard error, is so that they can be
5403 redirected separately.
5405 @cindex differences between @code{gawk} and @code{awk}
5406 In other implementations of @code{awk}, the only way to write an error
5407 message to standard error in an @code{awk} program is as follows:
5410 print "Serious error detected!" | "cat 1>&2"
5414 This works by opening a pipeline to a shell command which can access the
5415 standard error stream which it inherits from the @code{awk} process.
5416 This is far from elegant, and is also inefficient, since it requires a
5417 separate process. So people writing @code{awk} programs often
5418 neglect to do this. Instead, they send the error messages to the
5419 terminal, like this:
5423 print "Serious error detected!" > "/dev/tty"
5428 This usually has the same effect, but not always: although the
5429 standard error stream is usually the terminal, it can be redirected, and
5430 when that happens, writing to the terminal is not correct. In fact, if
5431 @code{awk} is run from a background job, it may not have a terminal at all.
5432 Then opening @file{/dev/tty} will fail.
5434 @code{gawk} provides special file names for accessing the three standard
5435 streams. When you redirect input or output in @code{gawk}, if the file name
5436 matches one of these special names, then @code{gawk} directly uses the
5437 stream it stands for.
5439 @cindex @file{/dev/stdin}
5440 @cindex @file{/dev/stdout}
5441 @cindex @file{/dev/stderr}
5442 @cindex @file{/dev/fd}
5446 The standard input (file descriptor 0).
5449 The standard output (file descriptor 1).
5452 The standard error output (file descriptor 2).
5454 @item /dev/fd/@var{N}
5455 The file associated with file descriptor @var{N}. Such a file must have
5456 been opened by the program initiating the @code{awk} execution (typically
5457 the shell). Unless you take special pains in the shell from which
5458 you invoke @code{gawk}, only descriptors 0, 1 and 2 are available.
5462 The file names @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr}
5463 are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2},
5464 respectively, but they are more self-explanatory.
5466 The proper way to write an error message in a @code{gawk} program
5467 is to use @file{/dev/stderr}, like this:
5470 print "Serious error detected!" > "/dev/stderr"
5473 @code{gawk} also provides special file names that give access to information
5474 about the running @code{gawk} process. Each of these ``files'' provides
5475 a single record of information. To read them more than once, you must
5476 first close them with the @code{close} function
5477 (@pxref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}).
5480 @cindex process information
5481 @cindex @file{/dev/pid}
5482 @cindex @file{/dev/pgrpid}
5483 @cindex @file{/dev/ppid}
5484 @cindex @file{/dev/user}
5488 Reading this file returns the process ID of the current process,
5489 in decimal, terminated with a newline.
5492 Reading this file returns the parent process ID of the current process,
5493 in decimal, terminated with a newline.
5496 Reading this file returns the process group ID of the current process,
5497 in decimal, terminated with a newline.
5500 Reading this file returns a single record terminated with a newline.
5501 The fields are separated with spaces. The fields represent the
5502 following information:
5506 The return value of the @code{getuid} system call
5507 (the real user ID number).
5510 The return value of the @code{geteuid} system call
5511 (the effective user ID number).
5514 The return value of the @code{getgid} system call
5515 (the real group ID number).
5518 The return value of the @code{getegid} system call
5519 (the effective group ID number).
5522 If there are any additional fields, they are the group IDs returned by
5523 @code{getgroups} system call.
5524 (Multiple groups may not be supported on all systems.)
5528 These special file names may be used on the command line as data
5529 files, as well as for I/O redirections within an @code{awk} program.
5530 They may not be used as source files with the @samp{-f} option.
5532 Recognition of these special file names is disabled if @code{gawk} is in
5533 compatibility mode (@pxref{Options, ,Command Line Options}).
5535 @strong{Caution}: Unless your system actually has a @file{/dev/fd} directory
5536 (or any of the other above listed special files),
5537 the interpretation of these file names is done by @code{gawk} itself.
5538 For example, using @samp{/dev/fd/4} for output will actually write on
5539 file descriptor 4, and not on a new file descriptor that was @code{dup}'ed
5540 from file descriptor 4. Most of the time this does not matter; however, it
5541 is important to @emph{not} close any of the files related to file descriptors
5542 0, 1, and 2. If you do close one of these files, unpredictable behavior
5545 The special files that provide process-related information will disappear
5546 in a future version of @code{gawk}.
5547 @xref{Future Extensions, ,Probable Future Extensions}.
5549 @node Close Files And Pipes, , Special Files, Printing
5550 @section Closing Input and Output Files and Pipes
5551 @cindex closing input files and pipes
5552 @cindex closing output files and pipes
5555 If the same file name or the same shell command is used with
5557 (@pxref{Getline, ,Explicit Input with @code{getline}})
5558 more than once during the execution of an @code{awk}
5559 program, the file is opened (or the command is executed) only the first time.
5560 At that time, the first record of input is read from that file or command.
5561 The next time the same file or command is used in @code{getline}, another
5562 record is read from it, and so on.
5564 Similarly, when a file or pipe is opened for output, the file name or command
5566 it is remembered by @code{awk} and subsequent writes to the same file or
5567 command are appended to the previous writes. The file or pipe stays
5568 open until @code{awk} exits.
5570 This implies that if you want to start reading the same file again from
5571 the beginning, or if you want to rerun a shell command (rather than
5572 reading more output from the command), you must take special steps.
5573 What you must do is use the @code{close} function, as follows:
5576 close(@var{filename})
5583 close(@var{command})
5586 The argument @var{filename} or @var{command} can be any expression. Its
5587 value must @emph{exactly} match the string that was used to open the file or
5588 start the command (spaces and other ``irrelevant'' characters
5589 included). For example, if you open a pipe with this:
5592 "sort -r names" | getline foo
5596 then you must close it with this:
5599 close("sort -r names")
5602 Once this function call is executed, the next @code{getline} from that
5603 file or command, or the next @code{print} or @code{printf} to that
5604 file or command, will reopen the file or rerun the command.
5606 Because the expression that you use to close a file or pipeline must
5607 exactly match the expression used to open the file or run the command,
5608 it is good practice to use a variable to store the file name or command.
5609 The previous example would become
5612 sortcom = "sort -r names"
5613 sortcom | getline foo
5619 This helps avoid hard-to-find typographical errors in your @code{awk}
5622 Here are some reasons why you might need to close an output file:
5626 To write a file and read it back later on in the same @code{awk}
5627 program. Close the file when you are finished writing it; then
5628 you can start reading it with @code{getline}.
5631 To write numerous files, successively, in the same @code{awk}
5632 program. If you don't close the files, eventually you may exceed a
5633 system limit on the number of open files in one process. So close
5634 each one when you are finished writing it.
5637 To make a command finish. When you redirect output through a pipe,
5638 the command reading the pipe normally continues to try to read input
5639 as long as the pipe is open. Often this means the command cannot
5640 really do its work until the pipe is closed. For example, if you
5641 redirect output to the @code{mail} program, the message is not
5642 actually sent until the pipe is closed.
5647 To run the same program a second time, with the same arguments.
5648 This is not the same thing as giving more input to the first run!
5650 For example, suppose you pipe output to the @code{mail} program. If you
5651 output several lines redirected to this pipe without closing it, they make
5652 a single message of several lines. By contrast, if you close the pipe
5653 after each line of output, then each line makes a separate message.
5657 @cindex differences between @code{gawk} and @code{awk}
5658 @code{close} returns a value of zero if the close succeeded.
5659 Otherwise, the value will be non-zero.
5660 In this case, @code{gawk} sets the variable @code{ERRNO} to a string
5661 describing the error that occurred.
5663 @cindex differences between @code{gawk} and @code{awk}
5664 @cindex portability issues
5665 If you use more files than the system allows you to have open,
5666 @code{gawk} will attempt to multiplex the available open files among
5667 your data files. @code{gawk}'s ability to do this depends upon the
5668 facilities of your operating system: it may not always work. It is
5669 therefore both good practice and good portability advice to always
5670 use @code{close} on your files when you are done with them.
5672 @node Expressions, Patterns and Actions, Printing, Top
5673 @chapter Expressions
5676 Expressions are the basic building blocks of @code{awk} patterns
5677 and actions. An expression evaluates to a value, which you can print, test,
5678 store in a variable or pass to a function. Additionally, an expression
5679 can assign a new value to a variable or a field, with an assignment operator.
5681 An expression can serve as a pattern or action statement on its own.
5683 statements contain one or more expressions which specify data on which to
5684 operate. As in other languages, expressions in @code{awk} include
5685 variables, array references, constants, and function calls, as well as
5686 combinations of these with various operators.
5689 * Constants:: String, numeric, and regexp constants.
5690 * Using Constant Regexps:: When and how to use a regexp constant.
5691 * Variables:: Variables give names to values for later use.
5692 * Conversion:: The conversion of strings to numbers and vice
5694 * Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-},
5696 * Concatenation:: Concatenating strings.
5697 * Assignment Ops:: Changing the value of a variable or a field.
5698 * Increment Ops:: Incrementing the numeric value of a variable.
5699 * Truth Values:: What is ``true'' and what is ``false''.
5700 * Typing and Comparison:: How variables acquire types, and how this
5701 affects comparison of numbers and strings with
5703 * Boolean Ops:: Combining comparison expressions using boolean
5704 operators @samp{||} (``or''), @samp{&&}
5705 (``and'') and @samp{!} (``not'').
5706 * Conditional Exp:: Conditional expressions select between two
5707 subexpressions under control of a third
5709 * Function Calls:: A function call is an expression.
5710 * Precedence:: How various operators nest.
5713 @node Constants, Using Constant Regexps, Expressions, Expressions
5714 @section Constant Expressions
5715 @cindex constants, types of
5716 @cindex string constants
5718 The simplest type of expression is the @dfn{constant}, which always has
5719 the same value. There are three types of constants: numeric constants,
5720 string constants, and regular expression constants.
5723 * Scalar Constants:: Numeric and string constants.
5724 * Regexp Constants:: Regular Expression constants.
5727 @node Scalar Constants, Regexp Constants, Constants, Constants
5728 @subsection Numeric and String Constants
5730 @cindex numeric constant
5731 @cindex numeric value
5732 A @dfn{numeric constant} stands for a number. This number can be an
5733 integer, a decimal fraction, or a number in scientific (exponential)
5734 notation.@footnote{The internal representation uses double-precision
5735 floating point numbers. If you don't know what that means, then don't
5736 worry about it.} Here are some examples of numeric constants, which all
5737 have the same value:
5745 A string constant consists of a sequence of characters enclosed in
5746 double-quote marks. For example:
5753 @cindex differences between @code{gawk} and @code{awk}
5754 represents the string whose contents are @samp{parrot}. Strings in
5755 @code{gawk} can be of any length and they can contain any of the possible
5756 eight-bit ASCII characters including ASCII NUL (character code zero).
5758 implementations may have difficulty with some character codes.
5760 @node Regexp Constants, , Scalar Constants, Constants
5761 @subsection Regular Expression Constants
5763 @cindex @code{~} operator
5764 @cindex @code{!~} operator
5765 A regexp constant is a regular expression description enclosed in
5766 slashes, such as @code{@w{/^beginning and end$/}}. Most regexps used in
5767 @code{awk} programs are constant, but the @samp{~} and @samp{!~}
5768 matching operators can also match computed or ``dynamic'' regexps
5769 (which are just ordinary strings or variables that contain a regexp).
5771 @node Using Constant Regexps, Variables, Constants, Expressions
5772 @section Using Regular Expression Constants
5774 When used on the right hand side of the @samp{~} or @samp{!~}
5775 operators, a regexp constant merely stands for the regexp that is to be
5779 Regexp constants (such as @code{/foo/}) may be used like simple expressions.
5781 regexp constant appears by itself, it has the same meaning as if it appeared
5782 in a pattern, i.e.@: @samp{($0 ~ /foo/)} (d.c.)
5783 (@pxref{Expression Patterns, ,Expressions as Patterns}).
5784 This means that the two code segments,
5787 if ($0 ~ /barfly/ || $0 ~ /camelot/)
5795 if (/barfly/ || /camelot/)
5800 are exactly equivalent.
5802 One rather bizarre consequence of this rule is that the following
5803 boolean expression is valid, but does not do what the user probably
5807 # note that /foo/ is on the left of the ~
5808 if (/foo/ ~ $1) print "found foo"
5812 This code is ``obviously'' testing @code{$1} for a match against the regexp
5813 @code{/foo/}. But in fact, the expression @samp{/foo/ ~ $1} actually means
5814 @samp{($0 ~ /foo/) ~ $1}. In other words, first match the input record
5815 against the regexp @code{/foo/}. The result will be either zero or one,
5816 depending upon the success or failure of the match. Then match that result
5817 against the first field in the record.
5819 Since it is unlikely that you would ever really wish to make this kind of
5820 test, @code{gawk} will issue a warning when it sees this construct in
5823 Another consequence of this rule is that the assignment statement
5830 will assign either zero or one to the variable @code{matches}, depending
5831 upon the contents of the current input record.
5833 This feature of the language was never well documented until the
5834 POSIX specification.
5836 @cindex differences between @code{gawk} and @code{awk}
5838 Constant regular expressions are also used as the first argument for
5839 the @code{gensub}, @code{sub} and @code{gsub} functions, and as the
5840 second argument of the @code{match} function
5841 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
5842 Modern implementations of @code{awk}, including @code{gawk}, allow
5843 the third argument of @code{split} to be a regexp constant, while some
5844 older implementations do not (d.c.).
5846 This can lead to confusion when attempting to use regexp constants
5847 as arguments to user defined functions
5848 (@pxref{User-defined, , User-defined Functions}).
5853 function mysub(pat, repl, str, global)
5856 gsub(pat, repl, str)
5866 text = "hi! hi yourself!"
5867 mysub(/hi/, "howdy", text, 1)
5873 In this example, the programmer wishes to pass a regexp constant to the
5874 user-defined function @code{mysub}, which will in turn pass it on to
5875 either @code{sub} or @code{gsub}. However, what really happens is that
5876 the @code{pat} parameter will be either one or zero, depending upon whether
5877 or not @code{$0} matches @code{/hi/}.
5879 As it is unlikely that you would ever really wish to pass a truth value
5880 in this way, @code{gawk} will issue a warning when it sees a regexp
5881 constant used as a parameter to a user-defined function.
5883 @node Variables, Conversion, Using Constant Regexps, Expressions
5886 Variables are ways of storing values at one point in your program for
5887 use later in another part of your program. You can manipulate them
5888 entirely within your program text, and you can also assign values to
5889 them on the @code{awk} command line.
5892 * Using Variables:: Using variables in your programs.
5893 * Assignment Options:: Setting variables on the command line and a
5894 summary of command line syntax. This is an
5895 advanced method of input.
5898 @node Using Variables, Assignment Options, Variables, Variables
5899 @subsection Using Variables in a Program
5901 @cindex variables, user-defined
5902 @cindex user-defined variables
5903 Variables let you give names to values and refer to them later. You have
5904 already seen variables in many of the examples. The name of a variable
5905 must be a sequence of letters, digits and underscores, but it may not begin
5906 with a digit. Case is significant in variable names; @code{a} and @code{A}
5907 are distinct variables.
5909 A variable name is a valid expression by itself; it represents the
5910 variable's current value. Variables are given new values with
5911 @dfn{assignment operators}, @dfn{increment operators} and
5912 @dfn{decrement operators}.
5913 @xref{Assignment Ops, ,Assignment Expressions}.
5915 A few variables have special built-in meanings, such as @code{FS}, the
5916 field separator, and @code{NF}, the number of fields in the current
5917 input record. @xref{Built-in Variables}, for a list of them. These
5918 built-in variables can be used and assigned just like all other
5919 variables, but their values are also used or changed automatically by
5920 @code{awk}. All built-in variables names are entirely upper-case.
5922 Variables in @code{awk} can be assigned either numeric or string
5923 values. By default, variables are initialized to the empty string, which
5924 is zero if converted to a number. There is no need to
5925 ``initialize'' each variable explicitly in @code{awk},
5926 the way you would in C and in most other traditional languages.
5928 @node Assignment Options, , Using Variables, Variables
5929 @subsection Assigning Variables on the Command Line
5931 You can set any @code{awk} variable by including a @dfn{variable assignment}
5932 among the arguments on the command line when you invoke @code{awk}
5933 (@pxref{Other Arguments, ,Other Command Line Arguments}). Such an assignment has
5937 @var{variable}=@var{text}
5941 With it, you can set a variable either at the beginning of the
5942 @code{awk} run or in between input files.
5944 If you precede the assignment with the @samp{-v} option, like this:
5947 -v @var{variable}=@var{text}
5951 then the variable is set at the very beginning, before even the
5952 @code{BEGIN} rules are run. The @samp{-v} option and its assignment
5953 must precede all the file name arguments, as well as the program text.
5954 (@xref{Options, ,Command Line Options}, for more information about
5955 the @samp{-v} option.)
5957 Otherwise, the variable assignment is performed at a time determined by
5958 its position among the input file arguments: after the processing of the
5959 preceding input file argument. For example:
5962 awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list
5966 prints the value of field number @code{n} for all input records. Before
5967 the first file is read, the command line sets the variable @code{n}
5968 equal to four. This causes the fourth field to be printed in lines from
5969 the file @file{inventory-shipped}. After the first file has finished,
5970 but before the second file is started, @code{n} is set to two, so that the
5971 second field is printed in lines from @file{BBS-list}.
5975 $ awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list
5985 Command line arguments are made available for explicit examination by
5986 the @code{awk} program in an array named @code{ARGV}
5987 (@pxref{ARGC and ARGV, ,Using @code{ARGC} and @code{ARGV}}).
5990 @code{awk} processes the values of command line assignments for escape
5991 sequences (d.c.) (@pxref{Escape Sequences}).
5993 @node Conversion, Arithmetic Ops, Variables, Expressions
5994 @section Conversion of Strings and Numbers
5996 @cindex conversion of strings and numbers
5997 Strings are converted to numbers, and numbers to strings, if the context
5998 of the @code{awk} program demands it. For example, if the value of
5999 either @code{foo} or @code{bar} in the expression @samp{foo + bar}
6000 happens to be a string, it is converted to a number before the addition
6001 is performed. If numeric values appear in string concatenation, they
6002 are converted to strings. Consider this:
6006 print (two three) + 4
6010 This prints the (numeric) value 27. The numeric values of
6011 the variables @code{two} and @code{three} are converted to strings and
6012 concatenated together, and the resulting string is converted back to the
6013 number 23, to which four is then added.
6016 @cindex empty string
6017 @cindex type conversion
6018 If, for some reason, you need to force a number to be converted to a
6019 string, concatenate the empty string, @code{""}, with that number.
6020 To force a string to be converted to a number, add zero to that string.
6022 A string is converted to a number by interpreting any numeric prefix
6023 of the string as numerals:
6024 @code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1000, and @code{"25fix"}
6025 has a numeric value of 25.
6026 Strings that can't be interpreted as valid numbers are converted to
6030 The exact manner in which numbers are converted into strings is controlled
6031 by the @code{awk} built-in variable @code{CONVFMT} (@pxref{Built-in Variables}).
6032 Numbers are converted using the @code{sprintf} function
6033 (@pxref{String Functions, ,Built-in Functions for String Manipulation})
6034 with @code{CONVFMT} as the format
6037 @code{CONVFMT}'s default value is @code{"%.6g"}, which prints a value with
6038 at least six significant digits. For some applications you will want to
6039 change it to specify more precision. On most modern machines, you must
6040 print 17 digits to capture a floating point number's value exactly.
6042 Strange results can happen if you set @code{CONVFMT} to a string that doesn't
6043 tell @code{sprintf} how to format floating point numbers in a useful way.
6044 For example, if you forget the @samp{%} in the format, all numbers will be
6045 converted to the same constant string.
6048 As a special case, if a number is an integer, then the result of converting
6049 it to a string is @emph{always} an integer, no matter what the value of
6050 @code{CONVFMT} may be. Given the following code fragment:
6059 @code{b} has the value @code{"12"}, not @code{"12.00"} (d.c.).
6061 @cindex @code{awk} language, POSIX version
6062 @cindex POSIX @code{awk}
6064 Prior to the POSIX standard, @code{awk} specified that the value
6065 of @code{OFMT} was used for converting numbers to strings. @code{OFMT}
6066 specifies the output format to use when printing numbers with @code{print}.
6067 @code{CONVFMT} was introduced in order to separate the semantics of
6068 conversion from the semantics of printing. Both @code{CONVFMT} and
6069 @code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority
6070 of cases, old @code{awk} programs will not change their behavior.
6071 However, this use of @code{OFMT} is something to keep in mind if you must
6072 port your program to other implementations of @code{awk}; we recommend
6073 that instead of changing your programs, you just port @code{gawk} itself!
6074 @xref{Print, ,The @code{print} Statement},
6075 for more information on the @code{print} statement.
6077 @node Arithmetic Ops, Concatenation, Conversion, Expressions
6078 @section Arithmetic Operators
6079 @cindex arithmetic operators
6080 @cindex operators, arithmetic
6083 @cindex multiplication
6087 @cindex exponentiation
6089 The @code{awk} language uses the common arithmetic operators when
6090 evaluating expressions. All of these arithmetic operators follow normal
6091 precedence rules, and work as you would expect them to. Arithmetic
6092 operations are evaluated using double precision floating point, which
6093 has the usual problems of inexactness and exceptions.@footnote{David
6094 Goldberg, @uref{http://www.validgh.com/goldberg/paper.ps, @cite{What Every
6095 Computer Scientist Should Know About Floating-point Arithmetic}},
6096 @cite{ACM Computing Surveys} @strong{23}, 1 (1991-03), 5-48.}
6098 Here is a file @file{grades} containing a list of student names and
6099 three test scores per student (it's a small class):
6108 This programs takes the file @file{grades}, and prints the average
6112 $ awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3
6113 > print $1, avg @}' grades
6116 @print{} Chris 84.3333
6119 This table lists the arithmetic operators in @code{awk}, in order from
6120 highest precedence to lowest:
6128 Unary plus. The expression is converted to a number.
6130 @cindex @code{awk} language, POSIX version
6131 @cindex POSIX @code{awk}
6132 @item @var{x} ^ @var{y}
6133 @itemx @var{x} ** @var{y}
6134 Exponentiation: @var{x} raised to the @var{y} power. @samp{2 ^ 3} has
6135 the value eight. The character sequence @samp{**} is equivalent to
6136 @samp{^}. (The POSIX standard only specifies the use of @samp{^}
6137 for exponentiation.)
6139 @item @var{x} * @var{y}
6142 @item @var{x} / @var{y}
6143 Division. Since all numbers in @code{awk} are
6144 floating point numbers, the result is not rounded to an integer: @samp{3 / 4}
6147 @item @var{x} % @var{y}
6148 @cindex differences between @code{gawk} and @code{awk}
6149 Remainder. The quotient is rounded toward zero to an integer,
6150 multiplied by @var{y} and this result is subtracted from @var{x}.
6151 This operation is sometimes known as ``trunc-mod.'' The following
6152 relation always holds:
6155 b * int(a / b) + (a % b) == a
6158 One possibly undesirable effect of this definition of remainder is that
6159 @code{@var{x} % @var{y}} is negative if @var{x} is negative. Thus,
6165 In other @code{awk} implementations, the signedness of the remainder
6166 may be machine dependent.
6167 @c !!! what does posix say?
6169 @item @var{x} + @var{y}
6172 @item @var{x} - @var{y}
6177 For maximum portability, do not use the @samp{**} operator.
6179 Unary plus and minus have the same precedence,
6180 the multiplication operators all have the same precedence, and
6181 addition and subtraction have the same precedence.
6183 @node Concatenation, Assignment Ops, Arithmetic Ops, Expressions
6184 @section String Concatenation
6185 @cindex Kernighan, Brian
6187 @i{It seemed like a good idea at the time.}
6192 @cindex string operators
6193 @cindex operators, string
6194 @cindex concatenation
6195 There is only one string operation: concatenation. It does not have a
6196 specific operator to represent it. Instead, concatenation is performed by
6197 writing expressions next to one another, with no operator. For example:
6201 $ awk '@{ print "Field number one: " $1 @}' BBS-list
6202 @print{} Field number one: aardvark
6203 @print{} Field number one: alpo-net
6208 Without the space in the string constant after the @samp{:}, the line
6209 would run together. For example:
6213 $ awk '@{ print "Field number one:" $1 @}' BBS-list
6214 @print{} Field number one:aardvark
6215 @print{} Field number one:alpo-net
6220 Since string concatenation does not have an explicit operator, it is
6221 often necessary to insure that it happens where you want it to by
6222 using parentheses to enclose
6223 the items to be concatenated. For example, the
6224 following code fragment does not concatenate @code{file} and @code{name}
6225 as you might expect:
6231 print "something meaningful" > file name
6236 It is necessary to use the following:
6239 print "something meaningful" > (file name)
6242 We recommend that you use parentheses around concatenation in all but the
6243 most common contexts (such as on the right-hand side of @samp{=}).
6245 @node Assignment Ops, Increment Ops, Concatenation, Expressions
6246 @section Assignment Expressions
6247 @cindex assignment operators
6248 @cindex operators, assignment
6249 @cindex expression, assignment
6251 An @dfn{assignment} is an expression that stores a new value into a
6252 variable. For example, let's assign the value one to the variable
6259 After this expression is executed, the variable @code{z} has the value one.
6260 Whatever old value @code{z} had before the assignment is forgotten.
6262 Assignments can store string values also. For example, this would store
6263 the value @code{"this food is good"} in the variable @code{message}:
6268 message = "this " thing " is " predicate
6272 (This also illustrates string concatenation.)
6274 The @samp{=} sign is called an @dfn{assignment operator}. It is the
6275 simplest assignment operator because the value of the right-hand
6276 operand is stored unchanged.
6279 Most operators (addition, concatenation, and so on) have no effect
6280 except to compute a value. If you ignore the value, you might as well
6281 not use the operator. An assignment operator is different; it does
6282 produce a value, but even if you ignore the value, the assignment still
6283 makes itself felt through the alteration of the variable. We call this
6284 a @dfn{side effect}.
6288 The left-hand operand of an assignment need not be a variable
6289 (@pxref{Variables}); it can also be a field
6290 (@pxref{Changing Fields, ,Changing the Contents of a Field}) or
6291 an array element (@pxref{Arrays, ,Arrays in @code{awk}}).
6292 These are all called @dfn{lvalues},
6293 which means they can appear on the left-hand side of an assignment operator.
6294 The right-hand operand may be any expression; it produces the new value
6295 which the assignment stores in the specified variable, field or array
6296 element. (Such values are called @dfn{rvalues}).
6298 @cindex types of variables
6299 It is important to note that variables do @emph{not} have permanent types.
6300 The type of a variable is simply the type of whatever value it happens
6301 to hold at the moment. In the following program fragment, the variable
6302 @code{foo} has a numeric value at first, and a string value later on:
6314 When the second assignment gives @code{foo} a string value, the fact that
6315 it previously had a numeric value is forgotten.
6317 String values that do not begin with a digit have a numeric value of
6318 zero. After executing this code, the value of @code{foo} is five:
6326 (Note that using a variable as a number and then later as a string can
6327 be confusing and is poor programming style. The above examples illustrate how
6328 @code{awk} works, @emph{not} how you should write your own programs!)
6330 An assignment is an expression, so it has a value: the same value that
6331 is assigned. Thus, @samp{z = 1} as an expression has the value one.
6332 One consequence of this is that you can write multiple assignments together:
6339 stores the value zero in all three variables. It does this because the
6340 value of @samp{z = 0}, which is zero, is stored into @code{y}, and then
6341 the value of @samp{y = z = 0}, which is zero, is stored into @code{x}.
6343 You can use an assignment anywhere an expression is called for. For
6344 example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one
6345 and then test whether @code{x} equals one. But this style tends to make
6346 programs hard to read; except in a one-shot program, you should
6347 not use such nesting of assignments.
6349 Aside from @samp{=}, there are several other assignment operators that
6350 do arithmetic with the old value of the variable. For example, the
6351 operator @samp{+=} computes a new value by adding the right-hand value
6352 to the old value of the variable. Thus, the following assignment adds
6353 five to the value of @code{foo}:
6360 This is equivalent to the following:
6367 Use whichever one makes the meaning of your program clearer.
6369 There are situations where using @samp{+=} (or any assignment operator)
6370 is @emph{not} the same as simply repeating the left-hand operand in the
6371 right-hand expression. For example:
6376 # Thanks to Pat Rankin for this example
6382 bar[rand()] = bar[rand()] + 5
6390 The indices of @code{bar} are guaranteed to be different, because
6391 @code{rand} will return different values each time it is called.
6392 (Arrays and the @code{rand} function haven't been covered yet.
6393 @xref{Arrays, ,Arrays in @code{awk}},
6394 and see @ref{Numeric Functions, ,Numeric Built-in Functions}, for more information).
6395 This example illustrates an important fact about the assignment
6396 operators: the left-hand expression is only evaluated @emph{once}.
6398 It is also up to the implementation as to which expression is evaluated
6399 first, the left-hand one or the right-hand one.
6400 Consider this example:
6408 The value of @code{a[3]} could be either two or four.
6410 Here is a table of the arithmetic assignment operators. In each
6411 case, the right-hand operand is an expression whose value is converted
6416 @item @var{lvalue} += @var{increment}
6417 Adds @var{increment} to the value of @var{lvalue} to make the new value
6420 @item @var{lvalue} -= @var{decrement}
6421 Subtracts @var{decrement} from the value of @var{lvalue}.
6423 @item @var{lvalue} *= @var{coefficient}
6424 Multiplies the value of @var{lvalue} by @var{coefficient}.
6426 @item @var{lvalue} /= @var{divisor}
6427 Divides the value of @var{lvalue} by @var{divisor}.
6429 @item @var{lvalue} %= @var{modulus}
6430 Sets @var{lvalue} to its remainder by @var{modulus}.
6432 @cindex @code{awk} language, POSIX version
6433 @cindex POSIX @code{awk}
6434 @item @var{lvalue} ^= @var{power}
6435 @itemx @var{lvalue} **= @var{power}
6436 Raises @var{lvalue} to the power @var{power}.
6437 (Only the @samp{^=} operator is specified by POSIX.)
6441 For maximum portability, do not use the @samp{**=} operator.
6443 @node Increment Ops, Truth Values, Assignment Ops, Expressions
6444 @section Increment and Decrement Operators
6446 @cindex increment operators
6447 @cindex operators, increment
6448 @dfn{Increment} and @dfn{decrement operators} increase or decrease the value of
6449 a variable by one. You could do the same thing with an assignment operator, so
6450 the increment operators add no power to the @code{awk} language; but they
6451 are convenient abbreviations for very common operations.
6453 The operator to add one is written @samp{++}. It can be used to increment
6454 a variable either before or after taking its value.
6456 To pre-increment a variable @var{v}, write @samp{++@var{v}}. This adds
6457 one to the value of @var{v} and that new value is also the value of this
6458 expression. The assignment expression @samp{@var{v} += 1} is completely
6461 Writing the @samp{++} after the variable specifies post-increment. This
6462 increments the variable value just the same; the difference is that the
6463 value of the increment expression itself is the variable's @emph{old}
6464 value. Thus, if @code{foo} has the value four, then the expression @samp{foo++}
6465 has the value four, but it changes the value of @code{foo} to five.
6467 The post-increment @samp{foo++} is nearly equivalent to writing @samp{(foo
6468 += 1) - 1}. It is not perfectly equivalent because all numbers in
6469 @code{awk} are floating point: in floating point, @samp{foo + 1 - 1} does
6470 not necessarily equal @code{foo}. But the difference is minute as
6471 long as you stick to numbers that are fairly small (less than 10e12).
6473 Any lvalue can be incremented. Fields and array elements are incremented
6474 just like variables. (Use @samp{$(i++)} when you wish to do a field reference
6475 and a variable increment at the same time. The parentheses are necessary
6476 because of the precedence of the field reference operator, @samp{$}.)
6478 @cindex decrement operators
6479 @cindex operators, decrement
6480 The decrement operator @samp{--} works just like @samp{++} except that
6481 it subtracts one instead of adding. Like @samp{++}, it can be used before
6482 the lvalue to pre-decrement or after it to post-decrement.
6484 Here is a summary of increment and decrement expressions.
6488 @item ++@var{lvalue}
6489 This expression increments @var{lvalue} and the new value becomes the
6490 value of the expression.
6492 @item @var{lvalue}++
6493 This expression increments @var{lvalue}, but
6494 the value of the expression is the @emph{old} value of @var{lvalue}.
6496 @item --@var{lvalue}
6497 Like @samp{++@var{lvalue}}, but instead of adding, it subtracts. It
6498 decrements @var{lvalue} and delivers the value that results.
6500 @item @var{lvalue}--
6501 Like @samp{@var{lvalue}++}, but instead of adding, it subtracts. It
6502 decrements @var{lvalue}. The value of the expression is the @emph{old}
6503 value of @var{lvalue}.
6507 @node Truth Values, Typing and Comparison, Increment Ops, Expressions
6508 @section True and False in @code{awk}
6509 @cindex truth values
6510 @cindex logical true
6511 @cindex logical false
6513 Many programming languages have a special representation for the concepts
6514 of ``true'' and ``false.'' Such languages usually use the special
6515 constants @code{true} and @code{false}, or perhaps their upper-case
6519 @cindex empty string
6520 @code{awk} is different. It borrows a very simple concept of true and
6521 false from C. In @code{awk}, any non-zero numeric value, @emph{or} any
6522 non-empty string value is true. Any other value (zero or the null
6523 string, @code{""}) is false. The following program will print @samp{A strange
6524 truth value} three times:
6530 print "A strange truth value"
6531 if ("Four Score And Seven Years Ago")
6532 print "A strange truth value"
6534 print "A strange truth value"
6540 There is a surprising consequence of the ``non-zero or non-null'' rule:
6541 The string constant @code{"0"} is actually true, since it is non-null (d.c.).
6543 @node Typing and Comparison, Boolean Ops, Truth Values, Expressions
6544 @section Variable Typing and Comparison Expressions
6545 @cindex comparison expressions
6546 @cindex expression, comparison
6547 @cindex expression, matching
6548 @cindex relational operators
6549 @cindex operators, relational
6550 @cindex regexp match/non-match operators
6551 @cindex variable typing
6552 @cindex types of variables
6553 @c 2e: consider splitting this section into subsections
6555 @i{The Guide is definitive. Reality is frequently inaccurate.}
6556 The Hitchhiker's Guide to the Galaxy
6560 Unlike other programming languages, @code{awk} variables do not have a
6561 fixed type. Instead, they can be either a number or a string, depending
6562 upon the value that is assigned to them.
6564 @cindex numeric string
6565 The 1992 POSIX standard introduced
6566 the concept of a @dfn{numeric string}, which is simply a string that looks
6567 like a number, for example, @code{@w{" +2"}}. This concept is used
6568 for determining the type of a variable.
6570 The type of the variable is important, since the types of two variables
6571 determine how they are compared.
6573 In @code{gawk}, variable typing follows these rules.
6577 A numeric literal or the result of a numeric operation has the @var{numeric}
6581 A string literal or the result of a string operation has the @var{string}
6585 Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements,
6586 @code{ENVIRON} elements and the
6587 elements of an array created by @code{split} that are numeric strings
6588 have the @var{strnum} attribute. Otherwise, they have the @var{string}
6590 Uninitialized variables also have the @var{strnum} attribute.
6593 Attributes propagate across assignments, but are not changed by
6595 @c (Although a use may cause the entity to acquire an additional
6596 @c value such that it has both a numeric and string value -- this leaves the
6597 @c attribute unchanged.)
6598 @c This is important but not relevant
6601 The last rule is particularly important. In the following program,
6602 @code{a} has numeric type, even though it is later used in a string
6608 b = a " is a cute number"
6613 When two operands are compared, either string comparison or numeric comparison
6614 may be used, depending on the attributes of the operands, according to the
6615 following, symmetric, matrix:
6617 @c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables
6620 \vbox{\bigskip % space above the table (about 1 linespace)
6621 % Because we have vertical rules, we can't let TeX insert interline space
6625 % Define the table template. & separates columns, and \cr ends the
6626 % template (and each row). # is replaced by the text of that entry on
6627 % each row. The template for the first column breaks down like this:
6628 % \strut -- a way to make each line have the height and depth
6629 % of a normal line of type, since we turned off interline spacing.
6630 % \hfil -- infinite glue; has the effect of right-justifying in this case.
6631 % # -- replaced by the text (for instance, `STRNUM', in the last row).
6632 % \quad -- about the width of an `M'. Just separates the columns.
6634 % The second column (\vrule#) is what generates the vertical rule that
6637 % The doubled && before the next entry means `repeat the following
6638 % template as many times as necessary on each line' -- in our case, twice.
6640 % The template itself, \quad#\hfil, left-justifies with a little space before.
6642 \halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr
6643 &&STRING &NUMERIC &STRNUM\cr
6644 % The \omit tells TeX to skip inserting the template for this column on
6645 % this particular row. In this case, we only want a little extra space
6646 % to separate the heading row from the rule below it. the depth 2pt --
6647 % `\vrule depth 2pt' is that little space.
6649 % This is the horizontal rule below the heading. Since it has nothing to
6650 % do with the columns of the table, we use \noalign to get it in there.
6652 % Like above, this time a little more space.
6654 % The remaining rows have nothing special about them.
6655 STRING &&string &string &string\cr
6656 NUMERIC &&string &numeric &numeric\cr
6657 STRNUM &&string &numeric &numeric\cr
6662 +----------------------------------------------
6663 | STRING NUMERIC STRNUM
6664 --------+----------------------------------------------
6666 STRING | string string string
6668 NUMERIC | string numeric numeric
6670 STRNUM | string numeric numeric
6671 --------+----------------------------------------------
6675 The basic idea is that user input that looks numeric, and @emph{only}
6676 user input, should be treated as numeric, even though it is actually
6677 made of characters, and is therefore also a string.
6679 @dfn{Comparison expressions} compare strings or numbers for
6680 relationships such as equality. They are written using @dfn{relational
6681 operators}, which are a superset of those in C. Here is a table of
6684 @cindex relational operators
6685 @cindex operators, relational
6686 @cindex @code{<} operator
6687 @cindex @code{<=} operator
6688 @cindex @code{>} operator
6689 @cindex @code{>=} operator
6690 @cindex @code{==} operator
6691 @cindex @code{!=} operator
6692 @cindex @code{~} operator
6693 @cindex @code{!~} operator
6694 @cindex @code{in} operator
6697 @item @var{x} < @var{y}
6698 True if @var{x} is less than @var{y}.
6700 @item @var{x} <= @var{y}
6701 True if @var{x} is less than or equal to @var{y}.
6703 @item @var{x} > @var{y}
6704 True if @var{x} is greater than @var{y}.
6706 @item @var{x} >= @var{y}
6707 True if @var{x} is greater than or equal to @var{y}.
6709 @item @var{x} == @var{y}
6710 True if @var{x} is equal to @var{y}.
6712 @item @var{x} != @var{y}
6713 True if @var{x} is not equal to @var{y}.
6715 @item @var{x} ~ @var{y}
6716 True if the string @var{x} matches the regexp denoted by @var{y}.
6718 @item @var{x} !~ @var{y}
6719 True if the string @var{x} does not match the regexp denoted by @var{y}.
6721 @item @var{subscript} in @var{array}
6722 True if the array @var{array} has an element with the subscript @var{subscript}.
6726 Comparison expressions have the value one if true and zero if false.
6728 When comparing operands of mixed types, numeric operands are converted
6729 to strings using the value of @code{CONVFMT}
6730 (@pxref{Conversion, ,Conversion of Strings and Numbers}).
6732 Strings are compared
6733 by comparing the first character of each, then the second character of each,
6734 and so on. Thus @code{"10"} is less than @code{"9"}. If there are two
6735 strings where one is a prefix of the other, the shorter string is less than
6736 the longer one. Thus @code{"abc"} is less than @code{"abcd"}.
6738 @cindex common mistakes
6739 @cindex mistakes, common
6740 @cindex errors, common
6741 It is very easy to accidentally mistype the @samp{==} operator, and
6742 leave off one of the @samp{=}s. The result is still valid @code{awk}
6743 code, but the program will not do what you mean:
6746 if (a = b) # oops! should be a == b
6753 Unless @code{b} happens to be zero or the null string, the @code{if}
6754 part of the test will always succeed. Because the operators are
6755 so similar, this kind of error is very difficult to spot when
6756 scanning the source code.
6758 Here are some sample expressions, how @code{gawk} compares them, and what
6759 the result of the comparison is.
6763 numeric comparison (true)
6765 @item "abc" >= "xyz"
6766 string comparison (false)
6769 string comparison (true)
6772 string comparison (true)
6774 @item a = 2; b = "2"
6776 string comparison (true)
6778 @item a = 2; b = " +2"
6780 string comparison (false)
6787 $ echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}'
6793 the result is @samp{false} since both @code{$1} and @code{$2} are numeric
6794 strings and thus both have the @var{strnum} attribute,
6795 dictating a numeric comparison.
6797 The purpose of the comparison rules and the use of numeric strings is
6798 to attempt to produce the behavior that is ``least surprising,'' while
6799 still ``doing the right thing.''
6801 @cindex comparisons, string vs. regexp
6802 @cindex string comparison vs. regexp comparison
6803 @cindex regexp comparison vs. string comparison
6804 String comparisons and regular expression comparisons are very different.
6812 has the value of one, or is true, if the variable @code{x}
6813 is precisely @samp{foo}. By contrast,
6820 has the value one if @code{x} contains @samp{foo}, such as
6821 @code{"Oh, what a fool am I!"}.
6823 The right hand operand of the @samp{~} and @samp{!~} operators may be
6824 either a regexp constant (@code{/@dots{}/}), or an ordinary
6825 expression, in which case the value of the expression as a string is used as a
6826 dynamic regexp (@pxref{Regexp Usage, ,How to Use Regular Expressions}; also
6827 @pxref{Computed Regexps, ,Using Dynamic Regexps}).
6829 @cindex regexp as expression
6830 In recent implementations of @code{awk}, a constant regular
6831 expression in slashes by itself is also an expression. The regexp
6832 @code{/@var{regexp}/} is an abbreviation for this comparison expression:
6838 One special place where @code{/foo/} is @emph{not} an abbreviation for
6839 @samp{$0 ~ /foo/} is when it is the right-hand operand of @samp{~} or
6841 @xref{Using Constant Regexps, ,Using Regular Expression Constants},
6842 where this is discussed in more detail.
6844 @c This paragraph has been here since day 1, and has always bothered
6845 @c me, especially since the expression doesn't really make a lot of
6846 @c sense. So, just take it out.
6848 In some contexts it may be necessary to write parentheses around the
6849 regexp to avoid confusing the @code{gawk} parser. For example,
6850 @samp{(/x/ - /y/) > threshold} is not allowed, but @samp{((/x/) - (/y/))
6851 > threshold} parses properly.
6854 @node Boolean Ops, Conditional Exp, Typing and Comparison, Expressions
6855 @section Boolean Expressions
6856 @cindex expression, boolean
6857 @cindex boolean expressions
6858 @cindex operators, boolean
6859 @cindex boolean operators
6860 @cindex logical operations
6861 @cindex operations, logical
6862 @cindex short-circuit operators
6863 @cindex operators, short-circuit
6864 @cindex and operator
6866 @cindex not operator
6867 @cindex @code{&&} operator
6868 @cindex @code{||} operator
6869 @cindex @code{!} operator
6871 A @dfn{boolean expression} is a combination of comparison expressions or
6872 matching expressions, using the boolean operators ``or''
6873 (@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with
6874 parentheses to control nesting. The truth value of the boolean expression is
6875 computed by combining the truth values of the component expressions.
6876 Boolean expressions are also referred to as @dfn{logical expressions}.
6877 The terms are equivalent.
6879 Boolean expressions can be used wherever comparison and matching
6880 expressions can be used. They can be used in @code{if}, @code{while},
6881 @code{do} and @code{for} statements
6882 (@pxref{Statements, ,Control Statements in Actions}).
6883 They have numeric values (one if true, zero if false), which come into play
6884 if the result of the boolean expression is stored in a variable, or
6887 In addition, every boolean expression is also a valid pattern, so
6888 you can use one as a pattern to control the execution of rules.
6890 Here are descriptions of the three boolean operators, with examples.
6894 @item @var{boolean1} && @var{boolean2}
6895 True if both @var{boolean1} and @var{boolean2} are true. For example,
6896 the following statement prints the current input record if it contains
6897 both @samp{2400} and @samp{foo}.
6900 if ($0 ~ /2400/ && $0 ~ /foo/) print
6903 The subexpression @var{boolean2} is evaluated only if @var{boolean1}
6904 is true. This can make a difference when @var{boolean2} contains
6905 expressions that have side effects: in the case of @samp{$0 ~ /foo/ &&
6906 ($2 == bar++)}, the variable @code{bar} is not incremented if there is
6907 no @samp{foo} in the record.
6909 @item @var{boolean1} || @var{boolean2}
6910 True if at least one of @var{boolean1} or @var{boolean2} is true.
6911 For example, the following statement prints all records in the input
6912 that contain @emph{either} @samp{2400} or
6913 @samp{foo}, or both.
6916 if ($0 ~ /2400/ || $0 ~ /foo/) print
6919 The subexpression @var{boolean2} is evaluated only if @var{boolean1}
6920 is false. This can make a difference when @var{boolean2} contains
6921 expressions that have side effects.
6923 @item ! @var{boolean}
6924 True if @var{boolean} is false. For example, the following program prints
6925 all records in the input file @file{BBS-list} that do @emph{not} contain the
6928 @c A better example would be `if (! (subscript in array)) ...' but we
6929 @c haven't done anything with arrays or `in' yet. Sigh.
6931 awk '@{ if (! ($0 ~ /foo/)) print @}' BBS-list
6936 The @samp{&&} and @samp{||} operators are called @dfn{short-circuit}
6937 operators because of the way they work. Evaluation of the full expression
6938 is ``short-circuited'' if the result can be determined part way through
6941 @cindex line continuation
6942 You can continue a statement that uses @samp{&&} or @samp{||} simply
6943 by putting a newline after them. But you cannot put a newline in front
6944 of either of these operators without using backslash continuation
6945 (@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
6947 The actual value of an expression using the @samp{!} operator will be
6948 either one or zero, depending upon the truth value of the expression it
6951 The @samp{!} operator is often useful for changing the sense of a flag
6952 variable from false to true and back again. For example, the following
6953 program is one way to print lines in between special bracketing lines:
6956 $1 == "START" @{ interested = ! interested @}
6957 interested == 1 @{ print @}
6958 $1 == "END" @{ interested = ! interested @}
6962 The variable @code{interested}, like all @code{awk} variables, starts
6963 out initialized to zero, which is also false. When a line is seen whose
6964 first field is @samp{START}, the value of @code{interested} is toggled
6965 to true, using @samp{!}. The next rule prints lines as long as
6966 @code{interested} is true. When a line is seen whose first field is
6967 @samp{END}, @code{interested} is toggled back to false.
6969 We should discuss using `next' in the two rules that toggle the
6970 variable, to avoid printing the bracketing lines, but that's more
6971 distraction than really needed.
6974 @node Conditional Exp, Function Calls, Boolean Ops, Expressions
6975 @section Conditional Expressions
6976 @cindex conditional expression
6977 @cindex expression, conditional
6979 A @dfn{conditional expression} is a special kind of expression with
6980 three operands. It allows you to use one expression's value to select
6981 one of two other expressions.
6983 The conditional expression is the same as in the C language:
6986 @var{selector} ? @var{if-true-exp} : @var{if-false-exp}
6990 There are three subexpressions. The first, @var{selector}, is always
6991 computed first. If it is ``true'' (not zero and not null) then
6992 @var{if-true-exp} is computed next and its value becomes the value of
6993 the whole expression. Otherwise, @var{if-false-exp} is computed next
6994 and its value becomes the value of the whole expression.
6996 For example, this expression produces the absolute value of @code{x}:
7002 Each time the conditional expression is computed, exactly one of
7003 @var{if-true-exp} and @var{if-false-exp} is used; the other is ignored.
7004 This is important when the expressions have side effects. For example,
7005 this conditional expression examines element @code{i} of either array
7006 @code{a} or array @code{b}, and increments @code{i}.
7009 x == y ? a[i++] : b[i++]
7013 This is guaranteed to increment @code{i} exactly once, because each time
7014 only one of the two increment expressions is executed,
7015 and the other is not.
7016 @xref{Arrays, ,Arrays in @code{awk}},
7017 for more information about arrays.
7019 @cindex differences between @code{gawk} and @code{awk}
7020 @cindex line continuation
7021 As a minor @code{gawk} extension,
7022 you can continue a statement that uses @samp{?:} simply
7023 by putting a newline after either character.
7024 However, you cannot put a newline in front
7025 of either character without using backslash continuation
7026 (@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
7027 If @samp{--posix} is specified
7028 (@pxref{Options, , Command Line Options}), then this extension is disabled.
7030 @node Function Calls, Precedence, Conditional Exp, Expressions
7031 @section Function Calls
7032 @cindex function call
7033 @cindex calling a function
7035 A @dfn{function} is a name for a particular calculation. Because it has
7036 a name, you can ask for it by name at any point in the program. For
7037 example, the function @code{sqrt} computes the square root of a number.
7039 A fixed set of functions are @dfn{built-in}, which means they are
7040 available in every @code{awk} program. The @code{sqrt} function is one
7041 of these. @xref{Built-in, ,Built-in Functions}, for a list of built-in
7042 functions and their descriptions. In addition, you can define your own
7043 functions for use in your program.
7044 @xref{User-defined, ,User-defined Functions}, for how to do this.
7046 @cindex arguments in function call
7047 The way to use a function is with a @dfn{function call} expression,
7048 which consists of the function name followed immediately by a list of
7049 @dfn{arguments} in parentheses. The arguments are expressions which
7050 provide the raw materials for the function's calculations.
7051 When there is more than one argument, they are separated by commas. If
7052 there are no arguments, write just @samp{()} after the function name.
7053 Here are some examples:
7056 sqrt(x^2 + y^2) @i{one argument}
7057 atan2(y, x) @i{two arguments}
7058 rand() @i{no arguments}
7061 @strong{Do not put any space between the function name and the
7062 open-parenthesis!} A user-defined function name looks just like the name of
7063 a variable, and space would make the expression look like concatenation
7064 of a variable with an expression inside parentheses. Space before the
7065 parenthesis is harmless with built-in functions, but it is best not to get
7066 into the habit of using space to avoid mistakes with user-defined
7069 Each function expects a particular number of arguments. For example, the
7070 @code{sqrt} function must be called with a single argument, the number
7071 to take the square root of:
7074 sqrt(@var{argument})
7077 Some of the built-in functions allow you to omit the final argument.
7078 If you do so, they use a reasonable default.
7079 @xref{Built-in, ,Built-in Functions}, for full details. If arguments
7080 are omitted in calls to user-defined functions, then those arguments are
7081 treated as local variables, initialized to the empty string
7082 (@pxref{User-defined, ,User-defined Functions}).
7084 Like every other expression, the function call has a value, which is
7085 computed by the function based on the arguments you give it. In this
7086 example, the value of @samp{sqrt(@var{argument})} is the square root of
7087 @var{argument}. A function can also have side effects, such as assigning
7088 values to certain variables or doing I/O.
7090 Here is a command to read numbers, one number per line, and print the
7091 square root of each one:
7095 $ awk '@{ print "The square root of", $1, "is", sqrt($1) @}'
7097 @print{} The square root of 1 is 1
7099 @print{} The square root of 3 is 1.73205
7101 @print{} The square root of 5 is 2.23607
7106 @node Precedence, , Function Calls, Expressions
7107 @section Operator Precedence (How Operators Nest)
7109 @cindex operator precedence
7111 @dfn{Operator precedence} determines how operators are grouped, when
7112 different operators appear close by in one expression. For example,
7113 @samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c}
7114 means to multiply @code{b} and @code{c}, and then add @code{a} to the
7115 product (i.e.@: @samp{a + (b * c)}).
7117 You can overrule the precedence of the operators by using parentheses.
7118 You can think of the precedence rules as saying where the
7119 parentheses are assumed to be if you do not write parentheses yourself. In
7120 fact, it is wise to always use parentheses whenever you have an unusual
7121 combination of operators, because other people who read the program may
7122 not remember what the precedence is in this case. You might forget,
7123 too; then you could make a mistake. Explicit parentheses will help prevent
7126 When operators of equal precedence are used together, the leftmost
7127 operator groups first, except for the assignment, conditional and
7128 exponentiation operators, which group in the opposite order.
7129 Thus, @samp{a - b + c} groups as @samp{(a - b) + c}, and
7130 @samp{a = b = c} groups as @samp{a = (b = c)}.
7132 The precedence of prefix unary operators does not matter as long as only
7133 unary operators are involved, because there is only one way to interpret
7134 them---innermost first. Thus, @samp{$++i} means @samp{$(++i)} and
7135 @samp{++$x} means @samp{++($x)}. However, when another operator follows
7136 the operand, then the precedence of the unary operators can matter.
7137 Thus, @samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means
7138 @samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^}
7139 while @samp{$} has higher precedence.
7141 Here is a table of @code{awk}'s operators, in order from highest
7142 precedence to lowest:
7144 @c use @code in the items, looks better in TeX w/o all the quotes
7153 Increment, decrement.
7155 @cindex @code{awk} language, POSIX version
7156 @cindex POSIX @code{awk}
7158 Exponentiation. These operators group right-to-left.
7159 (The @samp{**} operator is not specified by POSIX.)
7162 Unary plus, minus, logical ``not''.
7165 Multiplication, division, modulus.
7168 Addition, subtraction.
7170 @item @r{Concatenation}
7171 No special token is used to indicate concatenation.
7172 The operands are simply written side by side.
7176 Relational, and redirection.
7177 The relational operators and the redirections have the same precedence
7178 level. Characters such as @samp{>} serve both as relationals and as
7179 redirections; the context distinguishes between the two meanings.
7181 Note that the I/O redirection operators in @code{print} and @code{printf}
7182 statements belong to the statement level, not to expressions. The
7183 redirection does not produce an expression which could be the operand of
7184 another operator. As a result, it does not make sense to use a
7185 redirection operator near another operator of lower precedence, without
7186 parentheses. Such combinations, for example @samp{print foo > a ? b : c},
7187 result in syntax errors.
7188 The correct way to write this statement is @samp{print foo > (a ? b : c)}.
7191 Matching, non-matching.
7203 Conditional. This operator groups right-to-left.
7205 @cindex @code{awk} language, POSIX version
7206 @cindex POSIX @code{awk}
7209 Assignment. These operators group right-to-left.
7210 (The @samp{**=} operator is not specified by POSIX.)
7213 @node Patterns and Actions, Statements, Expressions, Top
7214 @chapter Patterns and Actions
7215 @cindex pattern, definition of
7217 As you have already seen, each @code{awk} statement consists of
7218 a pattern with an associated action. This chapter describes how
7219 you build patterns and actions.
7222 * Pattern Overview:: What goes into a pattern.
7223 * Action Overview:: What goes into an action.
7226 @node Pattern Overview, Action Overview, Patterns and Actions, Patterns and Actions
7227 @section Pattern Elements
7229 Patterns in @code{awk} control the execution of rules: a rule is
7230 executed when its pattern matches the current input record. This
7231 section explains all about how to write patterns.
7234 * Kinds of Patterns:: A list of all kinds of patterns.
7235 * Regexp Patterns:: Using regexps as patterns.
7236 * Expression Patterns:: Any expression can be used as a pattern.
7237 * Ranges:: Pairs of patterns specify record ranges.
7238 * BEGIN/END:: Specifying initialization and cleanup rules.
7239 * Empty:: The empty pattern, which matches every record.
7242 @node Kinds of Patterns, Regexp Patterns, Pattern Overview, Pattern Overview
7243 @subsection Kinds of Patterns
7244 @cindex patterns, types of
7246 Here is a summary of the types of patterns supported in @code{awk}.
7249 @item /@var{regular expression}/
7250 A regular expression as a pattern. It matches when the text of the
7251 input record fits the regular expression.
7252 (@xref{Regexp, ,Regular Expressions}.)
7254 @item @var{expression}
7255 A single expression. It matches when its value
7256 is non-zero (if a number) or non-null (if a string).
7257 (@xref{Expression Patterns, ,Expressions as Patterns}.)
7259 @item @var{pat1}, @var{pat2}
7260 A pair of patterns separated by a comma, specifying a range of records.
7261 The range includes both the initial record that matches @var{pat1}, and
7262 the final record that matches @var{pat2}.
7263 (@xref{Ranges, ,Specifying Record Ranges with Patterns}.)
7267 Special patterns for you to supply start-up or clean-up actions for your
7269 (@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.)
7272 The empty pattern matches every input record.
7273 (@xref{Empty, ,The Empty Pattern}.)
7276 @node Regexp Patterns, Expression Patterns, Kinds of Patterns, Pattern Overview
7277 @subsection Regular Expressions as Patterns
7279 We have been using regular expressions as patterns since our early examples.
7280 This kind of pattern is simply a regexp constant in the pattern part of
7281 a rule. Its meaning is @samp{$0 ~ /@var{pattern}/}.
7282 The pattern matches when the input record matches the regexp.
7286 /foo|bar|baz/ @{ buzzwords++ @}
7287 END @{ print buzzwords, "buzzwords seen" @}
7290 @node Expression Patterns, Ranges, Regexp Patterns, Pattern Overview
7291 @subsection Expressions as Patterns
7293 Any @code{awk} expression is valid as an @code{awk} pattern.
7294 Then the pattern matches if the expression's value is non-zero (if a
7295 number) or non-null (if a string).
7297 The expression is reevaluated each time the rule is tested against a new
7298 input record. If the expression uses fields such as @code{$1}, the
7299 value depends directly on the new input record's text; otherwise, it
7300 depends only on what has happened so far in the execution of the
7301 @code{awk} program, but that may still be useful.
7303 A very common kind of expression used as a pattern is the comparison
7304 expression, using the comparison operators described in
7305 @ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
7307 Regexp matching and non-matching are also very common expressions.
7308 The left operand of the @samp{~} and @samp{!~} operators is a string.
7309 The right operand is either a constant regular expression enclosed in
7310 slashes (@code{/@var{regexp}/}), or any expression, whose string value
7311 is used as a dynamic regular expression
7312 (@pxref{Computed Regexps, , Using Dynamic Regexps}).
7314 The following example prints the second field of each input record
7315 whose first field is precisely @samp{foo}.
7318 $ awk '$1 == "foo" @{ print $2 @}' BBS-list
7322 (There is no output, since there is no BBS site named ``foo''.)
7323 Contrast this with the following regular expression match, which would
7324 accept any record with a first field that contains @samp{foo}:
7328 $ awk '$1 ~ /foo/ @{ print $2 @}' BBS-list
7336 Boolean expressions are also commonly used as patterns.
7338 matches an input record depends on whether its subexpressions match.
7340 For example, the following command prints all records in
7341 @file{BBS-list} that contain both @samp{2400} and @samp{foo}.
7344 $ awk '/2400/ && /foo/' BBS-list
7345 @print{} fooey 555-1234 2400/1200/300 B
7348 The following command prints all records in
7349 @file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo}, or
7354 $ awk '/2400/ || /foo/' BBS-list
7355 @print{} alpo-net 555-3412 2400/1200/300 A
7356 @print{} bites 555-1675 2400/1200/300 A
7357 @print{} fooey 555-1234 2400/1200/300 B
7358 @print{} foot 555-6699 1200/300 B
7359 @print{} macfoo 555-6480 1200/300 A
7360 @print{} sdace 555-3430 2400/1200/300 A
7361 @print{} sabafoo 555-2127 1200/300 C
7365 The following command prints all records in
7366 @file{BBS-list} that do @emph{not} contain the string @samp{foo}.
7370 $ awk '! /foo/' BBS-list
7371 @print{} aardvark 555-5553 1200/300 B
7372 @print{} alpo-net 555-3412 2400/1200/300 A
7373 @print{} barfly 555-7685 1200/300 A
7374 @print{} bites 555-1675 2400/1200/300 A
7375 @print{} camelot 555-0542 300 C
7376 @print{} core 555-2912 1200/300 C
7377 @print{} sdace 555-3430 2400/1200/300 A
7381 The subexpressions of a boolean operator in a pattern can be constant regular
7382 expressions, comparisons, or any other @code{awk} expressions. Range
7383 patterns are not expressions, so they cannot appear inside boolean
7384 patterns. Likewise, the special patterns @code{BEGIN} and @code{END},
7385 which never match any input record, are not expressions and cannot
7386 appear inside boolean patterns.
7388 A regexp constant as a pattern is also a special case of an expression
7389 pattern. @code{/foo/} as an expression has the value one if @samp{foo}
7390 appears in the current input record; thus, as a pattern, @code{/foo/}
7391 matches any record containing @samp{foo}.
7393 @node Ranges, BEGIN/END, Expression Patterns, Pattern Overview
7394 @subsection Specifying Record Ranges with Patterns
7396 @cindex range pattern
7397 @cindex pattern, range
7398 @cindex matching ranges of lines
7399 A @dfn{range pattern} is made of two patterns separated by a comma, of
7400 the form @samp{@var{begpat}, @var{endpat}}. It matches ranges of
7401 consecutive input records. The first pattern, @var{begpat}, controls
7402 where the range begins, and the second one, @var{endpat}, controls where
7403 it ends. For example,
7406 awk '$1 == "on", $1 == "off"'
7410 prints every record between @samp{on}/@samp{off} pairs, inclusive.
7412 A range pattern starts out by matching @var{begpat}
7413 against every input record; when a record matches @var{begpat}, the
7414 range pattern becomes @dfn{turned on}. The range pattern matches this
7415 record. As long as it stays turned on, it automatically matches every
7416 input record read. It also matches @var{endpat} against
7417 every input record; when that succeeds, the range pattern is turned
7418 off again for the following record. Then it goes back to checking
7419 @var{begpat} against each record.
7421 The record that turns on the range pattern and the one that turns it
7422 off both match the range pattern. If you don't want to operate on
7423 these records, you can write @code{if} statements in the rule's action
7424 to distinguish them from the records you are interested in.
7426 It is possible for a pattern to be turned both on and off by the same
7427 record, if the record satisfies both conditions. Then the action is
7428 executed for just that record.
7430 For example, suppose you have text between two identical markers (say
7431 the @samp{%} symbol) that you wish to ignore. You might try to
7432 combine a range pattern that describes the delimited text with the
7433 @code{next} statement
7434 (not discussed yet, @pxref{Next Statement, , The @code{next} Statement}),
7435 which causes @code{awk} to skip any further processing of the current
7436 record and start over again with the next input record. Such a program
7437 would look like this:
7440 /^%$/,/^%$/ @{ next @}
7445 @cindex skipping lines between markers
7446 This program fails because the range pattern is both turned on and turned off
7447 by the first line with just a @samp{%} on it. To accomplish this task, you
7448 must write the program this way, using a flag:
7451 /^%$/ @{ skip = ! skip; next @}
7452 skip == 1 @{ next @} # skip lines with `skip' set
7455 Note that in a range pattern, the @samp{,} has the lowest precedence
7456 (is evaluated last) of all the operators. Thus, for example, the
7457 following program attempts to combine a range pattern with another,
7461 echo Yes | awk '/1/,/2/ || /Yes/'
7464 The author of this program intended it to mean @samp{(/1/,/2/) || /Yes/}.
7465 However, @code{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}.
7466 This cannot be changed or worked around; range patterns do not combine
7467 with other patterns.
7469 @node BEGIN/END, Empty, Ranges, Pattern Overview
7470 @subsection The @code{BEGIN} and @code{END} Special Patterns
7472 @cindex @code{BEGIN} special pattern
7473 @cindex pattern, @code{BEGIN}
7474 @cindex @code{END} special pattern
7475 @cindex pattern, @code{END}
7476 @code{BEGIN} and @code{END} are special patterns. They are not used to
7477 match input records. Rather, they supply start-up or
7478 clean-up actions for your @code{awk} script.
7481 * Using BEGIN/END:: How and why to use BEGIN/END rules.
7482 * I/O And BEGIN/END:: I/O issues in BEGIN/END rules.
7485 @node Using BEGIN/END, I/O And BEGIN/END, BEGIN/END, BEGIN/END
7486 @subsubsection Startup and Cleanup Actions
7488 A @code{BEGIN} rule is executed, once, before the first input record
7489 has been read. An @code{END} rule is executed, once, after all the
7490 input has been read. For example:
7495 > BEGIN @{ print "Analysis of \"foo\"" @}
7497 > END @{ print "\"foo\" appears " n " times." @}' BBS-list
7498 @print{} Analysis of "foo"
7499 @print{} "foo" appears 4 times.
7503 This program finds the number of records in the input file @file{BBS-list}
7504 that contain the string @samp{foo}. The @code{BEGIN} rule prints a title
7505 for the report. There is no need to use the @code{BEGIN} rule to
7506 initialize the counter @code{n} to zero, as @code{awk} does this
7507 automatically (@pxref{Variables}).
7509 The second rule increments the variable @code{n} every time a
7510 record containing the pattern @samp{foo} is read. The @code{END} rule
7511 prints the value of @code{n} at the end of the run.
7513 The special patterns @code{BEGIN} and @code{END} cannot be used in ranges
7514 or with boolean operators (indeed, they cannot be used with any operators).
7516 An @code{awk} program may have multiple @code{BEGIN} and/or @code{END}
7517 rules. They are executed in the order they appear, all the @code{BEGIN}
7518 rules at start-up and all the @code{END} rules at termination.
7519 @code{BEGIN} and @code{END} rules may be intermixed with other rules.
7520 This feature was added in the 1987 version of @code{awk}, and is included
7521 in the POSIX standard. The original (1978) version of @code{awk}
7522 required you to put the @code{BEGIN} rule at the beginning of the
7523 program, and the @code{END} rule at the end, and only allowed one of
7524 each. This is no longer required, but it is a good idea in terms of
7525 program organization and readability.
7527 Multiple @code{BEGIN} and @code{END} rules are useful for writing
7528 library functions, since each library file can have its own @code{BEGIN} and/or
7529 @code{END} rule to do its own initialization and/or cleanup. Note that
7530 the order in which library functions are named on the command line
7531 controls the order in which their @code{BEGIN} and @code{END} rules are
7532 executed. Therefore you have to be careful to write such rules in
7533 library files so that the order in which they are executed doesn't matter.
7534 @xref{Options, ,Command Line Options}, for more information on
7535 using library functions.
7536 @xref{Library Functions, ,A Library of @code{awk} Functions},
7537 for a number of useful library functions.
7540 If an @code{awk} program only has a @code{BEGIN} rule, and no other
7541 rules, then the program exits after the @code{BEGIN} rule has been run.
7542 (The original version of @code{awk} used to keep reading and ignoring input
7543 until end of file was seen.) However, if an @code{END} rule exists,
7544 then the input will be read, even if there are no other rules in
7545 the program. This is necessary in case the @code{END} rule checks the
7546 @code{FNR} and @code{NR} variables (d.c.).
7548 @code{BEGIN} and @code{END} rules must have actions; there is no default
7549 action for these rules since there is no current record when they run.
7551 @node I/O And BEGIN/END, , Using BEGIN/END, BEGIN/END
7552 @subsubsection Input/Output from @code{BEGIN} and @code{END} Rules
7554 @cindex I/O from @code{BEGIN} and @code{END}
7555 There are several (sometimes subtle) issues involved when doing I/O
7556 from a @code{BEGIN} or @code{END} rule.
7558 The first has to do with the value of @code{$0} in a @code{BEGIN}
7559 rule. Since @code{BEGIN} rules are executed before any input is read,
7560 there simply is no input record, and therefore no fields, when
7561 executing @code{BEGIN} rules. References to @code{$0} and the fields
7562 yield a null string or zero, depending upon the context. One way
7563 to give @code{$0} a real value is to execute a @code{getline} command
7564 without a variable (@pxref{Getline, ,Explicit Input with @code{getline}}).
7565 Another way is to simply assign a value to it.
7567 @cindex differences between @code{gawk} and @code{awk}
7568 The second point is similar to the first, but from the other direction.
7569 Inside an @code{END} rule, what is the value of @code{$0} and @code{NF}?
7570 Traditionally, due largely to implementation issues, @code{$0} and
7571 @code{NF} were @emph{undefined} inside an @code{END} rule.
7572 The POSIX standard specified that @code{NF} was available in an @code{END}
7573 rule, containing the number of fields from the last input record.
7574 Due most probably to an oversight, the standard does not say that @code{$0}
7575 is also preserved, although logically one would think that it should be.
7576 In fact, @code{gawk} does preserve the value of @code{$0} for use in
7577 @code{END} rules. Be aware, however, that Unix @code{awk}, and possibly
7578 other implementations, do not.
7580 The third point follows from the first two. What is the meaning of
7581 @samp{print} inside a @code{BEGIN} or @code{END} rule? The meaning is
7582 the same as always, @samp{print $0}. If @code{$0} is the null string,
7583 then this prints an empty line. Many long time @code{awk} programmers
7584 use @samp{print} in @code{BEGIN} and @code{END} rules, to mean
7585 @samp{@w{print ""}}, relying on @code{$0} being null. While you might
7586 generally get away with this in @code{BEGIN} rules, in @code{gawk} at
7587 least, it is a very bad idea in @code{END} rules. It is also poor
7588 style, since if you want an empty line in the output, you
7589 should say so explicitly in your program.
7591 @node Empty, , BEGIN/END, Pattern Overview
7592 @subsection The Empty Pattern
7594 @cindex empty pattern
7595 @cindex pattern, empty
7596 An empty (i.e.@: non-existent) pattern is considered to match @emph{every}
7597 input record. For example, the program:
7600 awk '@{ print $1 @}' BBS-list
7604 prints the first field of every record.
7606 @node Action Overview, , Pattern Overview, Patterns and Actions
7607 @section Overview of Actions
7608 @cindex action, definition of
7609 @cindex curly braces
7610 @cindex action, curly braces
7611 @cindex action, separating statements
7613 An @code{awk} program or script consists of a series of
7614 rules and function definitions, interspersed. (Functions are
7615 described later. @xref{User-defined, ,User-defined Functions}.)
7617 A rule contains a pattern and an action, either of which (but not
7619 omitted. The purpose of the @dfn{action} is to tell @code{awk} what to do
7620 once a match for the pattern is found. Thus, in outline, an @code{awk}
7621 program generally looks like this:
7624 @r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}
7625 @r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]}
7627 function @var{name}(@var{args}) @{ @dots{} @}
7631 An action consists of one or more @code{awk} @dfn{statements}, enclosed
7632 in curly braces (@samp{@{} and @samp{@}}). Each statement specifies one
7633 thing to be done. The statements are separated by newlines or
7636 The curly braces around an action must be used even if the action
7637 contains only one statement, or even if it contains no statements at
7638 all. However, if you omit the action entirely, omit the curly braces as
7639 well. An omitted action is equivalent to @samp{@{ print $0 @}}.
7642 /foo/ @{ @} # match foo, do nothing - empty action
7643 /foo/ # match foo, print the record - omitted action
7646 Here are the kinds of statements supported in @code{awk}:
7650 Expressions, which can call functions or assign values to variables
7651 (@pxref{Expressions}). Executing
7652 this kind of statement simply computes the value of the expression.
7653 This is useful when the expression has side effects
7654 (@pxref{Assignment Ops, ,Assignment Expressions}).
7657 Control statements, which specify the control flow of @code{awk}
7658 programs. The @code{awk} language gives you C-like constructs
7659 (@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few
7660 special ones (@pxref{Statements, ,Control Statements in Actions}).
7663 Compound statements, which consist of one or more statements enclosed in
7664 curly braces. A compound statement is used in order to put several
7665 statements together in the body of an @code{if}, @code{while}, @code{do}
7666 or @code{for} statement.
7669 Input statements, using the @code{getline} command
7670 (@pxref{Getline, ,Explicit Input with @code{getline}}), the @code{next}
7671 statement (@pxref{Next Statement, ,The @code{next} Statement}),
7672 and the @code{nextfile} statement
7673 (@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
7676 Output statements, @code{print} and @code{printf}.
7677 @xref{Printing, ,Printing Output}.
7680 Deletion statements, for deleting array elements.
7681 @xref{Delete, ,The @code{delete} Statement}.
7685 The next chapter covers control statements in detail.
7688 @node Statements, Built-in Variables, Patterns and Actions, Top
7689 @chapter Control Statements in Actions
7690 @cindex control statement
7692 @dfn{Control statements} such as @code{if}, @code{while}, and so on
7693 control the flow of execution in @code{awk} programs. Most of the
7694 control statements in @code{awk} are patterned on similar statements in
7697 All the control statements start with special keywords such as @code{if}
7698 and @code{while}, to distinguish them from simple expressions.
7700 @cindex compound statement
7701 @cindex statement, compound
7702 Many control statements contain other statements; for example, the
7703 @code{if} statement contains another statement which may or may not be
7704 executed. The contained statement is called the @dfn{body}. If you
7705 want to include more than one statement in the body, group them into a
7706 single @dfn{compound statement} with curly braces, separating them with
7707 newlines or semicolons.
7710 * If Statement:: Conditionally execute some @code{awk}
7712 * While Statement:: Loop until some condition is satisfied.
7713 * Do Statement:: Do specified action while looping until some
7714 condition is satisfied.
7715 * For Statement:: Another looping statement, that provides
7716 initialization and increment clauses.
7717 * Break Statement:: Immediately exit the innermost enclosing loop.
7718 * Continue Statement:: Skip to the end of the innermost enclosing
7720 * Next Statement:: Stop processing the current input record.
7721 * Nextfile Statement:: Stop processing the current file.
7722 * Exit Statement:: Stop execution of @code{awk}.
7725 @node If Statement, While Statement, Statements, Statements
7726 @section The @code{if}-@code{else} Statement
7728 @cindex @code{if}-@code{else} statement
7729 The @code{if}-@code{else} statement is @code{awk}'s decision-making
7730 statement. It looks like this:
7733 if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]}
7737 The @var{condition} is an expression that controls what the rest of the
7738 statement will do. If @var{condition} is true, @var{then-body} is
7739 executed; otherwise, @var{else-body} is executed.
7740 The @code{else} part of the statement is
7741 optional. The condition is considered false if its value is zero or
7742 the null string, and true otherwise.
7753 In this example, if the expression @samp{x % 2 == 0} is true (that is,
7754 the value of @code{x} is evenly divisible by two), then the first @code{print}
7755 statement is executed, otherwise the second @code{print} statement is
7758 If the @code{else} appears on the same line as @var{then-body}, and
7759 @var{then-body} is not a compound statement (i.e.@: not surrounded by
7760 curly braces), then a semicolon must separate @var{then-body} from
7761 @code{else}. To illustrate this, let's rewrite the previous example:
7764 if (x % 2 == 0) print "x is even"; else
7769 If you forget the @samp{;}, @code{awk} won't be able to interpret the
7770 statement, and you will get a syntax error.
7772 We would not actually write this example this way, because a human
7773 reader might fail to see the @code{else} if it were not the first thing
7776 @node While Statement, Do Statement, If Statement, Statements
7777 @section The @code{while} Statement
7778 @cindex @code{while} statement
7780 @cindex body of a loop
7782 In programming, a @dfn{loop} means a part of a program that can
7783 be executed two or more times in succession.
7785 The @code{while} statement is the simplest looping statement in
7786 @code{awk}. It repeatedly executes a statement as long as a condition is
7787 true. It looks like this:
7790 while (@var{condition})
7795 Here @var{body} is a statement that we call the @dfn{body} of the loop,
7796 and @var{condition} is an expression that controls how long the loop
7799 The first thing the @code{while} statement does is test @var{condition}.
7800 If @var{condition} is true, it executes the statement @var{body}.
7802 (The @var{condition} is true when the value
7803 is not zero and not a null string.)
7805 After @var{body} has been executed,
7806 @var{condition} is tested again, and if it is still true, @var{body} is
7807 executed again. This process repeats until @var{condition} is no longer
7808 true. If @var{condition} is initially false, the body of the loop is
7809 never executed, and @code{awk} continues with the statement following
7812 This example prints the first three fields of each record, one per line.
7820 @}' inventory-shipped
7824 Here the body of the loop is a compound statement enclosed in braces,
7825 containing two statements.
7827 The loop works like this: first, the value of @code{i} is set to one.
7828 Then, the @code{while} tests whether @code{i} is less than or equal to
7829 three. This is true when @code{i} equals one, so the @code{i}-th
7830 field is printed. Then the @samp{i++} increments the value of @code{i}
7831 and the loop repeats. The loop terminates when @code{i} reaches four.
7833 As you can see, a newline is not required between the condition and the
7834 body; but using one makes the program clearer unless the body is a
7835 compound statement or is very simple. The newline after the open-brace
7836 that begins the compound statement is not required either, but the
7837 program would be harder to read without it.
7839 @node Do Statement, For Statement, While Statement, Statements
7840 @section The @code{do}-@code{while} Statement
7842 The @code{do} loop is a variation of the @code{while} looping statement.
7843 The @code{do} loop executes the @var{body} once, and then repeats @var{body}
7844 as long as @var{condition} is true. It looks like this:
7850 while (@var{condition})
7854 Even if @var{condition} is false at the start, @var{body} is executed at
7855 least once (and only once, unless executing @var{body} makes
7856 @var{condition} true). Contrast this with the corresponding
7857 @code{while} statement:
7860 while (@var{condition})
7865 This statement does not execute @var{body} even once if @var{condition}
7866 is false to begin with.
7868 Here is an example of a @code{do} statement:
7880 This program prints each input record ten times. It isn't a very
7881 realistic example, since in this case an ordinary @code{while} would do
7882 just as well. But this reflects actual experience; there is only
7883 occasionally a real use for a @code{do} statement.
7885 @node For Statement, Break Statement, Do Statement, Statements
7886 @section The @code{for} Statement
7887 @cindex @code{for} statement
7889 The @code{for} statement makes it more convenient to count iterations of a
7890 loop. The general form of the @code{for} statement looks like this:
7893 for (@var{initialization}; @var{condition}; @var{increment})
7898 The @var{initialization}, @var{condition} and @var{increment} parts are
7899 arbitrary @code{awk} expressions, and @var{body} stands for any
7900 @code{awk} statement.
7902 The @code{for} statement starts by executing @var{initialization}.
7904 as @var{condition} is true, it repeatedly executes @var{body} and then
7905 @var{increment}. Typically @var{initialization} sets a variable to
7906 either zero or one, @var{increment} adds one to it, and @var{condition}
7907 compares it against the desired number of iterations.
7909 Here is an example of a @code{for} statement:
7913 awk '@{ for (i = 1; i <= 3; i++)
7915 @}' inventory-shipped
7920 This prints the first three fields of each input record, one field per
7923 You cannot set more than one variable in the
7924 @var{initialization} part unless you use a multiple assignment statement
7925 such as @samp{x = y = 0}, which is possible only if all the initial values
7926 are equal. (But you can initialize additional variables by writing
7927 their assignments as separate statements preceding the @code{for} loop.)
7929 The same is true of the @var{increment} part; to increment additional
7930 variables, you must write separate statements at the end of the loop.
7931 The C compound expression, using C's comma operator, would be useful in
7932 this context, but it is not supported in @code{awk}.
7934 Most often, @var{increment} is an increment expression, as in the
7935 example above. But this is not required; it can be any expression
7936 whatever. For example, this statement prints all the powers of two
7937 between one and 100:
7940 for (i = 1; i <= 100; i *= 2)
7944 Any of the three expressions in the parentheses following the @code{for} may
7945 be omitted if there is nothing to be done there. Thus, @w{@samp{for (; x
7946 > 0;)}} is equivalent to @w{@samp{while (x > 0)}}. If the
7947 @var{condition} is omitted, it is treated as @var{true}, effectively
7948 yielding an @dfn{infinite loop} (i.e.@: a loop that will never
7951 In most cases, a @code{for} loop is an abbreviation for a @code{while}
7952 loop, as shown here:
7955 @var{initialization}
7956 while (@var{condition}) @{
7963 The only exception is when the @code{continue} statement
7964 (@pxref{Continue Statement, ,The @code{continue} Statement}) is used
7965 inside the loop; changing a @code{for} statement to a @code{while}
7966 statement in this way can change the effect of the @code{continue}
7967 statement inside the loop.
7969 There is an alternate version of the @code{for} loop, for iterating over
7970 all the indices of an array:
7974 @var{do something with} array[i]
7978 @xref{Scanning an Array, ,Scanning All Elements of an Array},
7979 for more information on this version of the @code{for} loop.
7981 The @code{awk} language has a @code{for} statement in addition to a
7982 @code{while} statement because often a @code{for} loop is both less work to
7983 type and more natural to think of. Counting the number of iterations is
7984 very common in loops. It can be easier to think of this counting as part
7985 of looping rather than as something to do inside the loop.
7987 The next section has more complicated examples of @code{for} loops.
7989 @node Break Statement, Continue Statement, For Statement, Statements
7990 @section The @code{break} Statement
7991 @cindex @code{break} statement
7992 @cindex loops, exiting
7994 The @code{break} statement jumps out of the innermost @code{for},
7995 @code{while}, or @code{do} loop that encloses it. The
7996 following example finds the smallest divisor of any integer, and also
7997 identifies prime numbers:
8000 awk '# find smallest divisor of num
8003 for (div = 2; div*div <= num; div++)
8008 printf "Smallest divisor of %d is %d\n", num, div
8010 printf "%d is prime\n", num
8014 When the remainder is zero in the first @code{if} statement, @code{awk}
8015 immediately @dfn{breaks out} of the containing @code{for} loop. This means
8016 that @code{awk} proceeds immediately to the statement following the loop
8017 and continues processing. (This is very different from the @code{exit}
8018 statement which stops the entire @code{awk} program.
8019 @xref{Exit Statement, ,The @code{exit} Statement}.)
8021 Here is another program equivalent to the previous one. It illustrates how
8022 the @var{condition} of a @code{for} or @code{while} could just as well be
8023 replaced with a @code{break} inside an @code{if}:
8027 awk '# find smallest divisor of num
8029 for (div = 2; ; div++) @{
8030 if (num % div == 0) @{
8031 printf "Smallest divisor of %d is %d\n", num, div
8034 if (div*div > num) @{
8035 printf "%d is prime\n", num
8043 @cindex @code{break}, outside of loops
8044 @cindex historical features
8045 @cindex @code{awk} language, POSIX version
8046 @cindex POSIX @code{awk}
8048 As described above, the @code{break} statement has no meaning when
8049 used outside the body of a loop. However, although it was never documented,
8050 historical implementations of @code{awk} have treated the @code{break}
8051 statement outside of a loop as if it were a @code{next} statement
8052 (@pxref{Next Statement, ,The @code{next} Statement}).
8053 Recent versions of Unix @code{awk} no longer allow this usage.
8054 @code{gawk} will support this use of @code{break} only if @samp{--traditional}
8055 has been specified on the command line
8056 (@pxref{Options, ,Command Line Options}).
8057 Otherwise, it will be treated as an error, since the POSIX standard
8058 specifies that @code{break} should only be used inside the body of a
8061 @node Continue Statement, Next Statement, Break Statement, Statements
8062 @section The @code{continue} Statement
8064 @cindex @code{continue} statement
8065 The @code{continue} statement, like @code{break}, is used only inside
8066 @code{for}, @code{while}, and @code{do} loops. It skips
8067 over the rest of the loop body, causing the next cycle around the loop
8068 to begin immediately. Contrast this with @code{break}, which jumps out
8069 of the loop altogether.
8071 @c The point of this program was to illustrate the use of continue with
8072 @c a while loop. But Karl Berry points out that that is done adequately
8073 @c below, and that this example is very un-awk-like. So for now, we'll
8076 In Texinfo source files, text that the author wishes to ignore can be
8077 enclosed between lines that start with @samp{@@ignore} and end with
8078 @samp{@atend ignore}. Here is a program that strips out lines between
8079 @samp{@@ignore} and @samp{@atend ignore} pairs.
8083 while (getline > 0) @{
8086 else if (/^@@end[ \t]+ignore/) @{
8097 When an @samp{@@ignore} is seen, the @code{ignoring} flag is set to one (true).
8098 When @samp{@atend ignore} is seen, the flag is reset to zero (false). As long
8099 as the flag is true, the input record is not printed, because the
8100 @code{continue} restarts the @code{while} loop, skipping over the @code{print}
8104 @c How could this program be written to make better use of the awk language?
8107 The @code{continue} statement in a @code{for} loop directs @code{awk} to
8108 skip the rest of the body of the loop, and resume execution with the
8109 increment-expression of the @code{for} statement. The following program
8110 illustrates this fact:
8114 for (x = 0; x <= 20; x++) @{
8124 This program prints all the numbers from zero to 20, except for five, for
8125 which the @code{printf} is skipped. Since the increment @samp{x++}
8126 is not skipped, @code{x} does not remain stuck at five. Contrast the
8127 @code{for} loop above with this @code{while} loop:
8143 This program loops forever once @code{x} gets to five.
8145 @cindex @code{continue}, outside of loops
8146 @cindex historical features
8147 @cindex @code{awk} language, POSIX version
8148 @cindex POSIX @code{awk}
8150 As described above, the @code{continue} statement has no meaning when
8151 used outside the body of a loop. However, although it was never documented,
8152 historical implementations of @code{awk} have treated the @code{continue}
8153 statement outside of a loop as if it were a @code{next} statement
8154 (@pxref{Next Statement, ,The @code{next} Statement}).
8155 Recent versions of Unix @code{awk} no longer allow this usage.
8156 @code{gawk} will support this use of @code{continue} only if
8157 @samp{--traditional} has been specified on the command line
8158 (@pxref{Options, ,Command Line Options}).
8159 Otherwise, it will be treated as an error, since the POSIX standard
8160 specifies that @code{continue} should only be used inside the body of a
8163 @node Next Statement, Nextfile Statement, Continue Statement, Statements
8164 @section The @code{next} Statement
8165 @cindex @code{next} statement
8167 The @code{next} statement forces @code{awk} to immediately stop processing
8168 the current record and go on to the next record. This means that no
8169 further rules are executed for the current record. The rest of the
8170 current rule's action is not executed either.
8172 Contrast this with the effect of the @code{getline} function
8173 (@pxref{Getline, ,Explicit Input with @code{getline}}). That too causes
8174 @code{awk} to read the next record immediately, but it does not alter the
8175 flow of control in any way. So the rest of the current action executes
8176 with a new input record.
8178 At the highest level, @code{awk} program execution is a loop that reads
8179 an input record and then tests each rule's pattern against it. If you
8180 think of this loop as a @code{for} statement whose body contains the
8181 rules, then the @code{next} statement is analogous to a @code{continue}
8182 statement: it skips to the end of the body of this implicit loop, and
8183 executes the increment (which reads another record).
8185 For example, if your @code{awk} program works only on records with four
8186 fields, and you don't want it to fail when given bad input, you might
8187 use this rule near the beginning of the program:
8192 err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR)
8193 print err > "/dev/stderr"
8200 so that the following rules will not see the bad record. The error
8201 message is redirected to the standard error output stream, as error
8202 messages should be. @xref{Special Files, ,Special File Names in @code{gawk}}.
8204 @cindex @code{awk} language, POSIX version
8205 @cindex POSIX @code{awk}
8206 According to the POSIX standard, the behavior is undefined if
8207 the @code{next} statement is used in a @code{BEGIN} or @code{END} rule.
8208 @code{gawk} will treat it as a syntax error.
8209 Although POSIX permits it,
8210 some other @code{awk} implementations don't allow the @code{next}
8211 statement inside function bodies
8212 (@pxref{User-defined, ,User-defined Functions}).
8213 Just as any other @code{next} statement, a @code{next} inside a
8214 function body reads the next record and starts processing it with the
8215 first rule in the program.
8217 If the @code{next} statement causes the end of the input to be reached,
8218 then the code in any @code{END} rules will be executed.
8219 @xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.
8221 @cindex @code{next}, inside a user-defined function
8222 @strong{Caution:} Some @code{awk} implementations generate a run-time
8223 error if you use the @code{next} statement inside a user-defined function
8224 (@pxref{User-defined, , User-defined Functions}).
8225 @code{gawk} does not have this problem.
8227 @node Nextfile Statement, Exit Statement, Next Statement, Statements
8228 @section The @code{nextfile} Statement
8229 @cindex @code{nextfile} statement
8230 @cindex differences between @code{gawk} and @code{awk}
8232 @code{gawk} provides the @code{nextfile} statement,
8233 which is similar to the @code{next} statement.
8234 However, instead of abandoning processing of the current record, the
8235 @code{nextfile} statement instructs @code{gawk} to stop processing the
8238 Upon execution of the @code{nextfile} statement, @code{FILENAME} is
8239 updated to the name of the next data file listed on the command line,
8240 @code{FNR} is reset to one, @code{ARGIND} is incremented, and processing
8241 starts over with the first rule in the progam. @xref{Built-in Variables}.
8243 If the @code{nextfile} statement causes the end of the input to be reached,
8244 then the code in any @code{END} rules will be executed.
8245 @xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.
8247 The @code{nextfile} statement is a @code{gawk} extension; it is not
8248 (currently) available in any other @code{awk} implementation.
8249 @xref{Nextfile Function, ,Implementing @code{nextfile} as a Function},
8250 for a user-defined function you can use to simulate the @code{nextfile}
8253 The @code{nextfile} statement would be useful if you have many data
8254 files to process, and you expect that you
8255 would not want to process every record in every file.
8256 Normally, in order to move on to
8257 the next data file, you would have to continue scanning the unwanted
8258 records. The @code{nextfile} statement accomplishes this much more
8261 @cindex @code{next file} statement
8262 @strong{Caution:} Versions of @code{gawk} prior to 3.0 used two
8263 words (@samp{next file}) for the @code{nextfile} statement. This was
8264 changed in 3.0 to one word, since the treatment of @samp{file} was
8265 inconsistent. When it appeared after @code{next}, it was a keyword.
8266 Otherwise, it was a regular identifier. The old usage is still
8267 accepted. However, @code{gawk} will generate a warning message, and
8268 support for @code{next file} will eventually be discontinued in a
8269 future version of @code{gawk}.
8271 @node Exit Statement, , Nextfile Statement, Statements
8272 @section The @code{exit} Statement
8274 @cindex @code{exit} statement
8275 The @code{exit} statement causes @code{awk} to immediately stop
8276 executing the current rule and to stop processing input; any remaining input
8277 is ignored. It looks like this:
8280 exit @r{[}@var{return code}@r{]}
8283 If an @code{exit} statement is executed from a @code{BEGIN} rule the
8284 program stops processing everything immediately. No input records are
8285 read. However, if an @code{END} rule is present, it is executed
8286 (@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
8288 If @code{exit} is used as part of an @code{END} rule, it causes
8289 the program to stop immediately.
8291 An @code{exit} statement that is not part
8292 of a @code{BEGIN} or @code{END} rule stops the execution of any further
8293 automatic rules for the current record, skips reading any remaining input
8294 records, and executes
8295 the @code{END} rule if there is one.
8297 If you do not want the @code{END} rule to do its job in this case, you
8298 can set a variable to non-zero before the @code{exit} statement, and check
8299 that variable in the @code{END} rule.
8300 @xref{Assert Function, ,Assertions},
8301 for an example that does this.
8304 If an argument is supplied to @code{exit}, its value is used as the exit
8305 status code for the @code{awk} process. If no argument is supplied,
8306 @code{exit} returns status zero (success). In the case where an argument
8307 is supplied to a first @code{exit} statement, and then @code{exit} is
8308 called a second time with no argument, the previously supplied exit value
8311 For example, let's say you've discovered an error condition you really
8312 don't know how to handle. Conventionally, programs report this by
8313 exiting with a non-zero status. Your @code{awk} program can do this
8314 using an @code{exit} statement with a non-zero argument. Here is an
8320 if (("date" | getline date_now) <= 0) @{
8321 print "Can't get system date" > "/dev/stderr"
8324 print "current date is", date_now
8330 @node Built-in Variables, Arrays, Statements, Top
8331 @chapter Built-in Variables
8332 @cindex built-in variables
8334 Most @code{awk} variables are available for you to use for your own
8335 purposes; they never change except when your program assigns values to
8336 them, and never affect anything except when your program examines them.
8337 However, a few variables in @code{awk} have special built-in meanings.
8338 Some of them @code{awk} examines automatically, so that they enable you
8339 to tell @code{awk} how to do certain things. Others are set
8340 automatically by @code{awk}, so that they carry information from the
8341 internal workings of @code{awk} to your program.
8343 This chapter documents all the built-in variables of @code{gawk}. Most
8344 of them are also documented in the chapters describing their areas of
8348 * User-modified:: Built-in variables that you change to control
8350 * Auto-set:: Built-in variables where @code{awk} gives you
8352 * ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}.
8355 @node User-modified, Auto-set, Built-in Variables, Built-in Variables
8356 @section Built-in Variables that Control @code{awk}
8357 @cindex built-in variables, user modifiable
8359 This is an alphabetical list of the variables which you can change to
8360 control how @code{awk} does certain things. Those variables that are
8361 specific to @code{gawk} are marked with an asterisk, @samp{*}.
8365 @cindex @code{awk} language, POSIX version
8366 @cindex POSIX @code{awk}
8368 This string controls conversion of numbers to
8369 strings (@pxref{Conversion, ,Conversion of Strings and Numbers}).
8370 It works by being passed, in effect, as the first argument to the
8371 @code{sprintf} function
8372 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8373 Its default value is @code{"%.6g"}.
8374 @code{CONVFMT} was introduced by the POSIX standard.
8378 This is a space separated list of columns that tells @code{gawk}
8379 how to split input with fixed, columnar boundaries. It is an
8380 experimental feature. Assigning to @code{FIELDWIDTHS}
8381 overrides the use of @code{FS} for field splitting.
8382 @xref{Constant Size, ,Reading Fixed-width Data}, for more information.
8384 If @code{gawk} is in compatibility mode
8385 (@pxref{Options, ,Command Line Options}), then @code{FIELDWIDTHS}
8386 has no special meaning, and field splitting operations are done based
8387 exclusively on the value of @code{FS}.
8391 @code{FS} is the input field separator
8392 (@pxref{Field Separators, ,Specifying How Fields are Separated}).
8393 The value is a single-character string or a multi-character regular
8394 expression that matches the separations between fields in an input
8395 record. If the value is the null string (@code{""}), then each
8396 character in the record becomes a separate field.
8398 The default value is @w{@code{" "}}, a string consisting of a single
8399 space. As a special exception, this value means that any
8400 sequence of spaces, tabs, and/or newlines is a single separator.@footnote{In
8401 POSIX @code{awk}, newline does not count as whitespace.} It also causes
8402 spaces, tabs, and newlines at the beginning and end of a record to be ignored.
8404 You can set the value of @code{FS} on the command line using the
8408 awk -F, '@var{program}' @var{input-files}
8411 If @code{gawk} is using @code{FIELDWIDTHS} for field-splitting,
8412 assigning a value to @code{FS} will cause @code{gawk} to return to
8413 the normal, @code{FS}-based, field splitting. An easy way to do this
8414 is to simply say @samp{FS = FS}, perhaps with an explanatory comment.
8418 If @code{IGNORECASE} is non-zero or non-null, then all string comparisons,
8419 and all regular expression matching are case-independent. Thus, regexp
8420 matching with @samp{~} and @samp{!~}, and the @code{gensub},
8421 @code{gsub}, @code{index}, @code{match}, @code{split} and @code{sub}
8422 functions, record termination with @code{RS}, and field splitting with
8423 @code{FS} all ignore case when doing their particular regexp operations.
8424 The value of @code{IGNORECASE} does @emph{not} affect array subscripting.
8425 @xref{Case-sensitivity, ,Case-sensitivity in Matching}.
8427 If @code{gawk} is in compatibility mode
8428 (@pxref{Options, ,Command Line Options}),
8429 then @code{IGNORECASE} has no special meaning, and string
8430 and regexp operations are always case-sensitive.
8434 This string controls conversion of numbers to
8435 strings (@pxref{Conversion, ,Conversion of Strings and Numbers}) for
8436 printing with the @code{print} statement. It works by being passed, in
8437 effect, as the first argument to the @code{sprintf} function
8438 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8439 Its default value is @code{"%.6g"}. Earlier versions of @code{awk}
8440 also used @code{OFMT} to specify the format for converting numbers to
8441 strings in general expressions; this is now done by @code{CONVFMT}.
8445 This is the output field separator (@pxref{Output Separators}). It is
8446 output between the fields output by a @code{print} statement. Its
8447 default value is @w{@code{" "}}, a string consisting of a single space.
8451 This is the output record separator. It is output at the end of every
8452 @code{print} statement. Its default value is @code{"\n"}.
8453 (@xref{Output Separators}.)
8457 This is @code{awk}'s input record separator. Its default value is a string
8458 containing a single newline character, which means that an input record
8459 consists of a single line of text.
8460 It can also be the null string, in which case records are separated by
8461 runs of blank lines, or a regexp, in which case records are separated by
8462 matches of the regexp in the input text.
8463 (@xref{Records, ,How Input is Split into Records}.)
8467 @code{SUBSEP} is the subscript separator. It has the default value of
8468 @code{"\034"}, and is used to separate the parts of the indices of a
8469 multi-dimensional array. Thus, the expression @code{@w{foo["A", "B"]}}
8470 really accesses @code{foo["A\034B"]}
8471 (@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).
8474 @node Auto-set, ARGC and ARGV, User-modified, Built-in Variables
8475 @section Built-in Variables that Convey Information
8476 @cindex built-in variables, convey information
8478 This is an alphabetical list of the variables that are set
8479 automatically by @code{awk} on certain occasions in order to provide
8480 information to your program. Those variables that are specific to
8481 @code{gawk} are marked with an asterisk, @samp{*}.
8488 The command-line arguments available to @code{awk} programs are stored in
8489 an array called @code{ARGV}. @code{ARGC} is the number of command-line
8490 arguments present. @xref{Other Arguments, ,Other Command Line Arguments}.
8491 Unlike most @code{awk} arrays,
8492 @code{ARGV} is indexed from zero to @code{ARGC} @minus{} 1. For example:
8497 > for (i = 0; i < ARGC; i++)
8499 > @}' inventory-shipped BBS-list
8501 @print{} inventory-shipped
8507 In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}
8508 contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains
8509 @code{"BBS-list"}. The value of @code{ARGC} is three, one more than the
8510 index of the last element in @code{ARGV}, since the elements are numbered
8513 The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing
8514 the array from zero to @code{ARGC} @minus{} 1, are derived from the C language's
8515 method of accessing command line arguments.
8516 @xref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}, for information
8517 about how @code{awk} uses these variables.
8521 The index in @code{ARGV} of the current file being processed.
8522 Every time @code{gawk} opens a new data file for processing, it sets
8523 @code{ARGIND} to the index in @code{ARGV} of the file name.
8524 When @code{gawk} is processing the input files, it is always
8525 true that @samp{FILENAME == ARGV[ARGIND]}.
8527 This variable is useful in file processing; it allows you to tell how far
8528 along you are in the list of data files, and to distinguish between
8529 successive instances of the same filename on the command line.
8531 While you can change the value of @code{ARGIND} within your @code{awk}
8532 program, @code{gawk} will automatically set it to a new value when the
8533 next file is opened.
8535 This variable is a @code{gawk} extension. In other @code{awk} implementations,
8536 or if @code{gawk} is in compatibility mode
8537 (@pxref{Options, ,Command Line Options}),
8542 An associative array that contains the values of the environment. The array
8543 indices are the environment variable names; the values are the values of
8544 the particular environment variables. For example,
8545 @code{ENVIRON["HOME"]} might be @file{/home/arnold}. Changing this array
8546 does not affect the environment passed on to any programs that
8547 @code{awk} may spawn via redirection or the @code{system} function.
8548 (In a future version of @code{gawk}, it may do so.)
8550 Some operating systems may not have environment variables.
8551 On such systems, the @code{ENVIRON} array is empty (except for
8552 @w{@code{ENVIRON["AWKPATH"]}}).
8556 If a system error occurs either doing a redirection for @code{getline},
8557 during a read for @code{getline}, or during a @code{close} operation,
8558 then @code{ERRNO} will contain a string describing the error.
8560 This variable is a @code{gawk} extension. In other @code{awk} implementations,
8561 or if @code{gawk} is in compatibility mode
8562 (@pxref{Options, ,Command Line Options}),
8568 This is the name of the file that @code{awk} is currently reading.
8569 When no data files are listed on the command line, @code{awk} reads
8570 from the standard input, and @code{FILENAME} is set to @code{"-"}.
8571 @code{FILENAME} is changed each time a new file is read
8572 (@pxref{Reading Files, ,Reading Input Files}).
8573 Inside a @code{BEGIN} rule, the value of @code{FILENAME} is
8574 @code{""}, since there are no input files being processed
8575 yet.@footnote{Some early implementations of Unix @code{awk} initialized
8576 @code{FILENAME} to @code{"-"}, even if there were data files to be
8577 processed. This behavior was incorrect, and should not be relied
8578 upon in your programs.} (d.c.)
8582 @code{FNR} is the current record number in the current file. @code{FNR} is
8583 incremented each time a new record is read
8584 (@pxref{Getline, ,Explicit Input with @code{getline}}). It is reinitialized
8585 to zero each time a new input file is started.
8589 @code{NF} is the number of fields in the current input record.
8590 @code{NF} is set each time a new record is read, when a new field is
8591 created, or when @code{$0} changes (@pxref{Fields, ,Examining Fields}).
8595 This is the number of input records @code{awk} has processed since
8596 the beginning of the program's execution
8597 (@pxref{Records, ,How Input is Split into Records}).
8598 @code{NR} is set each time a new record is read.
8602 @code{RLENGTH} is the length of the substring matched by the
8603 @code{match} function
8604 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8605 @code{RLENGTH} is set by invoking the @code{match} function. Its value
8606 is the length of the matched string, or @minus{}1 if no match was found.
8610 @code{RSTART} is the start-index in characters of the substring matched by the
8611 @code{match} function
8612 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
8613 @code{RSTART} is set by invoking the @code{match} function. Its value
8614 is the position of the string where the matched substring starts, or zero
8615 if no match was found.
8619 @code{RT} is set each time a record is read. It contains the input text
8620 that matched the text denoted by @code{RS}, the record separator.
8622 This variable is a @code{gawk} extension. In other @code{awk} implementations,
8623 or if @code{gawk} is in compatibility mode
8624 (@pxref{Options, ,Command Line Options}),
8629 A side note about @code{NR} and @code{FNR}.
8630 @code{awk} simply increments both of these variables
8631 each time it reads a record, instead of setting them to the absolute
8632 value of the number of records read. This means that your program can
8633 change these variables, and their new values will be incremented for
8634 each record (d.c.). For example:
8641 > 4' | awk 'NR == 2 @{ NR = 17 @}
8651 Before @code{FNR} was added to the @code{awk} language
8652 (@pxref{V7/SVR3.1, ,Major Changes between V7 and SVR3.1}),
8653 many @code{awk} programs used this feature to track the number of
8654 records in a file by resetting @code{NR} to zero when @code{FILENAME}
8657 @node ARGC and ARGV, , Auto-set, Built-in Variables
8658 @section Using @code{ARGC} and @code{ARGV}
8660 In @ref{Auto-set, , Built-in Variables that Convey Information},
8661 you saw this program describing the information contained in @code{ARGC}
8667 > for (i = 0; i < ARGC; i++)
8669 > @}' inventory-shipped BBS-list
8671 @print{} inventory-shipped
8677 In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]}
8678 contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains
8681 Notice that the @code{awk} program is not entered in @code{ARGV}. The
8682 other special command line options, with their arguments, are also not
8683 entered. This includes variable assignments done with the @samp{-v}
8684 option (@pxref{Options, ,Command Line Options}).
8685 Normal variable assignments on the command line @emph{are}
8686 treated as arguments, and do show up in the @code{ARGV} array.
8691 @print{} printf "A=%d, B=%d\n", A, B
8692 @print{} for (i = 0; i < ARGC; i++)
8693 @print{} printf "\tARGV[%d] = %s\n", i, ARGV[i]
8695 @print{} END @{ printf "A=%d, B=%d\n", A, B @}
8696 $ awk -v A=1 -f showargs.awk B=2 /dev/null
8698 @print{} ARGV[0] = awk
8699 @print{} ARGV[1] = B=2
8700 @print{} ARGV[2] = /dev/null
8704 Your program can alter @code{ARGC} and the elements of @code{ARGV}.
8705 Each time @code{awk} reaches the end of an input file, it uses the next
8706 element of @code{ARGV} as the name of the next input file. By storing a
8707 different string there, your program can change which files are read.
8708 You can use @code{"-"} to represent the standard input. By storing
8709 additional elements and incrementing @code{ARGC} you can cause
8710 additional files to be read.
8712 If you decrease the value of @code{ARGC}, that eliminates input files
8713 from the end of the list. By recording the old value of @code{ARGC}
8714 elsewhere, your program can treat the eliminated arguments as
8715 something other than file names.
8717 To eliminate a file from the middle of the list, store the null string
8718 (@code{""}) into @code{ARGV} in place of the file's name. As a
8719 special feature, @code{awk} ignores file names that have been
8720 replaced with the null string.
8721 You may also use the @code{delete} statement to remove elements from
8722 @code{ARGV} (@pxref{Delete, ,The @code{delete} Statement}).
8724 All of these actions are typically done from the @code{BEGIN} rule,
8725 before actual processing of the input begins.
8726 @xref{Split Program, ,Splitting a Large File Into Pieces}, and see
8727 @ref{Tee Program, ,Duplicating Output Into Multiple Files}, for an example
8728 of each way of removing elements from @code{ARGV}.
8730 The following fragment processes @code{ARGV} in order to examine, and
8731 then remove, command line options.
8736 for (i = 1; i < ARGC; i++) @{
8737 if (ARGV[i] == "-v")
8739 else if (ARGV[i] == "-d")
8743 else if (ARGV[i] ~ /^-?/) @{
8744 e = sprintf("%s: unrecognized option -- %c",
8745 ARGV[0], substr(ARGV[i], 1, ,1))
8746 print e > "/dev/stderr"
8755 To actually get the options into the @code{awk} program, you have to
8756 end the @code{awk} options with @samp{--}, and then supply your options,
8760 awk -f myprog -- -v -d file1 file2 @dots{}
8763 @cindex differences between @code{gawk} and @code{awk}
8764 This is not necessary in @code{gawk}: Unless @samp{--posix} has been
8765 specified, @code{gawk} silently puts any unrecognized options into
8766 @code{ARGV} for the @code{awk} program to deal with.
8769 sees an unknown option, @code{gawk} stops looking for other options it might
8770 otherwise recognize. The above example with @code{gawk} would be:
8773 gawk -f myprog -d -v file1 file2 @dots{}
8777 Since @samp{-d} is not a valid @code{gawk} option, the following @samp{-v}
8778 is passed on to the @code{awk} program.
8780 @node Arrays, Built-in, Built-in Variables, Top
8781 @chapter Arrays in @code{awk}
8783 An @dfn{array} is a table of values, called @dfn{elements}. The
8784 elements of an array are distinguished by their indices. @dfn{Indices}
8785 may be either numbers or strings. @code{awk} maintains a single set
8786 of names that may be used for naming variables, arrays and functions
8787 (@pxref{User-defined, ,User-defined Functions}).
8788 Thus, you cannot have a variable and an array with the same name in the
8789 same @code{awk} program.
8792 * Array Intro:: Introduction to Arrays
8793 * Reference to Elements:: How to examine one element of an array.
8794 * Assigning Elements:: How to change an element of an array.
8795 * Array Example:: Basic Example of an Array
8796 * Scanning an Array:: A variation of the @code{for} statement. It
8797 loops through the indices of an array's
8799 * Delete:: The @code{delete} statement removes an element
8801 * Numeric Array Subscripts:: How to use numbers as subscripts in
8803 * Uninitialized Subscripts:: Using Uninitialized variables as subscripts.
8804 * Multi-dimensional:: Emulating multi-dimensional arrays in
8806 * Multi-scanning:: Scanning multi-dimensional arrays.
8807 * Array Efficiency:: Implementation-specific tips.
8810 @node Array Intro, Reference to Elements, Arrays, Arrays
8811 @section Introduction to Arrays
8814 The @code{awk} language provides one-dimensional @dfn{arrays} for storing groups
8815 of related strings or numbers.
8817 Every @code{awk} array must have a name. Array names have the same
8818 syntax as variable names; any valid variable name would also be a valid
8819 array name. But you cannot use one name in both ways (as an array and
8820 as a variable) in one @code{awk} program.
8822 Arrays in @code{awk} superficially resemble arrays in other programming
8823 languages; but there are fundamental differences. In @code{awk}, you
8824 don't need to specify the size of an array before you start to use it.
8825 Additionally, any number or string in @code{awk} may be used as an
8826 array index, not just consecutive integers.
8828 In most other languages, you have to @dfn{declare} an array and specify
8829 how many elements or components it contains. In such languages, the
8830 declaration causes a contiguous block of memory to be allocated for that
8831 many elements. An index in the array usually must be a positive integer; for
8832 example, the index zero specifies the first element in the array, which is
8833 actually stored at the beginning of the block of memory. Index one
8834 specifies the second element, which is stored in memory right after the
8835 first element, and so on. It is impossible to add more elements to the
8836 array, because it has room for only as many elements as you declared.
8837 (Some languages allow arbitrary starting and ending indices,
8838 e.g., @samp{15 .. 27}, but the size of the array is still fixed when
8839 the array is declared.)
8841 A contiguous array of four elements might look like this,
8842 conceptually, if the element values are eight, @code{"foo"},
8846 @c from Karl Berry, much thanks for the help.
8848 \bigskip % space above the table (about 1 linespace)
8850 \newdimen\width \width = 1.5cm
8851 \newdimen\hwidth \hwidth = 4\width \advance\hwidth by 2pt % 5 * 0.4pt
8853 \halign{\strut\hfil\ignorespaces#&&\vrule#&\hbox to\width{\hfil#\unskip\hfil}\cr
8854 \noalign{\hrule width\hwidth}
8855 &&{\tt 8} &&{\tt "foo"} &&{\tt ""} &&{\tt 30} &&\quad value\cr
8856 \noalign{\hrule width\hwidth}
8857 \noalign{\smallskip}
8858 &\omit&0&\omit &1 &\omit&2 &\omit&3 &\omit&\quad index\cr
8865 +---------+---------+--------+---------+
8866 | 8 | "foo" | "" | 30 | @r{value}
8867 +---------+---------+--------+---------+
8873 Only the values are stored; the indices are implicit from the order of
8874 the values. Eight is the value at index zero, because eight appears in the
8875 position with zero elements before it.
8877 @cindex arrays, definition of
8878 @cindex associative arrays
8879 @cindex arrays, associative
8880 Arrays in @code{awk} are different: they are @dfn{associative}. This means
8881 that each array is a collection of pairs: an index, and its corresponding
8882 array element value:
8885 @r{Element} 4 @r{Value} 30
8886 @r{Element} 2 @r{Value} "foo"
8887 @r{Element} 1 @r{Value} 8
8888 @r{Element} 3 @r{Value} ""
8892 We have shown the pairs in jumbled order because their order is irrelevant.
8894 One advantage of associative arrays is that new pairs can be added
8895 at any time. For example, suppose we add to the above array a tenth element
8896 whose value is @w{@code{"number ten"}}. The result is this:
8899 @r{Element} 10 @r{Value} "number ten"
8900 @r{Element} 4 @r{Value} 30
8901 @r{Element} 2 @r{Value} "foo"
8902 @r{Element} 1 @r{Value} 8
8903 @r{Element} 3 @r{Value} ""
8907 @cindex sparse arrays
8908 @cindex arrays, sparse
8909 Now the array is @dfn{sparse}, which just means some indices are missing:
8910 it has elements 1--4 and 10, but doesn't have elements 5, 6, 7, 8, or 9.
8911 @c ok, I should spell out the above, but ...
8913 Another consequence of associative arrays is that the indices don't
8914 have to be positive integers. Any number, or even a string, can be
8915 an index. For example, here is an array which translates words from
8916 English into French:
8919 @r{Element} "dog" @r{Value} "chien"
8920 @r{Element} "cat" @r{Value} "chat"
8921 @r{Element} "one" @r{Value} "un"
8922 @r{Element} 1 @r{Value} "un"
8926 Here we decided to translate the number one in both spelled-out and
8927 numeric form---thus illustrating that a single array can have both
8928 numbers and strings as indices.
8929 (In fact, array subscripts are always strings; this is discussed
8931 @ref{Numeric Array Subscripts, ,Using Numbers to Subscript Arrays}.)
8933 @cindex Array subscripts and @code{IGNORECASE}
8934 @cindex @code{IGNORECASE} and array subscripts
8936 The value of @code{IGNORECASE} has no effect upon array subscripting.
8937 You must use the exact same string value to retrieve an array element
8938 as you used to store it.
8940 When @code{awk} creates an array for you, e.g., with the @code{split}
8942 that array's indices are consecutive integers starting at one.
8943 (@xref{String Functions, ,Built-in Functions for String Manipulation}.)
8945 @node Reference to Elements, Assigning Elements, Array Intro, Arrays
8946 @section Referring to an Array Element
8947 @cindex array reference
8948 @cindex element of array
8949 @cindex reference to array
8951 The principal way of using an array is to refer to one of its elements.
8952 An array reference is an expression which looks like this:
8955 @var{array}[@var{index}]
8959 Here, @var{array} is the name of an array. The expression @var{index} is
8960 the index of the element of the array that you want.
8962 The value of the array reference is the current value of that array
8963 element. For example, @code{foo[4.3]} is an expression for the element
8964 of array @code{foo} at index @samp{4.3}.
8966 If you refer to an array element that has no recorded value, the value
8967 of the reference is @code{""}, the null string. This includes elements
8968 to which you have not assigned any value, and elements that have been
8969 deleted (@pxref{Delete, ,The @code{delete} Statement}). Such a reference
8970 automatically creates that array element, with the null string as its value.
8971 (In some cases, this is unfortunate, because it might waste memory inside
8974 @cindex arrays, presence of elements
8975 @cindex arrays, the @code{in} operator
8976 You can find out if an element exists in an array at a certain index with
8980 @var{index} in @var{array}
8984 This expression tests whether or not the particular index exists,
8985 without the side effect of creating that element if it is not present.
8986 The expression has the value one (true) if @code{@var{array}[@var{index}]}
8987 exists, and zero (false) if it does not exist.
8989 For example, to test whether the array @code{frequencies} contains the
8990 index @samp{2}, you could write this statement:
8993 if (2 in frequencies)
8994 print "Subscript 2 is present."
8997 Note that this is @emph{not} a test of whether or not the array
8998 @code{frequencies} contains an element whose @emph{value} is two.
8999 (There is no way to do that except to scan all the elements.) Also, this
9000 @emph{does not} create @code{frequencies[2]}, while the following
9001 (incorrect) alternative would do so:
9004 if (frequencies[2] != "")
9005 print "Subscript 2 is present."
9008 @node Assigning Elements, Array Example, Reference to Elements, Arrays
9009 @section Assigning Array Elements
9010 @cindex array assignment
9011 @cindex element assignment
9013 Array elements are lvalues: they can be assigned values just like
9014 @code{awk} variables:
9017 @var{array}[@var{subscript}] = @var{value}
9021 Here @var{array} is the name of your array. The expression
9022 @var{subscript} is the index of the element of the array that you want
9023 to assign a value. The expression @var{value} is the value you are
9024 assigning to that element of the array.
9026 @node Array Example, Scanning an Array, Assigning Elements, Arrays
9027 @section Basic Array Example
9029 The following program takes a list of lines, each beginning with a line
9030 number, and prints them out in order of line number. The line numbers are
9031 not in order, however, when they are first read: they are scrambled. This
9032 program sorts the lines by making an array using the line numbers as
9033 subscripts. It then prints out the lines in sorted order of their numbers.
9034 It is a very simple program, and gets confused if it encounters repeated
9035 numbers, gaps, or lines that don't begin with a number.
9039 @c file eg/misc/arraymax.awk
9048 for (x = 1; x <= max; x++)
9054 The first rule keeps track of the largest line number seen so far;
9055 it also stores each line into the array @code{arr}, at an index that
9056 is the line's number.
9058 The second rule runs after all the input has been read, to print out
9061 When this program is run with the following input:
9065 @c file eg/misc/arraymax.data
9067 2 Who are you? The new number two!
9068 4 . . . And four on the floor
9069 1 Who is number one?
9079 1 Who is number one?
9080 2 Who are you? The new number two!
9082 4 . . . And four on the floor
9086 If a line number is repeated, the last line with a given number overrides
9089 Gaps in the line numbers can be handled with an easy improvement to the
9090 program's @code{END} rule:
9094 for (x = 1; x <= max; x++)
9100 @node Scanning an Array, Delete, Array Example, Arrays
9101 @section Scanning All Elements of an Array
9102 @cindex @code{for (x in @dots{})}
9103 @cindex arrays, special @code{for} statement
9104 @cindex scanning an array
9106 In programs that use arrays, you often need a loop that executes
9107 once for each element of an array. In other languages, where arrays are
9108 contiguous and indices are limited to positive integers, this is
9110 find all the valid indices by counting from the lowest index
9111 up to the highest. This
9112 technique won't do the job in @code{awk}, since any number or string
9113 can be an array index. So @code{awk} has a special kind of @code{for}
9114 statement for scanning an array:
9117 for (@var{var} in @var{array})
9122 This loop executes @var{body} once for each index in @var{array} that your
9123 program has previously used, with the
9124 variable @var{var} set to that index.
9126 Here is a program that uses this form of the @code{for} statement. The
9127 first rule scans the input records and notes which words appear (at
9128 least once) in the input, by storing a one into the array @code{used} with
9129 the word as index. The second rule scans the elements of @code{used} to
9130 find all the distinct words that appear in the input. It prints each
9131 word that is more than 10 characters long, and also prints the number of
9132 such words. @xref{String Functions, ,Built-in Functions for String Manipulation}, for more information
9133 on the built-in function @code{length}.
9136 # Record a 1 for each word that is used at least once.
9138 for (i = 1; i <= NF; i++)
9142 # Find number of distinct words more than 10 characters long.
9145 if (length(x) > 10) @{
9149 print num_long_words, "words longer than 10 characters"
9154 @xref{Word Sorting, ,Generating Word Usage Counts},
9155 for a more detailed example of this type.
9157 The order in which elements of the array are accessed by this statement
9158 is determined by the internal arrangement of the array elements within
9159 @code{awk} and cannot be controlled or changed. This can lead to
9160 problems if new elements are added to @var{array} by statements in
9161 the loop body; you cannot predict whether or not the @code{for} loop will
9162 reach them. Similarly, changing @var{var} inside the loop may produce
9163 strange results. It is best to avoid such things.
9165 @node Delete, Numeric Array Subscripts, Scanning an Array, Arrays
9166 @section The @code{delete} Statement
9167 @cindex @code{delete} statement
9168 @cindex deleting elements of arrays
9169 @cindex removing elements of arrays
9170 @cindex arrays, deleting an element
9172 You can remove an individual element of an array using the @code{delete}
9176 delete @var{array}[@var{index}]
9179 Once you have deleted an array element, you can no longer obtain any
9180 value the element once had. It is as if you had never referred
9181 to it and had never given it any value.
9183 Here is an example of deleting elements in an array:
9186 for (i in frequencies)
9187 delete frequencies[i]
9191 This example removes all the elements from the array @code{frequencies}.
9193 If you delete an element, a subsequent @code{for} statement to scan the array
9194 will not report that element, and the @code{in} operator to check for
9195 the presence of that element will return zero (i.e.@: false):
9200 print "This will never be printed"
9203 It is important to note that deleting an element is @emph{not} the
9204 same as assigning it a null value (the empty string, @code{""}).
9209 print "This is printed, even though foo[4] is empty"
9212 It is not an error to delete an element that does not exist.
9214 @cindex arrays, deleting entire contents
9215 @cindex deleting entire arrays
9216 @cindex differences between @code{gawk} and @code{awk}
9217 You can delete all the elements of an array with a single statement,
9218 by leaving off the subscript in the @code{delete} statement.
9224 This ability is a @code{gawk} extension; it is not available in
9225 compatibility mode (@pxref{Options, ,Command Line Options}).
9227 Using this version of the @code{delete} statement is about three times
9228 more efficient than the equivalent loop that deletes each element one
9231 @cindex portability issues
9232 The following statement provides a portable, but non-obvious way to clear
9235 @cindex Brennan, Michael
9238 # thanks to Michael Brennan for pointing this out
9243 The @code{split} function
9244 (@pxref{String Functions, ,Built-in Functions for String Manipulation})
9245 clears out the target array first. This call asks it to split
9246 apart the null string. Since there is no data to split out, the
9247 function simply clears the array and then returns.
9249 @strong{Caution:} Deleting an array does not change its type; you cannot
9250 delete an array and then use the array's name as a scalar. For
9251 example, this will not work:
9254 a[1] = 3; delete a; a = 3
9257 @node Numeric Array Subscripts, Uninitialized Subscripts, Delete, Arrays
9258 @section Using Numbers to Subscript Arrays
9260 An important aspect of arrays to remember is that @emph{array subscripts
9261 are always strings}. If you use a numeric value as a subscript,
9262 it will be converted to a string value before it is used for subscripting
9263 (@pxref{Conversion, ,Conversion of Strings and Numbers}).
9265 @cindex conversions, during subscripting
9266 @cindex numbers, used as subscripts
9268 This means that the value of the built-in variable @code{CONVFMT} can potentially
9269 affect how your program accesses elements of an array. For example:
9277 printf "%s is in data\n", xyz
9279 printf "%s is not in data\n", xyz
9284 This prints @samp{12.15 is not in data}. The first statement gives
9285 @code{xyz} a numeric value. Assigning to
9286 @code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"}
9287 (using the default conversion value of @code{CONVFMT}, @code{"%.6g"}),
9288 and assigns one to @code{data["12.153"]}. The program then changes
9289 the value of @code{CONVFMT}. The test @samp{(xyz in data)} generates a new
9290 string value from @code{xyz}, this time @code{"12.15"}, since the value of
9291 @code{CONVFMT} only allows two significant digits. This test fails,
9292 since @code{"12.15"} is a different string from @code{"12.153"}.
9294 According to the rules for conversions
9295 (@pxref{Conversion, ,Conversion of Strings and Numbers}), integer
9296 values are always converted to strings as integers, no matter what the
9297 value of @code{CONVFMT} may happen to be. So the usual case of:
9300 for (i = 1; i <= maxsub; i++)
9301 @i{do something with} array[i]
9305 will work, no matter what the value of @code{CONVFMT}.
9307 Like many things in @code{awk}, the majority of the time things work
9308 as you would expect them to work. But it is useful to have a precise
9309 knowledge of the actual rules, since sometimes they can have a subtle
9310 effect on your programs.
9312 @node Uninitialized Subscripts, Multi-dimensional, Numeric Array Subscripts, Arrays
9313 @section Using Uninitialized Variables as Subscripts
9315 @cindex uninitialized variables, as array subscripts
9316 @cindex array subscripts, uninitialized variables
9317 Suppose you want to print your input data in reverse order.
9318 A reasonable attempt at a program to do so (with some test
9319 data) might look like this:
9325 > line 3' | awk '@{ l[lines] = $0; ++lines @}
9327 > for (i = lines-1; i >= 0; --i)
9335 Unfortunately, the very first line of input data did not come out in the
9338 At first glance, this program should have worked. The variable @code{lines}
9339 is uninitialized, and uninitialized variables have the numeric value zero.
9340 So, @code{awk} should have printed the value of @code{l[0]}.
9342 The issue here is that subscripts for @code{awk} arrays are @strong{always}
9343 strings. And uninitialized variables, when used as strings, have the
9344 value @code{""}, not zero. Thus, @samp{line 1} ended up stored in
9347 The following version of the program works correctly:
9350 @{ l[lines++] = $0 @}
9352 for (i = lines - 1; i >= 0; --i)
9357 Here, the @samp{++} forces @code{lines} to be numeric, thus making
9358 the ``old value'' numeric zero, which is then converted to @code{"0"}
9359 as the array subscript.
9361 @cindex null string, as array subscript
9363 As we have just seen, even though it is somewhat unusual, the null string
9364 (@code{""}) is a valid array subscript (d.c.). If @samp{--lint} is provided
9365 on the command line (@pxref{Options, ,Command Line Options}),
9366 @code{gawk} will warn about the use of the null string as a subscript.
9368 @node Multi-dimensional, Multi-scanning, Uninitialized Subscripts, Arrays
9369 @section Multi-dimensional Arrays
9371 @cindex subscripts in arrays
9372 @cindex arrays, multi-dimensional subscripts
9373 @cindex multi-dimensional subscripts
9374 A multi-dimensional array is an array in which an element is identified
9375 by a sequence of indices, instead of a single index. For example, a
9376 two-dimensional array requires two indices. The usual way (in most
9377 languages, including @code{awk}) to refer to an element of a
9378 two-dimensional array named @code{grid} is with
9379 @code{grid[@var{x},@var{y}]}.
9382 Multi-dimensional arrays are supported in @code{awk} through
9383 concatenation of indices into one string. What happens is that
9384 @code{awk} converts the indices into strings
9385 (@pxref{Conversion, ,Conversion of Strings and Numbers}) and
9386 concatenates them together, with a separator between them. This creates
9387 a single string that describes the values of the separate indices. The
9388 combined string is used as a single index into an ordinary,
9389 one-dimensional array. The separator used is the value of the built-in
9390 variable @code{SUBSEP}.
9392 For example, suppose we evaluate the expression @samp{foo[5,12] = "value"}
9393 when the value of @code{SUBSEP} is @code{"@@"}. The numbers five and 12 are
9394 converted to strings and
9395 concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus,
9396 the array element @code{foo["5@@12"]} is set to @code{"value"}.
9398 Once the element's value is stored, @code{awk} has no record of whether
9399 it was stored with a single index or a sequence of indices. The two
9400 expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always
9403 The default value of @code{SUBSEP} is the string @code{"\034"},
9404 which contains a non-printing character that is unlikely to appear in an
9405 @code{awk} program or in most input data.
9407 The usefulness of choosing an unlikely character comes from the fact
9408 that index values that contain a string matching @code{SUBSEP} lead to
9409 combined strings that are ambiguous. Suppose that @code{SUBSEP} were
9410 @code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a",
9411 "b@@c"]}} would be indistinguishable because both would actually be
9412 stored as @samp{foo["a@@b@@c"]}.
9414 You can test whether a particular index-sequence exists in a
9415 ``multi-dimensional'' array with the same operator @samp{in} used for single
9416 dimensional arrays. Instead of a single index as the left-hand operand,
9417 write the whole sequence of indices, separated by commas, in
9421 (@var{subscript1}, @var{subscript2}, @dots{}) in @var{array}
9424 The following example treats its input as a two-dimensional array of
9425 fields; it rotates this array 90 degrees clockwise and prints the
9426 result. It assumes that all lines have the same number of
9435 for (x = 1; x <= NF; x++)
9442 for (x = 1; x <= max_nf; x++) @{
9443 for (y = max_nr; y >= 1; --y)
9444 printf("%s ", vector[x, y])
9452 When given the input:
9477 @node Multi-scanning, Array Efficiency, Multi-dimensional, Arrays
9478 @section Scanning Multi-dimensional Arrays
9480 There is no special @code{for} statement for scanning a
9481 ``multi-dimensional'' array; there cannot be one, because in truth there
9482 are no multi-dimensional arrays or elements; there is only a
9483 multi-dimensional @emph{way of accessing} an array.
9485 However, if your program has an array that is always accessed as
9486 multi-dimensional, you can get the effect of scanning it by combining
9487 the scanning @code{for} statement
9488 (@pxref{Scanning an Array, ,Scanning All Elements of an Array}) with the
9489 @code{split} built-in function
9490 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
9494 for (combined in array) @{
9495 split(combined, separate, SUBSEP)
9501 This sets @code{combined} to
9502 each concatenated, combined index in the array, and splits it
9503 into the individual indices by breaking it apart where the value of
9504 @code{SUBSEP} appears. The split-out indices become the elements of
9505 the array @code{separate}.
9507 Thus, suppose you have previously stored a value in @code{array[1, "foo"]};
9508 then an element with index @code{"1\034foo"} exists in
9509 @code{array}. (Recall that the default value of @code{SUBSEP} is
9510 the character with code 034.) Sooner or later the @code{for} statement
9511 will find that index and do an iteration with @code{combined} set to
9512 @code{"1\034foo"}. Then the @code{split} function is called as
9516 split("1\034foo", separate, "\034")
9520 The result of this is to set @code{separate[1]} to @code{"1"} and
9521 @code{separate[2]} to @code{"foo"}. Presto, the original sequence of
9522 separate indices has been recovered.
9524 @node Array Efficiency, , Multi-scanning, Arrays
9525 @section Using Array Memory Efficiently
9527 This section applies just to @code{gawk}.
9529 It is often useful to use the same bit of data as an index
9530 into multiple arrays.
9531 Due to the way @code{gawk} implements associative arrays,
9532 when you need to use input data as an index for multiple
9533 arrays, it is much more effecient to assign the input field
9534 to a separate variable, and then use that variable as the index.
9542 seniority[name]++ # better than seniority[$1]++
9543 kids[name] = nkids # better than kids[$1] = nkids
9547 Using separate variables with mnemonic names for the input fields
9548 makes programs more readable, in any case.
9549 It is an eventual goal to make @code{gawk}'s array indexing as efficient
9550 as possible, no matter what the source of the index value.
9552 @node Built-in, User-defined, Arrays, Top
9553 @chapter Built-in Functions
9555 @c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
9556 @cindex built-in functions
9557 @dfn{Built-in} functions are functions that are always available for
9558 your @code{awk} program to call. This chapter defines all the built-in
9559 functions in @code{awk}; some of them are mentioned in other sections,
9560 but they are summarized here for your convenience. (You can also define
9561 new functions yourself. @xref{User-defined, ,User-defined Functions}.)
9564 * Calling Built-in:: How to call built-in functions.
9565 * Numeric Functions:: Functions that work with numbers, including
9566 @code{int}, @code{sin} and @code{rand}.
9567 * String Functions:: Functions for string manipulation, such as
9568 @code{split}, @code{match}, and
9570 * I/O Functions:: Functions for files and shell commands.
9571 * Time Functions:: Functions for dealing with time stamps.
9574 @node Calling Built-in, Numeric Functions, Built-in, Built-in
9575 @section Calling Built-in Functions
9577 To call a built-in function, write the name of the function followed
9578 by arguments in parentheses. For example, @samp{atan2(y + z, 1)}
9579 is a call to the function @code{atan2}, with two arguments.
9581 Whitespace is ignored between the built-in function name and the
9582 open-parenthesis, but we recommend that you avoid using whitespace
9583 there. User-defined functions do not permit whitespace in this way, and
9584 you will find it easier to avoid mistakes by following a simple
9585 convention which always works: no whitespace after a function name.
9587 @cindex differences between @code{gawk} and @code{awk}
9588 Each built-in function accepts a certain number of arguments.
9589 In some cases, arguments can be omitted. The defaults for omitted
9590 arguments vary from function to function and are described under the
9591 individual functions. In some @code{awk} implementations, extra
9592 arguments given to built-in functions are ignored. However, in @code{gawk},
9593 it is a fatal error to give extra arguments to a built-in function.
9595 When a function is called, expressions that create the function's actual
9596 parameters are evaluated completely before the function call is performed.
9597 For example, in the code fragment:
9605 the variable @code{i} is set to five before @code{sqrt} is called
9606 with a value of four for its actual parameter.
9608 @cindex evaluation, order of
9609 @cindex order of evaluation
9610 The order of evaluation of the expressions used for the function's
9611 parameters is undefined. Thus, you should not write programs that
9612 assume that parameters are evaluated from left to right or from
9613 right to left. For example,
9617 j = atan2(i++, i *= 2)
9620 If the order of evaluation is left to right, then @code{i} first becomes
9621 six, and then 12, and @code{atan2} is called with the two arguments six
9622 and 12. But if the order of evaluation is right to left, @code{i}
9623 first becomes 10, and then 11, and @code{atan2} is called with the
9624 two arguments 11 and 10.
9626 @node Numeric Functions, String Functions, Calling Built-in, Built-in
9627 @section Numeric Built-in Functions
9629 Here is a full list of built-in functions that work with numbers.
9630 Optional parameters are enclosed in square brackets (``['' and ``]'').
9635 This produces the nearest integer to @var{x}, located between @var{x} and zero,
9636 truncated toward zero.
9638 For example, @code{int(3)} is three, @code{int(3.9)} is three, @code{int(-3.9)}
9639 is @minus{}3, and @code{int(-3)} is @minus{}3 as well.
9643 This gives you the positive square root of @var{x}. It reports an error
9644 if @var{x} is negative. Thus, @code{sqrt(4)} is two.
9648 This gives you the exponential of @var{x} (@code{e ^ @var{x}}), or reports
9649 an error if @var{x} is out of range. The range of values @var{x} can have
9650 depends on your machine's floating point representation.
9654 This gives you the natural logarithm of @var{x}, if @var{x} is positive;
9655 otherwise, it reports an error.
9659 This gives you the sine of @var{x}, with @var{x} in radians.
9663 This gives you the cosine of @var{x}, with @var{x} in radians.
9665 @item atan2(@var{y}, @var{x})
9667 This gives you the arctangent of @code{@var{y} / @var{x}} in radians.
9671 This gives you a random number. The values of @code{rand} are
9672 uniformly-distributed between zero and one.
9673 The value is never zero and never one.
9675 Often you want random integers instead. Here is a user-defined function
9676 you can use to obtain a random non-negative integer less than @var{n}:
9679 function randint(n) @{
9680 return int(n * rand())
9685 The multiplication produces a random number greater than zero and less
9686 than @code{n}. We then make it an integer (using @code{int}) between zero
9687 and @code{n} @minus{} 1, inclusive.
9689 Here is an example where a similar function is used to produce
9690 random integers between one and @var{n}. This program
9691 prints a new random number for each input record.
9696 # Function to roll a simulated die.
9697 function roll(n) @{ return 1 + int(rand() * n) @}
9701 # Roll 3 six-sided dice and
9702 # print total number of points.
9704 printf("%d points\n",
9705 roll(6)+roll(6)+roll(6))
9710 @cindex seed for random numbers
9711 @cindex random numbers, seed of
9712 @comment MAWK uses a different seed each time.
9713 @strong{Caution:} In most @code{awk} implementations, including @code{gawk},
9714 @code{rand} starts generating numbers from the same
9715 starting number, or @dfn{seed}, each time you run @code{awk}. Thus,
9716 a program will generate the same results each time you run it.
9717 The numbers are random within one @code{awk} run, but predictable
9718 from run to run. This is convenient for debugging, but if you want
9719 a program to do different things each time it is used, you must change
9720 the seed to a value that will be different in each run. To do this,
9723 @item srand(@r{[}@var{x}@r{]})
9725 The function @code{srand} sets the starting point, or seed,
9726 for generating random numbers to the value @var{x}.
9728 Each seed value leads to a particular sequence of random
9729 numbers.@footnote{Computer generated random numbers really are not truly
9730 random. They are technically known as ``pseudo-random.'' This means
9731 that while the numbers in a sequence appear to be random, you can in
9732 fact generate the same sequence of random numbers over and over again.}
9733 Thus, if you set the seed to the same value a second time, you will get
9734 the same sequence of random numbers again.
9736 If you omit the argument @var{x}, as in @code{srand()}, then the current
9737 date and time of day are used for a seed. This is the way to get random
9738 numbers that are truly unpredictable.
9740 The return value of @code{srand} is the previous seed. This makes it
9741 easy to keep track of the seeds for use in consistently reproducing
9742 sequences of random numbers.
9745 @node String Functions, I/O Functions, Numeric Functions, Built-in
9746 @section Built-in Functions for String Manipulation
9748 The functions in this section look at or change the text of one or more
9750 Optional parameters are enclosed in square brackets (``['' and ``]'').
9753 @item index(@var{in}, @var{find})
9755 This searches the string @var{in} for the first occurrence of the string
9756 @var{find}, and returns the position in characters where that occurrence
9757 begins in the string @var{in}. For example:
9760 $ awk 'BEGIN @{ print index("peanut", "an") @}'
9765 If @var{find} is not found, @code{index} returns zero.
9766 (Remember that string indices in @code{awk} start at one.)
9768 @item length(@r{[}@var{string}@r{]})
9770 This gives you the number of characters in @var{string}. If
9771 @var{string} is a number, the length of the digit string representing
9772 that number is returned. For example, @code{length("abcde")} is five. By
9773 contrast, @code{length(15 * 35)} works out to three. How? Well, 15 * 35 =
9774 525, and 525 is then converted to the string @code{"525"}, which has
9777 If no argument is supplied, @code{length} returns the length of @code{$0}.
9779 @cindex historical features
9780 @cindex portability issues
9781 @cindex @code{awk} language, POSIX version
9782 @cindex POSIX @code{awk}
9783 In older versions of @code{awk}, you could call the @code{length} function
9784 without any parentheses. Doing so is marked as ``deprecated'' in the
9785 POSIX standard. This means that while you can do this in your
9786 programs, it is a feature that can eventually be removed from a future
9787 version of the standard. Therefore, for maximal portability of your
9788 @code{awk} programs, you should always supply the parentheses.
9790 @item match(@var{string}, @var{regexp})
9792 The @code{match} function searches the string, @var{string}, for the
9793 longest, leftmost substring matched by the regular expression,
9794 @var{regexp}. It returns the character position, or @dfn{index}, of
9795 where that substring begins (one, if it starts at the beginning of
9796 @var{string}). If no match is found, it returns zero.
9800 The @code{match} function sets the built-in variable @code{RSTART} to
9801 the index. It also sets the built-in variable @code{RLENGTH} to the
9802 length in characters of the matched substring. If no match is found,
9803 @code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1.
9809 @c file eg/misc/findpat.sh
9814 where = match($0, regex)
9816 print "Match of", regex, "found at", \
9825 This program looks for lines that match the regular expression stored in
9826 the variable @code{regex}. This regular expression can be changed. If the
9827 first word on a line is @samp{FIND}, @code{regex} is changed to be the
9828 second word on that line. Therefore, given:
9831 @c file eg/misc/findpat.data
9834 but not very quickly
9837 This line is property of Reality Engineering Co.
9846 Match of ru+n found at 12 in My program runs
9847 Match of Melvin found at 1 in Melvin was here.
9850 @item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]})
9852 This divides @var{string} into pieces separated by @var{fieldsep},
9853 and stores the pieces in @var{array}. The first piece is stored in
9854 @code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so
9855 forth. The string value of the third argument, @var{fieldsep}, is
9856 a regexp describing where to split @var{string} (much as @code{FS} can
9857 be a regexp describing where to split input records). If
9858 the @var{fieldsep} is omitted, the value of @code{FS} is used.
9859 @code{split} returns the number of elements created.
9861 The @code{split} function splits strings into pieces in a
9862 manner similar to the way input lines are split into fields. For example:
9865 split("cul-de-sac", a, "-")
9869 splits the string @samp{cul-de-sac} into three fields using @samp{-} as the
9870 separator. It sets the contents of the array @code{a} as follows:
9879 The value returned by this call to @code{split} is three.
9881 As with input field-splitting, when the value of @var{fieldsep} is
9882 @w{@code{" "}}, leading and trailing whitespace is ignored, and the elements
9883 are separated by runs of whitespace.
9885 @cindex differences between @code{gawk} and @code{awk}
9886 Also as with input field-splitting, if @var{fieldsep} is the null string, each
9887 individual character in the string is split into its own array element.
9888 (This is a @code{gawk}-specific extension.)
9891 Recent implementations of @code{awk}, including @code{gawk}, allow
9892 the third argument to be a regexp constant (@code{/abc/}), as well as a
9893 string (d.c.). The POSIX standard allows this as well.
9895 Before splitting the string, @code{split} deletes any previously existing
9896 elements in the array @var{array} (d.c.).
9898 If @var{string} does not match @var{fieldsep} at all, @var{array} will have
9899 one element. The value of that element will be the original
9902 @item sprintf(@var{format}, @var{expression1},@dots{})
9904 This returns (without printing) the string that @code{printf} would
9905 have printed out with the same arguments
9906 (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
9910 sprintf("pi = %.2f (approx.)", 22/7)
9914 returns the string @w{@code{"pi = 3.14 (approx.)"}}.
9917 2e: For sub, gsub, and gensub, either here or in the "how much matches"
9918 section, we need some explanation that it is possible to match the
9919 null string when using closures like *. E.g.,
9921 $ echo abc | awk '{ gsub(/m*/, "X"); print }'
9924 Although this makes a certain amount of sense, it can be very
9928 @item sub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})
9930 The @code{sub} function alters the value of @var{target}.
9931 It searches this value, which is treated as a string, for the
9932 leftmost longest substring matched by the regular expression, @var{regexp},
9933 extending this match as far as possible. Then the entire string is
9934 changed by replacing the matched text with @var{replacement}.
9935 The modified string becomes the new value of @var{target}.
9937 This function is peculiar because @var{target} is not simply
9938 used to compute a value, and not just any expression will do: it
9939 must be a variable, field or array element, so that @code{sub} can
9940 store a modified value there. If this argument is omitted, then the
9941 default is to use and alter @code{$0}.
9946 str = "water, water, everywhere"
9947 sub(/at/, "ith", str)
9951 sets @code{str} to @w{@code{"wither, water, everywhere"}}, by replacing the
9952 leftmost, longest occurrence of @samp{at} with @samp{ith}.
9954 The @code{sub} function returns the number of substitutions made (either
9957 If the special character @samp{&} appears in @var{replacement}, it
9958 stands for the precise substring that was matched by @var{regexp}. (If
9959 the regexp can match more than one string, then this precise substring
9960 may vary.) For example:
9963 awk '@{ sub(/candidate/, "& and his wife"); print @}'
9967 changes the first occurrence of @samp{candidate} to @samp{candidate
9968 and his wife} on each input line.
9970 Here is another example:
9975 sub(/a+/, "C&C", str)
9982 This shows how @samp{&} can represent a non-constant string, and also
9983 illustrates the ``leftmost, longest'' rule in regexp matching
9984 (@pxref{Leftmost Longest, ,How Much Text Matches?}).
9986 The effect of this special character (@samp{&}) can be turned off by putting a
9987 backslash before it in the string. As usual, to insert one backslash in
9988 the string, you must write two backslashes. Therefore, write @samp{\\&}
9989 in a string constant to include a literal @samp{&} in the replacement.
9990 For example, here is how to replace the first @samp{|} on each line with
9994 awk '@{ sub(/\|/, "\\&"); print @}'
9997 @cindex @code{sub}, third argument of
9998 @cindex @code{gsub}, third argument of
9999 @strong{Note:} As mentioned above, the third argument to @code{sub} must
10000 be a variable, field or array reference.
10001 Some versions of @code{awk} allow the third argument to
10002 be an expression which is not an lvalue. In such a case, @code{sub}
10003 would still search for the pattern and return zero or one, but the result of
10004 the substitution (if any) would be thrown away because there is no place
10005 to put it. Such versions of @code{awk} accept expressions like
10009 sub(/USA/, "United States", "the USA and Canada")
10013 For historical compatibility, @code{gawk} will accept erroneous code,
10014 such as in the above example. However, using any other non-changeable
10015 object as the third parameter will cause a fatal error, and your program
10018 Finally, if the @var{regexp} is not a regexp constant, it is converted into a
10019 string and then the value of that string is treated as the regexp to match.
10021 @item gsub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]})
10023 This is similar to the @code{sub} function, except @code{gsub} replaces
10024 @emph{all} of the longest, leftmost, @emph{non-overlapping} matching
10025 substrings it can find. The @samp{g} in @code{gsub} stands for
10026 ``global,'' which means replace everywhere. For example:
10029 awk '@{ gsub(/Britain/, "United Kingdom"); print @}'
10033 replaces all occurrences of the string @samp{Britain} with @samp{United
10034 Kingdom} for all input records.
10036 The @code{gsub} function returns the number of substitutions made. If
10037 the variable to be searched and altered, @var{target}, is
10038 omitted, then the entire input record, @code{$0}, is used.
10040 As in @code{sub}, the characters @samp{&} and @samp{\} are special,
10041 and the third argument must be an lvalue.
10045 @item gensub(@var{regexp}, @var{replacement}, @var{how} @r{[}, @var{target}@r{]})
10047 @code{gensub} is a general substitution function. Like @code{sub} and
10048 @code{gsub}, it searches the target string @var{target} for matches of
10049 the regular expression @var{regexp}. Unlike @code{sub} and
10050 @code{gsub}, the modified string is returned as the result of the
10051 function, and the original target string is @emph{not} changed. If
10052 @var{how} is a string beginning with @samp{g} or @samp{G}, then it
10053 replaces all matches of @var{regexp} with @var{replacement}.
10054 Otherwise, @var{how} is a number indicating which match of @var{regexp}
10055 to replace. If no @var{target} is supplied, @code{$0} is used instead.
10057 @code{gensub} provides an additional feature that is not available
10058 in @code{sub} or @code{gsub}: the ability to specify components of
10059 a regexp in the replacement text. This is done by using parentheses
10060 in the regexp to mark the components, and then specifying @samp{\@var{n}}
10061 in the replacement text, where @var{n} is a digit from one to nine.
10069 > b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
10077 As described above for @code{sub}, you must type two backslashes in order
10078 to get one into the string.
10080 In the replacement text, the sequence @samp{\0} represents the entire
10081 matched text, as does the character @samp{&}.
10083 This example shows how you can use the third argument to control
10084 which match of the regexp should be changed.
10087 $ echo a b c a b c |
10088 > gawk '@{ print gensub(/a/, "AA", 2) @}'
10089 @print{} a b c AA b c
10092 In this case, @code{$0} is used as the default target string.
10093 @code{gensub} returns the new string as its result, which is
10094 passed directly to @code{print} for printing.
10096 If the @var{how} argument is a string that does not begin with @samp{g} or
10097 @samp{G}, or if it is a number that is less than zero, only one
10098 substitution is performed.
10100 If @var{regexp} does not match @var{target}, @code{gensub}'s return value
10101 is the original, unchanged value of @var{target}.
10103 @cindex differences between @code{gawk} and @code{awk}
10104 @code{gensub} is a @code{gawk} extension; it is not available
10105 in compatibility mode (@pxref{Options, ,Command Line Options}).
10107 @item substr(@var{string}, @var{start} @r{[}, @var{length}@r{]})
10109 This returns a @var{length}-character-long substring of @var{string},
10110 starting at character number @var{start}. The first character of a
10111 string is character number one. For example,
10112 @code{substr("washington", 5, 3)} returns @code{"ing"}.
10114 If @var{length} is not present, this function returns the whole suffix of
10115 @var{string} that begins at character number @var{start}. For example,
10116 @code{substr("washington", 5)} returns @code{"ington"}. The whole
10117 suffix is also returned
10118 if @var{length} is greater than the number of characters remaining
10119 in the string, counting from character number @var{start}.
10121 @strong{Note:} The string returned by @code{substr} @emph{cannot} be
10122 assigned to. Thus, it is a mistake to attempt to change a portion of
10123 a string, like this:
10127 # try to get "abCDEf", won't work
10128 substr(string, 3, 3) = "CDE"
10132 or to use @code{substr} as the third agument of @code{sub} or @code{gsub}:
10135 gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG
10138 @cindex case conversion
10139 @cindex conversion of case
10140 @item tolower(@var{string})
10142 This returns a copy of @var{string}, with each upper-case character
10143 in the string replaced with its corresponding lower-case character.
10144 Non-alphabetic characters are left unchanged. For example,
10145 @code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}.
10147 @item toupper(@var{string})
10149 This returns a copy of @var{string}, with each lower-case character
10150 in the string replaced with its corresponding upper-case character.
10151 Non-alphabetic characters are left unchanged. For example,
10152 @code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}.
10155 @c fakenode --- for prepinfo
10156 @subheading More About @samp{\} and @samp{&} with @code{sub}, @code{gsub} and @code{gensub}
10158 @cindex escape processing, @code{sub} et. al.
10159 When using @code{sub}, @code{gsub} or @code{gensub}, and trying to get literal
10160 backslashes and ampersands into the replacement text, you need to remember
10161 that there are several levels of @dfn{escape processing} going on.
10163 First, there is the @dfn{lexical} level, which is when @code{awk} reads
10164 your program, and builds an internal copy of your program that can
10167 Then there is the run-time level, when @code{awk} actually scans the
10168 replacement string to determine what to generate.
10170 At both levels, @code{awk} looks for a defined set of characters that
10171 can come after a backslash. At the lexical level, it looks for the
10172 escape sequences listed in @ref{Escape Sequences}.
10173 Thus, for every @samp{\} that @code{awk} will process at the run-time
10174 level, you type two @samp{\}s at the lexical level.
10175 When a character that is not valid for an escape sequence follows the
10176 @samp{\}, Unix @code{awk} and @code{gawk} both simply remove the initial
10177 @samp{\}, and put the following character into the string. Thus, for
10178 example, @code{"a\qb"} is treated as @code{"aqb"}.
10180 At the run-time level, the various functions handle sequences of
10181 @samp{\} and @samp{&} differently. The situation is (sadly) somewhat complex.
10183 Historically, the @code{sub} and @code{gsub} functions treated the two
10184 character sequence @samp{\&} specially; this sequence was replaced in
10185 the generated text with a single @samp{&}. Any other @samp{\} within
10186 the @var{replacement} string that did not precede an @samp{&} was passed
10187 through unchanged. To illustrate with a table:
10189 @c Thank to Karl Berry for help with the TeX stuff.
10192 % This table has lots of &'s and \'s, so unspecialize them.
10193 \catcode`\& = \other \catcode`\\ = \other
10194 % But then we need character for escape and tab.
10196 @halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
10197 You type!@code{sub} sees!@code{sub} generates@cr
10198 @hrulefill!@hrulefill!@hrulefill@cr
10199 @code{\&}! @code{&}!the matched text@cr
10200 @code{\\&}! @code{\&}!a literal @samp{&}@cr
10201 @code{\\\&}! @code{\&}!a literal @samp{&}@cr
10202 @code{\\\\&}! @code{\\&}!a literal @samp{\&}@cr
10203 @code{\\\\\&}! @code{\\&}!a literal @samp{\&}@cr
10204 @code{\\\\\\&}! @code{\\\&}!a literal @samp{\\&}@cr
10205 @code{\\q}! @code{\q}!a literal @samp{\q}@cr
10211 You type @code{sub} sees @code{sub} generates
10212 -------- ---------- ---------------
10213 @code{\&} @code{&} the matched text
10214 @code{\\&} @code{\&} a literal @samp{&}
10215 @code{\\\&} @code{\&} a literal @samp{&}
10216 @code{\\\\&} @code{\\&} a literal @samp{\&}
10217 @code{\\\\\&} @code{\\&} a literal @samp{\&}
10218 @code{\\\\\\&} @code{\\\&} a literal @samp{\\&}
10219 @code{\\q} @code{\q} a literal @samp{\q}
10224 This table shows both the lexical level processing, where
10225 an odd number of backslashes becomes an even number at the run time level,
10226 and the run-time processing done by @code{sub}.
10227 (For the sake of simplicity, the rest of the tables below only show the
10228 case of even numbers of @samp{\}s entered at the lexical level.)
10230 The problem with the historical approach is that there is no way to get
10231 a literal @samp{\} followed by the matched text.
10233 @cindex @code{awk} language, POSIX version
10234 @cindex POSIX @code{awk}
10235 The 1992 POSIX standard attempted to fix this problem. The standard
10236 says that @code{sub} and @code{gsub} look for either a @samp{\} or an @samp{&}
10237 after the @samp{\}. If either one follows a @samp{\}, that character is
10238 output literally. The interpretation of @samp{\} and @samp{&} then becomes
10241 @c thanks to Karl Berry for formatting this table
10244 % This table has lots of &'s and \'s, so unspecialize them.
10245 \catcode`\& = \other \catcode`\\ = \other
10246 % But then we need character for escape and tab.
10248 @halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
10249 You type!@code{sub} sees!@code{sub} generates@cr
10250 @hrulefill!@hrulefill!@hrulefill@cr
10251 @code{&}! @code{&}!the matched text@cr
10252 @code{\\&}! @code{\&}!a literal @samp{&}@cr
10253 @code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr
10254 @code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
10260 You type @code{sub} sees @code{sub} generates
10261 -------- ---------- ---------------
10262 @code{&} @code{&} the matched text
10263 @code{\\&} @code{\&} a literal @samp{&}
10264 @code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text
10265 @code{\\\\\\&} @code{\\\&} a literal @samp{\&}
10270 This would appear to solve the problem.
10271 Unfortunately, the phrasing of the standard is unusual. It
10272 says, in effect, that @samp{\} turns off the special meaning of any
10273 following character, but that for anything other than @samp{\} and @samp{&},
10274 such special meaning is undefined. This wording leads to two problems.
10278 Backslashes must now be doubled in the @var{replacement} string, breaking
10279 historical @code{awk} programs.
10282 To make sure that an @code{awk} program is portable, @emph{every} character
10283 in the @var{replacement} string must be preceded with a
10284 backslash.@footnote{This consequence was certainly unintended.}
10285 @c I can say that, 'cause I was involved in making this change
10288 The POSIX standard is under revision.@footnote{As of @value{UPDATE-MONTH},
10289 with final approval and publication as part of the Austin Group
10290 Standards hopefully sometime in 2001.}
10291 Because of the above problems, proposed text for the revised standard
10292 reverts to rules that correspond more closely to the original existing
10293 practice. The proposed rules have special cases that make it possible
10294 to produce a @samp{\} preceding the matched text.
10298 % This table has lots of &'s and \'s, so unspecialize them.
10299 \catcode`\& = \other \catcode`\\ = \other
10300 % But then we need character for escape and tab.
10302 @halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
10303 You type!@code{sub} sees!@code{sub} generates@cr
10304 @hrulefill!@hrulefill!@hrulefill@cr
10305 @code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
10306 @code{\\\\&}! @code{\\&}!a literal @samp{\}, followed by the matched text@cr
10307 @code{\\&}! @code{\&}!a literal @samp{&}@cr
10308 @code{\\q}! @code{\q}!a literal @samp{\q}@cr
10314 You type @code{sub} sees @code{sub} generates
10315 -------- ---------- ---------------
10316 @code{\\\\\\&} @code{\\\&} a literal @samp{\&}
10317 @code{\\\\&} @code{\\&} a literal @samp{\}, followed by the matched text
10318 @code{\\&} @code{\&} a literal @samp{&}
10319 @code{\\q} @code{\q} a literal @samp{\q}
10323 In a nutshell, at the run-time level, there are now three special sequences
10324 of characters, @samp{\\\&}, @samp{\\&} and @samp{\&}, whereas historically,
10325 there was only one. However, as in the historical case, any @samp{\} that
10326 is not part of one of these three sequences is not special, and appears
10327 in the output literally.
10329 @code{gawk} 3.0 follows these proposed POSIX rules for @code{sub} and
10331 @c As much as we think it's a lousy idea. You win some, you lose some. Sigh.
10332 Whether these proposed rules will actually become codified into the
10333 standard is unknown at this point. Subsequent @code{gawk} releases will
10334 track the standard and implement whatever the final version specifies;
10335 this @value{DOCUMENT} will be updated as well.
10337 The rules for @code{gensub} are considerably simpler. At the run-time
10338 level, whenever @code{gawk} sees a @samp{\}, if the following character
10339 is a digit, then the text that matched the corresponding parenthesized
10340 subexpression is placed in the generated output. Otherwise,
10341 no matter what the character after the @samp{\} is, that character will
10342 appear in the generated text, and the @samp{\} will not.
10346 % This table has lots of &'s and \'s, so unspecialize them.
10347 \catcode`\& = \other \catcode`\\ = \other
10348 % But then we need character for escape and tab.
10350 @halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr
10351 You type!@code{gensub} sees!@code{gensub} generates@cr
10352 @hrulefill!@hrulefill!@hrulefill@cr
10353 @code{&}! @code{&}!the matched text@cr
10354 @code{\\&}! @code{\&}!a literal @samp{&}@cr
10355 @code{\\\\}! @code{\\}!a literal @samp{\}@cr
10356 @code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr
10357 @code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr
10358 @code{\\q}! @code{\q}!a literal @samp{q}@cr
10364 You type @code{gensub} sees @code{gensub} generates
10365 -------- ------------- ------------------
10366 @code{&} @code{&} the matched text
10367 @code{\\&} @code{\&} a literal @samp{&}
10368 @code{\\\\} @code{\\} a literal @samp{\}
10369 @code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text
10370 @code{\\\\\\&} @code{\\\&} a literal @samp{\&}
10371 @code{\\q} @code{\q} a literal @samp{q}
10375 Because of the complexity of the lexical and run-time level processing,
10376 and the special cases for @code{sub} and @code{gsub},
10377 we recommend the use of @code{gawk} and @code{gensub} for when you have
10378 to do substitutions.
10380 @node I/O Functions, Time Functions, String Functions, Built-in
10381 @section Built-in Functions for Input/Output
10383 The following functions are related to Input/Output (I/O).
10384 Optional parameters are enclosed in square brackets (``['' and ``]'').
10387 @item close(@var{filename})
10389 Close the file @var{filename}, for input or output. The argument may
10390 alternatively be a shell command that was used for redirecting to or
10391 from a pipe; then the pipe is closed.
10392 @xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes},
10393 for more information.
10395 @item fflush(@r{[}@var{filename}@r{]})
10397 @cindex portability issues
10398 @cindex flushing buffers
10399 @cindex buffers, flushing
10400 @cindex buffering output
10401 @cindex output, buffering
10402 Flush any buffered output associated @var{filename}, which is either a
10403 file opened for writing, or a shell command for redirecting output to
10406 Many utility programs will @dfn{buffer} their output; they save information
10407 to be written to a disk file or terminal in memory, until there is enough
10408 for it to be worthwhile to send the data to the ouput device.
10409 This is often more efficient than writing
10410 every little bit of information as soon as it is ready. However, sometimes
10411 it is necessary to force a program to @dfn{flush} its buffers; that is,
10412 write the information to its destination, even if a buffer is not full.
10413 This is the purpose of the @code{fflush} function; @code{gawk} too
10414 buffers its output, and the @code{fflush} function can be used to force
10415 @code{gawk} to flush its buffers.
10417 @code{fflush} is a recent (1994) addition to the Bell Labs research
10418 version of @code{awk}; it is not part of the POSIX standard, and will
10419 not be available if @samp{--posix} has been specified on the command
10420 line (@pxref{Options, ,Command Line Options}).
10422 @code{gawk} extends the @code{fflush} function in two ways. The first
10423 is to allow no argument at all. In this case, the buffer for the
10424 standard output is flushed. The second way is to allow the null string
10425 (@w{@code{""}}) as the argument. In this case, the buffers for
10426 @emph{all} open output files and pipes are flushed.
10428 @code{fflush} returns zero if the buffer was successfully flushed,
10429 and nonzero otherwise.
10431 @item system(@var{command})
10433 @cindex interaction, @code{awk} and other programs
10434 The @code{system} function allows the user to execute operating system commands
10435 and then return to the @code{awk} program. The @code{system} function
10436 executes the command given by the string @var{command}. It returns, as
10437 its value, the status returned by the command that was executed.
10439 For example, if the following fragment of code is put in your @code{awk}
10444 system("date | mail -s 'awk run done' root")
10449 the system administrator will be sent mail when the @code{awk} program
10450 finishes processing input and begins its end-of-input processing.
10452 Note that redirecting @code{print} or @code{printf} into a pipe is often
10453 enough to accomplish your task. If you need to run many commands, it
10454 will be more efficient to simply print them to a pipe to the shell:
10457 while (@var{more stuff to do})
10458 print @var{command} | "/bin/sh"
10463 However, if your @code{awk}
10464 program is interactive, @code{system} is useful for cranking up large
10465 self-contained programs, such as a shell or an editor.
10467 Some operating systems cannot implement the @code{system} function.
10468 @code{system} causes a fatal error if it is not supported.
10471 @c fakenode --- for prepinfo
10472 @subheading Interactive vs. Non-Interactive Buffering
10473 @cindex buffering, interactive vs. non-interactive
10474 @cindex buffering, non-interactive vs. interactive
10475 @cindex interactive buffering vs. non-interactive
10476 @cindex non-interactive buffering vs. interactive
10478 As a side point, buffering issues can be even more confusing depending
10479 upon whether or not your program is @dfn{interactive}, i.e., communicating
10480 with a user sitting at a keyboard.@footnote{A program is interactive
10481 if the standard output is connected
10482 to a terminal device.}
10484 Interactive programs generally @dfn{line buffer} their output; they
10485 write out every line. Non-interactive programs wait until they have
10486 a full buffer, which may be many lines of output.
10488 @c Thanks to Walter.Mecky@dresdnerbank.de for this example, and for
10489 @c motivating me to write this section.
10490 Here is an example of the difference.
10493 $ awk '@{ print $1 + $2 @}'
10502 Each line of output is printed immediately. Compare that behavior
10506 $ awk '@{ print $1 + $2 @}' | cat
10515 Here, no output is printed until after the @kbd{Control-d} is typed, since
10516 it is all buffered, and sent down the pipe to @code{cat} in one shot.
10518 @c fakenode --- for prepinfo
10519 @subheading Controlling Output Buffering with @code{system}
10520 @cindex flushing buffers
10521 @cindex buffers, flushing
10522 @cindex buffering output
10523 @cindex output, buffering
10525 The @code{fflush} function provides explicit control over output buffering for
10526 individual files and pipes. However, its use is not portable to many other
10527 @code{awk} implementations. An alternative method to flush output
10528 buffers is by calling @code{system} with a null string as its argument:
10531 system("") # flush output
10535 @code{gawk} treats this use of the @code{system} function as a special
10536 case, and is smart enough not to run a shell (or other command
10537 interpreter) with the empty command. Therefore, with @code{gawk}, this
10538 idiom is not only useful, it is efficient. While this method should work
10539 with other @code{awk} implementations, it will not necessarily avoid
10540 starting an unnecessary shell. (Other implementations may only
10541 flush the buffer associated with the standard output, and not necessarily
10542 all buffered output.)
10544 If you think about what a programmer expects, it makes sense that
10545 @code{system} should flush any pending output. The following program:
10549 print "first print"
10550 system("echo system echo")
10551 print "second print"
10573 If @code{awk} did not flush its buffers before calling @code{system}, the
10574 latter (undesirable) output is what you would see.
10576 @node Time Functions, , I/O Functions, Built-in
10577 @section Functions for Dealing with Time Stamps
10580 @cindex time of day
10581 A common use for @code{awk} programs is the processing of log files
10582 containing time stamp information, indicating when a
10583 particular log record was written. Many programs log their time stamp
10584 in the form returned by the @code{time} system call, which is the
10585 number of seconds since a particular epoch. On POSIX systems,
10586 it is the number of seconds since Midnight, January 1, 1970, UTC.
10588 In order to make it easier to process such log files, and to produce
10589 useful reports, @code{gawk} provides two functions for working with time
10590 stamps. Both of these are @code{gawk} extensions; they are not specified
10591 in the POSIX standard, nor are they in any other known version
10594 Optional parameters are enclosed in square brackets (``['' and ``]'').
10599 This function returns the current time as the number of seconds since
10600 the system epoch. On POSIX systems, this is the number of seconds
10601 since Midnight, January 1, 1970, UTC. It may be a different number on
10604 @item strftime(@r{[}@var{format} @r{[}, @var{timestamp}@r{]]})
10606 This function returns a string. It is similar to the function of the
10607 same name in ANSI C. The time specified by @var{timestamp} is used to
10608 produce a string, based on the contents of the @var{format} string.
10609 The @var{timestamp} is in the same format as the value returned by the
10610 @code{systime} function. If no @var{timestamp} argument is supplied,
10611 @code{gawk} will use the current time of day as the time stamp.
10612 If no @var{format} argument is supplied, @code{strftime} uses
10613 @code{@w{"%a %b %d %H:%M:%S %Z %Y"}}. This format string produces
10614 output (almost) equivalent to that of the @code{date} utility.
10615 (Versions of @code{gawk} prior to 3.0 require the @var{format} argument.)
10618 The @code{systime} function allows you to compare a time stamp from a
10619 log file with the current time of day. In particular, it is easy to
10620 determine how long ago a particular record was logged. It also allows
10621 you to produce log records using the ``seconds since the epoch'' format.
10623 The @code{strftime} function allows you to easily turn a time stamp
10624 into human-readable information. It is similar in nature to the @code{sprintf}
10626 (@pxref{String Functions, ,Built-in Functions for String Manipulation}),
10627 in that it copies non-format specification characters verbatim to the
10628 returned string, while substituting date and time values for format
10629 specifications in the @var{format} string.
10631 @code{strftime} is guaranteed by the ANSI C standard to support
10632 the following date format specifications:
10636 The locale's abbreviated weekday name.
10639 The locale's full weekday name.
10642 The locale's abbreviated month name.
10645 The locale's full month name.
10648 The locale's ``appropriate'' date and time representation.
10651 The day of the month as a decimal number (01--31).
10654 The hour (24-hour clock) as a decimal number (00--23).
10657 The hour (12-hour clock) as a decimal number (01--12).
10660 The day of the year as a decimal number (001--366).
10663 The month as a decimal number (01--12).
10666 The minute as a decimal number (00--59).
10669 The locale's equivalent of the AM/PM designations associated
10670 with a 12-hour clock.
10673 The second as a decimal number (00--60).@footnote{Occasionally there are
10674 minutes in a year with a leap second, which is why the
10675 seconds can go up to 60.}
10678 The week number of the year (the first Sunday as the first day of week one)
10679 as a decimal number (00--53).
10682 The weekday as a decimal number (0--6). Sunday is day zero.
10685 The week number of the year (the first Monday as the first day of week one)
10686 as a decimal number (00--53).
10689 The locale's ``appropriate'' date representation.
10692 The locale's ``appropriate'' time representation.
10695 The year without century as a decimal number (00--99).
10698 The year with century as a decimal number (e.g., 1995).
10701 The time zone name or abbreviation, or no characters if
10702 no time zone is determinable.
10705 A literal @samp{%}.
10708 If a conversion specifier is not one of the above, the behavior is
10709 undefined.@footnote{This is because ANSI C leaves the
10710 behavior of the C version of @code{strftime} undefined, and @code{gawk}
10711 will use the system's version of @code{strftime} if it's there.
10712 Typically, the conversion specifier will either not appear in the
10713 returned string, or it will appear literally.}
10715 @cindex locale, definition of
10716 Informally, a @dfn{locale} is the geographic place in which a program
10717 is meant to run. For example, a common way to abbreviate the date
10718 September 4, 1991 in the United States would be ``9/4/91''.
10719 In many countries in Europe, however, it would be abbreviated ``4.9.91''.
10720 Thus, the @samp{%x} specification in a @code{"US"} locale might produce
10721 @samp{9/4/91}, while in a @code{"EUROPE"} locale, it might produce
10722 @samp{4.9.91}. The ANSI C standard defines a default @code{"C"}
10723 locale, which is an environment that is typical of what most C programmers
10726 A public-domain C version of @code{strftime} is supplied with @code{gawk}
10727 for systems that are not yet fully ANSI-compliant. If that version is
10728 used to compile @code{gawk} (@pxref{Installation, ,Installing @code{gawk}}),
10729 then the following additional format specifications are available:
10733 Equivalent to specifying @samp{%m/%d/%y}.
10736 The day of the month, padded with a space if it is only one digit.
10739 Equivalent to @samp{%b}, above.
10742 A newline character (ASCII LF).
10745 Equivalent to specifying @samp{%I:%M:%S %p}.
10748 Equivalent to specifying @samp{%H:%M}.
10751 Equivalent to specifying @samp{%H:%M:%S}.
10757 The hour (24-hour clock) as a decimal number (0-23).
10758 Single digit numbers are padded with a space.
10761 The hour (12-hour clock) as a decimal number (1-12).
10762 Single digit numbers are padded with a space.
10765 The century, as a number between 00 and 99.
10768 The weekday as a decimal number
10773 The week number of the year (the first Monday as the first
10774 day of week one) as a decimal number (01--53).
10775 The method for determining the week number is as specified by ISO 8601
10776 (to wit: if the week containing January 1 has four or more days in the
10777 new year, then it is week one, otherwise it is week 53 of the previous year
10778 and the next week is week one).
10781 The year with century of the ISO week number, as a decimal number.
10783 For example, January 1, 1993, is in week 53 of 1992. Thus, the year
10784 of its ISO week number is 1992, even though its year is 1993.
10785 Similarly, December 31, 1973, is in week 1 of 1974. Thus, the year
10786 of its ISO week number is 1974, even though its year is 1973.
10789 The year without century of the ISO week number, as a decimal number (00--99).
10791 @item %Ec %EC %Ex %Ey %EY %Od %Oe %OH %OI
10792 @itemx %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy
10793 These are ``alternate representations'' for the specifications
10794 that use only the second letter (@samp{%c}, @samp{%C}, and so on).
10795 They are recognized, but their normal representations are
10796 used.@footnote{If you don't understand any of this, don't worry about
10797 it; these facilities are meant to make it easier to ``internationalize''
10799 (These facilitate compliance with the POSIX @code{date} utility.)
10802 The date in VMS format (e.g., 20-JUN-1991).
10807 The timezone offset in a +HHMM format (e.g., the format necessary to
10808 produce RFC-822/RFC-1036 date headers).
10811 This example is an @code{awk} implementation of the POSIX
10812 @code{date} utility. Normally, the @code{date} utility prints the
10813 current date and time of day in a well known format. However, if you
10814 provide an argument to it that begins with a @samp{+}, @code{date}
10815 will copy non-format specifier characters to the standard output, and
10816 will interpret the current time according to the format specifiers in
10817 the string. For example:
10820 $ date '+Today is %A, %B %d, %Y.'
10821 @print{} Today is Thursday, July 11, 1991.
10824 Here is the @code{gawk} version of the @code{date} utility.
10825 It has a shell ``wrapper'', to handle the @samp{-u} option,
10826 which requires that @code{date} run as if the time zone
10833 # date --- approximate the P1003.2 'date' command
10836 -u) TZ=GMT0 # use UTC
10844 format = "%a %b %d %H:%M:%S %Z %Y"
10851 else if (ARGC == 2) @{
10853 if (format ~ /^\+/)
10854 format = substr(format, 2) # remove leading +
10856 print strftime(format)
10862 @node User-defined, Invoking Gawk, Built-in, Top
10863 @chapter User-defined Functions
10865 @cindex user-defined functions
10866 @cindex functions, user-defined
10867 Complicated @code{awk} programs can often be simplified by defining
10868 your own functions. User-defined functions can be called just like
10869 built-in ones (@pxref{Function Calls}), but it is up to you to define
10870 them---to tell @code{awk} what they should do.
10873 * Definition Syntax:: How to write definitions and what they mean.
10874 * Function Example:: An example function definition and what it
10876 * Function Caveats:: Things to watch out for.
10877 * Return Statement:: Specifying the value a function returns.
10880 @node Definition Syntax, Function Example, User-defined, User-defined
10881 @section Function Definition Syntax
10882 @cindex defining functions
10883 @cindex function definition
10885 Definitions of functions can appear anywhere between the rules of an
10886 @code{awk} program. Thus, the general form of an @code{awk} program is
10887 extended to include sequences of rules @emph{and} user-defined function
10889 There is no need in @code{awk} to put the definition of a function
10890 before all uses of the function. This is because @code{awk} reads the
10891 entire program before starting to execute any of it.
10893 The definition of a function named @var{name} looks like this:
10896 function @var{name}(@var{parameter-list})
10898 @var{body-of-function}
10902 @cindex names, use of
10905 @var{name} is the name of the function to be defined. A valid function
10906 name is like a valid variable name: a sequence of letters, digits and
10907 underscores, not starting with a digit.
10908 Within a single @code{awk} program, any particular name can only be
10909 used as a variable, array or function.
10911 @var{parameter-list} is a list of the function's arguments and local
10912 variable names, separated by commas. When the function is called,
10913 the argument names are used to hold the argument values given in
10914 the call. The local variables are initialized to the empty string.
10915 A function cannot have two parameters with the same name.
10917 The @var{body-of-function} consists of @code{awk} statements. It is the
10918 most important part of the definition, because it says what the function
10919 should actually @emph{do}. The argument names exist to give the body a
10920 way to talk about the arguments; local variables, to give the body
10921 places to keep temporary values.
10923 Argument names are not distinguished syntactically from local variable
10924 names; instead, the number of arguments supplied when the function is
10925 called determines how many argument variables there are. Thus, if three
10926 argument values are given, the first three names in @var{parameter-list}
10927 are arguments, and the rest are local variables.
10929 It follows that if the number of arguments is not the same in all calls
10930 to the function, some of the names in @var{parameter-list} may be
10931 arguments on some occasions and local variables on others. Another
10932 way to think of this is that omitted arguments default to the
10935 Usually when you write a function you know how many names you intend to
10936 use for arguments and how many you intend to use as local variables. It is
10937 conventional to place some extra space between the arguments and
10938 the local variables, to document how your function is supposed to be used.
10940 @cindex variable shadowing
10941 During execution of the function body, the arguments and local variable
10942 values hide or @dfn{shadow} any variables of the same names used in the
10943 rest of the program. The shadowed variables are not accessible in the
10944 function definition, because there is no way to name them while their
10945 names have been taken away for the local variables. All other variables
10946 used in the @code{awk} program can be referenced or set normally in the
10949 The arguments and local variables last only as long as the function body
10950 is executing. Once the body finishes, you can once again access the
10951 variables that were shadowed while the function was running.
10953 @cindex recursive function
10954 @cindex function, recursive
10955 The function body can contain expressions which call functions. They
10956 can even call this function, either directly or by way of another
10957 function. When this happens, we say the function is @dfn{recursive}.
10959 @cindex @code{awk} language, POSIX version
10960 @cindex POSIX @code{awk}
10961 In many @code{awk} implementations, including @code{gawk},
10962 the keyword @code{function} may be
10963 abbreviated @code{func}. However, POSIX only specifies the use of
10964 the keyword @code{function}. This actually has some practical implications.
10965 If @code{gawk} is in POSIX-compatibility mode
10966 (@pxref{Options, ,Command Line Options}), then the following
10967 statement will @emph{not} define a function:
10970 func foo() @{ a = sqrt($1) ; print a @}
10974 Instead it defines a rule that, for each record, concatenates the value
10975 of the variable @samp{func} with the return value of the function @samp{foo}.
10976 If the resulting string is non-null, the action is executed.
10977 This is probably not what was desired. (@code{awk} accepts this input as
10978 syntactically valid, since functions may be used before they are defined
10979 in @code{awk} programs.)
10981 @cindex portability issues
10982 To ensure that your @code{awk} programs are portable, always use the
10983 keyword @code{function} when defining a function.
10985 @node Function Example, Function Caveats, Definition Syntax, User-defined
10986 @section Function Definition Examples
10988 Here is an example of a user-defined function, called @code{myprint}, that
10989 takes a number and prints it in a specific format.
10992 function myprint(num)
10994 printf "%6.3g\n", num
10999 To illustrate, here is an @code{awk} rule which uses our @code{myprint}
11003 $3 > 0 @{ myprint($3) @}
11007 This program prints, in our special format, all the third fields that
11008 contain a positive number in our input. Therefore, when given:
11013 9.10 11.12 -13.14 15.16
11014 17.18 19.20 21.22 23.24
11019 this program, using our function to format the results, prints:
11026 This function deletes all the elements in an array.
11029 function delarray(a, i)
11036 When working with arrays, it is often necessary to delete all the elements
11037 in an array and start over with a new list of elements
11038 (@pxref{Delete, ,The @code{delete} Statement}).
11040 to repeat this loop everywhere in your program that you need to clear out
11041 an array, your program can just call @code{delarray}.
11042 (This guarantees portability. The usage @samp{delete @var{array}} to delete
11043 the contents of an entire array is a non-standard extension.)
11045 Here is an example of a recursive function. It takes a string
11046 as an input parameter, and returns the string in backwards order.
11049 function rev(str, start)
11054 return (substr(str, start, 1) rev(str, start - 1))
11058 If this function is in a file named @file{rev.awk}, we can test it
11062 $ echo "Don't Panic!" |
11063 > gawk --source '@{ print rev($0, length($0)) @}' -f rev.awk
11064 @print{} !cinaP t'noD
11067 Here is an example that uses the built-in function @code{strftime}.
11068 (@xref{Time Functions, ,Functions for Dealing with Time Stamps},
11069 for more information on @code{strftime}.)
11070 The C @code{ctime} function takes a timestamp and returns it in a string,
11071 formatted in a well known fashion. Here is an @code{awk} version:
11074 @c file eg/lib/ctime.awk
11077 # awk version of C ctime(3) function
11080 function ctime(ts, format)
11082 format = "%a %b %d %H:%M:%S %Z %Y"
11084 ts = systime() # use current time as default
11085 return strftime(format, ts)
11091 @node Function Caveats, Return Statement, Function Example, User-defined
11092 @section Calling User-defined Functions
11094 @cindex call by value
11095 @cindex call by reference
11096 @cindex calling a function
11097 @cindex function call
11098 @dfn{Calling a function} means causing the function to run and do its job.
11099 A function call is an expression, and its value is the value returned by
11102 A function call consists of the function name followed by the arguments
11103 in parentheses. What you write in the call for the arguments are
11104 @code{awk} expressions; each time the call is executed, these
11105 expressions are evaluated, and the values are the actual arguments. For
11106 example, here is a call to @code{foo} with three arguments (the first
11107 being a string concatenation):
11110 foo(x y, "lose", 4 * z)
11113 @strong{Caution:} whitespace characters (spaces and tabs) are not allowed
11114 between the function name and the open-parenthesis of the argument list.
11115 If you write whitespace by mistake, @code{awk} might think that you mean
11116 to concatenate a variable with an expression in parentheses. However, it
11117 notices that you used a function name and not a variable name, and reports
11120 @cindex call by value
11121 When a function is called, it is given a @emph{copy} of the values of
11122 its arguments. This is known as @dfn{call by value}. The caller may use
11123 a variable as the expression for the argument, but the called function
11124 does not know this: it only knows what value the argument had. For
11125 example, if you write this code:
11133 then you should not think of the argument to @code{myfunc} as being
11134 ``the variable @code{foo}.'' Instead, think of the argument as the
11135 string value, @code{"bar"}.
11137 If the function @code{myfunc} alters the values of its local variables,
11138 this has no effect on any other variables. Thus, if @code{myfunc}
11143 function myfunc(str)
11153 to change its first argument variable @code{str}, this @emph{does not}
11154 change the value of @code{foo} in the caller. The role of @code{foo} in
11155 calling @code{myfunc} ended when its value, @code{"bar"}, was computed.
11156 If @code{str} also exists outside of @code{myfunc}, the function body
11157 cannot alter this outer value, because it is shadowed during the
11158 execution of @code{myfunc} and cannot be seen or changed from there.
11160 @cindex call by reference
11161 However, when arrays are the parameters to functions, they are @emph{not}
11162 copied. Instead, the array itself is made available for direct manipulation
11163 by the function. This is usually called @dfn{call by reference}.
11164 Changes made to an array parameter inside the body of a function @emph{are}
11165 visible outside that function.
11167 This can be @strong{very} dangerous if you do not watch what you are
11168 doing. For example:
11171 @emph{This can be very dangerous if you do not watch what you are
11172 doing.} For example:
11177 function changeit(array, ind, nvalue)
11179 array[ind] = nvalue
11184 a[1] = 1; a[2] = 2; a[3] = 3
11185 changeit(a, 2, "two")
11186 printf "a[1] = %s, a[2] = %s, a[3] = %s\n",
11192 This program prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because
11193 @code{changeit} stores @code{"two"} in the second element of @code{a}.
11195 @cindex undefined functions
11196 @cindex functions, undefined
11197 Some @code{awk} implementations allow you to call a function that
11198 has not been defined, and only report a problem at run-time when the
11199 program actually tries to call the function. For example:
11209 function bar() @{ @dots{} @}
11210 # note that `foo' is not defined
11215 Since the @samp{if} statement will never be true, it is not really a
11216 problem that @code{foo} has not been defined. Usually though, it is a
11217 problem if a program calls an undefined function.
11220 At one point, I had gawk dieing on this, but later decided that this might
11221 break old programs and/or test suites.
11224 If @samp{--lint} has been specified
11225 (@pxref{Options, ,Command Line Options}),
11226 @code{gawk} will report about calls to undefined functions.
11228 Some @code{awk} implementations generate a run-time
11229 error if you use the @code{next} statement
11230 (@pxref{Next Statement, , The @code{next} Statement})
11231 inside a user-defined function.
11232 @code{gawk} does not have this problem.
11234 @node Return Statement, , Function Caveats, User-defined
11235 @section The @code{return} Statement
11236 @cindex @code{return} statement
11238 The body of a user-defined function can contain a @code{return} statement.
11239 This statement returns control to the rest of the @code{awk} program. It
11240 can also be used to return a value for use in the rest of the @code{awk}
11241 program. It looks like this:
11244 return @r{[}@var{expression}@r{]}
11247 The @var{expression} part is optional. If it is omitted, then the returned
11248 value is undefined and, therefore, unpredictable.
11250 A @code{return} statement with no value expression is assumed at the end of
11251 every function definition. So if control reaches the end of the function
11252 body, then the function returns an unpredictable value. @code{awk}
11253 will @emph{not} warn you if you use the return value of such a function.
11255 Sometimes, you want to write a function for what it does, not for
11256 what it returns. Such a function corresponds to a @code{void} function
11257 in C or to a @code{procedure} in Pascal. Thus, it may be appropriate to not
11258 return any value; you should simply bear in mind that if you use the return
11259 value of such a function, you do so at your own risk.
11261 Here is an example of a user-defined function that returns a value
11262 for the largest number among the elements of an array:
11266 function maxelt(vec, i, ret)
11269 if (ret == "" || vec[i] > ret)
11278 You call @code{maxelt} with one argument, which is an array name. The local
11279 variables @code{i} and @code{ret} are not intended to be arguments;
11280 while there is nothing to stop you from passing two or three arguments
11281 to @code{maxelt}, the results would be strange. The extra space before
11282 @code{i} in the function parameter list indicates that @code{i} and
11283 @code{ret} are not supposed to be arguments. This is a convention that
11284 you should follow when you define functions.
11286 Here is a program that uses our @code{maxelt} function. It loads an
11287 array, calls @code{maxelt}, and then reports the maximum number in that
11293 function maxelt(vec, i, ret)
11296 if (ret == "" || vec[i] > ret)
11304 # Load all fields of each record into nums.
11306 for(i = 1; i <= NF; i++)
11316 Given the following input:
11322 256 291 1396 2962 100
11329 our program tells us (predictably) that @code{99385} is the largest number
11332 @node Invoking Gawk, Library Functions, User-defined, Top
11333 @chapter Running @code{awk}
11334 @cindex command line
11335 @cindex invocation of @code{gawk}
11336 @cindex arguments, command line
11337 @cindex options, command line
11338 @cindex long options
11339 @cindex options, long
11341 There are two ways to run @code{awk}: with an explicit program, or with
11342 one or more program files. Here are templates for both of them; items
11343 enclosed in @samp{@r{[}@dots{}@r{]}} in these templates are optional.
11345 Besides traditional one-letter POSIX-style options, @code{gawk} also
11346 supports GNU long options.
11349 awk @r{[@var{options}]} -f progfile @r{[@code{--}]} @var{file} @dots{}
11350 awk @r{[@var{options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}
11353 @cindex empty program
11354 @cindex dark corner
11355 It is possible to invoke @code{awk} with an empty program:
11358 $ awk '' datafile1 datafile2
11362 Doing so makes little sense though; @code{awk} will simply exit
11363 silently when given an empty program (d.c.). If @samp{--lint} has
11364 been specified on the command line, @code{gawk} will issue a
11365 warning that the program is empty.
11368 * Options:: Command line options and their meanings.
11369 * Other Arguments:: Input file names and variable assignments.
11370 * AWKPATH Variable:: Searching directories for @code{awk} programs.
11371 * Obsolete:: Obsolete Options and/or features.
11372 * Undocumented:: Undocumented Options and Features.
11373 * Known Bugs:: Known Bugs in @code{gawk}.
11376 @node Options, Other Arguments, Invoking Gawk, Invoking Gawk
11377 @section Command Line Options
11379 Options begin with a dash, and consist of a single character.
11380 GNU style long options consist of two dashes and a keyword.
11381 The keyword can be abbreviated, as long the abbreviation allows the option
11382 to be uniquely identified. If the option takes an argument, then the
11383 keyword is either immediately followed by an equals sign (@samp{=}) and the
11384 argument's value, or the keyword and the argument's value are separated
11385 by whitespace. For brevity, the discussion below only refers to the
11386 traditional short options; however the long and short options are
11387 interchangeable in all contexts.
11389 Each long option for @code{gawk} has a corresponding
11390 POSIX-style option. The options and their meanings are as follows:
11394 @itemx --field-separator @var{fs}
11395 @cindex @code{-F} option
11396 @cindex @code{--field-separator} option
11397 Sets the @code{FS} variable to @var{fs}
11398 (@pxref{Field Separators, ,Specifying How Fields are Separated}).
11400 @item -f @var{source-file}
11401 @itemx --file @var{source-file}
11402 @cindex @code{-f} option
11403 @cindex @code{--file} option
11404 Indicates that the @code{awk} program is to be found in @var{source-file}
11405 instead of in the first non-option argument.
11407 @item -v @var{var}=@var{val}
11408 @itemx --assign @var{var}=@var{val}
11409 @cindex @code{-v} option
11410 @cindex @code{--assign} option
11411 Sets the variable @var{var} to the value @var{val} @strong{before}
11412 execution of the program begins. Such variable values are available
11413 inside the @code{BEGIN} rule
11414 (@pxref{Other Arguments, ,Other Command Line Arguments}).
11416 The @samp{-v} option can only set one variable, but you can use
11417 it more than once, setting another variable each time, like this:
11418 @samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}.
11420 @strong{Caution:} Using @samp{-v} to set the values of the builtin
11421 variables may lead to suprising results. @code{awk} will reset the
11422 values of those variables as it needs to, possibly ignoring any
11423 predefined value you may have given.
11425 @item -mf @var{NNN}
11426 @itemx -mr @var{NNN}
11427 Set various memory limits to the value @var{NNN}. The @samp{f} flag sets
11428 the maximum number of fields, and the @samp{r} flag sets the maximum
11429 record size. These two flags and the @samp{-m} option are from the
11430 Bell Labs research version of Unix @code{awk}. They are provided
11431 for compatibility, but otherwise ignored by
11432 @code{gawk}, since @code{gawk} has no predefined limits.
11434 @item -W @var{gawk-opt}
11435 @cindex @code{-W} option
11436 Following the POSIX standard, options that are implementation
11437 specific are supplied as arguments to the @samp{-W} option. These options
11438 also have corresponding GNU style long options.
11442 Signals the end of the command line options. The following arguments
11443 are not treated as options even if they begin with @samp{-}. This
11444 interpretation of @samp{--} follows the POSIX argument parsing
11447 This is useful if you have file names that start with @samp{-},
11448 or in shell scripts, if you have file names that will be specified
11449 by the user which could start with @samp{-}.
11452 The following @code{gawk}-specific options are available:
11455 @item -W traditional
11457 @itemx --traditional
11459 @cindex @code{--compat} option
11460 @cindex @code{--traditional} option
11461 @cindex compatibility mode
11462 Specifies @dfn{compatibility mode}, in which the GNU extensions to
11463 the @code{awk} language are disabled, so that @code{gawk} behaves just
11464 like the Bell Labs research version of Unix @code{awk}.
11465 @samp{--traditional} is the preferred form of this option.
11466 @xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}},
11467 which summarizes the extensions. Also see
11468 @ref{Compatibility Mode, ,Downward Compatibility and Debugging}.
11471 @itemx -W copyright
11474 @cindex @code{--copyleft} option
11475 @cindex @code{--copyright} option
11476 Print the short version of the General Public License, and then exit.
11477 This option may disappear in a future version of @code{gawk}.
11483 @cindex @code{--help} option
11484 @cindex @code{--usage} option
11485 Print a ``usage'' message summarizing the short and long style options
11486 that @code{gawk} accepts, and then exit.
11490 @cindex @code{--lint} option
11491 Warn about constructs that are dubious or non-portable to
11492 other @code{awk} implementations.
11493 Some warnings are issued when @code{gawk} first reads your program. Others
11494 are issued at run-time, as your program executes.
11498 @cindex @code{--lint-old} option
11499 Warn about constructs that are not available in
11500 the original Version 7 Unix version of @code{awk}
11501 (@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}).
11505 @cindex @code{--posix} option
11507 Operate in strict POSIX mode. This disables all @code{gawk}
11508 extensions (just like @samp{--traditional}), and adds the following additional
11511 @c IMPORTANT! Keep this list in sync with the one in node POSIX
11515 @code{\x} escape sequences are not recognized
11516 (@pxref{Escape Sequences}).
11519 Newlines do not act as whitespace to separate fields when @code{FS} is
11520 equal to a single space.
11523 The synonym @code{func} for the keyword @code{function} is not
11524 recognized (@pxref{Definition Syntax, ,Function Definition Syntax}).
11527 The operators @samp{**} and @samp{**=} cannot be used in
11528 place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators},
11529 and also @pxref{Assignment Ops, ,Assignment Expressions}).
11532 Specifying @samp{-Ft} on the command line does not set the value
11533 of @code{FS} to be a single tab character
11534 (@pxref{Field Separators, ,Specifying How Fields are Separated}).
11537 The @code{fflush} built-in function is not supported
11538 (@pxref{I/O Functions, , Built-in Functions for Input/Output}).
11541 If you supply both @samp{--traditional} and @samp{--posix} on the
11542 command line, @samp{--posix} will take precedence. @code{gawk}
11543 will also issue a warning if both options are supplied.
11545 @item -W re-interval
11546 @itemx --re-interval
11547 Allow interval expressions
11548 (@pxref{Regexp Operators, , Regular Expression Operators}),
11550 Because interval expressions were traditionally not available in @code{awk},
11551 @code{gawk} does not provide them by default. This prevents old @code{awk}
11552 programs from breaking.
11554 @item -W source @var{program-text}
11555 @itemx --source @var{program-text}
11556 @cindex @code{--source} option
11557 Program source code is taken from the @var{program-text}. This option
11558 allows you to mix source code in files with source
11559 code that you enter on the command line. This is particularly useful
11560 when you have library functions that you wish to use from your command line
11561 programs (@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
11565 @cindex @code{--version} option
11566 Prints version information for this particular copy of @code{gawk}.
11567 This allows you to determine if your copy of @code{gawk} is up to date
11568 with respect to whatever the Free Software Foundation is currently
11570 It is also useful for bug reports
11571 (@pxref{Bugs, , Reporting Problems and Bugs}).
11574 Any other options are flagged as invalid with a warning message, but
11575 are otherwise ignored.
11577 In compatibility mode, as a special case, if the value of @var{fs} supplied
11578 to the @samp{-F} option is @samp{t}, then @code{FS} is set to the tab
11579 character (@code{"\t"}). This is only true for @samp{--traditional}, and not
11581 (@pxref{Field Separators, ,Specifying How Fields are Separated}).
11583 The @samp{-f} option may be used more than once on the command line.
11584 If it is, @code{awk} reads its program source from all of the named files, as
11585 if they had been concatenated together into one big file. This is
11586 useful for creating libraries of @code{awk} functions. Useful functions
11587 can be written once, and then retrieved from a standard place, instead
11588 of having to be included into each individual program.
11590 You can type in a program at the terminal and still use library functions,
11591 by specifying @samp{-f /dev/tty}. @code{awk} will read a file from the terminal
11592 to use as part of the @code{awk} program. After typing your program,
11593 type @kbd{Control-d} (the end-of-file character) to terminate it.
11594 (You may also use @samp{-f -} to read program source from the standard
11595 input, but then you will not be able to also use the standard input as a
11598 Because it is clumsy using the standard @code{awk} mechanisms to mix source
11599 file and command line @code{awk} programs, @code{gawk} provides the
11600 @samp{--source} option. This does not require you to pre-empt the standard
11601 input for your source code, and allows you to easily mix command line
11602 and library source code
11603 (@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
11605 If no @samp{-f} or @samp{--source} option is specified, then @code{gawk}
11606 will use the first non-option command line argument as the text of the
11607 program source code.
11609 @cindex @code{POSIXLY_CORRECT} environment variable
11610 @cindex environment variable, @code{POSIXLY_CORRECT}
11611 If the environment variable @code{POSIXLY_CORRECT} exists,
11612 then @code{gawk} will behave in strict POSIX mode, exactly as if
11613 you had supplied the @samp{--posix} command line option.
11614 Many GNU programs look for this environment variable to turn on
11615 strict POSIX mode. If you supply @samp{--lint} on the command line,
11616 and @code{gawk} turns on POSIX mode because of @code{POSIXLY_CORRECT},
11617 then it will print a warning message indicating that POSIX
11620 You would typically set this variable in your shell's startup file.
11621 For a Bourne compatible shell (such as Bash), you would add these
11622 lines to the @file{.profile} file in your home directory.
11626 POSIXLY_CORRECT=true
11627 export POSIXLY_CORRECT
11631 For a @code{csh} compatible shell,@footnote{Not recommended.}
11632 you would add this line to the @file{.login} file in your home directory.
11635 setenv POSIXLY_CORRECT true
11638 @node Other Arguments, AWKPATH Variable, Options, Invoking Gawk
11639 @section Other Command Line Arguments
11641 Any additional arguments on the command line are normally treated as
11642 input files to be processed in the order specified. However, an
11643 argument that has the form @code{@var{var}=@var{value}}, assigns
11644 the value @var{value} to the variable @var{var}---it does not specify a
11649 All these arguments are made available to your @code{awk} program in the
11650 @code{ARGV} array (@pxref{Built-in Variables}). Command line options
11651 and the program text (if present) are omitted from @code{ARGV}.
11652 All other arguments, including variable assignments, are
11653 included. As each element of @code{ARGV} is processed, @code{gawk}
11654 sets the variable @code{ARGIND} to the index in @code{ARGV} of the
11657 The distinction between file name arguments and variable-assignment
11658 arguments is made when @code{awk} is about to open the next input file.
11659 At that point in execution, it checks the ``file name'' to see whether
11660 it is really a variable assignment; if so, @code{awk} sets the variable
11661 instead of reading a file.
11663 Therefore, the variables actually receive the given values after all
11664 previously specified files have been read. In particular, the values of
11665 variables assigned in this fashion are @emph{not} available inside a
11667 (@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}),
11668 since such rules are run before @code{awk} begins scanning the argument list.
11670 @cindex dark corner
11671 The variable values given on the command line are processed for escape
11672 sequences (d.c.) (@pxref{Escape Sequences}).
11674 In some earlier implementations of @code{awk}, when a variable assignment
11675 occurred before any file names, the assignment would happen @emph{before}
11676 the @code{BEGIN} rule was executed. @code{awk}'s behavior was thus
11677 inconsistent; some command line assignments were available inside the
11678 @code{BEGIN} rule, while others were not. However,
11679 some applications came to depend
11680 upon this ``feature.'' When @code{awk} was changed to be more consistent,
11681 the @samp{-v} option was added to accommodate applications that depended
11682 upon the old behavior.
11684 The variable assignment feature is most useful for assigning to variables
11685 such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and
11686 output formats, before scanning the data files. It is also useful for
11687 controlling state if multiple passes are needed over a data file. For
11690 @cindex multiple passes over data
11691 @cindex passes, multiple
11693 awk 'pass == 1 @{ @var{pass 1 stuff} @}
11694 pass == 2 @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata
11697 Given the variable assignment feature, the @samp{-F} option for setting
11698 the value of @code{FS} is not
11699 strictly necessary. It remains for historical compatibility.
11701 @node AWKPATH Variable, Obsolete, Other Arguments, Invoking Gawk
11702 @section The @code{AWKPATH} Environment Variable
11703 @cindex @code{AWKPATH} environment variable
11704 @cindex environment variable, @code{AWKPATH}
11705 @cindex search path
11706 @cindex directory search
11707 @cindex path, search
11708 @cindex differences between @code{gawk} and @code{awk}
11710 The previous section described how @code{awk} program files can be named
11711 on the command line with the @samp{-f} option. In most @code{awk}
11712 implementations, you must supply a precise path name for each program
11713 file, unless the file is in the current directory.
11715 @cindex search path, for source files
11716 But in @code{gawk}, if the file name supplied to the @samp{-f} option
11717 does not contain a @samp{/}, then @code{gawk} searches a list of
11718 directories (called the @dfn{search path}), one by one, looking for a
11719 file with the specified name.
11721 The search path is a string consisting of directory names
11722 separated by colons. @code{gawk} gets its search path from the
11723 @code{AWKPATH} environment variable. If that variable does not exist,
11724 @code{gawk} uses a default path, which is
11725 @samp{.:/usr/local/share/awk}.@footnote{Your version of @code{gawk}
11726 may use a different directory; it
11727 will depend upon how @code{gawk} was built and installed. The actual
11728 directory will be the value of @samp{$(datadir)} generated when
11729 @code{gawk} was configured. You probably don't need to worry about this
11730 though.} (Programs written for use by
11731 system administrators should use an @code{AWKPATH} variable that
11732 does not include the current directory, @file{.}.)
11734 The search path feature is particularly useful for building up libraries
11735 of useful @code{awk} functions. The library files can be placed in a
11736 standard directory that is in the default path, and then specified on
11737 the command line with a short file name. Otherwise, the full file name
11738 would have to be typed for each file.
11740 By using both the @samp{--source} and @samp{-f} options, your command line
11741 @code{awk} programs can use facilities in @code{awk} library files.
11742 @xref{Library Functions, , A Library of @code{awk} Functions}.
11744 Path searching is not done if @code{gawk} is in compatibility mode.
11745 This is true for both @samp{--traditional} and @samp{--posix}.
11746 @xref{Options, ,Command Line Options}.
11748 @strong{Note:} if you want files in the current directory to be found,
11749 you must include the current directory in the path, either by including
11750 @file{.} explicitly in the path, or by writing a null entry in the
11751 path. (A null entry is indicated by starting or ending the path with a
11752 colon, or by placing two colons next to each other (@samp{::}).) If the
11753 current directory is not included in the path, then files cannot be
11754 found in the current directory. This path search mechanism is identical
11756 @c someday, @cite{The Bourne Again Shell}....
11758 Starting with version 3.0, if @code{AWKPATH} is not defined in the
11759 environment, @code{gawk} will place its default search path into
11760 @code{ENVIRON["AWKPATH"]}. This makes it easy to determine
11761 the actual search path @code{gawk} will use.
11763 @node Obsolete, Undocumented, AWKPATH Variable, Invoking Gawk
11764 @section Obsolete Options and/or Features
11766 @cindex deprecated options
11767 @cindex obsolete options
11768 @cindex deprecated features
11769 @cindex obsolete features
11770 This section describes features and/or command line options from
11771 previous releases of @code{gawk} that are either not available in the
11772 current version, or that are still supported but deprecated (meaning that
11773 they will @emph{not} be in the next release).
11775 @c update this section for each release!
11777 For version @value{VERSION}.@value{PATCHLEVEL} of @code{gawk}, there are no
11778 command line options
11779 or other deprecated features from the previous version of @code{gawk}.
11786 is thus essentially a place holder,
11787 in case some option becomes obsolete in a future version of @code{gawk}.
11790 @c This is pretty old news...
11791 The public-domain version of @code{strftime} that is distributed with
11792 @code{gawk} changed for the 2.14 release. The @samp{%V} conversion specifier
11793 that used to generate the date in VMS format was changed to @samp{%v}.
11794 This is because the POSIX standard for the @code{date} utility now
11795 specifies a @samp{%V} conversion specifier.
11796 @xref{Time Functions, ,Functions for Dealing with Time Stamps}, for details.
11799 @node Undocumented, Known Bugs, Obsolete, Invoking Gawk
11800 @section Undocumented Options and Features
11801 @cindex undocumented features
11803 @i{Use the Source, Luke!}
11808 This section intentionally left blank.
11810 @c Read The Source, Luke!
11813 @c If these came out in the Info file or TeX document, then they wouldn't
11814 @c be undocumented, would they?
11816 @code{gawk} has one undocumented option:
11821 Print the message @code{"awk: bailing out near line 1"} and dump core.
11822 This option was inspired by the common behavior of very early versions of
11823 Unix @code{awk}, and by a t--shirt.
11826 Early versions of @code{awk} used to not require any separator (either
11827 a newline or @samp{;}) between the rules in @code{awk} programs. Thus,
11828 it was common to see one-line programs like:
11831 awk '@{ sum += $1 @} END @{ print sum @}'
11834 @code{gawk} actually supports this, but it is purposely undocumented
11835 since it is considered bad style. The correct way to write such a program
11839 awk '@{ sum += $1 @} ; END @{ print sum @}'
11846 awk '@{ sum += $1 @}
11847 END @{ print sum @}' data
11851 @xref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a fuller
11856 @node Known Bugs, , Undocumented, Invoking Gawk
11857 @section Known Bugs in @code{gawk}
11858 @cindex bugs, known in @code{gawk}
11863 The @samp{-F} option for changing the value of @code{FS}
11864 (@pxref{Options, ,Command Line Options})
11865 is not necessary given the command line variable
11866 assignment feature; it remains only for backwards compatibility.
11869 If your system actually has support for @file{/dev/fd} and the
11870 associated @file{/dev/stdin}, @file{/dev/stdout}, and
11871 @file{/dev/stderr} files, you may get different output from @code{gawk}
11872 than you would get on a system without those files. When @code{gawk}
11873 interprets these files internally, it synchronizes output to the
11874 standard output with output to @file{/dev/stdout}, while on a system
11875 with those files, the output is actually to different open files
11876 (@pxref{Special Files, ,Special File Names in @code{gawk}}).
11879 Syntactically invalid single character programs tend to overflow
11880 the parse stack, generating a rather unhelpful message. Such programs
11881 are surprisingly difficult to diagnose in the completely general case,
11882 and the effort to do so really is not worth it.
11885 @node Library Functions, Sample Programs, Invoking Gawk, Top
11886 @chapter A Library of @code{awk} Functions
11888 @c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!!
11889 This chapter presents a library of useful @code{awk} functions. The
11890 sample programs presented later
11891 (@pxref{Sample Programs, ,Practical @code{awk} Programs})
11892 use these functions.
11893 The functions are presented here in a progression from simple to complex.
11895 @ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
11896 presents a program that you can use to extract the source code for
11897 these example library functions and programs from the Texinfo source
11898 for this @value{DOCUMENT}.
11899 (This has already been done as part of the @code{gawk} distribution.)
11901 If you have written one or more useful, general purpose @code{awk} functions,
11902 and would like to contribute them for a subsequent edition of this @value{DOCUMENT},
11903 please contact the author. @xref{Bugs, ,Reporting Problems and Bugs},
11904 for information on doing this. Don't just send code, as you will be
11905 required to either place your code in the public domain,
11906 publish it under the GPL (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}),
11907 or assign the copyright in it to the Free Software Foundation.
11910 * Portability Notes:: What to do if you don't have @code{gawk}.
11911 * Nextfile Function:: Two implementations of a @code{nextfile}
11913 * Assert Function:: A function for assertions in @code{awk}
11915 * Round Function:: A function for rounding if @code{sprintf} does
11916 not do it correctly.
11917 * Ordinal Functions:: Functions for using characters as numbers and
11919 * Join Function:: A function to join an array into a string.
11920 * Mktime Function:: A function to turn a date into a timestamp.
11921 * Gettimeofday Function:: A function to get formatted times.
11922 * Filetrans Function:: A function for handling data file transitions.
11923 * Getopt Function:: A function for processing command line
11925 * Passwd Functions:: Functions for getting user information.
11926 * Group Functions:: Functions for getting group information.
11927 * Library Names:: How to best name private global variables in
11931 @node Portability Notes, Nextfile Function, Library Functions, Library Functions
11932 @section Simulating @code{gawk}-specific Features
11933 @cindex portability issues
11935 The programs in this chapter and in
11936 @ref{Sample Programs, ,Practical @code{awk} Programs},
11937 freely use features that are specific to @code{gawk}.
11938 This section briefly discusses how you can rewrite these programs for
11939 different implementations of @code{awk}.
11941 Diagnostic error messages are sent to @file{/dev/stderr}.
11942 Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"}, if your system
11943 does not have a @file{/dev/stderr}, or if you cannot use @code{gawk}.
11945 A number of programs use @code{nextfile}
11946 (@pxref{Nextfile Statement, ,The @code{nextfile} Statement}),
11947 to skip any remaining input in the input file.
11948 @ref{Nextfile Function, ,Implementing @code{nextfile} as a Function},
11949 shows you how to write a function that will do the same thing.
11951 Finally, some of the programs choose to ignore upper-case and lower-case
11952 distinctions in their input. They do this by assigning one to @code{IGNORECASE}.
11953 You can achieve the same effect by adding the following rule to the
11954 beginning of the program:
11958 @{ $0 = tolower($0) @}
11962 Also, verify that all regexp and string constants used in
11963 comparisons only use lower-case letters.
11965 @node Nextfile Function, Assert Function, Portability Notes, Library Functions
11966 @section Implementing @code{nextfile} as a Function
11968 @cindex skipping input files
11969 @cindex input files, skipping
11970 The @code{nextfile} statement presented in
11971 @ref{Nextfile Statement, ,The @code{nextfile} Statement},
11972 is a @code{gawk}-specific extension. It is not available in other
11973 implementations of @code{awk}. This section shows two versions of a
11974 @code{nextfile} function that you can use to simulate @code{gawk}'s
11975 @code{nextfile} statement if you cannot use @code{gawk}.
11977 Here is a first attempt at writing a @code{nextfile} function.
11981 # nextfile --- skip remaining records in current file
11983 # this should be read in before the "main" awk program
11985 function nextfile() @{ _abandon_ = FILENAME; next @}
11987 _abandon_ == FILENAME @{ next @}
11991 This file should be included before the main program, because it supplies
11992 a rule that must be executed first. This rule compares the current data
11993 file's name (which is always in the @code{FILENAME} variable) to a private
11994 variable named @code{_abandon_}. If the file name matches, then the action
11995 part of the rule executes a @code{next} statement, to go on to the next
11996 record. (The use of @samp{_} in the variable name is a convention.
11997 It is discussed more fully in
11998 @ref{Library Names, , Naming Library Function Global Variables}.)
12000 The use of the @code{next} statement effectively creates a loop that reads
12001 all the records from the current data file.
12002 Eventually, the end of the file is reached, and
12003 a new data file is opened, changing the value of @code{FILENAME}.
12004 Once this happens, the comparison of @code{_abandon_} to @code{FILENAME}
12005 fails, and execution continues with the first rule of the ``real'' program.
12007 The @code{nextfile} function itself simply sets the value of @code{_abandon_}
12008 and then executes a @code{next} statement to start the loop
12009 going.@footnote{Some implementations of @code{awk} do not allow you to
12010 execute @code{next} from within a function body. Some other work-around
12011 will be necessary if you use such a version.}
12012 @c mawk is what we're talking about.
12014 This initial version has a subtle problem. What happens if the same data
12015 file is listed @emph{twice} on the command line, one right after the other,
12016 or even with just a variable assignment between the two occurrences of
12019 @c @findex nextfile
12020 @c do it this way, since all the indices are merged
12021 @cindex @code{nextfile} function
12023 this code will skip right through the file, a second time, even though
12024 it should stop when it gets to the end of the first occurrence.
12025 Here is a second version of @code{nextfile} that remedies this problem.
12028 @c file eg/lib/nextfile.awk
12029 # nextfile --- skip remaining records in current file
12030 # correctly handle successive occurrences of the same file
12031 # Arnold Robbins, arnold@@gnu.org, Public Domain
12034 # this should be read in before the "main" awk program
12036 function nextfile() @{ _abandon_ = FILENAME; next @}
12039 _abandon_ == FILENAME @{
12049 The @code{nextfile} function has not changed. It sets @code{_abandon_}
12050 equal to the current file name and then executes a @code{next} satement.
12051 The @code{next} statement reads the next record and increments @code{FNR},
12052 so @code{FNR} is guaranteed to have a value of at least two.
12053 However, if @code{nextfile} is called for the last record in the file,
12054 then @code{awk} will close the current data file and move on to the next
12055 one. Upon doing so, @code{FILENAME} will be set to the name of the new file,
12056 and @code{FNR} will be reset to one. If this next file is the same as
12057 the previous one, @code{_abandon_} will still be equal to @code{FILENAME}.
12058 However, @code{FNR} will be equal to one, telling us that this is a new
12059 occurrence of the file, and not the one we were reading when the
12060 @code{nextfile} function was executed. In that case, @code{_abandon_}
12061 is reset to the empty string, so that further executions of this rule
12062 will fail (until the next time that @code{nextfile} is called).
12064 If @code{FNR} is not one, then we are still in the original data file,
12065 and the program executes a @code{next} statement to skip through it.
12067 An important question to ask at this point is: ``Given that the
12068 functionality of @code{nextfile} can be provided with a library file,
12069 why is it built into @code{gawk}?'' This is an important question. Adding
12070 features for little reason leads to larger, slower programs that are
12071 harder to maintain.
12073 The answer is that building @code{nextfile} into @code{gawk} provides
12074 significant gains in efficiency. If the @code{nextfile} function is executed
12075 at the beginning of a large data file, @code{awk} still has to scan the entire
12076 file, splitting it up into records, just to skip over it. The built-in
12077 @code{nextfile} can simply close the file immediately and proceed to the
12078 next one, saving a lot of time. This is particularly important in
12079 @code{awk}, since @code{awk} programs are generally I/O bound (i.e.@:
12080 they spend most of their time doing input and output, instead of performing
12083 @node Assert Function, Round Function, Nextfile Function, Library Functions
12084 @section Assertions
12087 @cindex @code{assert}, C version
12088 When writing large programs, it is often useful to be able to know
12089 that a condition or set of conditions is true. Before proceeding with a
12090 particular computation, you make a statement about what you believe to be
12091 the case. Such a statement is known as an
12092 ``assertion.'' The C language provides an @code{<assert.h>} header file
12093 and corresponding @code{assert} macro that the programmer can use to make
12094 assertions. If an assertion fails, the @code{assert} macro arranges to
12095 print a diagnostic message describing the condition that should have
12096 been true but was not, and then it kills the program. In C, using
12097 @code{assert} looks this:
12102 #include <assert.h>
12104 int myfunc(int a, double b)
12106 assert(a <= 5 && b >= 17);
12111 If the assertion failed, the program would print a message similar to
12115 prog.c:5: assertion failed: a <= 5 && b >= 17
12119 The ANSI C language makes it possible to turn the condition into a string for use
12120 in printing the diagnostic message. This is not possible in @code{awk}, so
12121 this @code{assert} function also requires a string version of the condition
12122 that is being tested.
12126 @c file eg/lib/assert.awk
12127 # assert --- assert that a condition is true. Otherwise exit.
12128 # Arnold Robbins, arnold@@gnu.org, Public Domain
12131 function assert(condition, string)
12133 if (! condition) @{
12134 printf("%s:%d: assertion failed: %s\n",
12135 FILENAME, FNR, string) > "/dev/stderr"
12149 The @code{assert} function tests the @code{condition} parameter. If it
12150 is false, it prints a message to standard error, using the @code{string}
12151 parameter to describe the failed condition. It then sets the variable
12152 @code{_assert_exit} to one, and executes the @code{exit} statement.
12153 The @code{exit} statement jumps to the @code{END} rule. If the @code{END}
12154 rules finds @code{_assert_exit} to be true, then it exits immediately.
12156 The purpose of the @code{END} rule with its test is to
12157 keep any other @code{END} rules from running. When an assertion fails, the
12158 program should exit immediately.
12159 If no assertions fail, then @code{_assert_exit} will still be
12160 false when the @code{END} rule is run normally, and the rest of the
12161 program's @code{END} rules will execute.
12162 For all of this to work correctly, @file{assert.awk} must be the
12163 first source file read by @code{awk}.
12167 You would use this function in your programs this way:
12170 function myfunc(a, b)
12172 assert(a <= 5 && b >= 17, "a <= 5 && b >= 17")
12178 If the assertion failed, you would see a message like this:
12181 mydata:1357: assertion failed: a <= 5 && b >= 17
12184 There is a problem with this version of @code{assert}, that it may not
12185 be possible to work around with standard @code{awk}.
12186 An @code{END} rule is automatically added
12187 to the program calling @code{assert}. Normally, if a program consists
12188 of just a @code{BEGIN} rule, the input files and/or standard input are
12189 not read. However, now that the program has an @code{END} rule, @code{awk}
12190 will attempt to read the input data files, or standard input
12191 (@pxref{Using BEGIN/END, , Startup and Cleanup Actions}),
12192 most likely causing the program to hang, waiting for input.
12194 @node Round Function, Ordinal Functions, Assert Function, Library Functions
12195 @section Rounding Numbers
12198 The way @code{printf} and @code{sprintf}
12199 (@pxref{Printf, , Using @code{printf} Statements for Fancier Printing})
12200 do rounding will often depend
12201 upon the system's C @code{sprintf} subroutine.
12203 @code{sprintf} rounding is ``unbiased,'' which means it doesn't always
12204 round a trailing @samp{.5} up, contrary to naive expectations. In unbiased
12205 rounding, @samp{.5} rounds to even, rather than always up, so 1.5 rounds to
12206 2 but 4.5 rounds to 4.
12207 The result is that if you are using a format that does
12208 rounding (e.g., @code{"%.0f"}) you should check what your system does.
12209 The following function does traditional rounding;
12210 it might be useful if your awk's @code{printf} does unbiased rounding.
12214 @c file eg/lib/round.awk
12215 # round --- do normal rounding
12217 # Arnold Robbins, arnold@@gnu.org, August, 1996
12220 function round(x, ival, aval, fraction)
12222 ival = int(x) # integer part, int() truncates
12224 # see if fractional part
12225 if (ival == x) # no fraction
12229 aval = -x # absolute value
12231 fraction = aval - ival
12233 if (fraction >= .5)
12234 return int(x) - 1 # -2.5 --> -3
12236 return int(x) # -2.3 --> -2
12239 fraction = x - ival
12240 if (fraction >= .5)
12248 @{ print $0, round($0) @}
12252 @node Ordinal Functions, Join Function, Round Function, Library Functions
12253 @section Translating Between Characters and Numbers
12255 @cindex numeric character values
12256 @cindex values of characters as numbers
12257 One commercial implementation of @code{awk} supplies a built-in function,
12258 @code{ord}, which takes a character and returns the numeric value for that
12259 character in the machine's character set. If the string passed to
12260 @code{ord} has more than one character, only the first one is used.
12262 The inverse of this function is @code{chr} (from the function of the same
12263 name in Pascal), which takes a number and returns the corresponding character.
12265 Both functions can be written very nicely in @code{awk}; there is no real
12266 reason to build them into the @code{awk} interpreter.
12272 @c file eg/lib/ord.awk
12273 # ord.awk --- do ord and chr
12275 # Global identifiers:
12276 # _ord_: numerical values indexed by characters
12277 # _ord_init: function to initialize _ord_
12283 # 20 July, 1992, revised
12285 BEGIN @{ _ord_init() @}
12290 @c file eg/lib/ord.awk
12291 function _ord_init( low, high, i, t)
12293 low = sprintf("%c", 7) # BEL is ascii 7
12294 if (low == "\a") @{ # regular ascii
12297 @} else if (sprintf("%c", 128 + 7) == "\a") @{
12298 # ascii, mark parity
12301 @} else @{ # ebcdic(!)
12306 for (i = low; i <= high; i++) @{
12307 t = sprintf("%c", i)
12315 @cindex character sets
12316 @cindex character encodings
12319 @cindex mark parity
12320 Some explanation of the numbers used by @code{chr} is worthwhile.
12321 The most prominent character set in use today is ASCII. Although an
12322 eight-bit byte can hold 256 distinct values (from zero to 255), ASCII only
12323 defines characters that use the values from zero to 127.@footnote{ASCII
12324 has been extended in many countries to use the values from 128 to 255
12325 for country-specific characters. If your system uses these extensions,
12326 you can simplify @code{_ord_init} to simply loop from zero to 255.}
12327 At least one computer manufacturer that we know of
12329 uses ASCII, but with mark parity, meaning that the leftmost bit in the byte
12330 is always one. What this means is that on those systems, characters
12331 have numeric values from 128 to 255.
12332 Finally, large mainframe systems use the EBCDIC character set, which
12333 uses all 256 values.
12334 While there are other character sets in use on some older systems,
12335 they are not really worth worrying about.
12339 @c file eg/lib/ord.awk
12340 function ord(str, c)
12342 # only first character is of interest
12343 c = substr(str, 1, 1)
12350 @c file eg/lib/ord.awk
12353 # force c to be numeric by adding 0
12354 return sprintf("%c", c + 0)
12360 @c file eg/lib/ord.awk
12361 #### test code ####
12365 # printf("enter a character: ")
12366 # if (getline var <= 0)
12368 # printf("ord(%s) = %d\n", var, ord(var))
12375 An obvious improvement to these functions would be to move the code for the
12376 @code{@w{_ord_init}} function into the body of the @code{BEGIN} rule. It was
12377 written this way initially for ease of development.
12379 There is a ``test program'' in a @code{BEGIN} rule, for testing the
12380 function. It is commented out for production use.
12382 @node Join Function, Mktime Function, Ordinal Functions, Library Functions
12383 @section Merging an Array Into a String
12385 @cindex merging strings
12386 When doing string processing, it is often useful to be able to join
12387 all the strings in an array into one long string. The following function,
12388 @code{join}, accomplishes this task. It is used later in several of
12389 the application programs
12390 (@pxref{Sample Programs, ,Practical @code{awk} Programs}).
12392 Good function design is important; this function needs to be general, but it
12393 should also have a reasonable default behavior. It is called with an array
12394 and the beginning and ending indices of the elements in the array to be
12395 merged. This assumes that the array indices are numeric---a reasonable
12396 assumption since the array was likely created with @code{split}
12397 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
12402 @c file eg/lib/join.awk
12403 # join.awk --- join an array into a string
12404 # Arnold Robbins, arnold@@gnu.org, Public Domain
12407 function join(array, start, end, sep, result, i)
12411 else if (sep == SUBSEP) # magic value
12413 result = array[start]
12414 for (i = start + 1; i <= end; i++)
12415 result = result sep array[i]
12422 An optional additional argument is the separator to use when joining the
12423 strings back together. If the caller supplies a non-empty value,
12424 @code{join} uses it. If it is not supplied, it will have a null
12425 value. In this case, @code{join} uses a single blank as a default
12426 separator for the strings. If the value is equal to @code{SUBSEP},
12427 then @code{join} joins the strings with no separator between them.
12428 @code{SUBSEP} serves as a ``magic'' value to indicate that there should
12429 be no separation between the component strings.
12431 It would be nice if @code{awk} had an assignment operator for concatenation.
12432 The lack of an explicit operator for concatenation makes string operations
12433 more difficult than they really need to be.
12435 @node Mktime Function, Gettimeofday Function, Join Function, Library Functions
12436 @section Turning Dates Into Timestamps
12438 The @code{systime} function built in to @code{gawk}
12439 returns the current time of day as
12440 a timestamp in ``seconds since the Epoch.'' This timestamp
12441 can be converted into a printable date of almost infinitely variable
12442 format using the built-in @code{strftime} function.
12443 (For more information on @code{systime} and @code{strftime},
12444 @pxref{Time Functions, ,Functions for Dealing with Time Stamps}.)
12446 @cindex converting dates to timestamps
12447 @cindex dates, converting to timestamps
12448 @cindex timestamps, converting from dates
12449 An interesting but difficult problem is to convert a readable representation
12450 of a date back into a timestamp. The ANSI C library provides a @code{mktime}
12451 function that does the basic job, converting a canonical representation of a
12452 date into a timestamp.
12454 It would appear at first glance that @code{gawk} would have to supply a
12455 @code{mktime} built-in function that was simply a ``hook'' to the C language
12456 version. In fact though, @code{mktime} can be implemented entirely in
12457 @code{awk}.@footnote{@value{UPDATE-MONTH}: Actually, I was mistaken when
12458 I wrote this. The version presented here doesn't always work correctly,
12459 and the next major version of @code{gawk} will provide @code{mktime}
12460 as a built-in function.}
12463 Here is a version of @code{mktime} for @code{awk}. It takes a simple
12464 representation of the date and time, and converts it into a timestamp.
12466 The code is presented here intermixed with explanatory prose. In
12467 @ref{Extract Program, ,Extracting Programs from Texinfo Source Files},
12468 you will see how the Texinfo source file for this @value{DOCUMENT}
12469 can be processed to extract the code into a single source file.
12471 The program begins with a descriptive comment and a @code{BEGIN} rule
12472 that initializes a table @code{_tm_months}. This table is a two-dimensional
12473 array that has the lengths of the months. The first index is zero for
12474 regular years, and one for leap years. The values are the same for all the
12475 months in both kinds of years, except for February; thus the use of multiple
12480 @c file eg/lib/mktime.awk
12481 # mktime.awk --- convert a canonical date representation
12483 # Arnold Robbins, arnold@@gnu.org, Public Domain
12488 # Initialize table of month lengths
12489 _tm_months[0,1] = _tm_months[1,1] = 31
12490 _tm_months[0,2] = 28; _tm_months[1,2] = 29
12491 _tm_months[0,3] = _tm_months[1,3] = 31
12492 _tm_months[0,4] = _tm_months[1,4] = 30
12493 _tm_months[0,5] = _tm_months[1,5] = 31
12494 _tm_months[0,6] = _tm_months[1,6] = 30
12495 _tm_months[0,7] = _tm_months[1,7] = 31
12496 _tm_months[0,8] = _tm_months[1,8] = 31
12497 _tm_months[0,9] = _tm_months[1,9] = 30
12498 _tm_months[0,10] = _tm_months[1,10] = 31
12499 _tm_months[0,11] = _tm_months[1,11] = 30
12500 _tm_months[0,12] = _tm_months[1,12] = 31
12506 The benefit of merging multiple @code{BEGIN} rules
12507 (@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns})
12508 is particularly clear when writing library files. Functions in library
12509 files can cleanly initialize their own private data and also provide clean-up
12510 actions in private @code{END} rules.
12512 The next function is a simple one that computes whether a given year is or
12513 is not a leap year. If a year is evenly divisible by four, but not evenly
12514 divisible by 100, or if it is evenly divisible by 400, then it is a leap
12515 year. Thus, 1904 was a leap year, 1900 was not, but 2000 will be.
12516 @c Change this after the year 2000 to ``2000 was'' (:-)
12521 @c file eg/lib/mktime.awk
12522 # decide if a year is a leap year
12523 function _tm_isleap(year, ret)
12525 ret = (year % 4 == 0 && year % 100 != 0) ||
12534 This function is only used a few times in this file, and its computation
12535 could have been written @dfn{in-line} (at the point where it's used).
12536 Making it a separate function made the original development easier, and also
12537 avoids the possibility of typing errors when duplicating the code in
12540 The next function is more interesting. It does most of the work of
12541 generating a timestamp, which is converting a date and time into some number
12542 of seconds since the Epoch. The caller passes an array (rather
12543 imaginatively named @code{a}) containing six
12544 values: the year including century, the month as a number between one and 12,
12545 the day of the month, the hour as a number between zero and 23, the minute in
12546 the hour, and the seconds within the minute.
12548 The function uses several local variables to precompute the number of
12549 seconds in an hour, seconds in a day, and seconds in a year. Often,
12550 similar C code simply writes out the expression in-line, expecting the
12551 compiler to do @dfn{constant folding}. E.g., most C compilers would
12552 turn @samp{60 * 60} into @samp{3600} at compile time, instead of recomputing
12553 it every time at run time. Precomputing these values makes the
12554 function more efficient.
12559 @c file eg/lib/mktime.awk
12560 # convert a date into seconds
12561 function _tm_addup(a, total, yearsecs, daysecs,
12565 daysecs = 24 * hoursecs
12566 yearsecs = 365 * daysecs
12568 total = (a[1] - 1970) * yearsecs
12571 # extra day for leap years
12572 for (i = 1970; i < a[1]; i++)
12578 j = _tm_isleap(a[1])
12579 for (i = 1; i < a[2]; i++)
12580 total += _tm_months[j, i] * daysecs
12583 total += (a[3] - 1) * daysecs
12584 total += a[4] * hoursecs
12594 The function starts with a first approximation of all the seconds between
12595 Midnight, January 1, 1970,@footnote{This is the Epoch on POSIX systems.
12596 It may be different on other systems.} and the beginning of the current
12597 year. It then goes through all those years, and for every leap year,
12598 adds an additional day's worth of seconds.
12600 The variable @code{j} holds either one or zero, if the current year is or is not
12602 For every month in the current year prior to the current month, it adds
12603 the number of seconds in the month, using the appropriate entry in the
12604 @code{_tm_months} array.
12606 Finally, it adds in the seconds for the number of days prior to the current
12607 day, and the number of hours, minutes, and seconds in the current day.
12609 The result is a count of seconds since January 1, 1970. This value is not
12610 yet what is needed though. The reason why is described shortly.
12612 The main @code{mktime} function takes a single character string argument.
12613 This string is a representation of a date and time in a ``canonical''
12614 (fixed) form. This string should be
12615 @code{"@var{year} @var{month} @var{day} @var{hour} @var{minute} @var{second}"}.
12620 @c file eg/lib/mktime.awk
12621 # mktime --- convert a date into seconds,
12622 # compensate for time zone
12624 function mktime(str, res1, res2, a, b, i, j, t, diff)
12626 i = split(str, a, " ") # don't rely on FS
12638 a[2] < 1 || a[2] > 12 ||
12639 a[3] < 1 || a[3] > 31 ||
12640 a[4] < 0 || a[4] > 23 ||
12641 a[5] < 0 || a[5] > 59 ||
12642 a[6] < 0 || a[6] > 60 )
12646 res1 = _tm_addup(a)
12647 t = strftime("%Y %m %d %H %M %S", res1)
12650 printf("(%s) -> (%s)\n", str, t) > "/dev/stderr"
12653 res2 = _tm_addup(b)
12658 printf("diff = %d seconds\n", diff) > "/dev/stderr"
12668 The function first splits the string into an array, using spaces and tabs as
12669 separators. If there are not six elements in the array, it returns an
12670 error, signaled as the value @minus{}1.
12671 Next, it forces each element of the array to be numeric, by adding zero to it.
12672 The following @samp{if} statement then makes sure that each element is
12673 within an allowable range. (This checking could be extended further, e.g.,
12674 to make sure that the day of the month is within the correct range for the
12675 particular month supplied.) All of this is essentially preliminary set-up
12676 and error checking.
12678 Recall that @code{_tm_addup} generated a value in seconds since Midnight,
12679 January 1, 1970. This value is not directly usable as the result we want,
12680 @emph{since the calculation does not account for the local timezone}. In other
12681 words, the value represents the count in seconds since the Epoch, but only
12682 for UTC (Universal Coordinated Time). If the local timezone is east or west
12683 of UTC, then some number of hours should be either added to, or subtracted from
12684 the resulting timestamp.
12686 For example, 6:23 p.m. in Atlanta, Georgia (USA), is normally five hours west
12687 of (behind) UTC. It is only four hours behind UTC if daylight savings
12689 If you are calling @code{mktime} in Atlanta, with the argument
12690 @code{@w{"1993 5 23 18 23 12"}}, the result from @code{_tm_addup} will be
12691 for 6:23 p.m. UTC, which is only 2:23 p.m. in Atlanta. It is necessary to
12692 add another four hours worth of seconds to the result.
12694 How can @code{mktime} determine how far away it is from UTC? This is
12695 surprisingly easy. The returned timestamp represents the time passed to
12696 @code{mktime} @emph{as UTC}. This timestamp can be fed back to
12697 @code{strftime}, which will format it as a @emph{local} time; i.e.@: as
12698 if it already had the UTC difference added in to it. This is done by
12699 giving @code{@w{"%Y %m %d %H %M %S"}} to @code{strftime} as the format
12700 argument. It returns the computed timestamp in the original string
12701 format. The result represents a time that accounts for the UTC
12702 difference. When the new time is converted back to a timestamp, the
12703 difference between the two timestamps is the difference (in seconds)
12704 between the local timezone and UTC. This difference is then added back
12705 to the original result. An example demonstrating this is presented below.
12707 Finally, there is a ``main'' program for testing the function.
12710 @c there used to be a blank line after the getline,
12711 @c squished out for page formatting reasons
12713 @c file eg/lib/mktime.awk
12716 printf "Enter date as yyyy mm dd hh mm ss: "
12717 getline _tm_test_date
12718 t = mktime(_tm_test_date)
12719 r = strftime("%Y %m %d %H %M %S", t)
12720 printf "Got back (%s)\n", r
12727 The entire program uses two variables that can be set on the command
12728 line to control debugging output and to enable the test in the final
12729 @code{BEGIN} rule. Here is the result of a test run. (Note that debugging
12730 output is to standard error, and test output is to standard output.)
12734 $ gawk -f mktime.awk -v _tm_test=1 -v _tm_debug=1
12735 @print{} Enter date as yyyy mm dd hh mm ss: 1993 5 23 15 35 10
12736 @error{} (1993 5 23 15 35 10) -> (1993 05 23 11 35 10)
12737 @error{} diff = 14400 seconds
12738 @print{} Got back (1993 05 23 15 35 10)
12742 The time entered was 3:35 p.m. (15:35 on a 24-hour clock), on May 23, 1993.
12744 of debugging output shows the resulting time as UTC---four hours ahead of
12745 the local time zone. The second line shows that the difference is 14400
12746 seconds, which is four hours. (The difference is only four hours, since
12747 daylight savings time is in effect during May.)
12748 The final line of test output shows that the timezone compensation
12749 algorithm works; the returned time is the same as the entered time.
12751 This program does not solve the general problem of turning an arbitrary date
12752 representation into a timestamp. That problem is very involved. However,
12753 the @code{mktime} function provides a foundation upon which to build. Other
12754 software can convert month names into numeric months, and AM/PM times into
12755 24-hour clocks, to generate the ``canonical'' format that @code{mktime}
12758 @node Gettimeofday Function, Filetrans Function, Mktime Function, Library Functions
12759 @section Managing the Time of Day
12761 @cindex formatted timestamps
12762 @cindex timestamps, formatted
12763 The @code{systime} and @code{strftime} functions described in
12764 @ref{Time Functions, ,Functions for Dealing with Time Stamps},
12765 provide the minimum functionality necessary for dealing with the time of day
12766 in human readable form. While @code{strftime} is extensive, the control
12767 formats are not necessarily easy to remember or intuitively obvious when
12770 The following function, @code{gettimeofday}, populates a user-supplied array
12771 with pre-formatted time information. It returns a string with the current
12772 time formatted in the same way as the @code{date} utility.
12774 @findex gettimeofday
12777 @c file eg/lib/gettime.awk
12778 # gettimeofday --- get the time of day in a usable format
12779 # Arnold Robbins, arnold@@gnu.org, Public Domain, May 1993
12781 # Returns a string in the format of output of date(1)
12782 # Populates the array argument time with individual values:
12783 # time["second"] -- seconds (0 - 59)
12784 # time["minute"] -- minutes (0 - 59)
12785 # time["hour"] -- hours (0 - 23)
12786 # time["althour"] -- hours (0 - 12)
12787 # time["monthday"] -- day of month (1 - 31)
12788 # time["month"] -- month of year (1 - 12)
12789 # time["monthname"] -- name of the month
12790 # time["shortmonth"] -- short name of the month
12791 # time["year"] -- year within century (0 - 99)
12792 # time["fullyear"] -- year with century (19xx or 20xx)
12793 # time["weekday"] -- day of week (Sunday = 0)
12794 # time["altweekday"] -- day of week (Monday = 0)
12795 # time["weeknum"] -- week number, Sunday first day
12796 # time["altweeknum"] -- week number, Monday first day
12797 # time["dayname"] -- name of weekday
12798 # time["shortdayname"] -- short name of weekday
12799 # time["yearday"] -- day of year (0 - 365)
12800 # time["timezone"] -- abbreviation of timezone name
12801 # time["ampm"] -- AM or PM designation
12803 function gettimeofday(time, ret, now, i)
12805 # get time once, avoids unnecessary system calls
12808 # return date(1)-style output
12809 ret = strftime("%a %b %d %H:%M:%S %Z %Y", now)
12811 # clear out target array
12815 # fill in values, force numeric values to be
12816 # numeric by adding 0
12817 time["second"] = strftime("%S", now) + 0
12818 time["minute"] = strftime("%M", now) + 0
12819 time["hour"] = strftime("%H", now) + 0
12820 time["althour"] = strftime("%I", now) + 0
12821 time["monthday"] = strftime("%d", now) + 0
12822 time["month"] = strftime("%m", now) + 0
12823 time["monthname"] = strftime("%B", now)
12824 time["shortmonth"] = strftime("%b", now)
12825 time["year"] = strftime("%y", now) + 0
12826 time["fullyear"] = strftime("%Y", now) + 0
12827 time["weekday"] = strftime("%w", now) + 0
12828 time["altweekday"] = strftime("%u", now) + 0
12829 time["dayname"] = strftime("%A", now)
12830 time["shortdayname"] = strftime("%a", now)
12831 time["yearday"] = strftime("%j", now) + 0
12832 time["timezone"] = strftime("%Z", now)
12833 time["ampm"] = strftime("%p", now)
12834 time["weeknum"] = strftime("%U", now) + 0
12835 time["altweeknum"] = strftime("%W", now) + 0
12842 The string indices are easier to use and read than the various formats
12843 required by @code{strftime}. The @code{alarm} program presented in
12844 @ref{Alarm Program, ,An Alarm Clock Program},
12845 uses this function.
12848 The @code{gettimeofday} function is presented above as it was written. A
12849 more general design for this function would have allowed the user to supply
12850 an optional timestamp value that would have been used instead of the current
12853 @node Filetrans Function, Getopt Function, Gettimeofday Function, Library Functions
12854 @section Noting Data File Boundaries
12856 @cindex per file initialization and clean-up
12857 The @code{BEGIN} and @code{END} rules are each executed exactly once, at
12858 the beginning and end respectively of your @code{awk} program
12859 (@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
12860 We (the @code{gawk} authors) once had a user who mistakenly thought that the
12861 @code{BEGIN} rule was executed at the beginning of each data file and the
12862 @code{END} rule was executed at the end of each data file. When informed
12863 that this was not the case, the user requested that we add new special
12864 patterns to @code{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that
12865 would have the desired behavior. He even supplied us the code to do so.
12867 However, after a little thought, I came up with the following library program.
12868 It arranges to call two user-supplied functions, @code{beginfile} and
12869 @code{endfile}, at the beginning and end of each data file.
12870 Besides solving the problem in only nine(!) lines of code, it does so
12871 @emph{portably}; this will work with any implementation of @code{awk}.
12877 # Give the user a hook for filename transitions
12879 # The user must supply functions beginfile() and endfile()
12880 # that each take the name of the file being started or
12881 # finished, respectively.
12883 # Arnold Robbins, arnold@@gnu.org, January 1992
12886 FILENAME != _oldfilename \
12888 if (_oldfilename != "")
12889 endfile(_oldfilename)
12890 _oldfilename = FILENAME
12891 beginfile(FILENAME)
12894 END @{ endfile(FILENAME) @}
12898 This file must be loaded before the user's ``main'' program, so that the
12899 rule it supplies will be executed first.
12901 This rule relies on @code{awk}'s @code{FILENAME} variable that
12902 automatically changes for each new data file. The current file name is
12903 saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does
12904 not equal @code{_oldfilename}, then a new data file is being processed, and
12905 it is necessary to call @code{endfile} for the old file. Since
12906 @code{endfile} should only be called if a file has been processed, the
12907 program first checks to make sure that @code{_oldfilename} is not the null
12908 string. The program then assigns the current file name to
12909 @code{_oldfilename}, and calls @code{beginfile} for the file.
12910 Since, like all @code{awk} variables, @code{_oldfilename} will be
12911 initialized to the null string, this rule executes correctly even for the
12914 The program also supplies an @code{END} rule, to do the final processing for
12915 the last file. Since this @code{END} rule comes before any @code{END} rules
12916 supplied in the ``main'' program, @code{endfile} will be called first. Once
12917 again the value of multiple @code{BEGIN} and @code{END} rules should be clear.
12921 This version has same problem as the first version of @code{nextfile}
12922 (@pxref{Nextfile Function, ,Implementing @code{nextfile} as a Function}).
12923 If the same data file occurs twice in a row on command line, then
12924 @code{endfile} and @code{beginfile} will not be executed at the end of the
12925 first pass and at the beginning of the second pass.
12926 This version solves the problem.
12930 @c file eg/lib/ftrans.awk
12931 # ftrans.awk --- handle data file transitions
12933 # user supplies beginfile() and endfile() functions
12935 # Arnold Robbins, arnold@@gnu.org, November 1992
12939 if (_filename_ != "")
12940 endfile(_filename_)
12941 _filename_ = FILENAME
12942 beginfile(FILENAME)
12945 END @{ endfile(_filename_) @}
12950 In @ref{Wc Program, ,Counting Things},
12951 you will see how this library function can be used, and
12952 how it simplifies writing the main program.
12954 @node Getopt Function, Passwd Functions, Filetrans Function, Library Functions
12955 @section Processing Command Line Options
12957 @cindex @code{getopt}, C version
12958 @cindex processing arguments
12959 @cindex argument processing
12960 Most utilities on POSIX compatible systems take options or ``switches'' on
12961 the command line that can be used to change the way a program behaves.
12962 @code{awk} is an example of such a program
12963 (@pxref{Options, ,Command Line Options}).
12964 Often, options take @dfn{arguments}, data that the program needs to
12965 correctly obey the command line option. For example, @code{awk}'s
12966 @samp{-F} option requires a string to use as the field separator.
12967 The first occurrence on the command line of either @samp{--} or a
12968 string that does not begin with @samp{-} ends the options.
12970 Most Unix systems provide a C function named @code{getopt} for processing
12971 command line arguments. The programmer provides a string describing the one
12972 letter options. If an option requires an argument, it is followed in the
12973 string with a colon. @code{getopt} is also passed the
12974 count and values of the command line arguments, and is called in a loop.
12975 @code{getopt} processes the command line arguments for option letters.
12976 Each time around the loop, it returns a single character representing the
12977 next option letter that it found, or @samp{?} if it found an invalid option.
12978 When it returns @minus{}1, there are no options left on the command line.
12980 When using @code{getopt}, options that do not take arguments can be
12981 grouped together. Furthermore, options that take arguments require that the
12982 argument be present. The argument can immediately follow the option letter,
12983 or it can be a separate command line argument.
12985 Given a hypothetical program that takes
12986 three command line options, @samp{-a}, @samp{-b}, and @samp{-c}, and
12987 @samp{-b} requires an argument, all of the following are valid ways of
12988 invoking the program:
12992 prog -a -b foo -c data1 data2 data3
12993 prog -ac -bfoo -- data1 data2 data3
12994 prog -acbfoo data1 data2 data3
12998 Notice that when the argument is grouped with its option, the rest of
12999 the command line argument is considered to be the option's argument.
13000 In the above example, @samp{-acbfoo} indicates that all of the
13001 @samp{-a}, @samp{-b}, and @samp{-c} options were supplied,
13002 and that @samp{foo} is the argument to the @samp{-b} option.
13004 @code{getopt} provides four external variables that the programmer can use.
13008 The index in the argument value array (@code{argv}) where the first
13009 non-option command line argument can be found.
13012 The string value of the argument to an option.
13015 Usually @code{getopt} prints an error message when it finds an invalid
13016 option. Setting @code{opterr} to zero disables this feature. (An
13017 application might wish to print its own error message.)
13020 The letter representing the command line option.
13021 While not usually documented, most versions supply this variable.
13024 The following C fragment shows how @code{getopt} might process command line
13025 arguments for @code{awk}.
13030 main(int argc, char *argv[])
13033 /* print our own message */
13037 while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{
13039 case 'f': /* file */
13042 case 'F': /* field separator */
13045 case 'v': /* variable assignment */
13048 case 'W': /* extension */
13062 As a side point, @code{gawk} actually uses the GNU @code{getopt_long}
13063 function to process both normal and GNU-style long options
13064 (@pxref{Options, ,Command Line Options}).
13066 The abstraction provided by @code{getopt} is very useful, and would be quite
13067 handy in @code{awk} programs as well. Here is an @code{awk} version of
13068 @code{getopt}. This function highlights one of the greatest weaknesses in
13069 @code{awk}, which is that it is very poor at manipulating single characters.
13070 Repeated calls to @code{substr} are necessary for accessing individual
13071 characters (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
13073 The discussion walks through the code a bit at a time.
13077 @c file eg/lib/getopt.awk
13078 # getopt --- do C library getopt(3) function in awk
13083 # Initial version: March, 1991
13084 # Revised: May, 1993
13087 # External variables:
13088 # Optind -- index of ARGV for first non-option argument
13089 # Optarg -- string value of argument to current option
13090 # Opterr -- if non-zero, print our own diagnostic
13091 # Optopt -- current option letter
13095 # -1 at end of options
13096 # ? for unrecognized option
13097 # <c> a character representing the current option
13100 # _opti index in multi-flag option, e.g., -abc
13105 The function starts out with some documentation: who wrote the code,
13106 and when it was revised, followed by a list of the global variables it uses,
13107 what the return values are and what they mean, and any global variables that
13108 are ``private'' to this library function. Such documentation is essential
13109 for any program, and particularly for library functions.
13114 @c file eg/lib/getopt.awk
13115 function getopt(argc, argv, options, optl, thisopt, i)
13117 optl = length(options)
13118 if (optl == 0) # no options given
13121 if (argv[Optind] == "--") @{ # all done
13125 @} else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) @{
13133 The function first checks that it was indeed called with a string of options
13134 (the @code{options} parameter). If @code{options} has a zero length,
13135 @code{getopt} immediately returns @minus{}1.
13137 The next thing to check for is the end of the options. A @samp{--} ends the
13138 command line options, as does any command line argument that does not begin
13139 with a @samp{-}. @code{Optind} is used to step through the array of command
13140 line arguments; it retains its value across calls to @code{getopt}, since it
13141 is a global variable.
13143 The regexp used, @code{@w{/^-[^: \t\n\f\r\v\b]/}}, is
13144 perhaps a bit of overkill; it checks for a @samp{-} followed by anything
13145 that is not whitespace and not a colon.
13146 If the current command line argument does not match this pattern,
13147 it is not an option, and it ends option processing.
13151 @c file eg/lib/getopt.awk
13154 thisopt = substr(argv[Optind], _opti, 1)
13156 i = index(options, thisopt)
13159 printf("%c -- invalid option\n",
13160 thisopt) > "/dev/stderr"
13161 if (_opti >= length(argv[Optind])) @{
13172 The @code{_opti} variable tracks the position in the current command line
13173 argument (@code{argv[Optind]}). In the case that multiple options were
13174 grouped together with one @samp{-} (e.g., @samp{-abx}), it is necessary
13175 to return them to the user one at a time.
13177 If @code{_opti} is equal to zero, it is set to two, the index in the string
13178 of the next character to look at (we skip the @samp{-}, which is at position
13179 one). The variable @code{thisopt} holds the character, obtained with
13180 @code{substr}. It is saved in @code{Optopt} for the main program to use.
13182 If @code{thisopt} is not in the @code{options} string, then it is an
13183 invalid option. If @code{Opterr} is non-zero, @code{getopt} prints an error
13184 message on the standard error that is similar to the message from the C
13185 version of @code{getopt}.
13187 Since the option is invalid, it is necessary to skip it and move on to the
13188 next option character. If @code{_opti} is greater than or equal to the
13189 length of the current command line argument, then it is necessary to move on
13190 to the next one, so @code{Optind} is incremented and @code{_opti} is reset
13191 to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely
13194 In any case, since the option was invalid, @code{getopt} returns @samp{?}.
13195 The main program can examine @code{Optopt} if it needs to know what the
13196 invalid option letter actually was.
13200 @c file eg/lib/getopt.awk
13201 if (substr(options, i + 1, 1) == ":") @{
13202 # get option argument
13203 if (length(substr(argv[Optind], _opti + 1)) > 0)
13204 Optarg = substr(argv[Optind], _opti + 1)
13206 Optarg = argv[++Optind]
13214 If the option requires an argument, the option letter is followed by a colon
13215 in the @code{options} string. If there are remaining characters in the
13216 current command line argument (@code{argv[Optind]}), then the rest of that
13217 string is assigned to @code{Optarg}. Otherwise, the next command line
13218 argument is used (@samp{-xFOO} vs. @samp{@w{-x FOO}}). In either case,
13219 @code{_opti} is reset to zero, since there are no more characters left to
13220 examine in the current command line argument.
13224 @c file eg/lib/getopt.awk
13225 if (_opti == 0 || _opti >= length(argv[Optind])) @{
13236 Finally, if @code{_opti} is either zero or greater than the length of the
13237 current command line argument, it means this element in @code{argv} is
13238 through being processed, so @code{Optind} is incremented to point to the
13239 next element in @code{argv}. If neither condition is true, then only
13240 @code{_opti} is incremented, so that the next option letter can be processed
13241 on the next call to @code{getopt}.
13245 @c file eg/lib/getopt.awk
13247 Opterr = 1 # default is to diagnose
13248 Optind = 1 # skip ARGV[0]
13251 if (_getopt_test) @{
13252 while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1)
13253 printf("c = <%c>, optarg = <%s>\n",
13255 printf("non-option arguments:\n")
13256 for (; Optind < ARGC; Optind++)
13257 printf("\tARGV[%d] = <%s>\n",
13258 Optind, ARGV[Optind])
13265 The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one.
13266 @code{Opterr} is set to one, since the default behavior is for @code{getopt}
13267 to print a diagnostic message upon seeing an invalid option. @code{Optind}
13268 is set to one, since there's no reason to look at the program name, which is
13271 The rest of the @code{BEGIN} rule is a simple test program. Here is the
13272 result of two sample runs of the test program.
13276 $ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x
13277 @print{} c = <a>, optarg = <>
13278 @print{} c = <c>, optarg = <>
13279 @print{} c = <b>, optarg = <ARG>
13280 @print{} non-option arguments:
13281 @print{} ARGV[3] = <bax>
13282 @print{} ARGV[4] = <-x>
13286 $ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc
13287 @print{} c = <a>, optarg = <>
13288 @error{} x -- invalid option
13289 @print{} c = <?>, optarg = <>
13290 @print{} non-option arguments:
13291 @print{} ARGV[4] = <xyz>
13292 @print{} ARGV[5] = <abc>
13296 The first @samp{--} terminates the arguments to @code{awk}, so that it does
13297 not try to interpret the @samp{-a} etc. as its own options.
13299 Several of the sample programs presented in
13300 @ref{Sample Programs, ,Practical @code{awk} Programs},
13301 use @code{getopt} to process their arguments.
13303 @node Passwd Functions, Group Functions, Getopt Function, Library Functions
13304 @section Reading the User Database
13306 @cindex @file{/dev/user}
13307 The @file{/dev/user} special file
13308 (@pxref{Special Files, ,Special File Names in @code{gawk}})
13309 provides access to the current user's real and effective user and group id
13310 numbers, and if available, the user's supplementary group set.
13311 However, since these are numbers, they do not provide very useful
13312 information to the average user. There needs to be some way to find the
13313 user information associated with the user and group numbers. This
13314 section presents a suite of functions for retrieving information from the
13315 user database. @xref{Group Functions, ,Reading the Group Database},
13316 for a similar suite that retrieves information from the group database.
13318 @cindex @code{getpwent}, C version
13319 @cindex user information
13320 @cindex login information
13321 @cindex account information
13322 @cindex password file
13323 The POSIX standard does not define the file where user information is
13324 kept. Instead, it provides the @code{<pwd.h>} header file
13325 and several C language subroutines for obtaining user information.
13326 The primary function is @code{getpwent}, for ``get password entry.''
13327 The ``password'' comes from the original user database file,
13328 @file{/etc/passwd}, which kept user information, along with the
13329 encrypted passwords (hence the name).
13331 While an @code{awk} program could simply read @file{/etc/passwd} directly
13332 (the format is well known), because of the way password
13333 files are handled on networked systems,
13334 this file may not contain complete information about the system's set of users.
13336 @cindex @code{pwcat} program
13337 To be sure of being
13338 able to produce a readable, complete version of the user database, it is
13339 necessary to write a small C program that calls @code{getpwent}.
13340 @code{getpwent} is defined to return a pointer to a @code{struct passwd}.
13341 Each time it is called, it returns the next entry in the database.
13342 When there are no more entries, it returns @code{NULL}, the null pointer.
13343 When this happens, the C program should call @code{endpwent} to close the
13345 Here is @code{pwcat}, a C program that ``cats'' the password database.
13350 @c file eg/lib/pwcat.c
13354 * Generate a printable version of the password database
13372 while ((p = getpwent()) != NULL)
13373 printf("%s:%s:%d:%d:%s:%s:%s\n",
13374 p->pw_name, p->pw_passwd, p->pw_uid,
13375 p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell);
13384 If you don't understand C, don't worry about it.
13385 The output from @code{pwcat} is the user database, in the traditional
13386 @file{/etc/passwd} format of colon-separated fields. The fields are:
13390 The user's login name.
13392 @item Encrypted password
13393 The user's encrypted password. This may not be available on some systems.
13396 The user's numeric user-id number.
13399 The user's numeric group-id number.
13402 The user's full name, and perhaps other information associated with the
13405 @item Home directory
13406 The user's login, or ``home'' directory (familiar to shell programmers as
13410 The program that will be run when the user logs in. This is usually a
13411 shell, such as Bash (the Gnu Bourne-Again shell).
13414 Here are a few lines representative of @code{pwcat}'s output.
13419 @print{} root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh
13420 @print{} nobody:*:65534:65534::/:
13421 @print{} daemon:*:1:1::/:
13422 @print{} sys:*:2:2::/:/bin/csh
13423 @print{} bin:*:3:3::/bin:
13424 @print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh
13425 @print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh
13426 @print{} andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh
13431 With that introduction, here is a group of functions for getting user
13432 information. There are several functions here, corresponding to the C
13433 functions of the same name.
13437 @c file eg/lib/passwdawk.in
13439 # passwd.awk --- access password file information
13440 # Arnold Robbins, arnold@@gnu.org, Public Domain
13444 # tailor this to suit your system
13445 _pw_awklib = "/usr/local/libexec/awk/"
13450 function _pw_init( oldfs, oldrs, olddol0, pwcat)
13459 pwcat = _pw_awklib "pwcat"
13460 while ((pwcat | getline) > 0) @{
13461 _pw_byname[$1] = $0
13463 _pw_bycount[++_pw_total] = $0
13476 The @code{BEGIN} rule sets a private variable to the directory where
13477 @code{pwcat} is stored. Since it is used to help out an @code{awk} library
13478 routine, we have chosen to put it in @file{/usr/local/libexec/awk}.
13479 You might want it to be in a different directory on your system.
13481 The function @code{_pw_init} keeps three copies of the user information
13482 in three associative arrays. The arrays are indexed by user name
13483 (@code{_pw_byname}), by user-id number (@code{_pw_byuid}), and by order of
13484 occurrence (@code{_pw_bycount}).
13486 The variable @code{_pw_inited} is used for efficiency; @code{_pw_init} only
13487 needs to be called once.
13489 Since this function uses @code{getline} to read information from
13490 @code{pwcat}, it first saves the values of @code{FS}, @code{RS}, and
13491 @code{$0}. Doing so is necessary, since these functions could be called
13492 from anywhere within a user's program, and the user may have his or her
13493 own values for @code{FS} and @code{RS}.
13495 Problem, what if FIELDWIDTHS is in use? Sigh.
13498 The main part of the function uses a loop to read database lines, split
13499 the line into fields, and then store the line into each array as necessary.
13500 When the loop is done, @code{@w{_pw_init}} cleans up by closing the pipeline,
13501 setting @code{@w{_pw_inited}} to one, and restoring @code{FS}, @code{RS}, and
13502 @code{$0}. The use of @code{@w{_pw_count}} will be explained below.
13507 @c file eg/lib/passwdawk.in
13508 function getpwnam(name)
13511 if (name in _pw_byname)
13512 return _pw_byname[name]
13519 The @code{getpwnam} function takes a user name as a string argument. If that
13520 user is in the database, it returns the appropriate line. Otherwise it
13521 returns the null string.
13526 @c file eg/lib/passwdawk.in
13527 function getpwuid(uid)
13530 if (uid in _pw_byuid)
13531 return _pw_byuid[uid]
13539 the @code{getpwuid} function takes a user-id number argument. If that
13540 user number is in the database, it returns the appropriate line. Otherwise it
13541 returns the null string.
13546 @c file eg/lib/passwdawk.in
13547 function getpwent()
13550 if (_pw_count < _pw_total)
13551 return _pw_bycount[++_pw_count]
13558 The @code{getpwent} function simply steps through the database, one entry at
13559 a time. It uses @code{_pw_count} to track its current position in the
13560 @code{_pw_bycount} array.
13565 @c file eg/lib/passwdawk.in
13566 function endpwent()
13574 The @code{@w{endpwent}} function resets @code{@w{_pw_count}} to zero, so that
13575 subsequent calls to @code{getpwent} will start over again.
13577 A conscious design decision in this suite is that each subroutine calls
13578 @code{@w{_pw_init}} to initialize the database arrays. The overhead of running
13579 a separate process to generate the user database, and the I/O to scan it,
13580 will only be incurred if the user's main program actually calls one of these
13581 functions. If this library file is loaded along with a user's program, but
13582 none of the routines are ever called, then there is no extra run-time overhead.
13583 (The alternative would be to move the body of @code{@w{_pw_init}} into a
13584 @code{BEGIN} rule, which would always run @code{pwcat}. This simplifies the
13585 code but runs an extra process that may never be needed.)
13587 In turn, calling @code{_pw_init} is not too expensive, since the
13588 @code{_pw_inited} variable keeps the program from reading the data more than
13589 once. If you are worried about squeezing every last cycle out of your
13590 @code{awk} program, the check of @code{_pw_inited} could be moved out of
13591 @code{_pw_init} and duplicated in all the other functions. In practice,
13592 this is not necessary, since most @code{awk} programs are I/O bound, and it
13593 would clutter up the code.
13595 The @code{id} program in @ref{Id Program, ,Printing Out User Information},
13596 uses these functions.
13598 @node Group Functions, Library Names, Passwd Functions, Library Functions
13599 @section Reading the Group Database
13601 @cindex @code{getgrent}, C version
13602 @cindex group information
13603 @cindex account information
13605 Much of the discussion presented in
13606 @ref{Passwd Functions, ,Reading the User Database},
13607 applies to the group database as well. Although there has traditionally
13608 been a well known file, @file{/etc/group}, in a well known format, the POSIX
13609 standard only provides a set of C library routines
13610 (@code{<grp.h>} and @code{getgrent})
13611 for accessing the information.
13612 Even though this file may exist, it likely does not have
13613 complete information. Therefore, as with the user database, it is necessary
13614 to have a small C program that generates the group database as its output.
13616 @cindex @code{grcat} program
13617 Here is @code{grcat}, a C program that ``cats'' the group database.
13622 @c file eg/lib/grcat.c
13626 * Generate a printable version of the group database
13628 * Arnold Robbins, arnold@@gnu.org
13647 while ((g = getgrent()) != NULL) @{
13648 printf("%s:%s:%d:", g->gr_name, g->gr_passwd,
13651 for (i = 0; g->gr_mem[i] != NULL; i++) @{
13652 printf("%s", g->gr_mem[i]);
13653 if (g->gr_mem[i+1] != NULL)
13665 Each line in the group database represent one group. The fields are
13666 separated with colons, and represent the following information.
13670 The name of the group.
13672 @item Group Password
13673 The encrypted group password. In practice, this field is never used. It is
13674 usually empty, or set to @samp{*}.
13676 @item Group ID Number
13677 The numeric group-id number. This number should be unique within the file.
13679 @item Group Member List
13680 A comma-separated list of user names. These users are members of the group.
13681 Most Unix systems allow users to be members of several groups
13682 simultaneously. If your system does, then reading @file{/dev/user} will
13683 return those group-id numbers in @code{$5} through @code{$NF}.
13684 (Note that @file{/dev/user} is a @code{gawk} extension;
13685 @pxref{Special Files, ,Special File Names in @code{gawk}}.)
13688 Here is what running @code{grcat} might produce:
13693 @print{} wheel:*:0:arnold
13694 @print{} nogroup:*:65534:
13695 @print{} daemon:*:1:
13697 @print{} staff:*:10:arnold,miriam,andy
13698 @print{} other:*:20:
13703 Here are the functions for obtaining information from the group database.
13704 There are several, modeled after the C library functions of the same names.
13709 @c file eg/lib/groupawk.in
13710 # group.awk --- functions for dealing with the group file
13711 # Arnold Robbins, arnold@@gnu.org, Public Domain
13716 # Change to suit your system
13717 _gr_awklib = "/usr/local/libexec/awk/"
13723 @c file eg/lib/groupawk.in
13724 function _gr_init( oldfs, oldrs, olddol0, grcat, n, a, i)
13739 grcat = _gr_awklib "grcat"
13740 while ((grcat | getline) > 0) @{
13741 if ($1 in _gr_byname)
13742 _gr_byname[$1] = _gr_byname[$1] "," $4
13744 _gr_byname[$1] = $0
13745 if ($3 in _gr_bygid)
13746 _gr_bygid[$3] = _gr_bygid[$3] "," $4
13750 n = split($4, a, "[ \t]*,[ \t]*")
13753 for (i = 1; i <= n; i++)
13754 if (a[i] in _gr_groupsbyuser)
13755 _gr_groupsbyuser[a[i]] = \
13756 _gr_groupsbyuser[a[i]] " " $1
13758 _gr_groupsbyuser[a[i]] = $1
13762 _gr_bycount[++_gr_count] = $0
13777 The @code{BEGIN} rule sets a private variable to the directory where
13778 @code{grcat} is stored. Since it is used to help out an @code{awk} library
13779 routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might
13780 want it to be in a different directory on your system.
13782 These routines follow the same general outline as the user database routines
13783 (@pxref{Passwd Functions, ,Reading the User Database}).
13784 The @code{@w{_gr_inited}} variable is used to
13785 ensure that the database is scanned no more than once.
13786 The @code{@w{_gr_init}} function first saves @code{FS}, @code{RS}, and
13787 @code{$0}, and then sets @code{FS} and @code{RS} to the correct values for
13788 scanning the group information.
13790 The group information is stored is several associative arrays.
13791 The arrays are indexed by group name (@code{@w{_gr_byname}}), by group-id number
13792 (@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}).
13793 There is an additional array indexed by user name (@code{@w{_gr_groupsbyuser}}),
13794 that is a space separated list of groups that each user belongs to.
13796 Unlike the user database, it is possible to have multiple records in the
13797 database for the same group. This is common when a group has a large number
13798 of members. Such a pair of entries might look like:
13801 tvpeople:*:101:johny,jay,arsenio
13802 tvpeople:*:101:david,conan,tom,joan
13805 For this reason, @code{_gr_init} looks to see if a group name or
13806 group-id number has already been seen. If it has, then the user names are
13807 simply concatenated onto the previous list of users. (There is actually a
13808 subtle problem with the code presented above. Suppose that
13809 the first time there were no names. This code adds the names with
13810 a leading comma. It also doesn't check that there is a @code{$4}.)
13812 Finally, @code{_gr_init} closes the pipeline to @code{grcat}, restores
13813 @code{FS}, @code{RS}, and @code{$0}, initializes @code{_gr_count} to zero
13814 (it is used later), and makes @code{_gr_inited} non-zero.
13819 @c file eg/lib/groupawk.in
13820 function getgrnam(group)
13823 if (group in _gr_byname)
13824 return _gr_byname[group]
13831 The @code{getgrnam} function takes a group name as its argument, and if that
13832 group exists, it is returned. Otherwise, @code{getgrnam} returns the null
13838 @c file eg/lib/groupawk.in
13839 function getgrgid(gid)
13842 if (gid in _gr_bygid)
13843 return _gr_bygid[gid]
13850 The @code{getgrgid} function is similar, it takes a numeric group-id, and
13851 looks up the information associated with that group-id.
13856 @c file eg/lib/groupawk.in
13857 function getgruser(user)
13860 if (user in _gr_groupsbyuser)
13861 return _gr_groupsbyuser[user]
13868 The @code{getgruser} function does not have a C counterpart. It takes a
13869 user name, and returns the list of groups that have the user as a member.
13874 @c file eg/lib/groupawk.in
13875 function getgrent()
13878 if (++_gr_count in _gr_bycount)
13879 return _gr_bycount[_gr_count]
13886 The @code{getgrent} function steps through the database one entry at a time.
13887 It uses @code{_gr_count} to track its position in the list.
13892 @c file eg/lib/groupawk.in
13893 function endgrent()
13901 @code{endgrent} resets @code{_gr_count} to zero so that @code{getgrent} can
13904 As with the user database routines, each function calls @code{_gr_init} to
13905 initialize the arrays. Doing so only incurs the extra overhead of running
13906 @code{grcat} if these functions are used (as opposed to moving the body of
13907 @code{_gr_init} into a @code{BEGIN} rule).
13909 Most of the work is in scanning the database and building the various
13910 associative arrays. The functions that the user calls are themselves very
13911 simple, relying on @code{awk}'s associative arrays to do work.
13913 The @code{id} program in @ref{Id Program, ,Printing Out User Information},
13914 uses these functions.
13916 @node Library Names, , Group Functions, Library Functions
13917 @section Naming Library Function Global Variables
13919 @cindex namespace issues in @code{awk}
13920 @cindex documenting @code{awk} programs
13921 @cindex programs, documenting
13922 Due to the way the @code{awk} language evolved, variables are either
13923 @dfn{global} (usable by the entire program), or @dfn{local} (usable just by
13924 a specific function). There is no intermediate state analogous to
13925 @code{static} variables in C.
13927 Library functions often need to have global variables that they can use to
13928 preserve state information between calls to the function. For example,
13929 @code{getopt}'s variable @code{_opti}
13930 (@pxref{Getopt Function, ,Processing Command Line Options}),
13931 and the @code{_tm_months} array used by @code{mktime}
13932 (@pxref{Mktime Function, ,Turning Dates Into Timestamps}).
13933 Such variables are called @dfn{private}, since the only functions that need to
13934 use them are the ones in the library.
13936 When writing a library function, you should try to choose names for your
13937 private variables so that they will not conflict with any variables used by
13938 either another library function or a user's main program. For example, a
13939 name like @samp{i} or @samp{j} is not a good choice, since user programs
13940 often use variable names like these for their own purposes.
13942 The example programs shown in this chapter all start the names of their
13943 private variables with an underscore (@samp{_}). Users generally don't use
13944 leading underscores in their variable names, so this convention immediately
13945 decreases the chances that the variable name will be accidentally shared
13946 with the user's program.
13948 In addition, several of the library functions use a prefix that helps
13949 indicate what function or set of functions uses the variables. For example,
13950 @code{_tm_months} in @code{mktime}
13951 (@pxref{Mktime Function, ,Turning Dates Into Timestamps}), and
13952 @code{_pw_byname} in the user data base routines
13953 (@pxref{Passwd Functions, ,Reading the User Database}).
13954 This convention is recommended, since it even further decreases the chance
13955 of inadvertent conflict among variable names.
13956 Note that this convention can be used equally well both for variable names
13957 and for private function names too.
13959 While I could have re-written all the library routines to use this
13960 convention, I did not do so, in order to show how my own @code{awk}
13961 programming style has evolved, and to provide some basis for this
13964 As a final note on variable naming, if a function makes global variables
13965 available for use by a main program, it is a good convention to start that
13966 variable's name with a capital letter.
13967 For example, @code{getopt}'s @code{Opterr} and @code{Optind} variables
13968 (@pxref{Getopt Function, ,Processing Command Line Options}).
13969 The leading capital letter indicates that it is global, while the fact that
13970 the variable name is not all capital letters indicates that the variable is
13971 not one of @code{awk}'s built-in variables, like @code{FS}.
13973 It is also important that @emph{all} variables in library functions
13974 that do not need to save state are in fact declared local. If this is
13975 not done, the variable could accidentally be used in the user's program,
13976 leading to bugs that are very difficult to track down.
13979 function lib_func(x, y, l1, l2)
13982 @var{use variable} some_var # some_var could be local
13983 @dots{} # but is not by oversight
13988 A different convention, common in the Tcl community, is to use a single
13989 associative array to hold the values needed by the library function(s), or
13990 ``package.'' This significantly decreases the number of actual global names
13991 in use. For example, the functions described in
13992 @ref{Passwd Functions, , Reading the User Database},
13993 might have used @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}},
13994 @code{@w{PW_data["count"]}} and @code{@w{PW_data["awklib"]}}, instead of
13995 @code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}},
13996 and @code{@w{_pw_count}}.
13998 The conventions presented in this section are exactly that, conventions. You
13999 are not required to write your programs this way, we merely recommend that
14002 @node Sample Programs, Language History, Library Functions, Top
14003 @chapter Practical @code{awk} Programs
14005 This chapter presents a potpourri of @code{awk} programs for your reading
14008 There are two sections. The first presents @code{awk}
14009 versions of several common POSIX utilities.
14010 The second is a grab-bag of interesting programs.
14013 Many of these programs use the library functions presented in
14014 @ref{Library Functions, ,A Library of @code{awk} Functions}.
14017 * Clones:: Clones of common utilities.
14018 * Miscellaneous Programs:: Some interesting @code{awk} programs.
14021 @node Clones, Miscellaneous Programs, Sample Programs, Sample Programs
14022 @section Re-inventing Wheels for Fun and Profit
14024 This section presents a number of POSIX utilities that are implemented in
14025 @code{awk}. Re-inventing these programs in @code{awk} is often enjoyable,
14026 since the algorithms can be very clearly expressed, and usually the code is
14027 very concise and simple. This is true because @code{awk} does so much for you.
14029 It should be noted that these programs are not necessarily intended to
14030 replace the installed versions on your system. Instead, their
14031 purpose is to illustrate @code{awk} language programming for ``real world''
14034 The programs are presented in alphabetical order.
14037 * Cut Program:: The @code{cut} utility.
14038 * Egrep Program:: The @code{egrep} utility.
14039 * Id Program:: The @code{id} utility.
14040 * Split Program:: The @code{split} utility.
14041 * Tee Program:: The @code{tee} utility.
14042 * Uniq Program:: The @code{uniq} utility.
14043 * Wc Program:: The @code{wc} utility.
14046 @node Cut Program, Egrep Program, Clones, Clones
14047 @subsection Cutting Out Fields and Columns
14049 @cindex @code{cut} utility
14050 The @code{cut} utility selects, or ``cuts,'' either characters or fields
14052 input and sends them to its standard output. @code{cut} can cut out either
14053 a list of characters, or a list of fields. By default, fields are separated
14054 by tabs, but you may supply a command line option to change the field
14055 @dfn{delimiter}, i.e.@: the field separator character. @code{cut}'s definition
14056 of fields is less general than @code{awk}'s.
14058 A common use of @code{cut} might be to pull out just the login name of
14059 logged-on users from the output of @code{who}. For example, the following
14060 pipeline generates a sorted, unique list of the logged on users:
14063 who | cut -c1-8 | sort | uniq
14066 The options for @code{cut} are:
14069 @item -c @var{list}
14070 Use @var{list} as the list of characters to cut out. Items within the list
14071 may be separated by commas, and ranges of characters can be separated with
14072 dashes. The list @samp{1-8,15,22-35} specifies characters one through
14073 eight, 15, and 22 through 35.
14075 @item -f @var{list}
14076 Use @var{list} as the list of fields to cut out.
14078 @item -d @var{delim}
14079 Use @var{delim} as the field separator character instead of the tab
14083 Suppress printing of lines that do not contain the field delimiter.
14086 The @code{awk} implementation of @code{cut} uses the @code{getopt} library
14087 function (@pxref{Getopt Function, ,Processing Command Line Options}),
14088 and the @code{join} library function
14089 (@pxref{Join Function, ,Merging an Array Into a String}).
14091 The program begins with a comment describing the options and a @code{usage}
14092 function which prints out a usage message and exits. @code{usage} is called
14093 if invalid arguments are supplied.
14098 @c file eg/prog/cut.awk
14099 # cut.awk --- implement cut in awk
14100 # Arnold Robbins, arnold@@gnu.org, Public Domain
14104 # -f list Cut fields
14105 # -d c Field delimiter character
14106 # -c list Cut characters
14108 # -s Suppress lines without the delimiter character
14110 function usage( e1, e2)
14112 e1 = "usage: cut [-f list] [-d c] [-s] [files...]"
14113 e2 = "usage: cut [-c list] [files...]"
14114 print e1 > "/dev/stderr"
14115 print e2 > "/dev/stderr"
14123 The variables @code{e1} and @code{e2} are used so that the function
14132 Next comes a @code{BEGIN} rule that parses the command line options.
14133 It sets @code{FS} to a single tab character, since that is @code{cut}'s
14134 default field separator. The output field separator is also set to be the
14135 same as the input field separator. Then @code{getopt} is used to step
14136 through the command line options. One or the other of the variables
14137 @code{by_fields} or @code{by_chars} is set to true, to indicate that
14138 processing should be done by fields or by characters respectively.
14139 When cutting by characters, the output field separator is set to the null
14144 @c file eg/prog/cut.awk
14147 FS = "\t" # default
14149 while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{
14153 @} else if (c == "c") @{
14158 @} else if (c == "d") @{
14159 if (length(Optarg) > 1) @{
14160 printf("Using first character of %s" \
14161 " for delimiter\n", Optarg) > "/dev/stderr"
14162 Optarg = substr(Optarg, 1, 1)
14166 if (FS == " ") # defeat awk semantics
14168 @} else if (c == "s")
14175 for (i = 1; i < Optind; i++)
14181 Special care is taken when the field delimiter is a space. Using
14182 @code{@w{" "}} (a single space) for the value of @code{FS} is
14183 incorrect---@code{awk} would
14184 separate fields with runs of spaces, tabs and/or newlines, and we want them to be
14185 separated with individual spaces. Also, note that after @code{getopt} is
14186 through, we have to clear out all the elements of @code{ARGV} from one to
14187 @code{Optind}, so that @code{awk} will not try to process the command line
14188 options as file names.
14190 After dealing with the command line options, the program verifies that the
14191 options make sense. Only one or the other of @samp{-c} and @samp{-f} should
14192 be used, and both require a field list. Then either @code{set_fieldlist} or
14193 @code{set_charlist} is called to pull apart the list of fields or
14198 @c file eg/prog/cut.awk
14199 if (by_fields && by_chars)
14202 if (by_fields == 0 && by_chars == 0)
14203 by_fields = 1 # default
14205 if (fieldlist == "") @{
14206 print "cut: needs list for -c or -f" > "/dev/stderr"
14220 Here is @code{set_fieldlist}. It first splits the field list apart
14221 at the commas, into an array. Then, for each element of the array, it
14222 looks to see if it is actually a range, and if so splits it apart. The range
14223 is verified to make sure the first number is smaller than the second.
14224 Each number in the list is added to the @code{flist} array, which simply
14225 lists the fields that will be printed.
14226 Normal field splitting is used.
14227 The program lets @code{awk}
14228 handle the job of doing the field splitting.
14232 @c file eg/prog/cut.awk
14233 function set_fieldlist( n, m, i, j, k, f, g)
14235 n = split(fieldlist, f, ",")
14236 j = 1 # index in flist
14237 for (i = 1; i <= n; i++) @{
14238 if (index(f[i], "-") != 0) @{ # a range
14239 m = split(f[i], g, "-")
14240 if (m != 2 || g[1] >= g[2]) @{
14241 printf("bad field list: %s\n",
14242 f[i]) > "/dev/stderr"
14245 for (k = g[1]; k <= g[2]; k++)
14256 The @code{set_charlist} function is more complicated than @code{set_fieldlist}.
14257 The idea here is to use @code{gawk}'s @code{FIELDWIDTHS} variable
14258 (@pxref{Constant Size, ,Reading Fixed-width Data}),
14259 which describes constant width input. When using a character list, that is
14260 exactly what we have.
14262 Setting up @code{FIELDWIDTHS} is more complicated than simply listing the
14263 fields that need to be printed. We have to keep track of the fields to be
14264 printed, and also the intervening characters that have to be skipped.
14265 For example, suppose you wanted characters one through eight, 15, and
14266 22 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value
14267 for @code{FIELDWIDTHS} would be @code{@w{"8 6 1 6 14"}}. This gives us five
14268 fields, and what should be printed are @code{$1}, @code{$3}, and @code{$5}.
14269 The intermediate fields are ``filler,'' stuff in between the desired data.
14271 @code{flist} lists the fields to be printed, and @code{t} tracks the
14272 complete field list, including filler fields.
14276 @c file eg/prog/cut.awk
14277 function set_charlist( field, i, j, f, g, t,
14280 field = 1 # count total fields
14281 n = split(fieldlist, f, ",")
14282 j = 1 # index in flist
14283 for (i = 1; i <= n; i++) @{
14284 if (index(f[i], "-") != 0) @{ # range
14285 m = split(f[i], g, "-")
14286 if (m != 2 || g[1] >= g[2]) @{
14287 printf("bad character list: %s\n",
14288 f[i]) > "/dev/stderr"
14291 len = g[2] - g[1] + 1
14292 if (g[1] > 1) # compute length of filler
14293 filler = g[1] - last - 1
14297 t[field++] = filler
14298 t[field++] = len # length of field
14300 flist[j++] = field - 1
14303 filler = f[i] - last - 1
14307 t[field++] = filler
14310 flist[j++] = field - 1
14314 FIELDWIDTHS = join(t, 1, field - 1)
14321 Here is the rule that actually processes the data. If the @samp{-s} option
14322 was given, then @code{suppress} will be true. The first @code{if} statement
14323 makes sure that the input record does have the field separator. If
14324 @code{cut} is processing fields, @code{suppress} is true, and the field
14325 separator character is not in the record, then the record is skipped.
14327 If the record is valid, then at this point, @code{gawk} has split the data
14328 into fields, either using the character in @code{FS} or using fixed-length
14329 fields and @code{FIELDWIDTHS}. The loop goes through the list of fields
14330 that should be printed. If the corresponding field has data in it, it is
14331 printed. If the next field also has data, then the separator character is
14332 written out in between the fields.
14334 @c 2e: Could use `index($0, FS) != 0' instead of `$0 !~ FS', below
14338 @c file eg/prog/cut.awk
14340 if (by_fields && suppress && $0 !~ FS)
14343 for (i = 1; i <= nfields; i++) @{
14344 if ($flist[i] != "") @{
14345 printf "%s", $flist[i]
14346 if (i < nfields && $flist[i+1] != "")
14356 This version of @code{cut} relies on @code{gawk}'s @code{FIELDWIDTHS}
14357 variable to do the character-based cutting. While it would be possible in
14358 other @code{awk} implementations to use @code{substr}
14359 (@pxref{String Functions, ,Built-in Functions for String Manipulation}),
14360 it would also be extremely painful to do so.
14361 The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem
14362 of picking the input line apart by characters.
14364 @node Egrep Program, Id Program, Cut Program, Clones
14365 @subsection Searching for Regular Expressions in Files
14367 @cindex @code{egrep} utility
14368 The @code{egrep} utility searches files for patterns. It uses regular
14369 expressions that are almost identical to those available in @code{awk}
14370 (@pxref{Regexp Constants, ,Regular Expression Constants}). It is used this way:
14373 egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{}
14376 The @var{pattern} is a regexp.
14377 In typical usage, the regexp is quoted to prevent the shell from expanding
14378 any of the special characters as file name wildcards.
14379 Normally, @code{egrep} prints the
14380 lines that matched. If multiple file names are provided on the command
14381 line, each output line is preceded by the name of the file and a colon.
14387 Print out a count of the lines that matched the pattern, instead of the
14391 Be silent. No output is produced, and the exit value indicates whether
14392 or not the pattern was matched.
14395 Invert the sense of the test. @code{egrep} prints the lines that do
14396 @emph{not} match the pattern, and exits successfully if the pattern was not
14400 Ignore case distinctions in both the pattern and the input data.
14403 Only print the names of the files that matched, not the lines that matched.
14405 @item -e @var{pattern}
14406 Use @var{pattern} as the regexp to match. The purpose of the @samp{-e}
14407 option is to allow patterns that start with a @samp{-}.
14410 This version uses the @code{getopt} library function
14411 (@pxref{Getopt Function, ,Processing Command Line Options}),
14412 and the file transition library program
14413 (@pxref{Filetrans Function, ,Noting Data File Boundaries}).
14415 The program begins with a descriptive comment, and then a @code{BEGIN} rule
14416 that processes the command line arguments with @code{getopt}. The @samp{-i}
14417 (ignore case) option is particularly easy with @code{gawk}; we just use the
14418 @code{IGNORECASE} built in variable
14419 (@pxref{Built-in Variables}).
14424 @c file eg/prog/egrep.awk
14425 # egrep.awk --- simulate egrep in awk
14426 # Arnold Robbins, arnold@@gnu.org, Public Domain
14430 # -c count of lines
14431 # -s silent - use exit value
14432 # -v invert test, success if no match
14434 # -l print filenames only
14435 # -e argument is pattern
14438 while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{
14458 Next comes the code that handles the @code{egrep} specific behavior. If no
14459 pattern was supplied with @samp{-e}, the first non-option on the command
14460 line is used. The @code{awk} command line arguments up to @code{ARGV[Optind]}
14461 are cleared, so that @code{awk} won't try to process them as files. If no
14462 files were specified, the standard input is used, and if multiple files were
14463 specified, we make sure to note this so that the file names can precede the
14464 matched lines in the output.
14466 The last two lines are commented out, since they are not needed in
14467 @code{gawk}. They should be uncommented if you have to use another version
14472 @c file eg/prog/egrep.awk
14474 pattern = ARGV[Optind++]
14476 for (i = 1; i < Optind; i++)
14478 if (Optind >= ARGC) @{
14481 @} else if (ARGC - Optind > 1)
14485 # pattern = tolower(pattern)
14491 The next set of lines should be uncommented if you are not using
14492 @code{gawk}. This rule translates all the characters in the input line
14493 into lower-case if the @samp{-i} option was specified. The rule is
14494 commented out since it is not necessary with @code{gawk}.
14495 @c bug: if a match happens, we output the translated line, not the original
14499 @c file eg/prog/egrep.awk
14508 The @code{beginfile} function is called by the rule in @file{ftrans.awk}
14509 when each new file is processed. In this case, it is very simple; all it
14510 does is initialize a variable @code{fcount} to zero. @code{fcount} tracks
14511 how many lines in the current file matched the pattern.
14515 @c file eg/prog/egrep.awk
14516 function beginfile(junk)
14524 The @code{endfile} function is called after each file has been processed.
14525 It is used only when the user wants a count of the number of lines that
14526 matched. @code{no_print} will be true only if the exit status is desired.
14527 @code{count_only} will be true if line counts are desired. @code{egrep}
14528 will therefore only print line counts if printing and counting are enabled.
14529 The output format must be adjusted depending upon the number of files to be
14530 processed. Finally, @code{fcount} is added to @code{total}, so that we
14531 know how many lines altogether matched the pattern.
14535 @c file eg/prog/egrep.awk
14536 function endfile(file)
14538 if (! no_print && count_only)
14540 print file ":" fcount
14550 This rule does most of the work of matching lines. The variable
14551 @code{matches} will be true if the line matched the pattern. If the user
14552 wants lines that did not match, the sense of the @code{matches} is inverted
14553 using the @samp{!} operator. @code{fcount} is incremented with the value of
14554 @code{matches}, which will be either one or zero, depending upon a
14555 successful or unsuccessful match. If the line did not match, the
14556 @code{next} statement just moves on to the next record.
14558 There are several optimizations for performance in the following few lines
14559 of code. If the user only wants exit status (@code{no_print} is true), and
14560 we don't have to count lines, then it is enough to know that one line in
14561 this file matched, and we can skip on to the next file with @code{nextfile}.
14562 Along similar lines, if we are only printing file names, and we
14563 don't need to count lines, we can print the file name, and then skip to the
14564 next file with @code{nextfile}.
14566 Finally, each line is printed, with a leading filename and colon if
14570 2e: note, probably better to recode the last few lines as
14571 if (! count_only) @{
14575 if (filenames_only) @{
14581 print FILENAME ":" $0
14589 @c file eg/prog/egrep.awk
14591 matches = ($0 ~ pattern)
14593 matches = ! matches
14595 fcount += matches # 1 or 0
14600 if (no_print && ! count_only)
14603 if (filenames_only && ! count_only) @{
14608 if (do_filenames && ! count_only)
14609 print FILENAME ":" $0
14611 else if (! count_only)
14619 @c @strong{Exercise}: rearrange the code inside @samp{if (! count_only)}.
14621 The @code{END} rule takes care of producing the correct exit status. If
14622 there were no matches, the exit status is one, otherwise it is zero.
14626 @c file eg/prog/egrep.awk
14637 The @code{usage} function prints a usage message in case of invalid options
14642 @c file eg/prog/egrep.awk
14645 e = "Usage: egrep [-csvil] [-e pat] [files ...]"
14646 print e > "/dev/stderr"
14653 The variable @code{e} is used so that the function fits nicely
14654 on the printed page.
14656 @cindex backslash continuation
14657 Just a note on programming style. You may have noticed that the @code{END}
14658 rule uses backslash continuation, with the open brace on a line by
14659 itself. This is so that it more closely resembles the way functions
14660 are written. Many of the examples
14664 use this style. You can decide for yourself if you like writing
14665 your @code{BEGIN} and @code{END} rules this way,
14668 @node Id Program, Split Program, Egrep Program, Clones
14669 @subsection Printing Out User Information
14671 @cindex @code{id} utility
14672 The @code{id} utility lists a user's real and effective user-id numbers,
14673 real and effective group-id numbers, and the user's group set, if any.
14674 @code{id} will only print the effective user-id and group-id if they are
14675 different from the real ones. If possible, @code{id} will also supply the
14676 corresponding user and group names. The output might look like this:
14680 @print{} uid=2076(arnold) gid=10(staff) groups=10(staff),4(tty)
14683 This information is exactly what is provided by @code{gawk}'s
14684 @file{/dev/user} special file (@pxref{Special Files, ,Special File Names in @code{gawk}}).
14685 However, the @code{id} utility provides a more palatable output than just a
14688 Here is a simple version of @code{id} written in @code{awk}.
14689 It uses the user database library functions
14690 (@pxref{Passwd Functions, ,Reading the User Database}),
14691 and the group database library functions
14692 (@pxref{Group Functions, ,Reading the Group Database}).
14694 The program is fairly straightforward. All the work is done in the
14695 @code{BEGIN} rule. The user and group id numbers are obtained from
14696 @file{/dev/user}. If there is no support for @file{/dev/user}, the program
14699 The code is repetitive. The entry in the user database for the real user-id
14700 number is split into parts at the @samp{:}. The name is the first field.
14701 Similar code is used for the effective user-id number, and the group
14707 @c file eg/prog/id.awk
14708 # id.awk --- implement id in awk
14709 # Arnold Robbins, arnold@@gnu.org, Public Domain
14713 # uid=12(foo) euid=34(bar) gid=3(baz) \
14714 # egid=5(blat) groups=9(nine),2(two),1(one)
14718 if ((getline < "/dev/user") < 0) @{
14719 err = "id: no /dev/user support - cannot run"
14720 print err > "/dev/stderr"
14730 printf("uid=%d", uid)
14735 printf("(%s)", a[1])
14739 if (euid != uid) @{
14740 printf(" euid=%d", euid)
14741 pw = getpwuid(euid)
14744 printf("(%s)", a[1])
14748 printf(" gid=%d", gid)
14752 printf("(%s)", a[1])
14755 if (egid != gid) @{
14756 printf(" egid=%d", egid)
14757 pw = getgrgid(egid)
14760 printf("(%s)", a[1])
14765 printf(" groups=");
14766 for (i = 5; i <= NF; i++) @{
14771 printf("(%s)", a[1])
14787 The POSIX version of @code{id} takes arguments that control which
14788 information is printed. Modify this version to accept the same
14789 arguments and perform in the same way.
14792 @node Split Program, Tee Program, Id Program, Clones
14793 @subsection Splitting a Large File Into Pieces
14795 @cindex @code{split} utility
14796 The @code{split} program splits large text files into smaller pieces. By default,
14797 the output files are named @file{xaa}, @file{xab}, and so on. Each file has
14798 1000 lines in it, with the likely exception of the last file. To change the
14799 number of lines in each file, you supply a number on the command line
14800 preceded with a minus, e.g., @samp{-500} for files with 500 lines in them
14801 instead of 1000. To change the name of the output files to something like
14802 @file{myfileaa}, @file{myfileab}, and so on, you supply an additional
14803 argument that specifies the filename.
14805 Here is a version of @code{split} in @code{awk}. It uses the @code{ord} and
14806 @code{chr} functions presented in
14807 @ref{Ordinal Functions, ,Translating Between Characters and Numbers}.
14809 The program first sets its defaults, and then tests to make sure there are
14810 not too many arguments. It then looks at each argument in turn. The
14811 first argument could be a minus followed by a number. If it is, this happens
14812 to look like a negative number, so it is made positive, and that is the
14813 count of lines. The data file name is skipped over, and the final argument
14814 is used as the prefix for the output file names.
14819 @c file eg/prog/split.awk
14820 # split.awk --- do split in awk
14821 # Arnold Robbins, arnold@@gnu.org, Public Domain
14824 # usage: split [-num] [file] [outname]
14827 outfile = "x" # default
14833 if (ARGV[i] ~ /^-[0-9]+$/) @{
14838 # test argv in case reading from stdin instead of file
14840 i++ # skip data file name
14847 out = (outfile s1 s2)
14853 The next rule does most of the work. @code{tcount} (temporary count) tracks
14854 how many lines have been printed to the output file so far. If it is greater
14855 than @code{count}, it is time to close the current file and start a new one.
14856 @code{s1} and @code{s2} track the current suffixes for the file name. If
14857 they are both @samp{z}, the file is just too big. Otherwise, @code{s1}
14858 moves to the next letter in the alphabet and @code{s2} starts over again at
14863 @c file eg/prog/split.awk
14865 if (++tcount > count) @{
14869 printf("split: %s is too large to split\n", \
14870 FILENAME) > "/dev/stderr"
14873 s1 = chr(ord(s1) + 1)
14876 s2 = chr(ord(s2) + 1)
14877 out = (outfile s1 s2)
14886 The @code{usage} function simply prints an error message and exits.
14890 @c file eg/prog/split.awk
14893 e = "usage: split [-num] [file] [outname]"
14894 print e > "/dev/stderr"
14902 The variable @code{e} is used so that the function
14911 This program is a bit sloppy; it relies on @code{awk} to close the last file
14912 for it automatically, instead of doing it in an @code{END} rule.
14914 @node Tee Program, Uniq Program, Split Program, Clones
14915 @subsection Duplicating Output Into Multiple Files
14917 @cindex @code{tee} utility
14918 The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies
14919 its standard input to its standard output, and also duplicates it to the
14920 files named on the command line. Its usage is:
14923 tee @r{[}-a@r{]} file @dots{}
14926 The @samp{-a} option tells @code{tee} to append to the named files, instead of
14927 truncating them and starting over.
14929 The @code{BEGIN} rule first makes a copy of all the command line arguments,
14930 into an array named @code{copy}.
14931 @code{ARGV[0]} is not copied, since it is not needed.
14932 @code{tee} cannot use @code{ARGV} directly, since @code{awk} will attempt to
14933 process each file named in @code{ARGV} as input data.
14935 If the first argument is @samp{-a}, then the flag variable
14936 @code{append} is set to true, and both @code{ARGV[1]} and
14937 @code{copy[1]} are deleted. If @code{ARGC} is less than two, then no file
14938 names were supplied, and @code{tee} prints a usage message and exits.
14939 Finally, @code{awk} is forced to read the standard input by setting
14940 @code{ARGV[1]} to @code{"-"}, and @code{ARGC} to two.
14942 @c 2e: the `ARGC--' in the `if (ARGV[1] == "-a")' isn't needed.
14947 @c file eg/prog/tee.awk
14948 # tee.awk --- tee in awk
14949 # Arnold Robbins, arnold@@gnu.org, Public Domain
14951 # Revised December 1995
14957 for (i = 1; i < ARGC; i++)
14962 if (ARGV[1] == "-a") @{
14971 print "usage: tee [-a] file ..." > "/dev/stderr"
14983 The single rule does all the work. Since there is no pattern, it is
14984 executed for each line of input. The body of the rule simply prints the
14985 line into each file on the command line, and then to the standard output.
14989 @c file eg/prog/tee.awk
14991 # moving the if outside the loop makes it run faster
15004 It would have been possible to code the loop this way:
15015 This is more concise, but it is also less efficient. The @samp{if} is
15016 tested for each record and for each output file. By duplicating the loop
15017 body, the @samp{if} is only tested once for each input record. If there are
15018 @var{N} input records and @var{M} input files, the first method only
15019 executes @var{N} @samp{if} statements, while the second would execute
15020 @var{N}@code{*}@var{M} @samp{if} statements.
15022 Finally, the @code{END} rule cleans up, by closing all the output files.
15026 @c file eg/prog/tee.awk
15036 @node Uniq Program, Wc Program, Tee Program, Clones
15037 @subsection Printing Non-duplicated Lines of Text
15039 @cindex @code{uniq} utility
15040 The @code{uniq} utility reads sorted lines of data on its standard input,
15041 and (by default) removes duplicate lines. In other words, only unique lines
15042 are printed, hence the name. @code{uniq} has a number of options. The usage is:
15045 uniq @r{[}-udc @r{[}-@var{n}@r{]]} @r{[}+@var{n}@r{]} @r{[} @var{input file} @r{[} @var{output file} @r{]]}
15048 The option meanings are:
15052 Only print repeated lines.
15055 Only print non-repeated lines.
15058 Count lines. This option overrides @samp{-d} and @samp{-u}. Both repeated
15059 and non-repeated lines are counted.
15062 Skip @var{n} fields before comparing lines. The definition of fields
15063 is similar to @code{awk}'s default: non-whitespace characters separated
15064 by runs of spaces and/or tabs.
15067 Skip @var{n} characters before comparing lines. Any fields specified with
15068 @samp{-@var{n}} are skipped first.
15070 @item @var{input file}
15071 Data is read from the input file named on the command line, instead of from
15072 the standard input.
15074 @item @var{output file}
15075 The generated output is sent to the named output file, instead of to the
15079 Normally @code{uniq} behaves as if both the @samp{-d} and @samp{-u} options
15082 Here is an @code{awk} implementation of @code{uniq}. It uses the
15083 @code{getopt} library function
15084 (@pxref{Getopt Function, ,Processing Command Line Options}),
15085 and the @code{join} library function
15086 (@pxref{Join Function, ,Merging an Array Into a String}).
15088 The program begins with a @code{usage} function and then a brief outline of
15089 the options and their meanings in a comment.
15091 The @code{BEGIN} rule deals with the command line arguments and options. It
15092 uses a trick to get @code{getopt} to handle options of the form @samp{-25},
15093 treating such an option as the option letter @samp{2} with an argument of
15094 @samp{5}. If indeed two or more digits were supplied (@code{Optarg} looks
15095 like a number), @code{Optarg} is
15096 concatenated with the option digit, and then result is added to zero to make
15097 it into a number. If there is only one digit in the option, then
15098 @code{Optarg} is not needed, and @code{Optind} must be decremented so that
15099 @code{getopt} will process it next time. This code is admittedly a bit
15102 If no options were supplied, then the default is taken, to print both
15103 repeated and non-repeated lines. The output file, if provided, is assigned
15104 to @code{outputfile}. Earlier, @code{outputfile} was initialized to the
15105 standard output, @file{/dev/stdout}.
15109 @c file eg/prog/uniq.awk
15110 # uniq.awk --- do uniq in awk
15111 # Arnold Robbins, arnold@@gnu.org, Public Domain
15117 e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
15118 print e > "/dev/stderr"
15123 # -c count lines. overrides -d and -u
15124 # -d only repeated lines
15125 # -u only non-repeated lines
15127 # +n skip n characters, skip fields first
15132 outputfile = "/dev/stdout"
15133 opts = "udc0:1:2:3:4:5:6:7:8:9:"
15134 while ((c = getopt(ARGC, ARGV, opts)) != -1) @{
15136 non_repeated_only++
15141 else if (index("0123456789", c) != 0) @{
15142 # getopt requires args to options
15143 # this messes us up for things like -5
15144 if (Optarg ~ /^[0-9]+$/)
15145 fcount = (c Optarg) + 0
15156 if (ARGV[Optind] ~ /^\+[0-9]+$/) @{
15157 charcount = substr(ARGV[Optind], 2) + 0
15161 for (i = 1; i < Optind; i++)
15164 if (repeated_only == 0 && non_repeated_only == 0)
15165 repeated_only = non_repeated_only = 1
15167 if (ARGC - Optind == 2) @{
15168 outputfile = ARGV[ARGC - 1]
15169 ARGV[ARGC - 1] = ""
15175 The following function, @code{are_equal}, compares the current line,
15177 previous line, @code{last}. It handles skipping fields and characters.
15179 If no field count and no character count were specified, @code{are_equal}
15180 simply returns one or zero depending upon the result of a simple string
15181 comparison of @code{last} and @code{$0}. Otherwise, things get more
15184 If fields have to be skipped, each line is broken into an array using
15186 (@pxref{String Functions, ,Built-in Functions for String Manipulation}),
15187 and then the desired fields are joined back into a line using @code{join}.
15188 The joined lines are stored in @code{clast} and @code{cline}.
15189 If no fields are skipped, @code{clast} and @code{cline} are set to
15190 @code{last} and @code{$0} respectively.
15192 Finally, if characters are skipped, @code{substr} is used to strip off the
15193 leading @code{charcount} characters in @code{clast} and @code{cline}. The
15194 two strings are then compared, and @code{are_equal} returns the result.
15198 @c file eg/prog/uniq.awk
15199 function are_equal( n, m, clast, cline, alast, aline)
15201 if (fcount == 0 && charcount == 0)
15202 return (last == $0)
15205 n = split(last, alast)
15206 m = split($0, aline)
15207 clast = join(alast, fcount+1, n)
15208 cline = join(aline, fcount+1, m)
15214 clast = substr(clast, charcount + 1)
15215 cline = substr(cline, charcount + 1)
15218 return (clast == cline)
15224 The following two rules are the body of the program. The first one is
15225 executed only for the very first line of data. It sets @code{last} equal to
15226 @code{$0}, so that subsequent lines of text have something to be compared to.
15228 The second rule does the work. The variable @code{equal} will be one or zero
15229 depending upon the results of @code{are_equal}'s comparison. If @code{uniq}
15230 is counting repeated lines, then the @code{count} variable is incremented if
15231 the lines are equal. Otherwise the line is printed and @code{count} is
15232 reset, since the two lines are not equal.
15234 If @code{uniq} is not counting, @code{count} is incremented if the lines are
15235 equal. Otherwise, if @code{uniq} is counting repeated lines, and more than
15236 one line has been seen, or if @code{uniq} is counting non-repeated lines,
15237 and only one line has been seen, then the line is printed, and @code{count}
15240 Finally, similar logic is used in the @code{END} rule to print the final
15241 line of input data.
15245 @c file eg/prog/uniq.awk
15254 equal = are_equal()
15256 if (do_count) @{ # overrides -d and -u
15260 printf("%4d %s\n", count, last) > outputfile
15270 if ((repeated_only && count > 1) ||
15271 (non_repeated_only && count == 1))
15272 print last > outputfile
15281 printf("%4d %s\n", count, last) > outputfile
15282 else if ((repeated_only && count > 1) ||
15283 (non_repeated_only && count == 1))
15284 print last > outputfile
15291 @node Wc Program, , Uniq Program, Clones
15292 @subsection Counting Things
15294 @cindex @code{wc} utility
15295 The @code{wc} (word count) utility counts lines, words, and characters in
15296 one or more input files. Its usage is:
15299 wc @r{[}-lwc@r{]} @r{[} @var{files} @dots{} @r{]}
15302 If no files are specified on the command line, @code{wc} reads its standard
15303 input. If there are multiple files, it will also print total counts for all
15304 the files. The options and their meanings are:
15312 A ``word'' is a contiguous sequence of non-whitespace characters, separated
15313 by spaces and/or tabs. Happily, this is the normal way @code{awk} separates
15314 fields in its input data.
15317 Only count characters.
15320 Implementing @code{wc} in @code{awk} is particularly elegant, since
15321 @code{awk} does a lot of the work for us; it splits lines into words (i.e.@:
15322 fields) and counts them, it counts lines (i.e.@: records) for us, and it can
15323 easily tell us how long a line is.
15325 This version uses the @code{getopt} library function
15326 (@pxref{Getopt Function, ,Processing Command Line Options}),
15327 and the file transition functions
15328 (@pxref{Filetrans Function, ,Noting Data File Boundaries}).
15330 This version has one major difference from traditional versions of @code{wc}.
15331 Our version always prints the counts in the order lines, words,
15332 and characters. Traditional versions note the order of the @samp{-l},
15333 @samp{-w}, and @samp{-c} options on the command line, and print the counts
15336 The @code{BEGIN} rule does the argument processing.
15337 The variable @code{print_total} will
15338 be true if more than one file was named on the command line.
15343 @c file eg/prog/wc.awk
15344 # wc.awk --- count lines, words, characters
15345 # Arnold Robbins, arnold@@gnu.org, Public Domain
15349 # -l only count lines
15350 # -w only count words
15351 # -c only count characters
15353 # Default is to count lines, words, characters
15356 # let getopt print a message about
15357 # invalid options. we ignore them
15358 while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{
15366 for (i = 1; i < Optind; i++)
15369 # if no options, do all
15370 if (! do_lines && ! do_words && ! do_chars)
15371 do_lines = do_words = do_chars = 1
15373 print_total = (ARGC - i > 2)
15379 The @code{beginfile} function is simple; it just resets the counts of lines,
15380 words, and characters to zero, and saves the current file name in
15383 The @code{endfile} function adds the current file's numbers to the running
15384 totals of lines, words, and characters. It then prints out those numbers
15385 for the file that was just read. It relies on @code{beginfile} to reset the
15386 numbers for the following data file.
15389 @c left brace on line with `function' because of page breaking
15390 @c file eg/prog/wc.awk
15392 function beginfile(file) @{
15393 chars = lines = words = 0
15398 function endfile(file)
15404 printf "\t%d", lines
15406 printf "\t%d", words
15408 printf "\t%d", chars
15409 printf "\t%s\n", fname
15414 There is one rule that is executed for each line. It adds the length of the
15415 record to @code{chars}. It has to add one, since the newline character
15416 separating records (the value of @code{RS}) is not part of the record
15417 itself. @code{lines} is incremented for each line read, and @code{words} is
15418 incremented by the value of @code{NF}, the number of ``words'' on this
15419 line.@footnote{Examine the code in
15420 @ref{Filetrans Function, ,Noting Data File Boundaries}.
15421 Why must @code{wc} use a separate @code{lines} variable, instead of using
15422 the value of @code{FNR} in @code{endfile}?}
15424 Finally, the @code{END} rule simply prints the totals for all the files.
15428 @c file eg/prog/wc.awk
15431 chars += length($0) + 1 # get newline
15437 if (print_total) @{
15439 printf "\t%d", tlines
15441 printf "\t%d", twords
15443 printf "\t%d", tchars
15451 @node Miscellaneous Programs, , Clones, Sample Programs
15452 @section A Grab Bag of @code{awk} Programs
15454 This section is a large ``grab bag'' of miscellaneous programs.
15455 We hope you find them both interesting and enjoyable.
15458 * Dupword Program:: Finding duplicated words in a document.
15459 * Alarm Program:: An alarm clock.
15460 * Translate Program:: A program similar to the @code{tr} utility.
15461 * Labels Program:: Printing mailing labels.
15462 * Word Sorting:: A program to produce a word usage count.
15463 * History Sorting:: Eliminating duplicate entries from a history
15465 * Extract Program:: Pulling out programs from Texinfo source
15467 * Simple Sed:: A Simple Stream Editor.
15468 * Igawk Program:: A wrapper for @code{awk} that includes files.
15471 @node Dupword Program, Alarm Program, Miscellaneous Programs, Miscellaneous Programs
15472 @subsection Finding Duplicated Words in a Document
15474 A common error when writing large amounts of prose is to accidentally
15475 duplicate words. Often you will see this in text as something like ``the
15476 the program does the following @dots{}.'' When the text is on-line, often
15477 the duplicated words occur at the end of one line and the beginning of
15478 another, making them very difficult to spot.
15481 This program, @file{dupword.awk}, scans through a file one line at a time,
15482 and looks for adjacent occurrences of the same word. It also saves the last
15483 word on a line (in the variable @code{prev}) for comparison with the first
15484 word on the next line.
15486 The first two statements make sure that the line is all lower-case, so that,
15488 ``The'' and ``the'' compare equal to each other. The second statement
15489 removes all non-alphanumeric and non-whitespace characters from the line, so
15490 that punctuation does not affect the comparison either. This sometimes
15491 leads to reports of duplicated words that really are different, but this is
15494 @c FIXME: add check for $i != ""
15495 @findex dupword.awk
15498 @c file eg/prog/dupword.awk
15499 # dupword --- find duplicate words in text
15500 # Arnold Robbins, arnold@@gnu.org, Public Domain
15505 gsub(/[^A-Za-z0-9 \t]/, "");
15507 printf("%s:%d: duplicate %s\n",
15509 for (i = 2; i <= NF; i++)
15511 printf("%s:%d: duplicate %s\n",
15519 @node Alarm Program, Translate Program, Dupword Program, Miscellaneous Programs
15520 @subsection An Alarm Clock Program
15522 The following program is a simple ``alarm clock'' program.
15523 You give it a time of day, and an optional message. At the given time,
15524 it prints the message on the standard output. In addition, you can give it
15525 the number of times to repeat the message, and also a delay between
15528 This program uses the @code{gettimeofday} function from
15529 @ref{Gettimeofday Function, ,Managing the Time of Day}.
15531 All the work is done in the @code{BEGIN} rule. The first part is argument
15532 checking and setting of defaults; the delay, the count, and the message to
15533 print. If the user supplied a message, but it does not contain the ASCII BEL
15534 character (known as the ``alert'' character, @samp{\a}), then it is added to
15535 the message. (On many systems, printing the ASCII BEL generates some sort
15536 of audible alert. Thus, when the alarm goes off, the system calls attention
15537 to itself, in case the user is not looking at their computer or terminal.)
15542 @c file eg/prog/alarm.awk
15543 # alarm --- set an alarm
15544 # Arnold Robbins, arnold@@gnu.org, Public Domain
15547 # usage: alarm time [ "message" [ count [ delay ] ] ]
15551 # Initial argument sanity checking
15552 usage1 = "usage: alarm time ['message' [count [delay]]]"
15553 usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1])
15556 print usage > "/dev/stderr"
15558 @} else if (ARGC == 5) @{
15559 delay = ARGV[4] + 0
15560 count = ARGV[3] + 0
15562 @} else if (ARGC == 4) @{
15563 count = ARGV[3] + 0
15565 @} else if (ARGC == 3) @{
15567 @} else if (ARGV[1] !~ /[0-9]?[0-9]:[0-9][0-9]/) @{
15568 print usage1 > "/dev/stderr"
15569 print usage2 > "/dev/stderr"
15573 # set defaults for once we reach the desired time
15575 delay = 180 # 3 minutes
15580 message = sprintf("\aIt is now %s!\a", ARGV[1])
15581 else if (index(message, "\a") == 0)
15582 message = "\a" message "\a"
15587 The next section of code turns the alarm time into hours and minutes,
15588 and converts it if necessary to a 24-hour clock. Then it turns that
15589 time into a count of the seconds since midnight. Next it turns the current
15590 time into a count of seconds since midnight. The difference between the two
15591 is how long to wait before setting off the alarm.
15595 @c file eg/prog/alarm.awk
15596 # split up dest time
15597 split(ARGV[1], atime, ":")
15598 hour = atime[1] + 0 # force numeric
15599 minute = atime[2] + 0 # force numeric
15601 # get current broken down time
15604 # if time given is 12-hour hours and it's after that
15605 # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m.,
15606 # then add 12 to real hour
15607 if (hour < 12 && now["hour"] > hour)
15610 # set target time in seconds since midnight
15611 target = (hour * 60 * 60) + (minute * 60)
15613 # get current time in seconds since midnight
15614 current = (now["hour"] * 60 * 60) + \
15615 (now["minute"] * 60) + now["second"]
15617 # how long to sleep for
15618 naptime = target - current
15619 if (naptime <= 0) @{
15620 print "time is in the past!" > "/dev/stderr"
15627 Finally, the program uses the @code{system} function
15628 (@pxref{I/O Functions, ,Built-in Functions for Input/Output})
15629 to call the @code{sleep} utility. The @code{sleep} utility simply pauses
15630 for the given number of seconds. If the exit status is not zero,
15631 the program assumes that @code{sleep} was interrupted, and exits. If
15632 @code{sleep} exited with an OK status (zero), then the program prints the
15633 message in a loop, again using @code{sleep} to delay for however many
15634 seconds are necessary.
15637 @c file eg/prog/alarm.awk
15639 # zzzzzz..... go away if interrupted
15640 if (system(sprintf("sleep %d", naptime)) != 0)
15645 command = sprintf("sleep %d", delay)
15646 for (i = 1; i <= count; i++) @{
15648 # if sleep command interrupted, go away
15649 if (system(command) != 0)
15658 @node Translate Program, Labels Program, Alarm Program, Miscellaneous Programs
15659 @subsection Transliterating Characters
15661 The system @code{tr} utility transliterates characters. For example, it is
15662 often used to map upper-case letters into lower-case, for further
15666 @var{generate data} | tr '[A-Z]' '[a-z]' | @var{process data} @dots{}
15669 You give @code{tr} two lists of characters enclosed in square brackets.
15670 Usually, the lists are quoted to keep the shell from attempting to do a
15671 filename expansion.@footnote{On older, non-POSIX systems, @code{tr} often
15672 does not require that the lists be enclosed in square brackets and quoted.
15673 This is a feature.} When processing the input, the
15674 first character in the first list is replaced with the first character in the
15675 second list, the second character in the first list is replaced with the
15676 second character in the second list, and so on.
15677 If there are more characters in the ``from'' list than in the ``to'' list,
15678 the last character of the ``to'' list is used for the remaining characters
15679 in the ``from'' list.
15682 @c early or mid-1989!
15683 a user proposed to us that we add a transliteration function to @code{gawk}.
15684 Being opposed to ``creeping featurism,'' I wrote the following program to
15685 prove that character transliteration could be done with a user-level
15686 function. This program is not as complete as the system @code{tr} utility,
15687 but it will do most of the job.
15689 The @code{translate} program demonstrates one of the few weaknesses of
15691 @code{awk}: dealing with individual characters is very painful, requiring
15692 repeated use of the @code{substr}, @code{index}, and @code{gsub} built-in
15694 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).@footnote{This
15695 program was written before @code{gawk} acquired the ability to
15696 split each character in a string into separate array elements.
15697 How might you use this new feature to simplify the program?}
15699 There are two functions. The first, @code{stranslate}, takes three
15704 A list of characters to translate from.
15707 A list of characters to translate to.
15710 The string to do the translation on.
15713 Associative arrays make the translation part fairly easy. @code{t_ar} holds
15714 the ``to'' characters, indexed by the ``from'' characters. Then a simple
15715 loop goes through @code{from}, one character at a time. For each character
15716 in @code{from}, if the character appears in @code{target}, @code{gsub}
15717 is used to change it to the corresponding @code{to} character.
15719 The @code{translate} function simply calls @code{stranslate} using @code{$0}
15720 as the target. The main program sets two global variables, @code{FROM} and
15721 @code{TO}, from the command line, and then changes @code{ARGV} so that
15722 @code{awk} will read from the standard input.
15724 Finally, the processing rule simply calls @code{translate} for each record.
15726 @findex translate.awk
15729 @c file eg/prog/translate.awk
15730 # translate --- do tr like stuff
15731 # Arnold Robbins, arnold@@gnu.org, Public Domain
15734 # bugs: does not handle things like: tr A-Z a-z, it has
15735 # to be spelled out. However, if `to' is shorter than `from',
15736 # the last character in `to' is used for the rest of `from'.
15738 function stranslate(from, to, target, lf, lt, t_ar, i, c)
15742 for (i = 1; i <= lt; i++)
15743 t_ar[substr(from, i, 1)] = substr(to, i, 1)
15745 for (; i <= lf; i++)
15746 t_ar[substr(from, i, 1)] = substr(to, lt, 1)
15747 for (i = 1; i <= lf; i++) @{
15748 c = substr(from, i, 1)
15749 if (index(target, c) > 0)
15750 gsub(c, t_ar[c], target)
15755 function translate(from, to)
15757 return $0 = stranslate(from, to, $0)
15764 print "usage: translate from to" > "/dev/stderr"
15775 translate(FROM, TO)
15782 While it is possible to do character transliteration in a user-level
15783 function, it is not necessarily efficient, and we started to consider adding
15784 a built-in function. However, shortly after writing this program, we learned
15785 that the System V Release 4 @code{awk} had added the @code{toupper} and
15786 @code{tolower} functions. These functions handle the vast majority of the
15787 cases where character transliteration is necessary, and so we chose to
15788 simply add those functions to @code{gawk} as well, and then leave well
15791 An obvious improvement to this program would be to set up the
15792 @code{t_ar} array only once, in a @code{BEGIN} rule. However, this
15793 assumes that the ``from'' and ``to'' lists
15794 will never change throughout the lifetime of the program.
15796 @node Labels Program, Word Sorting, Translate Program, Miscellaneous Programs
15797 @subsection Printing Mailing Labels
15799 Here is a ``real world''@footnote{``Real world'' is defined as
15800 ``a program actually used to get something done.''}
15801 program. This script reads lists of names and
15802 addresses, and generates mailing labels. Each page of labels has 20 labels
15803 on it, two across and ten down. The addresses are guaranteed to be no more
15804 than five lines of data. Each address is separated from the next by a blank
15807 The basic idea is to read 20 labels worth of data. Each line of each label
15808 is stored in the @code{line} array. The single rule takes care of filling
15809 the @code{line} array and printing the page when 20 labels have been read.
15811 The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that
15812 @code{awk} will split records at blank lines
15813 (@pxref{Records, ,How Input is Split into Records}).
15814 It sets @code{MAXLINES} to 100, since @code{MAXLINE} is the maximum number
15815 of lines on the page (20 * 5 = 100).
15817 Most of the work is done in the @code{printpage} function.
15818 The label lines are stored sequentially in the @code{line} array. But they
15819 have to be printed horizontally; @code{line[1]} next to @code{line[6]},
15820 @code{line[2]} next to @code{line[7]}, and so on. Two loops are used to
15821 accomplish this. The outer loop, controlled by @code{i}, steps through
15822 every 10 lines of data; this is each row of labels. The inner loop,
15823 controlled by @code{j}, goes through the lines within the row.
15824 As @code{j} goes from zero to four, @samp{i+j} is the @code{j}'th line in
15825 the row, and @samp{i+j+5} is the entry next to it. The output ends up
15826 looking something like this:
15836 As a final note, at lines 21 and 61, an extra blank line is printed, to keep
15837 the output lined up on the labels. This is dependent on the particular
15838 brand of labels in use when the program was written. You will also note
15839 that there are two blank lines at the top and two blank lines at the bottom.
15841 The @code{END} rule arranges to flush the final page of labels; there may
15842 not have been an even multiple of 20 labels in the data.
15847 @c file eg/prog/labels.awk
15849 # Arnold Robbins, arnold@@gnu.org, Public Domain
15852 # Program to print labels. Each label is 5 lines of data
15853 # that may have blank lines. The label sheets have 2
15854 # blank lines at the top and 2 at the bottom.
15856 BEGIN @{ RS = "" ; MAXLINES = 100 @}
15858 function printpage( i, j)
15863 printf "\n\n" # header
15865 for (i = 1; i <= Nlines; i += 10) @{
15866 if (i == 21 || i == 61)
15868 for (j = 0; j < 5; j++) @{
15869 if (i + j > MAXLINES)
15871 printf " %-41s %s\n", line[i+j], line[i+j+5]
15876 printf "\n\n" # footer
15884 if (Count >= 20) @{
15889 n = split($0, a, "\n")
15890 for (i = 1; i <= n; i++)
15891 line[++Nlines] = a[i]
15892 for (; i <= 5; i++)
15893 line[++Nlines] = ""
15905 @node Word Sorting, History Sorting, Labels Program, Miscellaneous Programs
15906 @subsection Generating Word Usage Counts
15908 The following @code{awk} program prints
15909 the number of occurrences of each word in its input. It illustrates the
15910 associative nature of @code{awk} arrays by using strings as subscripts. It
15911 also demonstrates the @samp{for @var{x} in @var{array}} construction.
15912 Finally, it shows how @code{awk} can be used in conjunction with other
15913 utility programs to do a useful task of some complexity with a minimum of
15914 effort. Some explanations follow the program listing.
15918 # Print list of word frequencies
15920 for (i = 1; i <= NF; i++)
15927 printf "%s\t%d\n", word, freq[word]
15932 The first thing to notice about this program is that it has two rules. The
15933 first rule, because it has an empty pattern, is executed on every line of
15934 the input. It uses @code{awk}'s field-accessing mechanism
15935 (@pxref{Fields, ,Examining Fields}) to pick out the individual words from
15936 the line, and the built-in variable @code{NF} (@pxref{Built-in Variables})
15937 to know how many fields are available.
15939 For each input word, an element of the array @code{freq} is incremented to
15940 reflect that the word has been seen an additional time.
15942 The second rule, because it has the pattern @code{END}, is not executed
15943 until the input has been exhausted. It prints out the contents of the
15944 @code{freq} table that has been built up inside the first action.
15946 This program has several problems that would prevent it from being
15947 useful by itself on real text files:
15951 Words are detected using the @code{awk} convention that fields are
15952 separated by whitespace and that other characters in the input (except
15953 newlines) don't have any special meaning to @code{awk}. This means that
15954 punctuation characters count as part of words.
15957 The @code{awk} language considers upper- and lower-case characters to be
15958 distinct. Therefore, @samp{bartender} and @samp{Bartender} are not treated
15959 as the same word. This is undesirable since, in normal text, words
15960 are capitalized if they begin sentences, and a frequency analyzer should not
15961 be sensitive to capitalization.
15964 The output does not come out in any useful order. You're more likely to be
15965 interested in which words occur most frequently, or having an alphabetized
15966 table of how frequently each word occurs.
15969 The way to solve these problems is to use some of the more advanced
15970 features of the @code{awk} language. First, we use @code{tolower} to remove
15971 case distinctions. Next, we use @code{gsub} to remove punctuation
15972 characters. Finally, we use the system @code{sort} utility to process the
15973 output of the @code{awk} script. Here is the new version of
15976 @findex wordfreq.sh
15978 @c file eg/prog/wordfreq.awk
15979 # Print list of word frequencies
15981 $0 = tolower($0) # remove case distinctions
15982 gsub(/[^a-z0-9_ \t]/, "", $0) # remove punctuation
15983 for (i = 1; i <= NF; i++)
15991 printf "%s\t%d\n", word, freq[word]
15996 Assuming we have saved this program in a file named @file{wordfreq.awk},
15997 and that the data is in @file{file1}, the following pipeline
16000 awk -f wordfreq.awk file1 | sort +1 -nr
16004 produces a table of the words appearing in @file{file1} in order of
16005 decreasing frequency.
16007 The @code{awk} program suitably massages the data and produces a word
16008 frequency table, which is not ordered.
16010 The @code{awk} script's output is then sorted by the @code{sort} utility and
16011 printed on the terminal. The options given to @code{sort} in this example
16012 specify to sort using the second field of each input line (skipping one field),
16013 that the sort keys should be treated as numeric quantities (otherwise
16014 @samp{15} would come before @samp{5}), and that the sorting should be done
16015 in descending (reverse) order.
16017 We could have even done the @code{sort} from within the program, by
16018 changing the @code{END} action to:
16021 @c file eg/prog/wordfreq.awk
16023 sort = "sort +1 -nr"
16025 printf "%s\t%d\n", word, freq[word] | sort
16031 You would have to use this way of sorting on systems that do not
16034 See the general operating system documentation for more information on how
16035 to use the @code{sort} program.
16037 @node History Sorting, Extract Program, Word Sorting, Miscellaneous Programs
16038 @subsection Removing Duplicates from Unsorted Text
16040 The @code{uniq} program
16041 (@pxref{Uniq Program, ,Printing Non-duplicated Lines of Text}),
16042 removes duplicate lines from @emph{sorted} data.
16044 Suppose, however, you need to remove duplicate lines from a data file, but
16045 that you wish to preserve the order the lines are in? A good example of
16046 this might be a shell history file. The history file keeps a copy of all
16047 the commands you have entered, and it is not unusual to repeat a command
16048 several times in a row. Occasionally you might wish to compact the history
16049 by removing duplicate entries. Yet it is desirable to maintain the order
16050 of the original commands.
16052 This simple program does the job. It uses two arrays. The @code{data}
16053 array is indexed by the text of each line.
16054 For each line, @code{data[$0]} is incremented.
16056 If a particular line has not
16057 been seen before, then @code{data[$0]} will be zero.
16058 In that case, the text of the line is stored in @code{lines[count]}.
16059 Each element of @code{lines} is a unique command, and the indices of
16060 @code{lines} indicate the order in which those lines were encountered.
16061 The @code{END} rule simply prints out the lines, in order.
16063 @cindex Rakitzis, Byron
16064 @findex histsort.awk
16067 @c file eg/prog/histsort.awk
16068 # histsort.awk --- compact a shell history file
16069 # Arnold Robbins, arnold@@gnu.org, Public Domain
16072 # Thanks to Byron Rakitzis for the general idea
16074 if (data[$0]++ == 0)
16075 lines[++count] = $0
16079 for (i = 1; i <= count; i++)
16086 This program also provides a foundation for generating other useful
16087 information. For example, using the following @code{print} satement in the
16088 @code{END} rule would indicate how often a particular command was used.
16091 print data[lines[i]], lines[i]
16094 This works because @code{data[$0]} was incremented each time a line was
16097 @node Extract Program, Simple Sed, History Sorting, Miscellaneous Programs
16098 @subsection Extracting Programs from Texinfo Source Files
16101 Both this chapter and the previous chapter
16102 (@ref{Library Functions, ,A Library of @code{awk} Functions}),
16103 present a large number of @code{awk} programs.
16107 @ref{Library Functions, ,A Library of @code{awk} Functions},
16108 and @ref{Sample Programs, ,Practical @code{awk} Programs},
16109 are the top level nodes for a large number of @code{awk} programs.
16111 If you wish to experiment with these programs, it is tedious to have to type
16112 them in by hand. Here we present a program that can extract parts of a
16113 Texinfo input file into separate files.
16115 This @value{DOCUMENT} is written in Texinfo, the GNU project's document
16116 formatting language. A single Texinfo source file can be used to produce both
16117 printed and on-line documentation.
16119 Texinfo is fully documented in @cite{Texinfo---The GNU Documentation Format},
16120 available from the Free Software Foundation.
16123 The Texinfo language is described fully, starting with
16124 @ref{Top, , Introduction, texi, Texinfo---The GNU Documentation Format}.
16127 For our purposes, it is enough to know three things about Texinfo input
16132 The ``at'' symbol, @samp{@@}, is special in Texinfo, much like @samp{\} in C
16133 or @code{awk}. Literal @samp{@@} symbols are represented in Texinfo source
16134 files as @samp{@@@@}.
16137 Comments start with either @samp{@@c} or @samp{@@comment}.
16138 The file extraction program will work by using special comments that start
16139 at the beginning of a line.
16142 Example text that should not be split across a page boundary is bracketed
16143 between lines containing @samp{@@group} and @samp{@@end group} commands.
16146 The following program, @file{extract.awk}, reads through a Texinfo source
16147 file, and does two things, based on the special comments.
16148 Upon seeing @samp{@w{@@c system @dots{}}},
16149 it runs a command, by extracting the command text from the
16150 control line and passing it on to the @code{system} function
16151 (@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
16152 Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to
16153 the file @var{filename}, until @samp{@@c endfile} is encountered.
16154 The rules in @file{extract.awk} will match either @samp{@@c} or
16155 @samp{@@comment} by letting the @samp{omment} part be optional.
16156 Lines containing @samp{@@group} and @samp{@@end group} are simply removed.
16157 @file{extract.awk} uses the @code{join} library function
16158 (@pxref{Join Function, ,Merging an Array Into a String}).
16160 The example programs in the on-line Texinfo source for @cite{@value{TITLE}}
16161 (@file{gawk.texi}) have all been bracketed inside @samp{file},
16162 and @samp{endfile} lines. The @code{gawk} distribution uses a copy of
16163 @file{extract.awk} to extract the sample
16164 programs and install many of them in a standard directory, where
16165 @code{gawk} can find them.
16166 The Texinfo file looks something like this:
16170 This program has a @@code@{BEGIN@} block,
16171 which prints a nice message:
16174 @@c file examples/messages.awk
16175 BEGIN @@@{ print "Don't panic!" @@@}
16179 It also prints some final advice:
16182 @@c file examples/messages.awk
16183 END @@@{ print "Always avoid bored archeologists!" @@@}
16189 @file{extract.awk} begins by setting @code{IGNORECASE} to one, so that
16190 mixed upper-case and lower-case letters in the directives won't matter.
16192 The first rule handles calling @code{system}, checking that a command was
16193 given (@code{NF} is at least three), and also checking that the command
16194 exited with a zero exit status, signifying OK.
16196 @findex extract.awk
16199 @c file eg/prog/extract.awk
16200 # extract.awk --- extract files and run programs
16201 # from texinfo files
16202 # Arnold Robbins, arnold@@gnu.org, Public Domain, May 1993
16204 BEGIN @{ IGNORECASE = 1 @}
16207 /^@@c(omment)?[ \t]+system/ \
16210 e = (FILENAME ":" FNR)
16211 e = (e ": badly formed `system' line")
16212 print e > "/dev/stderr"
16219 e = (FILENAME ":" FNR)
16220 e = (e ": warning: system returned " stat)
16221 print e > "/dev/stderr"
16229 The variable @code{e} is used so that the function
16238 The second rule handles moving data into files. It verifies that a file
16239 name was given in the directive. If the file named is not the current file,
16240 then the current file is closed. This means that an @samp{@@c endfile} was
16241 not given for that file. (We should probably print a diagnostic in this
16242 case, although at the moment we do not.)
16244 The @samp{for} loop does the work. It reads lines using @code{getline}
16245 (@pxref{Getline, ,Explicit Input with @code{getline}}).
16246 For an unexpected end of file, it calls the @code{@w{unexpected_eof}}
16247 function. If the line is an ``endfile'' line, then it breaks out of
16249 If the line is an @samp{@@group} or @samp{@@end group} line, then it
16250 ignores it, and goes on to the next line.
16251 (These Texinfo control lines keep blocks of code together on one page;
16252 unfortunately, @TeX{} isn't always smart enough to do things exactly right,
16253 and we have to give it some advice.)
16255 Most of the work is in the following few lines. If the line has no @samp{@@}
16256 symbols, it can be printed directly. Otherwise, each leading @samp{@@} must be
16259 To remove the @samp{@@} symbols, the line is split into separate elements of
16260 the array @code{a}, using the @code{split} function
16261 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
16262 Each element of @code{a} that is empty indicates two successive @samp{@@}
16263 symbols in the original line. For each two empty elements (@samp{@@@@} in
16264 the original file), we have to add back in a single @samp{@@} symbol.
16266 When the processing of the array is finished, @code{join} is called with the
16267 value of @code{SUBSEP}, to rejoin the pieces back into a single
16268 line. That line is then printed to the output file.
16272 @c file eg/prog/extract.awk
16274 /^@@c(omment)?[ \t]+file/ \
16277 e = (FILENAME ":" FNR ": badly formed `file' line")
16278 print e > "/dev/stderr"
16282 if ($3 != curfile) @{
16289 if ((getline line) <= 0)
16291 if (line ~ /^@@c(omment)?[ \t]+endfile/)
16293 else if (line ~ /^@@(end[ \t]+)?group/)
16295 if (index(line, "@@") == 0) @{
16296 print line > curfile
16299 n = split(line, a, "@@")
16301 # if a[1] == "", means leading @@,
16302 # don't add one back in.
16304 for (i = 2; i <= n; i++) @{
16305 if (a[i] == "") @{ # was an @@@@
16311 print join(a, 1, n, SUBSEP) > curfile
16318 An important thing to note is the use of the @samp{>} redirection.
16319 Output done with @samp{>} only opens the file once; it stays open and
16320 subsequent output is appended to the file
16321 (@pxref{Redirection, , Redirecting Output of @code{print} and @code{printf}}).
16322 This allows us to easily mix program text and explanatory prose for the same
16323 sample source file (as has been done here!) without any hassle. The file is
16324 only closed when a new data file name is encountered, or at the end of the
16327 Finally, the function @code{@w{unexpected_eof}} prints an appropriate
16328 error message and then exits.
16330 The @code{END} rule handles the final cleanup, closing the open file.
16333 @c file eg/prog/extract.awk
16335 function unexpected_eof()
16337 printf("%s:%d: unexpected EOF or error\n", \
16338 FILENAME, FNR) > "/dev/stderr"
16350 @node Simple Sed, Igawk Program, Extract Program, Miscellaneous Programs
16351 @subsection A Simple Stream Editor
16353 @cindex @code{sed} utility
16354 The @code{sed} utility is a ``stream editor,'' a program that reads a
16355 stream of data, makes changes to it, and passes the modified data on.
16356 It is often used to make global changes to a large file, or to a stream
16357 of data generated by a pipeline of commands.
16359 While @code{sed} is a complicated program in its own right, its most common
16360 use is to perform global substitutions in the middle of a pipeline:
16363 command1 < orig.data | sed 's/old/new/g' | command2 > result
16366 Here, the @samp{s/old/new/g} tells @code{sed} to look for the regexp
16367 @samp{old} on each input line, and replace it with the text @samp{new},
16368 globally (i.e.@: all the occurrences on a line). This is similar to
16369 @code{awk}'s @code{gsub} function
16370 (@pxref{String Functions, , Built-in Functions for String Manipulation}).
16372 The following program, @file{awksed.awk}, accepts at least two command line
16373 arguments; the pattern to look for and the text to replace it with. Any
16374 additional arguments are treated as data file names to process. If none
16375 are provided, the standard input is used.
16377 @cindex Brennan, Michael
16378 @cindex @code{awksed}
16379 @cindex simple stream editor
16380 @cindex stream editor, simple
16383 @c file eg/prog/awksed.awk
16384 # awksed.awk --- do s/foo/bar/g using just print
16385 # Thanks to Michael Brennan for the idea
16387 # Arnold Robbins, arnold@@gnu.org, Public Domain
16392 print "usage: awksed pat repl [files...]" > "/dev/stderr"
16398 # validate arguments
16406 # don't use arguments as files
16407 ARGV[1] = ARGV[2] = ""
16410 # look ma, no hands!
16421 The program relies on @code{gawk}'s ability to have @code{RS} be a regexp
16422 and on the setting of @code{RT} to the actual text that terminated the
16423 record (@pxref{Records, ,How Input is Split into Records}).
16425 The idea is to have @code{RS} be the pattern to look for. @code{gawk}
16426 will automatically set @code{$0} to the text between matches of the pattern.
16427 This is text that we wish to keep, unmodified. Then, by setting @code{ORS}
16428 to the replacement text, a simple @code{print} statement will output the
16429 text we wish to keep, followed by the replacement text.
16431 There is one wrinkle to this scheme, which is what to do if the last record
16432 doesn't end with text that matches @code{RS}? Using a @code{print}
16433 statement unconditionally prints the replacement text, which is not correct.
16435 However, if the file did not end in text that matches @code{RS}, @code{RT}
16436 will be set to the null string. In this case, we can print @code{$0} using
16438 (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).
16440 The @code{BEGIN} rule handles the setup, checking for the right number
16441 of arguments, and calling @code{usage} if there is a problem. Then it sets
16442 @code{RS} and @code{ORS} from the command line arguments, and sets
16443 @code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they will
16444 not be treated as file names
16445 (@pxref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}).
16447 The @code{usage} function prints an error message and exits.
16449 Finally, the single rule handles the printing scheme outlined above,
16450 using @code{print} or @code{printf} as appropriate, depending upon the
16451 value of @code{RT}.
16454 Exercise, compare the performance of this version with the more
16460 ARGV[1] = ARGV[2] = ""
16463 { gsub(pat, repl); print }
16465 Exercise: what are the advantages and disadvantages of this version vs. sed?
16466 Advantage: egrep regexps
16468 Disadvantage: no & in replacement text
16473 @node Igawk Program, , Simple Sed, Miscellaneous Programs
16474 @subsection An Easy Way to Use Library Functions
16476 Using library functions in @code{awk} can be very beneficial. It
16477 encourages code re-use and the writing of general functions. Programs are
16478 smaller, and therefore clearer.
16479 However, using library functions is only easy when writing @code{awk}
16480 programs; it is painful when running them, requiring multiple @samp{-f}
16481 options. If @code{gawk} is unavailable, then so too is the @code{AWKPATH}
16482 environment variable and the ability to put @code{awk} functions into a
16483 library directory (@pxref{Options, ,Command Line Options}).
16485 It would be nice to be able to write programs like so:
16488 # library functions
16489 @@include getopt.awk
16495 while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1)
16501 The following program, @file{igawk.sh}, provides this service.
16502 It simulates @code{gawk}'s searching of the @code{AWKPATH} variable,
16503 and also allows @dfn{nested} includes; i.e.@: a file that has been included
16504 with @samp{@@include} can contain further @samp{@@include} statements.
16505 @code{igawk} will make an effort to only include files once, so that nested
16506 includes don't accidentally include a library function twice.
16508 @code{igawk} should behave externally just like @code{gawk}. This means it
16509 should accept all of @code{gawk}'s command line arguments, including the
16510 ability to have multiple source files specified via @samp{-f}, and the
16511 ability to mix command line and library source files.
16513 The program is written using the POSIX Shell (@code{sh}) command language.
16514 The way the program works is as follows:
16518 Loop through the arguments, saving anything that doesn't represent
16519 @code{awk} source code for later, when the expanded program is run.
16522 For any arguments that do represent @code{awk} text, put the arguments into
16523 a temporary file that will be expanded. There are two cases.
16527 Literal text, provided with @samp{--source} or @samp{--source=}. This
16528 text is just echoed directly. The @code{echo} program will automatically
16529 supply a trailing newline.
16532 File names provided with @samp{-f}. We use a neat trick, and echo
16533 @samp{@@include @var{filename}} into the temporary file. Since the file
16534 inclusion program will work the way @code{gawk} does, this will get the text
16535 of the file included into the program at the correct point.
16539 Run an @code{awk} program (naturally) over the temporary file to expand
16540 @samp{@@include} statements. The expanded program is placed in a second
16544 Run the expanded program with @code{gawk} and any other original command line
16545 arguments that the user supplied (such as the data file names).
16548 The initial part of the program turns on shell tracing if the first
16549 argument was @samp{debug}. Otherwise, a shell @code{trap} statement
16550 arranges to clean up any temporary files on program exit or upon an
16553 @c 2e: For the temp file handling, go with Darrel's ig=${TMP:-/tmp}/igs.$$
16554 @c 2e: or something as similar as possible.
16556 The next part loops through all the command line arguments.
16557 There are several cases of interest.
16561 This ends the arguments to @code{igawk}. Anything else should be passed on
16562 to the user's @code{awk} program without being evaluated.
16565 This indicates that the next option is specific to @code{gawk}. To make
16566 argument processing easier, the @samp{-W} is appended to the front of the
16567 remaining arguments and the loop continues. (This is an @code{sh}
16568 programming trick. Don't worry about it if you are not familiar with
16573 These are saved and passed on to @code{gawk}.
16579 The file name is saved to the temporary file @file{/tmp/ig.s.$$} with an
16580 @samp{@@include} statement.
16581 The @code{sed} utility is used to remove the leading option part of the
16582 argument (e.g., @samp{--file=}).
16587 The source text is echoed into @file{/tmp/ig.s.$$}.
16591 @code{igawk} prints its version number, and runs @samp{gawk --version}
16592 to get the @code{gawk} version information, and then exits.
16595 If none of @samp{-f}, @samp{--file}, @samp{-Wfile}, @samp{--source},
16596 or @samp{-Wsource}, were supplied, then the first non-option argument
16597 should be the @code{awk} program. If there are no command line
16598 arguments left, @code{igawk} prints an error message and exits.
16599 Otherwise, the first argument is echoed into @file{/tmp/ig.s.$$}.
16601 In any case, after the arguments have been processed,
16602 @file{/tmp/ig.s.$$} contains the complete text of the original @code{awk}
16605 The @samp{$$} in @code{sh} represents the current process ID number.
16606 It is often used in shell programs to generate unique temporary file
16607 names. This allows multiple users to run @code{igawk} without worrying
16608 that the temporary file names will clash.
16610 @cindex @code{sed} utility
16611 Here's the program:
16616 @c file eg/prog/igawk.sh
16619 # igawk --- like gawk but do @@include processing
16620 # Arnold Robbins, arnold@@gnu.org, Public Domain
16623 if [ "$1" = debug ]
16628 # cleanup on exit, hangup, interrupt, quit, termination
16629 trap 'rm -f /tmp/ig.[se].$$' 0 1 2 3 15
16632 while [ $# -ne 0 ] # loop over arguments
16641 -[vF]) opts="$opts $1 '$2'"
16644 -[vF]*) opts="$opts '$1'" ;;
16646 -f) echo @@include "$2" >> /tmp/ig.s.$$
16650 -f*) f=`echo "$1" | sed 's/-f//'`
16651 echo @@include "$f" >> /tmp/ig.s.$$ ;;
16654 -?file=*) # -Wfile or --file
16655 f=`echo "$1" | sed 's/-.file=//'`
16656 echo @@include "$f" >> /tmp/ig.s.$$ ;;
16658 -?file) # get arg, $2
16659 echo @@include "$2" >> /tmp/ig.s.$$
16662 -?source=*) # -Wsource or --source
16663 t=`echo "$1" | sed 's/-.source=//'`
16664 echo "$t" >> /tmp/ig.s.$$ ;;
16666 -?source) # get arg, $2
16667 echo "$2" >> /tmp/ig.s.$$
16671 echo igawk: version 1.0 1>&2
16675 -[W-]*) opts="$opts '$1'" ;;
16682 if [ ! -s /tmp/ig.s.$$ ]
16686 echo igawk: no program! 1>&2
16689 echo "$1" > /tmp/ig.s.$$
16694 # at this point, /tmp/ig.s.$$ has the program
16699 The @code{awk} program to process @samp{@@include} directives reads through
16700 the program, one line at a time using @code{getline}
16701 (@pxref{Getline, ,Explicit Input with @code{getline}}).
16702 The input file names and @samp{@@include} statements are managed using a
16703 stack. As each @samp{@@include} is encountered, the current file name is
16704 ``pushed'' onto the stack, and the file named in the @samp{@@include}
16706 the current file name. As each file is finished, the stack is ``popped,''
16707 and the previous input file becomes the current input file again.
16708 The process is started by making the original file the first one on the
16711 The @code{pathto} function does the work of finding the full path to a
16712 file. It simulates @code{gawk}'s behavior when searching the @code{AWKPATH}
16713 environment variable
16714 (@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}).
16715 If a file name has a @samp{/} in it, no path search
16716 is done. Otherwise, the file name is concatenated with the name of each
16717 directory in the path, and an attempt is made to open the generated file
16718 name. The only way in @code{awk} to test if a file can be read is to go
16719 ahead and try to read it with @code{getline}; that is what @code{pathto}
16720 does.@footnote{On some very old versions of @code{awk}, the test
16721 @samp{getline junk < t} can loop forever if the file exists but is empty.
16723 If the file can be read, it is closed, and the file name is
16726 An alternative way to test for the file's existence would be to call
16727 @samp{system("test -r " t)}, which uses the @code{test} utility to
16728 see if the file exists and is readable. The disadvantage to this method
16729 is that it requires creating an extra process, and can thus be slightly
16734 @c file eg/prog/igawk.sh
16736 # process @@include directives
16740 @c file eg/prog/igawk.sh
16741 function pathto(file, i, t, junk)
16743 if (index(file, "/") != 0)
16746 for (i = 1; i <= ndirs; i++) @{
16747 t = (pathlist[i] "/" file)
16748 if ((getline junk < t) > 0) @{
16760 The main program is contained inside one @code{BEGIN} rule. The first thing it
16761 does is set up the @code{pathlist} array that @code{pathto} uses. After
16762 splitting the path on @samp{:}, null elements are replaced with @code{"."},
16763 which represents the current directory.
16767 @c file eg/prog/igawk.sh
16769 path = ENVIRON["AWKPATH"]
16770 ndirs = split(path, pathlist, ":")
16771 for (i = 1; i <= ndirs; i++) @{
16772 if (pathlist[i] == "")
16779 The stack is initialized with @code{ARGV[1]}, which will be @file{/tmp/ig.s.$$}.
16780 The main loop comes next. Input lines are read in succession. Lines that
16781 do not start with @samp{@@include} are printed verbatim.
16783 If the line does start with @samp{@@include}, the file name is in @code{$2}.
16784 @code{pathto} is called to generate the full path. If it could not, then we
16785 print an error message and continue.
16787 The next thing to check is if the file has been included already. The
16788 @code{processed} array is indexed by the full file name of each included
16789 file, and it tracks this information for us. If the file has been
16790 seen, a warning message is printed. Otherwise, the new file name is
16791 pushed onto the stack and processing continues.
16793 Finally, when @code{getline} encounters the end of the input file, the file
16794 is closed and the stack is popped. When @code{stackptr} is less than zero,
16795 the program is done.
16799 @c file eg/prog/igawk.sh
16801 input[stackptr] = ARGV[1] # ARGV[1] is first file
16803 for (; stackptr >= 0; stackptr--) @{
16804 while ((getline < input[stackptr]) > 0) @{
16805 if (tolower($1) != "@@include") @{
16810 if (fpath == "") @{
16811 printf("igawk:%s:%d: cannot find %s\n", \
16812 input[stackptr], FNR, $2) > "/dev/stderr"
16816 if (! (fpath in processed)) @{
16817 processed[fpath] = input[stackptr]
16818 input[++stackptr] = fpath
16820 print $2, "included in", input[stackptr], \
16821 "already included in", \
16822 processed[fpath] > "/dev/stderr"
16826 close(input[stackptr])
16828 @}' /tmp/ig.s.$$ > /tmp/ig.e.$$
16834 The last step is to call @code{gawk} with the expanded program and the original
16835 options and command line arguments that the user supplied. @code{gawk}'s
16836 exit status is passed back on to @code{igawk}'s calling program.
16838 @c this causes more problems than it solves, so leave it out.
16840 The special file @file{/dev/null} is passed as a data file to @code{gawk}
16841 to handle an interesting case. Suppose that the user's program only has
16842 a @code{BEGIN} rule, and there are no data files to read. The program should exit without reading any data
16843 files. However, suppose that an included library file defines an @code{END}
16844 rule of its own. In this case, @code{gawk} will hang, reading standard
16845 input. In order to avoid this, @file{/dev/null} is explicitly to the
16846 command line. Reading from @file{/dev/null} always returns an immediate
16847 end of file indication.
16849 @c Hmm. Add /dev/null if $# is 0? Still messes up ARGV. Sigh.
16854 @c file eg/prog/igawk.sh
16855 eval gawk -f /tmp/ig.e.$$ $opts -- "$@@"
16862 This version of @code{igawk} represents my third attempt at this program.
16863 There are three key simplifications that made the program work better.
16867 Using @samp{@@include} even for the files named with @samp{-f} makes building
16868 the initial collected @code{awk} program much simpler; all the
16869 @samp{@@include} processing can be done once.
16872 The @code{pathto} function doesn't try to save the line read with
16873 @code{getline} when testing for the file's accessibility. Trying to save
16874 this line for use with the main program complicates things considerably.
16875 @c what problem does this engender though - exercise
16876 @c answer, reading from "-" or /dev/stdin
16879 Using a @code{getline} loop in the @code{BEGIN} rule does it all in one
16880 place. It is not necessary to call out to a separate loop for processing
16881 nested @samp{@@include} statements.
16884 Also, this program illustrates that it is often worthwhile to combine
16885 @code{sh} and @code{awk} programming together. You can usually accomplish
16886 quite a lot, without having to resort to low-level programming in C or C++, and it
16887 is frequently easier to do certain kinds of string and argument manipulation
16888 using the shell than it is in @code{awk}.
16890 Finally, @code{igawk} shows that it is not always necessary to add new
16891 features to a program; they can often be layered on top. With @code{igawk},
16892 there is no real reason to build @samp{@@include} processing into
16893 @code{gawk} itself.
16895 As an additional example of this, consider the idea of having two
16896 files in a directory in the search path.
16900 This file would contain a set of default library functions, such
16901 as @code{getopt} and @code{assert}.
16904 This file would contain library functions that are specific to a site or
16905 installation, i.e.@: locally developed functions.
16906 Having a separate file allows @file{default.awk} to change with
16907 new @code{gawk} releases, without requiring the system administrator to
16908 update it each time by adding the local functions.
16912 @c Karl Berry, karl@ileaf.com, 10/95
16913 suggested that @code{gawk} be modified to automatically read these files
16914 upon startup. Instead, it would be very simple to modify @code{igawk}
16915 to do this. Since @code{igawk} can process nested @samp{@@include}
16916 directives, @file{default.awk} could simply contain @samp{@@include}
16917 statements for the desired library functions.
16919 @c Exercise: make this change
16921 @node Language History, Gawk Summary, Sample Programs, Top
16922 @chapter The Evolution of the @code{awk} Language
16924 This @value{DOCUMENT} describes the GNU implementation of @code{awk}, which follows
16925 the POSIX specification. Many @code{awk} users are only familiar
16926 with the original @code{awk} implementation in Version 7 Unix.
16927 (This implementation was the basis for @code{awk} in Berkeley Unix,
16928 through 4.3--Reno. The 4.4 release of Berkeley Unix uses @code{gawk} 2.15.2
16929 for its version of @code{awk}.) This chapter briefly describes the
16930 evolution of the @code{awk} language, with cross references to other parts
16931 of the @value{DOCUMENT} where you can find more information.
16934 * V7/SVR3.1:: The major changes between V7 and System V
16936 * SVR4:: Minor changes between System V Releases 3.1
16938 * POSIX:: New features from the POSIX standard.
16939 * BTL:: New features from the Bell Laboratories
16940 version of @code{awk}.
16941 * POSIX/GNU:: The extensions in @code{gawk} not in POSIX
16945 @node V7/SVR3.1, SVR4, Language History, Language History
16946 @section Major Changes between V7 and SVR3.1
16948 The @code{awk} language evolved considerably between the release of
16949 Version 7 Unix (1978) and the new version first made generally available in
16950 System V Release 3.1 (1987). This section summarizes the changes, with
16951 cross-references to further details.
16955 The requirement for @samp{;} to separate rules on a line
16956 (@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}).
16959 User-defined functions, and the @code{return} statement
16960 (@pxref{User-defined, ,User-defined Functions}).
16963 The @code{delete} statement (@pxref{Delete, ,The @code{delete} Statement}).
16966 The @code{do}-@code{while} statement
16967 (@pxref{Do Statement, ,The @code{do}-@code{while} Statement}).
16970 The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand} and
16971 @code{srand} (@pxref{Numeric Functions, ,Numeric Built-in Functions}).
16974 The built-in functions @code{gsub}, @code{sub}, and @code{match}
16975 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
16978 The built-in functions @code{close}, and @code{system}
16979 (@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
16982 The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART},
16983 and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}).
16986 The conditional expression using the ternary operator @samp{?:}
16987 (@pxref{Conditional Exp, ,Conditional Expressions}).
16990 The exponentiation operator @samp{^}
16991 (@pxref{Arithmetic Ops, ,Arithmetic Operators}) and its assignment operator
16992 form @samp{^=} (@pxref{Assignment Ops, ,Assignment Expressions}).
16995 C-compatible operator precedence, which breaks some old @code{awk}
16996 programs (@pxref{Precedence, ,Operator Precedence (How Operators Nest)}).
16999 Regexps as the value of @code{FS}
17000 (@pxref{Field Separators, ,Specifying How Fields are Separated}), and as the
17001 third argument to the @code{split} function
17002 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
17005 Dynamic regexps as operands of the @samp{~} and @samp{!~} operators
17006 (@pxref{Regexp Usage, ,How to Use Regular Expressions}).
17009 The escape sequences @samp{\b}, @samp{\f}, and @samp{\r}
17010 (@pxref{Escape Sequences}).
17011 (Some vendors have updated their old versions of @code{awk} to
17012 recognize @samp{\r}, @samp{\b}, and @samp{\f}, but this is not
17013 something you can rely on.)
17016 Redirection of input for the @code{getline} function
17017 (@pxref{Getline, ,Explicit Input with @code{getline}}).
17020 Multiple @code{BEGIN} and @code{END} rules
17021 (@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}).
17024 Multi-dimensional arrays
17025 (@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).
17028 @node SVR4, POSIX, V7/SVR3.1, Language History
17029 @section Changes between SVR3.1 and SVR4
17031 @cindex @code{awk} language, V.4 version
17032 The System V Release 4 version of Unix @code{awk} added these features
17033 (some of which originated in @code{gawk}):
17037 The @code{ENVIRON} variable (@pxref{Built-in Variables}).
17040 Multiple @samp{-f} options on the command line
17041 (@pxref{Options, ,Command Line Options}).
17044 The @samp{-v} option for assigning variables before program execution begins
17045 (@pxref{Options, ,Command Line Options}).
17048 The @samp{--} option for terminating command line options.
17051 The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences
17052 (@pxref{Escape Sequences}).
17055 A defined return value for the @code{srand} built-in function
17056 (@pxref{Numeric Functions, ,Numeric Built-in Functions}).
17059 The @code{toupper} and @code{tolower} built-in string functions
17060 for case translation
17061 (@pxref{String Functions, ,Built-in Functions for String Manipulation}).
17064 A cleaner specification for the @samp{%c} format-control letter in the
17065 @code{printf} function
17066 (@pxref{Control Letters, ,Format-Control Letters}).
17069 The ability to dynamically pass the field width and precision (@code{"%*.*d"})
17070 in the argument list of the @code{printf} function
17071 (@pxref{Control Letters, ,Format-Control Letters}).
17074 The use of regexp constants such as @code{/foo/} as expressions, where
17075 they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/}
17076 (@pxref{Using Constant Regexps, ,Using Regular Expression Constants}).
17079 @node POSIX, BTL, SVR4, Language History
17080 @section Changes between SVR4 and POSIX @code{awk}
17082 The POSIX Command Language and Utilities standard for @code{awk}
17083 introduced the following changes into the language:
17087 The use of @samp{-W} for implementation-specific options.
17090 The use of @code{CONVFMT} for controlling the conversion of numbers
17091 to strings (@pxref{Conversion, ,Conversion of Strings and Numbers}).
17094 The concept of a numeric string, and tighter comparison rules to go
17095 with it (@pxref{Typing and Comparison, ,Variable Typing and Comparison Expressions}).
17098 More complete documentation of many of the previously undocumented
17099 features of the language.
17102 The following common extensions are not permitted by the POSIX
17105 @c IMPORTANT! Keep this list in sync with the one in node Options
17109 @code{\x} escape sequences are not recognized
17110 (@pxref{Escape Sequences}).
17113 Newlines do not act as whitespace to separate fields when @code{FS} is
17114 equal to a single space.
17117 The synonym @code{func} for the keyword @code{function} is not
17118 recognized (@pxref{Definition Syntax, ,Function Definition Syntax}).
17121 The operators @samp{**} and @samp{**=} cannot be used in
17122 place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators},
17123 and also @pxref{Assignment Ops, ,Assignment Expressions}).
17126 Specifying @samp{-Ft} on the command line does not set the value
17127 of @code{FS} to be a single tab character
17128 (@pxref{Field Separators, ,Specifying How Fields are Separated}).
17131 The @code{fflush} built-in function is not supported
17132 (@pxref{I/O Functions, , Built-in Functions for Input/Output}).
17135 @node BTL, POSIX/GNU, POSIX, Language History
17136 @section Extensions in the Bell Laboratories @code{awk}
17138 @cindex Kernighan, Brian
17139 Brian Kernighan, one of the original designers of Unix @code{awk},
17140 has made his version available via anonymous @code{ftp}
17141 (@pxref{Other Versions, ,Other Freely Available @code{awk} Implementations}).
17142 This section describes extensions in his version of @code{awk} that are
17143 not in POSIX @code{awk}.
17147 The @samp{-mf @var{NNN}} and @samp{-mr @var{NNN}} command line options
17148 to set the maximum number of fields, and the maximum
17149 record size, respectively
17150 (@pxref{Options, ,Command Line Options}).
17153 The @code{fflush} built-in function for flushing buffered output
17154 (@pxref{I/O Functions, ,Built-in Functions for Input/Output}).
17158 The @code{SYMTAB} array, that allows access to the internal symbol
17159 table of @code{awk}. This feature is not documented, largely because
17160 it is somewhat shakily implemented. For instance, you cannot access arrays
17161 or array elements through it.
17165 @node POSIX/GNU, , BTL, Language History
17166 @section Extensions in @code{gawk} Not in POSIX @code{awk}
17168 @cindex compatibility mode
17169 The GNU implementation, @code{gawk}, adds a number of features.
17170 This sections lists them in the order they were added to @code{gawk}.
17171 They can all be disabled with either the @samp{--traditional} or
17172 @samp{--posix} options
17173 (@pxref{Options, ,Command Line Options}).
17175 Version 2.10 of @code{gawk} introduced these features:
17179 The @code{AWKPATH} environment variable for specifying a path search for
17180 the @samp{-f} command line option
17181 (@pxref{Options, ,Command Line Options}).
17184 The @code{IGNORECASE} variable and its effects
17185 (@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).
17188 The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and
17189 @file{/dev/fd/@var{n}} file name interpretation
17190 (@pxref{Special Files, ,Special File Names in @code{gawk}}).
17193 Version 2.13 of @code{gawk} introduced these features:
17197 The @code{FIELDWIDTHS} variable and its effects
17198 (@pxref{Constant Size, ,Reading Fixed-width Data}).
17201 The @code{systime} and @code{strftime} built-in functions for obtaining
17202 and printing time stamps
17203 (@pxref{Time Functions, ,Functions for Dealing with Time Stamps}).
17206 The @samp{-W lint} option to provide source code and run time error
17207 and portability checking
17208 (@pxref{Options, ,Command Line Options}).
17211 The @samp{-W compat} option to turn off these extensions
17212 (@pxref{Options, ,Command Line Options}).
17215 The @samp{-W posix} option for full POSIX compliance
17216 (@pxref{Options, ,Command Line Options}).
17219 Version 2.14 of @code{gawk} introduced these features:
17223 The @code{next file} statement for skipping to the next data file
17224 (@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
17227 Version 2.15 of @code{gawk} introduced these features:
17231 The @code{ARGIND} variable, that tracks the movement of @code{FILENAME}
17232 through @code{ARGV} (@pxref{Built-in Variables}).
17235 The @code{ERRNO} variable, that contains the system error message when
17236 @code{getline} returns @minus{}1, or when @code{close} fails
17237 (@pxref{Built-in Variables}).
17240 The ability to use GNU-style long named options that start with @samp{--}
17241 (@pxref{Options, ,Command Line Options}).
17244 The @samp{--source} option for mixing command line and library
17246 (@pxref{Options, ,Command Line Options}).
17249 The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and
17250 @file{/dev/user} file name interpretation
17251 (@pxref{Special Files, ,Special File Names in @code{gawk}}).
17254 Version 3.0 of @code{gawk} introduced these features:
17258 The @code{next file} statement became @code{nextfile}
17259 (@pxref{Nextfile Statement, ,The @code{nextfile} Statement}).
17262 The @samp{--lint-old} option to
17263 warn about constructs that are not available in
17264 the original Version 7 Unix version of @code{awk}
17265 (@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}).
17268 The @samp{--traditional} option was added as a better name for
17269 @samp{--compat} (@pxref{Options, ,Command Line Options}).
17272 The ability for @code{FS} to be a null string, and for the third
17273 argument to @code{split} to be the null string
17274 (@pxref{Single Character Fields, , Making Each Character a Separate Field}).
17277 The ability for @code{RS} to be a regexp
17278 (@pxref{Records, , How Input is Split into Records}).
17281 The @code{RT} variable
17282 (@pxref{Records, , How Input is Split into Records}).
17285 The @code{gensub} function for more powerful text manipulation
17286 (@pxref{String Functions, , Built-in Functions for String Manipulation}).
17289 The @code{strftime} function acquired a default time format,
17290 allowing it to be called with no arguments
17291 (@pxref{Time Functions, , Functions for Dealing with Time Stamps}).
17294 Full support for both POSIX and GNU regexps
17295 (@pxref{Regexp, , Regular Expressions}).
17298 The @samp{--re-interval} option to provide interval expressions in regexps
17299 (@pxref{Regexp Operators, , Regular Expression Operators}).
17302 @code{IGNORECASE} changed, now applying to string comparison as well
17303 as regexp operations
17304 (@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).
17307 The @samp{-m} option and the @code{fflush} function from the
17308 Bell Labs research version of @code{awk}
17309 (@pxref{Options, ,Command Line Options}; also
17310 @pxref{I/O Functions, ,Built-in Functions for Input/Output}).
17313 The use of GNU Autoconf to control the configuration process
17314 (@pxref{Quick Installation, , Compiling @code{gawk} for Unix}).
17318 (@pxref{Amiga Installation, ,Installing @code{gawk} on an Amiga}).
17320 @c XXX ADD MORE STUFF HERE
17324 @node Gawk Summary, Installation, Language History, Top
17325 @appendix @code{gawk} Summary
17327 This appendix provides a brief summary of the @code{gawk} command line and the
17328 @code{awk} language. It is designed to serve as ``quick reference.'' It is
17329 therefore terse, but complete.
17332 * Command Line Summary:: Recapitulation of the command line.
17333 * Language Summary:: A terse review of the language.
17334 * Variables/Fields:: Variables, fields, and arrays.
17335 * Rules Summary:: Patterns and Actions, and their component
17337 * Actions Summary:: Quick overview of actions.
17338 * Functions Summary:: Defining and calling functions.
17339 * Historical Features:: Some undocumented but supported ``features''.
17342 @node Command Line Summary, Language Summary, Gawk Summary, Gawk Summary
17343 @appendixsec Command Line Options Summary
17345 The command line consists of options to @code{gawk} itself, the
17346 @code{awk} program text (if not supplied via the @samp{-f} option), and
17347 values to be made available in the @code{ARGC} and @code{ARGV}
17348 predefined @code{awk} variables:
17351 gawk @r{[@var{POSIX or GNU style options}]} -f @var{source-file} @r{[@code{--}]} @var{file} @dots{}
17352 gawk @r{[@var{POSIX or GNU style options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{}
17355 The options that @code{gawk} accepts are:
17359 @itemx --field-separator @var{fs}
17360 Use @var{fs} for the input field separator (the value of the @code{FS}
17361 predefined variable).
17363 @item -f @var{program-file}
17364 @itemx --file @var{program-file}
17365 Read the @code{awk} program source from the file @var{program-file}, instead
17366 of from the first command line argument.
17368 @item -mf @var{NNN}
17369 @itemx -mr @var{NNN}
17370 The @samp{f} flag sets
17371 the maximum number of fields, and the @samp{r} flag sets the maximum
17372 record size. These options are ignored by @code{gawk}, since @code{gawk}
17373 has no predefined limits; they are only for compatibility with the
17374 Bell Labs research version of Unix @code{awk}.
17376 @item -v @var{var}=@var{val}
17377 @itemx --assign @var{var}=@var{val}
17378 Assign the variable @var{var} the value @var{val} before program execution
17381 @item -W traditional
17383 @itemx --traditional
17385 Use compatibility mode, in which @code{gawk} extensions are turned
17389 @itemx -W copyright
17392 Print the short version of the General Public License on the standard
17393 output, and exit. This option may disappear in a future version of @code{gawk}.
17399 Print a relatively short summary of the available options on the standard
17404 Give warnings about dubious or non-portable @code{awk} constructs.
17408 Warn about constructs that are not available in
17409 the original Version 7 Unix version of @code{awk}.
17413 Use POSIX compatibility mode, in which @code{gawk} extensions
17414 are turned off and additional restrictions apply.
17416 @item -W re-interval
17417 @itemx --re-interval
17418 Allow interval expressions
17419 (@pxref{Regexp Operators, , Regular Expression Operators}),
17422 @item -W source=@var{program-text}
17423 @itemx --source @var{program-text}
17424 Use @var{program-text} as @code{awk} program source code. This option allows
17425 mixing command line source code with source code from files, and is
17426 particularly useful for mixing command line programs with library functions.
17430 Print version information for this particular copy of @code{gawk} on the error
17434 Signal the end of options. This is useful to allow further arguments to the
17435 @code{awk} program itself to start with a @samp{-}. This is mainly for
17436 consistency with POSIX argument parsing conventions.
17439 Any other options are flagged as invalid, but are otherwise ignored.
17440 @xref{Options, ,Command Line Options}, for more details.
17442 @node Language Summary, Variables/Fields, Command Line Summary, Gawk Summary
17443 @appendixsec Language Summary
17445 An @code{awk} program consists of a sequence of zero or more pattern-action
17446 statements and optional function definitions. One or the other of the
17447 pattern and action may be omitted.
17450 @var{pattern} @{ @var{action statements} @}
17452 @{ @var{action statements} @}
17454 function @var{name}(@var{parameter list}) @{ @var{action statements} @}
17457 @code{gawk} first reads the program source from the
17458 @var{program-file}(s), if specified, or from the first non-option
17459 argument on the command line. The @samp{-f} option may be used multiple
17460 times on the command line. @code{gawk} reads the program text from all
17461 the @var{program-file} files, effectively concatenating them in the
17462 order they are specified. This is useful for building libraries of
17463 @code{awk} functions, without having to include them in each new
17464 @code{awk} program that uses them. To use a library function in a file
17465 from a program typed in on the command line, specify
17466 @samp{--source '@var{program}'}, and type your program in between the single
17468 @xref{Options, ,Command Line Options}.
17470 The environment variable @code{AWKPATH} specifies a search path to use
17471 when finding source files named with the @samp{-f} option. The default
17473 @samp{.:/usr/local/share/awk}@footnote{The path may use a directory
17474 other than @file{/usr/local/share/awk}, depending upon how @code{gawk}
17475 was built and installed.} is used if @code{AWKPATH} is not set.
17476 If a file name given to the @samp{-f} option contains a @samp{/} character,
17477 no path search is performed.
17478 @xref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
17480 @code{gawk} compiles the program into an internal form, and then proceeds to
17481 read each file named in the @code{ARGV} array.
17482 The initial values of @code{ARGV} come from the command line arguments.
17483 If there are no files named
17484 on the command line, @code{gawk} reads the standard input.
17486 If a ``file'' named on the command line has the form
17487 @samp{@var{var}=@var{val}}, it is treated as a variable assignment: the
17488 variable @var{var} is assigned the value @var{val}.
17489 If any of the files have a value that is the null string, that
17490 element in the list is skipped.
17492 For each record in the input, @code{gawk} tests to see if it matches any
17493 @var{pattern} in the @code{awk} program. For each pattern that the record
17494 matches, the associated @var{action} is executed.
17496 @node Variables/Fields, Rules Summary, Language Summary, Gawk Summary
17497 @appendixsec Variables and Fields
17499 @code{awk} variables are not declared; they come into existence when they are
17500 first used. Their values are either floating-point numbers or strings.
17501 @code{awk} also has one-dimensional arrays; multiple-dimensional arrays
17502 may be simulated. There are several predefined variables that
17503 @code{awk} sets as a program runs; these are summarized below.
17506 * Fields Summary:: Input field splitting.
17507 * Built-in Summary:: @code{awk}'s built-in variables.
17508 * Arrays Summary:: Using arrays.
17509 * Data Type Summary:: Values in @code{awk} are numbers or strings.
17512 @node Fields Summary, Built-in Summary, Variables/Fields, Variables/Fields
17513 @appendixsubsec Fields
17515 As each input line is read, @code{gawk} splits the line into
17516 @var{fields}, using the value of the @code{FS} variable as the field
17517 separator. If @code{FS} is a single character, fields are separated by
17518 that character. Otherwise, @code{FS} is expected to be a full regular
17519 expression. In the special case that @code{FS} is a single space,
17520 fields are separated by runs of spaces, tabs and/or newlines.@footnote{In
17521 POSIX @code{awk}, newline does not separate fields.}
17522 If @code{FS} is the null string (@code{""}), then each individual
17523 character in the record becomes a separate field.
17524 Note that the value
17525 of @code{IGNORECASE} (@pxref{Case-sensitivity, ,Case-sensitivity in Matching})
17526 also affects how fields are split when @code{FS} is a regular expression.
17528 Each field in the input line may be referenced by its position, @code{$1},
17529 @code{$2}, and so on. @code{$0} is the whole line. The value of a field may
17530 be assigned to as well. Field numbers need not be constants:
17538 prints the fifth field in the input line. The variable @code{NF} is set to
17539 the total number of fields in the input line.
17541 References to non-existent fields (i.e.@: fields after @code{$NF}) return
17542 the null string. However, assigning to a non-existent field (e.g.,
17543 @code{$(NF+2) = 5}) increases the value of @code{NF}, creates any
17544 intervening fields with the null string as their value, and causes the
17545 value of @code{$0} to be recomputed, with the fields being separated by
17546 the value of @code{OFS}.
17547 Decrementing @code{NF} causes the values of fields past the new value to
17548 be lost, and the value of @code{$0} to be recomputed, with the fields being
17549 separated by the value of @code{OFS}.
17550 @xref{Reading Files, ,Reading Input Files}.
17552 @node Built-in Summary, Arrays Summary, Fields Summary, Variables/Fields
17553 @appendixsubsec Built-in Variables
17555 @code{gawk}'s built-in variables are:
17559 The number of elements in @code{ARGV}. See below for what is actually
17560 included in @code{ARGV}.
17563 The index in @code{ARGV} of the current file being processed.
17564 When @code{gawk} is processing the input data files,
17565 it is always true that @samp{FILENAME == ARGV[ARGIND]}.
17568 The array of command line arguments. The array is indexed from zero to
17569 @code{ARGC} @minus{} 1. Dynamically changing @code{ARGC} and
17570 the contents of @code{ARGV}
17571 can control the files used for data. A null-valued element in
17572 @code{ARGV} is ignored. @code{ARGV} does not include the options to
17573 @code{awk} or the text of the @code{awk} program itself.
17576 The conversion format to use when converting numbers to strings.
17579 A space separated list of numbers describing the fixed-width input data.
17582 An array of environment variable values. The array
17583 is indexed by variable name, each element being the value of that
17584 variable. Thus, the environment variable @code{HOME} is
17585 @code{ENVIRON["HOME"]}. One possible value might be @file{/home/arnold}.
17587 Changing this array does not affect the environment seen by programs
17588 which @code{gawk} spawns via redirection or the @code{system} function.
17589 (This may change in a future version of @code{gawk}.)
17591 Some operating systems do not have environment variables.
17592 The @code{ENVIRON} array is empty when running on these systems.
17595 The system error message when an error occurs using @code{getline}
17599 The name of the current input file. If no files are specified on the command
17600 line, the value of @code{FILENAME} is the null string.
17603 The input record number in the current input file.
17606 The input field separator, a space by default.
17609 The case-sensitivity flag for string comparisons and regular expression
17610 operations. If @code{IGNORECASE} has a non-zero value, then pattern
17611 matching in rules, record separating with @code{RS}, field splitting
17612 with @code{FS}, regular expression matching with @samp{~} and
17613 @samp{!~}, and the @code{gensub}, @code{gsub}, @code{index},
17614 @code{match}, @code{split} and @code{sub} built-in functions all
17615 ignore case when doing regular expression operations, and all string
17616 comparisons are done ignoring case.
17617 The value of @code{IGNORECASE} does @emph{not} affect array subscripting.
17620 The number of fields in the current input record.
17623 The total number of input records seen so far.
17626 The output format for numbers for the @code{print} statement,
17627 @code{"%.6g"} by default.
17630 The output field separator, a space by default.
17633 The output record separator, by default a newline.
17636 The input record separator, by default a newline.
17637 If @code{RS} is set to the null string, then records are separated by
17638 blank lines. When @code{RS} is set to the null string, then the newline
17639 character always acts as a field separator, in addition to whatever value
17640 @code{FS} may have. If @code{RS} is set to a multi-character
17641 string, it denotes a regexp; input text matching the regexp
17645 The input text that matched the text denoted by @code{RS},
17646 the record separator.
17649 The index of the first character last matched by @code{match}; zero if no match.
17652 The length of the string last matched by @code{match}; @minus{}1 if no match.
17655 The string used to separate multiple subscripts in array elements, by
17656 default @code{"\034"}.
17659 @xref{Built-in Variables}, for more information.
17661 @node Arrays Summary, Data Type Summary, Built-in Summary, Variables/Fields
17662 @appendixsubsec Arrays
17664 Arrays are subscripted with an expression between square brackets
17665 (@samp{[} and @samp{]}). Array subscripts are @emph{always} strings;
17666 numbers are converted to strings as necessary, following the standard
17668 (@pxref{Conversion, ,Conversion of Strings and Numbers}).
17670 If you use multiple expressions separated by commas inside the square
17671 brackets, then the array subscript is a string consisting of the
17672 concatenation of the individual subscript values, converted to strings,
17673 separated by the subscript separator (the value of @code{SUBSEP}).
17675 The special operator @code{in} may be used in a conditional context
17676 to see if an array has an index consisting of a particular value.
17683 If the array has multiple subscripts, use @samp{(i, j, @dots{}) in @var{array}}
17684 to test for existence of an element.
17686 The @code{in} construct may also be used in a @code{for} loop to iterate
17687 over all the elements of an array.
17688 @xref{Scanning an Array, ,Scanning All Elements of an Array}.
17690 You can remove an element from an array using the @code{delete} statement.
17692 You can clear an entire array using @samp{delete @var{array}}.
17694 @xref{Arrays, ,Arrays in @code{awk}}.
17696 @node Data Type Summary, , Arrays Summary, Variables/Fields
17697 @appendixsubsec Data Types
17699 The value of an @code{awk} expression is always either a number
17702 Some contexts (such as arithmetic operators) require numeric
17703 values. They convert strings to numbers by interpreting the text
17704 of the string as a number. If the string does not look like a
17705 number, it converts to zero.
17707 Other contexts (such as concatenation) require string values.
17708 They convert numbers to strings by effectively printing them
17709 with @code{sprintf}.
17710 @xref{Conversion, ,Conversion of Strings and Numbers}, for the details.
17712 To force conversion of a string value to a number, simply add zero
17713 to it. If the value you start with is already a number, this
17714 does not change it.
17716 To force conversion of a numeric value to a string, concatenate it with
17719 Comparisons are done numerically if both operands are numeric, or if
17720 one is numeric and the other is a numeric string. Otherwise one or
17721 both operands are converted to strings and a string comparison is
17722 performed. Fields, @code{getline} input, @code{FILENAME}, @code{ARGV}
17723 elements, @code{ENVIRON} elements and the elements of an array created
17724 by @code{split} are the only items that can be numeric strings. String
17725 constants, such as @code{"3.1415927"} are not numeric strings, they are
17726 string constants. The full rules for comparisons are described in
17727 @ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
17729 Uninitialized variables have the string value @code{""} (the null, or
17730 empty, string). In contexts where a number is required, this is
17731 equivalent to zero.
17733 @xref{Variables}, for more information on variable naming and initialization;
17734 @pxref{Conversion, ,Conversion of Strings and Numbers}, for more information
17735 on how variable values are interpreted.
17737 @node Rules Summary, Actions Summary, Variables/Fields, Gawk Summary
17738 @appendixsec Patterns
17741 * Pattern Summary:: Quick overview of patterns.
17742 * Regexp Summary:: Quick overview of regular expressions.
17745 An @code{awk} program is mostly composed of rules, each consisting of a
17746 pattern followed by an action. The action is enclosed in @samp{@{} and
17747 @samp{@}}. Either the pattern may be missing, or the action may be
17748 missing, but not both. If the pattern is missing, the
17749 action is executed for every input record. A missing action is
17750 equivalent to @samp{@w{@{ print @}}}, which prints the entire line.
17752 @c These paragraphs repeated for both patterns and actions. I don't
17753 @c like this, but I also don't see any way around it. Update both copies
17754 @c if they need fixing.
17755 Comments begin with the @samp{#} character, and continue until the end of the
17756 line. Blank lines may be used to separate statements. Statements normally
17757 end with a newline; however, this is not the case for lines ending in a
17758 @samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines
17759 ending in @code{do} or @code{else} also have their statements automatically
17760 continued on the following line. In other cases, a line can be continued by
17761 ending it with a @samp{\}, in which case the newline is ignored.
17763 Multiple statements may be put on one line by separating each one with
17765 This applies to both the statements within the action part of a rule (the
17766 usual case), and to the rule statements.
17768 @xref{Comments, ,Comments in @code{awk} Programs}, for information on
17769 @code{awk}'s commenting convention;
17770 @pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a
17771 description of the line continuation mechanism in @code{awk}.
17773 @node Pattern Summary, Regexp Summary, Rules Summary, Rules Summary
17774 @appendixsubsec Pattern Summary
17776 @code{awk} patterns may be one of the following:
17779 /@var{regular expression}/
17780 @var{relational expression}
17781 @var{pattern} && @var{pattern}
17782 @var{pattern} || @var{pattern}
17783 @var{pattern} ? @var{pattern} : @var{pattern}
17786 @var{pattern1}, @var{pattern2}
17791 @code{BEGIN} and @code{END} are two special kinds of patterns that are not
17792 tested against the input. The action parts of all @code{BEGIN} rules are
17793 concatenated as if all the statements had been written in a single @code{BEGIN}
17794 rule. They are executed before any of the input is read. Similarly, all the
17795 @code{END} rules are concatenated, and executed when all the input is exhausted (or
17796 when an @code{exit} statement is executed). @code{BEGIN} and @code{END}
17797 patterns cannot be combined with other patterns in pattern expressions.
17798 @code{BEGIN} and @code{END} rules cannot have missing action parts.
17800 For @code{/@var{regular-expression}/} patterns, the associated statement is
17801 executed for each input record that matches the regular expression. Regular
17802 expressions are summarized below.
17804 A @var{relational expression} may use any of the operators defined below in
17805 the section on actions. These generally test whether certain fields match
17806 certain regular expressions.
17808 The @samp{&&}, @samp{||}, and @samp{!} operators are logical ``and,''
17809 logical ``or,'' and logical ``not,'' respectively, as in C. They do
17810 short-circuit evaluation, also as in C, and are used for combining more
17811 primitive pattern expressions. As in most languages, parentheses may be
17812 used to change the order of evaluation.
17814 The @samp{?:} operator is like the same operator in C. If the first
17815 pattern matches, then the second pattern is matched against the input
17816 record; otherwise, the third is matched. Only one of the second and
17817 third patterns is matched.
17819 The @samp{@var{pattern1}, @var{pattern2}} form of a pattern is called a
17820 range pattern. It matches all input lines starting with a line that
17821 matches @var{pattern1}, and continuing until a line that matches
17822 @var{pattern2}, inclusive. A range pattern cannot be used as an operand
17823 of any of the pattern operators.
17825 @xref{Pattern Overview, ,Pattern Elements}.
17827 @node Regexp Summary, , Pattern Summary, Rules Summary
17828 @appendixsubsec Regular Expressions
17830 Regular expressions are based on POSIX EREs (extended regular expressions).
17831 The escape sequences allowed in string constants are also valid in
17832 regular expressions (@pxref{Escape Sequences}).
17833 Regexps are composed of characters as follows:
17837 matches the character @var{c} (assuming @var{c} is none of the characters
17841 matches the literal character @var{c}.
17844 matches any character, @emph{including} newline.
17845 In strict POSIX mode, @samp{.} does not match the @sc{nul}
17846 character, which is a character with all bits equal to zero.
17849 matches the beginning of a string.
17852 matches the end of a string.
17854 @item [@var{abc}@dots{}]
17855 matches any of the characters @var{abc}@dots{} (character list).
17857 @item [[:@var{class}:]]
17858 matches any character in the character class @var{class}. Allowable classes
17859 are @code{alnum}, @code{alpha}, @code{blank}, @code{cntrl},
17860 @code{digit}, @code{graph}, @code{lower}, @code{print}, @code{punct},
17861 @code{space}, @code{upper}, and @code{xdigit}.
17863 @item [[.@var{symbol}.]]
17864 matches the multi-character collating symbol @var{symbol}.
17865 @code{gawk} does not currently support collating symbols.
17867 @item [[=@var{classname}=]]
17868 matches any of the equivalent characters in the current locale named by the
17869 equivalence class @var{classname}.
17870 @code{gawk} does not currently support equivalence classes.
17872 @item [^@var{abc}@dots{}]
17873 matches any character except @var{abc}@dots{} (negated
17876 @item @var{r1}|@var{r2}
17877 matches either @var{r1} or @var{r2} (alternation).
17880 matches @var{r1}, and then @var{r2} (concatenation).
17883 matches one or more @var{r}'s.
17886 matches zero or more @var{r}'s.
17889 matches zero or one @var{r}'s.
17892 matches @var{r} (grouping).
17894 @item @var{r}@{@var{n}@}
17895 @itemx @var{r}@{@var{n},@}
17896 @itemx @var{r}@{@var{n},@var{m}@}
17897 matches at least @var{n}, @var{n} to any number, or @var{n} to @var{m}
17898 occurrences of @var{r} (interval expressions).
17901 matches the empty string at either the beginning or the
17905 matches the empty string within a word.
17908 matches the empty string at the beginning of a word.
17911 matches the empty string at the end of a word.
17914 matches any word-constituent character (alphanumeric characters and
17918 matches any character that is not word-constituent.
17921 matches the empty string at the beginning of a buffer (same as a string
17925 matches the empty string at the end of a buffer.
17928 The various command line options
17929 control how @code{gawk} interprets characters in regexps.
17931 @c NOTE!!! Keep this in sync with the same table in the regexp chapter!
17934 In the default case, @code{gawk} provide all the facilities of
17935 POSIX regexps and the GNU regexp operators described above.
17936 However, interval expressions are not supported.
17938 @item @code{--posix}
17939 Only POSIX regexps are supported, the GNU operators are not special
17940 (e.g., @samp{\w} matches a literal @samp{w}). Interval expressions
17943 @item @code{--traditional}
17944 Traditional Unix @code{awk} regexps are matched. The GNU operators
17945 are not special, interval expressions are not available, and neither
17946 are the POSIX character classes (@code{[[:alnum:]]} and so on).
17947 Characters described by octal and hexadecimal escape sequences are
17948 treated literally, even if they represent regexp metacharacters.
17950 @item @code{--re-interval}
17951 Allow interval expressions in regexps, even if @samp{--traditional}
17955 @xref{Regexp, ,Regular Expressions}.
17957 @node Actions Summary, Functions Summary, Rules Summary, Gawk Summary
17958 @appendixsec Actions
17960 Action statements are enclosed in braces, @samp{@{} and @samp{@}}.
17961 A missing action statement is equivalent to @samp{@w{@{ print @}}}.
17963 Action statements consist of the usual assignment, conditional, and looping
17964 statements found in most languages. The operators, control statements,
17965 and Input/Output statements available are similar to those in C.
17967 @c These paragraphs repeated for both patterns and actions. I don't
17968 @c like this, but I also don't see any way around it. Update both copies
17969 @c if they need fixing.
17970 Comments begin with the @samp{#} character, and continue until the end of the
17971 line. Blank lines may be used to separate statements. Statements normally
17972 end with a newline; however, this is not the case for lines ending in a
17973 @samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines
17974 ending in @code{do} or @code{else} also have their statements automatically
17975 continued on the following line. In other cases, a line can be continued by
17976 ending it with a @samp{\}, in which case the newline is ignored.
17978 Multiple statements may be put on one line by separating each one with
17980 This applies to both the statements within the action part of a rule (the
17981 usual case), and to the rule statements.
17983 @xref{Comments, ,Comments in @code{awk} Programs}, for information on
17984 @code{awk}'s commenting convention;
17985 @pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a
17986 description of the line continuation mechanism in @code{awk}.
17989 * Operator Summary:: @code{awk} operators.
17990 * Control Flow Summary:: The control statements.
17991 * I/O Summary:: The I/O statements.
17992 * Printf Summary:: A summary of @code{printf}.
17993 * Special File Summary:: Special file names interpreted internally.
17994 * Built-in Functions Summary:: Built-in numeric and string functions.
17995 * Time Functions Summary:: Built-in time functions.
17996 * String Constants Summary:: Escape sequences in strings.
17999 @node Operator Summary, Control Flow Summary, Actions Summary, Actions Summary
18000 @appendixsubsec Operators
18002 The operators in @code{awk}, in order of decreasing precedence, are:
18012 Increment and decrement, both prefix and postfix.
18015 Exponentiation (@samp{**} may also be used, and @samp{**=} for the assignment
18016 operator, but they are not specified in the POSIX standard).
18019 Unary plus, unary minus, and logical negation.
18022 Multiplication, division, and modulus.
18025 Addition and subtraction.
18028 String concatenation.
18030 @item < <= > >= != ==
18031 The usual relational operators.
18034 Regular expression match, negated match.
18046 A conditional expression. This has the form @samp{@var{expr1} ?
18047 @var{expr2} : @var{expr3}}. If @var{expr1} is true, the value of the
18048 expression is @var{expr2}; otherwise it is @var{expr3}. Only one of
18049 @var{expr2} and @var{expr3} is evaluated.
18051 @item = += -= *= /= %= ^=
18052 Assignment. Both absolute assignment (@code{@var{var}=@var{value}})
18053 and operator assignment (the other forms) are supported.
18056 @xref{Expressions}.
18058 @node Control Flow Summary, I/O Summary, Operator Summary, Actions Summary
18059 @appendixsubsec Control Statements
18061 The control statements are as follows:
18064 if (@var{condition}) @var{statement} @r{[} else @var{statement} @r{]}
18065 while (@var{condition}) @var{statement}
18066 do @var{statement} while (@var{condition})
18067 for (@var{expr1}; @var{expr2}; @var{expr3}) @var{statement}
18068 for (@var{var} in @var{array}) @var{statement}
18071 delete @var{array}[@var{index}]
18073 exit @r{[} @var{expression} @r{]}
18074 @{ @var{statements} @}
18077 @xref{Statements, ,Control Statements in Actions}.
18079 @node I/O Summary, Printf Summary, Control Flow Summary, Actions Summary
18080 @appendixsubsec I/O Statements
18082 The Input/Output statements are as follows:
18086 Set @code{$0} from next input record; set @code{NF}, @code{NR}, @code{FNR}.
18087 @xref{Getline, ,Explicit Input with @code{getline}}.
18089 @item getline <@var{file}
18090 Set @code{$0} from next record of @var{file}; set @code{NF}.
18092 @item getline @var{var}
18093 Set @var{var} from next input record; set @code{NR}, @code{FNR}.
18095 @item getline @var{var} <@var{file}
18096 Set @var{var} from next record of @var{file}.
18098 @item @var{command} | getline
18099 Run @var{command}, piping its output into @code{getline}; sets @code{$0},
18100 @code{NF}, @code{NR}.
18102 @item @var{command} | getline @code{var}
18103 Run @var{command}, piping its output into @code{getline}; sets @var{var}.
18106 Stop processing the current input record. The next input record is read and
18107 processing starts over with the first pattern in the @code{awk} program.
18108 If the end of the input data is reached, the @code{END} rule(s), if any,
18110 @xref{Next Statement, ,The @code{next} Statement}.
18113 Stop processing the current input file. The next input record read comes
18114 from the next input file. @code{FILENAME} is updated, @code{FNR} is set to one,
18115 @code{ARGIND} is incremented,
18116 and processing starts over with the first pattern in the @code{awk} program.
18117 If the end of the input data is reached, the @code{END} rule(s), if any,
18119 Earlier versions of @code{gawk} used @samp{next file}; this usage is still
18120 supported, but is considered to be deprecated.
18121 @xref{Nextfile Statement, ,The @code{nextfile} Statement}.
18124 Prints the current record.
18125 @xref{Printing, ,Printing Output}.
18127 @item print @var{expr-list}
18128 Prints expressions.
18130 @item print @var{expr-list} > @var{file}
18131 Prints expressions to @var{file}. If @var{file} does not exist, it is
18132 created. If it does exist, its contents are deleted the first time the
18133 @code{print} is executed.
18135 @item print @var{expr-list} >> @var{file}
18136 Prints expressions to @var{file}. The previous contents of @var{file}
18137 are retained, and the output of @code{print} is appended to the file.
18139 @item print @var{expr-list} | @var{command}
18140 Prints expressions, sending the output down a pipe to @var{command}.
18141 The pipeline to the command stays open until the @code{close} function
18144 @item printf @var{fmt}, @var{expr-list}
18147 @item printf @var{fmt}, @var{expr-list} > @var{file}
18148 Format and print to @var{file}. If @var{file} does not exist, it is
18149 created. If it does exist, its contents are deleted the first time the
18150 @code{printf} is executed.
18152 @item printf @var{fmt}, @var{expr-list} >> @var{file}
18153 Format and print to @var{file}. The previous contents of @var{file}
18154 are retained, and the output of @code{printf} is appended to the file.
18156 @item printf @var{fmt}, @var{expr-list} | @var{command}
18157 Format and print, sending the output down a pipe to @var{command}.
18158 The pipeline to the command stays open until the @code{close} function
18162 @code{getline} returns zero on end of file, and @minus{}1 on an error.
18163 In the event of an error, @code{getline} will set @code{ERRNO} to
18164 the value of a system-dependent string that describes the error.
18166 @node Printf Summary, Special File Summary, I/O Summary, Actions Summary
18167 @appendixsubsec @code{printf} Summary
18169 Conversion specification have the form
18170 @code{%}[@var{flag}][@var{width}][@code{.}@var{prec}]@var{format}.
18172 Items in brackets are optional.
18174 The @code{awk} @code{printf} statement and @code{sprintf} function
18175 accept the following conversion specification formats:
18179 An ASCII character. If the argument used for @samp{%c} is numeric, it is
18180 treated as a character and printed. Otherwise, the argument is assumed to
18181 be a string, and the only first character of that string is printed.
18185 A decimal number (the integer part).
18189 A floating point number of the form
18190 @samp{@r{[}-@r{]}d.dddddde@r{[}+-@r{]}dd}.
18191 The @samp{%E} format uses @samp{E} instead of @samp{e}.
18194 A floating point number of the form
18195 @r{[}@code{-}@r{]}@code{ddd.dddddd}.
18199 Use either the @samp{%e} or @samp{%f} formats, whichever produces a shorter
18200 string, with non-significant zeros suppressed.
18201 @samp{%G} will use @samp{%E} instead of @samp{%e}.
18204 An unsigned octal number (also an integer).
18207 An unsigned decimal number (again, an integer).
18210 A character string.
18214 An unsigned hexadecimal number (an integer).
18215 The @samp{%X} format uses @samp{A} through @samp{F} instead of
18216 @samp{a} through @samp{f} for decimal 10 through 15.
18219 A single @samp{%} character; no argument is converted.
18222 There are optional, additional parameters that may lie between the @samp{%}
18223 and the control letter:
18227 The expression should be left-justified within its field.
18230 For numeric conversions, prefix positive values with a space, and
18231 negative values with a minus sign.
18234 The plus sign, used before the width modifier (see below),
18235 says to always supply a sign for numeric conversions, even if the data
18236 to be formatted is positive. The @samp{+} overrides the space modifier.
18239 Use an ``alternate form'' for certain control letters.
18240 For @samp{o}, supply a leading zero.
18241 For @samp{x}, and @samp{X}, supply a leading @samp{0x} or @samp{0X} for
18243 For @samp{e}, @samp{E}, and @samp{f}, the result will always contain a
18245 For @samp{g}, and @samp{G}, trailing zeros are not removed from the result.
18248 A leading @samp{0} (zero) acts as a flag, that indicates output should be
18249 padded with zeros instead of spaces.
18250 This applies even to non-numeric output formats.
18251 This flag only has an effect when the field width is wider than the
18252 value to be printed.
18255 The field should be padded to this width. The field is normally padded
18256 with spaces. If the @samp{0} flag has been used, it is padded with zeros.
18259 A number that specifies the precision to use when printing.
18260 For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the
18261 number of digits you want printed to the right of the decimal point.
18262 For the @samp{g}, and @samp{G} formats, it specifies the maximum number
18263 of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u},
18264 @samp{x}, and @samp{X} formats, it specifies the minimum number of
18265 digits to print. For the @samp{s} format, it specifies the maximum number of
18266 characters from the string that should be printed.
18269 Either or both of the @var{width} and @var{prec} values may be specified
18270 as @samp{*}. In that case, the particular value is taken from the argument
18273 @xref{Printf, ,Using @code{printf} Statements for Fancier Printing}.
18275 @node Special File Summary, Built-in Functions Summary, Printf Summary, Actions Summary
18276 @appendixsubsec Special File Names
18278 When doing I/O redirection from either @code{print} or @code{printf} into a
18279 file, or via @code{getline} from a file, @code{gawk} recognizes certain special
18280 file names internally. These file names allow access to open file descriptors
18281 inherited from @code{gawk}'s parent process (usually the shell). The
18286 The standard input.
18289 The standard output.
18292 The standard error output.
18294 @item /dev/fd/@var{n}
18295 The file denoted by the open file descriptor @var{n}.
18298 In addition, reading the following files provides process related information
18299 about the running @code{gawk} program. All returned records are terminated
18304 Returns the process ID of the current process.
18307 Returns the parent process ID of the current process.
18310 Returns the process group ID of the current process.
18313 At least four space-separated fields, containing the return values of
18314 the @code{getuid}, @code{geteuid}, @code{getgid}, and @code{getegid}
18316 If there are any additional fields, they are the group IDs returned by
18317 @code{getgroups} system call.
18318 (Multiple groups may not be supported on all systems.)
18322 These file names may also be used on the command line to name data files.
18323 These file names are only recognized internally if you do not
18324 actually have files with these names on your system.
18326 @xref{Special Files, ,Special File Names in @code{gawk}}, for a longer description that
18327 provides the motivation for this feature.
18329 @node Built-in Functions Summary, Time Functions Summary, Special File Summary, Actions Summary
18330 @appendixsubsec Built-in Functions
18332 @code{awk} provides a number of built-in functions for performing
18333 numeric operations, string related operations, and I/O related operations.
18337 The built-in arithmetic functions are:
18340 @item atan2(@var{y}, @var{x})
18341 the arctangent of @var{y/x} in radians.
18343 @item cos(@var{expr})
18344 the cosine of @var{expr}, which is in radians.
18346 @item exp(@var{expr})
18347 the exponential function (@code{e ^ @var{expr}}).
18349 @item int(@var{expr})
18350 truncates to integer.
18352 @item log(@var{expr})
18353 the natural logarithm of @code{expr}.
18356 a random number between zero and one.
18358 @item sin(@var{expr})
18359 the sine of @var{expr}, which is in radians.
18361 @item sqrt(@var{expr})
18362 the square root function.
18364 @item srand(@r{[}@var{expr}@r{]})
18365 use @var{expr} as a new seed for the random number generator. If no @var{expr}
18366 is provided, the time of day is used. The return value is the previous
18367 seed for the random number generator.
18370 @code{awk} has the following built-in string functions:
18373 @item gensub(@var{regex}, @var{subst}, @var{how} @r{[}, @var{target}@r{]})
18374 If @var{how} is a string beginning with @samp{g} or @samp{G}, then
18375 replace each match of @var{regex} in @var{target} with @var{subst}.
18376 Otherwise, replace the @var{how}'th occurrence. If @var{target} is not
18377 supplied, use @code{$0}. The return value is the changed string; the
18378 original @var{target} is not modified. Within @var{subst},
18379 @samp{\@var{n}}, where @var{n} is a digit from one to nine, can be used to
18380 indicate the text that matched the @var{n}'th parenthesized
18382 This function is @code{gawk}-specific.
18384 @item gsub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]})
18385 for each substring matching the regular expression @var{regex} in the string
18386 @var{target}, substitute the string @var{subst}, and return the number of
18387 substitutions. If @var{target} is not supplied, use @code{$0}.
18389 @item index(@var{str}, @var{search})
18390 returns the index of the string @var{search} in the string @var{str}, or
18392 @var{search} is not present.
18394 @item length(@r{[}@var{str}@r{]})
18395 returns the length of the string @var{str}. The length of @code{$0}
18396 is returned if no argument is supplied.
18398 @item match(@var{str}, @var{regex})
18399 returns the position in @var{str} where the regular expression @var{regex}
18400 occurs, or zero if @var{regex} is not present, and sets the values of
18401 @code{RSTART} and @code{RLENGTH}.
18403 @item split(@var{str}, @var{arr} @r{[}, @var{regex}@r{]})
18404 splits the string @var{str} into the array @var{arr} on the regular expression
18405 @var{regex}, and returns the number of elements. If @var{regex} is omitted,
18406 @code{FS} is used instead. @var{regex} can be the null string, causing
18407 each character to be placed into its own array element.
18408 The array @var{arr} is cleared first.
18410 @item sprintf(@var{fmt}, @var{expr-list})
18411 prints @var{expr-list} according to @var{fmt}, and returns the resulting string.
18413 @item sub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]})
18414 just like @code{gsub}, but only the first matching substring is replaced.
18416 @item substr(@var{str}, @var{index} @r{[}, @var{len}@r{]})
18417 returns the @var{len}-character substring of @var{str} starting at @var{index}.
18418 If @var{len} is omitted, the rest of @var{str} is used.
18420 @item tolower(@var{str})
18421 returns a copy of the string @var{str}, with all the upper-case characters in
18422 @var{str} translated to their corresponding lower-case counterparts.
18423 Non-alphabetic characters are left unchanged.
18425 @item toupper(@var{str})
18426 returns a copy of the string @var{str}, with all the lower-case characters in
18427 @var{str} translated to their corresponding upper-case counterparts.
18428 Non-alphabetic characters are left unchanged.
18431 The I/O related functions are:
18434 @item close(@var{expr})
18435 Close the open file or pipe denoted by @var{expr}.
18437 @item fflush(@r{[}@var{expr}@r{]})
18438 Flush any buffered output for the output file or pipe denoted by @var{expr}.
18439 If @var{expr} is omitted, standard output is flushed.
18440 If @var{expr} is the null string (@code{""}), all output buffers are flushed.
18442 @item system(@var{cmd-line})
18443 Execute the command @var{cmd-line}, and return the exit status.
18444 If your operating system does not support @code{system}, calling it will
18445 generate a fatal error.
18447 @samp{system("")} can be used to force @code{awk} to flush any pending
18448 output. This is more portable, but less obvious, than calling @code{fflush}.
18451 @node Time Functions Summary, String Constants Summary, Built-in Functions Summary, Actions Summary
18452 @appendixsubsec Time Functions
18454 The following two functions are available for getting the current
18455 time of day, and for formatting time stamps.
18456 They are specific to @code{gawk}.
18460 returns the current time of day as the number of seconds since a particular
18461 epoch (Midnight, January 1, 1970 UTC, on POSIX systems).
18463 @item strftime(@r{[}@var{format}@r{[}, @var{timestamp}@r{]]})
18464 formats @var{timestamp} according to the specification in @var{format}.
18465 The current time of day is used if no @var{timestamp} is supplied.
18466 A default format equivalent to the output of the @code{date} utility is used if
18467 no @var{format} is supplied.
18468 @xref{Time Functions, ,Functions for Dealing with Time Stamps}, for the
18469 details on the conversion specifiers that @code{strftime} accepts.
18473 @xref{Built-in, ,Built-in Functions}, for a description of all of
18474 @code{awk}'s built-in functions.
18477 @node String Constants Summary, , Time Functions Summary, Actions Summary
18478 @appendixsubsec String Constants
18480 String constants in @code{awk} are sequences of characters enclosed
18481 in double quotes (@code{"}). Within strings, certain @dfn{escape sequences}
18482 are recognized, as in C. These are:
18486 A literal backslash.
18489 The ``alert'' character; usually the ASCII BEL character.
18509 @item \x@var{hex digits}
18510 The character represented by the string of hexadecimal digits following
18511 the @samp{\x}. As in ANSI C, all following hexadecimal digits are
18512 considered part of the escape sequence. E.g., @code{"\x1B"} is a
18513 string containing the ASCII ESC (escape) character. (The @samp{\x}
18514 escape sequence is not in POSIX @code{awk}.)
18517 The character represented by the one, two, or three digit sequence of octal
18518 digits. Thus, @code{"\033"} is also a string containing the ASCII ESC
18519 (escape) character.
18522 The literal character @var{c}, if @var{c} is not one of the above.
18525 The escape sequences may also be used inside constant regular expressions
18526 (e.g., the regexp @code{@w{/[@ \t\f\n\r\v]/}} matches whitespace
18529 @xref{Escape Sequences}.
18531 @node Functions Summary, Historical Features, Actions Summary, Gawk Summary
18532 @appendixsec User-defined Functions
18534 Functions in @code{awk} are defined as follows:
18537 function @var{name}(@var{parameter list}) @{ @var{statements} @}
18540 Actual parameters supplied in the function call are used to instantiate
18541 the formal parameters declared in the function. Arrays are passed by
18542 reference, other variables are passed by value.
18544 If there are fewer arguments passed than there are names in @var{parameter-list},
18545 the extra names are given the null string as their value. Extra names have the
18546 effect of local variables.
18548 The open-parenthesis in a function call of a user-defined function must
18549 immediately follow the function name, without any intervening white space.
18550 This is to avoid a syntactic ambiguity with the concatenation operator.
18552 The word @code{func} may be used in place of @code{function} (but not in
18555 Use the @code{return} statement to return a value from a function.
18557 @xref{User-defined, ,User-defined Functions}.
18559 @node Historical Features, , Functions Summary, Gawk Summary
18560 @appendixsec Historical Features
18562 @cindex historical features
18563 There are two features of historical @code{awk} implementations that
18564 @code{gawk} supports.
18566 First, it is possible to call the @code{length} built-in function not only
18567 with no arguments, but even without parentheses!
18574 is the same as either of
18585 $ echo abcdef | awk '@{ print length @}'
18590 This feature is marked as ``deprecated'' in the POSIX standard, and
18591 @code{gawk} will issue a warning about its use if @samp{--lint} is
18592 specified on the command line.
18593 (The ability to use @code{length} this way was actually an accident of the
18594 original Unix @code{awk} implementation. If any built-in function used
18595 @code{$0} as its default argument, it was possible to call that function
18596 without the parentheses. In particular, it was common practice to use
18597 the @code{length} function in this fashion, and this usage was documented
18598 in the @code{awk} manual page.)
18600 The other historical feature is the use of either the @code{break} statement,
18601 or the @code{continue} statement
18602 outside the body of a @code{while}, @code{for}, or @code{do} loop. Traditional
18603 @code{awk} implementations have treated such usage as equivalent to the
18604 @code{next} statement. More recent versions of Unix @code{awk} do not allow
18605 it. @code{gawk} supports this usage if @samp{--traditional} has been
18608 @xref{Options, ,Command Line Options}, for more information about the
18609 @samp{--posix} and @samp{--lint} options.
18611 @node Installation, Notes, Gawk Summary, Top
18612 @appendix Installing @code{gawk}
18614 This appendix provides instructions for installing @code{gawk} on the
18615 various platforms that are supported by the developers. The primary
18616 developers support Unix (and one day, GNU), while the other ports were
18617 contributed. The file @file{ACKNOWLEDGMENT} in the @code{gawk}
18618 distribution lists the electronic mail addresses of the people who did
18619 the respective ports, and they are also provided in
18620 @ref{Bugs, , Reporting Problems and Bugs}.
18623 * Gawk Distribution:: What is in the @code{gawk} distribution.
18624 * Unix Installation:: Installing @code{gawk} under various versions
18626 * VMS Installation:: Installing @code{gawk} on VMS.
18627 * PC Installation:: Installing and Compiling @code{gawk} on MS-DOS
18629 * Atari Installation:: Installing @code{gawk} on the Atari ST.
18630 * Amiga Installation:: Installing @code{gawk} on an Amiga.
18631 * Bugs:: Reporting Problems and Bugs.
18632 * Other Versions:: Other freely available @code{awk}
18636 @node Gawk Distribution, Unix Installation, Installation, Installation
18637 @appendixsec The @code{gawk} Distribution
18639 This section first describes how to get the @code{gawk}
18640 distribution, how to extract it, and then what is in the various files and
18644 * Getting:: How to get the distribution.
18645 * Extracting:: How to extract the distribution.
18646 * Distribution contents:: What is in the distribution.
18649 @node Getting, Extracting, Gawk Distribution, Gawk Distribution
18650 @appendixsubsec Getting the @code{gawk} Distribution
18651 @cindex getting @code{gawk}
18652 @cindex anonymous @code{ftp}
18653 @cindex @code{ftp}, anonymous
18654 @cindex Free Software Foundation
18655 There are three ways you can get GNU software.
18659 You can copy it from someone else who already has it.
18661 @cindex Free Software Foundation
18663 You can order @code{gawk} directly from the Free Software Foundation.
18664 Software distributions are available for Unix, MS-DOS, and VMS, on
18665 tape and CD-ROM. The address is:
18668 Free Software Foundation @*
18669 59 Temple Place---Suite 330 @*
18670 Boston, MA 02111-1307 USA @*
18671 Phone: +1-617-542-5942 @*
18672 Fax (including Japan): +1-617-542-2652 @*
18673 Email: @code{gnu@@gnu.org} @*
18674 URL: @code{http://www.gnu.org/} @*
18678 Ordering from the FSF directly contributes to the support of the foundation
18679 and to the production of more free software.
18682 You can get @code{gawk} by using anonymous @code{ftp} to the Internet host
18683 @code{gnudist.gnu.org}, in the directory @file{/gnu/gawk}.
18685 Here is a list of alternate @code{ftp} sites from which you can obtain GNU
18686 software. When a site is listed as ``@var{site}@code{:}@var{directory}'' the
18687 @var{directory} indicates the directory where GNU software is kept.
18688 You should use a site that is geographically close to you.
18693 @item cair-archive.kaist.ac.kr:/pub/gnu
18694 @itemx ftp.cs.titech.ac.jp
18695 @itemx ftp.nectec.or.th:/pub/mirrors/gnu
18696 @itemx utsun.s.u-tokyo.ac.jp:/ftpsync/prep
18703 @item archie.au:/gnu
18704 (@code{archie.oz} or @code{archie.oz.au} for ACSnet)
18709 @item ftp.sun.ac.za:/pub/gnu
18714 @item ftp.technion.ac.il:/pub/unsupported/gnu
18719 @item archive.eu.net
18720 @itemx ftp.denet.dk
18721 @itemx ftp.eunet.ch
18722 @itemx ftp.funet.fi:/pub/gnu
18723 @itemx ftp.ieunet.ie:pub/gnu
18724 @itemx ftp.informatik.rwth-aachen.de:/pub/gnu
18725 @itemx ftp.informatik.tu-muenchen.de
18726 @itemx ftp.luth.se:/pub/unix/gnu
18727 @itemx ftp.mcc.ac.uk
18728 @itemx ftp.stacken.kth.se
18729 @itemx ftp.sunet.se:/pub/gnu
18730 @itemx ftp.univ-lyon1.fr:pub/gnu
18731 @itemx ftp.win.tue.nl:/pub/gnu
18732 @itemx irisa.irisa.fr:/pub/gnu
18734 @itemx nic.switch.ch:/mirror/gnu
18735 @itemx src.doc.ic.ac.uk:/gnu
18736 @itemx unix.hensa.ac.uk:/pub/uunet/systems/gnu
18739 @item South America:
18741 @item ftp.inf.utfsm.cl:/pub/gnu
18742 @itemx ftp.unicamp.br:/pub/gnu
18745 @item Western Canada:
18747 @item ftp.cs.ubc.ca:/mirror2/gnu
18752 @item col.hp.com:/mirrors/gnu
18753 @itemx f.ms.uky.edu:/pub3/gnu
18754 @itemx ftp.cc.gatech.edu:/pub/gnu
18755 @itemx ftp.cs.columbia.edu:/archives/gnu/prep
18756 @itemx ftp.digex.net:/pub/gnu
18757 @itemx ftp.hawaii.edu:/mirrors/gnu
18758 @itemx ftp.kpc.com:/pub/mirror/gnu
18763 @item USA (continued):
18765 @itemx ftp.uu.net:/systems/gnu
18766 @itemx gatekeeper.dec.com:/pub/GNU
18767 @itemx jaguar.utah.edu:/gnustuff
18768 @itemx labrea.stanford.edu
18769 @itemx mrcnext.cso.uiuc.edu:/pub/gnu
18770 @itemx vixen.cso.uiuc.edu:/gnu
18771 @itemx wuarchive.wustl.edu:/systems/gnu
18776 @node Extracting, Distribution contents, Getting, Gawk Distribution
18777 @appendixsubsec Extracting the Distribution
18778 @code{gawk} is distributed as a @code{tar} file compressed with the
18779 GNU Zip program, @code{gzip}.
18781 Once you have the distribution (for example,
18782 @file{gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz}), first use @code{gzip} to expand the
18783 file, and then use @code{tar} to extract it. You can use the following
18784 pipeline to produce the @code{gawk} distribution:
18787 # Under System V, add 'o' to the tar flags
18788 gzip -d -c gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz | tar -xvpf -
18792 This will create a directory named @file{gawk-@value{VERSION}.@value{PATCHLEVEL}} in the current
18795 The distribution file name is of the form
18796 @file{gawk-@var{V}.@var{R}.@var{n}.tar.gz}.
18797 The @var{V} represents the major version of @code{gawk},
18798 the @var{R} represents the current release of version @var{V}, and
18799 the @var{n} represents a @dfn{patch level}, meaning that minor bugs have
18800 been fixed in the release. The current patch level is @value{PATCHLEVEL},
18802 retrieving distributions, you should get the version with the highest
18803 version, release, and patch level. (Note that release levels greater than
18804 or equal to 90 denote ``beta,'' or non-production software; you may not wish
18805 to retrieve such a version unless you don't mind experimenting.)
18807 If you are not on a Unix system, you will need to make other arrangements
18808 for getting and extracting the @code{gawk} distribution. You should consult
18811 @node Distribution contents, , Extracting, Gawk Distribution
18812 @appendixsubsec Contents of the @code{gawk} Distribution
18814 The @code{gawk} distribution has a number of C source files,
18815 documentation files,
18816 subdirectories and files related to the configuration process
18817 (@pxref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}),
18818 and several subdirectories related to different, non-Unix,
18822 @item various @samp{.c}, @samp{.y}, and @samp{.h} files
18823 These files are the actual @code{gawk} source code.
18828 @itemx README_d/README.*
18829 Descriptive files: @file{README} for @code{gawk} under Unix, and the
18830 rest for the various hardware and software combinations.
18833 A file providing an overview of the configuration and installation process.
18836 A list of systems to which @code{gawk} has been ported, and which
18837 have successfully run the test suite.
18839 @item ACKNOWLEDGMENT
18840 A list of the people who contributed major parts of the code or documentation.
18843 A detailed list of source code changes as bugs are fixed or improvements made.
18846 A list of changes to @code{gawk} since the last release or patch.
18849 The GNU General Public License.
18852 A brief list of features and/or changes being contemplated for future
18853 releases, with some indication of the time frame for the feature, based
18857 A list of those factors that limit @code{gawk}'s performance.
18858 Most of these depend on the hardware or operating system software, and
18859 are not limits in @code{gawk} itself.
18862 A description of one area where the POSIX standard for @code{awk} is
18863 incorrect, and how @code{gawk} handles the problem.
18866 A file describing known problems with the current release.
18868 @cindex artificial intelligence, using @code{gawk}
18869 @cindex AI programming, using @code{gawk}
18870 @item doc/awkforai.txt
18871 A short article describing why @code{gawk} is a good language for
18872 AI (Artificial Intelligence) programming.
18874 @item doc/README.card
18875 @itemx doc/ad.block
18876 @itemx doc/awkcard.in
18877 @itemx doc/cardfonts
18880 @itemx doc/no.colors
18881 @itemx doc/setter.outline
18882 The @code{troff} source for a five-color @code{awk} reference card.
18883 A modern version of @code{troff}, such as GNU Troff (@code{groff}) is
18884 needed to produce the color version. See the file @file{README.card}
18885 for instructions if you have an older @code{troff}.
18888 The @code{troff} source for a manual page describing @code{gawk}.
18889 This is distributed for the convenience of Unix users.
18891 @item doc/gawk.texi
18892 The Texinfo source file for this @value{DOCUMENT}.
18893 It should be processed with @TeX{} to produce a printed document, and
18894 with @code{makeinfo} to produce an Info file.
18896 @item doc/gawk.info
18897 The generated Info file for this @value{DOCUMENT}.
18900 The @code{troff} source for a manual page describing the @code{igawk}
18901 program presented in
18902 @ref{Igawk Program, ,An Easy Way to Use Library Functions}.
18904 @item doc/Makefile.in
18905 The input file used during the configuration process to generate the
18906 actual @file{Makefile} for creating the documentation.
18912 @itemx configure.in
18916 These files and subdirectory are used when configuring @code{gawk}
18917 for various Unix systems. They are explained in detail in
18918 @ref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}.
18920 @item awklib/extract.awk
18921 @itemx awklib/Makefile.in
18922 The @file{awklib} directory contains a copy of @file{extract.awk}
18923 (@pxref{Extract Program, ,Extracting Programs from Texinfo Source Files}),
18924 which can be used to extract the sample programs from the Texinfo
18925 source file for this @value{DOCUMENT}, and a @file{Makefile.in} file, which
18926 @code{configure} uses to generate a @file{Makefile}.
18927 As part of the process of building @code{gawk}, the library functions from
18928 @ref{Library Functions, , A Library of @code{awk} Functions},
18929 and the @code{igawk} program from
18930 @ref{Igawk Program, , An Easy Way to Use Library Functions},
18931 are extracted into ready to use files.
18932 They are installed as part of the installation process.
18935 Files needed for building @code{gawk} on an Atari ST.
18936 @xref{Atari Installation, ,Installing @code{gawk} on the Atari ST}, for details.
18939 Files needed for building @code{gawk} under MS-DOS and OS/2.
18940 @xref{PC Installation, ,MS-DOS and OS/2 Installation and Compilation}, for details.
18943 Files needed for building @code{gawk} under VMS.
18944 @xref{VMS Installation, ,How to Compile and Install @code{gawk} on VMS}, for details.
18948 @code{gawk}. You can use @samp{make check} from the top level @code{gawk}
18949 directory to run your version of @code{gawk} against the test suite.
18950 If @code{gawk} successfully passes @samp{make check} then you can
18951 be confident of a successful port.
18954 @node Unix Installation, VMS Installation, Gawk Distribution, Installation
18955 @appendixsec Compiling and Installing @code{gawk} on Unix
18957 Usually, you can compile and install @code{gawk} by typing only two
18958 commands. However, if you do use an unusual system, you may need
18959 to configure @code{gawk} for your system yourself.
18962 * Quick Installation:: Compiling @code{gawk} under Unix.
18963 * Configuration Philosophy:: How it's all supposed to work.
18966 @node Quick Installation, Configuration Philosophy, Unix Installation, Unix Installation
18967 @appendixsubsec Compiling @code{gawk} for Unix
18969 @cindex installation, unix
18970 After you have extracted the @code{gawk} distribution, @code{cd}
18971 to @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}. Like most GNU software,
18972 @code{gawk} is configured
18973 automatically for your Unix system by running the @code{configure} program.
18974 This program is a Bourne shell script that was generated automatically using
18975 GNU @code{autoconf}.
18977 (The @code{autoconf} software is
18979 @cite{Autoconf---Generating Automatic Configuration Scripts},
18980 which is available from the Free Software Foundation.)
18983 (The @code{autoconf} software is described fully starting with
18984 @ref{Top, , Introduction, autoconf, Autoconf---Generating Automatic Configuration Scripts}.)
18987 To configure @code{gawk}, simply run @code{configure}:
18993 This produces a @file{Makefile} and @file{config.h} tailored to your system.
18994 The @file{config.h} file describes various facts about your system.
18995 You may wish to edit the @file{Makefile} to
18996 change the @code{CFLAGS} variable, which controls
18997 the command line options that are passed to the C compiler (such as
18998 optimization levels, or compiling for debugging).
19000 Alternatively, you can add your own values for most @code{make}
19001 variables, such as @code{CC} and @code{CFLAGS}, on the command line when
19002 running @code{configure}:
19005 CC=cc CFLAGS=-g sh ./configure
19009 See the file @file{INSTALL} in the @code{gawk} distribution for
19012 After you have run @code{configure}, and possibly edited the @file{Makefile},
19020 and shortly thereafter, you should have an executable version of @code{gawk}.
19021 That's all there is to it!
19022 (If these steps do not work, please send in a bug report;
19023 @pxref{Bugs, ,Reporting Problems and Bugs}.)
19025 @node Configuration Philosophy, , Quick Installation, Unix Installation
19026 @appendixsubsec The Configuration Process
19028 @cindex configuring @code{gawk}
19029 (This section is of interest only if you know something about using the
19030 C language and the Unix operating system.)
19032 The source code for @code{gawk} generally attempts to adhere to formal
19033 standards wherever possible. This means that @code{gawk} uses library
19034 routines that are specified by the ANSI C standard and by the POSIX
19035 operating system interface standard. When using an ANSI C compiler,
19036 function prototypes are used to help improve the compile-time checking.
19038 Many Unix systems do not support all of either the ANSI or the
19039 POSIX standards. The @file{missing} subdirectory in the @code{gawk}
19040 distribution contains replacement versions of those subroutines that are
19041 most likely to be missing.
19043 The @file{config.h} file that is created by the @code{configure} program
19044 contains definitions that describe features of the particular operating
19045 system where you are attempting to compile @code{gawk}. The three things
19046 described by this file are what header files are available, so that
19047 they can be correctly included,
19048 what (supposedly) standard functions are actually available in your C
19050 other miscellaneous facts about your
19051 variant of Unix. For example, there may not be an @code{st_blksize}
19052 element in the @code{stat} structure. In this case @samp{HAVE_ST_BLKSIZE}
19053 would be undefined.
19055 @cindex @code{custom.h} configuration file
19056 It is possible for your C compiler to lie to @code{configure}. It may
19057 do so by not exiting with an error when a library function is not
19058 available. To get around this, you can edit the file @file{custom.h}.
19059 Use an @samp{#ifdef} that is appropriate for your system, and either
19060 @code{#define} any constants that @code{configure} should have defined but
19061 didn't, or @code{#undef} any constants that @code{configure} defined and
19062 should not have. @file{custom.h} is automatically included by
19065 It is also possible that the @code{configure} program generated by
19067 will not work on your system in some other fashion. If you do have a problem,
19069 @file{configure.in} is the input for @code{autoconf}. You may be able to
19070 change this file, and generate a new version of @code{configure} that will
19071 work on your system. @xref{Bugs, ,Reporting Problems and Bugs}, for
19072 information on how to report problems in configuring @code{gawk}. The same
19073 mechanism may be used to send in updates to @file{configure.in} and/or
19076 @node VMS Installation, PC Installation, Unix Installation, Installation
19077 @appendixsec How to Compile and Install @code{gawk} on VMS
19079 @c based on material from Pat Rankin <rankin@eql.caltech.edu>
19081 @cindex installation, vms
19082 This section describes how to compile and install @code{gawk} under VMS.
19085 * VMS Compilation:: How to compile @code{gawk} under VMS.
19086 * VMS Installation Details:: How to install @code{gawk} under VMS.
19087 * VMS Running:: How to run @code{gawk} under VMS.
19088 * VMS POSIX:: Alternate instructions for VMS POSIX.
19091 @node VMS Compilation, VMS Installation Details, VMS Installation, VMS Installation
19092 @appendixsubsec Compiling @code{gawk} on VMS
19094 To compile @code{gawk} under VMS, there is a @code{DCL} command procedure that
19095 will issue all the necessary @code{CC} and @code{LINK} commands, and there is
19096 also a @file{Makefile} for use with the @code{MMS} utility. From the source
19097 directory, use either
19100 $ @@[.VMS]VMSBUILD.COM
19107 $ MMS/DESCRIPTION=[.VMS]DESCRIP.MMS GAWK
19110 Depending upon which C compiler you are using, follow one of the sets
19111 of instructions in this table:
19115 Use either @file{vmsbuild.com} or @file{descrip.mms} as is. These use
19116 @code{CC/OPTIMIZE=NOLINE}, which is essential for Version 3.0.
19119 You must have Version 2.3 or 2.4; older ones won't work. Edit either
19120 @file{vmsbuild.com} or @file{descrip.mms} according to the comments in them.
19121 For @file{vmsbuild.com}, this just entails removing two @samp{!} delimiters.
19122 Also edit @file{config.h} (which is a copy of file @file{[.config]vms-conf.h})
19123 and comment out or delete the two lines @samp{#define __STDC__ 0} and
19124 @samp{#define VAXC_BUILTINS} near the end.
19127 Edit @file{vmsbuild.com} or @file{descrip.mms}; the changes are different
19128 from those for VAX C V2.x, but equally straightforward. No changes to
19129 @file{config.h} should be needed.
19132 Edit @file{vmsbuild.com} or @file{descrip.mms} according to their comments.
19133 No changes to @file{config.h} should be needed.
19136 @code{gawk} has been tested under VAX/VMS 5.5-1 using VAX C V3.2,
19137 GNU C 1.40 and 2.3. It should work without modifications for VMS V4.6 and up.
19139 @node VMS Installation Details, VMS Running, VMS Compilation, VMS Installation
19140 @appendixsubsec Installing @code{gawk} on VMS
19142 To install @code{gawk}, all you need is a ``foreign'' command, which is
19143 a @code{DCL} symbol whose value begins with a dollar sign. For example:
19146 $ GAWK :== $disk1:[gnubin]GAWK
19150 (Substitute the actual location of @code{gawk.exe} for
19151 @samp{$disk1:[gnubin]}.) The symbol should be placed in the
19152 @file{login.com} of any user who wishes to run @code{gawk},
19153 so that it will be defined every time the user logs on.
19154 Alternatively, the symbol may be placed in the system-wide
19155 @file{sylogin.com} procedure, which will allow all users
19156 to run @code{gawk}.
19158 Optionally, the help entry can be loaded into a VMS help library:
19161 $ LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP
19165 (You may want to substitute a site-specific help library rather than
19166 the standard VMS library @samp{HELPLIB}.) After loading the help text,
19173 will provide information about both the @code{gawk} implementation and the
19174 @code{awk} programming language.
19176 The logical name @samp{AWK_LIBRARY} can designate a default location
19177 for @code{awk} program files. For the @samp{-f} option, if the specified
19178 filename has no device or directory path information in it, @code{gawk}
19179 will look in the current directory first, then in the directory specified
19180 by the translation of @samp{AWK_LIBRARY} if the file was not found.
19181 If after searching in both directories, the file still is not found,
19182 then @code{gawk} appends the suffix @samp{.awk} to the filename and the
19183 file search will be re-tried. If @samp{AWK_LIBRARY} is not defined, that
19184 portion of the file search will fail benignly.
19186 @node VMS Running, VMS POSIX, VMS Installation Details, VMS Installation
19187 @appendixsubsec Running @code{gawk} on VMS
19189 Command line parsing and quoting conventions are significantly different
19190 on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor
19191 changes. They @emph{are} minor though, and all @code{awk} programs
19192 should run correctly.
19194 Here are a couple of trivial tests:
19197 $ gawk -- "BEGIN @{print ""Hello, World!""@}"
19198 $ gawk -"W" version
19199 ! could also be -"W version" or "-W version"
19203 Note that upper-case and mixed-case text must be quoted.
19205 The VMS port of @code{gawk} includes a @code{DCL}-style interface in addition
19206 to the original shell-style interface (see the help entry for details).
19207 One side-effect of dual command line parsing is that if there is only a
19208 single parameter (as in the quoted string program above), the command
19209 becomes ambiguous. To work around this, the normally optional @samp{--}
19210 flag is required to force Unix style rather than @code{DCL} parsing. If any
19211 other dash-type options (or multiple parameters such as data files to be
19212 processed) are present, there is no ambiguity and @samp{--} can be omitted.
19214 The default search path when looking for @code{awk} program files specified
19215 by the @samp{-f} option is @code{"SYS$DISK:[],AWK_LIBRARY:"}. The logical
19216 name @samp{AWKPATH} can be used to override this default. The format
19217 of @samp{AWKPATH} is a comma-separated list of directory specifications.
19218 When defining it, the value should be quoted so that it retains a single
19219 translation, and not a multi-translation @code{RMS} searchlist.
19221 @node VMS POSIX, , VMS Running, VMS Installation
19222 @appendixsubsec Building and Using @code{gawk} on VMS POSIX
19224 Ignore the instructions above, although @file{vms/gawk.hlp} should still
19225 be made available in a help library. The source tree should be unpacked
19226 into a container file subsystem rather than into the ordinary VMS file
19227 system. Make sure that the two scripts, @file{configure} and
19228 @file{vms/posix-cc.sh}, are executable; use @samp{chmod +x} on them if
19229 necessary. Then execute the following two commands:
19233 psx> CC=vms/posix-cc.sh configure
19234 psx> make CC=c89 gawk
19239 The first command will construct files @file{config.h} and @file{Makefile} out
19240 of templates, using a script to make the C compiler fit @code{configure}'s
19241 expectations. The second command will compile and link @code{gawk} using
19242 the C compiler directly; ignore any warnings from @code{make} about being
19243 unable to redefine @code{CC}. @code{configure} will take a very long
19244 time to execute, but at least it provides incremental feedback as it
19247 This has been tested with VAX/VMS V6.2, VMS POSIX V2.0, and DEC C V5.2.
19249 Once built, @code{gawk} will work like any other shell utility. Unlike
19250 the normal VMS port of @code{gawk}, no special command line manipulation is
19251 needed in the VMS POSIX environment.
19253 @c Rewritten by Scott Deifik <scottd@amgen.com>
19254 @c and Darrel Hankerson <hankedr@mail.auburn.edu>
19255 @node PC Installation, Atari Installation, VMS Installation, Installation
19256 @appendixsec MS-DOS and OS/2 Installation and Compilation
19258 @cindex installation, MS-DOS and OS/2
19259 If you have received a binary distribution prepared by the DOS
19260 maintainers, then @code{gawk} and the necessary support files will appear
19261 under the @file{gnu} directory, with executables in @file{gnu/bin},
19262 libraries in @file{gnu/lib/awk}, and manual pages under @file{gnu/man}.
19263 This is designed for easy installation to a @file{/gnu} directory on your
19264 drive, but the files can be installed anywhere provided @code{AWKPATH} is
19265 set properly. Regardless of the installation directory, the first line of
19266 @file{igawk.cmd} and @file{igawk.bat} (in @file{gnu/bin}) may need to be
19269 The binary distribution will contain a separate file describing the
19270 contents. In particular, it may include more than one version of the
19271 @code{gawk} executable. OS/2 binary distributions may have a
19272 different arrangement, but installation is similar.
19274 The OS/2 and MS-DOS versions of @code{gawk} search for program files as
19275 described in @ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
19276 However, semicolons (rather than colons) separate elements
19277 in the @code{AWKPATH} variable. If @code{AWKPATH} is not set or is empty,
19278 then the default search path is @code{@w{".;c:/lib/awk;c:/gnu/lib/awk"}}.
19280 An @code{sh}-like shell (as opposed to @code{command.com} under MS-DOS
19281 or @code{cmd.exe} under OS/2) may be useful for @code{awk} programming.
19282 Ian Stewartson has written an excellent shell for MS-DOS and OS/2, and a
19283 @code{ksh} clone and GNU Bash are available for OS/2. The file
19284 @file{README_d/README.pc} in the @code{gawk} distribution contains
19285 information on these shells. Users of Stewartson's shell on DOS should
19286 examine its documentation on handling of command-lines. In particular,
19287 the setting for @code{gawk} in the shell configuration may need to be
19288 changed, and the @code{ignoretype} option may also be of interest.
19290 @code{gawk} can be compiled for MS-DOS and OS/2 using the GNU development tools
19291 from DJ Delorie (DJGPP, MS-DOS-only) or Eberhard Mattes (EMX, MS-DOS and OS/2).
19292 Microsoft C can be used to build 16-bit versions for MS-DOS and OS/2. The file
19293 @file{README_d/README.pc} in the @code{gawk} distribution contains additional
19294 notes, and @file{pc/Makefile} contains important notes on compilation options.
19296 To build @code{gawk}, copy the files in the @file{pc} directory (@emph{except}
19297 for @file{ChangeLog}) to the
19298 directory with the rest of the @code{gawk} sources. The @file{Makefile}
19299 contains a configuration section with comments, and may need to be
19300 edited in order to work with your @code{make} utility.
19302 The @file{Makefile} contains a number of targets for building various MS-DOS
19303 and OS/2 versions. A list of targets will be printed if the @code{make}
19304 command is given without a target. As an example, to build @code{gawk}
19305 using the DJGPP tools, enter @samp{make djgpp}.
19307 Using @code{make} to run the standard tests and to install @code{gawk}
19308 requires additional Unix-like tools, including @code{sh}, @code{sed}, and
19309 @code{cp}. In order to run the tests, the @file{test/*.ok} files may need to
19310 be converted so that they have the usual DOS-style end-of-line markers. Most
19311 of the tests will work properly with Stewartson's shell along with the
19312 companion utilities or appropriate GNU utilities. However, some editing of
19313 @file{test/Makefile} is required. It is recommended that the file
19314 @file{pc/Makefile.tst} be copied to @file{test/Makefile} as a
19315 replacement. Details can be found in @file{README_d/README.pc}.
19317 @node Atari Installation, Amiga Installation, PC Installation, Installation
19318 @appendixsec Installing @code{gawk} on the Atari ST
19320 @c based on material from Michal Jaegermann <michal@gortel.phys.ualberta.ca>
19323 @cindex installation, atari
19324 There are no substantial differences when installing @code{gawk} on
19325 various Atari models. Compiled @code{gawk} executables do not require
19326 a large amount of memory with most @code{awk} programs and should run on all
19327 Motorola processor based models (called further ST, even if that is not
19330 In order to use @code{gawk}, you need to have a shell, either text or
19331 graphics, that does not map all the characters of a command line to
19332 upper-case. Maintaining case distinction in option flags is very
19333 important (@pxref{Options, ,Command Line Options}).
19334 These days this is the default, and it may only be a problem for some
19335 very old machines. If your system does not preserve the case of option
19336 flags, you will need to upgrade your tools. Support for I/O
19337 redirection is necessary to make it easy to import @code{awk} programs
19338 from other environments. Pipes are nice to have, but not vital.
19341 * Atari Compiling:: Compiling @code{gawk} on Atari
19342 * Atari Using:: Running @code{gawk} on Atari
19345 @node Atari Compiling, Atari Using, Atari Installation, Atari Installation
19346 @appendixsubsec Compiling @code{gawk} on the Atari ST
19348 A proper compilation of @code{gawk} sources when @code{sizeof(int)}
19349 differs from @code{sizeof(void *)} requires an ANSI C compiler. An initial
19350 port was done with @code{gcc}. You may actually prefer executables
19351 where @code{int}s are four bytes wide, but the other variant works as well.
19353 You may need quite a bit of memory when trying to recompile the @code{gawk}
19354 sources, as some source files (@file{regex.c} in particular) are quite
19355 big. If you run out of memory compiling such a file, try reducing the
19356 optimization level for this particular file; this may help.
19359 With a reasonable shell (Bash will do), and in particular if you run
19360 Linux, MiNT or a similar operating system, you have a pretty good
19361 chance that the @code{configure} utility will succeed. Otherwise
19362 sample versions of @file{config.h} and @file{Makefile.st} are given in the
19363 @file{atari} subdirectory and can be edited and copied to the
19364 corresponding files in the main source directory. Even if
19365 @code{configure} produced something, it might be advisable to compare
19366 its results with the sample versions and possibly make adjustments.
19368 Some @code{gawk} source code fragments depend on a preprocessor define
19369 @samp{atarist}. This basically assumes the TOS environment with @code{gcc}.
19370 Modify these sections as appropriate if they are not right for your
19371 environment. Also see the remarks about @code{AWKPATH} and @code{envsep} in
19372 @ref{Atari Using, ,Running @code{gawk} on the Atari ST}.
19374 As shipped, the sample @file{config.h} claims that the @code{system}
19375 function is missing from the libraries, which is not true, and an
19376 alternative implementation of this function is provided in
19377 @file{atari/system.c}. Depending upon your particular combination of
19378 shell and operating system, you may wish to change the file to indicate
19379 that @code{system} is available.
19381 @node Atari Using, , Atari Compiling, Atari Installation
19382 @appendixsubsec Running @code{gawk} on the Atari ST
19384 An executable version of @code{gawk} should be placed, as usual,
19385 anywhere in your @code{PATH} where your shell can find it.
19387 While executing, @code{gawk} creates a number of temporary files. When
19388 using @code{gcc} libraries for TOS, @code{gawk} looks for either of
19389 the environment variables @code{TEMP} or @code{TMPDIR}, in that order.
19390 If either one is found, its value is assumed to be a directory for
19391 temporary files. This directory must exist, and if you can spare the
19392 memory, it is a good idea to put it on a RAM drive. If neither
19393 @code{TEMP} nor @code{TMPDIR} are found, then @code{gawk} uses the
19394 current directory for its temporary files.
19396 The ST version of @code{gawk} searches for its program files as described in
19397 @ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}.
19398 The default value for the @code{AWKPATH} variable is taken from
19399 @code{DEFPATH} defined in @file{Makefile}. The sample @code{gcc}/TOS
19400 @file{Makefile} for the ST in the distribution sets @code{DEFPATH} to
19401 @code{@w{".,c:\lib\awk,c:\gnu\lib\awk"}}. The search path can be
19402 modified by explicitly setting @code{AWKPATH} to whatever you wish.
19403 Note that colons cannot be used on the ST to separate elements in the
19404 @code{AWKPATH} variable, since they have another, reserved, meaning.
19405 Instead, you must use a comma to separate elements in the path. When
19406 recompiling, the separating character can be modified by initializing
19407 the @code{envsep} variable in @file{atari/gawkmisc.atr} to another
19410 Although @code{awk} allows great flexibility in doing I/O redirections
19411 from within a program, this facility should be used with care on the ST
19412 running under TOS. In some circumstances the OS routines for file
19413 handle pool processing lose track of certain events, causing the
19414 computer to crash, and requiring a reboot. Often a warm reboot is
19415 sufficient. Fortunately, this happens infrequently, and in rather
19416 esoteric situations. In particular, avoid having one part of an
19417 @code{awk} program using @code{print} statements explicitly redirected
19418 to @code{"/dev/stdout"}, while other @code{print} statements use the
19419 default standard output, and a calling shell has redirected standard
19422 When @code{gawk} is compiled with the ST version of @code{gcc} and its
19423 usual libraries, it will accept both @samp{/} and @samp{\} as path separators.
19424 While this is convenient, it should be remembered that this removes one,
19425 technically valid, character (@samp{/}) from your file names, and that
19426 it may create problems for external programs, called via the @code{system}
19427 function, which may not support this convention. Whenever it is possible
19428 that a file created by @code{gawk} will be used by some other program,
19429 use only backslashes. Also remember that in @code{awk}, backslashes in
19430 strings have to be doubled in order to get literal backslashes
19431 (@pxref{Escape Sequences}).
19433 @node Amiga Installation, Bugs, Atari Installation, Installation
19434 @appendixsec Installing @code{gawk} on an Amiga
19437 @cindex installation, amiga
19438 You can install @code{gawk} on an Amiga system using a Unix emulation
19439 environment available via anonymous @code{ftp} from
19440 @code{ftp.ninemoons.com} in the directory @file{pub/ade/current}.
19441 This includes a shell based on @code{pdksh}. The primary component of
19442 this environment is a Unix emulation library, @file{ixemul.lib}.
19443 @c could really use more background here, who wrote this, etc.
19445 A more complete distribution for the Amiga is available on
19446 the Geek Gadgets CD-ROM from:
19450 1840 E. Warner Road #105-265 @*
19451 Tempe, AZ 85284 USA @*
19452 US Toll Free: (800) 804-0833 @*
19453 Phone: +1-602-491-0442 @*
19454 FAX: +1-602-491-0048 @*
19455 Email: @code{info@@ninemoons.com} @*
19456 WWW: @code{http://www.ninemoons.com} @*
19457 Anonymous @code{ftp} site: @code{ftp.ninemoons.com} @*
19460 Once you have the distribution, you can configure @code{gawk} simply by
19461 running @code{configure}:
19464 configure -v m68k-amigaos
19467 Then run @code{make}, and you should be all set!
19468 (If these steps do not work, please send in a bug report;
19469 @pxref{Bugs, ,Reporting Problems and Bugs}.)
19471 @node Bugs, Other Versions, Amiga Installation, Installation
19472 @appendixsec Reporting Problems and Bugs
19474 @i{There is nothing more dangerous than a bored archeologist.}
19475 The Hitchhiker's Guide to the Galaxy
19476 @c the radio show, not the book. :-)
19480 If you have problems with @code{gawk} or think that you have found a bug,
19481 please report it to the developers; we cannot promise to do anything
19482 but we might well want to fix it.
19484 Before reporting a bug, make sure you have actually found a real bug.
19485 Carefully reread the documentation and see if it really says you can do
19486 what you're trying to do. If it's not clear whether you should be able
19487 to do something or not, report that too; it's a bug in the documentation!
19489 Before reporting a bug or trying to fix it yourself, try to isolate it
19490 to the smallest possible @code{awk} program and input data file that
19491 reproduces the problem. Then send us the program and data file,
19492 some idea of what kind of Unix system you're using, and the exact results
19493 @code{gawk} gave you. Also say what you expected to occur; this will help
19494 us decide whether the problem was really in the documentation.
19496 Once you have a precise problem, send email to @email{bug-gawk@@gnu.org}.
19498 Please include the version number of @code{gawk} you are using.
19499 You can get this information with the command @samp{gawk --version}.
19500 Using this address will automatically send a carbon copy of your
19501 mail to Arnold Robbins. If necessary, he can be reached directly at
19502 @email{arnold@@gnu.org}.
19504 @cindex @code{comp.lang.awk}
19505 @strong{Important!} Do @emph{not} try to report bugs in @code{gawk} by
19506 posting to the Usenet/Internet newsgroup @code{comp.lang.awk}.
19507 While the @code{gawk} developers do occasionally read this newsgroup,
19508 there is no guarantee that we will see your posting. The steps described
19509 above are the official, recognized ways for reporting bugs.
19511 Non-bug suggestions are always welcome as well. If you have questions
19512 about things that are unclear in the documentation or are just obscure
19513 features, ask Arnold Robbins; he will try to help you out, although he
19514 may not have the time to fix the problem. You can send him electronic
19515 mail at the Internet address above.
19517 If you find bugs in one of the non-Unix ports of @code{gawk}, please send
19518 an electronic mail message to the person who maintains that port. They
19519 are listed below, and also in the @file{README} file in the @code{gawk}
19520 distribution. Information in the @file{README} file should be considered
19521 authoritative if it conflicts with this @value{DOCUMENT}.
19523 @c NEEDED for looks
19525 The people maintaining the non-Unix ports of @code{gawk} are:
19527 @cindex Deifik, Scott
19529 @cindex Hankerson, Darrel
19530 @cindex Jaegermann, Michal
19531 @cindex Rankin, Pat
19532 @cindex Rommel, Kai Uwe
19535 Scott Deifik, @samp{scottd@@amgen.com}, and
19536 Darrel Hankerson, @samp{hankedr@@mail.auburn.edu}.
19539 Kai Uwe Rommel, @samp{rommel@@ars.de}.
19542 Pat Rankin, @samp{rankin@@eql.caltech.edu}.
19545 Michal Jaegermann, @samp{michal@@gortel.phys.ualberta.ca}.
19548 Fred Fish, @samp{fnf@@ninemoons.com}.
19551 If your bug is also reproducible under Unix, please send copies of your
19552 report to the general GNU bug list, as well as to Arnold Robbins, at the
19553 addresses listed above.
19555 @node Other Versions, , Bugs, Installation
19556 @appendixsec Other Freely Available @code{awk} Implementations
19557 @cindex Brennan, Michael
19559 From: emory!amc.com!brennan (Michael Brennan)
19560 Subject: C++ comments in awk programs
19561 To: arnold@gnu.ai.mit.edu (Arnold Robbins)
19562 Date: Wed, 4 Sep 1996 08:11:48 -0700 (PDT)
19566 @i{It's kind of fun to put comments like this in your awk code.}
19567 @code{// Do C++ comments work? answer: yes! of course}
19572 There are two other freely available @code{awk} implementations.
19573 This section briefly describes where to get them.
19576 @cindex Kernighan, Brian
19577 @cindex anonymous @code{ftp}
19578 @cindex @code{ftp}, anonymous
19579 @item Unix @code{awk}
19580 Brian Kernighan has been able to make his implementation of
19581 @code{awk} freely available. You can get it via anonymous @code{ftp}
19582 to the host @code{@w{netlib.bell-labs.com}}. Change directory to
19583 @file{/netlib/research}. Use ``binary'' or ``image'' mode, and
19584 retrieve @file{awk.bundle.gz}.
19586 This is a shell archive that has been compressed with the GNU @code{gzip}
19587 utility. It can be uncompressed with the @code{gunzip} utility.
19589 You can also retrieve this version via the World Wide Web from his
19590 @uref{http://cm.bell-labs.com/who/bwk, home page}.
19592 This version requires an ANSI C compiler; GCC (the GNU C compiler)
19593 works quite nicely.
19595 @cindex Brennan, Michael
19596 @cindex @code{mawk}
19598 Michael Brennan has written an independent implementation of @code{awk},
19599 called @code{mawk}. It is available under the GPL
19600 (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}),
19601 just as @code{gawk} is.
19603 You can get it via anonymous @code{ftp} to the host
19604 @code{@w{ftp.whidbey.net}}. Change directory to @file{/pub/brennan}.
19605 Use ``binary'' or ``image'' mode, and retrieve @file{mawk1.3.3.tar.gz}
19606 (or the latest version that is there).
19608 @code{gunzip} may be used to decompress this file. Installation
19609 is similar to @code{gawk}'s
19610 (@pxref{Unix Installation, , Compiling and Installing @code{gawk} on Unix}).
19613 @node Notes, Glossary, Installation, Top
19614 @appendix Implementation Notes
19616 This appendix contains information mainly of interest to implementors and
19617 maintainers of @code{gawk}. Everything in it applies specifically to
19618 @code{gawk}, and not to other implementations.
19621 * Compatibility Mode:: How to disable certain @code{gawk} extensions.
19622 * Additions:: Making Additions To @code{gawk}.
19623 * Future Extensions:: New features that may be implemented one day.
19624 * Improvements:: Suggestions for improvements by volunteers.
19627 @node Compatibility Mode, Additions, Notes, Notes
19628 @appendixsec Downward Compatibility and Debugging
19630 @xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}},
19631 for a summary of the GNU extensions to the @code{awk} language and program.
19632 All of these features can be turned off by invoking @code{gawk} with the
19633 @samp{--traditional} option, or with the @samp{--posix} option.
19635 If @code{gawk} is compiled for debugging with @samp{-DDEBUG}, then there
19636 is one more option available on the command line:
19639 @item -W parsedebug
19640 @itemx --parsedebug
19641 Print out the parse stack information as the program is being parsed.
19644 This option is intended only for serious @code{gawk} developers,
19645 and not for the casual user. It probably has not even been compiled into
19646 your version of @code{gawk}, since it slows down execution.
19648 @node Additions, Future Extensions, Compatibility Mode, Notes
19649 @appendixsec Making Additions to @code{gawk}
19651 If you should find that you wish to enhance @code{gawk} in a significant
19652 fashion, you are perfectly free to do so. That is the point of having
19653 free software; the source code is available, and you are free to change
19654 it as you wish (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
19656 This section discusses the ways you might wish to change @code{gawk},
19657 and any considerations you should bear in mind.
19660 * Adding Code:: Adding code to the main body of @code{gawk}.
19661 * New Ports:: Porting @code{gawk} to a new operating system.
19664 @node Adding Code, New Ports, Additions, Additions
19665 @appendixsubsec Adding New Features
19667 @cindex adding new features
19668 @cindex features, adding
19669 You are free to add any new features you like to @code{gawk}.
19670 However, if you want your changes to be incorporated into the @code{gawk}
19671 distribution, there are several steps that you need to take in order to
19672 make it possible for me to include your changes.
19676 Get the latest version.
19677 It is much easier for me to integrate changes if they are relative to
19678 the most recent distributed version of @code{gawk}. If your version of
19679 @code{gawk} is very old, I may not be able to integrate them at all.
19680 @xref{Getting, ,Getting the @code{gawk} Distribution},
19681 for information on getting the latest version of @code{gawk}.
19685 Follow the @cite{GNU Coding Standards}.
19688 See @inforef{Top, , Version, standards, GNU Coding Standards}.
19690 This document describes how GNU software should be written. If you haven't
19691 read it, please do so, preferably @emph{before} starting to modify @code{gawk}.
19692 (The @cite{GNU Coding Standards} are available as part of the Autoconf
19693 distribution, from the FSF.)
19695 @cindex @code{gawk} coding style
19696 @cindex coding style used in @code{gawk}
19698 Use the @code{gawk} coding style.
19699 The C code for @code{gawk} follows the instructions in the
19700 @cite{GNU Coding Standards}, with minor exceptions. The code is formatted
19701 using the traditional ``K&R'' style, particularly as regards the placement
19702 of braces and the use of tabs. In brief, the coding rules for @code{gawk}
19707 Use old style (non-prototype) function headers when defining functions.
19710 Put the name of the function at the beginning of its own line.
19713 Put the return type of the function, even if it is @code{int}, on the
19714 line above the line with the name and arguments of the function.
19717 The declarations for the function arguments should not be indented.
19720 Put spaces around parentheses used in control structures
19721 (@code{if}, @code{while}, @code{for}, @code{do}, @code{switch}
19722 and @code{return}).
19725 Do not put spaces in front of parentheses used in function calls.
19728 Put spaces around all C operators, and after commas in function calls.
19731 Do not use the comma operator to produce multiple side-effects, except
19732 in @code{for} loop initialization and increment parts, and in macro bodies.
19735 Use real tabs for indenting, not spaces.
19738 Use the ``K&R'' brace layout style.
19741 Use comparisons against @code{NULL} and @code{'\0'} in the conditions of
19742 @code{if}, @code{while} and @code{for} statements, and in the @code{case}s
19743 of @code{switch} statements, instead of just the
19744 plain pointer or character value.
19747 Use the @code{TRUE}, @code{FALSE}, and @code{NULL} symbolic constants,
19748 and the character constant @code{'\0'} where appropriate, instead of @code{1}
19752 Provide one-line descriptive comments for each function.
19755 Do not use @samp{#elif}. Many older Unix C compilers cannot handle it.
19758 Do not use the @code{alloca} function for allocating memory off the stack.
19759 Its use causes more portability trouble than the minor benefit of not having
19760 to free the storage. Instead, use @code{malloc} and @code{free}.
19763 If I have to reformat your code to follow the coding style used in
19764 @code{gawk}, I may not bother.
19767 Be prepared to sign the appropriate paperwork.
19768 In order for the FSF to distribute your changes, you must either place
19769 those changes in the public domain, and submit a signed statement to that
19770 effect, or assign the copyright in your changes to the FSF.
19771 Both of these actions are easy to do, and @emph{many} people have done so
19772 already. If you have questions, please contact me
19773 (@pxref{Bugs, , Reporting Problems and Bugs}),
19774 or @code{gnu@@gnu.org}.
19777 Update the documentation.
19778 Along with your new code, please supply new sections and or chapters
19779 for this @value{DOCUMENT}. If at all possible, please use real
19780 Texinfo, instead of just supplying unformatted ASCII text (although
19781 even that is better than no documentation at all).
19782 Conventions to be followed in @cite{@value{TITLE}} are provided
19783 after the @samp{@@bye} at the end of the Texinfo source file.
19784 If possible, please update the man page as well.
19786 You will also have to sign paperwork for your documentation changes.
19789 Submit changes as context diffs or unified diffs.
19790 Use @samp{diff -c -r -N} or @samp{diff -u -r -N} to compare
19791 the original @code{gawk} source tree with your version.
19792 (I find context diffs to be more readable, but unified diffs are
19794 I recommend using the GNU version of @code{diff}.
19795 Send the output produced by either run of @code{diff} to me when you
19796 submit your changes.
19797 @xref{Bugs, , Reporting Problems and Bugs}, for the electronic mail
19800 Using this format makes it easy for me to apply your changes to the
19801 master version of the @code{gawk} source code (using @code{patch}).
19802 If I have to apply the changes manually, using a text editor, I may
19803 not do so, particularly if there are lots of changes.
19806 Include an entry for the @file{ChangeLog} file with your submission.
19807 This further helps minimize the amount of work I have to do,
19808 making it easier for me to accept patches.
19811 Although this sounds like a lot of work, please remember that while you
19812 may write the new code, I have to maintain it and support it, and if it
19813 isn't possible for me to do that with a minimum of extra work, then I
19817 @node New Ports, , Adding Code, Additions
19818 @appendixsubsec Porting @code{gawk} to a New Operating System
19820 @cindex porting @code{gawk}
19821 If you wish to port @code{gawk} to a new operating system, there are
19822 several steps to follow.
19826 Follow the guidelines in
19827 @ref{Adding Code, ,Adding New Features},
19828 concerning coding style, submission of diffs, and so on.
19831 When doing a port, bear in mind that your code must co-exist peacefully
19832 with the rest of @code{gawk}, and the other ports. Avoid gratuitous
19833 changes to the system-independent parts of the code. If at all possible,
19834 avoid sprinkling @samp{#ifdef}s just for your port throughout the
19837 If the changes needed for a particular system affect too much of the
19838 code, I probably will not accept them. In such a case, you will, of course,
19839 be able to distribute your changes on your own, as long as you comply
19841 (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}).
19844 A number of the files that come with @code{gawk} are maintained by other
19845 people at the Free Software Foundation. Thus, you should not change them
19846 unless it is for a very good reason. I.e.@: changes are not out of the
19847 question, but changes to these files will be scrutinized extra carefully.
19848 The files are @file{alloca.c}, @file{getopt.h}, @file{getopt.c},
19849 @file{getopt1.c}, @file{regex.h}, @file{regex.c}, @file{dfa.h},
19850 @file{dfa.c}, @file{install-sh}, and @file{mkinstalldirs}.
19853 Be willing to continue to maintain the port.
19854 Non-Unix operating systems are supported by volunteers who maintain
19855 the code needed to compile and run @code{gawk} on their systems. If no-one
19856 volunteers to maintain a port, that port becomes unsupported, and it may
19857 be necessary to remove it from the distribution.
19860 Supply an appropriate @file{gawkmisc.???} file.
19861 Each port has its own @file{gawkmisc.???} that implements certain
19862 operating system specific functions. This is cleaner than a plethora of
19863 @samp{#ifdef}s scattered throughout the code. The @file{gawkmisc.c} in
19864 the main source directory includes the appropriate
19865 @file{gawkmisc.???} file from each subdirectory.
19866 Be sure to update it as well.
19868 Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine
19869 or operating system for the port. For example, @file{pc/gawkmisc.pc} and
19870 @file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain
19871 @file{gawkmisc.c}, makes it possible to move files from a port's subdirectory
19872 into the main subdirectory, without accidentally destroying the real
19873 @file{gawkmisc.c} file. (Currently, this is only an issue for the MS-DOS
19877 Supply a @file{Makefile} and any other C source and header files that are
19878 necessary for your operating system. All your code should be in a
19879 separate subdirectory, with a name that is the same as, or reminiscent
19880 of, either your operating system or the computer system. If possible,
19881 try to structure things so that it is not necessary to move files out
19882 of the subdirectory into the main source directory. If that is not
19883 possible, then be sure to avoid using names for your files that
19884 duplicate the names of files in the main source directory.
19887 Update the documentation.
19888 Please write a section (or sections) for this @value{DOCUMENT} describing the
19889 installation and compilation steps needed to install and/or compile
19890 @code{gawk} for your system.
19893 Be prepared to sign the appropriate paperwork.
19894 In order for the FSF to distribute your code, you must either place
19895 your code in the public domain, and submit a signed statement to that
19896 effect, or assign the copyright in your code to the FSF.
19898 Both of these actions are easy to do, and @emph{many} people have done so
19899 already. If you have questions, please contact me, or
19900 @code{gnu@@gnu.org}.
19904 Following these steps will make it much easier to integrate your changes
19905 into @code{gawk}, and have them co-exist happily with the code for other
19906 operating systems that is already there.
19908 In the code that you supply, and that you maintain, feel free to use a
19909 coding style and brace layout that suits your taste.
19911 @node Future Extensions, Improvements, Additions, Notes
19912 @appendixsec Probable Future Extensions
19914 From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995
19915 Return-Path: <emory!scalpel.netlabs.com!lwall>
19916 Message-Id: <9510311732.AA28472@scalpel.netlabs.com>
19917 To: arnold@skeeve.atl.ga.us (Arnold D. Robbins)
19918 Subject: Re: May I quote you?
19919 In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST."
19920 <m0tAHPQ-00014MC@skeeve.atl.ga.us>
19921 Date: Tue, 31 Oct 95 09:32:46 -0800
19922 From: Larry Wall <emory!scalpel.netlabs.com!lwall>
19924 : Greetings. I am working on the release of gawk 3.0. Part of it will be a
19925 : thoroughly updated manual. One of the sections deals with planned future
19926 : extensions and enhancements. I have the following at the beginning
19930 : @cindex Wall, Larry
19932 : @i{AWK is a language similar to PERL, only considerably more elegant.} @*
19939 : Before I actually release this for publication, I wanted to get your
19940 : permission to quote you. (Hopefully, in the spirit of much of GNU, the
19941 : implied humor is visible... :-)
19943 I think that would be fine.
19948 @cindex Wall, Larry
19950 @i{AWK is a language similar to PERL, only considerably more elegant.}
19958 This section briefly lists extensions and possible improvements
19959 that indicate the directions we are
19960 currently considering for @code{gawk}. The file @file{FUTURES} in the
19961 @code{gawk} distributions lists these extensions as well.
19963 This is a list of probable future changes that will be usable by the
19964 @code{awk} language programmer.
19966 @c these are ordered by likelihood
19969 The GNU project is starting to support multiple languages.
19970 It will at least be possible to make @code{gawk} print its warnings and
19971 error messages in languages other than English.
19972 It may be possible for @code{awk} programs to also use the multiple
19973 language facilities, separate from @code{gawk} itself.
19976 It may be possible to map a GDBM/NDBM/SDBM file into an @code{awk} array.
19978 @item A @code{PROCINFO} Array
19979 The special files that provide process-related information
19980 (@pxref{Special Files, ,Special File Names in @code{gawk}})
19981 will be superseded by a @code{PROCINFO} array that would provide the same
19982 information, in an easier to access fashion.
19984 @item More @code{lint} warnings
19985 There are more things that could be checked for portability.
19987 @item Control of subprocess environment
19988 Changes made in @code{gawk} to the array @code{ENVIRON} may be
19989 propagated to subprocesses run by @code{gawk}.
19992 @item @code{RECLEN} variable for fixed length records
19993 Along with @code{FIELDWIDTHS}, this would speed up the processing of
19994 fixed-length records.
19996 @item A @code{restart} keyword
19997 After modifying @code{$0}, @code{restart} would restart the pattern
19998 matching loop, without reading a new record from the input.
20000 @item A @samp{|&} redirection
20001 The @samp{|&} redirection, in place of @samp{|}, would open a two-way
20002 pipeline for communication with a sub-process (via @code{getline} and
20003 @code{print} and @code{printf}).
20005 @item Function valued variables
20006 It would be possible to assign the name of a user-defined or built-in
20007 function to a regular @code{awk} variable, and then call the function
20008 indirectly, by using the regular variable. This would make it possible
20009 to write general purpose sorting and comparing routines, for example,
20010 by simply passing the name of one function into another.
20012 @item A built-in @code{stat} function
20013 The @code{stat} function would provide an easy-to-use hook to the
20014 @code{stat} system call so that @code{awk} programs could determine information
20017 @item A built-in @code{ftw} function
20018 Combined with function valued variables and the @code{stat} function,
20019 @code{ftw} (file tree walk) would make it easy for an @code{awk} program
20020 to walk an entire file tree.
20024 This is a list of probable improvements that will make @code{gawk}
20028 @item An Improved Version of @code{dfa}
20029 The @code{dfa} pattern matcher from GNU @code{grep} has some
20030 problems. Either a new version or a fixed one will deal with some
20031 important regexp matching issues.
20033 @item Use of GNU @code{malloc}
20034 The GNU version of @code{malloc} could potentially speed up @code{gawk},
20035 since it relies heavily on the use of dynamic memory allocation.
20039 @node Improvements, , Future Extensions, Notes
20040 @appendixsec Suggestions for Improvements
20042 Here are some projects that would-be @code{gawk} hackers might like to take
20043 on. They vary in size from a few days to a few weeks of programming,
20044 depending on which one you choose and how fast a programmer you are. Please
20045 send any improvements you write to the maintainers at the GNU project.
20046 @xref{Adding Code, , Adding New Features},
20047 for guidelines to follow when adding new features to @code{gawk}.
20048 @xref{Bugs, ,Reporting Problems and Bugs}, for information on
20049 contacting the maintainers.
20053 Compilation of @code{awk} programs: @code{gawk} uses a Bison (YACC-like)
20054 parser to convert the script given it into a syntax tree; the syntax
20055 tree is then executed by a simple recursive evaluator. This method incurs
20056 a lot of overhead, since the recursive evaluator performs many procedure
20057 calls to do even the simplest things.
20059 It should be possible for @code{gawk} to convert the script's parse tree
20060 into a C program which the user would then compile, using the normal
20061 C compiler and a special @code{gawk} library to provide all the needed
20062 functions (regexps, fields, associative arrays, type coercion, and so
20065 An easier possibility might be for an intermediate phase of @code{awk} to
20066 convert the parse tree into a linear byte code form like the one used
20067 in GNU Emacs Lisp. The recursive evaluator would then be replaced by
20068 a straight line byte code interpreter that would be intermediate in speed
20069 between running a compiled program and doing what @code{gawk} does
20073 The programs in the test suite could use documenting in this @value{DOCUMENT}.
20076 See the @file{FUTURES} file for more ideas. Contact us if you would
20077 seriously like to tackle any of the items listed there.
20080 @node Glossary, Copying, Notes, Top
20085 A series of @code{awk} statements attached to a rule. If the rule's
20086 pattern matches an input record, @code{awk} executes the
20087 rule's action. Actions are always enclosed in curly braces.
20088 @xref{Action Overview, ,Overview of Actions}.
20090 @item Amazing @code{awk} Assembler
20091 Henry Spencer at the University of Toronto wrote a retargetable assembler
20092 completely as @code{awk} scripts. It is thousands of lines long, including
20093 machine descriptions for several eight-bit microcomputers.
20094 It is a good example of a
20095 program that would have been better written in another language.
20097 @item Amazingly Workable Formatter (@code{awf})
20098 Henry Spencer at the University of Toronto wrote a formatter that accepts
20099 a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting
20100 commands, using @code{awk} and @code{sh}.
20103 The American National Standards Institute. This organization produces
20104 many standards, among them the standards for the C and C++ programming
20108 An @code{awk} expression that changes the value of some @code{awk}
20109 variable or data object. An object that you can assign to is called an
20110 @dfn{lvalue}. The assigned values are called @dfn{rvalues}.
20111 @xref{Assignment Ops, ,Assignment Expressions}.
20113 @item @code{awk} Language
20114 The language in which @code{awk} programs are written.
20116 @item @code{awk} Program
20117 An @code{awk} program consists of a series of @dfn{patterns} and
20118 @dfn{actions}, collectively known as @dfn{rules}. For each input record
20119 given to the program, the program's rules are all processed in turn.
20120 @code{awk} programs may also contain function definitions.
20122 @item @code{awk} Script
20123 Another name for an @code{awk} program.
20126 The GNU version of the standard shell (the Bourne-Again shell).
20127 See ``Bourne Shell.''
20130 See ``Bulletin Board System.''
20132 @item Boolean Expression
20133 Named after the English mathematician Boole. See ``Logical Expression.''
20136 The standard shell (@file{/bin/sh}) on Unix and Unix-like systems,
20137 originally written by Steven R.@: Bourne.
20138 Many shells (Bash, @code{ksh}, @code{pdksh}, @code{zsh}) are
20139 generally upwardly compatible with the Bourne shell.
20141 @item Built-in Function
20142 The @code{awk} language provides built-in functions that perform various
20143 numerical, time stamp related, and string computations. Examples are
20144 @code{sqrt} (for the square root of a number) and @code{substr} (for a
20145 substring of a string). @xref{Built-in, ,Built-in Functions}.
20147 @item Built-in Variable
20148 @code{ARGC}, @code{ARGIND}, @code{ARGV}, @code{CONVFMT}, @code{ENVIRON},
20149 @code{ERRNO}, @code{FIELDWIDTHS}, @code{FILENAME}, @code{FNR}, @code{FS},
20150 @code{IGNORECASE}, @code{NF}, @code{NR}, @code{OFMT}, @code{OFS}, @code{ORS},
20151 @code{RLENGTH}, @code{RSTART}, @code{RS}, @code{RT}, and @code{SUBSEP},
20152 are the variables that have special meaning to @code{awk}.
20153 Changing some of them affects @code{awk}'s running environment.
20154 Several of these variables are specific to @code{gawk}.
20155 @xref{Built-in Variables}.
20158 See ``Curly Braces.''
20160 @item Bulletin Board System
20161 A computer system allowing users to log in and read and/or leave messages
20162 for other users of the system, much like leaving paper notes on a bulletin
20166 The system programming language that most GNU software is written in. The
20167 @code{awk} programming language has C-like syntax, and this @value{DOCUMENT}
20168 points out similarities between @code{awk} and C when appropriate.
20171 @cindex ISO Latin-1
20172 @item Character Set
20173 The set of numeric codes used by a computer system to represent the
20174 characters (letters, numbers, punctuation, etc.) of a particular country
20175 or place. The most common character set in use today is ASCII (American
20176 Standard Code for Information Interchange). Many European
20177 countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1).
20180 A preprocessor for @code{pic} that reads descriptions of molecules
20181 and produces @code{pic} input for drawing them. It was written in @code{awk}
20182 by Brian Kernighan and Jon Bentley, and is available from
20183 @email{@w{netlib@@research.bell-labs.com}}.
20185 @item Compound Statement
20186 A series of @code{awk} statements, enclosed in curly braces. Compound
20187 statements may be nested.
20188 @xref{Statements, ,Control Statements in Actions}.
20190 @item Concatenation
20191 Concatenating two strings means sticking them together, one after another,
20192 giving a new string. For example, the string @samp{foo} concatenated with
20193 the string @samp{bar} gives the string @samp{foobar}.
20194 @xref{Concatenation, ,String Concatenation}.
20196 @item Conditional Expression
20197 An expression using the @samp{?:} ternary operator, such as
20198 @samp{@var{expr1} ? @var{expr2} : @var{expr3}}. The expression
20199 @var{expr1} is evaluated; if the result is true, the value of the whole
20200 expression is the value of @var{expr2}, otherwise the value is
20201 @var{expr3}. In either case, only one of @var{expr2} and @var{expr3}
20202 is evaluated. @xref{Conditional Exp, ,Conditional Expressions}.
20204 @item Comparison Expression
20205 A relation that is either true or false, such as @samp{(a < b)}.
20206 Comparison expressions are used in @code{if}, @code{while}, @code{do},
20208 statements, and in patterns to select which input records to process.
20209 @xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}.
20212 The characters @samp{@{} and @samp{@}}. Curly braces are used in
20213 @code{awk} for delimiting actions, compound statements, and function
20217 An area in the language where specifications often were (or still
20218 are) not clear, leading to unexpected or undesirable behavior.
20219 Such areas are marked in this @value{DOCUMENT} with ``(d.c.)'' in the
20220 text, and are indexed under the heading ``dark corner.''
20223 These are numbers and strings of characters. Numbers are converted into
20224 strings and vice versa, as needed.
20225 @xref{Conversion, ,Conversion of Strings and Numbers}.
20227 @item Double Precision
20228 An internal representation of numbers that can have fractional parts.
20229 Double precision numbers keep track of more digits than do single precision
20230 numbers, but operations on them are more expensive. This is the way
20231 @code{awk} stores numeric values. It is the C type @code{double}.
20233 @item Dynamic Regular Expression
20234 A dynamic regular expression is a regular expression written as an
20235 ordinary expression. It could be a string constant, such as
20236 @code{"foo"}, but it may also be an expression whose value can vary.
20237 @xref{Computed Regexps, , Using Dynamic Regexps}.
20240 A collection of strings, of the form @var{name@code{=}val}, that each
20241 program has available to it. Users generally place values into the
20242 environment in order to provide information to various programs. Typical
20243 examples are the environment variables @code{HOME} and @code{PATH}.
20246 See ``Null String.''
20248 @item Escape Sequences
20249 A special sequence of characters used for describing non-printing
20250 characters, such as @samp{\n} for newline, or @samp{\033} for the ASCII
20251 ESC (escape) character. @xref{Escape Sequences}.
20254 When @code{awk} reads an input record, it splits the record into pieces
20255 separated by whitespace (or by a separator regexp which you can
20256 change by setting the built-in variable @code{FS}). Such pieces are
20257 called fields. If the pieces are of fixed length, you can use the built-in
20258 variable @code{FIELDWIDTHS} to describe their lengths.
20259 @xref{Field Separators, ,Specifying How Fields are Separated},
20261 @xref{Constant Size, , Reading Fixed-width Data}.
20263 @item Floating Point Number
20264 Often referred to in mathematical terms as a ``rational'' number, this is
20265 just a number that can have a fractional part.
20266 See ``Double Precision'' and ``Single Precision.''
20269 Format strings are used to control the appearance of output in the
20270 @code{printf} statement. Also, data conversions from numbers to strings
20271 are controlled by the format string contained in the built-in variable
20272 @code{CONVFMT}. @xref{Control Letters, ,Format-Control Letters}.
20275 A specialized group of statements used to encapsulate general
20276 or program-specific tasks. @code{awk} has a number of built-in
20277 functions, and also allows you to define your own.
20278 @xref{Built-in, ,Built-in Functions},
20279 and @ref{User-defined, ,User-defined Functions}.
20282 See ``Free Software Foundation.''
20284 @item Free Software Foundation
20285 A non-profit organization dedicated
20286 to the production and distribution of freely distributable software.
20287 It was founded by Richard M.@: Stallman, the author of the original
20288 Emacs editor. GNU Emacs is the most widely used version of Emacs today.
20291 The GNU implementation of @code{awk}.
20293 @item General Public License
20294 This document describes the terms under which @code{gawk} and its source
20295 code may be distributed. (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE})
20298 ``GNU's not Unix''. An on-going project of the Free Software Foundation
20299 to create a complete, freely distributable, POSIX-compliant computing
20303 See ``General Public License.''
20306 Base 16 notation, where the digits are @code{0}-@code{9} and
20307 @code{A}-@code{F}, with @samp{A}
20308 representing 10, @samp{B} representing 11, and so on up to @samp{F} for 15.
20309 Hexadecimal numbers are written in C using a leading @samp{0x},
20310 to indicate their base. Thus, @code{0x12} is 18 (one times 16 plus 2).
20313 Abbreviation for ``Input/Output,'' the act of moving data into and/or
20314 out of a running program.
20317 A single chunk of data read in by @code{awk}. Usually, an @code{awk} input
20318 record consists of one line of text.
20319 @xref{Records, ,How Input is Split into Records}.
20322 A whole number, i.e.@: a number that does not have a fractional part.
20325 In the @code{awk} language, a keyword is a word that has special
20326 meaning. Keywords are reserved and may not be used as variable names.
20328 @code{gawk}'s keywords are:
20334 @code{do@dots{}while},
20336 @code{for@dots{}in},
20346 @item Logical Expression
20347 An expression using the operators for logic, AND, OR, and NOT, written
20348 @samp{&&}, @samp{||}, and @samp{!} in @code{awk}. Often called Boolean
20349 expressions, after the mathematician who pioneered this kind of
20350 mathematical logic.
20353 An expression that can appear on the left side of an assignment
20354 operator. In most languages, lvalues can be variables or array
20355 elements. In @code{awk}, a field designator can also be used as an
20359 A string with no characters in it. It is represented explicitly in
20360 @code{awk} programs by placing two double-quote characters next to
20361 each other (@code{""}). It can appear in input data by having two successive
20362 occurrences of the field separator appear next to each other.
20365 A numeric valued data object. The @code{gawk} implementation uses double
20366 precision floating point to represent numbers.
20367 Very old @code{awk} implementations use single precision floating
20371 Base-eight notation, where the digits are @code{0}-@code{7}.
20372 Octal numbers are written in C using a leading @samp{0},
20373 to indicate their base. Thus, @code{013} is 11 (one times 8 plus 3).
20376 Patterns tell @code{awk} which input records are interesting to which
20379 A pattern is an arbitrary conditional expression against which input is
20380 tested. If the condition is satisfied, the pattern is said to @dfn{match}
20381 the input record. A typical pattern might compare the input record against
20382 a regular expression. @xref{Pattern Overview, ,Pattern Elements}.
20385 The name for a series of standards being developed by the IEEE
20386 that specify a Portable Operating System interface. The ``IX'' denotes
20387 the Unix heritage of these standards. The main standard of interest for
20388 @code{awk} users is
20389 @cite{IEEE Standard for Information Technology, Standard 1003.2-1992,
20390 Portable Operating System Interface (POSIX) Part 2: Shell and Utilities}.
20391 Informally, this standard is often referred to as simply ``P1003.2.''
20394 Variables and/or functions that are meant for use exclusively by library
20395 functions, and not for the main @code{awk} program. Special care must be
20396 taken when naming such variables and functions.
20397 @xref{Library Names, , Naming Library Function Global Variables}.
20399 @item Range (of input lines)
20400 A sequence of consecutive lines from the input file. A pattern
20401 can specify ranges of input lines for @code{awk} to process, or it can
20402 specify single lines. @xref{Pattern Overview, ,Pattern Elements}.
20405 When a function calls itself, either directly or indirectly.
20406 If this isn't clear, refer to the entry for ``recursion.''
20409 Redirection means performing input from other than the standard input
20410 stream, or output to other than the standard output stream.
20412 You can redirect the output of the @code{print} and @code{printf} statements
20413 to a file or a system command, using the @samp{>}, @samp{>>}, and @samp{|}
20414 operators. You can redirect input to the @code{getline} statement using
20415 the @samp{<} and @samp{|} operators.
20416 @xref{Redirection, ,Redirecting Output of @code{print} and @code{printf}},
20417 and @ref{Getline, ,Explicit Input with @code{getline}}.
20420 Short for @dfn{regular expression}. A regexp is a pattern that denotes a
20421 set of strings, possibly an infinite set. For example, the regexp
20422 @samp{R.*xp} matches any string starting with the letter @samp{R}
20423 and ending with the letters @samp{xp}. In @code{awk}, regexps are
20424 used in patterns and in conditional expressions. Regexps may contain
20425 escape sequences. @xref{Regexp, ,Regular Expressions}.
20427 @item Regular Expression
20430 @item Regular Expression Constant
20431 A regular expression constant is a regular expression written within
20432 slashes, such as @code{/foo/}. This regular expression is chosen
20433 when you write the @code{awk} program, and cannot be changed doing
20434 its execution. @xref{Regexp Usage, ,How to Use Regular Expressions}.
20437 A segment of an @code{awk} program that specifies how to process single
20438 input records. A rule consists of a @dfn{pattern} and an @dfn{action}.
20439 @code{awk} reads an input record; then, for each rule, if the input record
20440 satisfies the rule's pattern, @code{awk} executes the rule's action.
20441 Otherwise, the rule does nothing for that input record.
20444 A value that can appear on the right side of an assignment operator.
20445 In @code{awk}, essentially every expression has a value. These values
20449 See ``Stream Editor.''
20451 @item Short-Circuit
20452 The nature of the @code{awk} logical operators @samp{&&} and @samp{||}.
20453 If the value of the entire expression can be deduced from evaluating just
20454 the left-hand side of these operators, the right-hand side will not
20456 (@pxref{Boolean Ops, ,Boolean Expressions}).
20459 A side effect occurs when an expression has an effect aside from merely
20460 producing a value. Assignment expressions, increment and decrement
20461 expressions and function calls have side effects.
20462 @xref{Assignment Ops, ,Assignment Expressions}.
20464 @item Single Precision
20465 An internal representation of numbers that can have fractional parts.
20466 Single precision numbers keep track of fewer digits than do double precision
20467 numbers, but operations on them are less expensive in terms of CPU time.
20468 This is the type used by some very old versions of @code{awk} to store
20469 numeric values. It is the C type @code{float}.
20472 The character generated by hitting the space bar on the keyboard.
20475 A file name interpreted internally by @code{gawk}, instead of being handed
20476 directly to the underlying operating system. For example, @file{/dev/stderr}.
20477 @xref{Special Files, ,Special File Names in @code{gawk}}.
20479 @item Stream Editor
20480 A program that reads records from an input stream and processes them one
20481 or more at a time. This is in contrast with batch programs, which may
20482 expect to read their input files in entirety before starting to do
20483 anything, and with interactive programs, which require input from the
20487 A datum consisting of a sequence of characters, such as @samp{I am a
20488 string}. Constant strings are written with double-quotes in the
20489 @code{awk} language, and may contain escape sequences.
20490 @xref{Escape Sequences}.
20493 The character generated by hitting the @kbd{TAB} key on the keyboard.
20494 It usually expands to up to eight spaces upon output.
20497 A computer operating system originally developed in the early 1970's at
20498 AT&T Bell Laboratories. It initially became popular in universities around
20499 the world, and later moved into commercial evnironments as a software
20500 development system and network server system. There are many commercial
20501 versions of Unix, as well as several work-alike systems whose source code
20502 is freely available (such as Linux, NetBSD, and FreeBSD).
20505 A sequence of space, tab, or newline characters occurring inside an input
20506 record or a string.
20509 @node Copying, Index, Glossary, Top
20510 @unnumbered GNU GENERAL PUBLIC LICENSE
20511 @center Version 2, June 1991
20514 Copyright @copyright{} 1989, 1991 Free Software Foundation, Inc.
20515 59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA
20517 Everyone is permitted to copy and distribute verbatim copies
20518 of this license document, but changing it is not allowed.
20521 @c fakenode --- for prepinfo
20522 @unnumberedsec Preamble
20524 The licenses for most software are designed to take away your
20525 freedom to share and change it. By contrast, the GNU General Public
20526 License is intended to guarantee your freedom to share and change free
20527 software---to make sure the software is free for all its users. This
20528 General Public License applies to most of the Free Software
20529 Foundation's software and to any other program whose authors commit to
20530 using it. (Some other Free Software Foundation software is covered by
20531 the GNU Library General Public License instead.) You can apply it to
20532 your programs, too.
20534 When we speak of free software, we are referring to freedom, not
20535 price. Our General Public Licenses are designed to make sure that you
20536 have the freedom to distribute copies of free software (and charge for
20537 this service if you wish), that you receive source code or can get it
20538 if you want it, that you can change the software or use pieces of it
20539 in new free programs; and that you know you can do these things.
20541 To protect your rights, we need to make restrictions that forbid
20542 anyone to deny you these rights or to ask you to surrender the rights.
20543 These restrictions translate to certain responsibilities for you if you
20544 distribute copies of the software, or if you modify it.
20546 For example, if you distribute copies of such a program, whether
20547 gratis or for a fee, you must give the recipients all the rights that
20548 you have. You must make sure that they, too, receive or can get the
20549 source code. And you must show them these terms so they know their
20552 We protect your rights with two steps: (1) copyright the software, and
20553 (2) offer you this license which gives you legal permission to copy,
20554 distribute and/or modify the software.
20556 Also, for each author's protection and ours, we want to make certain
20557 that everyone understands that there is no warranty for this free
20558 software. If the software is modified by someone else and passed on, we
20559 want its recipients to know that what they have is not the original, so
20560 that any problems introduced by others will not reflect on the original
20561 authors' reputations.
20563 Finally, any free program is threatened constantly by software
20564 patents. We wish to avoid the danger that redistributors of a free
20565 program will individually obtain patent licenses, in effect making the
20566 program proprietary. To prevent this, we have made it clear that any
20567 patent must be licensed for everyone's free use or not licensed at all.
20569 The precise terms and conditions for copying, distribution and
20570 modification follow.
20573 @c fakenode --- for prepinfo
20574 @unnumberedsec TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
20577 @center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
20582 This License applies to any program or other work which contains
20583 a notice placed by the copyright holder saying it may be distributed
20584 under the terms of this General Public License. The ``Program'', below,
20585 refers to any such program or work, and a ``work based on the Program''
20586 means either the Program or any derivative work under copyright law:
20587 that is to say, a work containing the Program or a portion of it,
20588 either verbatim or with modifications and/or translated into another
20589 language. (Hereinafter, translation is included without limitation in
20590 the term ``modification''.) Each licensee is addressed as ``you''.
20592 Activities other than copying, distribution and modification are not
20593 covered by this License; they are outside its scope. The act of
20594 running the Program is not restricted, and the output from the Program
20595 is covered only if its contents constitute a work based on the
20596 Program (independent of having been made by running the Program).
20597 Whether that is true depends on what the Program does.
20600 You may copy and distribute verbatim copies of the Program's
20601 source code as you receive it, in any medium, provided that you
20602 conspicuously and appropriately publish on each copy an appropriate
20603 copyright notice and disclaimer of warranty; keep intact all the
20604 notices that refer to this License and to the absence of any warranty;
20605 and give any other recipients of the Program a copy of this License
20606 along with the Program.
20608 You may charge a fee for the physical act of transferring a copy, and
20609 you may at your option offer warranty protection in exchange for a fee.
20612 You may modify your copy or copies of the Program or any portion
20613 of it, thus forming a work based on the Program, and copy and
20614 distribute such modifications or work under the terms of Section 1
20615 above, provided that you also meet all of these conditions:
20619 You must cause the modified files to carry prominent notices
20620 stating that you changed the files and the date of any change.
20623 You must cause any work that you distribute or publish, that in
20624 whole or in part contains or is derived from the Program or any
20625 part thereof, to be licensed as a whole at no charge to all third
20626 parties under the terms of this License.
20629 If the modified program normally reads commands interactively
20630 when run, you must cause it, when started running for such
20631 interactive use in the most ordinary way, to print or display an
20632 announcement including an appropriate copyright notice and a
20633 notice that there is no warranty (or else, saying that you provide
20634 a warranty) and that users may redistribute the program under
20635 these conditions, and telling the user how to view a copy of this
20636 License. (Exception: if the Program itself is interactive but
20637 does not normally print such an announcement, your work based on
20638 the Program is not required to print an announcement.)
20641 These requirements apply to the modified work as a whole. If
20642 identifiable sections of that work are not derived from the Program,
20643 and can be reasonably considered independent and separate works in
20644 themselves, then this License, and its terms, do not apply to those
20645 sections when you distribute them as separate works. But when you
20646 distribute the same sections as part of a whole which is a work based
20647 on the Program, the distribution of the whole must be on the terms of
20648 this License, whose permissions for other licensees extend to the
20649 entire whole, and thus to each and every part regardless of who wrote it.
20651 Thus, it is not the intent of this section to claim rights or contest
20652 your rights to work written entirely by you; rather, the intent is to
20653 exercise the right to control the distribution of derivative or
20654 collective works based on the Program.
20656 In addition, mere aggregation of another work not based on the Program
20657 with the Program (or with a work based on the Program) on a volume of
20658 a storage or distribution medium does not bring the other work under
20659 the scope of this License.
20662 You may copy and distribute the Program (or a work based on it,
20663 under Section 2) in object code or executable form under the terms of
20664 Sections 1 and 2 above provided that you also do one of the following:
20668 Accompany it with the complete corresponding machine-readable
20669 source code, which must be distributed under the terms of Sections
20670 1 and 2 above on a medium customarily used for software interchange; or,
20673 Accompany it with a written offer, valid for at least three
20674 years, to give any third party, for a charge no more than your
20675 cost of physically performing source distribution, a complete
20676 machine-readable copy of the corresponding source code, to be
20677 distributed under the terms of Sections 1 and 2 above on a medium
20678 customarily used for software interchange; or,
20681 Accompany it with the information you received as to the offer
20682 to distribute corresponding source code. (This alternative is
20683 allowed only for non-commercial distribution and only if you
20684 received the program in object code or executable form with such
20685 an offer, in accord with Subsection b above.)
20688 The source code for a work means the preferred form of the work for
20689 making modifications to it. For an executable work, complete source
20690 code means all the source code for all modules it contains, plus any
20691 associated interface definition files, plus the scripts used to
20692 control compilation and installation of the executable. However, as a
20693 special exception, the source code distributed need not include
20694 anything that is normally distributed (in either source or binary
20695 form) with the major components (compiler, kernel, and so on) of the
20696 operating system on which the executable runs, unless that component
20697 itself accompanies the executable.
20699 If distribution of executable or object code is made by offering
20700 access to copy from a designated place, then offering equivalent
20701 access to copy the source code from the same place counts as
20702 distribution of the source code, even though third parties are not
20703 compelled to copy the source along with the object code.
20706 You may not copy, modify, sublicense, or distribute the Program
20707 except as expressly provided under this License. Any attempt
20708 otherwise to copy, modify, sublicense or distribute the Program is
20709 void, and will automatically terminate your rights under this License.
20710 However, parties who have received copies, or rights, from you under
20711 this License will not have their licenses terminated so long as such
20712 parties remain in full compliance.
20715 You are not required to accept this License, since you have not
20716 signed it. However, nothing else grants you permission to modify or
20717 distribute the Program or its derivative works. These actions are
20718 prohibited by law if you do not accept this License. Therefore, by
20719 modifying or distributing the Program (or any work based on the
20720 Program), you indicate your acceptance of this License to do so, and
20721 all its terms and conditions for copying, distributing or modifying
20722 the Program or works based on it.
20725 Each time you redistribute the Program (or any work based on the
20726 Program), the recipient automatically receives a license from the
20727 original licensor to copy, distribute or modify the Program subject to
20728 these terms and conditions. You may not impose any further
20729 restrictions on the recipients' exercise of the rights granted herein.
20730 You are not responsible for enforcing compliance by third parties to
20734 If, as a consequence of a court judgment or allegation of patent
20735 infringement or for any other reason (not limited to patent issues),
20736 conditions are imposed on you (whether by court order, agreement or
20737 otherwise) that contradict the conditions of this License, they do not
20738 excuse you from the conditions of this License. If you cannot
20739 distribute so as to satisfy simultaneously your obligations under this
20740 License and any other pertinent obligations, then as a consequence you
20741 may not distribute the Program at all. For example, if a patent
20742 license would not permit royalty-free redistribution of the Program by
20743 all those who receive copies directly or indirectly through you, then
20744 the only way you could satisfy both it and this License would be to
20745 refrain entirely from distribution of the Program.
20747 If any portion of this section is held invalid or unenforceable under
20748 any particular circumstance, the balance of the section is intended to
20749 apply and the section as a whole is intended to apply in other
20752 It is not the purpose of this section to induce you to infringe any
20753 patents or other property right claims or to contest validity of any
20754 such claims; this section has the sole purpose of protecting the
20755 integrity of the free software distribution system, which is
20756 implemented by public license practices. Many people have made
20757 generous contributions to the wide range of software distributed
20758 through that system in reliance on consistent application of that
20759 system; it is up to the author/donor to decide if he or she is willing
20760 to distribute software through any other system and a licensee cannot
20761 impose that choice.
20763 This section is intended to make thoroughly clear what is believed to
20764 be a consequence of the rest of this License.
20767 If the distribution and/or use of the Program is restricted in
20768 certain countries either by patents or by copyrighted interfaces, the
20769 original copyright holder who places the Program under this License
20770 may add an explicit geographical distribution limitation excluding
20771 those countries, so that distribution is permitted only in or among
20772 countries not thus excluded. In such case, this License incorporates
20773 the limitation as if written in the body of this License.
20776 The Free Software Foundation may publish revised and/or new versions
20777 of the General Public License from time to time. Such new versions will
20778 be similar in spirit to the present version, but may differ in detail to
20779 address new problems or concerns.
20781 Each version is given a distinguishing version number. If the Program
20782 specifies a version number of this License which applies to it and ``any
20783 later version'', you have the option of following the terms and conditions
20784 either of that version or of any later version published by the Free
20785 Software Foundation. If the Program does not specify a version number of
20786 this License, you may choose any version ever published by the Free Software
20790 If you wish to incorporate parts of the Program into other free
20791 programs whose distribution conditions are different, write to the author
20792 to ask for permission. For software which is copyrighted by the Free
20793 Software Foundation, write to the Free Software Foundation; we sometimes
20794 make exceptions for this. Our decision will be guided by the two goals
20795 of preserving the free status of all derivatives of our free software and
20796 of promoting the sharing and reuse of software generally.
20799 @c fakenode --- for prepinfo
20800 @heading NO WARRANTY
20803 @center NO WARRANTY
20807 BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
20808 FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW@. EXCEPT WHEN
20809 OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
20810 PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
20811 OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
20812 MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE@. THE ENTIRE RISK AS
20813 TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU@. SHOULD THE
20814 PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
20815 REPAIR OR CORRECTION.
20818 IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
20819 WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
20820 REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
20821 INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
20822 OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
20823 TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
20824 YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
20825 PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
20826 POSSIBILITY OF SUCH DAMAGES.
20830 @c fakenode --- for prepinfo
20831 @heading END OF TERMS AND CONDITIONS
20834 @center END OF TERMS AND CONDITIONS
20838 @c fakenode --- for prepinfo
20839 @unnumberedsec How to Apply These Terms to Your New Programs
20841 If you develop a new program, and you want it to be of the greatest
20842 possible use to the public, the best way to achieve this is to make it
20843 free software which everyone can redistribute and change under these terms.
20845 To do so, attach the following notices to the program. It is safest
20846 to attach them to the start of each source file to most effectively
20847 convey the exclusion of warranty; and each file should have at least
20848 the ``copyright'' line and a pointer to where the full notice is found.
20851 @var{one line to give the program's name and an idea of what it does.}
20852 Copyright (C) @var{year} @var{name of author}
20854 This program is free software; you can redistribute it and/or
20855 modify it under the terms of the GNU General Public License
20856 as published by the Free Software Foundation; either version 2
20857 of the License, or (at your option) any later version.
20859 This program is distributed in the hope that it will be useful,
20860 but WITHOUT ANY WARRANTY; without even the implied warranty of
20861 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE@. See the
20862 GNU General Public License for more details.
20864 You should have received a copy of the GNU General Public License
20865 along with this program; if not, write to the Free Software
20866 Foundation, Inc., 59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA.
20869 Also add information on how to contact you by electronic and paper mail.
20871 If the program is interactive, make it output a short notice like this
20872 when it starts in an interactive mode:
20875 Gnomovision version 69, Copyright (C) @var{year} @var{name of author}
20876 Gnomovision comes with ABSOLUTELY NO WARRANTY; for details
20877 type `show w'. This is free software, and you are welcome
20878 to redistribute it under certain conditions; type `show c'
20882 The hypothetical commands @samp{show w} and @samp{show c} should show
20883 the appropriate parts of the General Public License. Of course, the
20884 commands you use may be called something other than @samp{show w} and
20885 @samp{show c}; they could even be mouse-clicks or menu items---whatever
20886 suits your program.
20888 You should also get your employer (if you work as a programmer) or your
20889 school, if any, to sign a ``copyright disclaimer'' for the program, if
20890 necessary. Here is a sample; alter the names:
20894 Yoyodyne, Inc., hereby disclaims all copyright
20895 interest in the program `Gnomovision'
20896 (which makes passes at compilers) written
20899 @var{signature of Ty Coon}, 1 April 1989
20900 Ty Coon, President of Vice
20904 This General Public License does not permit incorporating your program into
20905 proprietary programs. If your program is a subroutine library, you may
20906 consider it more useful to permit linking proprietary applications with the
20907 library. If this is what you want to do, use the GNU Library General
20908 Public License instead of this License.
20910 @node Index, , Copying, Top
20922 Robert J. Chassell points out that awk programs should have some indication
20923 of how to use them. It would be useful to perhaps have a "programming
20924 style" section of the manual that would include this and other tips.
20926 2. The default AWKPATH search path should be configurable via `configure'
20927 The default and how this changes needs to be documented.
20929 Consistency issues:
20930 /.../ regexps are in @code, not @samp
20931 ".." strings are in @code, not @samp
20932 no @print before @dots
20933 values of expressions in the text (@code{x} has the value 15),
20934 should be in roman, not @code
20935 Use tab and not TAB
20936 Use ESC and not ESCAPE
20937 Use space and not blank to describe the space bar's character
20938 The term "blank" is thus basically reserved for "blank lines" etc.
20939 The `(d.c.)' should appear inside the closing `.' of a sentence
20940 It should come before (pxref{...})
20941 " " should have an @w{} around it
20942 Use "non-" everywhere
20943 Use @code{ftp} when talking about anonymous ftp
20944 Use upper-case and lower-case, not "upper case" and "lower case"
20945 Use alphanumeric, not alpha-numeric
20946 Use --foo, not -Wfoo when describing long options
20947 Use findex for all programs and functions in the example chapters
20948 Use "Bell Laboratories", but not "Bell Labs".
20949 Use "behavior" instead of "behaviour".
20950 Use "zeros" instead of "zeroes".
20951 Use "Input/Output", not "input/output". Also "I/O", not "i/o".
20952 Use @code{do}, and not @code{do}-@code{while}, except where
20953 actually discussing the do-while.
20954 The words "a", "and", "as", "between", "for", "from", "in", "of",
20955 "on", "that", "the", "to", "with", and "without",
20956 should not be capitalized in @chapter, @section etc.
20957 "Into" and "How" should.
20958 Search for @dfn; make sure important items are also indexed.
20959 "e.g." should always be followed by a comma.
20960 "i.e." should never be followed by a comma, and should be followed
20962 The numbers zero through ten should be spelled out, except when
20963 talking about file descriptor numbers. > 10 and < 0, it's
20965 In tables, put command line options in @code, while in the text,
20967 When using @strong, use "Note:" or "Caution:" with colons and
20968 not exclamation points. Do not surround the paragraphs
20969 with @quotation ... @end quotation.
20971 Date: Wed, 13 Apr 94 15:20:52 -0400
20972 From: rsm@gnu.ai.mit.edu (Richard Stallman)
20973 To: gnu-prog@gnu.ai.mit.edu
20974 Subject: A reminder: no pathnames in GNU
20976 It's a GNU convention to use the term "file name" for the name of a
20977 file, never "pathname". We use the term "path" for search paths,
20978 which are lists of file names. Using it for a single file name as
20979 well is potentially confusing to users.
20981 So please check any documentation you maintain, if you think you might
20982 have used "pathname".
20984 Note that "file name" should be two words when it appears as ordinary
20985 text. It's ok as one word when it's a metasyntactic variable, though.
20989 Enhance FIELDWIDTHS with some way to indicate "the rest of the record".
20990 E.g., a length of 0 or -1 or something. May be "n"?
20992 Make FIELDWIDTHS be an array?
20994 What if FIELDWIDTHS has invalid values in it?