contrib/gperf/doc/gperf.texi

   1 \input texinfo @c -*- texinfo -*-
   2 @c %**start of header
   3 @setfilename gperf.info
   4 @settitle Perfect Hash Function Generator
   5 @c @setchapternewpage odd
   6 @c %**end of header
   7
   8 @c some day we should @include version.texi instead of defining
   9 @c these values at hand.
  10 @set UPDATED 26 September 2000
  11 @set EDITION 2.7.2
  12 @set VERSION 2.7.2
  13 @c ---------------------
  14
  15 @c remove the black boxes generated in the GPL appendix.
  16 @finalout
  17
  18 @c Merge functions into the concept index
  19 @syncodeindex fn cp
  20 @c @synindex pg cp
  21
  22 @dircategory Programming Tools
  23 @direntry
  24 * Gperf: (gperf).                Perfect Hash Function Generator.
  25 @end direntry
  26
  27 @ifinfo
  28 This file documents the features of the GNU Perfect Hash Function
  29 Generator @value{VERSION}.
  30
  31 Copyright @copyright{} 1989-2000 Free Software Foundation, Inc.
  32
  33 Permission is granted to make and distribute verbatim copies of this
  34 manual provided the copyright notice and this permission notice are
  35 preserved on all copies.
  36
  37 @ignore
  38 Permission is granted to process this file through TeX and print the
  39 results, provided the printed document carries a copying permission
  40 notice identical to this one except for the removal of this paragraph
  41 (this paragraph not being relevant to the printed manual).
  42
  43 @end ignore
  44
  45 Permission is granted to copy and distribute modified versions of this
  46 manual under the conditions for verbatim copying, provided also that the
  47 section entitled ``GNU General Public License'' is included exactly as
  48 in the original, and provided that the entire resulting derived work is
  49 distributed under the terms of a permission notice identical to this
  50 one.
  51
  52 Permission is granted to copy and distribute translations of this manual
  53 into another language, under the above conditions for modified versions,
  54 except that the section entitled ``GNU General Public License'' and this
  55 permission notice may be included in translations approved by the Free
  56 Software Foundation instead of in the original English.
  57
  58 @end ifinfo
  59
  60 @titlepage
  61 @title User's Guide to @code{gperf} @value{VERSION}
  62 @subtitle The GNU Perfect Hash Function Generator
  63 @subtitle Edition @value{EDITION}, @value{UPDATED}
  64 @author Douglas C. Schmidt
  65
  66 @page
  67 @vskip 0pt plus 1filll
  68 Copyright @copyright{} 1989-2000 Free Software Foundation, Inc.
  69
  70
  71 Permission is granted to make and distribute verbatim copies of
  72 this manual provided the copyright notice and this permission notice
  73 are preserved on all copies.
  74
  75 Permission is granted to copy and distribute modified versions of this
  76 manual under the conditions for verbatim copying, provided also that the
  77 section entitled ``GNU General Public License'' is included
  78 exactly as in the original, and provided that the entire resulting
  79 derived work is distributed under the terms of a permission notice
  80 identical to this one.
  81
  82 Permission is granted to copy and distribute translations of this manual
  83 into another language, under the above conditions for modified versions,
  84 except that the section entitled ``GNU General Public License'' may be
  85 included in a translation approved by the author instead of in the
  86 original English.
  87 @end titlepage
  88
  89 @ifinfo
  90 @node Top, Copying, (dir), (dir)
  91 @top Introduction
  92
  93 This manual documents the GNU @code{gperf} perfect hash function generator
  94 utility, focusing on its features and how to use them, and how to report
  95 bugs.
  96
  97 @menu
  98 * Copying::                     GNU @code{gperf} General Public License says
  99                                 how you can copy and share @code{gperf}.
 100 * Contributors::                People who have contributed to @code{gperf}.
 101 * Motivation::                  Static search structures and GNU GPERF.
 102 * Search Structures::           Static search structures and GNU @code{gperf}
 103 * Description::                 High-level discussion of how GPERF functions.
 104 * Options::                     A description of options to the program.
 105 * Bugs::                        Known bugs and limitations with GPERF.
 106 * Projects::                    Things still left to do.
 107 * Implementation::              Implementation Details for GNU GPERF.
 108 * Bibliography::                Material Referenced in this Report.
 109
 110 * Concept Index::
 111
 112 @detailmenu --- The Detailed Node Listing ---
 113
 114 High-Level Description of GNU @code{gperf}
 115
 116 * Input Format::                Input Format to @code{gperf}
 117 * Output Format::               Output Format for Generated C Code with @code{gperf}
 118 * Binary Strings::              Use of NUL characters
 119
 120 Input Format to @code{gperf}
 121
 122 * Declarations::                @code{struct} Declarations and C Code Inclusion.
 123 * Keywords::                    Format for Keyword Entries.
 124 * Functions::                   Including Additional C Functions.
 125
 126 Invoking @code{gperf}
 127
 128 * Input Details::               Options that affect Interpretation of the Input File
 129 * Output Language::             Specifying the Language for the Output Code
 130 * Output Details::              Fine tuning Details in the Output Code
 131 * Algorithmic Details::         Changing the Algorithms employed by @code{gperf}
 132 * Verbosity::                   Informative Output
 133
 134 @end detailmenu
 135 @end menu
 136
 137 @end ifinfo
 138
 139 @node Copying, Contributors, Top, Top
 140 @unnumbered GNU GENERAL PUBLIC LICENSE
 141 @include gpl.texinfo
 142
 143 @node Contributors, Motivation, Copying, Top
 144 @unnumbered Contributors to GNU @code{gperf} Utility
 145
 146 @itemize @bullet
 147 @item
 148 @cindex Bugs
 149 The GNU @code{gperf} perfect hash function generator utility was
 150 originally written in GNU C++ by Douglas C. Schmidt.  It is now also
 151 available in a highly-portable ``old-style'' C version.  The general
 152 idea for the perfect hash function generator was inspired by Keith
 153 Bostic's algorithm written in C, and distributed to net.sources around
 154 1984.  The current program is a heavily modified, enhanced, and extended
 155 implementation of Keith's basic idea, created at the University of
 156 California, Irvine.  Bugs, patches, and suggestions should be reported
 157 to both @code{<bug-gnu-utils@@gnu.org>} and
 158 @code{<gperf-bugs@@lists.sourceforge.net>}.
 159
 160 @item
 161 Special thanks is extended to Michael Tiemann and Doug Lea, for
 162 providing a useful compiler, and for giving me a forum to exhibit my
 163 creation.
 164
 165 In addition, Adam de Boor and Nels Olson provided many tips and insights
 166 that greatly helped improve the quality and functionality of @code{gperf}.
 167
 168 @item
 169 A testsuite was added by Bruno Haible. He also rewrote the output
 170 routines for better reliability.
 171 @end itemize
 172
 173 @node Motivation, Search Structures, Contributors, Top
 174 @chapter Introduction
 175
 176 @code{gperf} is a perfect hash function generator written in C++.  It
 177 transforms an @var{n} element user-specified keyword set @var{W} into a
 178 perfect hash function @var{F}.  @var{F} uniquely maps keywords in
 179 @var{W} onto the range 0..@var{k}, where @var{k} >= @var{n}.  If @var{k}
 180 = @var{n} then @var{F} is a @emph{minimal} perfect hash function.
 181 @code{gperf} generates a 0..@var{k} element static lookup table and a
 182 pair of C functions.  These functions determine whether a given
 183 character string @var{s} occurs in @var{W}, using at most one probe into
 184 the lookup table.
 185
 186 @code{gperf} currently generates the reserved keyword recognizer for
 187 lexical analyzers in several production and research compilers and
 188 language processing tools, including GNU C, GNU C++, GNU Pascal, GNU
 189 Modula 3, and GNU indent.  Complete C++ source code for @code{gperf} is
 190 available via anonymous ftp from @code{ftp://ftp.gnu.org/pub/gnu/gperf/}.
 191 A paper describing @code{gperf}'s design and implementation in greater
 192 detail is available in the Second USENIX C++ Conference proceedings.
 193
 194 @node Search Structures, Description, Motivation, Top
 195 @chapter Static search structures and GNU @code{gperf}
 196 @cindex Static search structure
 197
 198 A @dfn{static search structure} is an Abstract Data Type with certain
 199 fundamental operations, e.g., @emph{initialize}, @emph{insert},
 200 and @emph{retrieve}.  Conceptually, all insertions occur before any
 201 retrievals.  In practice, @code{gperf} generates a @code{static} array
 202 containing search set keywords and any associated attributes specified
 203 by the user.  Thus, there is essentially no execution-time cost for the
 204 insertions.  It is a useful data structure for representing @emph{static
 205 search sets}.  Static search sets occur frequently in software system
 206 applications.  Typical static search sets include compiler reserved
 207 words, assembler instruction opcodes, and built-in shell interpreter
 208 commands.  Search set members, called @dfn{keywords}, are inserted into
 209 the structure only once, usually during program initialization, and are
 210 not generally modified at run-time.
 211
 212 Numerous static search structure implementations exist, e.g.,
 213 arrays, linked lists, binary search trees, digital search tries, and
 214 hash tables.  Different approaches offer trade-offs between space
 215 utilization and search time efficiency.  For example, an @var{n} element
 216 sorted array is space efficient, though the average-case time
 217 complexity for retrieval operations using binary search is
 218 proportional to log @var{n}.  Conversely, hash table implementations
 219 often locate a table entry in constant time, but typically impose
 220 additional memory overhead and exhibit poor worst case performance.
 221
 222 @cindex Minimal perfect hash functions
 223 @emph{Minimal perfect hash functions} provide an optimal solution for a
 224 particular class of static search sets.  A minimal perfect hash
 225 function is defined by two properties:
 226
 227 @itemize @bullet
 228 @item
 229 It allows keyword recognition in a static search set using at most
 230 @emph{one} probe into the hash table.  This represents the ``perfect''
 231 property.
 232 @item
 233 The actual memory allocated to store the keywords is precisely large
 234 enough for the keyword set, and @emph{no larger}.  This is the
 235 ``minimal'' property.
 236 @end itemize
 237
 238 For most applications it is far easier to generate @emph{perfect} hash
 239 functions than @emph{minimal perfect} hash functions.  Moreover,
 240 non-minimal perfect hash functions frequently execute faster than
 241 minimal ones in practice.  This phenomena occurs since searching a
 242 sparse keyword table increases the probability of locating a ``null''
 243 entry, thereby reducing string comparisons.  @code{gperf}'s default
 244 behavior generates @emph{near-minimal} perfect hash functions for
 245 keyword sets.  However, @code{gperf} provides many options that permit
 246 user control over the degree of minimality and perfection.
 247
 248 Static search sets often exhibit relative stability over time.  For
 249 example, Ada's 63 reserved words have remained constant for nearly a
 250 decade.  It is therefore frequently worthwhile to expend concerted
 251 effort building an optimal search structure @emph{once}, if it
 252 subsequently receives heavy use multiple times.  @code{gperf} removes
 253 the drudgery associated with constructing time- and space-efficient
 254 search structures by hand.  It has proven a useful and practical tool
 255 for serious programming projects.  Output from @code{gperf} is currently
 256 used in several production and research compilers, including GNU C, GNU
 257 C++, GNU Pascal, and GNU Modula 3.  The latter two compilers are not yet
 258 part of the official GNU distribution.  Each compiler utilizes
 259 @code{gperf} to automatically generate static search structures that
 260 efficiently identify their respective reserved keywords.
 261
 262 @node Description, Options, Search Structures, Top
 263 @chapter High-Level Description of GNU @code{gperf}
 264
 265 @menu
 266 * Input Format::                Input Format to @code{gperf}
 267 * Output Format::               Output Format for Generated C Code with @code{gperf}
 268 * Binary Strings::              Use of NUL characters
 269 @end menu
 270
 271 The perfect hash function generator @code{gperf} reads a set of
 272 ``keywords'' from a @dfn{keyfile} (or from the standard input by
 273 default).  It attempts to derive a perfect hashing function that
 274 recognizes a member of the @dfn{static keyword set} with at most a
 275 single probe into the lookup table.  If @code{gperf} succeeds in
 276 generating such a function it produces a pair of C source code routines
 277 that perform hashing and table lookup recognition.  All generated C code
 278 is directed to the standard output.  Command-line options described
 279 below allow you to modify the input and output format to @code{gperf}.
 280
 281 By default, @code{gperf} attempts to produce time-efficient code, with
 282 less emphasis on efficient space utilization.  However, several options
 283 exist that permit trading-off execution time for storage space and vice
 284 versa.  In particular, expanding the generated table size produces a
 285 sparse search structure, generally yielding faster searches.
 286 Conversely, you can direct @code{gperf} to utilize a C @code{switch}
 287 statement scheme that minimizes data space storage size.  Furthermore,
 288 using a C @code{switch} may actually speed up the keyword retrieval time
 289 somewhat.  Actual results depend on your C compiler, of course.
 290
 291 In general, @code{gperf} assigns values to the characters it is using
 292 for hashing until some set of values gives each keyword a unique value.
 293 A helpful heuristic is that the larger the hash value range, the easier
 294 it is for @code{gperf} to find and generate a perfect hash function.
 295 Experimentation is the key to getting the most from @code{gperf}.
 296
 297 @node Input Format, Output Format, Description, Description
 298 @section Input Format to @code{gperf}
 299 @cindex Format
 300 @cindex Declaration section
 301 @cindex Keywords section
 302 @cindex Functions section
 303 You can control the input keyfile format by varying certain command-line
 304 arguments, in particular the @samp{-t} option.  The input's appearance
 305 is similar to GNU utilities @code{flex} and @code{bison} (or UNIX
 306 utilities @code{lex} and @code{yacc}).  Here's an outline of the general
 307 format:
 308
 309 @example
 310 @group
 311 declarations
 312 %%
 313 keywords
 314 %%
 315 functions
 316 @end group
 317 @end example
 318
 319 @emph{Unlike} @code{flex} or @code{bison}, all sections of
 320 @code{gperf}'s input are optional.  The following sections describe the
 321 input format for each section.
 322
 323 @menu
 324 * Declarations::                @code{struct} Declarations and C Code Inclusion.
 325 * Keywords::                    Format for Keyword Entries.
 326 * Functions::                   Including Additional C Functions.
 327 @end menu
 328
 329 @node Declarations, Keywords, Input Format, Input Format
 330 @subsection @code{struct} Declarations and C Code Inclusion
 331
 332 The keyword input file optionally contains a section for including
 333 arbitrary C declarations and definitions, as well as provisions for
 334 providing a user-supplied @code{struct}.  If the @samp{-t} option
 335 @emph{is} enabled, you @emph{must} provide a C @code{struct} as the last
 336 component in the declaration section from the keyfile file.  The first
 337 field in this struct must be a @code{char *} or @code{const char *}
 338 identifier called @samp{name}, although it is possible to modify this
 339 field's name with the @samp{-K} option described below.
 340
 341 Here is a simple example, using months of the year and their attributes as
 342 input:
 343
 344 @example
 345 @group
 346 struct months @{ char *name; int number; int days; int leap_days; @};
 347 %%
 348 january,   1, 31, 31
 349 february,  2, 28, 29
 350 march,     3, 31, 31
 351 april,     4, 30, 30
 352 may,       5, 31, 31
 353 june,      6, 30, 30
 354 july,      7, 31, 31
 355 august,    8, 31, 31
 356 september, 9, 30, 30
 357 october,  10, 31, 31
 358 november, 11, 30, 30
 359 december, 12, 31, 31
 360 @end group
 361 @end example
 362
 363 @cindex @samp{%%}
 364 Separating the @code{struct} declaration from the list of keywords and
 365 other fields are a pair of consecutive percent signs, @samp{%%},
 366 appearing left justified in the first column, as in the UNIX utility
 367 @code{lex}.
 368
 369 @cindex @samp{%@{}
 370 @cindex @samp{%@}}
 371 Using a syntax similar to GNU utilities @code{flex} and @code{bison}, it
 372 is possible to directly include C source text and comments verbatim into
 373 the generated output file.  This is accomplished by enclosing the region
 374 inside left-justified surrounding @samp{%@{}, @samp{%@}} pairs.  Here is
 375 an input fragment based on the previous example that illustrates this
 376 feature:
 377
 378 @example
 379 @group
 380 %@{
 381 #include <assert.h>
 382 /* This section of code is inserted directly into the output. */
 383 int return_month_days (struct months *months, int is_leap_year);
 384 %@}
 385 struct months @{ char *name; int number; int days; int leap_days; @};
 386 %%
 387 january,   1, 31, 31
 388 february,  2, 28, 29
 389 march,     3, 31, 31
 390 ...
 391 @end group
 392 @end example
 393
 394 It is possible to omit the declaration section entirely.  In this case
 395 the keyfile begins directly with the first keyword line, e.g.:
 396
 397 @example
 398 @group
 399 january,   1, 31, 31
 400 february,  2, 28, 29
 401 march,     3, 31, 31
 402 april,     4, 30, 30
 403 ...
 404 @end group
 405 @end example
 406
 407 @node Keywords, Functions, Declarations, Input Format
 408 @subsection Format for Keyword Entries
 409
 410 The second keyfile format section contains lines of keywords and any
 411 associated attributes you might supply.  A line beginning with @samp{#}
 412 in the first column is considered a comment.  Everything following the
 413 @samp{#} is ignored, up to and including the following newline.
 414
 415 The first field of each non-comment line is always the key itself.  It
 416 can be given in two ways: as a simple name, i.e., without surrounding
 417 string quotation marks, or as a string enclosed in double-quotes, in
 418 C syntax, possibly with backslash escapes like @code{\"} or @code{\234}
 419 or @code{\xa8}. In either case, it must start right at the beginning
 420 of the line, without leading whitespace.
 421 In this context, a ``field'' is considered to extend up to, but
 422 not include, the first blank, comma, or newline.  Here is a simple
 423 example taken from a partial list of C reserved words:
 424
 425 @example
 426 @group
 427 # These are a few C reserved words, see the c.gperf file
 428 # for a complete list of ANSI C reserved words.
 429 unsigned
 430 sizeof
 431 switch
 432 signed
 433 if
 434 default
 435 for
 436 while
 437 return
 438 @end group
 439 @end example
 440
 441 Note that unlike @code{flex} or @code{bison} the first @samp{%%} marker
 442 may be elided if the declaration section is empty.
 443
 444 Additional fields may optionally follow the leading keyword.  Fields
 445 should be separated by commas, and terminate at the end of line.  What
 446 these fields mean is entirely up to you; they are used to initialize the
 447 elements of the user-defined @code{struct} provided by you in the
 448 declaration section.  If the @samp{-t} option is @emph{not} enabled
 449 these fields are simply ignored.  All previous examples except the last
 450 one contain keyword attributes.
 451
 452 @node Functions,  , Keywords, Input Format
 453 @subsection Including Additional C Functions
 454
 455 The optional third section also corresponds closely with conventions
 456 found in @code{flex} and @code{bison}.  All text in this section,
 457 starting at the final @samp{%%} and extending to the end of the input
 458 file, is included verbatim into the generated output file.  Naturally,
 459 it is your responsibility to ensure that the code contained in this
 460 section is valid C.
 461
 462 @node Output Format, Binary Strings, Input Format, Description
 463 @section Output Format for Generated C Code with @code{gperf}
 464 @cindex hash table
 465
 466 Several options control how the generated C code appears on the standard
 467 output.  Two C function are generated.  They are called @code{hash} and
 468 @code{in_word_set}, although you may modify their names with a command-line
 469 option.  Both functions require two arguments, a string, @code{char *}
 470 @var{str}, and a length parameter, @code{int} @var{len}.  Their default
 471 function prototypes are as follows:
 472
 473 @deftypefun {unsigned int} hash (const char * @var{str}, unsigned int @var{len})
 474 By default, the generated @code{hash} function returns an integer value
 475 created by adding @var{len} to several user-specified @var{str} key
 476 positions indexed into an @dfn{associated values} table stored in a
 477 local static array.  The associated values table is constructed
 478 internally by @code{gperf} and later output as a static local C array
 479 called @samp{hash_table}; its meaning and properties are described below
 480 (@pxref{Implementation}). The relevant key positions are specified via
 481 the @samp{-k} option when running @code{gperf}, as detailed in the
 482 @emph{Options} section below(@pxref{Options}).
 483 @end deftypefun
 484
 485 @deftypefun {} in_word_set (const char * @var{str}, unsigned int @var{len})
 486 If @var{str} is in the keyword set, returns a pointer to that
 487 keyword. More exactly, if the option @samp{-t} was given, it returns
 488 a pointer to the matching keyword's structure. Otherwise it returns
 489 @code{NULL}.
 490 @end deftypefun
 491
 492 If the option @samp{-c} is not used, @var{str} must be a NUL terminated
 493 string of exactly length @var{len}. If @samp{-c} is used, @var{str} must
 494 simply be an array of @var{len} characters and does not need to be NUL
 495 terminated.
 496
 497 The code generated for these two functions is affected by the following
 498 options:
 499
 500 @table @samp
 501 @item -t
 502 @itemx --struct-type
 503 Make use of the user-defined @code{struct}.
 504
 505 @item -S @var{total-switch-statements}
 506 @itemx --switch=@var{total-switch-statements}
 507 @cindex @code{switch}
 508 Generate 1 or more C @code{switch} statement rather than use a large,
 509 (and potentially sparse) static array.  Although the exact time and
 510 space savings of this approach vary according to your C compiler's
 511 degree of optimization, this method often results in smaller and faster
 512 code.
 513 @end table
 514
 515 If the @samp{-t} and @samp{-S} options are omitted, the default action
 516 is to generate a @code{char *} array containing the keys, together with
 517 additional null strings used for padding the array.  By experimenting
 518 with the various input and output options, and timing the resulting C
 519 code, you can determine the best option choices for different keyword
 520 set characteristics.
 521
 522 @node Binary Strings,  , Output Format, Description
 523 @section Use of NUL characters
 524 @cindex NUL
 525
 526 By default, the code generated by @code{gperf} operates on zero
 527 terminated strings, the usual representation of strings in C. This means
 528 that the keywords in the input file must not contain NUL characters,
 529 and the @var{str} argument passed to @code{hash} or @code{in_word_set}
 530 must be NUL terminated and have exactly length @var{len}.
 531
 532 If option @samp{-c} is used, then the @var{str} argument does not need
 533 to be NUL terminated. The code generated by @code{gperf} will only
 534 access the first @var{len}, not @var{len+1}, bytes starting at @var{str}.
 535 However, the keywords in the input file still must not contain NUL
 536 characters.
 537
 538 If option @samp{-l} is used, then the hash table performs binary
 539 comparison. The keywords in the input file may contain NUL characters,
 540 written in string syntax as @code{\000} or @code{\x00}, and the code
 541 generated by @code{gperf} will treat NUL like any other character.
 542 Also, in this case the @samp{-c} option is ignored.
 543
 544 @node Options, Bugs, Description, Top
 545 @chapter Invoking @code{gperf}
 546
 547 There are @emph{many} options to @code{gperf}.  They were added to make
 548 the program more convenient for use with real applications.  ``On-line''
 549 help is readily available via the @samp{-h} option.  Here is the
 550 complete list of options.
 551
 552 @menu
 553 * Input Details::               Options that affect Interpretation of the Input File
 554 * Output Language::             Specifying the Language for the Output Code
 555 * Output Details::              Fine tuning Details in the Output Code
 556 * Algorithmic Details::         Changing the Algorithms employed by @code{gperf}
 557 * Verbosity::                   Informative Output
 558 @end menu
 559
 560 @node Input Details, Output Language, Options, Options
 561 @section Options that affect Interpretation of the Input File
 562
 563 @table @samp
 564 @item -e @var{keyword-delimiter-list}
 565 @itemx --delimiters=@var{keyword-delimiter-list}
 566 @cindex Delimiters
 567 Allows the user to provide a string containing delimiters used to
 568 separate keywords from their attributes.  The default is ",\n".  This
 569 option is essential if you want to use keywords that have embedded
 570 commas or newlines.  One useful trick is to use -e'TAB', where TAB is
 571 the literal tab character.
 572
 573 @item -t
 574 @itemx --struct-type
 575 Allows you to include a @code{struct} type declaration for generated
 576 code.  Any text before a pair of consecutive @samp{%%} is considered
 577 part of the type declaration.  Keywords and additional fields may follow
 578 this, one group of fields per line.  A set of examples for generating
 579 perfect hash tables and functions for Ada, C, C++, Pascal, Modula 2,
 580 Modula 3 and JavaScript reserved words are distributed with this release.
 581 @end table
 582
 583 @node Output Language, Output Details, Input Details, Options
 584 @section Options to specify the Language for the Output Code
 585
 586 @table @samp
 587 @item -L @var{generated-language-name}
 588 @itemx --language=@var{generated-language-name}
 589 Instructs @code{gperf} to generate code in the language specified by the
 590 option's argument.  Languages handled are currently:
 591
 592 @table @samp
 593 @item KR-C
 594 Old-style K&R C. This language is understood by old-style C compilers and
 595 ANSI C compilers, but ANSI C compilers may flag warnings (or even errors)
 596 because of lacking @samp{const}.
 597
 598 @item C
 599 Common C. This language is understood by ANSI C compilers, and also by
 600 old-style C compilers, provided that you @code{#define const} to empty
 601 for compilers which don't know about this keyword.
 602
 603 @item ANSI-C
 604 ANSI C. This language is understood by ANSI C compilers and C++ compilers.
 605
 606 @item C++
 607 C++. This language is understood by C++ compilers.
 608 @end table
 609
 610 The default is C.
 611
 612 @item -a
 613 This option is supported for compatibility with previous releases of
 614 @code{gperf}. It does not do anything.
 615
 616 @item -g
 617 This option is supported for compatibility with previous releases of
 618 @code{gperf}. It does not do anything.
 619 @end table
 620
 621 @node Output Details, Algorithmic Details, Output Language, Options
 622 @section Options for fine tuning Details in the Output Code
 623
 624 @table @samp
 625 @item -K @var{key-name}
 626 @itemx --slot-name=@var{key-name}
 627 @cindex Slot name
 628 This option is only useful when option @samp{-t} has been given.
 629 By default, the program assumes the structure component identifier for
 630 the keyword is @samp{name}.  This option allows an arbitrary choice of
 631 identifier for this component, although it still must occur as the first
 632 field in your supplied @code{struct}.
 633
 634 @item -F @var{initializers}
 635 @itemx --initializer-suffix=@var{initializers}
 636 @cindex Initializers
 637 This option is only useful when option @samp{-t} has been given.
 638 It permits to specify initializers for the structure members following
 639 @var{key name} in empty hash table entries.  The list of initializers
 640 should start with a comma.  By default, the emitted code will
 641 zero-initialize structure members following @var{key name}.
 642
 643 @item -H @var{hash-function-name}
 644 @itemx --hash-fn-name=@var{hash-function-name}
 645 Allows you to specify the name for the generated hash function.  Default
 646 name is @samp{hash}.  This option permits the use of two hash tables in
 647 the same file.
 648
 649 @item -N @var{lookup-function-name}
 650 @itemx --lookup-fn-name=@var{lookup-function-name}
 651 Allows you to specify the name for the generated lookup function.
 652 Default name is @samp{in_word_set}.  This option permits completely
 653 automatic generation of perfect hash functions, especially when multiple
 654 generated hash functions are used in the same application.
 655
 656 @item -Z @var{class-name}
 657 @itemx --class-name=@var{class-name}
 658 @cindex Class name
 659 This option is only useful when option @samp{-L C++} has been given.  It
 660 allows you to specify the name of generated C++ class.  Default name is
 661 @code{Perfect_Hash}.
 662
 663 @item -7
 664 @itemx --seven-bit
 665 This option specifies that all strings that will be passed as arguments
 666 to the generated hash function and the generated lookup function will
 667 solely consist of 7-bit ASCII characters (characters in the range 0..127).
 668 (Note that the ANSI C functions @code{isalnum} and @code{isgraph} do
 669 @emph{not} guarantee that a character is in this range. Only an explicit
 670 test like @samp{c >= 'A' && c <= 'Z'} guarantees this.) This was the
 671 default in versions of @code{gperf} earlier than 2.7; now the default is
 672 to assume 8-bit characters.
 673
 674 @item -c
 675 @itemx --compare-strncmp
 676 Generates C code that uses the @code{strncmp} function to perform
 677 string comparisons.  The default action is to use @code{strcmp}.
 678
 679 @item -C
 680 @itemx --readonly-tables
 681 Makes the contents of all generated lookup tables constant, i.e.,
 682 ``readonly''.  Many compilers can generate more efficient code for this
 683 by putting the tables in readonly memory.
 684
 685 @item -E
 686 @itemx  --enum
 687 Define constant values using an enum local to the lookup function rather
 688 than with #defines.  This also means that different lookup functions can
 689 reside in the same file.  Thanks to James Clark @code{<jjc@@ai.mit.edu>}.
 690
 691 @item -I
 692 @itemx --includes
 693 Include the necessary system include file, @code{<string.h>}, at the
 694 beginning of the code.  By default, this is not done; the user must
 695 include this header file himself to allow compilation of the code.
 696
 697 @item -G
 698 @itemx --global
 699 Generate the static table of keywords as a static global variable,
 700 rather than hiding it inside of the lookup function (which is the
 701 default behavior).
 702
 703 @item -W @var{hash-table-array-name}
 704 @itemx --word-array-name=@var{hash-table-array-name}
 705 @cindex Array name
 706 Allows you to specify the name for the generated array containing the
 707 hash table.  Default name is @samp{wordlist}.  This option permits the
 708 use of two hash tables in the same file, even when the option @samp{-G}
 709 is given.
 710
 711 @item -S @var{total-switch-statements}
 712 @itemx --switch=@var{total-switch-statements}
 713 @cindex @code{switch}
 714 Causes the generated C code to use a @code{switch} statement scheme,
 715 rather than an array lookup table.  This can lead to a reduction in both
 716 time and space requirements for some keyfiles.  The argument to this
 717 option determines how many @code{switch} statements are generated. A
 718 value of 1 generates 1 @code{switch} containing all the elements, a
 719 value of 2 generates 2 tables with 1/2 the elements in each
 720 @code{switch}, etc.  This is useful since many C compilers cannot
 721 correctly generate code for large @code{switch} statements. This option
 722 was inspired in part by Keith Bostic's original C program.
 723
 724 @item -T
 725 @itemx --omit-struct-type
 726 Prevents the transfer of the type declaration to the output file.  Use
 727 this option if the type is already defined elsewhere.
 728
 729 @item -p
 730 This option is supported for compatibility with previous releases of
 731 @code{gperf}. It does not do anything.
 732 @end table
 733
 734 @node Algorithmic Details, Verbosity, Output Details, Options
 735 @section Options for changing the Algorithms employed by @code{gperf}
 736
 737 @table @samp
 738 @item -k @var{keys}
 739 @itemx --key-positions=@var{keys}
 740 Allows selection of the character key positions used in the keywords'
 741 hash function. The allowable choices range between 1-126, inclusive.
 742 The positions are separated by commas, e.g., @samp{-k 9,4,13,14};
 743 ranges may be used, e.g., @samp{-k 2-7}; and positions may occur
 744 in any order.  Furthermore, the meta-character '*' causes the generated
 745 hash function to consider @strong{all} character positions in each key,
 746 whereas '$' instructs the hash function to use the ``final character''
 747 of a key (this is the only way to use a character position greater than
 748 126, incidentally).
 749
 750 For instance, the option @samp{-k 1,2,4,6-10,'$'} generates a hash
 751 function that considers positions 1,2,4,6,7,8,9,10, plus the last
 752 character in each key (which may differ for each key, obviously).  Keys
 753 with length less than the indicated key positions work properly, since
 754 selected key positions exceeding the key length are simply not
 755 referenced in the hash function.
 756
 757 @item -l
 758 @itemx --compare-strlen
 759 Compare key lengths before trying a string comparison.  This might cut
 760 down on the number of string comparisons made during the lookup, since
 761 keys with different lengths are never compared via @code{strcmp}.
 762 However, using @samp{-l} might greatly increase the size of the
 763 generated C code if the lookup table range is large (which implies that
 764 the switch option @samp{-S} is not enabled), since the length table
 765 contains as many elements as there are entries in the lookup table.
 766 This option is mandatory for binary comparisons (@pxref{Binary Strings}).
 767
 768 @item -D
 769 @itemx --duplicates
 770 @cindex Duplicates
 771 Handle keywords whose key position sets hash to duplicate values.
 772 Duplicate hash values occur for two reasons:
 773
 774 @itemize @bullet
 775 @item
 776 Since @code{gperf} does not backtrack it is possible for it to process
 777 all your input keywords without finding a unique mapping for each word.
 778 However, frequently only a very small number of duplicates occur, and
 779 the majority of keys still require one probe into the table.
 780
 781 @item
 782 Sometimes a set of keys may have the same names, but possess different
 783 attributes.  With the -D option @code{gperf} treats all these keys as
 784 part of an equivalence class and generates a perfect hash function with
 785 multiple comparisons for duplicate keys.  It is up to you to completely
 786 disambiguate the keywords by modifying the generated C code.  However,
 787 @code{gperf} helps you out by organizing the output.
 788 @end itemize
 789
 790 Option @samp{-D} is extremely useful for certain large or highly
 791 redundant keyword sets, e.g., assembler instruction opcodes.
 792 Using this option usually means that the generated hash function is no
 793 longer perfect.  On the other hand, it permits @code{gperf} to work on
 794 keyword sets that it otherwise could not handle.
 795
 796 @item -f @var{iteration-amount}
 797 @itemx --fast=@var{iteration-amount}
 798 Generate the perfect hash function ``fast''.  This decreases
 799 @code{gperf}'s running time at the cost of minimizing generated
 800 table-size.  The iteration amount represents the number of times to
 801 iterate when resolving a collision.  `0' means iterate by the number of
 802 keywords.  This option is probably most useful when used in conjunction
 803 with options @samp{-D} and/or @samp{-S} for @emph{large} keyword sets.
 804
 805 @item -i @var{initial-value}
 806 @itemx --initial-asso=@var{initial-value}
 807 Provides an initial @var{value} for the associate values array.  Default
 808 is 0.  Increasing the initial value helps inflate the final table size,
 809 possibly leading to more time efficient keyword lookups.  Note that this
 810 option is not particularly useful when @samp{-S} is used.  Also,
 811 @samp{-i} is overridden when the @samp{-r} option is used.
 812
 813 @item -j @var{jump-value}
 814 @itemx --jump=@var{jump-value}
 815 @cindex Jump value
 816 Affects the ``jump value'', i.e., how far to advance the associated
 817 character value upon collisions.  @var{Jump-value} is rounded up to an
 818 odd number, the default is 5.  If the @var{jump-value} is 0 @code{gperf}
 819 jumps by random amounts.
 820
 821 @item -n
 822 @itemx --no-strlen
 823 Instructs the generator not to include the length of a keyword when
 824 computing its hash value.  This may save a few assembly instructions in
 825 the generated lookup table.
 826
 827 @item -o
 828 @itemx --occurrence-sort
 829 Reorders the keywords by sorting the keywords so that frequently
 830 occuring key position set components appear first.  A second reordering
 831 pass follows so that keys with ``already determined values'' are placed
 832 towards the front of the keylist.  This may decrease the time required
 833 to generate a perfect hash function for many keyword sets, and also
 834 produce more minimal perfect hash functions.  The reason for this is
 835 that the reordering helps prune the search time by handling inevitable
 836 collisions early in the search process.  On the other hand, if the
 837 number of keywords is @emph{very} large using @samp{-o} may
 838 @emph{increase} @code{gperf}'s execution time, since collisions will
 839 begin earlier and continue throughout the remainder of keyword
 840 processing.  See Cichelli's paper from the January 1980 Communications
 841 of the ACM for details.
 842
 843 @item -r
 844 @itemx --random
 845 Utilizes randomness to initialize the associated values table.  This
 846 frequently generates solutions faster than using deterministic
 847 initialization (which starts all associated values at 0).  Furthermore,
 848 using the randomization option generally increases the size of the
 849 table.  If @code{gperf} has difficultly with a certain keyword set try using
 850 @samp{-r} or @samp{-D}.
 851
 852 @item -s @var{size-multiple}
 853 @itemx --size-multiple=@var{size-multiple}
 854 Affects the size of the generated hash table.  The numeric argument for
 855 this option indicates ``how many times larger or smaller'' the maximum
 856 associated value range should be, in relationship to the number of keys.
 857 If the @var{size-multiple} is negative the maximum associated value is
 858 calculated by @emph{dividing} it into the total number of keys.  For
 859 example, a value of 3 means ``allow the maximum associated value to be
 860 about 3 times larger than the number of input keys''.
 861
 862 Conversely, a value of -3 means ``allow the maximum associated value to
 863 be about 3 times smaller than the number of input keys''.  Negative
 864 values are useful for limiting the overall size of the generated hash
 865 table, though this usually increases the number of duplicate hash
 866 values.
 867
 868 If `generate switch' option @samp{-S} is @emph{not} enabled, the maximum
 869 associated value influences the static array table size, and a larger
 870 table should decrease the time required for an unsuccessful search, at
 871 the expense of extra table space.
 872
 873 The default value is 1, thus the default maximum associated value about
 874 the same size as the number of keys (for efficiency, the maximum
 875 associated value is always rounded up to a power of 2).  The actual
 876 table size may vary somewhat, since this technique is essentially a
 877 heuristic.  In particular, setting this value too high slows down
 878 @code{gperf}'s runtime, since it must search through a much larger range
 879 of values.  Judicious use of the @samp{-f} option helps alleviate this
 880 overhead, however.
 881 @end table
 882
 883 @node Verbosity,  , Algorithmic Details, Options
 884 @section Informative Output
 885
 886 @table @samp
 887 @item -h
 888 @itemx --help
 889 Prints a short summary on the meaning of each program option.  Aborts
 890 further program execution.
 891
 892 @item -v
 893 @itemx --version
 894 Prints out the current version number.
 895
 896 @item -d
 897 @itemx --debug
 898 Enables the debugging option.  This produces verbose diagnostics to
 899 ``standard error'' when @code{gperf} is executing.  It is useful both for
 900 maintaining the program and for determining whether a given set of
 901 options is actually speeding up the search for a solution.  Some useful
 902 information is dumped at the end of the program when the @samp{-d}
 903 option is enabled.
 904 @end table
 905
 906 @node Bugs, Projects, Options, Top
 907 @chapter Known Bugs and Limitations with @code{gperf}
 908
 909 The following are some limitations with the current release of
 910 @code{gperf}:
 911
 912 @itemize @bullet
 913 @item
 914 The @code{gperf} utility is tuned to execute quickly, and works quickly
 915 for small to medium size data sets (around 1000 keywords).  It is
 916 extremely useful for maintaining perfect hash functions for compiler
 917 keyword sets.  Several recent enhancements now enable @code{gperf} to
 918 work efficiently on much larger keyword sets (over 15,000 keywords).
 919 When processing large keyword sets it helps greatly to have over 8 megs
 920 of RAM.
 921
 922 However, since @code{gperf} does not backtrack no guaranteed solution
 923 occurs on every run.  On the other hand, it is usually easy to obtain a
 924 solution by varying the option parameters.  In particular, try the
 925 @samp{-r} option, and also try changing the default arguments to the
 926 @samp{-s} and @samp{-j} options.  To @emph{guarantee} a solution, use
 927 the @samp{-D} and @samp{-S} options, although the final results are not
 928 likely to be a @emph{perfect} hash function anymore!  Finally, use the
 929 @samp{-f} option if you want @code{gperf} to generate the perfect hash
 930 function @emph{fast}, with less emphasis on making it minimal.
 931
 932 @item
 933 The size of the generate static keyword array can get @emph{extremely}
 934 large if the input keyword file is large or if the keywords are quite
 935 similar.  This tends to slow down the compilation of the generated C
 936 code, and @emph{greatly} inflates the object code size.  If this
 937 situation occurs, consider using the @samp{-S} option to reduce data
 938 size, potentially increasing keyword recognition time a negligible
 939 amount.  Since many C compilers cannot correctly generated code for
 940 large switch statements it is important to qualify the @var{-S} option
 941 with an appropriate numerical argument that controls the number of
 942 switch statements generated.
 943
 944 @item
 945 The maximum number of key positions selected for a given key has an
 946 arbitrary limit of 126.  This restriction should be removed, and if
 947 anyone considers this a problem write me and let me know so I can remove
 948 the constraint.
 949 @end itemize
 950
 951 @node Projects, Implementation, Bugs, Top
 952 @chapter Things Still Left to Do
 953
 954 It should be ``relatively'' easy to replace the current perfect hash
 955 function algorithm with a more exhaustive approach; the perfect hash
 956 module is essential independent from other program modules.  Additional
 957 worthwhile improvements include:
 958
 959 @itemize @bullet
 960 @item
 961 Make the algorithm more robust.  At present, the program halts with an
 962 error diagnostic if it can't find a direct solution and the @samp{-D}
 963 option is not enabled.  A more comprehensive, albeit computationally
 964 expensive, approach would employ backtracking or enable alternative
 965 options and retry.  It's not clear how helpful this would be, in
 966 general, since most search sets are rather small in practice.
 967
 968 @item
 969 Another useful extension involves modifying the program to generate
 970 ``minimal'' perfect hash functions (under certain circumstances, the
 971 current version can be rather extravagant in the generated table size).
 972 Again, this is mostly of theoretical interest, since a sparse table
 973 often produces faster lookups, and use of the @samp{-S} @code{switch}
 974 option can minimize the data size, at the expense of slightly longer
 975 lookups (note that the gcc compiler generally produces good code for
 976 @code{switch} statements, reducing the need for more complex schemes).
 977
 978 @item
 979 In addition to improving the algorithm, it would also be useful to
 980 generate a C++ class or Ada package as the code output, in addition to
 981 the current C routines.
 982 @end itemize
 983
 984 @node Implementation, Bibliography, Projects, Top
 985 @chapter Implementation Details of GNU @code{gperf}
 986
 987 A paper describing the high-level description of the data structures and
 988 algorithms used to implement @code{gperf} will soon be available.  This
 989 paper is useful not only from a maintenance and enhancement perspective,
 990 but also because they demonstrate several clever and useful programming
 991 techniques, e.g., `Iteration Number' boolean arrays, double
 992 hashing, a ``safe'' and efficient method for reading arbitrarily long
 993 input from a file, and a provably optimal algorithm for simultaneously
 994 determining both the minimum and maximum elements in a list.
 995
 996 @page
 997
 998 @node Bibliography, Concept Index, Implementation, Top
 999 @chapter Bibliography
1000
1001 [1] Chang, C.C.: @i{A Scheme for Constructing Ordered Minimal Perfect
1002 Hashing Functions} Information Sciences 39(1986), 187-195.
1003
1004 [2] Cichelli, Richard J. @i{Author's Response to ``On Cichelli's Minimal Perfect Hash
1005 Functions Method''} Communications of the ACM, 23, 12(December 1980), 729.
1006
1007 [3] Cichelli, Richard J. @i{Minimal Perfect Hash Functions Made Simple}
1008 Communications of the ACM, 23, 1(January 1980), 17-19.
1009
1010 [4] Cook, C. R. and Oldehoeft, R.R. @i{A Letter Oriented Minimal
1011 Perfect Hashing Function} SIGPLAN Notices, 17, 9(September 1982), 18-27.
1012
1013 [5] Cormack, G. V. and Horspool, R. N. S. and Kaiserwerth, M.
1014 @i{Practical Perfect Hashing} Computer Journal, 28, 1(January 1985), 54-58.
1015
1016 [6] Jaeschke, G. @i{Reciprocal Hashing: A Method for Generating Minimal
1017 Perfect Hashing Functions} Communications of the ACM, 24, 12(December
1018 1981), 829-833.
1019
1020 [7] Jaeschke, G. and Osterburg, G. @i{On Cichelli's Minimal Perfect
1021 Hash Functions Method} Communications of the ACM, 23, 12(December 1980),
1022 728-729.
1023
1024 [8] Sager, Thomas J. @i{A Polynomial Time Generator for Minimal Perfect
1025 Hash Functions} Communications of the ACM, 28, 5(December 1985), 523-532
1026
1027 [9] Schmidt, Douglas C. @i{GPERF: A Perfect Hash Function Generator}
1028 Second USENIX C++ Conference Proceedings, April 1990.
1029
1030 [10] Sebesta, R.W. and Taylor, M.A. @i{Minimal Perfect Hash Functions
1031 for Reserved Word Lists}  SIGPLAN Notices, 20, 12(September 1985), 47-53.
1032
1033 [11] Sprugnoli, R. @i{Perfect Hashing Functions: A Single Probe
1034 Retrieving Method for Static Sets} Communications of the ACM, 20
1035 11(November 1977), 841-850.
1036
1037 [12] Stallman, Richard M. @i{Using and Porting GNU CC} Free Software Foundation,
1038 1988.
1039
1040 [13] Stroustrup, Bjarne @i{The C++ Programming Language.} Addison-Wesley, 1986.
1041
1042 [14] Tiemann, Michael D. @i{User's Guide to GNU C++} Free Software
1043 Foundation, 1989.
1044
1045 @node Concept Index,  , Bibliography, Top
1046 @unnumbered Concept Index
1047
1048 @printindex cp
1049
1050 @contents
1051 @bye