contrib/nvi/regex/regex.3

   1 .\"     $NetBSD: regex.3,v 1.1.1.2 2008/05/18 14:31:38 aymeric Exp $
   2 .\"
   3 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
   4 .\" Copyright (c) 1992, 1993, 1994
   5 .\"     The Regents of the University of California.  All rights reserved.
   6 .\"
   7 .\" This code is derived from software contributed to Berkeley by
   8 .\" Henry Spencer of the University of Toronto.
   9 .\"
  10 .\" Redistribution and use in source and binary forms, with or without
  11 .\" modification, are permitted provided that the following conditions
  12 .\" are met:
  13 .\" 1. Redistributions of source code must retain the above copyright
  14 .\"    notice, this list of conditions and the following disclaimer.
  15 .\" 2. Redistributions in binary form must reproduce the above copyright
  16 .\"    notice, this list of conditions and the following disclaimer in the
  17 .\"    documentation and/or other materials provided with the distribution.
  18 .\" 3. All advertising materials mentioning features or use of this software
  19 .\"    must display the following acknowledgement:
  20 .\"     This product includes software developed by the University of
  21 .\"     California, Berkeley and its contributors.
  22 .\" 4. Neither the name of the University nor the names of its contributors
  23 .\"    may be used to endorse or promote products derived from this software
  24 .\"    without specific prior written permission.
  25 .\"
  26 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  27 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  28 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  29 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  30 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  31 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  32 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  33 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  34 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  35 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  36 .\" SUCH DAMAGE.
  37 .\"
  38 .\"     @(#)regex.3     8.2 (Berkeley) 3/16/94
  39 .\"
  40 .TH REGEX 3 "March 16, 1994"
  41 .de ZR
  42 .\" one other place knows this name:  the SEE ALSO section
  43 .IR re_format (7) \\$1
  44 ..
  45 .SH NAME
  46 regcomp, regexec, regerror, regfree \- regular-expression library
  47 .SH SYNOPSIS
  48 .ft B
  49 .\".na
  50 #include <sys/types.h>
  51 .br
  52 #include <regex.h>
  53 .HP 10
  54 int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
  55 .HP
  56 int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
  57 size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
  58 .HP
  59 size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
  60 char\ *errbuf, size_t\ errbuf_size);
  61 .HP
  62 void\ regfree(regex_t\ *preg);
  63 .\".ad
  64 .ft
  65 .SH DESCRIPTION
  66 These routines implement POSIX 1003.2 regular expressions (``RE''s);
  67 see
  68 .ZR .
  69 .I Regcomp
  70 compiles an RE written as a string into an internal form,
  71 .I regexec
  72 matches that internal form against a string and reports results,
  73 .I regerror
  74 transforms error codes from either into human-readable messages,
  75 and
  76 .I regfree
  77 frees any dynamically-allocated storage used by the internal form
  78 of an RE.
  79 .PP
  80 The header
  81 .I <regex.h>
  82 declares two structure types,
  83 .I regex_t
  84 and
  85 .IR regmatch_t ,
  86 the former for compiled internal forms and the latter for match reporting.
  87 It also declares the four functions,
  88 a type
  89 .IR regoff_t ,
  90 and a number of constants with names starting with ``REG_''.
  91 .PP
  92 .I Regcomp
  93 compiles the regular expression contained in the
  94 .I pattern
  95 string,
  96 subject to the flags in
  97 .IR cflags ,
  98 and places the results in the
  99 .I regex_t
 100 structure pointed to by
 101 .IR preg .
 102 .I Cflags
 103 is the bitwise OR of zero or more of the following flags:
 104 .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
 105 Compile modern (``extended'') REs,
 106 rather than the obsolete (``basic'') REs that
 107 are the default.
 108 .IP REG_BASIC
 109 This is a synonym for 0,
 110 provided as a counterpart to REG_EXTENDED to improve readability.
 111 .IP REG_NOSPEC
 112 Compile with recognition of all special characters turned off.
 113 All characters are thus considered ordinary,
 114 so the ``RE'' is a literal string.
 115 This is an extension,
 116 compatible with but not specified by POSIX 1003.2,
 117 and should be used with
 118 caution in software intended to be portable to other systems.
 119 REG_EXTENDED and REG_NOSPEC may not be used
 120 in the same call to
 121 .IR regcomp .
 122 .IP REG_ICASE
 123 Compile for matching that ignores upper/lower case distinctions.
 124 See
 125 .ZR .
 126 .IP REG_NOSUB
 127 Compile for matching that need only report success or failure,
 128 not what was matched.
 129 .IP REG_NEWLINE
 130 Compile for newline-sensitive matching.
 131 By default, newline is a completely ordinary character with no special
 132 meaning in either REs or strings.
 133 With this flag,
 134 `[^' bracket expressions and `.' never match newline,
 135 a `^' anchor matches the null string after any newline in the string
 136 in addition to its normal function,
 137 and the `$' anchor matches the null string before any newline in the
 138 string in addition to its normal function.
 139 .IP REG_PEND
 140 The regular expression ends,
 141 not at the first NUL,
 142 but just before the character pointed to by the
 143 .I re_endp
 144 member of the structure pointed to by
 145 .IR preg .
 146 The
 147 .I re_endp
 148 member is of type
 149 .IR const\ char\ * .
 150 This flag permits inclusion of NULs in the RE;
 151 they are considered ordinary characters.
 152 This is an extension,
 153 compatible with but not specified by POSIX 1003.2,
 154 and should be used with
 155 caution in software intended to be portable to other systems.
 156 .PP
 157 When successful,
 158 .I regcomp
 159 returns 0 and fills in the structure pointed to by
 160 .IR preg .
 161 One member of that structure
 162 (other than
 163 .IR re_endp )
 164 is publicized:
 165 .IR re_nsub ,
 166 of type
 167 .IR size_t ,
 168 contains the number of parenthesized subexpressions within the RE
 169 (except that the value of this member is undefined if the
 170 REG_NOSUB flag was used).
 171 If
 172 .I regcomp
 173 fails, it returns a non-zero error code;
 174 see DIAGNOSTICS.
 175 .PP
 176 .I Regexec
 177 matches the compiled RE pointed to by
 178 .I preg
 179 against the
 180 .IR string ,
 181 subject to the flags in
 182 .IR eflags ,
 183 and reports results using
 184 .IR nmatch ,
 185 .IR pmatch ,
 186 and the returned value.
 187 The RE must have been compiled by a previous invocation of
 188 .IR regcomp .
 189 The compiled form is not altered during execution of
 190 .IR regexec ,
 191 so a single compiled RE can be used simultaneously by multiple threads.
 192 .PP
 193 By default,
 194 the NUL-terminated string pointed to by
 195 .I string
 196 is considered to be the text of an entire line, minus any terminating
 197 newline.
 198 The
 199 .I eflags
 200 argument is the bitwise OR of zero or more of the following flags:
 201 .IP REG_NOTBOL \w'REG_STARTEND'u+2n
 202 The first character of
 203 the string
 204 is not the beginning of a line, so the `^' anchor should not match before it.
 205 This does not affect the behavior of newlines under REG_NEWLINE.
 206 .IP REG_NOTEOL
 207 The NUL terminating
 208 the string
 209 does not end a line, so the `$' anchor should not match before it.
 210 This does not affect the behavior of newlines under REG_NEWLINE.
 211 .IP REG_STARTEND
 212 The string is considered to start at
 213 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
 214 and to have a terminating NUL located at
 215 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
 216 (there need not actually be a NUL at that location),
 217 regardless of the value of
 218 .IR nmatch .
 219 See below for the definition of
 220 .IR pmatch
 221 and
 222 .IR nmatch .
 223 This is an extension,
 224 compatible with but not specified by POSIX 1003.2,
 225 and should be used with
 226 caution in software intended to be portable to other systems.
 227 Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
 228 REG_STARTEND affects only the location of the string,
 229 not how it is matched.
 230 .PP
 231 See
 232 .ZR
 233 for a discussion of what is matched in situations where an RE or a
 234 portion thereof could match any of several substrings of
 235 .IR string .
 236 .PP
 237 Normally,
 238 .I regexec
 239 returns 0 for success and the non-zero code REG_NOMATCH for failure.
 240 Other non-zero error codes may be returned in exceptional situations;
 241 see DIAGNOSTICS.
 242 .PP
 243 If REG_NOSUB was specified in the compilation of the RE,
 244 or if
 245 .I nmatch
 246 is 0,
 247 .I regexec
 248 ignores the
 249 .I pmatch
 250 argument (but see below for the case where REG_STARTEND is specified).
 251 Otherwise,
 252 .I pmatch
 253 points to an array of
 254 .I nmatch
 255 structures of type
 256 .IR regmatch_t .
 257 Such a structure has at least the members
 258 .I rm_so
 259 and
 260 .IR rm_eo ,
 261 both of type
 262 .I regoff_t
 263 (a signed arithmetic type at least as large as an
 264 .I off_t
 265 and a
 266 .IR ssize_t ),
 267 containing respectively the offset of the first character of a substring
 268 and the offset of the first character after the end of the substring.
 269 Offsets are measured from the beginning of the
 270 .I string
 271 argument given to
 272 .IR regexec .
 273 An empty substring is denoted by equal offsets,
 274 both indicating the character following the empty substring.
 275 .PP
 276 The 0th member of the
 277 .I pmatch
 278 array is filled in to indicate what substring of
 279 .I string
 280 was matched by the entire RE.
 281 Remaining members report what substring was matched by parenthesized
 282 subexpressions within the RE;
 283 member
 284 .I i
 285 reports subexpression
 286 .IR i ,
 287 with subexpressions counted (starting at 1) by the order of their opening
 288 parentheses in the RE, left to right.
 289 Unused entries in the array\(emcorresponding either to subexpressions that
 290 did not participate in the match at all, or to subexpressions that do not
 291 exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
 292 .I rm_so
 293 and
 294 .I rm_eo
 295 set to \-1.
 296 If a subexpression participated in the match several times,
 297 the reported substring is the last one it matched.
 298 (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
 299 the parenthesized subexpression matches each of the three `b's and then
 300 an infinite number of empty strings following the last `b',
 301 so the reported substring is one of the empties.)
 302 .PP
 303 If REG_STARTEND is specified,
 304 .I pmatch
 305 must point to at least one
 306 .I regmatch_t
 307 (even if
 308 .I nmatch
 309 is 0 or REG_NOSUB was specified),
 310 to hold the input offsets for REG_STARTEND.
 311 Use for output is still entirely controlled by
 312 .IR nmatch ;
 313 if
 314 .I nmatch
 315 is 0 or REG_NOSUB was specified,
 316 the value of
 317 .IR pmatch [0]
 318 will not be changed by a successful
 319 .IR regexec .
 320 .PP
 321 .I Regerror
 322 maps a non-zero
 323 .I errcode
 324 from either
 325 .I regcomp
 326 or
 327 .I regexec
 328 to a human-readable, printable message.
 329 If
 330 .I preg
 331 is non-NULL,
 332 the error code should have arisen from use of
 333 the
 334 .I regex_t
 335 pointed to by
 336 .IR preg ,
 337 and if the error code came from
 338 .IR regcomp ,
 339 it should have been the result from the most recent
 340 .I regcomp
 341 using that
 342 .IR regex_t .
 343 .RI ( Regerror
 344 may be able to supply a more detailed message using information
 345 from the
 346 .IR regex_t .)
 347 .I Regerror
 348 places the NUL-terminated message into the buffer pointed to by
 349 .IR errbuf ,
 350 limiting the length (including the NUL) to at most
 351 .I errbuf_size
 352 bytes.
 353 If the whole message won't fit,
 354 as much of it as will fit before the terminating NUL is supplied.
 355 In any case,
 356 the returned value is the size of buffer needed to hold the whole
 357 message (including terminating NUL).
 358 If
 359 .I errbuf_size
 360 is 0,
 361 .I errbuf
 362 is ignored but the return value is still correct.
 363 .PP
 364 If the
 365 .I errcode
 366 given to
 367 .I regerror
 368 is first ORed with REG_ITOA,
 369 the ``message'' that results is the printable name of the error code,
 370 e.g. ``REG_NOMATCH'',
 371 rather than an explanation thereof.
 372 If
 373 .I errcode
 374 is REG_ATOI,
 375 then
 376 .I preg
 377 shall be non-NULL and the
 378 .I re_endp
 379 member of the structure it points to
 380 must point to the printable name of an error code;
 381 in this case, the result in
 382 .I errbuf
 383 is the decimal digits of
 384 the numeric value of the error code
 385 (0 if the name is not recognized).
 386 REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
 387 they are extensions,
 388 compatible with but not specified by POSIX 1003.2,
 389 and should be used with
 390 caution in software intended to be portable to other systems.
 391 Be warned also that they are considered experimental and changes are possible.
 392 .PP
 393 .I Regfree
 394 frees any dynamically-allocated storage associated with the compiled RE
 395 pointed to by
 396 .IR preg .
 397 The remaining
 398 .I regex_t
 399 is no longer a valid compiled RE
 400 and the effect of supplying it to
 401 .I regexec
 402 or
 403 .I regerror
 404 is undefined.
 405 .PP
 406 None of these functions references global variables except for tables
 407 of constants;
 408 all are safe for use from multiple threads if the arguments are safe.
 409 .SH IMPLEMENTATION CHOICES
 410 There are a number of decisions that 1003.2 leaves up to the implementor,
 411 either by explicitly saying ``undefined'' or by virtue of them being
 412 forbidden by the RE grammar.
 413 This implementation treats them as follows.
 414 .PP
 415 See
 416 .ZR
 417 for a discussion of the definition of case-independent matching.
 418 .PP
 419 There is no particular limit on the length of REs,
 420 except insofar as memory is limited.
 421 Memory usage is approximately linear in RE size, and largely insensitive
 422 to RE complexity, except for bounded repetitions.
 423 See BUGS for one short RE using them
 424 that will run almost any system out of memory.
 425 .PP
 426 A backslashed character other than one specifically given a magic meaning
 427 by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
 428 is taken as an ordinary character.
 429 .PP
 430 Any unmatched [ is a REG_EBRACK error.
 431 .PP
 432 Equivalence classes cannot begin or end bracket-expression ranges.
 433 The endpoint of one range cannot begin another.
 434 .PP
 435 RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
 436 .PP
 437 A repetition operator (?, *, +, or bounds) cannot follow another
 438 repetition operator.
 439 A repetition operator cannot begin an expression or subexpression
 440 or follow `^' or `|'.
 441 .PP
 442 `|' cannot appear first or last in a (sub)expression or after another `|',
 443 i.e. an operand of `|' cannot be an empty subexpression.
 444 An empty parenthesized subexpression, `()', is legal and matches an
 445 empty (sub)string.
 446 An empty string is not a legal RE.
 447 .PP
 448 A `{' followed by a digit is considered the beginning of bounds for a
 449 bounded repetition, which must then follow the syntax for bounds.
 450 A `{' \fInot\fR followed by a digit is considered an ordinary character.
 451 .PP
 452 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
 453 REs are anchors, not ordinary characters.
 454 .SH SEE ALSO
 455 grep(1), re_format(7)
 456 .PP
 457 POSIX 1003.2, sections 2.8 (Regular Expression Notation)
 458 and
 459 B.5 (C Binding for Regular Expression Matching).
 460 .SH DIAGNOSTICS
 461 Non-zero error codes from
 462 .I regcomp
 463 and
 464 .I regexec
 465 include the following:
 466 .PP
 467 .nf
 468 .ta \w'REG_ECOLLATE'u+3n
 469 REG_NOMATCH     regexec() failed to match
 470 REG_BADPAT      invalid regular expression
 471 REG_ECOLLATE    invalid collating element
 472 REG_ECTYPE      invalid character class
 473 REG_EESCAPE     \e applied to unescapable character
 474 REG_ESUBREG     invalid backreference number
 475 REG_EBRACK      brackets [ ] not balanced
 476 REG_EPAREN      parentheses ( ) not balanced
 477 REG_EBRACE      braces { } not balanced
 478 REG_BADBR       invalid repetition count(s) in { }
 479 REG_ERANGE      invalid character range in [ ]
 480 REG_ESPACE      ran out of memory
 481 REG_BADRPT      ?, *, or + operand invalid
 482 REG_EMPTY       empty (sub)expression
 483 REG_ASSERT      ``can't happen''\(emyou found a bug
 484 REG_INVARG      invalid argument, e.g. negative-length string
 485 .fi
 486 .SH HISTORY
 487 Originally written by Henry Spencer at University of Toronto.
 488 Altered for inclusion in the 4.4BSD distribution.
 489 .SH BUGS
 490 This is an alpha release with known defects.
 491 Please report problems.
 492 .PP
 493 There is one known functionality bug.
 494 The implementation of internationalization is incomplete:
 495 the locale is always assumed to be the default one of 1003.2,
 496 and only the collating elements etc. of that locale are available.
 497 .PP
 498 The back-reference code is subtle and doubts linger about its correctness
 499 in complex cases.
 500 .PP
 501 .I Regexec
 502 performance is poor.
 503 This will improve with later releases.
 504 .I Nmatch
 505 exceeding 0 is expensive;
 506 .I nmatch
 507 exceeding 1 is worse.
 508 .I Regexec
 509 is largely insensitive to RE complexity \fIexcept\fR that back
 510 references are massively expensive.
 511 RE length does matter; in particular, there is a strong speed bonus
 512 for keeping RE length under about 30 characters,
 513 with most special characters counting roughly double.
 514 .PP
 515 .I Regcomp
 516 implements bounded repetitions by macro expansion,
 517 which is costly in time and space if counts are large
 518 or bounded repetitions are nested.
 519 An RE like, say,
 520 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
 521 will (eventually) run almost any existing machine out of swap space.
 522 .PP
 523 There are suspected problems with response to obscure error conditions.
 524 Notably,
 525 certain kinds of internal overflow,
 526 produced only by truly enormous REs or by multiply nested bounded repetitions,
 527 are probably not handled well.
 528 .PP
 529 Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
 530 a special character only in the presence of a previous unmatched `('.
 531 This can't be fixed until the spec is fixed.
 532 .PP
 533 The standard's definition of back references is vague.
 534 For example, does
 535 `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
 536 Until the standard is clarified,
 537 behavior in such cases should not be relied on.
 538 .PP
 539 The implementation of word-boundary matching is a bit of a kludge,
 540 and bugs may lurk in combinations of word-boundary matching and anchoring.