| 1 | This is flex.info, produced by makeinfo version 4.5 from flex.texi.
|
|---|
| 2 |
|
|---|
| 3 | INFO-DIR-SECTION Programming
|
|---|
| 4 | START-INFO-DIR-ENTRY
|
|---|
| 5 | * flex: (flex). Fast lexical analyzer generator (lex replacement).
|
|---|
| 6 | END-INFO-DIR-ENTRY
|
|---|
| 7 |
|
|---|
| 8 |
|
|---|
| 9 | The flex manual is placed under the same licensing conditions as the
|
|---|
| 10 | rest of flex:
|
|---|
| 11 |
|
|---|
| 12 | Copyright (C) 1990, 1997 The Regents of the University of California.
|
|---|
| 13 | All rights reserved.
|
|---|
| 14 |
|
|---|
| 15 | This code is derived from software contributed to Berkeley by Vern
|
|---|
| 16 | Paxson.
|
|---|
| 17 |
|
|---|
| 18 | The United States Government has rights in this work pursuant to
|
|---|
| 19 | contract no. DE-AC03-76SF00098 between the United States Department of
|
|---|
| 20 | Energy and the University of California.
|
|---|
| 21 |
|
|---|
| 22 | Redistribution and use in source and binary forms, with or without
|
|---|
| 23 | modification, are permitted provided that the following conditions are
|
|---|
| 24 | met:
|
|---|
| 25 |
|
|---|
| 26 | 1. Redistributions of source code must retain the above copyright
|
|---|
| 27 | notice, this list of conditions and the following disclaimer.
|
|---|
| 28 |
|
|---|
| 29 | 2. Redistributions in binary form must reproduce the above copyright
|
|---|
| 30 | notice, this list of conditions and the following disclaimer in the
|
|---|
| 31 | documentation and/or other materials provided with the
|
|---|
| 32 | distribution.
|
|---|
| 33 | Neither the name of the University nor the names of its contributors
|
|---|
| 34 | may be used to endorse or promote products derived from this software
|
|---|
| 35 | without specific prior written permission.
|
|---|
| 36 |
|
|---|
| 37 | THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
|
|---|
| 38 | WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
|
|---|
| 39 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
|
|---|
| 40 |
|
|---|
| 41 | File: flex.info, Node: Top, Next: Copyright, Prev: (dir), Up: (dir)
|
|---|
| 42 |
|
|---|
| 43 | flex
|
|---|
| 44 | ****
|
|---|
| 45 |
|
|---|
| 46 | This manual describes `flex', a tool for generating programs that
|
|---|
| 47 | perform pattern-matching on text. The manual includes both tutorial and
|
|---|
| 48 | reference sections.
|
|---|
| 49 |
|
|---|
| 50 | This edition of `The flex Manual' documents `flex' version 2.5.33.
|
|---|
| 51 | It was last updated on 20 February 2006.
|
|---|
| 52 |
|
|---|
| 53 | * Menu:
|
|---|
| 54 |
|
|---|
| 55 | * Copyright::
|
|---|
| 56 | * Reporting Bugs::
|
|---|
| 57 | * Introduction::
|
|---|
| 58 | * Simple Examples::
|
|---|
| 59 | * Format::
|
|---|
| 60 | * Patterns::
|
|---|
| 61 | * Matching::
|
|---|
| 62 | * Actions::
|
|---|
| 63 | * Generated Scanner::
|
|---|
| 64 | * Start Conditions::
|
|---|
| 65 | * Multiple Input Buffers::
|
|---|
| 66 | * EOF::
|
|---|
| 67 | * Misc Macros::
|
|---|
| 68 | * User Values::
|
|---|
| 69 | * Yacc::
|
|---|
| 70 | * Scanner Options::
|
|---|
| 71 | * Performance::
|
|---|
| 72 | * Cxx::
|
|---|
| 73 | * Reentrant::
|
|---|
| 74 | * Lex and Posix::
|
|---|
| 75 | * Memory Management::
|
|---|
| 76 | * Serialized Tables::
|
|---|
| 77 | * Diagnostics::
|
|---|
| 78 | * Limitations::
|
|---|
| 79 | * Bibliography::
|
|---|
| 80 | * FAQ::
|
|---|
| 81 | * Appendices::
|
|---|
| 82 | * Indices::
|
|---|
| 83 |
|
|---|
| 84 | --- The Detailed Node Listing ---
|
|---|
| 85 |
|
|---|
| 86 | Format of the Input File
|
|---|
| 87 |
|
|---|
| 88 | * Definitions Section::
|
|---|
| 89 | * Rules Section::
|
|---|
| 90 | * User Code Section::
|
|---|
| 91 | * Comments in the Input::
|
|---|
| 92 |
|
|---|
| 93 | Scanner Options
|
|---|
| 94 |
|
|---|
| 95 | * Options for Specifing Filenames::
|
|---|
| 96 | * Options Affecting Scanner Behavior::
|
|---|
| 97 | * Code-Level And API Options::
|
|---|
| 98 | * Options for Scanner Speed and Size::
|
|---|
| 99 | * Debugging Options::
|
|---|
| 100 | * Miscellaneous Options::
|
|---|
| 101 |
|
|---|
| 102 | Reentrant C Scanners
|
|---|
| 103 |
|
|---|
| 104 | * Reentrant Uses::
|
|---|
| 105 | * Reentrant Overview::
|
|---|
| 106 | * Reentrant Example::
|
|---|
| 107 | * Reentrant Detail::
|
|---|
| 108 | * Reentrant Functions::
|
|---|
| 109 |
|
|---|
| 110 | The Reentrant API in Detail
|
|---|
| 111 |
|
|---|
| 112 | * Specify Reentrant::
|
|---|
| 113 | * Extra Reentrant Argument::
|
|---|
| 114 | * Global Replacement::
|
|---|
| 115 | * Init and Destroy Functions::
|
|---|
| 116 | * Accessor Methods::
|
|---|
| 117 | * Extra Data::
|
|---|
| 118 | * About yyscan_t::
|
|---|
| 119 |
|
|---|
| 120 | Memory Management
|
|---|
| 121 |
|
|---|
| 122 | * The Default Memory Management::
|
|---|
| 123 | * Overriding The Default Memory Management::
|
|---|
| 124 | * A Note About yytext And Memory::
|
|---|
| 125 |
|
|---|
| 126 | Serialized Tables
|
|---|
| 127 |
|
|---|
| 128 | * Creating Serialized Tables::
|
|---|
| 129 | * Loading and Unloading Serialized Tables::
|
|---|
| 130 | * Tables File Format::
|
|---|
| 131 |
|
|---|
| 132 | FAQ
|
|---|
| 133 |
|
|---|
| 134 | * When was flex born?::
|
|---|
| 135 | * How do I expand \ escape sequences in C-style quoted strings?::
|
|---|
| 136 | * Why do flex scanners call fileno if it is not ANSI compatible?::
|
|---|
| 137 | * Does flex support recursive pattern definitions?::
|
|---|
| 138 | * How do I skip huge chunks of input (tens of megabytes) while using flex?::
|
|---|
| 139 | * Flex is not matching my patterns in the same order that I defined them.::
|
|---|
| 140 | * My actions are executing out of order or sometimes not at all.::
|
|---|
| 141 | * How can I have multiple input sources feed into the same scanner at the same time?::
|
|---|
| 142 | * Can I build nested parsers that work with the same input file?::
|
|---|
| 143 | * How can I match text only at the end of a file?::
|
|---|
| 144 | * How can I make REJECT cascade across start condition boundaries?::
|
|---|
| 145 | * Why cant I use fast or full tables with interactive mode?::
|
|---|
| 146 | * How much faster is -F or -f than -C?::
|
|---|
| 147 | * If I have a simple grammar cant I just parse it with flex?::
|
|---|
| 148 | * Why doesnt yyrestart() set the start state back to INITIAL?::
|
|---|
| 149 | * How can I match C-style comments?::
|
|---|
| 150 | * The period isnt working the way I expected.::
|
|---|
| 151 | * Can I get the flex manual in another format?::
|
|---|
| 152 | * Does there exist a "faster" NDFA->DFA algorithm?::
|
|---|
| 153 | * How does flex compile the DFA so quickly?::
|
|---|
| 154 | * How can I use more than 8192 rules?::
|
|---|
| 155 | * How do I abandon a file in the middle of a scan and switch to a new file?::
|
|---|
| 156 | * How do I execute code only during initialization (only before the first scan)?::
|
|---|
| 157 | * How do I execute code at termination?::
|
|---|
| 158 | * Where else can I find help?::
|
|---|
| 159 | * Can I include comments in the "rules" section of the file?::
|
|---|
| 160 | * I get an error about undefined yywrap().::
|
|---|
| 161 | * How can I change the matching pattern at run time?::
|
|---|
| 162 | * How can I expand macros in the input?::
|
|---|
| 163 | * How can I build a two-pass scanner?::
|
|---|
| 164 | * How do I match any string not matched in the preceding rules?::
|
|---|
| 165 | * I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
|
|---|
| 166 | * Is there a way to make flex treat NULL like a regular character?::
|
|---|
| 167 | * Whenever flex can not match the input it says "flex scanner jammed".::
|
|---|
| 168 | * Why doesnt flex have non-greedy operators like perl does?::
|
|---|
| 169 | * Memory leak - 16386 bytes allocated by malloc.::
|
|---|
| 170 | * How do I track the byte offset for lseek()?::
|
|---|
| 171 | * How do I use my own I/O classes in a C++ scanner?::
|
|---|
| 172 | * How do I skip as many chars as possible?::
|
|---|
| 173 | * deleteme00::
|
|---|
| 174 | * Are certain equivalent patterns faster than others?::
|
|---|
| 175 | * Is backing up a big deal?::
|
|---|
| 176 | * Can I fake multi-byte character support?::
|
|---|
| 177 | * deleteme01::
|
|---|
| 178 | * Can you discuss some flex internals?::
|
|---|
| 179 | * unput() messes up yy_at_bol::
|
|---|
| 180 | * The | operator is not doing what I want::
|
|---|
| 181 | * Why can't flex understand this variable trailing context pattern?::
|
|---|
| 182 | * The ^ operator isn't working::
|
|---|
| 183 | * Trailing context is getting confused with trailing optional patterns::
|
|---|
| 184 | * Is flex GNU or not?::
|
|---|
| 185 | * ERASEME53::
|
|---|
| 186 | * I need to scan if-then-else blocks and while loops::
|
|---|
| 187 | * ERASEME55::
|
|---|
| 188 | * ERASEME56::
|
|---|
| 189 | * ERASEME57::
|
|---|
| 190 | * Is there a repository for flex scanners?::
|
|---|
| 191 | * How can I conditionally compile or preprocess my flex input file?::
|
|---|
| 192 | * Where can I find grammars for lex and yacc?::
|
|---|
| 193 | * I get an end-of-buffer message for each character scanned.::
|
|---|
| 194 | * unnamed-faq-62::
|
|---|
| 195 | * unnamed-faq-63::
|
|---|
| 196 | * unnamed-faq-64::
|
|---|
| 197 | * unnamed-faq-65::
|
|---|
| 198 | * unnamed-faq-66::
|
|---|
| 199 | * unnamed-faq-67::
|
|---|
| 200 | * unnamed-faq-68::
|
|---|
| 201 | * unnamed-faq-69::
|
|---|
| 202 | * unnamed-faq-70::
|
|---|
| 203 | * unnamed-faq-71::
|
|---|
| 204 | * unnamed-faq-72::
|
|---|
| 205 | * unnamed-faq-73::
|
|---|
| 206 | * unnamed-faq-74::
|
|---|
| 207 | * unnamed-faq-75::
|
|---|
| 208 | * unnamed-faq-76::
|
|---|
| 209 | * unnamed-faq-77::
|
|---|
| 210 | * unnamed-faq-78::
|
|---|
| 211 | * unnamed-faq-79::
|
|---|
| 212 | * unnamed-faq-80::
|
|---|
| 213 | * unnamed-faq-81::
|
|---|
| 214 | * unnamed-faq-82::
|
|---|
| 215 | * unnamed-faq-83::
|
|---|
| 216 | * unnamed-faq-84::
|
|---|
| 217 | * unnamed-faq-85::
|
|---|
| 218 | * unnamed-faq-86::
|
|---|
| 219 | * unnamed-faq-87::
|
|---|
| 220 | * unnamed-faq-88::
|
|---|
| 221 | * unnamed-faq-90::
|
|---|
| 222 | * unnamed-faq-91::
|
|---|
| 223 | * unnamed-faq-92::
|
|---|
| 224 | * unnamed-faq-93::
|
|---|
| 225 | * unnamed-faq-94::
|
|---|
| 226 | * unnamed-faq-95::
|
|---|
| 227 | * unnamed-faq-96::
|
|---|
| 228 | * unnamed-faq-97::
|
|---|
| 229 | * unnamed-faq-98::
|
|---|
| 230 | * unnamed-faq-99::
|
|---|
| 231 | * unnamed-faq-100::
|
|---|
| 232 | * unnamed-faq-101::
|
|---|
| 233 | * What is the difference between YYLEX_PARAM and YY_DECL?::
|
|---|
| 234 | * Why do I get "conflicting types for yylex" error?::
|
|---|
| 235 | * How do I access the values set in a Flex action from within a Bison action?::
|
|---|
| 236 |
|
|---|
| 237 | Appendices
|
|---|
| 238 |
|
|---|
| 239 | * Makefiles and Flex::
|
|---|
| 240 | * Bison Bridge::
|
|---|
| 241 | * M4 Dependency::
|
|---|
| 242 |
|
|---|
| 243 | Indices
|
|---|
| 244 |
|
|---|
| 245 | * Concept Index::
|
|---|
| 246 | * Index of Functions and Macros::
|
|---|
| 247 | * Index of Variables::
|
|---|
| 248 | * Index of Data Types::
|
|---|
| 249 | * Index of Hooks::
|
|---|
| 250 | * Index of Scanner Options::
|
|---|
| 251 |
|
|---|
| 252 |
|
|---|
| 253 | File: flex.info, Node: Copyright, Next: Reporting Bugs, Prev: Top, Up: Top
|
|---|
| 254 |
|
|---|
| 255 | Copyright
|
|---|
| 256 | *********
|
|---|
| 257 |
|
|---|
| 258 |
|
|---|
| 259 | The flex manual is placed under the same licensing conditions as the
|
|---|
| 260 | rest of flex:
|
|---|
| 261 |
|
|---|
| 262 | Copyright (C) 1990, 1997 The Regents of the University of California.
|
|---|
| 263 | All rights reserved.
|
|---|
| 264 |
|
|---|
| 265 | This code is derived from software contributed to Berkeley by Vern
|
|---|
| 266 | Paxson.
|
|---|
| 267 |
|
|---|
| 268 | The United States Government has rights in this work pursuant to
|
|---|
| 269 | contract no. DE-AC03-76SF00098 between the United States Department of
|
|---|
| 270 | Energy and the University of California.
|
|---|
| 271 |
|
|---|
| 272 | Redistribution and use in source and binary forms, with or without
|
|---|
| 273 | modification, are permitted provided that the following conditions are
|
|---|
| 274 | met:
|
|---|
| 275 |
|
|---|
| 276 | 1. Redistributions of source code must retain the above copyright
|
|---|
| 277 | notice, this list of conditions and the following disclaimer.
|
|---|
| 278 |
|
|---|
| 279 | 2. Redistributions in binary form must reproduce the above copyright
|
|---|
| 280 | notice, this list of conditions and the following disclaimer in the
|
|---|
| 281 | documentation and/or other materials provided with the
|
|---|
| 282 | distribution.
|
|---|
| 283 | Neither the name of the University nor the names of its contributors
|
|---|
| 284 | may be used to endorse or promote products derived from this software
|
|---|
| 285 | without specific prior written permission.
|
|---|
| 286 |
|
|---|
| 287 | THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
|
|---|
| 288 | WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
|
|---|
| 289 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
|
|---|
| 290 |
|
|---|
| 291 | File: flex.info, Node: Reporting Bugs, Next: Introduction, Prev: Copyright, Up: Top
|
|---|
| 292 |
|
|---|
| 293 | Reporting Bugs
|
|---|
| 294 | **************
|
|---|
| 295 |
|
|---|
| 296 | If you have problems with `flex' or think you have found a bug,
|
|---|
| 297 | please send mail detailing your problem to
|
|---|
| 298 | <[email protected]>. Patches are always welcome.
|
|---|
| 299 |
|
|---|
| 300 |
|
|---|
| 301 | File: flex.info, Node: Introduction, Next: Simple Examples, Prev: Reporting Bugs, Up: Top
|
|---|
| 302 |
|
|---|
| 303 | Introduction
|
|---|
| 304 | ************
|
|---|
| 305 |
|
|---|
| 306 | `flex' is a tool for generating "scanners". A scanner is a program
|
|---|
| 307 | which recognizes lexical patterns in text. The `flex' program reads
|
|---|
| 308 | the given input files, or its standard input if no file names are
|
|---|
| 309 | given, for a description of a scanner to generate. The description is
|
|---|
| 310 | in the form of pairs of regular expressions and C code, called "rules".
|
|---|
| 311 | `flex' generates as output a C source file, `lex.yy.c' by default,
|
|---|
| 312 | which defines a routine `yylex()'. This file can be compiled and
|
|---|
| 313 | linked with the flex runtime library to produce an executable. When
|
|---|
| 314 | the executable is run, it analyzes its input for occurrences of the
|
|---|
| 315 | regular expressions. Whenever it finds one, it executes the
|
|---|
| 316 | corresponding C code.
|
|---|
| 317 |
|
|---|
| 318 |
|
|---|
| 319 | File: flex.info, Node: Simple Examples, Next: Format, Prev: Introduction, Up: Top
|
|---|
| 320 |
|
|---|
| 321 | Some Simple Examples
|
|---|
| 322 | ********************
|
|---|
| 323 |
|
|---|
| 324 | First some simple examples to get the flavor of how one uses `flex'.
|
|---|
| 325 |
|
|---|
| 326 | The following `flex' input specifies a scanner which, when it
|
|---|
| 327 | encounters the string `username' will replace it with the user's login
|
|---|
| 328 | name:
|
|---|
| 329 |
|
|---|
| 330 |
|
|---|
| 331 | %%
|
|---|
| 332 | username printf( "%s", getlogin() );
|
|---|
| 333 |
|
|---|
| 334 | By default, any text not matched by a `flex' scanner is copied to
|
|---|
| 335 | the output, so the net effect of this scanner is to copy its input file
|
|---|
| 336 | to its output with each occurrence of `username' expanded. In this
|
|---|
| 337 | input, there is just one rule. `username' is the "pattern" and the
|
|---|
| 338 | `printf' is the "action". The `%%' symbol marks the beginning of the
|
|---|
| 339 | rules.
|
|---|
| 340 |
|
|---|
| 341 | Here's another simple example:
|
|---|
| 342 |
|
|---|
| 343 |
|
|---|
| 344 | int num_lines = 0, num_chars = 0;
|
|---|
| 345 |
|
|---|
| 346 | %%
|
|---|
| 347 | \n ++num_lines; ++num_chars;
|
|---|
| 348 | . ++num_chars;
|
|---|
| 349 |
|
|---|
| 350 | %%
|
|---|
| 351 | main()
|
|---|
| 352 | {
|
|---|
| 353 | yylex();
|
|---|
| 354 | printf( "# of lines = %d, # of chars = %d\n",
|
|---|
| 355 | num_lines, num_chars );
|
|---|
| 356 | }
|
|---|
| 357 |
|
|---|
| 358 | This scanner counts the number of characters and the number of lines
|
|---|
| 359 | in its input. It produces no output other than the final report on the
|
|---|
| 360 | character and line counts. The first line declares two globals,
|
|---|
| 361 | `num_lines' and `num_chars', which are accessible both inside `yylex()'
|
|---|
| 362 | and in the `main()' routine declared after the second `%%'. There are
|
|---|
| 363 | two rules, one which matches a newline (`\n') and increments both the
|
|---|
| 364 | line count and the character count, and one which matches any character
|
|---|
| 365 | other than a newline (indicated by the `.' regular expression).
|
|---|
| 366 |
|
|---|
| 367 | A somewhat more complicated example:
|
|---|
| 368 |
|
|---|
| 369 |
|
|---|
| 370 | /* scanner for a toy Pascal-like language */
|
|---|
| 371 |
|
|---|
| 372 | %{
|
|---|
| 373 | /* need this for the call to atof() below */
|
|---|
| 374 | #include math.h>
|
|---|
| 375 | %}
|
|---|
| 376 |
|
|---|
| 377 | DIGIT [0-9]
|
|---|
| 378 | ID [a-z][a-z0-9]*
|
|---|
| 379 |
|
|---|
| 380 | %%
|
|---|
| 381 |
|
|---|
| 382 | {DIGIT}+ {
|
|---|
| 383 | printf( "An integer: %s (%d)\n", yytext,
|
|---|
| 384 | atoi( yytext ) );
|
|---|
| 385 | }
|
|---|
| 386 |
|
|---|
| 387 | {DIGIT}+"."{DIGIT}* {
|
|---|
| 388 | printf( "A float: %s (%g)\n", yytext,
|
|---|
| 389 | atof( yytext ) );
|
|---|
| 390 | }
|
|---|
| 391 |
|
|---|
| 392 | if|then|begin|end|procedure|function {
|
|---|
| 393 | printf( "A keyword: %s\n", yytext );
|
|---|
| 394 | }
|
|---|
| 395 |
|
|---|
| 396 | {ID} printf( "An identifier: %s\n", yytext );
|
|---|
| 397 |
|
|---|
| 398 | "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
|
|---|
| 399 |
|
|---|
| 400 | "{"[\^{}}\n]*"}" /* eat up one-line comments */
|
|---|
| 401 |
|
|---|
| 402 | [ \t\n]+ /* eat up whitespace */
|
|---|
| 403 |
|
|---|
| 404 | . printf( "Unrecognized character: %s\n", yytext );
|
|---|
| 405 |
|
|---|
| 406 | %%
|
|---|
| 407 |
|
|---|
| 408 | main( argc, argv )
|
|---|
| 409 | int argc;
|
|---|
| 410 | char **argv;
|
|---|
| 411 | {
|
|---|
| 412 | ++argv, --argc; /* skip over program name */
|
|---|
| 413 | if ( argc > 0 )
|
|---|
| 414 | yyin = fopen( argv[0], "r" );
|
|---|
| 415 | else
|
|---|
| 416 | yyin = stdin;
|
|---|
| 417 |
|
|---|
| 418 | yylex();
|
|---|
| 419 | }
|
|---|
| 420 |
|
|---|
| 421 | This is the beginnings of a simple scanner for a language like
|
|---|
| 422 | Pascal. It identifies different types of "tokens" and reports on what
|
|---|
| 423 | it has seen.
|
|---|
| 424 |
|
|---|
| 425 | The details of this example will be explained in the following
|
|---|
| 426 | sections.
|
|---|
| 427 |
|
|---|
| 428 |
|
|---|
| 429 | File: flex.info, Node: Format, Next: Patterns, Prev: Simple Examples, Up: Top
|
|---|
| 430 |
|
|---|
| 431 | Format of the Input File
|
|---|
| 432 | ************************
|
|---|
| 433 |
|
|---|
| 434 | The `flex' input file consists of three sections, separated by a
|
|---|
| 435 | line containing only `%%'.
|
|---|
| 436 |
|
|---|
| 437 |
|
|---|
| 438 | definitions
|
|---|
| 439 | %%
|
|---|
| 440 | rules
|
|---|
| 441 | %%
|
|---|
| 442 | user code
|
|---|
| 443 |
|
|---|
| 444 | * Menu:
|
|---|
| 445 |
|
|---|
| 446 | * Definitions Section::
|
|---|
| 447 | * Rules Section::
|
|---|
| 448 | * User Code Section::
|
|---|
| 449 | * Comments in the Input::
|
|---|
| 450 |
|
|---|
| 451 |
|
|---|
| 452 | File: flex.info, Node: Definitions Section, Next: Rules Section, Prev: Format, Up: Format
|
|---|
| 453 |
|
|---|
| 454 | Format of the Definitions Section
|
|---|
| 455 | =================================
|
|---|
| 456 |
|
|---|
| 457 | The "definitions section" contains declarations of simple "name"
|
|---|
| 458 | definitions to simplify the scanner specification, and declarations of
|
|---|
| 459 | "start conditions", which are explained in a later section.
|
|---|
| 460 |
|
|---|
| 461 | Name definitions have the form:
|
|---|
| 462 |
|
|---|
| 463 |
|
|---|
| 464 | name definition
|
|---|
| 465 |
|
|---|
| 466 | The `name' is a word beginning with a letter or an underscore (`_')
|
|---|
| 467 | followed by zero or more letters, digits, `_', or `-' (dash). The
|
|---|
| 468 | definition is taken to begin at the first non-whitespace character
|
|---|
| 469 | following the name and continuing to the end of the line. The
|
|---|
| 470 | definition can subsequently be referred to using `{name}', which will
|
|---|
| 471 | expand to `(definition)'. For example,
|
|---|
| 472 |
|
|---|
| 473 |
|
|---|
| 474 | DIGIT [0-9]
|
|---|
| 475 | ID [a-z][a-z0-9]*
|
|---|
| 476 |
|
|---|
| 477 | Defines `DIGIT' to be a regular expression which matches a single
|
|---|
| 478 | digit, and `ID' to be a regular expression which matches a letter
|
|---|
| 479 | followed by zero-or-more letters-or-digits. A subsequent reference to
|
|---|
| 480 |
|
|---|
| 481 |
|
|---|
| 482 | {DIGIT}+"."{DIGIT}*
|
|---|
| 483 |
|
|---|
| 484 | is identical to
|
|---|
| 485 |
|
|---|
| 486 |
|
|---|
| 487 | ([0-9])+"."([0-9])*
|
|---|
| 488 |
|
|---|
| 489 | and matches one-or-more digits followed by a `.' followed by
|
|---|
| 490 | zero-or-more digits.
|
|---|
| 491 |
|
|---|
| 492 | An unindented comment (i.e., a line beginning with `/*') is copied
|
|---|
| 493 | verbatim to the output up to the next `*/'.
|
|---|
| 494 |
|
|---|
| 495 | Any _indented_ text or text enclosed in `%{' and `%}' is also copied
|
|---|
| 496 | verbatim to the output (with the %{ and %} symbols removed). The %{
|
|---|
| 497 | and %} symbols must appear unindented on lines by themselves.
|
|---|
| 498 |
|
|---|
| 499 | A `%top' block is similar to a `%{' ... `%}' block, except that the
|
|---|
| 500 | code in a `%top' block is relocated to the _top_ of the generated file,
|
|---|
| 501 | before any flex definitions (1). The `%top' block is useful when you
|
|---|
| 502 | want certain preprocessor macros to be defined or certain files to be
|
|---|
| 503 | included before the generated code. The single characters, `{' and
|
|---|
| 504 | `}' are used to delimit the `%top' block, as show in the example below:
|
|---|
| 505 |
|
|---|
| 506 |
|
|---|
| 507 | %top{
|
|---|
| 508 | /* This code goes at the "top" of the generated file. */
|
|---|
| 509 | #include <stdint.h>
|
|---|
| 510 | #include <inttypes.h>
|
|---|
| 511 | }
|
|---|
| 512 |
|
|---|
| 513 | Multiple `%top' blocks are allowed, and their order is preserved.
|
|---|
| 514 |
|
|---|
| 515 | ---------- Footnotes ----------
|
|---|
| 516 |
|
|---|
| 517 | (1) Actually, `yyIN_HEADER' is defined before the `%top' block.
|
|---|
| 518 |
|
|---|
| 519 |
|
|---|
| 520 | File: flex.info, Node: Rules Section, Next: User Code Section, Prev: Definitions Section, Up: Format
|
|---|
| 521 |
|
|---|
| 522 | Format of the Rules Section
|
|---|
| 523 | ===========================
|
|---|
| 524 |
|
|---|
| 525 | The "rules" section of the `flex' input contains a series of rules
|
|---|
| 526 | of the form:
|
|---|
| 527 |
|
|---|
| 528 |
|
|---|
| 529 | pattern action
|
|---|
| 530 |
|
|---|
| 531 | where the pattern must be unindented and the action must begin on
|
|---|
| 532 | the same line. *Note Patterns::, for a further description of patterns
|
|---|
| 533 | and actions.
|
|---|
| 534 |
|
|---|
| 535 | In the rules section, any indented or %{ %} enclosed text appearing
|
|---|
| 536 | before the first rule may be used to declare variables which are local
|
|---|
| 537 | to the scanning routine and (after the declarations) code which is to be
|
|---|
| 538 | executed whenever the scanning routine is entered. Other indented or
|
|---|
| 539 | %{ %} text in the rule section is still copied to the output, but its
|
|---|
| 540 | meaning is not well-defined and it may well cause compile-time errors
|
|---|
| 541 | (this feature is present for POSIX compliance. *Note Lex and Posix::,
|
|---|
| 542 | for other such features).
|
|---|
| 543 |
|
|---|
| 544 | Any _indented_ text or text enclosed in `%{' and `%}' is copied
|
|---|
| 545 | verbatim to the output (with the %{ and %} symbols removed). The %{
|
|---|
| 546 | and %} symbols must appear unindented on lines by themselves.
|
|---|
| 547 |
|
|---|
| 548 |
|
|---|
| 549 | File: flex.info, Node: User Code Section, Next: Comments in the Input, Prev: Rules Section, Up: Format
|
|---|
| 550 |
|
|---|
| 551 | Format of the User Code Section
|
|---|
| 552 | ===============================
|
|---|
| 553 |
|
|---|
| 554 | The user code section is simply copied to `lex.yy.c' verbatim. It
|
|---|
| 555 | is used for companion routines which call or are called by the scanner.
|
|---|
| 556 | The presence of this section is optional; if it is missing, the second
|
|---|
| 557 | `%%' in the input file may be skipped, too.
|
|---|
| 558 |
|
|---|
| 559 |
|
|---|
| 560 | File: flex.info, Node: Comments in the Input, Prev: User Code Section, Up: Format
|
|---|
| 561 |
|
|---|
| 562 | Comments in the Input
|
|---|
| 563 | =====================
|
|---|
| 564 |
|
|---|
| 565 | Flex supports C-style comments, that is, anything between /* and */
|
|---|
| 566 | is considered a comment. Whenever flex encounters a comment, it copies
|
|---|
| 567 | the entire comment verbatim to the generated source code. Comments may
|
|---|
| 568 | appear just about anywhere, but with the following exceptions:
|
|---|
| 569 |
|
|---|
| 570 | * Comments may not appear in the Rules Section wherever flex is
|
|---|
| 571 | expecting a regular expression. This means comments may not appear
|
|---|
| 572 | at the beginning of a line, or immediately following a list of
|
|---|
| 573 | scanner states.
|
|---|
| 574 |
|
|---|
| 575 | * Comments may not appear on an `%option' line in the Definitions
|
|---|
| 576 | Section.
|
|---|
| 577 |
|
|---|
| 578 | If you want to follow a simple rule, then always begin a comment on a
|
|---|
| 579 | new line, with one or more whitespace characters before the initial
|
|---|
| 580 | `/*'). This rule will work anywhere in the input file.
|
|---|
| 581 |
|
|---|
| 582 | All the comments in the following example are valid:
|
|---|
| 583 |
|
|---|
| 584 |
|
|---|
| 585 | %{
|
|---|
| 586 | /* code block */
|
|---|
| 587 | %}
|
|---|
| 588 |
|
|---|
| 589 | /* Definitions Section */
|
|---|
| 590 | %x STATE_X
|
|---|
| 591 |
|
|---|
| 592 | %%
|
|---|
| 593 | /* Rules Section */
|
|---|
| 594 | ruleA /* after regex */ { /* code block */ } /* after code block */
|
|---|
| 595 | /* Rules Section (indented) */
|
|---|
| 596 | <STATE_X>{
|
|---|
| 597 | ruleC ECHO;
|
|---|
| 598 | ruleD ECHO;
|
|---|
| 599 | %{
|
|---|
| 600 | /* code block */
|
|---|
| 601 | %}
|
|---|
| 602 | }
|
|---|
| 603 | %%
|
|---|
| 604 | /* User Code Section */
|
|---|
| 605 |
|
|---|
| 606 |
|
|---|
| 607 | File: flex.info, Node: Patterns, Next: Matching, Prev: Format, Up: Top
|
|---|
| 608 |
|
|---|
| 609 | Patterns
|
|---|
| 610 | ********
|
|---|
| 611 |
|
|---|
| 612 | The patterns in the input (see *Note Rules Section::) are written
|
|---|
| 613 | using an extended set of regular expressions. These are:
|
|---|
| 614 |
|
|---|
| 615 | `x'
|
|---|
| 616 | match the character 'x'
|
|---|
| 617 |
|
|---|
| 618 | `.'
|
|---|
| 619 | any character (byte) except newline
|
|---|
| 620 |
|
|---|
| 621 | `[xyz]'
|
|---|
| 622 | a "character class"; in this case, the pattern matches either an
|
|---|
| 623 | 'x', a 'y', or a 'z'
|
|---|
| 624 |
|
|---|
| 625 | `[abj-oZ]'
|
|---|
| 626 | a "character class" with a range in it; matches an 'a', a 'b', any
|
|---|
| 627 | letter from 'j' through 'o', or a 'Z'
|
|---|
| 628 |
|
|---|
| 629 | `[^A-Z]'
|
|---|
| 630 | a "negated character class", i.e., any character but those in the
|
|---|
| 631 | class. In this case, any character EXCEPT an uppercase letter.
|
|---|
| 632 |
|
|---|
| 633 | `[^A-Z\n]'
|
|---|
| 634 | any character EXCEPT an uppercase letter or a newline
|
|---|
| 635 |
|
|---|
| 636 | `r*'
|
|---|
| 637 | zero or more r's, where r is any regular expression
|
|---|
| 638 |
|
|---|
| 639 | `r+'
|
|---|
| 640 | one or more r's
|
|---|
| 641 |
|
|---|
| 642 | `r?'
|
|---|
| 643 | zero or one r's (that is, "an optional r")
|
|---|
| 644 |
|
|---|
| 645 | `r{2,5}'
|
|---|
| 646 | anywhere from two to five r's
|
|---|
| 647 |
|
|---|
| 648 | `r{2,}'
|
|---|
| 649 | two or more r's
|
|---|
| 650 |
|
|---|
| 651 | `r{4}'
|
|---|
| 652 | exactly 4 r's
|
|---|
| 653 |
|
|---|
| 654 | `{name}'
|
|---|
| 655 | the expansion of the `name' definition (*note Format::).
|
|---|
| 656 |
|
|---|
| 657 | `"[xyz]\"foo"'
|
|---|
| 658 | the literal string: `[xyz]"foo'
|
|---|
| 659 |
|
|---|
| 660 | `\X'
|
|---|
| 661 | if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C
|
|---|
| 662 | interpretation of `\x'. Otherwise, a literal `X' (used to escape
|
|---|
| 663 | operators such as `*')
|
|---|
| 664 |
|
|---|
| 665 | `\0'
|
|---|
| 666 | a NUL character (ASCII code 0)
|
|---|
| 667 |
|
|---|
| 668 | `\123'
|
|---|
| 669 | the character with octal value 123
|
|---|
| 670 |
|
|---|
| 671 | `\x2a'
|
|---|
| 672 | the character with hexadecimal value 2a
|
|---|
| 673 |
|
|---|
| 674 | `(r)'
|
|---|
| 675 | match an `r'; parentheses are used to override precedence (see
|
|---|
| 676 | below)
|
|---|
| 677 |
|
|---|
| 678 | `rs'
|
|---|
| 679 | the regular expression `r' followed by the regular expression `s';
|
|---|
| 680 | called "concatenation"
|
|---|
| 681 |
|
|---|
| 682 | `r|s'
|
|---|
| 683 | either an `r' or an `s'
|
|---|
| 684 |
|
|---|
| 685 | `r/s'
|
|---|
| 686 | an `r' but only if it is followed by an `s'. The text matched by
|
|---|
| 687 | `s' is included when determining whether this rule is the longest
|
|---|
| 688 | match, but is then returned to the input before the action is
|
|---|
| 689 | executed. So the action only sees the text matched by `r'. This
|
|---|
| 690 | type of pattern is called "trailing context". (There are some
|
|---|
| 691 | combinations of `r/s' that flex cannot match correctly. *Note
|
|---|
| 692 | Limitations::, regarding dangerous trailing context.)
|
|---|
| 693 |
|
|---|
| 694 | `^r'
|
|---|
| 695 | an `r', but only at the beginning of a line (i.e., when just
|
|---|
| 696 | starting to scan, or right after a newline has been scanned).
|
|---|
| 697 |
|
|---|
| 698 | `r$'
|
|---|
| 699 | an `r', but only at the end of a line (i.e., just before a
|
|---|
| 700 | newline). Equivalent to `r/\n'.
|
|---|
| 701 |
|
|---|
| 702 | Note that `flex''s notion of "newline" is exactly whatever the C
|
|---|
| 703 | compiler used to compile `flex' interprets `\n' as; in particular,
|
|---|
| 704 | on some DOS systems you must either filter out `\r's in the input
|
|---|
| 705 | yourself, or explicitly use `r/\r\n' for `r$'.
|
|---|
| 706 |
|
|---|
| 707 | `<s>r'
|
|---|
| 708 | an `r', but only in start condition `s' (see *Note Start
|
|---|
| 709 | Conditions:: for discussion of start conditions).
|
|---|
| 710 |
|
|---|
| 711 | `<s1,s2,s3>r'
|
|---|
| 712 | same, but in any of start conditions `s1', `s2', or `s3'.
|
|---|
| 713 |
|
|---|
| 714 | `<*>r'
|
|---|
| 715 | an `r' in any start condition, even an exclusive one.
|
|---|
| 716 |
|
|---|
| 717 | `<<EOF>>'
|
|---|
| 718 | an end-of-file.
|
|---|
| 719 |
|
|---|
| 720 | `<s1,s2><<EOF>>'
|
|---|
| 721 | an end-of-file when in start condition `s1' or `s2'
|
|---|
| 722 |
|
|---|
| 723 | Note that inside of a character class, all regular expression
|
|---|
| 724 | operators lose their special meaning except escape (`\') and the
|
|---|
| 725 | character class operators, `-', `]]', and, at the beginning of the
|
|---|
| 726 | class, `^'.
|
|---|
| 727 |
|
|---|
| 728 | The regular expressions listed above are grouped according to
|
|---|
| 729 | precedence, from highest precedence at the top to lowest at the bottom.
|
|---|
| 730 | Those grouped together have equal precedence (see special note on the
|
|---|
| 731 | precedence of the repeat operator, `{}', under the documentation for
|
|---|
| 732 | the `--posix' POSIX compliance option). For example,
|
|---|
| 733 |
|
|---|
| 734 |
|
|---|
| 735 | foo|bar*
|
|---|
| 736 |
|
|---|
| 737 | is the same as
|
|---|
| 738 |
|
|---|
| 739 |
|
|---|
| 740 | (foo)|(ba(r*))
|
|---|
| 741 |
|
|---|
| 742 | since the `*' operator has higher precedence than concatenation, and
|
|---|
| 743 | concatenation higher than alternation (`|'). This pattern therefore
|
|---|
| 744 | matches _either_ the string `foo' _or_ the string `ba' followed by
|
|---|
| 745 | zero-or-more `r''s. To match `foo' or zero-or-more repetitions of the
|
|---|
| 746 | string `bar', use:
|
|---|
| 747 |
|
|---|
| 748 |
|
|---|
| 749 | foo|(bar)*
|
|---|
| 750 |
|
|---|
| 751 | And to match a sequence of zero or more repetitions of `foo' and
|
|---|
| 752 | `bar':
|
|---|
| 753 |
|
|---|
| 754 |
|
|---|
| 755 | (foo|bar)*
|
|---|
| 756 |
|
|---|
| 757 | In addition to characters and ranges of characters, character classes
|
|---|
| 758 | can also contain "character class expressions". These are expressions
|
|---|
| 759 | enclosed inside `[': and `:]' delimiters (which themselves must appear
|
|---|
| 760 | between the `[' and `]' of the character class. Other elements may
|
|---|
| 761 | occur inside the character class, too). The valid expressions are:
|
|---|
| 762 |
|
|---|
| 763 |
|
|---|
| 764 | [:alnum:] [:alpha:] [:blank:]
|
|---|
| 765 | [:cntrl:] [:digit:] [:graph:]
|
|---|
| 766 | [:lower:] [:print:] [:punct:]
|
|---|
| 767 | [:space:] [:upper:] [:xdigit:]
|
|---|
| 768 |
|
|---|
| 769 | These expressions all designate a set of characters equivalent to the
|
|---|
| 770 | corresponding standard C `isXXX' function. For example, `[:alnum:]'
|
|---|
| 771 | designates those characters for which `isalnum()' returns true - i.e.,
|
|---|
| 772 | any alphabetic or numeric character. Some systems don't provide
|
|---|
| 773 | `isblank()', so flex defines `[:blank:]' as a blank or a tab.
|
|---|
| 774 |
|
|---|
| 775 | For example, the following character classes are all equivalent:
|
|---|
| 776 |
|
|---|
| 777 |
|
|---|
| 778 | [[:alnum:]]
|
|---|
| 779 | [[:alpha:][:digit:]]
|
|---|
| 780 | [[:alpha:][0-9]]
|
|---|
| 781 | [a-zA-Z0-9]
|
|---|
| 782 |
|
|---|
| 783 | Some notes on patterns are in order.
|
|---|
| 784 |
|
|---|
| 785 | * If your scanner is case-insensitive (the `-i' flag), then
|
|---|
| 786 | `[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'.
|
|---|
| 787 |
|
|---|
| 788 | * Character classes with ranges, such as `[a-Z]', should be used with
|
|---|
| 789 | caution in a case-insensitive scanner if the range spans upper or
|
|---|
| 790 | lowercase characters. Flex does not know if you want to fold all
|
|---|
| 791 | upper and lowercase characters together, or if you want the
|
|---|
| 792 | literal numeric range specified (with no case folding). When in
|
|---|
| 793 | doubt, flex will assume that you meant the literal numeric range,
|
|---|
| 794 | and will issue a warning. The exception to this rule is a
|
|---|
| 795 | character range such as `[a-z]' or `[S-W]' where it is obvious
|
|---|
| 796 | that you want case-folding to occur. Here are some examples with
|
|---|
| 797 | the `-i' flag enabled:
|
|---|
| 798 |
|
|---|
| 799 | Range Result Literal Range Alternate Range
|
|---|
| 800 | `[a-t]' ok `[a-tA-T]'
|
|---|
| 801 | `[A-T]' ok `[a-tA-T]'
|
|---|
| 802 | `[A-t]' ambiguous `[A-Z\[\\\]_`a-t]' `[a-tA-T]'
|
|---|
| 803 | `[_-{]' ambiguous `[_`a-z{]' `[_`a-zA-Z{]'
|
|---|
| 804 | `[@-C]' ambiguous `[@ABC]' `[@A-Z\[\\\]_`abc]'
|
|---|
| 805 |
|
|---|
| 806 | * A negated character class such as the example `[^A-Z]' above
|
|---|
| 807 | _will_ match a newline unless `\n' (or an equivalent escape
|
|---|
| 808 | sequence) is one of the characters explicitly present in the
|
|---|
| 809 | negated character class (e.g., `[^A-Z\n]'). This is unlike how
|
|---|
| 810 | many other regular expression tools treat negated character
|
|---|
| 811 | classes, but unfortunately the inconsistency is historically
|
|---|
| 812 | entrenched. Matching newlines means that a pattern like `[^"]*'
|
|---|
| 813 | can match the entire input unless there's another quote in the
|
|---|
| 814 | input.
|
|---|
| 815 |
|
|---|
| 816 | * A rule can have at most one instance of trailing context (the `/'
|
|---|
| 817 | operator or the `$' operator). The start condition, `^', and
|
|---|
| 818 | `<<EOF>>' patterns can only occur at the beginning of a pattern,
|
|---|
| 819 | and, as well as with `/' and `$', cannot be grouped inside
|
|---|
| 820 | parentheses. A `^' which does not occur at the beginning of a
|
|---|
| 821 | rule or a `$' which does not occur at the end of a rule loses its
|
|---|
| 822 | special properties and is treated as a normal character.
|
|---|
| 823 |
|
|---|
| 824 | * The following are invalid:
|
|---|
| 825 |
|
|---|
| 826 |
|
|---|
| 827 | foo/bar$
|
|---|
| 828 | <sc1>foo<sc2>bar
|
|---|
| 829 |
|
|---|
| 830 | Note that the first of these can be written `foo/bar\n'.
|
|---|
| 831 |
|
|---|
| 832 | * The following will result in `$' or `^' being treated as a normal
|
|---|
| 833 | character:
|
|---|
| 834 |
|
|---|
| 835 |
|
|---|
| 836 | foo|(bar$)
|
|---|
| 837 | foo|^bar
|
|---|
| 838 |
|
|---|
| 839 | If the desired meaning is a `foo' or a
|
|---|
| 840 | `bar'-followed-by-a-newline, the following could be used (the
|
|---|
| 841 | special `|' action is explained below, *note Actions::):
|
|---|
| 842 |
|
|---|
| 843 |
|
|---|
| 844 | foo |
|
|---|
| 845 | bar$ /* action goes here */
|
|---|
| 846 |
|
|---|
| 847 | A similar trick will work for matching a `foo' or a
|
|---|
| 848 | `bar'-at-the-beginning-of-a-line.
|
|---|
| 849 |
|
|---|
| 850 |
|
|---|
| 851 | File: flex.info, Node: Matching, Next: Actions, Prev: Patterns, Up: Top
|
|---|
| 852 |
|
|---|
| 853 | How the Input Is Matched
|
|---|
| 854 | ************************
|
|---|
| 855 |
|
|---|
| 856 | When the generated scanner is run, it analyzes its input looking for
|
|---|
| 857 | strings which match any of its patterns. If it finds more than one
|
|---|
| 858 | match, it takes the one matching the most text (for trailing context
|
|---|
| 859 | rules, this includes the length of the trailing part, even though it
|
|---|
| 860 | will then be returned to the input). If it finds two or more matches of
|
|---|
| 861 | the same length, the rule listed first in the `flex' input file is
|
|---|
| 862 | chosen.
|
|---|
| 863 |
|
|---|
| 864 | Once the match is determined, the text corresponding to the match
|
|---|
| 865 | (called the "token") is made available in the global character pointer
|
|---|
| 866 | `yytext', and its length in the global integer `yyleng'. The "action"
|
|---|
| 867 | corresponding to the matched pattern is then executed (*note
|
|---|
| 868 | Actions::), and then the remaining input is scanned for another match.
|
|---|
| 869 |
|
|---|
| 870 | If no match is found, then the "default rule" is executed: the next
|
|---|
| 871 | character in the input is considered matched and copied to the standard
|
|---|
| 872 | output. Thus, the simplest valid `flex' input is:
|
|---|
| 873 |
|
|---|
| 874 |
|
|---|
| 875 | %%
|
|---|
| 876 |
|
|---|
| 877 | which generates a scanner that simply copies its input (one
|
|---|
| 878 | character at a time) to its output.
|
|---|
| 879 |
|
|---|
| 880 | Note that `yytext' can be defined in two different ways: either as a
|
|---|
| 881 | character _pointer_ or as a character _array_. You can control which
|
|---|
| 882 | definition `flex' uses by including one of the special directives
|
|---|
| 883 | `%pointer' or `%array' in the first (definitions) section of your flex
|
|---|
| 884 | input. The default is `%pointer', unless you use the `-l' lex
|
|---|
| 885 | compatibility option, in which case `yytext' will be an array. The
|
|---|
| 886 | advantage of using `%pointer' is substantially faster scanning and no
|
|---|
| 887 | buffer overflow when matching very large tokens (unless you run out of
|
|---|
| 888 | dynamic memory). The disadvantage is that you are restricted in how
|
|---|
| 889 | your actions can modify `yytext' (*note Actions::), and calls to the
|
|---|
| 890 | `unput()' function destroys the present contents of `yytext', which can
|
|---|
| 891 | be a considerable porting headache when moving between different `lex'
|
|---|
| 892 | versions.
|
|---|
| 893 |
|
|---|
| 894 | The advantage of `%array' is that you can then modify `yytext' to
|
|---|
| 895 | your heart's content, and calls to `unput()' do not destroy `yytext'
|
|---|
| 896 | (*note Actions::). Furthermore, existing `lex' programs sometimes
|
|---|
| 897 | access `yytext' externally using declarations of the form:
|
|---|
| 898 |
|
|---|
| 899 |
|
|---|
| 900 | extern char yytext[];
|
|---|
| 901 |
|
|---|
| 902 | This definition is erroneous when used with `%pointer', but correct
|
|---|
| 903 | for `%array'.
|
|---|
| 904 |
|
|---|
| 905 | The `%array' declaration defines `yytext' to be an array of `YYLMAX'
|
|---|
| 906 | characters, which defaults to a fairly large value. You can change the
|
|---|
| 907 | size by simply #define'ing `YYLMAX' to a different value in the first
|
|---|
| 908 | section of your `flex' input. As mentioned above, with `%pointer'
|
|---|
| 909 | yytext grows dynamically to accommodate large tokens. While this means
|
|---|
| 910 | your `%pointer' scanner can accommodate very large tokens (such as
|
|---|
| 911 | matching entire blocks of comments), bear in mind that each time the
|
|---|
| 912 | scanner must resize `yytext' it also must rescan the entire token from
|
|---|
| 913 | the beginning, so matching such tokens can prove slow. `yytext'
|
|---|
| 914 | presently does _not_ dynamically grow if a call to `unput()' results in
|
|---|
| 915 | too much text being pushed back; instead, a run-time error results.
|
|---|
| 916 |
|
|---|
| 917 | Also note that you cannot use `%array' with C++ scanner classes
|
|---|
| 918 | (*note Cxx::).
|
|---|
| 919 |
|
|---|
| 920 |
|
|---|
| 921 | File: flex.info, Node: Actions, Next: Generated Scanner, Prev: Matching, Up: Top
|
|---|
| 922 |
|
|---|
| 923 | Actions
|
|---|
| 924 | *******
|
|---|
| 925 |
|
|---|
| 926 | Each pattern in a rule has a corresponding "action", which can be
|
|---|
| 927 | any arbitrary C statement. The pattern ends at the first non-escaped
|
|---|
| 928 | whitespace character; the remainder of the line is its action. If the
|
|---|
| 929 | action is empty, then when the pattern is matched the input token is
|
|---|
| 930 | simply discarded. For example, here is the specification for a program
|
|---|
| 931 | which deletes all occurrences of `zap me' from its input:
|
|---|
| 932 |
|
|---|
| 933 |
|
|---|
| 934 | %%
|
|---|
| 935 | "zap me"
|
|---|
| 936 |
|
|---|
| 937 | This example will copy all other characters in the input to the
|
|---|
| 938 | output since they will be matched by the default rule.
|
|---|
| 939 |
|
|---|
| 940 | Here is a program which compresses multiple blanks and tabs down to a
|
|---|
| 941 | single blank, and throws away whitespace found at the end of a line:
|
|---|
| 942 |
|
|---|
| 943 |
|
|---|
| 944 | %%
|
|---|
| 945 | [ \t]+ putchar( ' ' );
|
|---|
| 946 | [ \t]+$ /* ignore this token */
|
|---|
| 947 |
|
|---|
| 948 | If the action contains a `}', then the action spans till the
|
|---|
| 949 | balancing `}' is found, and the action may cross multiple lines.
|
|---|
| 950 | `flex' knows about C strings and comments and won't be fooled by braces
|
|---|
| 951 | found within them, but also allows actions to begin with `%{' and will
|
|---|
| 952 | consider the action to be all the text up to the next `%}' (regardless
|
|---|
| 953 | of ordinary braces inside the action).
|
|---|
| 954 |
|
|---|
| 955 | An action consisting solely of a vertical bar (`|') means "same as
|
|---|
| 956 | the action for the next rule". See below for an illustration.
|
|---|
| 957 |
|
|---|
| 958 | Actions can include arbitrary C code, including `return' statements
|
|---|
| 959 | to return a value to whatever routine called `yylex()'. Each time
|
|---|
| 960 | `yylex()' is called it continues processing tokens from where it last
|
|---|
| 961 | left off until it either reaches the end of the file or executes a
|
|---|
| 962 | return.
|
|---|
| 963 |
|
|---|
| 964 | Actions are free to modify `yytext' except for lengthening it
|
|---|
| 965 | (adding characters to its end-these will overwrite later characters in
|
|---|
| 966 | the input stream). This however does not apply when using `%array'
|
|---|
| 967 | (*note Matching::). In that case, `yytext' may be freely modified in
|
|---|
| 968 | any way.
|
|---|
| 969 |
|
|---|
| 970 | Actions are free to modify `yyleng' except they should not do so if
|
|---|
| 971 | the action also includes use of `yymore()' (see below).
|
|---|
| 972 |
|
|---|
| 973 | There are a number of special directives which can be included
|
|---|
| 974 | within an action:
|
|---|
| 975 |
|
|---|
| 976 | `ECHO'
|
|---|
| 977 | copies yytext to the scanner's output.
|
|---|
| 978 |
|
|---|
| 979 | `BEGIN'
|
|---|
| 980 | followed by the name of a start condition places the scanner in the
|
|---|
| 981 | corresponding start condition (see below).
|
|---|
| 982 |
|
|---|
| 983 | `REJECT'
|
|---|
| 984 | directs the scanner to proceed on to the "second best" rule which
|
|---|
| 985 | matched the input (or a prefix of the input). The rule is chosen
|
|---|
| 986 | as described above in *Note Matching::, and `yytext' and `yyleng'
|
|---|
| 987 | set up appropriately. It may either be one which matched as much
|
|---|
| 988 | text as the originally chosen rule but came later in the `flex'
|
|---|
| 989 | input file, or one which matched less text. For example, the
|
|---|
| 990 | following will both count the words in the input and call the
|
|---|
| 991 | routine `special()' whenever `frob' is seen:
|
|---|
| 992 |
|
|---|
| 993 |
|
|---|
| 994 | int word_count = 0;
|
|---|
| 995 | %%
|
|---|
| 996 |
|
|---|
| 997 | frob special(); REJECT;
|
|---|
| 998 | [^ \t\n]+ ++word_count;
|
|---|
| 999 |
|
|---|
| 1000 | Without the `REJECT', any occurences of `frob' in the input would
|
|---|
| 1001 | not be counted as words, since the scanner normally executes only
|
|---|
| 1002 | one action per token. Multiple uses of `REJECT' are allowed, each
|
|---|
| 1003 | one finding the next best choice to the currently active rule. For
|
|---|
| 1004 | example, when the following scanner scans the token `abcd', it will
|
|---|
| 1005 | write `abcdabcaba' to the output:
|
|---|
| 1006 |
|
|---|
| 1007 |
|
|---|
| 1008 | %%
|
|---|
| 1009 | a |
|
|---|
| 1010 | ab |
|
|---|
| 1011 | abc |
|
|---|
| 1012 | abcd ECHO; REJECT;
|
|---|
| 1013 | .|\n /* eat up any unmatched character */
|
|---|
| 1014 |
|
|---|
| 1015 | The first three rules share the fourth's action since they use the
|
|---|
| 1016 | special `|' action.
|
|---|
| 1017 |
|
|---|
| 1018 | `REJECT' is a particularly expensive feature in terms of scanner
|
|---|
| 1019 | performance; if it is used in _any_ of the scanner's actions it
|
|---|
| 1020 | will slow down _all_ of the scanner's matching. Furthermore,
|
|---|
| 1021 | `REJECT' cannot be used with the `-Cf' or `-CF' options (*note
|
|---|
| 1022 | Scanner Options::).
|
|---|
| 1023 |
|
|---|
| 1024 | Note also that unlike the other special actions, `REJECT' is a
|
|---|
| 1025 | _branch_. code immediately following it in the action will _not_
|
|---|
| 1026 | be executed.
|
|---|
| 1027 |
|
|---|
| 1028 | `yymore()'
|
|---|
| 1029 | tells the scanner that the next time it matches a rule, the
|
|---|
| 1030 | corresponding token should be _appended_ onto the current value of
|
|---|
| 1031 | `yytext' rather than replacing it. For example, given the input
|
|---|
| 1032 | `mega-kludge' the following will write `mega-mega-kludge' to the
|
|---|
| 1033 | output:
|
|---|
| 1034 |
|
|---|
| 1035 |
|
|---|
| 1036 | %%
|
|---|
| 1037 | mega- ECHO; yymore();
|
|---|
| 1038 | kludge ECHO;
|
|---|
| 1039 |
|
|---|
| 1040 | First `mega-' is matched and echoed to the output. Then `kludge'
|
|---|
| 1041 | is matched, but the previous `mega-' is still hanging around at the
|
|---|
| 1042 | beginning of `yytext' so the `ECHO' for the `kludge' rule will
|
|---|
| 1043 | actually write `mega-kludge'.
|
|---|
| 1044 |
|
|---|
| 1045 | Two notes regarding use of `yymore()'. First, `yymore()' depends on
|
|---|
| 1046 | the value of `yyleng' correctly reflecting the size of the current
|
|---|
| 1047 | token, so you must not modify `yyleng' if you are using `yymore()'.
|
|---|
| 1048 | Second, the presence of `yymore()' in the scanner's action entails a
|
|---|
| 1049 | minor performance penalty in the scanner's matching speed.
|
|---|
| 1050 |
|
|---|
| 1051 | `yyless(n)' returns all but the first `n' characters of the current
|
|---|
| 1052 | token back to the input stream, where they will be rescanned when the
|
|---|
| 1053 | scanner looks for the next match. `yytext' and `yyleng' are adjusted
|
|---|
| 1054 | appropriately (e.g., `yyleng' will now be equal to `n'). For example,
|
|---|
| 1055 | on the input `foobar' the following will write out `foobarbar':
|
|---|
| 1056 |
|
|---|
| 1057 |
|
|---|
| 1058 | %%
|
|---|
| 1059 | foobar ECHO; yyless(3);
|
|---|
| 1060 | [a-z]+ ECHO;
|
|---|
| 1061 |
|
|---|
| 1062 | An argument of 0 to `yyless()' will cause the entire current input
|
|---|
| 1063 | string to be scanned again. Unless you've changed how the scanner will
|
|---|
| 1064 | subsequently process its input (using `BEGIN', for example), this will
|
|---|
| 1065 | result in an endless loop.
|
|---|
| 1066 |
|
|---|
| 1067 | Note that `yyless()' is a macro and can only be used in the flex
|
|---|
| 1068 | input file, not from other source files.
|
|---|
| 1069 |
|
|---|
| 1070 | `unput(c)' puts the character `c' back onto the input stream. It
|
|---|
| 1071 | will be the next character scanned. The following action will take the
|
|---|
| 1072 | current token and cause it to be rescanned enclosed in parentheses.
|
|---|
| 1073 |
|
|---|
| 1074 |
|
|---|
| 1075 | {
|
|---|
| 1076 | int i;
|
|---|
| 1077 | /* Copy yytext because unput() trashes yytext */
|
|---|
| 1078 | char *yycopy = strdup( yytext );
|
|---|
| 1079 | unput( ')' );
|
|---|
| 1080 | for ( i = yyleng - 1; i >= 0; --i )
|
|---|
| 1081 | unput( yycopy[i] );
|
|---|
| 1082 | unput( '(' );
|
|---|
| 1083 | free( yycopy );
|
|---|
| 1084 | }
|
|---|
| 1085 |
|
|---|
| 1086 | Note that since each `unput()' puts the given character back at the
|
|---|
| 1087 | _beginning_ of the input stream, pushing back strings must be done
|
|---|
| 1088 | back-to-front.
|
|---|
| 1089 |
|
|---|
| 1090 | An important potential problem when using `unput()' is that if you
|
|---|
| 1091 | are using `%pointer' (the default), a call to `unput()' _destroys_ the
|
|---|
| 1092 | contents of `yytext', starting with its rightmost character and
|
|---|
| 1093 | devouring one character to the left with each call. If you need the
|
|---|
| 1094 | value of `yytext' preserved after a call to `unput()' (as in the above
|
|---|
| 1095 | example), you must either first copy it elsewhere, or build your
|
|---|
| 1096 | scanner using `%array' instead (*note Matching::).
|
|---|
| 1097 |
|
|---|
| 1098 | Finally, note that you cannot put back `EOF' to attempt to mark the
|
|---|
| 1099 | input stream with an end-of-file.
|
|---|
| 1100 |
|
|---|
| 1101 | `input()' reads the next character from the input stream. For
|
|---|
| 1102 | example, the following is one way to eat up C comments:
|
|---|
| 1103 |
|
|---|
| 1104 |
|
|---|
| 1105 | %%
|
|---|
| 1106 | "/*" {
|
|---|
| 1107 | register int c;
|
|---|
| 1108 |
|
|---|
| 1109 | for ( ; ; )
|
|---|
| 1110 | {
|
|---|
| 1111 | while ( (c = input()) != '*' &&
|
|---|
| 1112 | c != EOF )
|
|---|
| 1113 | ; /* eat up text of comment */
|
|---|
| 1114 |
|
|---|
| 1115 | if ( c == '*' )
|
|---|
| 1116 | {
|
|---|
| 1117 | while ( (c = input()) == '*' )
|
|---|
| 1118 | ;
|
|---|
| 1119 | if ( c == '/' )
|
|---|
| 1120 | break; /* found the end */
|
|---|
| 1121 | }
|
|---|
| 1122 |
|
|---|
| 1123 | if ( c == EOF )
|
|---|
| 1124 | {
|
|---|
| 1125 | error( "EOF in comment" );
|
|---|
| 1126 | break;
|
|---|
| 1127 | }
|
|---|
| 1128 | }
|
|---|
| 1129 | }
|
|---|
| 1130 |
|
|---|
| 1131 | (Note that if the scanner is compiled using `C++', then `input()' is
|
|---|
| 1132 | instead referred to as yyinput(), in order to avoid a name clash with
|
|---|
| 1133 | the `C++' stream by the name of `input'.)
|
|---|
| 1134 |
|
|---|
| 1135 | `YY_FLUSH_BUFFER()' flushes the scanner's internal buffer so that
|
|---|
| 1136 | the next time the scanner attempts to match a token, it will first
|
|---|
| 1137 | refill the buffer using `YY_INPUT()' (*note Generated Scanner::). This
|
|---|
| 1138 | action is a special case of the more general `yy_flush_buffer()'
|
|---|
| 1139 | function, described below (*note Multiple Input Buffers::)
|
|---|
| 1140 |
|
|---|
| 1141 | `yyterminate()' can be used in lieu of a return statement in an
|
|---|
| 1142 | action. It terminates the scanner and returns a 0 to the scanner's
|
|---|
| 1143 | caller, indicating "all done". By default, `yyterminate()' is also
|
|---|
| 1144 | called when an end-of-file is encountered. It is a macro and may be
|
|---|
| 1145 | redefined.
|
|---|
| 1146 |
|
|---|
| 1147 |
|
|---|
| 1148 | File: flex.info, Node: Generated Scanner, Next: Start Conditions, Prev: Actions, Up: Top
|
|---|
| 1149 |
|
|---|
| 1150 | The Generated Scanner
|
|---|
| 1151 | *********************
|
|---|
| 1152 |
|
|---|
| 1153 | The output of `flex' is the file `lex.yy.c', which contains the
|
|---|
| 1154 | scanning routine `yylex()', a number of tables used by it for matching
|
|---|
| 1155 | tokens, and a number of auxiliary routines and macros. By default,
|
|---|
| 1156 | `yylex()' is declared as follows:
|
|---|
| 1157 |
|
|---|
| 1158 |
|
|---|
| 1159 | int yylex()
|
|---|
| 1160 | {
|
|---|
| 1161 | ... various definitions and the actions in here ...
|
|---|
| 1162 | }
|
|---|
| 1163 |
|
|---|
| 1164 | (If your environment supports function prototypes, then it will be
|
|---|
| 1165 | `int yylex( void )'.) This definition may be changed by defining the
|
|---|
| 1166 | `YY_DECL' macro. For example, you could use:
|
|---|
| 1167 |
|
|---|
| 1168 |
|
|---|
| 1169 | #define YY_DECL float lexscan( a, b ) float a, b;
|
|---|
| 1170 |
|
|---|
| 1171 | to give the scanning routine the name `lexscan', returning a float,
|
|---|
| 1172 | and taking two floats as arguments. Note that if you give arguments to
|
|---|
| 1173 | the scanning routine using a K&R-style/non-prototyped function
|
|---|
| 1174 | declaration, you must terminate the definition with a semi-colon (;).
|
|---|
| 1175 |
|
|---|
| 1176 | `flex' generates `C99' function definitions by default. However flex
|
|---|
| 1177 | does have the ability to generate obsolete, er, `traditional', function
|
|---|
| 1178 | definitions. This is to support bootstrapping gcc on old systems.
|
|---|
| 1179 | Unfortunately, traditional definitions prevent us from using any
|
|---|
| 1180 | standard data types smaller than int (such as short, char, or bool) as
|
|---|
| 1181 | function arguments. For this reason, future versions of `flex' may
|
|---|
| 1182 | generate standard C99 code only, leaving K&R-style functions to the
|
|---|
| 1183 | historians. Currently, if you do *not* want `C99' definitions, then
|
|---|
| 1184 | you must use `%option noansi-definitions'.
|
|---|
| 1185 |
|
|---|
| 1186 | Whenever `yylex()' is called, it scans tokens from the global input
|
|---|
| 1187 | file `yyin' (which defaults to stdin). It continues until it either
|
|---|
| 1188 | reaches an end-of-file (at which point it returns the value 0) or one
|
|---|
| 1189 | of its actions executes a `return' statement.
|
|---|
| 1190 |
|
|---|
| 1191 | If the scanner reaches an end-of-file, subsequent calls are undefined
|
|---|
| 1192 | unless either `yyin' is pointed at a new input file (in which case
|
|---|
| 1193 | scanning continues from that file), or `yyrestart()' is called.
|
|---|
| 1194 | `yyrestart()' takes one argument, a `FILE *' pointer (which can be
|
|---|
| 1195 | NULL, if you've set up `YY_INPUT' to scan from a source other than
|
|---|
| 1196 | `yyin'), and initializes `yyin' for scanning from that file.
|
|---|
| 1197 | Essentially there is no difference between just assigning `yyin' to a
|
|---|
| 1198 | new input file or using `yyrestart()' to do so; the latter is available
|
|---|
| 1199 | for compatibility with previous versions of `flex', and because it can
|
|---|
| 1200 | be used to switch input files in the middle of scanning. It can also
|
|---|
| 1201 | be used to throw away the current input buffer, by calling it with an
|
|---|
| 1202 | argument of `yyin'; but it would be better to use `YY_FLUSH_BUFFER'
|
|---|
| 1203 | (*note Actions::). Note that `yyrestart()' does _not_ reset the start
|
|---|
| 1204 | condition to `INITIAL' (*note Start Conditions::).
|
|---|
| 1205 |
|
|---|
| 1206 | If `yylex()' stops scanning due to executing a `return' statement in
|
|---|
| 1207 | one of the actions, the scanner may then be called again and it will
|
|---|
| 1208 | resume scanning where it left off.
|
|---|
| 1209 |
|
|---|
| 1210 | By default (and for purposes of efficiency), the scanner uses
|
|---|
| 1211 | block-reads rather than simple `getc()' calls to read characters from
|
|---|
| 1212 | `yyin'. The nature of how it gets its input can be controlled by
|
|---|
| 1213 | defining the `YY_INPUT' macro. The calling sequence for `YY_INPUT()'
|
|---|
| 1214 | is `YY_INPUT(buf,result,max_size)'. Its action is to place up to
|
|---|
| 1215 | `max_size' characters in the character array `buf' and return in the
|
|---|
| 1216 | integer variable `result' either the number of characters read or the
|
|---|
| 1217 | constant `YY_NULL' (0 on Unix systems) to indicate `EOF'. The default
|
|---|
| 1218 | `YY_INPUT' reads from the global file-pointer `yyin'.
|
|---|
| 1219 |
|
|---|
| 1220 | Here is a sample definition of `YY_INPUT' (in the definitions
|
|---|
| 1221 | section of the input file):
|
|---|
| 1222 |
|
|---|
| 1223 |
|
|---|
| 1224 | %{
|
|---|
| 1225 | #define YY_INPUT(buf,result,max_size) \
|
|---|
| 1226 | { \
|
|---|
| 1227 | int c = getchar(); \
|
|---|
| 1228 | result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
|
|---|
| 1229 | }
|
|---|
| 1230 | %}
|
|---|
| 1231 |
|
|---|
| 1232 | This definition will change the input processing to occur one
|
|---|
| 1233 | character at a time.
|
|---|
| 1234 |
|
|---|
| 1235 | When the scanner receives an end-of-file indication from YY_INPUT, it
|
|---|
| 1236 | then checks the `yywrap()' function. If `yywrap()' returns false
|
|---|
| 1237 | (zero), then it is assumed that the function has gone ahead and set up
|
|---|
| 1238 | `yyin' to point to another input file, and scanning continues. If it
|
|---|
| 1239 | returns true (non-zero), then the scanner terminates, returning 0 to
|
|---|
| 1240 | its caller. Note that in either case, the start condition remains
|
|---|
| 1241 | unchanged; it does _not_ revert to `INITIAL'.
|
|---|
| 1242 |
|
|---|
| 1243 | If you do not supply your own version of `yywrap()', then you must
|
|---|
| 1244 | either use `%option noyywrap' (in which case the scanner behaves as
|
|---|
| 1245 | though `yywrap()' returned 1), or you must link with `-lfl' to obtain
|
|---|
| 1246 | the default version of the routine, which always returns 1.
|
|---|
| 1247 |
|
|---|
| 1248 | For scanning from in-memory buffers (e.g., scanning strings), see
|
|---|
| 1249 | *Note Scanning Strings::. *Note Multiple Input Buffers::.
|
|---|
| 1250 |
|
|---|
| 1251 | The scanner writes its `ECHO' output to the `yyout' global (default,
|
|---|
| 1252 | `stdout'), which may be redefined by the user simply by assigning it to
|
|---|
| 1253 | some other `FILE' pointer.
|
|---|
| 1254 |
|
|---|