Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

flex.info-1@ 3186

Visit:

Last change on this file since 3186 was 3031, checked in by bird, 19 years ago
flex 2.5.33.
File size: 43.3 KB

Line
1	This is flex.info, produced by makeinfo version 4.5 from flex.texi.
2
3	INFO-DIR-SECTION Programming
4	START-INFO-DIR-ENTRY
5	* flex: (flex). Fast lexical analyzer generator (lex replacement).
6	END-INFO-DIR-ENTRY
7
8
9	The flex manual is placed under the same licensing conditions as the
10	rest of flex:
11
12	Copyright (C) 1990, 1997 The Regents of the University of California.
13	All rights reserved.
14
15	This code is derived from software contributed to Berkeley by Vern
16	Paxson.
17
18	The United States Government has rights in this work pursuant to
19	contract no. DE-AC03-76SF00098 between the United States Department of
20	Energy and the University of California.
21
22	Redistribution and use in source and binary forms, with or without
23	modification, are permitted provided that the following conditions are
24	met:
25
26	1. Redistributions of source code must retain the above copyright
27	notice, this list of conditions and the following disclaimer.
28
29	2. Redistributions in binary form must reproduce the above copyright
30	notice, this list of conditions and the following disclaimer in the
31	documentation and/or other materials provided with the
32	distribution.
33	Neither the name of the University nor the names of its contributors
34	may be used to endorse or promote products derived from this software
35	without specific prior written permission.
36
37	THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
38	WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
39	MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
40
41	File: flex.info, Node: Top, Next: Copyright, Prev: (dir), Up: (dir)
42
43	flex
44	****
45
46	This manual describes `flex', a tool for generating programs that
47	perform pattern-matching on text. The manual includes both tutorial and
48	reference sections.
49
50	This edition of `The flex Manual' documents `flex' version 2.5.33.
51	It was last updated on 20 February 2006.
52
53	* Menu:
54
55	* Copyright::
56	* Reporting Bugs::
57	* Introduction::
58	* Simple Examples::
59	* Format::
60	* Patterns::
61	* Matching::
62	* Actions::
63	* Generated Scanner::
64	* Start Conditions::
65	* Multiple Input Buffers::
66	* EOF::
67	* Misc Macros::
68	* User Values::
69	* Yacc::
70	* Scanner Options::
71	* Performance::
72	* Cxx::
73	* Reentrant::
74	* Lex and Posix::
75	* Memory Management::
76	* Serialized Tables::
77	* Diagnostics::
78	* Limitations::
79	* Bibliography::
80	* FAQ::
81	* Appendices::
82	* Indices::
83
84	--- The Detailed Node Listing ---
85
86	Format of the Input File
87
88	* Definitions Section::
89	* Rules Section::
90	* User Code Section::
91	* Comments in the Input::
92
93	Scanner Options
94
95	* Options for Specifing Filenames::
96	* Options Affecting Scanner Behavior::
97	* Code-Level And API Options::
98	* Options for Scanner Speed and Size::
99	* Debugging Options::
100	* Miscellaneous Options::
101
102	Reentrant C Scanners
103
104	* Reentrant Uses::
105	* Reentrant Overview::
106	* Reentrant Example::
107	* Reentrant Detail::
108	* Reentrant Functions::
109
110	The Reentrant API in Detail
111
112	* Specify Reentrant::
113	* Extra Reentrant Argument::
114	* Global Replacement::
115	* Init and Destroy Functions::
116	* Accessor Methods::
117	* Extra Data::
118	* About yyscan_t::
119
120	Memory Management
121
122	* The Default Memory Management::
123	* Overriding The Default Memory Management::
124	* A Note About yytext And Memory::
125
126	Serialized Tables
127
128	* Creating Serialized Tables::
129	* Loading and Unloading Serialized Tables::
130	* Tables File Format::
131
132	FAQ
133
134	* When was flex born?::
135	* How do I expand \ escape sequences in C-style quoted strings?::
136	* Why do flex scanners call fileno if it is not ANSI compatible?::
137	* Does flex support recursive pattern definitions?::
138	* How do I skip huge chunks of input (tens of megabytes) while using flex?::
139	* Flex is not matching my patterns in the same order that I defined them.::
140	* My actions are executing out of order or sometimes not at all.::
141	* How can I have multiple input sources feed into the same scanner at the same time?::
142	* Can I build nested parsers that work with the same input file?::
143	* How can I match text only at the end of a file?::
144	* How can I make REJECT cascade across start condition boundaries?::
145	* Why cant I use fast or full tables with interactive mode?::
146	* How much faster is -F or -f than -C?::
147	* If I have a simple grammar cant I just parse it with flex?::
148	* Why doesnt yyrestart() set the start state back to INITIAL?::
149	* How can I match C-style comments?::
150	* The period isnt working the way I expected.::
151	* Can I get the flex manual in another format?::
152	* Does there exist a "faster" NDFA->DFA algorithm?::
153	* How does flex compile the DFA so quickly?::
154	* How can I use more than 8192 rules?::
155	* How do I abandon a file in the middle of a scan and switch to a new file?::
156	* How do I execute code only during initialization (only before the first scan)?::
157	* How do I execute code at termination?::
158	* Where else can I find help?::
159	* Can I include comments in the "rules" section of the file?::
160	* I get an error about undefined yywrap().::
161	* How can I change the matching pattern at run time?::
162	* How can I expand macros in the input?::
163	* How can I build a two-pass scanner?::
164	* How do I match any string not matched in the preceding rules?::
165	* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
166	* Is there a way to make flex treat NULL like a regular character?::
167	* Whenever flex can not match the input it says "flex scanner jammed".::
168	* Why doesnt flex have non-greedy operators like perl does?::
169	* Memory leak - 16386 bytes allocated by malloc.::
170	* How do I track the byte offset for lseek()?::
171	* How do I use my own I/O classes in a C++ scanner?::
172	* How do I skip as many chars as possible?::
173	* deleteme00::
174	* Are certain equivalent patterns faster than others?::
175	* Is backing up a big deal?::
176	* Can I fake multi-byte character support?::
177	* deleteme01::
178	* Can you discuss some flex internals?::
179	* unput() messes up yy_at_bol::
180	* The \| operator is not doing what I want::
181	* Why can't flex understand this variable trailing context pattern?::
182	* The ^ operator isn't working::
183	* Trailing context is getting confused with trailing optional patterns::
184	* Is flex GNU or not?::
185	* ERASEME53::
186	* I need to scan if-then-else blocks and while loops::
187	* ERASEME55::
188	* ERASEME56::
189	* ERASEME57::
190	* Is there a repository for flex scanners?::
191	* How can I conditionally compile or preprocess my flex input file?::
192	* Where can I find grammars for lex and yacc?::
193	* I get an end-of-buffer message for each character scanned.::
194	* unnamed-faq-62::
195	* unnamed-faq-63::
196	* unnamed-faq-64::
197	* unnamed-faq-65::
198	* unnamed-faq-66::
199	* unnamed-faq-67::
200	* unnamed-faq-68::
201	* unnamed-faq-69::
202	* unnamed-faq-70::
203	* unnamed-faq-71::
204	* unnamed-faq-72::
205	* unnamed-faq-73::
206	* unnamed-faq-74::
207	* unnamed-faq-75::
208	* unnamed-faq-76::
209	* unnamed-faq-77::
210	* unnamed-faq-78::
211	* unnamed-faq-79::
212	* unnamed-faq-80::
213	* unnamed-faq-81::
214	* unnamed-faq-82::
215	* unnamed-faq-83::
216	* unnamed-faq-84::
217	* unnamed-faq-85::
218	* unnamed-faq-86::
219	* unnamed-faq-87::
220	* unnamed-faq-88::
221	* unnamed-faq-90::
222	* unnamed-faq-91::
223	* unnamed-faq-92::
224	* unnamed-faq-93::
225	* unnamed-faq-94::
226	* unnamed-faq-95::
227	* unnamed-faq-96::
228	* unnamed-faq-97::
229	* unnamed-faq-98::
230	* unnamed-faq-99::
231	* unnamed-faq-100::
232	* unnamed-faq-101::
233	* What is the difference between YYLEX_PARAM and YY_DECL?::
234	* Why do I get "conflicting types for yylex" error?::
235	* How do I access the values set in a Flex action from within a Bison action?::
236
237	Appendices
238
239	* Makefiles and Flex::
240	* Bison Bridge::
241	* M4 Dependency::
242
243	Indices
244
245	* Concept Index::
246	* Index of Functions and Macros::
247	* Index of Variables::
248	* Index of Data Types::
249	* Index of Hooks::
250	* Index of Scanner Options::
251
252
253	File: flex.info, Node: Copyright, Next: Reporting Bugs, Prev: Top, Up: Top
254
255	Copyright
256	*********
257
258
259	The flex manual is placed under the same licensing conditions as the
260	rest of flex:
261
262	Copyright (C) 1990, 1997 The Regents of the University of California.
263	All rights reserved.
264
265	This code is derived from software contributed to Berkeley by Vern
266	Paxson.
267
268	The United States Government has rights in this work pursuant to
269	contract no. DE-AC03-76SF00098 between the United States Department of
270	Energy and the University of California.
271
272	Redistribution and use in source and binary forms, with or without
273	modification, are permitted provided that the following conditions are
274	met:
275
276	1. Redistributions of source code must retain the above copyright
277	notice, this list of conditions and the following disclaimer.
278
279	2. Redistributions in binary form must reproduce the above copyright
280	notice, this list of conditions and the following disclaimer in the
281	documentation and/or other materials provided with the
282	distribution.
283	Neither the name of the University nor the names of its contributors
284	may be used to endorse or promote products derived from this software
285	without specific prior written permission.
286
287	THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
288	WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
289	MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
290
291	File: flex.info, Node: Reporting Bugs, Next: Introduction, Prev: Copyright, Up: Top
292
293	Reporting Bugs
294	**************
295
296	If you have problems with `flex' or think you have found a bug,
297	please send mail detailing your problem to
298	<[email protected]>. Patches are always welcome.
299
300
301	File: flex.info, Node: Introduction, Next: Simple Examples, Prev: Reporting Bugs, Up: Top
302
303	Introduction
304	************
305
306	`flex' is a tool for generating "scanners". A scanner is a program
307	which recognizes lexical patterns in text. The `flex' program reads
308	the given input files, or its standard input if no file names are
309	given, for a description of a scanner to generate. The description is
310	in the form of pairs of regular expressions and C code, called "rules".
311	`flex' generates as output a C source file, `lex.yy.c' by default,
312	which defines a routine `yylex()'. This file can be compiled and
313	linked with the flex runtime library to produce an executable. When
314	the executable is run, it analyzes its input for occurrences of the
315	regular expressions. Whenever it finds one, it executes the
316	corresponding C code.
317
318
319	File: flex.info, Node: Simple Examples, Next: Format, Prev: Introduction, Up: Top
320
321	Some Simple Examples
322	********************
323
324	First some simple examples to get the flavor of how one uses `flex'.
325
326	The following `flex' input specifies a scanner which, when it
327	encounters the string `username' will replace it with the user's login
328	name:
329
330
331	%%
332	username printf( "%s", getlogin() );
333
334	By default, any text not matched by a `flex' scanner is copied to
335	the output, so the net effect of this scanner is to copy its input file
336	to its output with each occurrence of `username' expanded. In this
337	input, there is just one rule. `username' is the "pattern" and the
338	`printf' is the "action". The `%%' symbol marks the beginning of the
339	rules.
340
341	Here's another simple example:
342
343
344	int num_lines = 0, num_chars = 0;
345
346	%%
347	\n ++num_lines; ++num_chars;
348	. ++num_chars;
349
350	%%
351	main()
352	{
353	yylex();
354	printf( "# of lines = %d, # of chars = %d\n",
355	num_lines, num_chars );
356	}
357
358	This scanner counts the number of characters and the number of lines
359	in its input. It produces no output other than the final report on the
360	character and line counts. The first line declares two globals,
361	`num_lines' and `num_chars', which are accessible both inside `yylex()'
362	and in the `main()' routine declared after the second `%%'. There are
363	two rules, one which matches a newline (`\n') and increments both the
364	line count and the character count, and one which matches any character
365	other than a newline (indicated by the `.' regular expression).
366
367	A somewhat more complicated example:
368
369
370	/* scanner for a toy Pascal-like language */
371
372	%{
373	/* need this for the call to atof() below */
374	#include math.h>
375	%}
376
377	DIGIT [0-9]
378	ID [a-z][a-z0-9]*
379
380	%%
381
382	{DIGIT}+ {
383	printf( "An integer: %s (%d)\n", yytext,
384	atoi( yytext ) );
385	}
386
387	{DIGIT}+"."{DIGIT}* {
388	printf( "A float: %s (%g)\n", yytext,
389	atof( yytext ) );
390	}
391
392	if\|then\|begin\|end\|procedure\|function {
393	printf( "A keyword: %s\n", yytext );
394	}
395
396	{ID} printf( "An identifier: %s\n", yytext );
397
398	"+"\|"-"\|"*"\|"/" printf( "An operator: %s\n", yytext );
399
400	"{"[\^{}}\n]"}" / eat up one-line comments */
401
402	[ \t\n]+ /* eat up whitespace */
403
404	. printf( "Unrecognized character: %s\n", yytext );
405
406	%%
407
408	main( argc, argv )
409	int argc;
410	char **argv;
411	{
412	++argv, --argc; /* skip over program name */
413	if ( argc > 0 )
414	yyin = fopen( argv[0], "r" );
415	else
416	yyin = stdin;
417
418	yylex();
419	}
420
421	This is the beginnings of a simple scanner for a language like
422	Pascal. It identifies different types of "tokens" and reports on what
423	it has seen.
424
425	The details of this example will be explained in the following
426	sections.
427
428
429	File: flex.info, Node: Format, Next: Patterns, Prev: Simple Examples, Up: Top
430
431	Format of the Input File
432	************************
433
434	The `flex' input file consists of three sections, separated by a
435	line containing only `%%'.
436
437
438	definitions
439	%%
440	rules
441	%%
442	user code
443
444	* Menu:
445
446	* Definitions Section::
447	* Rules Section::
448	* User Code Section::
449	* Comments in the Input::
450
451
452	File: flex.info, Node: Definitions Section, Next: Rules Section, Prev: Format, Up: Format
453
454	Format of the Definitions Section
455	=================================
456
457	The "definitions section" contains declarations of simple "name"
458	definitions to simplify the scanner specification, and declarations of
459	"start conditions", which are explained in a later section.
460
461	Name definitions have the form:
462
463
464	name definition
465
466	The `name' is a word beginning with a letter or an underscore (`_')
467	followed by zero or more letters, digits, `_', or `-' (dash). The
468	definition is taken to begin at the first non-whitespace character
469	following the name and continuing to the end of the line. The
470	definition can subsequently be referred to using `{name}', which will
471	expand to `(definition)'. For example,
472
473
474	DIGIT [0-9]
475	ID [a-z][a-z0-9]*
476
477	Defines `DIGIT' to be a regular expression which matches a single
478	digit, and `ID' to be a regular expression which matches a letter
479	followed by zero-or-more letters-or-digits. A subsequent reference to
480
481
482	{DIGIT}+"."{DIGIT}*
483
484	is identical to
485
486
487	([0-9])+"."([0-9])*
488
489	and matches one-or-more digits followed by a `.' followed by
490	zero-or-more digits.
491
492	An unindented comment (i.e., a line beginning with `/*') is copied
493	verbatim to the output up to the next `*/'.
494
495	Any _indented_ text or text enclosed in `%{' and `%}' is also copied
496	verbatim to the output (with the %{ and %} symbols removed). The %{
497	and %} symbols must appear unindented on lines by themselves.
498
499	A `%top' block is similar to a `%{' ... `%}' block, except that the
500	code in a `%top' block is relocated to the _top_ of the generated file,
501	before any flex definitions (1). The `%top' block is useful when you
502	want certain preprocessor macros to be defined or certain files to be
503	included before the generated code. The single characters, `{' and
504	`}' are used to delimit the `%top' block, as show in the example below:
505
506
507	%top{
508	/* This code goes at the "top" of the generated file. */
509	#include <stdint.h>
510	#include <inttypes.h>
511	}
512
513	Multiple `%top' blocks are allowed, and their order is preserved.
514
515	---------- Footnotes ----------
516
517	(1) Actually, `yyIN_HEADER' is defined before the `%top' block.
518
519
520	File: flex.info, Node: Rules Section, Next: User Code Section, Prev: Definitions Section, Up: Format
521
522	Format of the Rules Section
523	===========================
524
525	The "rules" section of the `flex' input contains a series of rules
526	of the form:
527
528
529	pattern action
530
531	where the pattern must be unindented and the action must begin on
532	the same line. *Note Patterns::, for a further description of patterns
533	and actions.
534
535	In the rules section, any indented or %{ %} enclosed text appearing
536	before the first rule may be used to declare variables which are local
537	to the scanning routine and (after the declarations) code which is to be
538	executed whenever the scanning routine is entered. Other indented or
539	%{ %} text in the rule section is still copied to the output, but its
540	meaning is not well-defined and it may well cause compile-time errors
541	(this feature is present for POSIX compliance. *Note Lex and Posix::,
542	for other such features).
543
544	Any _indented_ text or text enclosed in `%{' and `%}' is copied
545	verbatim to the output (with the %{ and %} symbols removed). The %{
546	and %} symbols must appear unindented on lines by themselves.
547
548
549	File: flex.info, Node: User Code Section, Next: Comments in the Input, Prev: Rules Section, Up: Format
550
551	Format of the User Code Section
552	===============================
553
554	The user code section is simply copied to `lex.yy.c' verbatim. It
555	is used for companion routines which call or are called by the scanner.
556	The presence of this section is optional; if it is missing, the second
557	`%%' in the input file may be skipped, too.
558
559
560	File: flex.info, Node: Comments in the Input, Prev: User Code Section, Up: Format
561
562	Comments in the Input
563	=====================
564
565	Flex supports C-style comments, that is, anything between /* and */
566	is considered a comment. Whenever flex encounters a comment, it copies
567	the entire comment verbatim to the generated source code. Comments may
568	appear just about anywhere, but with the following exceptions:
569
570	* Comments may not appear in the Rules Section wherever flex is
571	expecting a regular expression. This means comments may not appear
572	at the beginning of a line, or immediately following a list of
573	scanner states.
574
575	* Comments may not appear on an `%option' line in the Definitions
576	Section.
577
578	If you want to follow a simple rule, then always begin a comment on a
579	new line, with one or more whitespace characters before the initial
580	`/*'). This rule will work anywhere in the input file.
581
582	All the comments in the following example are valid:
583
584
585	%{
586	/* code block */
587	%}
588
589	/* Definitions Section */
590	%x STATE_X
591
592	%%
593	/* Rules Section */
594	ruleA /* after regex / { / code block / } / after code block */
595	/* Rules Section (indented) */
596	<STATE_X>{
597	ruleC ECHO;
598	ruleD ECHO;
599	%{
600	/* code block */
601	%}
602	}
603	%%
604	/* User Code Section */
605
606
607	File: flex.info, Node: Patterns, Next: Matching, Prev: Format, Up: Top
608
609	Patterns
610	********
611
612	The patterns in the input (see *Note Rules Section::) are written
613	using an extended set of regular expressions. These are:
614
615	`x'
616	match the character 'x'
617
618	`.'
619	any character (byte) except newline
620
621	`[xyz]'
622	a "character class"; in this case, the pattern matches either an
623	'x', a 'y', or a 'z'
624
625	`[abj-oZ]'
626	a "character class" with a range in it; matches an 'a', a 'b', any
627	letter from 'j' through 'o', or a 'Z'
628
629	`[^A-Z]'
630	a "negated character class", i.e., any character but those in the
631	class. In this case, any character EXCEPT an uppercase letter.
632
633	`[^A-Z\n]'
634	any character EXCEPT an uppercase letter or a newline
635
636	`r*'
637	zero or more r's, where r is any regular expression
638
639	`r+'
640	one or more r's
641
642	`r?'
643	zero or one r's (that is, "an optional r")
644
645	`r{2,5}'
646	anywhere from two to five r's
647
648	`r{2,}'
649	two or more r's
650
651	`r{4}'
652	exactly 4 r's
653
654	`{name}'
655	the expansion of the `name' definition (*note Format::).
656
657	`"[xyz]\"foo"'
658	the literal string: `[xyz]"foo'
659
660	`\X'
661	if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C
662	interpretation of `\x'. Otherwise, a literal `X' (used to escape
663	operators such as `*')
664
665	`\0'
666	a NUL character (ASCII code 0)
667
668	`\123'
669	the character with octal value 123
670
671	`\x2a'
672	the character with hexadecimal value 2a
673
674	`(r)'
675	match an `r'; parentheses are used to override precedence (see
676	below)
677
678	`rs'
679	the regular expression `r' followed by the regular expression `s';
680	called "concatenation"
681
682	`r\|s'
683	either an `r' or an `s'
684
685	`r/s'
686	an `r' but only if it is followed by an `s'. The text matched by
687	`s' is included when determining whether this rule is the longest
688	match, but is then returned to the input before the action is
689	executed. So the action only sees the text matched by `r'. This
690	type of pattern is called "trailing context". (There are some
691	combinations of `r/s' that flex cannot match correctly. *Note
692	Limitations::, regarding dangerous trailing context.)
693
694	`^r'
695	an `r', but only at the beginning of a line (i.e., when just
696	starting to scan, or right after a newline has been scanned).
697
698	`r$'
699	an `r', but only at the end of a line (i.e., just before a
700	newline). Equivalent to `r/\n'.
701
702	Note that `flex''s notion of "newline" is exactly whatever the C
703	compiler used to compile `flex' interprets `\n' as; in particular,
704	on some DOS systems you must either filter out `\r's in the input
705	yourself, or explicitly use `r/\r\n' for `r$'.
706
707	`<s>r'
708	an `r', but only in start condition `s' (see *Note Start
709	Conditions:: for discussion of start conditions).
710
711	`<s1,s2,s3>r'
712	same, but in any of start conditions `s1', `s2', or `s3'.
713
714	`<*>r'
715	an `r' in any start condition, even an exclusive one.
716
717	`<<EOF>>'
718	an end-of-file.
719
720	`<s1,s2><<EOF>>'
721	an end-of-file when in start condition `s1' or `s2'
722
723	Note that inside of a character class, all regular expression
724	operators lose their special meaning except escape (`\') and the
725	character class operators, `-', `]]', and, at the beginning of the
726	class, `^'.
727
728	The regular expressions listed above are grouped according to
729	precedence, from highest precedence at the top to lowest at the bottom.
730	Those grouped together have equal precedence (see special note on the
731	precedence of the repeat operator, `{}', under the documentation for
732	the `--posix' POSIX compliance option). For example,
733
734
735	foo\|bar*
736
737	is the same as
738
739
740	(foo)\|(ba(r*))
741
742	since the `*' operator has higher precedence than concatenation, and
743	concatenation higher than alternation (`\|'). This pattern therefore
744	matches _either_ the string `foo' _or_ the string `ba' followed by
745	zero-or-more `r''s. To match `foo' or zero-or-more repetitions of the
746	string `bar', use:
747
748
749	foo\|(bar)*
750
751	And to match a sequence of zero or more repetitions of `foo' and
752	`bar':
753
754
755	(foo\|bar)*
756
757	In addition to characters and ranges of characters, character classes
758	can also contain "character class expressions". These are expressions
759	enclosed inside `[': and `:]' delimiters (which themselves must appear
760	between the `[' and `]' of the character class. Other elements may
761	occur inside the character class, too). The valid expressions are:
762
763
764	[:alnum:] [:alpha:] [:blank:]
765	[:cntrl:] [:digit:] [:graph:]
766	[:lower:] [:print:] [:punct:]
767	[:space:] [:upper:] [:xdigit:]
768
769	These expressions all designate a set of characters equivalent to the
770	corresponding standard C `isXXX' function. For example, `[:alnum:]'
771	designates those characters for which `isalnum()' returns true - i.e.,
772	any alphabetic or numeric character. Some systems don't provide
773	`isblank()', so flex defines `[:blank:]' as a blank or a tab.
774
775	For example, the following character classes are all equivalent:
776
777
778	[[:alnum:]]
779	[[:alpha:][:digit:]]
780	[[:alpha:][0-9]]
781	[a-zA-Z0-9]
782
783	Some notes on patterns are in order.
784
785	* If your scanner is case-insensitive (the `-i' flag), then
786	`[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'.
787
788	* Character classes with ranges, such as `[a-Z]', should be used with
789	caution in a case-insensitive scanner if the range spans upper or
790	lowercase characters. Flex does not know if you want to fold all
791	upper and lowercase characters together, or if you want the
792	literal numeric range specified (with no case folding). When in
793	doubt, flex will assume that you meant the literal numeric range,
794	and will issue a warning. The exception to this rule is a
795	character range such as `[a-z]' or `[S-W]' where it is obvious
796	that you want case-folding to occur. Here are some examples with
797	the `-i' flag enabled:
798
799	Range Result Literal Range Alternate Range
800	`[a-t]' ok `[a-tA-T]'
801	`[A-T]' ok `[a-tA-T]'
802	`[A-t]' ambiguous `[A-Z\[\\\]_`a-t]' `[a-tA-T]'
803	`[_-{]' ambiguous `[_`a-z{]' `[_`a-zA-Z{]'
804	`[@-C]' ambiguous `[@ABC]' `[@A-Z\[\\\]_`abc]'
805
806	* A negated character class such as the example `[^A-Z]' above
807	_will_ match a newline unless `\n' (or an equivalent escape
808	sequence) is one of the characters explicitly present in the
809	negated character class (e.g., `[^A-Z\n]'). This is unlike how
810	many other regular expression tools treat negated character
811	classes, but unfortunately the inconsistency is historically
812	entrenched. Matching newlines means that a pattern like `[^"]*'
813	can match the entire input unless there's another quote in the
814	input.
815
816	* A rule can have at most one instance of trailing context (the `/'
817	operator or the `$' operator). The start condition, `^', and
818	`<<EOF>>' patterns can only occur at the beginning of a pattern,
819	and, as well as with `/' and `$', cannot be grouped inside
820	parentheses. A `^' which does not occur at the beginning of a
821	rule or a `$' which does not occur at the end of a rule loses its
822	special properties and is treated as a normal character.
823
824	* The following are invalid:
825
826
827	foo/bar$
828	<sc1>foo<sc2>bar
829
830	Note that the first of these can be written `foo/bar\n'.
831
832	* The following will result in `$' or `^' being treated as a normal
833	character:
834
835
836	foo\|(bar$)
837	foo\|^bar
838
839	If the desired meaning is a `foo' or a
840	`bar'-followed-by-a-newline, the following could be used (the
841	special `\|' action is explained below, *note Actions::):
842
843
844	foo \|
845	bar$ /* action goes here */
846
847	A similar trick will work for matching a `foo' or a
848	`bar'-at-the-beginning-of-a-line.
849
850
851	File: flex.info, Node: Matching, Next: Actions, Prev: Patterns, Up: Top
852
853	How the Input Is Matched
854	************************
855
856	When the generated scanner is run, it analyzes its input looking for
857	strings which match any of its patterns. If it finds more than one
858	match, it takes the one matching the most text (for trailing context
859	rules, this includes the length of the trailing part, even though it
860	will then be returned to the input). If it finds two or more matches of
861	the same length, the rule listed first in the `flex' input file is
862	chosen.
863
864	Once the match is determined, the text corresponding to the match
865	(called the "token") is made available in the global character pointer
866	`yytext', and its length in the global integer `yyleng'. The "action"
867	corresponding to the matched pattern is then executed (*note
868	Actions::), and then the remaining input is scanned for another match.
869
870	If no match is found, then the "default rule" is executed: the next
871	character in the input is considered matched and copied to the standard
872	output. Thus, the simplest valid `flex' input is:
873
874
875	%%
876
877	which generates a scanner that simply copies its input (one
878	character at a time) to its output.
879
880	Note that `yytext' can be defined in two different ways: either as a
881	character _pointer_ or as a character _array_. You can control which
882	definition `flex' uses by including one of the special directives
883	`%pointer' or `%array' in the first (definitions) section of your flex
884	input. The default is `%pointer', unless you use the `-l' lex
885	compatibility option, in which case `yytext' will be an array. The
886	advantage of using `%pointer' is substantially faster scanning and no
887	buffer overflow when matching very large tokens (unless you run out of
888	dynamic memory). The disadvantage is that you are restricted in how
889	your actions can modify `yytext' (*note Actions::), and calls to the
890	`unput()' function destroys the present contents of `yytext', which can
891	be a considerable porting headache when moving between different `lex'
892	versions.
893
894	The advantage of `%array' is that you can then modify `yytext' to
895	your heart's content, and calls to `unput()' do not destroy `yytext'
896	(*note Actions::). Furthermore, existing `lex' programs sometimes
897	access `yytext' externally using declarations of the form:
898
899
900	extern char yytext[];
901
902	This definition is erroneous when used with `%pointer', but correct
903	for `%array'.
904
905	The `%array' declaration defines `yytext' to be an array of `YYLMAX'
906	characters, which defaults to a fairly large value. You can change the
907	size by simply #define'ing `YYLMAX' to a different value in the first
908	section of your `flex' input. As mentioned above, with `%pointer'
909	yytext grows dynamically to accommodate large tokens. While this means
910	your `%pointer' scanner can accommodate very large tokens (such as
911	matching entire blocks of comments), bear in mind that each time the
912	scanner must resize `yytext' it also must rescan the entire token from
913	the beginning, so matching such tokens can prove slow. `yytext'
914	presently does _not_ dynamically grow if a call to `unput()' results in
915	too much text being pushed back; instead, a run-time error results.
916
917	Also note that you cannot use `%array' with C++ scanner classes
918	(*note Cxx::).
919
920
921	File: flex.info, Node: Actions, Next: Generated Scanner, Prev: Matching, Up: Top
922
923	Actions
924	*******
925
926	Each pattern in a rule has a corresponding "action", which can be
927	any arbitrary C statement. The pattern ends at the first non-escaped
928	whitespace character; the remainder of the line is its action. If the
929	action is empty, then when the pattern is matched the input token is
930	simply discarded. For example, here is the specification for a program
931	which deletes all occurrences of `zap me' from its input:
932
933
934	%%
935	"zap me"
936
937	This example will copy all other characters in the input to the
938	output since they will be matched by the default rule.
939
940	Here is a program which compresses multiple blanks and tabs down to a
941	single blank, and throws away whitespace found at the end of a line:
942
943
944	%%
945	[ \t]+ putchar( ' ' );
946	[ \t]+$ /* ignore this token */
947
948	If the action contains a `}', then the action spans till the
949	balancing `}' is found, and the action may cross multiple lines.
950	`flex' knows about C strings and comments and won't be fooled by braces
951	found within them, but also allows actions to begin with `%{' and will
952	consider the action to be all the text up to the next `%}' (regardless
953	of ordinary braces inside the action).
954
955	An action consisting solely of a vertical bar (`\|') means "same as
956	the action for the next rule". See below for an illustration.
957
958	Actions can include arbitrary C code, including `return' statements
959	to return a value to whatever routine called `yylex()'. Each time
960	`yylex()' is called it continues processing tokens from where it last
961	left off until it either reaches the end of the file or executes a
962	return.
963
964	Actions are free to modify `yytext' except for lengthening it
965	(adding characters to its end-these will overwrite later characters in
966	the input stream). This however does not apply when using `%array'
967	(*note Matching::). In that case, `yytext' may be freely modified in
968	any way.
969
970	Actions are free to modify `yyleng' except they should not do so if
971	the action also includes use of `yymore()' (see below).
972
973	There are a number of special directives which can be included
974	within an action:
975
976	`ECHO'
977	copies yytext to the scanner's output.
978
979	`BEGIN'
980	followed by the name of a start condition places the scanner in the
981	corresponding start condition (see below).
982
983	`REJECT'
984	directs the scanner to proceed on to the "second best" rule which
985	matched the input (or a prefix of the input). The rule is chosen
986	as described above in *Note Matching::, and `yytext' and `yyleng'
987	set up appropriately. It may either be one which matched as much
988	text as the originally chosen rule but came later in the `flex'
989	input file, or one which matched less text. For example, the
990	following will both count the words in the input and call the
991	routine `special()' whenever `frob' is seen:
992
993
994	int word_count = 0;
995	%%
996
997	frob special(); REJECT;
998	[^ \t\n]+ ++word_count;
999
1000	Without the `REJECT', any occurences of `frob' in the input would
1001	not be counted as words, since the scanner normally executes only
1002	one action per token. Multiple uses of `REJECT' are allowed, each
1003	one finding the next best choice to the currently active rule. For
1004	example, when the following scanner scans the token `abcd', it will
1005	write `abcdabcaba' to the output:
1006
1007
1008	%%
1009	a \|
1010	ab \|
1011	abc \|
1012	abcd ECHO; REJECT;
1013	.\|\n /* eat up any unmatched character */
1014
1015	The first three rules share the fourth's action since they use the
1016	special `\|' action.
1017
1018	`REJECT' is a particularly expensive feature in terms of scanner
1019	performance; if it is used in _any_ of the scanner's actions it
1020	will slow down _all_ of the scanner's matching. Furthermore,
1021	`REJECT' cannot be used with the `-Cf' or `-CF' options (*note
1022	Scanner Options::).
1023
1024	Note also that unlike the other special actions, `REJECT' is a
1025	_branch_. code immediately following it in the action will _not_
1026	be executed.
1027
1028	`yymore()'
1029	tells the scanner that the next time it matches a rule, the
1030	corresponding token should be _appended_ onto the current value of
1031	`yytext' rather than replacing it. For example, given the input
1032	`mega-kludge' the following will write `mega-mega-kludge' to the
1033	output:
1034
1035
1036	%%
1037	mega- ECHO; yymore();
1038	kludge ECHO;
1039
1040	First `mega-' is matched and echoed to the output. Then `kludge'
1041	is matched, but the previous `mega-' is still hanging around at the
1042	beginning of `yytext' so the `ECHO' for the `kludge' rule will
1043	actually write `mega-kludge'.
1044
1045	Two notes regarding use of `yymore()'. First, `yymore()' depends on
1046	the value of `yyleng' correctly reflecting the size of the current
1047	token, so you must not modify `yyleng' if you are using `yymore()'.
1048	Second, the presence of `yymore()' in the scanner's action entails a
1049	minor performance penalty in the scanner's matching speed.
1050
1051	`yyless(n)' returns all but the first `n' characters of the current
1052	token back to the input stream, where they will be rescanned when the
1053	scanner looks for the next match. `yytext' and `yyleng' are adjusted
1054	appropriately (e.g., `yyleng' will now be equal to `n'). For example,
1055	on the input `foobar' the following will write out `foobarbar':
1056
1057
1058	%%
1059	foobar ECHO; yyless(3);
1060	[a-z]+ ECHO;
1061
1062	An argument of 0 to `yyless()' will cause the entire current input
1063	string to be scanned again. Unless you've changed how the scanner will
1064	subsequently process its input (using `BEGIN', for example), this will
1065	result in an endless loop.
1066
1067	Note that `yyless()' is a macro and can only be used in the flex
1068	input file, not from other source files.
1069
1070	`unput(c)' puts the character `c' back onto the input stream. It
1071	will be the next character scanned. The following action will take the
1072	current token and cause it to be rescanned enclosed in parentheses.
1073
1074
1075	{
1076	int i;
1077	/* Copy yytext because unput() trashes yytext */
1078	char *yycopy = strdup( yytext );
1079	unput( ')' );
1080	for ( i = yyleng - 1; i >= 0; --i )
1081	unput( yycopy[i] );
1082	unput( '(' );
1083	free( yycopy );
1084	}
1085
1086	Note that since each `unput()' puts the given character back at the
1087	_beginning_ of the input stream, pushing back strings must be done
1088	back-to-front.
1089
1090	An important potential problem when using `unput()' is that if you
1091	are using `%pointer' (the default), a call to `unput()' _destroys_ the
1092	contents of `yytext', starting with its rightmost character and
1093	devouring one character to the left with each call. If you need the
1094	value of `yytext' preserved after a call to `unput()' (as in the above
1095	example), you must either first copy it elsewhere, or build your
1096	scanner using `%array' instead (*note Matching::).
1097
1098	Finally, note that you cannot put back `EOF' to attempt to mark the
1099	input stream with an end-of-file.
1100
1101	`input()' reads the next character from the input stream. For
1102	example, the following is one way to eat up C comments:
1103
1104
1105	%%
1106	"/*" {
1107	register int c;
1108
1109	for ( ; ; )
1110	{
1111	while ( (c = input()) != '*' &&
1112	c != EOF )
1113	; /* eat up text of comment */
1114
1115	if ( c == '*' )
1116	{
1117	while ( (c = input()) == '*' )
1118	;
1119	if ( c == '/' )
1120	break; /* found the end */
1121	}
1122
1123	if ( c == EOF )
1124	{
1125	error( "EOF in comment" );
1126	break;
1127	}
1128	}
1129	}
1130
1131	(Note that if the scanner is compiled using `C++', then `input()' is
1132	instead referred to as yyinput(), in order to avoid a name clash with
1133	the `C++' stream by the name of `input'.)
1134
1135	`YY_FLUSH_BUFFER()' flushes the scanner's internal buffer so that
1136	the next time the scanner attempts to match a token, it will first
1137	refill the buffer using `YY_INPUT()' (*note Generated Scanner::). This
1138	action is a special case of the more general `yy_flush_buffer()'
1139	function, described below (*note Multiple Input Buffers::)
1140
1141	`yyterminate()' can be used in lieu of a return statement in an
1142	action. It terminates the scanner and returns a 0 to the scanner's
1143	caller, indicating "all done". By default, `yyterminate()' is also
1144	called when an end-of-file is encountered. It is a macro and may be
1145	redefined.
1146
1147
1148	File: flex.info, Node: Generated Scanner, Next: Start Conditions, Prev: Actions, Up: Top
1149
1150	The Generated Scanner
1151	*********************
1152
1153	The output of `flex' is the file `lex.yy.c', which contains the
1154	scanning routine `yylex()', a number of tables used by it for matching
1155	tokens, and a number of auxiliary routines and macros. By default,
1156	`yylex()' is declared as follows:
1157
1158
1159	int yylex()
1160	{
1161	... various definitions and the actions in here ...
1162	}
1163
1164	(If your environment supports function prototypes, then it will be
1165	`int yylex( void )'.) This definition may be changed by defining the
1166	`YY_DECL' macro. For example, you could use:
1167
1168
1169	#define YY_DECL float lexscan( a, b ) float a, b;
1170
1171	to give the scanning routine the name `lexscan', returning a float,
1172	and taking two floats as arguments. Note that if you give arguments to
1173	the scanning routine using a K&R-style/non-prototyped function
1174	declaration, you must terminate the definition with a semi-colon (;).
1175
1176	`flex' generates `C99' function definitions by default. However flex
1177	does have the ability to generate obsolete, er, `traditional', function
1178	definitions. This is to support bootstrapping gcc on old systems.
1179	Unfortunately, traditional definitions prevent us from using any
1180	standard data types smaller than int (such as short, char, or bool) as
1181	function arguments. For this reason, future versions of `flex' may
1182	generate standard C99 code only, leaving K&R-style functions to the
1183	historians. Currently, if you do not want `C99' definitions, then
1184	you must use `%option noansi-definitions'.
1185
1186	Whenever `yylex()' is called, it scans tokens from the global input
1187	file `yyin' (which defaults to stdin). It continues until it either
1188	reaches an end-of-file (at which point it returns the value 0) or one
1189	of its actions executes a `return' statement.
1190
1191	If the scanner reaches an end-of-file, subsequent calls are undefined
1192	unless either `yyin' is pointed at a new input file (in which case
1193	scanning continues from that file), or `yyrestart()' is called.
1194	`yyrestart()' takes one argument, a `FILE *' pointer (which can be
1195	NULL, if you've set up `YY_INPUT' to scan from a source other than
1196	`yyin'), and initializes `yyin' for scanning from that file.
1197	Essentially there is no difference between just assigning `yyin' to a
1198	new input file or using `yyrestart()' to do so; the latter is available
1199	for compatibility with previous versions of `flex', and because it can
1200	be used to switch input files in the middle of scanning. It can also
1201	be used to throw away the current input buffer, by calling it with an
1202	argument of `yyin'; but it would be better to use `YY_FLUSH_BUFFER'
1203	(*note Actions::). Note that `yyrestart()' does _not_ reset the start
1204	condition to `INITIAL' (*note Start Conditions::).
1205
1206	If `yylex()' stops scanning due to executing a `return' statement in
1207	one of the actions, the scanner may then be called again and it will
1208	resume scanning where it left off.
1209
1210	By default (and for purposes of efficiency), the scanner uses
1211	block-reads rather than simple `getc()' calls to read characters from
1212	`yyin'. The nature of how it gets its input can be controlled by
1213	defining the `YY_INPUT' macro. The calling sequence for `YY_INPUT()'
1214	is `YY_INPUT(buf,result,max_size)'. Its action is to place up to
1215	`max_size' characters in the character array `buf' and return in the
1216	integer variable `result' either the number of characters read or the
1217	constant `YY_NULL' (0 on Unix systems) to indicate `EOF'. The default
1218	`YY_INPUT' reads from the global file-pointer `yyin'.
1219
1220	Here is a sample definition of `YY_INPUT' (in the definitions
1221	section of the input file):
1222
1223
1224	%{
1225	#define YY_INPUT(buf,result,max_size) \
1226	{ \
1227	int c = getchar(); \
1228	result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
1229	}
1230	%}
1231
1232	This definition will change the input processing to occur one
1233	character at a time.
1234
1235	When the scanner receives an end-of-file indication from YY_INPUT, it
1236	then checks the `yywrap()' function. If `yywrap()' returns false
1237	(zero), then it is assumed that the function has gone ahead and set up
1238	`yyin' to point to another input file, and scanning continues. If it
1239	returns true (non-zero), then the scanner terminates, returning 0 to
1240	its caller. Note that in either case, the start condition remains
1241	unchanged; it does _not_ revert to `INITIAL'.
1242
1243	If you do not supply your own version of `yywrap()', then you must
1244	either use `%option noyywrap' (in which case the scanner behaves as
1245	though `yywrap()' returned 1), or you must link with `-lfl' to obtain
1246	the default version of the routine, which always returns 1.
1247
1248	For scanning from in-memory buffers (e.g., scanning strings), see
1249	Note Scanning Strings::. Note Multiple Input Buffers::.
1250
1251	The scanner writes its `ECHO' output to the `yyout' global (default,
1252	`stdout'), which may be redefined by the user simply by assigning it to
1253	some other `FILE' pointer.
1254

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format