1 |
|
---|
2 | qtokenautomaton is a token generator, that generates a simple, Unicode aware
|
---|
3 | tokenizer for C++ that uses the Qt API.
|
---|
4 |
|
---|
5 | Introduction
|
---|
6 | =====================
|
---|
7 | QTokenAutomaton generates a C++ class that essentially has this interface:
|
---|
8 |
|
---|
9 | class YourTokenizer
|
---|
10 | {
|
---|
11 | protected:
|
---|
12 | enum Token
|
---|
13 | {
|
---|
14 | A,
|
---|
15 | B,
|
---|
16 | C,
|
---|
17 | NoKeyword
|
---|
18 | };
|
---|
19 |
|
---|
20 | static inline Token toToken(const QString &string);
|
---|
21 | static inline Token toToken(const QStringRef &string);
|
---|
22 | static Token toToken(const QChar *data, int length);
|
---|
23 | static QString toString(Token token);
|
---|
24 | };
|
---|
25 |
|
---|
26 | When calling toToken(), the tokenizer returns the enum value corresponding to
|
---|
27 | the string. This is done with O(N) time complexity, where N is the length of
|
---|
28 | the string. The returned value can then subsequently be efficiently switched
|
---|
29 | over. The alternatives, either a long chain of if statements comparing one
|
---|
30 | QString to several other QStrings; or inserting all strings first into a hash,
|
---|
31 | are less efficient.
|
---|
32 |
|
---|
33 | For instance, the latter case of using a hash would involve when excluding the
|
---|
34 | initial populating of the hash, O(N) + O(1) where 0(1) is assumed to be a
|
---|
35 | non-conflicting hash lookup.
|
---|
36 |
|
---|
37 | toString(), which returns the string for the token that an enum value
|
---|
38 | represents, is implemented to store the strings in an efficient manner.
|
---|
39 |
|
---|
40 | A typical usage scenario is in combination with QXmlStreamReader. When parsing
|
---|
41 | a certain format, for instance XHTML, each element name, body, span, table and
|
---|
42 | so forth, typically needs special treatment. QTokenAutomaton conceptually cuts
|
---|
43 | the string comparisons down to one.
|
---|
44 |
|
---|
45 | Beyond efficiency, QTokenAutomaton also increases type safety, since C++
|
---|
46 | identifiers are used instead of string literals.
|
---|
47 |
|
---|
48 | Usage
|
---|
49 | =====================
|
---|
50 | Using it is approached as follows:
|
---|
51 |
|
---|
52 | 1. Create a token file. Use exampleFile.xml as a template.
|
---|
53 |
|
---|
54 | 2. Make sure it is valid by validating against qtokenautomaton.xsd. On
|
---|
55 | Linux, this can be achieved by running `xmllint --noout
|
---|
56 | --schema qtokenautomaton.xsd yourFile.xml`
|
---|
57 |
|
---|
58 | 3. Produce the C++ files by invoking the stylesheet with an XSL-T 2.0
|
---|
59 | processor[1]. For instance, with the implementation Saxon, this would be:
|
---|
60 | `java net.sf.saxon.Transform -xsl:qautomaton2cpp.xsl yourFile.xml`
|
---|
61 |
|
---|
62 | 4. Include the produced C++ files with your build system.
|
---|
63 |
|
---|
64 |
|
---|
65 | 1.
|
---|
66 | In Qt there is as of 4.4 no support for XSL-T.
|
---|