| 1 |
|
|---|
| 2 | =head1 NAME
|
|---|
| 3 |
|
|---|
| 4 | perlpodspec - Plain Old Documentation: format specification and notes
|
|---|
| 5 |
|
|---|
| 6 | =head1 DESCRIPTION
|
|---|
| 7 |
|
|---|
| 8 | This document is detailed notes on the Pod markup language. Most
|
|---|
| 9 | people will only have to read L<perlpod|perlpod> to know how to write
|
|---|
| 10 | in Pod, but this document may answer some incidental questions to do
|
|---|
| 11 | with parsing and rendering Pod.
|
|---|
| 12 |
|
|---|
| 13 | In this document, "must" / "must not", "should" /
|
|---|
| 14 | "should not", and "may" have their conventional (cf. RFC 2119)
|
|---|
| 15 | meanings: "X must do Y" means that if X doesn't do Y, it's against
|
|---|
| 16 | this specification, and should really be fixed. "X should do Y"
|
|---|
| 17 | means that it's recommended, but X may fail to do Y, if there's a
|
|---|
| 18 | good reason. "X may do Y" is merely a note that X can do Y at
|
|---|
| 19 | will (although it is up to the reader to detect any connotation of
|
|---|
| 20 | "and I think it would be I<nice> if X did Y" versus "it wouldn't
|
|---|
| 21 | really I<bother> me if X did Y").
|
|---|
| 22 |
|
|---|
| 23 | Notably, when I say "the parser should do Y", the
|
|---|
| 24 | parser may fail to do Y, if the calling application explicitly
|
|---|
| 25 | requests that the parser I<not> do Y. I often phrase this as
|
|---|
| 26 | "the parser should, by default, do Y." This doesn't I<require>
|
|---|
| 27 | the parser to provide an option for turning off whatever
|
|---|
| 28 | feature Y is (like expanding tabs in verbatim paragraphs), although
|
|---|
| 29 | it implicates that such an option I<may> be provided.
|
|---|
| 30 |
|
|---|
| 31 | =head1 Pod Definitions
|
|---|
| 32 |
|
|---|
| 33 | Pod is embedded in files, typically Perl source files -- although you
|
|---|
| 34 | can write a file that's nothing but Pod.
|
|---|
| 35 |
|
|---|
| 36 | A B<line> in a file consists of zero or more non-newline characters,
|
|---|
| 37 | terminated by either a newline or the end of the file.
|
|---|
| 38 |
|
|---|
| 39 | A B<newline sequence> is usually a platform-dependent concept, but
|
|---|
| 40 | Pod parsers should understand it to mean any of CR (ASCII 13), LF
|
|---|
| 41 | (ASCII 10), or a CRLF (ASCII 13 followed immediately by ASCII 10), in
|
|---|
| 42 | addition to any other system-specific meaning. The first CR/CRLF/LF
|
|---|
| 43 | sequence in the file may be used as the basis for identifying the
|
|---|
| 44 | newline sequence for parsing the rest of the file.
|
|---|
| 45 |
|
|---|
| 46 | A B<blank line> is a line consisting entirely of zero or more spaces
|
|---|
| 47 | (ASCII 32) or tabs (ASCII 9), and terminated by a newline or end-of-file.
|
|---|
| 48 | A B<non-blank line> is a line containing one or more characters other
|
|---|
| 49 | than space or tab (and terminated by a newline or end-of-file).
|
|---|
| 50 |
|
|---|
| 51 | (I<Note:> Many older Pod parsers did not accept a line consisting of
|
|---|
| 52 | spaces/tabs and then a newline as a blank line -- the only lines they
|
|---|
| 53 | considered blank were lines consisting of I<no characters at all>,
|
|---|
| 54 | terminated by a newline.)
|
|---|
| 55 |
|
|---|
| 56 | B<Whitespace> is used in this document as a blanket term for spaces,
|
|---|
| 57 | tabs, and newline sequences. (By itself, this term usually refers
|
|---|
| 58 | to literal whitespace. That is, sequences of whitespace characters
|
|---|
| 59 | in Pod source, as opposed to "EE<lt>32>", which is a formatting
|
|---|
| 60 | code that I<denotes> a whitespace character.)
|
|---|
| 61 |
|
|---|
| 62 | A B<Pod parser> is a module meant for parsing Pod (regardless of
|
|---|
| 63 | whether this involves calling callbacks or building a parse tree or
|
|---|
| 64 | directly formatting it). A B<Pod formatter> (or B<Pod translator>)
|
|---|
| 65 | is a module or program that converts Pod to some other format (HTML,
|
|---|
| 66 | plaintext, TeX, PostScript, RTF). A B<Pod processor> might be a
|
|---|
| 67 | formatter or translator, or might be a program that does something
|
|---|
| 68 | else with the Pod (like wordcounting it, scanning for index points,
|
|---|
| 69 | etc.).
|
|---|
| 70 |
|
|---|
| 71 | Pod content is contained in B<Pod blocks>. A Pod block starts with a
|
|---|
| 72 | line that matches <m/\A=[a-zA-Z]/>, and continues up to the next line
|
|---|
| 73 | that matches C<m/\A=cut/> -- or up to the end of the file, if there is
|
|---|
| 74 | no C<m/\A=cut/> line.
|
|---|
| 75 |
|
|---|
| 76 | =for comment
|
|---|
| 77 | The current perlsyn says:
|
|---|
| 78 | [beginquote]
|
|---|
| 79 | Note that pod translators should look at only paragraphs beginning
|
|---|
| 80 | with a pod directive (it makes parsing easier), whereas the compiler
|
|---|
| 81 | actually knows to look for pod escapes even in the middle of a
|
|---|
| 82 | paragraph. This means that the following secret stuff will be ignored
|
|---|
| 83 | by both the compiler and the translators.
|
|---|
| 84 | $a=3;
|
|---|
| 85 | =secret stuff
|
|---|
| 86 | warn "Neither POD nor CODE!?"
|
|---|
| 87 | =cut back
|
|---|
| 88 | print "got $a\n";
|
|---|
| 89 | You probably shouldn't rely upon the warn() being podded out forever.
|
|---|
| 90 | Not all pod translators are well-behaved in this regard, and perhaps
|
|---|
| 91 | the compiler will become pickier.
|
|---|
| 92 | [endquote]
|
|---|
| 93 | I think that those paragraphs should just be removed; paragraph-based
|
|---|
| 94 | parsing seems to have been largely abandoned, because of the hassle
|
|---|
| 95 | with non-empty blank lines messing up what people meant by "paragraph".
|
|---|
| 96 | Even if the "it makes parsing easier" bit were especially true,
|
|---|
| 97 | it wouldn't be worth the confusion of having perl and pod2whatever
|
|---|
| 98 | actually disagree on what can constitute a Pod block.
|
|---|
| 99 |
|
|---|
| 100 | Within a Pod block, there are B<Pod paragraphs>. A Pod paragraph
|
|---|
| 101 | consists of non-blank lines of text, separated by one or more blank
|
|---|
| 102 | lines.
|
|---|
| 103 |
|
|---|
| 104 | For purposes of Pod processing, there are four types of paragraphs in
|
|---|
| 105 | a Pod block:
|
|---|
| 106 |
|
|---|
| 107 | =over
|
|---|
| 108 |
|
|---|
| 109 | =item *
|
|---|
| 110 |
|
|---|
| 111 | A command paragraph (also called a "directive"). The first line of
|
|---|
| 112 | this paragraph must match C<m/\A=[a-zA-Z]/>. Command paragraphs are
|
|---|
| 113 | typically one line, as in:
|
|---|
| 114 |
|
|---|
| 115 | =head1 NOTES
|
|---|
| 116 |
|
|---|
| 117 | =item *
|
|---|
| 118 |
|
|---|
| 119 | But they may span several (non-blank) lines:
|
|---|
| 120 |
|
|---|
| 121 | =for comment
|
|---|
| 122 | Hm, I wonder what it would look like if
|
|---|
| 123 | you tried to write a BNF for Pod from this.
|
|---|
| 124 |
|
|---|
| 125 | =head3 Dr. Strangelove, or: How I Learned to
|
|---|
| 126 | Stop Worrying and Love the Bomb
|
|---|
| 127 |
|
|---|
| 128 | I<Some> command paragraphs allow formatting codes in their content
|
|---|
| 129 | (i.e., after the part that matches C<m/\A=[a-zA-Z]\S*\s*/>), as in:
|
|---|
| 130 |
|
|---|
| 131 | =head1 Did You Remember to C<use strict;>?
|
|---|
| 132 |
|
|---|
| 133 | In other words, the Pod processing handler for "head1" will apply the
|
|---|
| 134 | same processing to "Did You Remember to CE<lt>use strict;>?" that it
|
|---|
| 135 | would to an ordinary paragraph -- i.e., formatting codes (like
|
|---|
| 136 | "CE<lt>...>") are parsed and presumably formatted appropriately, and
|
|---|
| 137 | whitespace in the form of literal spaces and/or tabs is not
|
|---|
| 138 | significant.
|
|---|
| 139 |
|
|---|
| 140 | =item *
|
|---|
| 141 |
|
|---|
| 142 | A B<verbatim paragraph>. The first line of this paragraph must be a
|
|---|
| 143 | literal space or tab, and this paragraph must not be inside a "=begin
|
|---|
| 144 | I<identifier>", ... "=end I<identifier>" sequence unless
|
|---|
| 145 | "I<identifier>" begins with a colon (":"). That is, if a paragraph
|
|---|
| 146 | starts with a literal space or tab, but I<is> inside a
|
|---|
| 147 | "=begin I<identifier>", ... "=end I<identifier>" region, then it's
|
|---|
| 148 | a data paragraph, unless "I<identifier>" begins with a colon.
|
|---|
| 149 |
|
|---|
| 150 | Whitespace I<is> significant in verbatim paragraphs (although, in
|
|---|
| 151 | processing, tabs are probably expanded).
|
|---|
| 152 |
|
|---|
| 153 | =item *
|
|---|
| 154 |
|
|---|
| 155 | An B<ordinary paragraph>. A paragraph is an ordinary paragraph
|
|---|
| 156 | if its first line matches neither C<m/\A=[a-zA-Z]/> nor
|
|---|
| 157 | C<m/\A[ \t]/>, I<and> if it's not inside a "=begin I<identifier>",
|
|---|
| 158 | ... "=end I<identifier>" sequence unless "I<identifier>" begins with
|
|---|
| 159 | a colon (":").
|
|---|
| 160 |
|
|---|
| 161 | =item *
|
|---|
| 162 |
|
|---|
| 163 | A B<data paragraph>. This is a paragraph that I<is> inside a "=begin
|
|---|
| 164 | I<identifier>" ... "=end I<identifier>" sequence where
|
|---|
| 165 | "I<identifier>" does I<not> begin with a literal colon (":"). In
|
|---|
| 166 | some sense, a data paragraph is not part of Pod at all (i.e.,
|
|---|
| 167 | effectively it's "out-of-band"), since it's not subject to most kinds
|
|---|
| 168 | of Pod parsing; but it is specified here, since Pod
|
|---|
| 169 | parsers need to be able to call an event for it, or store it in some
|
|---|
| 170 | form in a parse tree, or at least just parse I<around> it.
|
|---|
| 171 |
|
|---|
| 172 | =back
|
|---|
| 173 |
|
|---|
| 174 | For example: consider the following paragraphs:
|
|---|
| 175 |
|
|---|
| 176 | # <- that's the 0th column
|
|---|
| 177 |
|
|---|
| 178 | =head1 Foo
|
|---|
| 179 |
|
|---|
| 180 | Stuff
|
|---|
| 181 |
|
|---|
| 182 | $foo->bar
|
|---|
| 183 |
|
|---|
| 184 | =cut
|
|---|
| 185 |
|
|---|
| 186 | Here, "=head1 Foo" and "=cut" are command paragraphs because the first
|
|---|
| 187 | line of each matches C<m/\A=[a-zA-Z]/>. "I<[space][space]>$foo->bar"
|
|---|
| 188 | is a verbatim paragraph, because its first line starts with a literal
|
|---|
| 189 | whitespace character (and there's no "=begin"..."=end" region around).
|
|---|
| 190 |
|
|---|
| 191 | The "=begin I<identifier>" ... "=end I<identifier>" commands stop
|
|---|
| 192 | paragraphs that they surround from being parsed as data or verbatim
|
|---|
| 193 | paragraphs, if I<identifier> doesn't begin with a colon. This
|
|---|
| 194 | is discussed in detail in the section
|
|---|
| 195 | L</About Data Paragraphs and "=beginE<sol>=end" Regions>.
|
|---|
| 196 |
|
|---|
| 197 | =head1 Pod Commands
|
|---|
| 198 |
|
|---|
| 199 | This section is intended to supplement and clarify the discussion in
|
|---|
| 200 | L<perlpod/"Command Paragraph">. These are the currently recognized
|
|---|
| 201 | Pod commands:
|
|---|
| 202 |
|
|---|
| 203 | =over
|
|---|
| 204 |
|
|---|
| 205 | =item "=head1", "=head2", "=head3", "=head4"
|
|---|
| 206 |
|
|---|
| 207 | This command indicates that the text in the remainder of the paragraph
|
|---|
| 208 | is a heading. That text may contain formatting codes. Examples:
|
|---|
| 209 |
|
|---|
| 210 | =head1 Object Attributes
|
|---|
| 211 |
|
|---|
| 212 | =head3 What B<Not> to Do!
|
|---|
| 213 |
|
|---|
| 214 | =item "=pod"
|
|---|
| 215 |
|
|---|
| 216 | This command indicates that this paragraph begins a Pod block. (If we
|
|---|
| 217 | are already in the middle of a Pod block, this command has no effect at
|
|---|
| 218 | all.) If there is any text in this command paragraph after "=pod",
|
|---|
| 219 | it must be ignored. Examples:
|
|---|
| 220 |
|
|---|
| 221 | =pod
|
|---|
| 222 |
|
|---|
| 223 | This is a plain Pod paragraph.
|
|---|
| 224 |
|
|---|
| 225 | =pod This text is ignored.
|
|---|
| 226 |
|
|---|
| 227 | =item "=cut"
|
|---|
| 228 |
|
|---|
| 229 | This command indicates that this line is the end of this previously
|
|---|
| 230 | started Pod block. If there is any text after "=cut" on the line, it must be
|
|---|
| 231 | ignored. Examples:
|
|---|
| 232 |
|
|---|
| 233 | =cut
|
|---|
| 234 |
|
|---|
| 235 | =cut The documentation ends here.
|
|---|
| 236 |
|
|---|
| 237 | =cut
|
|---|
| 238 | # This is the first line of program text.
|
|---|
| 239 | sub foo { # This is the second.
|
|---|
| 240 |
|
|---|
| 241 | It is an error to try to I<start> a Pod block with a "=cut" command. In
|
|---|
| 242 | that case, the Pod processor must halt parsing of the input file, and
|
|---|
| 243 | must by default emit a warning.
|
|---|
| 244 |
|
|---|
| 245 | =item "=over"
|
|---|
| 246 |
|
|---|
| 247 | This command indicates that this is the start of a list/indent
|
|---|
| 248 | region. If there is any text following the "=over", it must consist
|
|---|
| 249 | of only a nonzero positive numeral. The semantics of this numeral is
|
|---|
| 250 | explained in the L</"About =over...=back Regions"> section, further
|
|---|
| 251 | below. Formatting codes are not expanded. Examples:
|
|---|
| 252 |
|
|---|
| 253 | =over 3
|
|---|
| 254 |
|
|---|
| 255 | =over 3.5
|
|---|
| 256 |
|
|---|
| 257 | =over
|
|---|
| 258 |
|
|---|
| 259 | =item "=item"
|
|---|
| 260 |
|
|---|
| 261 | This command indicates that an item in a list begins here. Formatting
|
|---|
| 262 | codes are processed. The semantics of the (optional) text in the
|
|---|
| 263 | remainder of this paragraph are
|
|---|
| 264 | explained in the L</"About =over...=back Regions"> section, further
|
|---|
| 265 | below. Examples:
|
|---|
| 266 |
|
|---|
| 267 | =item
|
|---|
| 268 |
|
|---|
| 269 | =item *
|
|---|
| 270 |
|
|---|
| 271 | =item *
|
|---|
| 272 |
|
|---|
| 273 | =item 14
|
|---|
| 274 |
|
|---|
| 275 | =item 3.
|
|---|
| 276 |
|
|---|
| 277 | =item C<< $thing->stuff(I<dodad>) >>
|
|---|
| 278 |
|
|---|
| 279 | =item For transporting us beyond seas to be tried for pretended
|
|---|
| 280 | offenses
|
|---|
| 281 |
|
|---|
| 282 | =item He is at this time transporting large armies of foreign
|
|---|
| 283 | mercenaries to complete the works of death, desolation and
|
|---|
| 284 | tyranny, already begun with circumstances of cruelty and perfidy
|
|---|
| 285 | scarcely paralleled in the most barbarous ages, and totally
|
|---|
| 286 | unworthy the head of a civilized nation.
|
|---|
| 287 |
|
|---|
| 288 | =item "=back"
|
|---|
| 289 |
|
|---|
| 290 | This command indicates that this is the end of the region begun
|
|---|
| 291 | by the most recent "=over" command. It permits no text after the
|
|---|
| 292 | "=back" command.
|
|---|
| 293 |
|
|---|
| 294 | =item "=begin formatname"
|
|---|
| 295 |
|
|---|
| 296 | This marks the following paragraphs (until the matching "=end
|
|---|
| 297 | formatname") as being for some special kind of processing. Unless
|
|---|
| 298 | "formatname" begins with a colon, the contained non-command
|
|---|
| 299 | paragraphs are data paragraphs. But if "formatname" I<does> begin
|
|---|
| 300 | with a colon, then non-command paragraphs are ordinary paragraphs
|
|---|
| 301 | or data paragraphs. This is discussed in detail in the section
|
|---|
| 302 | L</About Data Paragraphs and "=beginE<sol>=end" Regions>.
|
|---|
| 303 |
|
|---|
| 304 | It is advised that formatnames match the regexp
|
|---|
| 305 | C<m/\A:?[-a-zA-Z0-9_]+\z/>. Implementors should anticipate future
|
|---|
| 306 | expansion in the semantics and syntax of the first parameter
|
|---|
| 307 | to "=begin"/"=end"/"=for".
|
|---|
| 308 |
|
|---|
| 309 | =item "=end formatname"
|
|---|
| 310 |
|
|---|
| 311 | This marks the end of the region opened by the matching
|
|---|
| 312 | "=begin formatname" region. If "formatname" is not the formatname
|
|---|
| 313 | of the most recent open "=begin formatname" region, then this
|
|---|
| 314 | is an error, and must generate an error message. This
|
|---|
| 315 | is discussed in detail in the section
|
|---|
| 316 | L</About Data Paragraphs and "=beginE<sol>=end" Regions>.
|
|---|
| 317 |
|
|---|
| 318 | =item "=for formatname text..."
|
|---|
| 319 |
|
|---|
| 320 | This is synonymous with:
|
|---|
| 321 |
|
|---|
| 322 | =begin formatname
|
|---|
| 323 |
|
|---|
| 324 | text...
|
|---|
| 325 |
|
|---|
| 326 | =end formatname
|
|---|
| 327 |
|
|---|
| 328 | That is, it creates a region consisting of a single paragraph; that
|
|---|
| 329 | paragraph is to be treated as a normal paragraph if "formatname"
|
|---|
| 330 | begins with a ":"; if "formatname" I<doesn't> begin with a colon,
|
|---|
| 331 | then "text..." will constitute a data paragraph. There is no way
|
|---|
| 332 | to use "=for formatname text..." to express "text..." as a verbatim
|
|---|
| 333 | paragraph.
|
|---|
| 334 |
|
|---|
| 335 | =item "=encoding encodingname"
|
|---|
| 336 |
|
|---|
| 337 | This command, which should occur early in the document (at least
|
|---|
| 338 | before any non-US-ASCII data!), declares that this document is
|
|---|
| 339 | encoded in the encoding I<encodingname>, which must be
|
|---|
| 340 | an encoding name that L<Encoding> recognizes. (Encoding's list
|
|---|
| 341 | of supported encodings, in L<Encoding::Supported>, is useful here.)
|
|---|
| 342 | If the Pod parser cannot decode the declared encoding, it
|
|---|
| 343 | should emit a warning and may abort parsing the document
|
|---|
| 344 | altogether.
|
|---|
| 345 |
|
|---|
| 346 | A document having more than one "=encoding" line should be
|
|---|
| 347 | considered an error. Pod processors may silently tolerate this if
|
|---|
| 348 | the not-first "=encoding" lines are just duplicates of the
|
|---|
| 349 | first one (e.g., if there's a "=use utf8" line, and later on
|
|---|
| 350 | another "=use utf8" line). But Pod processors should complain if
|
|---|
| 351 | there are contradictory "=encoding" lines in the same document
|
|---|
| 352 | (e.g., if there is a "=encoding utf8" early in the document and
|
|---|
| 353 | "=encoding big5" later). Pod processors that recognize BOMs
|
|---|
| 354 | may also complain if they see an "=encoding" line
|
|---|
| 355 | that contradicts the BOM (e.g., if a document with a UTF-16LE
|
|---|
| 356 | BOM has an "=encoding shiftjis" line).
|
|---|
| 357 |
|
|---|
| 358 | =back
|
|---|
| 359 |
|
|---|
| 360 | If a Pod processor sees any command other than the ones listed
|
|---|
| 361 | above (like "=head", or "=haed1", or "=stuff", or "=cuttlefish",
|
|---|
| 362 | or "=w123"), that processor must by default treat this as an
|
|---|
| 363 | error. It must not process the paragraph beginning with that
|
|---|
| 364 | command, must by default warn of this as an error, and may
|
|---|
| 365 | abort the parse. A Pod parser may allow a way for particular
|
|---|
| 366 | applications to add to the above list of known commands, and to
|
|---|
| 367 | stipulate, for each additional command, whether formatting
|
|---|
| 368 | codes should be processed.
|
|---|
| 369 |
|
|---|
| 370 | Future versions of this specification may add additional
|
|---|
| 371 | commands.
|
|---|
| 372 |
|
|---|
| 373 |
|
|---|
| 374 |
|
|---|
| 375 | =head1 Pod Formatting Codes
|
|---|
| 376 |
|
|---|
| 377 | (Note that in previous drafts of this document and of perlpod,
|
|---|
| 378 | formatting codes were referred to as "interior sequences", and
|
|---|
| 379 | this term may still be found in the documentation for Pod parsers,
|
|---|
| 380 | and in error messages from Pod processors.)
|
|---|
| 381 |
|
|---|
| 382 | There are two syntaxes for formatting codes:
|
|---|
| 383 |
|
|---|
| 384 | =over
|
|---|
| 385 |
|
|---|
| 386 | =item *
|
|---|
| 387 |
|
|---|
| 388 | A formatting code starts with a capital letter (just US-ASCII [A-Z])
|
|---|
| 389 | followed by a "<", any number of characters, and ending with the first
|
|---|
| 390 | matching ">". Examples:
|
|---|
| 391 |
|
|---|
| 392 | That's what I<you> think!
|
|---|
| 393 |
|
|---|
| 394 | What's C<dump()> for?
|
|---|
| 395 |
|
|---|
| 396 | X<C<chmod> and C<unlink()> Under Different Operating Systems>
|
|---|
| 397 |
|
|---|
| 398 | =item *
|
|---|
| 399 |
|
|---|
| 400 | A formatting code starts with a capital letter (just US-ASCII [A-Z])
|
|---|
| 401 | followed by two or more "<"'s, one or more whitespace characters,
|
|---|
| 402 | any number of characters, one or more whitespace characters,
|
|---|
| 403 | and ending with the first matching sequence of two or more ">"'s, where
|
|---|
| 404 | the number of ">"'s equals the number of "<"'s in the opening of this
|
|---|
| 405 | formatting code. Examples:
|
|---|
| 406 |
|
|---|
| 407 | That's what I<< you >> think!
|
|---|
| 408 |
|
|---|
| 409 | C<<< open(X, ">>thing.dat") || die $! >>>
|
|---|
| 410 |
|
|---|
| 411 | B<< $foo->bar(); >>
|
|---|
| 412 |
|
|---|
| 413 | With this syntax, the whitespace character(s) after the "CE<lt><<"
|
|---|
| 414 | and before the ">>" (or whatever letter) are I<not> renderable -- they
|
|---|
| 415 | do not signify whitespace, are merely part of the formatting codes
|
|---|
| 416 | themselves. That is, these are all synonymous:
|
|---|
| 417 |
|
|---|
| 418 | C<thing>
|
|---|
| 419 | C<< thing >>
|
|---|
| 420 | C<< thing >>
|
|---|
| 421 | C<<< thing >>>
|
|---|
| 422 | C<<<<
|
|---|
| 423 | thing
|
|---|
| 424 | >>>>
|
|---|
| 425 |
|
|---|
| 426 | and so on.
|
|---|
| 427 |
|
|---|
| 428 | =back
|
|---|
| 429 |
|
|---|
| 430 | In parsing Pod, a notably tricky part is the correct parsing of
|
|---|
| 431 | (potentially nested!) formatting codes. Implementors should
|
|---|
| 432 | consult the code in the C<parse_text> routine in Pod::Parser as an
|
|---|
| 433 | example of a correct implementation.
|
|---|
| 434 |
|
|---|
| 435 | =over
|
|---|
| 436 |
|
|---|
| 437 | =item C<IE<lt>textE<gt>> -- italic text
|
|---|
| 438 |
|
|---|
| 439 | See the brief discussion in L<perlpod/"Formatting Codes">.
|
|---|
| 440 |
|
|---|
| 441 | =item C<BE<lt>textE<gt>> -- bold text
|
|---|
| 442 |
|
|---|
| 443 | See the brief discussion in L<perlpod/"Formatting Codes">.
|
|---|
| 444 |
|
|---|
| 445 | =item C<CE<lt>codeE<gt>> -- code text
|
|---|
| 446 |
|
|---|
| 447 | See the brief discussion in L<perlpod/"Formatting Codes">.
|
|---|
| 448 |
|
|---|
| 449 | =item C<FE<lt>filenameE<gt>> -- style for filenames
|
|---|
| 450 |
|
|---|
| 451 | See the brief discussion in L<perlpod/"Formatting Codes">.
|
|---|
| 452 |
|
|---|
| 453 | =item C<XE<lt>topic nameE<gt>> -- an index entry
|
|---|
| 454 |
|
|---|
| 455 | See the brief discussion in L<perlpod/"Formatting Codes">.
|
|---|
| 456 |
|
|---|
| 457 | This code is unusual in that most formatters completely discard
|
|---|
| 458 | this code and its content. Other formatters will render it with
|
|---|
| 459 | invisible codes that can be used in building an index of
|
|---|
| 460 | the current document.
|
|---|
| 461 |
|
|---|
| 462 | =item C<ZE<lt>E<gt>> -- a null (zero-effect) formatting code
|
|---|
| 463 |
|
|---|
| 464 | Discussed briefly in L<perlpod/"Formatting Codes">.
|
|---|
| 465 |
|
|---|
| 466 | This code is unusual is that it should have no content. That is,
|
|---|
| 467 | a processor may complain if it sees C<ZE<lt>potatoesE<gt>>. Whether
|
|---|
| 468 | or not it complains, the I<potatoes> text should ignored.
|
|---|
| 469 |
|
|---|
| 470 | =item C<LE<lt>nameE<gt>> -- a hyperlink
|
|---|
| 471 |
|
|---|
| 472 | The complicated syntaxes of this code are discussed at length in
|
|---|
| 473 | L<perlpod/"Formatting Codes">, and implementation details are
|
|---|
| 474 | discussed below, in L</"About LE<lt>...E<gt> Codes">. Parsing the
|
|---|
| 475 | contents of LE<lt>content> is tricky. Notably, the content has to be
|
|---|
| 476 | checked for whether it looks like a URL, or whether it has to be split
|
|---|
| 477 | on literal "|" and/or "/" (in the right order!), and so on,
|
|---|
| 478 | I<before> EE<lt>...> codes are resolved.
|
|---|
| 479 |
|
|---|
| 480 | =item C<EE<lt>escapeE<gt>> -- a character escape
|
|---|
| 481 |
|
|---|
| 482 | See L<perlpod/"Formatting Codes">, and several points in
|
|---|
| 483 | L</Notes on Implementing Pod Processors>.
|
|---|
| 484 |
|
|---|
| 485 | =item C<SE<lt>textE<gt>> -- text contains non-breaking spaces
|
|---|
| 486 |
|
|---|
| 487 | This formatting code is syntactically simple, but semantically
|
|---|
| 488 | complex. What it means is that each space in the printable
|
|---|
| 489 | content of this code signifies a non-breaking space.
|
|---|
| 490 |
|
|---|
| 491 | Consider:
|
|---|
| 492 |
|
|---|
| 493 | C<$x ? $y : $z>
|
|---|
| 494 |
|
|---|
| 495 | S<C<$x ? $y : $z>>
|
|---|
| 496 |
|
|---|
| 497 | Both signify the monospace (c[ode] style) text consisting of
|
|---|
| 498 | "$x", one space, "?", one space, ":", one space, "$z". The
|
|---|
| 499 | difference is that in the latter, with the S code, those spaces
|
|---|
| 500 | are not "normal" spaces, but instead are non-breaking spaces.
|
|---|
| 501 |
|
|---|
| 502 | =back
|
|---|
| 503 |
|
|---|
| 504 |
|
|---|
| 505 | If a Pod processor sees any formatting code other than the ones
|
|---|
| 506 | listed above (as in "NE<lt>...>", or "QE<lt>...>", etc.), that
|
|---|
| 507 | processor must by default treat this as an error.
|
|---|
| 508 | A Pod parser may allow a way for particular
|
|---|
| 509 | applications to add to the above list of known formatting codes;
|
|---|
| 510 | a Pod parser might even allow a way to stipulate, for each additional
|
|---|
| 511 | command, whether it requires some form of special processing, as
|
|---|
| 512 | LE<lt>...> does.
|
|---|
| 513 |
|
|---|
| 514 | Future versions of this specification may add additional
|
|---|
| 515 | formatting codes.
|
|---|
| 516 |
|
|---|
| 517 | Historical note: A few older Pod processors would not see a ">" as
|
|---|
| 518 | closing a "CE<lt>" code, if the ">" was immediately preceded by
|
|---|
| 519 | a "-". This was so that this:
|
|---|
| 520 |
|
|---|
| 521 | C<$foo->bar>
|
|---|
| 522 |
|
|---|
| 523 | would parse as equivalent to this:
|
|---|
| 524 |
|
|---|
| 525 | C<$foo-E<gt>bar>
|
|---|
| 526 |
|
|---|
| 527 | instead of as equivalent to a "C" formatting code containing
|
|---|
| 528 | only "$foo-", and then a "bar>" outside the "C" formatting code. This
|
|---|
| 529 | problem has since been solved by the addition of syntaxes like this:
|
|---|
| 530 |
|
|---|
| 531 | C<< $foo->bar >>
|
|---|
| 532 |
|
|---|
| 533 | Compliant parsers must not treat "->" as special.
|
|---|
| 534 |
|
|---|
| 535 | Formatting codes absolutely cannot span paragraphs. If a code is
|
|---|
| 536 | opened in one paragraph, and no closing code is found by the end of
|
|---|
| 537 | that paragraph, the Pod parser must close that formatting code,
|
|---|
| 538 | and should complain (as in "Unterminated I code in the paragraph
|
|---|
| 539 | starting at line 123: 'Time objects are not...'"). So these
|
|---|
| 540 | two paragraphs:
|
|---|
| 541 |
|
|---|
| 542 | I<I told you not to do this!
|
|---|
| 543 |
|
|---|
| 544 | Don't make me say it again!>
|
|---|
| 545 |
|
|---|
| 546 | ...must I<not> be parsed as two paragraphs in italics (with the I
|
|---|
| 547 | code starting in one paragraph and starting in another.) Instead,
|
|---|
| 548 | the first paragraph should generate a warning, but that aside, the
|
|---|
| 549 | above code must parse as if it were:
|
|---|
| 550 |
|
|---|
| 551 | I<I told you not to do this!>
|
|---|
| 552 |
|
|---|
| 553 | Don't make me say it again!E<gt>
|
|---|
| 554 |
|
|---|
| 555 | (In SGMLish jargon, all Pod commands are like block-level
|
|---|
| 556 | elements, whereas all Pod formatting codes are like inline-level
|
|---|
| 557 | elements.)
|
|---|
| 558 |
|
|---|
| 559 |
|
|---|
| 560 |
|
|---|
| 561 | =head1 Notes on Implementing Pod Processors
|
|---|
| 562 |
|
|---|
| 563 | The following is a long section of miscellaneous requirements
|
|---|
| 564 | and suggestions to do with Pod processing.
|
|---|
| 565 |
|
|---|
| 566 | =over
|
|---|
| 567 |
|
|---|
| 568 | =item *
|
|---|
| 569 |
|
|---|
| 570 | Pod formatters should tolerate lines in verbatim blocks that are of
|
|---|
| 571 | any length, even if that means having to break them (possibly several
|
|---|
| 572 | times, for very long lines) to avoid text running off the side of the
|
|---|
| 573 | page. Pod formatters may warn of such line-breaking. Such warnings
|
|---|
| 574 | are particularly appropriate for lines are over 100 characters long, which
|
|---|
| 575 | are usually not intentional.
|
|---|
| 576 |
|
|---|
| 577 | =item *
|
|---|
| 578 |
|
|---|
| 579 | Pod parsers must recognize I<all> of the three well-known newline
|
|---|
| 580 | formats: CR, LF, and CRLF. See L<perlport|perlport>.
|
|---|
| 581 |
|
|---|
| 582 | =item *
|
|---|
| 583 |
|
|---|
| 584 | Pod parsers should accept input lines that are of any length.
|
|---|
| 585 |
|
|---|
| 586 | =item *
|
|---|
| 587 |
|
|---|
| 588 | Since Perl recognizes a Unicode Byte Order Mark at the start of files
|
|---|
| 589 | as signaling that the file is Unicode encoded as in UTF-16 (whether
|
|---|
| 590 | big-endian or little-endian) or UTF-8, Pod parsers should do the
|
|---|
| 591 | same. Otherwise, the character encoding should be understood as
|
|---|
| 592 | being UTF-8 if the first highbit byte sequence in the file seems
|
|---|
| 593 | valid as a UTF-8 sequence, or otherwise as Latin-1.
|
|---|
| 594 |
|
|---|
| 595 | Future versions of this specification may specify
|
|---|
| 596 | how Pod can accept other encodings. Presumably treatment of other
|
|---|
| 597 | encodings in Pod parsing would be as in XML parsing: whatever the
|
|---|
| 598 | encoding declared by a particular Pod file, content is to be
|
|---|
| 599 | stored in memory as Unicode characters.
|
|---|
| 600 |
|
|---|
| 601 | =item *
|
|---|
| 602 |
|
|---|
| 603 | The well known Unicode Byte Order Marks are as follows: if the
|
|---|
| 604 | file begins with the two literal byte values 0xFE 0xFF, this is
|
|---|
| 605 | the BOM for big-endian UTF-16. If the file begins with the two
|
|---|
| 606 | literal byte value 0xFF 0xFE, this is the BOM for little-endian
|
|---|
| 607 | UTF-16. If the file begins with the three literal byte values
|
|---|
| 608 | 0xEF 0xBB 0xBF, this is the BOM for UTF-8.
|
|---|
| 609 |
|
|---|
| 610 | =for comment
|
|---|
| 611 | use bytes; print map sprintf(" 0x%02X", ord $_), split '', "\x{feff}";
|
|---|
| 612 | 0xEF 0xBB 0xBF
|
|---|
| 613 |
|
|---|
| 614 | =for comment
|
|---|
| 615 | If toke.c is modified to support UTF-32, add mention of those here.
|
|---|
| 616 |
|
|---|
| 617 | =item *
|
|---|
| 618 |
|
|---|
| 619 | A naive but sufficient heuristic for testing the first highbit
|
|---|
| 620 | byte-sequence in a BOM-less file (whether in code or in Pod!), to see
|
|---|
| 621 | whether that sequence is valid as UTF-8 (RFC 2279) is to check whether
|
|---|
| 622 | that the first byte in the sequence is in the range 0xC0 - 0xFD
|
|---|
| 623 | I<and> whether the next byte is in the range
|
|---|
| 624 | 0x80 - 0xBF. If so, the parser may conclude that this file is in
|
|---|
| 625 | UTF-8, and all highbit sequences in the file should be assumed to
|
|---|
| 626 | be UTF-8. Otherwise the parser should treat the file as being
|
|---|
| 627 | in Latin-1. In the unlikely circumstance that the first highbit
|
|---|
| 628 | sequence in a truly non-UTF-8 file happens to appear to be UTF-8, one
|
|---|
| 629 | can cater to our heuristic (as well as any more intelligent heuristic)
|
|---|
| 630 | by prefacing that line with a comment line containing a highbit
|
|---|
| 631 | sequence that is clearly I<not> valid as UTF-8. A line consisting
|
|---|
| 632 | of simply "#", an e-acute, and any non-highbit byte,
|
|---|
| 633 | is sufficient to establish this file's encoding.
|
|---|
| 634 |
|
|---|
| 635 | =for comment
|
|---|
| 636 | If/WHEN some brave soul makes these heuristics into a generic
|
|---|
| 637 | text-file class (or PerlIO layer?), we can presumably delete
|
|---|
| 638 | mention of these icky details from this file, and can instead
|
|---|
| 639 | tell people to just use appropriate class/layer.
|
|---|
| 640 | Auto-recognition of newline sequences would be another desirable
|
|---|
| 641 | feature of such a class/layer.
|
|---|
| 642 | HINT HINT HINT.
|
|---|
| 643 |
|
|---|
| 644 | =for comment
|
|---|
| 645 | "The probability that a string of characters
|
|---|
| 646 | in any other encoding appears as valid UTF-8 is low" - RFC2279
|
|---|
| 647 |
|
|---|
| 648 | =item *
|
|---|
| 649 |
|
|---|
| 650 | This document's requirements and suggestions about encodings
|
|---|
| 651 | do not apply to Pod processors running on non-ASCII platforms,
|
|---|
| 652 | notably EBCDIC platforms.
|
|---|
| 653 |
|
|---|
| 654 | =item *
|
|---|
| 655 |
|
|---|
| 656 | Pod processors must treat a "=for [label] [content...]" paragraph as
|
|---|
| 657 | meaning the same thing as a "=begin [label]" paragraph, content, and
|
|---|
| 658 | an "=end [label]" paragraph. (The parser may conflate these two
|
|---|
| 659 | constructs, or may leave them distinct, in the expectation that the
|
|---|
| 660 | formatter will nevertheless treat them the same.)
|
|---|
| 661 |
|
|---|
| 662 | =item *
|
|---|
| 663 |
|
|---|
| 664 | When rendering Pod to a format that allows comments (i.e., to nearly
|
|---|
| 665 | any format other than plaintext), a Pod formatter must insert comment
|
|---|
| 666 | text identifying its name and version number, and the name and
|
|---|
| 667 | version numbers of any modules it might be using to process the Pod.
|
|---|
| 668 | Minimal examples:
|
|---|
| 669 |
|
|---|
| 670 | %% POD::Pod2PS v3.14159, using POD::Parser v1.92
|
|---|
| 671 |
|
|---|
| 672 | <!-- Pod::HTML v3.14159, using POD::Parser v1.92 -->
|
|---|
| 673 |
|
|---|
| 674 | {\doccomm generated by Pod::Tree::RTF 3.14159 using Pod::Tree 1.08}
|
|---|
| 675 |
|
|---|
| 676 | .\" Pod::Man version 3.14159, using POD::Parser version 1.92
|
|---|
| 677 |
|
|---|
| 678 | Formatters may also insert additional comments, including: the
|
|---|
| 679 | release date of the Pod formatter program, the contact address for
|
|---|
| 680 | the author(s) of the formatter, the current time, the name of input
|
|---|
| 681 | file, the formatting options in effect, version of Perl used, etc.
|
|---|
| 682 |
|
|---|
| 683 | Formatters may also choose to note errors/warnings as comments,
|
|---|
| 684 | besides or instead of emitting them otherwise (as in messages to
|
|---|
| 685 | STDERR, or C<die>ing).
|
|---|
| 686 |
|
|---|
| 687 | =item *
|
|---|
| 688 |
|
|---|
| 689 | Pod parsers I<may> emit warnings or error messages ("Unknown E code
|
|---|
| 690 | EE<lt>zslig>!") to STDERR (whether through printing to STDERR, or
|
|---|
| 691 | C<warn>ing/C<carp>ing, or C<die>ing/C<croak>ing), but I<must> allow
|
|---|
| 692 | suppressing all such STDERR output, and instead allow an option for
|
|---|
| 693 | reporting errors/warnings
|
|---|
| 694 | in some other way, whether by triggering a callback, or noting errors
|
|---|
| 695 | in some attribute of the document object, or some similarly unobtrusive
|
|---|
| 696 | mechanism -- or even by appending a "Pod Errors" section to the end of
|
|---|
| 697 | the parsed form of the document.
|
|---|
| 698 |
|
|---|
| 699 | =item *
|
|---|
| 700 |
|
|---|
| 701 | In cases of exceptionally aberrant documents, Pod parsers may abort the
|
|---|
| 702 | parse. Even then, using C<die>ing/C<croak>ing is to be avoided; where
|
|---|
| 703 | possible, the parser library may simply close the input file
|
|---|
| 704 | and add text like "*** Formatting Aborted ***" to the end of the
|
|---|
| 705 | (partial) in-memory document.
|
|---|
| 706 |
|
|---|
| 707 | =item *
|
|---|
| 708 |
|
|---|
| 709 | In paragraphs where formatting codes (like EE<lt>...>, BE<lt>...>)
|
|---|
| 710 | are understood (i.e., I<not> verbatim paragraphs, but I<including>
|
|---|
| 711 | ordinary paragraphs, and command paragraphs that produce renderable
|
|---|
| 712 | text, like "=head1"), literal whitespace should generally be considered
|
|---|
| 713 | "insignificant", in that one literal space has the same meaning as any
|
|---|
| 714 | (nonzero) number of literal spaces, literal newlines, and literal tabs
|
|---|
| 715 | (as long as this produces no blank lines, since those would terminate
|
|---|
| 716 | the paragraph). Pod parsers should compact literal whitespace in each
|
|---|
| 717 | processed paragraph, but may provide an option for overriding this
|
|---|
| 718 | (since some processing tasks do not require it), or may follow
|
|---|
| 719 | additional special rules (for example, specially treating
|
|---|
| 720 | period-space-space or period-newline sequences).
|
|---|
| 721 |
|
|---|
| 722 | =item *
|
|---|
| 723 |
|
|---|
| 724 | Pod parsers should not, by default, try to coerce apostrophe (') and
|
|---|
| 725 | quote (") into smart quotes (little 9's, 66's, 99's, etc), nor try to
|
|---|
| 726 | turn backtick (`) into anything else but a single backtick character
|
|---|
| 727 | (distinct from an openquote character!), nor "--" into anything but
|
|---|
| 728 | two minus signs. They I<must never> do any of those things to text
|
|---|
| 729 | in CE<lt>...> formatting codes, and never I<ever> to text in verbatim
|
|---|
| 730 | paragraphs.
|
|---|
| 731 |
|
|---|
| 732 | =item *
|
|---|
| 733 |
|
|---|
| 734 | When rendering Pod to a format that has two kinds of hyphens (-), one
|
|---|
| 735 | that's a non-breaking hyphen, and another that's a breakable hyphen
|
|---|
| 736 | (as in "object-oriented", which can be split across lines as
|
|---|
| 737 | "object-", newline, "oriented"), formatters are encouraged to
|
|---|
| 738 | generally translate "-" to non-breaking hyphen, but may apply
|
|---|
| 739 | heuristics to convert some of these to breaking hyphens.
|
|---|
| 740 |
|
|---|
| 741 | =item *
|
|---|
| 742 |
|
|---|
| 743 | Pod formatters should make reasonable efforts to keep words of Perl
|
|---|
| 744 | code from being broken across lines. For example, "Foo::Bar" in some
|
|---|
| 745 | formatting systems is seen as eligible for being broken across lines
|
|---|
| 746 | as "Foo::" newline "Bar" or even "Foo::-" newline "Bar". This should
|
|---|
| 747 | be avoided where possible, either by disabling all line-breaking in
|
|---|
| 748 | mid-word, or by wrapping particular words with internal punctuation
|
|---|
| 749 | in "don't break this across lines" codes (which in some formats may
|
|---|
| 750 | not be a single code, but might be a matter of inserting non-breaking
|
|---|
| 751 | zero-width spaces between every pair of characters in a word.)
|
|---|
| 752 |
|
|---|
| 753 | =item *
|
|---|
| 754 |
|
|---|
| 755 | Pod parsers should, by default, expand tabs in verbatim paragraphs as
|
|---|
| 756 | they are processed, before passing them to the formatter or other
|
|---|
| 757 | processor. Parsers may also allow an option for overriding this.
|
|---|
| 758 |
|
|---|
| 759 | =item *
|
|---|
| 760 |
|
|---|
| 761 | Pod parsers should, by default, remove newlines from the end of
|
|---|
| 762 | ordinary and verbatim paragraphs before passing them to the
|
|---|
| 763 | formatter. For example, while the paragraph you're reading now
|
|---|
| 764 | could be considered, in Pod source, to end with (and contain)
|
|---|
| 765 | the newline(s) that end it, it should be processed as ending with
|
|---|
| 766 | (and containing) the period character that ends this sentence.
|
|---|
| 767 |
|
|---|
| 768 | =item *
|
|---|
| 769 |
|
|---|
| 770 | Pod parsers, when reporting errors, should make some effort to report
|
|---|
| 771 | an approximate line number ("Nested EE<lt>>'s in Paragraph #52, near
|
|---|
| 772 | line 633 of Thing/Foo.pm!"), instead of merely noting the paragraph
|
|---|
| 773 | number ("Nested EE<lt>>'s in Paragraph #52 of Thing/Foo.pm!"). Where
|
|---|
| 774 | this is problematic, the paragraph number should at least be
|
|---|
| 775 | accompanied by an excerpt from the paragraph ("Nested EE<lt>>'s in
|
|---|
| 776 | Paragraph #52 of Thing/Foo.pm, which begins 'Read/write accessor for
|
|---|
| 777 | the CE<lt>interest rate> attribute...'").
|
|---|
| 778 |
|
|---|
| 779 | =item *
|
|---|
| 780 |
|
|---|
| 781 | Pod parsers, when processing a series of verbatim paragraphs one
|
|---|
| 782 | after another, should consider them to be one large verbatim
|
|---|
| 783 | paragraph that happens to contain blank lines. I.e., these two
|
|---|
| 784 | lines, which have a blank line between them:
|
|---|
| 785 |
|
|---|
| 786 | use Foo;
|
|---|
| 787 |
|
|---|
| 788 | print Foo->VERSION
|
|---|
| 789 |
|
|---|
| 790 | should be unified into one paragraph ("\tuse Foo;\n\n\tprint
|
|---|
| 791 | Foo->VERSION") before being passed to the formatter or other
|
|---|
| 792 | processor. Parsers may also allow an option for overriding this.
|
|---|
| 793 |
|
|---|
| 794 | While this might be too cumbersome to implement in event-based Pod
|
|---|
| 795 | parsers, it is straightforward for parsers that return parse trees.
|
|---|
| 796 |
|
|---|
| 797 | =item *
|
|---|
| 798 |
|
|---|
| 799 | Pod formatters, where feasible, are advised to avoid splitting short
|
|---|
| 800 | verbatim paragraphs (under twelve lines, say) across pages.
|
|---|
| 801 |
|
|---|
| 802 | =item *
|
|---|
| 803 |
|
|---|
| 804 | Pod parsers must treat a line with only spaces and/or tabs on it as a
|
|---|
| 805 | "blank line" such as separates paragraphs. (Some older parsers
|
|---|
| 806 | recognized only two adjacent newlines as a "blank line" but would not
|
|---|
| 807 | recognize a newline, a space, and a newline, as a blank line. This
|
|---|
| 808 | is noncompliant behavior.)
|
|---|
| 809 |
|
|---|
| 810 | =item *
|
|---|
| 811 |
|
|---|
| 812 | Authors of Pod formatters/processors should make every effort to
|
|---|
| 813 | avoid writing their own Pod parser. There are already several in
|
|---|
| 814 | CPAN, with a wide range of interface styles -- and one of them,
|
|---|
| 815 | Pod::Parser, comes with modern versions of Perl.
|
|---|
| 816 |
|
|---|
| 817 | =item *
|
|---|
| 818 |
|
|---|
| 819 | Characters in Pod documents may be conveyed either as literals, or by
|
|---|
| 820 | number in EE<lt>n> codes, or by an equivalent mnemonic, as in
|
|---|
| 821 | EE<lt>eacute> which is exactly equivalent to EE<lt>233>.
|
|---|
| 822 |
|
|---|
| 823 | Characters in the range 32-126 refer to those well known US-ASCII
|
|---|
| 824 | characters (also defined there by Unicode, with the same meaning),
|
|---|
| 825 | which all Pod formatters must render faithfully. Characters
|
|---|
| 826 | in the ranges 0-31 and 127-159 should not be used (neither as
|
|---|
| 827 | literals, nor as EE<lt>number> codes), except for the
|
|---|
| 828 | literal byte-sequences for newline (13, 13 10, or 10), and tab (9).
|
|---|
| 829 |
|
|---|
| 830 | Characters in the range 160-255 refer to Latin-1 characters (also
|
|---|
| 831 | defined there by Unicode, with the same meaning). Characters above
|
|---|
| 832 | 255 should be understood to refer to Unicode characters.
|
|---|
| 833 |
|
|---|
| 834 | =item *
|
|---|
| 835 |
|
|---|
| 836 | Be warned
|
|---|
| 837 | that some formatters cannot reliably render characters outside 32-126;
|
|---|
| 838 | and many are able to handle 32-126 and 160-255, but nothing above
|
|---|
| 839 | 255.
|
|---|
| 840 |
|
|---|
| 841 | =item *
|
|---|
| 842 |
|
|---|
| 843 | Besides the well-known "EE<lt>lt>" and "EE<lt>gt>" codes for
|
|---|
| 844 | less-than and greater-than, Pod parsers must understand "EE<lt>sol>"
|
|---|
| 845 | for "/" (solidus, slash), and "EE<lt>verbar>" for "|" (vertical bar,
|
|---|
| 846 | pipe). Pod parsers should also understand "EE<lt>lchevron>" and
|
|---|
| 847 | "EE<lt>rchevron>" as legacy codes for characters 171 and 187, i.e.,
|
|---|
| 848 | "left-pointing double angle quotation mark" = "left pointing
|
|---|
| 849 | guillemet" and "right-pointing double angle quotation mark" = "right
|
|---|
| 850 | pointing guillemet". (These look like little "<<" and ">>", and they
|
|---|
| 851 | are now preferably expressed with the HTML/XHTML codes "EE<lt>laquo>"
|
|---|
| 852 | and "EE<lt>raquo>".)
|
|---|
| 853 |
|
|---|
| 854 | =item *
|
|---|
| 855 |
|
|---|
| 856 | Pod parsers should understand all "EE<lt>html>" codes as defined
|
|---|
| 857 | in the entity declarations in the most recent XHTML specification at
|
|---|
| 858 | C<www.W3.org>. Pod parsers must understand at least the entities
|
|---|
| 859 | that define characters in the range 160-255 (Latin-1). Pod parsers,
|
|---|
| 860 | when faced with some unknown "EE<lt>I<identifier>>" code,
|
|---|
| 861 | shouldn't simply replace it with nullstring (by default, at least),
|
|---|
| 862 | but may pass it through as a string consisting of the literal characters
|
|---|
| 863 | E, less-than, I<identifier>, greater-than. Or Pod parsers may offer the
|
|---|
| 864 | alternative option of processing such unknown
|
|---|
| 865 | "EE<lt>I<identifier>>" codes by firing an event especially
|
|---|
| 866 | for such codes, or by adding a special node-type to the in-memory
|
|---|
| 867 | document tree. Such "EE<lt>I<identifier>>" may have special meaning
|
|---|
| 868 | to some processors, or some processors may choose to add them to
|
|---|
| 869 | a special error report.
|
|---|
| 870 |
|
|---|
| 871 | =item *
|
|---|
| 872 |
|
|---|
| 873 | Pod parsers must also support the XHTML codes "EE<lt>quot>" for
|
|---|
| 874 | character 34 (doublequote, "), "EE<lt>amp>" for character 38
|
|---|
| 875 | (ampersand, &), and "EE<lt>apos>" for character 39 (apostrophe, ').
|
|---|
| 876 |
|
|---|
| 877 | =item *
|
|---|
| 878 |
|
|---|
| 879 | Note that in all cases of "EE<lt>whatever>", I<whatever> (whether
|
|---|
| 880 | an htmlname, or a number in any base) must consist only of
|
|---|
| 881 | alphanumeric characters -- that is, I<whatever> must watch
|
|---|
| 882 | C<m/\A\w+\z/>. So "EE<lt> 0 1 2 3 >" is invalid, because
|
|---|
| 883 | it contains spaces, which aren't alphanumeric characters. This
|
|---|
| 884 | presumably does not I<need> special treatment by a Pod processor;
|
|---|
| 885 | " 0 1 2 3 " doesn't look like a number in any base, so it would
|
|---|
| 886 | presumably be looked up in the table of HTML-like names. Since
|
|---|
| 887 | there isn't (and cannot be) an HTML-like entity called " 0 1 2 3 ",
|
|---|
| 888 | this will be treated as an error. However, Pod processors may
|
|---|
| 889 | treat "EE<lt> 0 1 2 3 >" or "EE<lt>e-acute>" as I<syntactically>
|
|---|
| 890 | invalid, potentially earning a different error message than the
|
|---|
| 891 | error message (or warning, or event) generated by a merely unknown
|
|---|
| 892 | (but theoretically valid) htmlname, as in "EE<lt>qacute>"
|
|---|
| 893 | [sic]. However, Pod parsers are not required to make this
|
|---|
| 894 | distinction.
|
|---|
| 895 |
|
|---|
| 896 | =item *
|
|---|
| 897 |
|
|---|
| 898 | Note that EE<lt>number> I<must not> be interpreted as simply
|
|---|
| 899 | "codepoint I<number> in the current/native character set". It always
|
|---|
| 900 | means only "the character represented by codepoint I<number> in
|
|---|
| 901 | Unicode." (This is identical to the semantics of &#I<number>; in XML.)
|
|---|
| 902 |
|
|---|
| 903 | This will likely require many formatters to have tables mapping from
|
|---|
| 904 | treatable Unicode codepoints (such as the "\xE9" for the e-acute
|
|---|
| 905 | character) to the escape sequences or codes necessary for conveying
|
|---|
| 906 | such sequences in the target output format. A converter to *roff
|
|---|
| 907 | would, for example know that "\xE9" (whether conveyed literally, or via
|
|---|
| 908 | a EE<lt>...> sequence) is to be conveyed as "e\\*'".
|
|---|
| 909 | Similarly, a program rendering Pod in a Mac OS application window, would
|
|---|
| 910 | presumably need to know that "\xE9" maps to codepoint 142 in MacRoman
|
|---|
| 911 | encoding that (at time of writing) is native for Mac OS. Such
|
|---|
| 912 | Unicode2whatever mappings are presumably already widely available for
|
|---|
| 913 | common output formats. (Such mappings may be incomplete! Implementers
|
|---|
| 914 | are not expected to bend over backwards in an attempt to render
|
|---|
| 915 | Cherokee syllabics, Etruscan runes, Byzantine musical symbols, or any
|
|---|
| 916 | of the other weird things that Unicode can encode.) And
|
|---|
| 917 | if a Pod document uses a character not found in such a mapping, the
|
|---|
| 918 | formatter should consider it an unrenderable character.
|
|---|
| 919 |
|
|---|
| 920 | =item *
|
|---|
| 921 |
|
|---|
| 922 | If, surprisingly, the implementor of a Pod formatter can't find a
|
|---|
| 923 | satisfactory pre-existing table mapping from Unicode characters to
|
|---|
| 924 | escapes in the target format (e.g., a decent table of Unicode
|
|---|
| 925 | characters to *roff escapes), it will be necessary to build such a
|
|---|
| 926 | table. If you are in this circumstance, you should begin with the
|
|---|
| 927 | characters in the range 0x00A0 - 0x00FF, which is mostly the heavily
|
|---|
| 928 | used accented characters. Then proceed (as patience permits and
|
|---|
| 929 | fastidiousness compels) through the characters that the (X)HTML
|
|---|
| 930 | standards groups judged important enough to merit mnemonics
|
|---|
| 931 | for. These are declared in the (X)HTML specifications at the
|
|---|
| 932 | www.W3.org site. At time of writing (September 2001), the most recent
|
|---|
| 933 | entity declaration files are:
|
|---|
| 934 |
|
|---|
| 935 | http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
|
|---|
| 936 | http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
|
|---|
| 937 | http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
|
|---|
| 938 |
|
|---|
| 939 | Then you can progress through any remaining notable Unicode characters
|
|---|
| 940 | in the range 0x2000-0x204D (consult the character tables at
|
|---|
| 941 | www.unicode.org), and whatever else strikes your fancy. For example,
|
|---|
| 942 | in F<xhtml-symbol.ent>, there is the entry:
|
|---|
| 943 |
|
|---|
| 944 | <!ENTITY infin "∞"> <!-- infinity, U+221E ISOtech -->
|
|---|
| 945 |
|
|---|
| 946 | While the mapping "infin" to the character "\x{221E}" will (hopefully)
|
|---|
| 947 | have been already handled by the Pod parser, the presence of the
|
|---|
| 948 | character in this file means that it's reasonably important enough to
|
|---|
| 949 | include in a formatter's table that maps from notable Unicode characters
|
|---|
| 950 | to the codes necessary for rendering them. So for a Unicode-to-*roff
|
|---|
| 951 | mapping, for example, this would merit the entry:
|
|---|
| 952 |
|
|---|
| 953 | "\x{221E}" => '\(in',
|
|---|
| 954 |
|
|---|
| 955 | It is eagerly hoped that in the future, increasing numbers of formats
|
|---|
| 956 | (and formatters) will support Unicode characters directly (as (X)HTML
|
|---|
| 957 | does with C<∞>, C<∞>, or C<∞>), reducing the need
|
|---|
| 958 | for idiosyncratic mappings of Unicode-to-I<my_escapes>.
|
|---|
| 959 |
|
|---|
| 960 | =item *
|
|---|
| 961 |
|
|---|
| 962 | It is up to individual Pod formatter to display good judgment when
|
|---|
| 963 | confronted with an unrenderable character (which is distinct from an
|
|---|
| 964 | unknown EE<lt>thing> sequence that the parser couldn't resolve to
|
|---|
| 965 | anything, renderable or not). It is good practice to map Latin letters
|
|---|
| 966 | with diacritics (like "EE<lt>eacute>"/"EE<lt>233>") to the corresponding
|
|---|
| 967 | unaccented US-ASCII letters (like a simple character 101, "e"), but
|
|---|
| 968 | clearly this is often not feasible, and an unrenderable character may
|
|---|
| 969 | be represented as "?", or the like. In attempting a sane fallback
|
|---|
| 970 | (as from EE<lt>233> to "e"), Pod formatters may use the
|
|---|
| 971 | %Latin1Code_to_fallback table in L<Pod::Escapes|Pod::Escapes>, or
|
|---|
| 972 | L<Text::Unidecode|Text::Unidecode>, if available.
|
|---|
| 973 |
|
|---|
| 974 | For example, this Pod text:
|
|---|
| 975 |
|
|---|
| 976 | magic is enabled if you set C<$Currency> to 'E<euro>'.
|
|---|
| 977 |
|
|---|
| 978 | may be rendered as:
|
|---|
| 979 | "magic is enabled if you set C<$Currency> to 'I<?>'" or as
|
|---|
| 980 | "magic is enabled if you set C<$Currency> to 'B<[euro]>'", or as
|
|---|
| 981 | "magic is enabled if you set C<$Currency> to '[x20AC]', etc.
|
|---|
| 982 |
|
|---|
| 983 | A Pod formatter may also note, in a comment or warning, a list of what
|
|---|
| 984 | unrenderable characters were encountered.
|
|---|
| 985 |
|
|---|
| 986 | =item *
|
|---|
| 987 |
|
|---|
| 988 | EE<lt>...> may freely appear in any formatting code (other than
|
|---|
| 989 | in another EE<lt>...> or in an ZE<lt>>). That is, "XE<lt>The
|
|---|
| 990 | EE<lt>euro>1,000,000 Solution>" is valid, as is "LE<lt>The
|
|---|
| 991 | EE<lt>euro>1,000,000 Solution|Million::Euros>".
|
|---|
| 992 |
|
|---|
| 993 | =item *
|
|---|
| 994 |
|
|---|
| 995 | Some Pod formatters output to formats that implement non-breaking
|
|---|
| 996 | spaces as an individual character (which I'll call "NBSP"), and
|
|---|
| 997 | others output to formats that implement non-breaking spaces just as
|
|---|
| 998 | spaces wrapped in a "don't break this across lines" code. Note that
|
|---|
| 999 | at the level of Pod, both sorts of codes can occur: Pod can contain a
|
|---|
| 1000 | NBSP character (whether as a literal, or as a "EE<lt>160>" or
|
|---|
| 1001 | "EE<lt>nbsp>" code); and Pod can contain "SE<lt>foo
|
|---|
| 1002 | IE<lt>barE<gt> baz>" codes, where "mere spaces" (character 32) in
|
|---|
| 1003 | such codes are taken to represent non-breaking spaces. Pod
|
|---|
| 1004 | parsers should consider supporting the optional parsing of "SE<lt>foo
|
|---|
| 1005 | IE<lt>barE<gt> baz>" as if it were
|
|---|
| 1006 | "fooI<NBSP>IE<lt>barE<gt>I<NBSP>baz", and, going the other way, the
|
|---|
| 1007 | optional parsing of groups of words joined by NBSP's as if each group
|
|---|
| 1008 | were in a SE<lt>...> code, so that formatters may use the
|
|---|
| 1009 | representation that maps best to what the output format demands.
|
|---|
| 1010 |
|
|---|
| 1011 | =item *
|
|---|
| 1012 |
|
|---|
| 1013 | Some processors may find that the C<SE<lt>...E<gt>> code is easiest to
|
|---|
| 1014 | implement by replacing each space in the parse tree under the content
|
|---|
| 1015 | of the S, with an NBSP. But note: the replacement should apply I<not> to
|
|---|
| 1016 | spaces in I<all> text, but I<only> to spaces in I<printable> text. (This
|
|---|
| 1017 | distinction may or may not be evident in the particular tree/event
|
|---|
| 1018 | model implemented by the Pod parser.) For example, consider this
|
|---|
| 1019 | unusual case:
|
|---|
| 1020 |
|
|---|
| 1021 | S<L</Autoloaded Functions>>
|
|---|
| 1022 |
|
|---|
| 1023 | This means that the space in the middle of the visible link text must
|
|---|
| 1024 | not be broken across lines. In other words, it's the same as this:
|
|---|
| 1025 |
|
|---|
| 1026 | L<"AutoloadedE<160>Functions"/Autoloaded Functions>
|
|---|
| 1027 |
|
|---|
| 1028 | However, a misapplied space-to-NBSP replacement could (wrongly)
|
|---|
| 1029 | produce something equivalent to this:
|
|---|
| 1030 |
|
|---|
| 1031 | L<"AutoloadedE<160>Functions"/AutoloadedE<160>Functions>
|
|---|
| 1032 |
|
|---|
| 1033 | ...which is almost definitely not going to work as a hyperlink (assuming
|
|---|
| 1034 | this formatter outputs a format supporting hypertext).
|
|---|
|
|---|