| 1 | =head1 NAME
|
|---|
| 2 | X<data structure> X<complex data structure> X<struct>
|
|---|
| 3 |
|
|---|
| 4 | perldsc - Perl Data Structures Cookbook
|
|---|
| 5 |
|
|---|
| 6 | =head1 DESCRIPTION
|
|---|
| 7 |
|
|---|
| 8 | The single feature most sorely lacking in the Perl programming language
|
|---|
| 9 | prior to its 5.0 release was complex data structures. Even without direct
|
|---|
| 10 | language support, some valiant programmers did manage to emulate them, but
|
|---|
| 11 | it was hard work and not for the faint of heart. You could occasionally
|
|---|
| 12 | get away with the C<$m{$AoA,$b}> notation borrowed from B<awk> in which the
|
|---|
| 13 | keys are actually more like a single concatenated string C<"$AoA$b">, but
|
|---|
| 14 | traversal and sorting were difficult. More desperate programmers even
|
|---|
| 15 | hacked Perl's internal symbol table directly, a strategy that proved hard
|
|---|
| 16 | to develop and maintain--to put it mildly.
|
|---|
| 17 |
|
|---|
| 18 | The 5.0 release of Perl let us have complex data structures. You
|
|---|
| 19 | may now write something like this and all of a sudden, you'd have an array
|
|---|
| 20 | with three dimensions!
|
|---|
| 21 |
|
|---|
| 22 | for $x (1 .. 10) {
|
|---|
| 23 | for $y (1 .. 10) {
|
|---|
| 24 | for $z (1 .. 10) {
|
|---|
| 25 | $AoA[$x][$y][$z] =
|
|---|
| 26 | $x ** $y + $z;
|
|---|
| 27 | }
|
|---|
| 28 | }
|
|---|
| 29 | }
|
|---|
| 30 |
|
|---|
| 31 | Alas, however simple this may appear, underneath it's a much more
|
|---|
| 32 | elaborate construct than meets the eye!
|
|---|
| 33 |
|
|---|
| 34 | How do you print it out? Why can't you say just C<print @AoA>? How do
|
|---|
| 35 | you sort it? How can you pass it to a function or get one of these back
|
|---|
| 36 | from a function? Is it an object? Can you save it to disk to read
|
|---|
| 37 | back later? How do you access whole rows or columns of that matrix? Do
|
|---|
| 38 | all the values have to be numeric?
|
|---|
| 39 |
|
|---|
| 40 | As you see, it's quite easy to become confused. While some small portion
|
|---|
| 41 | of the blame for this can be attributed to the reference-based
|
|---|
| 42 | implementation, it's really more due to a lack of existing documentation with
|
|---|
| 43 | examples designed for the beginner.
|
|---|
| 44 |
|
|---|
| 45 | This document is meant to be a detailed but understandable treatment of the
|
|---|
| 46 | many different sorts of data structures you might want to develop. It
|
|---|
| 47 | should also serve as a cookbook of examples. That way, when you need to
|
|---|
| 48 | create one of these complex data structures, you can just pinch, pilfer, or
|
|---|
| 49 | purloin a drop-in example from here.
|
|---|
| 50 |
|
|---|
| 51 | Let's look at each of these possible constructs in detail. There are separate
|
|---|
| 52 | sections on each of the following:
|
|---|
| 53 |
|
|---|
| 54 | =over 5
|
|---|
| 55 |
|
|---|
| 56 | =item * arrays of arrays
|
|---|
| 57 |
|
|---|
| 58 | =item * hashes of arrays
|
|---|
| 59 |
|
|---|
| 60 | =item * arrays of hashes
|
|---|
| 61 |
|
|---|
| 62 | =item * hashes of hashes
|
|---|
| 63 |
|
|---|
| 64 | =item * more elaborate constructs
|
|---|
| 65 |
|
|---|
| 66 | =back
|
|---|
| 67 |
|
|---|
| 68 | But for now, let's look at general issues common to all
|
|---|
| 69 | these types of data structures.
|
|---|
| 70 |
|
|---|
| 71 | =head1 REFERENCES
|
|---|
| 72 | X<reference> X<dereference> X<dereferencing> X<pointer>
|
|---|
| 73 |
|
|---|
| 74 | The most important thing to understand about all data structures in Perl
|
|---|
| 75 | -- including multidimensional arrays--is that even though they might
|
|---|
| 76 | appear otherwise, Perl C<@ARRAY>s and C<%HASH>es are all internally
|
|---|
| 77 | one-dimensional. They can hold only scalar values (meaning a string,
|
|---|
| 78 | number, or a reference). They cannot directly contain other arrays or
|
|---|
| 79 | hashes, but instead contain I<references> to other arrays or hashes.
|
|---|
| 80 | X<multidimensional array> X<array, multidimensional>
|
|---|
| 81 |
|
|---|
| 82 | You can't use a reference to an array or hash in quite the same way that you
|
|---|
| 83 | would a real array or hash. For C or C++ programmers unused to
|
|---|
| 84 | distinguishing between arrays and pointers to the same, this can be
|
|---|
| 85 | confusing. If so, just think of it as the difference between a structure
|
|---|
| 86 | and a pointer to a structure.
|
|---|
| 87 |
|
|---|
| 88 | You can (and should) read more about references in the perlref(1) man
|
|---|
| 89 | page. Briefly, references are rather like pointers that know what they
|
|---|
| 90 | point to. (Objects are also a kind of reference, but we won't be needing
|
|---|
| 91 | them right away--if ever.) This means that when you have something which
|
|---|
| 92 | looks to you like an access to a two-or-more-dimensional array and/or hash,
|
|---|
| 93 | what's really going on is that the base type is
|
|---|
| 94 | merely a one-dimensional entity that contains references to the next
|
|---|
| 95 | level. It's just that you can I<use> it as though it were a
|
|---|
| 96 | two-dimensional one. This is actually the way almost all C
|
|---|
| 97 | multidimensional arrays work as well.
|
|---|
| 98 |
|
|---|
| 99 | $array[7][12] # array of arrays
|
|---|
| 100 | $array[7]{string} # array of hashes
|
|---|
| 101 | $hash{string}[7] # hash of arrays
|
|---|
| 102 | $hash{string}{'another string'} # hash of hashes
|
|---|
| 103 |
|
|---|
| 104 | Now, because the top level contains only references, if you try to print
|
|---|
| 105 | out your array in with a simple print() function, you'll get something
|
|---|
| 106 | that doesn't look very nice, like this:
|
|---|
| 107 |
|
|---|
| 108 | @AoA = ( [2, 3], [4, 5, 7], [0] );
|
|---|
| 109 | print $AoA[1][2];
|
|---|
| 110 | 7
|
|---|
| 111 | print @AoA;
|
|---|
| 112 | ARRAY(0x83c38)ARRAY(0x8b194)ARRAY(0x8b1d0)
|
|---|
| 113 |
|
|---|
| 114 |
|
|---|
| 115 | That's because Perl doesn't (ever) implicitly dereference your variables.
|
|---|
| 116 | If you want to get at the thing a reference is referring to, then you have
|
|---|
| 117 | to do this yourself using either prefix typing indicators, like
|
|---|
| 118 | C<${$blah}>, C<@{$blah}>, C<@{$blah[$i]}>, or else postfix pointer arrows,
|
|---|
| 119 | like C<$a-E<gt>[3]>, C<$h-E<gt>{fred}>, or even C<$ob-E<gt>method()-E<gt>[3]>.
|
|---|
| 120 |
|
|---|
| 121 | =head1 COMMON MISTAKES
|
|---|
| 122 |
|
|---|
| 123 | The two most common mistakes made in constructing something like
|
|---|
| 124 | an array of arrays is either accidentally counting the number of
|
|---|
| 125 | elements or else taking a reference to the same memory location
|
|---|
| 126 | repeatedly. Here's the case where you just get the count instead
|
|---|
| 127 | of a nested array:
|
|---|
| 128 |
|
|---|
| 129 | for $i (1..10) {
|
|---|
| 130 | @array = somefunc($i);
|
|---|
| 131 | $AoA[$i] = @array; # WRONG!
|
|---|
| 132 | }
|
|---|
| 133 |
|
|---|
| 134 | That's just the simple case of assigning an array to a scalar and getting
|
|---|
| 135 | its element count. If that's what you really and truly want, then you
|
|---|
| 136 | might do well to consider being a tad more explicit about it, like this:
|
|---|
| 137 |
|
|---|
| 138 | for $i (1..10) {
|
|---|
| 139 | @array = somefunc($i);
|
|---|
| 140 | $counts[$i] = scalar @array;
|
|---|
| 141 | }
|
|---|
| 142 |
|
|---|
| 143 | Here's the case of taking a reference to the same memory location
|
|---|
| 144 | again and again:
|
|---|
| 145 |
|
|---|
| 146 | for $i (1..10) {
|
|---|
| 147 | @array = somefunc($i);
|
|---|
| 148 | $AoA[$i] = \@array; # WRONG!
|
|---|
| 149 | }
|
|---|
| 150 |
|
|---|
| 151 | So, what's the big problem with that? It looks right, doesn't it?
|
|---|
| 152 | After all, I just told you that you need an array of references, so by
|
|---|
| 153 | golly, you've made me one!
|
|---|
| 154 |
|
|---|
| 155 | Unfortunately, while this is true, it's still broken. All the references
|
|---|
| 156 | in @AoA refer to the I<very same place>, and they will therefore all hold
|
|---|
| 157 | whatever was last in @array! It's similar to the problem demonstrated in
|
|---|
| 158 | the following C program:
|
|---|
| 159 |
|
|---|
| 160 | #include <pwd.h>
|
|---|
| 161 | main() {
|
|---|
| 162 | struct passwd *getpwnam(), *rp, *dp;
|
|---|
| 163 | rp = getpwnam("root");
|
|---|
| 164 | dp = getpwnam("daemon");
|
|---|
| 165 |
|
|---|
| 166 | printf("daemon name is %s\nroot name is %s\n",
|
|---|
| 167 | dp->pw_name, rp->pw_name);
|
|---|
| 168 | }
|
|---|
| 169 |
|
|---|
| 170 | Which will print
|
|---|
| 171 |
|
|---|
| 172 | daemon name is daemon
|
|---|
| 173 | root name is daemon
|
|---|
| 174 |
|
|---|
| 175 | The problem is that both C<rp> and C<dp> are pointers to the same location
|
|---|
| 176 | in memory! In C, you'd have to remember to malloc() yourself some new
|
|---|
| 177 | memory. In Perl, you'll want to use the array constructor C<[]> or the
|
|---|
| 178 | hash constructor C<{}> instead. Here's the right way to do the preceding
|
|---|
| 179 | broken code fragments:
|
|---|
| 180 | X<[]> X<{}>
|
|---|
| 181 |
|
|---|
| 182 | for $i (1..10) {
|
|---|
| 183 | @array = somefunc($i);
|
|---|
| 184 | $AoA[$i] = [ @array ];
|
|---|
| 185 | }
|
|---|
| 186 |
|
|---|
| 187 | The square brackets make a reference to a new array with a I<copy>
|
|---|
| 188 | of what's in @array at the time of the assignment. This is what
|
|---|
| 189 | you want.
|
|---|
| 190 |
|
|---|
| 191 | Note that this will produce something similar, but it's
|
|---|
| 192 | much harder to read:
|
|---|
| 193 |
|
|---|
| 194 | for $i (1..10) {
|
|---|
| 195 | @array = 0 .. $i;
|
|---|
| 196 | @{$AoA[$i]} = @array;
|
|---|
| 197 | }
|
|---|
| 198 |
|
|---|
| 199 | Is it the same? Well, maybe so--and maybe not. The subtle difference
|
|---|
| 200 | is that when you assign something in square brackets, you know for sure
|
|---|
| 201 | it's always a brand new reference with a new I<copy> of the data.
|
|---|
| 202 | Something else could be going on in this new case with the C<@{$AoA[$i]}}>
|
|---|
| 203 | dereference on the left-hand-side of the assignment. It all depends on
|
|---|
| 204 | whether C<$AoA[$i]> had been undefined to start with, or whether it
|
|---|
| 205 | already contained a reference. If you had already populated @AoA with
|
|---|
| 206 | references, as in
|
|---|
| 207 |
|
|---|
| 208 | $AoA[3] = \@another_array;
|
|---|
| 209 |
|
|---|
| 210 | Then the assignment with the indirection on the left-hand-side would
|
|---|
| 211 | use the existing reference that was already there:
|
|---|
| 212 |
|
|---|
| 213 | @{$AoA[3]} = @array;
|
|---|
| 214 |
|
|---|
| 215 | Of course, this I<would> have the "interesting" effect of clobbering
|
|---|
| 216 | @another_array. (Have you ever noticed how when a programmer says
|
|---|
| 217 | something is "interesting", that rather than meaning "intriguing",
|
|---|
| 218 | they're disturbingly more apt to mean that it's "annoying",
|
|---|
| 219 | "difficult", or both? :-)
|
|---|
| 220 |
|
|---|
| 221 | So just remember always to use the array or hash constructors with C<[]>
|
|---|
| 222 | or C<{}>, and you'll be fine, although it's not always optimally
|
|---|
| 223 | efficient.
|
|---|
| 224 |
|
|---|
| 225 | Surprisingly, the following dangerous-looking construct will
|
|---|
| 226 | actually work out fine:
|
|---|
| 227 |
|
|---|
| 228 | for $i (1..10) {
|
|---|
| 229 | my @array = somefunc($i);
|
|---|
| 230 | $AoA[$i] = \@array;
|
|---|
| 231 | }
|
|---|
| 232 |
|
|---|
| 233 | That's because my() is more of a run-time statement than it is a
|
|---|
| 234 | compile-time declaration I<per se>. This means that the my() variable is
|
|---|
| 235 | remade afresh each time through the loop. So even though it I<looks> as
|
|---|
| 236 | though you stored the same variable reference each time, you actually did
|
|---|
| 237 | not! This is a subtle distinction that can produce more efficient code at
|
|---|
| 238 | the risk of misleading all but the most experienced of programmers. So I
|
|---|
| 239 | usually advise against teaching it to beginners. In fact, except for
|
|---|
| 240 | passing arguments to functions, I seldom like to see the gimme-a-reference
|
|---|
| 241 | operator (backslash) used much at all in code. Instead, I advise
|
|---|
| 242 | beginners that they (and most of the rest of us) should try to use the
|
|---|
| 243 | much more easily understood constructors C<[]> and C<{}> instead of
|
|---|
| 244 | relying upon lexical (or dynamic) scoping and hidden reference-counting to
|
|---|
| 245 | do the right thing behind the scenes.
|
|---|
| 246 |
|
|---|
| 247 | In summary:
|
|---|
| 248 |
|
|---|
| 249 | $AoA[$i] = [ @array ]; # usually best
|
|---|
| 250 | $AoA[$i] = \@array; # perilous; just how my() was that array?
|
|---|
| 251 | @{ $AoA[$i] } = @array; # way too tricky for most programmers
|
|---|
| 252 |
|
|---|
| 253 |
|
|---|
| 254 | =head1 CAVEAT ON PRECEDENCE
|
|---|
| 255 | X<dereference, precedence> X<dereferencing, precedence>
|
|---|
| 256 |
|
|---|
| 257 | Speaking of things like C<@{$AoA[$i]}>, the following are actually the
|
|---|
| 258 | same thing:
|
|---|
| 259 | X<< -> >>
|
|---|
| 260 |
|
|---|
| 261 | $aref->[2][2] # clear
|
|---|
| 262 | $$aref[2][2] # confusing
|
|---|
| 263 |
|
|---|
| 264 | That's because Perl's precedence rules on its five prefix dereferencers
|
|---|
| 265 | (which look like someone swearing: C<$ @ * % &>) make them bind more
|
|---|
| 266 | tightly than the postfix subscripting brackets or braces! This will no
|
|---|
| 267 | doubt come as a great shock to the C or C++ programmer, who is quite
|
|---|
| 268 | accustomed to using C<*a[i]> to mean what's pointed to by the I<i'th>
|
|---|
| 269 | element of C<a>. That is, they first take the subscript, and only then
|
|---|
| 270 | dereference the thing at that subscript. That's fine in C, but this isn't C.
|
|---|
| 271 |
|
|---|
| 272 | The seemingly equivalent construct in Perl, C<$$aref[$i]> first does
|
|---|
| 273 | the deref of $aref, making it take $aref as a reference to an
|
|---|
| 274 | array, and then dereference that, and finally tell you the I<i'th> value
|
|---|
| 275 | of the array pointed to by $AoA. If you wanted the C notion, you'd have to
|
|---|
| 276 | write C<${$AoA[$i]}> to force the C<$AoA[$i]> to get evaluated first
|
|---|
| 277 | before the leading C<$> dereferencer.
|
|---|
| 278 |
|
|---|
| 279 | =head1 WHY YOU SHOULD ALWAYS C<use strict>
|
|---|
| 280 |
|
|---|
| 281 | If this is starting to sound scarier than it's worth, relax. Perl has
|
|---|
| 282 | some features to help you avoid its most common pitfalls. The best
|
|---|
| 283 | way to avoid getting confused is to start every program like this:
|
|---|
| 284 |
|
|---|
| 285 | #!/usr/bin/perl -w
|
|---|
| 286 | use strict;
|
|---|
| 287 |
|
|---|
| 288 | This way, you'll be forced to declare all your variables with my() and
|
|---|
| 289 | also disallow accidental "symbolic dereferencing". Therefore if you'd done
|
|---|
| 290 | this:
|
|---|
| 291 |
|
|---|
| 292 | my $aref = [
|
|---|
| 293 | [ "fred", "barney", "pebbles", "bambam", "dino", ],
|
|---|
| 294 | [ "homer", "bart", "marge", "maggie", ],
|
|---|
| 295 | [ "george", "jane", "elroy", "judy", ],
|
|---|
| 296 | ];
|
|---|
| 297 |
|
|---|
| 298 | print $aref[2][2];
|
|---|
| 299 |
|
|---|
| 300 | The compiler would immediately flag that as an error I<at compile time>,
|
|---|
| 301 | because you were accidentally accessing C<@aref>, an undeclared
|
|---|
| 302 | variable, and it would thereby remind you to write instead:
|
|---|
| 303 |
|
|---|
| 304 | print $aref->[2][2]
|
|---|
| 305 |
|
|---|
| 306 | =head1 DEBUGGING
|
|---|
| 307 | X<data structure, debugging> X<complex data structure, debugging>
|
|---|
| 308 | X<AoA, debugging> X<HoA, debugging> X<AoH, debugging> X<HoH, debugging>
|
|---|
| 309 | X<array of arrays, debugging> X<hash of arrays, debugging>
|
|---|
| 310 | X<array of hashes, debugging> X<hash of hashes, debugging>
|
|---|
| 311 |
|
|---|
| 312 | Before version 5.002, the standard Perl debugger didn't do a very nice job of
|
|---|
| 313 | printing out complex data structures. With 5.002 or above, the
|
|---|
| 314 | debugger includes several new features, including command line editing as
|
|---|
| 315 | well as the C<x> command to dump out complex data structures. For
|
|---|
| 316 | example, given the assignment to $AoA above, here's the debugger output:
|
|---|
| 317 |
|
|---|
| 318 | DB<1> x $AoA
|
|---|
| 319 | $AoA = ARRAY(0x13b5a0)
|
|---|
| 320 | 0 ARRAY(0x1f0a24)
|
|---|
| 321 | 0 'fred'
|
|---|
| 322 | 1 'barney'
|
|---|
| 323 | 2 'pebbles'
|
|---|
| 324 | 3 'bambam'
|
|---|
| 325 | 4 'dino'
|
|---|
| 326 | 1 ARRAY(0x13b558)
|
|---|
| 327 | 0 'homer'
|
|---|
| 328 | 1 'bart'
|
|---|
| 329 | 2 'marge'
|
|---|
| 330 | 3 'maggie'
|
|---|
| 331 | 2 ARRAY(0x13b540)
|
|---|
| 332 | 0 'george'
|
|---|
| 333 | 1 'jane'
|
|---|
| 334 | 2 'elroy'
|
|---|
| 335 | 3 'judy'
|
|---|
| 336 |
|
|---|
| 337 | =head1 CODE EXAMPLES
|
|---|
| 338 |
|
|---|
| 339 | Presented with little comment (these will get their own manpages someday)
|
|---|
| 340 | here are short code examples illustrating access of various
|
|---|
| 341 | types of data structures.
|
|---|
| 342 |
|
|---|
| 343 | =head1 ARRAYS OF ARRAYS
|
|---|
| 344 | X<array of arrays> X<AoA>
|
|---|
| 345 |
|
|---|
| 346 | =head2 Declaration of an ARRAY OF ARRAYS
|
|---|
| 347 |
|
|---|
|
|---|