1 | ptmalloc3 - a multi-thread malloc implementation
|
---|
2 | ================================================
|
---|
3 |
|
---|
4 | Wolfram Gloger ([email protected])
|
---|
5 |
|
---|
6 | Jan 2006
|
---|
7 |
|
---|
8 |
|
---|
9 | Thanks
|
---|
10 | ======
|
---|
11 |
|
---|
12 | This release was partly funded by Pixar Animation Studios. I would
|
---|
13 | like to thank David Baraff of Pixar for his support and Doug Lea
|
---|
14 | ([email protected]) for the great original malloc implementation.
|
---|
15 |
|
---|
16 |
|
---|
17 | Introduction
|
---|
18 | ============
|
---|
19 |
|
---|
20 | This package is a modified version of Doug Lea's malloc-2.8.3
|
---|
21 | implementation (available seperately from ftp://g.oswego.edu/pub/misc)
|
---|
22 | that I adapted for multiple threads, while trying to avoid lock
|
---|
23 | contention as much as possible.
|
---|
24 |
|
---|
25 | As part of the GNU C library, the source files may be available under
|
---|
26 | the GNU Library General Public License (see the comments in the
|
---|
27 | files). But as part of this stand-alone package, the code is also
|
---|
28 | available under the (probably less restrictive) conditions described
|
---|
29 | in the file 'COPYRIGHT'. In any case, there is no warranty whatsoever
|
---|
30 | for this package.
|
---|
31 |
|
---|
32 | The current distribution should be available from:
|
---|
33 |
|
---|
34 | http://www.malloc.de/malloc/ptmalloc3.tar.gz
|
---|
35 |
|
---|
36 |
|
---|
37 | Compilation
|
---|
38 | ===========
|
---|
39 |
|
---|
40 | It should be possible to build ptmalloc3 on any UN*X-like system that
|
---|
41 | implements the sbrk(), mmap(), munmap() and mprotect() calls. Since
|
---|
42 | there are now several source files, a library (libptmalloc3.a) is
|
---|
43 | generated. See the Makefile for examples of the compile-time options.
|
---|
44 |
|
---|
45 | Note that support for non-ANSI compilers is no longer there.
|
---|
46 |
|
---|
47 | Several example targets are provided in the Makefile:
|
---|
48 |
|
---|
49 | o Posix threads (pthreads), compile with "make posix"
|
---|
50 |
|
---|
51 | o Posix threads with explicit initialization, compile with
|
---|
52 | "make posix-explicit" (known to be required on HPUX)
|
---|
53 |
|
---|
54 | o Posix threads without "tsd data hack" (see below), compile with
|
---|
55 | "make posix-with-tsd"
|
---|
56 |
|
---|
57 | o Solaris threads, compile with "make solaris"
|
---|
58 |
|
---|
59 | o SGI sproc() threads, compile with "make sproc"
|
---|
60 |
|
---|
61 | o no threads, compile with "make nothreads" (currently out of order?)
|
---|
62 |
|
---|
63 | For Linux:
|
---|
64 |
|
---|
65 | o make "linux-pthread" (almost the same as "make posix") or
|
---|
66 | make "linux-shared"
|
---|
67 |
|
---|
68 | Note that some compilers need special flags for multi-threaded code,
|
---|
69 | e.g. with Solaris cc with Posix threads, one should use:
|
---|
70 |
|
---|
71 | % make posix SYS_FLAGS='-mt'
|
---|
72 |
|
---|
73 | Some additional targets, ending in `-libc', are also provided in the
|
---|
74 | Makefile, to compare performance of the test programs to the case when
|
---|
75 | linking with the standard malloc implementation in libc.
|
---|
76 |
|
---|
77 | A potential problem remains: If any of the system-specific functions
|
---|
78 | for getting/setting thread-specific data or for locking a mutex call
|
---|
79 | one of the malloc-related functions internally, the implementation
|
---|
80 | cannot work at all due to infinite recursion. One example seems to be
|
---|
81 | Solaris 2.4. I would like to hear if this problem occurs on other
|
---|
82 | systems, and whether similar workarounds could be applied.
|
---|
83 |
|
---|
84 | For Posix threads, too, an optional hack like that has been integrated
|
---|
85 | (activated when defining USE_TSD_DATA_HACK) which depends on
|
---|
86 | `pthread_t' being convertible to an integral type (which is of course
|
---|
87 | not generally guaranteed). USE_TSD_DATA_HACK is now the default
|
---|
88 | because I haven't yet found a non-glibc pthreads system where this
|
---|
89 | hack is _not_ needed.
|
---|
90 |
|
---|
91 | *NEW* and _important_: In (currently) one place in the ptmalloc3
|
---|
92 | source, a write memory barrier is needed, named
|
---|
93 | atomic_write_barrier(). This macro needs to be defined at the end of
|
---|
94 | malloc-machine.h. For gcc, a fallback in the form of a full memory
|
---|
95 | barrier is already defined, but you may need to add another definition
|
---|
96 | if you don't use gcc.
|
---|
97 |
|
---|
98 | Usage
|
---|
99 | =====
|
---|
100 |
|
---|
101 | Just link libptmalloc3 into your application.
|
---|
102 |
|
---|
103 | Some wicked systems (e.g. HPUX apparently) won't let malloc call _any_
|
---|
104 | thread-related functions before main(). On these systems,
|
---|
105 | USE_STARTER=2 must be defined during compilation (see "make
|
---|
106 | posix-explicit" above) and the global initialization function
|
---|
107 | ptmalloc_init() must be called explicitly, preferably at the start of
|
---|
108 | main().
|
---|
109 |
|
---|
110 | Otherwise, when using ptmalloc3, no special precautions are necessary.
|
---|
111 |
|
---|
112 | Link order is important
|
---|
113 | =======================
|
---|
114 |
|
---|
115 | On some systems, when overriding malloc and linking against shared
|
---|
116 | libraries, the link order becomes very important. E.g., when linking
|
---|
117 | C++ programs on Solaris with Solaris threads [this is probably now
|
---|
118 | obsolete], don't rely on libC being included by default, but instead
|
---|
119 | put `-lthread' behind `-lC' on the command line:
|
---|
120 |
|
---|
121 | CC ... libptmalloc3.a -lC -lthread
|
---|
122 |
|
---|
123 | This is because there are global constructors in libC that need
|
---|
124 | malloc/ptmalloc, which in turn needs to have the thread library to be
|
---|
125 | already initialized.
|
---|
126 |
|
---|
127 | Debugging hooks
|
---|
128 | ===============
|
---|
129 |
|
---|
130 | All calls to malloc(), realloc(), free() and memalign() are routed
|
---|
131 | through the global function pointers __malloc_hook, __realloc_hook,
|
---|
132 | __free_hook and __memalign_hook if they are not NULL (see the malloc.h
|
---|
133 | header file for declarations of these pointers). Therefore the malloc
|
---|
134 | implementation can be changed at runtime, if care is taken not to call
|
---|
135 | free() or realloc() on pointers obtained with a different
|
---|
136 | implementation than the one currently in effect. (The easiest way to
|
---|
137 | guarantee this is to set up the hooks before any malloc call, e.g.
|
---|
138 | with a function pointed to by the global variable
|
---|
139 | __malloc_initialize_hook).
|
---|
140 |
|
---|
141 | You can now also tune other malloc parameters (normally adjused via
|
---|
142 | mallopt() calls from the application) with environment variables:
|
---|
143 |
|
---|
144 | MALLOC_TRIM_THRESHOLD_ for deciding to shrink the heap (in bytes)
|
---|
145 |
|
---|
146 | MALLOC_GRANULARITY_ The unit for allocating and deallocating
|
---|
147 | MALLOC_TOP_PAD_ memory from the system. The default
|
---|
148 | is 64k and this parameter _must_ be a
|
---|
149 | power of 2.
|
---|
150 |
|
---|
151 | MALLOC_MMAP_THRESHOLD_ min. size for chunks allocated via
|
---|
152 | mmap() (in bytes)
|
---|
153 |
|
---|
154 | Tests
|
---|
155 | =====
|
---|
156 |
|
---|
157 | Two testing applications, t-test1 and t-test2, are included in this
|
---|
158 | source distribution. Both perform pseudo-random sequences of
|
---|
159 | allocations/frees, and can be given numeric arguments (all arguments
|
---|
160 | are optional):
|
---|
161 |
|
---|
162 | % t-test[12] <n-total> <n-parallel> <n-allocs> <size-max> <bins>
|
---|
163 |
|
---|
164 | n-total = total number of threads executed (default 10)
|
---|
165 | n-parallel = number of threads running in parallel (2)
|
---|
166 | n-allocs = number of malloc()'s / free()'s per thread (10000)
|
---|
167 | size-max = max. size requested with malloc() in bytes (10000)
|
---|
168 | bins = number of bins to maintain
|
---|
169 |
|
---|
170 | The first test `t-test1' maintains a completely seperate pool of
|
---|
171 | allocated bins for each thread, and should therefore show full
|
---|
172 | parallelism. On the other hand, `t-test2' creates only a single pool
|
---|
173 | of bins, and each thread randomly allocates/frees any bin. Some lock
|
---|
174 | contention is to be expected in this case, as the threads frequently
|
---|
175 | cross each others arena.
|
---|
176 |
|
---|
177 | Performance results from t-test1 should be quite repeatable, while the
|
---|
178 | behaviour of t-test2 depends on scheduling variations.
|
---|
179 |
|
---|
180 | Conclusion
|
---|
181 | ==========
|
---|
182 |
|
---|
183 | I'm always interested in performance data and feedback, just send mail
|
---|
184 | to [email protected].
|
---|
185 |
|
---|
186 | Good luck!
|
---|