Erik Chen | 527badd | 2018-04-18 21:37:59 | [diff] [blame] | 1 | # Example Investigation of a Heap Dump |
| 2 | |
| 3 | This document describes the steps taken to investigate a real memory leak |
| 4 | discovered by heap profiling in the wild. For investigators less familiar with |
| 5 | the code base, `Navigating the Stack Trace` should be enough information to |
| 6 | determine the relevant component, and to forward the bug to a component OWNER. |
| 7 | |
| 8 | ## Understanding the heap dump summary |
| 9 | |
| 10 | The opening comment of [Issue |
| 11 | 834033](https://bugs.chromium.org/p/chromium/issues/detail?id=834033) contains a |
| 12 | heap dump summary. The highlights are: |
| 13 | |
| 14 | * 315723 calls to malloc without corresponding call to free. |
| 15 | * 806MB of memory. |
| 16 | * The common stacktrace for all 315723 allocations. |
| 17 | |
| 18 | Usually, anything that uses over 10MB of memory is a red flag. With the |
| 19 | exception of large image resources, most code in Chrome should use much less |
| 20 | than 10MB. Anything that has over 100k allocations is also a red flag. |
| 21 | |
| 22 | ### Navigating the Stack Trace - Detailed Breakdown |
| 23 | |
| 24 | Let's take a look at the stack trace: |
| 25 | |
| 26 | ``` |
| 27 | profiling::(anonymous namespace)::HookAlloc(base::allocator::AllocatorDispatch const*, unsigned long, void*) |
| 28 | base::allocator::MallocZoneFunctionsToReplaceDefault()::$_1::__invoke(_malloc_zone_t*, unsigned long) |
| 29 | <???> |
| 30 | <???> |
| 31 | base::allocator::UncheckedMallocMac(unsigned long, void**) |
| 32 | sk_malloc_flags(unsigned long, unsigned int) |
| 33 | SkMallocPixelRef::MakeAllocate(SkImageInfo const&, unsigned long) |
| 34 | SkBitmap::tryAllocPixels(SkImageInfo const&, unsigned long) |
| 35 | IPC::ParamTraits<SkBitmap>::Read(base::Pickle const*, base::PickleIterator*, SkBitmap*) |
| 36 | ExtensionAction::ParseIconFromCanvasDictionary(base::DictionaryValue const&, gfx::ImageSkia*) |
| 37 | extensions::ExtensionActionSetIconFunction::RunExtensionAction() |
| 38 | extensions::ExtensionActionFunction::Run() |
| 39 | ExtensionFunction::RunWithValidation() |
| 40 | extensions::ExtensionFunctionDispatcher::DispatchWithCallbackInternal(ExtensionHostMsg_Request_Params const&, content::RenderFrameHost*, int, base::RepeatingCallback<void (ExtensionFunction::ResponseType, base::ListValue const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, extensions::functions::HistogramValue)> const&) |
| 41 | extensions::ExtensionFunctionDispatcher::Dispatch(ExtensionHostMsg_Request_Params const&, content::RenderFrameHost*, int) |
| 42 | bool IPC::MessageT<ExtensionHostMsg_Request_Meta, std::__1::tuple<ExtensionHostMsg_Request_Params>, void>::Dispatch<extensions::ExtensionWebContentsObserver, extensions::ExtensionWebContentsObserver, content::RenderFrameHost, void (extensions::ExtensionWebContentsObserver::*)(content::RenderFrameHost*, ExtensionHostMsg_Request_Params const&)>(IPC::Message const*, extensions::ExtensionWebContentsObserver*, extensions::ExtensionWebContentsObserver*, content::RenderFrameHost*, void (extensions::ExtensionWebContentsObserver::*)(content::RenderFrameHost*, ExtensionHostMsg_Request_Params const&)) |
| 43 | extensions::ExtensionWebContentsObserver::OnMessageReceived(IPC::Message const&, content::RenderFrameHost*) |
| 44 | extensions::ChromeExtensionWebContentsObserver::OnMessageReceived(IPC::Message const&, content::RenderFrameHost*) |
| 45 | content::WebContentsImpl::OnMessageReceived(content::RenderFrameHostImpl*, IPC::Message const&) |
| 46 | content::RenderFrameHostImpl::OnMessageReceived(IPC::Message const&) |
| 47 | IPC::ChannelProxy::Context::OnDispatchMessage(IPC::Message const&) |
| 48 | base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*) |
| 49 | base::MessageLoop::RunTask(base::PendingTask*) |
| 50 | base::MessageLoop::DoWork() |
| 51 | base::MessagePumpCFRunLoopBase::RunWork() |
Avi Drissman | b94ce74 | 2023-08-15 15:50:52 | [diff] [blame] | 52 | base::apple::CallWithEHFrame(void () block_pointer) |
Erik Chen | 527badd | 2018-04-18 21:37:59 | [diff] [blame] | 53 | base::MessagePumpCFRunLoopBase::RunWorkSource(void*) |
| 54 | <???> |
| 55 | <???> |
| 56 | <???> |
| 57 | <???> |
| 58 | <???> |
| 59 | <???> |
| 60 | <???> |
| 61 | <???> |
| 62 | <???> |
| 63 | __71-[BrowserCrApplication nextEventMatchingMask:untilDate:inMode:dequeue:]_block_invoke |
Avi Drissman | b94ce74 | 2023-08-15 15:50:52 | [diff] [blame] | 64 | base::apple::CallWithEHFrame(void () block_pointer) |
Erik Chen | 527badd | 2018-04-18 21:37:59 | [diff] [blame] | 65 | -[BrowserCrApplication nextEventMatchingMask:untilDate:inMode:dequeue:] |
| 66 | <???> |
| 67 | base::MessagePumpNSApplication::DoRun(base::MessagePump::Delegate*) |
| 68 | base::MessagePumpCFRunLoopBase::Run(base::MessagePump::Delegate*) |
| 69 | <name omitted> |
| 70 | ChromeBrowserMainParts::MainMessageLoopRun(int*) |
| 71 | content::BrowserMainLoop::RunMainMessageLoopParts() |
| 72 | content::BrowserMainRunnerImpl::Run() |
Gabriel Charette | fbeeb1c | 2021-11-10 20:50:06 | [diff] [blame] | 73 | content::BrowserMain(content::MainFunctionParams) |
Erik Chen | 527badd | 2018-04-18 21:37:59 | [diff] [blame] | 74 | content::ContentMainRunnerImpl::Run() |
| 75 | service_manager::Main(service_manager::MainParams const&) |
| 76 | content::ContentMain(content::ContentMainParams const&) |
| 77 | ChromeMain |
| 78 | main |
| 79 | <???> |
| 80 | ``` |
| 81 | |
| 82 | The first step is to divide the stack trace into smaller segments to get a |
| 83 | better understanding of what's happening at the time of allocations. The best |
| 84 | way to do this is to segment by name space and/or function prefixes. |
| 85 | |
| 86 | ``` |
| 87 | profiling::(anonymous namespace)::HookAlloc(base::allocator::AllocatorDispatch const*, unsigned long, void*) |
| 88 | base::allocator::MallocZoneFunctionsToReplaceDefault()::$_1::__invoke(_malloc_zone_t*, unsigned long) |
| 89 | <???> |
| 90 | <???> |
| 91 | base::allocator::UncheckedMallocMac(unsigned long, void**) |
| 92 | ``` |
| 93 | |
| 94 | The top of each stack will always contain some `base` and/or `profiling` |
| 95 | code. This is the code responsible for allocating and recording the memory. |
| 96 | |
| 97 | ``` |
| 98 | sk_malloc_flags(unsigned long, unsigned int) |
| 99 | SkMallocPixelRef::MakeAllocate(SkImageInfo const&, unsigned long) |
| 100 | SkBitmap::tryAllocPixels(SkImageInfo const&, unsigned long) |
| 101 | ``` |
| 102 | |
| 103 | Next, we three 3 frames with the prefix `sk`. Searching for |
| 104 | `sk_malloc_flags` on |
| 105 | [codesearch](https://cs.chromium.org/search/?q=sk_malloc_flags&sq=package:chromium&type=cs) |
| 106 | reveals that the component is `third_party/skia`. Looking at the |
| 107 | [README](https://cs.chromium.org/chromium/src/third_party/skia/README) reveals |
| 108 | that Skia is a 2D graphics library. |
| 109 | |
| 110 | ``` |
| 111 | IPC::ParamTraits<SkBitmap>::Read(base::Pickle const*, base::PickleIterator*, SkBitmap*) |
| 112 | ``` |
| 113 | |
| 114 | Next we see a templated function called `Read` in the namespace `IPC`. |
| 115 | `IPC` stands for inter-process communication. This suggests that the |
| 116 | function is responsible for reading an IPC Message, perhaps concerning an |
| 117 | `SkBitmap`. |
| 118 | |
| 119 | ``` |
| 120 | ExtensionAction::ParseIconFromCanvasDictionary(base::DictionaryValue const&, gfx::ImageSkia*) |
| 121 | extensions::ExtensionActionSetIconFunction::RunExtensionAction() |
| 122 | extensions::ExtensionActionFunction::Run() |
| 123 | ExtensionFunction::RunWithValidation() |
| 124 | extensions::ExtensionFunctionDispatcher::DispatchWithCallbackInternal(ExtensionHostMsg_Request_Params const&, content::RenderFrameHost*, int, base::RepeatingCallback<void (ExtensionFunction::ResponseType, base::ListValue const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, extensions::functions::HistogramValue)> const&) |
| 125 | extensions::ExtensionFunctionDispatcher::Dispatch(ExtensionHostMsg_Request_Params const&, content::RenderFrameHost*, int) |
| 126 | bool IPC::MessageT<ExtensionHostMsg_Request_Meta, std::__1::tuple<ExtensionHostMsg_Request_Params>, void>::Dispatch<extensions::ExtensionWebContentsObserver, extensions::ExtensionWebContentsObserver, content::RenderFrameHost, void (extensions::ExtensionWebContentsObserver::*)(content::RenderFrameHost*, ExtensionHostMsg_Request_Params const&)>(IPC::Message const*, extensions::ExtensionWebContentsObserver*, extensions::ExtensionWebContentsObserver*, content::RenderFrameHost*, void (extensions::ExtensionWebContentsObserver::*)(content::RenderFrameHost*, ExtensionHostMsg_Request_Params const&)) |
| 127 | extensions::ExtensionWebContentsObserver::OnMessageReceived(IPC::Message const&, content::RenderFrameHost*) |
| 128 | extensions::ChromeExtensionWebContentsObserver::OnMessageReceived(IPC::Message const&, content::RenderFrameHost*) |
| 129 | ``` |
| 130 | |
| 131 | Next, we see many frames with the `extension` prefix. Extensions are exactly |
| 132 | what they sound like - Chrome extensions like AdBlock are used to modify the |
| 133 | behavior of the browser. |
| 134 | |
| 135 | ``` |
| 136 | content::WebContentsImpl::OnMessageReceived(content::RenderFrameHostImpl*, IPC::Message const&) |
| 137 | content::RenderFrameHostImpl::OnMessageReceived(IPC::Message const&) |
| 138 | ``` |
| 139 | |
| 140 | `content` is the name of code that glues together web code [like extensions] and |
| 141 | the rest of Chrome. |
| 142 | |
| 143 | ``` |
| 144 | IPC::ChannelProxy::Context::OnDispatchMessage(IPC::Message const&) |
| 145 | ``` |
| 146 | |
| 147 | More `IPC` code. |
| 148 | |
| 149 | ``` |
| 150 | base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*) |
| 151 | base::MessageLoop::RunTask(base::PendingTask*) |
| 152 | base::MessageLoop::DoWork() |
| 153 | base::MessagePumpCFRunLoopBase::RunWork() |
Avi Drissman | b94ce74 | 2023-08-15 15:50:52 | [diff] [blame] | 154 | base::apple::CallWithEHFrame(void () block_pointer) |
Erik Chen | 527badd | 2018-04-18 21:37:59 | [diff] [blame] | 155 | base::MessagePumpCFRunLoopBase::RunWorkSource(void*) |
| 156 | ``` |
| 157 | |
| 158 | More `base` code. The bottom of most stack traces should go back to |
| 159 | `MessageLoop`, a primitive Chrome construct used to run tasks. |
| 160 | |
| 161 | ### Navigating the Stack Trace - Summary |
| 162 | |
| 163 | * The top and bottom of the stack should generally be the same and are not very |
| 164 | interesting. |
| 165 | * The prefixes of frames can be used to get a rough idea of the components |
| 166 | involved. |
| 167 | * Function names can be used to get a rough idea of what's going on. |
| 168 | |
| 169 | In this case, extension code is calling `ParseIconFromCanvasDictionary` - so |
| 170 | it's probably trying to parse an icon. This calls into Skia code. Given that |
| 171 | Skia is a 2D drawing library, and the function is `tryAllocPixels`, Skia is |
| 172 | allocating some pixels for the icon. This process is being repeated 315 thousand |
| 173 | times, and the icon is being leaked every time. |
| 174 | |
| 175 | |
| 176 | ## Diving into the code |
| 177 | |
| 178 | Now that we have a rough idea of what's happening, let's look at the code for |
| 179 | ParseIconFromCanvasDictionary. |
| 180 | |
| 181 | ```cpp |
| 182 | bool ExtensionAction::ParseIconFromCanvasDictionary( |
| 183 | const base::DictionaryValue& dict, |
| 184 | gfx::ImageSkia* icon) { |
| 185 | for (base::DictionaryValue::Iterator iter(dict); !iter.IsAtEnd(); |
| 186 | iter.Advance()) { |
| 187 | std::string binary_string64; |
| 188 | IPC::Message pickle; |
| 189 | if (iter.value().is_blob()) { |
| 190 | pickle = IPC::Message(iter.value().GetBlob().data(), |
| 191 | iter.value().GetBlob().size()); |
| 192 | } else if (iter.value().GetAsString(&binary_string64)) { |
| 193 | std::string binary_string; |
| 194 | if (!base::Base64Decode(binary_string64, &binary_string)) |
| 195 | return false; |
| 196 | pickle = IPC::Message(binary_string.c_str(), binary_string.length()); |
| 197 | } else { |
| 198 | continue; |
| 199 | } |
| 200 | base::PickleIterator pickle_iter(pickle); |
| 201 | SkBitmap bitmap; |
| 202 | if (!IPC::ReadParam(&pickle, &pickle_iter, &bitmap)) |
| 203 | return false; |
| 204 | CHECK(!bitmap.isNull()); |
| 205 | |
| 206 | // Chrome helpfully scales the provided icon(s), but let's not go overboard. |
| 207 | const int kActionIconMaxSize = 10 * ActionIconSize(); |
| 208 | if (bitmap.drawsNothing() || bitmap.width() > kActionIconMaxSize) |
| 209 | continue; |
| 210 | |
| 211 | float scale = static_cast<float>(bitmap.width()) / ActionIconSize(); |
| 212 | icon->AddRepresentation(gfx::ImageSkiaRep(bitmap, scale)); |
| 213 | } |
| 214 | return true; |
| 215 | } |
| 216 | ``` |
| 217 | |
| 218 | There's a lot going on here, but we can use the information we have to focus. |
| 219 | The leak happens in IPC::ReadParam, so the relevant lines are: |
| 220 | |
| 221 | ``` |
| 222 | SkBitmap bitmap; |
| 223 | if (!IPC::ReadParam(&pickle, &pickle_iter, &bitmap)) |
| 224 | return false; |
| 225 | ``` |
| 226 | |
| 227 | The `IPC` message is being decoded into `bitmap`. |
| 228 | |
| 229 | ``` |
| 230 | icon->AddRepresentation(gfx::ImageSkiaRep(bitmap, scale)); |
| 231 | ``` |
| 232 | Looking at subsequent consumers of `bitmap`, we see that it is being added as a |
| 233 | representation to `icon`. `icon` is an output parameter of this function, so we |
| 234 | have to look at the calling frame, |
| 235 | `ExtensionActionSetIconFunction::RunExtensionAction`. |
| 236 | |
| 237 | ``` |
| 238 | ExtensionFunction::ResponseAction |
| 239 | ExtensionActionSetIconFunction::RunExtensionAction() { |
| 240 | ... |
| 241 | EXTENSION_FUNCTION_VALIDATE( |
| 242 | ExtensionAction::ParseIconFromCanvasDictionary(*canvas_set, &icon)); |
| 243 | |
| 244 | if (icon.isNull()) |
| 245 | return RespondNow(Error("Icon invalid.")); |
| 246 | |
| 247 | extension_action_->SetIcon(tab_id_, gfx::Image(icon)); |
| 248 | ... |
| 249 | } |
| 250 | ``` |
| 251 | |
| 252 | In this case, I've already focused on the code that calls |
| 253 | `ParseIconFromCanvasDictionary`. Let's look at `SetIcon`. |
| 254 | |
| 255 | ``` |
| 256 | void ExtensionAction::SetIcon(int tab_id, const gfx::Image& image) { |
| 257 | SetValue(&icon_, tab_id, image); |
| 258 | } |
| 259 | ``` |
| 260 | |
| 261 | ``` |
| 262 | template<class T> |
| 263 | void SetValue(std::map<int, T>* map, int tab_id, const T& val) { |
| 264 | (*map)[tab_id] = val; |
| 265 | } |
| 266 | ``` |
| 267 | |
Erik Chen | dc47db1 | 2018-04-23 17:21:52 | [diff] [blame] | 268 | The icon is being added to a map `icon_`, with `tab_id` as the key. Ah ha! |
| 269 | Adding elements to a container [and never removing them] is one of the most |
| 270 | common sources of memory issues. |
| 271 | |
| 272 | There are two ways for this memory to be released - the container `icon_` can be |
| 273 | destroyed, or the element can be removed from the container. |
| 274 | |
| 275 | `icon_` is a member of `ExtensionAction`, whose documentation reads: |
| 276 | ``` |
| 277 | // ExtensionAction encapsulates the state of a browser action or page action. |
| 278 | // Instances can have both global and per-tab state. If a property does not have |
| 279 | // a per-tab value, the global value is used instead. |
| 280 | ``` |
| 281 | |
| 282 | This suggests that the lifetime of `icon_` is tied to the lifetime of the |
| 283 | ExtensionAction, which we can guess is tied to the lifetime of the Extension. As |
| 284 | long as the extension stays installed and enabled, `icon_` will not be |
| 285 | destroyed. |
| 286 | |
| 287 | Next, we use codesearch to look at all code that removes elements from `icon_`. |
| 288 | The only place that performs removal is |
Erik Chen | 527badd | 2018-04-18 21:37:59 | [diff] [blame] | 289 | |
| 290 | ``` |
| 291 | void ExtensionAction::ClearAllValuesForTab(int tab_id) { |
| 292 | ... |
| 293 | icon_.erase(tab_id); |
| 294 | ... |
| 295 | } |
| 296 | ``` |
| 297 | |
| 298 | This is called by `ExtensionActionAPI::ClearAllValuesForTab`, which is called by |
Erik Chen | dc47db1 | 2018-04-23 17:21:52 | [diff] [blame] | 299 | `TabHelper::DidFinishNavigation`. The name of this method suggests that each |
| 300 | time a tab is navigated, the previous tab-specific icon is cleared. However, |
| 301 | that means that if a tab is closed, then the icon is leaked forever. |