| # Investigating Out of Memory crashes |
| |
| A large fraction of process crashes in Chromium are due to Out Of Memory (OOM) |
| conditions. This page is meant to help Chromium developers understand stack |
| traces, and investigate. Note that some of the documentation here will only be |
| applicable to Google Chrome, as it is specific to the way Google's crash |
| reporting infrastructure aggregates and reports crashes. |
| |
| Some of the following also assumes that the `malloc()` implementation is |
| PartitionAlloc, which is as of 2022 the case on most platforms. |
| |
| [TOC] |
| |
| ## Identifying OOM crashes |
| |
| When a process crashes due to an Out Of Memory condition, this is usually |
| signaled by the presence of `base::internal::OnNoMemoryInternal()` on the stack. |
| |
| **Google Chrome only:** crash report infrastructure tags these as "[Out of |
| Memory]" based on this, and other function names. The full list is determined in |
| the (internal) crash server's code. |
| |
| Since Chromium configures its memory allocators to prefer crashing rather than |
| returning `nullptr`, an OOM crash can be triggered from anywhere in the code, |
| and most commonly from within the allocator, or higher-level functions such as |
| `operator new` in C++. |
| |
| ## Distinguishing between underlying causes |
| ### Different causes |
| |
| A process can reach an OOM condition for several reasons: |
| |
| * **The OS is truly out of memory**, regardless of how much memory the *current* |
| process is using |
| * **Some limit inside the OS is reached**. For instance, on Windows, there |
| exists a global "commit limit", which is the amount of memory that the system |
| can commit. Note that it is possible to commit more memory than what is |
| actually in use. This may also happen on Linux systems configured with no or |
| limited "overcommit", though the majority of systems don't have a limit. |
| * **Virtual address space exhaustion**. This is most likely to happen for relatively |
| large allocations, on 32 bit systems, where total addressable space is |
| typically 2GiB (most Windows systems), 3GiB (e.g. some Windows configurations, |
| Linux) or 4GiB (e.g. WoW64). However, it may also happen on 64 bit systems, |
| either due to: |
| * Limited virtual addressable space in the CPU/OS. For instance most Android |
| ARM64 systems have only 40 bits of address space as of 2022. |
| * "Cage" exhaustion. This is most likely to happen with PartitionAlloc on 64 |
| bit systems, where all allocations are grouped into a single contiguous |
| virtual address space "cage". |
| * **Sandbox per-process memory limit**. For some process types (e.g. Renderers) |
| and on most platforms, the sandbox enforces a maximum per-process memory |
| limit. Given that this limit is typically set at the OS level, it may not be |
| distinguishable from e.g. commit limit exhaustion. |
| * **Excessive allocation size**. Some allocators (notably PartitionAlloc) |
| purposely limit the maximum allocation size. |
| |
| ### Identifying the cause |
| |
| In the case of PartitionAlloc, it is possible to distinguish some of these cases: |
| |
| * **Virtual address space exhaustion**. This is identified by the presence of |
| `PartitionOutOfMemoryMappingFailure()` on the stack. It means that the |
| allocator was unable to find enough address space, either for its internal |
| memory allocation unit size, or the requested size. Since memory is *not* |
| committed as this step, this signals an address space issue. |
| * **Commit**. This is identified by the presence of |
| `PartitionOutOfMemoryCommitFailure()` on the stack. This signals that either |
| the OS or the sandbox limit has been reached. |
| * **Excessive allocation size**. Shown by `PartitionExcessiveAllocationSize()` |
| on the stack. |
| |
| |
| ## What to do? |
| |
| ### Commit Limit Reached |
| |
| The process is "truly" out of memory, or the system is. Some amount of these |
| crashes is expected, and the crashing location is not necessarily the |
| culprit. Indeed, as a rough approximation, the failing allocation is more likely |
| to be from a component naturally allocating a lot of memory, e.g. V8 or |
| rendering. |
| |
| However, if there is a spike, and many stack traces come from an unusual |
| location (e.g. newly added code), this may signal a memory leak in the component |
| on the stack, or excessive temporary allocations. |
| |
| Also, if `PartitionAllocDirectMap()` is on the stack, the memory allocation was |
| large. It may come from a large buffer, and potentially made worse by buffer |
| resizing. For instance, `std::vector` often double their size when out of |
| capacity. In which case, `reserve()`-ing the right size ahead of time may help. |
| |
| ### Excessive allocation size |
| |
| Is the calling code expected to allocate more than 2GiB? Or it is an underflow |
| somewhere in the calling code? |
| |
| ### Virtual address space |
| |
| On 32 bit systems, this is most likely to occur when overall memory usage is |
| high, or when the allocation size request is large. Is the calling code |
| allocating a very large buffer? |
| |
| ## Debugging |
| |
| ### General |
| |
| On Windows, the allocation size is added into the exception record. In Google |
| Chrome's crash dashboard, this is shown in "Parameter[0]" of the exception |
| info. On other operating systems, the allocation size if put on the stack before |
| crashing, and thus visible in minidumps. |
| |
| ### PartitionAlloc and Google specific |
| |
| 1. Starting from a specific report, click on the bug icon to start a cloud lldb |
| instance |
| 2. Locate the `PartitionRoot<true>::OutOfMemory()` frame on the stack, move to it with `f 5` |
| 3. Locate the stack addresses by printing registers `re re` |
| 4. Show the stack content with `x <stack_pointer> <frame pointer>` |
| |
| Below is an example for a crash on x86_64: |
| |
| ``` |
| ( lizeb ) bt |
| * thread #1, stop reason = EXC_BREAKPOINT (code=EXC_I386_BPT, subcode=0x10c45912f) |
| * frame #0: 0x000000010c45912f Google Chrome Framework`base::internal::OnNoMemoryInternal(unsigned long) at memory.cc:62 |
| frame #1: 0x000000010c459149 Google Chrome Framework`base::TerminateBecauseOutOfMemory(unsigned long) at memory.cc:69 |
| frame #2: 0x000000010c4f39c6 Google Chrome Framework`OnNoMemory(unsigned long) at oom.cc:17 |
| frame #3: 0x000000010d7e5794 Google Chrome Framework`WTF::PartitionsOutOfMemoryUsing2G(unsigned long) at partitions.cc:281 |
| frame #4: 0x000000010d7e4d2c Google Chrome Framework`WTF::Partitions::HandleOutOfMemory(unsigned long) at partitions.cc:415 |
| frame #5: 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) at partition_root.cc:521 |
| [...] |
| ( lizeb ) f 5 |
| frame #5: 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) at partition_root.cc:521 |
| ( lizeb ) re re |
| General Purpose Registers: |
| rbp = 0x00007ffee7012c50 |
| rsp = 0x00007ffee7012bf0 |
| rip = 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) + 196 at partition_root.cc:522 |
| 21 registers were unavailable. |
| ( lizeb ) x 0x00007ffee7012bf0 0x00007ffee7012c50 |
| 0x7ffee7012bf0: 76 61 5f 73 69 7a 65 00 00 00 00 07 00 00 00 00 va_size......... |
| 0x7ffee7012c00: 61 6c 6c 6f 63 00 20 20 00 2d 2d 01 00 00 00 00 alloc. .--..... |
| 0x7ffee7012c10: 63 6f 6d 6d 69 74 00 20 00 a0 9d 01 00 00 00 00 commit. ........ |
| 0x7ffee7012c20: 73 69 7a 65 00 20 20 20 00 00 20 00 00 00 00 00 size. .. ..... |
| 0x7ffee7012c30: aa aa aa aa aa aa aa aa 00 18 b0 12 01 00 00 00 ................ |
| 0x7ffee7012c40: 00 00 20 00 00 00 00 00 48 22 b0 12 01 00 00 00 .. .....H"...... |
| ``` |
| |
| The results here can help the PartitionAlloc team to identify issues, as |
| important metrics from PartitionAlloc are saved above. For instance virtual |
| address space usage is (in little endian) 0x70000000. |