Build your own .NET memory profiler in C# — Allocations (1/2)

Christophe Nasarre
Criteo R&D Blog
Published in
8 min readApr 18, 2020

--

In a previous post, I explained how to get statistics about the .NET Garbage Collector such as suspension time or generation sizes. But what if you would need more details about your application allocations such as how many times instances of a given type were allocated and for what cumulated size or even the allocation rate? This post explains how to get access to such information by writing your own memory profiler. The next one will show how to collect each sampled allocation stack trace.

Introduction

I have already used commercial tools to get detailed information about allocated type instances in an application; Visual Studio Profiler, dotTrace, ANTS memory profiler, or Perfview to name a few. With these tools in mind, I started to look at the .NET profiler API documentation and it reminded me the first time I read about the .NET profiler API. It was in December 2001 in Matt Pietrek’s MSDN Magazine article (I still have the paper version). When your application is starting, based on an environment variable, the .NET Framework (and now .NET Core) runtime is loading a profiler COM object that implements a specific ICorProfilerCallback interface (today, runtimes are supporting the 9th version ICorProfilerCallback9 interface). The methods of this interface will be called by the runtime at specific moments during the application lifetime. For example, the ObjectAllocated method is called each time an instance of a class is allocated: perfect for the job but it requires going back to COM and writing native code. Don’t be scared: I won’t go that way :^)

However, if you would like to get more details about writing your own .NET profiler in C or C++, I would recommend looking at the Microsoft ClrProfiler initial .NET Framework implementation and also Pavel Yosifovich DotNext session about Writing a .NET Core cross platform profiler in an hour with the corresponding (more recent and cross platform) source code.

Instead, several events that are emitted by the CLR are providing interesting details:

The GCSampledObjectAllocation events payload provides a type ID instead of a plain text type name. In order to retrieve the type name given its ID, we need to listen to TypeBulkType event that contains the mapping as I described in my post about finalizers. This is why the last two GCHeapAndTypeNames and Type keywords are needed.

Remember that if both GCSampledObjectAllocationLow and GCSampledObjectAllocationHigh keywords are set, an event will be received for EACH allocation. This could be a performance issue both for the monitored application and the profiler. I would recommend starting with either low or high (more on this later).

Last but not least, enabling at least one of these keywords is also switching the CLR to use “slower” allocators. This is why you should check that it does not impact your application performance. These slower allocators are also used when your ICorProfilerCallback.Initialize method calls SetEventMask with COR_PRF_ENABLE_OBJECT_ALLOCATED flag to receive allocation notifications.

When you use Perfview for memory investigation, you are relying on these events without knowing it. In the Collect/Run dialog, three checkboxes are defining how to get the memory profiling details:

  • .NET Alloc: use a custom native C++ ICorProfilerCallback implementation (noticeable impact on the profiled application performance).
  • .NET SampAlloc: use the same custom native profiler but with sampled events.
  • ETW .NET Alloc: use GCSampledObjectAllocationHigh events

In all cases, the profiled application needs to be started after the collection begins.

How to listen to allocation events

As I have already explained in previous posts, the Microsoft TraceEvent nuget helps you listening to CLR events. First, you create a TraceEventSession and setup the providers you want to receive events from:

Second, you set up the handlers for the events you are interested in:

And lastly, the processing of received events is done in a dedicated thread until the session is disposed of:

Now let’s see the difference between the two sets of events.

The AllocationTick way

My first idea was to use the AllocationTick event because it seemed easy: one sampled event with a size, a type name, and LOH/ephemeral kind. However, how this sampling works makes it impossible to get an exact per type allocated size. Let’s have a look at this list of events received from a WPF test application:

Small | 105444 : FreezableContextPair[]
Small | 111908 : FreezableContextPair[]
Small | 106720 : System.String
Small | 102488 : System.String
Small | 107028 : System.TimeSpan[]
Small | 106100 : System.String

All allocations were small (i.e. not in the LOH: < 85.000 bytes) and the second column gives the cumulated size of all allocations to reach the 100 KB threshold but not for this particular type! There is no easy way to make a valid guess of the specific last allocation size for which we get the type name.

For example, the first array of FreezableContextPair triggered the event for a cumulated size of 105.444 bytes. But how big was this array? We don’t know: could have been 100.000 because only 5444 bytes were allocated before or only 10444 bytes because 95.000 were allocated before. It would have been so useful that the size of the last allocated object would be passed in the event payload…

It is a little bit different (but not that better) for objects allocated in LOH because they have to be at least 85.000 bytes long. For example, allocate 4-byte arrays, each one 85.000 bytes long and let’s see the corresponding events:

Large | 170064 : System.Byte[]
Large | 170064 : System.Byte[]

Two AllocationTick events are received with 170064 as cumulated size. Still hard to figure out what was the size of the last allocated array: the only thing we know is that it was larger (or equal) to 85.000 bytes because it was allocated in LOH.

For larger objects, it might seem a little bit more accurate. Let’s allocate 2 byte arrays, each one 110.000 bytes long:

Large | 195064 : System.Byte[]
Large | 110032 : System.Byte[]

There are ~85.000 bytes difference between the two events even though the same 110.000 bytes were allocated. You could remove 85.000 bytes from the value and have an approximation of the LOH allocated object: the larger the allocation the less the error. But still: could be 85.000 size error…

So we won’t be able to rely on the size provided by the AllocationTick event; only the type name. In addition, you get a view of objects allocated in LOH. Maybe the other events will provide better results.

The GCSampledObjectAllocation way

When an object is allocated by the GC allocator, a GCSampledObjectAllocation event is emitted under certain conditions:

  • Both GCSampledObjectAllocationLow and GCSampledObjectAllocationHigh keywords are set on the CLR provider,
  • The object size is larger than 10.000 bytes,
  • At least 1000 instances of the type have been allocated,
  • Just before the application exits, current statistics for all types are flushed,
  • A complicated piece of code decides based on time since the last event and the type allocation rate.

Picking one or the other keyword changes the maximum number of milliseconds between two events for a given type:

  • High (10 ms) : 100 events / second
  • Low (200 ms) : 5 events / second

You should use low or high depending on the monitored application memory allocation workload to avoid impacting too much the profiler (and even the monitored application performance)

The interesting feature of these events is that, for a given type, the payload contains both the number of allocated instances since the last event and the cumulated size of these instances. Let’s take the same allocation of 4 arrays of byte, each 85000 long:

226 | 103616 : System.Byte[]
1 | 85012 : System.Byte[]
1 | 85012 : System.Byte[]
1 | 85012 : System.Byte[]

This time, we get the exact count in the first column (ObjectCountForTypeSample) and the exact cumulated size in the second column (TotalSizeForTypeSample). If the count is 1, we have the exact size of that allocation and if it is bigger than 85000 bytes, we know it has been allocated in the LOH. Same accuracy for the 2-byte array of 110.000 elements:

198 | 123552 : System.Byte[]
1 | 110012 : System.Byte[]

Sounds good. However, you have to remember that profiled applications need to be started after the session was created: it means that you can’t write a tool that will listen to a specific process ID like with AllocationTick. Three dictionaries are used by PerProcessProfilingState to keep track of per type allocations, type ID mappings, and process names:

The SampledObjectAllocationMemoryProfiler class uses it for the events processing:

The constructor of the profiler keeps track of its own process ID in _currentPid to skip its own events.

Gathering type mapping

The processing of TypeBulkType events is quite straightforward: store the type ID/name association into a per-process dictionary:

Remember that I choose to skip events from the current process detected by FilterOutEvent().

How to get process names

Even though each event contains the ID of the emitting process, it would be better to display its name instead. You could use Process.GetProcessById(pid).ProcessName when analyzing the details but the process might be long gone at that time.

Another solution would be to enable the Kernel ETW provider and listen to the ProcessStart event. The ImageFileName field of the payload contains the process filename with the extension. However, it is obviously not working on Linux.

The easiest solution is to use GetProcessById but just when you receive the first type mapping for a given process. This is the role of the AssociateProcess method called in GetProcessTypesMapping shown previously:

It is now time to process allocation events.

Collecting allocation details

The GCSampledObjectAllocationTraceData payload contains the size and count of instances since the last event. We just need to store them for the corresponding process:

The AddAllocation() helper method is simply accumulating these numbers for a given type in the ProcessAllocationInfo associated to the related process.

Displaying the results

When the profiling session ends, it is easy to show the allocated count and size per type:

The code is using a Linq syntax to get top allocations sorted either by count or by size:

Another usage could be a long-running monitoring system that shows the allocation rate: a nice complement to the other GC metrics. However, compared to the other profilers, one important feature is missing: if an unexpected number of instances are created, how to know which part of the code is responsible for the spike?

The next post will explain how to enhance such a sampled memory profiler with call stacks per sampled allocation.

--

--

Christophe Nasarre
Criteo R&D Blog

Loves to understand how things work (MVP Developer Technologies)