Thursday, October 24, 2013

Profiling Memory Usage: A Lamentation

Profiling memory usage is an exceedingly complex problem, one that Telemetry has tried to address in different ways over the years.  Much like performance profiling, the task of analyzing memory usage doesn't seem terribly difficult when looking at specific use cases.  However once we try to generalize to a cross-platform implementation that handles every conceivable use case...it becomes nearly intractable.  There are two major problems that memory profiling has to solve: capturing memory event data, then somehow digesting that data into something meaningful.

Collecting all the allocation and free events is actually pretty simple (conceptually) but rife with issues in practice.  But despite being relatively simple, it's still far from trivial.  There are two obvious methods for tracking memory events: explicit markup at point of allocation, or hooking system allocators.

Explicit markup requires the programmer to annotate every memory operation manually, much like using Telemetry with zones:

Foo *foo = malloc( sizeof( Foo ) );
tmAlloc( cx, foo, sizeof(*foo), "foo" );

For a language like C this is relatively straightforward, but it's tedious and error prone.

For C++ it's even worse, because the equivalent case:

Foo *foo = new Foo;
tmAlloc( cx, foo, sizeof(*foo), "foo" );

is potentially misleading since sizeof( foo ) does not take into account resultant allocations during member construction (such as having member variables that are STL containers).
In both cases with explicit markup, any memory that's allocated outside of your control (incidental allocations within STL, external libraries, etc. ) is effectively invisible.

At least with Telemetry's zones, missing markup still implies a lot about performance.  A gap in time still indicates that time was spent, just in an area we've yet to annotate:
Even a missing zone can tell us something
There's no equivalent way to infer memory usage with explicit markup.

So why don't we just trap the appropriate system wide memory allocator calls?  On a single platform this may be feasible, but doing this across every system we support is impractical and, frankly, bad practice since it can introduce subtle bugs and side effects.  And it doesn't even solve the problem fully, since some programmers may want to see memory usage analysis for their own allocators and thus may require explicit markup still.

There is a compromise solution, where you overload your allocators locally (trapping malloc and/or overloading operators new and delete) and mark things up there:

void *operator new( size_t s )
{
   void *p = malloc( s );
   tmAlloc( cx, p, s, "new" );
   return p;
}

This works, but we lose context about the allocation such as where it occurred or what it was being used for.  We could work around this by grabbing the callstack as well (slow) or by explicitly passing down additional contextual information (tedious), but neither of those solutions is ideal.

As you can see, simply gathering the data related to memory usage is a difficult problem in and of itself.  Currently Telemetry gives you two ways to do this: fine grained control using explicit markup, or implicit context by using what we refer to as 'memory zones'.  That's a topic for another blog post though!

For now, let's assume that we magically have the ability to gather relevant memory events and some contextual information.  This brings us to our second problem: what do we do with that information?

Memory events define intervals of usage, e.g. from time A to time B we had N bytes allocated by system Z.  That's actually a lot of useful information right there which can be used to generate all kinds of interesting analysis.  Of course, there's also a ton of noise in there that maybe isn't particularly relevant or interesting.  Sometimes we care about specific allocations but most of the time we only care about logical groups, and even then modern applications generate immense amounts of small, short lived memory allocations that can clutter our view instead of contributing to it (but not always!).

Given these memory allocation intervals, we could theoretically generate a lot of fun analysis:
  • leak detection
  • allocation churn (rate of change of number of individual allocations)
  • allocated churn (rate of change of total memory allocated)
  • life spans of allocations (by file, location, size, system, etc.)
  • peak allocation over time, in total and by system
  • allocation usage over time by specific systems
  • distribution of allocations at a point in time
  • address space fragmentation at a point in time
  • topographical (treemap) visualization of memory allocation at a point in time
  • etc. etc.

Most programmers, however, are only interested in a subset of that information. Unfortunately all of the above analysis can be expensive and storage hungry, and it's often difficult to visualize side by side with plot or zone data.  It's a difficult problem to solve to everyone's satisfaction, but we've started taking what we think are some good first steps to make this happen.

No comments:

Post a Comment