Thursday, October 24, 2013

Telemetry 2.0f released!


Telemetry 2.0f has been released!  This is primarily a bugfix and maintenance release.

Run-Time

  • Change: the autogenerated system/working set plot is now part of (mem) group
  • Change: TMCT_FILE now defaults to emulating Telemetry 2 server

Server

  • Change: updated to most recent SQLite
  • Bugfix: fixed plot gen crash bug
  • Bugfix: stopping server while it's cooking should not crash anymore
  • Bugfix: fixed string table bug where NULL strings being set to wrong value
  • Bugfix: fixed string id generation bug when generating plots
  • Bugfix: plot gen 'by frame' now works correctly
  • Bugfix: fixed bug with symbolization

Visualizer

  • Enhancement: ctrl-V now pastes into input text lines
  • Enhancement: TMPT_TIME_CYCLES plots now print in appropriate human readable time scale
  • Change: Changed "Cooking" to "Processing" to avoid idiomatic terminology
  • Bugfix: fixed plot LOD rendering bug
  • Bugfix: fixed plot gen crash bug
  • Bugfix: stopping server while it's processing should not crash anymore
  • Bugfix: plot gen 'by frame' now works correctly
  • Bugfix: fixed bug with symbolization

Other

  • Change: PS4 dependency documentation updated

Experimental Feature: Memory Zones

As mentioned in a previous blog post, profiling memory can be a difficult problem, not just technically (acquiring the data) but also in terms of usability (what do we do with all that data?!)

Telemetry has always allowed you to tag and track individual allocations using the tmAlloc/tmFree APIs, however this is often overkill (tracking all allocations); noisy (lots of stuff tracked we don't care about); expensive (lots of data); tedious (lots of markup); and in the end, hard to parse (no clear structure to visualize over time).  It's still the right low-level API since it puts the control in the programmer's hands, but a cleaner mechanism on top of that has always been desirable but hard to define.

Telemetry 2 is looking to provide that using an experimental new feature called memory zones.  These are just normal Telemetry zones with a new flag attached to them, TMZF_MEM_REGION_EXPERIMENTAL.  Specially annotated allocations that occur inside this zone are logically grouped and plotted (for now, the individual memory intervals are still recorded so that we can do more analysis later if we choose).

void MyFunc()
{
    tmZone( cx, TMZF_MEM_REGION_EXPERIMENTAL, "MyFunc" );
    ...
    foo* f = new foo;

    //note the empty description!
    tmAlloc( cx, f, sizeof(*f), "" );
    //note the empty description!
    bar* b = new bar;
    tmAlloc( cx, b, sizeof(*b), "" );
    ...
    delete b;
    delete f;
}


With the above code snippet, 'f' and 'b' would contribute to the 'MyFunc' bucket.  While useful, where it really starts to help is when you overload operators new and delete to take advantage of this:

void *operator new( size_t s )
{
   void *p = malloc( s );

   // empty description means 'use mem zone'
   tmAlloc( cx, p, s, "" );
   return p;
}


Then we can remove the tmAlloc calls from MyFunc:

void MyFunc()
{
    tmZone( cx, TMZF_MEM_REGION_EXPERIMENTAL, "MyFunc" );
    ...
    foo* f = new foo;
    bar* b = new bar;

    std::vector< foo > v;
    v.push_back(*f);
    ...
    delete b;
    delete f;
}


What happens now is that when 'foo' and 'bar' are allocated, our overloaded operator new will be within the MyFunc memory zone and accounted for properly.  Not only that, but any memory allocations trapped by our overloaded operator new (such as implicit ones by STL) are gathered as well, so in the above case any incidental allocations that result from manipulating the vector will be accounted for automatically.

In practice this means that you can now progressively mark up memory allocations exactly the same way you do with zones.  In fact, Telemetry will discard allocation and free events with empty descriptors if they're outside of a mem zone, which reduces overhead considerably:

void MyFunc()
{
    tmEnter( cx, TMZF_MEM_REGION_EXPERIMENTAL, "tracked" );

    p = new P(); // overloaded operator new traps and tracks
    q = new Q(); // overloaded operator new traps and tracks

    tmLeave( cx );
    r = new R(); // ignored on server (outside mem regions!)
}


The brute force option to get up and running would be to enclose a giant memory zone  around your application's main() function.  While this would work, it would be a lot of traffic and, as a result, somewhat meaningless until you add more memory zones (the full hierarchy is preserved so more zones = more detail).

A less intensive approach would be to progressively annotate "leafy" pieces of code and work upwards from there.  For example, start with your audio system or your level loader, see what it looks like, then go back and adjust your markup, try it again, etc.  This is very analogous to how zone markup is usually done with Telemetry.

What does this buy you?  Well, you have a nice clean way of discarding/ignoring memory allocations for starters, which can help keep traffic and total data storage requirements under control. 

Second, you gain automatic plotting of memory usage by memory zone.  This lets you look at memory usage at a high level then, if necessary, zoom into areas of interest and look at individual memory allocations.

Plots showing memory usage of sMainLoop
If we were too zoom in a bit we could see the usage pattern in more detail, specifically how a lot of usage occurs and then drops back down (a pattern associated with temporaries).  The data points encompass dozens of samples.
Zoomed in we can see usage pattern in more detail
By dropping all the way down the the zone view, we can see the individual allocations.  Given that we're looking at hundreds of tiny allocations, tracking them individually would not have given us very much useful data.

Individual memory intervals are still visible
If this system proves popular we plan on adding 'memory exclusion' zones, which would discard any memory events that occur inside them.  In addition we'll add a configuration option to control the rate at which the server generates allocation plots.  Currently the threshold is around 16K.

Finally, we'll add a bunch of visualization features so that we can take advantage of this system, which is really what we and our customers want!


Profiling Memory Usage: A Lamentation

Profiling memory usage is an exceedingly complex problem, one that Telemetry has tried to address in different ways over the years.  Much like performance profiling, the task of analyzing memory usage doesn't seem terribly difficult when looking at specific use cases.  However once we try to generalize to a cross-platform implementation that handles every conceivable use case...it becomes nearly intractable.  There are two major problems that memory profiling has to solve: capturing memory event data, then somehow digesting that data into something meaningful.

Collecting all the allocation and free events is actually pretty simple (conceptually) but rife with issues in practice.  But despite being relatively simple, it's still far from trivial.  There are two obvious methods for tracking memory events: explicit markup at point of allocation, or hooking system allocators.

Explicit markup requires the programmer to annotate every memory operation manually, much like using Telemetry with zones:

Foo *foo = malloc( sizeof( Foo ) );
tmAlloc( cx, foo, sizeof(*foo), "foo" );

For a language like C this is relatively straightforward, but it's tedious and error prone.

For C++ it's even worse, because the equivalent case:

Foo *foo = new Foo;
tmAlloc( cx, foo, sizeof(*foo), "foo" );

is potentially misleading since sizeof( foo ) does not take into account resultant allocations during member construction (such as having member variables that are STL containers).
In both cases with explicit markup, any memory that's allocated outside of your control (incidental allocations within STL, external libraries, etc. ) is effectively invisible.

At least with Telemetry's zones, missing markup still implies a lot about performance.  A gap in time still indicates that time was spent, just in an area we've yet to annotate:
Even a missing zone can tell us something
There's no equivalent way to infer memory usage with explicit markup.

So why don't we just trap the appropriate system wide memory allocator calls?  On a single platform this may be feasible, but doing this across every system we support is impractical and, frankly, bad practice since it can introduce subtle bugs and side effects.  And it doesn't even solve the problem fully, since some programmers may want to see memory usage analysis for their own allocators and thus may require explicit markup still.

There is a compromise solution, where you overload your allocators locally (trapping malloc and/or overloading operators new and delete) and mark things up there:

void *operator new( size_t s )
{
   void *p = malloc( s );
   tmAlloc( cx, p, s, "new" );
   return p;
}

This works, but we lose context about the allocation such as where it occurred or what it was being used for.  We could work around this by grabbing the callstack as well (slow) or by explicitly passing down additional contextual information (tedious), but neither of those solutions is ideal.

As you can see, simply gathering the data related to memory usage is a difficult problem in and of itself.  Currently Telemetry gives you two ways to do this: fine grained control using explicit markup, or implicit context by using what we refer to as 'memory zones'.  That's a topic for another blog post though!

For now, let's assume that we magically have the ability to gather relevant memory events and some contextual information.  This brings us to our second problem: what do we do with that information?

Memory events define intervals of usage, e.g. from time A to time B we had N bytes allocated by system Z.  That's actually a lot of useful information right there which can be used to generate all kinds of interesting analysis.  Of course, there's also a ton of noise in there that maybe isn't particularly relevant or interesting.  Sometimes we care about specific allocations but most of the time we only care about logical groups, and even then modern applications generate immense amounts of small, short lived memory allocations that can clutter our view instead of contributing to it (but not always!).

Given these memory allocation intervals, we could theoretically generate a lot of fun analysis:
  • leak detection
  • allocation churn (rate of change of number of individual allocations)
  • allocated churn (rate of change of total memory allocated)
  • life spans of allocations (by file, location, size, system, etc.)
  • peak allocation over time, in total and by system
  • allocation usage over time by specific systems
  • distribution of allocations at a point in time
  • address space fragmentation at a point in time
  • topographical (treemap) visualization of memory allocation at a point in time
  • etc. etc.

Most programmers, however, are only interested in a subset of that information. Unfortunately all of the above analysis can be expensive and storage hungry, and it's often difficult to visualize side by side with plot or zone data.  It's a difficult problem to solve to everyone's satisfaction, but we've started taking what we think are some good first steps to make this happen.

Tuesday, October 8, 2013

Telemetry 2.0e released!

Telemetry 2.0e is out the door!

NOTE: THERE HAS BEEN A DATABASE SCHEMA CHANGE, BACKUP YOUR DATA FILES BEFORE MIGRATING!

While we've tested the migration as much as possible, there's always a chance that things go awry during the update, so we highly recommend backing up your old data files before upgrading.  Copying the data files automatically is not pragmatic due to the size of some customer data sets (hundreds of GBs!).

Run-Time

  • Enhancement: Android implementation now reports actual number of cores
  • Bugfix: fixed connection bug for PS3, 360 and WiiU
  • Enhancement: PS4 console name now reported correctly (requires SDK 1.020+)

Server

  • Enhancement: major performance enhancement with string retrievals (in particular you'll notice this while fetching messages)
  • Bugfix: Crash bug when the server erroneously tried to mark some data sets as 'done' while they were still live
  • Bugfix: cancelling "Profile Selection" was not working properly
  • Bugfix: -NaN and Inf were not handled properly by the server for plots

Visualizer

  • Enhancement: timeline section names now support up to four separate lines (via \n)
  • Enhancement: copy tooltips using control key now works
  • Bugfix: enabling multiple plots then disabling some would not rescale the Y axis
  • Bugfix: middle click drag scrolling in plot view wasn't working
  • Enhancement: callstack rendering tooltips improved
  • Enhancement: callstack pre/post-guards added