RAD Telemetry

Tuesday, February 4, 2014

RAD Telemetry 2.0i released!

Whoops, that was quick. This is an interim bug fix release with some minor new features. The big issues are that plot generation was crashing, so that's been fixed, and we've allowed you to prefetch a lot more data per server data fetch which reduces latency when just scrolling around in the zone view.

Run-Time

Change: QNX/ARM is now built with -fpic
Change: removed dependency on dbghelp.dll for Windows
Change: 'processing time' plot is now 'process time(ms)'

Server

Change: default zone pool is larger now, reducing need to resize continuously

Visualizer

Bugfix: plot generation was crashing
Change: generated plot names are shortened and now include 'by frame' or 'by time' unit
Enhancement: prefetch window can be altered to reduce latency when scrolling around. This is a slider in the Options page.
Enhancement: state tag backgrounds in zone view are more prominent
Bugfix: mem track button would disappear and stay gone
Bugfix: time scale calculations were off in certain rare instances due to quantization artifacts

Other

Change: updated docs on using regsrv32 to fix MS DIA SDK registration issues

Tuesday, January 28, 2014

RAD Telemetry 2.0h released!

Telemetry 2.0h has been released! This is a collection of small fixes/updates from December and through January. Updated to support the latest console SDKs, implemented a bunch of customer requests, updated and clarified some documentation, etc. This will hopefully be the basis for the latest 2.0 stable build so I can start working on 2.1.

NOTE: This requires a full rebuild! The ABI has changed slightly to support the tmPlotAt functionality, so please remember to update your headers and libraries simultaneously and perform a full rebuild.

The only really significant gotcha is that TMZF_PROFILER_ONLY zones are now effectively discarded by the server, since they were a Telemetry 1.x specific feature. A similar feature may be forthcoming in Telemetry 2 but because some of the constraints have changed (we no longer have monolithic frames like we did in Telemetry 1.x) it'll have to be designed a little differently.

Run-Time

Enhancement: added tmPlotAt APIs so you can retroactively place a plot in time. For example, you might want to accumulate some information but then plot it back where it's most relevant, not where it's available.
Enhancement: updated to latest platform SDKs.
Bugfix: fixed XB1 checked build link errors
Change: 64-bit Linux static lib is now built with -fPIC

Server

Enhancement: server now does periodic flushing every 100ms instead of only flushing on tmTick processing
Change: TMZF_PROFILER_ONLY zones are now explicitly discarded

Visualizer

Enhancement: user defined zone, timespan, and plot colors. By putting "color: rrggbb" in your description the Visualizer will color accordingly.
Enhancement: added sort-by option in profiler results pane so you can sort by time or count
Enhancement: plot tooltip now shows time offset from start of session (so like T1 you can now see how far in a frame is, not just its duration)
Enhancement: TMZF_IDLE zones are now more obvious
Enhancement: 'show idle zones' checkbox is back
Enhancement: selected region in zone view will display its duration in tooltip
Enhancement: 1ms tick intervals are now drawn in the zone view
Bugfix: profiler pane now shows correct counts for accumulation zones

Other

Change: updated docs to clarify that TMZF_PROFILER_ONLY is a T1 specific feature

Wednesday, December 11, 2013

RAD Telemetry 2.0g released!

Telemetry 2.0g has been released! Big performance increase during processing (~40% faster) and bug fixes.

Run-Time

Change: tmInitializeContext now prints warning (checked build only) if the user provided buffer is greater than 4MB. Buffers that are too large can cause unnecessary slow downs.
Bugfix: PS3: issuing multiple tmOpen/tmClose calls after initiating the PPU listener thread would hang the run-time
Bugfix: Linux: intermittent hang bug that happened on a small number of systems when scanning processes during tmOpen

Server

Enhancement: SQLite usage optimizations resulting in a significant speed up during cooking (~40%)
Change: TMZF_MEM_REGION_EXPERIMENTAL now uses path name for naming allocations if present (a la profiler names)
Change: shortened generated filenames in an attempt to stave off MAX_PATH issues on Windows
Change: autogenerated memory usage plots are signed 64-bit instead of unsigned
Change: dynamic string usage is now plotted to help narrow down bandwidth issues
Bugfix: string cache overflow bug would make some plots disappear if you had too many unique strings
Bugfix: sort order was incorrect for profiler queries, so important data was clipped off if there were a lot (>512) of results in a profiler fetch
Bugfix: fixed issues where spurious plots were generated for mem usage

Visualizer

Change: cooking now reports number of files left instead of the approximate percentage
Bugfix: mouse wheel no longer zooms if UI is in modal state
Bugfix: you could erroneously select live sessions by clicking 'all' button
Bugfix: fixed rare crash bug in Visualizer that resulted from trying to access non-existent session data

Other

Change: TMS API changed to better support querying for cooking session duration
Change: updated UE3 integration docs with clarifications
Bugfix: updated trylock docs to indicate that they don't support recursive mutexes

Thursday, October 24, 2013

Telemetry 2.0f released!

Telemetry 2.0f has been released! This is primarily a bugfix and maintenance release.

Run-Time

Change: the autogenerated system/working set plot is now part of (mem) group
Change: TMCT_FILE now defaults to emulating Telemetry 2 server

Server

Change: updated to most recent SQLite
Bugfix: fixed plot gen crash bug
Bugfix: stopping server while it's cooking should not crash anymore
Bugfix: fixed string table bug where NULL strings being set to wrong value
Bugfix: fixed string id generation bug when generating plots
Bugfix: plot gen 'by frame' now works correctly
Bugfix: fixed bug with symbolization

Visualizer

Enhancement: ctrl-V now pastes into input text lines
Enhancement: TMPT_TIME_CYCLES plots now print in appropriate human readable time scale
Change: Changed "Cooking" to "Processing" to avoid idiomatic terminology
Bugfix: fixed plot LOD rendering bug
Bugfix: fixed plot gen crash bug
Bugfix: stopping server while it's processing should not crash anymore
Bugfix: plot gen 'by frame' now works correctly
Bugfix: fixed bug with symbolization

Other

Change: PS4 dependency documentation updated

Experimental Feature: Memory Zones

As mentioned in a previous blog post, profiling memory can be a difficult problem, not just technically (acquiring the data) but also in terms of usability (what do we do with all that data?!)

Telemetry has always allowed you to tag and track individual allocations using the tmAlloc/tmFree APIs, however this is often overkill (tracking all allocations); noisy (lots of stuff tracked we don't care about); expensive (lots of data); tedious (lots of markup); and in the end, hard to parse (no clear structure to visualize over time). It's still the right low-level API since it puts the control in the programmer's hands, but a cleaner mechanism on top of that has always been desirable but hard to define.

Telemetry 2 is looking to provide that using an experimental new feature called memory zones. These are just normal Telemetry zones with a new flag attached to them, TMZF_MEM_REGION_EXPERIMENTAL.  Specially annotated allocations that occur inside this zone are logically grouped and plotted (for now, the individual memory intervals are still recorded so that we can do more analysis later if we choose).

void MyFunc()
{
    tmZone( cx, TMZF_MEM_REGION_EXPERIMENTAL, "MyFunc" );
    ...
    foo* f = new foo;
    //note the empty description!
    tmAlloc( cx, f, sizeof(*f), "" );
    //note the empty description!
    bar* b = new bar;
    tmAlloc( cx, b, sizeof(*b), "" );
    ...
    delete b;
    delete f;
}

With the above code snippet, 'f' and 'b' would contribute to the 'MyFunc' bucket. While useful, where it really starts to help is when you overload operators new and delete to take advantage of this:

void *operator new( size_t s )
{
   void *p = malloc( s );
   // empty description means 'use mem zone'
   tmAlloc( cx, p, s, "" );
   return p;
}

Then we can remove the tmAlloc calls from MyFunc:

void MyFunc()
{
    tmZone( cx, TMZF_MEM_REGION_EXPERIMENTAL, "MyFunc" );
    ...
    foo* f = new foo;
    bar* b = new bar;
    std::vector< foo > v;
    v.push_back(*f);
    ...
    delete b;
    delete f;
}

What happens now is that when 'foo' and 'bar' are allocated, our overloaded operator new will be within the MyFunc memory zone and accounted for properly. Not only that, but any memory allocations trapped by our overloaded operator new (such as implicit ones by STL) are gathered as well, so in the above case any incidental allocations that result from manipulating the vector will be accounted for automatically.

In practice this means that you can now progressively mark up memory allocations exactly the same way you do with zones. In fact, Telemetry will discard allocation and free events with empty descriptors if they're outside of a mem zone, which reduces overhead considerably:

void MyFunc()
{
    tmEnter( cx, TMZF_MEM_REGION_EXPERIMENTAL, "tracked" );
    p = new P(); // overloaded operator new traps and tracks
    q = new Q(); // overloaded operator new traps and tracks
    tmLeave( cx );
    r = new R(); // ignored on server (outside mem regions!)
}

The brute force option to get up and running would be to enclose a giant memory zone around your application's main() function. While this would work, it would be a lot of traffic and, as a result, somewhat meaningless until you add more memory zones (the full hierarchy is preserved so more zones = more detail).

A less intensive approach would be to progressively annotate "leafy" pieces of code and work upwards from there. For example, start with your audio system or your level loader, see what it looks like, then go back and adjust your markup, try it again, etc. This is very analogous to how zone markup is usually done with Telemetry.

What does this buy you? Well, you have a nice clean way of discarding/ignoring memory allocations for starters, which can help keep traffic and total data storage requirements under control.

Second, you gain automatic plotting of memory usage by memory zone. This lets you look at memory usage at a high level then, if necessary, zoom into areas of interest and look at individual memory allocations.

Plots showing memory usage of sMainLoop

If we were too zoom in a bit we could see the usage pattern in more detail, specifically how a lot of usage occurs and then drops back down (a pattern associated with temporaries). The data points encompass dozens of samples.

Zoomed in we can see usage pattern in more detail

By dropping all the way down the the zone view, we can see the individual allocations. Given that we're looking at hundreds of tiny allocations, tracking them individually would not have given us very much useful data.

Individual memory intervals are still visible

If this system proves popular we plan on adding 'memory exclusion' zones, which would discard any memory events that occur inside them. In addition we'll add a configuration option to control the rate at which the server generates allocation plots. Currently the threshold is around 16K.

Finally, we'll add a bunch of visualization features so that we can take advantage of this system, which is really what we and our customers want!

Profiling Memory Usage: A Lamentation

Profiling memory usage is an exceedingly complex problem, one that Telemetry has tried to address in different ways over the years. Much like performance profiling, the task of analyzing memory usage doesn't seem terribly difficult when looking at specific use cases. However once we try to generalize to a cross-platform implementation that handles every conceivable use case...it becomes nearly intractable. There are two major problems that memory profiling has to solve: capturing memory event data, then somehow digesting that data into something meaningful.

Collecting all the allocation and free events is actually pretty simple (conceptually) but rife with issues in practice. But despite being relatively simple, it's still far from trivial. There are two obvious methods for tracking memory events: explicit markup at point of allocation, or hooking system allocators.

Explicit markup requires the programmer to annotate every memory operation manually, much like using Telemetry with zones:

Foo *foo = malloc( sizeof( Foo ) );

tmAlloc( cx, foo, sizeof(*foo), "foo" );

For a language like C this is relatively straightforward, but it's tedious and error prone.

For C++ it's even worse, because the equivalent case:

Foo *foo = new Foo;

tmAlloc( cx, foo, sizeof(*foo), "foo" );

is potentially misleading since sizeof( foo ) does not take into account resultant allocations during member construction (such as having member variables that are STL containers).
In both cases with explicit markup, any memory that's allocated outside of your control (incidental allocations within STL, external libraries, etc. ) is effectively invisible.

At least with Telemetry's zones, missing markup still implies a lot about performance. A gap in time still indicates that time was spent, just in an area we've yet to annotate:

Even a missing zone can tell us something

There's no equivalent way to infer memory usage with explicit markup.

So why don't we just trap the appropriate system wide memory allocator calls? On a single platform this may be feasible, but doing this across every system we support is impractical and, frankly, bad practice since it can introduce subtle bugs and side effects. And it doesn't even solve the problem fully, since some programmers may want to see memory usage analysis for their own allocators and thus may require explicit markup still.

There is a compromise solution, where you overload your allocators locally (trapping malloc and/or overloading operators new and delete) and mark things up there:

void *operator new( size_t s )

{

void *p = malloc( s );

tmAlloc( cx, p, s, "new" );

return p;

}

This works, but we lose context about the allocation such as where it occurred or what it was being used for. We could work around this by grabbing the callstack as well (slow) or by explicitly passing down additional contextual information (tedious), but neither of those solutions is ideal.

As you can see, simply gathering the data related to memory usage is a difficult problem in and of itself. Currently Telemetry gives you two ways to do this: fine grained control using explicit markup, or implicit context by using what we refer to as 'memory zones'. That's a topic for another blog post though!

For now, let's assume that we magically have the ability to gather relevant memory events and some contextual information. This brings us to our second problem: what do we do with that information?

Memory events define intervals of usage, e.g. from time A to time B we had N bytes allocated by system Z. That's actually a lot of useful information right there which can be used to generate all kinds of interesting analysis. Of course, there's also a ton of noise in there that maybe isn't particularly relevant or interesting. Sometimes we care about specific allocations but most of the time we only care about logical groups, and even then modern applications generate immense amounts of small, short lived memory allocations that can clutter our view instead of contributing to it (but not always!).

Given these memory allocation intervals, we could theoretically generate a lot of fun analysis:

leak detection
allocation churn (rate of change of number of individual allocations)
allocated churn (rate of change of total memory allocated)
life spans of allocations (by file, location, size, system, etc.)
peak allocation over time, in total and by system
allocation usage over time by specific systems
distribution of allocations at a point in time
address space fragmentation at a point in time
topographical (treemap) visualization of memory allocation at a point in time
etc. etc.

Most programmers, however, are only interested in a subset of that information. Unfortunately all of the above analysis can be expensive and storage hungry, and it's often difficult to visualize side by side with plot or zone data. It's a difficult problem to solve to everyone's satisfaction, but we've started taking what we think are some good first steps to make this happen.

Tuesday, October 8, 2013

Telemetry 2.0e released!

Telemetry 2.0e is out the door!

NOTE: THERE HAS BEEN A DATABASE SCHEMA CHANGE, BACKUP YOUR DATA FILES BEFORE MIGRATING!

While we've tested the migration as much as possible, there's always a chance that things go awry during the update, so we highly recommend backing up your old data files before upgrading. Copying the data files automatically is not pragmatic due to the size of some customer data sets (hundreds of GBs!).

Run-Time

Enhancement: Android implementation now reports actual number of cores
Bugfix: fixed connection bug for PS3, 360 and WiiU
Enhancement: PS4 console name now reported correctly (requires SDK 1.020+)

Server

Enhancement: major performance enhancement with string retrievals (in particular you'll notice this while fetching messages)
Bugfix: Crash bug when the server erroneously tried to mark some data sets as 'done' while they were still live
Bugfix: cancelling "Profile Selection" was not working properly
Bugfix: -NaN and Inf were not handled properly by the server for plots

Visualizer

Enhancement: timeline section names now support up to four separate lines (via \n)
Enhancement: copy tooltips using control key now works
Bugfix: enabling multiple plots then disabling some would not rescale the Y axis
Bugfix: middle click drag scrolling in plot view wasn't working
Enhancement: callstack rendering tooltips improved
Enhancement: callstack pre/post-guards added