Wednesday, December 11, 2013

RAD Telemetry 2.0g released!


Telemetry 2.0g has been released!  Big performance increase during processing (~40% faster) and bug fixes.

Run-Time

  • Change: tmInitializeContext now prints warning (checked build only) if the user provided buffer is greater than 4MB.  Buffers that are too large can cause unnecessary slow downs.
  • Bugfix: PS3: issuing multiple tmOpen/tmClose calls after initiating the PPU listener thread would hang the run-time
  • Bugfix: Linux: intermittent hang bug that happened on a small number of systems when scanning processes during tmOpen

Server

  • Enhancement: SQLite usage optimizations resulting in a significant speed up during cooking (~40%)
  • Change: TMZF_MEM_REGION_EXPERIMENTAL now uses path name for naming allocations if present (a la profiler names)
  • Change: shortened generated filenames in an attempt to stave off MAX_PATH issues on Windows
  • Change: autogenerated memory usage plots are signed 64-bit instead of unsigned
  • Change: dynamic string usage is now plotted to help narrow down bandwidth issues
  • Bugfix: string cache overflow bug would make some plots disappear if you had too many unique strings
  • Bugfix: sort order was incorrect for profiler queries, so important data was clipped off if there were a lot (>512) of results in a profiler fetch
  • Bugfix: fixed issues where spurious plots were generated for mem usage

Visualizer

  • Change: cooking now reports number of files left instead of the approximate percentage
  • Bugfix: mouse wheel no longer zooms if UI is in modal state
  • Bugfix: you could erroneously select live sessions by clicking 'all' button
  • Bugfix: fixed rare crash bug in Visualizer that resulted from trying to access non-existent session data

Other

  • Change: TMS API changed to better support querying for cooking session duration
  • Change: updated UE3 integration docs with clarifications
  • Bugfix: updated trylock docs to indicate that they don't support recursive mutexes

Thursday, October 24, 2013

Telemetry 2.0f released!


Telemetry 2.0f has been released!  This is primarily a bugfix and maintenance release.

Run-Time

  • Change: the autogenerated system/working set plot is now part of (mem) group
  • Change: TMCT_FILE now defaults to emulating Telemetry 2 server

Server

  • Change: updated to most recent SQLite
  • Bugfix: fixed plot gen crash bug
  • Bugfix: stopping server while it's cooking should not crash anymore
  • Bugfix: fixed string table bug where NULL strings being set to wrong value
  • Bugfix: fixed string id generation bug when generating plots
  • Bugfix: plot gen 'by frame' now works correctly
  • Bugfix: fixed bug with symbolization

Visualizer

  • Enhancement: ctrl-V now pastes into input text lines
  • Enhancement: TMPT_TIME_CYCLES plots now print in appropriate human readable time scale
  • Change: Changed "Cooking" to "Processing" to avoid idiomatic terminology
  • Bugfix: fixed plot LOD rendering bug
  • Bugfix: fixed plot gen crash bug
  • Bugfix: stopping server while it's processing should not crash anymore
  • Bugfix: plot gen 'by frame' now works correctly
  • Bugfix: fixed bug with symbolization

Other

  • Change: PS4 dependency documentation updated

Experimental Feature: Memory Zones

As mentioned in a previous blog post, profiling memory can be a difficult problem, not just technically (acquiring the data) but also in terms of usability (what do we do with all that data?!)

Telemetry has always allowed you to tag and track individual allocations using the tmAlloc/tmFree APIs, however this is often overkill (tracking all allocations); noisy (lots of stuff tracked we don't care about); expensive (lots of data); tedious (lots of markup); and in the end, hard to parse (no clear structure to visualize over time).  It's still the right low-level API since it puts the control in the programmer's hands, but a cleaner mechanism on top of that has always been desirable but hard to define.

Telemetry 2 is looking to provide that using an experimental new feature called memory zones.  These are just normal Telemetry zones with a new flag attached to them, TMZF_MEM_REGION_EXPERIMENTAL.  Specially annotated allocations that occur inside this zone are logically grouped and plotted (for now, the individual memory intervals are still recorded so that we can do more analysis later if we choose).

void MyFunc()
{
    tmZone( cx, TMZF_MEM_REGION_EXPERIMENTAL, "MyFunc" );
    ...
    foo* f = new foo;

    //note the empty description!
    tmAlloc( cx, f, sizeof(*f), "" );
    //note the empty description!
    bar* b = new bar;
    tmAlloc( cx, b, sizeof(*b), "" );
    ...
    delete b;
    delete f;
}


With the above code snippet, 'f' and 'b' would contribute to the 'MyFunc' bucket.  While useful, where it really starts to help is when you overload operators new and delete to take advantage of this:

void *operator new( size_t s )
{
   void *p = malloc( s );

   // empty description means 'use mem zone'
   tmAlloc( cx, p, s, "" );
   return p;
}


Then we can remove the tmAlloc calls from MyFunc:

void MyFunc()
{
    tmZone( cx, TMZF_MEM_REGION_EXPERIMENTAL, "MyFunc" );
    ...
    foo* f = new foo;
    bar* b = new bar;

    std::vector< foo > v;
    v.push_back(*f);
    ...
    delete b;
    delete f;
}


What happens now is that when 'foo' and 'bar' are allocated, our overloaded operator new will be within the MyFunc memory zone and accounted for properly.  Not only that, but any memory allocations trapped by our overloaded operator new (such as implicit ones by STL) are gathered as well, so in the above case any incidental allocations that result from manipulating the vector will be accounted for automatically.

In practice this means that you can now progressively mark up memory allocations exactly the same way you do with zones.  In fact, Telemetry will discard allocation and free events with empty descriptors if they're outside of a mem zone, which reduces overhead considerably:

void MyFunc()
{
    tmEnter( cx, TMZF_MEM_REGION_EXPERIMENTAL, "tracked" );

    p = new P(); // overloaded operator new traps and tracks
    q = new Q(); // overloaded operator new traps and tracks

    tmLeave( cx );
    r = new R(); // ignored on server (outside mem regions!)
}


The brute force option to get up and running would be to enclose a giant memory zone  around your application's main() function.  While this would work, it would be a lot of traffic and, as a result, somewhat meaningless until you add more memory zones (the full hierarchy is preserved so more zones = more detail).

A less intensive approach would be to progressively annotate "leafy" pieces of code and work upwards from there.  For example, start with your audio system or your level loader, see what it looks like, then go back and adjust your markup, try it again, etc.  This is very analogous to how zone markup is usually done with Telemetry.

What does this buy you?  Well, you have a nice clean way of discarding/ignoring memory allocations for starters, which can help keep traffic and total data storage requirements under control. 

Second, you gain automatic plotting of memory usage by memory zone.  This lets you look at memory usage at a high level then, if necessary, zoom into areas of interest and look at individual memory allocations.

Plots showing memory usage of sMainLoop
If we were too zoom in a bit we could see the usage pattern in more detail, specifically how a lot of usage occurs and then drops back down (a pattern associated with temporaries).  The data points encompass dozens of samples.
Zoomed in we can see usage pattern in more detail
By dropping all the way down the the zone view, we can see the individual allocations.  Given that we're looking at hundreds of tiny allocations, tracking them individually would not have given us very much useful data.

Individual memory intervals are still visible
If this system proves popular we plan on adding 'memory exclusion' zones, which would discard any memory events that occur inside them.  In addition we'll add a configuration option to control the rate at which the server generates allocation plots.  Currently the threshold is around 16K.

Finally, we'll add a bunch of visualization features so that we can take advantage of this system, which is really what we and our customers want!


Profiling Memory Usage: A Lamentation

Profiling memory usage is an exceedingly complex problem, one that Telemetry has tried to address in different ways over the years.  Much like performance profiling, the task of analyzing memory usage doesn't seem terribly difficult when looking at specific use cases.  However once we try to generalize to a cross-platform implementation that handles every conceivable use case...it becomes nearly intractable.  There are two major problems that memory profiling has to solve: capturing memory event data, then somehow digesting that data into something meaningful.

Collecting all the allocation and free events is actually pretty simple (conceptually) but rife with issues in practice.  But despite being relatively simple, it's still far from trivial.  There are two obvious methods for tracking memory events: explicit markup at point of allocation, or hooking system allocators.

Explicit markup requires the programmer to annotate every memory operation manually, much like using Telemetry with zones:

Foo *foo = malloc( sizeof( Foo ) );
tmAlloc( cx, foo, sizeof(*foo), "foo" );

For a language like C this is relatively straightforward, but it's tedious and error prone.

For C++ it's even worse, because the equivalent case:

Foo *foo = new Foo;
tmAlloc( cx, foo, sizeof(*foo), "foo" );

is potentially misleading since sizeof( foo ) does not take into account resultant allocations during member construction (such as having member variables that are STL containers).
In both cases with explicit markup, any memory that's allocated outside of your control (incidental allocations within STL, external libraries, etc. ) is effectively invisible.

At least with Telemetry's zones, missing markup still implies a lot about performance.  A gap in time still indicates that time was spent, just in an area we've yet to annotate:
Even a missing zone can tell us something
There's no equivalent way to infer memory usage with explicit markup.

So why don't we just trap the appropriate system wide memory allocator calls?  On a single platform this may be feasible, but doing this across every system we support is impractical and, frankly, bad practice since it can introduce subtle bugs and side effects.  And it doesn't even solve the problem fully, since some programmers may want to see memory usage analysis for their own allocators and thus may require explicit markup still.

There is a compromise solution, where you overload your allocators locally (trapping malloc and/or overloading operators new and delete) and mark things up there:

void *operator new( size_t s )
{
   void *p = malloc( s );
   tmAlloc( cx, p, s, "new" );
   return p;
}

This works, but we lose context about the allocation such as where it occurred or what it was being used for.  We could work around this by grabbing the callstack as well (slow) or by explicitly passing down additional contextual information (tedious), but neither of those solutions is ideal.

As you can see, simply gathering the data related to memory usage is a difficult problem in and of itself.  Currently Telemetry gives you two ways to do this: fine grained control using explicit markup, or implicit context by using what we refer to as 'memory zones'.  That's a topic for another blog post though!

For now, let's assume that we magically have the ability to gather relevant memory events and some contextual information.  This brings us to our second problem: what do we do with that information?

Memory events define intervals of usage, e.g. from time A to time B we had N bytes allocated by system Z.  That's actually a lot of useful information right there which can be used to generate all kinds of interesting analysis.  Of course, there's also a ton of noise in there that maybe isn't particularly relevant or interesting.  Sometimes we care about specific allocations but most of the time we only care about logical groups, and even then modern applications generate immense amounts of small, short lived memory allocations that can clutter our view instead of contributing to it (but not always!).

Given these memory allocation intervals, we could theoretically generate a lot of fun analysis:
  • leak detection
  • allocation churn (rate of change of number of individual allocations)
  • allocated churn (rate of change of total memory allocated)
  • life spans of allocations (by file, location, size, system, etc.)
  • peak allocation over time, in total and by system
  • allocation usage over time by specific systems
  • distribution of allocations at a point in time
  • address space fragmentation at a point in time
  • topographical (treemap) visualization of memory allocation at a point in time
  • etc. etc.

Most programmers, however, are only interested in a subset of that information. Unfortunately all of the above analysis can be expensive and storage hungry, and it's often difficult to visualize side by side with plot or zone data.  It's a difficult problem to solve to everyone's satisfaction, but we've started taking what we think are some good first steps to make this happen.

Tuesday, October 8, 2013

Telemetry 2.0e released!

Telemetry 2.0e is out the door!

NOTE: THERE HAS BEEN A DATABASE SCHEMA CHANGE, BACKUP YOUR DATA FILES BEFORE MIGRATING!

While we've tested the migration as much as possible, there's always a chance that things go awry during the update, so we highly recommend backing up your old data files before upgrading.  Copying the data files automatically is not pragmatic due to the size of some customer data sets (hundreds of GBs!).

Run-Time

  • Enhancement: Android implementation now reports actual number of cores
  • Bugfix: fixed connection bug for PS3, 360 and WiiU
  • Enhancement: PS4 console name now reported correctly (requires SDK 1.020+)

Server

  • Enhancement: major performance enhancement with string retrievals (in particular you'll notice this while fetching messages)
  • Bugfix: Crash bug when the server erroneously tried to mark some data sets as 'done' while they were still live
  • Bugfix: cancelling "Profile Selection" was not working properly
  • Bugfix: -NaN and Inf were not handled properly by the server for plots

Visualizer

  • Enhancement: timeline section names now support up to four separate lines (via \n)
  • Enhancement: copy tooltips using control key now works
  • Bugfix: enabling multiple plots then disabling some would not rescale the Y axis
  • Bugfix: middle click drag scrolling in plot view wasn't working
  • Enhancement: callstack rendering tooltips improved
  • Enhancement: callstack pre/post-guards added

Friday, September 13, 2013

Telemetry 2.0d released!

Telemetry 2.0d is out the door!  This is an incremental release from 2.0c with the following changes and improvements:

Run-Time

  • Bugfix: closing Telemetry after it failed to open would sometimes hang
  • Bugfix: initial timestamp wasn't offset by start TSC
  • Bugfix: iOS crash bug fixed
  • Change: iOS and Android use 16K write buffers for TMCT_FILE instead of 1MB sizes

Server

  • Bugfix: plot LOD bug fix
  • Bugfix: fixed zone search query
  • Bugfix: fixed join logic to deal with 0 string column IDs
  • Bugfix: crashing when attempting to symbolize a session that is also loading
  • Bugfix: stale file handles no longer prevent deletion of sessions

Visualizer

  • Change: selection checkbox is disabled for live sessions
  • Enhancement: huge speed up to symbolization
  • Bugfix: mem track is visible again
  • Enhancement: search results now also dumped into a separate tab
  • Bugfix: zones with same name in different threads were shadowing each other in profiler pane
  • Bugfix: properly reports when a search returns no results
  • Bugfix: cleaned up thread offset / lock count display in lock tracks
  • Bugfix: plots were still showing up as available after deletion

Misc

  • Bugfix: bug in database export Python script
  • Other: fixed misc documentation errors

Tuesday, September 3, 2013

Telemetry 2: User Defined Timers

Introduced as part of Telemetry 2.0c, user specified timers are a handy way to override Telemetry's internal timing code with your own.  This is an advanced feature and not something most customers will need, but in certain rare cases it can be helpful.  For example:
  • embedded systems with their own custom high performance timers
  • scripting systems that use their own timing mechanisms
The usage is extremely simple.  You register a timing function by calling a new API, tmStartupEx, and providing your user defined timing function.  The caveats are that:
  1. It must be thread safe
  2. It must be fast
  3. It must return a 64-bit, monotonically increasing value
  4. It must have at least microsecond accuracy
  5. It must be synchronized across processors
Anything less than microsecond accuracy can play havoc with the Telemetry back end, and tmStartupEx will error out if it detects a low resolution timer.

The final thing to watch out for is if you use accumulation zones.  The accumulation APIs are just macros that wrap calls around Telemetry's default timer function, which means that they are incompatible with a user defined one.  To get around this you'll need to write your own accumulation zone enter/leave/get start macros (just use the existing ones as templates), e.g.:

#define myGetAccumulationStart(cx) ((cx) ? myFastTime() : 0)
#define myEnterAccumulationZone(cx,z) { if (cx) (z) -= myFastTime(); }
#define myLeaveAccumulationZone(cx,z) { if (cx) (z) += myFastTime(); }

If you need any assistance with this feature, please don't hesitate to contact us!

RAD Telemetry 2.0c Released

Release 2.0c of Telemetry has just been released!  The big news is that we now have support for the Playstation VITA!

Run-Time

  • Added support for Playstation VITA
  • Installable user timers

Visualizer

  • message text is clearer
  • message text can be copied to the clipboard
  • timespans are now taller and colored based on name
  • timespans now print text inside the timespan
  • timeline section names named 'paused' are now rendered in red

Misc

  • added documentation about using tmLockName to name timespans

Thursday, August 29, 2013

Telemetry Power Tips #2: Show paused state with timeline section names

Telemetry's API allows you to pause and resume capture dynamically, allowing you to pick and choose when data is acquired*.  This is particularly helpful if you're running sessions for a long period of time and don't want to capture data continuously. 

When reviewing your data you'll want to spot paused areas of the capture.  With Telemetry 1.x this was handled automatically since pauses were latched to the end of frame boundaries, so 'empty' frames were drawn differently.  However since Telemetry 2 supports multiple frame sets this can get complex, so instead of rendering frames as paused or not, it makes more sense to mark a region of time as paused.

A simple mechanism for this is to use the tmSetTimelineSectionName API.  The name 'paused' is recognized by Telemetry and it will use a bright red background for those timeline sections.

tmSetTimelineSectionName( cx, "paused" );
tmPause( cx, 1 );
...
tmPause( cx, 0 );
tmSetTimelineSectionName( cx, current_state_name );

This gives you capture somewhat like this:




* Note that a more flexible method of managing event capture in real time is available, but requires a bit more application setup.

Tuesday, August 27, 2013

Telemetry Power Tips #1: Managing Real-time Event Volume

One of the most common questions that comes up from Telemetry users is "What if I don't want all this data all the time?" or, put another way, "What if I only want to collect some types of data, some of the time?"

While you can use the tmPause API to turn off events, it's a heavyweight solution since it shuts down almost all data collection.  A finer grain solution that provides precise control of detail or specific coverage is often desired.

Simple Telemetry integrations have a single global HTELEMETRY context which is passed into Telemetry markup, and if NULL then the function is not called.  (Technically they're passed into Telemetry macros which dispatch after the check, saving a little performance in the NULL case).

We can leverage this property by keeping a set of HTELEMETRY variables that alternate between NULL and the real Telemetry context.

For example, let's say you have separate markup for your rendering, game logic, physics, and audio, and you want to enable/disable profiling for each of those categories independently.

HTELEMETRY telemetry_context;
enum { TCAT_AUDIO, TCAT_RENDERING, TCAT_PHYSICS, TCAT_GAME, NUM_TCATS };
HTELEMETRY telemetry_categories[ NUM_TCATS ];

Your markup can then use the appropriate category:

void playSound( char const *kpSoundName )
{
   tmZone( telemetry_categories[ TCAT_AUDIO ], TMZF_NONE, "playSound %s", kpSoundName );
   // ... play audio
}

In the above code, if telemetry_categories[ TCAT_AUDIO ] is NULL then the zone is never created and sent.

Now you just hook up some method to control the categories (console command, hot keys, controller input, etc.) which sets those array entries appropriately:

void enableTelemetryCategory( int which, bool enable )
{
   telemetry_categories[ which ] = enable ? telemetry_context : NULL;
}

You can obviously extend this to other things like markup 'levels', with the default being very light markup that has low overhead and going all the way up to a dense markup setting that generates tens of thousands of events per frame.

Just remember to avoid dangling (and invalid) references by clearing your arrays before you shut everything down!


Monday, August 26, 2013

RAD Telemetry 2.0b released!

Hot on the heels of the 2.0a release we now have 2.0b out the door.  Lots of little bug fixes and official platform support for Android and QNX are the highlights.
  • ENHANCEMENT: updated Android support, including support for TMCT_FILE
  • ENHANCEMENT: support for QNX 6.5 on ARM
  • ENHANCEMENT: support for Linux on ARM
  • BUGFIX: endianess bug regression from 2.0a
  • BUGFIX: zone view tooltips for out of range frametime plots now fixed
  • BUGFIX: zooming in beyond 100 clock width would result in white screen
  • CHANGE: visualizer uses term 'clocks' instead of 'cycles'
  • BUGFIX: zero-length zones are handled properly now
  • BUGFIX: events that were too closely spaced were being discarded
  • CHANGE: documentation updates

Tuesday, August 20, 2013

Telemetry now available for QNX!

Telemetry has become a vital tool for game developers looking to squeeze as much performance out of their code as possible, especially when running on game consoles such as 360, PS3, and WiiU.  Given that those platforms are just special cases of embedded systems, it's not surprising that we've had a lot of inquiries from non-gaming embedded software developers.

While Telemetry runs great on some operating systems that happen to exist in the embedded space, such as Android or Linux/ARM, for us to become really relevant we needed to support dedicated embedded operating systems.  With that in mind we've added support for one of the most popular embedded platforms out there, the QNX operating system.  We now have support for QNX 6.5 on x86 and ARM processors!

If you're interested in trying it out, email us at sales3@radgametools.com!

Thursday, August 15, 2013

Telemetry now supports Android!

Just a quick mention that Telemetry now supports Android OS, so if you're interested in performance visualization on Android, let us know!

Tuesday, August 13, 2013

RAD Telemetry 2.0a/1.1t shipped!

Man, it's so easy to just keep plugging away at something and thinking "we'll just release next week".  We've been doing that with Telemetry 2 for months now because there's always one more platform or feature to add, but we finally locked things down and have made our release official as of today!

So as of now Telemetry 2.0a is available, and so is the incremental run-time update to 1.1t (the Telemetry run-time is 1.1t for both Telemetry 2 and Telemetry 1.x, which is kind of confusing right now but that's the nature of having dual version support).

Telemetry 1.1t will (hopefully) be the last release of 1.1 and going forward all releases will be 2.x+ versions.

If you're looking to evaluate again or upgrade, please contact our sales guys at sales3@radgametools.com so we can hook you up!

Monday, June 3, 2013

Telemetry 2 Migration Notes #2: Pseudo Paths

Telemetry 1.x had an object identification system based on pseudo-paths, however as new features were added the pathing became inconsistent.  For example, plots and memory allocations used plots like this:

tmPlot( cx, TMPT_NONE, 0, value, "enemies/active" );
tmAlloc( cx, r, size_r, "meshes/players/current" );

Whereas zones and messages used parentheses to delimit the paths from the leaves:

tmZone( cx, TMZF_NONE, "(game/render)draw mesh" );
tmMessage( cx, TMMF_SEVERITY_LOG, "(game/net) client connected!\n" );

With Telemetry 2 everything uses the parenthetical form, so the above would look like this:

tmPlot( cx, TMPT_NONE, 0, value, "(enemies)active" );
tmAlloc( cx, r, size_r, "(meshes/players)current" );
tmZone( cx, TMZF_NONE, "(game/render/draw mesh)" ); // will explain this in a moment
tmMessage( cx, TMMF_SEVERITY_LOG, "(game/net)client connected!\n" );

The part within parentheses is considered the path, and the part outside is the leaf.  Paths are not required, however.

The one potential problem area are zones, mostly due to a change in how we handle zone time collation for profiling.  With Telemetry 1 we would use a kind of hazy heuristic about what zones were combined into a single time category.  It worked for the most part but was a little opaque and customers often didn't understand how or why it worked the way it did.

With Telemetry 2 the aggregation of zones for profiling purposes is now explicitly managed by you.  If there is an explicit path, then everything in that path will be combined.  So for example if there are different zones specified like this:

tmZone( .., "(render/mesh) %s", mesh_name );

Then they will all be combined and displayed as "(render/mesh)" in the profile view.

If for some reason you didn't want different meshes combined together and instead wanted them shown individually, you would do:

tmZone( ..., "render mesh %s", mesh_name );

And now the profile view would show a separate row for each individual mesh_name that was rendered.

Monday, April 8, 2013

RAD Telemetry 1.1s shipped!

The big update is that we now have "out of the box" support for Linux/ARM, with some caveats.  But by and large if you're working on some kind of embedded Linux/ARM system with reasonable networking then Telemetry should be viable for you.

And of course we have our standard updates, bugfixes, and compatibility stuff!

Run-time

  • Enhancement: added 'R' (reverse) variants of callstack APIs
  • Enhancement: official support for ARM7/Linux
  • Enhancement: updated SDK/console support
  • Bugfix: Fixed networking bug on PS4
  • Bugfix: Emulated tools version was being reset at tmClose
  • Bugfix: Warnings now with tmSetLockStateMintime if you reuse the same buffer
  • Change: tmEmitAccumulationZone now sets location string of 'leave' to same as enter event

Visualizer

  • Bugfix: fixed timespan rendering bug

Misc

  • Changes: updated UE3 integration documentation
  • Changes: fixed examples
  • Changes: minor documentation changes

Monday, April 1, 2013

Telemetry 2 Beta Migration Notes

To make the migration from Telemetry 1.x to 2.0 a little easier, existing customers can provide us test data that we'll verify and, if all looks good, ship out the beta soon after.  The three steps necessary are to ensure you're on a recent version of Telemetry (1.1r or later), enable T2 compatibility, and send data to a file instead of a server.

First, verify what version you're running -- this is can be seen in the Visualizer's "About" pane.  If you need a newer version, please contact us and we'll get it to you.

Second, enable T2 compatibility by adding the following line after initialization but before you call tmOpen:

TmI32 v = 2;
tmSetParameter( cx, TMP_EMULATED_SERVER_VERSION, &v );

Finally, specify TMCT_FILE instead of TMCT_TCP when calling tmOpen.  You should otherwise be able to run normally and when all done a series of TMDATA.000, TMDATA.001, etc. will be created (location depends on your platform).

Zip up those files and contact us for FTP access and we'll verify that they work!

One other migration note: for ideal compatibility during migration, you'll need to change how some path names are done for plots, messages and zones, but we'll get into that later.

Monday, March 25, 2013

Telemetry 2: Architectural Improvements

In a previous post I talked a bit about some of the issues we faced after releasing Telemetry, and specifically how some early assumptions and decisions ended up making Telemetry difficult to update with unanticipated feature requests.  Examples of these decisions and assumptions included:
  • data sets would be memory resident
  • customers would annotate their code the way we did
  • write bandwidth was more important than read bandwidth
  • users would use the Visualizer for accessing data instead of trying to get to it themselves
  • data was difficult to find
  • 'render frame' was a natural frame boundary for everyone
With Telemetry 2 we wanted to address these issues.  This meant:
  • on-demand data loading
  • fast random seek and search of data
  • open data format
  • extreme dynamic range in Visualization, from clock cycles all the way up to viewing hours of data
  • fast, reliable search
  • different 'frame sets' so you could look at "render frames" or "physics frames" or whatever
  • ...without breaking existing customer's Telemetry API integrations
We stopped saying, or even thinking, things like "Well, most of the time..[some convenient assumption]....so we can probably just do this the simple way".  Our goal with Telemetry 2 is to anticipate and support worst case scenarios as much as possible, without making the average case suffer.

SQLite

None of our goals were attainable without changing our data file format.  Telemetry's original format was designed for fast streaming writes to avoid backing up the run-time client (which has since been fixed by buffering to an intermediate set of data files).  And since we assumed that our data would be read and sent in big chunks, random seek was slow and clumsy.  In addition, the format was proprietary—not because we wanted to hide it, but because supporting customer access of data would have been a support nightmare.

A real database system was needed, but the conventional wisdom is that traditional databases are too slow for real-time writing of the enormous amounts of data we were generating.  Also, we had shied away from databases early on because it was one more thing for end users to configure and install.  Asking a potential customer to install and configure a MySQL service was just another barrier to entry.

With Telemetry 2 we re-examined these assumptions and decided to go with SQLite.  We ran some preliminary performance tests and discovered that, while slower than Telemetry 1.x (which is understandable, since Telemetry was just writing raw data as fast it could), its write bandwidth on mechanical media was not unreasonable.  In addition, since SQLite is built as a library instead of a separate program, installation didn't have any extra steps for our customers.

Switching to SQLite cleaned up a tremendous of code.  Debugging was easier—we could just open up the SQLite command line client and run queries manually—and iterating on our schema was simple.  Simple data fetches were now a SELECT query instead of custom data file parsing code.

And by using mostly standard SQL, we can add support for another database solution such as Microsoft SQL Server, Oracle, MySQL, or PostgreSQL.

Level of Detail

Level of detail (LOD) was not a consideration for Telemetry because it assumed sparse data sets focused on a single frame at a time.  Dense data sets where you'd also want to zoom out to hundreds of frames never seemed like a plausible situation.  Predictably, after Telemetry was released we started seeing immensely dense customer datasets that they were zooming way, way out, blowing up memory usage and rendering performance.

I'm talking about hundreds of thousands of visible zones and hundreds of frames, something that the Visualizer's renderer just wasn't expecting.  As an example, the Visualizer had a very small amount of setup work to render each captured frame, maybe half a millisecond of overhead.  This was lost in the noise when rendering two to three frames of data.  With even 100 visible frames that overhead suddenly mattered, capping our frame rate at 20 Hz even without rendering anything!

The actual low level renderer also assumed on the order of a few thousand primitives.  Scaling past that by one or two orders of magnitude made it chug due to per-primitive overhead.

As a stop gap measure we did basic dynamic merging of small zones on the Visualizer, which improved performance dramatically.  Of course, this only encouraged customers to keep pushing the boundaries, so we needed a better long term solution.

While the client side LOD coalescing fixed rendering speed, it did nothing to address memory usage.  Since the LOD was handled by the Visualizer it was possible to have hundreds of thousands or even millions of zones in memory.  Add in all the other data and now you're looking at memory usage of many gigabytes.  This hurt the small minority of customers still running 32-bit operating systems, but it also impacted our customers trying to wrangle huge data sets (4+ GB).
Telemetry 2 addresses this by making level of detail a first class feature, covering all render types (not just zones).  LOD generation is handled by the server during capture, with a hierarchical LOD system reminiscent of texture MIP maps.  A significant amount of engineering work went into devising this system.  The dynamic range of data captures is massive, allowing users to zoom from clock cycles all the way out to hours.


Hundreds of thousands of zones spanning minutes of time
Zones under one microsecond in length

Hundreds of thousands of context switches
Context switches at the clock cycle level

User Defined Frame Boundaries

Telemetry's data visualization was centered around the notion of a 'frame', a unit of time consisting of one game loop ending a graphics system buffer swap.  For many games this was a logical unit of work, and when optimizing frame rate this was clearly the right focus.

However over time customers started focusing on non-rendering tasks such as physics or resource loading, where conceptually dividing execution on rendering boundaries didn't make sense.

To address this Telemetry 2 introduced TMPF_FRAMETIME plots, which are like normal plots but are rendered as time intervals, allowing a programmer to create any arbitrary number of frame groups and then reorient the zone display on the group that matters.  Now the physics, job manager, and render programmers can all see the data the way they want to see it!

Full Text Search

SQLite provides full text search capabilities via its FTS4 extension, and we've taken advantage of this with Telemetry 2.  Text within events is searchable, so "needle in a haystack" issues are much more manageable.  For example, our Valve Team Fortress 2 dataset, which is 20GB encompassing over 10 minutes of game play, had an overtime event in it while playing.  To find when this occurred I searched for "overtime*wav", which found the specific zone that played "sound\vo\announcer_overtime4.wav" and took a few seconds to do so.

Backwards Compatibility

Early on we considered radically changing the Telemetry API to reflect some of the things we learned, but eventually wisely decided against it.  As a result, the Telemetry run-time is identical between Telemetry 1.1 and Telemetry 2—this means you can use the same binary and switch between Telemetry versions based solely on which server you're connecting to.  This allows existing customers to evaluate the switchover before committing any significant coding resources.

That said, there are some minor transition changes having to do with object paths, mostly because we're trying to make path specification consistent across the different API entry points.  Paths are now always in the form "(path/path2)leaf":

tmPlot( cx, ... "(game/AI)num_entities" );
tmMessage( cx, ... "(warnings)This is a warning" );
tmEnter( cx, ... "(renderMesh)%s", meshName );

There's a slight caveat with how we handle things when doing aggregate zone profiling (the old profiler track view), but we'll cover that in a later migration document.

Summary

Hopefully this post gives you some things to get excited about with Telemetry 2.  We can't wait to get this into everybody's hands, not just because it's a better experience and product, but because the core technology is so robust and scalable that adding new features and making enhancements will be significantly easier than with Telemetry 1.x.  And not just for us—by providing an open data format, customers can now track and mine data for their own internal use.

This will allow us to be more responsive to customer requests and also spend less time on support and bug fixes.

If you're interested in the beta, please drop us a line at sales3@radgametools.com and we'll arrange an evaluation!

Wednesday, March 20, 2013

Friday, March 8, 2013

Telemetry 2: A Preamble

Telemetry was released October 25, 2010.  In the intervening 2.5 years we've learned a lot about customer use cases and expectations for a product like this, and have taken the opportunity to revisit some of our assumptions and underpinnings for our upcoming release of Telemetry 2.

When Telemetry was designed there were no real analogous products with which to compare.  Traditional profilers have significant limitations in both visualization and session sizes, tending to focus on the micro level (clock cycles, cache misses, time spent in aggregate in different areas) and short captures (a few seconds).  Telemetry operates at a higher granularity (millisecond+) for extended session captures (several minutes), illustrating the relationships between locks, memory, CPU time and application state.

While there's some overlap, the important thing to keep in mind is that Telemetry is trying to solve a fundamentally different type of problem than just "Where is my time being spent over this period of time?".  Instead it tries to answer questions like "What is the relationship between my game's state and performance over time?"  No other product (that I'm aware of) attempted to address that line of questioning.

The downside of this is that many of Telemetry's fundamental architectural decisions were based on best-guess assumptions, a combination of our own experiences and what our peers and other RAD customers thought they might like.  We had to make some significant assumptions and go with it.

In addition, Dresscode by Tuxedo Labs, the precursor to Telemetry, made its own assumptions about the nature of data capture and visualization that impacted the design.

Some of these assumptions have since proven to be limiting.  Not because the assumptions were dumb or ill-informed, but mostly because once you hand a tool to a user, how that tool is used is out of your control.  And if they decide to use your screwdriver as a hammer and it breaks, your screwdriver will still get the blame.

Here are some of the fundamental Telemetry assumptions we made that we have revisited in Telemetry 2:

Data Sets Would Be Memory Resident

Dresscode worked with data sets that were entirely in memory.  Dresscode's 32-bit architecture limited data sets to 1GB or less, which constrained session complexity and duration.  At the time this was a reasonable restriction since Dresscode's use cases tended to be simple (hundreds of zones per frame, and a few thousand frames at most).

In-core or out-of-core data management is a significant architectural decision that is non-trivial to change.  At the time we assumed that since the worst case with Dresscode was about 1GB, by migrating to 64-bit we could run in core even with "crazy" data sets several times that size.

Reality—and enthusiastic customers—disposed us of that notion.  We found that users could trivially blow past that assumed 1GB comfort zone.  Sixteen gigabyte and larger data sets were not unheard of.  Even on 64-bit Windows an overzealous data capture could bring Telemetry (or the machine) down.

This wasn't a fundamental implementation flaw in Telemetry as much as an unanticipated use case that arose from my naivete (thinking users would stay within what we considered reasonable guidelines) and not enforcing hard limits by cutting off the capture beyond some threshold.  But because these guidelines were unclear and/or easy to exceed in common use, it became an issue for a small subset of our users.

Average Events Per Second Was a Meaningful Metric

Part of Telemetry's power is that you can annotate your code however you want.  It's programmer driven profiling, so if you don't care about, say, your audio subsystem, then you don't mark it up.  And if you really care about your renderer, you can mark it up at a very fine granularity (sub-millisecond).

We used Telemetry quite a bit internally before releasing it, and we also had several beta testers using it, and based on that limited data we made assumptions about likely markup patterns in the wild.  Our feeling was that markup would be relatively even, with the occasional hierarchical spike where something was "Telemetrized" heavily.

In reality what we got were all kinds of markup patterns:
  • sparse zones, but dense plotting
  • uniformly dense zones within our average events per second parameters
  • average zone density with intermittent, short lived, highly dense zones (lasting less than a second), usually forming narrow but deep hierarchies.  This is usually a result of light initial markup followed by heavy emphasis on a problematic subsystem.
  • average zone density intermixed with massive zone density lasting many seconds or minutes (level loads, startup, and things like that)
  • ...and many other types
The distribution of Telemetry events over time can impact the run-time's performance.  Extremely dense activity can create locking overhead.  Long duration, dense activity can back up the network, especially on systems with slow networking such as the Nintendo WiiU and Microsoft Xbox 360.

Each of those cases impacted the run time and server in different ways, especially when you factor in platform idiosyncracies.  As a result many of our expectations were based on average events per second, not peak events per second.  It was a simple shorthand that let us estimate things like buffer sizes, network bandwidth, disk space requirements, and so on.

It's fine to take peak values and estimate up to averages, but taking an average number and then inferring peak values is not as valid.

As a result some usage patterns would slam Telemetry pretty hard, even if the average data rate was within spec.

Data Writes Were More Important Than Data Reads

We knew that we would be generating gobs of data.  In some cases a gigabyte a minute (or more), depending on the degree of markup in a game.  Early versions of the server cooked incoming data as it arrived instead of buffering it, which could backup the client if we fell behind.  To maximize performance we used a custom binary format designed for write bandwidth.  Something like XML would have bloated our size by a factor of 20 at least, and at the time we were concerned that a traditional database would not be able to handle our write rates (since most database systems are optimized for reads, not writes).

This decision made sense early on, but two problems emerged.  The first is that a small but real minority of our users wanted access to Telemetry's data from their own tools.  Documenting the Telemetry file format and the implicit promise that we wouldn't break it was a daunting task.  In addition, supporting everyone else's file parser would have been an onerous support headache.

The second problem was that our proprietary format wasn't indexed—in fact, it had the bare minimum amount of data in it to keep write speeds up, so random access was pretty much off the table.  We could grab big chunks of data very coarsely, but it was slow, especially if it was a live session.  This wasn't considered a significant issue because we assumed we'd be memory resident and could stream all the data from the server without any need for searching or seeking.

The lack of indexing made seeking and searching data very difficult.

Render Frames Were Natural Frame Boundaries

Telemetry operates with a fundamental unit of measure known as a frame.  This delineation is natural in games, where render speed is one of the most important performance concerns, but with more advanced games and non-gaming applications using Telemetry customers began to find it limiting.  For example, if you're writing an off-line data transformation tool, the 'render' frame is clearly not where you want to see work divisions.  And if you're working on a subsystem that is decoupled from your render speed, such as audio or physics, then you want to see your own subsystem frames as the primary division of work.

Summary

Telemetry's core assumptions about data were centered around the notion of grabbing big chunks of data and keeping it all in memory, and that this data could be in a opaque format that streamed quickly.  After a couple years of customer usage this assumption has turned out to be limiting, so with Telemetry 2 we've revisited this core pillar of the technology.

Wednesday, February 20, 2013

RAD Telemetry 1.1r Shipped!

The big update is that we now have "out of the box" support for Linux/ARM, with some caveats.  But by and large if you're working on some kind of embedded Linux/ARM system with reasonable networking then Telemetry should be viable for you.

And of course we have our standard updates, bugfixes, and compatibility stuff!

Run-time

  • Enhancement: support for ARM/Linux
  • Enhancement: added TMP_EMULATED_SERVER_VERSION for forward compatibility
  • Enhancement: updated support for unannounced platforms
  • Bugfix: rare bug where Telemetry internal zones could arrive slightly out of order.  This didn't have a practical effect on the end user.
  • Bugfix: 360: rare bug where context switches could arrive out of order (oddly enough, unrelated to the previous bug).
  • Bugfix: 360: TMCT_FILE failed due to extra backslash
  • Bugfix: moved tmEnter timstamp fetch after locking write buffer to avoid potential out of order zones

Visualizer

  • Bugfix: fixed rare crash bug in aligned memory allocator
  • Bugfix: latent bug in string class fixed

Server

  • Bugfix: latent bug in string class fixed

Misc

  • Bugfix: fixed plot export bug in tmsym.  This is an unadvertised feature so it impacted only one customer.

Monday, January 14, 2013

RAD Telemetry 1.1q shipped

Telemetry 1.1q has shipped with bug fixes and additional support for unannounced platforms. Email us at sales3@radgametools.com if you want more information on upcoming platform support.

IMPORTANT: All Telemetry components must be upgraded simultaneously since there has been a change in the network protocol!

Run-Time
  • Bugfix: forward slashes in app name passed to tmOpen would cause hangs
  • Bugfix: repeated tmOpen/tmClose with context switches enabled on 360 or Windows would cause a hang in tmOpen
  • Change: protocol updated for forward compatibility
  • Enhancement: updated support for unannounced platforms
Server
  • Change: protocol updated for forward compatibility