Monday, March 25, 2013

Telemetry 2: Architectural Improvements

In a previous post I talked a bit about some of the issues we faced after releasing Telemetry, and specifically how some early assumptions and decisions ended up making Telemetry difficult to update with unanticipated feature requests.  Examples of these decisions and assumptions included:
  • data sets would be memory resident
  • customers would annotate their code the way we did
  • write bandwidth was more important than read bandwidth
  • users would use the Visualizer for accessing data instead of trying to get to it themselves
  • data was difficult to find
  • 'render frame' was a natural frame boundary for everyone
With Telemetry 2 we wanted to address these issues.  This meant:
  • on-demand data loading
  • fast random seek and search of data
  • open data format
  • extreme dynamic range in Visualization, from clock cycles all the way up to viewing hours of data
  • fast, reliable search
  • different 'frame sets' so you could look at "render frames" or "physics frames" or whatever
  • ...without breaking existing customer's Telemetry API integrations
We stopped saying, or even thinking, things like "Well, most of the time..[some convenient assumption]....so we can probably just do this the simple way".  Our goal with Telemetry 2 is to anticipate and support worst case scenarios as much as possible, without making the average case suffer.

SQLite

None of our goals were attainable without changing our data file format.  Telemetry's original format was designed for fast streaming writes to avoid backing up the run-time client (which has since been fixed by buffering to an intermediate set of data files).  And since we assumed that our data would be read and sent in big chunks, random seek was slow and clumsy.  In addition, the format was proprietary—not because we wanted to hide it, but because supporting customer access of data would have been a support nightmare.

A real database system was needed, but the conventional wisdom is that traditional databases are too slow for real-time writing of the enormous amounts of data we were generating.  Also, we had shied away from databases early on because it was one more thing for end users to configure and install.  Asking a potential customer to install and configure a MySQL service was just another barrier to entry.

With Telemetry 2 we re-examined these assumptions and decided to go with SQLite.  We ran some preliminary performance tests and discovered that, while slower than Telemetry 1.x (which is understandable, since Telemetry was just writing raw data as fast it could), its write bandwidth on mechanical media was not unreasonable.  In addition, since SQLite is built as a library instead of a separate program, installation didn't have any extra steps for our customers.

Switching to SQLite cleaned up a tremendous of code.  Debugging was easier—we could just open up the SQLite command line client and run queries manually—and iterating on our schema was simple.  Simple data fetches were now a SELECT query instead of custom data file parsing code.

And by using mostly standard SQL, we can add support for another database solution such as Microsoft SQL Server, Oracle, MySQL, or PostgreSQL.

Level of Detail

Level of detail (LOD) was not a consideration for Telemetry because it assumed sparse data sets focused on a single frame at a time.  Dense data sets where you'd also want to zoom out to hundreds of frames never seemed like a plausible situation.  Predictably, after Telemetry was released we started seeing immensely dense customer datasets that they were zooming way, way out, blowing up memory usage and rendering performance.

I'm talking about hundreds of thousands of visible zones and hundreds of frames, something that the Visualizer's renderer just wasn't expecting.  As an example, the Visualizer had a very small amount of setup work to render each captured frame, maybe half a millisecond of overhead.  This was lost in the noise when rendering two to three frames of data.  With even 100 visible frames that overhead suddenly mattered, capping our frame rate at 20 Hz even without rendering anything!

The actual low level renderer also assumed on the order of a few thousand primitives.  Scaling past that by one or two orders of magnitude made it chug due to per-primitive overhead.

As a stop gap measure we did basic dynamic merging of small zones on the Visualizer, which improved performance dramatically.  Of course, this only encouraged customers to keep pushing the boundaries, so we needed a better long term solution.

While the client side LOD coalescing fixed rendering speed, it did nothing to address memory usage.  Since the LOD was handled by the Visualizer it was possible to have hundreds of thousands or even millions of zones in memory.  Add in all the other data and now you're looking at memory usage of many gigabytes.  This hurt the small minority of customers still running 32-bit operating systems, but it also impacted our customers trying to wrangle huge data sets (4+ GB).
Telemetry 2 addresses this by making level of detail a first class feature, covering all render types (not just zones).  LOD generation is handled by the server during capture, with a hierarchical LOD system reminiscent of texture MIP maps.  A significant amount of engineering work went into devising this system.  The dynamic range of data captures is massive, allowing users to zoom from clock cycles all the way out to hours.


Hundreds of thousands of zones spanning minutes of time
Zones under one microsecond in length

Hundreds of thousands of context switches
Context switches at the clock cycle level

User Defined Frame Boundaries

Telemetry's data visualization was centered around the notion of a 'frame', a unit of time consisting of one game loop ending a graphics system buffer swap.  For many games this was a logical unit of work, and when optimizing frame rate this was clearly the right focus.

However over time customers started focusing on non-rendering tasks such as physics or resource loading, where conceptually dividing execution on rendering boundaries didn't make sense.

To address this Telemetry 2 introduced TMPF_FRAMETIME plots, which are like normal plots but are rendered as time intervals, allowing a programmer to create any arbitrary number of frame groups and then reorient the zone display on the group that matters.  Now the physics, job manager, and render programmers can all see the data the way they want to see it!

Full Text Search

SQLite provides full text search capabilities via its FTS4 extension, and we've taken advantage of this with Telemetry 2.  Text within events is searchable, so "needle in a haystack" issues are much more manageable.  For example, our Valve Team Fortress 2 dataset, which is 20GB encompassing over 10 minutes of game play, had an overtime event in it while playing.  To find when this occurred I searched for "overtime*wav", which found the specific zone that played "sound\vo\announcer_overtime4.wav" and took a few seconds to do so.

Backwards Compatibility

Early on we considered radically changing the Telemetry API to reflect some of the things we learned, but eventually wisely decided against it.  As a result, the Telemetry run-time is identical between Telemetry 1.1 and Telemetry 2—this means you can use the same binary and switch between Telemetry versions based solely on which server you're connecting to.  This allows existing customers to evaluate the switchover before committing any significant coding resources.

That said, there are some minor transition changes having to do with object paths, mostly because we're trying to make path specification consistent across the different API entry points.  Paths are now always in the form "(path/path2)leaf":

tmPlot( cx, ... "(game/AI)num_entities" );
tmMessage( cx, ... "(warnings)This is a warning" );
tmEnter( cx, ... "(renderMesh)%s", meshName );

There's a slight caveat with how we handle things when doing aggregate zone profiling (the old profiler track view), but we'll cover that in a later migration document.

Summary

Hopefully this post gives you some things to get excited about with Telemetry 2.  We can't wait to get this into everybody's hands, not just because it's a better experience and product, but because the core technology is so robust and scalable that adding new features and making enhancements will be significantly easier than with Telemetry 1.x.  And not just for us—by providing an open data format, customers can now track and mine data for their own internal use.

This will allow us to be more responsive to customer requests and also spend less time on support and bug fixes.

If you're interested in the beta, please drop us a line at sales3@radgametools.com and we'll arrange an evaluation!

No comments:

Post a Comment