RAD Telemetry: March 2013

Monday, March 25, 2013

Telemetry 2: Architectural Improvements

In a previous post I talked a bit about some of the issues we faced after releasing Telemetry, and specifically how some early assumptions and decisions ended up making Telemetry difficult to update with unanticipated feature requests. Examples of these decisions and assumptions included:

data sets would be memory resident
customers would annotate their code the way we did
write bandwidth was more important than read bandwidth
users would use the Visualizer for accessing data instead of trying to get to it themselves
data was difficult to find
'render frame' was a natural frame boundary for everyone

With Telemetry 2 we wanted to address these issues. This meant:

on-demand data loading
fast random seek and search of data
open data format
extreme dynamic range in Visualization, from clock cycles all the way up to viewing hours of data
fast, reliable search
different 'frame sets' so you could look at "render frames" or "physics frames" or whatever
...without breaking existing customer's Telemetry API integrations

We stopped saying, or even thinking, things like "Well, most of the time..[some convenient assumption]....so we can probably just do this the simple way". Our goal with Telemetry 2 is to anticipate and support worst case scenarios as much as possible, without making the average case suffer.

SQLite

None of our goals were attainable without changing our data file format. Telemetry's original format was designed for fast streaming writes to avoid backing up the run-time client (which has since been fixed by buffering to an intermediate set of data files). And since we assumed that our data would be read and sent in big chunks, random seek was slow and clumsy. In addition, the format was proprietary—not because we wanted to hide it, but because supporting customer access of data would have been a support nightmare.

A real database system was needed, but the conventional wisdom is that traditional databases are too slow for real-time writing of the enormous amounts of data we were generating. Also, we had shied away from databases early on because it was one more thing for end users to configure and install. Asking a potential customer to install and configure a MySQL service was just another barrier to entry.

With Telemetry 2 we re-examined these assumptions and decided to go with SQLite. We ran some preliminary performance tests and discovered that, while slower than Telemetry 1.x (which is understandable, since Telemetry was just writing raw data as fast it could), its write bandwidth on mechanical media was not unreasonable. In addition, since SQLite is built as a library instead of a separate program, installation didn't have any extra steps for our customers.

Switching to SQLite cleaned up a tremendous of code. Debugging was easier—we could just open up the SQLite command line client and run queries manually—and iterating on our schema was simple. Simple data fetches were now a SELECT query instead of custom data file parsing code.

And by using mostly standard SQL, we can add support for another database solution such as Microsoft SQL Server, Oracle, MySQL, or PostgreSQL.

Level of Detail

Level of detail (LOD) was not a consideration for Telemetry because it assumed sparse data sets focused on a single frame at a time. Dense data sets where you'd also want to zoom out to hundreds of frames never seemed like a plausible situation. Predictably, after Telemetry was released we started seeing immensely dense customer datasets that they were zooming way, way out, blowing up memory usage and rendering performance.

I'm talking about hundreds of thousands of visible zones and hundreds of frames, something that the Visualizer's renderer just wasn't expecting. As an example, the Visualizer had a very small amount of setup work to render each captured frame, maybe half a millisecond of overhead. This was lost in the noise when rendering two to three frames of data. With even 100 visible frames that overhead suddenly mattered, capping our frame rate at 20 Hz even without rendering anything!

The actual low level renderer also assumed on the order of a few thousand primitives. Scaling past that by one or two orders of magnitude made it chug due to per-primitive overhead.

As a stop gap measure we did basic dynamic merging of small zones on the Visualizer, which improved performance dramatically. Of course, this only encouraged customers to keep pushing the boundaries, so we needed a better long term solution.

While the client side LOD coalescing fixed rendering speed, it did nothing to address memory usage. Since the LOD was handled by the Visualizer it was possible to have hundreds of thousands or even millions of zones in memory. Add in all the other data and now you're looking at memory usage of many gigabytes. This hurt the small minority of customers still running 32-bit operating systems, but it also impacted our customers trying to wrangle huge data sets (4+ GB).
Telemetry 2 addresses this by making level of detail a first class feature, covering all render types (not just zones). LOD generation is handled by the server during capture, with a hierarchical LOD system reminiscent of texture MIP maps. A significant amount of engineering work went into devising this system. The dynamic range of data captures is massive, allowing users to zoom from clock cycles all the way out to hours.

Hundreds of thousands of zones spanning minutes of time

Zones under one microsecond in length

Hundreds of thousands of context switches

Context switches at the clock cycle level

User Defined Frame Boundaries

Telemetry's data visualization was centered around the notion of a 'frame', a unit of time consisting of one game loop ending a graphics system buffer swap. For many games this was a logical unit of work, and when optimizing frame rate this was clearly the right focus.

However over time customers started focusing on non-rendering tasks such as physics or resource loading, where conceptually dividing execution on rendering boundaries didn't make sense.

To address this Telemetry 2 introduced TMPF_FRAMETIME plots, which are like normal plots but are rendered as time intervals, allowing a programmer to create any arbitrary number of frame groups and then reorient the zone display on the group that matters. Now the physics, job manager, and render programmers can all see the data the way they want to see it!

Full Text Search

SQLite provides full text search capabilities via its FTS4 extension, and we've taken advantage of this with Telemetry 2. Text within events is searchable, so "needle in a haystack" issues are much more manageable. For example, our Valve Team Fortress 2 dataset, which is 20GB encompassing over 10 minutes of game play, had an overtime event in it while playing. To find when this occurred I searched for "overtime*wav", which found the specific zone that played "sound\vo\announcer_overtime4.wav" and took a few seconds to do so.

Backwards Compatibility

Early on we considered radically changing the Telemetry API to reflect some of the things we learned, but eventually wisely decided against it. As a result, the Telemetry run-time is identical between Telemetry 1.1 and Telemetry 2—this means you can use the same binary and switch between Telemetry versions based solely on which server you're connecting to. This allows existing customers to evaluate the switchover before committing any significant coding resources.

That said, there are some minor transition changes having to do with object paths, mostly because we're trying to make path specification consistent across the different API entry points. Paths are now always in the form "(path/path2)leaf":

tmPlot( cx, ... "(game/AI)num_entities" );
tmMessage( cx, ... "(warnings)This is a warning" );
tmEnter( cx, ... "(renderMesh)%s", meshName );

There's a slight caveat with how we handle things when doing aggregate zone profiling (the old profiler track view), but we'll cover that in a later migration document.

Summary

Hopefully this post gives you some things to get excited about with Telemetry 2. We can't wait to get this into everybody's hands, not just because it's a better experience and product, but because the core technology is so robust and scalable that adding new features and making enhancements will be significantly easier than with Telemetry 1.x. And not just for us—by providing an open data format, customers can now track and mine data for their own internal use.

This will allow us to be more responsive to customer requests and also spend less time on support and bug fixes.

If you're interested in the beta, please drop us a line at sales3@radgametools.com and we'll arrange an evaluation!

Wednesday, March 20, 2013

Nice words from Gaslamp Games!

One of our customers, indie sensation Gaslamp Games, posted a blog about their optimization work and mentioned Telemetry!

http://www.gaslampgames.com/2013/03/20/cpu-and-gpu-run-time-profiling-in-simulation-land/

Friday, March 8, 2013

Telemetry 2: A Preamble

Telemetry was released October 25, 2010. In the intervening 2.5 years we've learned a lot about customer use cases and expectations for a product like this, and have taken the opportunity to revisit some of our assumptions and underpinnings for our upcoming release of Telemetry 2.

When Telemetry was designed there were no real analogous products with which to compare. Traditional profilers have significant limitations in both visualization and session sizes, tending to focus on the micro level (clock cycles, cache misses, time spent in aggregate in different areas) and short captures (a few seconds). Telemetry operates at a higher granularity (millisecond+) for extended session captures (several minutes), illustrating the relationships between locks, memory, CPU time and application state.

While there's some overlap, the important thing to keep in mind is that Telemetry is trying to solve a fundamentally different type of problem than just "Where is my time being spent over this period of time?". Instead it tries to answer questions like "What is the relationship between my game's state and performance over time?" No other product (that I'm aware of) attempted to address that line of questioning.

The downside of this is that many of Telemetry's fundamental architectural decisions were based on best-guess assumptions, a combination of our own experiences and what our peers and other RAD customers thought they might like. We had to make some significant assumptions and go with it.

In addition, Dresscode by Tuxedo Labs, the precursor to Telemetry, made its own assumptions about the nature of data capture and visualization that impacted the design.

Some of these assumptions have since proven to be limiting. Not because the assumptions were dumb or ill-informed, but mostly because once you hand a tool to a user, how that tool is used is out of your control. And if they decide to use your screwdriver as a hammer and it breaks, your screwdriver will still get the blame.

Here are some of the fundamental Telemetry assumptions we made that we have revisited in Telemetry 2:

Data Sets Would Be Memory Resident

Dresscode worked with data sets that were entirely in memory. Dresscode's 32-bit architecture limited data sets to 1GB or less, which constrained session complexity and duration. At the time this was a reasonable restriction since Dresscode's use cases tended to be simple (hundreds of zones per frame, and a few thousand frames at most).

In-core or out-of-core data management is a significant architectural decision that is non-trivial to change. At the time we assumed that since the worst case with Dresscode was about 1GB, by migrating to 64-bit we could run in core even with "crazy" data sets several times that size.

Reality—and enthusiastic customers—disposed us of that notion. We found that users could trivially blow past that assumed 1GB comfort zone. Sixteen gigabyte and larger data sets were not unheard of. Even on 64-bit Windows an overzealous data capture could bring Telemetry (or the machine) down.

This wasn't a fundamental implementation flaw in Telemetry as much as an unanticipated use case that arose from my naivete (thinking users would stay within what we considered reasonable guidelines) and not enforcing hard limits by cutting off the capture beyond some threshold. But because these guidelines were unclear and/or easy to exceed in common use, it became an issue for a small subset of our users.

Average Events Per Second Was a Meaningful Metric

Part of Telemetry's power is that you can annotate your code however you want. It's programmer driven profiling, so if you don't care about, say, your audio subsystem, then you don't mark it up. And if you really care about your renderer, you can mark it up at a very fine granularity (sub-millisecond).

We used Telemetry quite a bit internally before releasing it, and we also had several beta testers using it, and based on that limited data we made assumptions about likely markup patterns in the wild. Our feeling was that markup would be relatively even, with the occasional hierarchical spike where something was "Telemetrized" heavily.

In reality what we got were all kinds of markup patterns:

sparse zones, but dense plotting
uniformly dense zones within our average events per second parameters
average zone density with intermittent, short lived, highly dense zones (lasting less than a second), usually forming narrow but deep hierarchies. This is usually a result of light initial markup followed by heavy emphasis on a problematic subsystem.
average zone density intermixed with massive zone density lasting many seconds or minutes (level loads, startup, and things like that)
...and many other types

The distribution of Telemetry events over time can impact the run-time's performance. Extremely dense activity can create locking overhead. Long duration, dense activity can back up the network, especially on systems with slow networking such as the Nintendo WiiU and Microsoft Xbox 360.

Each of those cases impacted the run time and server in different ways, especially when you factor in platform idiosyncracies. As a result many of our expectations were based on average events per second, not peak events per second. It was a simple shorthand that let us estimate things like buffer sizes, network bandwidth, disk space requirements, and so on.

It's fine to take peak values and estimate up to averages, but taking an average number and then inferring peak values is not as valid.

As a result some usage patterns would slam Telemetry pretty hard, even if the average data rate was within spec.

Data Writes Were More Important Than Data Reads

We knew that we would be generating gobs of data. In some cases a gigabyte a minute (or more), depending on the degree of markup in a game. Early versions of the server cooked incoming data as it arrived instead of buffering it, which could backup the client if we fell behind. To maximize performance we used a custom binary format designed for write bandwidth. Something like XML would have bloated our size by a factor of 20 at least, and at the time we were concerned that a traditional database would not be able to handle our write rates (since most database systems are optimized for reads, not writes).

This decision made sense early on, but two problems emerged. The first is that a small but real minority of our users wanted access to Telemetry's data from their own tools. Documenting the Telemetry file format and the implicit promise that we wouldn't break it was a daunting task. In addition, supporting everyone else's file parser would have been an onerous support headache.

The second problem was that our proprietary format wasn't indexed—in fact, it had the bare minimum amount of data in it to keep write speeds up, so random access was pretty much off the table. We could grab big chunks of data very coarsely, but it was slow, especially if it was a live session. This wasn't considered a significant issue because we assumed we'd be memory resident and could stream all the data from the server without any need for searching or seeking.

The lack of indexing made seeking and searching data very difficult.

Render Frames Were Natural Frame Boundaries

Telemetry operates with a fundamental unit of measure known as a frame. This delineation is natural in games, where render speed is one of the most important performance concerns, but with more advanced games and non-gaming applications using Telemetry customers began to find it limiting. For example, if you're writing an off-line data transformation tool, the 'render' frame is clearly not where you want to see work divisions. And if you're working on a subsystem that is decoupled from your render speed, such as audio or physics, then you want to see your own subsystem frames as the primary division of work.

Summary

Telemetry's core assumptions about data were centered around the notion of grabbing big chunks of data and keeping it all in memory, and that this data could be in a opaque format that streamed quickly. After a couple years of customer usage this assumption has turned out to be limiting, so with Telemetry 2 we've revisited this core pillar of the technology.