Friday, March 8, 2013

Telemetry 2: A Preamble

Telemetry was released October 25, 2010.  In the intervening 2.5 years we've learned a lot about customer use cases and expectations for a product like this, and have taken the opportunity to revisit some of our assumptions and underpinnings for our upcoming release of Telemetry 2.

When Telemetry was designed there were no real analogous products with which to compare.  Traditional profilers have significant limitations in both visualization and session sizes, tending to focus on the micro level (clock cycles, cache misses, time spent in aggregate in different areas) and short captures (a few seconds).  Telemetry operates at a higher granularity (millisecond+) for extended session captures (several minutes), illustrating the relationships between locks, memory, CPU time and application state.

While there's some overlap, the important thing to keep in mind is that Telemetry is trying to solve a fundamentally different type of problem than just "Where is my time being spent over this period of time?".  Instead it tries to answer questions like "What is the relationship between my game's state and performance over time?"  No other product (that I'm aware of) attempted to address that line of questioning.

The downside of this is that many of Telemetry's fundamental architectural decisions were based on best-guess assumptions, a combination of our own experiences and what our peers and other RAD customers thought they might like.  We had to make some significant assumptions and go with it.

In addition, Dresscode by Tuxedo Labs, the precursor to Telemetry, made its own assumptions about the nature of data capture and visualization that impacted the design.

Some of these assumptions have since proven to be limiting.  Not because the assumptions were dumb or ill-informed, but mostly because once you hand a tool to a user, how that tool is used is out of your control.  And if they decide to use your screwdriver as a hammer and it breaks, your screwdriver will still get the blame.

Here are some of the fundamental Telemetry assumptions we made that we have revisited in Telemetry 2:

Data Sets Would Be Memory Resident

Dresscode worked with data sets that were entirely in memory.  Dresscode's 32-bit architecture limited data sets to 1GB or less, which constrained session complexity and duration.  At the time this was a reasonable restriction since Dresscode's use cases tended to be simple (hundreds of zones per frame, and a few thousand frames at most).

In-core or out-of-core data management is a significant architectural decision that is non-trivial to change.  At the time we assumed that since the worst case with Dresscode was about 1GB, by migrating to 64-bit we could run in core even with "crazy" data sets several times that size.

Reality—and enthusiastic customers—disposed us of that notion.  We found that users could trivially blow past that assumed 1GB comfort zone.  Sixteen gigabyte and larger data sets were not unheard of.  Even on 64-bit Windows an overzealous data capture could bring Telemetry (or the machine) down.

This wasn't a fundamental implementation flaw in Telemetry as much as an unanticipated use case that arose from my naivete (thinking users would stay within what we considered reasonable guidelines) and not enforcing hard limits by cutting off the capture beyond some threshold.  But because these guidelines were unclear and/or easy to exceed in common use, it became an issue for a small subset of our users.

Average Events Per Second Was a Meaningful Metric

Part of Telemetry's power is that you can annotate your code however you want.  It's programmer driven profiling, so if you don't care about, say, your audio subsystem, then you don't mark it up.  And if you really care about your renderer, you can mark it up at a very fine granularity (sub-millisecond).

We used Telemetry quite a bit internally before releasing it, and we also had several beta testers using it, and based on that limited data we made assumptions about likely markup patterns in the wild.  Our feeling was that markup would be relatively even, with the occasional hierarchical spike where something was "Telemetrized" heavily.

In reality what we got were all kinds of markup patterns:
  • sparse zones, but dense plotting
  • uniformly dense zones within our average events per second parameters
  • average zone density with intermittent, short lived, highly dense zones (lasting less than a second), usually forming narrow but deep hierarchies.  This is usually a result of light initial markup followed by heavy emphasis on a problematic subsystem.
  • average zone density intermixed with massive zone density lasting many seconds or minutes (level loads, startup, and things like that)
  • ...and many other types
The distribution of Telemetry events over time can impact the run-time's performance.  Extremely dense activity can create locking overhead.  Long duration, dense activity can back up the network, especially on systems with slow networking such as the Nintendo WiiU and Microsoft Xbox 360.

Each of those cases impacted the run time and server in different ways, especially when you factor in platform idiosyncracies.  As a result many of our expectations were based on average events per second, not peak events per second.  It was a simple shorthand that let us estimate things like buffer sizes, network bandwidth, disk space requirements, and so on.

It's fine to take peak values and estimate up to averages, but taking an average number and then inferring peak values is not as valid.

As a result some usage patterns would slam Telemetry pretty hard, even if the average data rate was within spec.

Data Writes Were More Important Than Data Reads

We knew that we would be generating gobs of data.  In some cases a gigabyte a minute (or more), depending on the degree of markup in a game.  Early versions of the server cooked incoming data as it arrived instead of buffering it, which could backup the client if we fell behind.  To maximize performance we used a custom binary format designed for write bandwidth.  Something like XML would have bloated our size by a factor of 20 at least, and at the time we were concerned that a traditional database would not be able to handle our write rates (since most database systems are optimized for reads, not writes).

This decision made sense early on, but two problems emerged.  The first is that a small but real minority of our users wanted access to Telemetry's data from their own tools.  Documenting the Telemetry file format and the implicit promise that we wouldn't break it was a daunting task.  In addition, supporting everyone else's file parser would have been an onerous support headache.

The second problem was that our proprietary format wasn't indexed—in fact, it had the bare minimum amount of data in it to keep write speeds up, so random access was pretty much off the table.  We could grab big chunks of data very coarsely, but it was slow, especially if it was a live session.  This wasn't considered a significant issue because we assumed we'd be memory resident and could stream all the data from the server without any need for searching or seeking.

The lack of indexing made seeking and searching data very difficult.

Render Frames Were Natural Frame Boundaries

Telemetry operates with a fundamental unit of measure known as a frame.  This delineation is natural in games, where render speed is one of the most important performance concerns, but with more advanced games and non-gaming applications using Telemetry customers began to find it limiting.  For example, if you're writing an off-line data transformation tool, the 'render' frame is clearly not where you want to see work divisions.  And if you're working on a subsystem that is decoupled from your render speed, such as audio or physics, then you want to see your own subsystem frames as the primary division of work.

Summary

Telemetry's core assumptions about data were centered around the notion of grabbing big chunks of data and keeping it all in memory, and that this data could be in a opaque format that streamed quickly.  After a couple years of customer usage this assumption has turned out to be limiting, so with Telemetry 2 we've revisited this core pillar of the technology.

No comments:

Post a Comment