Top five metrics mistakes in games

Here are the top five mistakes I’ve observed when a project tries to implement a metrics program. These are generalities extracted from multiple observations, and as such, are intended to provide rule of thumb guidance, not rules chiseled in stone. On the ground conditions in any given project may require a metrics solution tailored to their specific needs.
Note that some very important metrics usually gets up and running without much risk. For example, channeling user behavior metrics into the game design group is such an obvious mission-critical task it will usually happen even if the game designer has to buy an SQL textbook. Thus few user behavior metrics are represented in the five most common mistakes with metrics.

Top five mistakes in metrics
One: No application of metrics in task assignments and project management.
Two: Failing to measure factors that affect team efficiency and delivery schedules.
Three: Raw, singleton data and manual distribution don’t work. You must automate the entire collection, aggregation and distribution cycle.
Four: Not having senior engineers involved in the architectural analysis, implementation design and growth of your metrics system will either cripple or kill your metrics project.
Five: Not using metrics generated via repeatable automated tests at the front end of the production pipeline to prevent defects from moving further down the production line.

ONE: No application of metrics in task assignments and project management.
a)    Without a measurable goal, it is very amorphous as to when a particular task is considered done, or rather, done well enough. The developer has little incentive to do more than get the task done with the minimal amount of time involved: people respond to the way in which their performance is measured. The level of completeness, stability, performance, scalability and other critical factors tend not to be addressed unless they are considered in the Measures of Success for any given task, or until they become a serious problem. This can result in a very high go back cost: how much time is spent fixing defects in a module or in other, connected modules. To paraphrase one senior MMO engineer, “using metrics in my task allowed me to significantly improve performance and remove some bottlenecks. But my question is why would I ever use metrics again, unless it is out of the goodness of my heart? My manager did not specify anything further than getting the feature to work; not how well it worked, or how stable it needed to be. So if I spend time improving my module via metrics, I have, in my manager’s eyes, achieved less work that week: I could’ve left my first task alone and gotten other tasks done instead of improving my first task.”
b)    Metrics also helps to accurately focus staff on real problems, not perceived problems. For example, if a system is failing to scale, there are two paths to follow. The common approach is to gather the senior engineers together and have them argue for a while about what might be causing the problem and then implementing one of the educated guesses, hoping to get lucky. The other path is to place some metrics probes in the system that is failing to scale and then run a test. With the resultant metrics, it is usually much easier to find where the problem is, implement a solution, and rerun the tests to see if the scalability numbers have improved.
c)    Before we had implemented an effective metrics system on TSO, engineers were tasked mostly by educated guessing: we had no way to observe what was going on inside our game and were thus trying to debug a large-scale, nondeterministic blackbox, with very little time remaining. Once we had effective metrics, server engineers were tasked mostly via metrics coming out from automated scale testing. Our production rate soared.
d)    Aggregated data also provides an easy, excellent focusing tool. A Crash Aggregator can pull crash frequencies and locations per build to provide the number of crashes at specific code-file and line-number locations. Prioritization then becomes quite simple. If you know that bug 33 crashed 88 times in build 99, you know that it is a more critical fix than bug 1 that crashed once in build 99.
e)    Lack of metrics-driven task assignment is particularly deadly in the iterative, highly agile world of game production, which has some pretty deep behaviors burned into how things are done. Further, agile development is sometimes the pretext for programmers to continually change what they want to build, on-the-fly. Gold-plated toilets in a two story outhouse are often the result… Customer driven metrics, task driven metrics and team efficiency metrics are good antidotes for keeping teams focused.
f)    The risk of building necessarily partial implementations on-the-fly is that “something” is working by the due date, but it only has a fuzzy possibility of being correct. Further, the go back costs are not accounted for in the schedule and thus they become “found work” that adds unexpected time into the schedule. Of course many features are experimental in nature: they may shift radically or may not make it into the final game, and so it makes sense to build as little as possible until the system requirements are understood. This is still very addressable via metrics: as part of such experimental tasks, simply define the Measures of Success for the initial task as “implement core functionality only” and address the rest later.
g)    Example: when building an inventory system, you need to deliver enough of that system so that some play testing can be done, but you don’t need to cover all edge conditions upfront. Instead, you define and build only the core functionality and deal with edge conditions later, when you will actually have firmer knowledge of how the system is used and what it is expected to do. Using Inventory as an example, the core functionality is simply <add item; remove item; show current items>. Completions of such features are easily tested and measured, and are thus easy to keep stable in the build and in gameplay. Similarly, once the final inventory requirements are known, the measurable conditions of “ready for alpha” or “ready for launch” are easy to define. In this case, the final acceptance metrics would be something like: 30 items are allowed in the inventory, delete one item then test and measure that the inventory count goes down one; verify that that item has been actually removed (from the user’ s perspective); verify that all other items are still in the inventory; verify that adding the 31st item does not damage the inventory existing items and that an appropriate error message is given; does deleting a nonexistent item from the inventory return failure, and are all of the existing items still intact; does deleting a nonexistent item from an empty inventory return failure; and following that, does adding a real item to the potentially corrupted empty inventory still work; etc. etc. etc.
h)    Metrics allow tying production and operation actions to the big three business metrics: cost of customer acquisition, cost of customer support and cost of customer retention. And if you can quantify an improvement you want to make in the game and track how it affects the big three business metrics, you can do what you need to do: no fuss, no muss.
i)    Finally, without project management using task completion metrics, identifying the current state of game completion and projecting long-term milestones are at best exercises in wishful thinking. This tends to result in projects that inch closer and closer to their launch date, with little actual idea of what will happen then, or even if the game will be complete by then. With early, accurate measures of completion, actions can be taken early enough to improve projects at risk: adding staff, cutting features or pushing back the release date. Without early, accurate measures, by the time the problem is detected it is too late to do anything about it.

TWO: Failing to measure factors that affect team efficiency and delivery schedules.
a)    Large teams building large, complex systems can be crippled by small problems, brittle code, non-scalable tools and lack of development stability. Even if individual developer efficiency drops by only 10%, a 100 person team takes a serious hit in the amount of work done by the team, each and every week.
b)    Some such factors are build failure rate, build download time, build completion time, game load time, components that have a high go back cost, time from “build started, build downloaded to QA, time until pass/failure data reaches production, server downtime, etc. These and other critical path tasks not only slow production, they are also mission-critical problems in operations.
c)    Measuring bottlenecks in your content production pipeline can point to places where automation could be added to speed up production; if server stability or database incompatibility or broken builds are recurring bottlenecks, the engineering team then has an actionable task that will measurably improve content production. In TSO, we found that such bottlenecks, despite being widely known as a problem, were not tagged as priority problems to solve! The management team was under tremendous pressure to build features and add content. Assigning resources to fix a fuzzily defined artist tool issue instead of putting more pixels on the screen is a hard sell. So the problems were always dismissed as “oh the build probably doesn’t fail often anyway, “it probably doesn’t affect the team very much when it does”, or “oh, we probably won’t have another Perforce failure, we must’ve found them all by now”. But when we quantified the numbers of build failures in a week, multiplied by the size of the team and how long it took people to resume forward motion, stabilizing the build became a top priority problem. Lost team efficiency via a poor production environment is one of my favorite metrics. It has always resulted in tool improvements and a faster, more stable production cycle, and one that makes it easier to project delivery times for large-scale systems. In a TSO postmortem, the senior development director stated that “[stabilizing the build] saved us.”

THREE: Raw data and manual lookups don’t work. You must automate the entire collection, aggregation and distribution cycle.
a)    Building a series of one-off metric systems that do not support the entire metrics collection/aggregation/distribution cycle is a path to duplicative dooms. One-off systems quickly rot, which is why you can find so many dead ones littering your code base; people hack together what they need for the moment and then they are done with it. And when you next need a number, you’re back at square one: the old hacks are dead so you hack in a new metrics ‘system’.
b)    One-off tools do not generally support correlation and aggregation across multiple databases, nor do they generally have team wide distribution built-in nor do they generally have sophisticated visualization systems.
c)    One-off systems generate patient only a specific type of report and must be run, by hand, whenever the data is needed, and delivery to others is by a whim or by e-mail. In other words, the data is not actionable. To be actionable, a metrics report must contain specific data points before a given task is started and can report any changes in those specific data points after a task has been completed. Such reports are “breadcrumbs” that quickly lead the developer to the problem and know when the problem is solved.
d)    A team-wide Metrics Dashboards helps improve the efficiency of developers by supplying real-time views into the most common and the most critical reports. This also helps improve the efficiency of your build masters and senior engineers, who are continually distracted by questions such as “where’s my stuff in the build pipeline” or “why is this <thingy> broken?”
e)    Lack of automation in a metrics system means somebody is going to have to continually do a lot of data aggregation and communication tasks. This generally leads to people working with what they know: a simple, one-of-a-kind spreadsheet built and then discarded, or some incredibly complex spreadsheet is built by someone, who then enters massive amounts of data by hand and e-mails the results. Sounds very real-time and accurate; a tool that people would love to use, right?
f)    Your metrics system also needs to support calibration: you use the results in critical business decisions and you need to know that the numbers are accurate. Running hundreds of tests to remove nondeterministic factors by, for example, aggregating multiple test runs, eliminating the outriders and averaging the middle third results. This is a typical function that the report builder tool needs to support.
g)    Using metrics in multiple areas also helps to prevent code rot: the system is always in use, and therefore will be kept up-to-date. Further, a stable, feature-rich metrics system reduces the incentive for engineers to create one-off metric systems and thus preventing duplicative, wasted work. Finally, if the metric system is on the production/operation critical path, not only will it remain active, it will be continually grown by the people using it in day-to-day tasks.

FOUR: Not having senior engineers involved in architectural analysis, implementation design and growth of your metrics system will either cripple or kill your metrics project.
a)    A metrics system capable of supporting a large-scale online game is a complex system in and of itself. A poor metrics tool will be a hard sell into a production team that has gotten along without metrics before, or the metrics tool could be integrated into a project and then crack in the seams as the software and customers scale up. Examples of tasks that are beyond junior engineers to complete without guidance: tailoring the system to meet “on-the-fly” priority requests, deciding what are the key metrics to capture, how to make the system flexible enough for easy addition of new metrics and rapid aggregation/calibration/ of new reports, how to make the system scalable, how to make an easy user interface for the complex aggregation map/calibration/new-report functions.
b)    In other words, the design and implementation of a mission-critical tool usually falls down the programmer pecking order to the people least likely to make the correct decisions or correctly implement a real-time report creation tool or a real-time report viewing tool in a massively scaled metrics database that imports data from multiple external forces or correctly aggregate data from multiple, radically different databases.
c)    Example: correlating game data with data from CS or social networks can produce a profit/trouble ratio for customers or detect bots and hackers. One could correlate game features to network costs and suggest game changes to lower network costs, or suggest network changes to strengthen gameplay. One could easily find the players that generate the highest revenue with the lowest hassle. One could easily expand the detection of a single hacker into finding the other hackers associated with them, or even different hackers to use the same basic patterns. You could also correlate better between what game features create more and stronger social building blocks and thus broader social networks. If you know most of your friends from playing an online game, that’s what you have in common, strengthening the customer retention factor.
d)    Failure to collect “metrics on metrics”, which lets you see how the team is using your metrics system: what features are popular, who uses what features, and what is response time for users creating or viewing a report.

FIVE: Not using metrics generated via repeatable automated tests at the front end of the production pipeline to prevent defects from moving further down the production line.
a)    The earlier you detect a defect, the better.
b)    The further you let a bug go down the production pipeline, the more expensive and time-consuming it is. Bug verification, bug assignment, bug replication, bug tracking, bug fixing and fix verification generates expensive noise that hinders already busy people.
c)    The more you allow defects into your build, the more you affect the productivity of the entire team! One half hour of a junior engineer can drop in a build bomb that freezes your team for the hours it takes to find, fix and create a new build.
d)    Even worse, tracking down hard problems often require your top technical people, who could otherwise be generating useful systems! I measured a few such bugs on TSO. One little problem in the build consumed about 30 hours of five of the most expensive people on the team.
e)    Using metrics generated by repeatable automated tests before checking in buggy code will prevent them from burning team-wide time.
f)    Many of the most valuable metrics in an online game can only be accurately produced via repeatable automated tests. Failure to integrate your metrics system into an automated testing system will at the worst kill your project, and at the best, cost you time and money that you might not have.

Leave a comment