I have a Seiko analog watch that needs repair. When I pull out the crown to set the time, the movement stops. This is normal. But when I push the crown back in, the movement doesn’t resume right away. It sometimes takes several minutes for it to start ticking again.
I love this watch. My wife gave it to me for our anniversary years ago, and I’ve worn it daily ever since. I’m definitely going to have it repaired. Administrivia isn’t my strong suit, though, so I haven’t gotten around to it yet.
Meanwhile, I keep wearing it. I’ve learned that if I just stop messing with it, I can still rely on it to keep time. Now, I just remember the offset between the actual time and my watch’s time and do the head math. This works because the watch is consistent, and I’ve become consistent in using it this way.
Consistency is a Virtue
Consistency is a virtue. It’s important in all aspects of life. It’s particularly important in software development.
Consistency is predictable. It sets reasonable expectations about what will be and reliably fulfills them. It reduces uncertainty and gives people greater confidence in a given outcome. This permits people to focus on more important problems and discover new results .
Consistency is repeatable. It sets an example for others to follow. It identifies a way of doing things that is already acceptable and understood. It reduces learning curves and lowers barriers to performing routine tasks. This clears the way for learning new methods and accomplishing greater goals.
Consistency sets a baseline. It defines a standard for the normal. This readily highlights anomalies and novelties for closer inspection and possible action.
Consistency reveals patterns. It identifies factors that can be filtered out of a set of problems or observations as unimportant. This makes discovery of new and interesting patterns easier.
Consistency promotes understanding. It provides a common frame of reference for communicating established concepts. It serves as a basis for comparison when reasoning about new concepts.
Consistency reduces cost. Wherever we apply consistency, we eventually become more efficient. We cut unnecessary complexity and streamline processes. This ultimately reduces the amount of effort and time we spend on unproductive activities.
Consistency in the Big Picture
Recall that software architecture is about “anything and everything related to the significant elements of a software system”.1)Software Architecture for Developers, by Simon Brown, 2015-04-01, http://leanpub.com/software-architecture-for-developers. Architecture is about the big picture, and it’s in the big picture where consistency has the greatest impact.
A good software architecture strives for consistency wherever it will pay off. Consider these architectural concerns:
Server logs are immensely powerful for understanding how servers work and for notifying us when something needs attention. But they can easily get out of hand, generating massive amounts of useless noise. Developers may disagree on what should be logged as a WARN. Others forget to convert their verbose INFO statements to DEBUG, spamming production log files. Still others write out ERROR messages that give no information to help with resolution. Developers systematically ignore automated alert emails that constantly report false alarms. When these things happen, logs waste valuable troubleshooting time.
Achieving consistency in logging pays off handsomely. Action items include:
- Use a single logging library throughout your code base. Chances are, your reasons for having more than one aren’t good enough.
- Make sure your logging API is easy to use. If you use a library with complicated semantics, wrap it with something simpler.
- Establish standard criteria for each logging level. Make it easy to remember and apply. As a rule of thumb try using TRACE when you think DEBUG is right, and DEBUG when you think it should be INFO. Use NOTICE, if available, for infrequent status messages.
- Require that all ERROR messages be actionable using the information supplied in the message. If a file isn’t found, name it. If a parameter is invalid, name it. If an exception is thrown, show the stack. Treat a non-actionable error like a build failure: fix it immediately.
- Monitor logs continuously. Regularly report the frequency of each log message, factoring out unique IDs, time stamps, usernames and the like. Have the team set goals for an acceptable number of errors (zero is good), warnings, etc.
- Dog-food your logs. Establish a live support discipline as early as possible, treating errors in your dev environment as if they were production errors. Set up on-call assignments and rotate the entire team through it. Logging consistency will improve once developers have to rely on it to solve their problems.
One more thing. If applicable, completely remove from your server logs all reporting of normal player activity, game balance, and other rich analytics data such as session start/stop, mission completion, combat activity, item use, trades and purchases, and anything else like that. These are very important and call for their own dedicated data collection and analysis system. They can’t be allowed to get lost in your server logs.
Error handling is another cross-cutting concern that demands consistency. Achieving this is harder than it is with logging, however. Some problems, technologies, and languages may need a specific error handling approach. Also, achieving consistency after development has already started presents greater challenges. This area deserves attention as early as possible. If you miss your chance, however, all is not lost. The benefits still far exceed the costs.
Bug prevention and better overall bug detection are the top benefits of consistent error handling. A close second is cleaner and easier to read code, with fewer special cases that “handle” errors in unique and exciting ways.
Error handling consistency requires that you:
- Understand the existing error semantics and features of the programming language(s) you’re using to implement your game. These could impose constraints or present performance challenges that you’ll have to deal with. Have the team agree on standard approaches to dealing with these.
- Do the same for any third-party libraries you’re using. Some high-performing or stateful systems such as physics engines may use specialized error reporting systems. Your team should decide whether and how to integrate with these early on.
- Decide on a policy for how the server will communicate errors to the game clients, or whether it should at all. Anything you do report could have significant player experience, customer service, or security implications.
- Prohibit the practice of silently handling and ignoring unspecified or untyped exceptions. This allows any possible error to go unchecked. At the very least, require a comment that clearly justifies each case. Supplement it with a log message if possible.
- Avoid throwing top-level exception types, such as System.Exception (.NET), java.lang.Exception (Java), or sys.Exception (Python). This is laziness at best, and hides any useful information about the error from callers. Instead, use an appropriate built-in exception or write one specific to your code.
- Avoid using numeric return values to mean success or failure. The question of whether 0 means success or failure has caused much angst. Instead, return a boolean or throw an exception. Operating system shell commands tend to be fairly consistent about returning 0 to mean success, but libraries and internal functions are all over the map. Just don’t go there.
- If you must share a set of discrete error “codes” across a body of code, do so carefully. Context is key; don’t glom unrelated values into a single scope for convenience. Use enumeration types if possible. If not, declare a class or struct with constant members or something similar. Avoid global constants at all costs, preferring any type of namespace or other type-safe scoping mechanism you can use.
- If you use or implement HTTP APIs anywhere, decide early whether and how you will map HTTP status codes into your own error semantics. It seems like it would be easy, but special cases and ambiguity abound.
- Decide whether and how you will integrate error handling with logging. Should you log all exceptions? What data should you provide? Remember, all errors must be actionable!
- If you write systems that need to support retry and failover logic as part of error handling, consider using a purpose-built third-party library instead of rolling your own. One such library that seems promising is Polly (.NET). In a stateful real-time game simulation, edge cases in this space are not fun.
Most of the suggestions above come from a pain point I’ve experienced myself. It’s hard to achieve consistency with error handling, but even more important. You will have to think it through and decide what’s right for your situation.
Third-party libraries can save a ton of development time and reduce exposure to bugs. On most software projects, the biggest cost and risk area is developer time. Using third-party libraries to address this is often a win.
Downsides exist. With commercial libraries you have licensing and support costs. You also get limited access to source code for debugging or extending functionality. Open source libraries can help avoid this. However, they can also bring problems, such as incompatible license terms and lack of professional support.
Open source software is usually free and is easily modified. This boon also creates problems. Modern tools such as Maven and its repository, or Visual Studio with NuGet, make finding and playing with these libraries a breeze. Developers can experiment freely, sometimes leaving behind a wake of dependencies. These factors are a force multiplier for software entropy on your project. I once saw, in a single code base, 5 JSON libraries, 4 DB API libraries, 3 Redis client libraries, 3 HTTP client libraries, and 3 unit testing frameworks!
Such inconsistency causes confusion and uncertainty. It wastes precious developer time and adds risk to the project. As with inconsistent error handling it’s best to address early, but will pay off when you can bring it under control.
To achieve consistency in your use of third-party libraries:
- Keep libraries in an access-controlled version-managed repository. This should be your source control repo in most cases. If you use C/C++, put library binaries and source code in a top-level /extern folder. If you use NuGet, also add your /packages folder to source control. If you use Maven, then use a Maven repository manager, and create user roles to limit who can add new packages.
- Know what libraries your project uses. Be able to get an updated list of all libraries instantly. Use the reporting features of tools such as Maven’s Dependency Plugin or the NuGet Package Manager Console.
- Avoid creating the need for developers to manually install library distributions on their local development machines. Use project or environment variables to reference libraries in your repository. If possible, building and running your code with its library dependencies should need no special steps to bootstrap.
- Have team members explicitly communicate library additions and version changes. Also, encourage team members to raise concerns about library use when they have them. A quick message to the team via email or other team messaging medium is best because it creates a record of the decision, however informal.
- Promote a team ethic where personal preference takes a back seat to team consensus. Everyone has their biases, but teams must unite around their shared goals.
- Avoid directly integrating library source code into your code base. Prefer using a binary distribution if possible. If building from source is necessary, do it in a project that is separate from that of your main product(s). This holds whether you modify the library code or not.
- Include library use in your code reviews. Evaluate new library additions, and challenge any that duplicate features of existing libraries. Require that challenges be resolved before code check-in.
Consistency in your use of third-party libraries reduces external dependencies, makes the software easier to understand, and reduces the risk of bugs.
I hope that most game development studios embrace some form of automated testing discipline. If you don’t, stop reading and go do that now. For the rest of us, automated tests are an essential tool. They catch and report errors early, when the cost to correct them is low. They help us understand how unfamiliar code functions. They reduce the chance we’ll introduce new bugs when we make changes.
Inconsistency in automated testing negates all this. Tests only work when they’re executed. Tests that run infrequently become obsolete, reporting false success or failure when they run. Broken tests that continue to run hide legitimate errors. Worse, they negate entire test suites by obscuring valid test results with error spam. Tests that fail intermittently create extra work when a failure coincides with some poor developer’s unrelated check-in.
Fortunately, achieving consistency in automated tests is usually straightforward. Nailing it early in development is good, but playing catch-up later on usually doesn’t cost much more. That is, unless you count the opportunity cost of not catching errors early, which can be significant.
Automated testing consistency requires that you:
- Run tests often, preferably whenever the code changes. If not that, then at least several times a day.
- Remove barriers to writing automated tests. Use proven frameworks like those of the xUnit family: NUnit, JUnit, CppUnit, PyUnit/PyUnit (v3), etc. Integrate these as seamlessly as possible into your development tool suite. Ensure that more experienced developers mentor those less familiar with testing.
- Minimize the amount of “special knowledge” required to write integration tests. Ideally, build your integration test framework so that it leverages your existing (unit) testing conventions and patterns, rather than inventing new ones.
- Ensure that tests run against the latest compiled code. Normally this means automatically running tests immediately after a continuous integration (CI) build completes.
- Ensure that tests that have dependencies on generated types (e.g. serialization classes, JSON documents, etc.) always use the latest versions of them.
- Minimize test execution time, especially for unit tests. Each unit test should take less than 1 second to complete. Developers tend to avoid tests that hold up their progress. Explicitly identify long-running tests as such, and segregate them so that developers can opt-in to running them.
- Treat test failures like build failures. Suffer no broken test to exist through repeated failure reports. Automatically report test failures to interested parties immediately.
Inconsistent testing practices makes automated testing much less valuable than it should be. This often causes teams or management to become skeptical of automated testing, placing less emphasis on it. Avoid this self-fulfilling prophecy by getting your testing consistency under control as soon as you can.
Concurrency is one of the most useful tools we have to make game servers perform well under load. It’s important for all multiplayer game types, and especially critical for MMO games. It’s also one of the harder aspects of server development to get right. It presents challenges at both the architecture and implementation levels.
An important architecture decision point is the choice between multi-processing and multi-threading concurrency. Each option drives more architecture decisions related to clustering, sharding, messaging, state management, synchronization, and the like. During implementation, developers must deal with threading and synchronization primitives, IPC, data consistency, and many other challenges. Because of these cascading concerns, teams must address this early on.
Consistency in concurrency management is not optional. Without it, developers devise their own expedient solutions when they need asynchronous behavior. Less experienced programmers create new threads when they need them and move on. More senior programmers differ on whether to create their own thread pool or use a built-in framework. Still others prefer careful synchronization using reader/writer locks because they’re good at it.
These one-off solutions turn the ordinary challenges of concurrency management into a minefield of problems. Data corruption, deadlocks, race conditions, and performance degradation are inevitable.
To avoid the heartache and despair that comes from inconsistent concurrency management, do these things:
- Decide the key concurrency management patterns and approaches you want to use in your project as early as possible. Communicate them to all programmers and get agreement on using them. Iterate on them, and keep communicating as changes arise during development.
- Make it a core team value that no programmer should have to create a new thread to perform asynchronous processing. Let this become a code smell that prompts a search for better alternatives. Make it an action item when discovered in code reviews.
- Prefer designs that support single-threaded application logic, relegating other threads mainly to handling I/O. Most modern operating systems support asynchronous I/O. Use it wherever possible.
- Prefer designs that use message passing and queues over thread synchronization for managing state changes.
- When your design requires multi-threading, consider using higher-level frameworks that use thread pools and work queues. These are often built into the operating system or runtime environment. Examples include Java’s ExecutorService API and Microsoft’s Task-based Asynchronous Pattern (TAP) in .NET.
- For distributed processing, consider using an actor framework such as Akka (Scala, Java) or Orleans (.NET). Actors implement logic as single-threaded operations, calling other actors exclusively via async messages. This hides concurrency and distributed processing details from application programmers. It allows programmers to focus on game logic without concern for an actor’s location, state, or synchronization.
The easiest way to get consistency in concurrent processing is to limit developers’ exposure to it. These suggestions help your team focus on simple solutions where possible, and abstract away concurrency details when necessary. With this, the team can focus more clearly on game logic and functionality.
It’s for the Players
The architecture concerns I’ve discussed here all suffer in integrity when they lack consistency. There are others I didn’t have time to cover: development tools, data serialization, server life cycle management, configuration files, and more.
These days, most multiplayer online games are more of a service than just a product. They must run 24/7/365. We must think beyond shipping our game, to the point where we’ll be supporting a live service. If not addressed, inconsistency in any of these areas will seriously degrade the quality of your game’s software and its increase maintenance and support costs. This limits your ability to deliver the experience of fun your players expect. At this point the specter of inconsistency will exact its ultimate price.
Improving consistency in any of these areas is an obvious, if not always easy, strategic win for any multiplayer game project. Do it now.
What About My Watch?
My watch recently died completely. Its consistent, reliable performance allowed me to continue using it for a while, but then it succumbed. This is my fault. I failed to be consistent in another important area: maintenance. Now I really do have to take it in for service. 😳
What Are Your Struggles With Consistency?
I’d love to learn about the challenges you’ve had with consistency and how you overcame them. Please reply to this post with whatever war stories you’d care to share. Or, feel free to leave any suggestions or questions you might have.
Thanks for playing!