How do we distribute core game play load across multiple processes in a way that supports flexible scaling and efficient allocation of computing resources?
We are developing the server for a massively multiplayer online game with a distributed architecture. The game design seeks to create an immersive play experience by enabling thousands of players to interact with each other in a shared virtual world.
- The game design depends on having many players online at the same time.
- The game design calls for world maps that emphasize exploration, rich game play, and frequent and varied interaction with other players and NPCs.
- The game design specifies that when possible, map design and game play, rather than runtime performance, should decide the size of the player population on any given map.
- The game design may specify the use of ephemeral maps, where small groups of players choose to share in scripted experiences.
- The game server uses a distributed deployment architecture.
- The game server uses the Distributed Network Connections pattern to manage game client connections.
- The development plan allocates time and resources towards developing a flexible server configuration that supports scaling horizontally to meet increasing load demand. Dynamic scaling, while desirable, is not a requirement.
- The long-term product plan calls for incremental enhancement of the game through adding content and increasing the size of the virtual world.
- The business and operating plan seeks to increase capacity incrementally, by adding more machines instead of replacing them with more powerful ones.
Develop a Responsibility-Oriented Game Server strategy that defines server types that are responsible for specific subsets of game play functionality. Instances of the different server types collaborate to carry out end-to-end use cases.
Assign responsibilities to server types in a way similar to how you would for classes or other modular software units, but at a higher level of abstraction. Some suggestions to consider include:
- Use functional decomposition of high level game play activities to identify key operations.
- Write use cases for typical game play scenarios to identify the main behaviors exercised.
- Group related behaviors according to the type of operations performed and the data involved. These groups are candidate responsibilities. Identify one server type to own each candidate responsibility.
- Ensure that all functionality each server type will perform falls within the scope of its single high-level responsibility.
- Avoid repeated back-and-forth interactions between server types during a single use case.
- Avoid long request chains of dependent (i.e. synchronous) operations that span several server types.
- Prefer operations that are fairly self-contained and are truly asynchronous, where the result can be handled as a stand-alone event.
- Whenever possible, support multiple process instances of each server type. Limit use of singletons to cases where the cost or complexity of multiple instances is prohibitive.
- Iterate until your set of server types seems fairly stable. Be willing to revisit as requirements change or new technical decisions emerge.
Example Server Types
Each game implements different server types based on game design, technical, and operational factors. The pattern doesn’t call for specific types. The examples in the table below illustrate the pattern’s concepts without prescribing a specific implementation or claiming to represent a complete cluster.
|Server Type||Description||Game State|
|Gameplay||Performs all core game play operations, including game systems logic and game effects.||character stats, hit points, buffs/debuffs, game effects, experience points|
|AI||Performs all AI "thinking" that controls NPC behavior. Includes pathfinding, decision making, perception.||finite state machines, decision trees, target lists|
|Visibility||Uses world geometry and spatial partitioning data to compute whether objects can be seen from a given location in the virtual world.||position, line of sight, occlusion bodies, map geometry|
|Physics||Uses world geometry and static collision data to detect collisions between moving and static objects. Also simulates uncontrolled movement (e.g. falling) and enforces game rules for player movement.||position, collision bodies, map geometry, pathfinding graph.|
|Directory||Maintains addressing information for game state within the cluster. For a given game object, resolves the process instance of each server type that has the given object.||game object addresses, lookup tables, node status|
It’s useful to encapsulate game state in a logical abstraction called a game object. All meaningful entities in the game, including player characters, NPCs, items, interactive world objects, and the like, are instances of the game object type.
The semantics of this abstraction vary by implementation, but the key property of the game object is its ID. This ID uniquely represents a game object instance throughout the entire game. The type of the ID is an implementation detail, but it should have a value space large enough to represent all game objects that could reasonably exist during the life of the game. In practice, standard UUIDs, GUIDs, or unsigned 128-bit integers usually work well.
Different server types use different types of game state. Therefore, a given type of game object might have different implementations across server types. A server type that deals with physics simulation would implement behavior and data related to position, orientation, movement, and collision detection. A server type that deals with player character customization would emphasize wearable objects, visual enhancements, and possibly stats, character class, and other properties.
The servers used by this pattern are stateful. Once loaded, the game keeps state in memory for repeated use until it’s no longer needed. Each game object instance exists in only one process of a each server type at a given time. This reduces object creation overhead and helps ensure state consistency. Also consider preventing movement of game objects between processes. This makes locating objects more deterministic and eliminates certain race conditions.
Game objects interact via asynchronous events. This generally permits high concurrency. It also ensures that game objects interact consistently whether they are in the same process or in different processes. When objects in different processes interact, these events travel via inter-server messages to the appropriate server process, which dispatches them to the target game object.
The game must provide a facility for locating objects in the server cluster and routing events to them. Implementation details vary, but consider the following:
- Use a logical address to target game objects from within game server code. The logical address should include the target object’s ID. It should not include any physical host or routing information. It may include information about the server type containing the target game object. An example of a valid scheme might be <server_type>::<object_id>, but the scheme <host_address>:<port>::<object_id> would be invalid.
- Define a server type to act as a registry or directory of game object instances. This server should map the logical address of a game object to its physical address. It will be the authority for assigning objects to server process instances and locating them.
- Use consistent hashing, distributed hash tables (DHT), or some combination to assign game objects to server process instances. Consistent hashing is easy to implement, fast, and deterministic. However, it’s prone to “hot spots” and doesn’t allow reallocating IDs at runtime. Dynamic Hash Tables are more flexible, but more complex. They allow relocation of game objects at runtime and support more robust failure recovery. 1)McCaffrey, Caitie. “Building Scalable Stateful Service.” StrangeLoop 2015. 27 Sept. 2015. Lecture. (slides: https://speakerdeck.com/caitiem20/building-scalable-stateful-services
- Consider caching game object addresses and mappings locally in each server process to reduce round trips to the directory server. This is especially relevant when game objects don’t move between processes once they’ve been created.
One of the main benefits of this pattern is its support for efficiently scaling a game cluster. This typically means scaling horizontally instead of vertically. Specifically, this pattern supports adding new process instances of one or more server types to handle increasing load. Increased load ideally comes from adding more players, but it can also come from adding game features that demand more processing power. This pattern handles either case. The extra processes run on new or existing hardware or virtual nodes as needed.
Ideally, the number of process instances of each server type varies independently from that of other server types. This allows you to allocate computing resources to the parts of the game that require it. The key to achieving this is identifying the appropriate server types and their responsibilities.
Scaling a cluster means managing cluster membership, or the set of processes that run in the cluster. Static cluster membership requires manually adding or removing cluster members. Dynamic cluster membership scales automatically, based on load metrics and heuristics.
Static cluster membership is simplest to implement and understand. It requires maintaining a manifest of all the process instances in the cluster and the nodes on which they run. The easiest way to manage this is to deploy all runtime artifacts to all nodes, and use the manifest to control which servers to run on each node. The downside of this approach is that reconfiguring a cluster by reallocating server type instances or nodes usually requires a service interruption. Also, it’s not fault-tolerant when a machine fails, as replacing a node means updating all copies of the manifest.
Dynamic cluster membership affords more flexibility and efficiency in responding to changing load. But, it’s considerably more complex and requires specialized logic. A dynamic cluster membership system monitors the performance and health of the cluster adding or removing process instances or machine nodes as needed. This system must know about the cluster’s different server types, their performance characteristics, and their processing requirements.
A simple way to do this is with a central authoritative server and data store. However, this creates a single point of failure that defeats the purpose of dynamic cluster management. Two generally accepted approaches include using a gossip protocol or consensus-based solution.2)Subramaniyan, Rajagopal, Pirabhu Raman, Alan D. George, and Matthew Radlinski. “GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems.” Cluster Computing 9 (2006): 101-20. Print. 3)“How to Build a Highly Available System Using Consensus.” Microsoft Research. Microsoft. Web. 3 Dec. 2015. <http://research.microsoft.com/en-us/um/people/blampson/58-consensus/Acrobat.pdf>. These distributed solutions include facilities for robust failure handling and dynamic scaling.
When applying this pattern, you should consider these points:
- Start out first by implementing the manual approach. Nearly everything you learn from doing this, and most of what you implement, will still apply if you decide to go with a dynamic solution later.
- If you stay with a manual configuration approach, by all means build an automated system for deploying and applying changes. Sufficient tooling and automation can go a long way towards reducing the length of service outages due to cluster reconfiguration.
- Whether you go manual or automatic, consider using the actor model to implement your game servers. First, it’s a natural fit for the game object abstraction. Second, it’s a good way to get concurrency support with minimal complexity for developers. Finally, third-party actor frameworks exist that include support for dynamic cluster membership. Some examples include Microsoft’s Project Orleans 4)“Microsoft Project Orleans Documentation.” Project Orleans GitHub Repository. Microsoft Research. Web. 2 Dec. 2015. <https://dotnet.github.io/orleans/Runtime-Implementation-Details/Cluster-Management>. (.Net) , Akka 5)“Cluster Specification.” Akka Documentation. Typesafe, Inc. Web. 2 Dec. 2015. <http://doc.akka.io/docs/akka/current/common/cluster.html>. (Java/Scala), and Akka.NET 6)“Akka.Cluster Overview.” Akka.Cluster Overview. Getakka.net. Web. 2 Dec. 2015. <http://getakka.net/docs/clustering/cluster-overview>.(.Net).
- If you decide to pursue the automatic approach, consider hosting the game with a cloud provider. It will probably be more cost-effective because you’ll pay only for the resources you use. Also, cloud hosting often comes with extensive automation support that you can leverage for even better dynamic capabilities.
Further discussion of specific dynamic scaling techniques is out of scope for this article. Some useful introductory articles on these topics include Using Gossip Protocols For Failure Detection, Monitoring, Messaging And Other Good Things and Consensus: Reaching Agreement.
Player state persists in a data store. However, the pattern deliberately avoids specifying a database architecture or technology. Consider the following suggestions when designing your persistence system:
- The persistence solution should generally optimize for frequent writes and infrequent reads. Game play generates frequent state changes, all of which must be saved to guarantee player data consistency. Because game objects don’t move between server processes, player state should only have to be loaded once during a given play session.
- The persistence solution should support distributed operation if possible, to scale appropriately with the rest of the cluster.
- A write-back in-memory cache can help to amortize write operations over time, which will reduce load on the data store at the risk of some data loss. This might be useful in cases where the storage technology may not scale as well as the rest of the cluster.
We have an MMO game server that distributes the core game play load across multiple server processes to support thousands of connected clients at one time. The game can scale horizontally to accommodate increasing player load and additional functionality. Assigning specific game play responsibilities to different server types permits different parts of the game to scale independently of one another. The architecture supports the use of either manual or automatic scaling techniques, as desired.
This pattern provides flexibility in scaling at the cost of complexity and development effort. It recognizes the long-term view of an MMO as a service with an ongoing responsibility to continually delight players.
This approach requires strong technical experience and leadership to guide early development. This is especially true with respect to identifying server types and responsibilities, implementing state management and object addressing features. This work should be front loaded into the project plan, possibly deferring work on game play systems. Because of this, implementing this pattern requires maturity on the part of the leadership team.
That said, with the right leadership and planning, the pattern will likely permit more stable content and feature development later on. The modular architecture lends itself well to parallel development and rapid iteration, which can shorten development times. This might offset the perceived “delays” of early development.
The pattern provides the following benefits over less distributed patterns:
- Modular server design readily permits enhancements and additions to game play functionality.
- Server types and responsibilities permit efficient allocation of computing resources to the game play functionality that requires it.
- Horizontal scaling handles increasing player load with lower operational cost.
- You can start simple with manual cluster management and build a more dynamic system once the need arises.
- The pattern’s flexibility provides a better foundation for long-term maintenance and growth of the MMO service.
This pattern is probably a good one to use when:
- The development team and studio management is fairly experienced with MMO development in general.
- The development plan can tolerate more up-front engineering work that produces fewer working game play features early in the project.
- Team staffing allows for hiring engineers with experience in building scalable servers in addition to game system programmers.
- The business plan focuses on longer term ROI over faster time to market.
This pattern describes a general architectural approach to scaling an MMO server cluster. It presents certain concepts that are essential and others that are simply recommendations. For clarity, here’s a summary of the essential concepts:
- Different server types have specific game play responsibilities.
- Game objects exist in a single server type process instance at a given time.
- Functionality exists to find game objects within the cluster.
- Game objects interact via asynchronous events.
Anything not in the list above is most likely a recommendation. Some of the more important ones are:
- Game objects do not move between processes.
- Use the actor model for concurrency.
- Avoid implementing singleton processes.
- Rift (Trion Worlds)
- Lineage 7)This is an educated guess, based on my own observations. I can’t find any information to support or refute this. (NCsoft)
Note: I’m sure other games have used this pattern, or something like it. Please leave me a comment in the feedback box if you know of any others. ~matsaleh