Pattern: Responsibility-Oriented Game Server

Problem

How do we distribute core game play load across multiple processes in a way that supports flexible scaling and efficient allocation of computing resources?

Context

We are developing the server for a massively multiplayer online game with a distributed architecture. The game design seeks to create an immersive play experience by enabling thousands of players to interact with each other in a shared virtual world.

Forces

  1. The game design depends on having many players online at the same time.
  2. The game design calls for world maps that emphasize exploration, rich game play, and frequent and varied interaction with other players and NPCs.
  3. The game design specifies that when possible, map design and game play, rather than runtime performance, should decide the size of the player population on any given map.
  4. The game design may specify the use of ephemeral maps, where small groups of players choose to share in scripted experiences.
  5. The game server uses a distributed deployment architecture.
  6. The game server uses the Distributed Network Connections pattern to manage game client connections.
  7. The development plan allocates time and resources towards developing a flexible server configuration that supports scaling horizontally to meet increasing load demand. Dynamic scaling, while desirable, is not a requirement.
  8. The long-term product plan calls for incremental enhancement of the game through adding content and increasing the size of the virtual world.
  9. The business and operating plan seeks to increase capacity incrementally, by adding more machines instead of replacing them with more powerful ones.

Solution

Develop a Responsibility-Oriented Game Server strategy that defines server types that are responsible for specific subsets of game play functionality. Instances of the different server types collaborate to carry out end-to-end use cases.

Identifying Responsibilities

Assign responsibilities to server types in a way similar to how you would for classes or other modular software units, but at a higher level of abstraction. Some suggestions to consider include:

  • Use functional decomposition of high level game play activities to identify key operations.
  • Write use cases for typical game play scenarios to identify the main behaviors exercised.
  • Group related behaviors according to the type of operations performed and the data involved. These groups are candidate responsibilities. Identify one server type to own each candidate responsibility.
  • Ensure that all functionality each server type will perform falls within the scope of its single high-level responsibility.
  • Avoid repeated back-and-forth interactions between server types during a single use case.
  • Avoid long request chains of dependent (i.e. synchronous) operations that span several server types.
  • Prefer operations that are fairly self-contained and are truly asynchronous, where the result can be handled as a stand-alone event.
  • Whenever possible, support multiple process instances of each server type. Limit use of singletons to cases where the cost or complexity of multiple instances is prohibitive.
  • Iterate until your set of server types seems fairly stable. Be willing to revisit as requirements change or new technical decisions emerge.

Example Server Types

Each game implements different server types based on game design, technical, and operational factors. The pattern doesn’t call for specific types. The examples in the table below illustrate the pattern’s concepts without prescribing a specific implementation or  claiming to represent a complete cluster.

Server TypeDescriptionGame State
GameplayPerforms all core game play operations, including game systems logic and game effects.character stats, hit points, buffs/debuffs, game effects, experience points
AIPerforms all AI "thinking" that controls NPC behavior. Includes pathfinding, decision making, perception.finite state machines, decision trees, target lists
VisibilityUses world geometry and spatial partitioning data to compute whether objects can be seen from a given location in the virtual world.position, line of sight, occlusion bodies, map geometry
PhysicsUses world geometry and static collision data to detect collisions between moving and static objects. Also simulates uncontrolled movement (e.g. falling) and enforces game rules for player movement.position, collision bodies, map geometry, pathfinding graph.
DirectoryMaintains addressing information for game state within the cluster. For a given game object, resolves the process instance of each server type that has the given object.game object addresses, lookup tables, node status
Responsibility-Oriented Game Server Pattern
Responsibility-Oriented Game Server Pattern

State Management

It’s useful to encapsulate game state in a logical abstraction called a game object. All meaningful entities in the game, including player characters, NPCs, items, interactive world objects, and the like, are instances of the game object type.

The semantics of this abstraction vary by implementation, but the key property of the game object is its ID. This ID uniquely represents a game object instance throughout the entire game. The type of the ID is an implementation detail, but it should have a value space large enough to represent all game objects that could reasonably exist during the life of the game. In practice, standard UUIDs, GUIDs, or unsigned 128-bit integers usually work well.

Different server types use different types of game state. Therefore, a given type of game object might have different implementations across server types. A server type that deals with physics simulation would implement behavior and data related to position, orientation, movement, and collision detection. A server type that deals with player character customization would emphasize wearable objects, visual enhancements, and possibly stats, character class, and other properties.

The servers used by this pattern are stateful. Once loaded, the game keeps state in memory for repeated use until it’s no longer needed. Each game object instance exists in only one process of a each server type at a given time. This reduces object creation overhead and helps ensure state consistency. Also consider preventing movement of game objects between processes. This makes locating objects more deterministic and eliminates certain race conditions.

Addressing Objects

Game objects interact via asynchronous events. This generally permits high concurrency. It also ensures that game objects interact consistently whether they are in the same process or in different processes. When objects in different processes interact, these events travel via inter-server messages to the appropriate server process, which dispatches them to the target game object.

The game must provide a facility for locating objects in the server cluster and routing events to them. Implementation details vary, but consider the following:

  • Use a logical address to target game objects from within game server code. The logical address should include the target object’s ID. It should not include any physical host or routing information. It may include information about the server type containing the target game object. An example of a valid scheme might be <server_type>::<object_id>, but the scheme <host_address>:<port>::<object_id> would be invalid.
  • Define a server type to act as a registry or directory of game object instances. This server should map the logical address of a game object to its physical address. It will be the authority for assigning objects to server process instances and locating them.
  • Use consistent hashingdistributed hash tables (DHT), or some combination to assign game objects to server process instances. Consistent hashing is easy to implement, fast, and deterministic. However, it’s prone to “hot spots” and doesn’t allow reallocating IDs at runtime. Dynamic Hash Tables are more flexible, but more complex. They allow relocation of game objects at runtime and support more robust failure recovery. 1)McCaffrey, Caitie. “Building Scalable Stateful Service.” StrangeLoop 2015. 27 Sept. 2015. Lecture. (slides: https://speakerdeck.com/caitiem20/building-scalable-stateful-services 
  • Consider caching game object addresses and mappings locally in each server process to reduce round trips to the directory server. This is especially relevant when game objects don’t move between processes once they’ve been created.

Scaling

One of the main benefits of this pattern is its support for efficiently scaling a game cluster. This typically means scaling horizontally instead of vertically. Specifically, this pattern supports adding new process instances of one or more server types to handle increasing load. Increased load ideally comes from adding more players, but it can also come from adding game features that demand more processing power. This pattern handles either case. The extra processes run on new or existing hardware or virtual nodes as needed.

Ideally, the number of process instances of each server type varies independently from that of other server types. This allows you to allocate computing resources to the parts of the game that require it. The key to achieving this is identifying the appropriate server types and their responsibilities.

Scaling a cluster means managing cluster membership, or the set of processes that run in the cluster. Static cluster membership requires manually adding or removing cluster members. Dynamic cluster membership scales automatically, based on load metrics and heuristics.

Static cluster membership is simplest to implement and understand. It requires maintaining a manifest of all the process instances in the cluster and the nodes on which they run. The easiest way to manage this is to deploy all runtime artifacts to all nodes, and use the manifest to control which servers to run on each node. The downside of this approach is that reconfiguring a cluster by reallocating server type instances or nodes usually requires a service interruption. Also, it’s not fault-tolerant when a machine fails, as replacing a node means updating all copies of the manifest.

Dynamic cluster membership affords more flexibility and efficiency in responding to changing load. But, it’s considerably more complex and requires specialized logic. A dynamic cluster membership system monitors the performance and health of the cluster adding or removing process instances or machine nodes as needed. This system must know about the cluster’s different server types, their performance characteristics, and their processing requirements.

A simple way to do this is with a central authoritative server and data store. However, this creates a single point of failure that defeats the purpose of dynamic cluster management. Two generally accepted approaches include using a gossip protocol or consensus-based solution.2)Subramaniyan, Rajagopal, Pirabhu Raman, Alan D. George, and Matthew Radlinski. “GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems.” Cluster Computing 9 (2006): 101-20. Print. 3)“How to Build a Highly Available System Using Consensus.” Microsoft Research. Microsoft. Web. 3 Dec. 2015. <http://research.microsoft.com/en-us/um/people/blampson/58-consensus/Acrobat.pdf>.  These distributed solutions include facilities for robust failure handling and dynamic scaling.

When applying this pattern, you should consider these points:

  • Start out first by implementing the manual approach. Nearly everything you learn from doing this, and most of what you implement, will still apply if you decide to go with a dynamic solution later.
  • If you stay with a manual configuration approach, by all means build an automated system for deploying and applying changes. Sufficient tooling and automation can go a long way towards reducing the length of service outages due to cluster reconfiguration.
  • Whether you go manual or automatic, consider using the actor model to implement your game servers. First, it’s a natural fit for the game object abstraction. Second, it’s a good way to get concurrency support with minimal complexity for developers. Finally, third-party actor frameworks exist that include support for dynamic cluster membership. Some examples include Microsoft’s Project Orleans 4)“Microsoft Project Orleans Documentation.” Project Orleans GitHub Repository. Microsoft Research. Web. 2 Dec. 2015. <https://dotnet.github.io/orleans/Runtime-Implementation-Details/Cluster-Management>. (.Net) , Akka 5)“Cluster Specification.” Akka Documentation. Typesafe, Inc. Web. 2 Dec. 2015. <http://doc.akka.io/docs/akka/current/common/cluster.html>. (Java/Scala), and Akka.NET 6)
    “Akka.Cluster Overview.” Akka.Cluster Overview. Getakka.net. Web. 2 Dec. 2015. <http://getakka.net/docs/clustering/cluster-overview>.

     (.Net).

  • If you decide to pursue the automatic approach, consider hosting the game with a cloud provider. It will probably be more cost-effective because you’ll pay only for the resources you use. Also, cloud hosting often comes with extensive automation support that you can leverage for even better dynamic capabilities.

Further discussion of specific dynamic scaling techniques is out of scope for this article. Some useful introductory articles on these topics include Using Gossip Protocols For Failure Detection, Monitoring, Messaging And Other Good Things and Consensus: Reaching Agreement.

Persistence

Player state persists in a data store. However, the pattern deliberately avoids specifying a database architecture or technology. Consider the following suggestions when designing your persistence system:

  • The persistence solution should generally optimize for frequent writes and infrequent reads. Game play generates frequent state changes, all of which must be saved to guarantee player data consistency. Because game objects don’t move between server processes, player state should only have to be loaded once during a given play session.
  • The persistence solution should support distributed operation if possible, to scale appropriately with the rest of the cluster.
  • A write-back in-memory cache can help to amortize write operations over time, which will reduce load on the data store at the risk of some data loss. This might be useful in cases where the storage technology may not scale as well as the rest of the cluster.

Resulting Context

We have an MMO game server that distributes the core game play load across multiple server processes to support thousands of connected clients at one time. The game can scale horizontally to accommodate increasing player load and additional functionality. Assigning specific game play responsibilities to different server types permits different parts of the game to scale independently of one another. The architecture supports the use of either manual or automatic scaling techniques, as desired.

Rationale

This pattern provides flexibility in scaling at the cost of complexity and development effort. It recognizes the long-term view of an MMO as a service with an ongoing responsibility to continually delight players.

This approach requires strong technical experience and leadership to guide early development. This is especially true with respect to identifying server types and responsibilities, implementing state management and object addressing features. This work should be front loaded into the project plan, possibly deferring work on game play systems. Because of this, implementing this pattern requires maturity on the part of the leadership team.

That said, with the right leadership and planning, the pattern will likely permit more stable content and feature development later on. The modular architecture lends itself well to parallel development and rapid iteration, which can shorten development times. This might offset the perceived “delays” of early development.

The pattern provides the following benefits over less distributed patterns:

  1. Modular server design readily permits enhancements and additions to game play functionality.
  2. Server types and responsibilities permit efficient allocation of computing resources to the game play functionality that requires it.
  3. Horizontal scaling handles increasing player load with lower operational cost.
  4. You can start simple with manual cluster management and build a more dynamic system once the need arises.
  5. The pattern’s flexibility provides a better foundation for long-term maintenance and growth of the MMO service.

This pattern is probably a good one to use when:

  1. The development team and studio management is fairly experienced with MMO development in general.
  2. The development plan can tolerate more up-front engineering work that produces fewer working game play features early in the project.
  3. Team staffing allows for hiring engineers with experience in building scalable servers in addition to game system programmers.
  4. The business plan focuses on longer term ROI over faster time to market.

Essential Concepts

This pattern describes a general architectural approach to scaling an MMO server cluster. It presents certain concepts that are essential and others that are simply recommendations. For clarity, here’s a summary of the essential concepts:

  • Different server types have specific game play responsibilities.
  • Game objects exist in a single server type process instance at a given time.
  • Functionality exists to find game objects within the cluster.
  • Game objects interact via asynchronous events.

Anything not in the list above is most likely a recommendation. Some of the more important ones are:

  • Game objects do not move between processes.
  • Use the actor model for concurrency.
  • Avoid implementing singleton processes.

Related Patterns

Known Uses

  • Rift (Trion Worlds)
  • Lineage 7)This is an educated guess, based on my own observations. I can’t find any information to support or refute this. (NCsoft)

Note: I’m sure other games have used this pattern, or something like it. Please leave me a comment in the feedback box if you know of any others. ~matsaleh

 

Pattern Template: Coplien, James. “PLoP95_telecom – Info.” PLoP95_telecom – Info. James Coplien. Web. 30 Sept. 2015. .

 

13 thoughts on “Pattern: Responsibility-Oriented Game Server

  1. This is an excellent series of articles, it did not have such information. I think a lot of beginners indie game developers will thank you.

    Write more)

    1. Thanks for your kind words! I do have more to write, just been a bit busy lately. I have one nearly done on seamless server boundaries that I’ll try to finish in the next week or so. Cheers!

  2. I wanted to jump in and say that I’m really enjoying this blog. Thanks!

    One thing I wanted to ask is how do you deal with large, expanding landscapes? It looks like you have them loaded in the Visibility and Physics nodes, which makes sense. It would seem when working with a large open world, having all maps loaded at once on a single node could decrease performance. How would you scale this?

    Also, I haven’t worked in this type of environment before, would sending this amount of network traffic between servers saturate the infrastructure?

    1. Hey there! Thanks so much for your comment and your interest. Sorry I’ve been slow moderating these lately, been focusing on some client work.

      Large, expanding landscapes are a real challenge, of course. The pattern I described doesn’t specifically address it, but I don’t think it prevents it either. You can certainly create a large enough map that it could exceed the ability of a single node to handle it.

      Since you mentioned “maps” (plural), you’re already hinting at one solution: simply break larger areas into smaller ones. Once you do that, it’s pretty straightforward to scale by adding visibility and physics nodes, and distributing the map data across the nodes according to their load characteristics. This pattern is designed to “scale out” in this way.

      You also mentioned a “large open world” however. If you’re talking about a seamless world with unrestricted movement and game play, well that’s one of the biggest challenges in MMO development. At some point, you’ll still have to chunk the world into smaller maps, so the scaling part on the server is more or less similar. Just distribute the chunked map data across the visibility and physics nodes.

      The harder issue here is creating the illusion of “seamlessness” in the world. That is, how do we move a character between maps without the player noticing? Or, even more challenging, how do we make characters interact with other objects across a map boundary?

      Fortunately, this pattern makes this problem a little easier to solve the game play part of this. That’s because all game play activity is done by the gameplay servers, and once loaded into a process, player characters don’t move between processes. So, from that perspective, executing game interactions between characters on two different map chunks is no different than if they were on the same one. This is one of the strengths of this pattern over the “Map-Centric Game Server” pattern, which actually moves characters between processes when they move between maps.

      That leaves visibility and physics interactions across map boundaries. I haven’t had to solve this problem before, so I’ll have to make an educated guess. First, I would insist on creating uniform, square map chunks to keep it simple. Then, I’d identify an “overlap area”, a fixed-width band all around the map perimiter, such that objects in the overlap area of a map chunk are proxied on the adjacent map. Then, visibility and collision computations can be done between objects of the adjacent map and the proxied objects. Certainly this would be a complex problem to solve, and I don’t want to oversimplify it, but that’s how I’d start out.

      As for your question on inter-server network traffic, yes, you can definitely over-saturate your infrastructure! I’ve seen it! 🙂 You have to take this into account during your technical design, and try to group responsibilities in a way that minimizes inter-server traffic. Also, this pattern doesn’t specifically call for assigning one process per hardware/virtual node, so in some cases you can keep the traffic internal by assigning specific servers to talk to each other and grouping them on a single node. Here, your most important tool is a good load testing harness and performance metrics. On the project I worked on that used this pattern, we spent a few months testing and tuning our servers around this important problem.

      I hope my answers have been helpful, and thanks for visiting Engines of Delight!

      1. Thank you so much for your very detailed reply, I really appreciate it. I’m starting to understand the actor model a little better, and how to separate these tasks. I was more focused on a map orientated pattern, so it was hard to grasp at first 🙂 The Orleans documentation and Caitie McCaffrey’s presentation helped. I’m looking to CAF for C++, and going to start playing around with it. Thanks for introducing me!

    1. Thanks, and thanks for the link! I’d heard of Multiverse, but have not investigated it. I’ll check it out. Have you done any work with that platform or others that you’d like to discuss further?

      Cheers!

      1. I only read about multiverse. They documented everything in the wiki, but it’s gone. However, I created a backup and uploaded it here https://drive.google.com/open?id=0B6saLquRxoPrWVl6UlpnUXZlVWs
        Multiverse is also the foundation of Avatism, a MMORPG framework for Unity3D http://atavismonline.com/

        I also gathered some information of Wildstar. They have 11 daemons in total (http://www.wildstar-online.com/en/news/lets-talk-about-servers/). I found in the beta patch notes the following ones: World Server (handles part of the worlds), Gateway Server/User Deamon (gateway to all other servers), Auth Server (authentication and permission), Econ Server (auction house), Regional Server (cross-realm chat), chat server, login server, …

        This is actually the first time, someone put a name on this pattern. So I’m very curious about it. I also thinks that this is the solution to handle a lot of players in a close vicinity.

        In general, there are only a few MMROPGs that can handle thousands of players without instances. Apart from the ones that you mentioned, I can think of WoW and Vanguard. Do you know if they used this pattern or how they scale?

        1. Thanks very much for digging up those links!

          There are several ways to handle the problem of scaling MMO servers, of course. Instancing is one of them, and it’s relatively efficient and straightforward to implement. Another way is the time-tested method of “sharding” servers. That is, creating a copy of the entire virtual world to run on another set of machines. This is rather crude, but can be effective. It works well as long as players pick a single shard to play on. However, if they ever want to play on a different shard it’s cumbersome. The player must either create a completely new character, effectively starting from scratch, or – if the game supports it – request a character transfer to the new shard. The latter usually comes with delays and certain restrictions for the player.

          Most MMOs use the sharding technique because it’s easy to do. I’m pretty sure this includes WoW, but I don’t know about Vanguard. Three of the games I’ve worked on – Rift, Tabula Rasa, and Ultima Online, used this approach. I believe that Eve Online does not do this, IIRC.

          As for instancing, I’m not sure whether WoW or Vanguard do this or not.

          Again, thanks for visiting and contributing to Engines of Delight!

          Cheers

  3. I can`t believe that i find this article. Thanks so much you helped me understand what i wanted to understand from years ago

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti-Spam by WP-SpamShield