agones.dev/agones@v1.54.0/test/load/allocation/README.md

agones.dev/agones@v1.54.0/test/load/allocation/README.md (about)

     1  # Load Tests for gRPC Allocation Service
     2  
     3  [Allocation Load Test](#allocation-load-test) and [Scenario Tests](#scenario-tests)
     4  for testing the performance of the gRPC allocation service.
     5  
     6  ## Prerequisites
     7  
     8  1. A [Kubernetes cluster](https://agones.dev/site/docs/installation/creating-cluster/) with [Agones](https://agones.dev/site/docs/installation/install-agones/)
     9      - We recommend installing Agones using the [Helm](https://agones.dev/site/docs/installation/install-agones/helm/) package manager.
    10      - If you are running in GCP, use a regional cluster instead of a zonal cluster to ensure high availability of the cluster control plane.
    11      - Use a dedicated node pool for the Agones controllers with multiple CPUs per node, e.g. 'e2-standard-4'.
    12      - For Allocation Load Test:
    13        - In the default node pool (where the Game Server pods are created), 75 nodes are required to make sure there are enough nodes available for all game servers to move into the ready state. When using a regional GKE cluster with three zones that will require a configuration of 25 nodes per zone.
    14      - For Scenario Tests:
    15        - See [Kubernetes Cluster Setup for Scenario Tests](#kubernetes-cluster-setup-for-scenario-tests)
    16  2. A configured [Allocator Service](https://agones.dev/site/docs/advanced/allocator-service/)
    17      - The allocator service uses gRPC. In order to be able to call the service, TLS
    18  and mTLS have to be set up on the Server and Client.
    19  3. (Optional) [Metrics](https://agones.dev/site/docs/guides/metrics/) for monitoring Agones workloads
    20  
    21  # Allocation Load Test
    22  
    23  This load test aims to validate performance of the gRPC allocation service.
    24  
    25  ## Fleet Setting
    26  
    27  We used the sample [fleet configuration](./fleet.yaml). We set the `automaticShutdownDelaySec` parameter to 600 so simple-game-server game servers shutdown after 10
    28  minutes. This helps to easily re-run the test without having to delete the game servers and allows to run tests continously.
    29  
    30  
    31  ## Running the test
    32  
    33  ```
    34  kubectl apply -f ./fleet.yaml
    35  ````
    36  Wait until the fleet shows 4000 ready game servers before running the allocation script.
    37  ```
    38  kubectl get fleet
    39  NAME              SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
    40  load-test-fleet   Packed       4000      4000      0           4000    2m38s
    41  ```
    42  
    43  You can use the provided runAllocation.sh script by providing two parameters:
    44  - number of clients (to do parallel allocations)
    45  - number of allocations for client
    46  
    47  For making 4000 allocations calls, you can provide 40 and 100
    48  
    49  ```
    50  ./runAllocation.sh 40 100
    51  ```
    52  
    53  Script will print out the start and end date/time:
    54  ```
    55  started: 2020-10-22 23:33:25.828724579 -0700 PDT m=+0.005921014
    56  finished: 2020-10-22 23:34:18.381396416 -0700 PDT m=+52.558592912
    57  ```
    58  
    59  If some errors occurred, the error message will be printed:
    60  ```
    61  started: 2020-10-22 22:16:47.322731849 -0700 PDT m=+0.002953843
    62  (failed(client=3,allocation=43): rpc error: code = Unknown desc = error updating allocated gameserver: Operation cannot be fulfilled on gameservers.agones.dev "simple-game-server-mlljx-g9crp": the object has been modified; please apply your changes to the latest version and try again
    63  (failed(client=2,allocation=47): rpc error: code = Unknown desc = error updating allocated gameserver: Operation cannot be fulfilled on gameservers.agones.dev "simple-game-server-mlljx-rxflv": the object has been modified; please apply your changes to the latest version and try again
    64  (failed(client=7,allocation=45): rpc error: code = Unknown desc = error updating allocated gameserver: Operation cannot be fulfilled on gameservers.agones.dev "simple-game-server-mlljx-x4khw": the object has been modified; please apply your changes to the latest version and try again
    65  finished: 2020-10-22 22:17:18.822039094 -0700 PDT m=+31.502261092
    66  ```
    67  
    68  You can use environment variables overwrite defaults. To run only a single run of tests, you can use:
    69  ```
    70  TESTRUNSCOUNT=1 ./runAllocation.sh 40 10
    71  ```
    72  
    73  
    74  # Scenario Tests
    75  
    76  The scenario test allows you to generate a variable number of allocations to
    77  your cluster over time, simulating a game where clients arrive in an unsteady
    78  pattern. The game servers used in the test are configured to shutdown after
    79  being allocated, simulating the GameServer churn that is expected during
    80  normal game play.
    81  
    82  ## Kubernetes Cluster Setup for Scenario Tests
    83  
    84  For the scenario test to achieve high throughput, you can create multiple groups
    85  of nodes in your cluster. During testing (on GKE), we created a node pool for
    86  the Kubernetes system components (such as the metrics server and dns servers), a
    87  node pool for the Agones system components (as recommended in the installation
    88  guide), and a node pool for the game servers.
    89  
    90  On GKE, to restrict the Kubernetes system components to their own set of nodes,
    91  you can create a node pool with the taint
    92  `components.gke.io/gke-managed-components=true:NoExecute`.
    93  
    94  To prevent the Kubernetes system components from running on the game servers
    95  node pool, that node pool was created with the taint
    96  `scenario-test.io/game-servers=true:NoExecute`
    97  and the Agones system node pool used the normal taint
    98  `agones.dev/agones-system=true:NoExecute`.
    99  
   100  In addition, the GKE cluster was configured as a regional cluster to ensure high
   101  availability of the cluster control plane.
   102  
   103  The following commands were used to construct a cluster for testing:
   104  
   105  ```bash
   106  export REGION="us-west1"
   107  export VERSION="1.23"
   108  
   109  gcloud container clusters create scenario-test --cluster-version=$VERSION \
   110    --tags=game-server --scopes=gke-default --num-nodes=2 \
   111    --no-enable-autoupgrade --machine-type=n2-standard-2 \
   112    --region=$REGION --enable-ip-alias
   113  
   114  gcloud container node-pools create kube-system --cluster=scenario-test \
   115    --no-enable-autoupgrade \
   116    --node-taints components.gke.io/gke-managed-components=true:NoExecute \
   117    --num-nodes=1 --machine-type=n2-standard-16 --region $REGION
   118  
   119  gcloud container node-pools create agones-system --cluster=scenario-test \
   120    --no-enable-autoupgrade --node-taints agones.dev/agones-system=true:NoExecute \
   121    --node-labels agones.dev/agones-system=true --num-nodes=1 \
   122    --machine-type=n2-standard-16 --region $REGION
   123  
   124  gcloud container node-pools create game-servers --cluster=scenario-test \
   125    --node-taints scenario-test.io/game-servers=true:NoExecute --num-nodes=1 \
   126    --machine-type n2-standard-2 --no-enable-autoupgrade \
   127    --region $REGION --tags=game-server --scopes=gke-default \
   128    --enable-autoscaling --max-nodes=300 --min-nodes=175
   129  ```
   130  
   131  ## Agones Modifications
   132  
   133  For the scenario tests, we modified the Agones installation in a number of ways.
   134  
   135  First, we made sure that the Agones pods would _only_ run in the Agones node
   136  pool by changing the node affinity in the deployments for the controller,
   137  allocator service, and ping service to
   138  `requiredDuringSchedulingIgnoredDuringExecution`.
   139  
   140  We also increased the resources for the controller and allocator service pods,
   141  and made sure to specify both requests and limits to ensure that the pods were
   142  given the highest quality of service.
   143  
   144  These configuration changes are captured in
   145  [scenario-values.yaml](scenario-values.yaml) and can be applied during
   146  installation using helm:
   147  
   148  ```bash
   149  helm install my-release --namespace agones-system -f scenario-values.yaml agones/agones --create-namespace
   150  ```
   151  
   152  Alternatively, these changes can be applied to an existing Agones installation
   153  by running [`./configure-agones.sh`](configure-agones.sh).
   154  
   155  ## Fleet Setting
   156  
   157  We used the sample [fleet configuration](./scenario-fleet.yaml) and [fleet autoscaler configuration](./autoscaler.yaml).
   158  
   159  To reduce pod churn in the system, the simple game servers are configured to
   160  return themselves to `Ready` after being allocated the first 10 times following
   161  the [Reusing Allocated GameServers for more than one game
   162  session](https://agones.dev/site/docs/integration-patterns/reusing-gameservers/)
   163  integration pattern. After 10 simulated game sessions, the simple game servers
   164  then exit automatically. The fleet configuration above sets each game session to
   165  last for 1 minute, representing a short game.
   166  
   167  ## Running the test
   168  
   169  You can use the provided runScenario.sh script by providing one parameter (a
   170  scenario file). The scenario file is a simple text file where each line
   171  represents a "scenario" that the program will execute before moving to the next
   172  scenario. A scenario is a duration, the number of concurrent clients to use, and
   173  the interval of the allocation requests submitted by each client in milliseconds
   174  separated by a comma. The program will create the desired number of clients and
   175  those clients send allocation requests to the allocator service for the scenario
   176  duration in the defined cadence. At the end of each scenario the program will print
   177  out some statistics for the scenario.
   178  
   179  Two sample scenario files are included in this directory, one which sends a
   180  constant rate of allocations for the duration of the test and another that sends
   181  a variable number of allocations.
   182  
   183  Since error counts are gathered per scenario, it's recommended to keep each
   184  scenario short (e.g. 10 minutes) to narrow down the window when errors
   185  occurred even if the allocation rate stays at the same level for longer than
   186  10 minutes at a time.
   187