github.com/verrazzano/verrazzano@v1.7.1/tools/psr/DEVELOPER.md

github.com/verrazzano/verrazzano@v1.7.1/tools/psr/DEVELOPER.md (about)

     1  # PSR Developer Guide
     2  
     3  This document describes how to develop PSR workers and scenarios that can be used to test specific Verrazzano areas. 
     4  Following is a summary of the steps needed:
     5  
     6  1. Get familiar with the PSR tool, run the example and some scenarios.
     7  2. Decide what component you want to test.
     8  3. Decide what you want to test for your first scenario.
     9  4. Decide what workers you need to implement your scenario use cases.
    10  5. Implement a single worker and test it using Helm.
    11  6. Create or update a scenario that includes the worker.
    12  7. Test the scenario using the PSR CLI (psrctl).
    13  8. Repeat steps 5-7 until the scenario is complete.
    14  9. Update the README with your worker information.
    15  
    16  ## Prerequisites
    17  - Read the [Verrazzano PSR README](./README.md)  to get familiar with the PSR concepts and structure of the source code.
    18  - A Kubernetes cluster with Verrazzano installed (full installation or the components you are testing).
    19  
    20  ## PSR Areas
    21  Workers are organized into areas, where each area typically maps to one or more Verrazzano backend components, but that isn't always
    22  the case as shown with HTTP workers.  You can see the workers in the [workers](./backend/workers) package.  
    23  PSR scenarios are also grouped into areas.
    24  
    25  The following area names are used in the source code and YAML configuration.
    26  They are not exposed in metrics names, rather each `worker.go` file specifies the metrics prefix, which is the long name.  
    27  For example, the OpenSearch worker uses the metric prefix `opensearch`
    28  
    29  1. argo - Argo
    30  2. oam - OAM applications, Verrazzano application operator
    31  3. cm - cert-manager
    32  4. cluster - Verrazzano Cluster operator, multicluster
    33  5. coh - Coherence
    34  6. dns - ExternalDNS
    35  7. jaeger - Jaeger
    36  8. kc - Keycloak
    37  9. http - HTTP tests
    38  10. istio - Istio, Kiali
    39  11. mysql - MySQL
    40  12. nginx - NGINX Ingress Controller, AuthProxy
    41  13. ops - OpenSearch, OpenSearchDashboards, Fluentd, VMO
    42  14. prom - Prometheus stack, Grafana
    43  15. rancher - Rancher
    44  16. velero - Velero
    45  17. wls - WebLogic
    46  
    47  ## Developing a worker
    48  As mentioned in the README, a worker is the code that implements a single use case. For example, a worker might continuously
    49  scale OpenSearch in and out.  The `DoWork` function is the code that actually does the work for one loop iteration, and 
    50  is called repeatedly by the `runner`.  DoWork does whatever it needs to do to perform work, this includes blocking calls or 
    51  condition checks.
    52  
    53  ### Worker Tips
    54  Here is some important information to know about workers, much of it is repeated in the README.
    55  
    56  1. Worker code runs in a backend pod.
    57  2. The same backend pod has the code for all the workers, but only one worker is executing.
    58  3. Workers can have multiple threads doing work (scale up).
    59  4. Workers can have multiple replicas (scale out).
    60  5. Workers are configured using environment variables.
    61  6. Workers should only do one thing (e.g., query OpenSearch).
    62  7. All worker should emit metrics.
    63  8. Workers must wait for their dependencies before doing work (e.g., Verrazzano CR ready).
    64  9. Worker `DoWork` function is called repeatedly in a loop by the `runner`.
    65  10. Some workers must be run in an Istio enabled namespace (depends on what the worker does).
    66  11. A Worker might need additional Kubernetes resources to be created (e.g., AuthorizationPolicies).
    67  12. Workers can be run as Kubernetes deployments or OAM apps (default), this is specified at Helm install.
    68  13. All workers run as cluster-admin.
    69  
    70  ### Worker Chart and Overrides
    71  Workers are deployed using Helm where there is a single Helm chart for all workers along with area specific Helm subcharts.
    72  Each worker specifies the value overrides in a YAML file, such as the environment variables needed to configure
    73  worker. If an area specific subchart is needed, then it must be enabled in the override file.
    74  
    75  The worker override YAML file is in manifests/usecases/<area>/<worker>.yaml.  The only environment variable required is
    76  the `PSR_WORKER_TYPE`. For example, [usecases/opensearch/getlogs.yaml](./manifests/usecases/opensearch/getlogs.yaml)
    77  
    78  ```
    79  global:
    80    envVars:
    81      PSR_WORKER_TYPE: ops-getlogs
    82      
    83  # activate subchart
    84  opensearch:
    85    enabled: true
    86  
    87  ```
    88  
    89  ### Sample MySQL worker
    90  To make this section easier to follow, we will describe creating a new MySQL worker that queries the MySQL database.  
    91  In general, when creating a worker, it is easiest to just copy an existing worker that does the same type of action (e.g., scale)
    92  and modify it as needed for your component.  When it makes sense, common code should be factored out and reused by multiple workers.
    93  
    94  ### Creating a worker skeleton
    95  Following are the first steps to implement a worker:
    96  
    97  1. Add a worker type named `WorkerTypeMysqlQuery = mysql-query` to [config.go](./backend/config/config.go).
    98  2. Create a package named `mysql` in package [workers](./backend/workers).
    99  3. Create a file `query.go` in the `mysql` package and do the following:
   100     1. Stub out the [worker interface](./backend/spi/worker.go) implementation in `query.go`  You can copy the ops getlogs worker as a starting point.
   101     2. Change the const metrics prefix to `metricsPrefix = "mysql_query"`.
   102     3. Rename the `NewGetLogsWorker` function to `NewQueryWorker`.
   103     4. Change the `GetWorkerDesc` function to return information about the worker.
   104     6. Change the DoWork function to  `fmt.Println("hello mysql query worker")`.
   105  4. Add your worker case to the `getWorker` function in [manager.go](./backend/workmanager/manager.go).
   106  5. Add a directory named `mysql` to [usecases](./manifests/usecases).
   107  6. Copy [usecases/opensearch/getlogs.yaml](./manifests/usecases/opensearch/getlogs.yaml) to a file named `usecases/mysql/query.yaml`.
   108  7. Edit query.yaml:
   109     1. change `PSR_WORKER_TYPE: ops-getlogs` to `PSR_WORKER_TYPE: mysql-query`.
   110     2. remove the opensearch-authpol section.
   111  
   112  ### Testing the worker skeleton
   113  This section shows how to test the new worker in a Kind cluster.
   114  
   115  1. Test the example worker first by building the image, loading it into the cluster and running the example worker:
   116     1. `make run-example-k8s`.
   117     2. take note of the image name:tag that is used with the --set override, for example the output might show this:
   118        1. helm upgrade --install psr manifests/charts/worker --set appType=k8s --set imageName=ghcr.io/verrazzano/psr-backend:local-4210a50.
   119  2. kubectl get pods to see the example worker, look at the pod logs to make sure it is logging.
   120  3. Delete the example worker:
   121     1. `helm delete psr`.
   122  4. Run the mysql worker with the newly built image, an example image tag is shown below:
   123     1. `helm install psr manifests/charts/worker -f manifests/usescases/mysql/query.yaml --set appType=k8s --set imageName=ghcr.io/verrazzano/psr-backend:local-4210a50`
   124  5. Look at the PSR mysql worker pod and make sure that it is logging `hello mysql query worker`.
   125  6. Delete the mysql worker:
   126     1. `helm delete psr`.
   127  
   128  ### Add worker specific charts
   129  To function properly, certain workers need additional Kubernetes resources to be created.  Rather than having the worker create the
   130  resources at runtime, you can use a subchart to create them. The subchart will be shared by all workers in an area.  
   131  Since the MySQL query worker needs to access MySQL directly within the cluster, it will need an Istio AuthorizationPolicy,
   132  just like the OpenSearch workers do.  This section will show how to add the chart and use it in the use case YAML file.
   133  
   134  1. Create a new subchart called `mysql`:
   135     1. copy the opensearch chart from [manifests/charts/worker/charts/opensearch](./manifests/charts/worker/charts/opensearch) to 
   136  [manifests/charts/worker/charts/mysql](./manifests/charts/worker/charts/mysql).
   137     2. create the authorizationpolicy.yaml file with the correct policy to access MySQL.
   138     3. Delete the existing opensearch policy yaml files.
   139  2. Edit the [worker Chart.yaml](./manifests/charts/worker/Chart.yaml) file and add a dependency for the mysql chart.
   140  ```
   141  dependencies:
   142    - name: mysql
   143      repository: file://../mysql
   144      version: 0.1.0
   145      condition: mysql.enabled
   146  ```
   147  3.Edit the [worker Chart.yaml](./manifests/charts/worker/Chart.yaml) file and add the following section:
   148  ```
   149  # activate subchart
   150  mysql:
   151    enabled: false
   152  ```
   153  4. Edit [usecases/mysql/query.yaml](./manifests/usecases/mysql/query.yaml) and add the following section:
   154  ```
   155  # activate subchart
   156  mysql:
   157    enabled: true
   158  ```
   159  5. You will need to install the chart in an Istio enabled namespace.
   160  6. Test the chart in an Verrazzano installation using the same Helm command as previously, but also specify the namespace:
   161     1. `helm install psr manifests/charts/worker -n myns -f manifests/usescases/mysql/query.yaml --set appType=k8s --set imageName=ghcr.io/verrazzano/psr-backend:local-4210a50`.
   162  
   163  ### Add metrics to worker
   164  Worker metrics are very important because they let us track the progress and health of a worker.  Before implementing 
   165  the `DoWork` and `PreconditionsMet` functions, you should get metrics working.  The reason is that you will be able to 
   166  easily test your metrics by running your worker in an IDE, then opening up your browser to http://localhost:9090/metrics.  
   167  Once you implement the real worker code (`DoWork`), you might need to run in an Istio enabled namespace and will need 
   168  to use Prometheus or Grafana to see the metrics.
   169  
   170  The [runner](./backend/workmanager/runner.go) also emits metrics such as loop count, so you don't need to emit the same metrics.
   171  
   172  1. Modify the `workerMetrics` struct to add the metrics that the worker will emit.
   173  2. Modify the `NewQueryWorker` function to specify the metrics descriptors:
   174     1. Use a CounterValue metric if the value can never go down, otherwise use GaugeValue or some other metric type.
   175     2. Don't specify the worker type prefix in the name field, that is automatically added to the metric name.
   176  3. Modify the `GetMetricList` function returning the list of metrics.
   177  4. Modify DoWork to update the metrics as work is done:
   178     1. You might have some metrics that you cannot implement until the full DoWork code is done.
   179     2. Metric access must be thread-safe, use the atomic package like the other worker.
   180  5. Test the worker using the Helm chart.
   181  6. Access the Prometheus console and query the metrics.
   182  
   183  ### Implement the remainder of the worker code
   184  Implement the remaining worker code in `query.go`, specifically `PreconditionsMet` and `DoWork` Note that the query worker
   185  doesn't need a Kubernetes client since it knows the MySQL service name. If your worker needs
   186  to call the Kubernetes API server, then use the [k8sclient](./backend/pkg/k8sclient) package.  See how the
   187  OpenSearch [getlogs](./backend/workers/opensearch/scale/scale.go) worker uses ` k8sclient.NewPsrClient`.
   188  
   189  1. Implement NewQueryWorker to create the worker instance.
   190  2. Change function GetEnvDescList to return configuration environment variables that the worker needs:
   191     1. See the OpenSearch [getlogs](./backend/workers/opensearch/scale/scale.go) worker for an example.
   192  3. Implement DoWork. This method should not log, but if it really needs to log, then use the throttled Verrazzano logging,
   193     such as Progress or ErrorfThrottled.
   194  4. Test the worker using the Helm chart.
   195  
   196  **NOTE** The same worker instance is shared across all worker threads.  There is currently no state per worker.  Workers
   197  that keep state, such as the scaling worker, normally only run in a single thread.
   198  
   199  ## Creating a scenario
   200  A scenario is a collection of use cases with a curated configuration, that are run concurrently.  Typically,
   201  you should restrict the scenario use cases to a single area, but that is not a strict requirement.  You can run multiple
   202  scenarios concurrently so creating a mixed-area scenario might not be necessary.  If you do decide to create a mixed area scenario,
   203  then create it in a directory called scenario/mixed.
   204  
   205  ### Scenario files
   206  Scenarios are specified by a scenario.yaml file along with use case override files, one for each use case.
   207  By convention, the files must be in the [<area>/<scenario-name>/scenarios](./manifests/scenarios) directory structured as follows:
   208  ```
   209  <area>/<scenario-name>/scenario.yaml
   210  <area>/<scenario-name>usecase-overrides/*
   211  ```
   212  Use the long name for the area, e.g. opensearch instead of ops, or cert-manager instead of cm.
   213  For example, the scenario to restart all OpenSearch tiers is in [restart-all-tiers](./manifests/scenarios/opensearch/restart-all-tiers)
   214  
   215  ### Scenario YAML file
   216  The scenario YAML file describes the scenario, the use cases that comprise the scenario, and the use case overrides for the scenario.
   217  Following is the OpenSearch [restart-all-tiers/scenario.yaml](./manifests/scenarios/opensearch/restart-all-tiers/scenario.yaml).
   218  ```
   219  name: opensearch-restart-all-tiers
   220  ID: ops-rat
   221  description: |
   222    This is a scenario that restarts pods on all 3 OpenSearch tiers simultaneously
   223  usecases:
   224    - usecasePath: opensearch/restart.yaml
   225      overrideFile: restart-master.yaml
   226      description: restarts master nodes
   227    - usecasePath: opensearch/restart.yaml
   228      overrideFile: restart-data.yaml
   229      description: restarts data nodes
   230    - usecasePath: opensearch/restart.yaml
   231      overrideFile: restart-ingest.yaml
   232      description: restarts ingest nodes
   233  ```
   234  The front section has the scenario name, ID, and description.  The usecases section has all the use cases that will be run when
   235  the scenario is started.  The `usecasePath` points to the built-in use case override file.  The `overrideFile` specifies the file
   236  in the `usecase-overrides` directory which contains the scenario overrides for that specific use case.
   237  
   238  Following is the [scenario override file](./manifests/scenarios/opensearch/restart-all-tiers/usecase-overrides/restart-data.yaml) for restarting the data tier.
   239  ```
   240  global:
   241    envVars:
   242      PSR_WORKER_TYPE: ops-restart
   243      OPENSEARCH_TIER: data
   244      PSR_LOOP_SLEEP: 5s
   245  ```
   246  This file specifies that the data tier should be restarted every 5 seconds.  The `PSR_WORKER_TYPE` is not really needed here, since it
   247  is already in the use case file at [usecases/opensearch/restart.yaml](./manifests/usecases/opensearch/restart.yaml), however, it
   248  doesn't hurt to have it for documentation purposes.
   249  
   250  ## Running a scenario
   251  Scenarios are run by the PSR command line interface, `psrctl`.  The source code [manifests](./manifests) directory contains all
   252  of the helm charts, use cases overrides, and scenario files.  These manifests files are built into the psrctl binary and accessed
   253  internally at runtime, so the psrctl binary is self-contained and there is no need for the user to provide external files.  However,
   254  you can override the scenario directory at runtime with the `-d` flag.  This allows you to modify and test scenarios without having
   255  to rebuild `psrctl`.  See `psrctl` help for details.
   256  
   257  Scenarios always use OAM to deploy the workers, Kubernetes Deployments are not an option at this time.
   258  
   259  ### Specifying the backend image
   260  If you build `psrctl` using make, the image tag is derived from the last commit id.  If that image has not been uploaded to
   261  ghcr.io, you will need to run `make docker-push`.  Since that image is private will need to provide a secret with the ghcr.io
   262  credentials with `psrctl -p`.  If you want to override the image name, use `psrctl -w`.
   263  
   264  If you want to use a local image and load it into a kind cluster, they run `make kind-load-image` and specify that image
   265  using `psrctl -w`.  This is the easiest way to develop and test a worker on Kind.
   266  
   267  ### Updating a running scenario
   268  During development, you may want to update scenario override values while testing the scenario.  Currently, you
   269  can only update with `psrctl -u` but only the `usecase-overrides` files can be changed. If you need to change a chart
   270  or scenario.yaml, then just restart the scenario.
   271  
   272  ### Sample psrctl commands
   273  This section shows examples to run some built-in scenarios.
   274  
   275  **Show the built-in scenarios**
   276  ``` 
   277  psrctl explain
   278  ```
   279  
   280  **Start the OpenSearch scenario ops-s2 in Istio enabled namespace using a custom image**
   281  ``` 
   282  kubectl create ns psrtest
   283  kubectl label ns psrtest verrazzano-managed=true istio-injection=enabled
   284  psrctl start -s ops-s2 -n psrtest -w ghcr.io/verrazzano/psr-backend:local-c2e911e
   285  ```
   286  
   287  **Show the running scenarios in namespace psrtest, then across all namespaces**
   288  ``` 
   289  psrctl list -n psrtest 
   290  psrctl list -A
   291  ```
   292  
   293  **Update the running OpenSearch scenario ops-s2 with external scenario override files**
   294  ``` 
   295  psrctl update -s ops-s2 -n psrtest -d ~/tmp/my-ops-s2
   296  ```
   297  
   298  **Stop the running OpenSearch scenario ops-s2**
   299  ``` 
   300  psrctl stop -s ops-s2 -n psrtest
   301  ```
   302  
   303  ## Source Code
   304  The source code is organized into backend code, psrctl code, and manifest files.
   305  
   306  ### Backend code
   307  The  [backend](./backend) directory has the backend code which consists of the following packages:
   308  * config - configuration code
   309  * metrics - metrics server for metrics generated by the workers
   310  * osenv - package that allows workers to specify default and required env vars
   311  * pkg - various packages needed by the workers
   312  * spi - the worker interface
   313  * workers - the various workers
   314  * workmanager - the worker manager and runner
   315  
   316  ### Manifests
   317  The [manifests/charts](./manifests/charts)  directory has the Helm charts. There is a single worker chart for using
   318  either OAM or plain Kubernetes resources to deploy the backend.  The default is OAM, use
   319  deploy without OAM, use `--set appType=k8s`.  There is also a subchart for each area that requires one.
   320  
   321  The [manifests/usecases](./manifests/usecases) directory has the Helm override files for every use
   322  case. These files must contain the configuration, as key:value pairs, required by the worker.
   323  The usecases are organized by area.
   324  
   325  The [manifests/scenarios](./manifests/scenarios) directories for each scenario. Each scenario director
   326  contains the scenario.yaml file for the scenario, along with use case override files.
   327  The scenarios are organized by area.
   328  
   329  ### Psrctl
   330  The [psrct](./psrctl) directory contains the command line interface along with support packages.  One
   331  thing to note is that the [embed.go](./embed.go) file is needed by the psrctl code to access the built-in
   332  manifests.  This file needs to be in the parent directory of psrctl.
   333  
   334  
   335  ## Summary
   336  This document has all the information needed to create new workers and scenarios for any Verrazzano component.
   337  When creating workers, it is easiest to use Helm to deploy and test the worker. Always start with a single worker
   338  thread, then test with multiple threads and replicas. 
   339  
   340  When using Helm directly, you can deploy your worker as a Kubernetes Deployment or as an OAM application.  
   341  If you use a Deployment, you can start testing your stubbed-out worker without Verrazzano installed.  
   342  When you get further and need Verrazzano dependencies or want the worker metrics scraped, you can switch to OAM.  
   343  Your worker needs to run in the target cluster, or the worker metrics won't get scraped.
   344  
   345  If your worker needs to access resource in the mesh, like OpenSearch, you will need to create AuthorizationPolicies 
   346  (via a subchart) and will need to deploy your worker to an Istio enabled namespace.  The existing OpenSearch 
   347  workers have this requirement so you can use one of them as a starting point. Make sure you use Prometheus to 
   348  test your worker metrics.  Once the worker is running, you can create any custom scenario using YAML files as described earlier.  
   349  Finally, use the `psrctl` CLI to run and test your scenarios.