k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/playbooks/greenhouse.md (about)

     1  # Greenhouse Playbook
     2  
     3  This is the playbook for GreenHouse. See also [the playbook index][playbooks].
     4  
     5  TDLR: Greenhouse is a [bazel] [remote build cache][remote-build-cache].
     6  
     7  The [OWNERS][OWNERS] are a potential point of contact for more info.
     8  
     9  For in depth details about the project see the [README][README].
    10  
    11  ## General Debugging
    12  
    13  Greenhouse runs as a Kubernetes deployment.
    14  
    15  For the [Kubernetes Project's Prow Deployment][prow-k8s-io] the exact spec is in
    16  [deployment.yaml], and the deployment is in the "build cluster".
    17  
    18  ### Logs
    19  
    20  First configure your local environment to point to the cluster hosting 
    21  greenhouse. <!--TODO: link to prow info for doing this on our deployment-->
    22  
    23  The greenhouse pods should have the label `app=greenhouse`, you can view
    24  the logs with `kubectl logs -l=app=greenhouse`.
    25  
    26  The logs may also be stored in [Stackdriver] / the host cluster's logging
    27  integration(s).
    28  
    29  ### Monitoring
    30  
    31  Prometheus metrics are available for scraping at `/metrics`
    32  
    33  Note that periodicly freeing disk space is expected, and will be highly variable
    34  with the load from our build workloads. 
    35  
    36  Writing half a terabyte of cache entries and reaching the eviction threshold
    37  in just 3 to 4 hours is not unusual under load as of 9/10/2019.
    38  
    39  ## Options
    40  
    41  The following well-known options are available for dealing with greenhouse
    42  service issues.
    43  
    44  ### Rolling Back
    45  
    46  If you are running greenhouse without using the config in this repo 
    47  (and you likely are if you are not looking at prow.k8s.io ...) you will need
    48  to roll back the specific deployment mechanism used in that deployment.
    49  
    50  For prow.k8s.io kf you think that Greenhouse is broken in some way the easiest 
    51  way to roll it back is to check out this repo to a previous commit deploy it 
    52  from that commit.
    53  
    54  Deployment details are covered in the [README].
    55  
    56  ### Cutting Access To The Cache
    57  
    58  Cache users must explicitly configure bazel to use the cache and will fall
    59  back to non-cached builds if the cache cannot be reached.
    60  
    61  To force falling back, you can simply delete the `bazel-cache` service.
    62  
    63  `kubectl delete service bazel-cache -l=app=greenhouse`
    64  
    65  Eventually once we've resolved whatever issue necessitates this, you should
    66  reinstate the service, which is defined in [service.yaml].
    67  
    68  ### Wiping The Cache Contents
    69  
    70  Firstly: you should not do this! This is only necessary if there is a bad
    71  bug in bazel related to caching for some reason.
    72  
    73  If this does become the case and you are confident you can do this fairly
    74  trivially. However this will be mildly disruptive due to trying to delete 
    75  files that may be currently being served. 
    76  
    77  You should only do this if you really think that somehow bazel
    78  has bad state in the cache.
    79  
    80  You should also consider removing the greenhouse service instead, jobs
    81  will fall back to non-cached builds if the cache cannot be reached.
    82  
    83  If you do decide to do this, this is how:
    84  
    85  - Find the pod name with `kubectl get po -l=app=greenhouse`
    86  - Obtain a shell the pod with `kubectl exec -it <greenhouse-pod-name> /bin/sh`
    87  - The data directory should be at `/data` for our deployment.
    88    - Verify this by inspecting `kubectl describe po -l=app=greenhouse`
    89  - Once you are sure that you know where the data is stored, you can simply run
    90  `cd /data && rm -rf ./*` from the `kubectl exec` shell you created above.
    91  
    92  ## Known Issues
    93  
    94  Greenhouse has a relatively clean track record for approximately two years now.
    95  
    96  I've probably just jinxed it.
    97  
    98  There is, however at least one known issue with Bazel caching in general that
    99  may affect Greenhouse users at some point.
   100  
   101  ### Host Tool Tracking Is Limited
   102  
   103  Bazel does not properly track toolchains on the host (like C++ compilers).
   104  
   105  This issue may occur with bazel's machine local cache (`$HOME/.cache/bazel/...`)
   106  on your development machine.
   107  
   108  To avoid this problem with our cache we ask that users of greenhouse use some 
   109  additional tooling we built to ensure that the cache is used in a way that
   110  includes a key hashed from the known host toolchains in use.
   111  
   112  You can read more about how we do this in the [README], and the upstream issue 
   113  [bazel#4558].
   114  
   115  It is possible that in the future one of our cache users will depend on some
   116  "host" toolchain that bazel does not track as an input, causing issues when
   117  versions of the tool are switched and produce incompatible outputs.
   118  
   119  This may be difficult to diagnose if you are not familiar with Bazel's output.
   120  Consider asking for help from someone familiar with Bazel if you suspect this
   121  issue.
   122  
   123  <!--URLS-->
   124  [OWNERS]: /greenhouse/OWNERS 
   125  [README]: /greenhouse/README.md 
   126  [playbooks]: /docs/playbooks/README.md
   127  <!--Additional URLS-->
   128  [bazel]: https://bazel.build/
   129  [remote-build-cache]: https://docs.bazel.build/versions/master/remote-caching.html
   130  [deployment.yaml]: /greenhouse/deployment.yaml
   131  [service.yaml]: /greenhouse/service.yaml
   132  [prow-k8s-io]: https://prow.k8s.io
   133  [bazel#4558]: https://github.com/bazelbuild/bazel/issues/4558
   134  [velodrome]: http://velodrome.k8s.io/dashboard/db/bazel-cache?refresh=1m&orgId=1
   135  [Stackdriver]: https://cloud.google.com/stackdriver/