k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/playbooks/greenhouse.md (about) 1 # Greenhouse Playbook 2 3 This is the playbook for GreenHouse. See also [the playbook index][playbooks]. 4 5 TDLR: Greenhouse is a [bazel] [remote build cache][remote-build-cache]. 6 7 The [OWNERS][OWNERS] are a potential point of contact for more info. 8 9 For in depth details about the project see the [README][README]. 10 11 ## General Debugging 12 13 Greenhouse runs as a Kubernetes deployment. 14 15 For the [Kubernetes Project's Prow Deployment][prow-k8s-io] the exact spec is in 16 [deployment.yaml], and the deployment is in the "build cluster". 17 18 ### Logs 19 20 First configure your local environment to point to the cluster hosting 21 greenhouse. <!--TODO: link to prow info for doing this on our deployment--> 22 23 The greenhouse pods should have the label `app=greenhouse`, you can view 24 the logs with `kubectl logs -l=app=greenhouse`. 25 26 The logs may also be stored in [Stackdriver] / the host cluster's logging 27 integration(s). 28 29 ### Monitoring 30 31 Prometheus metrics are available for scraping at `/metrics` 32 33 Note that periodicly freeing disk space is expected, and will be highly variable 34 with the load from our build workloads. 35 36 Writing half a terabyte of cache entries and reaching the eviction threshold 37 in just 3 to 4 hours is not unusual under load as of 9/10/2019. 38 39 ## Options 40 41 The following well-known options are available for dealing with greenhouse 42 service issues. 43 44 ### Rolling Back 45 46 If you are running greenhouse without using the config in this repo 47 (and you likely are if you are not looking at prow.k8s.io ...) you will need 48 to roll back the specific deployment mechanism used in that deployment. 49 50 For prow.k8s.io kf you think that Greenhouse is broken in some way the easiest 51 way to roll it back is to check out this repo to a previous commit deploy it 52 from that commit. 53 54 Deployment details are covered in the [README]. 55 56 ### Cutting Access To The Cache 57 58 Cache users must explicitly configure bazel to use the cache and will fall 59 back to non-cached builds if the cache cannot be reached. 60 61 To force falling back, you can simply delete the `bazel-cache` service. 62 63 `kubectl delete service bazel-cache -l=app=greenhouse` 64 65 Eventually once we've resolved whatever issue necessitates this, you should 66 reinstate the service, which is defined in [service.yaml]. 67 68 ### Wiping The Cache Contents 69 70 Firstly: you should not do this! This is only necessary if there is a bad 71 bug in bazel related to caching for some reason. 72 73 If this does become the case and you are confident you can do this fairly 74 trivially. However this will be mildly disruptive due to trying to delete 75 files that may be currently being served. 76 77 You should only do this if you really think that somehow bazel 78 has bad state in the cache. 79 80 You should also consider removing the greenhouse service instead, jobs 81 will fall back to non-cached builds if the cache cannot be reached. 82 83 If you do decide to do this, this is how: 84 85 - Find the pod name with `kubectl get po -l=app=greenhouse` 86 - Obtain a shell the pod with `kubectl exec -it <greenhouse-pod-name> /bin/sh` 87 - The data directory should be at `/data` for our deployment. 88 - Verify this by inspecting `kubectl describe po -l=app=greenhouse` 89 - Once you are sure that you know where the data is stored, you can simply run 90 `cd /data && rm -rf ./*` from the `kubectl exec` shell you created above. 91 92 ## Known Issues 93 94 Greenhouse has a relatively clean track record for approximately two years now. 95 96 I've probably just jinxed it. 97 98 There is, however at least one known issue with Bazel caching in general that 99 may affect Greenhouse users at some point. 100 101 ### Host Tool Tracking Is Limited 102 103 Bazel does not properly track toolchains on the host (like C++ compilers). 104 105 This issue may occur with bazel's machine local cache (`$HOME/.cache/bazel/...`) 106 on your development machine. 107 108 To avoid this problem with our cache we ask that users of greenhouse use some 109 additional tooling we built to ensure that the cache is used in a way that 110 includes a key hashed from the known host toolchains in use. 111 112 You can read more about how we do this in the [README], and the upstream issue 113 [bazel#4558]. 114 115 It is possible that in the future one of our cache users will depend on some 116 "host" toolchain that bazel does not track as an input, causing issues when 117 versions of the tool are switched and produce incompatible outputs. 118 119 This may be difficult to diagnose if you are not familiar with Bazel's output. 120 Consider asking for help from someone familiar with Bazel if you suspect this 121 issue. 122 123 <!--URLS--> 124 [OWNERS]: /greenhouse/OWNERS 125 [README]: /greenhouse/README.md 126 [playbooks]: /docs/playbooks/README.md 127 <!--Additional URLS--> 128 [bazel]: https://bazel.build/ 129 [remote-build-cache]: https://docs.bazel.build/versions/master/remote-caching.html 130 [deployment.yaml]: /greenhouse/deployment.yaml 131 [service.yaml]: /greenhouse/service.yaml 132 [prow-k8s-io]: https://prow.k8s.io 133 [bazel#4558]: https://github.com/bazelbuild/bazel/issues/4558 134 [velodrome]: http://velodrome.k8s.io/dashboard/db/bazel-cache?refresh=1m&orgId=1 135 [Stackdriver]: https://cloud.google.com/stackdriver/