k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/playbooks/greenhouse.md

k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/playbooks/greenhouse.md (about)

1 # Greenhouse Playbook
2
3 This is the playbook for GreenHouse. See also [the playbook index][playbooks].
4
5 TDLR: Greenhouse is a [bazel] [remote build cache][remote-build-cache].
6
7 The [OWNERS][OWNERS] are a potential point of contact for more info.
8
9 For in depth details about the project see the [README][README].
10
11 ## General Debugging
12
13 Greenhouse runs as a Kubernetes deployment.
14
15 For the [Kubernetes Project's Prow Deployment][prow-k8s-io] the exact spec is in
16 [deployment.yaml], and the deployment is in the "build cluster".
17
18 ### Logs
19
20 First configure your local environment to point to the cluster hosting
21 greenhouse. 
22
23 The greenhouse pods should have the label `app=greenhouse`, you can view
24 the logs with `kubectl logs -l=app=greenhouse`.
25
26 The logs may also be stored in [Stackdriver] / the host cluster's logging
27 integration(s).
28
29 ### Monitoring
30
31 Prometheus metrics are available for scraping at `/metrics`
32
33 Note that periodicly freeing disk space is expected, and will be highly variable
34 with the load from our build workloads.
35
36 Writing half a terabyte of cache entries and reaching the eviction threshold
37 in just 3 to 4 hours is not unusual under load as of 9/10/2019.
38
39 ## Options
40
41 The following well-known options are available for dealing with greenhouse
42 service issues.
43
44 ### Rolling Back
45
46 If you are running greenhouse without using the config in this repo
47 (and you likely are if you are not looking at prow.k8s.io ...) you will need
48 to roll back the specific deployment mechanism used in that deployment.
49
50 For prow.k8s.io kf you think that Greenhouse is broken in some way the easiest
51 way to roll it back is to check out this repo to a previous commit deploy it
52 from that commit.
53
54 Deployment details are covered in the [README].
55
56 ### Cutting Access To The Cache
57
58 Cache users must explicitly configure bazel to use the cache and will fall
59 back to non-cached builds if the cache cannot be reached.
60
61 To force falling back, you can simply delete the `bazel-cache` service.
62
63 `kubectl delete service bazel-cache -l=app=greenhouse`
64
65 Eventually once we've resolved whatever issue necessitates this, you should
66 reinstate the service, which is defined in [service.yaml].
67
68 ### Wiping The Cache Contents
69
70 Firstly: you should not do this! This is only necessary if there is a bad
71 bug in bazel related to caching for some reason.
72
73 If this does become the case and you are confident you can do this fairly
74 trivially. However this will be mildly disruptive due to trying to delete
75 files that may be currently being served.
76
77 You should only do this if you really think that somehow bazel
78 has bad state in the cache.
79
80 You should also consider removing the greenhouse service instead, jobs
81 will fall back to non-cached builds if the cache cannot be reached.
82
83 If you do decide to do this, this is how:
84
85 - Find the pod name with `kubectl get po -l=app=greenhouse`
86 - Obtain a shell the pod with `kubectl exec -it <greenhouse-pod-name> /bin/sh`
87 - The data directory should be at `/data` for our deployment.
88 - Verify this by inspecting `kubectl describe po -l=app=greenhouse`
89 - Once you are sure that you know where the data is stored, you can simply run
90 `cd /data && rm -rf ./*` from the `kubectl exec` shell you created above.
91
92 ## Known Issues
93
94 Greenhouse has a relatively clean track record for approximately two years now.
95
96 I've probably just jinxed it.
97
98 There is, however at least one known issue with Bazel caching in general that
99 may affect Greenhouse users at some point.
100
101 ### Host Tool Tracking Is Limited
102
103 Bazel does not properly track toolchains on the host (like C++ compilers).
104
105 This issue may occur with bazel's machine local cache (`$HOME/.cache/bazel/...`)
106 on your development machine.
107
108 To avoid this problem with our cache we ask that users of greenhouse use some
109 additional tooling we built to ensure that the cache is used in a way that
110 includes a key hashed from the known host toolchains in use.
111
112 You can read more about how we do this in the [README], and the upstream issue
113 [bazel#4558].
114
115 It is possible that in the future one of our cache users will depend on some
116 "host" toolchain that bazel does not track as an input, causing issues when
117 versions of the tool are switched and produce incompatible outputs.
118
119 This may be difficult to diagnose if you are not familiar with Bazel's output.
120 Consider asking for help from someone familiar with Bazel if you suspect this
121 issue.
122
123 
124 [OWNERS]: /greenhouse/OWNERS
125 [README]: /greenhouse/README.md
126 [playbooks]: /docs/playbooks/README.md
127 
128 [bazel]: https://bazel.build/
129 [remote-build-cache]: https://docs.bazel.build/versions/master/remote-caching.html
130 [deployment.yaml]: /greenhouse/deployment.yaml
131 [service.yaml]: /greenhouse/service.yaml
132 [prow-k8s-io]: https://prow.k8s.io
133 [bazel#4558]: https://github.com/bazelbuild/bazel/issues/4558
134 [velodrome]: http://velodrome.k8s.io/dashboard/db/bazel-cache?refresh=1m&orgId=1
135 [Stackdriver]: https://cloud.google.com/stackdriver/