github.com/Financial-Times/publish-availability-monitor@v1.12.0/runbooks/runbook.md (about) 1 <!-- 2 Written in the format prescribed by https://github.com/Financial-Times/runbook.md. 3 Any future edits should abide by this format. 4 --> 5 # UPP - Publish Availability Monitor 6 7 For every piece of published content, polls the relevant endpoints to ascertain whether the publish was successful, within the time limit as per SLA. The resulting metrics are fed to Splunk. 8 9 ## Code 10 11 publish-availability-monitor 12 13 ## Primary URL 14 15 https://github.com/Financial-Times/publish-availability-monitor 16 17 ## Service Tier 18 19 Platinum 20 21 ## Lifecycle Stage 22 23 Production 24 25 ## Host Platform 26 27 AWS 28 29 ## Architecture 30 31 The Publish Availability Monitor listens to new content messages via the Kafka topic NativeCmsPublicationEvents, and polls (or in the case of Notifications Push listens) to the following endpoints until the content appears or the content SLA times out: 32 33 * The Content Public Read /content/{uuid} endpoint. 34 * The Content Notifications /content/notifications endpoint. 35 * The Content Notifications Push /content/notifications-push endpoint. 36 37 * The Pages Public Read /__public-pages-api/pages/{uuid} endpoint. 38 * The Pages Notifications /__page-notifications-rw/pages/notifications endpoint. 39 * The Pages Notification Push /pages/notifications-push endpoint. 40 41 * The Lists Public Read /__public-lists-api/lists/{uuid} endpoint. 42 * The Lists Notifications /__list-notifications-rw/lists/notifications endpoint. 43 * The Lists Notification Push /lists/notifications-push endpoint. 44 45 If two of the last ten pieces of content failed any of these checks, then the PAM healthcheck will 46 switch to RED and therefore cause the cluster healthchecks and good-to-go 47 endpoints to show RED. 48 49 ## Contains Personal Data 50 51 No 52 53 ## Contains Sensitive Data 54 55 No 56 57 <!-- Placeholder - remove HTML comment markers to activate 58 ## Can Download Personal Data 59 Choose Yes or No 60 61 ...or delete this placeholder if not applicable to this system 62 --> 63 64 <!-- Placeholder - remove HTML comment markers to activate 65 ## Can Contact Individuals 66 Choose Yes or No 67 68 ...or delete this placeholder if not applicable to this system 69 --> 70 71 ## Failover Architecture Type 72 73 ActiveActive 74 75 ## Failover Process Type 76 77 PartiallyAutomated 78 79 ## Failback Process Type 80 81 PartiallyAutomated 82 83 ## Failover Details 84 85 The service is deployed in both Publishing clusters. The failover guide for the cluster is located [in the upp-docs failover guides](https://github.com/Financial-Times/upp-docs/tree/master/failover-guides/): 86 87 ## Data Recovery Process Type 88 89 NotApplicable 90 91 ## Data Recovery Details 92 93 The service does not store data, so it does not require any data recovery steps. 94 95 ## Release Process Type 96 97 PartiallyAutomated 98 99 ## Rollback Process Type 100 101 Manual 102 103 ## Release Details 104 105 The release is triggered by making a Github release which is then picked up by a Jenkins multibranch pipeline. The Jenkins pipeline should be manually started in order for it to deploy the helm package to the Kubernetes clusters. 106 107 <!-- Placeholder - remove HTML comment markers to activate 108 ## Heroku Pipeline Name 109 Enter descriptive text satisfying the following: 110 This is the name of the Heroku pipeline for this system. If you don't have a pipeline, this is the name of the app in Heroku. A pipeline is a group of Heroku apps that share the same codebase where each app in a pipeline represents the different stages in a continuous delivery workflow, i.e. staging, production. 111 112 ...or delete this placeholder if not applicable to this system 113 --> 114 115 ## Key Management Process Type 116 117 NotApplicable 118 119 ## Key Management Details 120 121 There is no key rotation procedure for this system. 122 123 ## Monitoring 124 125 Pod health: 126 127 * <https://upp-prod-publish-eu.ft.com/__health/__pods-health?service-name=publish-availability-monitor> 128 * <https://upp-prod-publish-us.ft.com/__health/__pods-health?service-name=publish-availability-monitor> 129 130 ## First Line Troubleshooting 131 132 Please DO NOT restart publish availability monitor until Second Line support have taken a look. Most errors are not caused by the service itself and it contains useful debugging information in memory. If the application is running fine, and the healthcheck is still alerting, there may be issues with the Kafka Proxy or Kafka itself. Check if these systems are up. 133 134 [Publish availability monitor panic guide](https://sites.google.com/a/ft.com/universal-publishing/ops-guides/publish-availability-monitor-panic-guide) 135 [First Line Troubleshooting guide](https://github.com/Financial-Times/upp-docs/tree/master/guides/ops/first-line-troubleshooting) 136 137 ## Second Line Troubleshooting 138 139 Please refer to the GitHub repository README for troubleshooting information.