github.com/Financial-Times/publish-availability-monitor@v1.12.0/runbooks/runbook.md

github.com/Financial-Times/publish-availability-monitor@v1.12.0/runbooks/runbook.md (about)

     1  <!--
     2      Written in the format prescribed by https://github.com/Financial-Times/runbook.md.
     3      Any future edits should abide by this format.
     4  -->
     5  # UPP - Publish Availability Monitor
     6  
     7  For every piece of published content, polls the relevant endpoints to ascertain whether the publish was successful, within the time limit as per SLA. The resulting metrics are fed to Splunk.
     8  
     9  ## Code
    10  
    11  publish-availability-monitor
    12  
    13  ## Primary URL
    14  
    15  https://github.com/Financial-Times/publish-availability-monitor
    16  
    17  ## Service Tier
    18  
    19  Platinum
    20  
    21  ## Lifecycle Stage
    22  
    23  Production
    24  
    25  ## Host Platform
    26  
    27  AWS
    28  
    29  ## Architecture
    30  
    31  The Publish Availability Monitor listens to new content messages via the Kafka topic NativeCmsPublicationEvents, and polls (or in the case of Notifications Push listens) to the following endpoints until the content appears or the content SLA times out:
    32  
    33  * The Content Public Read /content/{uuid} endpoint.
    34  * The Content Notifications /content/notifications endpoint.
    35  * The Content Notifications Push /content/notifications-push endpoint.
    36  
    37  * The Pages Public Read /__public-pages-api/pages/{uuid} endpoint.
    38  * The Pages Notifications /__page-notifications-rw/pages/notifications endpoint.
    39  * The Pages Notification Push /pages/notifications-push endpoint.
    40  
    41  * The Lists Public Read /__public-lists-api/lists/{uuid} endpoint.
    42  * The Lists Notifications /__list-notifications-rw/lists/notifications endpoint.
    43  * The Lists Notification Push /lists/notifications-push endpoint.
    44  
    45  If two of the last ten pieces of content failed any of these checks, then the PAM healthcheck will
    46  switch to RED and therefore cause the cluster healthchecks and good-to-go
    47  endpoints to show RED.
    48  
    49  ## Contains Personal Data
    50  
    51  No
    52  
    53  ## Contains Sensitive Data
    54  
    55  No
    56  
    57  <!-- Placeholder - remove HTML comment markers to activate
    58  ## Can Download Personal Data
    59  Choose Yes or No
    60  
    61  ...or delete this placeholder if not applicable to this system
    62  -->
    63  
    64  <!-- Placeholder - remove HTML comment markers to activate
    65  ## Can Contact Individuals
    66  Choose Yes or No
    67  
    68  ...or delete this placeholder if not applicable to this system
    69  -->
    70  
    71  ## Failover Architecture Type
    72  
    73  ActiveActive
    74  
    75  ## Failover Process Type
    76  
    77  PartiallyAutomated
    78  
    79  ## Failback Process Type
    80  
    81  PartiallyAutomated
    82  
    83  ## Failover Details
    84  
    85  The service is deployed in both Publishing clusters. The failover guide for the cluster is located [in the upp-docs failover guides](https://github.com/Financial-Times/upp-docs/tree/master/failover-guides/):
    86  
    87  ## Data Recovery Process Type
    88  
    89  NotApplicable
    90  
    91  ## Data Recovery Details
    92  
    93  The service does not store data, so it does not require any data recovery steps.
    94  
    95  ## Release Process Type
    96  
    97  PartiallyAutomated
    98  
    99  ## Rollback Process Type
   100  
   101  Manual
   102  
   103  ## Release Details
   104  
   105  The release is triggered by making a Github release which is then picked up by a Jenkins multibranch pipeline. The Jenkins pipeline should be manually started in order for it to deploy the helm package to the Kubernetes clusters.
   106  
   107  <!-- Placeholder - remove HTML comment markers to activate
   108  ## Heroku Pipeline Name
   109  Enter descriptive text satisfying the following:
   110  This is the name of the Heroku pipeline for this system. If you don't have a pipeline, this is the name of the app in Heroku. A pipeline is a group of Heroku apps that share the same codebase where each app in a pipeline represents the different stages in a continuous delivery workflow, i.e. staging, production.
   111  
   112  ...or delete this placeholder if not applicable to this system
   113  -->
   114  
   115  ## Key Management Process Type
   116  
   117  NotApplicable
   118  
   119  ## Key Management Details
   120  
   121  There is no key rotation procedure for this system.
   122  
   123  ## Monitoring
   124  
   125  Pod health:
   126  
   127  *   <https://upp-prod-publish-eu.ft.com/__health/__pods-health?service-name=publish-availability-monitor>
   128  *   <https://upp-prod-publish-us.ft.com/__health/__pods-health?service-name=publish-availability-monitor>
   129  
   130  ## First Line Troubleshooting
   131  
   132  Please DO NOT restart publish availability monitor until Second Line support have taken a look. Most errors are not caused by the service itself and it contains useful debugging information in memory. If the application is running fine, and the healthcheck is still alerting, there may be issues with the Kafka Proxy or Kafka itself. Check if these systems are up.
   133  
   134  [Publish availability monitor panic guide](https://sites.google.com/a/ft.com/universal-publishing/ops-guides/publish-availability-monitor-panic-guide)
   135  [First Line Troubleshooting guide](https://github.com/Financial-Times/upp-docs/tree/master/guides/ops/first-line-troubleshooting)
   136  
   137  ## Second Line Troubleshooting
   138  
   139  Please refer to the GitHub repository README for troubleshooting information.