github.com/lukasheimann/cloudfoundrycli@v7.1.0+incompatible/doc/adr/0009-log-cache-clock-issues.md (about)

     1  # 9. Log Cache and Clock Issues
     2  
     3  Date: 2020-04-08
     4  
     5  ## Status
     6  
     7  Accepted
     8  
     9  ## Context
    10  
    11  _The issue motivating this decision, and any context that influences or constrains the decision._
    12  
    13  The CF CLI previously used the Traffic Controller to retrieve application logs.
    14  This was implemented as a long-lived WebSocket connection where logs were
    15  streamed to the client.
    16  
    17  With the migration to Log Cache we needed to preserve the experience of
    18  streaming logs, but built [on an
    19  API](https://github.com/cloudfoundry/log-cache-release/blob/f08a3081c029d133300b1d6cb5ea8ebbd2108874/src/README.md)
    20  where timestamped log envelopes are retrieved via HTTP requests.
    21  
    22  There were two main questions we needed to answer:
    23  
    24  1. What is the oldest log that should be shown?
    25     An application may have logs from a previous push already present in Log Cache
    26     that do not relate to the current staging operation. We would want to only
    27     show the logs for the current staging operation.
    28  2. What is the newest log that can be shown?
    29     Log Cache is eventually consistent, so very recent logs may be out of order
    30     or incomplete. We would only want to show logs once they have "settled" (i.e.
    31     logs that have been in Log Cache long enough to converge).
    32  
    33  We also needed to answer a question regarding `cf logs --recent`:
    34  
    35  3. Do we want to define a range of "settled" logs to show, or do we want to show everything currently in the Log Cache?
    36  
    37  ## Decision
    38  
    39  _The change that we're proposing or have agreed to implement._
    40  
    41  1. To determine the oldest log that will be shown: we will 'peek' at the latest log within
    42     Log Cache for the application and read from that point.
    43  2. To determine the newest log that will be shown: we will not show log envelopes where the
    44     timestamp is less than two seconds old.
    45  3. `cf logs --recent` will always show all logs currently in the Log Cache, regardless of whether they have "settled".
    46  
    47  ### 1. Peeking at the latest log
    48  
    49  The initial CLI implementation of Log Cache started reading from Log Cache at an
    50  offset based on the current client clock time. This was flawed because an
    51  incorrectly configured client clock would result in some unexpected behaviour:
    52  
    53  * If the client clock was ahead of the server then either the logs would not be
    54    shown at all, or there would be a lengthy delay before some of the logs would
    55    be shown
    56  * If the client clock was behind the server then logs relating to a previous
    57    operation might be shown
    58  
    59  In an attempt to decouple ourselves from the client clock time we instead 'peek'
    60  at the timestamp of the latest log envelope for the application and use that
    61  timestamp as our starting point.
    62  
    63  If there are no envelopes present for the application we will continue to retry
    64  until envelopes become available.
    65  
    66  ### 2. Delaying the output of new logs
    67  
    68  By default, the Log Cache client will only return log envelopes that have timestamps
    69  more than a second old (see the [code](https://github.com/cloudfoundry/log-cache-release/blob/f08a3081c029d133300b1d6cb5ea8ebbd2108874/src/pkg/client/walk.go#L174-L183)).
    70  This filtering mechanism is known as `WalkDelay` and is configurable.
    71  
    72  In our testing we found that the default of one second was not sufficient to allow Log Cache to
    73  "settle" and resulted in log loss on foundations with multiple Log Cache nodes.
    74  This is because multiple Log Cache nodes may be ingesting events for the same
    75  application, but a single Log Cache node hosts the cache for a given
    76  application. It's possible for us to see a newer log envelope timestamp and move
    77  our timestamp cursor forwards before we have received an earlier log envelope:
    78  https://www.pivotaltracker.com/story/show/171759407/comments/212391238.
    79  
    80  We decided to increase the `WalkDelay` to 2 seconds to give Log cache more time
    81  to "settle" in a multiple Log Cache node foundation.
    82  
    83  Note: an issue has been filed against the Log Cache client to increase the default
    84  walk delay: https://github.com/cloudfoundry/go-log-cache/issues/29
    85  
    86  ### 3. `cf logs --recent` behavior
    87  
    88  We decided to implement `cf logs --recent` using a single request to Log Cache instead of a multiple pass
    89  `Walk` implementation. This has better performance but means that there is no `WalkDelay` (i.e. the latest
    90  logs returned by the request may not be "settled"). There were three options we considered to address this:
    91  
    92  1. Do nothing. Simply return all of the logs that are currently in Log Cache.
    93  2. Only render logs that are more than two seconds old.
    94  3. Wait two seconds before showing all logs up to the point that the command was started.
    95  
    96  We decided to go with option 1. Option 2 would most likely break CATs and other automated tooling run against
    97  the `cf logs --recent` command due to the lag in logs appearing. Option 3 was a poor UX as it made the command
    98  two seconds slower and would also be missing logs that were emitted after the command started.
    99  
   100  See this [comment](https://www.pivotaltracker.com/story/show/172152836/comments/213375953) for more information.
   101  
   102  ## Consequences
   103  
   104  _What becomes easier or more difficult to do and any risks introduced by the change that will need to be mitigated._
   105  
   106  ### Peeking at the latest log
   107  
   108  * Peeking at the latest envelope for an application removes a dependency on the
   109    client clock, making us more tolerant to a situation where the client clock is
   110    not closely synchronised to the server
   111  * As each component that generates logs is responsible for assigning the
   112    timestamp, there is still potential for clock drift within the foundation to
   113    cause unexpected behavior.
   114  * The code is slightly more involved, but it seems like a good trade-off
   115  
   116  ### Delaying the output of new logs
   117  
   118  * We are momentarily delaying the output of new log envelopes to the user, with
   119    the advantage that by delaying we are able to output a more consistent view of
   120    the logs
   121  * The `WalkDelay` is implemented within the Log Cache client as a comparison of
   122    the log envelope timestamps against the client clock. It is therefore
   123    vulnerable to any misconfiguration of the client clock. We've considered
   124    strategies for removing this dependency on the client clock but have decided
   125    for now to wait for feedback from our users before attempting to make ourselves
   126    reliable in this situation. An issue has been filed against the Log Cache
   127    client: https://github.com/cloudfoundry/go-log-cache/issues/28
   128  
   129  ### `cf logs --recent` behavior
   130  
   131  * The most recent logs returned by this command may not be "settled". There may be some
   132    discrepancies between the output of two consecutive runs of this command.