github.com/smintz/nomad@v0.8.3/website/source/guides/operating-a-job/inspecting-state.html.md

github.com/smintz/nomad@v0.8.3/website/source/guides/operating-a-job/inspecting-state.html.md (about)

     1  ---
     2  layout: "guides"
     3  page_title: "Inspecting State - Operating a Job"
     4  sidebar_current: "guides-operating-a-job-inspecting-state"
     5  description: |-
     6    Nomad exposes a number of tools and techniques for inspecting a running job.
     7    This is helpful in ensuring the job started successfully. Additionally, it
     8    can inform us of any errors that occurred while starting the job.
     9  ---
    10  
    11  # Inspecting State
    12  
    13  A successful job submission is not an indication of a successfully-running job.
    14  This is the nature of a highly-optimistic scheduler. A successful job submission
    15  means the server was able to issue the proper scheduling commands. It does not
    16  indicate the job is actually running. To verify the job is running, we need to
    17  inspect its state.
    18  
    19  This section will utilize the job named "docs" from the [previous
    20  sections](/guides/operating-a-job/submitting-jobs.html), but these operations
    21  and command largely apply to all jobs in Nomad.
    22  
    23  ## Job Status
    24  
    25  After a job is submitted, you can query the status of that job using the job
    26  status command:
    27  
    28  ```text
    29  $ nomad job status
    30  ID    Type     Priority  Status
    31  docs  service  50        running
    32  ```
    33  
    34  At a high level, we can see that our job is currently running, but what does
    35  "running" actually mean. By supplying the name of a job to the job status
    36  command, we can ask Nomad for more detailed job information:
    37  
    38  ```text
    39  $ nomad job status docs
    40  ID          = docs
    41  Name        = docs
    42  Type        = service
    43  Priority    = 50
    44  Datacenters = dc1
    45  Status      = running
    46  Periodic    = false
    47  
    48  Summary
    49  Task Group  Queued  Starting  Running  Failed  Complete  Lost
    50  example     0       0         3        0       0         0
    51  
    52  Allocations
    53  ID        Eval ID   Node ID   Task Group  Desired  Status    Created At
    54  04d9627d  42d788a3  a1f934c9  example     run      running   <timestamp>
    55  e7b8d4f5  42d788a3  012ea79b  example     run      running   <timestamp>
    56  5cbf23a1  42d788a3  1e1aa1e0  example     run      running   <timestamp>
    57  ```
    58  
    59  Here we can see that there are three instances of this task running, each with
    60  its own allocation. For more information on the `status` command, please see the
    61  [CLI documentation for <tt>status</tt>](/docs/commands/status.html).
    62  
    63  ## Evaluation Status
    64  
    65  You can think of an evaluation as a submission to the scheduler. An example
    66  below shows status output for a job where some allocations were placed
    67  successfully, but did not have enough resources to place all of the desired
    68  allocations.
    69  
    70  If we issue the status command with the `-evals` flag, we could see there is an
    71  outstanding evaluation for this hypothetical job:
    72  
    73  ```text
    74  $ nomad job status -evals docs
    75  ID          = docs
    76  Name        = docs
    77  Type        = service
    78  Priority    = 50
    79  Datacenters = dc1
    80  Status      = running
    81  Periodic    = false
    82  
    83  Evaluations
    84  ID        Priority  Triggered By  Status    Placement Failures
    85  5744eb15  50        job-register  blocked   N/A - In Progress
    86  8e38e6cf  50        job-register  complete  true
    87  
    88  Placement Failure
    89  Task Group "example":
    90    * Resources exhausted on 1 nodes
    91    * Dimension "cpu" exhausted on 1 nodes
    92  
    93  Allocations
    94  ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
    95  12681940  8e38e6cf  4beef22f  example       run      running  <timestamp>
    96  395c5882  8e38e6cf  4beef22f  example       run      running  <timestamp>
    97  4d7c6f84  8e38e6cf  4beef22f  example       run      running  <timestamp>
    98  843b07b8  8e38e6cf  4beef22f  example       run      running  <timestamp>
    99  a8bc6d3e  8e38e6cf  4beef22f  example       run      running  <timestamp>
   100  b0beb907  8e38e6cf  4beef22f  example       run      running  <timestamp>
   101  da21c1fd  8e38e6cf  4beef22f  example       run      running  <timestamp>
   102  ```
   103  
   104  In the above example we see that the job has a "blocked" evaluation that is in
   105  progress. When Nomad can not place all the desired allocations, it creates a
   106  blocked evaluation that waits for more resources to become available.
   107  
   108  The `eval status` command enables us to examine any evaluation in more detail.
   109  For the most part this should never be necessary but can be useful to see why
   110  all of a job's allocations were not placed. For example if we run it on the job
   111  named docs, which had a placement failure according to the above output, we
   112  might see:
   113  
   114  ```text
   115  $ nomad eval status 8e38e6cf
   116  ID                 = 8e38e6cf
   117  Status             = complete
   118  Status Description = complete
   119  Type               = service
   120  TriggeredBy        = job-register
   121  Job ID             = docs
   122  Priority           = 50
   123  Placement Failures = true
   124  
   125  Failed Placements
   126  Task Group "example" (failed to place 3 allocations):
   127    * Resources exhausted on 1 nodes
   128    * Dimension "cpu" exhausted on 1 nodes
   129  
   130  Evaluation "5744eb15" waiting for additional capacity to place remainder
   131  ```
   132  
   133  For more information on the `eval status` command, please see the [CLI documentation for <tt>eval status</tt>](/docs/commands/eval-status.html).
   134  
   135  ## Allocation Status
   136  
   137  You can think of an allocation as an instruction to schedule. Just like an
   138  application or service, an allocation has logs and state. The `alloc status`
   139  command gives us the most recent events that occurred for a task, its resource
   140  usage, port allocations and more:
   141  
   142  ```text
   143  $ nomad alloc status 04d9627d
   144  ID            = 04d9627d
   145  Eval ID       = 42d788a3
   146  Name          = docs.example[2]
   147  Node ID       = a1f934c9
   148  Job ID        = docs
   149  Client Status = running
   150  
   151  Task "server" is "running"
   152  Task Resources
   153  CPU        Memory          Disk     IOPS  Addresses
   154  0/100 MHz  728 KiB/10 MiB  300 MiB  0     http: 10.1.1.196:5678
   155  
   156  Recent Events:
   157  Time                   Type      Description
   158  10/09/16 00:36:06 UTC  Started   Task started by client
   159  10/09/16 00:36:05 UTC  Received  Task received by client
   160  ```
   161  
   162  The `alloc status` command is a good starting to point for debugging an
   163  application that did not start. Hypothetically assume a user meant to start a
   164  Docker container named "redis:2.8", but accidentally put a comma instead of a
   165  period, typing "redis:2,8".
   166  
   167  When the job is executed, it produces a failed allocation. The `alloc status`
   168  command will give us the reason why:
   169  
   170  ```text
   171  $ nomad alloc status 04d9627d
   172  # ...
   173  
   174  Recent Events:
   175  Time                   Type            Description
   176  06/28/16 15:50:22 UTC  Not Restarting  Error was unrecoverable
   177  06/28/16 15:50:22 UTC  Driver Failure  failed to create image: Failed to pull `redis:2,8`: API error (500): invalid tag format
   178  06/28/16 15:50:22 UTC  Received        Task received by client
   179  ```
   180  
   181  Unfortunately not all failures are as easily debuggable. If the `alloc status`
   182  command shows many restarts, there is likely an application-level issue during
   183  start up. For example:
   184  
   185  ```text
   186  $ nomad alloc status 04d9627d
   187  # ...
   188  
   189  Recent Events:
   190  Time                   Type        Description
   191  06/28/16 15:56:16 UTC  Restarting  Task restarting in 5.178426031s
   192  06/28/16 15:56:16 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
   193  06/28/16 15:56:16 UTC  Started     Task started by client
   194  06/28/16 15:56:00 UTC  Restarting  Task restarting in 5.00123931s
   195  06/28/16 15:56:00 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
   196  06/28/16 15:55:59 UTC  Started     Task started by client
   197  06/28/16 15:55:48 UTC  Received    Task received by client
   198  ```
   199  
   200  To debug these failures, we will need to utilize the "logs" command, which is
   201  discussed in the [accessing logs](/guides/operating-a-job/accessing-logs.html)
   202  section of this documentation.
   203  
   204  For more information on the `alloc status` command, please see the [CLI
   205  documentation for <tt>alloc status</tt>](/docs/commands/alloc/status.html).