github.com/sentienttechnologies/studio-go-runner@v0.0.0-20201118202441-6d21f2ced8ee/docs/metadata.md

github.com/sentienttechnologies/studio-go-runner@v0.0.0-20201118202441-6d21f2ced8ee/docs/metadata.md (about)

     1  # Metadata Introduction
     2  
     3  The metadata features within the studioml go runner are designed to allow authors of python and containerized applications to sequester attributes of experiments into json files that accompany experiment results.
     4  
     5  Experiment accessioning and management is a major requirement for both small and large teams in both research and commercial contexts.  Tasks run using studio go runner that generate conforming JSON output will have that output captured by the runner and stored as JSON blobs or files using the same storage endpoint as the experiment logs. Subsequently user workflows or downstream tools will be able to retreieve documents for any purpose for example, indexing and queries, or for ETL purposes.
     6  
     7  # Experiment Metadata wrangling
     8  
     9  ## Storage organization
    10  
    11  studioml runner when hosted tasks will monitor the console output of the task and will scrape any single line well formed JSON fragments.  These fragments are gathered every time a checkpoint of the task occurs and are used to build a JSON document that will be placed into the '\_metadata' artifact location on the storage endpoint specified by the experimenter initiating the studioml task.
    12  
    13  The following figure shows the runtime layout of directories and files while an experiment is being run.  It shows the second pass at running an experiment after the first pass failed and as the second pass is running.
    14  
    15  ```
    16  └── 4f9ba63a64ec0618.1
    17      ├── _metadata
    18      │   ├── output-host-awsdev-1gew0R.log
    19      │   ├── output-host-awsdev-1gew1j.log
    20      │   ├── scrape-host-awsdev-1gew0R.json
    21      │   └── scrape-host-awsdev-1gew1j.json
    22      ├── _metrics
    23      ├── modeldir
    24      ├── output
    25      │   └── output
    26      ├── tb
    27      └── workspace
    28          ├── experiment_template.json
    29          └── metadata-test.py
    30  ```
    31  
    32  The \_metadata artifact shows four files that have the file name composed from, the file type, the host key and host name, and an ID that can be sorted to reflect the time of creation.  These files allow the progress of the studioml task to be tracked across time and different machines within a studioml cluster.
    33  
    34  studioml applications can retrieve these files from the storage platform choosen by the experiment and used to query experiment results using the raw console output, in the case of the 'output-hist-xxxxxx-tttttt.log' files, and also JSON data emitted by the application as JSON documents in the case of the 'scrape-hist-xxx-tttttt.json' files.
    35  
    36  If a bucket is used to store the experiments output data then the metadata artifacts will be uploaded as individual blobs, or files, allowing them to be selectively indexed or downloaded.  Their keys will appear as follows, given the previous example:
    37  
    38  ```
    39  metadata/output-host-awsdev-1gew0R.log
    40  metadata/output-host-awsdev-1gew1j.log
    41  metadata/scrape-host-awsdev-1gew0R.json
    42  metadata/scrape-host-awsdev-1gew1j.json
    43  ```
    44  
    45  The metadata artifact is treated as a folder style artifact consisting of multiple individual files with three files per run and named using the pod/host name on which the run was located.  The following example show keys for artifacts from 2 attempted runs of an experiment, on hosts host-fe5917a, and host234c07a.
    46  
    47  ```
    48  + metadata
    49  |
    50  +--- output-host-fe5917a-1gKTNC.log
    51  |
    52  +--- runner-host-fe5917a-1gKTNC.log
    53  |
    54  +--- scrape-host-fe5917a-1gKTNC.json
    55  |
    56  +--- output-host-234c07a-1gKTNw.log
    57  |
    58  +--- runner-host-234c07a-1gKTNw.log
    59  |
    60  +--- scrape-host-234c07a-1gKTNw.json
    61  ```
    62  
    63  Using individual objects, or files allows independent uploads of experiment activity enabling checksum based caching to be employed downstream and also to preserve atomic uploads for a host and experiment run combination.
    64  
    65  The scrape files contain the metadata defined in the next subsection.
    66  
    67  The trailing characters of the file names are significant in that they represent the timestamp at which the file was created in seconds since 1970 and then encoded using Base 62 format to create a chronology for file creation.  Refreshed files will retain their original names when updated.
    68  
    69  Should annotations need injection into the scrape files by the experimenters application they must be added after the experiment has had its studioml.status tag updated to read 'completed'.  The application however must follow the rule that existing tags must not be modified.  It is envisioned that an application such as a session server or completion service (project orchestration) responsible for an entire project would wait for experiments to complete by querying the scrape files. Likewise ETL tasks populating a downstream ETL'ed database can stream scrape results into a downstream DB potentially adding a new tag on each extraction until the status is complete then doing a final extraction.
    70  
    71  After completion the orchestrator can inspect the results in the scrape file and add information related to the entire experiment and the standing of each individual experiment in their respective scrape files.  Examples of the values orchestration might add could include model information such as a version number, or marking them as fit for deployment using application defined tags in the experiment section.
    72  
    73  In some cases there may well be state the ML project orchestration software wishes to save for checkpointing and other purposes that are no part of the studioml scope.  In these cases the 3rd party software can store this independently of the studioml ecosystem possibly even on the same shared storage infrastructure.  However this is orthogonal to the runners and studioml.  Examples of this type of data might include project, cost center, and customer data.  Once each experiment is complete the orchestration can also add these tags to the finished experiment to assist with downstream ETL and queries that might need supporting.
    74  
    75  If there is metadata that would be needed to reproduce the experiment then this should be added as an artifact to the input files for the experiment rather than waiting until the conclusion of the run to add it to the metadata artifact.
    76  
    77  ## JSON Document
    78  
    79  JSON data scraped from the tasks console output will be captured and will be checked for being well-formed by the runner, validJSON on a single line.  The JSON data should be formatted as mergable fragments, or as JSON patch directives as defined by RFC6902, or RFC7386.  Examples of each appear below:
    80  
    81  ```
    82  {"experiment": {"name": "testExpr", "max_run_length": 24, "current_run_position": 16}}
    83  [{"op": "replace", "path": "/experiment/current_run_position", "value": 20}]
    84  {"experiment": {"completed": "true"}}
    85  [{"op": "remove", "path": "/experiment/current_run_position"}]
    86  ```
    87  
    88  As an application progresses it can continue to emit merge fragments and patching directives updating the resulting document that the runner will checkpoint creating an upto the minute application state.
    89  
    90  When the runner checkpoints a task, or when the task completes the JSON fragments these fragments are processed in the order they appeared to create a single JSON document and stored alongside the output log using a prefix of 'metadata/' as described in the previous subsection.
    91  
    92  ## runner JSON
    93  
    94  JSON Data is also produced by the runner, when using python based workloads, detailing aspects of the runtime environment that can later be used by downstream tooling.
    95  
    96  studioml data is gathered into a JSON map using studioml as the key.  User, or experiment data is by convention added using an experiment key.  For example the studioml generated pip dependency tree is placed into the JSON using the following as an example:
    97  
    98  ```
    99  {
   100    "studioml": {
   101      "artifacts": {
   102        "_metadata": "s3://127.0.0.1:40130/bgnauro3p3itfkp5iuqg/_metadata.tar",
   103        "_metrics": "s3://127.0.0.1:40130/bgnauro3p3itfkp5iuqg/_metrics.tar",
   104        "modeldir": "s3://127.0.0.1:40130/bgnauro3p3itfkp5iuqg/modeldir.tar",
   105        "output": "s3://127.0.0.1:40130/bgnauro3p3itfkp5iuqg/output.tar",
   106        "tb": "s3://127.0.0.1:40130/bgnauro3p3itfkp5iuqg/tb.tar",
   107        "workspace": "s3://127.0.0.1:40130//bgnauro3p3itfkp5iuqg/workspace.tar"
   108      },
   109      "experiment": {
   110        "key": "e5e90feb-a6e5-4668-b885-c1789f74ad23",
   111        "project": "goldengun"
   112      },
   113      "pipdeptree": [
   114        {
   115          "dependencies": [],
   116          "package": {
   117            "installed_version": "3.1.0",
   118            "package_name": "setuptools-scm",
   119            "key": "setuptools-scm"
   120          }
   121        },
   122        {
   123          "dependencies": [],
   124          "package": {
   125            "installed_version": "1.24",
   126            "package_name": "urllib3",
   127            "key": "urllib3"
   128          }
   129        },
   130  ...
   131        }
   132      ]
   133    },
   134  ...
   135  }
   136  ```
   137  Application JSON output is added simply by sending JSON merge fragments, or JSON patch directives.  Should the application echo the following:
   138  
   139  ```
   140  {"experiment": {"name": "dummy pass"}}
   141  {"experiment": {"name": "Zaphod Beeblebrox"}}
   142  ```
   143  
   144  the result would appear in the JSON file as:
   145  
   146  ```
   147  {
   148    "experiment": {
   149      "name": "Zaphod Beeblebrox"
   150    },
   151    "studioml": {
   152    }
   153  ...
   154  }
   155  ```
   156  
   157  # Storage platforms and query capabilities
   158  
   159  TBD
   160  
   161  https://docs.aws.amazon.com/athena/latest/ug/work-with-data.html
   162  
   163  # Downstream ETL and enterprise integration
   164  
   165  When wrangling JSON documents the jq tool has proved invaluable, https://stedolan.github.io/jq/.
   166  
   167  The design of the metadata artifact allows the creation of downstream applications that extract data from a studioml data store, such as S3, while experiments are in one of two states:
   168  
   169  1. experiments in flight
   170  2. experiments that have ceased active processing
   171  
   172  Performing ETL on experiments that have ceased processing can be easily implemented via the ETL marking experiments as exported using custom tags in the experiment block.  Any experiments without the exported tag can then be selected using either a JSON query engine for simple iteration of scrape JSON files.  Using a query engine such as AWS Athena or Google datastore is another method employing S3 select on the JSON studioml structure and the status field with a value of completed.  If a query engine is not available the files store or blob heirarchy can be traversed and the most recent run scrapes selected using the last dash delimited portion of the file name as a sortable timestamp equivalent then marshalling the JSON to check on the status.
   173  
   174  ETL processing if performed using a long lived daemon can track experiments still in progress using a membership test filter on in-memory data structure to exclude or include experiments for ETL, an example of this is in-memory cuckoo filter, https://brilliant.org/wiki/cuckoo-filter/, preventing unnessasary processing of JSON artifacts for experiments that have already completed, or which are no longer of interest.  If iteration is being used then the timestamp portion of the file name also be used to exclude JSON scrapes that are too old to be relevant.  For storage platforms that store access and modification file times there are also opportunities to avoid needless processing.
   175  
   176  Copyright © 2019-2020 Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 license.