github.com/sentienttechnologies/studio-go-runner@v0.0.0-20201118202441-6d21f2ced8ee/docs/interface.md

github.com/sentienttechnologies/studio-go-runner@v0.0.0-20201118202441-6d21f2ced8ee/docs/interface.md (about)

     1  # Interfacing and Integration
     2  
     3  This document describes the interface, and interchange format used between the StudioML client and runners that process StudioML experiments.
     4  
     5  <!--ts-->
     6  
     7  Table of Contents
     8  =================
     9  
    10  * [Interfacing and Integration](#interfacing-and-integration)
    11  * [Table of Contents](#table-of-contents)
    12    * [Introduction](#introduction)
    13    * [Audience](#audience)
    14    * [Runners](#runners)
    15    * [Queuing](#queuing)
    16    * [Experiment Lifecycle](#experiment-lifecycle)
    17    * [Payloads](#payloads)
    18      * [Encrypted payloads](#encrypted-payloads)
    19      * [Signed payloads](#signed-payloads)
    20      * [Field descriptions](#field-descriptions)
    21      * [experiment ↠ pythonver](#experiment--pythonver)
    22      * [experiment ↠ args](#experiment--args)
    23      * [experiment ↠ max_duration](#experiment--max_duration)
    24      * [experiment ↠ filename](#experiment--filename)
    25      * [experiment ↠ project](#experiment--project)
    26      * [experiment ↠ artifacts](#experiment--artifacts)
    27      * [experiment ↠ artifacts ↠ [label] ↠ bucket](#experiment--artifacts--label--bucket)
    28      * [experiment ↠ artifacts ↠ [label] ↠ key](#experiment--artifacts--label--key)
    29      * [experiment ↠ artifacts ↠ [label] ↠ qualified](#experiment--artifacts--label--qualified)
    30      * [experiment ↠ artifacts ↠ [label] ↠ mutable](#experiment--artifacts--label--mutable)
    31      * [experiment ↠ artifacts ↠ [label] ↠ unpack](#experiment--artifacts--label--unpack)
    32      * [experiment ↠ artifacts ↠ resources_needed](#experiment--artifacts--resources_needed)
    33      * [experiment ↠ artifacts ↠ pythonenv](#experiment--artifacts--pythonenv)
    34      * [experiment ↠ artifacts ↠  time added](#experiment--artifacts---time-added)
    35      * [experiment ↠ config](#experiment--config)
    36      * [experiment ↠ config ↠ experimentLifetime](#experiment--config--experimentlifetime)
    37      * [experiment ↠ config ↠ verbose](#experiment--config--verbose)
    38      * [experiment ↠ config ↠ saveWorkspaceFrequency](#experiment--config--saveworkspacefrequency)
    39      * [experiment ↠ config ↠ database](#experiment--config--database)
    40      * [experiment ↠ config ↠ database ↠ type](#experiment--config--database--type)
    41      * [experiment ↠ config ↠ database ↠ authentication](#experiment--config--database--authentication)
    42      * [experiment ↠ config ↠ database ↠ endpoint](#experiment--config--database--endpoint)
    43      * [experiment ↠ config ↠ database ↠ bucket](#experiment--config--database--bucket)
    44      * [experiment ↠ config ↠ storage](#experiment--config--storage)
    45      * [experiment ↠ config ↠ storage ↠ type](#experiment--config--storage--type)
    46      * [experiment ↠ config ↠ storage ↠ endpoint](#experiment--config--storage--endpoint)
    47      * [experiment ↠ config ↠ storage ↠ bucket](#experiment--config--storage--bucket)
    48      * [experiment ↠ config ↠ storage ↠ authentication](#experiment--config--storage--authentication)
    49      * [experiment ↠ config ↠ resources_needed](#experiment--config--resources_needed)
    50      * [experiment ↠ config ↠ resources_needed ↠ hdd](#experiment--config--resources_needed--hdd)
    51      * [experiment ↠ config ↠ resources_needed ↠ cpus](#experiment--config--resources_needed--cpus)
    52      * [experiment ↠ config ↠ resources_needed ↠ ram](#experiment--config--resources_needed--ram)
    53      * [experiment ↠ config ↠ resources_needed ↠ gpus](#experiment--config--resources_needed--gpus)
    54      * [experiment ↠ config ↠ resources_needed ↠ gpuMem](#experiment--config--resources_needed--gpumem)
    55      * [experiment ↠ config ↠ env](#experiment--config--env)
    56      * [experiment ↠ config ↠ cloud ↠ queue ↠ rmq](#experiment--config--cloud--queue--rmq)
    57  <!--te-->
    58  
    59  ## Introduction
    60  
    61  StudioML has two major modules.
    62  
    63  . The client, or front end, that shepherds experiments on behalf of users and packaging up experiments that are then placed on to a queue using json messages
    64  
    65  . The runner that receives json formatted messages on a message queue and then runs the experiment they describe
    66  
    67  There are other tools that StudioML offers for reporting and management of experiment artifacts that are not within the scope of this document.
    68  
    69  It is not yet within the scope of this document to describe how data outside of the queuing interface is stored and formatted.
    70  
    71  ## Audience
    72  
    73  This document is intended for developers who wish to implement runners to process StudioML work, or implement clients that generate work for StudioML runners.
    74  
    75  ## Runners
    76  
    77  This project implements a StudioML runner, however it is not specific to StudioML.  This runner could be used to deliver and execute and python code within a virtualenv that the runner supplies.
    78  
    79  Any standard runners can accept a standalone virtualenv with no associated container.  The go runner, this present project, has been extended to allow clients to also send work that has a Singularity container specified.
    80  
    81  In the first case, virtualenv only, the runner implcitly trusts that any work received is trusted and is not malicous.  In this mode the runner makes not attempt to protect the integrity of the host it is deployed into.
    82  
    83  In the second case if a container is specified it will be used to launch work and the runner will rely upon the container runtime to prevent leakage into the host.
    84  
    85  ## Queuing
    86  
    87  The StudioML eco system relies upon a message queue to buffer work being sent by the StudioML client to any arbitrary runner that is subscribed to the experimenters choosen queuing service.  StudioML support multiple queuing technologies including, AWS SQS, local file system, and RabbitMQ.  The reference implementation is RabbitMQ for the purposes of this present project.  The go runner project supports SQS, and RabbitMQ.
    88  
    89  Additional queuing technologies can be added if desired to the StudioML (https://github.com/studioml/studio.git), and go runner (https://github.com/SentientTechnologies/studio-go-runner.git) code bases and a pull request submitted.
    90  
    91  When using a queue the StudioML eco system relies upon a reliable, at-least-once, messaging system.  An additional requirement for queuing systems is that if the worker disappears, or work is not reclaimed by the worker as progress is made that the work is requeued by the broker automatically.
    92  
    93  ## Experiment Lifecycle
    94  
    95  If you have had a chance to run some of the example experiments within the StudioML github repository then you will have noticed a keras example.  The keras example is used to initiate a single experiment that queues work for a single runner and then immediately returns to the command line prompt without waiting for a result.  Experiments run in this way rely on the user to monitor their cloud storage bucket and look for the output.tar file in a directory named after their experiment.  For simple examples and tests this is a quick but manual way to work.
    96  
    97  In more complex experiments there might be multiple phases to a project that is being run.  Each experiment might represent an individual in for example evolutionary computation.  The python software running the project might want to send potentially hundreds of experiments, or individuals to the runners and then wait for these to complete.  Once complete it might select individuals that scored highly, using as one example a fitness screen.  The python StudioML client might then generate a new population that are then marshall individuals from the population into experiments, repeating this cycle potentially for days.
    98  
    99  To address the need for longer running experiments StudioML offers a number of python classes within the open source distribution that allows this style of longer running taining scenarios to be implemented by researchers and engineers.  The combination of completion service and session server classes can be used to create these long running StudioML compliant clients.
   100  
   101  Completion service based applications that use the StudioML classes generate work in exactly the same way as the CLI based 'studio run' command.  Session servers are an implementation of a completion service combined with logic that once experiments are queued will on a regular interval examine the cloud storage folders for returned archives that runners have rolled up when they either save experiment workspaces, or at the conclusion of the experiment find that the python experiment code had generated files in directories identified as a part of the queued job.  After the requisite numer of experiments are deemed to have finished based on the storage server bucket contents the session server can then examine the uploaded artifacts and determine their next set of training steps.
   102  
   103  ## Payloads
   104  
   105  The following figure shows an example of a job sent from the studioML front end to the runner.  The runner does not always make use of the entire set of json tags, typically a limited but consistent subset of tags are used.  This format is a clear text format, please see below for notes regarding the encrypted format.
   106  
   107  ```json
   108  {
   109    "experiment": {
   110      "status": "waiting",
   111      "time_finished": null,
   112      "git": null,
   113      "key": "1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92",
   114      "time_last_checkpoint": 1530054414.027222,
   115      "pythonver": "3.6",
   116      "metric": null,
   117      "args": [
   118        "10"
   119      ],
   120      "max_duration": "20m",
   121      "filename": "train_cifar10.py",
   122      "project": null,
   123      "artifacts": {
   124        "output": {
   125          "local": "/home/kmutch/.studioml/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/output",
   126          "bucket": "kmutch-rmq",
   127          "qualified": "s3://s3-us-west-2.amazonaws.com/kmutch-rmq/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/output.tar",
   128          "key": "experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/output.tar",
   129          "mutable": true,
   130          "unpack": true
   131        },
   132        "_metrics": {
   133          "local": "/home/kmutch/.studioml/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/_metrics",
   134          "bucket": "kmutch-rmq",
   135          "qualified": "s3://s3-us-west-2.amazonaws.com/kmutch-rmq/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/_metrics.tar",
   136          "key": "experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/_metrics.tar",
   137          "mutable": true,
   138          "unpack": true
   139        },
   140        "modeldir": {
   141          "local": "/home/kmutch/.studioml/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/modeldir",
   142          "bucket": "kmutch-rmq",
   143          "qualified": "s3://s3-us-west-2.amazonaws.com/kmutch-rmq/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/modeldir.tar",
   144          "key": "experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/modeldir.tar",
   145          "mutable": true,
   146          "unpack": true
   147        },
   148        "workspace": {
   149          "local": "/home/kmutch/studio/examples/keras",
   150          "bucket": "kmutch-rmq",
   151          "qualified": "s3://s3-us-west-2.amazonaws.com/kmutch-rmq/blobstore/419411b17e9c851852735901a17bd6d20188cee30a0b589f1bf1ca5b487930b5.tar
   152  ",
   153          "key": "blobstore/419411b17e9c851852735901a17bd6d20188cee30a0b589f1bf1ca5b487930b5.tar",
   154          "mutable": false,
   155          "unpack": true
   156        },
   157        "tb": {
   158          "local": "/home/kmutch/.studioml/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/tb",
   159          "bucket": "kmutch-rmq",
   160          "qualified": "s3://s3-us-west-2.amazonaws.com/kmutch-rmq/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/tb.tar",
   161          "key": "experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/tb.tar",
   162          "mutable": true,
   163          "unpack": true
   164        }
   165       "info": {},
   166      "resources_needed": {
   167        "hdd": "3gb",
   168        "gpus": 1,
   169        "ram": "2gb",
   170        "cpus": 1,
   171        "gpuMem": "4gb"
   172      },
   173      "pythonenv": [
   174        "APScheduler==3.5.1",
   175        "argparse==1.2.1",
   176        "asn1crypto==0.24.0",
   177        "attrs==17.4.0",
   178        "autopep8==1.3.5",
   179        "awscli==1.15.4",
   180        "boto3==1.7.4",
   181        "botocore==1.10.4",
   182  ...
   183        "six==1.11.0",
   184        "sseclient==0.0.19",
   185        "-e git+https://github.com/SentientTechnologies/studio@685f4891764227a2e1ea5f7fc91b31dcf3557647#egg=studioml",
   186        "terminaltables==3.1.0",
   187        "timeout-decorator==0.4.0",
   188        "tzlocal==1.5.1",
   189        "uritemplate==3.0.0",
   190        "urllib3==1.22",
   191        "Werkzeug==0.14.1",
   192        "wheel==0.31.0",
   193        "wsgiref==0.1.2"
   194      ],
   195      "owner": "guest",
   196      "time_added": 1530054413.134781,
   197      "time_started": null
   198    },
   199    "config": {
   200      "optimizer": {
   201        "visualization": true,
   202        "load_checkpoint_file": null,
   203        "cmaes_config": {
   204          "load_best_only": false,
   205          "popsize": 100,
   206          "sigma0": 0.25
   207        },
   208        "termination_criterion": {
   209          "generation": 5,
   210          "fitness": 999,
   211          "skip_gen_timeout": 30,
   212          "skip_gen_thres": 1
   213        },
   214        },
   215        "result_dir": "~/Desktop/",
   216        "checkpoint_interval": 0
   217      },
   218      "verbose": "debug",
   219      "saveWorkspaceFrequency": "3m",
   220      "database": {
   221        "type": "s3",
   222        "authentication": "none",
   223        "endpoint": "http://s3-us-west-2.amazonaws.com",
   224        "bucket": "kmutch-metadata"
   225      },
   226      "runner": {
   227        "slack_destination": "@karl.mutch"
   228      },
   229      "storage": {
   230        "type": "s3",
   231        "endpoint": "http://s3-us-west-2.amazonaws.com",
   232        "bucket": "kmutch-rmq"
   233      },
   234      "server": {
   235        "authentication": "None"
   236      },
   237      "env": {
   238        "PATH": "%PATH%:./bin",
   239        "AWS_DEFAULT_REGION": "us-west-2",
   240        "AWS_ACCESS_KEY_ID": "AKZAIE5G7Q2GZC3OMTYW",
   241        "AWS_SECRET_ACCESS_KEY": "rt43wqJ/w5aqAPat659gkkYpphnOFxXejsCBq"
   242      },
   243      "cloud": {
   244        "queue": {
   245          "rmq": "amqp://user:password@10.230.72.19:5672/%2f?connection_attempts=30&retry_delay=.5&socket_timeout=5"
   246        }
   247      }
   248    }
   249  }
   250  ```
   251  
   252  ### Encrypted payloads
   253  
   254  In the event that message level encryption is enabled then the payload format will vary from the clear-text format.  The encrypted format will retain a very few blocks in clear-text to assist in scheduling, the status, pythonver, experiment_lifetime, time_added, and the resources needed blocks as in the following example. All other fragments will be rolled up into an encrypted_data block, consisting of Base64 encoded data.  The fields used within the clear-text header retain the same purpose and meaning as those in the Request documented in the [Field Descriptions](#field-descriptions) section
   255  
   256  Encrypted payloads use a hybrid cryptosystem, for a detailed description please see https://en.wikipedia.org/wiki/Hybrid_cryptosystem.
   257  
   258  A detailed description of the StudioML implementation of this system can be found in the [docs/message_privacy.md](docs/message_privacy.md) documentation.
   259  
   260  The following figures shows an example of the clear-text headers and the encrypted payload portion of a message:
   261  
   262  ```json
   263  {
   264    "message": {
   265      "experiment": {
   266          "status": "waiting",
   267          "pythonver": "3.6",
   268      },
   269      "time_added": 1530054413.134781,
   270      "experiment_lifetime": "30m",
   271      "resources_needed": {
   272          "gpus": 1,
   273          "hdd": "3gb",
   274          "ram": "2gb",
   275          "cpus": 1,
   276          "gpuMem": "4gb"
   277      },
   278      "payload": "Full Base64 encrypted payload"
   279    }
   280  }
   281  ```
   282  
   283  The encrypted format will retain a very few blocks in clear-text to assist in scheduling, the status, pythonver, experiment_lifetime, time_added, and the resources needed blocks as in the following example. All other fragments will be rolled up into an encrypted_data block, consisting of Base64 encoded data.  The fields used within the clear-text header retain the same purpose and meaning as those in the Request documented in the [Field Descriptions](#field-descriptions) section
   284  
   285  Encrypted payloads use a hybrid cryptosystem, for a detailed description please see https://en.wikipedia.org/wiki/Hybrid_cryptosystem.
   286  
   287  A detailed description of the StudioML implementation of this system can be found in the [message_privacy](docs/message_privacy.md) documentation.
   288  
   289  The following figures shows an example of the clear-text headers and the encrypted payload portion of a message:
   290  
   291  ```json
   292  {
   293    "message": {
   294      "experiment": {
   295          "status": "waiting",
   296          "pythonver": "3.6",
   297      },
   298      "time_added": 1530054413.134781,
   299      "experiment_lifetime": "30m",
   300      "resources_needed": {
   301          "gpus": 1,
   302          "hdd": "3gb",
   303          "ram": "2gb",
   304          "cpus": 1,
   305          "gpuMem": "4gb"
   306      },
   307      "payload": "Full Base64 encrypted payload"
   308    }
   309  }
   310  ```
   311  
   312  The encrypted payload should consist of a 24 byte nonce, and then the users encrypted data.
   313  
   314  When processing messages runners can use the clear-text JSON in an advisory capacity to determine if messages are useful before decrypting their contents, however once decrypted messages will be re-evaluated using the decrypted contents only.  The clear-text portions of the message  will be ignored post decryption.
   315  
   316  Private keys and passphrases are provisioned on compute clusters using the Kubernetes secrets service and stored encrypted within etcd when the go runner is used.
   317  
   318  ### Signed payloads
   319  
   320  Message signing is a way of protecting the runner receiving messages from processing spoofed requests.  To prevent this the runner can be configured to read public key information from Kubernetes secrets and then to use this to validate messages that are being received.  The configuration information for the runner signing keys is detailed in the [message\_privacy.md](message_privacy.md) file.
   321  
   322  Message signing must be used in combination with message encryption features described in the previous section.
   323  
   324  The format of the signature that is transmitted using the StudioML message signature field consists of the Base64 encoded signature blob, encoded from the binary 64 byte signature.
   325  
   326  The signing information is encoded into two JSON elements, the fingerprint and signature elements, for example:
   327  
   328  ```
   329  ```json
   330  {
   331    "message": {
   332      "experiment": {
   333          "status": "waiting",
   334          "pythonver": "3.6",
   335      },
   336      "time_added": 1530054413.134781,
   337      "experiment_lifetime": "30m",
   338      "resources_needed": {
   339          "gpus": 1,
   340          "hdd": "3gb",
   341          "ram": "2gb",
   342          "cpus": 1,
   343          "gpuMem": "4gb"
   344      },
   345      "payload": "Full Base64 encrypted payload",
   346      "fingerprint": "Base64 of sha256 binary fingerprint",
   347      "signature": "Base64 encoded binary signature for the Base64 representation of the encrypted payload"
   348    }
   349  }
   350  ```
   351  
   352  ### Field descriptions
   353  
   354  ### experiment ↠ pythonver
   355  
   356  The value for this tag must be an integer 2 or 3 for the specific python version requested by the experimenter.
   357  
   358  ### experiment ↠ args
   359  
   360  A list of the command line arguments to be supplied to the python interpreter that will be passed into the main of the running python job.
   361  
   362  ### experiment ↠ max\_duration
   363  
   364  The period of time that the experiment is permitted to run in a single attempt.  If this time is exceeded the runner can abandon the task at any point but it may continue to run for a short period.
   365  
   366  ### experiment ↠ filename
   367  
   368  The python file in which the experiment code is to be found.  This file should exist within the workspace artifact archive relative to the top level directory.
   369  
   370  ### experiment ↠ project
   371  
   372  All experiments should be assigned to a project.  The project identifier is a label assigned by the StudioML user and is specific to their purposes.
   373  
   374  ### experiment ↠ artifacts
   375  
   376  Artifacts are assigned labels, some labels have significance.  The workspace artifact should contain any python code that is needed, it may container other assets for the python code to run including configuration files etc.  The output artifact is used to identify where any logging and returned results will be archives to.
   377  
   378  Work that is sent to StudioML runners must have at least one workspace artifact consisting of the python code that will be run.  Artifacts are typically tar archives that contain not just python code but also any other data needed by the experiment being run.
   379  
   380  Before the experiment commences the artifact will be unrolled onto local disk of the container running it.  When unrolled the artifact label is used to name the peer directory into which any files are placed.
   381  
   382  The experiment when running will be placed into the workspace directory which contains the contents of the workspace labeled artifact.  Any other artifacts that were downloaded will be peer directories of the workspace directory.  Artifacts that were mutable and not available for downloading at the start of the experiment will results in empty peer directories that are named based on the label as well.
   383  
   384  Artifacts do not have any restriction on the size of the data they identify.
   385  
   386  The StudioML runner will download all artifacts that it can prior to starting an experiment.  Should any mutable artifacts be not available then they will be ignored and the experiment will continue.  If non-mutable artifacts are not found then the experiment will fail.
   387  
   388  Named non-mutable artifacts are subject to caching to reduce download times and network load.
   389  
   390  ### experiment ↠ artifacts ↠ [label] ↠ bucket
   391  
   392  The bucket identifies the cloud providers storage service bucket.  This value is not used when the go runner is running tasks.  This value is used by the python runner for configurations where the StudioML client is being run in proximoity to a StudioML configuration file.
   393  
   394  ### experiment ↠ artifacts ↠ [label] ↠ key
   395  
   396  The key identifies the cloud providers storage service key value for the artifact.  This value is not used when the go runner is running tasks.  This value is used by the python runner for configurations where the StudioML client is being run in proxiomity to a StudioML configuration file.
   397  
   398  ### experiment ↠ artifacts ↠ [label] ↠ qualified
   399  
   400  The qualified field contains a fully specified cloud storage platform reference that includes a schema used for selecting the storage platform implementation.  The host name is used within AWS to select the appropriate endpoint and region for the bucket, when using Minio this identifies the endpoint being used including the port number.  The URI path contains the bucket and file name (key in the case of AWS) for the artifact.
   401  
   402  If the artifact is mutable and will be returned to the S3 or Minio storage then the bucket MUST exist otherwise the experiment will fail.
   403  
   404  The environment section of the json payload is used to supply the needed credentials for the storage.  The go runner will be extended in future to allow the use of a user:password pair inside the URI to allow for multiple credentials on the cloud storage platform.
   405  
   406  ### experiment ↠ artifacts ↠ [label] ↠ mutable
   407  
   408  mutable is a true/false flag for identifying whether an artifact should be returned to the storage platform being used.  mutable artifacts that are not able to be downloaded at the start of an experiment will not cause the runner to terminate the experiment, non-mutable downloads that fail will lead to the experiment stopping.
   409  
   410  ### experiment ↠ artifacts ↠ [label] ↠ unpack
   411  
   412  unpack is a true/false flag that can be used to supress the tar or other compatible archive format archive within the artifact.
   413  
   414  ### experiment ↠ artifacts ↠ resources\_needed
   415  
   416  This section is a repeat of the experiment config resources_needed section, please ignore.
   417  
   418  ### experiment ↠ artifacts ↠ pythonenv
   419  
   420  This section encapsulates a json string array containing pip install dependencies and their versions.  The string elements in this array are a json rendering of what would typically appear in a pip requirements files.  The runner will unpack the frozen pip packages and will install them prior to the experiment running.  Any valid pip reference can be used except for private dependencies that require specialized authentication which is not supported by runners.  If a private dependency is needed then you should add the pip dependency as a file within an artifact and load the dependency in your python experiment implemention to protect it.
   421  
   422  ### experiment ↠ artifacts ↠  time added
   423  
   424  The time that the experiment was initially created expressed as a floating point number representing the seconds since the epoc started, January 1st 1970.
   425  
   426  ### experiment ↠ config
   427  
   428  The StudioML configuration file can be used to store parameters that are not processed by the StudioML client.  These values are passed to the runners and are not validated.  When present to the runner they can then be used to configure it or change its behavior.  If you implement your own runner then you can add values to the configuration file and they will then be placed into the config section of the json payload the runner receives.
   429  
   430  Running experiments that make use of Sentient ENN tooling or third party libraries will often require that framework specific configuration values be placed into this section.  Example of frameworks that use these values include the StudioML completion service, and evolutionary strategies used for numerical optimization.
   431  
   432  ### experiment ↠ config ↠ experimentLifetime
   433  
   434  This variable is used to inform the go runner of the date and time that the experiment should be considered to be dead and any work related to it should be abandoned or discarded.  This acts as a gaureentee that the client will no longer need to be concerned with the experiment and work can be requeued in the system, as one example, without fear of repeatition.
   435  
   436  The value is expressed as an integer followed by a unit, s,m,h.
   437  
   438  ### experiment ↠ config ↠ verbose
   439  
   440  verbose can be used to adjust the logging level for the runner and for StudioML components.  It has the following valid string values debug, info, warn, error, crit.
   441  
   442  ### experiment ↠ config ↠ saveWorkspaceFrequency
   443  
   444  On a regular basis the runner can upload any logs and intermediate results from the experiments mutable labelled artifact directories.  This variable can be used to set the interval at which these uploads are done.  The primary purpose of this variable is to speed up remote monitoring of intermediate output logging from the runner and the python code within the experiment.
   445  
   446  This variable is not intended to be used as a substitute for experiment checkpointing.
   447  
   448  ### experiment ↠ config ↠ database
   449  
   450  The database within StudioML is used to store meta-data that StudioML generates to describe experiments, projects and other useful material related to the progress of experiments such as the start time, owner.
   451  
   452  The database can point at blob storage or can be used with structured datastores should you wish to customize it.  The database is used in the event that the API server is launched by a user as a very simply way of accessing experiment and user details.
   453  
   454  ### experiment ↠ config ↠ database ↠ type
   455  
   456  This variable denotes the storage format being used by StudioML to store meta-data and supports three types within the open source offering, firebase, gcloud, s3.  Using s3 does allow other stores such as Azure blob storage when a bridging technology such as Minio is used.
   457  
   458  ### experiment ↠ config ↠ database ↠ authentication
   459  
   460  Not yet widely supported across the database types this variable supports either none, firebase, or github.  Currently its application is only to the gcloud, amnd firebase storage.  The go runner is intended for non vendor dependent implementations and uses the env variable seetings for the AWS authentication currently.  It is planned in the future that the authentication would make use of shortlived tokens using this field.
   461  
   462  ### experiment ↠ config ↠ database ↠ endpoint
   463  
   464  The endpoint variable is used to denote the S3 endpoint that is used to terminate API requests on.  This is used for both native S3 and minio support.  
   465  
   466  In the case of a native S3 deployment it will be one of the well known endpoints for S3 and should be biased to using the region specific endpoints for the buckets being used, an example for this use case would be 'http://s3-us-west-2.amazonaws.com'.
   467  
   468  In the case of minio this should point at the appropriate endpoint for the minio server along with the port being used, for example http://40.114.110.201:9000/.  If you wish to use HTTPS to increase security the runners deployed must have the appropriate root certificates installed and the certs on your minio server setup to reference one of the publically well known certificate authorities.
   469  
   470  ### experiment ↠ config ↠ database ↠ bucket
   471  
   472  The bucket variable denotes the bucket name being used and should be homed in the region that is configured using the endpoint and any AWS style environment variables captured in the environment variables section, 'env'.
   473  
   474  ### experiment ↠ config ↠ storage
   475  
   476  The storage area within StudioML is used to store the artifacts and assets that are created by the StudioML client.  The typical files placed into the storage are include any directories that are stored on the local workstation of the experimenter and need to be copied to a location that is available to runners.
   477  
   478  At a minimum when an experiment starts there will be an workspace artifact placed into the storage area.  Any artifacts placed into the storage will have a key that denotes the exact experiment and the name of the directory that was archived.
   479  
   480  Upon completion of the experiment the storage area will be updated with artifacts that are denoted as mutable and that have been changed.
   481  
   482  ### experiment ↠ config ↠ storage ↠ type
   483  
   484  This variable denotes the storage being used as either gs (google cloud storage), or s3.
   485  
   486  ### experiment ↠ config ↠ storage ↠ endpoint
   487  
   488  The endpoint variable is used to denote the S3 endpoint that is used to terminate API requests on.  This is used for both native S3 and minio support.
   489  
   490  In the case of a native S3 deployment it will be one of the well known endpoints for S3 and should be biased to using the region specific endpoints for the buckets being used, an example for this use case would be 'http://s3-us-west-2.amazonaws.com'.
   491  
   492  In the case of minio this should point at the appropriate endpoint for the minio server along with the port being used, for example http://40.114.110.201:9000/.  If you wish to use HTTPS to increase security the runners deployed must have the appropriate root certificates installed and the certs on your minio server setup to reference one of the publically well known certificate authorities.
   493  
   494  ### experiment ↠ config ↠ storage ↠ bucket
   495  
   496  The bucket variable denotes the bucket name being used and should be homed in the region that is configured using the endpoint.  In the case of AWS any AWS style environment variables captured in the environment variables section, 'env', will be used for authentication.
   497  
   498  When the experiment is being initiated within the StudioML client then local AWS environment variables will be used.  When the bucket is accessed by the runner then the authentication details captured inside this json payload will be used to download and upload any data.
   499  
   500  ### experiment ↠ config ↠ storage ↠ authentication
   501  
   502  Not yet widely supported across the database types this variable supports either none, firebase, or github.  Currently its application is only to the gcloud, amnd firebase storage.  The go runner is intended for non vendor dependent implementations and uses the env variable seetings for the AWS authentication currently.  It is planned in the future that the authentication would make use of shortlived tokens using this field.
   503  
   504  ### experiment ↠ config ↠ resources\_needed
   505  
   506  This section details the minimum hardware requirements needed to run the experiment.
   507  
   508  Values of the parameters in this section are either integers or integer units.  For units suffixes can include Mb, Gb, Tb for megabytes, gigabytes, or terrabytes.
   509  
   510  It should be noted that GPU resources are not virtualized and the requirements are hints to the scheduler only.  A project over committing resources will only affects its own experiments as GPU cards are not shared across projects.  CPU and RAM are virtualized by the container runtime and so are not as prone to abuse.
   511  
   512  ### experiment ↠ config ↠ resources\_needed ↠ hdd
   513  
   514  The minimum disk space required to run the experiment.
   515  
   516  ### experiment ↠ config ↠ resources\_needed ↠ cpus
   517  
   518  The number of CPU Cores that should be available for the experiments.  Remember this value does not account for the power of the CPU.  Consult your cluster operator or administrator for this information and adjust the number of cores to deal with the expectation you have for the hardware.
   519  
   520  ### experiment ↠ config ↠ resources\_needed ↠ ram
   521  
   522  The amount of free CPU RAM that is needed to run the experiment.  It should be noted that StudioML is design to run in a co-operative environment where tasks being sent to runners adequately describe their resource requirements and are scheduled based upon expect consumption.  Runners are free to implement their own strategies to deal with abusers.
   523  
   524  ### experiment ↠ config ↠ resources\_needed ↠ gpus
   525  
   526  gpus are counted as slots using the relative throughput of the physical hardware GPUs. GTX 1060's count as a single slot, GTX1070 is two slots, and a TitanX is considered to be four slots.  GPUs are not virtualized and so the go runner will pack the jobs from one experiment into one GPU device based on the slots.  Cards are not shared between different experiments to prevent noise between projects from affecting other projects.  If a project exceeds its resource consumption promise it will only impact itself.
   527  
   528  ### experiment ↠ config ↠ resources\_needed ↠ gpuMem
   529  
   530  The amount on onboard GPU memory the experiment will require.  Please see above notes concerning the use of GPU hardware.
   531  
   532  ### experiment ↠ config ↠ env
   533  
   534  This section contains a dictionary of environmnet variables and their values.  Prior to the experiment being initiated by the runner the environment table will be loaded.  The envrionment table is current used for AWS authentication for S3 access and so this section should contain as a minimum the AWS_DEFAULT_REGION, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY variables.  In the future the AWS credentials for the artifacts will be obtained from the artifact block.
   535  
   536  ### experiment ↠ config ↠ cloud ↠ queue ↠ rmq
   537  
   538  This variable will contain the rabbitMQ URI and configuration parameters if rabbitMQ was used by the system to queue this work.  The runner will ignore this value if it is passed through as it gets its queue information from the runner configuration store.
   539  
   540  Copyright © 2019-2020 Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 license.