github.com/sentienttechnologies/studio-go-runner@v0.0.0-20201118202441-6d21f2ced8ee/docs/interface.md (about) 1 # Interfacing and Integration 2 3 This document describes the interface, and interchange format used between the StudioML client and runners that process StudioML experiments. 4 5 <!--ts--> 6 7 Table of Contents 8 ================= 9 10 * [Interfacing and Integration](#interfacing-and-integration) 11 * [Table of Contents](#table-of-contents) 12 * [Introduction](#introduction) 13 * [Audience](#audience) 14 * [Runners](#runners) 15 * [Queuing](#queuing) 16 * [Experiment Lifecycle](#experiment-lifecycle) 17 * [Payloads](#payloads) 18 * [Encrypted payloads](#encrypted-payloads) 19 * [Signed payloads](#signed-payloads) 20 * [Field descriptions](#field-descriptions) 21 * [experiment ↠ pythonver](#experiment--pythonver) 22 * [experiment ↠ args](#experiment--args) 23 * [experiment ↠ max_duration](#experiment--max_duration) 24 * [experiment ↠ filename](#experiment--filename) 25 * [experiment ↠ project](#experiment--project) 26 * [experiment ↠ artifacts](#experiment--artifacts) 27 * [experiment ↠ artifacts ↠ [label] ↠ bucket](#experiment--artifacts--label--bucket) 28 * [experiment ↠ artifacts ↠ [label] ↠ key](#experiment--artifacts--label--key) 29 * [experiment ↠ artifacts ↠ [label] ↠ qualified](#experiment--artifacts--label--qualified) 30 * [experiment ↠ artifacts ↠ [label] ↠ mutable](#experiment--artifacts--label--mutable) 31 * [experiment ↠ artifacts ↠ [label] ↠ unpack](#experiment--artifacts--label--unpack) 32 * [experiment ↠ artifacts ↠ resources_needed](#experiment--artifacts--resources_needed) 33 * [experiment ↠ artifacts ↠ pythonenv](#experiment--artifacts--pythonenv) 34 * [experiment ↠ artifacts ↠ time added](#experiment--artifacts---time-added) 35 * [experiment ↠ config](#experiment--config) 36 * [experiment ↠ config ↠ experimentLifetime](#experiment--config--experimentlifetime) 37 * [experiment ↠ config ↠ verbose](#experiment--config--verbose) 38 * [experiment ↠ config ↠ saveWorkspaceFrequency](#experiment--config--saveworkspacefrequency) 39 * [experiment ↠ config ↠ database](#experiment--config--database) 40 * [experiment ↠ config ↠ database ↠ type](#experiment--config--database--type) 41 * [experiment ↠ config ↠ database ↠ authentication](#experiment--config--database--authentication) 42 * [experiment ↠ config ↠ database ↠ endpoint](#experiment--config--database--endpoint) 43 * [experiment ↠ config ↠ database ↠ bucket](#experiment--config--database--bucket) 44 * [experiment ↠ config ↠ storage](#experiment--config--storage) 45 * [experiment ↠ config ↠ storage ↠ type](#experiment--config--storage--type) 46 * [experiment ↠ config ↠ storage ↠ endpoint](#experiment--config--storage--endpoint) 47 * [experiment ↠ config ↠ storage ↠ bucket](#experiment--config--storage--bucket) 48 * [experiment ↠ config ↠ storage ↠ authentication](#experiment--config--storage--authentication) 49 * [experiment ↠ config ↠ resources_needed](#experiment--config--resources_needed) 50 * [experiment ↠ config ↠ resources_needed ↠ hdd](#experiment--config--resources_needed--hdd) 51 * [experiment ↠ config ↠ resources_needed ↠ cpus](#experiment--config--resources_needed--cpus) 52 * [experiment ↠ config ↠ resources_needed ↠ ram](#experiment--config--resources_needed--ram) 53 * [experiment ↠ config ↠ resources_needed ↠ gpus](#experiment--config--resources_needed--gpus) 54 * [experiment ↠ config ↠ resources_needed ↠ gpuMem](#experiment--config--resources_needed--gpumem) 55 * [experiment ↠ config ↠ env](#experiment--config--env) 56 * [experiment ↠ config ↠ cloud ↠ queue ↠ rmq](#experiment--config--cloud--queue--rmq) 57 <!--te--> 58 59 ## Introduction 60 61 StudioML has two major modules. 62 63 . The client, or front end, that shepherds experiments on behalf of users and packaging up experiments that are then placed on to a queue using json messages 64 65 . The runner that receives json formatted messages on a message queue and then runs the experiment they describe 66 67 There are other tools that StudioML offers for reporting and management of experiment artifacts that are not within the scope of this document. 68 69 It is not yet within the scope of this document to describe how data outside of the queuing interface is stored and formatted. 70 71 ## Audience 72 73 This document is intended for developers who wish to implement runners to process StudioML work, or implement clients that generate work for StudioML runners. 74 75 ## Runners 76 77 This project implements a StudioML runner, however it is not specific to StudioML. This runner could be used to deliver and execute and python code within a virtualenv that the runner supplies. 78 79 Any standard runners can accept a standalone virtualenv with no associated container. The go runner, this present project, has been extended to allow clients to also send work that has a Singularity container specified. 80 81 In the first case, virtualenv only, the runner implcitly trusts that any work received is trusted and is not malicous. In this mode the runner makes not attempt to protect the integrity of the host it is deployed into. 82 83 In the second case if a container is specified it will be used to launch work and the runner will rely upon the container runtime to prevent leakage into the host. 84 85 ## Queuing 86 87 The StudioML eco system relies upon a message queue to buffer work being sent by the StudioML client to any arbitrary runner that is subscribed to the experimenters choosen queuing service. StudioML support multiple queuing technologies including, AWS SQS, local file system, and RabbitMQ. The reference implementation is RabbitMQ for the purposes of this present project. The go runner project supports SQS, and RabbitMQ. 88 89 Additional queuing technologies can be added if desired to the StudioML (https://github.com/studioml/studio.git), and go runner (https://github.com/SentientTechnologies/studio-go-runner.git) code bases and a pull request submitted. 90 91 When using a queue the StudioML eco system relies upon a reliable, at-least-once, messaging system. An additional requirement for queuing systems is that if the worker disappears, or work is not reclaimed by the worker as progress is made that the work is requeued by the broker automatically. 92 93 ## Experiment Lifecycle 94 95 If you have had a chance to run some of the example experiments within the StudioML github repository then you will have noticed a keras example. The keras example is used to initiate a single experiment that queues work for a single runner and then immediately returns to the command line prompt without waiting for a result. Experiments run in this way rely on the user to monitor their cloud storage bucket and look for the output.tar file in a directory named after their experiment. For simple examples and tests this is a quick but manual way to work. 96 97 In more complex experiments there might be multiple phases to a project that is being run. Each experiment might represent an individual in for example evolutionary computation. The python software running the project might want to send potentially hundreds of experiments, or individuals to the runners and then wait for these to complete. Once complete it might select individuals that scored highly, using as one example a fitness screen. The python StudioML client might then generate a new population that are then marshall individuals from the population into experiments, repeating this cycle potentially for days. 98 99 To address the need for longer running experiments StudioML offers a number of python classes within the open source distribution that allows this style of longer running taining scenarios to be implemented by researchers and engineers. The combination of completion service and session server classes can be used to create these long running StudioML compliant clients. 100 101 Completion service based applications that use the StudioML classes generate work in exactly the same way as the CLI based 'studio run' command. Session servers are an implementation of a completion service combined with logic that once experiments are queued will on a regular interval examine the cloud storage folders for returned archives that runners have rolled up when they either save experiment workspaces, or at the conclusion of the experiment find that the python experiment code had generated files in directories identified as a part of the queued job. After the requisite numer of experiments are deemed to have finished based on the storage server bucket contents the session server can then examine the uploaded artifacts and determine their next set of training steps. 102 103 ## Payloads 104 105 The following figure shows an example of a job sent from the studioML front end to the runner. The runner does not always make use of the entire set of json tags, typically a limited but consistent subset of tags are used. This format is a clear text format, please see below for notes regarding the encrypted format. 106 107 ```json 108 { 109 "experiment": { 110 "status": "waiting", 111 "time_finished": null, 112 "git": null, 113 "key": "1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92", 114 "time_last_checkpoint": 1530054414.027222, 115 "pythonver": "3.6", 116 "metric": null, 117 "args": [ 118 "10" 119 ], 120 "max_duration": "20m", 121 "filename": "train_cifar10.py", 122 "project": null, 123 "artifacts": { 124 "output": { 125 "local": "/home/kmutch/.studioml/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/output", 126 "bucket": "kmutch-rmq", 127 "qualified": "s3://s3-us-west-2.amazonaws.com/kmutch-rmq/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/output.tar", 128 "key": "experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/output.tar", 129 "mutable": true, 130 "unpack": true 131 }, 132 "_metrics": { 133 "local": "/home/kmutch/.studioml/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/_metrics", 134 "bucket": "kmutch-rmq", 135 "qualified": "s3://s3-us-west-2.amazonaws.com/kmutch-rmq/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/_metrics.tar", 136 "key": "experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/_metrics.tar", 137 "mutable": true, 138 "unpack": true 139 }, 140 "modeldir": { 141 "local": "/home/kmutch/.studioml/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/modeldir", 142 "bucket": "kmutch-rmq", 143 "qualified": "s3://s3-us-west-2.amazonaws.com/kmutch-rmq/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/modeldir.tar", 144 "key": "experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/modeldir.tar", 145 "mutable": true, 146 "unpack": true 147 }, 148 "workspace": { 149 "local": "/home/kmutch/studio/examples/keras", 150 "bucket": "kmutch-rmq", 151 "qualified": "s3://s3-us-west-2.amazonaws.com/kmutch-rmq/blobstore/419411b17e9c851852735901a17bd6d20188cee30a0b589f1bf1ca5b487930b5.tar 152 ", 153 "key": "blobstore/419411b17e9c851852735901a17bd6d20188cee30a0b589f1bf1ca5b487930b5.tar", 154 "mutable": false, 155 "unpack": true 156 }, 157 "tb": { 158 "local": "/home/kmutch/.studioml/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/tb", 159 "bucket": "kmutch-rmq", 160 "qualified": "s3://s3-us-west-2.amazonaws.com/kmutch-rmq/experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/tb.tar", 161 "key": "experiments/1530054412_70d7eaf4-3ce3-493a-a8f6-ffa0212a5c92/tb.tar", 162 "mutable": true, 163 "unpack": true 164 } 165 "info": {}, 166 "resources_needed": { 167 "hdd": "3gb", 168 "gpus": 1, 169 "ram": "2gb", 170 "cpus": 1, 171 "gpuMem": "4gb" 172 }, 173 "pythonenv": [ 174 "APScheduler==3.5.1", 175 "argparse==1.2.1", 176 "asn1crypto==0.24.0", 177 "attrs==17.4.0", 178 "autopep8==1.3.5", 179 "awscli==1.15.4", 180 "boto3==1.7.4", 181 "botocore==1.10.4", 182 ... 183 "six==1.11.0", 184 "sseclient==0.0.19", 185 "-e git+https://github.com/SentientTechnologies/studio@685f4891764227a2e1ea5f7fc91b31dcf3557647#egg=studioml", 186 "terminaltables==3.1.0", 187 "timeout-decorator==0.4.0", 188 "tzlocal==1.5.1", 189 "uritemplate==3.0.0", 190 "urllib3==1.22", 191 "Werkzeug==0.14.1", 192 "wheel==0.31.0", 193 "wsgiref==0.1.2" 194 ], 195 "owner": "guest", 196 "time_added": 1530054413.134781, 197 "time_started": null 198 }, 199 "config": { 200 "optimizer": { 201 "visualization": true, 202 "load_checkpoint_file": null, 203 "cmaes_config": { 204 "load_best_only": false, 205 "popsize": 100, 206 "sigma0": 0.25 207 }, 208 "termination_criterion": { 209 "generation": 5, 210 "fitness": 999, 211 "skip_gen_timeout": 30, 212 "skip_gen_thres": 1 213 }, 214 }, 215 "result_dir": "~/Desktop/", 216 "checkpoint_interval": 0 217 }, 218 "verbose": "debug", 219 "saveWorkspaceFrequency": "3m", 220 "database": { 221 "type": "s3", 222 "authentication": "none", 223 "endpoint": "http://s3-us-west-2.amazonaws.com", 224 "bucket": "kmutch-metadata" 225 }, 226 "runner": { 227 "slack_destination": "@karl.mutch" 228 }, 229 "storage": { 230 "type": "s3", 231 "endpoint": "http://s3-us-west-2.amazonaws.com", 232 "bucket": "kmutch-rmq" 233 }, 234 "server": { 235 "authentication": "None" 236 }, 237 "env": { 238 "PATH": "%PATH%:./bin", 239 "AWS_DEFAULT_REGION": "us-west-2", 240 "AWS_ACCESS_KEY_ID": "AKZAIE5G7Q2GZC3OMTYW", 241 "AWS_SECRET_ACCESS_KEY": "rt43wqJ/w5aqAPat659gkkYpphnOFxXejsCBq" 242 }, 243 "cloud": { 244 "queue": { 245 "rmq": "amqp://user:password@10.230.72.19:5672/%2f?connection_attempts=30&retry_delay=.5&socket_timeout=5" 246 } 247 } 248 } 249 } 250 ``` 251 252 ### Encrypted payloads 253 254 In the event that message level encryption is enabled then the payload format will vary from the clear-text format. The encrypted format will retain a very few blocks in clear-text to assist in scheduling, the status, pythonver, experiment_lifetime, time_added, and the resources needed blocks as in the following example. All other fragments will be rolled up into an encrypted_data block, consisting of Base64 encoded data. The fields used within the clear-text header retain the same purpose and meaning as those in the Request documented in the [Field Descriptions](#field-descriptions) section 255 256 Encrypted payloads use a hybrid cryptosystem, for a detailed description please see https://en.wikipedia.org/wiki/Hybrid_cryptosystem. 257 258 A detailed description of the StudioML implementation of this system can be found in the [docs/message_privacy.md](docs/message_privacy.md) documentation. 259 260 The following figures shows an example of the clear-text headers and the encrypted payload portion of a message: 261 262 ```json 263 { 264 "message": { 265 "experiment": { 266 "status": "waiting", 267 "pythonver": "3.6", 268 }, 269 "time_added": 1530054413.134781, 270 "experiment_lifetime": "30m", 271 "resources_needed": { 272 "gpus": 1, 273 "hdd": "3gb", 274 "ram": "2gb", 275 "cpus": 1, 276 "gpuMem": "4gb" 277 }, 278 "payload": "Full Base64 encrypted payload" 279 } 280 } 281 ``` 282 283 The encrypted format will retain a very few blocks in clear-text to assist in scheduling, the status, pythonver, experiment_lifetime, time_added, and the resources needed blocks as in the following example. All other fragments will be rolled up into an encrypted_data block, consisting of Base64 encoded data. The fields used within the clear-text header retain the same purpose and meaning as those in the Request documented in the [Field Descriptions](#field-descriptions) section 284 285 Encrypted payloads use a hybrid cryptosystem, for a detailed description please see https://en.wikipedia.org/wiki/Hybrid_cryptosystem. 286 287 A detailed description of the StudioML implementation of this system can be found in the [message_privacy](docs/message_privacy.md) documentation. 288 289 The following figures shows an example of the clear-text headers and the encrypted payload portion of a message: 290 291 ```json 292 { 293 "message": { 294 "experiment": { 295 "status": "waiting", 296 "pythonver": "3.6", 297 }, 298 "time_added": 1530054413.134781, 299 "experiment_lifetime": "30m", 300 "resources_needed": { 301 "gpus": 1, 302 "hdd": "3gb", 303 "ram": "2gb", 304 "cpus": 1, 305 "gpuMem": "4gb" 306 }, 307 "payload": "Full Base64 encrypted payload" 308 } 309 } 310 ``` 311 312 The encrypted payload should consist of a 24 byte nonce, and then the users encrypted data. 313 314 When processing messages runners can use the clear-text JSON in an advisory capacity to determine if messages are useful before decrypting their contents, however once decrypted messages will be re-evaluated using the decrypted contents only. The clear-text portions of the message will be ignored post decryption. 315 316 Private keys and passphrases are provisioned on compute clusters using the Kubernetes secrets service and stored encrypted within etcd when the go runner is used. 317 318 ### Signed payloads 319 320 Message signing is a way of protecting the runner receiving messages from processing spoofed requests. To prevent this the runner can be configured to read public key information from Kubernetes secrets and then to use this to validate messages that are being received. The configuration information for the runner signing keys is detailed in the [message\_privacy.md](message_privacy.md) file. 321 322 Message signing must be used in combination with message encryption features described in the previous section. 323 324 The format of the signature that is transmitted using the StudioML message signature field consists of the Base64 encoded signature blob, encoded from the binary 64 byte signature. 325 326 The signing information is encoded into two JSON elements, the fingerprint and signature elements, for example: 327 328 ``` 329 ```json 330 { 331 "message": { 332 "experiment": { 333 "status": "waiting", 334 "pythonver": "3.6", 335 }, 336 "time_added": 1530054413.134781, 337 "experiment_lifetime": "30m", 338 "resources_needed": { 339 "gpus": 1, 340 "hdd": "3gb", 341 "ram": "2gb", 342 "cpus": 1, 343 "gpuMem": "4gb" 344 }, 345 "payload": "Full Base64 encrypted payload", 346 "fingerprint": "Base64 of sha256 binary fingerprint", 347 "signature": "Base64 encoded binary signature for the Base64 representation of the encrypted payload" 348 } 349 } 350 ``` 351 352 ### Field descriptions 353 354 ### experiment ↠ pythonver 355 356 The value for this tag must be an integer 2 or 3 for the specific python version requested by the experimenter. 357 358 ### experiment ↠ args 359 360 A list of the command line arguments to be supplied to the python interpreter that will be passed into the main of the running python job. 361 362 ### experiment ↠ max\_duration 363 364 The period of time that the experiment is permitted to run in a single attempt. If this time is exceeded the runner can abandon the task at any point but it may continue to run for a short period. 365 366 ### experiment ↠ filename 367 368 The python file in which the experiment code is to be found. This file should exist within the workspace artifact archive relative to the top level directory. 369 370 ### experiment ↠ project 371 372 All experiments should be assigned to a project. The project identifier is a label assigned by the StudioML user and is specific to their purposes. 373 374 ### experiment ↠ artifacts 375 376 Artifacts are assigned labels, some labels have significance. The workspace artifact should contain any python code that is needed, it may container other assets for the python code to run including configuration files etc. The output artifact is used to identify where any logging and returned results will be archives to. 377 378 Work that is sent to StudioML runners must have at least one workspace artifact consisting of the python code that will be run. Artifacts are typically tar archives that contain not just python code but also any other data needed by the experiment being run. 379 380 Before the experiment commences the artifact will be unrolled onto local disk of the container running it. When unrolled the artifact label is used to name the peer directory into which any files are placed. 381 382 The experiment when running will be placed into the workspace directory which contains the contents of the workspace labeled artifact. Any other artifacts that were downloaded will be peer directories of the workspace directory. Artifacts that were mutable and not available for downloading at the start of the experiment will results in empty peer directories that are named based on the label as well. 383 384 Artifacts do not have any restriction on the size of the data they identify. 385 386 The StudioML runner will download all artifacts that it can prior to starting an experiment. Should any mutable artifacts be not available then they will be ignored and the experiment will continue. If non-mutable artifacts are not found then the experiment will fail. 387 388 Named non-mutable artifacts are subject to caching to reduce download times and network load. 389 390 ### experiment ↠ artifacts ↠ [label] ↠ bucket 391 392 The bucket identifies the cloud providers storage service bucket. This value is not used when the go runner is running tasks. This value is used by the python runner for configurations where the StudioML client is being run in proximoity to a StudioML configuration file. 393 394 ### experiment ↠ artifacts ↠ [label] ↠ key 395 396 The key identifies the cloud providers storage service key value for the artifact. This value is not used when the go runner is running tasks. This value is used by the python runner for configurations where the StudioML client is being run in proxiomity to a StudioML configuration file. 397 398 ### experiment ↠ artifacts ↠ [label] ↠ qualified 399 400 The qualified field contains a fully specified cloud storage platform reference that includes a schema used for selecting the storage platform implementation. The host name is used within AWS to select the appropriate endpoint and region for the bucket, when using Minio this identifies the endpoint being used including the port number. The URI path contains the bucket and file name (key in the case of AWS) for the artifact. 401 402 If the artifact is mutable and will be returned to the S3 or Minio storage then the bucket MUST exist otherwise the experiment will fail. 403 404 The environment section of the json payload is used to supply the needed credentials for the storage. The go runner will be extended in future to allow the use of a user:password pair inside the URI to allow for multiple credentials on the cloud storage platform. 405 406 ### experiment ↠ artifacts ↠ [label] ↠ mutable 407 408 mutable is a true/false flag for identifying whether an artifact should be returned to the storage platform being used. mutable artifacts that are not able to be downloaded at the start of an experiment will not cause the runner to terminate the experiment, non-mutable downloads that fail will lead to the experiment stopping. 409 410 ### experiment ↠ artifacts ↠ [label] ↠ unpack 411 412 unpack is a true/false flag that can be used to supress the tar or other compatible archive format archive within the artifact. 413 414 ### experiment ↠ artifacts ↠ resources\_needed 415 416 This section is a repeat of the experiment config resources_needed section, please ignore. 417 418 ### experiment ↠ artifacts ↠ pythonenv 419 420 This section encapsulates a json string array containing pip install dependencies and their versions. The string elements in this array are a json rendering of what would typically appear in a pip requirements files. The runner will unpack the frozen pip packages and will install them prior to the experiment running. Any valid pip reference can be used except for private dependencies that require specialized authentication which is not supported by runners. If a private dependency is needed then you should add the pip dependency as a file within an artifact and load the dependency in your python experiment implemention to protect it. 421 422 ### experiment ↠ artifacts ↠ time added 423 424 The time that the experiment was initially created expressed as a floating point number representing the seconds since the epoc started, January 1st 1970. 425 426 ### experiment ↠ config 427 428 The StudioML configuration file can be used to store parameters that are not processed by the StudioML client. These values are passed to the runners and are not validated. When present to the runner they can then be used to configure it or change its behavior. If you implement your own runner then you can add values to the configuration file and they will then be placed into the config section of the json payload the runner receives. 429 430 Running experiments that make use of Sentient ENN tooling or third party libraries will often require that framework specific configuration values be placed into this section. Example of frameworks that use these values include the StudioML completion service, and evolutionary strategies used for numerical optimization. 431 432 ### experiment ↠ config ↠ experimentLifetime 433 434 This variable is used to inform the go runner of the date and time that the experiment should be considered to be dead and any work related to it should be abandoned or discarded. This acts as a gaureentee that the client will no longer need to be concerned with the experiment and work can be requeued in the system, as one example, without fear of repeatition. 435 436 The value is expressed as an integer followed by a unit, s,m,h. 437 438 ### experiment ↠ config ↠ verbose 439 440 verbose can be used to adjust the logging level for the runner and for StudioML components. It has the following valid string values debug, info, warn, error, crit. 441 442 ### experiment ↠ config ↠ saveWorkspaceFrequency 443 444 On a regular basis the runner can upload any logs and intermediate results from the experiments mutable labelled artifact directories. This variable can be used to set the interval at which these uploads are done. The primary purpose of this variable is to speed up remote monitoring of intermediate output logging from the runner and the python code within the experiment. 445 446 This variable is not intended to be used as a substitute for experiment checkpointing. 447 448 ### experiment ↠ config ↠ database 449 450 The database within StudioML is used to store meta-data that StudioML generates to describe experiments, projects and other useful material related to the progress of experiments such as the start time, owner. 451 452 The database can point at blob storage or can be used with structured datastores should you wish to customize it. The database is used in the event that the API server is launched by a user as a very simply way of accessing experiment and user details. 453 454 ### experiment ↠ config ↠ database ↠ type 455 456 This variable denotes the storage format being used by StudioML to store meta-data and supports three types within the open source offering, firebase, gcloud, s3. Using s3 does allow other stores such as Azure blob storage when a bridging technology such as Minio is used. 457 458 ### experiment ↠ config ↠ database ↠ authentication 459 460 Not yet widely supported across the database types this variable supports either none, firebase, or github. Currently its application is only to the gcloud, amnd firebase storage. The go runner is intended for non vendor dependent implementations and uses the env variable seetings for the AWS authentication currently. It is planned in the future that the authentication would make use of shortlived tokens using this field. 461 462 ### experiment ↠ config ↠ database ↠ endpoint 463 464 The endpoint variable is used to denote the S3 endpoint that is used to terminate API requests on. This is used for both native S3 and minio support. 465 466 In the case of a native S3 deployment it will be one of the well known endpoints for S3 and should be biased to using the region specific endpoints for the buckets being used, an example for this use case would be 'http://s3-us-west-2.amazonaws.com'. 467 468 In the case of minio this should point at the appropriate endpoint for the minio server along with the port being used, for example http://40.114.110.201:9000/. If you wish to use HTTPS to increase security the runners deployed must have the appropriate root certificates installed and the certs on your minio server setup to reference one of the publically well known certificate authorities. 469 470 ### experiment ↠ config ↠ database ↠ bucket 471 472 The bucket variable denotes the bucket name being used and should be homed in the region that is configured using the endpoint and any AWS style environment variables captured in the environment variables section, 'env'. 473 474 ### experiment ↠ config ↠ storage 475 476 The storage area within StudioML is used to store the artifacts and assets that are created by the StudioML client. The typical files placed into the storage are include any directories that are stored on the local workstation of the experimenter and need to be copied to a location that is available to runners. 477 478 At a minimum when an experiment starts there will be an workspace artifact placed into the storage area. Any artifacts placed into the storage will have a key that denotes the exact experiment and the name of the directory that was archived. 479 480 Upon completion of the experiment the storage area will be updated with artifacts that are denoted as mutable and that have been changed. 481 482 ### experiment ↠ config ↠ storage ↠ type 483 484 This variable denotes the storage being used as either gs (google cloud storage), or s3. 485 486 ### experiment ↠ config ↠ storage ↠ endpoint 487 488 The endpoint variable is used to denote the S3 endpoint that is used to terminate API requests on. This is used for both native S3 and minio support. 489 490 In the case of a native S3 deployment it will be one of the well known endpoints for S3 and should be biased to using the region specific endpoints for the buckets being used, an example for this use case would be 'http://s3-us-west-2.amazonaws.com'. 491 492 In the case of minio this should point at the appropriate endpoint for the minio server along with the port being used, for example http://40.114.110.201:9000/. If you wish to use HTTPS to increase security the runners deployed must have the appropriate root certificates installed and the certs on your minio server setup to reference one of the publically well known certificate authorities. 493 494 ### experiment ↠ config ↠ storage ↠ bucket 495 496 The bucket variable denotes the bucket name being used and should be homed in the region that is configured using the endpoint. In the case of AWS any AWS style environment variables captured in the environment variables section, 'env', will be used for authentication. 497 498 When the experiment is being initiated within the StudioML client then local AWS environment variables will be used. When the bucket is accessed by the runner then the authentication details captured inside this json payload will be used to download and upload any data. 499 500 ### experiment ↠ config ↠ storage ↠ authentication 501 502 Not yet widely supported across the database types this variable supports either none, firebase, or github. Currently its application is only to the gcloud, amnd firebase storage. The go runner is intended for non vendor dependent implementations and uses the env variable seetings for the AWS authentication currently. It is planned in the future that the authentication would make use of shortlived tokens using this field. 503 504 ### experiment ↠ config ↠ resources\_needed 505 506 This section details the minimum hardware requirements needed to run the experiment. 507 508 Values of the parameters in this section are either integers or integer units. For units suffixes can include Mb, Gb, Tb for megabytes, gigabytes, or terrabytes. 509 510 It should be noted that GPU resources are not virtualized and the requirements are hints to the scheduler only. A project over committing resources will only affects its own experiments as GPU cards are not shared across projects. CPU and RAM are virtualized by the container runtime and so are not as prone to abuse. 511 512 ### experiment ↠ config ↠ resources\_needed ↠ hdd 513 514 The minimum disk space required to run the experiment. 515 516 ### experiment ↠ config ↠ resources\_needed ↠ cpus 517 518 The number of CPU Cores that should be available for the experiments. Remember this value does not account for the power of the CPU. Consult your cluster operator or administrator for this information and adjust the number of cores to deal with the expectation you have for the hardware. 519 520 ### experiment ↠ config ↠ resources\_needed ↠ ram 521 522 The amount of free CPU RAM that is needed to run the experiment. It should be noted that StudioML is design to run in a co-operative environment where tasks being sent to runners adequately describe their resource requirements and are scheduled based upon expect consumption. Runners are free to implement their own strategies to deal with abusers. 523 524 ### experiment ↠ config ↠ resources\_needed ↠ gpus 525 526 gpus are counted as slots using the relative throughput of the physical hardware GPUs. GTX 1060's count as a single slot, GTX1070 is two slots, and a TitanX is considered to be four slots. GPUs are not virtualized and so the go runner will pack the jobs from one experiment into one GPU device based on the slots. Cards are not shared between different experiments to prevent noise between projects from affecting other projects. If a project exceeds its resource consumption promise it will only impact itself. 527 528 ### experiment ↠ config ↠ resources\_needed ↠ gpuMem 529 530 The amount on onboard GPU memory the experiment will require. Please see above notes concerning the use of GPU hardware. 531 532 ### experiment ↠ config ↠ env 533 534 This section contains a dictionary of environmnet variables and their values. Prior to the experiment being initiated by the runner the environment table will be loaded. The envrionment table is current used for AWS authentication for S3 access and so this section should contain as a minimum the AWS_DEFAULT_REGION, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY variables. In the future the AWS credentials for the artifacts will be obtained from the artifact block. 535 536 ### experiment ↠ config ↠ cloud ↠ queue ↠ rmq 537 538 This variable will contain the rabbitMQ URI and configuration parameters if rabbitMQ was used by the system to queue this work. The runner will ignore this value if it is passed through as it gets its queue information from the runner configuration store. 539 540 Copyright © 2019-2020 Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 license.