github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/individual-developer-workflow.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/individual-developer-workflow.md (about)

     1  # Individual Developer Workflow
     2  
     3  A typical Pachyderm workflow involves multiple iterations of
     4  experimenting with your code and pipeline specs.
     5  
     6  !!! info
     7      Before you read this section, make sure that you
     8      understand basic Pachyderm pipeline concepts described in
     9      [Concepts](../concepts/pipeline-concepts/index.md).
    10  
    11  ## How it works
    12  
    13  Working with Pachyderm includes multiple iterations of the
    14  following steps:
    15  
    16  ![Developer workflow](../assets/images/d_steps_analysis_pipeline.svg)
    17  
    18  ## Step 1: Write Your Analysis Code
    19  
    20  Because Pachyderm is completely language-agnostic, the code
    21  that is used to process data in Pachyderm can
    22  be written in any language and can use any libraries of choice. Whether
    23  your code is as simple as a bash command or as complicated as a
    24  TensorFlow neural network, it needs to be built with all the required
    25  dependencies into a container that can run anywhere, including inside
    26  of Pachyderm. See [Examples](https://github.com/pachyderm/pachyderm/tree/master/examples).
    27  
    28  Your code does not have to import any special Pachyderm
    29  functionality or libraries. However, it must meet the
    30  following requirements:
    31  
    32  * **Read files from a local file system**. Pachyderm automatically
    33    mounts each input data repository as `/pfs/<repo_name>` in the running
    34    containers of your Docker image. Therefore, the code that you write needs
    35    to read input data from this directory, similar to any other
    36    file system.
    37  
    38    Because Pachyderm automatically spreads data across parallel
    39    containers, your analysis code does not have to deal with data
    40    sharding or parallelization. For example, if you have four
    41    containers that run your Python code, Pachyderm automatically
    42    supplies 1/4 of the input data to `/pfs/<repo_name>` in
    43    each running container. These workload balancing settings
    44    can be adjusted as needed through Pachyderm tunable parameters
    45    in the pipeline specification.
    46  
    47  * **Write files into a local file system**, such as saving results.
    48    Your code must write to the `/pfs/out` directory that Pachyderm
    49    mounts in all of your running containers. Similar to reading data,
    50    your code does not have to manage parallelization or sharding.
    51  
    52  ## Step 2: Build Your Docker Image
    53  
    54  When you create a Pachyderm pipeline, you need
    55  to specify a Docker image that includes the code or binary that
    56  you want to run. Therefore, every time you modify your code,
    57  you need to build a new Docker image, push it to your image registry,
    58  and update the image tag in the pipeline spec. This section
    59  describes one way of building Docker images, but
    60  if you have your own routine, feel free to apply it.
    61  
    62  To build an image, you need to create a `Dockerfile`. However, do not
    63  use the `CMD` field in your `Dockerfile` to specify the commands that
    64  you want to run. Instead, you add them in the `cmd` field in your pipeline
    65  specification. Pachyderm runs these commands inside the
    66  container during the job execution rather than relying on Docker
    67  to run them.
    68  The reason is that Pachyderm cannot execute your code immediately when
    69  your container starts, so it runs a shim process in your container
    70  instead, and then, it calls your pipeline specification's `cmd` from there.
    71  
    72  After building your image, you need to upload the image into
    73  a public or private image registry, such as
    74  [DockerHub](https://hub.docker.com) or other.
    75  
    76  Alternatively, you can use the Pachyderm's built-in functionality to
    77  tag, build, and push images by running the `pachctl update pipeline` command
    78  with the `--build` and `--push-images` flags. For more information, see
    79  [Update a pipelines](updating_pipelines.md).
    80  
    81  !!! note
    82      The `Dockerfile` example below is provided for your reference
    83      only. Your `Dockerfile` might look completely different.
    84  
    85  To build a Docker image, complete the following steps:
    86  
    87  1. If you do not have a registry, create one with a preferred provider.
    88  If you decide to use DockerHub, follow the [Docker Hub Quickstart](https://docs.docker.com/docker-hub/) to
    89  create a repository for your project.
    90  1. Create a `Dockerfile` for your project. See the [OpenCV example](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv/Dockerfile).
    91  1. Log in to an image registry.
    92  
    93     * If you use DockerHub, run:
    94  
    95       ```shell
    96       docker login --username=<dockerhub-username> --password=<dockerhub-password> <dockerhub-fqdn>
    97       ```
    98  
    99  1. Build a new image from the `Dockerfile` by specifying a tag:
   100  
   101     ```shell
   102     docker build -t <IMAGE>:<TAG> .
   103     ```
   104  
   105  1. Push your image to your image registry.
   106  
   107     * If you use DockerHub, run:
   108  
   109       ```shell
   110       docker push <image>:tag
   111       ```
   112  
   113  For more information about building Docker images, see
   114  [Docker documentation](https://docs.docker.com/engine/tutorials/dockerimages/).
   115  
   116  ## Step 3: Load Your Data to Pachyderm
   117  
   118  You need to add your data to Pachyderm so that your pipeline runs your code
   119  against it. You can do so by using one of the following methods:
   120  
   121  * By using the `pachctl put file` command
   122  * By using a special type of pipeline, such as a spout or cron
   123  * By using one of the Pachyderm's [language clients](../reference/clients/)
   124  * By using a compatible S3 client
   125  * By using the Pachyderm UI (Enterprise version or free trial)
   126  
   127  For more information, see [Load Your Data Into Pachyderm](../load-data-into-pachyderm/).
   128  
   129  ## Step 4: Create a Pipeline
   130  
   131  Pachyderm's pipeline specifications store the configuration information
   132  about the Docker image and code that Pachyderm should run. Pipeline
   133  specifications are stored in JSON format. As soon as you create a pipeline,
   134  Pachyderm immediately spins a pod or pods on a Kubernetes worker node
   135  in which the pipeline code runs. By default, after the pipeline finishes
   136  running, the pods continue to run while waiting for the new data to be
   137  committed into the Pachyderm input repository. You can configure this
   138  parameter, as well as many others, in the pipeline specification.
   139  
   140  A standard pipeline specification must include the following
   141  parameters:
   142  
   143  - `name`
   144  - `transform`
   145  - `parallelism`
   146  - `input`
   147  
   148  !!! note
   149      Some special types of pipelines, such as a spout pipeline, do not
   150      require you to specify all of these parameters.
   151  
   152  You can store your pipeline locally or in a remote location, such
   153  as a GitHub repository.
   154  
   155  To create a Pipeline, complete the following steps:
   156  
   157  1. Create a pipeline specification. Here is an example of a pipeline
   158     spec:
   159  
   160     ```shell
   161     # my-pipeline.json
   162     {
   163       "pipeline": {
   164         "name": "my-pipeline"
   165       },
   166       "transform": {
   167         "image": "my-pipeline-image",
   168         "cmd": ["/binary", "/pfs/data", "/pfs/out"]
   169       },
   170       "input": {
   171           "pfs": {
   172             "repo": "data",
   173             "glob": "/*"
   174           }
   175       }
   176     }
   177     ```
   178  
   179  1. Create a Pachyderm pipeline from the spec:
   180  
   181     ```shell
   182     pachctl create pipeline -f my-pipeline.json
   183     ```
   184  
   185     You can specify a local file or a file stored in a remote
   186     location, such as a GitHub repository. For example,
   187     `https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json`.
   188  
   189  !!! note "See Also:"
   190      - [Pipeline Specification](../reference/pipeline_spec.md)
   191  
   192  <!-- - [Running Pachyderm in Production](TBA)-->