github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/individual-developer-workflow.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/individual-developer-workflow.md (about)

     1  # Individual Developer Workflow
     2  
     3  A typical Pachyderm workflow involves multiple iterations of
     4  experimenting with your code and pipeline specs.
     5  
     6  !!! info
     7      Before you read this section, make sure that you
     8      understand basic Pachyderm pipeline concepts described in
     9      [Concepts](../concepts/pipeline-concepts/index.md).
    10  
    11  ## How it works
    12  
    13  Working with Pachyderm includes multiple iterations of the
    14  following steps:
    15  
    16  ![Developer workflow](../assets/images/d_steps_analysis_pipeline.svg)
    17  
    18  ## Step 1: Write Your Analysis Code
    19  
    20  Because Pachyderm is completely language-agnostic, the code
    21  that is used to process data in Pachyderm can
    22  be written in any language and can use any libraries of choice. Whether
    23  your code is as simple as a bash command or as complicated as a
    24  TensorFlow neural network, it needs to be built with all the required
    25  dependencies into a container that can run anywhere, including inside
    26  of Pachyderm. See [Examples](https://github.com/pachyderm/pachyderm/tree/master/examples).
    27  
    28  Your code does not have to import any special Pachyderm
    29  functionality or libraries. However, it must meet the
    30  following requirements:
    31  
    32  * **Read files from a local file system**. Pachyderm automatically
    33    mounts each input data repository as `/pfs/<repo_name>` in the running
    34    containers of your Docker image. Therefore, the code that you write needs
    35    to read input data from this directory, similar to any other
    36    file system.
    37  
    38    Because Pachyderm automatically spreads data across parallel
    39    containers, your analysis code does not have to deal with data
    40    sharding or parallelization. For example, if you have four
    41    containers that run your Python code, Pachyderm automatically
    42    supplies 1/4 of the input data to `/pfs/<repo_name>` in
    43    each running container. These workload balancing settings
    44    can be adjusted as needed through Pachyderm tunable parameters
    45    in the pipeline specification.
    46  
    47  * **Write files into a local file system**, such as saving results.
    48    Your code must write to the `/pfs/out` directory that Pachyderm
    49    mounts in all of your running containers. Similar to reading data,
    50    your code does not have to manage parallelization or sharding.
    51  
    52  ## Step 2: Build Your Docker Image
    53  
    54  When you create a Pachyderm pipeline, you need
    55  to specify a Docker image that includes the code or binary that
    56  you want to run. Therefore, every time you modify your code,
    57  you need to build a new Docker image, push it to your image registry,
    58  and update the image tag in the pipeline spec. This section
    59  describes one way of building Docker images, but
    60  if you have your own routine, feel free to apply it.
    61  
    62  To build an image, you need to create a `Dockerfile`. However, do not
    63  use the `CMD` field in your `Dockerfile` to specify the commands that
    64  you want to run. Instead, you add them in the `cmd` field in your pipeline
    65  specification. Pachyderm runs these commands inside the
    66  container during the job execution rather than relying on Docker
    67  to run them.
    68  The reason is that Pachyderm cannot execute your code immediately when
    69  your container starts, so it runs a shim process in your container
    70  instead, and then, it calls your pipeline specification's `cmd` from there.
    71  
    72  After building your image, you need to upload the image into
    73  a public or private image registry, such as
    74  [DockerHub](https://hub.docker.com) or other.
    75  
    76  Alternatively, you can use the Pachyderm's built-in functionality to
    77  tag, build, and push images by running the `pachctl update pipeline` command
    78  with the `--build` and `--push-images` flags. For more information, see
    79  [Update a pipelines](updating_pipelines.md).
    80  
    81  !!! note
    82      The `Dockerfile` example below is provided for your reference
    83      only. Your `Dockerfile` might look completely different.
    84  
    85  To build a Docker image, complete the following steps:
    86  
    87  1. If you do not have a registry, create one with a preferred provider.
    88  If you decide to use DockerHub, follow the [Docker Hub Quickstart](https://docs.docker.com/docker-hub/) to
    89  create a repository for your project.
    90  1. Create a `Dockerfile` for your project. See the [OpenCV example](https://github.com/pachyderm/pachyderm/blob/master/examples/opencv/Dockerfile).
    91  1. Log in to an image registry.
    92  
    93     * If you use DockerHub, run:
    94  
    95       ```shell
    96       docker login --username=<dockerhub-username> --password=<dockerhub-password> <dockerhub-fqdn>
    97       ```
    98  
    99  1. Build a new image from the `Dockerfile` by specifying a tag:
   100  
   101     ```shell
   102     docker build -t <IMAGE>:<TAG> .
   103     ```
   104  
   105  1. Push your image to your image registry.
   106  
   107     * If you use DockerHub, run:
   108  
   109       ```shell
   110       docker push <image>:tag
   111       ```
   112  
   113  For more information about building Docker images, see
   114  [Docker documentation](https://docs.docker.com/engine/tutorials/dockerimages/).
   115  
   116  
   117  ## Step 3: Create a Pipeline
   118  
   119  Pachyderm's pipeline specifications store the configuration information
   120  about the Docker image and code that Pachyderm should run. Pipeline
   121  specifications are stored in JSON format. As soon as you create a pipeline,
   122  Pachyderm immediately spins a pod or pods on a Kubernetes worker node
   123  in which pipeline code runs. By default, after the pipeline finishes
   124  running, the pods continue to run while waiting for the new data to be
   125  committed into the Pachyderm input repository. You can configure this
   126  parameter, as well as many others, in the pipeline specification.
   127  
   128  A minimum pipeline specification must include the following
   129  parameters:
   130  
   131  - `name`
   132  - `transform`
   133  - `parallelism`
   134  - `input`
   135  
   136  You can store your pipeline locally or in a remote location, such
   137  as a GitHub repository.
   138  
   139  To create a Pipeline, complete the following steps:
   140  
   141  1. Create a pipeline specification. Here is an example of a pipeline
   142     spec:
   143  
   144     ```shell
   145     # my-pipeline.json
   146     {
   147       "pipeline": {
   148         "name": "my-pipeline"
   149       },
   150       "transform": {
   151         "image": "my-pipeline-image",
   152         "cmd": ["/binary", "/pfs/data", "/pfs/out"]
   153       },
   154       "input": {
   155           "pfs": {
   156             "repo": "data",
   157             "glob": "/*"
   158           }
   159       }
   160     }
   161     ```
   162  
   163  1. Create a Pachyderm pipeline from the spec:
   164  
   165     ```shell
   166     $ pachctl create pipeline -f my-pipeline.json
   167     ```
   168  
   169     You can specify a local file or a file stored in a remote
   170     location, such as a GitHub repository. For example,
   171     `https://raw.githubusercontent.com/pachyderm/pachyderm/master/examples/opencv/edges.json`.
   172  
   173  !!! note "See Also:"
   174  
   175  - [Pipeline Specification](../reference/pipeline_spec.md)
   176  <!-- - [Running Pachyderm in Production](TBA)-->