github.com/kubeflow/training-operator@v1.7.0/docs/testing/e2e_debugging.md (about)

     1  # How to debug an E2E test for Kubeflow Training Operator
     2  
     3  TODO (andreyvelich): This doc is outdated. Currently, E2Es are located here:
     4  [`sdk/python/test/e2e`](../../sdk/python/test/e2e)
     5  
     6  [E2E Testing](./e2e_testing.md) gives an overview of writing e2e tests. This guidance concentrates more on the e2e failure debugging.
     7  
     8  ## Prerequsite
     9  
    10  1. Install python 3.7
    11  
    12  2. Clone `kubeflow/testing` repo under `$GOPATH/src/kubeflow/`
    13  
    14  3. Install [ksonnet](https://ksonnet.io/)
    15  
    16  ```
    17  wget https://github.com/ksonnet/ksonnet/releases/download/v0.13.1/ks_0.13.1_linux_amd64.tar.gz
    18  tar -xvzf ks_0.13.1_linux_amd64.tar.gz
    19  sudo cp ks_0.13.1_linux_amd64/ks /usr/local/bin/ks-13
    20  ```
    21  
    22  > We would like to deprecate `ksonnet` but may takes some time. Feel free to pick up [the issue](https://github.com/kubeflow/training-operator/issues/1468) if you are interested in it.
    23  > If your platform is darwin or windows, feel free to download binaries in [ksonnet v0.13.1](https://github.com/ksonnet/ksonnet/releases/tag/v0.13.1)
    24  
    25  4. Deploy HEAD training operator version in your environment
    26  
    27  ```
    28  IMG=kubeflow/training-operator:e2e-debug-prid make docker-build
    29  
    30  # Optional - load image into kind cluster if you are using kind
    31  kind load docker-image kubeflow/training-operator:e2e-debug-1462
    32  
    33  kubectl set image deployment.v1.apps/training-operator training-operator=kubeflow/training-operator:e2e-debug-1462
    34  ```
    35  
    36  ## Run E2E Tests locally
    37  
    38  1. Set environments
    39  
    40  ```
    41  export KUBEFLOW_PATH=$GOPATH/src/github.com/kubeflow
    42  export KUBEFLOW_TRAINING_REPO=$KUBEFLOW_PATH/training-operator
    43  export KUBEFLOW_TESTING_REPO=$KUBEFLOW_PATH/testing
    44  export PYTHONPATH=$KUBEFLOW_TRAINING_REPO:$KUBEFLOW_TRAINING_REPO/py:$KUBEFLOW_TESTING_REPO/py:$KUBEFLOW_TRAINING_REPO/sdk/python
    45  ```
    46  
    47  2. Install python dependencies
    48  
    49  ```
    50  pip3 install -r $KUBEFLOW_TESTING_REPO/py/kubeflow/testing/requirements.txt
    51  ```
    52  
    53  > Note: if you have meet problem install requirement, you may need to `sudo apt-get install libffi-dev`. Feel free to share error logs if you don't know how to handle it.
    54  
    55  3. Run Tests
    56  
    57  ```
    58  # enter the ksonnet app to run tests
    59  cd $KUBEFLOW_TRAINING_REPO/test/workflows
    60  
    61  # run individual test that failed in the presubmit job.
    62  python3 -m kubeflow.tf_operator.pod_names_validation_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=pod-names-validation-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=1 --artifacts_path=/tmp/output/artifacts
    63  python3 -m kubeflow.tf_operator.cleanpod_policy_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=cleanpod-policy-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=1 --artifacts_path=/tmp/output/artifacts
    64  python3 -m kubeflow.tf_operator.simple_tfjob_tests  --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=simple-tfjob-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=2 --artifacts_path=/tmp/output/artifact
    65  ```
    66  
    67  ## Check results
    68  
    69  You can either check logs or check results in `/tmp/output/artifact`.
    70  
    71  ```
    72  $ ls -al /tmp/output/artifact
    73  junit_test_simple_tfjob_cpu.xml
    74  
    75  $ cat /tmp/output/artifact/junit_test_simple_tfjob_cpu.xml
    76  <testsuite failures="0" tests="1" time="659.5505294799805"><testcase classname="SimpleTfJobTests" name="simple-tfjob-tests-v1" time="659.5505294799805" /></testsuite>
    77  ```
    78  
    79  ## Common issues
    80  
    81  1. ksonnet is not installed
    82  
    83  ```
    84  ERROR|2021-11-16T03:06:06|/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py|57| There was a problem running the job; Exception [Errno 2] No such file or directory: 'ks-13': 'ks-13'
    85  Traceback (most recent call last):
    86    File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py", line 38, in run_test
    87      test_func()
    88    File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/pod_names_validation_tests.py", line 53, in test_pod_names
    89      self.params)
    90    File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/util.py", line 579, in setup_ks_app
    91      cwd=app_dir)
    92    File "/home/jiaxin.shan/go/src/github.com/kubeflow/testing/py/kubeflow/testing/util.py", line 59, in run
    93      command, cwd=cwd, env=env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    94    File "/usr/local/lib/python3.7/subprocess.py", line 775, in __init__
    95      restore_signals, start_new_session)
    96    File "/usr/local/lib/python3.7/subprocess.py", line 1522, in _execute_child
    97      raise child_exception_type(errno_num, err_msg, err_filename)
    98  FileNotFoundError: [Errno 2] No such file or directory: 'ks-13': 'ks-13'
    99  ```
   100  
   101  Please check `Prerequsite` section to install ksonnet.
   102  
   103  2. TypeError: load() missing 1 required positional argument: 'Loader'
   104  
   105  ```
   106  ERROR|2021-11-16T03:04:12|/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py|57| There was a problem running the job; Exception load() missing 1 required positional argument: 'Loader'
   107  Traceback (most recent call last):
   108    File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py", line 38, in run_test
   109      test_func()
   110    File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/pod_names_validation_tests.py", line 51, in test_pod_names
   111      ks_cmd = ks_util.get_ksonnet_cmd(self.app_dir)
   112    File "/home/jiaxin.shan/go/src/github.com/kubeflow/testing/py/kubeflow/testing/ks_util.py", line 47, in get_ksonnet_cmd
   113      results = yaml.load(app_yaml)
   114  TypeError: load() missing 1 required positional argument: 'Loader'
   115  ```
   116  
   117  This is the pyyaml compatibility issue. Please check if you are using pyyaml==6.0.0. If so, downgrade to `5.4.1` instead.
   118  
   119  ```
   120  pip3 uninstall pyyaml
   121  pip3 install pyyaml==5.4.1 --user
   122  ```