github.com/kubeflow/training-operator@v1.7.0/docs/testing/e2e_debugging.md (about) 1 # How to debug an E2E test for Kubeflow Training Operator 2 3 TODO (andreyvelich): This doc is outdated. Currently, E2Es are located here: 4 [`sdk/python/test/e2e`](../../sdk/python/test/e2e) 5 6 [E2E Testing](./e2e_testing.md) gives an overview of writing e2e tests. This guidance concentrates more on the e2e failure debugging. 7 8 ## Prerequsite 9 10 1. Install python 3.7 11 12 2. Clone `kubeflow/testing` repo under `$GOPATH/src/kubeflow/` 13 14 3. Install [ksonnet](https://ksonnet.io/) 15 16 ``` 17 wget https://github.com/ksonnet/ksonnet/releases/download/v0.13.1/ks_0.13.1_linux_amd64.tar.gz 18 tar -xvzf ks_0.13.1_linux_amd64.tar.gz 19 sudo cp ks_0.13.1_linux_amd64/ks /usr/local/bin/ks-13 20 ``` 21 22 > We would like to deprecate `ksonnet` but may takes some time. Feel free to pick up [the issue](https://github.com/kubeflow/training-operator/issues/1468) if you are interested in it. 23 > If your platform is darwin or windows, feel free to download binaries in [ksonnet v0.13.1](https://github.com/ksonnet/ksonnet/releases/tag/v0.13.1) 24 25 4. Deploy HEAD training operator version in your environment 26 27 ``` 28 IMG=kubeflow/training-operator:e2e-debug-prid make docker-build 29 30 # Optional - load image into kind cluster if you are using kind 31 kind load docker-image kubeflow/training-operator:e2e-debug-1462 32 33 kubectl set image deployment.v1.apps/training-operator training-operator=kubeflow/training-operator:e2e-debug-1462 34 ``` 35 36 ## Run E2E Tests locally 37 38 1. Set environments 39 40 ``` 41 export KUBEFLOW_PATH=$GOPATH/src/github.com/kubeflow 42 export KUBEFLOW_TRAINING_REPO=$KUBEFLOW_PATH/training-operator 43 export KUBEFLOW_TESTING_REPO=$KUBEFLOW_PATH/testing 44 export PYTHONPATH=$KUBEFLOW_TRAINING_REPO:$KUBEFLOW_TRAINING_REPO/py:$KUBEFLOW_TESTING_REPO/py:$KUBEFLOW_TRAINING_REPO/sdk/python 45 ``` 46 47 2. Install python dependencies 48 49 ``` 50 pip3 install -r $KUBEFLOW_TESTING_REPO/py/kubeflow/testing/requirements.txt 51 ``` 52 53 > Note: if you have meet problem install requirement, you may need to `sudo apt-get install libffi-dev`. Feel free to share error logs if you don't know how to handle it. 54 55 3. Run Tests 56 57 ``` 58 # enter the ksonnet app to run tests 59 cd $KUBEFLOW_TRAINING_REPO/test/workflows 60 61 # run individual test that failed in the presubmit job. 62 python3 -m kubeflow.tf_operator.pod_names_validation_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=pod-names-validation-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=1 --artifacts_path=/tmp/output/artifacts 63 python3 -m kubeflow.tf_operator.cleanpod_policy_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=cleanpod-policy-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=1 --artifacts_path=/tmp/output/artifacts 64 python3 -m kubeflow.tf_operator.simple_tfjob_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=simple-tfjob-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=2 --artifacts_path=/tmp/output/artifact 65 ``` 66 67 ## Check results 68 69 You can either check logs or check results in `/tmp/output/artifact`. 70 71 ``` 72 $ ls -al /tmp/output/artifact 73 junit_test_simple_tfjob_cpu.xml 74 75 $ cat /tmp/output/artifact/junit_test_simple_tfjob_cpu.xml 76 <testsuite failures="0" tests="1" time="659.5505294799805"><testcase classname="SimpleTfJobTests" name="simple-tfjob-tests-v1" time="659.5505294799805" /></testsuite> 77 ``` 78 79 ## Common issues 80 81 1. ksonnet is not installed 82 83 ``` 84 ERROR|2021-11-16T03:06:06|/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py|57| There was a problem running the job; Exception [Errno 2] No such file or directory: 'ks-13': 'ks-13' 85 Traceback (most recent call last): 86 File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py", line 38, in run_test 87 test_func() 88 File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/pod_names_validation_tests.py", line 53, in test_pod_names 89 self.params) 90 File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/util.py", line 579, in setup_ks_app 91 cwd=app_dir) 92 File "/home/jiaxin.shan/go/src/github.com/kubeflow/testing/py/kubeflow/testing/util.py", line 59, in run 93 command, cwd=cwd, env=env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 94 File "/usr/local/lib/python3.7/subprocess.py", line 775, in __init__ 95 restore_signals, start_new_session) 96 File "/usr/local/lib/python3.7/subprocess.py", line 1522, in _execute_child 97 raise child_exception_type(errno_num, err_msg, err_filename) 98 FileNotFoundError: [Errno 2] No such file or directory: 'ks-13': 'ks-13' 99 ``` 100 101 Please check `Prerequsite` section to install ksonnet. 102 103 2. TypeError: load() missing 1 required positional argument: 'Loader' 104 105 ``` 106 ERROR|2021-11-16T03:04:12|/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py|57| There was a problem running the job; Exception load() missing 1 required positional argument: 'Loader' 107 Traceback (most recent call last): 108 File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py", line 38, in run_test 109 test_func() 110 File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/pod_names_validation_tests.py", line 51, in test_pod_names 111 ks_cmd = ks_util.get_ksonnet_cmd(self.app_dir) 112 File "/home/jiaxin.shan/go/src/github.com/kubeflow/testing/py/kubeflow/testing/ks_util.py", line 47, in get_ksonnet_cmd 113 results = yaml.load(app_yaml) 114 TypeError: load() missing 1 required positional argument: 'Loader' 115 ``` 116 117 This is the pyyaml compatibility issue. Please check if you are using pyyaml==6.0.0. If so, downgrade to `5.4.1` instead. 118 119 ``` 120 pip3 uninstall pyyaml 121 pip3 install pyyaml==5.4.1 --user 122 ```