Running e2e tests locally
Pre-requisites
We recommend using Python 3.9, along with Poetry.
On KinD clusters
Pre-requisite for KinD clusters: please add in your local /etc/hosts
file 127.0.0.1 kind
. This will map your localhost IP address to the
KinD cluster’s hostname. This is already performed on GitHub
Actions
If the system you run on contains NVidia GPU then you can enable the GPU support in KinD, this will allow you to run also GPU tests. To enable GPU on KinD follow these instructions.
Setup Phase:
Pull the codeflare-operator repo and run the following make targets:
make kind-e2e export CLUSTER_HOSTNAME=kind make setup-e2e make deploy -e IMG=quay.io/project-codeflare/codeflare-operator:v1.3.0 For running tests locally on Kind cluster, we need to disable `rayDashboardOAuthEnabled` in `codeflare-operator-config` ConfigMap and then restart CodeFlare Operator
(Optional) - Create and add
sdk-user
with limited permissions to the cluster to run through the e2e tests:
# Get KinD certificates docker cp kind-control-plane:/etc/kubernetes/pki/ca.crt . docker cp kind-control-plane:/etc/kubernetes/pki/ca.key . # Generate certificates for new user openssl genrsa -out user.key 2048 openssl req -new -key user.key -out user.csr -subj '/CN=sdk-user/O=tenant' openssl x509 -req -in user.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out user.crt -days 360 # Add generated certificated to KinD context user_crt=$(base64 --wrap=0 user.crt) user_key=$(base64 --wrap=0 user.key) yq eval -i ".contexts += {\"context\": {\"cluster\": \"kind-kind\", \"user\": \"sdk-user\"}, \"name\": \"sdk-user\"}" $HOME/.kube/config yq eval -i ".users += {\"name\": \"sdk-user\", \"user\": {\"client-certificate-data\": \"$user_crt\", \"client-key-data\": \"$user_key\"}}" $HOME/.kube/config cat $HOME/.kube/config # Cleanup rm ca.crt rm ca.srl rm ca.key rm user.crt rm user.key rm user.csr # Add RBAC permissions to sdk-user kubectl create clusterrole list-ingresses --verb=get,list --resource=ingresses kubectl create clusterrolebinding sdk-user-list-ingresses --clusterrole=list-ingresses --user=sdk-user kubectl create clusterrole appwrapper-creator --verb=get,list,create,delete,patch --resource=appwrappers kubectl create clusterrolebinding sdk-user-appwrapper-creator --clusterrole=appwrapper-creator --user=sdk-user kubectl create clusterrole namespace-creator --verb=get,list,create,delete,patch --resource=namespaces kubectl create clusterrolebinding sdk-user-namespace-creator --clusterrole=namespace-creator --user=sdk-user kubectl create clusterrole list-rayclusters --verb=get,list --resource=rayclusters kubectl create clusterrolebinding sdk-user-list-rayclusters --clusterrole=list-rayclusters --user=sdk-user kubectl config use-context sdk-user
Install the latest development version of kueue
kubectl apply --server-side -k "github.com/opendatahub-io/kueue/config/rhoai?ref=dev"
Test Phase:
Once we have the codeflare-operator, kuberay-operator and kueue running and ready, we can run the e2e test on the codeflare-sdk repository:
poetry install --with test,docs poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_kind_test.py
If the cluster doesn’t have NVidia GPU support then we need to disable NVidia GPU tests by providing proper marker:
poetry install --with test,docs poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_kind_test.py -m 'kind and not nvidia_gpu'
On OpenShift clusters
Setup Phase:
Pull the codeflare-operator repo and run the following make targets:
make setup-e2e make deploy -e IMG=quay.io/project-codeflare/codeflare-operator:v1.3.0
Install the latest development version of kueue
kubectl apply --server-side -k "github.com/opendatahub-io/kueue/config/rhoai?ref=dev"
If the system you run on contains NVidia GPU then you can enable the GPU support on OpenShift, this will allow you to run also GPU tests. To enable GPU on OpenShift follow these instructions. Currently the SDK doesn’t support tolerations, so e2e tests can’t be executed on nodes with taint (i.e. GPU taint).
Test Phase:
Once we have the codeflare-operator, kuberay-operator and kueue running and ready, we can run the e2e test on the codeflare-sdk repository:
poetry install --with test,docs poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_test.py
To run the multiple tests based on the cluster environment, we can run the e2e tests by marking -m with cluster environment (kind or openshift)
poetry run pytest -v -s ./tests/e2e -m openshift
By default tests configured with timeout of
15 minutes
. If necessary, we can override the timeout using--timeout
option
poetry run pytest -v -s ./tests/e2e -m openshift --timeout=1200
On OpenShift Disconnected clusters
In addition to setup phase mentioned above in case of Openshift cluster, Disconnected environment requires following pre-requisites :
Mirror Image registry :
Image mirror registry is used to host set of container images required locally for the applications and services. This ensures to pull images without needing an external network connection. It also ensures continuous operation and deployment capabilities in a network-isolated environment.
PYPI Mirror Index :
When trying to install Python packages in a disconnected environment, the pip command might fail because the connection cannot install packages from external URLs. This issue can be resolved by setting up PIP Mirror Index on separate endpoint in same environment.
S3 compatible storage :
Some of our distributed training examples require an external storage solution so that all nodes can access the same data in disconnected environment (For example: common-datasets and model files).
Minio S3 compatible storage type instance can be deployed in disconnected environment using
/tests/e2e/minio_deployment.yaml
or using support methods in e2e test suite.The following are environment variables for configuring PIP index URl for accessing the common-python packages required and the S3 or Minio storage for your Ray Train script or interactive session.
export RAY_IMAGE=quay.io/project-codeflare/ray@sha256:<image-digest> (prefer image digest over image tag in disocnnected environment) PIP_INDEX_URL=https://<bastion-node-endpoint-url>/root/pypi/+simple/ \ PIP_TRUSTED_HOST=<bastion-node-endpoint-url> \ AWS_DEFAULT_ENDPOINT=<s3-compatible-storage-endpoint-url> \ AWS_ACCESS_KEY_ID=<s3-compatible-storage-access-key> \ AWS_SECRET_ACCESS_KEY=<s3-compatible-storage-secret-key> \ AWS_STORAGE_BUCKET=<storage-bucket-name> AWS_STORAGE_BUCKET_MNIST_DIR=<storage-bucket-MNIST-datasets-directory>
Note
When using the Python Minio client to connect to a minio storage bucket, the
AWS_DEFAULT_ENDPOINT
environment variable by default expects secure endpoint where user can use endpoint url with https/http prefix for autodetection of secure/insecure endpoint.