Troubleshooting

Diagnose and fix issues you may encounter in your Kubeflow deployment

For general errors related to Kubernetes and Amazon EKS, please refer to the Amazon EKS User Guide troubleshooting section. For issues with cluster creation or modification with eksctl, see the eksctl troubleshooting page.

Validate prerequisites

You may experience issues due to version incompatibility. Before diving into more specific issues, check to make sure that you have the correct prerequisites installed.

ALB fails to provision

If you see that your istio-ingress ADDRESS is empty after more than a few minutes, it is possible that something is misconfigured in your ALB ingress controller.

kubectl get ingress -n istio-system
NAME            HOSTS   ADDRESS   PORTS   AGE
istio-ingress   *                 80      3min

Check the AWS ALB Ingress Controller logs for errors.

kubectl -n kube-system logs $(kubectl get pods -n kube-system --selector=app.kubernetes.io/name=aws-load-balancer-controller --output=jsonpath={.items..metadata.name})

If the logs indicate issues with a missing clustername, check the following ConfigMap:

kubectl get configmaps -n kube-system aws-load-balancer-controller-config -o yaml

Make sure that the ConfigMap has the correct EKS cluster name assigned to the clusterName variable.

apiVersion: v1
kind: ConfigMap
data:
  clusterName: your-eks-cluster-name

If this does not resolve the error, it is possible that your subnets are not tagged so that Kubernetes knows which subnets to use for external load balancers. To fix this, ensure that your cluster’s public subnets are tagged with the Key: kubernetes.io/role/elb and Value: 1. See the Prerequisites section for application load balancing in the Amazon EKS User Guide for further details.

FSx issues

Verify that the FSx drivers are installed by running the following command:

kubectl get csidriver -A

Check that PersistentVolumes, PersistentVolumeClaims, and StorageClasses are all deployed as expected:

kubectl get pv,pvc,sc -A

Use the kubectl logs command to get more information on Pods that use these resources.

For more information, see the Amazon FSx for Lustre CSI Driver GitHub repository. Troubleshooting information for specific FSx filesystems can be found in the Amazon FSx documentation.

RDS issues

To troubleshoot RDS issues, follow the installation verification steps.

Last modified September 19, 2022: Kubeflow 1.6 docs (#365) (5fdb7b1c)