Troubleshooting

You will find troubleshooting info for Data on Amazon EKS(DoEKS) installation issues

Error: local-exec provisioner error

If you encounter the following error during the execution of the local-exec provisioner:

Error: local-exec provisioner error \
with module.eks-blueprints.module.emr_on_eks["data_team_b"].null_resource.update_trust_policy,\
 on .terraform/modules/eks-blueprints/modules/emr-on-eks/main.tf line 105, in resource "null_resource" \
 "update_trust_policy":│ 105: provisioner "local-exec" {│ │ Error running command 'set -e│ │ aws emr-containers update-role-trust-policy \
 │ --cluster-name emr-on-eks \│ --namespace emr-data-team-b \│ --role-name emr-on-eks-emr-eks-data-team-b

Issue Description:

The error message indicates that the emr-containers command is not present in the AWS CLI version being used. This issue has been addressed and fixed in AWS CLI version 2.0.54.

Solution

To resolve the issue, update your AWS CLI version to 2.0.54 or a later version by executing the following command:

pip install --upgrade awscliv2

By updating the AWS CLI version, you will ensure that the necessary emr-containers command is available and can be executed successfully during the provisioning process.

If you continue to experience any issues or require further assistance, please consult the AWS CLI GitHub issue for more details or contact our support team for additional guidance.

Timeouts during Terraform Destroy

Issue Description:

Customers may experience timeouts during the deletion of their environments, specifically when VPCs are being deleted. This is a known issue related to the vpc-cni component.

Symptoms:

ENIs (Elastic Network Interfaces) remain attached to subnets even after the environment is destroyed. The EKS managed security group associated with the ENI cannot be deleted by EKS.

Solution:

To overcome this issue, follow the recommended solution below:

Utilize the provided cleanup.sh scripts to ensure a proper cleanup of resources. Run the `cleanup.sh`` script, which is included in the blueprint. This script will handle the removal of any lingering ENIs and associated security groups.

Error: could not download chart

If you encounter the following error while attempting to download a chart:

│ Error: could not download chart: failed to download "oci://public.ecr.aws/karpenter/karpenter" at version "v0.18.1"
│
│   with module.eks_blueprints_kubernetes_addons.module.karpenter[0].module.helm_addon.helm_release.addon[0],
│   on .terraform/modules/eks_blueprints_kubernetes_addons/modules/kubernetes-addons/helm-addon/main.tf line 1, in resource "helm_release" "addon":
│    1: resource "helm_release" "addon" {
│

Follow the steps below to resolve the issue:

Issue Description:

The error message indicates that there was a failure in downloading the specified chart. This issue can occur due to a bug in Terraform during the installation of Karpenter.

Solution:

To resolve the issue, you can try the following steps:

Authenticate with ECR: Run the following command to authenticate with the ECR (Elastic Container Registry) where the chart is located:

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws

Re-run terraform apply: Execute the terraform apply command again with the --auto-approve flag to reapply the Terraform configuration:

terraform apply --auto-approve

By authenticating with ECR and re-running the terraform apply command, you will ensure that the necessary chart can be downloaded successfully during the installation process.

Terraform apply/destroy error to authenticate with EKS Cluster

ERROR:
╷
│ Error: Get "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth": dial tcp [::1]:80: connect: connection refused
│
│   with module.eks.kubernetes_config_map_v1_data.aws_auth[0],
│   on .terraform/modules/eks/main.tf line 550, in resource "kubernetes_config_map_v1_data" "aws_auth":
│  550: resource "kubernetes_config_map_v1_data" "aws_auth" {
│
╵

Solution: In this situation Terraform is unable to refresh the data resources and authenticate with EKS Cluster. See the discussion here

Try this approach first by using exec plugin.

provider "kubernetes" {
  host                   = module.eks_blueprints.eks_cluster_endpoint
  cluster_ca_certificate = base64decode(module.eks_blueprints.eks_cluster_certificate_authority_data)

  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    command     = "aws"
    args = ["eks", "get-token", "--cluster-name", module.eks_blueprints.eks_cluster_id]
  }
}

If the issue still persists even after the above change then you can use alternative approach of using local kube config file. NOTE: This approach might not be ideal for production. It helps you to apply/destroy clusters with your local kube config.

Create a local kubeconfig for your cluster

aws eks update-kubeconfig --name <EKS_CLUSTER_NAME> --region <CLUSTER_REGION>

Update the providers.tf file with the below config by just using the config_path.

provider "kubernetes" {
    config_path = "<HOME_PATH>/.kube/config"
}

provider "helm" {
    kubernetes {
        config_path = "<HOME_PATH>/.kube/config"
    }
}

provider "kubectl" {
    config_path = "<HOME_PATH>/.kube/config"
}

EMR Containers Virtual Cluster (dhwtlq9yx34duzq5q3akjac00) delete: unexpected state 'ARRESTED'

If you encounter an error message stating "waiting for EMR Containers Virtual Cluster (xwbc22787q6g1wscfawttzzgb) delete: unexpected state 'ARRESTED', wanted target ''. last error: %!s(nil)", you can follow the steps below to resolve the issue:

Note: Replace <REGION> with the appropriate AWS region where the virtual cluster is located.

Open a terminal or command prompt.
Run the following command to list the virtual clusters in the "ARRESTED" state:

aws emr-containers list-virtual-clusters --region <REGION> --states ARRESTED \
--query 'virtualClusters[0].id' --output text

This command retrieves the ID of the virtual cluster in the "ARRESTED" state.

Run the following command to delete the virtual cluster:

aws emr-containers list-virtual-clusters --region <REGION> --states ARRESTED \
--query 'virtualClusters[0].id' --output text | xargs -I{} aws emr-containers delete-virtual-cluster \
--region <REGION> --id {}

Replace <VIRTUAL_CLUSTER_ID> with the ID of the virtual cluster obtained from the previous step.

By executing these commands, you will be able to delete the virtual cluster that is in the "ARRESTED" state. This should resolve the unexpected state issue and allow you to proceed with further operations.

Terminating namespace issue

If you encounter the issue where a namespace is stuck in the "Terminating" state and cannot be deleted, you can use the following command to remove the finalizers on the namespace:

Note: Replace <namespace> with the name of the namespace you want to delete.

NAMESPACE=<namespace>
kubectl get namespace $NAMESPACE -o json | sed 's/"kubernetes"//' | kubectl replace --raw "/api/v1/namespaces/$NAMESPACE/finalize" -f -

This command retrieves the namespace details in JSON format, removes the "kubernetes" finalizer, and performs a replace operation to remove the finalizer from the namespace. This should allow the namespace to complete the termination process and be successfully deleted.

Please ensure that you have the necessary permissions to perform this operation. If you continue to experience issues or require further assistance, please reach out to our support team for additional guidance and troubleshooting steps.

KMS Alias AlreadyExistsException

During your Terraform installation or redeployment, you might encounter an error saying: AlreadyExistsException: An alias with the name ... already exists. This happens when the KMS alias you're trying to create already exists in your AWS account.

│ Error: creating KMS Alias (alias/eks/trainium-inferentia): AlreadyExistsException: An alias with the name arn:aws:kms:us-west-2:23423434:alias/eks/trainium-inferentia already exists
│
│   with module.eks.module.kms.aws_kms_alias.this["cluster"],
│   on .terraform/modules/eks.kms/main.tf line 452, in resource "aws_kms_alias" "this":
│  452: resource "aws_kms_alias" "this" {
│

Solution:

To resolve this, delete the existing KMS alias using the aws kms delete-alias command. Remember to update the alias name and region in the command before running it.

aws kms delete-alias --alias-name <KMS_ALIAS_NAME> --region <ENTER_REGION>

Error: creating CloudWatch Logs Log Group

Terraform cannot create a CloudWatch Logs log group because it already exists in your AWS account.

╷
│ Error: creating CloudWatch Logs Log Group (/aws/eks/trainium-inferentia/cluster): operation error CloudWatch Logs: CreateLogGroup, https response error StatusCode: 400, RequestID: 5c34c47a-72c6-44b2-a345-925824f24d38, ResourceAlreadyExistsException: The specified log group already exists
│
│   with module.eks.aws_cloudwatch_log_group.this[0],
│   on .terraform/modules/eks/main.tf line 106, in resource "aws_cloudwatch_log_group" "this":
│  106: resource "aws_cloudwatch_log_group" "this" {

Solution:

Delete the existing log group by updating log group name and the region.

aws logs delete-log-group --log-group-name <LOG_GROUP_NAME> --region <ENTER_REGION>

Karpenter Error - Missing Service Linked Role

Karpenter throws below error while trying to create new instances.

"error":"launching nodeclaim, creating instance, with fleet error(s), AuthFailure.ServiceLinkedRoleCreationNotPermitted: The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances."}

Solution:

You will need to create the service linked role in the AWS account you're using to avoid ServiceLinkedRoleCreationNotPermitted error.

aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

Error: AmazonEKS_CNI_IPv6_Policy does not exist

If you encounter the error below when deploying a solution that supports IPv6:

│ Error: attaching IAM Policy (arn:aws:iam::1234567890:policy/AmazonEKS_CNI_IPv6_Policy) to IAM Role (core-node-group-eks-node-group-20241111182906854800000003): operation error IAM: AttachRolePolicy, https response error StatusCode: 404, RequestID: 9c99395a-ce3d-4a05-b119-538470a3a9f7, NoSuchEntity: Policy arn:aws:iam::1234567890:policy/AmazonEKS_CNI_IPv6_Policy does not exist or is not attachable.

Issue Description:

The Amazon VPC CNI plugin requires IAM permission to assign IPv6 addresses so you must create an IAM policy and associate it with the role that the CNI will use. However, each IAM policy name must be unique in the same AWS account. This causes a conflict if the policy is created as part of the terraform stack and it is deployed multiple times.

To resolve this error you will need to create the Policy with the commands below. You should only need to do this once per AWS account.

Solution:

Copy the following text and save it to a file named vpc-cni-ipv6-policy.json.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AssignIpv6Addresses",
                "ec2:DescribeInstances",
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstanceTypes"
            ],
            "Resource": ""
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags"
            ],
            "Resource": [
                "arn:aws:ec2::*:network-interface/*"
            ]
        }
    ]
}

Create the IAM policy.

aws iam create-policy --policy-name AmazonEKS_CNI_IPv6_Policy --policy-document file://vpc-cni-ipv6-policy.json

Re-run the install.sh script for the blueprint

Error: local-exec provisioner error​

Issue Description:​

Solution​

Timeouts during Terraform Destroy​

Issue Description:​

Symptoms:​

Solution:​

Error: could not download chart​

Issue Description:​

Solution:​

Terraform apply/destroy error to authenticate with EKS Cluster​

EMR Containers Virtual Cluster (dhwtlq9yx34duzq5q3akjac00) delete: unexpected state 'ARRESTED'​

Terminating namespace issue​

KMS Alias AlreadyExistsException​

Error: creating CloudWatch Logs Log Group​

Karpenter Error - Missing Service Linked Role​

Error: AmazonEKS_CNI_IPv6_Policy does not exist​

Issue Description:​

Solution:​

Error: local-exec provisioner error

Issue Description:

Solution

Timeouts during Terraform Destroy

Issue Description:

Symptoms:

Solution:

Error: could not download chart

Issue Description:

Solution:

Terraform apply/destroy error to authenticate with EKS Cluster

EMR Containers Virtual Cluster (dhwtlq9yx34duzq5q3akjac00) delete: unexpected state 'ARRESTED'

Terminating namespace issue

KMS Alias AlreadyExistsException

Error: creating CloudWatch Logs Log Group

Karpenter Error - Missing Service Linked Role

Error: AmazonEKS_CNI_IPv6_Policy does not exist

Issue Description:

Solution: