Common Issues
This page covers frequently encountered issues during deployment, web application usage, API interaction, and pipeline execution for Visual Asset Management System (VAMS).
CDK Deployment Errors
Amazon ECS VPC Endpoint Conflicts
When redeploying with pipeline configuration changes, you may encounter AWS CloudFormation errors related to Amazon ECS VPC interface endpoints. This occurs when Amazon ECS endpoint changes conflict with AWS CloudFormation stack change restrictions.
Symptoms:
- CloudFormation stack rollback with VPC endpoint creation failures
- Errors referencing duplicate interface endpoints for Amazon ECS
Resolution:
- Temporarily disable the affected pipelines (Isaac Lab Training, Gaussian Splat Toolbox) in
infra/config/config.json. - Deploy with pipelines disabled:
cdk deploy --all --require-approval never. - Re-enable the pipelines in the configuration file.
- Deploy again:
cdk deploy --all --require-approval never.
This issue typically occurs only when toggling multiple pipelines simultaneously. Deploying pipeline changes incrementally can help avoid it.
Docker Buildx Container Image Errors
When deploying with AWS CDK using Docker, you may encounter errors related to container image builds, particularly failed commit on ref "manifest-sha256:..." or Lambda function XXX reached terminal FAILED state due to InvalidImage.
Symptoms:
unexpected status from PUT request to https://....dkr.ecr.REGION.amazonaws.com/v2/foo/manifests/bar: 400 Bad RequestInvalidImage(ImageLayerFailure: UnsupportedImageLayerDetected)errors during Lambda function creation- Container images fail to push to Amazon ECR
Resolution:
This is a known issue with certain Docker buildx versions. Set the following environment variable before running cdk deploy:
export BUILDX_NO_DEFAULT_ATTESTATIONS=1
Additionally, if deploying from an ARM64 machine (such as Apple Silicon Mac), you may need to clear the Docker cache and configure cross-platform emulation:
# Clear Docker cache
docker system prune -a
# Set cross-platform emulation
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
Set BUILDX_NO_DEFAULT_ATTESTATIONS=1 permanently in your shell profile (.bashrc, .zshrc) to avoid repeating this step on every deployment.
If you are using Finch or Podman instead of Docker, buildx-specific issues may not apply. However, you may encounter different container build errors specific to your engine. Verify the issue reproduces with Docker before reporting. See the Prerequisites page for alternative container engine setup.
Web Build or Infrastructure CDK Errors
After upgrading VAMS or switching branches, stale dependencies can cause build failures.
Symptoms:
- TypeScript compilation errors in
web/orinfra/ - Module resolution failures during
cdk synth
Resolution:
# Clear and reinstall web dependencies
cd web && rm -rf node_modules && npm install && npm run build
# Clear and reinstall infrastructure dependencies
cd infra && rm -rf node_modules && npm install
Always run npm install in both the web/ and infra/ directories after pulling new code or switching branches.
External VPC Import Failures
When importing an external Amazon VPC with subnets, the first deployment attempt may fail because AWS CDK cannot resolve VPC context before stack synthesis.
Symptoms:
- VPC or subnet lookup errors during
cdk synth - Stack deployment fails referencing missing VPC context
Resolution:
Perform a two-phase deployment:
# Phase 1: Import VPC context and deploy non-VPC stacks
cdk deploy --all --require-approval never --context loadContextIgnoreVPCStacks=true
# Phase 2: Deploy all stacks including VPC-dependent ones
cdk deploy --all --require-approval never
Alternatively, set loadContextIgnoreVPCStacks: true in infra/config/config.json for the first deployment, then set it back to false for subsequent deployments.
AWS KMS Key Permission Errors
AWS KMS key policy errors can occur when AWS CloudFormation custom resources attempt to modify Amazon S3 or Amazon DynamoDB tables encrypted with a customer-managed KMS key.
Symptoms:
- Custom resource Lambda functions fail with
AccessDeniedExceptionfor AWS KMS operations - Stack deployment rolls back during default data population steps
Resolution:
Verify that the KMS key policy includes the required principals. If using an external CMK via app.useKmsCmkEncryption.optionalExternalCmkArn, ensure the key policy grants the following actions to the AWS CloudFormation service principal and the deployment role:
kms:Decryptkms:Encryptkms:GenerateDataKeykms:ReEncrypt*
Web Application Issues
Content Security Policy Errors in Local Development
During local development, Content Security Policy (CSP) errors may block certain viewer functionality or API calls.
Symptoms:
- Browser console errors mentioning
Content-Security-Policy - Viewers fail to load external resources
- WebAssembly modules blocked
Resolution:
VAMS includes a service worker that sets the required cross-origin isolation headers. Ensure the service worker is registered by verifying:
- The development server is running on
https://orlocalhost. - The browser has not blocked the service worker registration.
- For WASM-based viewers, verify
allowUnsafeEvalFeaturesis enabled in the deployment configuration if testing against a deployed backend.
The Vite development server proxy handles most CSP issues automatically. If problems persist, clear your browser cache and service worker registrations.
WASM-Based Viewers Not Loading
Viewers that use WebAssembly (Needle USD Viewer, Three.js CAD Viewer, Cesium 3D Tileset Viewer) require specific HTTP headers to function.
Symptoms:
- Viewer shows a loading spinner indefinitely
- Browser console errors mentioning
SharedArrayBufferorCross-Origin-Opener-Policy
Resolution:
WASM-based viewers require Cross-Origin Isolation headers (Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp). These are provided in two ways:
- Amazon CloudFront deployment: Headers are set automatically by the CloudFront distribution.
- Application Load Balancer (ALB) deployment: A front-end service worker attempts to set the headers. If your organization's security policy blocks service workers, WASM viewers will not function.
For the Cesium 3D Tileset Viewer, you must also enable allowUnsafeEvalFeatures in infra/config/config.json because CesiumJS requires runtime code generation for its rendering engine.
Safari Limitations for WASM Viewers
Safari does not fully support the cross-origin isolation requirements needed by certain WASM-based viewers.
Symptoms:
- Needle USD Viewer, Three.js CAD formats (.stp, .step, .iges, .brep), and Cesium Viewer fail to load in Safari
- Standard mesh formats (.gltf, .glb, .obj, .stl) in the Three.js Viewer work correctly
Resolution:
Use a Chromium-based browser (Google Chrome, Microsoft Edge) or Mozilla Firefox for WASM-dependent viewers. Non-WASM viewers and standard mesh formats work in all supported browsers.
Login Loop or Configuration Fetch Failures
Users may experience a login loop where the application repeatedly redirects to the sign-in page.
Symptoms:
- Page refreshes continuously after successful authentication
- Browser console shows errors fetching
/api/amplify-configor/api/secure-config
Resolution:
- Clear your browser cache and local storage for the VAMS domain.
- Verify the API Gateway endpoint is accessible from your network.
- If using IP range restrictions (
authorizerOptions.allowedIpRanges), confirm your IP address is within an allowed range. - For external OAuth identity provider configurations, verify all endpoint URLs in the configuration are correct and reachable.
API Issues
429 Rate Limiting
VAMS applies API rate limiting through Amazon API Gateway throttling.
Symptoms:
- API responses return HTTP 429 (Too Many Requests)
- Bulk operations fail intermittently
Resolution:
Increase the rate limits in infra/config/config.json:
{
"app": {
"api": {
"globalRateLimit": 100,
"globalBurstLimit": 200
}
}
}
The default values are globalRateLimit: 50 requests per second and globalBurstLimit: 100. Redeploy after changing these values.
Increasing rate limits raises the potential cost of Amazon API Gateway usage and may affect downstream service limits. Monitor your Amazon CloudWatch metrics after adjustments.
Timeout on Large Operations
Amazon API Gateway imposes a 29-second timeout on HTTP responses, while the underlying AWS Lambda function continues processing for up to 15 minutes.
Symptoms:
- API returns a 504 Gateway Timeout
- The operation actually completes successfully in the background
Affected operations:
- Listing or exporting assets with thousands of files
- Amazon OpenSearch re-indexing for large datasets
- Bulk metadata operations
Resolution:
For operations that may exceed 29 seconds, check your AWS Lambda function logs in Amazon CloudWatch to confirm whether the operation completed. The VamsCLI provides automatic pagination and retry logic that handles timeout scenarios for bulk operations.
Amazon OpenSearch Indexing Delays After Bulk Operations
After uploading many files or performing bulk metadata changes, search results may not immediately reflect the updates.
Symptoms:
- Newly uploaded assets do not appear in search results
- Metadata changes are not reflected in search filters
Resolution:
Amazon OpenSearch indexing is asynchronous. After bulk operations, allow 30-60 seconds for indexing to complete. If indexing appears stuck:
- Check the Amazon CloudWatch logs for the indexing Lambda functions.
- Verify the Amazon OpenSearch cluster health in the AWS Management Console.
- If necessary, trigger a re-index by setting
reindexOnCdkDeploy: truein the configuration and redeploying, or by using the manual re-index tool ininfra/deploymentDataMigration/.
Pipeline Issues
Container Pull Failures
Pipeline containers running on AWS Batch with AWS Fargate may fail to pull container images from Amazon Elastic Container Registry (Amazon ECR).
Symptoms:
- AWS Batch job fails with
CannotPullContainerError - Timeout errors during image pull
Resolution:
Pipeline containers require network access to Amazon ECR endpoints. Verify:
- The VPC has the required VPC endpoints for Amazon ECR (
com.amazonaws.region.ecr.apiandcom.amazonaws.region.ecr.dkr) and Amazon S3. - If using pipelines that require internet access (such as RapidPipeline or ModelOps), ensure the VPC has NAT Gateway or public subnet access configured.
- Check that the security groups attached to AWS Batch compute environments allow outbound HTTPS traffic.
GPU Instance Unavailability for AWS Batch Jobs
Some pipelines (Isaac Lab Training, Gaussian Splat Toolbox) require GPU instances that may not be available in all AWS Regions or Availability Zones.
Symptoms:
- AWS Batch jobs remain in
RUNNABLEstate indefinitely - No compute environment instances are launched
Resolution:
- Verify GPU instance type availability in your AWS Region (e.g.,
g6e.2xlarge,g5.xlarge). - Request a service quota increase for the required instance types through the AWS Service Quotas console.
- For Isaac Lab Training, consider enabling the
keepWarmInstanceoption to reduce cold start times at the cost of continuous compute charges.
Pipeline Timeout vs. Workflow Timeout
Pipeline step timeouts and overall workflow timeouts are configured separately. A pipeline step that exceeds its timeout will cause the workflow to fail.
Symptoms:
- Workflow execution fails but logs show the container was still processing
- AWS Step Functions execution history shows a timeout error on a specific state
Resolution:
Adjust the timeout values in the pipeline or workflow configuration. Large files (approaching the 100 GB limit for the 3D Thumbnail pipeline) may require extended processing time. Monitor the AWS Batch job logs and AWS Step Functions execution history to determine appropriate timeout values.
For long-running pipeline operations, check the AWS Batch job logs in Amazon CloudWatch rather than relying solely on the API response or web UI status.