FAQ

The following are common questions you might have when deploying and using the solution.

Deployment

1. In which AWS Regions can this solution be deployed?

For the list of supported Regions, refer to supported Regions.

2. When creating a transfer task, should I deploy it on the data source side or the destination side?

The transfer performance of the solution will not be affected by whether the deployment is on the data source or destination side.

If you do not have a domain name registered by ICP in AWS China Regions, we recommend you deploy it in AWS Regions.

If you need to deploy in AWS China Regions but do not have a domain name, you can directly deploy the back-end version:

Amazon S3 Plugin: https://github.com/awslabs/data-transfer-hub/blob/main/docs/S3_PLUGIN.md
Amazon ECR Plugin: https://github.com/awslabs/data-transfer-hub/blob/main/docs/ECR_PLUGIN.md

3. Do I need to deploy the solution on the data source and destination side separately?

No. You can choose to deploy on the data source or destination side, which has no impact on the transfer performance.

4. Is it possible to deploy the solution in AWS account A and transfer Amazon S3 objects from account B to account C?

Yes. In this case, you need to store the AccessKeyID and SecretAccessKey of account B and account C in the Secrets Manager of account A.

5. For data transfer within the production account, is it recommended to create an AWS account specifically for deploying the solution?

Yes. It is recommended to create a new AWS account dedicated to deploying solutions. The account-level isolation improves the stability of the production account in the data synchronization process.

6. Is it possible to transfer data between different areas under the same account?

Not supported currently. For this scenario, we recommend using Amazon S3's Cross-Region Replication.

7. Can I use AWS CLI to create a DTH S3 Transfer Task?

Yes. Please refer to the tutorial Using AWS CLI to launch DTH S3 Transfer task.

Performance

1. Will there be any difference in data transfer performance for deployment in AWS China Regions and in AWS Regions?

No. If you do not have a domain name registered by ICP in AWS China Regions, it is recommended to deploy it in the AWS Regions.

2. What are the factors influencing the data transfer performance?

The transfer performance may be affected by average file size, destination of data transfer, geographic location of data source, and real-time network environment.

For example, using the same configuration, the transfer speed with an average file size of 50MB is 170 times the transfer speed with an average file size of 10KB.

3. What is the scale up/scale down policy of Worker Auto Scaling Group?
The Auto Scaling Group will automatically scale up or scale down according to the number of tasks in SQS.

Scaling Up Steps are:

{ lower: 100,   change: +1 }
{ lower: 500,   change: +2 }
{ lower: 2000,  change: +5 }
{ lower: 10000, change: +10 }

Scaling Down Step is:
```
{ upper: 0, change: -10000 }
```

Data security and authentication

1. How does the solution ensure data security?

The solution adopts the following to ensure data security:

All data is transferred in the memory in the transfer node cluster, without being placed on the disk.
The external ports of all transfer nodes are closed, and there is no way to SSH into the transfer node.
All data download and upload bottom layers are calling AWS official API, and data transfer conforms to the TLS protocol.

2. How does the solution ensure the security of resources on the cloud?

In the research and development process, we strictly follow the minimum IAM permission design rules, and adopt the design of Auto Scaling, which will automatically help users terminate idle working nodes.

3. Is the front-end console open to the public network? How to ensure user authentication and multi-user management?

Yes. You can access it with a front-end console link. User authentication and multi-user management are achieved through AWS Cognito User Pool in AWS Regions, and through OIDC SAAS in AWS China Regions.

4. How does the solution achieve cross-account and cross-cloud authentication?

By authentication through the Access Keyid and Access Key of the other party’s account. The secret key is stored in AWS Secrets Manager and will be read in Secrets Manager as needed.

5. Does the solution support SSE-S3, SSE-KMS, and SSE-CMK?

Yes. The solution supports the use of SSE-S3 and SSE-KMS data sources. If your source bucket has SSE-CMK enabled, refer to the tutorial.

Features

1. What third-party clouds does Amazon S3 sync currently support?

Alibaba Cloud OSS, Tencent Cloud, Huawei Cloud, Qiniu Cloud, Baidu Cloud, and all clouds that support S3 compatible protocols.

2. Why is the status of Task still in progress after all destination files are transferred? When will the task stop?

For Fixed Rate Job

The data difference between the data source and destination will be monitored continuously, and the differences between the two sides will be automatically compared after the first deployment.

Moreover, when the default comparison task once an hour finds a difference, it will also transfer the difference data. Therefore, the status of the Task will always be in progress, unless the user manually terminates the task.

Based on the built-in automatic expansion function of the solution, when there is no data to be transferred, the number of transfer working nodes will be automatically reduced to the minimum value configured by the user.
For One Time Transfer Job

When the objects are all transferred to the destination, the status of one time transfer job will become Completed.

The transfer action will stop and you can select Stop to delete and release all backend resources.

3. How often will the data difference between the data source and destination be compared？

By default, it runs hourly.

At Task Scheduling Settings, you can make the task scheduling configuration.

If you want to configure the timed task at a fixed frequency to compare the data difference on both sides of the time, select Fixed Rate.
If you want to configure a scheduled task through Cron Expression to achieve a scheduled comparison of data differences on both sides, select Cron Expression.
If you only want to perform the data synchronization task once, select One Time Transfer.

4. Is it possible for real-time synchronization of newly added files?

Near-real-time synchronization can be achieved, only if the Data Transfer Hub is deployed in the same AWS account and the same region as the data source. If the data source and the solution are not in the same account, you can configure it manually. For more information, refer to the tutorial.

5. Are there restrictions on the number of files and the size of files?

No. Larger files will be uploaded in chunks.

6. If a single file transfer fails due to network issues, how to resolve it? Is there an error handling mechanism?

There will be 5 retries. After 5 retries without success, the task will be notified to the user via email.

7. How to monitor the progress of the transfer by checking information like how many files are waiting to be transferred and the current transfer speed?

You can jump to the customized dashboard of Amazon CloudWatch by clicking the CloudWatch Dashboard link in Task Detail of the web console. You can also go directly to CloudWatch to view it.

8. Do I need to create an S3 destination bucket before creating a transfer task?

Yes, you need to create the destination S3 bucket in advance.

9. How to use Finder depth and Finder number to improve Finder performance?

You can use these two parameters to increase the parallelism of Finder to improve the performance of data comparison.

For example, if there are 12 subdirectories with over 100k files each, such as Jan, Feb, ..., Dec.

You are recommended to set finderDepth=1 and finderNumber=12, so that your comparison performance will increase by 12 times.

When using finderDepth and finderNumber, make sure that there are no objects in the folder whose level is equal to or less than finderDepth. Otherwise, data loss may occur.

For example, assume that you set the finderDepth=2 and finderNumber=12 * 31 = 372, and your S3 bucket structure is like bucket_name/Jan/01/pic1.jpg.

What will be lost are files like bucket_name/pic.jpg, bucket_name/Jan/pic.jpg.

What will not be lost are all files under bucket_name/Jan/33/, all files under bucket_name/13/33/.

10. How to deal with Access Key Rotation?

Currently, when Data Transfer Hub perceived that the Access Key of S3 has been rotated, it will fetch the latest key from AWS Secrets Manager automatically. Therefore, the Access Key Rotation will not affect the migrating process of DTH.

11. Does the Payer Request mode support Public Data Set?

No. Currently, Payer Request data synchronization is only supported through Access Key and Private Key authentication methods.

Others

1. The cluster node (EC2) is terminated by mistake. How to resolve it?

The Auto Scaling mechanism of the solution will enable automatic restart of a new working node.

However, if a sharding task being transferred in the node is mistakenly terminated, it may cause that the files to which the shard belongs cannot be merged on the destination side, and the error "api error NoSuchUpload: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed" occurs. You need to configure lifecycle rules for Delete expired delete markers or incomplete multipart uploads in the Amazon S3 bucket.

2. The Secrets configuration in Secrets Manager is wrong. How to resolve it?

You need to update Secrets in Secrets Manager first, and then go to the EC2 console to Terminate all EC2 instances that have been started by the task. Later, the Auto Scaling mechanism of the solution will automatically start a new working node and update Secrets to it.

3. How to find detailed transfer log?

For Portal users

Go to Tasks list page, and click the Task ID. You can see the dashboard and logs under the Monitoring section.

Data Transfer Hub has embedded Dashboard and log groups on the Portal, so you do not need to navigate to AWS CloudWatch console to view the logs.
For Plugin (Pure Backend) users

When deploying the stack, you will be asked to enter the stack name (DTHS3Stack by default), and most resources will be created with the name prefix as the stack name. For example, the format of the queue name is <StackName>-S3TransferQueue-<random suffix>. This plugin will create two main log groups.
- If there is no data transfer, you need to check whether there is a problem in the Finder task log. The following is the log group for scheduling Finder tasks. For more information, refer to the Troubleshooting section.
  
  <StackName>-CommonS3RepWorkerLogGroup<random suffix>
- The following are the log groups of all EC2 instances, and you can find detailed transfer logs.
  
  <StackName>-EC2WorkerStackS3RepWorkerLogGroup<random suffix>

4. How to make customized build?

If you want to make customized changes to this plugin, refer to Custom Build.

5. After the deployment is complete, why can't I find any log streams in the two CloudWatch log groups?

This is because the subnet you selected when deploying this solution does not have public network access, and the EC2 cannot download the CloudWatch agent to send logs to CloudWatch. Check your VPC settings. After resolving the issue, you need to manually terminate the running EC2 instance (if any) through this solution. Later, the elastic scaling group will automatically start a new instance.

6. How to use TLSv1.2_2021 or above for this Solution?
Please go to the CloudFront Console and configure a custom domain, which will allow you to select a Security policy for CloudFront after solution deployment. You need to prepare a domain name and a corresponding TLS certificate in order to use more secure TLS configurations.