Architecture diagram
Deploying this solution with the default parameters builds the following environment in the AWS Cloud.
Sensitive Data Protection on AWS architecture
- The Application Load Balancer distributes the solution's frontend web UI assets hosted in AWS Lambda.
- Identity provider for user authentication.
- The AWS Lambda function is packaged as Docker images and stored in the Amazon ECR (Elastic Container Registry).
- The backend Lambda function is a target for the Application Load Balancer.
- The backend Lambda function invokes AWS Step Functions in monitored accounts for sensitive data detection.
- In AWS Step Functions workflow, the AWS Glue Crawler runs to take inventory of the structured data sources and is stored in the Glue Database as metadata tables. Amazon SageMaker processing job is used to pre-process unstructured file in S3 buckets, and store metadata in the Glue database. AWS Glue Job is used to detect sensitive data.
- The Step Functions send Amazon SQS messages to the detection job queue after the Glue job has run.
- Lambda function processes messages from Amazon SQS.
- The Amazon Athena query detection results and save to MySQL instance in Amazon RDS.
The solution uses the AWS Glue service as a core for building data catalog in the monitored account(s) and for invoking the Glue Job to detect sensitive data Personal Identifiable Information (PII). The distributed Glue job runs in each monitored account, and the admin account contains a centralized data catalog of data sources across AWS accounts. This is an implementation of the Data Mesh concept recommended by AWS.
To be more specific, the solution introduces an event-driven process and uses AWS IAM roles to trigger and communicate between the admin account and the monitored account(s) for sensitive data discovery jobs. The admin account can start PII detection jobs and retrieve data catalogs. All monitored AWS accounts are permitted to be connected to the admin account, which is able to distinguish and access the monitored accounts.