DataLakeCatalog

AWS Glue Catalog databases on top of a DataLakeStorage.

Overview

DataLakeCatalog is a data catalog for your data lake. It's a set of AWS Glue Data Catalog Databases configured on top of a DataLakeStorage. The construct creates three databases pointing to the respective medallion layers (bronze, silve or gold) of the DataLakeStorage:

The database default location is pointing to the corresponding S3 bucket location s3://<locationBucket>/<locationPrefix>/
By default, each database has an active crawler scheduled to run once a day (00:01h local timezone). The crawler can be disabled and the schedule/frequency of the crawler can be modified with a cron expression.

Data Lake Catalog

Data Catalog encryption

The AWS Glue Data Catalog resources created by the DataCatalogDatabase construct are not encrypted because the encryption is only available at the catalog level. Changing the encryption at the catalog level has a wide impact on existing Glue resources and producers/consumers. Similarly, changing the encryption configuration at the catalog level after this construct is deployed can break all the resources created as part of DSF on AWS.

Usage

TypeScript
Python

class ExampleDefaultDataLakeCatalogStack extends cdk.Stack {
constructor(scope: Construct, id: string) {
  super(scope, id);
  const storage = new dsf.storage.DataLakeStorage(this, 'MyDataLakeStorage');

  new dsf.governance.DataLakeCatalog(this, 'DataCatalog', {
    dataLakeStorage: storage,
  });
}
}

class ExampleDefaultDataLakeCatalogStack(cdk.Stack):
  def __init__(self, scope, id):
      super().__init__(scope, id)
      storage = dsf.storage.DataLakeStorage(self, "MyDataLakeStorage")

      dsf.governance.DataLakeCatalog(self, "DataCatalog",
          data_lake_storage=storage
      )

Modifying the crawlers behavior for the entire catalog

You can change the default configuration of the AWS Glue Crawlers associated with the different databases to match your requirements:

Enable or disable the crawlers
Change the crawlers run frequency
Provide your own key to encrypt the crawlers logs

The parameters apply to the three databases, if you need fine-grained configuration per database, you can use the DataCatalogDatabase construct.

TypeScript
Python

  const encryptionKey = new Key(this, 'CrawlerLogEncryptionKey');

  new dsf.governance.DataLakeCatalog(this, 'DataCatalog', {
    dataLakeStorage: storage,
    autoCrawl: true,
    autoCrawlSchedule: {
      scheduleExpression: 'cron(1 0 * * ? *)',
    },
    crawlerLogEncryptionKey: encryptionKey,
    crawlerTableLevelDepth: 3,
  });

encryption_key = Key(self, "CrawlerLogEncryptionKey")

dsf.governance.DataLakeCatalog(self, "DataCatalog",
  data_lake_storage=storage,
  auto_crawl=True,
  auto_crawl_schedule=cdk.aws_glue.CfnCrawler.ScheduleProperty(
      schedule_expression="cron(1 0 * * ? *)"
  ),
  crawler_log_encryption_key=encryption_key,
  crawler_table_level_depth=3
)

Overview​

Usage​

Modifying the crawlers behavior for the entire catalog​

Overview

Usage

Modifying the crawlers behavior for the entire catalog