Class GlueMetadataHandler

  • All Implemented Interfaces:
    com.amazonaws.services.lambda.runtime.RequestStreamHandler
    Direct Known Subclasses:
    DocDBMetadataHandler, DynamoDBMetadataHandler, ElasticsearchMetadataHandler, GcsMetadataHandler, HbaseMetadataHandler, NeptuneMetadataHandler, RedisMetadataHandler, TimestreamMetadataHandler

    public abstract class GlueMetadataHandler
    extends MetadataHandler
    This class allows you to leverage AWS Glue's DataCatalog to satisfy portions of the functionality required in a MetadataHandler. More precisely, this implementation uses AWS Glue's DataCatalog to implement:

    • doListSchemas(...)
    • doListTables(...)
    • doGetTable(...)

    When you extend this class you can optionally provide a DatabaseFilter and/or TableFilter to decide which Databases (aka schemas) and Tables are eligible for use with your connector. You can find examples of this in the athena-hbase and athena-docdb connector modules. A common reason for this is when you happen to have databases/tables in Glue which match the names of databases and tables in your source but that aren't actually relevant. In such cases you may choose to ignore those Glue tables.

    At present this class does not retrieve partition information from AWS Glue's DataCatalog. There is an open task for how best to handle partitioning information in this class: https://github.com/awslabs/aws-athena-query-federation/issues/5 It is unclear at this time how many sources will have meaningful partition info in Glue but many sources (DocDB, Hbase, Redis) benefited from having basic schema information in Glue. As a result we punted support for partition information to a later time.

    See Also:
    MetadataHandler
    • Field Detail

      • GET_TABLES_REQUEST_MAX_RESULTS

        protected static final int GET_TABLES_REQUEST_MAX_RESULTS
        The maximum number of tables returned in a single response (as defined in the Glue API docs).
        See Also:
        Constant Field Values
      • DATETIME_FORMAT_MAPPING_PROPERTY_NORMALIZED

        public static final String DATETIME_FORMAT_MAPPING_PROPERTY_NORMALIZED
        See Also:
        Constant Field Values
      • GLUE_TABLE_CONTAINS_PREVIOUSLY_UNSUPPORTED_TYPE

        public static final String GLUE_TABLE_CONTAINS_PREVIOUSLY_UNSUPPORTED_TYPE
        See Also:
        Constant Field Values
    • Constructor Detail

      • GlueMetadataHandler

        public GlueMetadataHandler​(String sourceType,
                                   Map<String,​String> configOptions)
        Basic constructor which is recommended when extending this class.
        Parameters:
        sourceType - The source type, used in diagnostic logging.
        configOptions - The configOptions for this MetadataHandler.
      • GlueMetadataHandler

        public GlueMetadataHandler​(software.amazon.awssdk.services.glue.GlueClient awsGlue,
                                   String sourceType,
                                   Map<String,​String> configOptions)
        Constructor that allows injection of a customized Glue client.
        Parameters:
        awsGlue - The glue client to use.
        sourceType - The source type, used in diagnostic logging.
        configOptions - The configOptions for this MetadataHandler.
      • GlueMetadataHandler

        protected GlueMetadataHandler​(software.amazon.awssdk.services.glue.GlueClient awsGlue,
                                      EncryptionKeyFactory encryptionKeyFactory,
                                      software.amazon.awssdk.services.secretsmanager.SecretsManagerClient secretsManager,
                                      software.amazon.awssdk.services.athena.AthenaClient athena,
                                      String sourceType,
                                      String spillBucket,
                                      String spillPrefix,
                                      Map<String,​String> configOptions)
        Full DI constructor used mostly for testing
        Parameters:
        awsGlue - The glue client to use.
        encryptionKeyFactory - The EncryptionKeyFactory to use for spill encryption.
        secretsManager - The SecretsManagerClient client that can be used when attempting to resolve secrets.
        athena - The Athena client that can be used to fetch query termination status to fast-fail this handler.
        spillBucket - The S3 Bucket to use when spilling results.
        spillPrefix - The S3 prefix to use when spilling results.
        configOptions - The configOptions for this MetadataHandler.
    • Method Detail

      • getAwsGlue

        protected software.amazon.awssdk.services.glue.GlueClient getAwsGlue()
        Provides access to the Glue client if the extender should need it. This will return null if Glue use is disabled.
        Returns:
        The AWSGlue client being used by this class, or null if disabled.
      • getCatalog

        protected String getCatalog​(MetadataRequest request)
        Provides access to the current AWS Glue DataCatalog being used by this class.
        Parameters:
        request - The request for which we'd like to resolve the catalog.
        Returns:
        The glue catalog to use for the request.
      • doListSchemaNames

        public ListSchemasResponse doListSchemaNames​(BlockAllocator blockAllocator,
                                                     ListSchemasRequest request)
                                              throws Exception
        Returns an unfiltered list of schemas (aka databases) from AWS Glue DataCatalog.
        Specified by:
        doListSchemaNames in class MetadataHandler
        Parameters:
        blockAllocator - Tool for creating and managing Apache Arrow Blocks.
        request - Provides details on who made the request and which Athena catalog they are querying.
        Returns:
        The ListSchemasResponse which mostly contains the list of schemas (aka databases).
        Throws:
        Exception
      • doListSchemaNames

        protected ListSchemasResponse doListSchemaNames​(BlockAllocator blockAllocator,
                                                        ListSchemasRequest request,
                                                        GlueMetadataHandler.DatabaseFilter filter)
                                                 throws Exception
        Returns a list of schemas (aka databases) from AWS Glue DataCatalog with optional filtering.
        Parameters:
        blockAllocator - Tool for creating and managing Apache Arrow Blocks.
        request - Provides details on who made the request and which Athena catalog they are querying.
        filter - The DatabaseFilter to apply to all schemas (aka databases) before adding them to the results list.
        Returns:
        The ListSchemasResponse which mostly contains the list of schemas (aka databases).
        Throws:
        Exception
      • doListTables

        public ListTablesResponse doListTables​(BlockAllocator blockAllocator,
                                               ListTablesRequest request)
                                        throws Exception
        Returns an unfiltered list of tables from AWS Glue DataCatalog for the requested schema (aka database)
        Specified by:
        doListTables in class MetadataHandler
        Parameters:
        blockAllocator - Tool for creating and managing Apache Arrow Blocks.
        request - Provides details on who made the request and which Athena catalog they are querying.
        Returns:
        The ListTablesResponse which mostly contains the list of table names.
        Throws:
        Exception
      • doListTables

        protected ListTablesResponse doListTables​(BlockAllocator blockAllocator,
                                                  ListTablesRequest request,
                                                  GlueMetadataHandler.TableFilter filter)
                                           throws Exception
        Returns a paginated list of tables from AWS Glue DataCatalog with optional filtering for the requested schema (aka database).
        Parameters:
        blockAllocator - Tool for creating and managing Apache Arrow Blocks.
        request - Provides details on who made the request and which Athena catalog they are querying.
        filter - The TableFilter to apply to all tables before adding them to the results list.
        Returns:
        The ListTablesResponse which mostly contains the list of table names.
        Throws:
        Exception
      • doGetTable

        public GetTableResponse doGetTable​(BlockAllocator blockAllocator,
                                           GetTableRequest request)
                                    throws Exception
        Attempts to retrieve a Table (columns and properties) from AWS Glue for the request schema (aka database) and table name with no fitlering.
        Specified by:
        doGetTable in class MetadataHandler
        Parameters:
        blockAllocator - Tool for creating and managing Apache Arrow Blocks.
        request - Provides details on who made the request and which Athena catalog, database, and table they are querying.
        Returns:
        A GetTableResponse mostly containing the columns, their types, and any table properties for the requested table.
        Throws:
        Exception
      • doGetTable

        protected GetTableResponse doGetTable​(BlockAllocator blockAllocator,
                                              GetTableRequest request,
                                              GlueMetadataHandler.TableFilter filter)
                                       throws Exception
        Attempts to retrieve a Table (columns and properties) from AWS Glue for the request schema (aka database) and table name with no filtering.
        Parameters:
        blockAllocator - Tool for creating and managing Apache Arrow Blocks.
        request - Provides details on who made the request and which Athena catalog, database, and table they are querying.
        filter - The TableFilter to apply to any matching table before generating the result.
        Returns:
        A GetTableResponse mostly containing the columns, their types, and any table properties for the requested table.
        Throws:
        Exception
      • convertField

        protected org.apache.arrow.vector.types.pojo.Field convertField​(String name,
                                                                        String glueType)
        Maps a Glue field to an Apache Arrow Field.
        Parameters:
        name - The name of the field in Glue.
        glueType - The type of the field in Glue.
        Returns:
        The corresponding Apache Arrow Field.
      • populateSourceTableNameIfAvailable

        protected static void populateSourceTableNameIfAvailable​(software.amazon.awssdk.services.glue.model.Table table,
                                                                 SchemaBuilder schemaBuilder)
        Glue has strict table naming rules and may not be able to match the exact table name from the source. So this stores the source table name in the schema metadata if necessary to ease lookup later. It looks for it in the following places:

        • A table property called "sourceTable"
        • In StorageDescriptor.Location in the form of an ARN (e.g. arn:aws:dynamodb:us-east-1:012345678910:table/mytable)

        Override this method to fetch the source table name from somewhere else.

        Parameters:
        table - The Glue Table
        schemaBuilder - The schema being generated
      • getColumnNameMapping

        protected static Map<String,​String> getColumnNameMapping​(software.amazon.awssdk.services.glue.model.Table table)
        If available, will parse and return a column name mapping for cases when a data source's columns cannot be represented by Glue's quite restrictive naming rules. It looks a comma separated inline map in the "columnMapping" table property.
        Parameters:
        table - The glue table
        Returns:
        A column mapping if provided, otherwise an empty map