Class GlueMetadataHandler
- java.lang.Object
-
- com.amazonaws.athena.connector.lambda.handlers.MetadataHandler
-
- com.amazonaws.athena.connector.lambda.handlers.GlueMetadataHandler
-
- All Implemented Interfaces:
com.amazonaws.services.lambda.runtime.RequestStreamHandler
- Direct Known Subclasses:
DocDBMetadataHandler
,DynamoDBMetadataHandler
,ElasticsearchMetadataHandler
,GcsMetadataHandler
,HbaseMetadataHandler
,NeptuneMetadataHandler
,RedisMetadataHandler
,TimestreamMetadataHandler
public abstract class GlueMetadataHandler extends MetadataHandler
This class allows you to leverage AWS Glue's DataCatalog to satisfy portions of the functionality required in a MetadataHandler. More precisely, this implementation uses AWS Glue's DataCatalog to implement:- doListSchemas(...)
- doListTables(...)
- doGetTable(...)
When you extend this class you can optionally provide a DatabaseFilter and/or TableFilter to decide which Databases (aka schemas) and Tables are eligible for use with your connector. You can find examples of this in the athena-hbase and athena-docdb connector modules. A common reason for this is when you happen to have databases/tables in Glue which match the names of databases and tables in your source but that aren't actually relevant. In such cases you may choose to ignore those Glue tables.
At present this class does not retrieve partition information from AWS Glue's DataCatalog. There is an open task for how best to handle partitioning information in this class: https://github.com/awslabs/aws-athena-query-federation/issues/5 It is unclear at this time how many sources will have meaningful partition info in Glue but many sources (DocDB, Hbase, Redis) benefited from having basic schema information in Glue. As a result we punted support for partition information to a later time.
- See Also:
MetadataHandler
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static interface
GlueMetadataHandler.DatabaseFilter
static interface
GlueMetadataHandler.TableFilter
-
Field Summary
Fields Modifier and Type Field Description static String
COLUMN_NAME_MAPPING_PROPERTY
static String
DATETIME_FORMAT_MAPPING_PROPERTY
static String
DATETIME_FORMAT_MAPPING_PROPERTY_NORMALIZED
protected static int
GET_TABLES_REQUEST_MAX_RESULTS
The maximum number of tables returned in a single response (as defined in the Glue API docs).static String
GLUE_TABLE_CONTAINS_PREVIOUSLY_UNSUPPORTED_TYPE
static String
SOURCE_TABLE_PROPERTY
static String
VIEW_METADATA_FIELD
-
Fields inherited from class com.amazonaws.athena.connector.lambda.handlers.MetadataHandler
configOptions, DISABLE_SPILL_ENCRYPTION, KMS_KEY_ID_ENV, SPILL_BUCKET_ENV, SPILL_PREFIX_ENV
-
-
Constructor Summary
Constructors Modifier Constructor Description GlueMetadataHandler(String sourceType, Map<String,String> configOptions)
Basic constructor which is recommended when extending this class.protected
GlueMetadataHandler(software.amazon.awssdk.services.glue.GlueClient awsGlue, EncryptionKeyFactory encryptionKeyFactory, software.amazon.awssdk.services.secretsmanager.SecretsManagerClient secretsManager, software.amazon.awssdk.services.athena.AthenaClient athena, String sourceType, String spillBucket, String spillPrefix, Map<String,String> configOptions)
Full DI constructor used mostly for testingGlueMetadataHandler(software.amazon.awssdk.services.glue.GlueClient awsGlue, String sourceType, Map<String,String> configOptions)
Constructor that allows injection of a customized Glue client.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected org.apache.arrow.vector.types.pojo.Field
convertField(String name, String glueType)
Maps a Glue field to an Apache Arrow Field.GetTableResponse
doGetTable(BlockAllocator blockAllocator, GetTableRequest request)
Attempts to retrieve a Table (columns and properties) from AWS Glue for the request schema (aka database) and table name with no fitlering.protected GetTableResponse
doGetTable(BlockAllocator blockAllocator, GetTableRequest request, GlueMetadataHandler.TableFilter filter)
Attempts to retrieve a Table (columns and properties) from AWS Glue for the request schema (aka database) and table name with no filtering.ListSchemasResponse
doListSchemaNames(BlockAllocator blockAllocator, ListSchemasRequest request)
Returns an unfiltered list of schemas (aka databases) from AWS Glue DataCatalog.protected ListSchemasResponse
doListSchemaNames(BlockAllocator blockAllocator, ListSchemasRequest request, GlueMetadataHandler.DatabaseFilter filter)
Returns a list of schemas (aka databases) from AWS Glue DataCatalog with optional filtering.ListTablesResponse
doListTables(BlockAllocator blockAllocator, ListTablesRequest request)
Returns an unfiltered list of tables from AWS Glue DataCatalog for the requested schema (aka database)protected ListTablesResponse
doListTables(BlockAllocator blockAllocator, ListTablesRequest request, GlueMetadataHandler.TableFilter filter)
Returns a paginated list of tables from AWS Glue DataCatalog with optional filtering for the requested schema (aka database).protected software.amazon.awssdk.services.glue.GlueClient
getAwsGlue()
Provides access to the Glue client if the extender should need it.protected String
getCatalog(MetadataRequest request)
Provides access to the current AWS Glue DataCatalog being used by this class.protected static Map<String,String>
getColumnNameMapping(software.amazon.awssdk.services.glue.model.Table table)
If available, will parse and return a column name mapping for cases when a data source's columns cannot be represented by Glue's quite restrictive naming rules.protected static String
getSourceTableName(org.apache.arrow.vector.types.pojo.Schema schema)
Will return the source table name stored bypopulateSourceTableNameIfAvailable(software.amazon.awssdk.services.glue.model.Table, com.amazonaws.athena.connector.lambda.data.SchemaBuilder)
protected static void
populateSourceTableNameIfAvailable(software.amazon.awssdk.services.glue.model.Table table, SchemaBuilder schemaBuilder)
Glue has strict table naming rules and may not be able to match the exact table name from the source.-
Methods inherited from class com.amazonaws.athena.connector.lambda.handlers.MetadataHandler
doGetDataSourceCapabilities, doGetQueryPassthroughSchema, doGetSplits, doGetTableLayout, doHandleRequest, doPing, enhancePartitionSchema, getPartitions, getSecret, handleRequest, makeEncryptionKey, makeSpillLocation, onPing, resolveSecrets
-
-
-
-
Field Detail
-
GET_TABLES_REQUEST_MAX_RESULTS
protected static final int GET_TABLES_REQUEST_MAX_RESULTS
The maximum number of tables returned in a single response (as defined in the Glue API docs).- See Also:
- Constant Field Values
-
SOURCE_TABLE_PROPERTY
public static final String SOURCE_TABLE_PROPERTY
- See Also:
- Constant Field Values
-
COLUMN_NAME_MAPPING_PROPERTY
public static final String COLUMN_NAME_MAPPING_PROPERTY
- See Also:
- Constant Field Values
-
DATETIME_FORMAT_MAPPING_PROPERTY
public static final String DATETIME_FORMAT_MAPPING_PROPERTY
- See Also:
- Constant Field Values
-
DATETIME_FORMAT_MAPPING_PROPERTY_NORMALIZED
public static final String DATETIME_FORMAT_MAPPING_PROPERTY_NORMALIZED
- See Also:
- Constant Field Values
-
VIEW_METADATA_FIELD
public static final String VIEW_METADATA_FIELD
- See Also:
- Constant Field Values
-
GLUE_TABLE_CONTAINS_PREVIOUSLY_UNSUPPORTED_TYPE
public static final String GLUE_TABLE_CONTAINS_PREVIOUSLY_UNSUPPORTED_TYPE
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
GlueMetadataHandler
public GlueMetadataHandler(String sourceType, Map<String,String> configOptions)
Basic constructor which is recommended when extending this class.- Parameters:
sourceType
- The source type, used in diagnostic logging.configOptions
- The configOptions for this MetadataHandler.
-
GlueMetadataHandler
public GlueMetadataHandler(software.amazon.awssdk.services.glue.GlueClient awsGlue, String sourceType, Map<String,String> configOptions)
Constructor that allows injection of a customized Glue client.- Parameters:
awsGlue
- The glue client to use.sourceType
- The source type, used in diagnostic logging.configOptions
- The configOptions for this MetadataHandler.
-
GlueMetadataHandler
protected GlueMetadataHandler(software.amazon.awssdk.services.glue.GlueClient awsGlue, EncryptionKeyFactory encryptionKeyFactory, software.amazon.awssdk.services.secretsmanager.SecretsManagerClient secretsManager, software.amazon.awssdk.services.athena.AthenaClient athena, String sourceType, String spillBucket, String spillPrefix, Map<String,String> configOptions)
Full DI constructor used mostly for testing- Parameters:
awsGlue
- The glue client to use.encryptionKeyFactory
- The EncryptionKeyFactory to use for spill encryption.secretsManager
- The SecretsManagerClient client that can be used when attempting to resolve secrets.athena
- The Athena client that can be used to fetch query termination status to fast-fail this handler.spillBucket
- The S3 Bucket to use when spilling results.spillPrefix
- The S3 prefix to use when spilling results.configOptions
- The configOptions for this MetadataHandler.
-
-
Method Detail
-
getAwsGlue
protected software.amazon.awssdk.services.glue.GlueClient getAwsGlue()
Provides access to the Glue client if the extender should need it. This will return null if Glue use is disabled.- Returns:
- The AWSGlue client being used by this class, or null if disabled.
-
getCatalog
protected String getCatalog(MetadataRequest request)
Provides access to the current AWS Glue DataCatalog being used by this class.- Parameters:
request
- The request for which we'd like to resolve the catalog.- Returns:
- The glue catalog to use for the request.
-
doListSchemaNames
public ListSchemasResponse doListSchemaNames(BlockAllocator blockAllocator, ListSchemasRequest request) throws Exception
Returns an unfiltered list of schemas (aka databases) from AWS Glue DataCatalog.- Specified by:
doListSchemaNames
in classMetadataHandler
- Parameters:
blockAllocator
- Tool for creating and managing Apache Arrow Blocks.request
- Provides details on who made the request and which Athena catalog they are querying.- Returns:
- The ListSchemasResponse which mostly contains the list of schemas (aka databases).
- Throws:
Exception
-
doListSchemaNames
protected ListSchemasResponse doListSchemaNames(BlockAllocator blockAllocator, ListSchemasRequest request, GlueMetadataHandler.DatabaseFilter filter) throws Exception
Returns a list of schemas (aka databases) from AWS Glue DataCatalog with optional filtering.- Parameters:
blockAllocator
- Tool for creating and managing Apache Arrow Blocks.request
- Provides details on who made the request and which Athena catalog they are querying.filter
- The DatabaseFilter to apply to all schemas (aka databases) before adding them to the results list.- Returns:
- The ListSchemasResponse which mostly contains the list of schemas (aka databases).
- Throws:
Exception
-
doListTables
public ListTablesResponse doListTables(BlockAllocator blockAllocator, ListTablesRequest request) throws Exception
Returns an unfiltered list of tables from AWS Glue DataCatalog for the requested schema (aka database)- Specified by:
doListTables
in classMetadataHandler
- Parameters:
blockAllocator
- Tool for creating and managing Apache Arrow Blocks.request
- Provides details on who made the request and which Athena catalog they are querying.- Returns:
- The ListTablesResponse which mostly contains the list of table names.
- Throws:
Exception
-
doListTables
protected ListTablesResponse doListTables(BlockAllocator blockAllocator, ListTablesRequest request, GlueMetadataHandler.TableFilter filter) throws Exception
Returns a paginated list of tables from AWS Glue DataCatalog with optional filtering for the requested schema (aka database).- Parameters:
blockAllocator
- Tool for creating and managing Apache Arrow Blocks.request
- Provides details on who made the request and which Athena catalog they are querying.filter
- The TableFilter to apply to all tables before adding them to the results list.- Returns:
- The ListTablesResponse which mostly contains the list of table names.
- Throws:
Exception
-
doGetTable
public GetTableResponse doGetTable(BlockAllocator blockAllocator, GetTableRequest request) throws Exception
Attempts to retrieve a Table (columns and properties) from AWS Glue for the request schema (aka database) and table name with no fitlering.- Specified by:
doGetTable
in classMetadataHandler
- Parameters:
blockAllocator
- Tool for creating and managing Apache Arrow Blocks.request
- Provides details on who made the request and which Athena catalog, database, and table they are querying.- Returns:
- A GetTableResponse mostly containing the columns, their types, and any table properties for the requested table.
- Throws:
Exception
-
doGetTable
protected GetTableResponse doGetTable(BlockAllocator blockAllocator, GetTableRequest request, GlueMetadataHandler.TableFilter filter) throws Exception
Attempts to retrieve a Table (columns and properties) from AWS Glue for the request schema (aka database) and table name with no filtering.- Parameters:
blockAllocator
- Tool for creating and managing Apache Arrow Blocks.request
- Provides details on who made the request and which Athena catalog, database, and table they are querying.filter
- The TableFilter to apply to any matching table before generating the result.- Returns:
- A GetTableResponse mostly containing the columns, their types, and any table properties for the requested table.
- Throws:
Exception
-
convertField
protected org.apache.arrow.vector.types.pojo.Field convertField(String name, String glueType)
Maps a Glue field to an Apache Arrow Field.- Parameters:
name
- The name of the field in Glue.glueType
- The type of the field in Glue.- Returns:
- The corresponding Apache Arrow Field.
-
populateSourceTableNameIfAvailable
protected static void populateSourceTableNameIfAvailable(software.amazon.awssdk.services.glue.model.Table table, SchemaBuilder schemaBuilder)
Glue has strict table naming rules and may not be able to match the exact table name from the source. So this stores the source table name in the schema metadata if necessary to ease lookup later. It looks for it in the following places:- A table property called "sourceTable"
- In StorageDescriptor.Location in the form of an ARN (e.g. arn:aws:dynamodb:us-east-1:012345678910:table/mytable)
Override this method to fetch the source table name from somewhere else.
- Parameters:
table
- The Glue TableschemaBuilder
- The schema being generated
-
getSourceTableName
protected static String getSourceTableName(org.apache.arrow.vector.types.pojo.Schema schema)
Will return the source table name stored bypopulateSourceTableNameIfAvailable(software.amazon.awssdk.services.glue.model.Table, com.amazonaws.athena.connector.lambda.data.SchemaBuilder)
- Parameters:
schema
- The schema returned bydoGetTable(com.amazonaws.athena.connector.lambda.data.BlockAllocator, com.amazonaws.athena.connector.lambda.metadata.GetTableRequest)
- Returns:
- The source table name
-
getColumnNameMapping
protected static Map<String,String> getColumnNameMapping(software.amazon.awssdk.services.glue.model.Table table)
If available, will parse and return a column name mapping for cases when a data source's columns cannot be represented by Glue's quite restrictive naming rules. It looks a comma separated inline map in the "columnMapping" table property.- Parameters:
table
- The glue table- Returns:
- A column mapping if provided, otherwise an empty map
-
-