Skip to content

Supported data types

Structured/Semi-structured data scanning

Classifier type Classification string Notes
Apache Avro avro Reads the schema at the beginning of the file to determine format.
Apache ORC orc Reads the file metadata to determine format.
Apache Parquet parquet Reads the schema at the end of the file to determine format.
JSON json Reads the beginning of the file to determine format.
Binary JSON bson Reads the beginning of the file to determine format.
XML xml Reads the beginning of the file to determine format. AWS Glue determines the table schema based on XML tags in the document. For information about creating a custom XML classifier to specify rows in the document, see Writing XML custom classifiers.
Amazon Ion ion Reads the beginning of the file to determine format.
Combined Apache log combined_apache Determines log formats through a grok pattern.
Apache log apache Determines log formats through a grok pattern.
Linux kernel log linux_kernel Determines log formats through a grok pattern.
Microsoft log microsoft_log Determines log formats through a grok pattern.
Ruby log ruby_logger Reads the beginning of the file to determine format.
Squid 3.x log squid Reads the beginning of the file to determine format.
Redis monitor log redismonlog Reads the beginning of the file to determine format.
Redis log redislog Reads the beginning of the file to determine format.
CSV csv Checks for the following delimiters: comma (,), pipe (
Amazon Redshift redshift Uses JDBC connection to import metadata.
MySQL mysql Uses JDBC connection to import metadata.
PostgreSQL postgresql Uses JDBC connection to import metadata.
Oracle database oracle Uses JDBC connection to import metadata.
Microsoft SQL Server sqlserver Uses JDBC connection to import metadata.
Amazon DynamoDB dynamodb Reads data from the DynamoDB table.
Compressed Formats Files in the following compressed formats can be classified:
ZIP Supported for archives containing only a single file. Note that Zip is not well-supported in other services (because of the archive).
Snappy Supported for both standard and Hadoop native Snappy formats.

Note: The solution uses AWS Glue to crawl these data into data catalogs. For specific data format supported by AWS Glue, please refer to Built-in classifiers in AWS Glue.

## Unstructured data scanning (S3 only)

File Type Extensions
Document ".docx", ".pdf"
Webpage ".htm", ".html"
Email ".eml"
Code ".java", ".py", ".cpp", ".c", ".h", ".html", ".css", ".js", ".php", ".rb", ".swift", ".go", ".sql"
Text ".txt", ".md", ".log"
Image “.jpg”, “.jpeg”, “.png”, “.gif”, “.bmp”, “.tiff”, “.tif” - (ID cards/Business licenses/Driver's licenses/Faces)