Supported data types
Structured/Semi-structured data scanning
| Classifier type |
Classification string |
Notes |
| Apache Avro |
avro |
Reads the schema at the beginning of the file to determine format. |
| Apache ORC |
orc |
Reads the file metadata to determine format. |
| Apache Parquet |
parquet |
Reads the schema at the end of the file to determine format. |
| JSON |
json |
Reads the beginning of the file to determine format. |
| Binary JSON |
bson |
Reads the beginning of the file to determine format. |
| XML |
xml |
Reads the beginning of the file to determine format. AWS Glue determines the table schema based on XML tags in the document. For information about creating a custom XML classifier to specify rows in the document, see Writing XML custom classifiers. |
| Amazon Ion |
ion |
Reads the beginning of the file to determine format. |
| Combined Apache log |
combined_apache |
Determines log formats through a grok pattern. |
| Apache log |
apache |
Determines log formats through a grok pattern. |
| Linux kernel log |
linux_kernel |
Determines log formats through a grok pattern. |
| Microsoft log |
microsoft_log |
Determines log formats through a grok pattern. |
| Ruby log |
ruby_logger |
Reads the beginning of the file to determine format. |
| Squid 3.x log |
squid |
Reads the beginning of the file to determine format. |
| Redis monitor log |
redismonlog |
Reads the beginning of the file to determine format. |
| Redis log |
redislog |
Reads the beginning of the file to determine format. |
| CSV |
csv |
Checks for the following delimiters: comma (,), pipe ( |
| Amazon Redshift |
redshift |
Uses JDBC connection to import metadata. |
| MySQL |
mysql |
Uses JDBC connection to import metadata. |
| PostgreSQL |
postgresql |
Uses JDBC connection to import metadata. |
| Oracle database |
oracle |
Uses JDBC connection to import metadata. |
| Microsoft SQL Server |
sqlserver |
Uses JDBC connection to import metadata. |
| Amazon DynamoDB |
dynamodb |
Reads data from the DynamoDB table. |
| Compressed Formats |
|
Files in the following compressed formats can be classified: |
| ZIP |
|
Supported for archives containing only a single file. Note that Zip is not well-supported in other services (because of the archive). |
| BZIP |
|
|
| GZIP |
|
|
| LZ4 |
|
|
| Snappy |
|
Supported for both standard and Hadoop native Snappy formats. |
Note: The solution uses AWS Glue to crawl these data into data catalogs. For specific data format supported by AWS Glue, please refer to Built-in classifiers in AWS Glue.
## Unstructured data scanning (S3 only)
| File Type |
Extensions |
| Document |
".docx", ".pdf" |
| Webpage |
".htm", ".html" |
| Email |
".eml" |
| Code |
".java", ".py", ".cpp", ".c", ".h", ".html", ".css", ".js", ".php", ".rb", ".swift", ".go", ".sql" |
| Text |
".txt", ".md", ".log" |
| Image |
“.jpg”, “.jpeg”, “.png”, “.gif”, “.bmp”, “.tiff”, “.tif” - (ID cards/Business licenses/Driver's licenses/Faces) |