Supported data types
Structured/Semi-structured data scanning
| Classifier type | 
Classification string | 
Notes | 
| Apache Avro | 
avro | 
Reads the schema at the beginning of the file to determine format. | 
| Apache ORC | 
orc | 
Reads the file metadata to determine format. | 
| Apache Parquet | 
parquet | 
Reads the schema at the end of the file to determine format. | 
| JSON | 
json | 
Reads the beginning of the file to determine format. | 
| Binary JSON | 
bson | 
Reads the beginning of the file to determine format. | 
| XML | 
xml | 
Reads the beginning of the file to determine format. AWS Glue determines the table schema based on XML tags in the document. For information about creating a custom XML classifier to specify rows in the document, see Writing XML custom classifiers. | 
| Amazon Ion | 
ion | 
Reads the beginning of the file to determine format. | 
| Combined Apache log | 
combined_apache | 
Determines log formats through a grok pattern. | 
| Apache log | 
apache | 
Determines log formats through a grok pattern. | 
| Linux kernel log | 
linux_kernel | 
Determines log formats through a grok pattern. | 
| Microsoft log | 
microsoft_log | 
Determines log formats through a grok pattern. | 
| Ruby log | 
ruby_logger | 
Reads the beginning of the file to determine format. | 
| Squid 3.x log | 
squid | 
Reads the beginning of the file to determine format. | 
| Redis monitor log | 
redismonlog | 
Reads the beginning of the file to determine format. | 
| Redis log | 
redislog | 
Reads the beginning of the file to determine format. | 
| CSV | 
csv | 
Checks for the following delimiters: comma (,), pipe ( | 
| Amazon Redshift | 
redshift | 
Uses JDBC connection to import metadata. | 
| MySQL | 
mysql | 
Uses JDBC connection to import metadata. | 
| PostgreSQL | 
postgresql | 
Uses JDBC connection to import metadata. | 
| Oracle database | 
oracle | 
Uses JDBC connection to import metadata. | 
| Microsoft SQL Server | 
sqlserver | 
Uses JDBC connection to import metadata. | 
| Amazon DynamoDB | 
dynamodb | 
Reads data from the DynamoDB table. | 
| Compressed Formats | 
 | 
Files in the following compressed formats can be classified: | 
| ZIP | 
 | 
Supported for archives containing only a single file. Note that Zip is not well-supported in other services (because of the archive). | 
| BZIP | 
 | 
 | 
| GZIP | 
 | 
 | 
| LZ4 | 
 | 
 | 
| Snappy | 
 | 
Supported for both standard and Hadoop native Snappy formats. | 
Note: The solution uses AWS Glue to crawl these data into data catalogs. For specific data format supported by AWS Glue, please refer to Built-in classifiers in AWS Glue.
## Unstructured data scanning (S3 only)
| File Type | 
Extensions | 
| Document | 
".docx", ".pdf" | 
| Webpage | 
".htm", ".html" | 
| Email | 
".eml" | 
| Code | 
".java", ".py", ".cpp", ".c", ".h", ".html", ".css", ".js", ".php", ".rb", ".swift", ".go", ".sql" | 
| Text | 
".txt", ".md", ".log" | 
| Image | 
“.jpg”, “.jpeg”, “.png”, “.gif”, “.bmp”, “.tiff”, “.tif” - (ID cards/Business licenses/Driver's licenses/Faces) |