Class SchemaUtils


  • public class SchemaUtils
    extends Object
    Collection of helpful utilities that handle DocumentDB schema inference, type, and naming conversion.

    Inferred Schemas are formed by scanning N documents from the desired collection and then performing a union of all fields in those Documents. The same union approach is applied to complex types (structs aka nested Documents). If a type mistmatch is discovered, we assume the type is VARCHAR since most types can be coerced to VARCHAR. However, this naive coercion does not work well if you then try to filter on the coerced field because whifen we push the filter into DocDB it will almost certainly result in no matches.

    • Method Detail

      • inferSchema

        public static org.apache.arrow.vector.types.pojo.Schema inferSchema​(com.mongodb.client.MongoDatabase db,
                                                                            TableName table,
                                                                            int numObjToSample)
        This method will produce an Apache Arrow Schema for the given TableName and DocumentDB connection by scanning up to the requested number of rows and using basic schema inference to determine data types.
        Parameters:
        client - The DocumentDB connection to use for the scan operation.
        table - The DocumentDB TableName for which to produce an Apache Arrow Schema.
        numObjToSample - The number of records to scan as part of producing the Schema.
        Returns:
        An Apache Arrow Schema representing the schema of the HBase table.