Document Classification

You can perform a more accurate, and cost effective document classification with Rhubarb’s vector sampling functionality. Internally, Rhubarb uses Amazon Titan Multi-modal embedding model for this purpose. Here are the high level steps you will follow to setup a classifier, and then run document classifications with it.

How are the documents classified?

The premise of classification based on vector sampling relies on the fact that you have a small set of labeled documents and their corresponding vectors (embeddings). Given this sample set, you can generate vector embeddings of a new document and perform similarity check with your sample set to determine which type of document the new document closely resembles to.

For that, Rhubarb offers two methods of choice-

  1. Cosine Similarity (cosine) to measure the similarity of a new document’s vector embeddings with the sample set of labeled vectors. Cosine Similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. It’s derived from the dot product of the vectors divided by the product of their magnitudes (norms). For each class, it finds the maximum similarity score among all its vectors compared to the page’s embedding. This score represents how closely the page’s content resembles the content typical of that class. If the highest similarity score across all classes is below a certain threshold (unknown_threshold), it classifies the page as “UNKNOWN”. This means that the page does not closely resemble any of the known classes above the specified confidence level. The UNKNOWN class is assigned the highest similarity score found, even though it’s below the threshold (more on this later).

  2. Euclidean distance, also known as L2 distance (l2), measures the straight line distance between two points in a multi-dimensional space. It is derived from the square root of the sum of the squared differences between corresponding elements of the two vectors. For each class, the Euclidean distance between the page’s embedding and each of the class’s sample vectors is calculated. Unlike cosine similarity, which measures the angle between vectors, Euclidean distance measures the magnitude of difference between vectors. A smaller Euclidean distance indicates a closer or more similar match between the vectors. Therefore, for each class, you would typically look for the minimum Euclidean distance between the page’s embedding and the class’s vectors to determine the closest match.

Pre-requisites

  1. You will need to define a the classes (a.k.a. labels or document class) for the documents.

  2. Collect a few sample documents for each class (minimum 1 sample max 10 samples each class)

  3. Create a CSV manifest file

  4. Run sampling using the manifest file, capture the sample ID (you will need this to run classification)

  5. Run classifications with new document using the sample ID from step 4.

Create a manifest file

Your manifest file should be of CSV format. For example

BANK_STATEMENT,s3://your-bucket/samples/bank_stmt_0.pdf,1
BANK_STATEMENT,s3://your-bucket/samples/bank_stmt_1.pdf,1
BANK_STATEMENT,s3://your-bucket/samples/bank_stmt_2.pdf,1
INVOICE,s3://your-bucket/samples/invoice_0.pdf,1
INVOICE,s3://your-bucket/samples/invoice_1.pdf,1
RECEIPT,s3://your-bucket/samples/receipt_0.pdf,1
RECEIPT,s3://your-bucket/samples/receipt_1.pdf,1
RECEIPT,s3://your-bucket/samples/receipt_2.pdf,1

The manifest CSV file must contain three fields-

  1. The class label should be the first field

  2. The sample document belonging to that class, either local path or s3:// location of the sample document. S3 locations are recommended but local path’s can be used during development.

  3. Third column is the page number of the document (in case the document is multi-page pdf)

Note

If you would like to include all pages of a multi-page document in the sample dataset then you must make separate lines in the csv with the same document and the subsequent page number.

You can store this manifest file into an S3 location (recommended) or use it locally. Once you have your manifest file ready we can create a classifier.

Setup a classifier

You initialize an instance of DocClassification and call the run_sampling() function with the manifest file path to start the sampling process. This should be a fairly quick process and shouldn’t take more than a few minutes. The function will return a sample_id which you can subsequently use to run document classification tasks. You will also need to provide an S3 bucket where the resulting vector samples will be stored for later use.

from rhubarb import DocClassification
import boto3

session = boto3.Session()

dc = DocClassification(bucket_name="your-classifier-bucket", boto3_session=session)
classifier = dc.run_sampling(manifest_path="s3://your-bucket/manifest.csv")

Sample output

{
    "sample_id": "rb_classifier_1711608335"
}

Creating a classifier (sample) is a one-time process, i.e. once you setup your classifier you can use the sample_id to perform document classification tasks.

Using a classifier

You use the run_classify() function with path to the new document you would like to classify. The file can be a multi-page document and Rhubarb will classify each page individually.

from rhubarb import DocClassification
import boto3

session = boto3.Session()

dc = DocClassification(bucket_name="your-classifier-bucket", boto3_session=session)
results = dc.run_classify(sample_id="rb_classifier_1711608335",
                          file_path="./test_docs/bank_stmt.pdf")

Sample output

[
    {
        "page": 1,
        "classification": [
                {
                    "class": "BANK_STATEMENT",
                    "score": 0.92
                }
            ]
    }
]

Using unknown_threshold

In some cases, your document processing pipeline may encounter documents that do not belong to any of the pre-configured classes. In such cases, it can be useful to mark these documents as UNKNOWN and isolate them for further analysis. This can be achieved using the unknown_threshold parameter. By default, the value is set to 0.8 i.e. any document below this threshold is automatically marked as UNKNOWN. However, you can override this value to further tune your classification task.

 from rhubarb import DocClassification
 import boto3

 session = boto3.Session()
 dc = DocClassification(bucket_name="your-classifier-bucket", boto3_session=session)
 results = dc.run_classify(sample_id="rb_classifier_1711608335",
                         file_path="./test_docs/Sample1.pdf",
                         top_n=2,
                         unknown_threshold=0.85)
 results

Sample output

[
    {
        "page": 1,
        "classification": [
            {"class": "BANK_STATEMENT", "score": 0.93},
            {"class": "DISCHARGE_SUMMARY", "score": 0.76}
        ]
    },
    {
        "page": 2,
        "classification": [
            {"class": "RECEIPT", "score": 1.0},
            {"class": "INVOICE", "score": 0.78}
        ]
    },
    {"page": 3, "classification": [{"class": "UNKNOWN", "score": 0.6150004682895025}]},
    {"page": 4, "classification": [{"class": "UNKNOWN", "score": 0.7916701829486292}]},
    {"page": 5, "classification": [{"class": "UNKNOWN", "score": 0.8255265125891919}]},
    {"page": 6, "classification": [{"class": "UNKNOWN", "score": 0.768307452125929}]}
]

The score for UNKNOWN classification is the highest similarity score (even if it’s below the threshold). This gives a sense of how close the document was to being classified into one of the known classes before being deemed unknown. To view what the unknown pages is actually getting classified into, you can reduce the unknown_threshold and re-run the classification. Ideally, you would experiment and tune this score based on your use cases, the number of samples you used while creating the classifier, the number of classes, and the different type of documents you receive for your use-case.

Using similarity_metric

By default Rhubarb uses Cosine similarity to determine the class of a given page. If default Cosine similarity is used then scores can be between 0 and 1, and the higher the score the better. Higher number means the given document is most similar to the samples in the given class.

However, you can also choose Euclidian distance similarity metric (a.k.a L2 distance). In this case, scores will range beween 0 and 1, and the lower the number, the better, which means the straight line distance between the vector of the give document and the vector’s of the class it is categorized in is the least. You can set the similarity_metric to l2 to force using Euclidian distance for classification.

Note

In case of Euclidian distance, your unknown_threshold is the inverse of cosine similarity. Which means if unknown_threshold is set to 0.5 then any class with score more than 0.5 will be marked as UNKNOWN since it is the most dissimilar to all the known classes.

from rhubarb import DocClassification
import boto3

session = boto3.Session()
dc = DocClassification(bucket_name="your-classifier-bucket", boto3_session=session)
results = dc.run_classify(sample_id="rb_classifier_1711608335",
                          file_path="./test_docs/Sample1.pdf",
                          similarity_metric="l2",
                          unknown_threshold=0.5) # threshold is low because smaller is better

Sample output

[
    {"page": 1, "classification": [{"class": "BANK_STATEMENT", "score": 0.38}]},
    {"page": 2, "classification": [{"class": "RECEIPT", "score": 0.0}]},
    {"page": 3, "classification": [{"class": "UNKNOWN", "score": 0.87749589194715}]},
    {"page": 4, "classification": [{"class": "UNKNOWN", "score": 0.6454917724396393}]},
    {"page": 5, "classification": [{"class": "UNKNOWN", "score": 0.5907173241851739}]},
    {"page": 6, "classification": [{"class": "UNKNOWN", "score": 0.6807239544254996}]}
]

Viewing a classifier

You can view the details of an existing classifier using the view_sample() function. This will show you the classes you have configured for your classifiers and the samples you used while creating the classifier.

from rhubarb import DocClassification
import boto3

session = boto3.Session()

dc = DocClassification(bucket_name="your-classifier-bucket", boto3_session=session)
dc.view_sample(sample_id="rb_classifier_1711608335")

Sample output

[
    {"class": "BANK_STATEMENT", "num_samples": 6},
    {"class": "INVOICE", "num_samples": 6},
    {"class": "RECEIPT", "num_samples": 6}
]

Updating a classifier

You may have a need to subsequently update your classifier to add new classes, or samples. You can easily do that just like creating a new classifier with the help of a new manifest CSV file. But this time, you will need to supply an existing sample_id to the run_sampling() function via the update_sample_id parameter. This will process the manifest file but instead of creating a new classifier, will update the existing classifier (sample). For example with our new manifest2.csv we would like to update our existing sample.

manifest2.csv

DISCHARGE_SUMMARY,s3://your-bucket/samples/discharge_summary_0.pdf,1
DISCHARGE_SUMMARY,s3://your-bucket/samples/discharge_summary_1.pdf,1
DISCHARGE_SUMMARY,s3://your-bucket/samples/discharge_summary_2.pdf,1
DISCHARGE_SUMMARY,s3://your-bucket/samples/discharge_summary_3.pdf,1
DISCHARGE_SUMMARY,s3://your-bucket/samples/discharge_summary_4.pdf,1

Note

It is recommended that you use samples that were not used before while creating the classifier (sample), since Rhubarb doesn’t do any de-duplication internally.

 from rhubarb import DocClassification
 import boto3

 session = boto3.Session()
 dc = DocClassification(bucket_name="your-classifier-bucket",
                        boto3_session=session)
 classifier = dc.run_sampling(manifest_path="s3://your-bucket/manifest2.csv",
                              update_sample_id="rb_classifier_1711608335")

Sample output

{
   "sample_id": "rb_classifier_1711608335"
}

View the updated classifier

from rhubarb import DocClassification
import boto3

session = boto3.Session()

dc = DocClassification(bucket_name="your-classifier-bucket", boto3_session=session)
dc.view_sample(sample_id="rb_classifier_1711608335")

Sample output

[
    {"class": "BANK_STATEMENT", "num_samples": 6},
    {"class": "INVOICE", "num_samples": 6},
    {"class": "RECEIPT", "num_samples": 6},
    {"class": "DISCHARGE_SUMMARY", "num_samples": 5}
]

How is the classifier stored?

Your classifier will contain three distinct parts

  1. A unique identifier (sample_id)

  2. The class labels (i.e. BANK_STATEMENT, RECEIPTS and so on)

  3. The multi-modal vector embeddings for each of the sample document for each class as specified in the manifest file.

All of this data is arranged and stored in a compressed Parquet file in the S3 bucket you provide (via the bucket_name parameter). Within the bucket, Rhubarb will store all your classifiers under the prefix rb_classification by default. However, you can override this prefix via Rhubarb’s GlobalConfig and overriding classification_prefix config using the update_config() function, with your desired prefix.

Warning

The prefix must not contain any leading or trailing /.

   from rhubarb import DocClassification, GlobalConfig
   import boto3

   session = boto3.Session()

   GlobalConfig.update_config(classification_prefix="my_classifiers")
   dc = DocClassification(bucket_name="your-classifier-bucket", boto3_session=session)
   classifier = dc.run_sampling(manifest_path="s3://your-bucket/manifest.csv")