Schema Creation Assistant

Rhubarb can also help create accurate JSON schemas from plain text prompts. You can provide a document and ask it to extract certain values from the document and it will respond back with a JSON schema. You can then use the JSON schema with the output_schema as shown above, or you can tweak and modify it to fit your need further. You do this using the generate_schema() function.

from rhubarb import DocAnalysis

da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf",
                boto3_session=session,
                pages=[1])
resp = da.generate_schema(message="I want to extract the employee name, employee SSN, employee address, \
                                   date of birth and phone number from this document.")
resp['output']

Sample output

{
  "type": "object",
  "properties": {
    "employeeName": {
      "type": "object",
      "properties": {
        "first": {
          "type": "string",
          "description": "The employee's first name"
        },
        "initial": {
          "type": "string",
          "description": "The employee's middle initial"
        },
        "last": {
          "type": "string",
          "description": "The employee's last name"
        }
      },
      "required": ["first", "last"]
    },
    "employeeSSN": {
      "type": "string",
      "description": "The employee's social security number"
    },
    "employeeAddress": {
      "type": "object",
      "properties": {
        "street": {
          "type": "string",
          "description": "The employee's street address"
        },
        "city": {
          "type": "string",
          "description": "The employee's city"
        },
        "state": {
          "type": "string",
          "description": "The employee's state"
        },
        "zipCode": {
          "type": "string",
          "description": "The employee's zip code"
        }
      },
      "required": ["street", "city", "state", "zipCode"]
    },
    "dateOfBirth": {
      "type": "string",
      "description": "The employee's date of birth in MM/DD/YY format"
    },
    "phoneNumber": {
      "type": "string",
      "description": "The employee's phone number"
    }
  },
  "required": [
    "employeeName",
    "employeeSSN",
    "employeeAddress",
    "dateOfBirth",
    "phoneNumber"
  ]
}

We can then use this schema to perform extraction on the same document.

output_schema = resp['output']
resp = da.run(message="I want to extract the employee name, employee SSN, employee address, date of \
                       birth and phone number from this document. Use the schema provided.",
              output_schema=output_schema)

Sample output

{
  "output": {
    "employeeName": {
      "first": "Martha",
      "initial": "C",
      "last": "Rivera"
    },
    "employeeSSN": "376 12 1987",
    "employeeAddress": {
      "street": "8 Any Plaza, 21 Street",
      "city": "Any City",
      "state": "CA",
      "zipCode": "90210"
    },
    "dateOfBirth": "09/19/80",
    "phoneNumber": "(383) 555-0100"
  },
  "token_usage": {
    "input_tokens": 2107,
    "output_tokens": 146
  }
}

Schema creation assistance with question rephrase

In many cases you may want to quickly get started with creating a JSON Schema for your document wihtout spending too much time crafting a proper prompt for the document. For example, in a birth certificate you could be vague in asking a question such as “I want to get the child’s, the mother’s and father’s details from the given document”. In such cases Rhubarb can help rephrasing the question and create an appropriate rephrased question based on the document and generate a subsequent schema for it which can directly be used to extract the data. For this, you use the assistive_rephrase parameter in your call to generate_schema() function.

 from rhubarb import DocAnalysis

 da = DocAnalysis(file_path="./test_docs/birth_cert.jpeg",
                  boto3_session=session)
 resp = da.generate_schema(message="I want to get the child's, the mother's and father's details from the given document",
                           assistive_rephrase=True)
 resp['output']

Sample output

{
  "rephrased_question": "Extract the child's name, date of birth, sex, place of birth, mother's name, mother's date of birth, mother's place of birth, mother's address, father's name, and father's place of birth from the given birth certificate document.",
  "output_schema": {
    "type": "object",
    "properties": {
      "child": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
            "description": "The child's full name"
          },
          "dateOfBirth": {
            "type": "string",
            "description": "The child's date of birth"
          },
          "sex": {
            "type": "string",
            "description": "The child's sex"
          },
          "placeOfBirth": {
            "type": "string",
            "description": "The child's place of birth"
          }
        },
        "required": ["name", "dateOfBirth", "sex", "placeOfBirth"]
      },
      "mother": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
            "description": "The mother's full name"
          },
          "dateOfBirth": {
            "type": "string",
            "description": "The mother's date of birth"
          },
          "placeOfBirth": {
            "type": "string",
            "description": "The mother's place of birth"
          },
          "address": {
            "type": "string",
            "description": "The mother's address"
          }
        },
        "required": ["name", "dateOfBirth", "placeOfBirth", "address"]
      },
      "father": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
            "description": "The father's full name"
          },
          "placeOfBirth": {
            "type": "string",
            "description": "The father's place of birth"
          }
        },
        "required": ["name", "placeOfBirth"]
      }
    },
    "required": ["child", "mother", "father"]
  }
}

And then use the rephrased question and the output_schema.

 output_schema = resp['output']['output_schema']
 question = resp['output']['rephrased_question']

 resp = da.run(message = question,
               output_schema = output_schema)