Named Entity Recognition (NER)¶
Rhubarb comes with 50 built-in entities which includes common entities such as LOCATION
, EVENT
etc. Entities
are available via the Entities
class. You can pick and choose which entities to detect and then pass them onto the
run_entity()
method.
from rhubarb import DocAnalysis, Entities
da = DocAnalysis(file_path="./test_docs/employee_enrollment.pdf",
boto3_session=session,
pages=[1,3])
resp = da.run_entity(message="Extract all the specified entities from this document.",
entities=[Entities.PERSON, Entities.ADDRESS])
Sample response
{
"output": [
{
"page": 1,
"entities": [
{"PERSON": "Martha C Rivera"},
{"ADDRESS": "8 Any Plaza, 21 Street, Any City, CA 90210"}
]
},
{
"page": 3,
"entities": [
{"PERSON": "Mateo Rivera"},
{"PERSON": "Pat Rivera"},
{"ADDRESS": "8 Any Plaza, 21 Street, Any City, CA 90210"}
]
}
],
"token_usage": {
"input_tokens": 3531,
"output_tokens": 168
}
}
Supported Entities¶
Below is a list of entities that are supported.
Entity |
Description |
---|---|
|
A physical address, such as ‘100 Main Street, Anytown, USA’ or ‘Suite #12, Building 123’. |
|
An individual’s age, including the quantity and unit of time. |
|
A unique identifier that’s associated with a secret access key; used to sign programmatic AWS requests cryptographically. |
|
A unique identifier that’s associated with an access key. |
|
A three-digit card verification code (CVV) present on VISA, MasterCard, and Discover credit and debit cards. |
|
The expiration date for a credit or debit card. |
|
The number for a credit or debit card. |
|
A date can include a year, month, day, day of week, or time of day. |
|
The number assigned to a driver’s license. |
|
An email address. |
|
An International Bank Account Number has specific formats in each country. |
|
An IPv4 address. |
|
A license plate for a vehicle. |
|
A media access control (MAC) address. |
|
An individual’s name. |
|
An alphanumeric string used as a password. |
|
A phone number. |
|
A four-digit personal identification number (PIN). |
|
A SWIFT code. |
|
A web address. |
|
A user name that identifies an account. |
|
A Vehicle Identification Number (VIN). |
|
A Canadian Health Service Number. |
|
A Canadian Social Insurance Number (SIN). |
|
An Indian Aadhaar number. |
|
An Indian National Rural Employment Guarantee Act (NREGA) number. |
|
An Indian Permanent Account Number. |
|
An Indian Voter ID number. |
|
A UK National Health Service Number. |
|
A UK National Insurance Number (NINO). |
|
A UK Unique Taxpayer Reference (UTR) is a 10-digit number that identifies a taxpayer or a business. |
|
A US bank account number, typically 10 to 12 digits long. |
|
A US bank routing number, typically nine digits long. |
|
A passport number, ranging from six to nine alphanumeric characters. |
|
A US Individual Taxpayer Identification Number (ITIN) is a nine-digit number. |
|
A US Social Security Number (SSN) is a nine-digit number. |
|
A Spanish NIF number (Personal tax ID). |
|
An Italian VAT code number. |
|
Polish PESEL number. |
|
A National Registration Identification Card. |
|
The Australian Business Number (ABN) is a unique 11 digit identifier issued to all entities registered in the Australian Business Register (ABR). |
|
An Australian Company Number is a unique nine-digit number issued by the Australian Securities and Investments Commission to every company registered under the Commonwealth Corporations Act 2001 as an identifier. |
|
The tax file number (TFN) is a unique identifier issued by the Australian Taxation Office to each taxpaying entity. |
|
Medicare number is a unique identifier issued by Australian Government. |
|
A branded product. |
|
An event, such as a festival, concert, election, etc. |
|
Large organizations, such as a government, company, religion, sports team, etc. |
|
Individuals, groups of people, nicknames, fictional characters. |
|
A quantified amount, such as currency, percentages, numbers, bytes, etc. |
|
An official name given to any creation or creative work, such as movies, books, songs, etc. |