Step 2: Define Classification and Grading Templates
Concept
- Data identifiers are specific rules for detecting certain sensitive data, such as identifiers for ID cards, email addresses, names, etc.
- A template is a collection of data identifiers. Templates will be used in sensitive data discovery jobs.
Best Practices
You need to understand what kind of data is defined as sensitive in your company and the identification rules for these sensitive data. Only those sensitive data that can be defined by regular expressions or AI can be identified by technical means (e.g., using the SDP solution). After defining data identifiers, you need to add them to a template. When running a sensitive data scanning task, it will match the data in the data source against the rules in the template and mark them in the data catalog.
View and Edit Built-in Data Identifiers
The solution provides built-in data identifiers, which are mainly based on national privacy data rules and provides a reference for you to decide how to classify your data.
On the Manage Data Identification Rules page, in the Built-in Data Identifiers tab, you can see a list of built-in data identifiers. For the complete list, please refer to Appendix - Built-in Data Identifiers.
You can click to edit and adjust the data grading classification. By default, these data identifiers have a PERSONAL
category attribute and S2
/S3
/S4
identifier label attributes. You can update these attributes according to your sensitive data.
Create and Edit Custom Data Identifiers
On the Manage Data Identification Rules page, in the Custom Data Identifiers tab, you can see your defined list of custom data identifiers. By default, this list is empty. You can create or delete data identifiers based on business-sensitive data.
You can click to edit and adjust the data grading classification. For example, you can define the category as FINANCE
/AUTO
/GENERAL
, and set the data sensitivity level to Level1
/Level2
/Level3
/Level4
.
To create a new data identifier, select Create Text-based Data Identifier.
On the data identifier creation page, you can define rules for sensitive data scanning, as detailed in the following table.
Parameter | Required | Description |
---|---|---|
Name | Yes | The name of the data identifier, used for automatic marking when sensitive data is scanned. |
Description | Optional | Additional explanation of the identifier, helpful for understanding its use and context. |
Identification Rules | Yes | Defines the rules for identifying data, which can be based on column name keywords, regular expressions, or a combination of both. |
Identifier Attributes | Optional | Allows for classification and grading of identifiers, e.g., by industry (Finance, Game, Personal, etc.) or security level (S1, S2, S3, etc.). |
Advanced Rules: Exclude Keywords | Optional | Defines column name keywords that should not be marked as sensitive data. |
Advanced Rules: Unstructured Data | Optional | Applicable for specific security levels (like S3), includes settings for the frequency of rule occurrence and the number of characters between keywords and regular expressions. |
Add Data Identifiers to Template
- In the left menu, select Define Classification Template.
- Choose Add Data Identifier. You will see a sidebar displaying all data identifiers.
- Select one or more data identifiers and choose Add to Template.
Example: how data identifiers are labeled in data catalog after sensitive data discovery job.
Assume we want to detect sensitive data in this table named "PizzaOrderTable"
id | user_name | email_address | order_id |
---|---|---|---|
1 | aaa_frankzhu | frankzhu@mail.com | 12344536 |
2 | aaa_zheng | zhm@mail.com | 12344536 |
3 | aaa_patrickpark | ppark@example.com | 12344536 |
4 | aaa_kyle | kyle@qq.com | 1230000 |
For example, we define 5 custom data identifiers:
Identifier Name | Regex | Keyword |
---|---|---|
OrderInfo1 | OrderInfo1 | order |
OrderInfo2 | (disabled) | _id |
UserEmail | ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ |
(disabled) |
EmailAddress | (disabled) | themail, email-address, email_address |
UserPrefix | aaa_ | user |
Assume all the above identifiers were added in classification template, and then we started a discovery job. "PizzaOrderTable" data catalog result is as below:
Column | Identifiers | Privacy |
---|---|---|
id | N/A | Non-PII |
user_name | UserPrefix | Contain-PII |
email_address | UserEmail, EmailAddress | Contain-PII |
order_id | OrderInfo2 | Contain-PII |
Explanation for Identifiers:
- The identifier "OrderInfo1" is not matched because the regex does not match the data pattern.
- The identifier "OrderInfo2" is labeled on the "order_id" column because the keyword "_id" partially matches the column name "order_id".
- The identifier "UserEmail" is labeled on the "email_address" column because the regex matches the data pattern of the "email_address" column values.
- The identifier "EmailAddress" is labeled on the "email_address" column because one of the keywords "email_address" matches the column name.
- The identifier "UserPrefix" is labeled on the "user_name" column because both the regex and keywords match.
Explanation for Privacy labels:
- Columns "user_name", "email_address", and "order_id" are labeled as (Contain-PII) privacy label because identifiers are matched.
- Column "id" is labeled as (Non-PII) privacy label because no identifier is matched.