Using Regular Expressions to Build a Microsoft Purview Custom Sensitive Information Type

Introduction

A colleague of mine needed to create a Microsoft Purview custom Sensitive Information Type (SIT) for a client– The client wanted to use this SIT as part of their implementation of Microsoft Purview. The client needed the SIT to match on occurrences of an organizational identifier that met specific criteria. There is a library of available SITs, but none of them were applicable. SITs are leveraged by several components in the Purview product family and in other areas of the Microsoft compliance universe.

This article will show how you can use Regular Expressions (RegEx) to help create a matching pattern in the “Primary Element” of a custom SIT. We will not be discussing how to configure the other three (3) components of the pattern in the SIT. We will begin with what a SIT is and how they are used in the Microsoft Purview solution suite to protect organizations and their priceless information assets. Microsoft Purview is a cornerstone solution that eGroup | Enabling Technologies uses to help our customers implement and maintain our overriding information security and compliance philosophy:

  1. Trust no one
  2. Harden everything

If you need to create a custom SIT, you will need to use some RegEx. This article is not going to be a lesson on how to write RegEx rules but on how to create a custom SIT with a RegEx rule to provide the matching.

Sensitive Information Types

Sensitive Information Types (SITs) are used to identify and classify sensitive “items” that are in your organization’s data inventory. There are four (4) types of SIT:

  1. Built-in SITs
    1. These have been created by Microsoft and are available by default in the Compliance Console
    2. They cannot be edited but can be used as templates to custom SITs
    3. They can be found at Sensitive information type entity definitions
  2. Named Entity SITs
    1. These have also been created by Microsoft and are available in the Compliance Console
    2. They cannot be edited or copied
    3. They detect names of people, physical addresses, medical terms and conditions
    4. They can be found at Learn about Named Entities
  3. Exact Data Match (EDM) SITs:
    1. These allow an organization to create custom SITs that refer to exact values in a dynamically updatable sensitive information database
    2. The content can be refreshed regularly
    3. An organization could update this database daily to include information about its employees, clients, patients, etc.
  4. Custom SITs
    1. If the Built-In and Named Entity SITs do not provide a generic rule(s) that meet your requirements, you can create your own custom SITs
    2. Custom SITs can be written from “scratch” or can be based on one of the built-in SITs

SITs are leveraged currently by these components of the Microsoft Security and Compliance suite of products:

SITs are used to detect sensitive information in an organization’s documents, files, emails, chats, etc. Policies can be created to take an action if the SIT gets a match. For example, a policy:

  • Could tag an Excel Workbook with the sensitivity label “Highly Confidential”
  • Warn a user that they may be trying to share something inappropriately
  • Block a user from sharing a document containing something sensitive
  • Could prevent the sensitive information from being displayed in a Teams chat
  • Be setup to perform a Regulatory compliance scan to safeguard against potential insider trading
  • Detect and manage the transfer of an employee’s personal data within the organization as required by the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA) and other similar regulations

Every SIT has a Name, Description, and a Pattern. The Pattern is the definition of what the SIT is looking for. It is composed of four (4) components:

  • Primary Element: This is what is being looked for. It can be a RegEx, keyword list, keyword dictionary or a function.
  • Supporting Element: These are used to provide supporting evidence that helps to increase the confidence of the match. If you are looking to match a Social Security number, having “SSN” appear in the document makes it more likely that the nine (9) digit number being looked at is a Social Security number and not a foreign phone number (most international numbers can be between eight (8) and sixteen (16) characters long. These can also be defined with a RegEx, keyword list or dictionary.
  • Confidence Level: An indication of how much supporting evidence was detected along with the Primary Element. Levels are defined as high, medium, or low.
  • Proximity: The number of characters between the Primary and Supporting element. If we find SSN several pages before the nine (9) digit number that matched the Primary Element, the likelihood that the number is a Social Security number is less than it would be if SSN preceded the number by a few characters.

In this article, we are going to focus on creating a RegEx rule for the Primary Element that will match the criteria of the client’s internal identifier.

What are Regular Expressions (RegEx)?

Regular Expressions (RegEx) have been around since 1951. The concept was originated by mathematician Stephen Cole Kleene. It is a syntax that can be used to search for a pattern in text. They can be used in “find” or “find and replace” operations. They first came into popular use in Unix text-processing utilities. RegEx has been incorporated into most common programming languages including:

  • C#
  • C++
  • Java
  • JavaScript
  • Perl
  • PowerShell
  • Rust
  • Python
  • VBA
  • Over the years, many versions of the RegEx syntax have evolved. Microsoft 365 SITs use the Boost.RegEx 5.1.3 engine. This version of Boost is sometimes referred to by a different version number, 1.66. Microsoft Teams uses the .NET Framework version of RegEx. For a .NET Framework RegEx quick reference from Microsoft, refer to the Regular Expression Language – Quick Reference web page.

    If you aren’t confused at this point, the SITs have some additional RegEx validation rules that you need to be aware of. Believe me, if you violate any of these rules, Microsoft 365 will let you know!

    At eGroup | Enabling Technologies, we have been using RegEx for over a decade. RegEx has been the required syntax when creating telephone number Normalization Rules in Dial Plans from the days of Live Communications Server 2005 through Lync, Skype for Business, and Microsoft Teams. Some Session Border Controllers also use RegEx in their manipulation engines. The most common Dial Plan Normalization Rules translate four (4) digit telephone extensions (5100) dialed by a user into twelve (12) digit e.164 phone numbers (+14436255100).

    There are many sources and resources available on the web as well as those old-fashioned things called books! RegEx is supported in many programming languages. If you want to learn RegEx in general, avoid using a resource specific to a programming language, you won’t get the full picture.

Creating the Custom Sensitive Information Type

You can create a custom SIT in the Microsoft Purview Compliance Portal. They can also be created offline in an XML file called a rule package. A colleague of ours wrote about this a few years ago, How to Create Data Loss Prevention Custom Sensitive Information Types.

Client Criteria and Initial RegEx Rule

The client asked us to create a SIT that would produce a match if a number in a document, e-mail, chat, etc. matched the definition of an organizational identifier with these criteria:

  • The number must be ten (10) digits long
  • It must not start or end with a zero (0)
  • Digits can be repeated sequentially up to three (3) times

Here are examples of numeric strings that meet the criteria:

  • 1234567891
  • 2223334445
  • 1234555666
  • 9876500019
  • 1112211122

And non-matching numeric strings:

  • 0234567891
  • 123456789
  • 1234567890
  • 1115555111
  • 1234500001

The criteria proved to be more challenging than expected. The restriction on not having more than three (3) repetitions of the same number was the most difficult. The rule was first developed and tested using an online RegEx tester; there are several available, RegEx Tester and Regular Expressions 101 to name a few. Once we had the rule tested, we started the process of creating the SIT itself.

Creating the Custom SIT
  1. Sign in and navigate to the Microsoft Purview Compliance Portal, https://compliance.microsoft.com/homepage
  2. Click on “Data classification”

3. Click on “Sensitive info types”

4. Click “Create sensitive info type”

5. Type in a name in the “Name” field for the SIT

6. Add a description to the “Description” field. Descriptions are required.

7. Click the “Next” button

8. Click “Create pattern”

9. Click the “+ Add primary element” drop-down

10. Click on “Regular Expression”

11. In the “ID” field, type in a name for the Regular Expression

12. Paste the RegEx into the Regular Expression field. Obviously, even though the rule’s syntax was fine in the tool we used to create it, Microsoft 365 didn’t like the syntax. We’ll discuss this below.

13. Select “String Match”. A match will occur even if the matched number is contained within preceding and/or ending text. The rule would match for “ID:1234567891Number”. If you select “Word Match”, the rule will only match instances of the string that “stand” by themselves. The example string would not match if you had selected “Word Match”.

14. After fixing the syntax, the errors will clear

15. Click the “Done” button

16. Change the “Character proximity” as needed

17. Add “Supporting Elements”

18. Click the “Create” button

19. Click the “Next” button

20. Select a Confidence level

21. Click the “Next” button

22. Click the “Create” button

23. Wait for the SIT to be created then click the “Done” button

Testing the Custom SIT
  1. Click in the “Search” box and type in part of the name of the new rule
  2. Double-click on the name of the rule

3. Click the “Test” button

4. Click “Upload file”

5. Select the test file.

6. Click the “Open” button.

7. Click the “Test” button.

8. Wait for the test to complete, review the results.

9. Click the “Finish” button.

10. Correct the SIT as needed.

The Primary Element’s RegEx Rule of the SIT

  • To review, the client provided these requirements for matching the organizational identifier:
    • The number must be ten (10) digits long.
    • It must not start or end with a zero (0).
    • Digits can be repeated sequentially up to three (3) times.
  • The original rule that was created and tested with a web-based rule checker was:

    ((?!.*(0000|1111|2222|3333|4444|5555|6666|7777|8888|9999).*)(?!(0))[0-9]{9}(?!(0))[0-9])

  • This rule failed validation when it was pasted into the Primary Element. The corrected rule was:(?!\d{0,6}(0000|1111|2222|3333|4444|5555|6666|7777|8888|9999)\d{0,6})(?!(0))[0-9]{9}(?!(0))[0-9]
  • Upon further “dabbling” the final rule was defined: 

(?!\d{0,6}(0{4}|1{4}|2{4}|3{4}|4{4}|5{4}|6{4}|7{4}|8{4}|9{4})\d{0,6})(?!(0))\d{9}(?!(0))\d

All these risk reductions are real, they are valuable, and they should be a part of any discussion about moving systems or applications to the cloud.  This isn’t to minimize the shared responsibility model that we all need to follow (see Microsoft’s diagram of this below), but up to half (half!!) of the boxes below are Microsoft’s responsibility, depending on the system.  Oh, except for on-premises.  You have to manage that.  All on your own….🙂

High-Level Breakdown of the RegEx Rule
  1. Look for a string of numbers that begins with zero (0) to six (6) digits and ends in zero (0) to six (6) digits.
  2. If this group of numbers contains any digit that has repeated four (4) times, the match fails.
  3. If the match is still possible, check to see if the first digit is a zero (0), if it is, the match fails.
  4. If the match is still possible, match on a string of numbers, nine (9) digits long. If there are fewer than nine (9) digits, the rule fails.
  5. If the match is still possible, check if there is a tenth digit. If there isn’t, the match fails.
  6. If the match is still possible, check to see if the tenth digit is a zero (0), if so, the match fails.
  7. The match succeeds with the string of numbers matching the criteria for the organizational identifier.
First Matching Section Detail

(?!\d{0,6}(0{4}|1{4}|2{4}|3{4}|4{4}|5{4}|6{4}|7{4}|8{4}|9{4})\d{0,6})

  • We are looking for a ten (10) digit numeric string.
  • We do not want a number repeating more than three (3) times.
  • A number repeating four (4) times violates the criteria. A number repeating five (5), six (6), seven (7) times or more also violates the criteria. This means we only need to look for a number that repeats four (4) times.
  • If the number repeats four (4) times the numeric string can only have six (6) more digits. The six (6) digits can be found in different places in the string. We are using “5555” to represent the repeating digits and “x” to represent all other digits in these samples:
    • At the beginning, xxxxxx5555.
    • At the end, 5555xxxxxx.
    • In the middle xx5555xxxx.
  • In the RegEx, the \d{0,6} means match if there are between zero (0) and six (6) digits.
  • The \d{0,6} is placed before and after the test for four (4) repetitions of each digit.
  • This results in getting a match if any number repeats four (4) times in a ten (10) digit string.
  • The entire section is enclosed with (?!……). This construct says that if there is a match based on the RegEx that appears between the “?!” and the “)”, the entire RegEx expression fails. If this first construct fails, don’t even bother trying the rest of the constructs, the rule fails and does not match the submitted numeric string.

 

Second Matching Section Detail

(?!(0))

  • Having passed the first test, this section asks if the first digit in the numeric string is a zero (0).
  • If it is, this construct fails, don’t bother testing the rest of the rule and fail the match.
Third Matching Section Detail

\d{9}

  • This section looks for any nine (9) digits. These can be any digit between zero (0) and nine (9).
  • This section would allow all nine (9) digits to be the same, but the first section has already checked for this and would have disallowed it.
  • If there are fewer than nine (9) digits, the construct and the rule will fail.
Fourth Matching Section Detail

(?!(0))\d

  • If we’ve made it this far, we have matched nine (9) digits where:
    • The first digit is not a zero (0).
    • No digit repeats more than three (3) times consecutively.
  • This rule checks to make sure the tenth digit is not a zero (0). If it is, the construct and the rule will fail to make a match.

Summary

  • Sensitive Information Types (SIT) are fundamental elements in an organization’s implementation of the Microsoft Security and Compliance suite of products.
  • These rules are what are used to identify and classify the organization’s information assets.
  • Microsoft has provided many built-in SITs that fulfill the identification and classification requirements for most customers. However, there are always exceptions.
  • Custom SITs are used to meet the specific data identification and classification requirements of an organization.
  • If you need to create a custom SIT, you will need to create a RegEx rule.
  • With all the different flavors of RegEx out there and the SITs own RegEx validation requirements, creating RegEx rules can be challenging.

 

eGroup | Enabling Technologies is available and ready to help you harden and protect your organization and its information assets. SITs and custom SITs are basic components used by many of the security and compliance tools in the Microsoft product family. Determining which SITs you need to leverage or create is not always straightforward.

We have been writing RegEx rules for the Microsoft Unified Communications products and using them in various programming projects, PowerShell scripts, etc. for over fifteen (15) years. The need to use RegEx when creating custom SITs is something we are very comfortable with and ready to assist our customers in their implementation.

This series is part of our effort to help our customers implement a “Trust No One and Harden Everything” security infrastructure. If you need help in planning and implementing your organizational security infrastructure, please contact us!

John Miller

John Miller

Cloud Solutions Architect - eGroup | Enabling Technologies

Last updated on July 26th, 2023 at 02:39 pm