Section 4 - Create Raw Validators using IntelliJ

In this section we will:

create a NOT NULL raw validator
create a POSITIVE raw validator
create a IN_SET raw validator
Description
Config Reference
Data Schemas Descriptor
Raw Data Point Validators
Artifacts
customers_a.xlsx
customers_b.xlsx

Processing Pipeline

Raw Validators without configurations

Raw validators are part of the processing pipeline of each data point.
They are defined in the config schemas. In this case CUSTOMER schema.
We start by opening the SCHEMA_CUSTOMERS.xml file we created in the previous section, in IntelliJ. Note: This file is used a starting point just before starting section 4, to add data validators and processors. If you were not able to complete the previous section you could copy the configuration below and paste it into SCHEMA_CUSTOMERS.xml to continue with this section.

We will now implement raw validations for the data descriptor AGE highligted below at line 45.

    <?xml version="1.0" encoding="UTF-8"?>

    <apiroConf version="1" xmlns="http://apiro.com/apiro/v1/root">
        <groups/>
        <loadOrder>15</loadOrder>
        <schemas>
            <schema defBacked="false" historical="false" name="CUSTOMER">
                <groupTags>
                    <groupTag>EXAMPLES</groupTag>
                </groupTags>
                <metaData/>
                <identityKeys>
                    <identityKey>BAC</identityKey>
                </identityKeys>

                <!-- Data Point descriptions -->
                <dataPoints>
                    <dataPoint name="BAC"
                               dataType="STRING"
                               canEditValid="true"
                               canEditViolated="true"
                               displayName="BAC">
                        <nullable>false</nullable>

                        <metaData>
                            <item name="piiClassification">
                                <simpleValues>
                                    <simpleValue>High Risk</simpleValue>
                                </simpleValues>
                            </item>
                        </metaData>

                        <!-- BAC data point processors -->
                        <rawDPValidators/>
                        <rawDPProcessors/>
                        <!--consolidationAlgorithm></consolidationAlgorithm -->
                        <consDPValidators/>
                        <consDPProcessors/>
                    </dataPoint>

                    <dataPoint name="FIRST_NAME" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="FIRST NAME"/>
                    <dataPoint name="LAST_NAME" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="LAST NAME"/>
                    <dataPoint name="ADDRESS" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="ADDRESS"/>
                    <dataPoint name="PHONE_NUMBER" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PHONE NUMBER"/>
                    <dataPoint name="AGE" canEditValid="false" canEditViolated="true" dataType="INTEGER" displayName="AGE"/>
                    <dataPoint name="YEARLY_INCOME" canEditValid="false" canEditViolated="true" dataType="DECIMAL" displayName="YEARLY INCOME"/>
                    <dataPoint name="TFN" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="TFN"/>
                    <dataPoint name="PORTFOLIO_VALUE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PORTFOLIO VALUE"/>
                    <dataPoint name="COMPANY_NAME" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY NAME"/>
                    <dataPoint name="COMPANY_ADDRESS" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY ADDRESS"/>
                    <dataPoint name="PROFILE_IMAGE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PROFILE_IMAGE"/>
                    <dataPoint name="COMPANY_WEBSITE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY WEBSITE"/>
                    <dataPoint name="XML_ROOT_DOC"  canEditValid="false" canEditViolated="true"  displayName="XML Root Doc" dataType="XML"/>
                    <dataPoint name="JSON_ROOT_DOC"  canEditValid="false" canEditViolated="true"  displayName="JSON Root Doc" dataType="JSON"/>
                </dataPoints>
                <schemaAppliedProcessors>
                    <groupTags>
                        <groupTag>DEFAULT</groupTag>
                    </groupTags>
                    <metaData/>
                    <rawDPValidators/>
                    <rawDPProcessors/>
                    <consDPValidators/>
                    <consDPProcessors/>
                    <dataBlockProcessors/>
                </schemaAppliedProcessors>
                <alerts/>
            </schema>
        </schemas>
    </apiroConf>

Copy the AGE element below and override line 45. This will implement the NOT_NULL and POSITIVE raw data point validators for the AGE Data Point Descriptor. Note: Before proceeding, you must ensure that the AGE dataType is INTEGER and not a STRING.

            <dataPoint  name="AGE" dataType="INTEGER" canEditValid="false" canEditViolated="false" displayName="Age">
                <rawDPValidators>
                    <rawDPValidator name="INVALID_IF_NULL" entity="NOT_NULL"/> // The name can be anything and it will appear in data audit/lineage
                    <rawDPValidator name="INVALID_IF_NEGATIVE" entity="POSITIVE">
                        <lateBound>false</lateBound> // This is the default value if one is not specified
                    </rawDPValidator>
                </rawDPValidators>
            </dataPoint>

You must now push your updated SCHEMA_CUSTOMER.xml file to GIT and deploy as per the instructions provided at the bottom of this page to reload the configuration.
You have just implemented a chain of Raw Validators for the AGE data point.
If the AGE value of any data feed is NULL or NON POSITIVE, a violation will be raised, tracked, audited and shown in the UI.
After reloading the schema you need to trigger sourcing for both feeds individually (CUSTOMERS_A_XLSX, CUSTOMERS_B_XLSX) to ingest the data and process each pipeline for every data point and data block.
You can see below how the UI shows this violation in the Raw Data table

However, the aggregated table does not show any violations because the default behaviour is to aggregate all VALID data point values using the default mean average algorithm. In this case the AGE sourced from one feed was INVALID because it was NULL and the value sourced from the second feed was 22. This is why the aggregated value was 22 because there was only one VALID raw value. We will see later in the guide how we can customize this behaviour. For example we could specify that a data point value cannot be aggregated unless atleast 2 feeds are available. This could be as simple as configuring a <consDPValidator name="HAS_MIN_FEED_2" entity="MINIMUM_FEEDS"> as shown below. Do not add this validator at the moment as it is out of the scope of this section. It will be discussed at a later section.

    <dataPoint name="AGE" dataType="STRING">
        <consDPValidators>
            <consDPValidator name="MIN_FEEDS_FOR_AGE" entity="MINIMUM_FEEDS">
                <config>
                    <![CDATA[
                        {
                            "minFeeds" : 2
                        }
                    ]]>
                </config>
            </consDPValidator>
        </consDPValidators>
    </dataPoint>

If we double click on the aggregated cell in this case 22 we will be able to see the Data Audit/Lineage popup dialog

The Data Audit/Lineage popup dialog will show all the information related to this specific data point.

(1) The name of the first feed CUSTOMERS_A_XLSX, the sourced value null and the INVALID status of the data point from this feed.
(2) The name of the second feed CUSTOMERS_B_XLSX, the sourced value 22 and the VALID status of the data point from this feed.
(3) The aggregated value and the aggregation algorithm INTEGER_MEAN used to perform the aggregation.
(4) Indicator if we are allowed to manually edit this data point value.
(5) A complete data processing pipeline showing how the values were changed during the process.

Note: Whether a data point value is manually editable or not, is specified in the schema config file as an attribute on each data point as shown below

                <dataPoint  name="AGE"
                            displayName="Age"
                            dataType="STRING"
                            canEditValid="false"
                            canEditViolated="false"/>

If canEditViolated="true"/> and/or canEditValid="true"/> were set to true then the UI would allow the manual updating of any values that are violated.

Raw Validators WITH configuration

Now, lets have a look at the IN_SET Validator.
This predefined validator accepts a configuration as shown below.
It validlates the FIRST_NAME values and specifies that only Tom or Bob are valid. Any other first names will raise a violation.
Copy the FIRST_NAME element below and override it in the CUSTOMER_SCHEMA.xml file.

Push to GIT and reload the config as described at the bottom of this page.

        <dataPoint name="FIRST_NAME" 
                   dataType="STRING" 
                   displayName="First Name"
                   canEditValid="true"
                   canEditViolated="true">
            <rawDPValidators>
                <rawDPValidator name="IN_BAC_SET_CHECK " entity="IN_SET">
                    <config>
                        <![CDATA[
                            {
                                ignoreCase : true,
                                options : [ "Tom", "Bob"]
                            }
                        ]]>
                    </config>
                </rawDPValidator>
            </rawDPValidators>
        </dataPoint>

You need to retrigger sourcing as we described above.
As shown in the screenshot below the record with first name Lucy are raised as violated fields.
We will see later in the guide that this exclusion list doesn't always have to be hard coded in the config. It can be dynamically retrieved by external service or another schema.
As mentioned previously Apiro provides a fine-grained control as to how raw and consolidated/aggregated data is processed and validated.
We can see below that because the raw data for first name was violated, then pipeline did not proceed with consolidating the data points related to FIRST_NAME
You will also notice that there was no violation flagged for consolidated data even thought the value is missing.
In order to flag this violation we need to in introduce a </consDPValidator>
We will discuss consolidate data point validators later in the guide but we are provide this here for completeness

        <dataPoint name="FIRST_NAME" 
                   dataType="STRING" 
                   displayName="First Name"
                   canEditValid="true"
                   canEditViolated="true">
            <rawDPValidators>
                <rawDPValidator name="IN_BAC_SET_CHECK " entity="IN_SET">
                    <config>
                        <![CDATA[
                            {
                                ignoreCase : true,
                                options : [ "Tom", "Bob"]
                            }
                        ]]>
                    </config>
                </rawDPValidator>
            </rawDPValidators>

            <consDPValidators>
                <consDPValidator name="INVALID_IF_CONSOLIDATED_NULL" entity="NOT_NULL"/> 
            </consDPValidators>
        </dataPoint>

Once we add the consolidated data point validator for FIRST_NAME we can see the violation raised.

Bringing it all together

Completed configuration files

This is the completed CUSTOMER schema configuration file that implements all the above. Notice how simple and quick it was to add out of the box and custom validators in a single configuration using the existing pre wired pipelines, audit and data lineage features.

     <?xml version="1.0" encoding="UTF-8"?>

    <apiroConf version="1" xmlns="http://apiro.com/apiro/v1/root">
        <groups/>
        <loadOrder>15</loadOrder>
        <schemas>
            <schema defBacked="false" historical="false" name="CUSTOMER">
                <groupTags>
                    <groupTag>EXAMPLES</groupTag>
                </groupTags>
                <metaData/>
                <identityKeys>
                    <identityKey>BAC</identityKey>
                </identityKeys>

                <!-- Data Point descriptions -->
                <dataPoints>
                    <dataPoint name="BAC"
                               dataType="STRING"
                               canEditValid="true"
                               canEditViolated="true"
                               displayName="BAC">
                        <nullable>false</nullable>

                        <metaData>
                            <item name="piiClassification">
                                <simpleValues>
                                    <simpleValue>High Risk</simpleValue>
                                </simpleValues>
                            </item>
                        </metaData>

                        <!-- BAC data point processors -->
                        <rawDPValidators/>
                        <rawDPProcessors/>
                        <!--consolidationAlgorithm></consolidationAlgorithm -->
                        <consDPValidators/>
                        <consDPProcessors/>
                    </dataPoint>

                    <dataPoint name="FIRST_NAME"
                               dataType="STRING"
                               displayName="First Name"
                               canEditValid="true"
                               canEditViolated="true">
                        <rawDPValidators>
                            <rawDPValidator name="IN_BAC_SET_CHECK " entity="IN_SET">
                                <config>
                                    <![CDATA[
                                {
                                    ignoreCase : true,
                                    options : [ "Tom", "Bob"]
                                }
                            ]]>
                                </config>
                            </rawDPValidator>
                        </rawDPValidators>

                        <consDPValidators>
                            <consDPValidator name="INVALID_IF_CONSOLIDATED_NULL" entity="NOT_NULL"/> 
                        </consDPValidators>
                    </dataPoint>

                    <dataPoint name="LAST_NAME" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="LAST NAME"/>
                    <dataPoint name="ADDRESS" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="ADDRESS"/>
                    <dataPoint name="PHONE_NUMBER" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PHONE NUMBER"/>
                    <dataPoint  name="AGE" dataType="INTEGER" canEditValid="true" canEditViolated="true" displayName="Age">
                        <rawDPValidators>
                            <rawDPValidator name="INVALID_IF_NULL" entity="NOT_NULL"/> // The name can be anything and it will appear in data audit/lineage
                            <rawDPValidator name="INVALID_IF_NEGATIVE" entity="POSITIVE">
                                <lateBound>false</lateBound> // This is the default value if one is not specified
                            </rawDPValidator>
                        </rawDPValidators>
                    </dataPoint>
                    <dataPoint name="YEARLY_INCOME" canEditValid="false" canEditViolated="true" dataType="DECIMAL" displayName="YEARLY INCOME"/>
                    <dataPoint name="TFN" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="TFN"/>
                    <dataPoint name="PORTFOLIO_VALUE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PORTFOLIO VALUE"/>
                    <dataPoint name="COMPANY_NAME" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY NAME"/>
                    <dataPoint name="COMPANY_ADDRESS" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY ADDRESS"/>
                    <dataPoint name="PROFILE_IMAGE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PROFILE_IMAGE"/>
                    <dataPoint name="COMPANY_WEBSITE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY WEBSITE"/>
                    <dataPoint name="XML_ROOT_DOC"  canEditValid="false" canEditViolated="true"  displayName="XML Root Doc" dataType="XML"/>
                    <dataPoint name="JSON_ROOT_DOC"  canEditValid="false" canEditViolated="true"  displayName="JSON Root Doc" dataType="JSON"/>
                </dataPoints>
                <schemaAppliedProcessors>
                    <groupTags>
                        <groupTag>DEFAULT</groupTag>
                    </groupTags>
                    <metaData/>
                    <rawDPValidators/>
                    <rawDPProcessors/>
                    <consDPValidators/>
                    <consDPProcessors/>
                    <dataBlockProcessors/>
                </schemaAppliedProcessors>
                <alerts/>
            </schema>
        </schemas>
    </apiroConf>

Deploy config files

Follow these steps Config Deployment to deploy and start using your configuration files.

Description
Config Reference	Data Schemas Descriptor Raw Data Point Validators
Artifacts	customers_a.xlsx customers_b.xlsx