Section 4 - Create Raw Validators using IntelliJ
Go back to Getting started guide
In this section we will:
- create a NOT NULL raw validator
- create a POSITIVE raw validator
-
create a IN_SET raw validator
Description Config Reference Artifacts
Processing Pipeline
Raw Validators without configurations
- Raw validators are part of the processing pipeline of each data point.
- They are defined in the config schemas. In this case CUSTOMER schema.
- We start by opening the
SCHEMA_CUSTOMERS.xml
file we created in the previous section, in IntelliJ. Note: This file is used a starting point just before starting section 4, to add data validators and processors. If you were not able to complete the previous section you could copy the configuration below and paste it intoSCHEMA_CUSTOMERS.xml
to continue with this section. - We will now implement raw validations for the data descriptor
AGE
highligted below at line 45.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
<?xml version="1.0" encoding="UTF-8"?> <apiroConf version="1" xmlns="http://apiro.com/apiro/v1/root"> <groups/> <loadOrder>15</loadOrder> <schemas> <schema defBacked="false" historical="false" name="CUSTOMER"> <groupTags> <groupTag>EXAMPLES</groupTag> </groupTags> <metaData/> <identityKeys> <identityKey>BAC</identityKey> </identityKeys> <!-- Data Point descriptions --> <dataPoints> <dataPoint name="BAC" dataType="STRING" canEditValid="true" canEditViolated="true" displayName="BAC"> <nullable>false</nullable> <metaData> <item name="piiClassification"> <simpleValues> <simpleValue>High Risk</simpleValue> </simpleValues> </item> </metaData> <!-- BAC data point processors --> <rawDPValidators/> <rawDPProcessors/> <!--consolidationAlgorithm></consolidationAlgorithm --> <consDPValidators/> <consDPProcessors/> </dataPoint> <dataPoint name="FIRST_NAME" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="FIRST NAME"/> <dataPoint name="LAST_NAME" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="LAST NAME"/> <dataPoint name="ADDRESS" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="ADDRESS"/> <dataPoint name="PHONE_NUMBER" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PHONE NUMBER"/> <dataPoint name="AGE" canEditValid="false" canEditViolated="true" dataType="INTEGER" displayName="AGE"/> <dataPoint name="YEARLY_INCOME" canEditValid="false" canEditViolated="true" dataType="DECIMAL" displayName="YEARLY INCOME"/> <dataPoint name="TFN" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="TFN"/> <dataPoint name="PORTFOLIO_VALUE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PORTFOLIO VALUE"/> <dataPoint name="COMPANY_NAME" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY NAME"/> <dataPoint name="COMPANY_ADDRESS" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY ADDRESS"/> <dataPoint name="PROFILE_IMAGE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PROFILE_IMAGE"/> <dataPoint name="COMPANY_WEBSITE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY WEBSITE"/> <dataPoint name="XML_ROOT_DOC" canEditValid="false" canEditViolated="true" displayName="XML Root Doc" dataType="XML"/> <dataPoint name="JSON_ROOT_DOC" canEditValid="false" canEditViolated="true" displayName="JSON Root Doc" dataType="JSON"/> </dataPoints> <schemaAppliedProcessors> <groupTags> <groupTag>DEFAULT</groupTag> </groupTags> <metaData/> <rawDPValidators/> <rawDPProcessors/> <consDPValidators/> <consDPProcessors/> <dataBlockProcessors/> </schemaAppliedProcessors> <alerts/> </schema> </schemas> </apiroConf>
- Copy the
AGE
element below and override line 45. This will implement theNOT_NULL
andPOSITIVE
raw data point validators for theAGE
Data Point Descriptor. Note: Before proceeding, you must ensure that theAGE dataType
isINTEGER
and not aSTRING
.1 2 3 4 5 6 7 8
<dataPoint name="AGE" dataType="INTEGER" canEditValid="false" canEditViolated="false" displayName="Age"> <rawDPValidators> <rawDPValidator name="INVALID_IF_NULL" entity="NOT_NULL"/> // The name can be anything and it will appear in data audit/lineage <rawDPValidator name="INVALID_IF_NEGATIVE" entity="POSITIVE"> <lateBound>false</lateBound> // This is the default value if one is not specified </rawDPValidator> </rawDPValidators> </dataPoint>
- You must now push your updated
SCHEMA_CUSTOMER.xml
file to GIT and deploy as per the instructions provided at the bottom of this page to reload the configuration. - You have just implemented a chain of Raw Validators for the
AGE
data point. - If the
AGE
value of any data feed isNULL
orNON POSITIVE
, a violation will be raised, tracked, audited and shown in the UI. - After reloading the schema you need to trigger sourcing for both feeds individually (
CUSTOMERS_A_XLSX
,CUSTOMERS_B_XLSX
) to ingest the data and process each pipeline for every data point and data block. - You can see below how the UI shows this violation in the Raw Data table
- However, the aggregated table does not show any violations because the default behaviour is to aggregate all
VALID
data point values using the defaultmean average
algorithm. In this case theAGE
sourced from one feed wasINVALID
because it wasNULL
and the value sourced from the second feed was22
. This is why the aggregated value was 22 because there was only oneVALID
raw value. We will see later in the guide how we can customize this behaviour. For example we could specify that a data point value cannot be aggregated unless atleast 2 feeds are available. This could be as simple as configuring a<consDPValidator name="HAS_MIN_FEED_2" entity="MINIMUM_FEEDS">
as shown below. Do not add this validator at the moment as it is out of the scope of this section. It will be discussed at a later section.1 2 3 4 5 6 7 8 9 10 11 12 13
<dataPoint name="AGE" dataType="STRING"> <consDPValidators> <consDPValidator name="MIN_FEEDS_FOR_AGE" entity="MINIMUM_FEEDS"> <config> <![CDATA[ { "minFeeds" : 2 } ]]> </config> </consDPValidator> </consDPValidators> </dataPoint>
-
If we double click on the aggregated cell in this case
22
we will be able to see the Data Audit/Lineage popup dialog -
The Data Audit/Lineage popup dialog will show all the information related to this specific data point.
- (1) The name of the first feed
CUSTOMERS_A_XLSX
, the sourced valuenull
and theINVALID
status of the data point from this feed. - (2) The name of the second feed
CUSTOMERS_B_XLSX
, the sourced value22
and theVALID
status of the data point from this feed. - (3) The aggregated value and the aggregation algorithm
INTEGER_MEAN
used to perform the aggregation. - (4) Indicator if we are allowed to manually edit this data point value.
-
(5) A complete data processing pipeline showing how the values were changed during the process.
-
Note: Whether a data point value is manually editable or not, is specified in the schema config file as an attribute on each data point as shown below
1 2 3 4 5
<dataPoint name="AGE" displayName="Age" dataType="STRING" canEditValid="false" canEditViolated="false"/>
- If
canEditViolated="true"/>
and/orcanEditValid="true"/>
were set to true then the UI would allow the manual updating of any values that are violated.
- (1) The name of the first feed
Raw Validators WITH configuration
- Now, lets have a look at the IN_SET Validator.
- This predefined validator accepts a configuration as shown below.
- It validlates the
FIRST_NAME
values and specifies that onlyTom
orBob
are valid. Any other first names will raise a violation. - Copy the
FIRST_NAME
element below and override it in theCUSTOMER_SCHEMA.xml
file. - Push to GIT and reload the config as described at the bottom of this page.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
<dataPoint name="FIRST_NAME" dataType="STRING" displayName="First Name" canEditValid="true" canEditViolated="true"> <rawDPValidators> <rawDPValidator name="IN_BAC_SET_CHECK " entity="IN_SET"> <config> <![CDATA[ { ignoreCase : true, options : [ "Tom", "Bob"] } ]]> </config> </rawDPValidator> </rawDPValidators> </dataPoint>
- You need to retrigger sourcing as we described above.
- As shown in the screenshot below the record with first name
Lucy
are raised as violated fields. -
We will see later in the guide that this
exclusion
list doesn't always have to be hard coded in the config. It can be dynamically retrieved by external service or another schema. -
As mentioned previously Apiro provides a fine-grained control as to how
raw
andconsolidated/aggregated
data is processed and validated. - We can see below that because the raw data for first name was violated, then pipeline did not proceed with consolidating the data points related to
FIRST_NAME
- You will also notice that there was no violation flagged for consolidated data even thought the value is missing.
- In order to flag this violation we need to in introduce a
</consDPValidator>
- We will discuss consolidate data point validators later in the guide but we are provide this here for completeness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Bringing it all together
Completed configuration files
- This is the completed CUSTOMER schema configuration file that implements all the above. Notice how simple and quick it was to add out of the box and custom validators in a single configuration using the existing pre wired pipelines, audit and data lineage features.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
<?xml version="1.0" encoding="UTF-8"?> <apiroConf version="1" xmlns="http://apiro.com/apiro/v1/root"> <groups/> <loadOrder>15</loadOrder> <schemas> <schema defBacked="false" historical="false" name="CUSTOMER"> <groupTags> <groupTag>EXAMPLES</groupTag> </groupTags> <metaData/> <identityKeys> <identityKey>BAC</identityKey> </identityKeys> <!-- Data Point descriptions --> <dataPoints> <dataPoint name="BAC" dataType="STRING" canEditValid="true" canEditViolated="true" displayName="BAC"> <nullable>false</nullable> <metaData> <item name="piiClassification"> <simpleValues> <simpleValue>High Risk</simpleValue> </simpleValues> </item> </metaData> <!-- BAC data point processors --> <rawDPValidators/> <rawDPProcessors/> <!--consolidationAlgorithm></consolidationAlgorithm --> <consDPValidators/> <consDPProcessors/> </dataPoint> <dataPoint name="FIRST_NAME" dataType="STRING" displayName="First Name" canEditValid="true" canEditViolated="true"> <rawDPValidators> <rawDPValidator name="IN_BAC_SET_CHECK " entity="IN_SET"> <config> <![CDATA[ { ignoreCase : true, options : [ "Tom", "Bob"] } ]]> </config> </rawDPValidator> </rawDPValidators> <consDPValidators> <consDPValidator name="INVALID_IF_CONSOLIDATED_NULL" entity="NOT_NULL"/> </consDPValidators> </dataPoint> <dataPoint name="LAST_NAME" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="LAST NAME"/> <dataPoint name="ADDRESS" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="ADDRESS"/> <dataPoint name="PHONE_NUMBER" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PHONE NUMBER"/> <dataPoint name="AGE" dataType="INTEGER" canEditValid="true" canEditViolated="true" displayName="Age"> <rawDPValidators> <rawDPValidator name="INVALID_IF_NULL" entity="NOT_NULL"/> // The name can be anything and it will appear in data audit/lineage <rawDPValidator name="INVALID_IF_NEGATIVE" entity="POSITIVE"> <lateBound>false</lateBound> // This is the default value if one is not specified </rawDPValidator> </rawDPValidators> </dataPoint> <dataPoint name="YEARLY_INCOME" canEditValid="false" canEditViolated="true" dataType="DECIMAL" displayName="YEARLY INCOME"/> <dataPoint name="TFN" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="TFN"/> <dataPoint name="PORTFOLIO_VALUE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PORTFOLIO VALUE"/> <dataPoint name="COMPANY_NAME" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY NAME"/> <dataPoint name="COMPANY_ADDRESS" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY ADDRESS"/> <dataPoint name="PROFILE_IMAGE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="PROFILE_IMAGE"/> <dataPoint name="COMPANY_WEBSITE" canEditValid="false" canEditViolated="true" dataType="STRING" displayName="COMPANY WEBSITE"/> <dataPoint name="XML_ROOT_DOC" canEditValid="false" canEditViolated="true" displayName="XML Root Doc" dataType="XML"/> <dataPoint name="JSON_ROOT_DOC" canEditValid="false" canEditViolated="true" displayName="JSON Root Doc" dataType="JSON"/> </dataPoints> <schemaAppliedProcessors> <groupTags> <groupTag>DEFAULT</groupTag> </groupTags> <metaData/> <rawDPValidators/> <rawDPProcessors/> <consDPValidators/> <consDPProcessors/> <dataBlockProcessors/> </schemaAppliedProcessors> <alerts/> </schema> </schemas> </apiroConf>
Deploy config files
- Follow these steps Config Deployment to deploy and start using your configuration files.