What is a data point cleanser?

Data point cleansers modify the data imported from each point before they enter the sourcing pipeline process. This is helpful when data needs to be changed before it enters the sourcing pipeline. The most common scenario is when you have a numeric type Schema data point (named "percentage", for example) that is sourcing data containing non-numeric characters. If the data point named "percentage" of data type "DOUBLE" is sourcing data "15.5%" then it would normally throw a violation since the data contains a string character "%". However, using pre source cleansers we can remove the "%" character so that the data point sources "15.5" which can be cast into a DOUBLE type. The cleanser used in the above scenario is shown below:

<dataFeed definition="EXPR_CSV_FEED2" name="CUSTOMER_CSV_TEST">
    <execPriority>10</execPriority>
    <enabled>true</enabled>
    <push>false</push>
    <pull>true</pull>
    <schema>CUSTOMERS</schema>
    <config><![CDATA[
{
  "dataSource": {
    "entity": "IMPLICIT",
    "config": {
      "payload": "${LOCAL:PAYLOAD}",
      "mimeType": "text/plain"
    }
  },
  "titleAtLine": 1,
  "dataAtLine": 2,
  "explicitMappings": [
    {
      "dictionary": "BAC",
      "value": "#GRV{PAYLOAD.resolve('bac')}"
    },
    {
      "dictionary": "FIRSTNAME",
      "value": "#GRV{PAYLOAD.resolve('fname')}"
    },
    {
      "dictionary": "PERCENTAGE",
      "value": "#GRV{PAYLOAD.resolve('percentage')}",
      "cleansers" : [
        {
            "name":"REMOVE_PERCENTAGE_CHAR",
            "entity":"GEN_EXPRESS",
            "config": {
                "root":"#GRV{ return VALUE.replace("%", "") }"
            }
        }
      ]
    },
    {
      "dictionary": "ADDRESSL1",
      "value": "#GRV{PAYLOAD.resolve('a1')}"
    },
    {
      "dictionary": "ADDRESSL2",
      "value": "#GRV{PAYLOAD.resolve('a2')}"
    }
  ]
}
]]></config>
</dataFeed>

NOTE: Since cleansers are a pre source processor, they must be applied within the config of a data feed as can be seen above.

At its core, cleansers are represented within feed config as a Json array of "cleanser" objects. Here is the MOCK ANONYMISE cleanser, showing only its core structure:

"cleansers" : [
    {
         "anonCategory":"category",
         "extra":"anon"
    }
]

One or many cleansers can by added to the array. In the above example, only one is included but data points may contain multiple cleansers.

Full list of data point cleansers: