What is a data feed?

We mentioned previously in Data Schemas Descriptor that schema specifies the rules and actions to take as a packet of information representing a datablock are ingested into Apiro. A datablock itself is an internal model representation of a set of units of data that together have some identity (such as a equity trade request).

A datablock is instantiated in Apiro by ingesting data from external systems into Apiro. This data may come in arbitrary formats - from json rest services, from spreadsheets, from csv files, from PDF files or any number of other places and formats.

A datafeed is able to ingest data of a specific format - for example, a CSV file, and convert it into Apiro's internal DataBlock model representastion to be processed by the pipeline. Apiro therefore has individual DataFeeds that are able to undestand and extract data from Excel Spreadsheets, CSV files, JSON files, XML files etc.

Note that DataFeeds themselves are not concerned with where the data comes from - that is handled by another domain concept called a Data Source.

A DataFeed is able to process the data once it is obtained by a Data Source. The two concepts are entirely orthogonal. For example, to process an Excel file that is sitting on an SFTP server, you would need to properly configure an SFTP DataSource to obtain the file from the SFTP server, and provide that to a properly configured Excel DataFeed to extract the data from the file. The DataSource is not itself concerned with the contents of the file it is obtaining (it is just a 'bag of bytes'), and the DataFeed is not concernbed with where exactly the data it is extracting came from. A such, the configuration of the DataSource is strictly related to how to access the file or data (host, port username, password/key of the server, and the path to the file). The configuration of the DataFeed is strictly related to how the data in the file, maps to a DataBlock as defined by the relevant schema (in the case of Excel, what sheet is the data on? What row do the datablocks start from? What spreadsheet columns correspond to the DataPoints etc). DataFeed configurations minimally specify how to get the raw representational data for a DataPoint within a DataBlock. The corresponding Schema specifies the pipeline processing that the initial DataBlocks and DataPoints undergo from there (for example data validation, derived processing etc)

In this light its important to note that multiple DataFeeds can be configured to ingest data for an individual Schema. It is an n-to-1 mapping. They can also be combined in different ways. For example two DataFeeds may ingest the same data for an individual DataBlock redundantly, and the processing pipeline may validate the data is the same for each DataPoint.

Conversely, two or more DataFeeds may contribute a partial incomplete set of DataPoints to a Datablock, that together unite to compose all the DataPoints necessary to complete all of the requirements specified by the Schema to constitute a valid DataBlock. Two DataFeeds may contribute full DataBlocks to a Schema that represent a subset of all the DataBlocks for a Schema (for example you may have an EQUITIES Schema and individual DataFeeds may source the equities for individual exchanges. One for ASX, one for NASDAQ etc).

This documentation provides the information to configure these Apiro components to maximal effect.