All data mature companies have chosen the strategy of collecting company data into a centralised data platform. Data is collected from systems that contain source data, from client devices, or from external parties. It is ingested to the platform, where it is processed offline and turned into refined, valuable data, which feeds analytics and data-driven features.
In a data platform, ingested data is copied as datasets into a data lake, or into data streams for real-time processing. Once data has been ingested into the data platform, it is processed in data pipelines, processing steps where data is cleaned, combined, and refined. Each use case or data-driven feature has a dedicated data pipeline, although the pipelines share common steps, such as data cleaning and curation. Data is processed offline, without direct connection to online systems that serve users. At the end of each pipeline, resulting refined data is copied to a suitable destination for consumption, e.g. a data warehouse for interactive analytics, or an online system for serving end users.
In data mature companies, the majority of data processing is structured as data pipelines, whether it is batch or real-time processing. The data platform is used for all sorts of computations, including simple propagation of data from one system to another, core business processes such as financial reporting, and data intensive training of machine learning models.
Building data-driven products are uncertain undertakings; success depends not only on technology, but also on the ingested data. It is therefore important to achieve short and efficient feedback cycles, in order to be able to learn from production data, and adapt early. Companies have adopted a DevOps culture in order to reduce development cycle time by removing the barrier between developers and operations. We now know that DevOps improves both speed and quality. For data-driven products, there is also a barrier between data scientists and data engineers to overcome, and the cure is known as DataOps. While most companies still have data scientists isolated from production, creating models that require translation before going to production, a few data mature companies have successfully created cross-functional data product teams where experimental development and model evaluation happens in production.
Getting value from data and making efficient use of a data platform is both a technical and cultural challenge. Effective use of data requires new workflows.