This is a curated recommended read and watch list for scalable data processing. It is primarily aimed towards software architects and developers, but there is material also for people in leadership position as well as data scientists, in particular in the first section. The content has been chosen with a bias towards material that conveys a good understanding of the field as a whole and is relevant for building practical applications.
If you wish to discuss the contents or report any broken links, please do so via email to firstname.lastname@example.org. Also feel free to send any material you really think should be in here.
Data pipelines from zero to solid. End to end overview of how to build a data platform, including data ingestion, data pipelines, and serving computational results.
The next data engineering architecture: Beyond the lake and the corresponding blog post How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. Insights about scaling data processing environments beyond homogeneous data platforms, and proposed solution patterns. Rare and useful advice for companies that have come far in their data journey.
The profession of solving (the wrong problem). Keeping focus on the business value without getting distracted by technology, with real-world examples of typical failures.
Second generation “workflow managers” for big data. Explanation of workflow orchestration.
Applications in the emerging world of stream processing (slides). Good summary of how to build stream processing applications.
Stream processing, event sourcing, reactive, CEP… and making sense of it all This blog post is also part of Martin Kleppman’s free book Making sense out of stream processing, which also encompasses his two entries in the Data collection section.
Dataflow: A unified model for batch and streaming data processing. Google still dominates data processing technology, and looking at them is usually a glimpse into the future for open source technology. Apache Beam is still a bit young, but the semantics described in the presentation are likely to be the next architectural step, as an alternative to the lambda and kappa architectural patterns.
Cassandra data modeling best practices. This is one of the first documents on data modelling for Cassandra. It predates the Cassandra Query Language (CQL), but in order to use CQL efficiently, it is necessary to understand and adapt data models to the underlying structures.
Real-time stream processing at InMobi (Storm & Spark Streaming comparison)
Production Ready Data-Science with Python and Luigi. The steps from data science model to production pipeline.
Guide towards algorithm explainability in machine learning. Slides and code. Strategies for handling bias and looking into the black box of machine learning models.
Goods: organizing Google’s datasets. Google has higher amibitions on dataset structure than most companies need. The paper, however, gives insight into the kind of entropy that creeps into data processing systems, and examples of structure and tools necessary to keep the chaos under control.
Schema evolution in Avro, Protocol Buffers and Thrift. Every data platform or pipeline should have a strategy for schema evolution. This article describes the details of how schemas evolve, and the difference between the three formats.
Why is there a section here on Scala? Because Scala is rising as the preferred language for scalable data processing. The primary reason for this is not technical, but cultural; successful data-driven products rise out of a collaboration between data scientists and software engineers. The day to day activities of the former group involve model tinkering and experimentation, and the rituals and boilerplate involved in backend languages such as Java prohibits quick experimentation. The latter group, however, is concerned with operational stability, and languages frequently used for experimental purposes, such as Python and R, tend to be perceived as insufficiently rigid and lacking ecosystem support for quality assurance and operations.
Scala is the middle ground where these two worlds meet. It is succinct and expressive enough for experimental purposes, but also statically typed and standing on the JVM platform, providing the quality and operations ecosystems. The lion share of innovation in data processing is therefore expressed in Scala, and it is a matter of time before most data-driven companies adopt it. It is possible to stick to Java or Python for data-driven products, but such decisions come at the cost of deselecting to utilise most of the innovation that happens in the data processing open source world.
Scala is powerful and well suited for data processing, but it comes with risks; it comes with ammunition to shoot yourself in the foot, and an opinionated community that can be both elitistic and cryptic. It is wise to collect input and advice from several different types of sources when adopting Scala.
Twitter Scala school. A guide to learning Scala from scratch.
Transitioning to Scala. Pragmatic advice for teams adopting Scala.
Building a company on Scala. Likewise pragmatic advice for companies using Scala.
Moving a team from Scala to Golang. This is a cautionary tale of Scala adoption gone wrong; it provides insight into cultural risks that have to be managed. These risks exist in all teams, but Scala provides soil for them to bloom.
Don’t fear the implicits: Everything you need to know about typeclasses. Comprehensive explanation of typeclasses, one of the most powerful Scala constructs.
Privacy by design. My bag of tricks and patterns for protecting users’ privacy and complying with GDPR in a big data environment.
Papers we love (on distributed systems)
How To Become a Data Engineer. Resources for learning data engineering.