Data engineering 

reading list.

Essential resources for data engineers​

This is a curated recommended read and watch list for scalable data processing. It is primarily aimed towards software architects and developers, but there is material also for people in leadership position as well as data scientists, in particular in the first section. The content has been chosen with a bias towards material that conveys a good understanding of the field as a whole and is relevant for building practical applications.

If you wish to discuss the contents or report any broken links, please do so via email to Also feel free to send any material you really think should be in here.


Happy reading!

The big picture


Building scalable data pipelines

Data pipelines from zero to solid. End to end overview of how to build a data platform, including data ingestion, data pipelines, and serving computational results.

Building a data pipeline from scratch

Avoiding big data anti-patterns

The next data engineering architecture: Beyond the lake and the corresponding blog post How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. Insights about scaling data processing environments beyond homogeneous data platforms, and proposed solution patterns. Rare and useful advice for companies that have come far in their data journey.

The profession of solving (the wrong problem). Keeping focus on the business value without getting distracted by technology, with real-world examples of typical failures.

End to end

Architecture and patterns

Business perspective, leadership

Head to tail

Data collection

Batch processing

Stream processing


Applications in the emerging world of stream processing (slides). Good summary of how to build stream processing applications.

Building real-time data-driven products (slides). Holistic view on building stream processing applications, architectural variants, and tradeoffs invovled.

The world beyond batch: Streaming 101

The world beyond batch: Streaming 102

Stream processing, event sourcing, reactive, CEP… and making sense of it allThis blog post is also part of Martin Kleppman’s free book Making sense out of stream processing, which also encompasses his two entries in the Data collection section.

Apache Kafka, Samza, and the Unix philosophy of distributed data

Dataflow: A unified model for batch and streaming data processing. Google still dominates data processing technology, and looking at them is usually a glimpse into the future for open source technology. Apache Beam is still a bit young, but the semantics described in the presentation are likely to be the next architectural step, as an alternative to the lambda and kappa architectural patterns.

Streaming big data & analytics for scale

Data product serving, NoSQL

Cassandra data modeling best practices. This is one of the first documents on data modelling for Cassandra. It predates the Cassandra Query Language (CQL), but in order to use CQL efficiently, it is necessary to understand and adapt data models to the underlying structures.

Time series stream processing with Spark and Cassandra

Writing Datomic in Clojure

The Datomic information model

Components and comparisons

Creating value

Approximate algorithms

Data science, machine learning


Productivity, text, quality, monitoring


Test strategies for data processing pipelines (slides). How to build automated regression test suites for stream processing and batch processing data pipelines.

The mechanics of testing large data pipelines

Effective testing for spark programs

Testing Spark: best practices

Spark and Spark Streaming unit testing

Goods: organizing Google’s datasets. Google has higher amibitions on dataset structure than most companies need. The paper, however, gives insight into the kind of entropy that creeps into data processing systems, and examples of structure and tools necessary to keep the chaos under control.

Schema & semantics

The unified logging infrastructure for data analytics at Twitter

Schema evolution in Avro, Protocol Buffers and Thrift. Every data platform or pipeline should have a strategy for schema evolution. This article describes the details of how schemas evolve, and the difference between the three formats.


Why is there a section here on Scala? Because Scala is rising as the preferred language for scalable data processing. The primary reason for this is not technical, but cultural; successful data-driven products rise out of a collaboration between data scientists and software engineers. The day to day activities of the former group involve model tinkering and experimentation, and the rituals and boilerplate involved in backend languages such as Java prohibits quick experimentation. The latter group, however, is concerned with operational stability, and languages frequently used for experimental purposes, such as Python and R, tend to be perceived as insufficiently rigid and lacking ecosystem support for quality assurance and operations.

Scala is the middle ground where these two worlds meet. It is succinct and expressive enough for experimental purposes, but also statically typed and standing on the JVM platform, providing the quality and operations ecosystems. The lion share of innovation in data processing is therefore expressed in Scala, and it is a matter of time before most data-driven companies adopt it. It is possible to stick to Java or Python for data-driven products, but such decisions come at the cost of deselecting to utilise most of the innovation that happens in the data processing open source world.


Scala is powerful and well suited for data processing, but it comes with risks; it comes with ammunition to shoot yourself in the foot, and an opinionated community that can be both elitistic and cryptic. It is wise to collect input and advice from several different types of sources when adopting Scala.


Twitter Scala school. A guide to learning Scala from scratch.


Scala with style


Toward a safer Scala


Strategic Scala style: Practical type safety


Transitioning to Scala. Pragmatic advice for teams adopting Scala.

Building a company on Scala. Likewise pragmatic advice for companies using Scala.


Moving a team from Scala to Golang. This is a cautionary tale of Scala adoption gone wrong; it provides insight into cultural risks that have to be managed. These risks exist in all teams, but Scala provides soil for them to bloom.

Don’t fear the implicits: Everything you need to know about typeclasses. Comprehensive explanation of typeclasses, one of the most powerful Scala constructs.


Privacy by design. My bag of tricks and patterns for protecting users’ privacy and complying with GDPR in a big data environment.

Building privacy-protected data systems


How to prepare for proposed EU data protection regulation


Adapting your company to comply with EU privacy regulations (In Swedish)


Resource sites

 Address. c/o RISE, Isafjordsgatan 22, 164 29 Kista, Sweden

Tel. +46 70 7687109

This site was built with Wix, which sets cookies. We do not show an annoying banner,

but encourage everyone to install Privacy Badger or similar tools to limit third-party cookies.