Essential resources for data engineers

This is a curated recommended read and watch list for scalable data processing. It is primarily aimed towards software architects and developers, but there is material also for people in leadership position as well as data scientists, in particular in the first section. The content has been chosen with a bias towards material that conveys a good understanding of the field as a whole and is relevant for building practical applications.

If you wish to discuss the contents or report any broken links, please do so via email to info@scling.com. Also feel free to send any material you really think should be in here.

Happy reading & watching!

The big picture

End to end

Building scalable data pipelines

Data pipelines from zero to solid. End to end overview of how to build a data platform, including data ingestion, data pipelines, and serving computational results.

Building a data pipeline from scratch

Avoiding big data anti-patterns

The next data engineering architecture: Beyond the lake and the corresponding blog post How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. Insights about scaling data processing environments beyond homogeneous data platforms, and proposed solution patterns. Rare and useful advice for companies that have come far in their data journey.

The profession of solving (the wrong problem). Keeping focus on the business value without getting distracted by technology, with real-world examples of typical failures.

Architecture and patterns

Big data - principles and best practices of scalable realtime data systems

A call for sanity in NoSQL

Questioning the lambda architecture

System architectures for personalization and recommendation

Streaming analytics with Spark, Kafka, Cassandra, and Akka

Business perspective, leadership

The 5 stages of grief on the road to big data

Where big data projects fail

The 10 worst big data practices

Head to tail

Data collection

The Log: What every software engineer should know about real-time data’s unifying abstraction

Staying in sync: From transactions to streams

Getting data out of databases: a surprisingly tricky problem

Exactly-once streaming from Kafka

Kafka reliability - when it absolutely, positively has to be there

Infrastructure at scale: Apache Kafka, Twitter Storm & Elastic Search

Devices and timestamps: seriously though, WTF?

Batch processing

Second generation “workflow managers” for big data. Explanation of workflow orchestration.

Managing containerized data pipeline dependencies with Luigi

Scalable pipelines with Luigi

Stream processing

Applications in the emerging world of stream processing (slides). Good summary of how to build stream processing applications.

Building real-time data-driven products, (slides). Holistic view on building stream processing applications, architectural variants, and tradeoffs invovled.

The world beyond batch: Streaming 101

The world beyond batch: Streaming 102

Stream processing, event sourcing, reactive, CEP… and making sense of it all This blog post is also part of Martin Kleppman’s free book Making sense out of stream processing, which also encompasses his two entries in the Data collection section.

Apache Kafka, Samza, and the Unix philosophy of distributed data

Dataflow: A unified model for batch and streaming data processing. Google still dominates data processing technology, and looking at them is usually a glimpse into the future for open source technology. Apache Beam is still a bit young, but the semantics described in the presentation are likely to be the next architectural step, as an alternative to the lambda and kappa architectural patterns.

Streaming big data & analytics for scale

Data product serving, NoSQL

Cassandra data modeling best practices. This is one of the first documents on data modelling for Cassandra. It predates the Cassandra Query Language (CQL), but in order to use CQL efficiently, it is necessary to understand and adapt data models to the underlying structures.

Time series stream processing with Spark and Cassandra

Writing Datomic in Clojure

The Datomic information model

Components and comparisons

Insight data engineering ecosystem: An interactive map

Choosing an HDFS data storage format- Avro vs. Parquet and more

Hadoop file formats: it’s not just CSV anymore

Dataflow/Beam & Spark: A programming model comparison

Apache showdown: Flink vs. Spark

Comparison of various streaming technologies

Real-time stream processing at InMobi ( Storm & Spark Streaming comparison)

Spark & Storm: when & where?

Picking the right SQL-on-Hadoop tool for the job

Creating value

Approximate algorithms

Some important streaming algorithms you should know about

Realtime personalization and recommendation with stream mining

Scalable real-time processing techniques - how to almost count

Acceptably inaccurate: probabilistic data structures

Data science, machine learning

Interactive recommender systems

Deep learning for high performance time-series databases

10 more lessons learned from building machine learning systems

Best practices for machine learning engineering

Production Ready Data-Science with Python and Luigi. The steps from data science model to production pipeline.

Guide towards algorithm explainability in machine learning. Slides and code. Strategies for handling bias and looking into the black box of machine learning models.

Practices

DataOps

7 steps to DataOps. A pragmatic and actionable guide to adopt DataOps.

Data engineering at the speed of software development. A walkthrough of the essential principles of DataOps.

Producticity, test, quality, monitoring

Test strategies for data processing pipelines, (slides). How to build automated regression test suites for stream processing and batch processing data pipelines.

Big data, big quality? The dimensions of data quality, how to monitor them, and how to improve data maturity.

The mechanics of testing large data pipelines

Effective testing for spark programs

Testing Spark: best practices

Spark and Spark Streaming unit testing

Goods: organizing Google’s datasets. Google has higher amibitions on dataset structure than most companies need. The paper, however, gives insight into the kind of entropy that creeps into data processing systems, and examples of structure and tools necessary to keep the chaos under control.

Schema & semantics

The unified logging infrastructure for data analytics at Twitter

Schema evolution in Avro, Protocol Buffers and Thrift. Every data platform or pipeline should have a strategy for schema evolution. This article describes the details of how schemas evolve, and the difference between the three formats.

Scala

Why is there a section here on Scala? Because Scala is rising as the preferred language for scalable data processing. The primary reason for this is not technical, but cultural; successful data-driven products rise out of a collaboration between data scientists and software engineers. The day to day activities of the former group involve model tinkering and experimentation, and the rituals and boilerplate involved in backend languages such as Java prohibits quick experimentation. The latter group, however, is concerned with operational stability, and languages frequently used for experimental purposes, such as Python and R, tend to be perceived as insufficiently rigid and lacking ecosystem support for quality assurance and operations.

Scala is the middle ground where these two worlds meet. It is succinct and expressive enough for experimental purposes, but also statically typed and standing on the JVM platform, providing the quality and operations ecosystems. The lion share of innovation in data processing is therefore expressed in Scala, and it is a matter of time before most data-driven companies adopt it. It is possible to stick to Java or Python for data-driven products, but such decisions come at the cost of deselecting to utilise most of the innovation that happens in the data processing open source world.

Scala is powerful and well suited for data processing, but it comes with risks; it comes with ammunition to shoot yourself in the foot, and an opinionated community that can be both elitistic and cryptic. It is wise to collect input and advice from several different types of sources when adopting Scala.

Twitter Scala school. A guide to learning Scala from scratch.

Scala with style

Toward a safer Scala

Strategic Scala style: Practical type safety

Transitioning to Scala. Pragmatic advice for teams adopting Scala.

Building a company on Scala. Likewise pragmatic advice for companies using Scala.

Don’t fear the implicits: Everything you need to know about typeclasses. Comprehensive explanation of typeclasses, one of the most powerful Scala constructs.

Privacy

Privacy by design. My bag of tricks and patterns for protecting users’ privacy and complying with GDPR in a big data environment.

Building privacy-protected data systems

How to prepare for proposed EU data protection regulation

Adapting your company to comply with EU privacy regulations ( In Swedish)

Performance

Everyday I’m shuffling - tips for writing better Spark programs

Avoid GroupByKey

Resource sites

A self-study list for data engineers and aspiring data architects

Hadoop weekly

Last week in stream processing & analytics

Big data resources

Awesome big data

10 free Hadoop tutorials

10 completely free resources for sharpening your skills in Hadoop

Big data projects

A curated list of data science articles

So you are interested in deep learning

Papers we love (on distributed systems)

How To Become a Data Engineer. Resources for learning data engineering.