Software Architecture

Data Analysis Pipeline

By Kees Jan Koster

Kees Jan Koster

Published

29.06.2023

Zero projects base their choices on data, so a lot of data is collected and that data is put to use in analysis. Building a reliable data analysis pipeline is an important aspect of all Zero projects.Data analysis is an iterative learning process. One of our core values is to publish our learnings, so that you, the reader, may benefit from them. That is why the data analysis pipeline ends with publishing the data, the analysis code and our findings.This document explains how we set up a staged pipeline for data analysis. It shows what steps are taken to go from raw sensor values all the way to published articles and notebooks.

Here is the pipeline overview. Each stage is explained below.

The preprocessing stage cleans and transforms the raw data into SI units. The analysis stage is the iterative learning process, the results of which are processed for publication.

For readability, data and metadata are considered a given.

Data and Metadata

This section explains how data is queried and how project metadata is linked, so that we can work with large volumes of data from wildly different sources.

I/O List Metadata

The sensors on a project are documented in technical drawings and in a master I/O list. The I/O list contains a list of all sensors installed, with metadata as needed to be able to interpret the data in the time series database.

One example of metadata in the I/O list would be the type of signal that a sensor outputs (volts, amperes, pulses etc.). The I/O list would also describe whether the sensor follow a linear, logarithmic or some other non-linear response and which scaling is applied.

Each record in the I/O list has a key that identifies the associated data in the time series database. The format of this key depends on how the time series database uses keys. It may be necessary to use a composite key to uniquely identity a data series.

The I/O list has the advantage of being programatically accessible, so the preprocessor can ingest the I/O list to determine what transformations are necessary for each data series.

Sensor Data

All sensor and systems data in the project is gathered into a time series database. Data series are identified with the same key in the I/O list as in the time series database.

Weather Data

Weather data is a regular part of our systems, since most of our projects are heavily dependent on sun or wind. Being able to make reliable energy predictions almost certainly has weather factored in.

Weather data is retrieved from suitable weather data services in the form of GRIB files. For weather data we use the time and geo location as keys to relate the weather data to the sensor data.

When working with weather data and maps, we prefer to use the Mercator projection, which gives the advantage that straight lines are also lines of the same course.

We base our decisions on real life measurements, taken from projects with real people. Location data is therefore considered personal data and for that reason cannot be published.

Relational Data

Other systems may have relational data. In practice, relational data is not going to play a big role in the data analysis pipeline. Still, the preprocessor may ingest some data from relational databases too.

Data Preprocessing and Packaging

To give the analyst stable data sets to work with, we use a staged approach. The data preprocessor produces data files with clean and converted data. These files are known as the analysis sets.

Each analysis set is constructed specifically for an analysis task. The various analysis sets are constructed from a range of databases, in a variety of operational projects. What is presented in this document as a singular preprocessor is in actual fact a collection of more or less reusable preprocessing code.

Time and Resolution

The data preprocessor performs any resampling needed to make the data align nicely to the time slots that the analysis expects. Not all data arrives at the time series database at the same data rate. It is the task of the data preprocessor to ensure that all data is time aligned and has the right sample rate.

All data is processed as being in the Coordinated Universal Time (UTC) time zone. This way we don't have to deal with time zones.

Unit and Value Conversion

The I/O list tells the data preprocessor what unit each column should be in. The data processor will apply common unit conversions to ensure that everything is in International System of Units (SI). We use that system throughout our analysis. To be precise, our analysis uses the appropriate SI or SI derived unit for each data series. This gives us consistency and easy conversion.

The I/O list tells the data preprocessor what unit each column should be in. The data processor will apply conversions to ensure that everything is in International System of Units (SI) or units accepted for use with SI. We use those throughout our analysis. This gives us consistency and easy conversion.

Further, the data processor converts values into those that are easier for the data analysis. A sensor producing pulse counts may be converted into litres per minute, for example. A sensor giving off a milliampere signal may be transformed into a tank volume.

The transformations are described in the I/O list.

Data Cleaning

The data preprocessor then executes more general data cleaning operations: cleaning out bad records, adding column data type information, interpolating missing data and applying smoothing and filtering.

Analysis Set Format

The data preprocessor produces files in Apache Parquet format. That format is open and portable between systems and programming languages. Unlike flat file formats, Apache Parquet files contain column data type information and indexes, making the data easier to load and use.

The Apache Parquet files are generated and cached on the developer machines and not committed to a Git repository since they may be generated at any time (but see below for an exception).

Data Analysis

Data analysis is an iterative process where we typically use Jupyter Notebooks as a means to product well annotated research results. The notebooks are kept in a Git repository for versioning.

The analysis may be performed using other tools than Jupyter notebooks. For readability, this document treats the result of such analyses as notebooks.

Publication of Analysis and Data

The final stage of the data analysis pipeline is to publish the notebook on the Internet. This happens in three progressively more technical forms: as insight, as white paper and as runnable notebook.

By publishing a report in these forms, we ensure that there is always a form that fits perfectly with our audience.

Editing

The editor and analyst take the analysis notebook and add explanatory text and diagrams. The objective is to take the notebook from its practical form and polish it for readability.

An unedited notebook likely contains traces of the iterative process that no longer contribute to the value of the notebook. Various preprocessing steps and small experiments linger in the code, but no longer contribute to the message and value of the notebook. These are trimmed or removed at this stage, but the notebook remains runnable.

Notebook as Insight

The Foundation Zero insights are typically written to explain findings to a wider audience. The writer will probably write insights from scratch, rather than reusing parts of the notebook. The conclusion and message are the same for the insights as well as the more technical formats.

Notebook as White Paper

Those who would like more detail after reading the insight article can read the analysis in the white paper format. Unlike the insights, the white papers are generated from the actual analysis notebook. They show the code and the graphs from the analysis.

Runnable Notebook

The next and final step is that technical readers may want to run the code for themselves.

During editing, the analysis set gets reduced to its smallest viable set. Along with the final, edited notebook it gets uploaded to the Foundation Zero GitHub repository. The intention of making the data and the code available on GitHub is so that the work can be reviewed and reused.

Visitors can then run the notebook on their own machine or use a service like Binder where they can experiment with the analysis, testing their own changes to the code in the notebooks. They may also reuse our code, subject to the code's license.

Data Traceability and Transparency

When publishing results and findings, it is important that they may be traced back all the way to the original, raw data. The approach in this pipeline makes the path from the cleaned analysis data sets completely transparent, barring personal data.

The step to repackage the cleaned and transformed data puts a break into the total transparency. The I/O list is unlikely to be a public resource and some parts of the transformation code may remain closed.

We have to balance practicality and total transparency. Publishing the raw data is technically possible. Downside would be that the analysis sets would grow significantly larger. The analysis notebooks would come with a lot of data processing code that distract from the analysis and findings.

To work around this, the reader is invited to validate our measured data against similar systems. For project members, the raw sensor data is available on request, again barring personal data.