Process Data Analytics - Course Outline

Instructors : Professor Sirish Shah and Dr. Nikolaos Anesiadis

Course Outline:

We are currently at the cusp of the fourth industrial revolution that is poised to reshape all the sectors of economy and society with unprecedented depth and breadth. The driving force for this will be the analysis of data to look for patterns and information. Process industries are in a unique position to benefit from this data revolution, as they have the right infrastructure, and are in possession of massive amounts of heterogeneous industrial data. The extraction of valuable information and knowledge from industry data will provide economic and competitive advantages in the face of ever-increasing demands on energy, environment and quality by providing a level of automation and efficiency never seen before. The process industries have been using data analytics in various forms for more than three decades.

The emphasis in this course will be on tools and techniques that help in the process of understanding data and discovering information that will lead to predictive monitoring and diagnosis of process faults, design of soft-sensors, process performance monitoring and on-line modeling methods. Highly interconnected process plants are now common and monitoring and analysis of root causes of process abnormality including predictive risk analysis is non-trivial. It is the extraction of information from the fusion of process data, alarm and event data and process connectivity that should form the backbone of a viable process data analytics strategy and this will be the main focus of this course.

Learning outcomes and expected goals:

The focus in this course will be on tools and techniques that help with understanding data and discovering information and patterns in routine process data. The objective is to deliver a coherent and coordinated work flow for the students to know what tools to use when and the pitfalls to avoid.

The goal is to have the students learn how to apply the following commonly used methodologies for the analysis of data and succeed in an analytics project that will ultimately lead to predictive monitoring and diagnosis of process faults, design of soft-sensors, process performance monitoring and on-line modeling methods:

Define the data analysis problem and ask the right questions; define clear objectives;
Get good data in context of the problem; do quality checks on data, sort and filter the data and check for outliers and missing data;
Get to know your data: visualize, explore and analyze; carry out data visualization and learn the key steps in data ingestion and data management;
Find the features that affect the outcome of interest;
Explore unsupervised learning using classical clustering methods such as kNN (k nearest neighbours) and Principal Components Analysis (PCA) for dimensionality reduction;
Carry out supervised learning using: Multivariate linear regression and its variants including LASSO; Logistic regression; Classification and Regression Trees (CART) including Random Forests; Support Vector Classification and Regression methods; kernel methods, model maintenance and feature extraction;
Build meaningful models for soft-sensing; process and performance monitoring; carry out model quality checks;
Investigate causality analysis and process topology reconstruction methods;
Make the model operational and maintain these models.

Course Content
Course overview and discussion on course evaluations Statistical preliminaries; Statistical Science versus Data Science with examples; Data visualization with examples;
Data visualization with examples; General concepts in visual analytics; Discuss visualization examples. Data quality assessment; outlier detection and treatment of missing data; filtering and general denoising; Concepts of balanced data sets and bootstraping. Tutorial 1: Intro to Python; pandas, numpy, for loops, basic plotting (random walk example) Assignment 1: Develop a Covid-19 dashboard Assignment 2: Data quality checks on experimental data
Simple linear regression with examples; Design of experiments and discussion about time series data; Introduce the concepts of auto-correlation and cross-correlation; difference between steady state models and dynamic models; Basic system identification: use of simple regression to estimate ARMA models. Assignment 3 (Regression and dynamic model building using time series data)
Unsupervised and supervised learning; Clustering methods such as kNN with examples; Dendrograms and silhouette plots; Assignment 4 (clustering)
Supervised learning including classification; Regression methods: Simple and multivariate regression; Logistic regression; Classification and regression trees (CART); Random Forests; LASSO and Support Vector Regression; Assignment 5 (classification) Tutorial 2: Credit Card Defaulting; in-depth analysis and comparison of trees, logistic regression and random forests
Dimension reduction and PCA and general concepts in multivariate statistics. Tutorial 3: PCA of Fetal Bovine Serum (FBS). Applications in biotechnology. Assignment 6: PCA of Exchange-traded funds (ETFs). Visualization development for a Fintech app.
Alarm data analytics with examples and industrial case studies.
Model evaluation/performance: cross-validation, train/test/validation sets, accuracy, ROC/AUC, confusion matrix, precision/recall/specificity, ranking, expected value calculations

Course Assessment Criteria

The course marks will be based on continuous evaluation based on 6 assignments. The marks for the assignments will be split as follows:

Assignment	Weightage
Assignment-1	20%
Assignment-2	15%
Assignment-3	10%
Assignment-4	20%
Assignment-5	20%
Assignment-6	15%
Total	100%

Prerequisites: Basic knowledge of statistics, linear algebra, signal processing, system identification and control.

The main computational platform for this course will be Matlab and Python