# AAL Data Cluster Analysis. Theory and Implementation

Bachelor Thesis 2016 57 Pages

## Excerpt

## Contents

Acknowledgments

Abstract

1. Introduction

1.1. Motivation

1.2. Outline

1.3. The Goal

1.4. Methodology

2. Theory Section

2.1. Gaussian Mixture Model

2.2. Dirichlet Process GMM

2.3. Hierarchical Clustering

3. Implementation

3.1. Software Specification

3.2. Data and Sensors

3.3. Data Visualisation and Inspection

3.4. Sensor values Discretisation and Extraction

3.5. Unsupervised Event Extraction

3.6. Data Structure for Event Analysis

3.7. Data Quality and Quantity

3.8. Predictive Analysis

3.8.1. Hourly Binning Analysis

3.8.2. Clustering Analysis

3.8.2.1. Hierarchical Clustering Analysis

3.8.2.2. Dirichlet Process Gaussian Mixture Model Clustering Analysis

4. Summary

4.1. Results

4.2. Discussion

Literature

List of Figures

List of Tables

List of Abbreviations

## Acknowledgments

First and foremost I would like to thank Amra Omeragic for her patience and loving support throughout the long writing process.

I would like to thank Peter Mayer and Wolfgang Zagler for contributions and navigating this project with their advices, ideas and revisions.

I must also thank my mother and father, Nevresa and Enver Hamzic, for all of their patience, support and love, and for beeing my biggest fans.

Finally, I would like to thank all those who stood by me in good and bad times.

Dzenan Hamzic

## Abstract

The eHome project from the Vienna University of Technology^{1} is an R&D project with goals of providing assistive technologies for private households of older people with idea to give them possibilities for longer and independent living in their homes. The eHome system consists of an adaptive intelligent network of wireless sensors for activity monitoring with a central contextaware embedded system^{2}.

The primary goal of this thesis is to investigate unsupervised prediction and clustering possibilities of user behaviour based on collected timeseries data from infrared temperature sensors in the eHome enviroment.

Three different prediction approaches are described. Hourly Based Event Binning approach is compared to two clustering algorithms, Hierarchical Clustering and Dirichlet Process GMM. Prediction rates are measured on data from three different test persons.

This thesis first examines two different approaches for event detection from infrared signal data. In a second stage three different methods for unsupervised prediction analytics are discussed and tested on selected datasets. Clustering algorithms parameter settings for timeseries data have also been discussed and tested in detail. Finally the prediction performance results are compared and each method's advantages and disadvantages have been discussed.

The practical part of this thesis is implemented in IPython notebook. Python version was 2.7 on 64 bit Ubuntu linux 12.04 LTS. Data analysis has been implemented with Python’s Pandas library. Visualisations are made with Matplotlib and Seaborn libraries.

The results reveal that prediction accuracy depends on data quantity and spread of data points. The simplest method in prediction comparison, the Hourly Based Binning has however given the best prediction rates overall.

By contrast to the Hourly Based Binning the Dirichlet Process Gaussian Mixture Models clustering show best prediction performance on smaller training data sets and well spread data. By further parameter tuning on Dirichlet Process GMM clustering the prediction rates could be further improved coming very close or even over performing the Hourly Based Binning. Due to the unknown distribution and well spread data, choosing the right threshold parameter for the Hierarchical Clustering was trickier than initially assumed. Despite the initial assumptions for Hierarchical Clustering, this method was at least applicable for unsupervised prediction analytics on used data sets.

## 1. Introduction

We are living in times of “smarthomes”. Sensors are registering our every movement. We can use these data on various ways like for preventing leakage, disasters, robbery and automating doors, lights etc. On the other hand, we can use these data to to support older people by finding out if they are behaving other than usual in order to trigger some help alarm like sending an emergency call or cutting the powersupply in order to prevent disaster. The eHome project from the Vienna University of Technology^{1} is an R&D project with goals of providing assistive technologies for private households of older people with the idea to give them possibilities for longer and independent living in their homes.

### 1.1 Motivation

Most people when they are old want to be supported by their loved ones. At the same time they do not want to be a burden to them. There are a lot of cases where such people live alone. Many of them have difficulties in walking and are prone to falling without possibility to get up again or calling help. As another example, if a cooking plate stays on without being noticed the house could be placed on fire. In order to prevent such a disaster it is a goal to detect events like that. If detected there is a possibility to trigger an alarm and to remind to turn off the plate and send someone to help the person to get up. For that reason and many others there is a need for a system like eHome that detects unusual and dangerous behaviour and supports older people by providing them assisting technologies.

### 1.2 Outline

Multiple wireless sensors of different types like Temperature Sensor, IIR (Infrared Temperature Sensor), REED (Magnetic Contact Sensor), PIR (Passive Infrared Sensor) etc. are placed in the house. The data from all sensors are recorded and analysed on eHome’s central unit. Events like cooking and leavingflat must firstly be correctly recognised in sensor data. Having recognised and extracted events, it can be further investigated if any behavioural patterns or event accumulations can be recognised in data which would be then useful in predicting the future behaviour of the person.

### 1.3 The Goal

The goal is to have as reliable as possible solutions for detecting the typical times of a person’s behaviour like cooking without any supervised interventions. If the person usually cooks in intervals between 10 AM and 1 PM and 3 PM and 5 PM it should be possible to find such time intervals. It should also be possible to have a probability of an event occurring in any given hour or time span.

### 1.4 Methodology

Three methods of possible event predicting are going to be discussed. The Hourly Binning of detected events is the simplest solution which returns event probabilities for every given hour. The two clustering approaches Dirichlet Process Gaussian Mixture Model and Hierarchical Clustering are used for finding event accumulations in timespans. Prediction possibilities are measured by dividing the whole dataset of a test person by 50:50 to training and testing datasets and then comparing them for event hits and misses.

## 2. Theory

Every person is different and behaves differently. Some people cook normally at 12 PM others at 6 AM. Some people cook only once a day while others cook three or even four times a day. Some people don’t cook at all. This leads to the conclusion that no fixed number of cooking events can be assigned to each person. Having this in mind, the eHome system must have a possibility of unsupervised finding the number of events taking place in homes of older people. The clustering algorithms chosen in this section are later used for finding event accumulations and event clustering on timeseries data from the eHome project’s Infrared sensors. The Dirichlet Process Gaussian Mixture Model and Hierarchical Clustering need no fixed number of clusters to be found what makes them appropriate for unsupervised finding of eventual behavioural patterns.

### 2.1 Gaussian Mixture Model

Common and important clustering techniques are based on probability density estimation using Gaussian Mixture Model and Expectation Maximisation.

The Kmeans Algorithm uses just single points in feature space as the cluster centre. Each data point is then assigned to the nearest cluster using euclidian distance from the cluster centre. Using only euclidian distance the Kmeans algorithm is not well suited for overlapping data points in feature space and for clusters that form no circular shape.

*Convex sets: In Euclidean space, an object is convex if for every pair of points within the object, every point on the straight line segment that joins them is also within the object ^{3}.*

Kmeans is often actually viewed as a special case of a Gaussian Mixture Model. GMM models can be seen as an extension to Kmeans models in which clusters are modeled with gaussian distributions using not only their means but also covariance that describes their ellipsoid shapes.

Covariance parameter in GMM can be constrained to pherical, diagonal, tied or full.

Figure 2.1 Gaussian Mixture Model density estimation^{4}.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.1 shows 3 gaussian components. Each component is described as Gaussian distribution so each has a mean ᵻ, variance or covariance ᵻ and the “size” ᵻ. The mean is responsible for the distribution shift. Variance determines how wide/narrow the component is. The ᵻ is the component's height.

The goal of performing an GMM clustering on a dataset is to find the model parameters (Mean and variance of each component) so the model fits the data as much as possible. This bestfit estimation usually translates into likelihoodmaximisation of the GMM model data.

Likelihood maximisation is performed by the Expectation Maximisation algorithm [B1]. The EM algorithm proceeds iteratively in 2 steps.

The Expectationstep treats gaussian component’s mean, covariance and the size as fixed. For each datapoint i and each cluster c probability value Ric of that datapoint belonging to cluster c is computed. If the particular probability value Ric is not high that data point i does not belong to c cluster.

The best possible explanation for single component belonging to one cluster is Ric = 1.

The Maximisationstep starts with assignment of Ric probabilities and updates gaussians components parameters mean, covariance, and size. For each cluster c parameters are update using and estimated weights of Ric probabilities.

Each iterative step increases the loglikelihood of the model.

Prior knowledge of cluster number is assumed for this clustering algorithm.

Advantages:

Fastest algorithm for mixture models learning

No cluster shape and size limits

Disadvantages:

Bad performance on small datasets

Fixed number of components

### 2.2 Dirichlet Process Gaussian Mixture Model

As noted earlier Gaussian Mixture Model and Kmeans algorithms assume a fixed number of components which should be found but in the most real world problems data are unstructured and no exact conclusion on datapoints distribution is can be made in advance.

The DPGMM is a Gaussian Mixture Model variant where no prior knowledge of cluster number is necessary. It uses a maximum number of clusters parameter as an upper bound for maximum components number to be found. Setting this parameter to e.g. five components should find all possible clusters in data up to maximum five. The algorithm should not simply split data into five components but deliver the real cluster number and cluster data accordingly. This is illustrated on diagram below.

Abbildung in dieser Leseprobe nicht enthalten

Figure 2.2 Gaussian Mixture Model and Dirichlet Process GMM both initialised with 5 components^{5}.

This upper bound parameter should be loosely coupled with the real cluster number.

In comparison to Gaussian Mixture Model, Dirichlet Process GMM uses one additional parameter ᵬ (alpha) to specify datapoints concentration.The alpha parameter controls the number of components used to fit the data. Lowering alpha parameter clusters the datapoints tightly as the expected number of clusters is alpha*log(N). Doing the opposite, more clusters are produced in any finite set of points. Given low data quantity, the DPGMM tends to fit data points to only single component.

Depending on the data type and data distribution DPGMM allows to set which cluster parameters are going to be updated in the training process. It can be setup to update wweights, mmeans and ccovariances or any combination of the three.

The Dirichlet Process can be explained by the “chinese restaurant process” which satisfies properties^{6} ^{7}:

“Rich get richer” property (the more people sitst at the table the higher the chance of a new person joining in)

There exists always a small probability that a new person entering joins the new table

Probability of a new group is set by concentration parameter alpha

Tables in the chinese restaurant paradigm are components of GMM.

Advantages:

Relatively stable (no big changes with small parameter tuning)

Less tuning needed

No need for component number specification (only loose upper bound)

Disadvantages:

Dirichlet Process makes inference slower (not much)

Implicit biases (sometimes better to use finite mixture models as GMM)

### 2.3 Hierarchical Clustering

Hierarchical methods need no cluster number and no cluster seed specification. In hierarchical clustering methods a nested series of clusters is produced. Hierarchical clustering tries to capture the underlying datastructure by constructing a tree of clusters.

Two hierarchical approaches are possible [B2]. Bottomup approach where at start every dataobject is a cluster by itself. Nearby clusters are iteratively merged into bigger clusters until all clusters are merged into a single cluster in the highest hierarchy level or some stopping criterion is met.

Topdown approach starts from one big cluster containing all datapoints in the highest level of the hierarchy. Going towards bottom in hierarchy this method repeatedly performs splitting of clusters resulting in smaller and smaller clusters until every data point is a cluster for itself or some stopping criterion is met.

Depending on the dataobject’s distances, a threshold for flat cluster formatting is to be set. Both approaches can use distance as a stopping criterion.

Computing distance between all data pointsin two clusters is an expensive operation especially on big datasets. Therefore, the Hierarchical Clustering method offers multiple algorithms for computing distances between clusters.

Singlelink algorithm: computes distance between two nearest points each in a different cluster. Completelink algorithm: computes the distance between two furthest points each in different cluster (opposite of singlelink algorithm).

Centroid algorithm: computes the distance between two in cluster center points each in different cluster Averagelink algorithm: computes distance between all data points pairs of each in different cluster.

Advantages:

Can provide more insight into the data (eventual cluster hierarchy)

Simple to implement

Can provide clusters at different levels of granularity

Disadvantages:

No dataobject resignation to other clusters

Time complexity О(n³)

Distance matrix requires О(n²) memory space

## 3. Implementation

### 3.1 Software Specification

The practical part of this thesis is implemented in IPython Notebook^{8} on Ubuntu 12.04 LTS. Python version is 2.7.10 with 64 Bit Anaconda 2.3.0^{9}. Anaconda is a completely free scientific Python distribution. It includes more than 300 of the most popular Python packages for science, math, engineering and data analysis. NumPy is used for mathematical functions like transponding, rounding and others, Pandas for CSV datatables management, scikit for Dirichlet Process GMM Algorithm implementation, scipy for Hierarchical Clustering implementation. Visualisations are made using Matplotlib and Seaborn^{10} libraries.

### 3.2 Data and Sensors

The eHome system consists of an adaptive intelligent network of wireless sensors placed in homes of older people for activity monitoring with a central contextaware embedded system. In each home the data are monitored by the central system in real time. The data subsets of monitored test persons used in this thesis is a collection of all sensor events recorded in a time frame of a few months.

Data events from wireless sensors like Accelerometers, Temperature Sensors, IIR (Infrared Temperature Sensor), REED (Magnetic Contact Sensor) and PIR (Passive Infrared Sensor)^{2} are recorded to single CSV formatted data files. Each line represents a single sensor event. Each new day begins with a new data file. The recorded data are separated by a comma and have the following format:

*day.month.year hour:min:sec, unix timestamp, milliseconds, sensor type, event type, event subtype, sensor ID, network ID, sensor value.*

The CSV file looks as following:

Abbildung in dieser Leseprobe nicht enthalten

Table 3.1. Sensor data sample from CSV file.

The marked lines indicate temperature of 22.6 C recorded by Infrared sensor with ID 173.

### 3.3 Data Visualisation and Inspection

The first step of almost every data analysis is to get to know the data. In this case the data of the Infrared Temperature Sensor (placed by the cooking plate) and Magnetic Contact Sensors (placed on doors) are going to be inspected by visualising their values.

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.1 Cooking Sensor data overview.

As can be seen from figure 3.1, the cooking plate was active multiple times on 15.05.2010. It can be noticed that the temperature is not so high as usuall on cooking plates. The reason is that the sensor is not placed directly on the cooking plate but to the side.

By visually inspecting the cooking sensor temperature values, 3 big peaks and multiple small peaks can be seen. Such small peaks are of no interest for this thesis because they do not imply cooking. Cooking usually takes longer than 10 minutes. Event length detection and filtering will be discussed in further sections. The IIR (Infrared temperature sensor) cooking sensor is continuously delivering temperature values. Minimal sending interval is 1 minute. Maximal sending interval is 10 minutes. It is triggered by a temperature change of 0,5 C. This setup can be visually confirmed in Figure 3.2 below.

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.2 Infra Red Sensor sending intervals.

It can be seen that the sensor fired multiple times in the interval between 8:46 and 8:56 what indicates a strong increase in temperature concluding that the cooking plate was turned on.

Sensors like PIRSensor (Movement sensor) are working with discrete values. PIR sensors are saving ones if a movement is detected. Otherwise nothing is saved. Such discrete values are easier to work with. The clustering algorithms can directly be fed with such values. Below is the visualisation of the Passive Infrared sensor values on 19. and 20. May of 2010 in the home of Test Person 1.

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.3 PIR Sensor values visualisation for TP1.

From Figure 3.3 can nicely be seen if there was any movement in house. The value gaps, which could indicate sleeping or being outdoors, can easily be filled with virtual sensor values in order to be processed further.

In order to do density estimations on passive infrared sensor timeseries values, some kind of conversion to discrete values is needed. When working with timeseries data density estimation clustering algorithms are to be filled only with values when the cooking plate is turned on. All other values from the cooking sensor can be filtered out from the dataset or can be set to 0.

Multiple steps are needed in order to achieve discretisation of values from the cookingplate infrared sensor. There are multiple possible solutions to this issue and some of them are going to be discussed in next section of the thesis.

### 3.4 Sensor Values Discretisation and Extraction

As noted earlier IIR sensor values need discretisation. The idea is to implement some kind of sampling on continuous values of the IIR cooking sensor. This is needed to be able to extract the cookings from the rest of the sensor events from the dataset.

Having a discrete signal from infrared temperature sensor, two approaches for cooking event extraction from a dataset are going to be discussed. The goal is to have possibilities of reliable and correct cooking events detection in a single dataset.

The first approach is the sequencing of temperature rises that belong together. The second approach is unsupervised event extraction using a clustering algorithm for grouping temperature increases that belong together. The latter is discussed in the next section.

Three big peaks from Figure 3.1 need to be discretized since only “heating in progress” on the cooking plate is of interest. This gives an idea that positive temperature increases should be inspected and leads to conclusion that the first step of temperature signal transforming should be a difference operation on temperature values.

The result of temperature values difference operation is visualised on the following diagram.

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.4 Temperature differences.

Figure 3.4 clearly shows temperature peaks. Small increases in temperature can be categorised as signal noise ( 0.1 C). This noise was present in most of the datasets and can clearly be seen e.g. between 03:00 and 06:00. Negative temperature differences can simply be filtered out from the new dataset. The positive “noise” in temperature signal (+0.1 C) does also have to be filtered out in order to get only relevant temperature increases. This can be dealt with using 2 approaches.

First:

Taking the sensors sensitivity of 0.5C into consideration every temperature difference value below that threshold could be set to 0.

Second:

Counting the absolute probability for every positive temperature difference and filtering (setting to 0) all values between some threshold. This threshold could be set by finding the values which have probabilities of e.g. 15% or 20% in difference dataset. Using this approach demands detailed analysis of sensor behaviour.

Below is the table of probabilities for every temperature difference.

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.5 Absolute probabilities of temperaturevalue differences.

Almost every provided dataset was noisy. One of the two above mentioned methods can be used to get the clean signal. After successful implementation of this step the dataset should no longer contain the sensor noise.

Abbildung in dieser Leseprobe nicht enthalten

Figure 3.6. Filtered temperature differences.

The filtered signal can now be manipulated easily. One could add a new binary column to the new dataset which indicates the positive increases in temperature. This column would set ones to rows which indicate that the cooking plate is turned on. This isolates the needed signal events from the rest of the dataset. By plotting the the temperature values marked as true in new column we get the following diagram:

**[...]**

## Details

- Pages
- 57
- Year
- 2016
- ISBN (eBook)
- 9783668299542
- ISBN (Book)
- 9783668299559
- File size
- 2.1 MB
- Language
- English
- Catalog Number
- v340200
- Institution / College
- Vienna University of Technology
- Grade
- 1
- Tags
- time-series cluster analysis gaussian mixture model hierarchical clustering predictive analysis hourly binning sci-kit python