Predicting (Drops in) Vigilance using Machine Learning

Rosalie E. Lucas

6540384

June 26, 2021

Supervisor: Leon Kenemans

Bachelor Psychology

Utrecht University

Abstract

The prediction of (drops in) vigilance can be useful in the prevention of car crashes. Previous

work showed that temperature data, demographic data and subjective measurements can be

useful in these predictions. With machine learning this study explores these different features

as possible predictors of vigilance in the Brief Stimulus Reaction Time task (BSRT). Three

different kinds of Random Forest models were trained, and a naïve model was created for the

evaluation of these models. Temperature was measured four seconds prior to the target

presentation with iButtons and a FLIR camera. Results show that features measured with the

FLIR camera may be useful predictors of vigilance, especially the forehead temperature.

However, the accuracy scores of the models were just slightly better than the naïve model. So,

further research should explore these features further and confirm these findings.

Introduction

In 2020 the Netherlands counted 620 fatal car crashes (Centraal Bureau voor de

Statistiek, 2019). One of the main causes of car crashes is drops in attention (De Raedt &

Ponjaert-Kristoffersen, 2001). This can be explained by the fact that attention can be challenged

during routine tasks like driving (O'Connell et al., 2009). The question that therefore arises is

how these accidents can be prevented. Can people be alerted when their attention drops? To

answer this, more has to be known about the type of attention that is relevant for driving, named

vigilance. Vigilance is a state of readiness, manifested in the extent to which a person can detect

and respond at random time intervals to small changes occurring in the environment (Schmidt

et al., 2009). In other words, vigilance is the ability to pay attention, for a certain period of

time, to a task (Schmidt et al., 2009).

This research was an explorative bottom-up study of different predictors for (drops in)

vigilance. The aim of this study is to compare these predictors and see which predictors are

worth further exploring. This was done using machine learning. Machine learning is very useful

for the prediction of an outcome variable (Bzdok et al., 2018) because it learns from known

properties of the training data to give accurate predictions for unknown test data (Shanker,

2018). While classical statistics can also make predictions, machine learning is not bound by

the restrictions of these methods. Machine learning makes less assumptions and therefore is

able to detect the presence of complicated nonlinear interactions, such as double interactions,

triple interactions or even more (Bzdok et al., 2018). To test this with classical statistics, such

as a general linear model, these interactions would have to be specified in advance and then

put into the classical specification, which can be more restrictive. When we look at a broad set

of possible predictors, machine learning can also be useful to assess the relative importance of

different features, even in settings where the number of input variables exceeds the number of

participants (Bzdok et al., 2018).

Machine learning techniques also have a downside. A reason to be cautious with

machine learning is the so-called black box problem: the user has limited to no insight into the

choices the algorithm makes to get from a given input to the given output (Chang et al., 2017).

That is why this research uses the Random Forest (RF) algorithm from Breiman (2001). A RF

is a classification algorithm based on randomised decision trees that together build a forest

(Breiman, 2001). These decision trees are easier to understand and give some insight in the

choices that the computer makes to classify each trial.

Another reason to use a RF is that Menze et al. (2009) found the RF to be the best

algorithm for feature selection. This is because it classifies based on different features. A RF

randomly selects features and uses these to split the decision tree based on their Gini impurity

(Menze et al., 2009). Gini impurity indicates how well a potential split is in classifying the data

points (Menze et al., 2009). For each feature, the Gini importance can be obtained, reflecting

how many times a feature was used to split a tree (Menze et al., 2009). It therefore can be used

to create a hierarchy of features, from most to least important (Züger et al., 2009). Combined

with the Gini importance, this algorithm can thus be used to gain insight in the usefulness of

different features. The features found this way can subsequently be investigated in more depth.

Background

To determine which features are important to consider, more must be known about

previous research into predictors for vigilance. The focus will be on four groups: EEG,

temperature related data, subjective measurements, and demographic data.

EEG. Previous studies have found EEG signals to be useful predictors of vigilance

(O'Connell et al., 2009). They found alpha activity to be a strong predictor of lapses in attention.

Also, delta and theta activities have a high correlation with poor task performance and fatigue

(Li et al., 2014). Other studies have found a similar predictive relationship by using machine

learning algorithms, such as extreme machine learning and support vector machines (Shi & Lu,

2013; Li et al., 2020). Though EEG is a very powerful predictor of vigilance, it is not a very

practical one. EEG caps are not comfortable to wear and are quite fragile (Li et al., 2014). In

the context of preventing car accidents, they are hard to impossible to implement for driving a

car for a long time. However, new research focusses on more wearable EEG systems (Li et al.,

2014).

Temperature. Fortunately, some research has been conducted into other predictors for

vigilance. Romeijn and Van Someren (2011) have conducted a vigilance experiment and found

skin temperature to be a possible predictor of vigilance. The fluctuations in skin temperature

can be described according to the Circadian Clock Mechanism or the Homeostatic Hourglass

(Te Lindert & Van Someren, 2018), implying skin temperature is related to the sleep-wake

cycle. A higher skin temperature results in feeling sleepier, having higher reaction times,

having more lapses in performance and therefore, being less vigilant (Romeijn & Van Someren,

2011).

Skin temperature can be measured at different locations. These can be divided into

distal and proximal locations. Distal locations fluctuate more in temperature whereas proximal

locations are more consistent. Romeijn and Van Someren found the gradient between a distal

location and proximal location to be a good predictor of vigilance. Mainly the gradient finger-

chest and the gradient finger-wrist, were strong predictors of vigilance. Also, the gradient

pinna-mastoid has been researched and seems to have a predictive relationship with reaction

times (Romeijn et al., 2012).

In short, skin temperature can be a reflection of the underlying processes of the

regulation of vigilance (Romeijn et al., 2012). Furthermore, skin temperature measurements

seem to be applicable in everyday life (Romeijn & Van Someren). This means they could be

useful predictors for vigilance. This is precisely the topic of this research.

Subjective Measures. Other predictors could be found in the ability of people to

evaluate their own level of vigilance. For instance, there is a strong relationship between the

subjective measures of the Karolinska Sleepiness Scale and reaction time (Schmidt et al.,

2009). This was investigated during a long drive. The relationship was absent towards the end

of the drive.

Degree of sleep deprivation is another possible measurement. The Pittsburgh Sleep

Quality Index (PSQI) assesses peoples’ sleep quality by calculating a person’s PSQI score

(Buysse et al., 1989). Being sleep deprived results in longer reaction times (Blatter et al., 2009).

When sleep quality is bad, a person does not sleep well and is effectively sleep deprived.

Finally, Blatter et al., (2009) found that reaction times increased the longer participants

were awake. The Morningness Eveningness Questionnaire (MEQ) assesses if a person is a

morning or evening person (Horne & Östberg, 1976). This could give an interaction effect with

the time of the experiment. A morning person in combination with an experiment in the

afternoon could result in longer reaction times because they are awake for a longer period. The

opposite could be true for an evening person who usually wakes up later and in combination

with an afternoon experiment has shorter reaction times.

Demographic measures. Lastly, demographic data could provide useful predictors.

Blatter et al. (2006) found that gender and age can influence reaction times. They found women

to react more slowly, but also more accurately than men. Furthermore, they saw that age had

an interaction effect with the sleep condition of the participant. Sleep deprivation in

combination with younger age resulted in slower reaction times. Also, age interacted with the

time spent awake proved relevant in their study.

Research Questions

In this research the focus was on the different temperatures, PSQI scores, MEQ scores,

gender and age as potential features to predict drops in vigilance. Some of these predictors have

been previously researched, while other predictors were explored for the first time in this study.

This has led to the following research questions: 1) Which feature is the best predictor of (drops

in) vigilance according to its Gini Importance? 2) What is the hierarchy of these features in

predicting vigilance? Three different RF models with different feature combinations were

compared to answer these questions.

Expectations

Based on previous literature the best temperature related features should be the chest

temperature and the finger-chest gradient (Romeijn & Van Someren, 2011). The pinna-mastoid

gradient could also be higher up in the hierarchy (Romeijn et al., 2012). For the other predictors

it is not possible to say something about the outcome yet.

Methods

All procedures were approved by the ethical committee of Utrecht University and

according to the protocol of the Faculty of Social Sciences.

Participants.

In total 33 healthy participants participated in this study: 14 males and 19 females aged

19 - 59 years (M=26.15, SD=10.03). Participants were recruited within the social bubble of the

research leaders due to the COVID-19 pandemic. They joined on voluntary bases and were

rewarded with student credit or eight euros per hour (total of 20 euros for 2.5 hours). None of

the participants had any known history of sleep-related disorders. Because their vigilance level

should not be artificially influenced, participants were not allowed to take caffeine, alcohol or

drugs after 22:00 the day prior to the experiment. Also, participants had to be non-smokers for

at least 6 months prior to the experiment. Participants had to have normal or corrected (glasses

or lenses) vision. They did not wear any hair products or make-up such that EEG electrodes

could be correctly placed.

Tasks.

BSRT. In this research the focus lies on the Brief Stimulus Reaction Time task. This is

a vigilance assessment task based on the task created by Romeijn and Van Someren (2011).

Compared to the original paper there were a few differences, because more lapses were needed

for the EEG signals. Participants have to focus on a fixation-cross (+ sign) and respond as fast

as possible when it changed into a fixation-cross with a shorter vertical line by pressing the

space bar with the index finger of their dominant hand. In the original paper participants had

to respond when the cross changed into a hyphen (- sign). The background is grey, and the

signs are black. The target presentation was only one frame (16.67ms), which is lower than the

original paper (25ms). Furthermore, this task had a staircase procedure. This resulted in the

change of difficulty of the task by altering the size of the vertical line of the fixation cross. The

most difficult condition of the task was a 10 pixels difference between the vertical line and the

normal vertical line of the target. The easiest condition was an 80 pixels difference. The task

started with a difficulty level of 50 pixels difference. When the participant responds incorrectly

twice, the following trial becomes easier. When the participant responds correctly, the next

trial will become more difficult. This is a 2-up-1-down procedure. The task comprised 144

stimuli with a 4 to 14 second interval and took around 20 minutes to complete.

Methods and materials. (Switched with procedure.)

iButtons. Skin temperature was measured using iButtons (type DS1922L,

Maxim/Dallas, USA). The iButtons sample skin temperature with a .0625 ºC resolution at 2-

second intervals. The method has been described in detail and validated by Van Marken

Lichtenbelt et al. (2006).

iButtons were placed at three distal sites: finger, nose and pinna, and three proximal

sites: chest, forehead and mastoid. From this data three relative distal to proximal gradients

were calculated: finger minus chest (DPG

finger-chest

), nose minus forehead (DPG

nose-forehead

), and

pinna minus mastoid (DPG

pinna-mastoid

FLIR. Skin temperature was also measured using a FLIR camera (FLIR E53, FLIR

Systems Inc., Wilsonville USA). The camera samples skin temperature with a 240 x 180 pixels

infrared resolution at a 33-milisecond interval. Thermosensitivity is below .04 ºC and image

frequency is 30 hertz. The camera was placed in front of the participant and focusses on the

head of the participants. One relative distal to proximal gradient was calculated: nose minus

forehead (DPG

nose-forehead

Questionnaires. The PSQI and MEQ were both assessed in Dutch due to this being

the native language of most participants. The PSQI has been recoded according to the method

described by Buysse et al. (1988). This is done by calculating the scores for every component

of the PSQI and calculating the total score. However, for some participants some components

were missing. So, the total score calculated is the total score for all components divided by

the number of available components.

The MEQ was recoded according to the method described by Horne and Östberg

(1976). This results in three different types: morning person (total score above 59),

intermediate person (total score between 42 and 58) and evening person (total score below

41).

Procedure.

All experiments were conducted in the lab environment of Utrecht University. This

study is part of a bigger study. The procedure described was the same for all participants, but

not all collected data will be used in the current study.

On arrival at the lab, participants were asked if they kept to the guidelines and had no

symptoms of COVID-19. Then, participants had to fill in the MEQ, PSQI and a demographic

questionnaire. Next, participants were equipped with an EEG cap, face electrodes and iButtons.

During each session participants were seated before a 60 Hz screen in a dimly lit room at an

environmental temperature of 20.68 ºC (SD=0.56). Participants had to place their head in a

chin rest so that their movement was limited and the FLIR camera could be focused on their

head.

First participants were asked to focus on a fixation cross for five minutes with their eyes

open. Next, they had to close their eyes for 2.5 minutes. Then participants were asked how alert

and awake the felt using a 100-likert scale. After that they were presented with the BSRT task

(Romeijn & Van Someren, 2011).

Data-Analysis.

All data was processed and analysed using Python 3.9 and software PyCharm

Professional 2021.1.1 (JetBrains, 2021). For the current study three different kinds of models

were trained multiple times on 500 different bootstrapped samples. One kind with only

temperature features, a second kind with only the subjective and demographic features and a

third kind with all the features. This is done so the effect of the different categories of features

can be analysed and evaluated, but also to evaluate the addition of different features.

Pre-processing. Data from 20 participants was excluded from the analysis due to

missing or flawed data files such as FLIR recordings, iButton data or questionnaires. The FLIR

data were extracted by using commercial software (Flirtools+, FLIR Systems Inc., Wilsonville,

U.S.A.). By hand, two ellipses were placed on the nose and forehead of the participant. These

had to be adjusted throughout the recordings, because of movement of the head. This procedure

was done for each participant and resulted in an output file with all the trials of one participant.

Next, all these data files were combined, and trials that contained missing values were removed

so the classification was only based on complete trials. Then there was an 80:20 split on the

data into train and test data with random_state=0, a parameter that sets a state for the random

generator such that the splitting is the same every time (Pedregosa et al., 2021), for all the

models.

Feature extraction. The features were handpicked based on previous literature. The

datapoints that enter the random forest are individual trials. Each trial contains the

corresponding values of the features for a certain participant at a certain time point. The age,

gender, MEQ-type and PSQI-score of the participant were entered to all the trials of that

participant. The temperature features fluctuate between trials and participants. Because

iButtons measure with an interval of 2 seconds the iButton temperatures were measured 4 or 5

seconds before the target was presented. This time interval was chosen because 4 seconds is

the shortest interval between trials. The FLIR temperatures were the average of the 30 frames

in the 4th second before the target was presented. The DPGs where calculated based on the

temperature features of the trials. The features used to train each model are described in Table

1. A full description of the feature extraction and data combination is described in Appendix

Table 1.

Overview of which features are used in each model.

Model 1

Model 2

Model 3

Age

Gender

MEQ

PSQI

Finger

Chest

Mastoid

Pinna

Nose

Forehead

DPG

finger-chest

DPG

nose-forehead

DPG

pinna-mastoid

FLIR nose

FLIR forehead

FLIR DPG

nose-forehead

Output. The output measurement is whether the participant saw the target, a hit (1) or

did not see the target, a miss (0). A hit means the participant was vigilant. A miss means the

participant was less vigilant.

Data processing. Data from all participants were, as said before, combined in one file

with all the trials of all the participants. The Random Forest (RF) was programmed using scikit-

learn (Pedregosa et al., 2021) and according to the method described by Garreta and Moncecchi

(2013). For each model 100 estimators were used. Again, a random state was used to make

sure the output is reproducible. The chosen state was random_state=9. Furthermore, default

settings were used. The algorithm takes 100 bootstrapped samples from the training set. These

were used to build the different decision trees. Then a number of random features were chosen

from the input features. Each tree classifies trials as hits or misses and then a majority voting

determines the final output. The procedure of the RF is summarised in Figure 2 (Fraiwan et al.,

2012). A full description of the Random Forest models is given in Appendix B.

Figure 1.

The different steps of the classification process of a Random Forest. From Fraiwan, L., Lweesy,

K., Khasawneh, N., Wenz, H., & Dickhaus, H. (2012). Automated sleep stage identification

system based on time–frequency analysis of a single EEG channel and random forest classifier.

Computer Methods and Programs in Biomedicine, 108(1), 10–19.

The hyperparameters of the RF was done for the number of trees and their depth.

Therefore, a validation set was needed, so the training set was again split into a train and

validation set with a 70:30 split. In total there thus was a 56:24:20 split for training, validation

and testing. The hyperparameters were then varied over a range of values on the training data,

but then evaluated in terms of accuracy on the validation data. The best values of the

hyperparameters over the validation data were 100 trees per RF and a maximum individual tree

depth of five levels. A full description of the optimisation of the hyperparameters is described

in Appendix C. Further explorative analyses have been conducted and are described in

Appendix F.

Model evaluation. To evaluate the trained models, a benchmark is needed. In this study

the trained models will be compared to a so-called naïve model which always guesses the

majority class of the data. This is a model that classifies all trials as hits (which is the dominant

outcome across trials). The models trained in this study therefore have to be more accurate than

this naïve model. The importance of each feature will be evaluated according to its Gini

Importance. The second benchmark is created by calculating the Gini Importance for all

features if they were equally important. If a feature is higher than this benchmark it is relevant

in the prediction of vigilance. Otherwise, if it is below the benchmark, it does not contribute

much to the prediction of vigilance.

Statistical Significance. With the previous described procedure, the models will only

create the output for one specific 80:20 split. This means that nothing can be said about its

statistical significance. That is why a bootstrapping procedure was performed. After the

hyperparameters were optimized, 500 samples of 1716 observations were bootstrapped from

the original data (with replacement). For each bootstrap sample an 80:20 split was performed.

Then the three different models were built, and their accuracy scores were calculated and saved.

Also, the Gini importance for each feature was calculated in each bootstrap sample and stored.

Finally, for each sample a naïve model and its accuracy were computed and saved. The

accuracies of each model were compared with the accuracies of the naïve model to test whether

the model performed significantly better than the naïve model. This was also done for the Gini

Importance of each feature to see whether it performed significantly better than the benchmark.

With this bootstrapping procedure a 95% Confidence Interval was created for the accuracy

differences between each model and the benchmark model. From this interval a statement can

be made on the statistical significance between the models in terms of accuracy performance

and feature relevance. The full procedure and Python script are described in Appendix D.

Results

The training scores of the models before finetuning showed signs of overfitting.

Therefore, the models were finetuned. Below, only the results for these finetuned models are

presented. The finetuned models are also used to assess the statistical significance using the

bootstrapped procedure.

Accuracy scores. The accuracies of the first, second and third model were compared

with the accuracies of the naïve model. These differences were evaluated with a one-sided 95%

confidence interval to see whether the first, second or third model performed better than the

naïve model. The differences in accuracies for each bootstrapped sample are displayed in figure

2, 3 and 4. In these figures the accuracy score was subtracted from the first, second or third

model, thus creating a difference. To determine whether the model performs significantly better

than the naïve model, it must be determined whether zero is in the one-sided confidence interval

or not. So, the lower bound of the interval must be higher than zero.

Figure 2.

Accuracy Differences Between the First Model with Temperature Features and the

Benchmark Model

Note. The dashed line is the one-sided confidence interval

The first model performs with 95% certainty better than the benchmark. This because

a difference of zero or lower was not present in the confidence interval (CI=.0029).

Figure 3.

Accuracy Differences Between the Second Model with Subjective and Demographic Features

and the Benchmark Model

Note. The dashed line is the one-sided confidence interval

The second model does not perform better with 95% certainty than the benchmark. This

because a difference of zero or lower was present in the confidence interval (CI= [-.020, 1.00]).

Figure 4.

Accuracy Differences Between the Third Model with Temperature, Subjective and

Demographic Features and the Benchmark Model

Note. The dashed line is the one-sided confidence interval

The third model does perform with 95% certainty better than the benchmark. This

because a difference of zero or lower was not present in the confidence interval (CI=[.0029,

1.00]).

Figure 5.

Accuracy Differences Between the Third Model with Temperature, Subjective and

Demographic Features and the First model with Temperature Features

Note. The dashed line is the one-sided confidence interval

The accuracies of the third model were compared with the accuracies of the first model.

These differences were evaluated with a one-sided 95% confidence interval to see whether the

third model performed better than the first model. The third model does not perform with 95%

certainty better than the first model. This because a zero or lower difference was present in the

confidence interval (CI= [-.0087, 1.00]).

Gini Importance. The Gini Importance of the features of the first, second and third

model were compared with the second benchmark created by the naïve model. These were

evaluated with a two-sided Confidence interval of 95% to see whether the features performed

better or worse than the benchmark. Figures 6, 7 and 8 display the differences between the Gini

importance and the benchmark of a feature. The 2.5 % lowest and highest values were removed,

thus creating a 95% Confidence interval. To test whether the feature performs better or worse

than the benchmark, it must be determined whether zero is included in the interval or not.

Figure 6.

The Confidence Intervals of the Differences Between Gini Importance of the Features of Model

1 and the Second Benchmark

Note. The dashed line marks the 0.0 difference between the GI of the feature and the second

benchmark.

Figure 6 displays the Gini hierarchy of the first model. There were 12 features in this

model, so the benchmark was set at .0833. The FLIR measured features ranked high but only

the forehead temperature performed better than the benchmark with 95% certainty. From the

iButton measured temperatures the chest, mastoid and forehead temperature performed worse

than the benchmark with 95% certainty. The other temperatures contain 0.0 in their interval

and therefore cannot be said to perform better or worse.

Figure 7.

The Confidence Intervals of the Differences Between Gini Importance of the Features of Model

2 and the Second Benchmark

Note. The dashed line marks the 0.0 difference between the GI of the feature and the second

benchmark.

Figure 7 displays the Gini hierarchy of the second model. There were 4 features in this

model, so the benchmark was set at .250. The participants MEQ type performed better than the

benchmark with 95% certainty. The participants PSQI score and Gender performed worse than

the benchmark with 95% certainty. The participants Age included a difference of 0.0 in its

interval so nothing can be said with certainty about its relative performance.

Figure 8.

The Confidence Intervals of the Differences Between Gini Importance of the Features of Model

3 and the Second Benchmark

Note. The dashed line marks the 0.0 difference between the GI of the feature and the second

benchmark.

Figure 7 displays the Gini hierarchy of the third model. There were 16 features in this

model, so the benchmark was set at .0625. The FLIR measured features performed better than

the benchmark with 95% certainty. From the subjective and demographic features performed

the participants PSQI score, Gender and Age worse than the benchmark.

The iButton measured temperatures and the participants MEQ type contained 0.0 in

their interval and therefore cannot be said to perform better or worse.

Discussion

The current study explored the importance of different predictors of (drops in) vigilance

with machine learning. The algorithm random forest (RF) was used due to its usefulness in

feature selection (Menze et al., 2009). Three different kind models were trained once and

finetuned. Then 500 bootstrapped samples were created, and the models were trained again on

these samples with the optimized hyperparameters. To evaluate these models a naïve model

was created that classifies all trials as hits (1.0). This was done by comparing the accuracy

scores of the models with the accuracy scores of the naïve model. Before finetuning, the train

data was very overfitted and therefore not useful for comparison. After adding the finetuned

hyperparameters they performed slightly better or the same as the naïve model. Below results

are discussed for each model separately.

The first model used only temperature related features. It had an accuracy of 100% for

the train data and 73.8% for the test data. The high accuracy for the train data shows that the

model overfitted the data. Therefore, the parameters of the algorithm had to be finetuned. After

finetuning, the model performed significantly better than the naïve model with a 95% accuracy.

The highest-ranking features are now the forehead temperature measured with the FLIR

camera, the nose temperature measured with an iButton, and the nose temperature measured

with the FLIR camera.

Because the first model scores slightly better than the naïve model, conclusions can be

drawn from its Gini hierarchy. This model implicates that the temperature of the forehead

measured with the FLIR camera can be seen as the best predictor of (drops in) vigilance for

this task because it performs significantly better than the benchmark. The other features that

were measured or calculated with the FLIR camera, ranked relatively high but statistically

nothing can be said about their relative performance to the benchmark. The temperatures of the

chest, mastoid and forehead measured with the iButtons performed significantly worse than

the benchmark and therefore do not predict (drops in) vigilance.

The second model used the subjective and demographic features. The model did not

show any overfitting but was still finetuned the same way as the first and third model. The

participants MEQ type performed significantly better than the benchmark. The participants

PSQI score and gender performed significantly worse than the benchmark. Statistically nothing

can be said about the age of the participant. The second model did not significantly perform

better than the naïve model. This implies that the subjective measurements and demographic

data on their own have no contribution to the prediction of vigilance for this task.

The third model used the temperature related features and the subjective and

demographic features. The accuracy of this model was 100% for the train data and 73.8% for

the test data. This model also has a high accuracy score for the train data. Therefore, the model

overfitted the data. The finetuning of this model was the same as for the first and second model.

After finetuning, the model performed significantly better than the naïve model with a 95%

accuracy. The highest-ranking features were measured with the FLIR camera. These were the

forehead temperature, nose temperature and the DPG

nose-forehead

. The worst ranking features were

again the participants age, PSQI score and gender.

The implications of this model are mostly consistent with the implications of the first

model. Again, this model performs slightly better than the naïve model. This means that

conclusions can be drawn from the Gini hierarchy of this model. The FLIR measured features

performed significantly better than the benchmark. The participants PSQI score, Gender and

Age performed significantly worse than the benchmark. Even though the nose temperature of

the iButton ranked high, statistically nothing can be said about its relative performance to the

benchmark. The same goes for the other features measured with the iButtons. Also, the third

model was compared with the first model. The third model did not significantly perform better

than the first model. So, adding the demographic and subjective features does not contribute

significantly to the accuracy of the classification. In short, FLIR measured features are worth

further exploration while age, gender and PSQI score do not show any potential for this task.

However, some placements of features in the hierarchy are inconsistent with previous

literature. Romeijn and van Someren (2011) found the chest temperature to be a better predictor

than the DPG

finger-chest

. However, in this study the chest temperature does not score high in the

hierarchies, it even scores significantly worse than the benchmark. In addition, the chest

temperature is not higher ranked than the DPG

finger-chest

. Nevertheless, compared to the other

iButton measured features, this DPG ranks relatively high. This is consistent with the findings

Romeijn and Van Someren which showed that this DPG is a strong predictor of vigilance.

However, its confidence interval still includes zero so nothing can be said statistically about its

performance relative to the benchmark.

Even though the first and third model perform slightly better than the naive model, it is

debatable how relevant these models are. The reason for performing just slightly better could

be found in the difference between participants. There can be different fluctuations in skin

temperature between participants. Romeijn and Van Someren (2011) already noted that these

fluctuations call for sustained recordings and mapping temperatures to the individuals’ normal

range of temperature. These individual differences make it hard for the RF to make the right

split within the temperature features for classifying the trials.

Furthermore, this study added a staircase procedure to the BSRT task created by

Romeijn and Van Someren (2011). The staircase procedure made sure that the amount of hits

for each participant was consistent. However, it could have resulted in a confounding factor:

when a participant became less vigilant, instead of producing more misses the task became

easier resulting in more hits again. On the other hand, when a participant was vigilant a miss

could be the result of the staircase making the task too difficult. Another consequence of the

staircase is visible in model 2. Because the number of hits was the same for each participant,

there could be no discrimination between the participants specific features like PSQI score, age

and gender.

Furthermore, the finetuned hyperparameters were fixed for all the 500 samples. To be

conclusive these parameters had to be optimized for each bootstrapped sample to get optimal

performance. However, due to the lack of time this procedure was only done for one model

and those values were chosen to be the hyperparameters. Further research could add this

finetuning in the code described in Appendix D.

Lastly, due to the many different experimental leaders some things went wrong during

the data collection. Because there was not much time for the leaders to get used to the protocol,

some files were not (correctly) saved. Also, the processing of the FLIR data took a long time

so more participants had to be excluded from the study because their FLIR data was

unavailable. This limited the amount of data that could be used for analysing and therefore

reduced power.

Due to these limitations more research should be done about the exploration of different

features in the BSRT without the staircase. Also, other tasks like the CTET (O'Connell et al.,

2009) could be useful to analyse the importance of different predictors. This task was

appropriate in predicting vigilance using EEG signals, so exploring this with temperature

measurements can be useful. This task also has a lower hit rate than most vigilance assessment

tasks (O'Connell et al., 2009) which creates a more evenly distributed division between hits

and misses.

Also, more has to be known about the difference between individuals in fluctuations of

skin temperature. As mentioned before, participants can differ in their fluctuations in skin

temperature but also in their overall mean temperature. By training classification models for

individuals separately instead of building a general classification model for all individuals,

more can be known about the predictiveness of different temperature features and their

differences between participants.

Further research should also look at the kind of predictive relationship between the

features and output. While a RF is very useful for feature selection, it cannot say much about

this relationship. Different machine learning algorithms can be used to determine the specific

relationship and classical statistics should confirm these relationships.

In conclusion, the research questions of the current study can be answered. The best

predictor of vigilance is most likely the temperature of the forehead measured with the FLIR

camera. The hierarchies of the first and third finetuned models are the best indicators of the

usefulness of predictors. However, these models’ accuracies do not differ much from the naïve

model meaning that all included features in aggregate have little predictive power for drops in

vigilance. So, the conclusions that are drawn have to be confirmed with further research.

References

Blatter, K., Graw, P., Münch, M., Knoblauch, V., Wirz-Justice, A., & Cajochen, C. (2006).

Gender and age differences in psychomotor vigilance performance under differential

sleep pressure conditions. Behavioural Brain Research, 168(2), 312–317.

https://doi.org/10.1016/j.bbr.2005.11.018

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.

https://doi.org/10.1023/a:1010933404324

Buysse, D. J., Reynolds, C. F., Monk, T. H., Berman, S. R., & Kupfer, D. J. (1989). The

Pittsburgh sleep quality index: A new instrument for psychiatric practice and research.

Psychiatry Research, 28(2), 193–213. https://doi.org/10.1016/0165-1781(89)90047-4

Bzdok, D., Altman, N., & Krzywinski, M. (2018). Statistics versus machine learning. Nature

Methods, 15(4), 233–234. https://doi.org/10.1038/nmeth.4642

Centraal Bureau voor de Statistiek. (2020, October 15). Hoeveel mensen komen om in het

verkeer? https://www.cbs.nl/nl-nl/visualisaties/verkeer-en-vervoer/verkeer/hoeveel-

mensen-komen-om-in-het-verkeer-

Chang, S., Cohen, T., & Ostdiek, B. (2018). What is the machine learning? Physical Review

D, 97(5), 056009-1 – 056009-6. https://doi.org/10.1103/PhysRevD.97.056009

De Raedt, R., & Ponjaert-Kristoffersen, I. (2001). Predicting at-fault car accidents of older

drivers. Accident Analysis & Prevention, 33(6), 809–819.

https://doi.org/10.1016/S0001-4575(00)00095-6

Fraiwan, L., Lweesy, K., Khasawneh, N., Wenz, H., & Dickhaus, H. (2012). Automated sleep

stage identification system based on time–frequency analysis of a single EEG channel

and random forest classifier. Computer Methods and Programs in Biomedicine,

108(1), 10–19. https://doi.org/10.1016/j.cmpb.2011.11.005.

Garreta, R., & Moncecchi, G. (2013). Learning scikit-learn: Machine learning in Python.

Packt Publishing.

Horne, J. A., & Östberg, O. (1976). A self-assessment questionnaire to determine

morningness-eveningness in human circadian rhythms. International Journal of

Chronobiology, 4, 97–110.

Jin, C. Y., Borst, J. P., & Vugt, M. K. (2020). Distinguishing vigilance decrement and low

task demands from mind-wandering: A machine learning analysis of EEG. European

Journal of Neuroscience, 52(9), 4147–4164. https://doi.org/10.1111/ejn.14863

Lin, C.-T., Chuang, C.-H., Huang, C.-S., Tsai, S.-F., Lu, S.-W., Chen, Y.-H., & Ko, L.-W.

(2014). Wireless and Wearable EEG System for Evaluating Driver Vigilance. IEEE

Transactions on Biomedical Circuits and Systems, 8(2), 165–176.

https://doi.org/10.1109/TBCAS.2014.2316224

Menze, B. H., Kelm, B. M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., &

Hamprecht, F. A. (2009). A comparison of random forest and its Gini importance

with standard chemometric methods for the feature selection and classification of

spectral data. BMC Bioinformatics, 10(1), 213. https://doi.org/10.1186/1471-2105-10-

213

O’Connell, R. G., Dockree, P. M., Robertson, I. H., Bellgrove, M. A., Foxe, J. J., & Kelly, S.

P. (2009). Uncovering the Neural Signature of Lapsing Attention:

Electrophysiological Signals Predict Errors up to 20 s before They Occur. Journal of

Neuroscience, 29(26), 8604–8611. https://doi.org/10.1523/JNEUROSCI.5967-

08.2009

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,

Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., & Cournapeau,

D. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning

Research 12, 2825-2830.

PyCharm: the Python IDE for Professional Developers. (2020, August 25). JetBrains.

https://www.jetbrains.com/pycharm/

Romeijn, N., & Van Someren, E. J. W. (2011). Correlated Fluctuations of Daytime Skin

Temperature and Vigilance. Journal of Biological Rhythms, 26(1), 68–77.

https://doi.org/10.1177/0748730410391894

Romeijn, N., Verweij, I. M., Koeleman, A., Mooij, A., Steimke, R., & Virkkala, J. (2012).

Cold Hands, Warm Feet: Sleep Deprivation Disrupts Thermoregulation and Its

Association with Vigilance. 35(12), 11.

Schmidt, E. A., Schrauf, M., Simon, M., Fritzsche, M., Buchner, A., & Kincses, W. E.

(2009). Drivers’ misjudgement of vigilance state during prolonged monotonous

daytime driving. Accident Analysis & Prevention, 41(5), 1087–1093.

https://doi.org/10.1016/j.aap.2009.06.007

Shanker, A. (2019). Bioinformatics: Sequences, Structures, Phylogeny (Softcover reprint of

the original 1st ed. 2018 ed.). Springer.

Shi, L.-C., & Lu, B.-L. (2013). EEG-based vigilance estimation using extreme learning

machines. Neurocomputing, 102, 135–143.

https://doi.org/10.1016/j.neucom.2012.02.041

Te Lindert, B. H. W., & Van Someren, E. J. W. (2018). Skin temperature, sleep, and

vigilance. In Handbook of Clinical Neurology (Vol. 156, pp. 353–365). Elsevier.

https://doi.org/10.1016/B978-0-444-63912-7.00021-7

Van Schaik, A.S. (2021). Predicting drops in vigilance based on beta activity (Unpublished

Bachelor’s Thesis). Utrecht University.

Züger, M., Müller, S. C., Meyer, A. N., & Fritz, T. (2018). Sensing Interruptibility in the

Office: A Field Study on the Use of Biometric and Computer Interaction Sensors.

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems,

1–14. https://doi.org/10.1145/3173574.3174165

Appendix A

The script is called data_combination_BSRT.py. Here the main parts of the script will

be explained. The full script is available on https://github.com/rooslucas/Bachelor-Thesis.

There is also a script available to combine all participant files together. This is called

combine_all_files.py. An adaptation to combine the data for the CTET is also available on

GitHub and is called data_combination_CTET.py.

The script loops through all participants that are available. The participant ID is saved

and used further in the script. A new data frame is created with the triggers and times from

the trigger logger.

Next the practice trials are removed by scanning for the start trigger 21.

The trigger logger displays times as seconds passed since 1970-01-01 2:00 AM. The

times will now be displayed in year-month-day hours:minutes:seconds:miliseconds.

A new column is added to the data frame to display the end time of a trial. The

locations of different triggers are saved. The script then loops through the locations with the

response triggers (8 and 0) and makes each row one trial. Then the other triggers are

removed, and the index of the data frame is reset.

for participant in trigger_list:

print(f"\n Participant {i}/{len(trigger_list)}")

# Safe the participant id

participant_id = participant.split('_')

participant_id = participant_id[0]

print(f"Processing data from participant

{participant_id}")

trigger_path = trigger_folder + '/' + participant

triggers = pd.read_csv(trigger_path)

# Create dataframe

data_frame = {'trigger': triggers['Trigger'],

'start_time': triggers['TriggerTime']}

data_file = pd.DataFrame(data_frame)

start, = data_file.index[data_file['trigger'] == 20.0]

data_file.drop(range(1, start), inplace=True)

data_file.dropna(inplace=True)

# Refactor the trigger times

original_time = datetime(year=1970, month=1, day=1, hour=2)

for time in data_file['start_time']:

new_time = original_time + timedelta(seconds=time)

data_file['start_time'].replace(time, new_time,

inplace=True)

Next, a column with the results is added. For each trial in the data frame the script

checks if it was a hit (2.0-8.0) or a miss (2.0-0.0). Rows with missing triggers get value 99

and are then removed. Again, the index is reset.

Now the FLIR data is added to the data frame. First it checks if the FLIR file exists.

Then the FLIR times are combined

# Add new column with end time of each trial

data_file['end_time'] = ''

# Get locations of triggers 8, 0 and 1

locations = data_file.index[(data_file['trigger'] == 8.0) |

(data_file['trigger'] == 0.0)]

locations_2 = data_file.index[data_file['trigger'] == 1.0]

data_file['trigger'] = data_file['trigger'].astype(str)

# Make each row one trial from target to response

for location in locations:

data_file.at[location - 1, 'end_time'] =

data_file.at[location, 'start_time']

data_file.at[location - 1, 'trigger'] =

str(data_file.at[location - 1, 'trigger']) + '-' +

str(data_file.at[location, 'trigger']

data_file.drop(index=location, inplace=True)

# Drop the other triggers

for location in locations_2:

data_file.drop(index=location, inplace=True)

# Reset index of dataframe

data_file.reset_index(drop=True, inplace=True)

# Get responses from triggers

data_file['results'] = ''

for trigger in data_file['trigger']:

loc = data_file.index[data_file['trigger'] == trigger]

if trigger == '2.0-8.0':

data_file.at[loc, 'results'] = 1.0

elif trigger == '2.0-0.0':

data_file.at[loc, 'results'] = 0.0

else:

data_file.at[loc, 'results'] = 99

missing_results = data_file.index[data_file['results']

== 99]

for error in missing_results:

data_file.drop(error, inplace=True)

data_file.dropna(inplace=True)

data_file.reset_index(drop=True, inplace=True)

Then the script loops through the trials in the data frame. It also loops through the

times in the FLIR data and looks if the number of seconds of the FLIR time matches the trial

time four seconds prior. If so the average of the 30 frames in that second is calculated for the

nose and forehead. To speak up the process, the rows that already have been scanned in the

FLIR data will be skipped and after the right second has been found the loop breaks.

Afterwards missing data is dropped.

if flir_data is not None:

# Reset times of flir data

for point in range(len(flir_data['Time'])):

ttime = str(flir_data.at[point, 'Time'])

good_time = pd.to_datetime(ttime) +

timedelta(milliseconds=int(flir_data.at[point,

'Milliseconds']))

good_time = datetime.combine(pd.to_datetime(

flir_data.at[point, 'Date']), good_time.time())

flir_data.at[point, 'good_time'] = good_time

# Loop through trials

for time in data_file['start_time']:

matches_el1 = []

matches_el2 = []

row, = data_file.index[(data_file['start_time'] == time)]

difference = datetime.now() - time

# Add values from nearest time sample

for time_f in range(start, len(flir_data['good_time'])):

ttime = flir_data.iloc[time_f]['good_time']

if int((time - datetime(1970, 1, 1)).total_seconds())

- 4 == int((ttime –

datetime(1970, 1, 1)).total_seconds()):

matches_el1.append(flir_data.iloc[time_f]

['El1.Average']

matches_el2.append(flir_data.iloc[time_f]

['El2.Average'])

new_time = time_f # Define new start point

elif int((time - datetime(1970, 1, 1)).

total_seconds()) - 3 ==

int((ttime - datetime(1970, 1, 1)).total_seconds()):

start = new_time

break

# Calculate average values over one second

if len(matches_el1) != 0:

average_el1 = sum(matches_el1) / len(matches_el1)

average_el2 = sum(matches_el2) / len(matches_el2)

# Add values to the dataframe

data_file.at[row, 'FLIR_forehead'] = average_el1

data_file.at[row, 'FLIR_nose'] = average_el2

print(f'{trial} done!')

trial += 1

After the FLIR data is processed the start times drop their milliseconds so they can be

matched to the iButton times.

Now the iButton data is processed. Each participant has a folder with the different

iButton files in there. It skips the file with the room temperature. Then the script loops

through the trials and matches the start time with the iButton time 4 or 5 seconds before. The

temperature at that timepoint enters the data frame for the trial. Afterwards missing values are

again removed.

# Prep times for ibutton matches

for time in data_file['start_time']:

new_time = time - timedelta(microseconds=

time.microsecond)

data_file['start_time'].replace(time, new_time,

inplace=True)

# Add each ibutton to the dataframe

for ibutton in ibuttons_list:

path = ibutton_folder + '/' + ibutton

temp = pd.read_csv(path, skiprows=18)

ibutton_name, rest = ibutton.split('_')

print(f"Processing data from {ibutton_name}")

if ibutton_name != "FF00000045298741":

data_file[ibutton_name] = 99.9

for time in temp['Date/Time']:

good_time = pd.to_datetime(time)

location_temp, =

temp.index[(temp['Date/Time'] == time)]

for trigger_time in data_file['start_time']:

if good_time == trigger_time -

timedelta(seconds=4):

location = data_file.index[

(data_file['start_time'] ==

trigger_time)]

data_file.at[location,

ibutton_name] =

temp.iloc[location_temp]['Value']

elif good_time == trigger_time –

timedelta(seconds=5):

location = data_file.index[

(data_file['start_time'] ==

trigger_time)]

data_file.at[location,

ibutton_name] =

temp.iloc[location_temp]['Value']

Then the different iButton are calculated. If there is FLIR data, the with the FLIR

temperature is also calculated.

Finally, the data from the questionnaires is added. It searches for the participant ID

and enters the corresponding values to the data frame. Then the file is saved as a .csv with the

participant ID in the name.

# Calculate four DPGs

print("Calculate DPG_finger-chest")

data_file['DPG_finger-chest'] =

data_file['4B0000004516B141'] - data_file['9A00000045146841']

print("Calculate DPG_nose-forehead")

data_file['DPG_nose-forehead'] =

data_file['CB000000452D7441'] - data_file['F9000000452CCF41']

print("Calculate DPG_pinna-mastoid")

data_file['DPG_pinna-mastoid'] =

data_file['76000000452C9741'] - data_file['7200000045201D41']

if flir_data is not None:

print("Calculate FLIR_DPG_nose-forehead")

data_file['FLIR_DPG_nose-forehead'] =

data_file['FLIR_nose'] - data_file['FLIR_forehead']

# Add data from questionnaires

print("Adding data from the questionnaires")

questionnaire_folder = directory +

'/Questionnaires/questionnaire_data.csv'

questionnaire_file = pd.read_csv(questionnaire_folder)

row_number = questionnaire_file.index[

(questionnaire_file['PPID'] == participant_id)]

if len(row_number) != 0:

data_file['Gender'] =

questionnaire_file.at[row_number[0], 'Gender']

data_file['Age'] =

questionnaire_file.at[row_number[0], 'Age']

data_file['MEQ_type'] =

questionnaire_file.at[row_number[0], 'type']

data_file['PSQI'] = questionnaire_file.at[

row_number[0], 'total_score_PSQI']

# Safe file to a csv

data_file.to_csv(r'/Users/roos/Data/final_trials/trials' +

participant_id + '.csv', index=False, header=True)

Appendix B

For this study multiple random forest models were trained. All Jupyter notebooks are

available on https://github.com/rooslucas/Bachelor-Thesis. These are the not optimised

models. To optimise the models, the parameters can be adjusted. The method is based on

Garreta and Moncecchi (2013) and for all models the same. Here the main parts of the

methods will be described for model 3. Models 1 and 2 use the same methods and can, as said

before, be found on the authors GitHub. Also, the models for the CTET can be found there.

For each model certain features are selected from the data frame.

MEQ type and gender are categorical variables and therefore have to be decoded.

The labels get transformed into integers so the model can interpret them.

data_file_path = '/Users/roos/Data/all_trials_noNaN2.csv'

data_file = pd.read_csv(data_file_path)

data_3 = data_file[[ 'Age', 'Gender', 'PSQI', 'MEQ_type',

'9A00000045146841','F9000000452CCF41',

'76000000452C9741', '7200000045201D41',

'4B0000004516B141', 'CB000000452D7441',

'DPG_finger-chest','DPG_nose-forehead',

'DPG_pinna-mastoid', 'results', 'FLIR_forehead',

'FLIR_nose', 'FLIR_DPG_nose-forehead']]

# Encode categorical variables

# Gender

encoder = LabelEncoder()

label_encoder_gender = encoder.fit(data_3['Gender'])

print("gender classes:", label_encoder_gender.classes_)

integer_classes_gender =

label_encoder_gender.transform(label_encoder_gender.classes_)

print("Gender integer classes", integer_classes_gender)

code = label_encoder_gender.transform(data_3['Gender'])

data_3['Gender'] = code

# MEQ_type

label_encoder_MEQ = encoder.fit(data_3['MEQ_type'])

print("MEQ classes:", label_encoder_MEQ.classes_)

integer_classes_MEQ =

label_encoder_MEQ.transform(label_encoder_MEQ.classes_)

print("MEQ> integer classes", integer_classes_MEQ)

code_MEQ = label_encoder_MEQ.transform(data_3['MEQ_type'])

data_3['MEQ_type'] = code_MEQ

A single decision tree is trained and evaluated. This tree is visualised and can be

found in appendix D.

Next the forest is created, and its accuracy is displayed.

For each feature its Gini importance is calculated, and the hierarchy is displayed in a

table.

dt = DecisionTreeClassifier(max_depth=3, random_state=0)

dt.fit(X_train, Y_train)

import graphviz

dot_data = tree.export_graphviz(dt, out_file=None,

feature_names=data_3.drop(

'results', axis=1).columns,

class_names=['0.0', '1.0'],

filled=True, rounded=True,

special_characters=True)

graph = graphviz.Source(dot_data)

graph.render('model3.gv', view=True)

graph

# Building a forest

random_forest = RandomForestClassifier(n_estimators=100,

random_state=9)

random_forest.fit(X_train, Y_train)

# Accuracy on Test

print("Training Accuracy is: ", random_forest.score(X_train,

Y_train))

# Accuracy on Train

print("Testing Accuracy is: ", random_forest.score(X_test,

Y_test))

fi2 = ''

final2 = ''

for i, column in enumerate(data_3.drop('results', axis=1)):

print('Importance of feature {}:, {:.3f}'.format(column,

random_forest.feature_importances_[i]))

fi2 = pd.DataFrame({'Variable': [column], 'Feature

Importance Score': [random_forest.feature_importances_[i]]})

try:

final2 = pd.concat([final2, fi2], ignore_index=True)

except:

final2 = fi2

# Ordering the data

final_fi2 = final2.sort_values('Feature Importance Score',

ascending=False).reset_index()

final_fi2

Appendix C

This script described the optimalisation process of model 1. The full Jupyter notebook

is available on https://github.com/rooslucas/Bachelor-Thesis. The main parts of the script will

be described here. The same optimalisation parameters were afterwards used for model 2 and

model 3.

The data was split into a train set and test set with an 80:20 split. Then the train set

was split into a train set and validation set with a 70:30 split.

To determine the optimal number of estimators, different numbers between 100 and

500 were tried. The accuracy of the train and validation set for each number were displayed

into a graph.

# Splitting the data

X = data.drop('results', axis=1).values

Y = data['results'].values

print('X shape: {}'.format(np.shape(X)))

print('Y shape: {}'.format(np.shape(Y)))

X_train1, X_test, Y_train1, Y_test = train_test_split(X, Y,

train_size=0.8, test_size=0.2, random_state=0)

X_train, X_validate, Y_train, Y_validate =

train_test_split(X_train1, Y_train1, train_size=0.7,

test_size=0.3, random_state=0)

# Trying different trees

test_acc_tree = []

val_acc_tree = []

trees = [100, 150, 200, 250, 300, 350, 400, 450, 500]

for num_trees in range(100, 501, 50):

print("Number of trees:", num_trees)

random_forest =

RandomForestClassifier(n_estimators=num_trees,

random_state=30)

random_forest.fit(X_train, Y_train)

test_acc_tree.append(random_forest.score(X_train,

Y_train))

# Accuracy on validation

val_acc_tree.append(random_forest.score(X_validate,

Y_validate))

plt.plot(trees, test_acc_tree, c="magenta")

plt.plot(trees, val_acc_tree, c="aqua")

plt.show()

To determine the optimal depth, different numbers between 1 and 10 were tried. The

accuracy of the train and validation set for each number were displayed into a graph.

After the parameters were chosen, all sets were evaluated with the optimal settings.

test_acc_depth = []

val_acc_depth = []

depth = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

for max_depth in range(1, 11, 1):

print("Depth:", max_depth)

random_forest = RandomForestClassifier(n_estimators=100,

random_state=30, max_depth=max_depth)

random_forest.fit(X_train, Y_train)

test_acc_depth.append(random_forest.score(X_train, Y_train))

val_acc_depth.append( random_forest.score(X_validate,

Y_validate))

plt.plot(depth, test_acc_depth, c="magenta")

plt.plot(depth, val_acc_depth, c="aqua")

plt.show()

random_forest = RandomForestClassifier(n_estimators=100,

random_state=30, max_depth=5)

random_forest.fit(X_train, Y_train)

print("Training Accuracy is: ", random_forest.score(X_train,

Y_train))

# Accuracy on Train

print("Validation Accuracy is: ", random_forest.score(X_validate,

Y_validate))

print(Y_test.sum()/(len(Y_test)))

print("Testing Accuracy is: ", random_forest.score(X_test,

Y_test))

Appendix D

For this study 500 bootstrapped samples were made. The notebooks are available on

https://github.com/rooslucas/Bachelor-Thesis and is called vigilance_RF_bootstrapped.ipynb.

The hyperparameters are fixed but could be optimized for each model individually by adding

this in the for loop. The main parts will be explained here. A full script and visualization

script are available on the authors GitHub.

First a Random seed was set so the procedure could be reproduced. Then the data file

was loaded in, and a new data frame was created.

Next 500 bootstrapped samples were created and saved in a list.

Then the first kind of models were trained for each sample in the samples list. This

procedure can be repeated for each different model.

Next for each sample two random forests were created: one without hyperparameter

optimization and one with this optimization. The accuracy scores and Gini importance were

added to the data frame.

np.random.seed(0)

data_file_path = '/Users/roos/Data/all_trials_noNaN2.csv'

data_file = pd.read_csv(data_file_path)

output = pd.DataFrame()

samples = []

for i in range(500):

samples.append(data_file.sample(n=len(data_file),

replace=True))

index = 0

for sample in samples:

sample1 = sample[['9A00000045146841',

'F9000000452CCF41', '76000000452C9741', '7200000045201D41',

'4B0000004516B141', 'CB000000452D7441', 'DPG_finger-chest',

'DPG_nose-forehead', 'DPG_pinna-mastoid', 'results',

'FLIR_forehead', 'FLIR_nose', 'FLIR_DPG_nose-forehead']]

# Splitting the data

X = sample1.drop('results', axis=1).values

Y = sample1['results'].values

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,

train_size=0.8, test_size=0.2, random_state=0)

Lastly, the differences between the models, features and benchmarks were calculated

and saved in the data frame.

# Building a forest

random_forest = RandomForestClassifier(n_estimators=100,

random_state=9)

random_forest.fit(X_train, Y_train)

random_forest2 = RandomForestClassifier(n_estimators=100,

random_state=9, max_depth=5)

random_forest2.fit(X_train, Y_train)

output.at[index, 'Acc1'] = random_forest.score(X_test, Y_test)

output.at[index, 'Acc_opt1'] = random_forest2.score(X_test,

Y_test)

importances1 = random_forest2.feature_importances_

output.at[index, 'Gini_9A00000045146841_1'] = importances1[0]

output.at[index, 'Gini_F9000000452CCF41_1'] = importances1[1]

output.at[index, 'Gini_76000000452C9741_1'] = importances1[2]

output.at[index, 'Gini_7200000045201D41_1'] = importances1[3]

output.at[index, 'Gini_4B0000004516B141_1'] = importances1[4]

output.at[index, 'Gini_CB000000452D7441_1'] = importances1[5]

output.at[index, 'Gini_DPG_finger-chest_1'] = importances1[6]

output.at[index, 'DPG_nose-forehead_1'] = importances1[7]

output.at[index, 'DPG_pinna-mastoid_1'] = importances1[8]

output.at[index, 'FLIR_forehead_1'] = importances1[9]

output.at[index, 'FLIR_nose_1'] = importances1[10]

output.at[index, 'FLIR_DPG_nose-forehead_1'] =

importances1[11]

index+=1

Appendix E

Three examples of decision trees, for each model one tree example.

Model 1.

Model 2.

FLIR_forehead ≤ 34.005

gini = 0.343

samples = 1372

value = [30 2 , 1070]

class = 1.0

FLIR_nose ≤ 27.844

gini = 0.308

samples = 1211

value = [23 0 , 981]

class = 1.0

True

76000000452C9741 ≤ 31.882

gini = 0.494

samples = 161

value = [72 , 89]

class = 1.0

False

FLIR_forehead ≤ 33.375

gini = 0.17

samples = 266

value = [25 , 241]

class = 1.0

CB000000452D7441 ≤ 32.984

gini = 0.34

samples = 945

value = [20 5 , 740]

class = 1.0

gini = 0.094

samples = 182

value = [9, 173]

class = 1.0

gini = 0.308

samples = 84

value = [16 , 68]

class = 1.0

gini = 0.418

samples = 373

value = [111, 262]

class = 1.0

gini = 0.275

samples = 572

value = [94 , 478]

class = 1.0

FLIR_DPG_nose-forehead ≤ -6.251

gini = 0.153

samples = 36

value = [3, 33]

class = 1.0

FLIR_forehead ≤ 34.324

gini = 0.495

samples = 125

value = [69 , 56]

class = 0.0

gini = 0.0

samples = 1

value = [1, 0]

class = 0.0

gini = 0.108

samples = 35

value = [2, 33]

class = 1.0

gini = 0.487

samples = 11 9

value = [69 , 50]

class = 0.0

gini = 0.0

samples = 6

value = [0, 6]

class = 1.0

MEQ_type ≤ 1.5

gini = 0.343

samples = 1372

value = [302, 1070]

class = 1.0

Age ≤ 36.5

gini = 0.266

samples = 917

value = [145, 772]

class = 1.0

True

Age ≤ 24.5

gini = 0.452

samples = 455

value = [157, 298]

class = 1.0

False

MEQ_type ≤ 0.5

gini = 0.233

samples = 683

value = [92, 591]

class = 1.0

PSQI ≤ 0.571

gini = 0.35

samples = 234

value = [53, 181]

class = 1.0

gini = 0.265

samples = 457

value = [72, 385]

class = 1.0

gini = 0.161

samples = 226

value = [20, 206]

class = 1.0

gini = 0.364

samples = 121

value = [29, 92]

class = 1.0

gini = 0.335

samples = 113

value = [24, 89]

class = 1.0

Gender ≤ 0.5

gini = 0.419

samples = 345

value = [103, 242]

class = 1.0

gini = 0.5

samples = 110

value = [54, 56]

class = 1.0

gini = 0.456

samples = 111

value = [39, 72]

class = 1.0

gini = 0.397

samples = 234

value = [64, 170]

class = 1.0

Model 3.

MEQ_type ≤ 1.5

gini = 0.343

samples = 1372

value = [3 0 2, 1070]

class = 1.0

DPG_pinna-mastoid ≤ -5.727

gini = 0.266

samples = 917

value = [1 4 5, 772]

class = 1.0

True

FLIR_forehead ≤ 33.999

gini = 0.452

samples = 455

value = [1 5 7, 298]

class = 1.0

False

CB000000452D7441 ≤ 31.424

gini = 0.422

samples = 159

value = [4 8 , 111 ]

class = 1.0

DPG_nger-chest ≤ -3.196

gini = 0.223

samples = 758

value = [9 7 , 661]

class = 1.0

gini = 0.074

samples = 26

value = [1 , 25]

class = 1.0

gini = 0.457

samples = 133

value = [4 7 , 86]

class = 1.0

gini = 0.185

samples = 553

value = [5 7 , 496]

class = 1.0

gini = 0.314

samples = 205

value = [4 0 , 165]

class = 1.0

DPG_pinna-mastoid ≤ -2.295

gini = 0.395

samples = 318

value = [8 6 , 232]

class = 1.0

FLIR_nose ≤ 28.085

gini = 0.499

samples = 137

value = [7 1 , 66]

class = 0.0

gini = 0.379

samples = 303

value = [7 7 , 226]

class = 1.0

gini = 0.48

samples = 15

value = [9 , 6]

class = 0.0

gini = 0.308

samples = 21

value = [1 7 , 4]

class = 0.0

gini = 0.498

samples = 116

value = [5 4 , 62]

class = 1.0

Appendix F

Some further explorative analyses have been conducted but were not added to the

original study. However, they are quite interesting and therefore described in this appendix.

Another task that the participants performed was the continuous temporal expectancy task

(CTET) is a vigilance assessment task based on the task created by O’Connell et al. (2009). A

full description can be found in the study of Van Schaik (2021).

Again, three different models were made the same way as for the BSRT. Only difference is

the iButton and FLIR data was taken 20 seconds before target display. This because this was

the same time as the EEG sample in O’Connell et al. (2009). The results were as followed.

Figure 9.

Accuracy of Each Model of the CTET Before and After Finetuning of the Parameters

Before hyperparameter optimization the model was not very overfitted but still

performed much better for the train data than for the test data, except for the second model.

The accuracy of the models improved after finetuning with almost 10% relatively to the naïve

model.

Figure 6.

Gini Importance per Feature for Model 1 of the CTET

The benchmark is set at .11 for the first model. The best performing features after

hyperparameter optimization are the nose temperature, DPG

pinna-mastoid

, and the DPG

finger-chest

The worst features were the forehead temperature, mastoid temperature and chest

temperature.

Figure 7.

Gini Importance per Feature for Model 2 of the CTET

The benchmark is set at .25 for the second model. The best performing features are

the participants PSQI score and age. The worst performing features are the participants MEQ

type and gender.

Figure 8.

Gini Importance per Feature for Model 3 of the CTET

The benchmark is set at .077 for the third model. The best performing features after

hyperparameter optimization were the nose temperature, DPG

pinna-mastoid

and the DPG

finger-chest

The worst performing features were the mastoid temperature, participants MEQ type and

gender.

Here a Bootstrapped procedure would be useful as well. Then more can be said about

the usefulness of the different models and different features. For now the FLIR measures

features seem to be relevant. But the second model performs the best so that would mean that

the classification in all models is based on differences between persons instead of differences

between features.