Predicting (Drops in) Vigilance using Machine Learning
Rosalie E. Lucas
6540384
June 26, 2021
Supervisor: Leon Kenemans
Bachelor Psychology
Utrecht University
2
Abstract
The prediction of (drops in) vigilance can be useful in the prevention of car crashes. Previous
work showed that temperature data, demographic data and subjective measurements can be
useful in these predictions. With machine learning this study explores these different features
as possible predictors of vigilance in the Brief Stimulus Reaction Time task (BSRT). Three
different kinds of Random Forest models were trained, and a naïve model was created for the
evaluation of these models. Temperature was measured four seconds prior to the target
presentation with iButtons and a FLIR camera. Results show that features measured with the
FLIR camera may be useful predictors of vigilance, especially the forehead temperature.
However, the accuracy scores of the models were just slightly better than the naïve model. So,
further research should explore these features further and confirm these findings.
Introduction
In 2020 the Netherlands counted 620 fatal car crashes (Centraal Bureau voor de
Statistiek, 2019). One of the main causes of car crashes is drops in attention (De Raedt &
Ponjaert-Kristoffersen, 2001). This can be explained by the fact that attention can be challenged
during routine tasks like driving (O'Connell et al., 2009). The question that therefore arises is
how these accidents can be prevented. Can people be alerted when their attention drops? To
answer this, more has to be known about the type of attention that is relevant for driving, named
vigilance. Vigilance is a state of readiness, manifested in the extent to which a person can detect
and respond at random time intervals to small changes occurring in the environment (Schmidt
et al., 2009). In other words, vigilance is the ability to pay attention, for a certain period of
time, to a task (Schmidt et al., 2009).
This research was an explorative bottom-up study of different predictors for (drops in)
vigilance. The aim of this study is to compare these predictors and see which predictors are
worth further exploring. This was done using machine learning. Machine learning is very useful
3
for the prediction of an outcome variable (Bzdok et al., 2018) because it learns from known
properties of the training data to give accurate predictions for unknown test data (Shanker,
2018). While classical statistics can also make predictions, machine learning is not bound by
the restrictions of these methods. Machine learning makes less assumptions and therefore is
able to detect the presence of complicated nonlinear interactions, such as double interactions,
triple interactions or even more (Bzdok et al., 2018). To test this with classical statistics, such
as a general linear model, these interactions would have to be specified in advance and then
put into the classical specification, which can be more restrictive. When we look at a broad set
of possible predictors, machine learning can also be useful to assess the relative importance of
different features, even in settings where the number of input variables exceeds the number of
participants (Bzdok et al., 2018).
Machine learning techniques also have a downside. A reason to be cautious with
machine learning is the so-called black box problem: the user has limited to no insight into the
choices the algorithm makes to get from a given input to the given output (Chang et al., 2017).
That is why this research uses the Random Forest (RF) algorithm from Breiman (2001). A RF
is a classification algorithm based on randomised decision trees that together build a forest
(Breiman, 2001). These decision trees are easier to understand and give some insight in the
choices that the computer makes to classify each trial.
Another reason to use a RF is that Menze et al. (2009) found the RF to be the best
algorithm for feature selection. This is because it classifies based on different features. A RF
randomly selects features and uses these to split the decision tree based on their Gini impurity
(Menze et al., 2009). Gini impurity indicates how well a potential split is in classifying the data
points (Menze et al., 2009). For each feature, the Gini importance can be obtained, reflecting
how many times a feature was used to split a tree (Menze et al., 2009). It therefore can be used
to create a hierarchy of features, from most to least important (Züger et al., 2009). Combined
4
with the Gini importance, this algorithm can thus be used to gain insight in the usefulness of
different features. The features found this way can subsequently be investigated in more depth.
Background
To determine which features are important to consider, more must be known about
previous research into predictors for vigilance. The focus will be on four groups: EEG,
temperature related data, subjective measurements, and demographic data.
EEG. Previous studies have found EEG signals to be useful predictors of vigilance
(O'Connell et al., 2009). They found alpha activity to be a strong predictor of lapses in attention.
Also, delta and theta activities have a high correlation with poor task performance and fatigue
(Li et al., 2014). Other studies have found a similar predictive relationship by using machine
learning algorithms, such as extreme machine learning and support vector machines (Shi & Lu,
2013; Li et al., 2020). Though EEG is a very powerful predictor of vigilance, it is not a very
practical one. EEG caps are not comfortable to wear and are quite fragile (Li et al., 2014). In
the context of preventing car accidents, they are hard to impossible to implement for driving a
car for a long time. However, new research focusses on more wearable EEG systems (Li et al.,
2014).
Temperature. Fortunately, some research has been conducted into other predictors for
vigilance. Romeijn and Van Someren (2011) have conducted a vigilance experiment and found
skin temperature to be a possible predictor of vigilance. The fluctuations in skin temperature
can be described according to the Circadian Clock Mechanism or the Homeostatic Hourglass
(Te Lindert & Van Someren, 2018), implying skin temperature is related to the sleep-wake
cycle. A higher skin temperature results in feeling sleepier, having higher reaction times,
having more lapses in performance and therefore, being less vigilant (Romeijn & Van Someren,
2011).
5
Skin temperature can be measured at different locations. These can be divided into
distal and proximal locations. Distal locations fluctuate more in temperature whereas proximal
locations are more consistent. Romeijn and Van Someren found the gradient between a distal
location and proximal location to be a good predictor of vigilance. Mainly the gradient finger-
chest and the gradient finger-wrist, were strong predictors of vigilance. Also, the gradient
pinna-mastoid has been researched and seems to have a predictive relationship with reaction
times (Romeijn et al., 2012).
In short, skin temperature can be a reflection of the underlying processes of the
regulation of vigilance (Romeijn et al., 2012). Furthermore, skin temperature measurements
seem to be applicable in everyday life (Romeijn & Van Someren). This means they could be
useful predictors for vigilance. This is precisely the topic of this research.
Subjective Measures. Other predictors could be found in the ability of people to
evaluate their own level of vigilance. For instance, there is a strong relationship between the
subjective measures of the Karolinska Sleepiness Scale and reaction time (Schmidt et al.,
2009). This was investigated during a long drive. The relationship was absent towards the end
of the drive.
Degree of sleep deprivation is another possible measurement. The Pittsburgh Sleep
Quality Index (PSQI) assesses peoples’ sleep quality by calculating a person’s PSQI score
(Buysse et al., 1989). Being sleep deprived results in longer reaction times (Blatter et al., 2009).
When sleep quality is bad, a person does not sleep well and is effectively sleep deprived.
Finally, Blatter et al., (2009) found that reaction times increased the longer participants
were awake. The Morningness Eveningness Questionnaire (MEQ) assesses if a person is a
morning or evening person (Horne & Östberg, 1976). This could give an interaction effect with
the time of the experiment. A morning person in combination with an experiment in the
afternoon could result in longer reaction times because they are awake for a longer period. The
6
opposite could be true for an evening person who usually wakes up later and in combination
with an afternoon experiment has shorter reaction times.
Demographic measures. Lastly, demographic data could provide useful predictors.
Blatter et al. (2006) found that gender and age can influence reaction times. They found women
to react more slowly, but also more accurately than men. Furthermore, they saw that age had
an interaction effect with the sleep condition of the participant. Sleep deprivation in
combination with younger age resulted in slower reaction times. Also, age interacted with the
time spent awake proved relevant in their study.
Research Questions
In this research the focus was on the different temperatures, PSQI scores, MEQ scores,
gender and age as potential features to predict drops in vigilance. Some of these predictors have
been previously researched, while other predictors were explored for the first time in this study.
This has led to the following research questions: 1) Which feature is the best predictor of (drops
in) vigilance according to its Gini Importance? 2) What is the hierarchy of these features in
predicting vigilance? Three different RF models with different feature combinations were
compared to answer these questions.
Expectations
Based on previous literature the best temperature related features should be the chest
temperature and the finger-chest gradient (Romeijn & Van Someren, 2011). The pinna-mastoid
gradient could also be higher up in the hierarchy (Romeijn et al., 2012). For the other predictors
it is not possible to say something about the outcome yet.
Methods
All procedures were approved by the ethical committee of Utrecht University and
according to the protocol of the Faculty of Social Sciences.
Participants.
7
In total 33 healthy participants participated in this study: 14 males and 19 females aged
19 - 59 years (M=26.15, SD=10.03). Participants were recruited within the social bubble of the
research leaders due to the COVID-19 pandemic. They joined on voluntary bases and were
rewarded with student credit or eight euros per hour (total of 20 euros for 2.5 hours). None of
the participants had any known history of sleep-related disorders. Because their vigilance level
should not be artificially influenced, participants were not allowed to take caffeine, alcohol or
drugs after 22:00 the day prior to the experiment. Also, participants had to be non-smokers for
at least 6 months prior to the experiment. Participants had to have normal or corrected (glasses
or lenses) vision. They did not wear any hair products or make-up such that EEG electrodes
could be correctly placed.
Tasks.
BSRT. In this research the focus lies on the Brief Stimulus Reaction Time task. This is
a vigilance assessment task based on the task created by Romeijn and Van Someren (2011).
Compared to the original paper there were a few differences, because more lapses were needed
for the EEG signals. Participants have to focus on a fixation-cross (+ sign) and respond as fast
as possible when it changed into a fixation-cross with a shorter vertical line by pressing the
space bar with the index finger of their dominant hand. In the original paper participants had
to respond when the cross changed into a hyphen (- sign). The background is grey, and the
signs are black. The target presentation was only one frame (16.67ms), which is lower than the
original paper (25ms). Furthermore, this task had a staircase procedure. This resulted in the
change of difficulty of the task by altering the size of the vertical line of the fixation cross. The
most difficult condition of the task was a 10 pixels difference between the vertical line and the
normal vertical line of the target. The easiest condition was an 80 pixels difference. The task
started with a difficulty level of 50 pixels difference. When the participant responds incorrectly
twice, the following trial becomes easier. When the participant responds correctly, the next
8
trial will become more difficult. This is a 2-up-1-down procedure. The task comprised 144
stimuli with a 4 to 14 second interval and took around 20 minutes to complete.
Methods and materials. (Switched with procedure.)
iButtons. Skin temperature was measured using iButtons (type DS1922L,
Maxim/Dallas, USA). The iButtons sample skin temperature with a .0625 ºC resolution at 2-
second intervals. The method has been described in detail and validated by Van Marken
Lichtenbelt et al. (2006).
iButtons were placed at three distal sites: finger, nose and pinna, and three proximal
sites: chest, forehead and mastoid. From this data three relative distal to proximal gradients
were calculated: finger minus chest (DPG
finger-chest
), nose minus forehead (DPG
nose-forehead
), and
pinna minus mastoid (DPG
pinna-mastoid
).
FLIR. Skin temperature was also measured using a FLIR camera (FLIR E53, FLIR
Systems Inc., Wilsonville USA). The camera samples skin temperature with a 240 x 180 pixels
infrared resolution at a 33-milisecond interval. Thermosensitivity is below .04 ºC and image
frequency is 30 hertz. The camera was placed in front of the participant and focusses on the
head of the participants. One relative distal to proximal gradient was calculated: nose minus
forehead (DPG
nose-forehead
).
Questionnaires. The PSQI and MEQ were both assessed in Dutch due to this being
the native language of most participants. The PSQI has been recoded according to the method
described by Buysse et al. (1988). This is done by calculating the scores for every component
of the PSQI and calculating the total score. However, for some participants some components
were missing. So, the total score calculated is the total score for all components divided by
the number of available components.
The MEQ was recoded according to the method described by Horne and Östberg
(1976). This results in three different types: morning person (total score above 59),
9
intermediate person (total score between 42 and 58) and evening person (total score below
41).
Procedure.
All experiments were conducted in the lab environment of Utrecht University. This
study is part of a bigger study. The procedure described was the same for all participants, but
not all collected data will be used in the current study.
On arrival at the lab, participants were asked if they kept to the guidelines and had no
symptoms of COVID-19. Then, participants had to fill in the MEQ, PSQI and a demographic
questionnaire. Next, participants were equipped with an EEG cap, face electrodes and iButtons.
During each session participants were seated before a 60 Hz screen in a dimly lit room at an
environmental temperature of 20.68 ºC (SD=0.56). Participants had to place their head in a
chin rest so that their movement was limited and the FLIR camera could be focused on their
head.
First participants were asked to focus on a fixation cross for five minutes with their eyes
open. Next, they had to close their eyes for 2.5 minutes. Then participants were asked how alert
and awake the felt using a 100-likert scale. After that they were presented with the BSRT task
(Romeijn & Van Someren, 2011).
Data-Analysis.
All data was processed and analysed using Python 3.9 and software PyCharm
Professional 2021.1.1 (JetBrains, 2021). For the current study three different kinds of models
were trained multiple times on 500 different bootstrapped samples. One kind with only
temperature features, a second kind with only the subjective and demographic features and a
third kind with all the features. This is done so the effect of the different categories of features
can be analysed and evaluated, but also to evaluate the addition of different features.
10
Pre-processing. Data from 20 participants was excluded from the analysis due to
missing or flawed data files such as FLIR recordings, iButton data or questionnaires. The FLIR
data were extracted by using commercial software (Flirtools+, FLIR Systems Inc., Wilsonville,
U.S.A.). By hand, two ellipses were placed on the nose and forehead of the participant. These
had to be adjusted throughout the recordings, because of movement of the head. This procedure
was done for each participant and resulted in an output file with all the trials of one participant.
Next, all these data files were combined, and trials that contained missing values were removed
so the classification was only based on complete trials. Then there was an 80:20 split on the
data into train and test data with random_state=0, a parameter that sets a state for the random
generator such that the splitting is the same every time (Pedregosa et al., 2021), for all the
models.
Feature extraction. The features were handpicked based on previous literature. The
datapoints that enter the random forest are individual trials. Each trial contains the
corresponding values of the features for a certain participant at a certain time point. The age,
gender, MEQ-type and PSQI-score of the participant were entered to all the trials of that
participant. The temperature features fluctuate between trials and participants. Because
iButtons measure with an interval of 2 seconds the iButton temperatures were measured 4 or 5
seconds before the target was presented. This time interval was chosen because 4 seconds is
the shortest interval between trials. The FLIR temperatures were the average of the 30 frames
in the 4th second before the target was presented. The DPGs where calculated based on the
temperature features of the trials. The features used to train each model are described in Table
1. A full description of the feature extraction and data combination is described in Appendix
A.
Table 1.
Overview of which features are used in each model.
11
Model 1
Model 2
Model 3
Age
X
X
Gender
X
X
MEQ
X
X
PSQI
X
X
Finger
X
X
Chest
X
X
Mastoid
X
X
Pinna
X
X
Nose
X
X
Forehead
X
X
DPG
finger-chest
X
X
DPG
nose-forehead
X
X
DPG
pinna-mastoid
X
X
FLIR nose
X
X
FLIR forehead
X
X
FLIR DPG
nose-forehead
X
X
Output. The output measurement is whether the participant saw the target, a hit (1) or
did not see the target, a miss (0). A hit means the participant was vigilant. A miss means the
participant was less vigilant.
Data processing. Data from all participants were, as said before, combined in one file
with all the trials of all the participants. The Random Forest (RF) was programmed using scikit-
learn (Pedregosa et al., 2021) and according to the method described by Garreta and Moncecchi
(2013). For each model 100 estimators were used. Again, a random state was used to make
sure the output is reproducible. The chosen state was random_state=9. Furthermore, default
settings were used. The algorithm takes 100 bootstrapped samples from the training set. These
were used to build the different decision trees. Then a number of random features were chosen
from the input features. Each tree classifies trials as hits or misses and then a majority voting
determines the final output. The procedure of the RF is summarised in Figure 2 (Fraiwan et al.,
2012). A full description of the Random Forest models is given in Appendix B.
12
Figure 1.
The different steps of the classification process of a Random Forest. From Fraiwan, L., Lweesy,
K., Khasawneh, N., Wenz, H., & Dickhaus, H. (2012). Automated sleep stage identification
system based on time–frequency analysis of a single EEG channel and random forest classifier.
Computer Methods and Programs in Biomedicine, 108(1), 10–19.
https://doi.org/10.1016/j.cmpb.2011.11.005. Copyright 2021 by Elsevier Inc.
The hyperparameters of the RF was done for the number of trees and their depth.
Therefore, a validation set was needed, so the training set was again split into a train and
validation set with a 70:30 split. In total there thus was a 56:24:20 split for training, validation
and testing. The hyperparameters were then varied over a range of values on the training data,
but then evaluated in terms of accuracy on the validation data. The best values of the
hyperparameters over the validation data were 100 trees per RF and a maximum individual tree
depth of five levels. A full description of the optimisation of the hyperparameters is described
in Appendix C. Further explorative analyses have been conducted and are described in
Appendix F.
Model evaluation. To evaluate the trained models, a benchmark is needed. In this study
the trained models will be compared to a so-called naïve model which always guesses the
majority class of the data. This is a model that classifies all trials as hits (which is the dominant
outcome across trials). The models trained in this study therefore have to be more accurate than
this naïve model. The importance of each feature will be evaluated according to its Gini
13
Importance. The second benchmark is created by calculating the Gini Importance for all
features if they were equally important. If a feature is higher than this benchmark it is relevant
in the prediction of vigilance. Otherwise, if it is below the benchmark, it does not contribute
much to the prediction of vigilance.
Statistical Significance. With the previous described procedure, the models will only
create the output for one specific 80:20 split. This means that nothing can be said about its
statistical significance. That is why a bootstrapping procedure was performed. After the
hyperparameters were optimized, 500 samples of 1716 observations were bootstrapped from
the original data (with replacement). For each bootstrap sample an 80:20 split was performed.
Then the three different models were built, and their accuracy scores were calculated and saved.
Also, the Gini importance for each feature was calculated in each bootstrap sample and stored.
Finally, for each sample a naïve model and its accuracy were computed and saved. The
accuracies of each model were compared with the accuracies of the naïve model to test whether
the model performed significantly better than the naïve model. This was also done for the Gini
Importance of each feature to see whether it performed significantly better than the benchmark.
With this bootstrapping procedure a 95% Confidence Interval was created for the accuracy
differences between each model and the benchmark model. From this interval a statement can
be made on the statistical significance between the models in terms of accuracy performance
and feature relevance. The full procedure and Python script are described in Appendix D.
Results
The training scores of the models before finetuning showed signs of overfitting.
Therefore, the models were finetuned. Below, only the results for these finetuned models are
presented. The finetuned models are also used to assess the statistical significance using the
bootstrapped procedure.
14
Accuracy scores. The accuracies of the first, second and third model were compared
with the accuracies of the naïve model. These differences were evaluated with a one-sided 95%
confidence interval to see whether the first, second or third model performed better than the
naïve model. The differences in accuracies for each bootstrapped sample are displayed in figure
2, 3 and 4. In these figures the accuracy score was subtracted from the first, second or third
model, thus creating a difference. To determine whether the model performs significantly better
than the naïve model, it must be determined whether zero is in the one-sided confidence interval
or not. So, the lower bound of the interval must be higher than zero.
Figure 2.
Accuracy Differences Between the First Model with Temperature Features and the
Benchmark Model
Note. The dashed line is the one-sided confidence interval
The first model performs with 95% certainty better than the benchmark. This because
a difference of zero or lower was not present in the confidence interval (CI=.0029).
15
Figure 3.
Accuracy Differences Between the Second Model with Subjective and Demographic Features
and the Benchmark Model
Note. The dashed line is the one-sided confidence interval
The second model does not perform better with 95% certainty than the benchmark. This
because a difference of zero or lower was present in the confidence interval (CI= [-.020, 1.00]).
Figure 4.
Accuracy Differences Between the Third Model with Temperature, Subjective and
Demographic Features and the Benchmark Model
Note. The dashed line is the one-sided confidence interval
16
The third model does perform with 95% certainty better than the benchmark. This
because a difference of zero or lower was not present in the confidence interval (CI=[.0029,
1.00]).
Figure 5.
Accuracy Differences Between the Third Model with Temperature, Subjective and
Demographic Features and the First model with Temperature Features
Note. The dashed line is the one-sided confidence interval
The accuracies of the third model were compared with the accuracies of the first model.
These differences were evaluated with a one-sided 95% confidence interval to see whether the
third model performed better than the first model. The third model does not perform with 95%
certainty better than the first model. This because a zero or lower difference was present in the
confidence interval (CI= [-.0087, 1.00]).
Gini Importance. The Gini Importance of the features of the first, second and third
model were compared with the second benchmark created by the naïve model. These were
evaluated with a two-sided Confidence interval of 95% to see whether the features performed
better or worse than the benchmark. Figures 6, 7 and 8 display the differences between the Gini
importance and the benchmark of a feature. The 2.5 % lowest and highest values were removed,
17
thus creating a 95% Confidence interval. To test whether the feature performs better or worse
than the benchmark, it must be determined whether zero is included in the interval or not.
Figure 6.
The Confidence Intervals of the Differences Between Gini Importance of the Features of Model
1 and the Second Benchmark
Note. The dashed line marks the 0.0 difference between the GI of the feature and the second
benchmark.
Figure 6 displays the Gini hierarchy of the first model. There were 12 features in this
model, so the benchmark was set at .0833. The FLIR measured features ranked high but only
the forehead temperature performed better than the benchmark with 95% certainty. From the
iButton measured temperatures the chest, mastoid and forehead temperature performed worse
than the benchmark with 95% certainty. The other temperatures contain 0.0 in their interval
and therefore cannot be said to perform better or worse.
18
Figure 7.
The Confidence Intervals of the Differences Between Gini Importance of the Features of Model
2 and the Second Benchmark
Note. The dashed line marks the 0.0 difference between the GI of the feature and the second
benchmark.
Figure 7 displays the Gini hierarchy of the second model. There were 4 features in this
model, so the benchmark was set at .250. The participants MEQ type performed better than the
benchmark with 95% certainty. The participants PSQI score and Gender performed worse than
the benchmark with 95% certainty. The participants Age included a difference of 0.0 in its
interval so nothing can be said with certainty about its relative performance.
Figure 8.
The Confidence Intervals of the Differences Between Gini Importance of the Features of Model
3 and the Second Benchmark
Note. The dashed line marks the 0.0 difference between the GI of the feature and the second
benchmark.
19
Figure 7 displays the Gini hierarchy of the third model. There were 16 features in this
model, so the benchmark was set at .0625. The FLIR measured features performed better than
the benchmark with 95% certainty. From the subjective and demographic features performed
the participants PSQI score, Gender and Age worse than the benchmark.
The iButton measured temperatures and the participants MEQ type contained 0.0 in
their interval and therefore cannot be said to perform better or worse.
Discussion
The current study explored the importance of different predictors of (drops in) vigilance
with machine learning. The algorithm random forest (RF) was used due to its usefulness in
feature selection (Menze et al., 2009). Three different kind models were trained once and
finetuned. Then 500 bootstrapped samples were created, and the models were trained again on
these samples with the optimized hyperparameters. To evaluate these models a naïve model
was created that classifies all trials as hits (1.0). This was done by comparing the accuracy
scores of the models with the accuracy scores of the naïve model. Before finetuning, the train
data was very overfitted and therefore not useful for comparison. After adding the finetuned
hyperparameters they performed slightly better or the same as the naïve model. Below results
are discussed for each model separately.
The first model used only temperature related features. It had an accuracy of 100% for
the train data and 73.8% for the test data. The high accuracy for the train data shows that the
model overfitted the data. Therefore, the parameters of the algorithm had to be finetuned. After
finetuning, the model performed significantly better than the naïve model with a 95% accuracy.
The highest-ranking features are now the forehead temperature measured with the FLIR
camera, the nose temperature measured with an iButton, and the nose temperature measured
with the FLIR camera.
20
Because the first model scores slightly better than the naïve model, conclusions can be
drawn from its Gini hierarchy. This model implicates that the temperature of the forehead
measured with the FLIR camera can be seen as the best predictor of (drops in) vigilance for
this task because it performs significantly better than the benchmark. The other features that
were measured or calculated with the FLIR camera, ranked relatively high but statistically
nothing can be said about their relative performance to the benchmark. The temperatures of the
chest, mastoid and forehead measured with the iButtons performed significantly worse than
the benchmark and therefore do not predict (drops in) vigilance.
The second model used the subjective and demographic features. The model did not
show any overfitting but was still finetuned the same way as the first and third model. The
participants MEQ type performed significantly better than the benchmark. The participants
PSQI score and gender performed significantly worse than the benchmark. Statistically nothing
can be said about the age of the participant. The second model did not significantly perform
better than the naïve model. This implies that the subjective measurements and demographic
data on their own have no contribution to the prediction of vigilance for this task.
The third model used the temperature related features and the subjective and
demographic features. The accuracy of this model was 100% for the train data and 73.8% for
the test data. This model also has a high accuracy score for the train data. Therefore, the model
overfitted the data. The finetuning of this model was the same as for the first and second model.
After finetuning, the model performed significantly better than the naïve model with a 95%
accuracy. The highest-ranking features were measured with the FLIR camera. These were the
forehead temperature, nose temperature and the DPG
nose-forehead
. The worst ranking features were
again the participants age, PSQI score and gender.
The implications of this model are mostly consistent with the implications of the first
model. Again, this model performs slightly better than the naïve model. This means that
21
conclusions can be drawn from the Gini hierarchy of this model. The FLIR measured features
performed significantly better than the benchmark. The participants PSQI score, Gender and
Age performed significantly worse than the benchmark. Even though the nose temperature of
the iButton ranked high, statistically nothing can be said about its relative performance to the
benchmark. The same goes for the other features measured with the iButtons. Also, the third
model was compared with the first model. The third model did not significantly perform better
than the first model. So, adding the demographic and subjective features does not contribute
significantly to the accuracy of the classification. In short, FLIR measured features are worth
further exploration while age, gender and PSQI score do not show any potential for this task.
However, some placements of features in the hierarchy are inconsistent with previous
literature. Romeijn and van Someren (2011) found the chest temperature to be a better predictor
than the DPG
finger-chest
. However, in this study the chest temperature does not score high in the
hierarchies, it even scores significantly worse than the benchmark. In addition, the chest
temperature is not higher ranked than the DPG
finger-chest
. Nevertheless, compared to the other
iButton measured features, this DPG ranks relatively high. This is consistent with the findings
Romeijn and Van Someren which showed that this DPG is a strong predictor of vigilance.
However, its confidence interval still includes zero so nothing can be said statistically about its
performance relative to the benchmark.
Even though the first and third model perform slightly better than the naive model, it is
debatable how relevant these models are. The reason for performing just slightly better could
be found in the difference between participants. There can be different fluctuations in skin
temperature between participants. Romeijn and Van Someren (2011) already noted that these
fluctuations call for sustained recordings and mapping temperatures to the individuals’ normal
range of temperature. These individual differences make it hard for the RF to make the right
split within the temperature features for classifying the trials.
22
Furthermore, this study added a staircase procedure to the BSRT task created by
Romeijn and Van Someren (2011). The staircase procedure made sure that the amount of hits
for each participant was consistent. However, it could have resulted in a confounding factor:
when a participant became less vigilant, instead of producing more misses the task became
easier resulting in more hits again. On the other hand, when a participant was vigilant a miss
could be the result of the staircase making the task too difficult. Another consequence of the
staircase is visible in model 2. Because the number of hits was the same for each participant,
there could be no discrimination between the participants specific features like PSQI score, age
and gender.
Furthermore, the finetuned hyperparameters were fixed for all the 500 samples. To be
conclusive these parameters had to be optimized for each bootstrapped sample to get optimal
performance. However, due to the lack of time this procedure was only done for one model
and those values were chosen to be the hyperparameters. Further research could add this
finetuning in the code described in Appendix D.
Lastly, due to the many different experimental leaders some things went wrong during
the data collection. Because there was not much time for the leaders to get used to the protocol,
some files were not (correctly) saved. Also, the processing of the FLIR data took a long time
so more participants had to be excluded from the study because their FLIR data was
unavailable. This limited the amount of data that could be used for analysing and therefore
reduced power.
Due to these limitations more research should be done about the exploration of different
features in the BSRT without the staircase. Also, other tasks like the CTET (O'Connell et al.,
2009) could be useful to analyse the importance of different predictors. This task was
appropriate in predicting vigilance using EEG signals, so exploring this with temperature
measurements can be useful. This task also has a lower hit rate than most vigilance assessment
23
tasks (O'Connell et al., 2009) which creates a more evenly distributed division between hits
and misses.
Also, more has to be known about the difference between individuals in fluctuations of
skin temperature. As mentioned before, participants can differ in their fluctuations in skin
temperature but also in their overall mean temperature. By training classification models for
individuals separately instead of building a general classification model for all individuals,
more can be known about the predictiveness of different temperature features and their
differences between participants.
Further research should also look at the kind of predictive relationship between the
features and output. While a RF is very useful for feature selection, it cannot say much about
this relationship. Different machine learning algorithms can be used to determine the specific
relationship and classical statistics should confirm these relationships.
In conclusion, the research questions of the current study can be answered. The best
predictor of vigilance is most likely the temperature of the forehead measured with the FLIR
camera. The hierarchies of the first and third finetuned models are the best indicators of the
usefulness of predictors. However, these models’ accuracies do not differ much from the naïve
model meaning that all included features in aggregate have little predictive power for drops in
vigilance. So, the conclusions that are drawn have to be confirmed with further research.
24
References
Blatter, K., Graw, P., Münch, M., Knoblauch, V., Wirz-Justice, A., & Cajochen, C. (2006).
Gender and age differences in psychomotor vigilance performance under differential
sleep pressure conditions. Behavioural Brain Research, 168(2), 312–317.
https://doi.org/10.1016/j.bbr.2005.11.018
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
https://doi.org/10.1023/a:1010933404324
Buysse, D. J., Reynolds, C. F., Monk, T. H., Berman, S. R., & Kupfer, D. J. (1989). The
Pittsburgh sleep quality index: A new instrument for psychiatric practice and research.
Psychiatry Research, 28(2), 193–213. https://doi.org/10.1016/0165-1781(89)90047-4
Bzdok, D., Altman, N., & Krzywinski, M. (2018). Statistics versus machine learning. Nature
Methods, 15(4), 233–234. https://doi.org/10.1038/nmeth.4642
Centraal Bureau voor de Statistiek. (2020, October 15). Hoeveel mensen komen om in het
verkeer? https://www.cbs.nl/nl-nl/visualisaties/verkeer-en-vervoer/verkeer/hoeveel-
mensen-komen-om-in-het-verkeer-
Chang, S., Cohen, T., & Ostdiek, B. (2018). What is the machine learning? Physical Review
D, 97(5), 056009-1 – 056009-6. https://doi.org/10.1103/PhysRevD.97.056009
De Raedt, R., & Ponjaert-Kristoffersen, I. (2001). Predicting at-fault car accidents of older
drivers. Accident Analysis & Prevention, 33(6), 809–819.
https://doi.org/10.1016/S0001-4575(00)00095-6
Fraiwan, L., Lweesy, K., Khasawneh, N., Wenz, H., & Dickhaus, H. (2012). Automated sleep
stage identification system based on time–frequency analysis of a single EEG channel
and random forest classifier. Computer Methods and Programs in Biomedicine,
108(1), 10–19. https://doi.org/10.1016/j.cmpb.2011.11.005.
25
Garreta, R., & Moncecchi, G. (2013). Learning scikit-learn: Machine learning in Python.
Packt Publishing.
Horne, J. A., & Östberg, O. (1976). A self-assessment questionnaire to determine
morningness-eveningness in human circadian rhythms. International Journal of
Chronobiology, 4, 97–110.
Jin, C. Y., Borst, J. P., & Vugt, M. K. (2020). Distinguishing vigilance decrement and low
task demands from mind-wandering: A machine learning analysis of EEG. European
Journal of Neuroscience, 52(9), 4147–4164. https://doi.org/10.1111/ejn.14863
Lin, C.-T., Chuang, C.-H., Huang, C.-S., Tsai, S.-F., Lu, S.-W., Chen, Y.-H., & Ko, L.-W.
(2014). Wireless and Wearable EEG System for Evaluating Driver Vigilance. IEEE
Transactions on Biomedical Circuits and Systems, 8(2), 165–176.
https://doi.org/10.1109/TBCAS.2014.2316224
Menze, B. H., Kelm, B. M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., &
Hamprecht, F. A. (2009). A comparison of random forest and its Gini importance
with standard chemometric methods for the feature selection and classification of
spectral data. BMC Bioinformatics, 10(1), 213. https://doi.org/10.1186/1471-2105-10-
213
O’Connell, R. G., Dockree, P. M., Robertson, I. H., Bellgrove, M. A., Foxe, J. J., & Kelly, S.
P. (2009). Uncovering the Neural Signature of Lapsing Attention:
Electrophysiological Signals Predict Errors up to 20 s before They Occur. Journal of
Neuroscience, 29(26), 8604–8611. https://doi.org/10.1523/JNEUROSCI.5967-
08.2009
26
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., & Cournapeau,
D. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning
Research 12, 2825-2830.
PyCharm: the Python IDE for Professional Developers. (2020, August 25). JetBrains.
https://www.jetbrains.com/pycharm/
Romeijn, N., & Van Someren, E. J. W. (2011). Correlated Fluctuations of Daytime Skin
Temperature and Vigilance. Journal of Biological Rhythms, 26(1), 68–77.
https://doi.org/10.1177/0748730410391894
Romeijn, N., Verweij, I. M., Koeleman, A., Mooij, A., Steimke, R., & Virkkala, J. (2012).
Cold Hands, Warm Feet: Sleep Deprivation Disrupts Thermoregulation and Its
Association with Vigilance. 35(12), 11.
Schmidt, E. A., Schrauf, M., Simon, M., Fritzsche, M., Buchner, A., & Kincses, W. E.
(2009). Drivers’ misjudgement of vigilance state during prolonged monotonous
daytime driving. Accident Analysis & Prevention, 41(5), 1087–1093.
https://doi.org/10.1016/j.aap.2009.06.007
Shanker, A. (2019). Bioinformatics: Sequences, Structures, Phylogeny (Softcover reprint of
the original 1st ed. 2018 ed.). Springer.
Shi, L.-C., & Lu, B.-L. (2013). EEG-based vigilance estimation using extreme learning
machines. Neurocomputing, 102, 135–143.
https://doi.org/10.1016/j.neucom.2012.02.041
Te Lindert, B. H. W., & Van Someren, E. J. W. (2018). Skin temperature, sleep, and
vigilance. In Handbook of Clinical Neurology (Vol. 156, pp. 353–365). Elsevier.
https://doi.org/10.1016/B978-0-444-63912-7.00021-7
27
Van Schaik, A.S. (2021). Predicting drops in vigilance based on beta activity (Unpublished
Bachelor’s Thesis). Utrecht University.
Züger, M., Müller, S. C., Meyer, A. N., & Fritz, T. (2018). Sensing Interruptibility in the
Office: A Field Study on the Use of Biometric and Computer Interaction Sensors.
Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems,
1–14. https://doi.org/10.1145/3173574.3174165
28
Appendix A
The script is called data_combination_BSRT.py. Here the main parts of the script will
be explained. The full script is available on https://github.com/rooslucas/Bachelor-Thesis.
There is also a script available to combine all participant files together. This is called
combine_all_files.py. An adaptation to combine the data for the CTET is also available on
GitHub and is called data_combination_CTET.py.
The script loops through all participants that are available. The participant ID is saved
and used further in the script. A new data frame is created with the triggers and times from
the trigger logger.
Next the practice trials are removed by scanning for the start trigger 21.
The trigger logger displays times as seconds passed since 1970-01-01 2:00 AM. The
times will now be displayed in year-month-day hours:minutes:seconds:miliseconds.
A new column is added to the data frame to display the end time of a trial. The
locations of different triggers are saved. The script then loops through the locations with the
response triggers (8 and 0) and makes each row one trial. Then the other triggers are
removed, and the index of the data frame is reset.
for participant in trigger_list:
print(f"\n Participant {i}/{len(trigger_list)}")
# Safe the participant id
participant_id = participant.split('_')
participant_id = participant_id[0]
print(f"Processing data from participant
{participant_id}")
trigger_path = trigger_folder + '/' + participant
triggers = pd.read_csv(trigger_path)
# Create dataframe
data_frame = {'trigger': triggers['Trigger'],
'start_time': triggers['TriggerTime']}
data_file = pd.DataFrame(data_frame)
start, = data_file.index[data_file['trigger'] == 20.0]
data_file.drop(range(1, start), inplace=True)
data_file.dropna(inplace=True)
# Refactor the trigger times
original_time = datetime(year=1970, month=1, day=1, hour=2)
for time in data_file['start_time']:
new_time = original_time + timedelta(seconds=time)
data_file['start_time'].replace(time, new_time,
inplace=True)
29
Next, a column with the results is added. For each trial in the data frame the script
checks if it was a hit (2.0-8.0) or a miss (2.0-0.0). Rows with missing triggers get value 99
and are then removed. Again, the index is reset.
Now the FLIR data is added to the data frame. First it checks if the FLIR file exists.
Then the FLIR times are combined
# Add new column with end time of each trial
data_file['end_time'] = ''
# Get locations of triggers 8, 0 and 1
locations = data_file.index[(data_file['trigger'] == 8.0) |
(data_file['trigger'] == 0.0)]
locations_2 = data_file.index[data_file['trigger'] == 1.0]
data_file['trigger'] = data_file['trigger'].astype(str)
# Make each row one trial from target to response
for location in locations:
data_file.at[location - 1, 'end_time'] =
data_file.at[location, 'start_time']
data_file.at[location - 1, 'trigger'] =
str(data_file.at[location - 1, 'trigger']) + '-' +
str(data_file.at[location, 'trigger']
data_file.drop(index=location, inplace=True)
# Drop the other triggers
for location in locations_2:
data_file.drop(index=location, inplace=True)
# Reset index of dataframe
data_file.reset_index(drop=True, inplace=True)
# Get responses from triggers
data_file['results'] = ''
for trigger in data_file['trigger']:
loc = data_file.index[data_file['trigger'] == trigger]
if trigger == '2.0-8.0':
data_file.at[loc, 'results'] = 1.0
elif trigger == '2.0-0.0':
data_file.at[loc, 'results'] = 0.0
else:
data_file.at[loc, 'results'] = 99
missing_results = data_file.index[data_file['results']
== 99]
for error in missing_results:
data_file.drop(error, inplace=True)
!
data_file.dropna(inplace=True)
data_file.reset_index(drop=True, inplace=True)
30
Then the script loops through the trials in the data frame. It also loops through the
times in the FLIR data and looks if the number of seconds of the FLIR time matches the trial
time four seconds prior. If so the average of the 30 frames in that second is calculated for the
nose and forehead. To speak up the process, the rows that already have been scanned in the
FLIR data will be skipped and after the right second has been found the loop breaks.
Afterwards missing data is dropped.
if flir_data is not None:
# Reset times of flir data
for point in range(len(flir_data['Time'])):
ttime = str(flir_data.at[point, 'Time'])
good_time = pd.to_datetime(ttime) +
timedelta(milliseconds=int(flir_data.at[point,
'Milliseconds']))
good_time = datetime.combine(pd.to_datetime(
flir_data.at[point, 'Date']), good_time.time())
flir_data.at[point, 'good_time'] = good_time
# Loop through trials
for time in data_file['start_time']:
matches_el1 = []
matches_el2 = []
row, = data_file.index[(data_file['start_time'] == time)]
difference = datetime.now() - time
# Add values from nearest time sample
for time_f in range(start, len(flir_data['good_time'])):
ttime = flir_data.iloc[time_f]['good_time']
if int((time - datetime(1970, 1, 1)).total_seconds())
- 4 == int((ttime –
datetime(1970, 1, 1)).total_seconds()):
matches_el1.append(flir_data.iloc[time_f]
['El1.Average']
matches_el2.append(flir_data.iloc[time_f]
['El2.Average'])
new_time = time_f # Define new start point
elif int((time - datetime(1970, 1, 1)).
total_seconds()) - 3 ==
int((ttime - datetime(1970, 1, 1)).total_seconds()):
start = new_time
break
# Calculate average values over one second
if len(matches_el1) != 0:
average_el1 = sum(matches_el1) / len(matches_el1)
average_el2 = sum(matches_el2) / len(matches_el2)
# Add values to the dataframe
data_file.at[row, 'FLIR_forehead'] = average_el1
data_file.at[row, 'FLIR_nose'] = average_el2
print(f'{trial} done!')
trial += 1
31
After the FLIR data is processed the start times drop their milliseconds so they can be
matched to the iButton times.
Now the iButton data is processed. Each participant has a folder with the different
iButton files in there. It skips the file with the room temperature. Then the script loops
through the trials and matches the start time with the iButton time 4 or 5 seconds before. The
temperature at that timepoint enters the data frame for the trial. Afterwards missing values are
again removed.
# Prep times for ibutton matches
for time in data_file['start_time']:
new_time = time - timedelta(microseconds=
time.microsecond)
data_file['start_time'].replace(time, new_time,
inplace=True)
# Add each ibutton to the dataframe
for ibutton in ibuttons_list:
path = ibutton_folder + '/' + ibutton
temp = pd.read_csv(path, skiprows=18)
ibutton_name, rest = ibutton.split('_')
print(f"Processing data from {ibutton_name}")
if ibutton_name != "FF00000045298741":
data_file[ibutton_name] = 99.9
for time in temp['Date/Time']:
good_time = pd.to_datetime(time)
location_temp, =
temp.index[(temp['Date/Time'] == time)]
for trigger_time in data_file['start_time']:
if good_time == trigger_time -
timedelta(seconds=4):
location = data_file.index[
(data_file['start_time'] ==
trigger_time)]
data_file.at[location,
ibutton_name] =
temp.iloc[location_temp]['Value']
elif good_time == trigger_time –
timedelta(seconds=5):
location = data_file.index[
(data_file['start_time'] ==
trigger_time)]
data_file.at[location,
ibutton_name] =
temp.iloc[location_temp]['Value']
32
Then the different iButton are calculated. If there is FLIR data, the with the FLIR
temperature is also calculated.
Finally, the data from the questionnaires is added. It searches for the participant ID
and enters the corresponding values to the data frame. Then the file is saved as a .csv with the
participant ID in the name.
# Calculate four DPGs
print("Calculate DPG_finger-chest")
data_file['DPG_finger-chest'] =
data_file['4B0000004516B141'] - data_file['9A00000045146841']
print("Calculate DPG_nose-forehead")
data_file['DPG_nose-forehead'] =
data_file['CB000000452D7441'] - data_file['F9000000452CCF41']
print("Calculate DPG_pinna-mastoid")
data_file['DPG_pinna-mastoid'] =
data_file['76000000452C9741'] - data_file['7200000045201D41']
if flir_data is not None:
print("Calculate FLIR_DPG_nose-forehead")
data_file['FLIR_DPG_nose-forehead'] =
data_file['FLIR_nose'] - data_file['FLIR_forehead']
# Add data from questionnaires
print("Adding data from the questionnaires")
questionnaire_folder = directory +
'/Questionnaires/questionnaire_data.csv'
questionnaire_file = pd.read_csv(questionnaire_folder)
row_number = questionnaire_file.index[
(questionnaire_file['PPID'] == participant_id)]
if len(row_number) != 0:
data_file['Gender'] =
questionnaire_file.at[row_number[0], 'Gender']
data_file['Age'] =
questionnaire_file.at[row_number[0], 'Age']
data_file['MEQ_type'] =
questionnaire_file.at[row_number[0], 'type']
data_file['PSQI'] = questionnaire_file.at[
row_number[0], 'total_score_PSQI']
# Safe file to a csv
data_file.to_csv(r'/Users/roos/Data/final_trials/trials' +
participant_id + '.csv', index=False, header=True)
33
Appendix B
For this study multiple random forest models were trained. All Jupyter notebooks are
available on https://github.com/rooslucas/Bachelor-Thesis. These are the not optimised
models. To optimise the models, the parameters can be adjusted. The method is based on
Garreta and Moncecchi (2013) and for all models the same. Here the main parts of the
methods will be described for model 3. Models 1 and 2 use the same methods and can, as said
before, be found on the authors GitHub. Also, the models for the CTET can be found there.
For each model certain features are selected from the data frame.
MEQ type and gender are categorical variables and therefore have to be decoded.
The labels get transformed into integers so the model can interpret them.
data_file_path = '/Users/roos/Data/all_trials_noNaN2.csv'
data_file = pd.read_csv(data_file_path)
data_3 = data_file[[ 'Age', 'Gender', 'PSQI', 'MEQ_type',
'9A00000045146841','F9000000452CCF41',
'76000000452C9741', '7200000045201D41',
'4B0000004516B141', 'CB000000452D7441',
'DPG_finger-chest','DPG_nose-forehead',
'DPG_pinna-mastoid', 'results', 'FLIR_forehead',
'FLIR_nose', 'FLIR_DPG_nose-forehead']]
# Encode categorical variables
# Gender
encoder = LabelEncoder()
label_encoder_gender = encoder.fit(data_3['Gender'])
print("gender classes:", label_encoder_gender.classes_)
integer_classes_gender =
label_encoder_gender.transform(label_encoder_gender.classes_)
print("Gender integer classes", integer_classes_gender)
code = label_encoder_gender.transform(data_3['Gender'])
data_3['Gender'] = code
# MEQ_type
label_encoder_MEQ = encoder.fit(data_3['MEQ_type'])
print("MEQ classes:", label_encoder_MEQ.classes_)
integer_classes_MEQ =
label_encoder_MEQ.transform(label_encoder_MEQ.classes_)
print("MEQ> integer classes", integer_classes_MEQ)
code_MEQ = label_encoder_MEQ.transform(data_3['MEQ_type'])
data_3['MEQ_type'] = code_MEQ
34
A single decision tree is trained and evaluated. This tree is visualised and can be
found in appendix D.
Next the forest is created, and its accuracy is displayed.
For each feature its Gini importance is calculated, and the hierarchy is displayed in a
table.
dt = DecisionTreeClassifier(max_depth=3, random_state=0)
dt.fit(X_train, Y_train)
import graphviz
dot_data = tree.export_graphviz(dt, out_file=None,
feature_names=data_3.drop(
'results', axis=1).columns,
class_names=['0.0', '1.0'],
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph.render('model3.gv', view=True)
graph
# Building a forest
random_forest = RandomForestClassifier(n_estimators=100,
random_state=9)
random_forest.fit(X_train, Y_train)
# Accuracy on Test
print("Training Accuracy is: ", random_forest.score(X_train,
Y_train))
# Accuracy on Train
print("Testing Accuracy is: ", random_forest.score(X_test,
Y_test))
fi2 = ''
final2 = ''
for i, column in enumerate(data_3.drop('results', axis=1)):
print('Importance of feature {}:, {:.3f}'.format(column,
random_forest.feature_importances_[i]))
fi2 = pd.DataFrame({'Variable': [column], 'Feature
Importance Score': [random_forest.feature_importances_[i]]})
try:
final2 = pd.concat([final2, fi2], ignore_index=True)
except:
final2 = fi2
# Ordering the data
final_fi2 = final2.sort_values('Feature Importance Score',
ascending=False).reset_index()
final_fi2
35
Appendix C
This script described the optimalisation process of model 1. The full Jupyter notebook
is available on https://github.com/rooslucas/Bachelor-Thesis. The main parts of the script will
be described here. The same optimalisation parameters were afterwards used for model 2 and
model 3.
The data was split into a train set and test set with an 80:20 split. Then the train set
was split into a train set and validation set with a 70:30 split.
To determine the optimal number of estimators, different numbers between 100 and
500 were tried. The accuracy of the train and validation set for each number were displayed
into a graph.
# Splitting the data
X = data.drop('results', axis=1).values
Y = data['results'].values
print('X shape: {}'.format(np.shape(X)))
print('Y shape: {}'.format(np.shape(Y)))
X_train1, X_test, Y_train1, Y_test = train_test_split(X, Y,
train_size=0.8, test_size=0.2, random_state=0)
X_train, X_validate, Y_train, Y_validate =
train_test_split(X_train1, Y_train1, train_size=0.7,
test_size=0.3, random_state=0)
# Trying different trees
test_acc_tree = []
val_acc_tree = []
trees = [100, 150, 200, 250, 300, 350, 400, 450, 500]
for num_trees in range(100, 501, 50):
print("Number of trees:", num_trees)
random_forest =
RandomForestClassifier(n_estimators=num_trees,
random_state=30)
random_forest.fit(X_train, Y_train)
test_acc_tree.append(random_forest.score(X_train,
Y_train))
# Accuracy on validation
val_acc_tree.append(random_forest.score(X_validate,
Y_validate))
plt.plot(trees, test_acc_tree, c="magenta")
plt.plot(trees, val_acc_tree, c="aqua")
plt.show()
36
To determine the optimal depth, different numbers between 1 and 10 were tried. The
accuracy of the train and validation set for each number were displayed into a graph.
After the parameters were chosen, all sets were evaluated with the optimal settings.
!
test_acc_depth = []
val_acc_depth = []
depth = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for max_depth in range(1, 11, 1):
print("Depth:", max_depth)
random_forest = RandomForestClassifier(n_estimators=100,
random_state=30, max_depth=max_depth)
random_forest.fit(X_train, Y_train)
test_acc_depth.append(random_forest.score(X_train, Y_train))
val_acc_depth.append( random_forest.score(X_validate,
Y_validate))
plt.plot(depth, test_acc_depth, c="magenta")
plt.plot(depth, val_acc_depth, c="aqua")
plt.show()
random_forest = RandomForestClassifier(n_estimators=100,
random_state=30, max_depth=5)
random_forest.fit(X_train, Y_train)
print("Training Accuracy is: ", random_forest.score(X_train,
Y_train))
# Accuracy on Train
print("Validation Accuracy is: ", random_forest.score(X_validate,
Y_validate))
print(Y_test.sum()/(len(Y_test)))
print("Testing Accuracy is: ", random_forest.score(X_test,
Y_test))
37
Appendix D
For this study 500 bootstrapped samples were made. The notebooks are available on
https://github.com/rooslucas/Bachelor-Thesis and is called vigilance_RF_bootstrapped.ipynb.
The hyperparameters are fixed but could be optimized for each model individually by adding
this in the for loop. The main parts will be explained here. A full script and visualization
script are available on the authors GitHub.
First a Random seed was set so the procedure could be reproduced. Then the data file
was loaded in, and a new data frame was created.
Next 500 bootstrapped samples were created and saved in a list.
Then the first kind of models were trained for each sample in the samples list. This
procedure can be repeated for each different model.
Next for each sample two random forests were created: one without hyperparameter
optimization and one with this optimization. The accuracy scores and Gini importance were
added to the data frame.
np.random.seed(0)
data_file_path = '/Users/roos/Data/all_trials_noNaN2.csv'
data_file = pd.read_csv(data_file_path)
output = pd.DataFrame()
samples = []
for i in range(500):
samples.append(data_file.sample(n=len(data_file),
replace=True))
index = 0
for sample in samples:
sample1 = sample[['9A00000045146841',
'F9000000452CCF41', '76000000452C9741', '7200000045201D41',
'4B0000004516B141', 'CB000000452D7441', 'DPG_finger-chest',
'DPG_nose-forehead', 'DPG_pinna-mastoid', 'results',
'FLIR_forehead', 'FLIR_nose', 'FLIR_DPG_nose-forehead']]
# Splitting the data
X = sample1.drop('results', axis=1).values
Y = sample1['results'].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
train_size=0.8, test_size=0.2, random_state=0)
38
Lastly, the differences between the models, features and benchmarks were calculated
and saved in the data frame.
# Building a forest
random_forest = RandomForestClassifier(n_estimators=100,
random_state=9)
random_forest.fit(X_train, Y_train)
random_forest2 = RandomForestClassifier(n_estimators=100,
random_state=9, max_depth=5)
random_forest2.fit(X_train, Y_train)
output.at[index, 'Acc1'] = random_forest.score(X_test, Y_test)
output.at[index, 'Acc_opt1'] = random_forest2.score(X_test,
Y_test)
importances1 = random_forest2.feature_importances_
output.at[index, 'Gini_9A00000045146841_1'] = importances1[0]
output.at[index, 'Gini_F9000000452CCF41_1'] = importances1[1]
output.at[index, 'Gini_76000000452C9741_1'] = importances1[2]
output.at[index, 'Gini_7200000045201D41_1'] = importances1[3]
output.at[index, 'Gini_4B0000004516B141_1'] = importances1[4]
output.at[index, 'Gini_CB000000452D7441_1'] = importances1[5]
output.at[index, 'Gini_DPG_finger-chest_1'] = importances1[6]
output.at[index, 'DPG_nose-forehead_1'] = importances1[7]
output.at[index, 'DPG_pinna-mastoid_1'] = importances1[8]
output.at[index, 'FLIR_forehead_1'] = importances1[9]
output.at[index, 'FLIR_nose_1'] = importances1[10]
output.at[index, 'FLIR_DPG_nose-forehead_1'] =
importances1[11]
index+=1
39
Appendix E
Three examples of decision trees, for each model one tree example.
Model 1.
Model 2.
FLIR_forehead ≤ 34.005
gini = 0.343
samples = 1372
value = [30 2 , 1070]
class = 1.0
FLIR_nose ≤ 27.844
gini = 0.308
samples = 1211
value = [23 0 , 981]
class = 1.0
True
76000000452C9741 ≤ 31.882
gini = 0.494
samples = 161
value = [72 , 89]
class = 1.0
False
FLIR_forehead ≤ 33.375
gini = 0.17
samples = 266
value = [25 , 241]
class = 1.0
CB000000452D7441 ≤ 32.984
gini = 0.34
samples = 945
value = [20 5 , 740]
class = 1.0
gini = 0.094
samples = 182
value = [9, 173]
class = 1.0
gini = 0.308
samples = 84
value = [16 , 68]
class = 1.0
gini = 0.418
samples = 373
value = [111, 262]
class = 1.0
gini = 0.275
samples = 572
value = [94 , 478]
class = 1.0
FLIR_DPG_nose-forehead ≤ -6.251
gini = 0.153
samples = 36
value = [3, 33]
class = 1.0
FLIR_forehead ≤ 34.324
gini = 0.495
samples = 125
value = [69 , 56]
class = 0.0
gini = 0.0
samples = 1
value = [1, 0]
class = 0.0
gini = 0.108
samples = 35
value = [2, 33]
class = 1.0
gini = 0.487
samples = 11 9
value = [69 , 50]
class = 0.0
gini = 0.0
samples = 6
value = [0, 6]
class = 1.0
MEQ_type ≤ 1.5
gini = 0.343
samples = 1372
value = [302, 1070]
class = 1.0
Age ≤ 36.5
gini = 0.266
samples = 917
value = [145, 772]
class = 1.0
True
Age ≤ 24.5
gini = 0.452
samples = 455
value = [157, 298]
class = 1.0
False
MEQ_type ≤ 0.5
gini = 0.233
samples = 683
value = [92, 591]
class = 1.0
PSQI ≤ 0.571
gini = 0.35
samples = 234
value = [53, 181]
class = 1.0
gini = 0.265
samples = 457
value = [72, 385]
class = 1.0
gini = 0.161
samples = 226
value = [20, 206]
class = 1.0
gini = 0.364
samples = 121
value = [29, 92]
class = 1.0
gini = 0.335
samples = 113
value = [24, 89]
class = 1.0
Gender ≤ 0.5
gini = 0.419
samples = 345
value = [103, 242]
class = 1.0
gini = 0.5
samples = 110
value = [54, 56]
class = 1.0
gini = 0.456
samples = 111
value = [39, 72]
class = 1.0
gini = 0.397
samples = 234
value = [64, 170]
class = 1.0
40
Model 3.
MEQ_type ≤ 1.5
gini = 0.343
samples = 1372
value = [3 0 2, 1070]
class = 1.0
DPG_pinna-mastoid ≤ -5.727
gini = 0.266
samples = 917
value = [1 4 5, 772]
class = 1.0
True
FLIR_forehead ≤ 33.999
gini = 0.452
samples = 455
value = [1 5 7, 298]
class = 1.0
False
CB000000452D7441 ≤ 31.424
gini = 0.422
samples = 159
value = [4 8 , 111 ]
class = 1.0
DPG_nger-chest ≤ -3.196
gini = 0.223
samples = 758
value = [9 7 , 661]
class = 1.0
gini = 0.074
samples = 26
value = [1 , 25]
class = 1.0
gini = 0.457
samples = 133
value = [4 7 , 86]
class = 1.0
gini = 0.185
samples = 553
value = [5 7 , 496]
class = 1.0
gini = 0.314
samples = 205
value = [4 0 , 165]
class = 1.0
DPG_pinna-mastoid ≤ -2.295
gini = 0.395
samples = 318
value = [8 6 , 232]
class = 1.0
FLIR_nose ≤ 28.085
gini = 0.499
samples = 137
value = [7 1 , 66]
class = 0.0
gini = 0.379
samples = 303
value = [7 7 , 226]
class = 1.0
gini = 0.48
samples = 15
value = [9 , 6]
class = 0.0
gini = 0.308
samples = 21
value = [1 7 , 4]
class = 0.0
gini = 0.498
samples = 116
value = [5 4 , 62]
class = 1.0
41
Appendix F
Some further explorative analyses have been conducted but were not added to the
original study. However, they are quite interesting and therefore described in this appendix.
Another task that the participants performed was the continuous temporal expectancy task
(CTET) is a vigilance assessment task based on the task created by O’Connell et al. (2009). A
full description can be found in the study of Van Schaik (2021).
Again, three different models were made the same way as for the BSRT. Only difference is
the iButton and FLIR data was taken 20 seconds before target display. This because this was
the same time as the EEG sample in O’Connell et al. (2009). The results were as followed.
Figure 9.
Accuracy of Each Model of the CTET Before and After Finetuning of the Parameters
Before hyperparameter optimization the model was not very overfitted but still
performed much better for the train data than for the test data, except for the second model.
The accuracy of the models improved after finetuning with almost 10% relatively to the naïve
model.
42
Figure 6.
Gini Importance per Feature for Model 1 of the CTET
The benchmark is set at .11 for the first model. The best performing features after
hyperparameter optimization are the nose temperature, DPG
pinna-mastoid
, and the DPG
finger-chest
.
The worst features were the forehead temperature, mastoid temperature and chest
temperature.
Figure 7.
Gini Importance per Feature for Model 2 of the CTET
The benchmark is set at .25 for the second model. The best performing features are
the participants PSQI score and age. The worst performing features are the participants MEQ
type and gender.
43
Figure 8.
Gini Importance per Feature for Model 3 of the CTET
The benchmark is set at .077 for the third model. The best performing features after
hyperparameter optimization were the nose temperature, DPG
pinna-mastoid
and the DPG
finger-chest
.
The worst performing features were the mastoid temperature, participants MEQ type and
gender.
Here a Bootstrapped procedure would be useful as well. Then more can be said about
the usefulness of the different models and different features. For now the FLIR measures
features seem to be relevant. But the second model performs the best so that would mean that
the classification in all models is based on differences between persons instead of differences
between features.