Explorative Analysis: Dimensionality reduction I

The objective of this assignment is to see if we can use PCA to classify different periods of the 24-hour EEG recording in a similar way as the expert human in the hypnogram (rech). To be able to make this comparison easily, we will chop the EEG time series in non-overlapping pieces of 3,000 data points so we match exactly the period of acquisition of the hypnogram (30 seconds). Remember that this is because the EEG was acquired at 100 Hz so that in 30 seconds the EEG contains 3,000 data points.

Introduction

How to do PCA

[coeff, score, latent] = princomp(eeg);

coeff are the weights of each principal component in the original axes, score is the data already projected on the rotated axes (remember that PCA is just an axis rotation), and latent tells you how much variance is explained by each of the principal components.

Exercises

Before we start the exercises, we import the data and create general variables.

% Import data
[hd, rec] = edfread('sc4002e0.rec'); % Import EEG data
[hdh, rech] = edfread('sc4002e0.hyp'); % Import Hypnogram data

% EEG Data
Fs1= 100; % Frequency sample of 100Hz
t1 = 1/Fs1; % Time sample (0.01 seconds)
EEG = rec(2,:); % Channel of the EEG signal
% Hypnogram Data
t2 = 30; % Time sample (30 seconds)

Exercise 1

**(A) Build a matrix that contains in each row a different piece of data of length 3,000. This matrix will have dimensions length(rech) x 3,000. To correct for possible drifts in the mean EEG signal, subtract from each of these windows its mean.**

We convert the EEG vector to matrix chopped every 3000 datapoint.

[mat,padded] = vec2mat(EEG,3000);

To correct for possible drifts in the mean EEG signal, subtract from each of these windows its mean.

m = [];
for i=1:2830
 m(i) = mean(mat(i,:));
 mat_sub (i,:) = mat(i,:)-m(i);
end

**(B) Compute the power spectrum with pwelch. You can let Matlab choose the default window size and sampling frequency, because we do not need that info now. We save the resulting power spectrum in decibels (i.e. we take the logarithm in base 10, log10, and multiply the result by 10) in a new matrix that will include all these spectra as different rows, and thus they will have dimensions length(rech) x length(power spectrum).**

We compute the power spectrum of the EEG matrix subracted (mat_sub). To make things easier, we save the resulting power spectrum in decibels in a new matrix called *pws_mat*.

pws_mat=[];
for i=1:size(mat_sub,1)
 [p1,f1] = pwelch(mat_sub(i,:),[],[],[],Fs1);
 pws_mat(i,:) = 10*log10(p1);
end

**(C) Now you are ready to pass this information to the PCA algorithm. First, though, see by yourself what the data looks like. Plot individual power spectra (from different random rows in your matrix) and see that they are messy and potentially different from one another in specific ways. In what ways? What are the essential differences that collectively separate these different spectra in separate classes that could signal different brain states? This is what PCA allows us to do.**

We plot the power spectrum of the EEG signal subtracted.

figure
plot(f1,pws_mat);
title('Power spectrum of the EEG matrix signal');
xlabel('Frequency (Hz)'); ylabel('Magnitude (dB)');
im1
Figure 1. Power spectrum of the EEG signal subtracted.

If we plot random power spectrum rows of pws_mat in the same plot, we will see a huge difference between them. To create the random numbers, we call the function randi. It creates 4 random numbers from 1 to 3000.

r = randi([1 size(pws_mat,1)],1,4);
figure
hold on
plot(f1,pws_mat(r(1),:));
plot(f1,pws_mat(r(2),:));
plot(f1,pws_mat(r(3),:));
plot(f1,pws_mat(r(4),:));
title('Random power spectrum of the EEG matrix signal');
xlabel('Frequency (Hz)'); ylabel('Magnitude (dB)');
legend('Random Row A', 'Random Row B', 'Random Row C', 'Random Row D');
hold off
im2
Figure 2. Random power spectrum of the EEG signal.

The power spectrum of all four random rows of the psw_mat variable are messy and potentially different from one another. One of the reasons is because the power spectrum of the random rows have a sleep stage associated with it. Therefore, its information (i.e magnitude of the power spectrum across frequencies) is different.

Exercise 2

**(A) Run the PCA with princomp on the power spectra matrix and first plot the percent of variance explained by each principal component. How many components do you think are necessary to sufficiently explain the data?**

(!) pca and princomp functions do not work properly in my MATLAB R2015b version, so the only solution I found to do the expriment was to run the pca and get only the values PCCOEFF (i.e latent variable using princomp) and PCCVEC (i.e. coef variable using princomp). Since I need the variable SCORE and function pca in my laptop does not want to run, I asked a friend to send me the “scores.mat” variable so I can keep going with the exercices about PCA.

We run the PCA with princomp on the power spectra matrix.

[PCCOEFF, PCVEC] = pca(pws_mat);

We plot the percent of variance explained by each principal component in two different options:

Option 1

figure
var100 = PCCOEFF/sum(PCCOEFF);
plot(1:length(PCCOEFF),var100); xlim([1 10]); ylim([0 1]);
title('Cumulative distribution of variances by components');
xlabel('Components'); ylabel('Cumulative distribution');
im3
Figure 3. Cumulative distribution of variances by components (Option 1).

Option 2

figure
pareto(var100);
title('Cumulative distribution of variances by components');
xlabel('Components');
im4
Figure 4. Cumulative distribution of variances by components (Option 2).

By viewing the two figures, we can perfectly see how the first component (PC1) has a huge weight (i.e almost 90%), while the following components 2, 3 and 4 are barely reliable to be used.

**(B) Plot in a 2D graph each data point in the first and second principal components. What do you see? Could you easily separate the two main brain states, awake and asleep from this representation?**

(!) At this point, I need to upload the “scores.mat” variable.

load('scores.mat');

We save in PC1 variable the first component of scores (first column), and in PC2 variable the second component of scores (second column).

PC1 = score(:,1);
PC2 = score(:,2);

We plot in a 2D graph each data point in the PC1 and PC2.

figure
plot(PC1,PC2, '.');
title('Princpal Components');
xlabel('Principal Component 1'); ylabel('Principal Component 2');
im5
Figure 5. Principal Components (1st and 2nd).

What we see here are two different clouds separated at -100 from the x-axis (i.e PC1). Thus, we could separate the two main brain states, awake and asleep from this representation. The big cloud on the right might be the awake state and the one on the left is the sleep.

**(C) Which PC do you think would be better to separate these two clouds? Separate the two clouds based on a fixed value in the first principal component and average the spectra for all EEG pieces at each side of this boundary separately. Plot each of these two mean power spectra. Can you identify what cloud corresponds to the awake state and what cloud is more consistent with sleep?**

In the figure above we see a threshold separation between the two clouds that resides at -100 taking the x-axis as reference (aka the PC1 axis). Thus, we find which values of PC1 are above and below this threshold and we save them in the awake and sleep variables.

awake = find(PC1 >= -100);
sleep = find(PC1 < -100);

We calculate the mean of the power spectrum of each awake and sleep variable, so we can plot the result and compare the two situations.

awake_m = mean(pws_mat(awake,:));
sleep_m = mean(pws_mat(sleep,:));

We plot the mean power spectrum of the matrix signal when the subject is sleeping or it’s awake.

figure
plot(f1,awake_m); hold on
plot(f1,sleep_m);
title('Power spectrum of the EEG matrix signal: Sleep vs Awake');
xlabel('Frequency (Hz)'); ylabel('Mean Power (dB)'); legend('Awake','Sleep');
im6
Figure 6. Power spectrum of the EEG signal: sleep (red) and awake (blue).

(Extra) We plot in a 2D graph each data point in the PC1 and PC2 in a way that each sleep stage cloud has a color.

We find the dots for each PC of each state by using awake and sleep variable.

PC1_awake = PC1(awake); PC1_sleep = PC1(sleep);
PC2_awake = PC2(awake); PC2_sleep = PC2(sleep);

We plot the result:

figure
plot(PC1_awake,PC2_awake, '.'); hold on
plot(PC1_sleep,PC2_sleep, '.');
title('Princpal Components');
xlabel('Principal Component 1'); ylabel('Principal Component 2');

 

im7
Figure 7. Principal components (1st and 2nd) showing two different clouds for sleep and awake brain states.

Exercise 3

**(A) Knowing that the sleep phase contains separate brain states (REM, shallow NON-REM, deep NON-REM), you can try to see if PCA can distinguish those states by looking at the 2nd principal component. First zoom in your sleep cloud in the score plot and see if you can identify separate sub-clouds. Plot a histogram of the 2nd component score for the sleep points and see if you can identify separate subgroups. Do you see different modes on the distribution you got with the histogram?**

We look at the 2nd principal component to see if PCA can distinguish the brain states from the sleep phase. We first zoom in the score plot to see if we can visually distinguish the different sleep phases in the sleep cloud.

We plot the result, but we do not see separate sub-clouds.

figure
plot(PC1_sleep,PC2_sleep, '.');
title('2nd Princpal Component')
xlabel('Principal Component 1'); ylabel('Principal Component 2');

Figure

We plot a histogram of the 2nd component score for the sleep points and see if you can identify separate subgroups.

hist(PC1p,100);
title('Histogram of Principal Component 1 (PC1) in all states');

 

im15.png
Figure 8. Histogram of the 1st principal component (PC1)

Just by looking at the histogram, we cannot differentiate between clouds.

**(B) Perhaps you do not see very clearly the separation. The next thing you can do is to run again the PCA now restricted to the sleep spectra you identified in Exercise 2. This is a more powerful approach because, as mentioned in class, the 2nd component is forced to be orthogonal to the 1st component. In this new analysis this constraint is removed and now the direction that best distinguishes the sleep conditions can be evaluated without such restrictions. In this new PCA, plot now the fraction of explained variance in the two first principal components. Can you distinguish now 3 clouds? Can you visually determine the boundaries of each sleep phase? Use the histogram again. Try different number of bins and extract the boundaries from here. Using such boundaries you can now separate spectra corresponding to each cloud and average them together to obtain 3 mean spectra for each clouds. Plot them (Fig. 4). Can you identify the REM, shallow NON-REM, and deep NON-REM phases of sleep?**

We run again the PCA now restricted to the sleep spectra we identified in Exercise 2. This is a more powerful approach because the 2nd component is forced to be orthogonal to the 1st component. In this new analysis this constraint is removed and the direction that best distinguishes the sleep conditions can be evaluated without such restrictions.

Thus, in this new PCA, we plot the fraction of explained variance in the two first principal components.

[PCCOEFF2, PCVEC2] = pca(pws_mat(sleep,:));

We plot the percent of variance explained by each principal component:

var100_2 = PCCOEFF2/sum(PCCOEFF2);
figure
pareto(var100_2);
title('Comulative distribution of variances by components');
xlabel('Components'); ylabel('Comulative distribution');

 

im14.png
Figure 9. Cumulative distribution of variances by components (Sleep state)

We determine the boundaries of each sleep phase and determine three clouds. We use such boundaries to separate spectra corresponding to each cloud and average them together to obtain 3 mean spectra for each cloud. We can identify the REM, shallow Non-REM, and deep non-REM phases of sleep.

(!) At this point, I need to upload the “scores.mat” variable.

load('score2.mat');

We save in two variables PC1 and PC2.

PC1_sleep = score2(:,1);
PC2_sleep = score2(:,2);

We plot in a 2D graph each data point in the PC1 and PC2.

figure
plot(PC1_sleep,PC2_sleep, '.');
title('Princpal Components');
xlabel('Principal Component 1'); ylabel('Principal Component 2');

im16.png
Figure 10. Princpal components for sleep state.

We plot a histogram of the 2nd component score for the sleep points and see if we can identify separate subgroups.

hist(PC1_sleep,100);
title('Histogram of Principal Component 1 (PC1) in sleep state');
im12
Figure 11. Histogram of the 1st principal component (PC1)

Both figures agree in terms of the point of separation between the first cloud from the second resides around -17 taking as reference the x-axis. Likewise, we see how the second boundary between the other two clouds is at point 19.4. Thus, we find which values of `PC1` correspond to each sleep phase taking into account the two thresholds.

sleep1 = find(PC1_sleep <= -17);
sleep2 = find( -17 < PC1_sleep <= 19.4);
sleep3 = find(19.4 < PC1_sleep);

We select power spectrum of sleep state.

pws_mat2 = pws_mat(sleep,:);

We calculate the mean of the power spectrum of each awake and sleep variable, so we can plot the result and compare the two situations.

sleep1_m = mean(pws_mat2(sleep1,:)); 
sleep2_m = mean(pws_mat2(sleep2,:)); 
sleep3_m = mean(pws_mat2(sleep3,:));

We plot the mean power spectrum of the matrix signal when the subject is sleeping or it’s awake.

figure 
plot(f1,sleep1_m); hold on plot(f1,sleep2_m); plot(f1,sleep3_m); 
title('Power spectrum of the EEG matrix signal in all sleep states');
xlabel('Frequency (Hz)'); ylabel('Mean Power (dB)'); 
legend('Sleep 1','Sleep 2', 'Sleep 3');

im8.png
Figure 12. Power spectrum of the EEG matrix signal in all sleep states.

How do we know which signal corresponds to which sleep stage? We know that REM phase has no peak around 12Hz. Therefore, sleep 1 corresponds to REM 5. Following the same modus operandis, non-REM 3&4 sleep state is characterized by higher delta activity. Thus, sleep 3 corresponds to non-REM 3&4, and the sleep variable remaining corresponds to non-REM 1&2 (sleep 2).

We plot the same figure again with the legend corrected.

figure 
plot(f1,sleep1_m); hold on; 
plot(f1,sleep2_m); plot(f1,sleep3_m); 
title('Power spectrum of the EEG matrix signal in all sleep states'); 
xlabel('Frequency (Hz)'); ylabel('Mean Power (dB)'); 
legend('REM 5','Non-REM 1&2', 'Non-REM 3&4');

im9.png
Figure 13. Power spectrum of the EEG signal with all sleep states tagged.

**(C) You have now achieved your own classification of the phases of sleep-wake in this recording. Let’s see how well we did in identifying sleep phases on our own, just with the help of the PCA algorithm, compared with the expert eye recorded in the hypnogram rech. To this end, get the indices that rech attributes to the REM stage, to the NON-REM stages 1 and 2, and to the NON-REM stages 3 and 4. Then plot again the 2D plot of PC1 against PC2 for the two cases before (all spectra, and sleep-spectra alone), only that the dots are plotted with different color depending on the rech evaluation.**

We now achieved our own classification of the phases of sleep-wake in this recording. Let’s see how well we did in identifying sleep phases on our own, just with the help of the PCA algorithm, compared with the expert eye recorded in the hypnogram rech. To this end, get the indices that rech attributes to the REM 5 stage, to the NON-REM stages 1 and 2, and to the NON-REM stages 3 and 4. Let’s now separate the different sleep stages to compare the spectra by using unique function to save a vector with rech values (0/9).

sleep_values=unique(rech);
for i=1:length(sleep_values) 
 sleep_indeces{i}=find(rech==sleep_values(i));
end

Compute the indices of the eeg vector at which periods ranked as 1/2, 3/4 or 5 in rech end. We put together all indexes 1&2, 3&4, and 5 and use sort function to put the numbers in order in the vector.

sleep12 = sort([sleep_indeces{2} sleep_indeces{1}]);
sleep34 = sort([sleep_indeces{4} sleep_indeces{3}]); 
sleep5 = sleep_indeces{5};

We then plot again the 2D plot of PC1 against PC2 for the two cases before (all spectra, and sleep-spectra alone), only that the dots are plotted with different color depending on the rech evaluation.

PC1_nrem12 = PC1(sleep12); PC2_nrem12 = PC2(sleep12); % Non-REM 1&2 
PC1_nrem34 = PC1(sleep34); PC2_nrem34 = PC2(sleep34); % Non-REM 3&4 
PC1_rem5 = PC1(sleep5); PC2_rem5 = PC2(sleep5); % REM 5

We plot the result:

figure 
plot(PC1_nrem12,PC2_nrem12, 'r.'); hold on; 
plot(PC1_nrem34,PC2_nrem34, 'b.'); 
plot(PC1_rem5,PC2_rem5, 'g.'); 
plot(PC1_awake,PC2_awake, 'k.'); 
title('Princpal Components'); 
xlabel('Principal Component 1'); ylabel('Principal Component 2'); 
legend('Non-REM 1&2', 'Non-REM 3&4', 'REM 5', 'Awake');
im10_2.png
Figure 14. Principal components of awake and sleep states.

We now find the datapoints in sleep stage.

[~,sleep_points] = find(rech>0 & rech<9);
rech2 = rech(sleep_points);

We save sleep points to each respective sleep state.

    snrem12 = [find(rech2==1) find(rech2==2)];
     snrem34 = [find(rech2==3) find(rech2==4)];
     srem5 = find(rech2==5);

Select PC1 and PC2 data for each sleep state.

    PC1_srem12 = PC1_sleep(snrem12); 
    PC2_srem12 = PC2_sleep(snrem12); % Non-REM 1&2
    PC1_srem34 = PC1_sleep(snrem34); 
    PC2_srem34 = PC2_sleep(snrem34); % Non-REM 3&4
    PC1_srem5 = PC1_sleep(srem5); 
    PC2_srem5 = PC2_sleep(srem5); % REM 5

Plot PC1 and PC2 and their sleep data points for each sleep state.

    figure
    plot(PC1_srem12,PC2_srem12, 'r.'); hold on
    plot(PC1_srem34,PC2_srem34, 'b.');
    plot(PC1_srem5,PC2_srem5, 'g.');
    title('Princpal Components'); xlabel('Principal Component 1'); 
    ylabel('Principal Component 2'); legend('Non-REM 1&2','Non-REM 3&4','REM 5');
im11.png
Figure 15. Principal components of sleep state: non-REM 1&2 (red), non-REM 3&4 (blue), and REM phase (green).

 

OPTIONAL

**(1) Let’s quantify how good we were in classifying sleep phases without knowing anything about sleep rhythms, just using PCA and some intuition (i.e. thresholds!). Take each of the vigilance levels: awake, NREM12, NREM34 and REM and compute the fraction of data segments that were wrongly classified based on the simple threshold criterion. Which is the sleep phase that was worst classified based on the threshold?**

**(2) Can you try and interpret the weights of the PCs (also known as the “loadings”, which are in the matrix coeff returned as first element by princomp)? To this end, go back to the full PCA, and plot the weights of the 1st PC. Can you interpret what this curve means by looking at the average spectra that you obtained when thresholding on the first cloud? You can gain intuition about it by computing the mean power spectrum across all conditions, and subtracting it from the average spectra in Fig. 2. Alternatively, you can identify the center of mass in the PC1 axis of each cloud in Fig. 1, and use this number as a factor on the PC1 loadings and then add the grand-average power spectrum. How do the two curves that you obtain compare with the ones in Fig. 2? Can you now fully interpret the meaning of the PC1 loadings?**

Leave a Reply