preprocessing EEG dataset in python to get better accuracy - python

I've an EEG dataset which has 8 features taken using 8-channel EEG headset. Each row represents readings taken with 250ms interval. The values are all floating point representing voltages in micro volt. If I plot individual features, I can see that they form a continuous wave. now the target has 3 categories: 0,1,2. and for a duration of time the target doesn't change because the sample taken spans across multiple rows. I would appreciate any guidance as to how to pre-process the dataset. Since using it as it is gives me very low accuracy(80%) and according to Wikipedia P300 signal can be detected with 95% accuracy. And please note that I've almost zero knowledge about signal processing and analysing waveforms.
I did try making a 3D array where each row represented a single target and the values of each feature was a list of values that originally spanned across multiple rows. But I get an error that says estimator array expected to be <=2. I'm not sure if this was the right approach. But it didn't work anyway.
here have a look at my feature set:-
-1.2198,-0.32769,-1.22,2.4115,0.057031,-2.6568,7.372,-0.2789
-1.4262,-4.19,-5.6546,-7.7161,-5.4359,-9.4553,-3.6705,-5.4851
-1.3152,-6.8708,-8.5599,-14.739,-9.1808,-14.268,-11.632,-8.929
-0.53987,-7.5156,-8.9646,-16.656,-10.119,-15.791,-14.616,-9.4095
Their corresponding targets:-
0
0
0
0

Related

Get prediction from multiple rows - Decision Tree Regressor

I have a dataset of weather data that I want to use to make a prediction.
The data set consists of data from several different locations. The features in the data set are as follows:
datetime
location
rain
snow
temp_min
temp_max
clouds
pressure
humidity
wind_speed
wind_deg
weather_description
The measurements have been made at the same time in all locations, which make it possible to distinguish between the individual measurements.
I want to use data from all locations as input with getting a prediction for a location.
Is it possible to use several lines as input or can input data only consist of one line?
The DecisionTreeRegressor from scikit-learn expects a dataframe where each output is generated based on a single row. You can nevertheless move all your measurements in into one row (during training and testing) as below:
rain_stn1, rain_stn2, rain_stn3, ..., snow_stn1, snow_stn2, snow_stn3, ...
rain_value#stn1, rain_value#stn2, rain_value#stn3, ...
Of course this means that there needs to be some logical relationship between the stations such as distance. You could also create aggregate values such as rain_nearby (average of stations at <5 km distance), rain_far (average of stations at >5 km distance) which is probably more helpful in your case.
To give more specific answers you need to provide more details on use case, what you are trying to achieve, and how the dataset looks like.

Unsupervised learning: Anomaly detection on discrete time series

I am working on a final year project on an unlabelled dataset consisting of vibration data from multiple components inside a wind turbine.
Datasets:
I have data from 4 wind turbines each consisting of 415 10-second intervals.
About the 10 second interval data:
Each of the 415 10-second intervals consist of vibration data for the generator, gearbox etc. (14 features in total)
The vibration data (the 14 features) have a resolution of 25.6kHz (262144 rows in each interval)
The 10-seconds are recorded once every day, at different times => A little more than 1 year worth of data
Head of dataframe with some of the features shown:
Plan:
My current plan is to
Do a Fast Fourier Transformation (FFT) from the time domain for each of the different sensors (gearbox, generator etc.) for each of the 415 intervals. From the FFT I am able to extract frequency information to put in a dataframe. (Statistical data from the FFT like spectral RMS per bin)
Build different data sets for different components.
Add features such as wind speed, wind direction, power produced etc.
I will then build unsupervised ML models that can detect anomalies.
Unsupervised models I consider using are Encoder-Decorder and clustering.
Questions:
Does it look like I have enough data for this type of task? 415
intervals x 4 different turbines = 1660 rows and approx. 20 features
Should the data be treated as a time series? (It is sampled for 10 seconds once a day at random times..)
What other unsupervised ML models/approaches that could be good for this task?
I hope this was clearly written. Thanks in advance for any input!

How can I initialize K means clustering on a data matrix with 569 rows (samples), and 30 columns (features)?

I'm having trouble understanding how to begin my solution. I have a matrix with 569 rows, each representing a single sample of my data, and 30 columns representing the features of each sample. My intuition is to plot each individual row, and see what the clusters (if any) look like, but I can't figure out how to do more than 2 rows on a single scatter plot.
I've spent several hours looking through tutorials, but have not been able to understand how to apply it to my data. I know a scatter plot takes 2 vectors as a parameter, so how could I possibly plot all 569 samples to cluster them? Am I missing something fundamental here?
#our_data is a 2-dimensional matrix of size 569 x 30
plt.scatter(our_data[0,:], our_data[1,:], s = 40)
My goal is to start k means clustering on the 569 samples.
Since you have a 30-dimensinal factor space, it is difficult to plot such data in 2D space (i.e. on canvas). In such cases usually apply dimension reduction techniques first. This could help to understand data structure. You can try to apply,e.g. PCA (principal component analysis) first, e.g.
#your_matrix.shape = (569, 30)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
projected_data = pca.fit_transform(your_matrix)
plt.scatter(projected_data[:,0], projected_data[:, 1]) # This might be very helpful for data structure understanding...
plt.show()
You can also look on other (including non-linear) dimension reduction techniques, such as, e.g. T-sne.
Further you can apply k-means or something else; or apply k-means to projected data.
If by initialize you mean picking the k initial clusters, one of the common ways of doing so is to use K-means++ described here which was developed in order to avoid poor clusterings.
It essentially entails semi-randomly choosing centers based upon a probability distribution of distances away from a first center that is chosen completely randomly.

Understanding scikitlearn PCA.transform function in Python

so I'm currently working on a project that involves the use of Principal Component Analysis, or PCA, and I'm attempting to kind of learn it on the fly. Luckily, Python has a very convenient module from scikitlearn.decomposition that seems to do most of the work for you. Before I really start to use it though, I'm trying to figure out exactly what it's doing.
The dataframe I've been testing on looks like this:
0 1
0 1 2
1 3 1
2 4 6
3 5 3
And when I call PCA.fit() and then view the components I get:
array([[ 0.5172843 , 0.85581362],
[ 0.85581362, -0.5172843 ]])
From my rather limited knowledge of PCA, I kind of grasp how this was calculated, but where I get lost is when I then call PCA.transform. This is the output it gives me:
array([[-2.0197033 , -1.40829634],
[-1.84094831, 0.8206152 ],
[ 2.95540408, -0.9099927 ],
[ 0.90524753, 1.49767383]])
Could someone potentially walk me through how it takes the original dataframe and components and transforms it into this new array? I'd like to be able to understand the exact calculations it's doing so that when I scale up I'll have a better sense of what's going on. Thanks!
When you call fit PCA is going to compute some vectors that you can project your data onto in order to reduce the dimension of your data. Since each row of your data is 2 dimensional there will be a maximum of 2 vectors onto which data can be projected and each of those vectors will be 2-dimensional. Each row of PCA.components_ is a single vector onto which things get projected and it will have the same size as the number of columns in your training data. Since you did a full PCA you get 2 such vectors so you get a 2x2 matrix. The first of those vectors will maximize the variance of the projected data. The 2nd will maximize the variance of what's left after the first projection. Typically one passed a value of n_components that's less than the dimension of the input data so that you get back fewer rows and you have a wide but not tall components_ array.
When you call transform you're asking sklearn to actually do the projection. That is, you are asking it to project each row of your data into the vector space that was learned when fit was called. For each row of the data you pass to transform you'll have 1 row in the output and the number of columns in that row will be the number of vectors that were learned in the fit phase. In other words, the number of columns will be equal to the value of n_components you passed to the constructor.
Typically one uses PCA when the source data has lots of columns and you want to reduce the number of columns while preserving as much information as possible. Suppose you had a data set with 100 rows and each row had 500 columns. If you constructed a PCA like PCA(n_components = 10) and then called fit you'd find that components_ has 10 rows, one for each of the components you requested, and 500 columns as that's the input dimension. If you then called transform all 100 rows of your data would be projected into this 10-dimensional space so the output would have 100 rows (1 for each in the input) but only 10 columns thus reducing the dimension of your data.
The short answer to how this is done is that PCA computes a Singular Value Decomposition and then keeps only some of the columns of one of those matrices. Wikipedia has much more information on the actual linear algebra behind this - it's a bit long for a StackOverflow answer.

does the classification after a fft

I have a spectrum and I do the fft. And I wanted to use this data to make learning with scikit-learn. However I know what to take as explanatory variables, the frequencies the amplitudes or phases. It also seems it there's specific methods to process data. If you have ideas thank you
for example measurements made on two species
measure for the specie 1
Frequency [Hz] Peak amplitude Phase [degrees]
117.122319744375 2806130.78600507 -79.781679752725
234.24463948875 1913786.60902507 17.7111789273704
351.366959233125 808519.710937228 116.444676921222
468.4892789775 122095.42475935 25.5770279979328
585.520239658112 607116.287067349 142.264887989957
702.642559402487 604818.747928879 -112.469849617122
819.764879146862 277750.38203791 -15.0000950192717
936.887198891237 118608.971696726 -74.5121366118222
1054.00951863561 344484.145698282 -6.21161038546633
1171.13183837999 327156.097365635 97.0304114077862
1288.25415812436 133294.989030519 -42.5375933954097
1405.37647786874 112216.937121264 78.5147573168857
1522.49879761311 231245.476714294 -25.4436913705878
1639.62111735749 201337.057689481 -24.3659638609968
1756.6520780381 77785.2190703514 29.0468023773855
1873.77439778247 103345.482912432 -13.8433556624336
1990.89671752685 164252.685204496 32.0091367478569
2108.01903727122 131507.600569796 3.20717282723705
2225.1413570156 62446.6053497028 17.6656168494324
2342.26367675998 92615.8137781526 -2.92386499550556
measure for the specie 2
Frequency [Hz] Peak amplitude Phase [degrees]
117.122319744375 2786323.45338023 -78.5559125894388
234.24463948875 1915479.67743241 20.1586403367551
351.366959233125 830370.792189816 120.081294764269
468.4892789775 94486.3308071095 28.1762359863422
585.611598721875 590794.892175599 137.070646192436
702.642559402487 610017.558439343 -99.8603287979889
819.764879146862 300481.494163747 -7.0350571153689
936.887198891237 93989.1090623071 -52.6686900337389
1054.00951863561 332194.292343295 4.40278213901234
1171.13183837999 335166.932956212 92.5972261483014
1288.25415812436 154686.81104112 -64.5940556800747
1405.37647786874 91910.7647280088 82.3509804545009
1522.49879761311 223229.665336525 -64.4186985300827
1639.62111735749 211038.25587802 12.6057366375093
1756.74343710186 93456.4477333818 25.3398315513138
1873.77439778247 87937.8620001563 15.3447294063444
1990.89671752685 160213.112972346 7.41647669351739
2108.01903727122 141354.896010814 -48.4341201110724
2225.1413570156 69137.6327300227 39.9238718439715
2342.26367675998 82097.0663259956 -28.9291500313113
OP is asking how to classify this. I've explained it to him in comments and will break it down more here:
Each "specie" represents a row, or a sample. Each sample, thus, has 60 features (20x3)
He is doing a binary classification problem
Re-cast the output of the FFT to give Freq1,Amp1,Phase1....etc as a numerical input set for a training algorithm
Use something like a Support Vector Machine or Decision Tree Classifier out of scikit-learn and train over the dataset
Evaluate and measure accuracy
Caveats: 60 features over 1000 samples is potentially going to be quite hard to separate and liable to over-fitting. OP needs to be careful. I havent spent much time understanding the features themselves, but I suspect 20 of those features are redundant (the frequencies always seem to be the same between samples)

Categories

Resources