Error in using knn for multidimensional data - python

I am a beginer in Machine Learning, I am trying to classify multi dimensional data into two classes. Each data point is 40x6 float values. To begin with I have read my csv file. In this file shot number represents data point.
https://docs.google.com/spreadsheets/d/1tW1xJqnNZa1PhVDAE-ieSVbcdqhT8XfYGy8ErUEY_X4/edit?usp=sharing
Here is the code in python:
import pandas as pd
1 import numpy as np
2 import matplotlib.pyplot as plot
3
4 from sklearn.neighbors import KNeighborsClassifier
5
6 # Read csv data into pandas data frame
7 data_frame = pd.read_csv('data.csv')
8
9 extract_columns = ['LinearAccX', 'LinearAccY', 'LinearAccZ', 'Roll', 'pitch', 'compass']
10
11 # Number of sample in one shot
12 samples_per_shot = 40
13
14 # Calculate number of shots in dataframe
15 count_of_shots = len(data_frame.index)/samples_per_shot
16
17 # Initialize Empty data frame
18 training_index = range(count_of_shots)
19 training_data_list = []
20
21 # flag for backward compatibility
22 make_old_data_compatible_with_new = 0
23
24 if make_old_data_compatible_with_new:
25 # Convert 40 shot data to 25 shot data
26 # New logic takes 25 samples/shot
27 # old logic takes 40 samples/shot
28 start_shot_sample_index = 9
29 end_shot_sample_index = 34
30 else:
31 # Start index from 1 and continue till lets say 40
32 start_shot_sample_index = 1
33 end_shot_sample_index = samples_per_shot
34
35 # Extract each shot into pandas series
36 for shot in range(count_of_shots):
37 # Extract current shot
38 current_shot_data = data_frame[data_frame['shot_no']==(shot+1)]
39
40 # Select only the following column
41 selected_columns_from_shot = current_shot_data[extract_columns]
42
43 # Select columns from selected rows
44 # Find start and end row indexes
45 current_shot_data_start_index = shot * samples_per_shot + start_shot_sample_index
46 current_shot_data_end_index = shot * samples_per_shot + end_shot_sample_index
47 selected_rows_from_shot = selected_columns_from_shot.ix[current_shot_data_start_index:curren t_shot_data_end_index]
48
49 # Append to list of lists
50 # Convert selected short into multi-dimensional array
51
training_data_list.append([selected_columns_from_shot[extract_columns[index]].values.tolist( ) for index in range(len(extract_columns))])
8
7 # Append each sliced shot into training data
6 training_data = pd.DataFrame(training_data_list, columns=extract_columns)
5 training_features = [1 for i in range(count_of_shots)]
4 knn = KNeighborsClassifier(n_neighbors=3)
3 knn.fit(training_data, training_features)
training_data_list.append([selected_columns_from_shot[extract_columns[index]].values.tolist( ) for index in range(len(extract_columns))])
After running the above code, I am getting an error
ValueError: setting an array element with a sequence.
for the line
knn.fit(training_data, training_features)

Related

Sort rows of curve shaped data in python

I have a dataset that consists of 5 rows that are formed like a curve. I want to separate the inner row from the other or if possible each row and store them in a separate array. Is there any way to do this, like somehow flatten the curved data and sorting it afterwards based on the x and y values?
I would like to assign each row from left to right numbers from 0 to the max of the row. Right now the labels for each dot are not useful for me and I can't change the labels.
Here are the first 50 data points of my data set:
x y
0 -6.4165 0.3716
1 -4.0227 2.63
2 -7.206 3.0652
3 -3.2584 -0.0392
4 -0.7565 2.1039
5 -0.0498 -0.5159
6 2.363 1.5329
7 -10.7253 3.4654
8 -8.0621 5.9083
9 -4.6328 5.3028
10 -1.4237 4.8455
11 1.8047 4.2297
12 4.8147 3.6074
13 -5.3504 8.1889
14 -1.7743 7.6165
15 1.1783 6.9698
16 4.3471 6.2411
17 7.4067 5.5988
18 -2.6037 10.4623
19 0.8613 9.7628
20 3.8054 9.0202
21 7.023 8.1962
22 9.9776 7.5563
23 0.1733 12.6547
24 3.7137 11.9097
25 6.4672 10.9363
26 9.6489 10.1246
27 12.5674 9.3369
28 3.2124 14.7492
29 6.4983 13.7562
30 9.2606 12.7241
31 12.4003 11.878
32 15.3578 11.0027
33 6.3128 16.7014
34 9.7676 15.6557
35 12.2103 14.4967
36 15.3182 13.5166
37 18.2495 12.5836
38 9.3947 18.5506
39 12.496 17.2993
40 15.3987 16.2716
41 18.2212 15.1871
42 21.1241 14.0893
43 12.3548 20.2538
44 15.3682 18.9439
45 18.357 17.8862
46 21.0834 16.6258
47 23.9992 15.4145
48 15.3776 21.9402
49 18.3568 20.5803
50 21.1733 19.3041
It seems that your curves have a pattern, so you could select the curve of interest using splicing. I had the offset the selection slightly to get the five curves because the first 8 points are not in the same order as the rest of the data. So the initial 8 data points are discarded. But these could be added back in afterwards if required.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({ 'x': [-6.4165, -4.0227, -7.206, -3.2584, -0.7565, -0.0498, 2.363, -10.7253, -8.0621, -4.6328, -1.4237, 1.8047, 4.8147, -5.3504, -1.7743, 1.1783, 4.3471, 7.4067, -2.6037, 0.8613, 3.8054, 7.023, 9.9776, 0.1733, 3.7137, 6.4672, 9.6489, 12.5674, 3.2124, 6.4983, 9.2606, 12.4003, 15.3578, 6.3128, 9.7676, 12.2103, 15.3182, 18.2495, 9.3947, 12.496, 15.3987, 18.2212, 21.1241, 12.3548, 15.3682, 18.357, 21.0834, 23.9992, 15.3776, 18.3568, 21.1733],
'y': [0.3716, 2.63, 3.0652, -0.0392, 2.1039, -0.5159, 1.5329, 3.4654, 5.9083, 5.3028, 4.8455, 4.2297, 3.6074, 8.1889, 7.6165, 6.9698, 6.2411, 5.5988, 10.4623, 9.7628, 9.0202, 8.1962, 7.5563, 12.6547, 11.9097, 10.9363, 10.1246, 9.3369, 14.7492, 13.7562, 12.7241, 11.878, 11.0027, 16.7014, 15.6557, 14.4967, 13.5166, 12.5836, 18.5506, 17.2993, 16.2716, 15.1871, 14.0893, 20.2538, 18.9439, 17.8862, 16.6258, 15.4145, 21.9402, 20.5803, 19.3041]})
# Generate the 5 dataframes
df_list = [df.iloc[i+8::5, :] for i in range(5)]
# Generate the plot
fig = plt.figure()
for frame in df_list:
plt.scatter(frame['x'], frame['y'])
plt.show()
# Print the data of the innermost curve
print(df_list[4])
OUTPUT:
The 5th dataframe df_list[4] contains the data of the innermost plot.
x y
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
You can then add the missing data like this:
# Retrieve the two missing points of the inner curve
inner_curve = pd.concat([df_list[4], df[5:7]]).sort_index(ascending=True)
print(inner_curve)
# Plot the inner curve only
fig2 = plt.figure()
plt.scatter(inner_curve['x'], inner_curve['y'], color = '#9467BD')
plt.show()
OUTPUT: inner curve
x y
5 -0.0498 -0.5159
6 2.3630 1.5329
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
Complete Inner Curve

plot specific columns from a text file

If I have a text file, data.txt, which contains many columns, how to call this file by python and plot only chosen two columns?
for example:
10 -22.82215289 0.11s
12 -22.81978265 0.14s
15 -22.82359691 0.14s
20 -22.82464363 0.16s
25 -22.82615348 0.17s
30 -22.82641815 0.19s
35 -22.82649347 0.21s
40 -22.82655376 0.22s
50 -22.82661407 0.28s
60 -22.82663535 0.34s
70 -22.82664864 0.42s
80 -22.82665962 0.46s
90 -22.82666308 0.51s
100 -22.82666662 0.56s
and I need to plot only the first and second columns.
Note the space before the first column.
Eidt
I used the following code:
import matplotlib.pyplot as plt
from matplotlib import rcParamsDefault
import numpy as np
plt.rcParams["figure.dpi"]=150
plt.rcParams["figure.facecolor"]="white"
x, y = np.loadtxt('./calc.dat', delimiter=' ')
plt.plot(x, y, "o-", markersize=5, label='Etot')
plt.xlabel('ecut')
plt.ylabel('Etot')
plt.legend(frameon=False)
plt.savefig("fig.png")
but I have to modify my data to contain only two columns that I need to plot without any spaces before the first column, as follows
10 -22.82215289
12 -22.81978265
15 -22.82359691
20 -22.82464363
25 -22.82615348
30 -22.82641815
35 -22.82649347
40 -22.82655376
50 -22.82661407
60 -22.82663535
70 -22.82664864
80 -22.82665962
90 -22.82666308
100 -22.82666662
So, how to modify the code so that I do not have to modify the data every time?
You can create a DataFrame from from a text file using pandas read_csv, which can simplify future processing of the data, besides plotting it.
In this case, the tricky part are the whitespaces, that can be managed by setting the optional parameter sep to '\s+':
df = pd.read_csv('data.txt', sep='\s+', header=None, names=['foo', 'bar', 'baz'])
>>>df
index
foo
bar
baz
0
10
-22.82215289
0.11s
1
12
-22.81978265
0.14s
2
15
-22.82359691
0.14s
3
20
-22.82464363
0.16s
4
25
-22.82615348
0.17s
5
30
-22.82641815
0.19s
6
35
-22.82649347
0.21s
7
40
-22.82655376
0.22s
8
50
-22.82661407
0.28s
9
60
-22.82663535
0.34s
10
70
-22.82664864
0.42s
11
80
-22.82665962
0.46s
12
90
-22.82666308
0.51s
13
100
-22.82666662
0.56s
And the just your code:
plt.rcParams["figure.dpi"]=150
plt.rcParams["figure.facecolor"]="white"
plt.plot(df['foo'], df['bar'], "o-", markersize=5, label='Etot')
plt.xlabel('ecut')
plt.ylabel('Etot')
plt.legend(frameon=False)
plt.savefig("fig.png")
I set the names of the columns to arbitrary strings. You can avoid that, and just refer to the columns as df[0], df[1]
You could first read your file data.txt and preprocess it by stripping the whitespaces on the left of each line, save the preprocessed data to data_processed.txt, then load it with pd.read_csv and then plot the two columns of choice col1 and col2 against each other with plt.plot, as follows:
import pandas as pd
import matplotlib.pyplot as plt
s = """ 10 -22.82215289 0.11s
12 -22.81978265 0.14s
15 -22.82359691 0.14s
20 -22.82464363 0.16s
25 -22.82615348 0.17s
30 -22.82641815 0.19s
35 -22.82649347 0.21s
40 -22.82655376 0.22s
50 -22.82661407 0.28s
60 -22.82663535 0.34s
70 -22.82664864 0.42s
80 -22.82665962 0.46s
90 -22.82666308 0.51s
100 -22.82666662 0.56s"""
with open ('data.txt', 'w') as f:
f.write(s)
with open ('data.txt', 'r') as f:
data = f.read()
data_processed = '\n'.join([l.lstrip() for l in data.split('\n')])
with open ('data_processed.txt', 'w') as f:
f.write(data_processed)
df = pd.read_csv('data_processed.txt', sep=' ', header=None)
col1 = 0
col2 = 1
plt.plot(df[col1], df[col2]);

How to obtain the first 4 rows for every 20 rows from a CSV file

I've Read the CVS file using pandas and have managed to print the 1st, 2nd, 3rd and 4th row for every 20 rows using .iloc.
Prem_results = pd.read_csv("../data sets analysis/prem/result.csv")
Prem_results.iloc[:320:20,:]
Prem_results.iloc[1:320:20,:]
Prem_results.iloc[2:320:20,:]
Prem_results.iloc[3:320:20,:]
Is there a way using iloc to print the 1st 4 rows of every 20 lines together rather then seperately like I do now? Apologies if this is worded badly fairly new to both python and using pandas.
Using groupby.head:
Prem_results.groupby(np.arange(len(Prem_results)) // 20).head(4)
You can concat slices together like this:
pd.concat([df[i::20] for i in range(4)]).sort_index()
MCVE:
df = pd.DataFrame({'col1':np.arange(1000)})
pd.concat([df[i::20] for i in range(4)]).sort_index().head(20)
Output:
col1
0 0
1 1
2 2
3 3
20 20
21 21
22 22
23 23
40 40
41 41
42 42
43 43
60 60
61 61
62 62
63 63
80 80
81 81
82 82
83 83
Start at 0 get every 20 rows
Start at 1 get every 20 rows
Start at 2 get every 20 rows
And, start at 3 get every 20 rows.
You can also do this while reading the csv itself.
df = pd.DataFrame()
for chunk in pd.read_csv(file_name, chunksize = 20):
df = pd.concat((df, chunk.head(4)))
More resources:
You can read more about the usage of chunksize in Pandas official documentation here.
I also have a post about its usage here.

TypeError: '<' not supported between instances of 'str' and 'int' while doing PCA for k-means clustering

I am trying to apply Kernel Principle Component Analysis on a dataset without a dependent variable to do a cluster analysis with k-means, so that I can learn how to do so. Here is a sample of my dataset(according to the scenario, this is a dataset of a shopping mall, and the shopping mall wants to discover the segments of its customers according to the data below):
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
6 Female 22 17 76
7 Female 35 18 6
8 Female 23 18 94
9 Male 64 19 3
10 Female 30 19 72
11 Male 67 19 14
First, I omitted CustomerID column and then encoded the gender column to be able to apply kernel PCA. Here is how I did it:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the mall dataset with pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, 1:5].values
df = pd.DataFrame(X)
#df is in order to visualize the "X" on variable explorer
#Encoding independent categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
After executing this code, I could get the array with float64 Type. The sample from the array I created is below:
0 1 19 15 39
0 1 21 15 81
1 0 20 16 6
1 0 23 16 77
1 0 31 17 40
1 0 22 17 76
1 0 35 18 6
1 0 23 18 94
0 1 64 19 3
1 0 30 19 72
0 1 67 19 14
And then, I wanted to apply Kernel PCA to get the principal components which I will use at k-means. However, when I try to execute the code below, I get the error "TypeError: '<' not supported between instances of 'str' and 'int'".
# Applying Kernel PCA
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 'None', kernel = 'rbf')
X = kpca.fit_transform(X)
explained_variance = kpca.explained_variance_ratio_
Even if I encoded my categorical data and I don't have any strings in my dataset, I cannot understand why it gives this error. Is there anyone that could help?
Thank you very much in advance.
n_components = 'None' is the problem. you should not put a string here...
use:
kpca = KernelPCA(n_components = None, kernel = 'rbf')
I suspect this is what is happening:
This is an error of an included file, or some code that is running, prior to your running code. The "TypeError: '<' to which this is referring is a string "<error>". Which is what something prior to your code is returning.

Healpy map2alm and alm2map inconsistency?

I'm just starting to work with Healpy and have noticed that if I use a map to get alm's and then use those alm's to generate a new map, I do not get the map I started with. Here's what I'm looking at:
import numpy as np
import healpy as hp
nside = 2 # healpix nside parameter
m = np.arange(hp.nside2npix(nside)) # create a map to test
alm = hp.map2alm(m) # compute alm's
new_map = hp.alm2map(alm, nside) # create new map from computed alm's
# Let's look at two maps
print(m)
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47] # as expected
print(new_map)
[-23.30522233 -22.54434515 -21.50906755 -20.09203749 -19.48841773
-18.66392484 -16.99593867 -16.789984 -15.14587061 -14.57960049
-13.4403252 -13.35992138 -10.51368725 -10.49793946 -10.1262039
-8.6340571 -7.41789272 -6.87712224 -5.75765487 -3.75121764
-4.35825512 -1.6221964 -1.03902923 -0.41478954 0.52480646
2.34629955 2.1511705 2.40325268 5.39576497 5.38390848
5.78324832 7.24779083 8.4915595 9.0047257 10.15179735
12.1306303 12.62672772 13.4512206 15.11920678 15.32516145
16.96927483 17.53554496 18.67482024 18.75522407 20.42078855
21.18166574 22.21694334 23.6339734 ] # not what I was expecting
As you can see, new_map doesn't match the input map, m. I imagine there's some subtlety to these functions that I'm missing. Any idea?
I get a different result:
print(new_map)
[ 0.15859344, 0.91947062, 1.95474822, 3.37177828,
4.01808325, 4.84257613, 6.51056231, 6.71651698,
8.36063036, 8.92690049, 10.06617577, 10.1465796 ,
12.98620654, 13.00668621, 13.3736899 , 14.87056857,
16.08200108, 16.62750343, 17.74223892, 19.75340803,
19.13441288, 21.8704716 , 22.45363877, 23.07787846,
24.01747446, 25.83896755, 25.6438385 , 25.89592068,
28.89565876, 28.88853415, 29.28314212, 30.7524165 ,
31.9914533 , 32.50935137, 33.65169114, 35.63525597,
36.13322869, 36.95772158, 38.62570775, 38.83166242,
40.47577581, 41.04204594, 42.18132122, 42.26172504,
43.88460433, 44.64548151, 45.68075911, 47.09778917]
Older versions of healpy were automatically removing a constant offset from the map before transformation, better to update healpy to the last version.
The residual difference is related to the fact the pixelization introduces an error, this error is larger at low nside.

Categories

Resources