I have a dataset like below:
In this dataset first column represents the id of a person, the last column is label of this person and rest of the columns are features of the person.
101 166 633.0999756 557.5 71.80000305 60.40000153 2.799999952 1 1 -1
101 133 636.2000122 504.3999939 71 56.5 2.799999952 1 2 -1
105 465 663.5 493.7000122 82.80000305 66.40000153 3.299999952 10 3 -1
105 133 635.5999756 495.6000061 89 72 3.599999905 9 6 -1
105 266 633.9000244 582.2000122 93.59999847 81 3.700000048 2 2 -1
105 299 618.4000244 552.4000244 80.19999695 66.59999847 3.200000048 3 64 -1
105 99 615.5999756 575.7000122 80 67 3.200000048 0 0 -1
120 399 617.7000122 583.5 95.80000305 82.40000153 3.799999952 8 10 1
120 266 633.9000244 582.2000122 93.59999847 81 3.700000048 2 2 1
120 299 618.4000244 552.4000244 80.19999695 66.59999847 3.200000048 3 64 1
120 99 615.5999756 575.7000122 80 67 3.200000048 0 0 1
My aim is to classify these people, and I want to use leave one out person method as a split method. So I need to choose one person and his all data as a test data and the rest of the data for training. But when try to choose the test data I implemented list assignment operation but it gave an error. This is my code:
`import numpy as np
datasets=["raw_fixationData.txt"]
file_name_array=[101,105,120]
for data in datasets:
data = np.genfromtxt(data,delimiter="\t")
data=data[1:,:]
num_line=len(data[:,1])-1
num_feat=len(data[1,:])-2
label=num_feat+1
X = data[0:num_line+1,1:label]
y = data[0:num_line+1,label]
test_prtcpnt=[]; test_prtcpnt_label=[]; train_prtcpnt=[]; train_prtcpnt_label=[];
for i in range(len(file_name_array)):
m=0; # test index
n=0 #train index
for j in range(num_line):
if X[j,0]==file_name_array[i]:
test_prtcpnt[m,0:10]=X[j,0:10];
test_prtcpnt_label[m]=y[j];
m=m+1;
else:
train_prtcpnt[n,0:10]=X[j,0:10];
train_prtcpnt_label[n]=y[j];
n=n+1; `
This code give me this error test_prtcpnt[m,0:10]=X[j,0:10]; TypeError: list indices must be integers or slices, not tuple
How could I solve this problem?
I think that you are misusing Python's slice notation. Please refer to the following stack overflow post on slicing:
Explain Python's slice notation
In this case, the Python interpreter seems to be interpreting test_prtcpnt[m,0:10] as a tuple. Is it possible that you meant to say the following:
test_prtcpnt[0:10]=X[0:10]
Related
I have a huge data frame.
I am using a for loop in the below sample code:
for i in range(1, len(df_A2C), 1):
A2C_TT= df_A2C.loc[(df_A2C['TO_ID'] == i)].sort_values('DURATION_H').head(1)
if A2C_TT.size > 0:
print (A2C_TT)
This is working fine but I want to use df.iterrows() since it will help me to automaticall avoid empty frame issues.
I want to iterate through TO_ID and looking for minimum values accordingly.
How should I replace my classical i loop counter with df.iterrows()?
Sample Data:
FROM_ID TO_ID DURATION_H DIST_KM
1 7 0.528555556 38.4398
2 26 0.512511111 37.38515
3 71 0.432452778 32.57571
4 83 0.599486111 39.26188
5 98 0.590516667 35.53107
6 108 1.077794444 76.79874
7 139 0.838972222 58.86963
8 146 1.185088889 76.39174
9 158 0.625872222 45.6373
10 208 0.500122222 31.85239
11 209 0.530916667 29.50249
12 221 0.945444444 62.69099
13 224 1.080883333 66.06291
14 240 0.734269444 48.1778
15 272 0.822875 57.5008
16 349 1.171163889 76.43536
17 350 1.080097222 71.16137
18 412 0.503583333 38.19685
19 416 1.144961111 74.35502
As far as I understand your question, you want to group your data by To_ID and select the row where Duration_H is the smallest? Is that right?
df.loc[df.groupby('TO_ID').DURATION_H.idxmin()]
here is one way about it
# run the loop for as many unique TO_ID you have
# instead of iterrows, which runs for all the DF or running to the size of DF
for idx in np.unique(df['TO_ID']):
A2C_TT= df.loc[(df['TO_ID'] == idx)].sort_values('DURATION_H').head(1)
print (A2C_TT)
ROM_ID TO_ID DURATION_H DIST_KM
498660 39 7 0.434833 25.53808
here is another way about it
df.loc[df['DURATION_H'].eq(df.groupby('TO_ID')['DURATION_H'].transform(min))]
ROM_ID TO_ID DURATION_H DIST_KM
498660 39 7 0.434833 25.53808
I'm interested in figuring out how to do vectorized computations in a numpy array / pandas dataframe where each new cell is updated with local information.
For example, lets say I'm a weatherman interested in making predictions about the weather. My prediction algorithm will be the mean of the past 3 days. While this prediction is simple, I'd like to be able to do this with an arbitrary function.
Example data:
day temp
1 70
2 72
3 68
4 67
...
After a transformation should become
day temp prediction
1 70 None (no previous data)
2 72 70 (only one data point)
3 68 71 (two data points)
4 67 70
5 70 69
...
I'm only interested in the prediction column, so no need to make an attempt to join the data back together after achieving the prediction! Thanks!
Use rolling with a window of 3 and the min_periods of 1
df['prediction'] = df['temp'].rolling(window = 3, min_periods = 1).mean().shift()
df
day temp prediction
0 1 70 NaN
1 2 72 70
2 3 68 71
3 4 67 70
4 5 70 69
I'm trying to normalize a Pandas DF by row and there's a column which has string values which is causing me a lot of trouble. Anyone have a neat way to make this work?
For example:
system Fluency Terminology No-error Accuracy Locale convention Other
19 hyp.metricsystem2 111 28 219 98 0 133
18 hyp.metricsystem1 97 22 242 84 0 137
22 hyp.metricsystem5 107 11 246 85 0 127
17 hyp.eTranslation 49 30 262 80 0 143
20 hyp.metricsystem3 86 23 263 89 0 118
21 hyp.metricsystem4 74 17 274 70 0 111
I am trying to normalize each row from Fluency, Terminology, etc. Other over the total. In other words, divide each integer column entry over the total of each row (Fluency[0]/total_row[0], Terminology[0]/total_row[0], ...)
I tried using this command, but it's giving me an error because I have a column of strings
bad_models.div(bad_models.sum(axis=1), axis = 0)
Any help would be greatly appreciated...
Use select_dtypes to select numeric only columns:
subset = bad_models.select_dtypes('number')
bad_models[subset.columns] = subset.div(subset.sum(axis=1), axis=0)
print(bad_models)
# Output
system Fluency Terminology No-error Accuracy Locale convention Other
19 hyp.metricsystem2 0.211832 0.21374 0.145418 0.193676 0 0.172952
18 hyp.metricsystem1 0.185115 0.167939 0.160691 0.166008 0 0.178153
22 hyp.metricsystem5 0.204198 0.083969 0.163347 0.167984 0 0.16515
17 hyp.eTranslation 0.093511 0.229008 0.173971 0.158103 0 0.185956
20 hyp.metricsystem3 0.164122 0.175573 0.174635 0.175889 0 0.153446
21 hyp.metricsystem4 0.141221 0.129771 0.181939 0.13834 0 0.144343
I want to search a number in a column . if the number is found it will store the corresponding path to a list . so my target number is 159 , and my expected output should be corresponding repeating paths , as the number 159 repeats 151 times.But,it shows the error: argument of type 'numpy.int64' is not iterable, how can I fix it ?
i=0
charpath0 = []
item = '159'
for x in range(len(df)):
if item in df.grapheme_root[x]:
s= (df.path[x])
charpath0.append(s)
i=i+1
csv file looks like this :
image_id grapheme_root vowel_diacritic consonant_diacritic grapheme path Total_graphemes
0 Train_0 15 9 5 ক্ট্রো ../input/fulldata/train_images128/train_images... 164
1 Train_1 159 0 0 হ ../input/fulldata/train_images128/train_images... 151
2 Train_2 22 3 5 খ্রী ../input/fulldata/train_images128/train_images... 143
3 Train_3 53 2 2 র্টি ../input/fulldata/train_images128/train_images... 162
4 Train_4 71 9 5 থ্রো ../input/fulldata/train_images128/train_images... 164
I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47