I want to loop through dataset and replace specific columns value with one the same [value]
The whole dataset has 91164 rows.
The case here i need to replace vec_red ,vec_greem, vec_blue with new_data
new_data has shape of (91164,) and its number of appearance equals index of my dataframe.
For e.g. last item is
This 1 need to be value in val_red , val_blue, val_green.
So I want to loop through whole dataframe and replace the calues in columns from 3 to 5.
What I have is :
label_idx = 0
for i in range(321):
for j in range(284):
(sth here) = new_data[label_idx]
label_idx += 1
The case here is that I am updating my pixel values after filtration. Thank you.
The shape of 91164 is result of multiplication 321 * 284. These are my pixel values in an RGB image.
Looping over rows of a dataframe is a code smell. If the 3 columns must receive the same values, you can do it in one single operation:
df[['vec_red', 'vec_green', 'vec_blue']] = np.transpose(
np.array([new_data, new_data, new_data]))
Demo:
np.random.seed(0)
nx = 284
ny = 321
df = pd.DataFrame({'x_indices': [i for j in range(ny) for i in range(nx)],
'y_indices': [j for j in range(ny) for i in range(nx)],
'vec_red': np.random.randint(0, 256, nx * ny),
'vec_green': np.random.randint(0, 256, nx * ny),
'vec_blue': np.random.randint(0, 256, nx * ny)
})
new_data = np.random.randint(0, 256, nx * ny)
print(df)
print(new_data)
df[['vec_red', 'vec_green', 'vec_blue']] = np.transpose(
np.array([new_data, new_data, new_data]))
print(df)
It gives as expected:
x_indices y_indices vec_red vec_green vec_blue
0 0 0 172 167 100
1 1 0 47 92 124
2 2 0 117 65 174
3 3 0 192 249 72
4 4 0 67 108 144
... ... ... ... ... ...
91159 279 320 16 162 42
91160 280 320 142 169 145
91161 281 320 225 81 143
91162 282 320 106 93 68
91163 283 320 85 65 130
[91164 rows x 5 columns]
[ 32 48 245 ... 26 66 58]
x_indices y_indices vec_red vec_green vec_blue
0 0 0 32 32 32
1 1 0 48 48 48
2 2 0 245 245 245
3 3 0 6 6 6
4 4 0 178 178 178
... ... ... ... ... ...
91159 279 320 27 27 27
91160 280 320 118 118 118
91161 281 320 26 26 26
91162 282 320 66 66 66
91163 283 320 58 58 58
[91164 rows x 5 columns]
Related
What is the cleanest way to return the closest matched value to [reference] from [ABCD] columns.
Output is the closest value. e.g. for the first row, absolute delta is [19 40 45 95] so the closest value to return is -21.
df1 = pd.DataFrame(np.random.randint(-100,300,size=(100, 4)), columns=list('ABCD')) # Generate Random Dataframe
df2 = pd.DataFrame(np.random.randint(-100,100,size=(100, 1)), columns=['reference'])
df = pd.concat([df1,df2], axis=1)
df['closest_value'] = "?"
df
You can apply a lambda function on rows and get the closest value from the desired columns based on absolute difference from the reference column
df['closest_value'] = (df
.apply(
lambda x: x.values[(np.abs(x[[i for i in x.index if i != 'reference']].values
- x['reference'])).argmin()]
, axis=1)
)
OUTPUT:
A B C D reference closest_value
0 -2 227 -88 268 -68 -88
1 185 182 18 279 -59 18
2 140 40 264 98 61 40
3 0 98 -32 81 47 81
4 -6 70 -6 -9 -53 -9
.. ... ... ... ... ... ...
95 -29 -34 141 166 -76 -34
96 14 22 175 205 69 22
97 265 11 -25 284 -88 -25
98 283 31 -91 252 11 31
99 6 -59 84 95 -15 6
[100 rows x 6 columns]
Try this :
idx = df.drop(['reference'], axis=1).sub(df.reference, axis=0).abs().idxmin(1)
df['closest_value'] = df.lookup(df.index, idx)
>>> display(df)
Edit:
Since pandas.DataFrame.lookup will be (or is?) deprecated, you can :
Replace this line :
df.lookup(df.index, df['col'])
By these:
out = df.set_index(idx, append=True)
out['closest_value'] = df.stack()
The cleanest way:
Using a conversion to numpy.
data = df[list('ABCD')].to_numpy()
reference = df[['reference']].to_numpy()
indices = np.abs(data - reference).argmin(axis=1)
df['closest_value'] = data[np.arange(len(data)), indices]
Result:
A B C D reference closest_value
0 -60 254 80 -46 89 80
1 5 10 72 259 41 10
2 219 14 269 -70 0 14
3 171 36 132 45 -55 36
4 7 233 -65 231 -76 -65
.. ... ... ... ... ... ...
95 229 213 -54 129 62 129
96 16 -26 -30 79 94 79
97 105 157 -3 148 -48 -3
98 -27 60 218 273 62 60
99 140 131 -49 28 -46 -49
[100 rows x 6 columns]
i have this dataframe:
Timestamp DATA0 DATA1 DATA2 DATA3 DATA4 DATA5 DATA6 DATA7
0 1.478196e+09 219 128 220 27 141 193 95 50
1 1.478196e+09 95 237 27 121 90 194 232 137
2 1.478196e+09 193 22 103 217 138 195 153 172
3 1.478196e+09 181 120 186 73 120 239 121 218
4 1.478196e+09 70 194 36 16 81 129 95 217
... ... ... ... ... ... ... ... ... ...
242 1.478198e+09 15 133 112 2 236 81 94 252
243 1.478198e+09 0 123 163 160 13 156 145 32
244 1.478198e+09 83 147 61 61 33 199 147 110
245 1.478198e+09 172 95 87 220 226 99 108 176
246 1.478198e+09 123 240 180 145 132 213 47 60
I need to create a temporal features like this:
Timestamp DATA0 DATA1 DATA2 DATA3 DATA4 DATA5 DATA6 DATA7
0 1.478196e+09 219 128 220 27 141 193 95 50
1 1.478196e+09 95 237 27 121 90 194 232 137
2 1.478196e+09 193 22 103 217 138 195 153 172
3 1.478196e+09 181 120 186 73 120 239 121 218
4 1.478196e+09 70 194 36 16 81 129 95 217
Timestamp DATA0 DATA1 DATA2 DATA3 DATA4 DATA5 DATA6 DATA7
1 1.478196e+09 95 237 27 121 90 194 232 137
2 1.478196e+09 193 22 103 217 138 195 153 172
3 1.478196e+09 181 120 186 73 120 239 121 218
4 1.478196e+09 70 194 36 16 81 129 95 217
5 1.478196e+09 121 69 111 204 134 92 51 190
Timestamp DATA0 DATA1 DATA2 DATA3 DATA4 DATA5 DATA6 DATA7
2 1.478196e+09 193 22 103 217 138 195 153 172
3 1.478196e+09 181 120 186 73 120 239 121 218
4 1.478196e+09 70 194 36 16 81 129 95 217
5 1.478196e+09 121 69 111 204 134 92 51 190
6 1.478196e+09 199 132 39 197 159 242 153 104
How can I do this automatically?
what structure should I use, what functions?
I was told that the dataframe should become an array of arrays
it's not very clear to me
If I understand it correctly, you want e.g. a list of dataframes, where each dataframe is a progressing slice of the original frame. This example would give you a list of dataframes:
import pandas as pd
# dummy dataframe
df = pd.DataFrame({'col_1': range(10), 'col_2': range(10)})
# returns slices of size slice_length with step size 1
slice_length = 5
lst = [df.iloc[i:i+slice_length,: ] for i in range(df.shape[0] - slice_length)]
Please note that you are duplicating a lot of data and thus increasing memory usage. If you merely have to perform an operation on subsequent slices, you should better loop over the dataframe and apply your function. Even better, if possible, you should try to verctorize your operation, as this will likely make a huge difference in performance.
EDIT: saving the slices to file:
If you're only interested in saving the slices to file (e.g. in a csv), you don't need to first create a list of all slices (with the associated memory usage). Instead, loop over the slices (by looping over the starting indices that define each slice), and save each slice to file.
slice_length = 5
# loop over indices (i.e. slices)
for idx_from in range(df.shape[0] - slice_length):
# create the slice and write to file
df.iloc[idx_from: idx_from + slice_length, :].to_csv(f'slice_starting_idx_{idx_from}.csv', sep=';', index=False)
hi I have tried this which might results to your expectations, based on indexes:
import numpy as np
import pandas as pd
x=np.array([[8,9],[2,3],[9,10],[25,78],[56,67],[56,67],[72,12],[98,24],
[8,9],[2,3],[9,10],[25,78],[56,67],[56,67],[72,12],[98,24]])
df=pd.DataFrame(np.reshape(x,(16,2)),columns=['Col1','Col2'])
print(df)
print("**********************************")
count=df['Col1'].count() # number of rows in dataframe
i=0 # to set index from starting point for every iteration
n=4 # to set index to end point for every iteration
count2=3 # This is important , if you want 4 row then yo must set this count2 4-1 i.e 3,let say if you want 5 rows then count2 must be 5-1 i.e 4
while count !=0: # condition till the count gets set to 0
df1=df[i:n] # first iteration i=0, n=4(if you want four rows), second iteration i=n i.e i=4, and n will be n=n+4 i.e 8
if i>0:
print(df1.set_index(np.arange(i-count2,n-count2)))
count2=count2+3 # Incrementing count2, so the index will be like in first iteration 0 to 3 then 1 to 4 and so on.
else:
print(df1.set_index(np.arange(i,n)))
i=n
count=count-4
n=n+4
First output of Dataframe
Col1 Col2
0 8 9
1 2 3
2 9 10
3 25 78
4 56 67
5 56 67
6 72 12
7 98 24
8 8 9
9 2 3
10 9 10
11 25 78
12 56 67
13 56 67
14 72 12
15 98 24
Final Ouput
Col1 Col2
0 8 9
1 2 3
2 9 10
3 25 78
Col1 Col2
1 56 67
2 56 67
3 72 12
4 98 24
Col1 Col2
2 8 9
3 2 3
4 9 10
5 25 78
Col1 Col2
3 56 67
4 56 67
5 72 12
6 98 24
Note: I am also new in python there can be some possible shortest ways to achieve the expected output.
I asked a question about the same problem earlier, but because my approach has changed I now have different questions.
My current code:
from sklearn import preprocessing
from openpyxl import load_workbook
import numpy as np
from numpy import exp, array, random, dot
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
#Set sizes
rowSize = 200
numColumns = 4
# read from excel file
wb = load_workbook('python_excel_read.xlsx')
sheet_1 = wb["Sheet1"]
date = np.zeros(rowSize)
day = np.zeros(rowSize)
rain = np.zeros(rowSize)
temp = np.zeros(rowSize)
out = np.zeros(rowSize)
for i in range(0, rowSize):
date[i] = sheet_1.cell(row=i + 1, column=1).value
day[i] = sheet_1.cell(row=i + 1, column=2).value
rain[i] = sheet_1.cell(row=i + 1, column=3).value
temp[i] = sheet_1.cell(row=i + 1, column=4).value
out[i] = sheet_1.cell(row=i + 1, column=5).value
train = np.zeros(shape=(rowSize,numColumns))
t_o = np.zeros(shape=(rowSize,1))
for i in range(0, rowSize):
train[i] = [date[i], day[i], rain[i], temp[i]]
t_o[i] = [out[i]]
X = train
# Output
y = t_o
X_train, X_test, y_train, y_test = train_test_split(X, y)
####Neural Net
nn = MLPRegressor(
hidden_layer_sizes=(3,), activation='relu', solver='adam', alpha=0.001, batch_size='auto',
learning_rate='constant', learning_rate_init=0.01, power_t=0.5, max_iter=10000, shuffle=True,
random_state=9, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True,
early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
nn.fit(X_train, y_train.ravel())
y_pred = nn.predict(X_test)
###Linear Regression
# lm = LinearRegression()
# lm.fit(X_train,y_train)
# y_pred = lm.predict(X_test)
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(X_test[:,0], y_pred, s=1, c='b', marker="s", label='real')
ax1.scatter(X_test[:,0], y_test, s=10, c='r', marker="o", label='NN Prediction')
plt.show()
#Calc MSE
mse = np.square(y_test-y_pred).mean()
print(mse)
The results from this show a pretty bad prediction of the test data. Because I am new to this, I am not sure if it is my data, the model, or my coding. Based on the plot, I believe the model is wrong for the data (the model seems to predict something near linear or squared, while the actual data seems much more spread out)
Here are some of the data points:
formatted as Day of year(2 is jan 2nd), weekday(1)/weekend(0), rain(1)/no rain(0), Temp in F, attendance (this is output)
2 0 0 51 1366
4 0 0 62 538
5 1 0 71 317
6 1 0 76 174
7 1 0 78 176
8 1 0 68 220
12 1 1 64 256
13 1 1 60 379
14 1 0 64 316
18 0 0 72 758
19 1 0 72 1038
20 1 0 72 405
21 1 0 71 326
24 0 0 74 867
26 1 1 68 521
27 1 0 71 381
28 1 0 72 343
29 1 1 68 266
30 0 1 57 479
31 0 1 57 717
33 1 0 70 542
34 1 0 73 220
35 1 0 74 360
36 1 0 79 444
42 1 0 78 534
45 0 0 80 1572
52 0 0 76 1236
55 1 1 64 689
56 1 0 69 726
59 0 0 67 1188
60 0 0 74 1140
61 1 1 63 979
62 1 1 62 657
63 1 0 67 687
64 1 0 72 615
67 0 0 80 1074
68 1 0 81 1261
71 1 0 83 1332
73 0 0 85 1259
74 0 0 86 1142
76 1 0 88 1207
77 1 1 78 1438
82 1 0 85 1251
83 1 0 83 1019
85 1 0 86 1178
86 0 0 92 1306
87 0 0 92 1273
89 1 0 93 1101
90 1 0 92 1274
93 0 0 83 1548
94 0 0 86 1318
96 1 0 83 1395
97 1 0 81 1338
98 1 0 75 1240
100 0 0 84 1335
102 0 0 83 931
103 1 0 87 746
104 1 0 91 746
105 1 0 81 600
106 1 0 72 852
108 0 1 87 1204
109 0 0 89 1191
110 1 0 90 769
111 1 0 88 642
112 1 0 86 743
114 0 1 75 1085
115 0 1 78 1109
117 1 0 84 871
120 1 0 96 599
123 0 0 93 651
129 0 0 74 1325
133 1 0 88 637
134 1 0 84 470
135 0 1 73 980
136 0 0 72 1096
137 0 0 83 792
138 1 0 87 565
139 1 0 84 501
141 1 0 88 615
142 0 0 79 722
143 0 0 80 1363
144 0 0 82 1506
146 1 0 93 626
147 1 0 94 415
148 1 0 95 596
149 0 0 100 532
150 0 0 102 784
154 1 0 99 514
155 1 0 94 495
156 0 1 87 689
157 0 1 94 931
158 0 0 97 618
161 1 0 92 451
162 1 0 97 574
164 0 0 102 898
165 0 0 104 746
166 1 0 109 587
167 1 0 109 465
174 1 0 108 514
175 1 0 109 572
179 0 0 107 811
181 1 0 104 423
182 1 0 103 526
184 0 1 97 849
185 0 0 103 852
189 1 0 106 728
191 0 0 101 577
194 1 0 105 511
198 0 1 101 616
199 0 1 97 1056
200 0 0 94 740
202 1 0 103 498
205 0 0 101 610
206 0 0 106 944
207 0 0 105 769
208 1 0 103 551
209 1 0 103 624
210 1 0 97 513
212 0 1 107 561
213 0 0 100 905
214 0 0 105 767
215 1 0 107 510
216 1 0 108 406
217 1 0 109 439
218 1 0 103 427
219 0 1 104 460
224 1 0 105 213
227 0 0 112 834
228 0 0 109 615
229 1 0 105 216
230 1 0 104 213
231 1 0 104 256
232 1 0 104 282
235 0 0 104 569
238 1 0 103 165
239 1 1 105 176
241 0 1 108 727
242 0 1 105 652
243 1 1 103 231
244 1 0 96 117
245 1 1 98 168
246 1 1 97 113
247 0 0 95 227
248 0 0 92 1050
249 0 0 101 1274
250 1 1 95 1148
254 0 0 99 180
255 0 0 104 557
258 1 0 94 228
260 1 0 95 133
263 0 0 100 511
264 1 1 89 249
265 1 1 90 245
267 1 0 101 390
272 1 0 100 223
273 1 0 103 194
274 1 0 103 150
275 0 0 95 224
276 0 0 92 705
277 0 1 92 504
279 1 1 77 331
281 1 0 89 268
284 0 0 95 566
285 1 0 94 579
286 1 0 95 420
288 1 0 93 392
289 0 1 94 525
290 0 1 86 670
291 0 1 89 488
294 1 1 74 295
296 0 0 81 314
299 1 0 88 211
301 1 0 84 246
303 0 1 76 433
304 0 0 80 216
307 1 1 80 275
308 1 1 66 319
312 0 0 80 413
313 1 0 78 278
316 1 0 74 305
320 1 1 57 323
324 0 0 76 220
326 0 0 77 461
327 1 0 78 510
331 0 0 60 1701
334 1 0 58 237
335 1 0 62 355
336 1 0 68 266
338 0 0 70 246
342 1 0 72 109
343 1 0 70 103
347 0 0 58 486
349 1 0 52 144
350 1 0 53 209
351 1 0 55 289
354 0 0 62 707
355 1 0 59 903
359 0 0 58 481
360 0 0 53 1342
364 1 0 57 1624
I have over a thousand data points in total, but Im not using them all for training/testing. One thought is I need more, another is that I need more factors because temp/rain/day of week does not affect attendance enough.
Here is the plot:
What can I do to make my model more accurate and give better predictions?
Thanks
EDIT: I added more data points and another factor. I cant seem to upload the excel file so I put the data on here with a better explanation of how it is formatted
EDIT:
Here is the most recent code:
from sklearn import preprocessing
from openpyxl import load_workbook
import numpy as np
from numpy import exp, array, random, dot
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import LeaveOneOut
#Set sizes
rowSize = 500
numColumns = 254
# read from excel file
wb = load_workbook('python_excel_read.xlsx')
sheet_1 = wb["Sheet1"]
input = np.zeros(shape=(rowSize,numColumns))
out = np.zeros(rowSize)
for i in range(0, rowSize):
for j in range(0,numColumns):
input[i,j] = sheet_1.cell(row=i + 1, column=j+1).value
out[i] = sheet_1.cell(row=i + 1, column=numColumns+1).value
output = np.zeros(shape=(rowSize,1))
for i in range(0, rowSize):
output[i] = [out[i]]
X = input
# Output
y = output
print(X)
print(y)
y[y < 500] = 0
y[np.logical_and(y >= 500, y <= 1000)] = 1
y[np.logical_and(y > 1000, y <= 1200)] = 2
y[y > 1200] = 3
# Use cross-validation
#kf = KFold(n_splits = 10, random_state=0)
loo = LeaveOneOut()
# Try different models
clf = svm.SVC()
scaler = StandardScaler()
pipe = Pipeline([('scaler', scaler), ('svc', clf)])
accuracy = cross_val_score(pipe, X, y.ravel(), cv = loo, scoring = "accuracy")
print(accuracy.mean())
#y_pred = cross_val_predict(clf, X, y.ravel(), cv = kf)
#cm = confusion_matrix(y, y_pred)
and here is the up to date data with as many features as I could add. note this is a random sample from the full data:
Link to sample data
Current output:
0.6230954290296712
My ultimate goal is to achieve 90% or higher accuracy... I dont believe I can find more features, but will continue to gather as many as possible if helpful
Your question is really general, however I have some suggestions. You could use cross-validation and try different models. Personnaly, I would try SVR,RandomForests and as last choice I would use a MLPR.
I modified a bit your code to show a simple example:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import LeaveOneOut
import pandas as pd
from sklearn.decomposition import PCA
# read the data
df = pd.read_excel('python_excel_read.xlsx', header = None)
rows, cols = df.shape
X = df.iloc[: , 0:(cols - 1)]
y = df.iloc[: , cols - 1 ]
print(X.shape)
print(y.shape)
y[y < 500] = 0
y[np.logical_and(y >= 500, y <= 1000)] = 1
y[np.logical_and(y > 1000, y <= 1200)] = 2
y[y > 1200] = 3
print(np.unique(y))
# We can apply PCA to reduce the dimensions of the data
# pca = PCA(n_components=2)
# pca.fit(X)
# X = pca.fit_transform(X)
# Use cross-validation
#kf = KFold(n_splits = 10, random_state=0)
loo = LeaveOneOut()
# Try different models
clf = svm.SVC(kernel = 'linear')
scaler = StandardScaler()
pipe = Pipeline([('scaler', scaler), ('svc', clf)])
accuracy = cross_val_score(pipe, X, y.ravel(), cv = loo, scoring = "accuracy")
print(accuracy.mean())
#y_pred = cross_val_predict(clf, X, y.ravel(), cv = kf)
#cm = confusion_matrix(y, y_pred)
I have 2 arrays of shape (128,). I want the elementwise difference between them.
for idx, x in enumerate(test):
if idx == 0:
print (test[idx])
print()
print(library[idx])
print()
print(np.abs(np.subtract(library[idx],test[idx])))
output:
[186 3 172 80 187 120 127 172 96 213 103 107 137 119 33 53 54 113
200 78 140 234 77 94 151 64 199 218 170 73 152 73 0 5 121 42
0 106 166 80 115 220 56 66 194 187 51 132 55 73 150 83 91 204
108 58 183 0 32 240 255 55 151 255 189 153 77 89 42 176 204 170
93 117 194 195 59 204 149 55 111 255 218 48 72 171 122 163 255 155
198 179 69 173 108 0 0 176 249 214 193 255 106 116 0 47 255 255
255 255 210 175 67 0 95 120 21 158 0 72 120 255 121 208 255 0
61 255]
[189 0 178 72 177 124 123 167 81 235 110 123 139 107 39 54 34 102
195 59 156 255 66 112 161 65 180 236 181 69 142 82 0 0 152 38
0 102 146 86 117 230 59 77 220 182 44 121 63 59 146 41 92 213
146 70 184 0 0 255 255 42 165 255 245 152 114 88 63 138 255 158
96 141 221 201 47 191 179 42 156 255 237 7 136 168 133 142 254 164
236 250 56 202 141 0 0 197 255 184 212 255 108 133 0 7 255 255
255 255 243 197 74 0 50 143 24 175 0 74 101 255 121 207 255 0
146 255]
[ 3 253 6 248 246 4 252 251 241 22 7 16 2 244 6 1 236 245
251 237 16 21 245 18 10 1 237 18 11 252 246 9 0 251 31 252
0 252 236 6 2 10 3 11 26 251 249 245 8 242 252 214 1 9
38 12 1 0 224 15 0 243 14 0 56 255 37 255 21 218 51 244
3 24 27 6 244 243 30 243 45 0 19 215 64 253 11 235 255 9
38 71 243 29 33 0 0 21 6 226 19 0 2 17 0 216 0 0
0 0 33 22 7 0 211 23 3 17 0 2 237 0 0 255 0 0
85 0]
So it reads, the last array printed out is the difference between the first two arrays.
189 - 186 is 3
3 - 0 is 3 (not 253)
I must be missing something trivial.
I'd rather not zip and subtract the values as I have a ton of data.
Your arrays probably have dtype uint8; they cannot hold values outside the interval [0, 256), and subtracting 3 from 0 wraps around to 253. The absolute value of 253 is still 253.
Use a different dtype, or restructure your computation to avoid hitting the limits of the dtype you're using.
You can just simply subtract two numpy arrays like this, it is element-wise operation:
>test = np.array([1,2,3])
>library = np.array([1,1,1])
>np.abs(library - test)
array([0, 1, 2])
I've been struggling to get something to work for the following text file format.
My overall goal is to extract the value for one of the variable names throughout the entire text file. For example, I want all the values for B rows and D rows. Then put them in a normal numpy array and run calculations.
Here is what the data file looks like:
[SECTION1a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION1b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48 204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208 104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255 110
[END SECTION1]
[SECTION2a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION2b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48 204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208 104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255 110
[END SECTION2]
That pattern continues for N sections.
Currently I read the file and put it into two columns:
filename_load = fileopenbox(msg=None, title='Load Data File',
default="Z:\*",
filetypes=None)
col1_data = np.genfromtxt(filename_load, skip_header=1, dtype=None,
usecols=(0,), usemask=True, invalid_raise=False)
col2_data = np.genfromtxt(filename_load, skip_header=1, dtype=None,
usecols=(1,), usemask=True, invalid_raise=False)
I was going to then use where, to find the index of the value I wanted, then make a new array of those values:
arr_index = np.where(col1_data == '[b]')
new_array = col2_data[arr_index]
Problem with that is, I end up with arrays of two different sizes because of the weird file format so obviously the data in the array won't match up properly to the right variable name.
I have tried a few other alternatives and get stuck due to the weird text file format and how to read it into python.
Not sure if I should stay on this track an if so how to address the problem, or, try a totally different approach.
Thanks in advance!
A possible solution sorting your data into hierachy of OrdedDict() dictionaries:
from collections import OrderedDict
import re
ss = """[SECTION1a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION1b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48 204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208 104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255 110
[END SECTION1]
[SECTION2a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION2b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48 204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208 104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255 110
[END SECTION2]"""
# regular expressions for matching SECTIONs
p1 = re.compile("^\[SECTION[0-9]+a\]")
p2 = re.compile("^\[SECTION[0-9]+b\]")
p3 = re.compile("^\[END SECTION[0-9]+\]")
def parse(ss):
""" Make hierachial dict from string """
ll, l_cnt = ss.splitlines(), 0
d = OrderedDict()
while l_cnt < len(ll): # iterate through lines
l = ll[l_cnt].strip()
if p1.match(l): # new sub dict for [SECTION*a]
dd, nn = OrderedDict(), l[1:-1]
l_cnt += 1
while (p2.match(ll[l_cnt].strip()) is None and
p3.match(ll[l_cnt].strip()) is None):
ww = ll[l_cnt].split()
dd[ww[0][1:-1]] = int(ww[1])
l_cnt += 1
d[nn] = dd
elif p2.match(l): # array of ints for [SECTION*b]
d[l[1:-1]] = [int(w) for w in ll[l_cnt+1].split()]
l_cnt += 2
elif p3.match(l):
l_cnt += 1
return d
dd = parse(ss)
Note that you can get much more robust code, if you use an existing parsing tool (e.g., Parsley).
To retrieve'[c]' from all sections, do:
print("All entries for [c]: ", end="")
cc = [d['c'] for s,d in dd.items() if s.endswith('a')]
print(", ".join(["{}".format(c) for c in cc]))
# Gives: All entries for [c]: 873348378938, 873348378938
Or you could traverse the whole dictionary:
def print_recdicts(d, tbw=0):
"""print the hierachial dict """
for k,v in d.items():
if type(v) is OrderedDict:
print(" "*tbw + "* {}:".format(k))
print_recdicts(v, tbw+2)
else:
print(" "*tbw + "* {}: {}".format(k,v))
print_recdicts(dd)
# Gives:
# * SECTION1a:
# * a: 1424457484310
# * b: 5313402937
# ...
The following should do it. It uses a running store (tally) to cope with missing values, then writes the state out when hitting the end marker.
import re
import numpy as np
filename = "yourfilenamehere.txt"
# [e] 14957596088
match_line_re = re.compile(r"^\[([a-z])\]\W(\d*)")
result = {
'b':[],
'd':[],
}
tally_empty = dict( zip( result.keys(), [np.nan] * len(result) ) )
tally = tally_empty
with open(filename, 'r') as f:
for line in f:
if line.startswith('[END SECTION'):
# Write accumulated data to the lists
for k, v in tally.items():
result[k].append(v)
tally = tally_empty
else:
# Map the items using regex
m = match_line_re.search(line)
if m:
k, v = m.group(1), m.group(2)
print(k,v)
if k in tally:
tally[k] = v
b = np.array(result['b'])
d = np.array(result['d'])
Note, whatever keys are in the result dict definition will be in the output.