I have training data with 2 dimension. (200 results of 4 features)
I proved 100 different applications with 10 repetition resulting 1000 csv files.
I want to stack each csv results for machine learning.
But I don't know how.
each of my csv files look like below.
test1.csv to numpy array data
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]]
I tried below python code.
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
cnt=0
for f in csv_files:
cnt +=1
seperator = '_'
app = os.path.basename(f).split(seperator, 1)[0]
if cnt==1:
a = np.array(preprocess(f))
b = np.array(app)
else:
a = np.vstack((a, np.array(preprocess(f))))
b = np.append(b,app)
print(a)
print(b)
preprocess function returns df.to_numpy results for each csv files.
My expectation was like below. a(1000, 200, 4)
[[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]],
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]],
...
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]]]
However, I'm getting this. a(200000, 4)
[[0 'crc32_pclmul' 445 0]
[0 'crc32_pclmul' 270 4096]
[0 'crc32_pclmul' 234 8192]
...
[249 'intel_pmt' 272 4096]
[249 'intel_pmt' 224 8192]
[249 'intel_pmt' 268 12288]]
I want to access each csv results using a[0] to a[1000] each sub-array looks like (200,4)
How can I solve the problem? I'm quite lost
Well, yes, that is what vstack (and append) does. It merges things on the same axis (rows axis).
a1=np.arange(10).reshape(2,5)
# [[0,1,2,3,4],
# [5,6,7,8,9]]
a2=np.arange(10,20).reshape(2,5)
# [[10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19]])
np.vstack((a1,a2))
# [[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9],
# [10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19]])
b1=np.arange(5)
b2=np.arange(5,10)
np.append(b1,b2)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
If you expect (from those examples), to append along a new axis, then you need to add it, or to use more flexible stack.
np.vstack(([a1],[a2]))
#array([[[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9]],
#
# [[10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19]]])
Or, in the case of 1d, use vstack instead of append
np.vstack((b1,b2))
#array([[0, 1, 2, 3, 4],
# [5, 6, 7, 8, 9]])
But more importantly, you shouldn't be doing this in the first place inside a loop. Each of those functions (stack, vstack, append) recreates a new array.
It would be probably more efficient to just append all your np.array(preprocess(f)) and b = np.array(app) to a pure python list, and call stack and vstack only once you've read them all.
Or, even better, just append directly the preprocess(f) and the app inside python list. And call np.array only after the loop, and the whole thing.
So, something like
la=[]
lb=[]
for f in csv_files:
cnt +=1
seperator = '_'
app = os.path.basename(f).split(seperator, 1)[0]
la.append(preprocess(f))
lb.append(app)
a=np.array(la)
b=np.array(lb)
You have to change from vstack to stack
la=[]
lb=[]
for f in csv_files:
cnt +=1
seperator = '_'
app = os.path.basename(f).split(seperator, 1)[0]
la.append(preprocess(f))
lb.append(app)
a=np.stack(la, axis=0)
b=np.array(lb)
vstack can stack along rows only but stack function can stack along a new axis.
Make a new list (outside of the loop) and append each item to that new list after reading.
Related
I am working on a multiclass semantic segmentation dataset, the dataset has RGB ground truth segmentation masks for the original images. The dataset has 24 classes. The following table displays the classes and their respective RGB values:
name
r
g
b
unlabeled
0
0
0
paved-area
128
64
128
dirt
130
76
0
grass
0
102
0
gravel
112
103
87
water
28
42
168
rocks
48
41
30
pool
0
50
89
vegetation
107
142
35
roof
70
70
70
wall
102
102
156
window
254
228
12
door
254
148
12
fence
190
153
153
fence-pole
153
153
153
person
255
22
96
dog
102
51
0
car
9
143
150
bicycle
119
11
32
tree
51
51
0
bald-tree
190
250
190
ar-marker
112
150
146
obstacle
2
135
115
conflicting
255
0
0
Sample RGB Ground Truth Segmentation Mask Image:
There are 400 images in the dataset, each having a shape of (4000 px X 6000 px). The directory structure of the dataset is shown below:
dataset_folder
├── original_images
│ ├── 000.png
│ ├── 001.png
│ ├── ...
| ├── 399.png
| └── 400.png
└── masks
├── 000.png
├── 001.png
├── ...
├── 399.png
└── 400.png
I want to create semantic segmentation masks from the RGB masks, by assigning integer values to the pixels in the range 0-23 (where each integer represents a class) and save them to the working directory. Can someone please suggest an efficient code for this task?
I had a similar problem.
My solution is probably not the most efficient, but as there is no other answer, i share it anyway :
First get an array from the image, opening it with openCV for example..
For the example, let's make an "image" of 4*3 px with three channels:
img=np.array([[
[128, 64,128],
[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[128, 64,128],
[ 0,102, 0],
[ 0, 0, 0],
[ 0, 0, 0]],
[[130, 76, 0],
[130, 76, 0],
[130, 76, 0],
[130, 76, 0]]])
Make a dictionary of the RGB values associated with the mask's wanted value (i wrote it down by hand for the example, but you can do it using pandas if you have a table as shown above), then make a list of the values encountered in the image, and finally create the mask with the corresponding categorical value.
unlabeled = str([0, 0, 0])
paved_area = str([128, 64, 128])
dirt = str([130, 76, 0])
grass = str([ 0, 102, 0])
labels = {unlabeled:0, paved_area:1, dirt:2, grass:3}
print(labels)
>>> {'[0, 0, 0]': 0, '[128, 64, 128]': 1, '[130, 76, 0]': 2, '[0, 102, 0]': 3}
width = img.shape[1]
height = img.shape[0]
values = [str(list(img[i,j])) for i in range(height) for j in range(width)]
print(values)
>>> ['[128, 64, 128]', '[0, 0, 0]', ..., '[130, 76, 0]']
print(len(values))
>>> 12 # width*height
mask=list([0]*width*height)
for i, value in enumerate(values):
mask[i]=labels[value]
mask = np.asarray(mask).reshape(height,width)
print(mask)
>>> array([[1, 0, 0, 0],
[1, 3, 0, 0],
[2, 2, 2, 2]])
Can someone point me in the right direction to accomplish the following. I would really appreciate it.
Given the following column.
111
108
106
107
109
130
I would like to take the first number(111) and find and print the difference between the rest of the values in the order they appear.
I would then like to repeat the process starting on the second position(108) until all rows have looped through to the end.
And lastly I would like to display the biggest difference and row# from the results.
Expected output is something along these lines
Start bigest-difference row/positioning
111 19 5
108 22 5
106 24 5
107 23 5
109 24 5
130 24 2
You could use broadcasting:
import numpy as np
data = np.array([111, 108, 106, 107, 109, 130])
data - data[:, None]
# array([[ 0, -3, -5, -4, -2, 19],
# [ 3, 0, -2, -1, 1, 22],
# [ 5, 2, 0, 1, 3, 24],
# [ 4, 1, -1, 0, 2, 23],
# [ 2, -1, -3, -2, 0, 21],
# [-19, -22, -24, -23, -21, 0]])
I have the array
[[ 430 780 1900 420][ 0 0 2272 1704]]
and needs to convert it into this result:
[[[ 430 780 1] [1900 420 1]] [[ 0 0 1] [2272 1704 1]]]
basically turn a 2d array into 3d, separate each array into 2 and append the number 1 to it. How can I achieve it?
As pointed out in the comments, the question leaves some ambiguity about what would happen with bigger arrays, but one way to obtain the result that you indicate is this:
import numpy as np
a = np.array([[430, 780, 1900, 420], [0, 0, 2272, 1704]])
b = a.reshape(a.shape[0], -1, 2)
b = np.concatenate([b, np.ones_like(b[..., -1:])], -1)
print(b)
# [[[ 430 780 1]
# [1900 420 1]]
#
# [[ 0 0 1]
# [2272 1704 1]]]
Try this, for small size arrays(for large arrays consider #jdehesa answer).
>>> arr = [[ 430, 780, 1900, 420],[ 0, 0, 2272, 1704]]
>>> [[[a[0],a[1],1],[a[2],a[3],1]] for a in arr]
[[[430, 780, 1], [1900, 420, 1]], [[0, 0, 1], [2272, 1704, 1]]]
I have a sparse tensor (the tensor was generated using tf.Transform on a categorical value) which I convert it into a dense representation using the following command
bow_indecies = tf.sparse_tensor_to_dense(sparse_bow_indecies, default_value=0)
which results in a matrix of size batch_size x max_seq_length. The array looks like this
[[ 597 1157 60 0 0 0]
[ 939 1212 169 10 0 0]
[ 242 719 215 520 57 6]]
I would like to reverse the zero padding from trailing to leading in order to look like this
[[ 0 0 0 597 1157 60]
[ 0 0 939 1212 169 10]
[ 242 719 215 520 57 6]]
Any idea on how to do this?
There is one rude way to do that, if you can specify the indices of SparseTesor.
I mean you have to tell your SparseTesor object (sparse_bow_indecies) the indices of nonzero values.
The documentation says "Indices not in sp_input are assigned default_value."
https://www.tensorflow.org/api_docs/python/tf/sparse_tensor_to_dense
So in your case
May be the indices in your SparseTesor object (sparse_bow_indecies) should be some what like below for the result you are expecting.
SparseTensor(indices=[[0, 3], [0, 4],[0, 5],[1, 2],[1, 3],.....], values=[............], dense_shape=[3, 6])
or try overriding indices, if it the SparseTensor object is already with you.
sparse_bow_indecies.indices = =[[0, 3], [0, 4],[0, 5],[1, 2],[1, 3],.....] #Kept dots for continuation.
I'm making a fit with a scikit model (that is a ExtraTreesRegressor ) with the aim of make supervised features selection.
I've made a toy example in order to be as most clear as possible. That's the toy code:
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesRegressor
from itertools import chain
# Original Dataframe
df = pd.DataFrame({"A": [[10,15,12,14],[20,30,10,43]], "R":[2,2] ,"C":[2,2] , "CLASS":[1,0]})
X = np.array([np.array(df.A).reshape(1,4) , df.C , df.R])
Y = np.array(df.CLASS)
# prints
X = np.array([np.array(df.A), df.C , df.R])
Y = np.array(df.CLASS)
print("X",X)
print("Y",Y)
print(df)
df['A'].apply(lambda x: print("ORIGINAL SHAPE",np.array(x).shape,"field:",x))
df['A'] = df['A'].apply(lambda x: np.array(x).reshape(4,1),"field:",x)
df['A'].apply(lambda x: print("RESHAPED SHAPE",np.array(x).shape,"field:",x))
model = ExtraTreesRegressor()
model.fit(X,Y)
model.feature_importances_
X [[[10, 15, 12, 14] [20, 30, 10, 43]]
[2 2]
[2 2]]
Y [1 0]
A C CLASS R
0 [10, 15, 12, 14] 2 1 2
1 [20, 30, 10, 43] 2 0 2
ORIGINAL SHAPE (4,) field: [10, 15, 12, 14]
ORIGINAL SHAPE (4,) field: [20, 30, 10, 43]
---------------------------
That's the arise exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-37-5a36c4c17ea0> in <module>()
7 print(df)
8 model = ExtraTreesRegressor()
----> 9 model.fit(X,Y)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)
210 """
211 # Validate or convert input data
--> 212 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
213 if issparse(X):
214 # Pre-sort indices to avoid that each individual tree of the
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
371 force_all_finite)
372 else:
--> 373 array = np.array(array, dtype=dtype, order=order, copy=copy)
374
375 if ensure_2d:
ValueError: setting an array element with a sequence.
I've noticed that involves np.arrays. So I've tried to fit another toy dataframe, that is the most basic one, with only scalars and there are not arised errors. I've tried to keep the same code and just modify the same toy dataframe by adding another field that contains monodimensional arrays, and now the same exception was arised.
I've looked around but so far I've not found a solution even by trying to make some reshapes, conversions into lists, np.array etc. and matrixed in my real problem. Now I'm keeping trying along this direction.
I've also seen that usually this kind of problem is arised when there are arrays withdifferent lengths betweeen samples but that is not the case of the toy example.
Anyone know how to deal with this structures/exception ?
Thanks in advance for any help.
Have a closer look at your X:
>>> X
array([[[10, 15, 12, 14], [20, 30, 10, 43]],
[2, 2],
[2, 2]], dtype=object)
>>> type(X[0,0])
<class 'list'>
Notice that it's dtype=object, and one of these objects is a list, hence "setting array element with sequence. Part of the problem is that np.array(df.A) does not correctly create a 2D array:
>>> np.array(df.A)
array([[10, 15, 12, 14], [20, 30, 10, 43]], dtype=object)
>>> _.shape
(2,) # oops!
But using np.stack(df.A) fixes the problem.
Are you looking for:
>>> X = np.concatenate([
np.stack(df.A), # condense A to (N, 4)
np.expand_dims(df.C, axis=-1), # expand C to (N, 1)
np.expand_dims(df.R, axis=-1), # expand R to (N, 1)
axis=-1
)
>>> X
array([[10, 15, 12, 14, 2, 2],
[20, 30, 10, 43, 2, 2]], dtype=int64)
To convert Pandas' DataFrame to NumPy's matrix,
import pandas as pd
def df2mat(df):
a = df.as_matrix()
n = a.shape[0]
m = len(a[0])
b = np.zeros((n,m))
for i in range(n):
for j in range(m):
b[i,j]=a[i][j]
return b
df = pd.DataFrame({"A":[[1,2],[3,4]]})
b = df2mat(df.A)
After then, concatenate.