Fill values in a numpy array given a condition - python

Currently I have an array as follows:
myArray = np.array(
[[ 976.77 , 152.95 , 105.62 , 53.44 , 0 ],
[ 987.61 , 156.63 , 105.53 , 51.1 , 0 ],
[1003.74 , 151.31 , 104.435, 52.86 , 0 ],
[ 968. , 153.41 , 106.24 , 58.98 , 0 ],
[ 978.66 , 152.19 , 103.28 , 57.97 , 0 ],
[1001.9 , 152.88 , 105.08 , 58.01 , 0 ],
[1024.93 , 146.59 , 107.06 , 59.94 , 0 ],
[1020.01 , 148.05 , 109.96 , 58.67 , 0 ],
[1034.01 , 152.69 , 107.64 , 59.74 , 0 ],
[ 0. , 154.88 , 102. , 58.96 , 0 ],
[ 0. , 147.46 , 100.69 , 54.95 , 0 ],
[ 0. , 149.7 , 102.439, 53.91 , 0 ]]
)
I would like the fill in the zeros in the first column with the previous last value (1034.01) however if the 0's start from index 0, for it to remain as 0.
Example of end result:
myArrayEnd = np.array(
[[ 976.77 , 152.95 , 105.62 , 53.44 , 0 ],
[ 987.61 , 156.63 , 105.53 , 51.1 , 0 ],
[1003.74 , 151.31 , 104.435, 52.86 , 0 ],
[ 968. , 153.41 , 106.24 , 58.98 , 0 ],
[ 978.66 , 152.19 , 103.28 , 57.97 , 0 ],
[1001.9 , 152.88 , 105.08 , 58.01 , 0 ],
[1024.93 , 146.59 , 107.06 , 59.94 , 0 ],
[1020.01 , 148.05 , 109.96 , 58.67 , 0 ],
[1034.01 , 152.69 , 107.64 , 59.74 , 0 ],
[1034.01 , 154.88 , 102. , 58.96 , 0 ],
[1034.01 , 147.46 , 100.69 , 54.95 , 0 ],
[1034.01 , 149.7 , 102.439, 53.91 , 0 ]]
)
I would like the code to be applicable to any array not just this one, where the situation may be different. (Column 3 might be all 0's and Column 4 might have 0's in the middle which should be filled with the last previous value).

Here's a vectorised way with pandas. This is also possible with numpy. In any case, you should not need explicit loops for this task.
import pandas as pd
import numpy as np
df = pd.DataFrame(myArray)\
.replace(0, np.nan)\
.ffill().fillna(0)
res = df.values
print(res)
[[ 976.77 152.95 105.62 53.44 0. ]
[ 987.61 156.63 105.53 51.1 0. ]
[ 1003.74 151.31 104.435 52.86 0. ]
[ 968. 153.41 106.24 58.98 0. ]
[ 978.66 152.19 103.28 57.97 0. ]
[ 1001.9 152.88 105.08 58.01 0. ]
[ 1024.93 146.59 107.06 59.94 0. ]
[ 1020.01 148.05 109.96 58.67 0. ]
[ 1034.01 152.69 107.64 59.74 0. ]
[ 1034.01 154.88 102. 58.96 0. ]
[ 1034.01 147.46 100.69 54.95 0. ]
[ 1034.01 149.7 102.439 53.91 0. ]]

Staying within numpy:
for k, c in enumerate(myArray.T):
idx = np.flatnonzero(c == 0)
if idx.size > 0 and idx[0] > 0:
myArray[idx, k] = myArray[idx[0] - 1, k]

Assuming I've understood you correctly, this should do the trick:
def fill_zeroes(array):
temp_array = array
for i in xrange(1, len(temp_array)):
if temp_array[i][0] == 0:
temp_array[i][0] = temp_array[i-1][0]
return temp_array

How about something like this (in psuedo code)?
for each col in array
for each row in col
if array[col,row] == 0 && row>0
array[col,row] = array[col,row-1]
edit: Combined with #ukemi, who has a quicker solution, but does not loop over the various columns. Also, you need to make sure to not try to index array[0][-1].

The code below requires testing:
values = myArray.to_list() # don't remember if nd_array.to_list is a method or property
result = []
last = None
for i,item in enumerate(values):
if i == 0 and item[0] == 0:
last = item
elif item[0] == 0 and last is not None:
item[0] = last
else:
last = item[0]
result.append(item)

Related

Numpy dot product for group of rows

I am trying to calculate a dot product between two matrices, for each couple of rows.
I have matrix D with (u x 2) dimensions and matrix R with (u*2 x c) dimensions.
Below an example:
D = np.array([[0.02747092, 0.11233295],
[0.02747092, 0.07295284],
[0.01245856, 0.19935923],
[0.01245856, 0.13520913],
[0.11233295, 0.07295284]])
R = np.array([[-3. , 0. , 1. , -1. ],
[-1.25 , 0.75 , 1.75 , -1.25 ],
[-2.33333333, -0.33333333, 1.66666667, -1.33333333],
[-1.25 , 0.75 , 1.75 , -1.25 ],
[ 0. , -2. , 2. , -4. ],
[-1.25 , 0.75 , 1.75 , -1.25 ],
[ 0.66666667, -3.33333333, 2.66666667, -4.33333333],
[-1.25 , 0.75 , 1.75 , -1.25 ],
[-2.33333333, -0.33333333, 1.66666667, -1.33333333],
[-3. , 0. , 1. , -1. ]])
The result should be matrix M with dimensions (u x c) as follows (example of first row):
M = np.array([[-0.2185, 0.0825, 0.2195, -0.1645],
[...]])
Which is result of dot product between the first row of D and first two rows of matrix R as such:
D_ = np.array([[0.027, 0.11]])
R_ = np.array([[-3., 0., 1., -1.],
[-1.25, 0.75, 1.75, -1.25]])
D_.dot(R_)
I tried various ways of np.tensordot after reshaping the D matrix into tensor, but without any luck. I am looking for vectorized solution and to avoid loops (which is my current solution, quite slow).
Reshape R to 3D and use np.einsum -
np.einsum('ijk,ij->ik',R.reshape(len(D),2,-1),D)

Plot RGB Values with matplotlib

I have a set of RGB values in an array rgb_array of the form
[255.000, 56,026, 0.000]
[246.100, 60,000, 0.000]
...
>>> print(rbg_array)
1000, 3
that I'd like to plot similarly to the color gradient shown above.
How can I best use matpotlib's imshow to achieve this?
Supposing your array has N rows where each row contains 3 floats between 0 and 255, you can create an image as follows. First convert it to a numpy array of integers, and reshape it to (1, N, 3). This will make it a 1xN image. Then, display the image using imshow. You need to set an extent to get the x and y axes as in your example, or just set them to [0, 1, 0, 1]. Also the aspect ratio needs to be controlled, as otherwise the pixels would be considered "square".
import numpy as np
import matplotlib.pyplot as plt
rgb_array = [[255.000, 56.026 + (255 - 56.026) * i / 400, 255 * i / 400] for i in range(400)]
rgb_array += [[255 - 255 * i / 600, 255 - 255 * i / 600, 255] for i in range(600)]
img = np.array(rgb_array, dtype=int).reshape((1, len(rgb_array), 3))
plt.imshow(img, extent=[0, 16000, 0, 1], aspect='auto')
plt.show()
Don't use this method - #JohanC provides a much superior solution of creating an image rather than making a bar-graph.
I'm not so good on Matplotlib, but came up with this. There may be more efficient methods, so someone correct me please if this is the wrong approach.
#!/usr/bin/env python3
import numpy as np
import matplotlib.pyplot as plt
NSAMPLES = 100
# Synthesize R, G, B and A channels with dummy data
# The thing to note is that the samples are REAL and in range [0..1]
r = np.linspace(0,1,NSAMPLES).astype(np.float)
g = 1.0 - r
b = np.full(NSAMPLES,0.5,np.float)
a = np.full(NSAMPLES,1,np.float)
# Merge into a single array, 4 deep
RGBA = np.dstack((r,g,b,a))
# Plot
height, width = 40, 1
plt.bar(np.arange(NSAMPLES), height, width, color=rgba.reshape(-1,4))
plt.title("Some Funky Barplot")
plt.show()
The array RGBA looks like this:
array([[[0. , 1. , 0.5 , 1. ],
[0.01010101, 0.98989899, 0.5 , 1. ],
[0.02020202, 0.97979798, 0.5 , 1. ],
[0.03030303, 0.96969697, 0.5 , 1. ],
[0.04040404, 0.95959596, 0.5 , 1. ],
[0.05050505, 0.94949495, 0.5 , 1. ],
[0.06060606, 0.93939394, 0.5 , 1. ],
[0.07070707, 0.92929293, 0.5 , 1. ],
[0.08080808, 0.91919192, 0.5 , 1. ],
[0.09090909, 0.90909091, 0.5 , 1. ],
[0.1010101 , 0.8989899 , 0.5 , 1. ],
[0.11111111, 0.88888889, 0.5 , 1. ],
[0.12121212, 0.87878788, 0.5 , 1. ],
[0.13131313, 0.86868687, 0.5 , 1. ],
[0.14141414, 0.85858586, 0.5 , 1. ],
[0.15151515, 0.84848485, 0.5 , 1. ],
[0.16161616, 0.83838384, 0.5 , 1. ],
[0.17171717, 0.82828283, 0.5 , 1. ],
[0.18181818, 0.81818182, 0.5 , 1. ],
[0.19191919, 0.80808081, 0.5 , 1. ],
[0.2020202 , 0.7979798 , 0.5 , 1. ],
[0.21212121, 0.78787879, 0.5 , 1. ],
[0.22222222, 0.77777778, 0.5 , 1. ],
[0.23232323, 0.76767677, 0.5 , 1. ],
[0.24242424, 0.75757576, 0.5 , 1. ],
[0.25252525, 0.74747475, 0.5 , 1. ],
[0.26262626, 0.73737374, 0.5 , 1. ],
[0.27272727, 0.72727273, 0.5 , 1. ],
[0.28282828, 0.71717172, 0.5 , 1. ],
[0.29292929, 0.70707071, 0.5 , 1. ],
[0.3030303 , 0.6969697 , 0.5 , 1. ],
[0.31313131, 0.68686869, 0.5 , 1. ],
[0.32323232, 0.67676768, 0.5 , 1. ],
[0.33333333, 0.66666667, 0.5 , 1. ],
[0.34343434, 0.65656566, 0.5 , 1. ],
[0.35353535, 0.64646465, 0.5 , 1. ],
[0.36363636, 0.63636364, 0.5 , 1. ],
[0.37373737, 0.62626263, 0.5 , 1. ],
[0.38383838, 0.61616162, 0.5 , 1. ],
[0.39393939, 0.60606061, 0.5 , 1. ],
[0.4040404 , 0.5959596 , 0.5 , 1. ],
[0.41414141, 0.58585859, 0.5 , 1. ],
[0.42424242, 0.57575758, 0.5 , 1. ],
[0.43434343, 0.56565657, 0.5 , 1. ],
[0.44444444, 0.55555556, 0.5 , 1. ],
[0.45454545, 0.54545455, 0.5 , 1. ],
[0.46464646, 0.53535354, 0.5 , 1. ],
[0.47474747, 0.52525253, 0.5 , 1. ],
[0.48484848, 0.51515152, 0.5 , 1. ],
[0.49494949, 0.50505051, 0.5 , 1. ],
[0.50505051, 0.49494949, 0.5 , 1. ],
[0.51515152, 0.48484848, 0.5 , 1. ],
[0.52525253, 0.47474747, 0.5 , 1. ],
[0.53535354, 0.46464646, 0.5 , 1. ],
[0.54545455, 0.45454545, 0.5 , 1. ],
[0.55555556, 0.44444444, 0.5 , 1. ],
[0.56565657, 0.43434343, 0.5 , 1. ],
[0.57575758, 0.42424242, 0.5 , 1. ],
[0.58585859, 0.41414141, 0.5 , 1. ],
[0.5959596 , 0.4040404 , 0.5 , 1. ],
[0.60606061, 0.39393939, 0.5 , 1. ],
[0.61616162, 0.38383838, 0.5 , 1. ],
[0.62626263, 0.37373737, 0.5 , 1. ],
[0.63636364, 0.36363636, 0.5 , 1. ],
[0.64646465, 0.35353535, 0.5 , 1. ],
[0.65656566, 0.34343434, 0.5 , 1. ],
[0.66666667, 0.33333333, 0.5 , 1. ],
[0.67676768, 0.32323232, 0.5 , 1. ],
[0.68686869, 0.31313131, 0.5 , 1. ],
[0.6969697 , 0.3030303 , 0.5 , 1. ],
[0.70707071, 0.29292929, 0.5 , 1. ],
[0.71717172, 0.28282828, 0.5 , 1. ],
[0.72727273, 0.27272727, 0.5 , 1. ],
[0.73737374, 0.26262626, 0.5 , 1. ],
[0.74747475, 0.25252525, 0.5 , 1. ],
[0.75757576, 0.24242424, 0.5 , 1. ],
[0.76767677, 0.23232323, 0.5 , 1. ],
[0.77777778, 0.22222222, 0.5 , 1. ],
[0.78787879, 0.21212121, 0.5 , 1. ],
[0.7979798 , 0.2020202 , 0.5 , 1. ],
[0.80808081, 0.19191919, 0.5 , 1. ],
[0.81818182, 0.18181818, 0.5 , 1. ],
[0.82828283, 0.17171717, 0.5 , 1. ],
[0.83838384, 0.16161616, 0.5 , 1. ],
[0.84848485, 0.15151515, 0.5 , 1. ],
[0.85858586, 0.14141414, 0.5 , 1. ],
[0.86868687, 0.13131313, 0.5 , 1. ],
[0.87878788, 0.12121212, 0.5 , 1. ],
[0.88888889, 0.11111111, 0.5 , 1. ],
[0.8989899 , 0.1010101 , 0.5 , 1. ],
[0.90909091, 0.09090909, 0.5 , 1. ],
[0.91919192, 0.08080808, 0.5 , 1. ],
[0.92929293, 0.07070707, 0.5 , 1. ],
[0.93939394, 0.06060606, 0.5 , 1. ],
[0.94949495, 0.05050505, 0.5 , 1. ],
[0.95959596, 0.04040404, 0.5 , 1. ],
[0.96969697, 0.03030303, 0.5 , 1. ],
[0.97979798, 0.02020202, 0.5 , 1. ],
[0.98989899, 0.01010101, 0.5 , 1. ],
[1. , 0. , 0.5 , 1. ]]])

In-line column assignments in Python/Numpy

I have a bunch of points and need to select a subset of them, add a value to the x coordinates and store the information in the original points.
I need to do it without loops or intermediate assignments.
import numpy as np
points=np.array([[100. , 100. , 100. ],
[ 0. , -2.75, 0. ],
[ 0. , -2.75, 5. ],
[ 0. , -1.9 , 3.15],
[ 0. , -1.9 , 3.35]])
then trying:
points[[3,4,0]][:,[0]]+=2
or
points[[3,4,0]][:,[0]]=points[[3,4,0]][:,[0]]+2
the original points variable does not change.
Any ideas? I suspect I am missing some stupid stuff...
If you are looking to edit first column of those rows use:
points[[3,4,0], 0] += 2
points
#[[ 102. 100. 100. ]
# [ 0. -2.75 0. ]
# [ 0. -2.75 5. ]
# [ 2. -1.9 3.15]
# [ 2. -1.9 3.35]]

How can I extract two separate matrices from a file?

So I have a file that looks something like this:
# 3 # Number of network ROIs
# 2 # Number of netcc matrices
# WITH_ROI_LABELS
001 002 003
1 2 3
# CC
1.0000 0.9800 0.9895
0.9800 1.0000 0.9817
0.9895 0.9817 1.0000
# FZ
4.0000 2.2965 2.6240
2.2965 4.0000 2.3426
2.6240 2.3426 4.0000
I want to extract the 3x3 matrix labelled "CC"
I want to extract the 3x3 matrix labelled "FZ"
So I did the following:
file=/users/3dfile1
A= numpy.genfromtxt(file)
m= A[:,:]
m
So the output I get looks like this:
array([[ 1. , 2. , 3. ],
[ 1. , 2. , 3. ],
[ 1. , 0.98 , 0.9895],
[ 0.98 , 1. , 0.9817],
[ 0.9895, 0.9817, 1. ],
[ 4. , 2.2965, 2.624 ],
[ 2.2965, 4. , 2.3426],
[ 2.624 , 2.3426, 4. ]])
However, my question is... if I have multiple files. Where the matrix size is NOT CONSISTENT. This means that in some files the matrix will be 3x3, some files 8x8, 1x1, etc. In this case, how can I code something that will:
differentiate the matrix CC from FZ
extract the matrix (can detect the size of matrix somehow and give me the exact matrix I'm looking for)
Try
import numpy as np
x = np.array([[ 1. , 2. , 3. ],
[ 1. , 2. , 3. ],
[ 1. , 0.98 , 0.9895],
[ 0.98 , 1. , 0.9817],
[ 0.9895, 0.9817, 1. ],
[ 4. , 2.2965, 2.624 ],
[ 2.2965, 4. , 2.3426],
[ 2.624 , 2.3426, 4. ]])
x1 = x[2:,:]
x2 = x1.reshape(2,3,3)
CC ,FZ = x2
Result:
In [23]: CC
Out[23]:
array([[ 1. , 0.98 , 0.9895],
[ 0.98 , 1. , 0.9817],
[ 0.9895, 0.9817, 1. ]])
In [24]: FZ
Out[24]:
array([[ 4. , 2.2965, 2.624 ],
[ 2.2965, 4. , 2.3426],
[ 2.624 , 2.3426, 4. ]])

How can I one hot encode a subset of columns?

I have a data set which has some categorical columns. Here is a small sample:
Temp precip dow tod
-20.44 snow 4 14.5
-22.69 snow 4 15.216666666666667
-21.52 snow 4 17.316666666666666
-21.52 snow 4 17.733333333333334
-20.51 snow 4 18.15
Here, the dow and precip are categorical, where as the others are continuous.
Is there a way I can create a OneHotEncoder for just those columns? I don't want to use pd.get_dummies because that won't put the data in the proper format unless of each dow and precip are in the new data.
Two things you could check out: sklearn-pandas and as mentioned by #Grr pipelines with this good intro.
So I prefer pipelines, as they are a tidy way, allow easy use with things like grid-seach, avoid leakage between folds in cross validation, etc. So I usually end up having a pipe like that (given you have precip LabelEncoded first):
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
class Normalize(BaseEstimator, TransformerMixin):
def __init__(self, func=None, func_param={}):
self.func = func
self.func_param = func_param
def transform(self, X):
if self.func != None:
return self.func(X, **self.func_param)
else:
return X
def fit(self, X, y=None, **fit_params):
return self
cat_cols = ['precip', 'dow']
num_cols = ['Temp','tod']
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=num_cols),Normalize())),
('categorical', make_pipeline(Columns(names=cat_cols),OneHotEncoder(sparse=False)))
])),
('model', LinearRegression())
])
The short answer is yes, but with some caveats.
First off you won't be able to use OneHotEncoder directly on the precip feature. You will need to encode those labels in to integers with LabelEncoder.
Secondly, if you just want to encode those features you can pass the proper values to the n_values and categorical_features parameters.
Example:
I will assume dow is day of the week, which will have seven values, and precip will have (rain, sleet, snow, and mix) as values.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
df2 = df.copy()
le = LabelEncoder()
le.fit(['rain', 'sleet', 'snow', 'mix'])
df2.precip = le.transform(df2.precip)
df2
Temp precip dow tod
0 -20.44 3 4 14.500000
1 -22.69 3 4 15.216667
2 -21.52 3 4 17.316667
3 -21.52 3 4 17.733333
4 -20.51 3 4 18.150000
# Initialize OneHotEncoder with 4 values for precip and 7 for dow.
ohe = OneHotEncoder(n_values=np.array([4,7]), categorical_features=[1,2])
X = ohe.fit_transform(df2)
X.toarray()
array([[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.44 , 14.5 ],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -22.69 ,
15.21666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.31666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.73333333],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.51 , 18.15 ]])
Ok that works, but you have to either mutate your data in place or create a copy an things can get a little messy. A more organized way to do this would be to use a Pipeline.
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
def get_precip(X):
le = LabelEncoder()
le.fit(['rain', 'sleet', 'snow', 'mix'])
return le.transform(X.precip).reshape(-1,1)
def get_dow(X):
return X.dow.values.reshape(-1,1)
def get_rest(X):
return X.drop(['precip', 'dow'], axis=1)
precip_trans = FunctionTransformer(get_precip, validate=False)
dow_trans = FunctionTransformer(get_dow, validate=False)
rest_trans = FunctionTransformer(get_rest, validate=False)
union = FeatureUnion([('precip', precip_trans), ('dow', dow_trans), ('rest', rest_trans)])
ohe = OneHotEncoder(n_values=[4,7], categorical_features=[0,1])
pipe = Pipeline([('union', union), ('one_hot', ohe)])
X = pipe.fit_transform(df)
X.toarray()
array([[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.44 , 14.5 ],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -22.69 ,
15.21666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.31666667],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -21.52 ,
17.73333333],
[ 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. ,
1. , 0. , 0. , -20.51 , 18.15 ]])
I do want to point out that in the upcoming release of sklearn v0.20 there will be a CategoricalEncoder which should make this kind of thing even easier.
I don't want to use pd.get_dummies because that won't put the data in
the proper format unless of each dow and precip are in the new data.
Assuming you want to encode but also maintain those two columns--are you sure this wouldn't work for you?
df = pd.DataFrame({
'temp': np.random.random(5) + 20.,
'precip': pd.Categorical(['snow', 'snow', 'rain', 'none', 'rain']),
'dow': pd.Categorical([4, 4, 4, 3, 1]),
'tod': np.random.random(5) + 10.
})
pd.concat((df[['dow', 'precip']],
pd.get_dummies(df, columns=['dow', 'precip'], drop_first=True)),
axis=1)
dow precip temp tod dow_3 dow_4 precip_rain precip_snow
0 4 snow 20.7019 10.4610 0 1 0 1
1 4 snow 20.0917 10.0174 0 1 0 1
2 4 rain 20.3978 10.5766 0 1 1 0
3 3 none 20.9804 10.0770 1 0 0 0
4 1 rain 20.3121 10.3584 0 0 1 0
In the case where you'll be interacting with new data that includes categories that df hasn't "seen," you can use
df['col'] = df['col'].cat.add_categories(...)
Where you pass a list of the set difference. This adds to the list of "recognized" categories for the resulting pd.Categorical object.

Categories

Resources