np.logical_and operator not working as it should (numpy) - python

I have a dataset (ndarray, float 32), for example:
[-3.4028235e+38 -3.4028235e+38 -3.4028235e+38 ... 1.2578617e-01
1.2651859e-01 1.3053264e-01] ...
I want to remove all values below 0, greater than 1, so I use:
with rasterio.open(raster_file) as src:
h = src.read(1)
i = h[0]
i[np.logical_and(i >= 0.0, i <= 1.0)]
Obviously the first entries (i.e. -3.4028235e+38) should be removed but they still appear after the operator is applied. I'm wondering if this is related to the scientific notation and a pre-step is required to be performed, but I can't see what exactly. And ideas?
To simplify this, here is the code again:
pp = [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, 1.2578617e-01, 1.2651859e-01, 1.3053264e-01]
pp[np.logical_and(pp => 0.0, pp <= 1.0)]
print (pp)
And the result
pp = [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, 0.12578617, 0.12651859, 0.13053264]
So the first 3 entries still remain.

The problem is that you are not removing the indices you selected. You are just selecting them.
If you want to remove them. You should probably convert them to nans as such
from numpy import random, nan, logical_and
a = random.randn(10, 3)
print(a)
a[logical_and(a > 0, a < 1)] = nan
print(a)
Output example
[[-0.95355719 nan nan]
[-0.21268393 nan -0.24113676]
[-0.58929128 nan nan]
[ nan -0.89110972 nan]
[-0.27453321 1.07802157 1.60466863]
[-0.34829213 nan 1.51556019]
[-0.4890989 nan -1.08481203]
[-2.17016962 nan -0.65332871]
[ nan 1.58937678 1.79992471]
[ nan -0.91716538 1.60264461]]
Alternatively you can look into masked array

Silly mistake, I had to wrap the array in a numpy array, then assign a variable to the new constructed array, like so:
j = np.array(pp)
mask = j[np.logical_and(j >= 0.0, j <= 1.0)]

Related

Cleaning outliers inside a column with interpolation

I'm trying to do the following.
I have some data with wrong values (x<=0 or x>=1100) inside a dataframe.
I am trying to change those values to values inside an acceptable range.
For the time being, this is what I do code-wise
def while_non_nan(A, k):
init = k
if k+1 >= len(A)-1:
return A.iloc[k-1]
while np.isnan(A[k+1]):
k += 1
#Calculate the value.
n = k-init+1
value = (n*A.iloc[init-1] + A.iloc[k])/(n+1)
return value
evoli.loc[evoli['T1'] >= 1100, 'T1'] = np.nan
evoli.loc[evoli['T1'] <= 0, 'T1'] = np.nan
inds = np.where(np.isnan(evoli))
#Place column means in the indices. Align the arrays using take
for k in inds[0] :
evoli['T1'].iloc[k] = while_non_nan(evoli['T1'], k)
I transform the outlier values into nan.
Afterwards, I get the position of those nan.
Finally, I modify the nan to the mean value between the previous value and the next one.
Since, several nan can be next to each other, the whie_non_nan search for the next non_nan value and get the ponderated mean.
Example of what I'm hoping to get:
Input :
[nan 0 1 2 nan 4 nan nan 7 nan ]
Output:
[0 0 1 2 3 4 5 6 7 7 ]
Hope it is clear enough. Thanks !
Pandas has a builtin interpolation you could use after setting your limits to NaN:
from numpy import NaN
import pandas as pd
df = pd.DataFrame({"T1": [1, 2, NaN, 3, 5, NaN, NaN, 4, NaN]})
df["T1"] = df["T1"].interpolate(method='linear', axis=0).ffill().bfill()
print(df)
Interpolate is a DataFrame method that fills NaN values with specified interpolation method (linear in this case). Calling .bfill() for backward fill and .ffill() for forward fill ensures the 1st and last item are also replaced if needed, with 2nd and 2nd to last item respectively. If you want some fancier strategy for 1st and last item you need to write it yourself.

How do I better perform this numpy calculation

I have text file something like this:
0 0 0 1 2
0 0 1 3 1
0 1 0 4 1
0 1 1 2 3
1 0 0 5 3
1 0 1 1 3
1 1 0 4 5
1 1 1 6 1
Let label these columns as:
s1 a s2 r t
I also have another array with dummy values (for simplicity)
>>> V = np.array([10.,20.])
I want to do certain calculation on these numbers with good performance. The calculation I want to perform is: for each s1, I want max sum t*(r+V[s1]) for each a.
For example,
for s1=0, a=0, we will have sum = 2*(1+10)+1*(3+10) = 35
for s1=0, a=1, we will have sum = 1*(4+10)+3*(2+10) = 50
So max of this is 50, which is what I want to obtain as an output for s1=0.
Also, note that, in above calculation, 10 is V[s1].
If, I dont have last three lines in file, then, for s1=1, I will simply return 3*(5+20)=75, where 20 is V[s1]. So the final desire result is [50,75]
So I thought it will be good for numpy to load it as follows (consider values only for s1=0 for simplicity)
>>> c1=[[ [ [0,1,2],[1,3,1] ],[ [0,4,1],[1,2,3] ] ]]
>>> import numpy as np
>>> c1arr = np.array(c1)
>>> c1arr #when I actually load from file, its not loading as this (check Q2 below)
array([[[[0, 1, 2],
[1, 3, 1]],
[[0, 4, 1],
[1, 2, 3]]]])
>>> np.sum(c1arr[0,0][:,2]*(c1arr[0,0][:,1]+V)) #sum over t*(r+V)
45.0
Q1. I am not able to guess, how can I modify above to get numpy array [45.0,80.0], so that I can get numpy.max over it.
Q2. When I actually load the file, I am not able load it as c1arr as stated in comment above. Instead, am getting it as follows:
>>> type(a) #a is populated by parsing file
<class 'list'>
>>> print(a)
[[[[0, -0.9, 0.3], [1, 0.9, 0.6]], [[0, -0.2, 0.6], [1, 0.7, 0.3]]], [[[1, 0.2, 1.0]], [[0, -0.8, 1.0]]]]
>>> np.array(a) #note that this is not same as c1arr above
<string>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
array([[list([[0, -0.9, 0.3], [1, 0.9, 0.6]]),
list([[0, -0.2, 0.6], [1, 0.7, 0.3]])],
[list([[1, 0.2, 1.0]]),
list([[0, -0.8, 1.0]])]], dtype=object)
How I can fix this?
Q3. Is there any overall better approach, say by laying out the numpy array differently? (Given I am not allowed to use pandas, but only numpy)
In my opinion, the most intuitive and maintainable approach
is to use Pandas, where you can assign names to columns.
Another important factor is that grouping is much easier just in Pandas.
As your input sample contains only integers, I defined V
also as an array of integers:
V = np.array([10, 20])
I read your input file as follows:
df = pd.read_csv('Input.txt', sep=' ', names=['s1', 'a', 's2', 'r', 't'])
(print it to see what has been read).
Then, to get results for each combination of s1 and a,
you can run:
result = df.groupby(['s1', 'a']).apply(lambda grp:
(grp.t * (grp.r + V[grp.s1])).sum())
Note that as you refer to named columns, this code is easy to read.
The result is:
s1 a
0 0 35
1 50
1 0 138
1 146
dtype: int64
Each result is integer because V is also an array of
int type. But if you define it just as in your post (an
array of float), the result will be also of float type
(your choice).
If you want the max result for each s1, run:
result.max(level=0)
This time the result is:
s1
0 50
1 146
dtype: int64
The Numpy version
If you really are restricted to Numpy, there is also a solution,
although more difficult to read and update.
Read your input file:
data = np.genfromtxt('Input.txt')
Initially I tried int type, just like in the pandasonic solution,
but one of your comments states that 2 rightmost columns are float.
So, because Numpy arrays must be of a single type, the whole
array must be of float type.
Run the following code:
res = []
# First level grouping - by "s1" (column 0)
for s1 in np.unique(data[:,0]).astype(int):
dat1 = data[np.where(data[:,0] == s1)]
res2 = []
# Second level grouping - by "a" (column 1)
for a in np.unique(dat1[:,1]):
dat2 = dat1[np.where(dat1[:,1] == a)]
# t - column 4, r - column 3
res2.append((dat2[:,4] * (dat2[:,3] + V[s1])).sum())
res.append([s1, max(res2)])
result = np.array(res)
The result (a Numpy array) is:
array([[ 0., 50.],
[ 1., 146.]])
The left column contains s1 values and the right - maximum
group values from the second level grouping.
The Numpy version with a structured array
Actually, you can also use a Numpy structured array.
Then the code is at least more readable, because you refer to column names,
not to column numbers.
Read the array passing dtype with column names and types:
data = np.genfromtxt(io.StringIO(txt), dtype=[('s1', '<i4'),
('a', '<i4'), ('s2', '<i4'), ('r', '<f8'), ('t', '<f8')])
Then run:
res = []
# First level grouping - by "s1"
for s1 in np.unique(data['s1']):
dat1 = data[np.where(data['s1'] == s1)]
res2 = []
# Second level grouping - by "a"
for a in np.unique(dat1['a']):
dat2 = dat1[np.where(dat1['a'] == a)]
res2.append((dat2['t'] * (dat2['r'] + V[s1])).sum())
res.append([s1, max(res2)])
result = np.array(res)

How to replace values in a array?

I'm beggining to study python and saw this:
I have and array(km_media) that have nan values,
km_media = km / (2019 - year)
it happend because the variable year has some 2019.
So for the sake of learning, I would like to know how do to 2 things:
how can I use the replace() to substitute the nan values for 0 in the variable;
how can i print the variable that has the nan values with the replace.
What I have until now:
1.
km_media = km_media.replace('nan', 0)
print(f'{km_media.replace('nan',0)}')
Thanks
Not sure is this will do what you are looking for?
a = 2 / np.arange(5)
print(a)
array([ inf, 2. , 1. , 0.66666667, 0.5 ])
b = [i if i != np.inf or i != np.nan else 0 for i in a]
print(b)
Output:
[0, 2.0, 1.0, 0.6666666666666666, 0.5]
Or:
np.where(((a == np.inf) | (a == np.nan)), 0, a)
Or:
a[np.isinf(a)] = 0
Also, for part 2 of your question, I'm not sure what you mean. If you have just replaced the inf's with 0, then you will just be printing zeros. If you want the index position of the inf's you have replaced, you can grab them before replacement:
np.where(a == np.inf)[0][0]
Output:
0 # this is the index position of np.inf in array a

How to add nan to the end of an array using numpy

I have a list of multiple arrays and I want them to have the same size, filling the ones with less elements with nan. I have some arrays that have integers and others that have string.
For example:
a = ['Nike']
b = [1,5,10,15,20]
c = ['Adidas']
d = [150, 2]
I have tried
max_len = max(len(a),len(b),len(c),len(d))
empty = np.empty(max_len - len(a))
a = np.asarray(a) + empty
empty = np.empty(max_len - len(b))
b = np.asarray(b) + empty
I do the same with all of the arrays, however an error occurs (TypeError: only integer scalar arrays can be converted to a scalar index)
I am doing this because I want to make a DataFrame with all of the arrays being a different columns.
Thank you in advanced
How about this?
df1 = pd.DataFrame([a,b,c,d]).T
I'd suggest using lists since you also have strings. Here's one way using zip_longest:
from itertools import zip_longest
a, b, c, d = map(list,(zip(*zip_longest(a,b,c,d, fillvalue=float('nan')))))
print(a)
# ['Nike', nan, nan, nan, nan]
print(b)
# [1, 5, 10, 15, 20]
print(c)
# ['Adidas', nan, nan, nan, nan]
print(d)
# [150, 2, nan, nan, nan]
Another approach could be:
max_len = len(max([a,b,c,d], key=len))
a, b, c, d = [l+[float('nan')]*(max_len-len(l)) for l in [a,b,c,d]]
You should use the numpy.append(array, value, axis) to append to an array. In you example that would be ans = np.append(a,empty).
You can do that directly just like so:
>>> import pandas as pd
>>> a = ['Nike']
>>> b = [1,5,10,15,20]
>>> c = ['Adidas']
>>> d = [150, 2]
>>> pd.DataFrame([a, b, c, d])
0 1 2 3 4
0 Nike NaN NaN NaN NaN
1 1 5.0 10.0 15.0 20.0
2 Adidas NaN NaN NaN NaN
3 150 2.0 NaN NaN NaN

Shannon's Entropy on an array containing zero's

I use the following code to return Shannon's Entropy on an array that represents a probability distribution.
A = np.random.randint(10, size=10)
pA = A / A.sum()
Shannon2 = -np.sum(pA*np.log2(pA))
This works fine if the array doesn't contain any zero's.
Example:
Input: [2 3 3 3 2 1 5 3 3 4]
Output: 3.2240472715
However, if the array does contain zero's, Shannon's Entropy produces nan
Example:
Input:[7 6 6 8 8 2 8 3 0 7]
Output: nan
I do get two RuntimeWarnings:
1) RuntimeWarning: divide by zero encountered in log2
2) RuntimeWarning: invalid value encountered in multiply
Is there a way to alter the code to include zero's? I'm just not sure if removing them completely will influence the result. Specifically, if the variation would be greater due to the greater frequency in distribution.
I think you want to use nansum to count nans as zero:
A = np.random.randint(10, size=10)
pA = A / A.sum()
Shannon2 = -np.nansum(pA*np.log2(pA))
The easiest and most used way is to ignore the zero probabilities and calculate the Shannon's Entropy on remaining values.
Try the following:
import numpy as np
A = np.array([1.0, 2.0, 0.0, 5.0, 0.0, 9.0])
A = np.array(filter(lambda x: x!= 0, A))
pA = A / A.sum()
Shannon2 = -np.sum(pA * np.log2(pA))

Categories

Resources