I have a list of multiple arrays and I want them to have the same size, filling the ones with less elements with nan. I have some arrays that have integers and others that have string.
For example:
a = ['Nike']
b = [1,5,10,15,20]
c = ['Adidas']
d = [150, 2]
I have tried
max_len = max(len(a),len(b),len(c),len(d))
empty = np.empty(max_len - len(a))
a = np.asarray(a) + empty
empty = np.empty(max_len - len(b))
b = np.asarray(b) + empty
I do the same with all of the arrays, however an error occurs (TypeError: only integer scalar arrays can be converted to a scalar index)
I am doing this because I want to make a DataFrame with all of the arrays being a different columns.
Thank you in advanced
How about this?
df1 = pd.DataFrame([a,b,c,d]).T
I'd suggest using lists since you also have strings. Here's one way using zip_longest:
from itertools import zip_longest
a, b, c, d = map(list,(zip(*zip_longest(a,b,c,d, fillvalue=float('nan')))))
print(a)
# ['Nike', nan, nan, nan, nan]
print(b)
# [1, 5, 10, 15, 20]
print(c)
# ['Adidas', nan, nan, nan, nan]
print(d)
# [150, 2, nan, nan, nan]
Another approach could be:
max_len = len(max([a,b,c,d], key=len))
a, b, c, d = [l+[float('nan')]*(max_len-len(l)) for l in [a,b,c,d]]
You should use the numpy.append(array, value, axis) to append to an array. In you example that would be ans = np.append(a,empty).
You can do that directly just like so:
>>> import pandas as pd
>>> a = ['Nike']
>>> b = [1,5,10,15,20]
>>> c = ['Adidas']
>>> d = [150, 2]
>>> pd.DataFrame([a, b, c, d])
0 1 2 3 4
0 Nike NaN NaN NaN NaN
1 1 5.0 10.0 15.0 20.0
2 Adidas NaN NaN NaN NaN
3 150 2.0 NaN NaN NaN
Related
If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN
I have a dataset (ndarray, float 32), for example:
[-3.4028235e+38 -3.4028235e+38 -3.4028235e+38 ... 1.2578617e-01
1.2651859e-01 1.3053264e-01] ...
I want to remove all values below 0, greater than 1, so I use:
with rasterio.open(raster_file) as src:
h = src.read(1)
i = h[0]
i[np.logical_and(i >= 0.0, i <= 1.0)]
Obviously the first entries (i.e. -3.4028235e+38) should be removed but they still appear after the operator is applied. I'm wondering if this is related to the scientific notation and a pre-step is required to be performed, but I can't see what exactly. And ideas?
To simplify this, here is the code again:
pp = [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, 1.2578617e-01, 1.2651859e-01, 1.3053264e-01]
pp[np.logical_and(pp => 0.0, pp <= 1.0)]
print (pp)
And the result
pp = [-3.4028235e+38, -3.4028235e+38, -3.4028235e+38, 0.12578617, 0.12651859, 0.13053264]
So the first 3 entries still remain.
The problem is that you are not removing the indices you selected. You are just selecting them.
If you want to remove them. You should probably convert them to nans as such
from numpy import random, nan, logical_and
a = random.randn(10, 3)
print(a)
a[logical_and(a > 0, a < 1)] = nan
print(a)
Output example
[[-0.95355719 nan nan]
[-0.21268393 nan -0.24113676]
[-0.58929128 nan nan]
[ nan -0.89110972 nan]
[-0.27453321 1.07802157 1.60466863]
[-0.34829213 nan 1.51556019]
[-0.4890989 nan -1.08481203]
[-2.17016962 nan -0.65332871]
[ nan 1.58937678 1.79992471]
[ nan -0.91716538 1.60264461]]
Alternatively you can look into masked array
Silly mistake, I had to wrap the array in a numpy array, then assign a variable to the new constructed array, like so:
j = np.array(pp)
mask = j[np.logical_and(j >= 0.0, j <= 1.0)]
I'm new to numba and can't seem to figure the arguments to pass to vectorize. Here's what I'm trying to do:
test = [x for x in range(10)]
test2 = ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c']
test_df = pd.DataFrame({'test': test, 'test2': test2})
test_df['test3'] = np.where(test_df['test'].values % 2 == 0,
test_df['test'].values,
np.nan)
test test2 test3 test4
0 0 a 0.0 0.0
1 1 a NaN NaN
2 2 a 2.0 4.0
3 3 b NaN NaN
4 4 b 4.0 16.0
5 5 c NaN NaN
6 6 c 6.0 36.0
7 7 c NaN NaN
8 8 c 8.0 64.0
9 9 c NaN NaN
The task is create a new column based on the following logic, first based on standard pandas:
def nonnumba_test(row):
if row['test2'] == 'a':
return row['test'] * row['test3']
else:
return np.nan
Use apply; I understand I can accomplish this much faster using np.where and the .values attribute of the Series objects, but want to test this against numba.
test_df.apply(nonnumba_test, axis=1)
0 0.0
1 NaN
2 4.0
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
Next, when I try to use the numba.vectorize decorator
#numba.vectorize()
def numba_test(x, y, z):
if x == 'a':
return y * z
else:
return np.nan
I get the following error
numba_test(test_df['test2'].values,
test_df['test'].values,
test_df['test3'].values)
ValueError: Unsupported array dtype: object
I imagine I need to specify the return type in the signature argument, but I can't seem to figure it out.
The problem is that numba does not easily support strings (see here and see here).
The solution is to handle the boolean logic if x=='a' outside the numba decorated function. Modifying your example (both numba_test and the input argument) as follows produces the desired output (everything above the last two blocks in your example is unchanged):
from numba import vectorize, float64, int64, boolean
##vectorize() will also work here, but I think it's best practice with numba to specify types.
#vectorize([float64(boolean, int64, float64)])
def numba_test(x, y, z):
if x:
return y * z
else:
return np.nan
# now test it...
# NOTICE the boolean argument, **not** string!
numba_test(test_df['test2'].values =='a',
test_df['test'].values,
test_df['test3'].values)
Returns:
array([ 0., nan, 4., nan, nan, nan, nan, nan, nan, nan])
as desired.
Final note: you'll see that I specify types in the vectorize decorator above. Yes, it's a bit annoying, but I think it's best practice because it spares you headaches exactly like this one: if you had specified the types, you would have been unable to find the string type, and that would have solved it.
I am new to python and pandas. I don't know how to solve the following problem in an elegant way.
Let's say we have a simple pandas dataframe.
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
df = pd.DataFrame(np.arange(0,60,10), columns=['Value'])
Now set a variable, e.g.:
n = 3
The goal is to add a column to df, made of arrays of the n-preceding values, as following:
The next step could be to set NaNs to zero.
Is there a smart way to do this?
Thank you in advance for your help,
Gilbert
We can use a df.shift to generate the offset columns and a list comprehension to group them together then map to generate a list of lists for the dataframe. However, the list of lists generated will need to be transposed first before assigning it to the original df so that we have an list of values corresponding to the correct row.
df["b"] =np.array(map(list,[df["a"].shift(x) for x in range(1,4)])).T.tolist()
Input:
a
0 1
1 2
2 3
3 4
Output:
a b
0 1 [nan, nan, nan]
1 2 [1.0, nan, nan]
2 3 [2.0, 1.0, nan]
3 4 [3.0, 2.0, 1.0]
This is a little gnarly to wrangle but the following works:
In [63]:
def func(x):
return pd.Series(df['Value'], index=np.arange(x.name-3,x.name)).values.tolist()
df['ArrayValues'] = df[['Value']].apply(lambda x: func(x), axis=1)
df
Out[63]:
Value ArrayValues
0 0 [nan, nan, nan]
1 10 [nan, nan, 0.0]
2 20 [nan, 0.0, 10.0]
3 30 [0, 10, 20]
4 40 [10, 20, 30]
5 50 [20, 30, 40]
So firstly we double subscript the df using [[]] so that we force the single column into a df so we can call apply and use param axis=1 to apply our func row-wise, this is required because we want to use the current row index value accessed via name attribute to return a re-indexed Series based on the index range, as the index values don't exist it creates NaN values for the index rows that don't exist, finally we need to anonymize the data by returning a numpy array and convert this to a list so we don't try to align on the Series index
edit
if we swap the start/stop args to np.arange with a negative step then you get the order you desire:
In [70]:
def func(x):
return pd.Series(df['Value'], index=np.arange(x.name-1,x.name-4,-1)).values.tolist()
df['ArrayValues'] = df[['Value']].apply(lambda x: func(x), axis=1)
df
Out[70]:
Value ArrayValues
0 0 [nan, nan, nan]
1 10 [0.0, nan, nan]
2 20 [10.0, 0.0, nan]
3 30 [20, 10, 0]
4 40 [30, 20, 10]
5 50 [40, 30, 20]
I need to divide two series element wise.
The elements are of type float.
A = [10,20,30]
B = [2,5,5]
result = A/B
I expect
result = [5,4,6]
but get
result = [NaN, NaN, NaN]
This just works with pandas Series as expected:
In [3]: import pandas as pd
In [4]: A = pd.Series([10,20,30])
In [5]: B = pd.Series([2,5,5])
In [6]: A/B
Out[6]:
0 5
1 4
2 6
dtype: float64