I have a list of numbers inside of a pandas dataframe and i am trying to use a lambda function + list comprehension to remove values from these lists.
col 1 col2
a [-1, 2, 10, 600, -10]
b [-0, -5, -6, -200, -30]
c .....
etc.
df.col2.apply(lambda x: [i for i in x if i>= 0]) #just trying to remove negative values
numbers are always ascending and can be all negative, all positive or a mix. lists are about 200 items long all integers.
I get this error:
TypeError: 'numpy.float64' object is not iterable
Edit: when i do it this way it works
[i for i in df[col2][#] if i >= 0] i guess i could run this through a for loop.. seems slow though
Edit2: looking at it with fresh eyes. turns out that the column isn't entirely made up of lists there are a few float values spread throughout (duh). Something weird was going on with the merge, once i corrected that the code above worked as expected. Thanks for the help!
Because x in your lambda is a float number, and you cant loop over float :p.
if you need to do so. you can
In [2]: np.random.seed(4)
...: df = pd.DataFrame(np.random.randint(-5,5, 7)).rename(columns={0:"col2"})
...: df.col2 = df.col2.astype(float)
...: df
Out[2]:
col2
0 2.0
1 0.0
2 -4.0
3 3.0
4 2.0
5 3.0
6 -3.0
In [3]: df.col2.apply(lambda x: x if x > 0 else None).dropna()
Out[3]:
0 2.0
3 3.0
4 2.0
5 3.0
Name: col2, dtype: float64
Related
I have a df with a column containing floats (transaction values).
I would liek to iterate trough the column and only print the value if it is not nan.
Right now i have the following condition.
if j > 0:
print(j)
i += 1
else: i += 1
where i is my iteration number.
I do this because I know that in my dataset there are no negative values and that is my workaound, but I would like to know how it would be done correctly if I would have nagative values.
so what would be the if condition ?
I have tried j != None
and j != np.nan but it still prints all nan.
Why not use built-in pandas functionality?
Given some dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [-1, -2, 0, np.nan, 5, 6, np.nan]
})
You can filter out all nans:
df[df['a'].notna()]
>>> a
0 -1.0
1 -2.0
2 0.0
4 5.0
5 6.0
or only positive numbers:
df[df['a']> 0]
>>> a
4 5.0
5 6.0
Suppose let's assume here the datatype is int64 for the filled values in column
for i in range(0,10):
if(type(array[i])!= int):
print(array[i])
Let's say I have this Pandas series:
num = pd.Series([1,2,3,4,5,6,5,6,4,2,1,3])
What I want to do is to get a number, say 5, and return the index where it previously occurred. So if I'm using the element 5, I should get 4 as the element appears in indices 4 and 6. Now I want to do this for all of the elements of the series, and can be easily done using a for loop:
for idx,x in enumerate(num):
idx_prev = num[num == x].idxmax()
if(idx_prev < idx):
return idx_prev
However, this process consumes too much time for longer series lengths due to the looping. Is there a way to implement the same thing but in a vectorized form? The output should be something like this:
[NaN,NaN,NaN,NaN,NaN,NaN,4,5,3,1,0,2]
You can use groupby to shift the index:
num.index.to_series().groupby(num).shift()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
7 5.0
8 3.0
9 1.0
10 0.0
11 2.0
dtype: float64
It's possible to keep working in numpy.
Equivalent of [num[num == x].idxmax() for idx,x in enumerate(num)] using numpy is:
_, out = np.unique(num.values, return_inverse=True)
which assigns
array([0, 1, 2, 3, 4, 5, 4, 5, 3, 1, 0, 2], dtype=int64)
to out. Now you can assign bad values of out to Nans like this:
out_series = pd.Series(out)
out_series[out >= np.arange(len(out))] = np.nan
If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN
I want to compute the expanding window of just the last few elements in a group...
df = pd.DataFrame({'B': [np.nan, np.nan, 1, 1, 2, 2, 1,1], 'A': [1, 2, 1, 2, 1, 2,1,2]})
df.groupby("A")["B"].expanding().quantile(0.5)
this gives:
1 0 NaN
2 1.0
4 1.5
6 1.0
2 1 NaN
3 1.0
5 1.5
7 1.0
I only really want the last two rows for each group though. the result should be:
1 4 1.5
6 1.0
2 5 1.5
7 1.0
I can easily calculate it all and then just get the sections I want. but this is very slow if my dataframe is 1000s of elements long and I dont want to roll across the whole window... just the last two "rolls"
EDIT: I have ammended the title; A lot of people are correctly answering part of the question, but ignoring what is IMO the important part (I should have been more clear)
The issue here is the time it takes. I could just "tail" the answer to get the last two; but then it involves calculating the first two "expanding windows" and then throwing away those results. If my dataframe was instead 1000s of rows long and I just needed the answer for the last few entries, much of this calculation would be wasting time. This is the main problem I have.
As I stated:
"I can easily calculate it all and then just get the sections I want" => through using tail.
Sorry for the confusion.
Also potentially using tail doesnt involve calculating the lot, but it still seems like it does from the timings that I have done... maybe this is not correct, it is an assumption I have made.
EDIT2: The other Option I have tried was using the min_windows in rolling to force it to not calculate the initial sections of the group, but this has many pitfalls such as: -if the array includes NaNs this doesnt work, -if the groupbys are not the same length.
EDIT3:
As a simpler problem and reasoning: Its a limitation of the expanding/or rolling window I think... say we had an array [1,2,3,4,5] the expanding windows are [1], [1,2], [1,2,3], [1,2,3,4], [1,2,3,4,5], and if we run the max over that we get: 1,2,3,4,5 (the max of each array). But if I just want the max of the last two expanding windows. I just need max[1,2,3,4] = 4 and max[1,2,3,4,5]. Intuitively I don't need to calculate max of the first 3 expanding window results to get the last two. But Pandas Implementation might be that it calculates max[1,2,3,4] as max[max[1,2,3],max[4]] = 4 in which case the calculation of the entire window is necessary... this might be the same for the quantile example. There might be an alternate way to do this however without using expanding... not sure... this is what I cant work out.
Maybe try using tail: https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.core.groupby.GroupBy.tail.html
df.groupby('A')['B'].rolling(4, min_periods=1).quantile(0.5).reset_index(level=0).groupby('A').tail(2)
Out[410]:
A B
4 1 1.5
6 1 1.0
5 2 1.5
7 2 1.0
rolling and expanding are similar
How about this (edited 06/12/2018):
def last_two_quantile(row, q):
return pd.Series([row.iloc[:-1].quantile(q), row.quantile(q)])
df.groupby('A')['B'].apply(last_two_quantile, 0.5)
Out[126]:
A
1 0 1.5
1 1.0
2 0 1.5
1 1.0
Name: B, dtype: float64
If this (or something like it) doesn't do what you desire I think you should provide a real example of your use case.
Is this you want?
df[-4:].groupby("A")["B"].expanding().quantile(0.5)
A
1 4 2.0
6 1.5
2 5 2.0
7 1.5
Name: B, dtype: float64
Hope can help you.
Solution1:
newdf = df.groupby("A")["B"].expanding().quantile(0.5).reset_index()
for i in range(newdf["A"].max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
Solution2:
newdf2 = df.groupby("A")["B"].expanding().quantile(0.5)
for i in range(newdf2.index.get_level_values("A").max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
Solution3:
for i in range(df.groupby("A")["B"].expanding().quantile(0.5).index.get_level_values("A").max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
output:
Empty DataFrame
Columns: [A, level_1, B]
Index: []
A level_1 B
2 1 4 1.5
3 1 6 1.0
A level_1 B
6 2 5 1.5
7 2 7 1.0
new solution:
newdf = pd.DataFrame(columns={"A", "B"})
for i in range(len(df["A"].unique())):
newdf = newdf.append(pd.DataFrame(df[df["A"]==i+1][:-2].sum()).T)
newdf["A"] = newdf["A"]/2
for i in range(len(df["A"].unique())):
newdf = newdf.append(df[df["A"]==df["A"].unique()[i]][-2:])
#newdf = newdf.reset_index(drop=True)
newdf["A"] = newdf["A"].astype(int)
for i in range(newdf["A"].max()+1):
print(newdf[newdf["A"]==i].groupby("A")["B"].expanding().quantile(0.5)[-2:])
output:
Series([], Name: B, dtype: float64)
A
1 4 1.5
6 1.0
Name: B, dtype: float64
A
2 5 1.5
7 1.0
Name: B, dtype: float64
I have a pandas Series:
0 1
1 5
2 20
3 -1
Lets say I want to apply mean() on every two elements, so I get something like this:
0 3.0
1 9.5
Is there an elegant way to do this?
You can use groupby by index divide by k=2:
k = 2
print (s.index // k)
Int64Index([0, 0, 1, 1], dtype='int64')
print (s.groupby([s.index // k]).mean())
name
0 3.0
1 9.5
You can do this:
(s.iloc[::2].values + s.iloc[1::2])/2
if you want you can also reset the index afterwards, so you have 0, 1 as the index, using:
((s.iloc[::2].values + s.iloc[1::2])/2).reset_index(drop=True)
If you are using this over large series and many times, you'll want to consider a fast approach. This solution uses all numpy functions and will be fast.
Use reshape and construct new pd.Series
consider the pd.Series s
s = pd.Series([1, 5, 20, -1])
generalized function
def mean_k(s, k):
pad = (k - s.shape[0] % k) % k
nan = np.repeat(np.nan, pad)
val = np.concatenate([s.values, nan])
return pd.Series(np.nanmean(val.reshape(-1, k), axis=1))
demonstration
mean_k(s, 2)
0 3.0
1 9.5
dtype: float64
mean_k(s, 3)
0 8.666667
1 -1.000000
dtype: float64