I have the data as below, the new pandas version doesn't preserve the grouped columns after the operation of fillna/ffill/bfill. Is there a way to have the grouped data?
data = """one;two;three
1;1;10
1;1;nan
1;1;nan
1;2;nan
1;2;20
1;2;nan
1;3;nan
1;3;nan"""
df = pd.read_csv(io.StringIO(data), sep=";")
print(df)
one two three
0 1 1 10.0
1 1 1 NaN
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
print(df.groupby(['one','two']).ffill())
three
0 10.0
1 10.0
2 10.0
3 NaN
4 20.0
5 20.0
6 NaN
7 NaN
With the most recent pandas if we would like keep the groupby columns , we need to adding apply here
out = df.groupby(['one','two']).apply(lambda x : x.ffill())
Out[219]:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Does it what you expect?
df['three']= df.groupby(['one','two'])['three'].ffill()
print(df)
# Output:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Yes please set the index and then try grouping it so that it will preserve the columns as shown here:
df = pd.read_csv(io.StringIO(data), sep=";")
df.set_index(['one','two'], inplace=True)
df.groupby(['one','two']).ffill()
Related
I have three Pandas Dataframes:
df1:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
df2:
0 1
3 7
6 5
9 2
df3:
1 2
4 6
7 6
My goal is to assign the values of df2 and df3 to df1 based on the index.
df1 should then become:
0 1
1 2
2 NaN
3 7
4 6
5 NaN
6 5
7 6
8 NaN
9 2
I tried with simple assinment:
df1.loc[df2.index] = df2.values
or
df1.loc[df2.index] = df2
but this gives me an ValueError:
ValueError: Must have equal len keys and value when setting with an iterable
Thanks for your help!
You can do concat with combine_first:
pd.concat([df2,df3]).combine_first(df1)
Or reindex:
pd.concat([df2,df3]).reindex_like(df1)
0 1.0
1 2.0
2 NaN
3 7.0
4 6.0
5 NaN
6 5.0
7 6.0
8 NaN
9 2.0
I have a df like this:
time data
0 1
1 1
2 nan
3 nan
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 nan
Is there a way to use pd.Series.ffill() to ffill on for certain occurences of values? Specifically, I want to forward fill only if values in df.data are == 1 or 4. Should look like this:
time data
0 1
1 1
2 1
3 1
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 4
One option would be to forward fill (ffill) the column, then only populate where the values are 1 or 4 using (isin) and (mask):
s = df['data'].ffill()
df['data'] = df['data'].mask(s.isin([1, 4]), s)
df:
time data
0 0 1.0
1 1 1.0
2 2 1.0
3 3 1.0
4 4 6.0
5 5 NaN
6 6 NaN
7 7 NaN
8 8 5.0
9 9 4.0
10 10 4.0
I have a df
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
I need to groub by a and b and then if c or d contains 1 or more nan's within groups I want the entire group in the specific column to be nan:
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 nan
1 3 1 nan
1 1 nan 3
1 1 nan 3
1 1 nan 4
and then combine c and d that there is no nan's anymore
a b c d e
0 1 nan 1 1
0 2 2 nan 2
0 2 3 nan 3
1 3 1 nan 1
1 1 nan 3 3
1 1 nan 3 3
1 1 nan 4 4
You will want to check each group for whether it is nan and then set the appropriate value (nan or existing value) and then use combine_first() to combine the columns.
from io import StringIO
import pandas as pd
import numpy as np
df = pd.read_csv(StringIO("""
a b c d
0 1 nan 1
0 2 2 nan
0 2 3 4
1 3 1 nan
1 1 nan 3
1 1 2 3
1 1 2 4
"""), sep=' ')
for col in ['c', 'd']:
df[col] = df.groupby(['a','b'])[col].transform(lambda x: np.nan if any(x.isna()) else x)
df['e'] = df['c'].combine_first(df['d'])
df
a b c d e
0 0 1 NaN 1.0 1.0
1 0 2 2.0 NaN 2.0
2 0 2 3.0 NaN 3.0
3 1 3 1.0 NaN 1.0
4 1 1 NaN 3.0 3.0
5 1 1 NaN 3.0 3.0
6 1 1 NaN 4.0 4.0
I would like to use rolling count with maximum value is 36 which need to include NaN value such as start with 0 if its NaN. I have dataframe that look like this:
Input:
val
NaN
1
1
NaN
2
1
3
NaN
5
Code:
b = a.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x))).astype(int)
It gives me:
Val count
NaN 1
1 2
1 2
NaN 3
2 4
1 4
3 5
NaN 6
5 7
Expected Output:
Val count
NaN 0
1 1
1 1
NaN 1
2 2
1 2
3 3
NaN 3
5 4
You can just filter out nan
df.val.rolling(36,min_periods=1).apply(lambda x: len(np.unique(x[~np.isnan(x)]))).fillna(0)
Out[35]:
0 0.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 3.0
7 3.0
8 4.0
Name: val, dtype: float64
The reason why
np.unique([np.nan]*2)
Out[38]: array([nan, nan])
np.nan==np.nan
Out[39]: False
My dataset looks like this(first row is header)
0 1 2 3 4 5
1 3 4 6 2 3
3 8 9 3 2 4
2 2 3 2 1 2
I want to select a range of columns of the dataset based on the column [5], e.g:
1 3 4
3 8 9 3
2 2
I have tried the following, but it did not work:
df.iloc[:,0:df['5'].values]
Let's try:
df.apply(lambda x: x[:x.iloc[5]], 1)
Output:
0 1 2 3
0 1.0 3.0 4.0 NaN
1 3.0 8.0 9.0 3.0
2 2.0 2.0 NaN NaN
Recreate your dataframe
df=pd.DataFrame([x[:x[5]] for x in df.values]).fillna(0)
df
Out[184]:
0 1 2 3
0 1 3 4.0 0.0
1 3 8 9.0 3.0
2 2 2 0.0 0.0