Fill by group and between two values - python

I want to fill all rows between two values by group. For each group, var1 has two values equal to 1, and I want to fill the missing rows between the two 1s. var1 represents what I have, var2 represents what I want, var3 shows what I am obtaining with my code, but it is not what I want (different from var2):
var1 group var2 var3
NaN 1 NaN NaN
NaN 1 NaN NaN
1 1 1 1
NaN 1 1 1
NaN 1 1 1
1 1 1 1
NaN 1 NaN 1
NaN 1 NaN 1
1 2 1 1
NaN 2 1 1
1 2 1 1
NaN 2 NaN 1
My code:
df.var3 = df.groupby('group')['var1'].bffill()

Assuming the values are only 1 or NaN, you can groupby.ffill and groupby.bfill and only keep the values that are identical:
g = df.groupby('group')['var1']
s1 = g.ffill()
s2 = g.bfill()
df['var2'] = s1.where(s1.eq(s2))
Output:
var1 group var2
0 NaN 1 NaN
1 NaN 1 NaN
2 1.0 1 1.0
3 NaN 1 1.0
4 NaN 1 1.0
5 1.0 1 1.0
6 NaN 1 NaN
7 NaN 1 NaN
8 1.0 2 1.0
9 NaN 2 1.0
10 1.0 2 1.0
11 NaN 2 NaN
Intermediates:
var1 group var2 ffill bfill
0 NaN 1 NaN NaN 1.0
1 NaN 1 NaN NaN 1.0
2 1.0 1 1.0 1.0 1.0
3 NaN 1 1.0 1.0 1.0
4 NaN 1 1.0 1.0 1.0
5 1.0 1 1.0 1.0 1.0
6 NaN 1 NaN 1.0 NaN
7 NaN 1 NaN 1.0 NaN
8 1.0 2 1.0 1.0 1.0
9 NaN 2 1.0 1.0 1.0
10 1.0 2 1.0 1.0 1.0
11 NaN 2 NaN 1.0 NaN

Related

Transfer multiple columns string values to numbers in Pandas

I'm working at a data frame like this:
id type1 type2 type3
0 1 dog NaN NaN
1 2 cat NaN NaN
2 3 dog cat NaN
3 4 cow NaN NaN
4 5 dog NaN NaN
5 6 cat NaN NaN
6 7 cat dog cow
7 8 dog NaN NaN
How can I transfer it to the following dataframe? Thank you.
id dog cat cow
0 1 1.0 NaN NaN
1 2 NaN 1.0 NaN
2 3 1.0 1.0 NaN
3 4 NaN NaN 1.0
4 5 1.0 NaN NaN
5 6 NaN 1.0 NaN
6 7 1.0 1.0 1.0
7 8 1.0 NaN NaN
First filter ony type columns by DataFrame.filter, reshape by DataFrame.stack, so possible call Series.str.get_dummies. Then for 0/1 output use max by first level of MultiIndex and change 1 to NaNs by DataFrame.mask. Last add first column by DataFrame.join:
df1 = df.filter(like='type').stack().str.get_dummies().max(level=0).mask(lambda x: x == 0)
Or use get_dummies and max per columns names and last change 1 to NaNs:
df1 = (pd.get_dummies(df.filter(like='type'), prefix='', prefix_sep='')
.max(level=0, axis=1)
.mask(lambda x: x == 0))
df = df[['id']].join(df1)
print (df)
id cat cow dog
0 1 NaN NaN 1.0
1 2 1.0 NaN NaN
2 3 1.0 NaN 1.0
3 4 NaN 1.0 NaN
4 5 NaN NaN 1.0
5 6 1.0 NaN NaN
6 7 1.0 1.0 1.0
7 8 NaN NaN 1.0

pivot dataframe with duplicate values

consider the below pd.DataFrame
temp = pd.DataFrame({'label_0':[1,1,1,2,2,2],'label_1':['a','b','c',np.nan,'c','b'], 'values':[0,2,4,np.nan,8,5]})
print(temp)
label_0 label_1 values
0 1 a 0.0
1 1 b 2.0
2 1 c 4.0
3 2 NaN NaN
4 2 c 8.0
5 2 b 5.0
my desired output is
label_1 1 2
0 a 0.0 NaN
1 b 2.0 5.0
2 c 4.0 8.0
3 NaN NaN NaN
I have tried pd.pivot and wrangling around with pd.gropuby but cannot get to the desired output due to duplicate entries. any help most appreciated.
d = {}
for _0, _1, v in zip(*map(temp.get, temp)):
d.setdefault(_1, {})[_0] = v
pd.DataFrame.from_dict(d, orient='index')
1 2
a 0.0 NaN
b 2.0 5.0
c 4.0 8.0
NaN NaN NaN
OR
pd.DataFrame.from_dict(d, orient='index').rename_axis('label_1').reset_index()
label_1 1 2
0 a 0.0 NaN
1 b 2.0 5.0
2 c 4.0 8.0
3 NaN NaN NaN
Another way is to use set_index and unstack:
temp.set_index(['label_0','label_1'])['values'].unstack(0)
Output:
label_0 1 2
label_1
NaN NaN NaN
a 0.0 NaN
b 2.0 5.0
c 4.0 8.0
You can do fillna then pivot
temp.fillna('NaN').pivot(*temp.columns).T
Out[251]:
label_0 1 2
label_1
NaN NaN NaN
a 0 NaN
b 2 5
c 4 8
Seems like a straightforward pivot works:
temp.pivot(columns='label_0', index='label_1', values='values')
Output:
label_0 1 2
label_1
NaN NaN NaN
a 0.0 NaN
b 2.0 5.0
c 4.0 8.0

How to remove clustered/unclustered values less than a certain length from pandas dataframe?

If I have a pandas data frame like this:
A
1 1
2 1
3 NaN
4 1
5 NaN
6 1
7 1
8 1
9 1
10 NaN
11 1
12 1
13 1
How do I remove values that are clustered in a length less than some value (in this case four) for example? Such that I get an array like this:
A
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 1
7 1
8 1
9 1
10 NaN
11 NaN
12 NaN
13 NaN
Using groupby and np.where
s = df.groupby(df.A.isnull().cumsum()).transform(lambda s: pd.notnull(s).sum())
df['B'] = np.where(s.A>=4, df.A, np.nan)
Outputs
A B
1 1.0 NaN
2 1.0 NaN
3 NaN NaN
4 1.0 NaN
5 NaN NaN
6 1.0 1.0
7 1.0 1.0
8 1.0 1.0
9 1.0 1.0
10 NaN NaN
11 1.0 NaN
12 1.0 NaN
13 1.0 NaN

How to merge multiindex column dataframe

I want to merge static data with time varying data.
First dataframe
a_columns = pd.MultiIndex.from_product([["A","B","C"],["1","2"]])
a_index = pd.date_range("20100101","20110101",freq="BM")
a = pd.DataFrame(columns=a_columns,index=a_index)#A
Second dataframe
b_columns = ["3","4","5"]
b_index = ["A","B","C"]
b = pd.DataFrame(columns=b_columns,index=b_index)
How do i join these two? My desired dataframe has the form as A but with additional columns.
Thanks!
I think you need reshape by stack and then create df by to_frame - for concat need Datetimeindex, so new index was from first value of index of a.
Last concat + sort_index:
#added some data - 2
a_columns = pd.MultiIndex.from_product([["A","B","C"],["1","2"]])
a_index = pd.date_range("20100101","20110101",freq="BM")
a = pd.DataFrame(2,columns=a_columns,index=a_index)#A
#added some data - 1
b_columns = ["3","4","5"]
b_index = ["A","B","C"]
b = pd.DataFrame(1,columns=b_columns,index=b_index)
c = b.stack().to_frame(a.index[0]).T
print (c)
A B C
3 4 5 3 4 5 3 4 5
2010-01-29 1 1 1 1 1 1 1 1 1
d = pd.concat([a,c], axis=1).sort_index(axis=1)
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-02-26 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-03-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-04-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-05-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-06-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-07-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-08-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-09-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-10-29 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-11-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-12-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
Last if need replace NaNs only in added columns by first row:
d[c.columns] = d[c.columns].ffill()
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-02-26 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-03-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-04-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-05-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-06-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-07-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-08-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-09-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-10-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-11-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-12-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
Similar solution with reindex:
c = b.stack().to_frame(a.index[0]).T.reindex(a.index, method='ffill')
print (c)
A B C
3 4 5 3 4 5 3 4 5
2010-01-29 1 1 1 1 1 1 1 1 1
2010-02-26 1 1 1 1 1 1 1 1 1
2010-03-31 1 1 1 1 1 1 1 1 1
2010-04-30 1 1 1 1 1 1 1 1 1
2010-05-31 1 1 1 1 1 1 1 1 1
2010-06-30 1 1 1 1 1 1 1 1 1
2010-07-30 1 1 1 1 1 1 1 1 1
2010-08-31 1 1 1 1 1 1 1 1 1
2010-09-30 1 1 1 1 1 1 1 1 1
2010-10-29 1 1 1 1 1 1 1 1 1
2010-11-30 1 1 1 1 1 1 1 1 1
2010-12-31 1 1 1 1 1 1 1 1 1
d = pd.concat([a,c], axis=1).sort_index(axis=1)
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-02-26 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-03-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-04-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-05-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-06-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-07-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-08-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-09-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-10-29 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-11-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-12-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1

pandas DataFrame filter by rows and columns

I have a python pandas DataFrame that looks like this:
A B C ... ZZ
2008-01-01 00 NaN NaN NaN ... 1
2008-01-02 00 NaN NaN NaN ... NaN
2008-01-03 00 NaN NaN 1 ... NaN
... ... ... ... ... ...
2012-12-31 00 NaN 1 NaN ... NaN
and I can't figure out how to get a subset of the DataFrame where there is one or more '1' in it, so that the final df should be something like this:
B C ... ZZ
2008-01-01 00 NaN NaN ... 1
2008-01-03 00 NaN 1 ... NaN
... ... ... ... ...
2012-12-31 00 1 NaN ... NaN
This is, removing all rows and columns that do not have a 1 in it.
I try this which seems to remove the rows with no 1:
df_filtered = df[df.sum(1)>0]
And the try to remove columns with:
df_filtered = df_filtered[df.sum(0)>0]
but get this error after the second line:
IndexingError('Unalignable boolean Series key provided')
Do it with loc:
In [90]: df
Out[90]:
0 1 2 3 4 5
0 1 NaN NaN 1 1 NaN
1 NaN NaN NaN NaN NaN NaN
2 1 1 NaN NaN 1 NaN
3 1 NaN 1 1 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [91]: df.loc[df.sum(1) > 0, df.sum(0) > 0]
Out[91]:
0 1 2 3 4
0 1 NaN NaN 1 1
2 1 1 NaN NaN 1
3 1 NaN 1 1 NaN
Here's why you get that error:
Let's say I have the following frame, df, (similar to yours):
In [112]: df
Out[112]:
a b c d e
0 0 1 1 NaN 1
1 NaN NaN NaN NaN NaN
2 0 0 0 NaN 0
3 0 0 1 NaN 1
4 1 1 1 NaN 1
5 0 0 0 NaN 0
6 1 0 1 NaN 0
When I sum along the rows and threshold at 0, I get:
In [113]: row_sum = df.sum()
In [114]: row_sum > 0
Out[114]:
a True
b True
c True
d False
e True
dtype: bool
Since the index of row_sum is the columns of df, it doesn't make sense in this case to try to use the values of row_sum > 0 to fancy-index into the rows of df, since their row indices are not aligned and they cannot be aligned.
Alternatively to remove all NaN rows or columns you can use .any() too.
In [1680]: df
Out[1680]:
0 1 2 3 4 5
0 1.0 NaN NaN 1.0 1.0 NaN
1 NaN NaN NaN NaN NaN NaN
2 1.0 1.0 NaN NaN 1.0 NaN
3 1.0 NaN 1.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [1681]: df.loc[df.any(axis=1), df.any(axis=0)]
Out[1681]:
0 1 2 3 4
0 1.0 NaN NaN 1.0 1.0
2 1.0 1.0 NaN NaN 1.0
3 1.0 NaN 1.0 1.0 NaN

Categories

Resources