How to do forward filling for each group in pandas - python

I have a dataframe similar to below
id A B C D E
1 2 3 4 5 5
1 NaN 4 NaN 6 7
2 3 4 5 6 6
2 NaN NaN 5 4 1
I want to do a null value imputation for columns A, B, C in a forward filling but for each group. That means, I want the forward filling be applied on each id. How can I do that?

Use GroupBy.ffill for forward filling per groups for all columns, but if first values per groups are NaNs there is no replace, so is possible use fillna and last casting to integers:
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 NaN 4.0 NaN 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 NaN NaN 5.0 4 1.0
cols = ['A','B','C']
df[['id'] + cols] = df.groupby('id')[cols].ffill().fillna(0).astype(int)
print (df)
id A B C D E
0 1 2 3 4 5 NaN
1 1 2 4 4 6 NaN
2 2 3 4 5 6 6.0
3 2 3 4 5 4 1.0
Detail:
print (df.groupby('id')[cols].ffill().fillna(0).astype(int))
id A B C
0 1 2 3 4
1 1 2 4 4
2 2 3 4 5
3 2 3 4 5
Or:
cols = ['A','B','C']
df.update(df.groupby('id')[cols].ffill().fillna(0))
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 2.0 4.0 4.0 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 3.0 4.0 5.0 4 1.0

Related

How to assign a value from the last row of a preceding group to the next group?

The goal is to put the digits from the last row of the previous letter group in the new column "last_digit_prev_group". The expected, correct value, as a result formula, was entered by me manually in the column "col_ok". I stopped trying shift (), but the effect was far from what I expected. Maybe there is some other way?
Forgive me the inconsistency of my post, I'm not an IT specialist and I don't know English. Thanks in advance for your support.
df = pd.read_csv('C:/Users/.../a.csv',names=['group_letter', 'digit', 'col_ok'] ,
index_col=0,)
df['last_digit_prev_group'] = df.groupby('group_letter')['digit'].shift(1)
print(df)
group_letter digit col_ok last_digit_prev_group
A 1 n NaN
A 3 n 1.0
A 2 n 3.0
A 5 n 2.0
A 1 n 5.0
B 1 1 NaN
B 2 1 1.0
B 1 1 2.0
B 1 1 1.0
B 3 1 1.0
C 5 3 NaN
C 6 3 5.0
C 1 3 6.0
C 2 3 1.0
C 3 3 2.0
D 4 3 NaN
D 3 3 4.0
D 2 3 3.0
D 5 3 2.0
D 7 3 5.0
Use Series.mask with DataFrame.duplicated for last valeus of digit, then Series.shift and last ffill:
df['last_digit_prev_group'] = (df['digit'].mask(df.duplicated('group_letter', keep='last'))
.shift()
.ffill())
print (df)
group_letter digit col_ok last_digit_prev_group
0 A 1 n NaN
1 A 3 n NaN
2 A 2 n NaN
3 A 5 n NaN
4 A 1 n NaN
5 B 1 1 1.0
6 B 2 1 1.0
7 B 1 1 1.0
8 B 1 1 1.0
9 B 3 1 1.0
10 C 5 3 3.0
11 C 6 3 3.0
12 C 1 3 3.0
13 C 2 3 3.0
14 C 3 3 3.0
15 D 4 3 3.0
16 D 3 3 3.0
17 D 2 3 3.0
18 D 5 3 3.0
19 D 7 3 3.0
If possible some last value is NaN:
df['last_digit_prev_group'] = (df['digit'].mask(df.duplicated('group_letter', keep='last'))
.shift()
.groupby(df['group_letter']).ffill()
print (df)
group_letter digit col_ok last_digit_prev_group
0 A 1.0 n NaN
1 A 3.0 n NaN
2 A 2.0 n NaN
3 A 5.0 n NaN
4 A 1.0 n NaN
5 B 1.0 1 1.0
6 B 2.0 1 1.0
7 B 1.0 1 1.0
8 B 1.0 1 1.0
9 B 3.0 1 1.0
10 C 5.0 3 3.0
11 C 6.0 3 3.0
12 C 1.0 3 3.0
13 C 2.0 3 3.0
14 C NaN 3 3.0
15 D 4.0 3 NaN
16 D 3.0 3 NaN
17 D 2.0 3 NaN
18 D 5.0 3 NaN
19 D 7.0 3 NaN

How to fill NaN in one column depending from values two different columns

I have a dataframe with three columns. Two of them are group and subgroup, adn the third one is a value. I have some NaN values in the values column. I need to fiil them by median values,according to group and subgroup.
I made a pivot table with double index and the median of target column. But I don`t understand how to get this values and put them into original dataframe
import pandas as pd
df=pd.DataFrame(data=[
[1,1,'A',1],
[2,1,'A',3],
[3,3,'B',8],
[4,2,'C',1],
[5,3,'A',3],
[6,2,'C',6],
[7,1,'B',2],
[8,1,'C',3],
[9,2,'A',7],
[10,3,'C',4],
[11,2,'B',6],
[12,1,'A'],
[13,1,'C'],
[14,2,'B'],
[15,3,'A']],columns=['id','group','subgroup','value'])
print(df)
id group subgroup value
0 1 1 A 1
1 2 1 A 3
2 3 3 B 8
3 4 2 C 1
4 5 3 A 3
5 6 2 C 6
6 7 1 B 2
7 8 1 C 3
8 9 2 A 7
9 10 3 C 4
10 11 2 B 6
11 12 1 A NaN
12 13 1 C NaN
13 14 2 B NaN
14 15 3 A NaN
df_struct=df.pivot_table(index=['group','subgroup'],values='value',aggfunc='median')
print(df_struct)
value
group subgroup
1 A 2.0
B 2.0
C 3.0
2 A 7.0
B 6.0
C 3.5
3 A 3.0
B 8.0
C 4.0
Will be thankfull for any help
Use pandas.DataFrame.groupby.transform then fillna:
id group subgroup value
0 1 1 A 1.0
1 2 1 A NaN # < Value with nan
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
df['value'] = df['value'].fillna(df.groupby(['group', 'subgroup'])['value'].transform('median'))
print(df)
Output:
id group subgroup value
0 1 1 A 1.0
1 2 1 A 1.0
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0

Pandas insert empty row at 0th position

Suppose have following data frame
A B
1 2 3 4 5
4 5 6 7 8
I want to check if df(0,0) is nan then insert pd.series(np.nan) at 0th position. So in above case it will be
A B
1 2 3 4 5
4 5 6 7 8
I am able to check (0,0) element but how do I insert empty row at first position?
Use append of DataFrame with one empty row:
df1 = pd.DataFrame([[np.nan] * len(df.columns)], columns=df.columns)
df = df1.append(df, ignore_index=True)
print (df)
A B C D E
0 NaN NaN NaN NaN NaN
1 1.0 2.0 3.0 4.0 5.0
2 4.0 5.0 6.0 7.0 8.0
Perhaps you can first append a row with zeros, shift the whole rows and overwrite the first with 0:
df
A B C D E
0 1 2 3 4 5
1 4 5 6 7 8
df.loc[len(df)] = 0
df
A B C D E
0 1 2 3 4 5
1 4 5 6 7 8
2 0 0 0 0 0
df = df.shift()
df.loc[0] = 0
df
A B C D E
0 0.0 0.0 0.0 0.0 0.0
1 1.0 2.0 3.0 4.0 5.0
2 4.0 5.0 6.0 7.0 8.0

Add new dataframe to existing database but only add if column name matches

I have two dataframes that I am trying to combine but I'm not getting the result I want using pandas.concat.
I have a database of data that I want to add new data to but only if the column of name matches.
Let says df1 is:
A B C D
1 1 2 2
3 3 4 4
5 5 6 6
and df2 is:
A E D F
7 7 8 8
9 9 0 0
the result I would like to get is:
A B C D
1 1 2 2
3 3 4 4
5 5 6 6
7 - - 8
9 - - 0
The blank data doesn't have to be - it can be anything.
When I use:
results = pandas.concat([df1, df2], axis=0, join='outer')
it gives me a new dataframe with all of the columns A through F, instead of what I want. Any ideas for how I can accomplish this? Thanks!
You want to use the pd.DataFrame.align method and specify that you want to align with the left argument's indices and that you only care about columns.
d1, d2 = df1.align(df2, join='left', axis=1)
Then you can use pd.DataFrame.append or pd.concat
pd.concat([d1, d2], ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
Or
d1.append(d2, ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
My preferred way would be to skip the reassignment to names
pd.concat(df1.align(df2, 'left', 1), ignore_index=True)
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
3 7 NaN NaN 8
4 9 NaN NaN 0
You can use find the intersection of columns on df2 and concat or append:
pd.concat(
[df1, df2[df1.columns.intersection(df2.columns)]]
)
Or,
df1.append(df2[df1.columns.intersection(df2.columns)])
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
0 7 NaN NaN 8
1 9 NaN NaN 0
You can also use reindex and concat:
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[81]:
A B C D
0 1 1.0 2.0 2
1 3 3.0 4.0 4
2 5 5.0 6.0 6
0 7 NaN NaN 8
1 9 NaN NaN 0
Transpose first before merging.
df1.T.merge(df2.T, how="left", left_index=True, right_index=True).T
A B C D
0_x 1.0 1.0 2.0 2.0
1_x 3.0 3.0 4.0 4.0
2 5.0 5.0 6.0 6.0
0_y 7.0 NaN NaN 8.0
1_y 9.0 NaN NaN 0.0
df1.T df2.T
0 1 2 1 2
A 1 3 5 A 7 9
B 1 3 5 E 7 9
C 2 4 6 D 8 0
D 2 4 6 F 8 0
Now the result can be obtained with a merge with how="left" and we use the indices as the join key by passing left_index=True and right_index=True.
df1.T.merge(df2.T, how="left", left_index=True, right_index=True)
0_x 1_x 2 0_y 1_y
A 1 3 5 7.0 9.0
B 1 3 5 NaN NaN
C 2 4 6 NaN NaN
D 2 4 6 8.0 0.0

Select a range of column base on value of another column in pandas

My dataset looks like this(first row is header)
0 1 2 3 4 5
1 3 4 6 2 3
3 8 9 3 2 4
2 2 3 2 1 2
I want to select a range of columns of the dataset based on the column [5], e.g:
1 3 4
3 8 9 3
2 2
I have tried the following, but it did not work:
df.iloc[:,0:df['5'].values]
Let's try:
df.apply(lambda x: x[:x.iloc[5]], 1)
Output:
0 1 2 3
0 1.0 3.0 4.0 NaN
1 3.0 8.0 9.0 3.0
2 2.0 2.0 NaN NaN
Recreate your dataframe
df=pd.DataFrame([x[:x[5]] for x in df.values]).fillna(0)
df
Out[184]:
0 1 2 3
0 1 3 4.0 0.0
1 3 8 9.0 3.0
2 2 2 0.0 0.0

Categories

Resources