Pandas: How to use (df.groupby) in a lambda formula - python

The example below:
import pandas as pd
list1 = ['a','a','a','b','b','b','b','c','c','c']
list2 = range(len(list1))
df = pd.DataFrame(zip(list1, list2), columns= ['Item','Value'])
df
gives:
required: GroupFirstValue column as shown below.
The idea is to use a lambda formula to get the 'first' value for each group..for example "a"'s first value is 0, "b"'s first value is 3, "c"'s first value is 7. That's why those numbers appear in the GroupFirstValue column.
Note: I know that I can do this on 2 steps...one is the original df and the second is a grouped by df and then merge them together. The idea is to see if this can be done more efficiently in a single step. Many thanks in advance!

groupby and use first
df.groupby('Item')['Value'].first()
or you can use transform and assign to a new column in your frame
df['new_col'] = df.groupby('Item')['Value'].transform('first')

Use mask and duplicated
df['GroupFirstValue'] = df.Value.mask(df.Item.duplicated())
Out[109]:
Item Value GroupFirstValue
0 a 0 0.0
1 a 1 NaN
2 a 2 NaN
3 b 3 3.0
4 b 4 NaN
5 b 5 NaN
6 b 6 NaN
7 c 7 7.0
8 c 8 NaN
9 c 9 NaN

Related

How to sum duplicate columns in dataframe and return nan if at least one value is nan

I have a dataframe with duplicate columns (number not known a priori) like this example:
a
a
a
b
b
0
1
1
1
1
1
1
1
nan
1
1
1
I need to be able to aggregate the columns by summing their values (by rows) and returning NaN if at least one value, in one of the columns among the duplicates, is NaN.
I have tried this code:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,1,1,1], [1,np.nan,1,1,1]], columns=['a','a','a','b','b'])
df = df.groupby(axis=1, level=0).sum()
The result i get is as follows, but it does not return NaN in the second row of column 'a'.
a
b
0
3
2
1
2
2
In the documentation of pandas.DataFrame.sum, there is the skipna parameter which might suit my case. But I am using the function pandas.core.groupby.GroupBy.sum which does not have this parameter, but the min_count which does what i want but the number is not known in advance and would be different for each duplicate column.
For example, a min_count=3 solves the problem for column 'a', but obviously returns NaN on the whole of column 'b'.
The result I want to achieve is:
a
b
0
3
2
1
nan
2
One workaround might be to use apply to get the DataFrame.sum:
df.groupby(level=0, axis=1).apply(lambda x: x.sum(axis=1, skipna=False))
Output:
a b
0 3.0 2.0
1 NaN 2.0
Another possible solution:
cols, ldf = df.columns.unique(), len(df)
pd.DataFrame(
np.reshape([sum(df.loc[i, x]) for i in range(ldf) for x in cols],
(len(cols), ldf)),
columns=cols)
Output:
a b
0 3.0 2.0
1 NaN 2.0

Merge dataframes of different sizes and simultaneously overwrite NaN values

I would like to combine two dataframes in Python of different sizes. These dataframes are loaded from Excel files. The first dataframe has many empty values containing NaN, and the second dataframe has the data to replace the NaN values in the first dataframe. The two dataframes are linked by the data in the first column, but are not in the same order.
I can successfully merge and organize the dataframes using merge(), but the resulting dataframe has extra columns because the NaN values were not overwritten. I can overwrite the NaN values with fillna(), but the resulting dataframe is out of order. Is there any way to perform this kind of merge that replaces NaN without separate operations that delete and reorder columns?
import pandas as pd
import numpy as np
df1=pd.DataFrame({'A':[1,2,3],'B':[np.nan,np.nan,np.nan],'C':['X','Y','Z']})
df1
A B C
0 1 NaN X
1 2 NaN Y
2 3 NaN Z
df2=pd.DataFrame({'A':[3,1,2],'B':['U','V','W'],'D':[7,8,9]})
df2
A B D
0 3 U 7
1 1 V 8
2 2 W 9
If I do:
df1.merge(df2,how='left',on='A',sort=True)
A B_x C B_y D
0 1 NaN X V 8
1 2 NaN Y W 9
2 3 NaN Z U 7
The data is in order but B has multiple instances.
If I do:
df1.fillna(df2)
A B C
0 1 U X
1 2 V Y
2 3 W Z
The data is out of order, but the NaN are replaced.
I want the output to be a dataframe which looks like this:
df3
A B C D
0 1 V X 8
1 2 W Y 9
2 3 U Z 7
You can use:
df3=pd.concat([df1['C'],df2[['A','B','D']].sort_values('A').reset_index(drop=True)],axis=1).reindex(columns=['A','B','C','D'])
Output:
df3
A B C D
0 1 V X 8
1 2 W Y 9
2 3 U Z 7
Explanation:
sort_values ​​orders df2 according to column A.
reset_index (drop = True) is necessary to concatenate the DataFrame in the correct order.
I use concat to join the column of df1 'C' with df2 whose columns are now in the correct order. Finally I use reindex to reposition the columns of the DataFrame df3.
You can see that the order of the DataFrame df2 has not changed, since we have not used inplace = True.
d = dict(zip(df2.A,df2.B))
df1["B"] = df1["A"].map(d)
del df2["B"]
df1.merge(df2,how='left',on='A',sort=True)

Python Pandas equivalent to the excel fill handle?

Is there a Pandas function equivalent to the MS Excel fill handle?
It fills data down or extends a series if more than one cell is selected. My specific application is filling down with a set value in a specific column from a specific row in the dataframe, not necessarily filling a series.
This simple function essentially does what I want. I think it would be nice if ffill could be modified to fill in this way...
def fill_down(df, col, val, start, end = 0, interval = 1):
if not end:
end = len(df)
for i in range(start,end,interval):
df[col].iloc[i] += val
return df
As others commented, there isn't a GUI for pandas, but ffill gives the functionality you're looking for. You can also use ffill with groupby for more powerful functionality. For example:
>>> df
A B
0 12 1
1 NaN 1
2 4 2
3 NaN 2
>>> df.A = df.groupby('B').A.ffill()
A B
0 12 1
1 12 1
2 4 2
3 4 2
Edit: If you don't have NaN's, you could always create the NaN's where you want to fill down. For example:
>>> df
Out[8]:
A B
0 1 2
1 3 3
2 4 5
>>> df.replace(3, np.nan)
Out[9]:
A B
0 1.0 2.0
1 NaN NaN
2 4.0 5.0

Add pandas Series as new columns to a specific Dataframe row

Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.
join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN

How do I combine two columns within a dataframe in Pandas?

Say I have two columns, A and B, in my dataframe:
A B
1 NaN
2 5
3 NaN
4 6
I want to get a new column, C, which fills in NaN cells in column B using values from column A:
A B C
1 NaN 1
2 5 5
3 NaN 3
4 6 6
How do I do this?
I'm sure this is a very basic question, but as I am new to Pandas, any help will be appreciated!
You can use combine_first:
df['c'] = df['b'].combine_first(df['a'])
Docs: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.combine_first.html
You can use where which is a vectorized if/else:
df['C'] = df['A'].where(df['B'].isnull(), df['B'])
A B C
0 1 NaN 1
1 2 5 5
2 3 NaN 3
3 4 6 6
df['c'] = df['b'].fillna(df['a'])
So what .fillna will do is it will fill all the Nan values in the data frame
We can pass any value to it
Here we pass the value df['a']
So this method will put the corresponding values of 'a' into the Nan values of 'b'
And the final answer will be in 'c'

Categories

Resources