Pandas dataframe group: sum one column, take first element from others - python

I have a pandas dataframe
x = pd.DataFrame.from_dict({'row':[1, 1, 2, 2, 3, 3, 3], 'add': [1, 2, 3, 4, 5, 6, 7], 'take1': ['a', 'b', 'c', 'd', 'e', 'f', 'g'], 'take2': ['11', '22', '33', '44', '55', '66', '77'], 'range': [100, 200, 300, 400, 500, 600, 700]})
add range row take1 take2
0 1 100 1 a 11
1 2 200 1 b 22
2 3 300 2 c 33
3 4 400 2 d 44
4 5 500 3 e 55
5 6 600 3 f 66
6 7 700 3 g 77
I want to group it by the row column, then add up entries in add column, but take the first entry from take1 and take2, and select the min and max from range:
add row take1 take2 min_range max_range
0 3 1 a 11 100 200
1 7 2 c 33 300 400
2 18 3 e 55 500 700

Use DataFrameGroupBy.agg by dict, but then some cleaning is necessary, because get MultiIndex in columns:
#create a dictionary of column names and functions to apply to that column
d = {'add':'sum', 'take1':'first', 'take2':'first', 'range':['min','max']}
#group by the row column and apply the corresponding aggregation to each
#column as specified in the dictionary d
df = x.groupby('row', as_index=False).agg(d)
#rename some columns
df = df.rename(columns={'first':'', 'sum':''})
df.columns = ['{0[0]}_{0[1]}'.format(x).strip('_') for x in df.columns]
print (df)
row take1 range_min range_max take2 add
0 1 a 100 200 11 3
1 2 c 300 400 33 7
2 3 e 500 700 55 18
Details : Aggregate the columns based by the functions specified in the dictionary :
df = x.groupby('row', as_index=False).agg(d)
row range take2 take1 add
min max first first sum
0 1 100 200 11 a 3
1 2 300 400 33 c 7
2 3 500 700 55 e 18
Replacing column names sum and first with '' will lead to
row range take2 take1 add
min max
0 1 100 200 11 a 3
1 2 300 400 33 c 7
2 3 500 700 55 e 18
List comprehension on columns by using string formatters will get the desired column names. Assigning it to df.columns will get the desired output.

Here's what I had, without column renaming/sorting.
x = pd.DataFrame.from_dict({'row':[1, 1, 2, 2, 3, 3, 3], 'add': [1, 2, 3, 4, 5, 6, 7], 'take1': ['a', 'b', 'c', 'd', 'e', 'f', 'g'], 'take2': ['11', '22', '33', '44', '55', '66', '77'], 'range': [100, 200, 300, 400, 500, 600, 700]})
x.reset_index(inplace = True)
min_cols = x.ix[x.groupby(['row'])['index'].idxmin().values][['row','take1','take2']]
x_grouped = x.groupby(['row']).agg({'add':'sum','range':[np.min, np.max]})
x_out = pd.merge(x_grouped,min_cols, how = 'left',left_index = True, right_on = ['row'])
print x_out
(add, sum) (range, amin) (range, amax) row take1 take2
0 3 100 200 1 a 11
2 7 300 400 2 c 33
4 18 500 700 3 e 55

Related

How do I select the first item in a column after grouping for another column in pandas?

I have the following data frame:
df = {
'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'name_ID' : [1, 1, 2, 2, 2, 3, 3],
'score' : [400, 500, 3000, 1000, 4000, 600, 750],
'score_number' : [1, 2, 1, 2, 3, 1, 2]
}
df = pd.DataFrame(df)
Note that the df is grouped by name / name_ID. names can have n scores, e.g. A has 2 scores, whereas B has 3 scores. I want an additional column, that indicates the first score per name / name_ID. The reference_score for the first scores of a name should be NaN. Like this:
I have tried:
df_v2['first_fund'] =df_v2['fund_size'].groupby(df_v2['firm_ID']).first(),
also with .nth but it didn't work.
Thanks in advance.
Let's use groupby.transform to get first row value then mask the first row as NaN with condition ~df.duplicated('name', keep='first').
# sort the dataframe first if score number is not ascending
# df = df.sort_values(['name_ID', 'score_number'])
df['reference_score'] = (df.groupby('name')['score']
.transform('first')
.mask(~df.duplicated('name', keep='first')))
print(df)
name name_ID score score_number reference_score
0 A 1 400 1 NaN
1 A 1 500 2 400.0
2 B 2 3000 1 NaN
3 B 2 1000 2 3000.0
4 B 2 4000 3 3000.0
5 C 3 600 1 NaN
6 C 3 750 2 600.0
Or we can compare score_number with 1 to define the first row in each group.
df['reference_score'] = (df.groupby('name')['score']
.transform('first')
.mask(df['score_number'].eq(1))
Another solution:
df["reference_score"] = df.groupby("name")["score"].apply(
lambda x: pd.Series([x.iat[0]] * len(x), index=x.index).shift()
)
print(df)
Prints:
name name_ID score score_number reference_score
0 A 1 400 1 NaN
1 A 1 500 2 400.0
2 B 2 3000 1 NaN
3 B 2 1000 2 3000.0
4 B 2 4000 3 3000.0
5 C 3 600 1 NaN
6 C 3 750 2 600.0

Python: Creating new column names in a for loop

I am trying to make custom column header names for the dataframe using a for loop. Currently I am using two for loops to iterate through a dataframe, but don't know how to put new column headers in without hardcoding them. I have
df = pandas.DataFrame({
'A':[5,3,6,9,2,4],
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],})
result = []
for i in range(len(df.columns)):
SelectedCol = (df.iloc[:,i])
for c in range(i+1, len(df.columns)):
result.append(((SelectedCol+1)/ (df.iloc[:,c]+1)))
df1 = pandas.DataFrame(result)
df1=df1.transpose()
In df, the first column is taken and multiplied to the second, third, and fourth. And then the code takes the second, and multiples it by the third and fourth, and continues in the for loop so the output columns are
'AB' , 'AC', 'AD', 'BC', 'BD', and 'CD'.
What could I add to my for loop to extract the column names so each column name of df1 can be 'Long A, Short B' , 'Long A, Short C'.... and finally 'Long C, Short D'
Thanks for your help
from itertools import combinations
for x,y in combinations(df.columns,2):
df['Long '+x+' Short '+y]=df[x]*df[y]
import pandas
from itertools import combinations
df = pandas.DataFrame({
'A': [5, 3, 6, 9, 2, 4],
'B': [4, 5, 4, 5, 5, 4],
'C': [7, 8, 9, 4, 2, 3],
'D': [1, 3, 5, 7, 1, 0], })
# get all col name
for index, row in df.iteritems():
print(index)
# get all combinations
result = combinations(df.iteritems(), 2)
# calc
for name, data in result:
_name = name[0] + data[0]
_data = name[1] * data[1]
df[_name] = _data
print(df)
A
B
C
D
A B C D AB AC AD BC BD CD
0 5 4 7 1 20 35 5 28 4 7
1 3 5 8 3 15 24 9 40 15 24
2 6 4 9 5 24 54 30 36 20 45
3 9 5 4 7 45 36 63 20 35 28
4 2 5 2 1 10 4 2 10 5 2
5 4 4 3 0 16 12 0 12 0 0

How to remove rows from DF as result of a groupby query?

I have this Pandas dataframe:
df = pd.DataFrame({'site': ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'day': [1, 1, 1, 1, 1, 1, 2, 2, 2],
'hour': [1, 2, 3, 1, 2, 3, 1, 2, 3], 'clicks': [100, 200, 50, 0, 0, 0, 10, 0, 20]})
# site day hour clicks
# 0 a 1 1 100
# 1 a 1 2 200
# 2 a 1 3 50
# 3 b 1 1 0
# 4 b 1 2 0
# 5 b 1 3 0
# 6 a 2 1 10
# 7 a 2 2 0
# 8 a 2 3 20
And I want to remove all rows for a site/day, where there were 0 clicks. So in the example above, I would want to remove the rows with site='b' and day =1.
I can basically group them and show where the sum is 0 for a day/site:
print(df.groupby(['site', 'day'])['clicks'].sum() == 0)
But how would now be straight-forward way to remove the rows from original dataframe where that condition applies?
Solution I am having so far is that I iterate over group and save all tuples of site/day in a list, and then separately remove all rows that have that combinations of site/day. That works but, I am sure there must be a more functional and elegant way to achieve that result?
Option 1
Using groupby, transform and boolean indexing:
df[df.groupby(['site', 'day'])['clicks'].transform('sum') != 0]
Output:
site day hour clicks
0 a 1 1 100
1 a 1 2 200
2 a 1 3 50
6 a 2 1 10
7 a 2 2 0
8 a 2 3 20
Option 2
Using groupby and filter:
df.groupby(['site', 'day']).filter(lambda x: x['clicks'].sum() != 0)
Output:
site day hour clicks
0 a 1 1 100
1 a 1 2 200
2 a 1 3 50
6 a 2 1 10
7 a 2 2 0
8 a 2 3 20

Pandas groupby cumcount starting on row with a certain column value

I'd like to create two cumcount columns, depending on the values of two columns.
In the example below, I'd like one cumcount starting when colA is at least 100, and another cumcount starting when colB is at least 10.
columns = ['ID', 'colA', 'colB', 'cumcountA', 'cumountB']
data = [['A', 3, 1, '',''],
['A', 20, 4, '',''],
['A', 102, 8, 1, ''],
['A', 117, 10, 2, 1],
['B', 75, 0, '',''],
['B', 170, 12, 1, 1],
['B', 200, 13, 2, 2],
['B', 300, 20, 3, 3],
]
pd.DataFrame(columns=columns, data=data)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3
How would I calculate cumcountA and cumcountB?
you can try setting df.clip lower = your values (here 100 and 10) and then compare then groupby ID and cumsum :
col_list = ['colA','colB']
val_list = [100,10]
df[['cumcountA','cumountB']] = (df[col_list].ge(df[col_list].clip(lower=val_list,axis=1))
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
Or may be even better to compare directly:
df[['cumcountA','cumountB']] = (df[['colA','colB']].ge([100,10])
.groupby(df['ID']).cumsum().replace(0,''))
print(df)
ID colA colB cumcountA cumountB
0 A 3 1
1 A 20 4
2 A 102 8 1
3 A 117 10 2 1
4 B 75 0
5 B 170 12 1 1
6 B 200 13 2 2
7 B 300 20 3 3

Add incremeting number to certain coulmns of a pandas DataFrame [duplicate]

This question already has answers here:
Grouping and auto increment based on columns in pandas
(1 answer)
How to use groupby and cumcount on unique names in a Pandas column
(2 answers)
Closed 3 years ago.
Say I have a df as follows
df = pd.DataFrame({'val': [30, 40, 50, 60, 70, 80, 90], 'idx': [9, 8, 7, 6, 5, 4, 3],
'category': ['a', 'a', 'b', 'b', 'c', 'c', 'c']}).set_index('idx')
Ouput:
val category
idx
9 30 a
8 40 a
7 50 b
6 60 b
5 70 c
4 80 c
3 90 c
I want to add an incrementing number from 1 to the total number if rows for each 'category'. The new column should look like this:
category incrNbr val
idx
3 a 1 30
4 a 2 40
5 b 1 50
6 b 2 60
7 c 1 70
8 c 2 80
9 c 3 90
Currently I loop through each category like this:
li = []
for index, row in df.iterrows():
cat = row['category']
if cat not in li:
li.append(cat)
temp = df.loc[(df['category'] == row['category'])][['val']]
temp.insert(0, 'incrNbr', range(1, 1 + len(temp)))
del temp['val']
df = df.combine_first(temp)
It is very slow.
Is there a way to do this using vectorized operations?
If your category column is sorted, we can use GroupBy.cumcount:
df['incrNbr'] = df.groupby('category')['category'].cumcount().add(1)
val category incrNbr
idx
9 30 a 1
8 40 a 2
7 50 b 1
6 60 b 2
5 70 c 1
4 80 c 2
3 90 c 3

Categories

Resources