Group by using agg and count - python

I have the following group by in pandas
df = df.groupby("pred_text1").agg({
"rouge_score":"first","pred_text1":"first",
"input_text":"first", "target_text":"first", "prefix":"first"})
I want to have another columns which counts the number in each group (repetition of each pred_text)
I know that using transform('count') after groupby can add such a new column, but I still need agg function too.

import pandas as pd
df = pd.DataFrame({'Group': ['G1', 'G1', 'G2', 'G3', 'G2', 'G1'], 'ValueLabel': [0, 1, 1, 1, 0, 2]})
You can do this in a few steps:
First aggregate.
df = df.groupby('Group').agg({'ValueLabel': ['first', 'count']})
The new columns are a pd.MultiIndex type which we will flatten. After that we can create a mapping of the names to the labels we want and rename the columns.
df.columns = df.columns.to_flat_index()
mapper = {label: new_label for label, new_label in zip \
(pd.MultiIndex.from_product([['ValueLabel'], \
['first', 'count']]), ['ValueFirstLabel', 'ValueCountLabel'])}
df.rename(mapper, axis=1, inplace=True)

You can use pd.NamedAgg function. (Pandas 0.25+)
Code:
import pandas as pd
# A sample dataframe
df = pd.DataFrame({
'pred_text1': [chr(ord('A')+i%3) for i in range(10)],
'rouge_score': [chr(ord('A')+i%5) for i in range(10)],
'input_text': [chr(ord('A')+i%7) for i in range(10)],
'target_text': [chr(ord('A')+i%9) for i in range(10)],
'prefix': [chr(ord('A')+i%11) for i in range(10)],
})
# Aggregation
df = df.groupby("pred_text1").agg(
rouge_score=pd.NamedAgg("rouge_score", "first"),
pred_text1=pd.NamedAgg("pred_text1", "first"),
input_text=pd.NamedAgg("input_text", "first"),
target_text=pd.NamedAgg("target_text", "first"),
prefix=pd.NamedAgg("prefix", "first"),
pred_count=pd.NamedAgg("pred_text1", "count"),
)
Input:
(index)
pred_text1
rouge_score
input_text
target_text
prefix
0
A
A
A
A
A
1
B
B
B
B
B
2
C
C
C
C
C
3
A
D
D
D
D
4
B
E
E
E
E
5
C
A
F
F
F
6
A
B
G
G
G
7
B
C
A
H
H
8
C
D
B
I
I
9
A
E
C
A
J
Output:
(index)
rouge_score
pred_text1
input_text
target_text
prefix
pred_count
A
A
A
A
A
A
4
B
B
B
B
B
B
3
C
C
C
C
C
C
3

Related

How to pivot a pandas table just for some columns

I have a datafame in pandas with has a group of columns with hyphens (theres several but I'll use 2 as an example, _1 and _2), which both depict a different year.
df = pd.DataFrame({'A': ['BP','Virgin'],
'B(LY)': ['A','C'],
'B(LY_1)': ['B', 'D'],
'C': [1, 3],
'C_1': [2,4],
'D': ['W','Y'],
'D_1': ['X','Z']})
I'm trying to reorganise the table to pivot it, so that it look like this:
df = pd.DataFrame({'A': ['BP','BP', 'Virgin', 'Virgin'],
'Year': ['A','B','C','D'],
'C': [1,2,3,4],
'D': ['W','X','Y','Z']})
But I can't work out how to do it. The problem is, I only need the the hyphen column to match the equivalent hyphen for the other variables. Any help is appreciated, thanks
EDIT
here is a real life example of the data
df = pd.DataFrame({'Company': ['BP','Virgin'],
'Account_date(LY)': ['May','Apr'],
'Account_date(LY_1)': ['Apr', 'Mar'],
'Account_date(LY_2)': ['Mar', 'Feb'],
'Account_date(LY_3)': ['Feb', 'Jan'],
'Acc_day': [1, 5],
'Acc_day_1': [2,6],
'Acc_day_2': [3,7],
'Acc_day_2': [4,8],
'D': ['W','A'],
'D_1': ['X','B'],
'D_1': ['Y','C'],
'D_1': ['Z','D']})
desired output:
df = pd.DataFrame({'Company': ['BP','BP','BP','BP', 'Virgin', 'Virgin','Virgin', 'Virgin'],
'Year': ['May','Apr','Mar','Feb','Apr','Mar','Feb','May'],
'Acc_day': [1,2,3,4,5,6,7,8],
'D': ['W','X','Y','Z','A','B','C','D']})
You can use:
# set A aside
df2 = df.set_index('A')
# split columns to MultiIndex on "_"
df2.columns = df2.columns.str.split('_', expand=True)
# reshape
out = df2.stack().droplevel(1).rename(columns={'B': 'Year'}).reset_index()
Or using janitor's pivot_longer:
import janitor
out = (df.pivot_longer(index='A', names_sep='_', names_to=('.value', '_drop'), sort_by_appearance=True)
.rename(columns={'B': 'Year'}).drop(columns='_drop')
)
Output:
A Year C D
0 BP A 1 W
1 BP B 2 X
2 Virgin C 3 Y
3 Virgin D 4 Z
updated example
using a mapper to match (LY) -> _1, etc.
import re
# you can generate this mapper programmatically if needed
mapper = {'(LY)': '_1', '(LY-1)': '_2'}
# set A aside
df2 = df.set_index('A')
# split columns to MultiIndex on "_"
pattern = '|'.join(map(re.escape, mapper))
df2.columns = df2.columns.str.replace(pattern, lambda m: mapper[m.group()], regex=True).str.split('_', expand=True)
# reshape
out = df2.stack().droplevel(1).rename(columns={'B': 'Year'}).reset_index()
Output:
A Year C D
0 BP A 1 W
1 BP B 2 X
2 Virgin C 3 Y
3 Virgin D 4 Z

use specific columns to map new column with json

I have a data frame with:
A B C
1 3 6
I want to take the 2 columns and create column D that reads {"A":"1", "C":"6}
new dataframe output would be:
A B C D
1 3 6 {"A":"1", "C":"6}
I have the following code:
df['D'] = n.apply(lambda x: x.to_json(), axis=1)
but this is taking all columns while I only need columns A and C and want to leave B from the JSON that is created.
Any tips on just targeting the two columns would be appreciated.
It's not exactly what you ask but you can convert your 2 columns into a dict then if you want to export your data in JSON format, use df['D'].to_json():
df['D'] = df[['A', 'C']].apply(dict, axis=1)
print(df)
# Output
A B C D
0 1 3 6 {'A': 1, 'C': 6}
For example, export the column D as JSON:
print(df['D'].to_json(orient='records', indent=4))
# Output
[
{
"A":1,
"C":6
}
]
Use subset in lambda function:
df['D'] = df.apply(lambda x: x[['A','C']].to_json(), axis=1)
Or sellect columns before apply:
df['D'] = df[['A','C']].apply(lambda x: x.to_json(), axis=1)
If possible create dictionaries:
df['D'] = df[['A','C']].to_dict(orient='records')
print (df)
A B C D
0 1 3 6 {'A': 1, 'C': 6}

Filtering with conditional conditions (pandas)

I have an array with multiple strings, some of them are none(0 or ''), and each of them should have their own conditions if exists. if the array at its position is none, I don't have to apply the filtering.
# df.columns = ['a','b','c','d','e']
# Case 1
l = ['A', 'B', '','' , 123]
## DESIRED FILTERING
df[ (df.a=='A') & (df.b=='B') & (df.e == 123)]
# Case 2
l = ['z','' ,'' ,'', 123]
## DESIRED FILTERING
df[ (df.a=='z') & (df.e == 123) ]
This is my attempt, yet it failed cuz (df.col_name == 'something') returns a series.
#Case 1 for example
check_null = [ i!='' for i in l ] # ->returns [true,false,...]
conditions = [ (df.a==l[0]),(df.b==l[1]),(df.c==l[2]), (df.d==l[3]), (df.e==l[4])]
filt = [conditions[i] for i in range(len(check_null)) if check_null[i]]
df[filt]
How do I manage to get this work?
Create dictionary for non empty values, convert to Series and filtering in boolean indexing:
df = pd.DataFrame(columns = ['a','b','c','d','e'])
df.loc[0] = ['A', 'B','g' ,'h' , 123]
df.loc[1] = ['A', 'B','g' ,'h' , 52]
l = ['A', 'B','' ,'' , 123]
s = pd.Series(dict(zip(df.columns, l))).loc[lambda x: x != '']
df = df[df[s.index].eq(s).all(axis=1)]
print (df)
a b c d e
0 A B g h 123
l = ['A', 'B', '','', '']
s = pd.Series(dict(zip(df.columns, l))).loc[lambda x: x != '']
df = df[df[s.index].eq(s).all(axis=1)]
print (df)
a b c d e
0 A B g h 123
1 A B g h 52
You can use a Series for comparison.
Either ensure that the value matches the Series (df.eq(s)), or (|) that the Series contains a empty string (s.eq('')). Broadcasting magic will do the rest ;)
s = pd.Series(l, index=df.columns)
df2 = df[(df.eq(s)|s.eq('')).all(1)]
Example with ['A', 'B', '', '', 123]:
# input
a b c d e
0 A B C D 123
1 A X C D 456
# output
a b c d e
0 A B C D 123

Length mismatch while applying a logic to Dataframe

I'm trying to change to uppercase of alternate column names of a Dataframe having 6 columns.
input :
df.columns[::2].str.upper()
Output :
Index(['FIRST_NAME', 'AGE_VALUE', 'MOB_#'], dtype='object')
Now i want to apply this to Dataframe.
input : df.columns= df.columns[::2].str.upper()
ValueError: Length mismatch: Expected axis has 6 elements, new values have 3 elements
You can use rename
df
a b c d e f
0 a b c d e f
column_names = dict(zip(df.columns[::2], df.columns[::2].str.upper()))
column_names
{'a': 'A', 'c': 'C', 'e': 'E'}
df = df.rename(columns=column_names)
df
A b C d E f
0 a b c d e f

Merge pandas dataframe with overwrite of columns

What is the quickest way to merge to python data frames in this manner?
I have two data frames with similar structures (both have a primary key id and some value columns).
What I want to do is merge the two data frames based on id. Are there any ways do this based on pandas operations? How I've implemented it right now is as coded below:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
a_dict = {e[id]: e for e in a.to_dict('record')}
b_dict = {e[id]: e for e in b.to_dict('record')}
c_dict = a_dict.copy()
c_dict.update(b_dict)
c = pd.DataFrame(list(c.values())
Here, c would be equivalent to
pd.DataFrame({'id': [1,2,3,4], 'letter':['A','b', 'C', 'D']})
id letter
0 1 A
1 2 b
2 3 C
3 4 D
combine_first
If 'id' is your primary key, then use it as your index.
b.set_index('id').combine_first(a.set_index('id')).reset_index()
id letter
0 1 A
1 2 b
2 3 C
3 4 D
merge with groupby
a.merge(b, 'outer', 'id').groupby(lambda x: x.split('_')[0], axis=1).last()
id letter
0 1 A
1 2 b
2 3 C
3 4 D
One way may be as following:
append dataframe a to dataframe b
drop duplicates based on id
sort values on remaining by id
reset index and drop older index
You can try:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
c = b.append(a).drop_duplicates(subset='id').sort_values('id').reset_index(drop=True)
print(c)
Try this
c = pd.concat([a, b], axis=0).sort_values('letter').drop_duplicates('id', keep='first').sort_values('id')
c.reset_index(drop=True, inplace=True)
print(c)
id letter
0 1 A
1 2 b
2 3 C
3 4 D

Categories

Resources