aggregate pivot in pandas with multiple repeated fields

aggregate pivot in pandas with multiple repeated fields - python

I have a dataframe that looks like this:
id Field_name Field_value
1 consent yes
1 _REACTION TIME_ 5547
1 age 24
1 gender X
1 _REACTION TIME_ 45396
1 education uni
1 language EN
1 _REACTION TIME_ 105187
2 consent yes
2 _REACTION TIME_ 3547
2 age 25
2 gender F
2 _REACTION TIME_ 42396
2 education uni
2 language EU
2 _REACTION TIME_ 115427
and I would like to have it as a row per id, with every _REACTION TIME_ row being a different column, such as:
id consent _REACTION TIME_1 age gender _REACTION TIME_2 education language _REACTION TIME_3
1 yes 5547 24 X 45396 uni EN 105187
2 yes 3547 25 F 42396 uni EU 115427
I have been looking for an answer to this all over SO but I can't find it for this particular issue when only some of the entries are repeated, but they are repeated multiple times.
Thanks in advance!

Use GroupBy.cumcount only for rows with duplicates by DataFrame.duplicated, so possible pivoting by DataFrame.pivot, last for original order of columns add DataFrame.reindex:
m = df.duplicated(['id','Field_name'], keep=False)
df.loc[m, 'Field_name'] += df[m].groupby(['id','Field_name']).cumcount().add(1).astype(str)
cols = df['Field_name'].unique()
df = df.pivot(index='id', columns='Field_name', values='Field_value').reindex(cols, axis=1)
print (df)
Field_name consent _REACTION TIME_1 age gender _REACTION TIME_2 education \
id
1 yes 5547 24 X 45396 uni
2 yes 3547 25 F 42396 uni
Field_name language _REACTION TIME_3
id
1 EN 105187
2 EU 115427
Solution avoiding overwrite original DataFrame is similar:
m = df.duplicated(['id','Field_name'], keep=False)
s = df['Field_name'].add(df.groupby(['id','Field_name']).cumcount().add(1)
.astype(str)).where(m, df['Field_name'])
df1 = (df.assign(Field_name=s)
.pivot(index='id', columns='Field_name', values='Field_value')
.reindex(s.unique(), axis=1))
print (df1)
Field_name consent _REACTION TIME_1 age gender _REACTION TIME_2 education \
id
1 yes 5547 24 X 45396 uni
2 yes 3547 25 F 42396 uni
Field_name language _REACTION TIME_3
id
1 EN 105187
2 EU 115427

If you want to remain _REACTION TIME_ instead of renaming it as _REACTION TIME_1 in column header, you can do groupby.apply
out = (df.groupby('id').apply(lambda g: g.drop('id', axis=1).set_index('Field_name').T)
.reset_index(level=0).reset_index(drop=True)
.rename_axis('', axis=1))
print(out)
id consent _REACTION_TIME_ age gender _REACTION_TIME_ education language _REACTION_TIME_
0 1 yes 5547 24 X 45396 uni EN 105187
1 2 yes 3547 25 F 42396 uni EU 115427

Related

In-place update in pandas: update the value of the cell based on a condition

DOB Name
0 1956-10-30 Anna
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry
6 1972-05-04 Kate
In the dataframe similar to the one above where I have duplicate names. So I am want to add a suffix '_0' to the name if DOB is before 1990 and a duplicate name.
I am expecting a result like this
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I am using the following
df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0')
But I am getting this result
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 NaN
2 2001-09-09 NaN
3 1993-01-15 NaN
4 1999-05-02 NaN
5 1962-12-17 Jerry_0
6 1972-05-04 NaN
How can I add a suffix to the Name which is a duplicate and have to be born before 1990.

Problem in your df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0') is that df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))] is a filtered dataframe whose rows are less than the original. When you assign it back, the not filtered rows doesn't have corresponding value in the filtered dataframe, so it becomes NaN.
You can try mask instead
m = (df['DOB'] < '1990-01-01') & df['Name'].duplicated(keep=False)
df['Name'] = df['Name'].mask(m, df['Name']+'_0')

You can use masks and boolean indexing:
# is the year before 1990?
m1 = pd.to_datetime(df['DOB']).dt.year.lt(1990)
# is the name duplicated?
m2 = df['Name'].duplicated(keep=False)
# if both conditions are True, add '_0' to the name
df.loc[m1&m2, 'Name'] += '_0'
output:
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate

filtering data in pandas where string is in multiple columns

I have a dataframe that looks like this:
team_1 score_1 team_2 score_2
AUS 2 SCO 1
ENG 1 ARG 0
JPN 0 ENG 2
I can retreive all the data from a single team by using:
#list specifiying team of interest
team = ['ENG']
#slice the dataframe to show only the rows where the column 'Team 1' or 'Team 2' value is in the specified string list 'team'
df.loc[df['team_1'].isin(team) | df['team_2'].isin(team)]
team_1 score_1 team_2 score_2
ENG 1 ARG 0
JPN 0 ENG 2
How can I now return only the score for my 'team' such as:
team score
ENG 1
ENG 2
Maybe creating an index to each team so as to filter out?
Maybe encoding the team_1 and team_2 columns to filter out?

new_df_1 = df[df.team_1 =='ENG'][['team_1', 'score_1']]
new_df_1 =new_df_1.rename(columns={"team_1":"team", "score_1":"score"})
# team score
# 0 ENG 1
new_df_2 = df[df.team_2 =='ENG'][['team_2', 'score_2']]
new_df_2 = new_df_2.rename(columns={"team_2":"team", "score_2":"score"})
# team score
# 1 ENG 2
then concat two dataframe:
pd.concat([new_df_1, new_df_2])
the output is :
team score
0 ENG 1
1 ENG 2

Melt the columns, filter for values in team, compute the sum of the scores column, and filter for only teams and score:
team = ["ENG"]
(
df
.melt(cols, value_name="team")
.query("team in #team")
.assign(score=lambda x: x.filter(like="score").sum(axis=1))
.loc[:, ["team", "score"]]
)
team score
1 ENG 1
5 ENG 2

Count a certain value for each country

I am attempting to do a Excel countif function with pandas but hitting a roadblock in doing so.
I have this dataframe. I need to count the YES for each country quarter-wise. I have posted the requested answers below.
result.head(3)
Country Jan 1 Feb 1 Mar 1 Apr 1 May 1 Jun 1 Quarter_1 Quarter_2
FRANCE Yes Yes No No No No 2 0
BELGIUM Yes Yes No Yes No No 2 1
CANADA Yes No No Yes No No 1 1
I tried the following but Pandas spats out a total value instead showing a 5 for all the values under Quarter_1. I am oblivious on how to calculate my function below by Country? Any assistance with this please!
result['Quarter_1'] = len(result[result['Jan 1'] == 'Yes']) + len(result[result['Feb 1'] == 'Yes'])
+ len(result[result['Mar 1'] == 'Yes'])

We can use the length of your column and take the floor division to create your quarters. Then we groupby on these and take the sum.
Finally to we add the prefix Quarter:
df = df.set_index('Country')
grps = np.arange(len(df.columns)) // 3
dfn = (
df.join(df.eq('Yes')
.groupby(grps, axis=1)
.sum()
.astype(int)
.add_prefix('Quarter_'))
.reset_index()
)
Or using list comprehension to rename your columns:
df = df.set_index('Country')
grps = np.arange(len(df.columns)) // 3
dfn = df.eq('Yes').groupby(grps, axis=1).sum().astype(int)
dfn.columns = [f'Quarter_{col+1}' for col in dfn.columns]
df = df.join(dfn).reset_index()
Country Jan 1 Feb 1 Mar 1 Apr 1 May 1 Jun 1 Quarter_1 Quarter_2
0 FRANCE Yes Yes No No No No 2 0
1 BELGIUM Yes Yes No Yes No No 2 1
2 CANADA Yes No No Yes No No 1 1

pandas - How to aggregate two columns and keeping all other columns

I have the below synopsis of a df:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 5 3
1 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 268 2
2 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 276 4
3 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 217 3
4 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 87 4
What i'm looking for is count 'user id' and average 'rating' and keep all other columns intact. So the result will be something like this:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 50 3.75
1 3 Four Rooms (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 35 2.34
any idea how to do that?
Thanks

If all the values are in the columns you are aggregating over are the same for each group then you can avoid the join by putting them into the group.
Then pass a dictionary of functions to agg. If you set as_index to False to keep the grouped by columns as columns:
df.groupby(['movie id','movie title','release date','IMDb URL','genre'], as_index=False).agg({'user id':len,'rating':'mean'})
Note len is used to count

When you have too many columns, you probably do not want to type all of the column names. So here is what I came up with:
column_map = {col: "first" for col in df.columns}
column_map["col_name1"] = "sum"
column_map["col_name2"] = lambda x: set(x) # it can also be a function or lambda
now you can simply do
df.groupby(["col_to_group"], as_index=False).aggreagate(column_map)

Pandas: Retrieving an Index from a dataframe to populate another df

I have tried to find a solution to this but have failed
I have my master df with transactional data and specifically credit card names:
transactionId, amount, type, person
1 -30 Visa john
2 -100 Visa Premium john
3 -12 Mastercard jenny
I am grouping by person and then aggregating by numb of records and amount.
person numbTrans Amount
john 2 -130
jenny 1 -12
This is fine but I need to add the dimension of creditcard type to my df.
I have grouped a df of the creditcards in use
index CreditCardName
0 Visa
1 Visa Premium
2 Mastercard
So, what I can't do is creating a new column in my master dataframe called 'CreditCard_id' which uses the string 'Visa/Visa Premium/Mastercard' to pull in the index for the column.
transactionId, amount, type, CreditCardId, person
1 -30 Visa 0 john
2 -100 Visa Premium 1 john
3 -12 Mastercard 2 jenny
I need this as I am doing some simple kmeans clustering and require ints, not strings (or at least I think I do)
Thanks in advance
Rob

If you set the 'CreditCardName' as the index of the second df then you can just call map:
In [80]:
# setup dummydata
import pandas as pd
temp = """transactionId,amount,type,person
1,-30,Visa,john
2,-100,Visa Premium,john
3,-12,Mastercard,jenny"""
temp1 = """index,CreditCardName
0,Visa
1,Visa Premium
2,Mastercard"""
df = pd.read_csv(io.StringIO(temp))
# crucually set the index column to be the credit card name
df1 = pd.read_csv(io.StringIO(temp1), index_col=[1])
df
Out[80]:
transactionId amount type person
0 1 -30 Visa john
1 2 -100 Visa Premium john
2 3 -12 Mastercard jenny
In [81]:
df1
Out[81]:
index
CreditCardName
Visa 0
Visa Premium 1
Mastercard 2
In [82]:
# now we can call map passing the series, naturally the map will align on index and return the index value for our new column
df['CreditCardId'] = df['type'].map(df1['index'])
df
Out[82]:
transactionId amount type person CreditCardId
0 1 -30 Visa john 0
1 2 -100 Visa Premium john 1
2 3 -12 Mastercard jenny 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

aggregate pivot in pandas with multiple repeated fields - python

Related

In-place update in pandas: update the value of the cell based on a condition

filtering data in pandas where string is in multiple columns

Count a certain value for each country

pandas - How to aggregate two columns and keeping all other columns

Pandas: Retrieving an Index from a dataframe to populate another df

Categories

Resources