How to add rows as sums of other rows in DataFrame? - python

I'm not sure I titled this post correctly but I have a unique situation where I want to append a new set of rows to an existing DataFrame as a sum of rows from existing sets and I'm not sure where to start.
For example, I have the following DataFrame:
import pandas as pd
data = {'Team': ['Atlanta', 'Atlanta', 'Cleveland', 'Cleveland'],
'Position': ['Defense', 'Kicker', 'Defense', 'Kicker'],
'Points': [5, 10, 15, 20]}
df = pd.DataFrame(data)
print(df)
Team Position Points
0 Atlanta Defense 5
1 Atlanta Kicker 10
2 Cleveland Defense 15
3 Cleveland Kicker 20
How do I create/append new rows which create a new position for each team and sum the points of the two existing positions for each team? Additionally, the full dataset consists of several more teams so I'm looking for a solution that will work for any number of teams.
edit: I forgot to include that there are other positions in the complete DataFrame; but, I only want this solution to be applied for the positions "Defense" and "Kicker".
My desired output is below.
Team Position Points
Team Position Points
0 Atlanta Defense 5
1 Atlanta Kicker 10
2 Cleveland Defense 15
3 Cleveland Kicker 20
4 Atlanta Defense + Kicker 15
5 Cleveland Defense + Kicker 35
Thanks in advance!

We can use groupby agg to create the summary rows then append to the DataFrame:
df = df.append(df.groupby('Team', as_index=False).agg({
'Position': ' + '.join, # Concat Strings together
'Points': 'sum' # Total Points
}), ignore_index=True)
df:
Team Position Points
0 Atlanta Defense 5
1 Atlanta Kicker 10
2 Cleveland Defense 15
3 Cleveland Kicker 20
4 Atlanta Defense + Kicker 15
5 Cleveland Defense + Kicker 35
We can also whitelist certain positions by filtering df before groupby to aggregate only the desired positions:
whitelisted_positions = ['Kicker', 'Defense']
df = df.append(
df[df['Position'].isin(whitelisted_positions)]
.groupby('Team', as_index=False).agg({
'Position': ' + '.join, # Concat Strings together
'Points': 'sum' # Total Points
}), ignore_index=True
)

pandas.DataFrame.append is deprecated since version 1.4.0. Henry Ecker's neat solution just needs a slight tweak to use concat instead.
whitelisted_positions = ['Kicker', 'Defense']
df = pd.concat([df,
df[df['Position'].isin(whitelisted_positions)]
.groupby('Team', as_index=False).agg({
'Position': ' + '.join, # Concat Strings together
'Points': 'sum' # Total Points
})],
ignore_index=True)

Related

How can I group by in python and create columns with information of a column if another column has a specific value?

I have a data frame with "Team", "HA" (home away), "attack", "defense"
And what I need to have is a table, grouped by Team with 4 columns like this
I guess it could be done with an aggregate function, but I don't really know how
df_ad=df_calc.groupby(['Team','Liga']).agg...
The equivalent in SQL would be
SELECT Team, CASE
WHEN HA='Home' THEN attack
END AS Home_attack, CASE
WHEN HA='Home' THEN defense
END AS Home_defense, CASE
WHEN HA='Away' THEN attack
END AS Away_attack, CASE
WHEN HA='Away' THEN defense
END AS Away_defense
FROM df_calc;
Just pivot your dataframe:
out = df.pivot('Team', 'HA', ['attack', 'defense'])
out.columns = out.columns.swaplevel().to_flat_index().map(' '.join)
out = out.reset_index()
print(out)
# Output
Team Away attack Home attack Away defense Home defense
0 A. San Luis 1 3 2 4
1 AC Milan 5 7 6 8
2 AS Roma 9 11 10 12
For the first part, you can read: How can I pivot a dataframe?

Reading nested json to pandas dataframe

I have below URL that has a JSON response. I need to read this json into a pandas dataframe and perform operations on top of it . This is a case of nested JSON which consists of multiple lists and dicts within dicts.
URL: 'http://api.nobelprize.org/v1/laureate.json'
I have tried below code:
import json, pandas as pd,requests
resp=requests.get('http://api.nobelprize.org/v1/laureate.json')
df=pd.json_normalize(json.loads(resp.content),record_path =['laureates'])
print(df.head(5))
Output-
id firstname surname born died \
0 1 Wilhelm Conrad Röntgen 1845-03-27 1923-02-10
1 2 Hendrik A. Lorentz 1853-07-18 1928-02-04
2 3 Pieter Zeeman 1865-05-25 1943-10-09
3 4 Henri Becquerel 1852-12-15 1908-08-25
4 5 Pierre Curie 1859-05-15 1906-04-19
bornCountry bornCountryCode bornCity \
0 Prussia (now Germany) DE Lennep (now Remscheid)
1 the Netherlands NL Arnhem
2 the Netherlands NL Zonnemaire
3 France FR Paris
4 France FR Paris
diedCountry diedCountryCode diedCity gender \
0 Germany DE Munich male
1 the Netherlands NL NaN male
2 the Netherlands NL Amsterdam male
3 France FR NaN male
4 France FR Paris male
prizes
0 [{'year': '1901', 'category': 'physics', 'shar...
1 [{'year': '1902', 'category': 'physics', 'shar...
2 [{'year': '1902', 'category': 'physics', 'shar...
3 [{'year': '1903', 'category': 'physics', 'shar...
4 [{'year': '1903', 'category': 'physics', 'shar...
But in this prizes comes as a list. If I create a separate dataframe for prizes, it has affiliations as list.I want all columns to come as separate columns. Some entires may/may not have prizes. So need to handle that case as well.
I went through this article https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd. Looks like we'll have to use meta and error=ignore here, but not able to fix it. Appreciate your inputs here. Thanks.
You would have to do this in few steps.
The first step would be to extract the first record_path = ['laureates']
The second one would be record_path = ['laureates', 'prizes'] for the nested json records with meta path as the id from the parent record
Combine the two datasets by joining on the id column.
Drop the unnecessary columns and store
import json, pandas as pd, requests
resp = requests.get('http://api.nobelprize.org/v1/laureate.json')
df0 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates'])
df1 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates','prizes'], meta = [['laureates','id']])
output = pd.merge(df0, df1, left_on='id', right_on='laureates.id').drop(['prizes','laureates.id'], axis=1, inplace=False)
print('Shape of data ->',output.shape)
print('Columns ->',output.columns)
Shape of data -> (975, 18)
Columns -> Index(['id', 'firstname', 'surname', 'born', 'died', 'bornCountry',
'bornCountryCode', 'bornCity', 'diedCountry', 'diedCountryCode',
'diedCity', 'gender', 'year', 'category', 'share', 'motivation',
'affiliations', 'overallMotivation'],
dtype='object')
Found an alternate solution as well with lesser code. This works.
from flatten_json import flatten
data = winners['laureates']
dict_flattened = (flatten(record, '.') for record in data)
df = pd.DataFrame(dict_flattened)
print(df.shape)
(968, 43)

Create dictionary and append certain fields and columns to existing dataframe (based on condition in Python)

I have an existing dataframe, df, where I would like to append several columns and values to if values in the existing columns meet a certain criteria
Data
location type count year
ny marvel 1 2021
ca dc 1 2021
Desired
location type count year strength points cost
ny marvel 1 2021 13 1000 100,000
ca dc 1 2021 10 500 200,000
IF the string in the type column is 'marvel' then strength = 13, points = 1000 and cost = 100,000
IF the string in the type column is 'dc' then strength = 10, points = 500 and cost = 200,000
essentially, I would like to create 3 new columns and add values in these columns based on certain criteria
strength points cost
marvel 13 1000 100,000
dc 10 500 200,000
Doing
#empty dictionary
marvel = {}
dc = {}
marvel_a = {strength:13, points: 1000, cost: 100,000}
dc_a = {strength:10, points: 500, cost: 200,000}
df.assign({'strength': '', 'points': '', 'cost': ''})
I am creating a dictionary that will hold the key and the value and then I am thinking that I need to append this to the existing dataframe, however, the dictionary is working fine, but I am not able to add these 3 new columns.
Any suggestion or advice is appreciated.
If you have dataframe df:
location type count year
0 ny marvel 1 2021
1 ca dc 1 2021
And dataframe df_criteria:
strength points cost
marvel 13 1000 100,000
dc 10 500 200,000
Note the index of this dataframe.
Then:
print(df.merge(df_criteria, how="left", left_on="type", right_index=True))
Prints:
location type count year strength points cost
0 ny marvel 1 2021 13 1000 100,000
1 ca dc 1 2021 10 500 200,000
assume that your DataFrame is df
import numpy as np
df['strength'] = np.where(df['type']=='marvel', 13,
np.where(df['type']=='dc', 10, None))
df['points'] = np.where(df['type']=='marvel', 1000,
np.where(df['type']=='dc', 500, None))
df['cost'] = np.where(df['type']=='marvel', 100000,
np.where(df['type']=='dc', 200000, None))
let me explain
numpy where function is np.where(condition, value if condition is True, value if condition is False)
I use overlapped np.where because there are two conditios, marvel and dc.
Also an alternative where you have dict of each attribute:
import pandas as pd
df = pd.DataFrame({'location': ['ny', 'ca'], 'type': ['marvel','dc'], 'count':[1, 1], 'year': [2021, 2021]})
strength = {'marvel': 13, 'dc': 10}
points = {'marvel': 1000, 'dc': 500}
cost = {'marvel': 100000, 'dc': 200000}
df['strength'] = df['type'].map(strength)
df['points'] = df['type'].map(points)
df['cost'] = df['type'].map(cost)

Faster way to query & compute in Pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes in Pandas. What I want achieve is, grab every 'Name' from DF1 and get the corresponding 'City' and 'State' present in DF2.
For example, 'Dwight' from DF1 should return corresponding values 'Miami' and 'Florida' from DF2.
DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
. . . .
70000 Jim 27 Yes
DF1 has approx 70,000 rows with 3 columns
Second Dataframe, DF2 has approx 320,000 rows.
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
. . . . .
325082 Jim Scranton Pennsylvania
Currently I have two functions, which return the values of 'City' and 'State' using a filter.
def read_city(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['City'].values[0])
else:
field = ""
return field
def read_state(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['State'].values[0])
else:
field = ""
return field
I am using the apply function to process all the values.
df['city_list'] = df['Name'].apply(read_city)
df['State_list'] = df['Name'].apply(read_state)
The result takes a long time to compute in the above way. It roughly takes me around 18 minutes to get back the df['city_list'] and df['State_list'].
Is there a faster to compute this ? Since I am completely new to pandas, I would like to know if there is a efficient way to compute this ?
I believe you can do a map:
s = df2.groupby('name')[['City','State']].agg(list)
df['city_list'] = df['Name'].map(s['City'])
df['State_list'] = df['Name'].map(s['State'])
Or a left merge after you got s:
df = df.merge(s.add_suffix('_list'), left_on='Name', right_index=True, how='left')
I think you can do something like this:
# Dataframe DF1 (dummy data)
DF1 = pd.DataFrame(columns=['Name', 'Age', 'Student'], data=[['Dwight', 20, 'Yes'], ['Michael', 30, 'No'], ['Pam', 55, 'No'], ['Jim', 27, 'Yes']])
print("DataFrame DF1")
print(DF1)
# Dataframe DF2 (dummy data)
DF2 = pd.DataFrame(columns=['Name', 'City', 'State'], data=[['Dwight', 'Miami', 'Florida'], ['Michael', 'Scranton', 'Pennsylvania'], ['Pam', 'Austin', 'Texas'], ['Jim', 'Scranton', 'Pennsylvania']])
print("DataFrame DF2")
print(DF2)
# You do a merge on 'Name' column and then, you change the name of columns 'City' and 'State'
df = pd.merge(DF1, DF2, on=['Name']).rename(columns={'City': 'city_list', 'State': 'State_list'})
print("DataFrame final")
print(df)
Output:
DataFrame DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
3 Jim 27 Yes
DataFrame DF2
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
3 Jim Scranton Pennsylvania
DataFrame final
Name Age Student city_list State_list
0 Dwight 20 Yes Miami Florida
1 Michael 30 No Scranton Pennsylvania
2 Pam 55 No Austin Texas
3 Jim 27 Yes Scranton Pennsylvania

Adding data to Pandas Dataframe in for loop

I am trying to populate a pandas dataframe from multiple dictionaries. Each of the dictionaries are in the form below:
{'Miami': {'DrPepper': '5', 'Pepsi': '8'}}
{'Atlanta:{'DrPepper':'10','Pepsi':'25'}}
Ultimately what I want is a dataframe that looks like this (After this I plan to use pandas to do some data transformations then output the dataframe to a tab delimited file):
DrPepper Pepsi
Miami 5 8
Atlanta 10 25
If you don't mind using an additional library, you can use toolz.merge to combine all of the dictionaries, followed by DataFrame.from_dict:
import toolz
d1 = {'Miami': {'DrPepper': '5', 'Pepsi': '8'}}
d2 = {'Atlanta': {'DrPepper': '10', 'Pepsi': '25'}}
df = pd.DataFrame.from_dict(toolz.merge(d1, d2), orient='index')
This method assumes that you don't have repeat index values (i.e city names). If you do, the repeats will be overwritten with the last one in the list of dictionaries taking precedence.
The resulting output:
DrPepper Pepsi
Atlanta 10 25
Miami 5 8
You can use concat DataFrames created from dict by DataFrame.from_dict:
d1 = {'Miami': {'DrPepper': '5', 'Pepsi': '8'}}
d2 = {'Atlanta':{'DrPepper':'10','Pepsi':'25'}}
print (pd.DataFrame.from_dict(d1, orient='index'))
Pepsi DrPepper
Miami 8 5
print (pd.concat([pd.DataFrame.from_dict(d1, orient='index'),
pd.DataFrame.from_dict(d2, orient='index')]))
Pepsi DrPepper
Miami 8 5
Atlanta 25 10
Another solution with transpose by T:
print (pd.DataFrame(d1))
Miami
DrPepper 5
Pepsi 8
print (pd.concat([pd.DataFrame(d1).T, pd.DataFrame(d2).T]))
DrPepper Pepsi
Miami 5 8
Atlanta 10 25
Is possible use list comprehension also:
L = [d1,d2]
print (pd.concat([pd.DataFrame(d).T for d in L]))
DrPepper Pepsi
Miami 5 8
Atlanta 10 25

Categories

Resources