How to combine and do group computations on Pandas datasets? - python

I'm working on an economics paper and need some help with combining and transforming two datasets.
I have two pandas dataframes, one with a list of countries and their neighbors (borderdf) such as
borderdf
country neighbor
sweden norway
sweden denmark
denmark germany
denmark sweden
and one with data (datadf) for each country and year such as
datadf
country gdp year
sweden 5454 2004
sweden 5676 2005
norway 3433 2004
norway 3433 2005
denmark 2132 2004
denmark 2342 2005
I need to create a column in the datadf for neighbormeangdp that would contain the mean of the gdp of all the neighbors, as given by neighbordf. I would like my result to look like this:
datadf
country year gdp neighborsmeangdp
sweden 2004 5454 5565
sweden 2005 5676 5775
How should I go about doing this?

You can directly merge the two using pandas merge function.
The trick here is that you actually want to merge the country column in your datadf with the neighbor column in your borderdf.
Then use groupby and mean to get the average neighbor gdp.
Finally, merge back with the data to get the country's own GDP.
For example:
import pandas as pd
from StringIO import StringIO
border_csv = '''
country, neighbor
sweden, norway
sweden, denmark
denmark, germany
denmark, sweden
'''
data_csv = '''
country, gdp, year
sweden, 5454, 2004
sweden, 5676, 2005
norway, 3433, 2004
norway, 3433, 2005
denmark, 2132, 2004
denmark, 2342, 2005
'''
borders = pd.read_csv(StringIO(border_csv), sep=',\s*', header=1)
data = pd.read_csv(StringIO(data_csv), sep=',\s*', header=1)
merged = pd.merge(borders,data,left_on='neighbor',right_on='country')
merged = merged.drop('country_y', axis=1)
merged.columns = ['country','neighbor','gdp','year']
grouped = merged.groupby(['country','year'])
neighbor_means = grouped.mean()
neighbor_means.columns = ['neighbor_gdp']
neighbor_means.reset_index(inplace=True)
results_df = pd.merge(neighbor_means,data, on=['country','year'])

I think a direct way is to put the GDP values in the border DataFrame. Then, all what is needed is just to sum the groupby object and then do a merge:
In [178]:
borderdf[2004]=[datadf2.ix[(item, 2004)].values[0] for item in borderdf.neighbor]
borderdf[2005]=[datadf2.ix[(item, 2005)].values[0] for item in borderdf.neighbor]
gpdf=borderdf.groupby(by=['country']).sum()
df=pd.DataFrame(gpdf.unstack(), columns=['neighborsmeangdp'])
df=df.reset_index()
df=df.rename(columns = {'level_0':'year'})
print pd.ordered_merge(datadf, df)
country gdp year neighborsmeangdp
0 denmark 2132 2004 7586
1 germany 2132 2004 NaN
2 norway 3433 2004 NaN
3 sweden 5454 2004 5565
4 denmark 2342 2005 8018
5 germany 2342 2005 NaN
6 norway 3433 2005 NaN
7 sweden 5676 2005 5775
[8 rows x 4 columns]
Sure, I have to make up some data for Germany,
germany 2132 2004
germany 2342 2005
Which I am sure in reality she is doing better.

Related

Delete the rows that have the same value in the columns Dataframe

I have a dataframe like this :
origin
destination
germany
germany
germany
italy
germany
spain
USA
USA
USA
spain
Argentina
Argentina
Argentina
Brazil
and I want to filter the routes that are within the same country, that is, I want to obtain the following dataframe :
origin
destination
germany
italy
germany
spain
USA
spain
Argentina
Brazil
How can i do this with pandas ? I have tried deleting duplicates but it does not give me the results I want
Use a simple filter:
df = df[df['origin'] != df['destination']]
Output:
>>> df
origin destination
1 germany italy
2 germany spain
4 USA spain
6 Argentina Brazil
We could query:
out = df.query('origin!=destination')
Output:
origin destination
1 germany italy
2 germany spain
4 USA spain
6 Argentina Brazil

Converting multiple variables into one column and creating a matching values column using pandas

I have a Table with the following format:
Country
GDP
LifeExp
USA
6.5
75
UK
9.5
78
Italy
5.5
80
I need to change the Table above to the format of the Table below. This is just a small part of the actual table so hard coding is not going to cut it unfortunately.
Country
Indicator name
Value
USA
GDP
6.5
USA
LifeExp
75
UK
GDP
9.5
UK
LifeExp
78
Italy
GDP
5.5
Italy
LifeExp
80
Here is the code to create the first Table:
import pandas as pd
df = pd.DataFrame({'Country':["USA", "UK", "Italy"],
'GDP':[6.5, 9.5, 5.5],
'LifeExp':[75,78,80]})
I've never uploaded something before on stackoverflow so I hope I've provided sufficient info for someone to help me with this problem.
Thanks in advance!
You can use .melt() with .sort_values(), as follows:
(df.melt(id_vars='Country', var_name='Indicator name', value_name='Value')
.sort_values('Country', ascending=False)
).reset_index(drop=True)
# Result
Country Indicator name Value
0 USA GDP 6.5
1 USA LifeExp 75.0
2 UK GDP 9.5
3 UK LifeExp 78.0
4 Italy GDP 5.5
5 Italy LifeExp 80.0
You can choose sorting order of Country column. If you want it in ascending order, you can simply remove the parameter ascending=False in the .sort_values() function.
Use .stack() and .reset_index():
print(
df.set_index("Country")
.stack()
.reset_index()
.rename(columns={"level_1": "Indicator Name", 0: "Value"})
)
Prints:
Country Indicator Name Value
0 USA GDP 6.5
1 USA LifeExp 75.0
2 UK GDP 9.5
3 UK LifeExp 78.0
4 Italy GDP 5.5
5 Italy LifeExp 80.0

How to filter within a subgroup (Pandas)

here is my problem:
You will find below a Pandas DataFrame, I would like to groupby Date and then filtering within the subgroups, but I have a lot of difficulties in doing it (spent 3 hours on this and I haven't find any solution).
This is what I am looking for :
I first have to group everything by date, then sort each score from the max to the lower (in each subgroup) and then select the two best scores but they have to be from different countries.
(For example, if the two best are from the same country then we select the higher score with a country different from the first).
This is the DataFrame :
Date Name Score Country
2012 Paul 65 France
2012 Silvia 81 Italy
2012 David 80 UK
2012 Alphonse 46 France
2012 Giovanni 82 Italy
2012 Britney 53 UK
2013 Paul 32 France
2013 Silvia 59 Italy
2013 David 92 UK
2013 Alphonse 68 France
2013 Giovanni 23 Italy
2013 Britney 78 UK
2014 Paul 46 France
2014 Silvia 87 Italy
2014 David 89 UK
2014 Alphonse 76 France
2014 Giovanni 53 Italy
2014 Britney 90 UK
The Result I am looking for is something like this :
Date Name Score Country
2012 Giovanni 82 Italy
2012 David 80 UK
2013 David 92 UK
2013 Alphonse 68 France
2014 Britney 90 UK
2014 Silvia 87 Italy
Here is the code that I started :
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2012","2012","2013","2013","2013","2013","2013","2013","2014","2014","2014","2014","2014","2014"],
'Name': ["Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney"],
'Score': [65, 81, 80, 46, 82, 53,32,59,92,68,23,78,46,87,89,76,53,90],
"Country":["France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK"]})
df = df.set_index('Name').groupby('Date')["Score","Country"].apply(lambda _df: _df.sort_values(["Score"],ascending=False))
And this is what I have :
But as you can see for example in 2012, the two best scores are from the same country (Italy), so what I still have to do is :
1. Select the max per country for each year
2. Select only two best scores (and the countries have to be different).
I will be really thankful for that because I really don't know how to do it.
If somebody has some ideas on that, please share it :)
PS : please don't hesitate to tell me if it wasn't clear enough
Use DataFrame.sort_values first by 2 columns, then remove duplicates by 2 columns by DataFrame.drop_duplicates and last select top values per groups by GroupBy.head:
df1 = (df.sort_values(['Date','Score'], ascending=[True, False])
.drop_duplicates(['Date','Country'])
.groupby('Date')
.head(2))
print (df1)
Date Name Score Country
4 2012 Giovanni 82 Italy
2 2012 David 80 UK
8 2013 David 92 UK
9 2013 Alphonse 68 France
17 2014 Britney 90 UK
13 2014 Silvia 87 Italy

Subtract columns from two data frames based on a common column

df1:
Asia 34
America 74
Australia 92
Africa 44
df2 :
Asia 24
Australia 90
Africa 30
I want the output of df1 - df2 to be
Asia 10
America 74
Australia 2
Africa 14
I am getting troubled by this, I am newbie into pandas. Please help out.
Use Series.sub with mapped second Series by Series.map:
df1['B'] = df1['B'].sub(df1['A'].map(df2.set_index('A')['B']), fill_value=0)
print (df1)
A B
0 Asia 10.0
1 America 74.0
2 Australia 2.0
3 Africa 14.0
If possible changed ordering of first column convert both first columns to index by DataFrame.set_index and subtract :
df2 = df1.set_index('A')['B'].sub(df2.set_index('A')['B'], fill_value=0).reset_index()
print (df2)
A B
0 Africa 14.0
1 America 74.0
2 Asia 10.0
3 Australia 2.0

python groupby: how to move hierarchical grouped data from columns into rows?

i’ve got a python/pandas groupby that is grouped on name and looks like this:
name gender year city city total
jane female 2011 Detroit 1
2015 Chicago 1
dan male 2009 Lexington 1
bill male 2001 New York 1
2003 Buffalo 1
2000 San Francisco 1
and I want it to look like this:
name gender year1 city1 year2 city2 year3 city3 city total
jane female 2011 Detroit 2015 Chicago 2
dan male 2009 Lexington 1
bill male 2000 Chico 2001 NewYork 2003 Buffalo 3
so i want to keep the grouping by name and then order by year and make each name have only one column. it's a variation on a dummy variables maybe? i'm not even sure how to summarize it.

Categories

Resources