I have a dataframe like this :
origin
destination
germany
germany
germany
italy
germany
spain
USA
USA
USA
spain
Argentina
Argentina
Argentina
Brazil
and I want to filter the routes that are within the same country, that is, I want to obtain the following dataframe :
origin
destination
germany
italy
germany
spain
USA
spain
Argentina
Brazil
How can i do this with pandas ? I have tried deleting duplicates but it does not give me the results I want
Use a simple filter:
df = df[df['origin'] != df['destination']]
Output:
>>> df
origin destination
1 germany italy
2 germany spain
4 USA spain
6 Argentina Brazil
We could query:
out = df.query('origin!=destination')
Output:
origin destination
1 germany italy
2 germany spain
4 USA spain
6 Argentina Brazil
Related
I have table like this
Province Country Date infected
New South Wales Australia 1/22/20 12
Victoria Australia 1/22/20 10
British Columbia Canada 1/22/20 5
USA 1/22/20 7
New South Wales Australia 1/23/20 6
Victoria Australia 1/23/20 2
British Columbia Canada 1/23/20 1
USA 1/23/20 10
Now I want to convert that table into like this
Province Country Date infected
New South Wales Australia 1/22/20 12
1/23/20 6
Victoria Australia 1/22/20 10
1/23/20 2
British Columbia Canada 1/22/20 5
1/23/20 1
USA 1/22/20 7
1/23/20 10
I have tried df.sort_values('Date') but no luck.
How can I implement such kind of table?
I'm a Python rookie, but let me think along (I'm sure this can be done neater).
df = df.fillna(method='ffill')
df = df.groupby(['Province', 'Country', 'Date']).sum()
This gave me:
Province Country Date infected
British Columbia Canada 1/22/20 5
1/23/20 1
USA 1/22/20 7
1/23/20 10
New South Wales Australia 1/22/20 12
1/23/20 6
Victoria Australia 1/22/20 10
1/23/20 2
I kind of anticipated you have NaN values in the empty places (at least it's what I had importing the dataframe). I changed all these NaN to values from the index above them.
Then a groupby gave me the result above. Not sure if this is what you were after, but maybe it sparked some ideas =)
dict = {"Province": ["New South Wales", "Victoria", "British Columbia", "", "New South Wales", "Victoria", "British Columbia", ""],
"Country": ["Australia", "Australia", "Canada", "USA", "Australia", "Australia", "Canada", "USA"],
"Date": ["1/22/20", "1/22/20", "1/22/20", "1/22/20", "1/23/20", "1/23/20", "1/23/20", "1/23/20"],
"infected": [12, 10, 6, 5, 2, 3, 4, 5] }
import pandas as pd
brics = pd.DataFrame(dict)
print(brics)
df = brics.set_index(['Country', 'Province', 'Date']).sort_values(['Country', 'Province', 'Date'])
print(df)
Output:
Province Country Date infected
0 New South Wales Australia 1/22/20 12
1 Victoria Australia 1/22/20 10
2 British Columbia Canada 1/22/20 6
3 USA 1/22/20 5
4 New South Wales Australia 1/23/20 2
5 Victoria Australia 1/23/20 3
6 British Columbia Canada 1/23/20 4
7 USA 1/23/20 5
infected
Country Province Date
Australia New South Wales 1/22/20 12
1/23/20 2
Victoria 1/22/20 10
1/23/20 3
Canada British Columbia 1/22/20 6
1/23/20 4
USA 1/22/20 5
1/23/20 5
Let's say this is my data frame:
country Edition sports Athletes Medal Gender Score
Germany 1990 Aquatics HAJOS, Alfred gold M 3
Germany 1990 Aquatics HIRSCHMANN, Otto silver M 2
Germany 1990 Aquatics DRIVAS, Dimitrios gold W 3
Germany 1990 Aquatics DRIVAS, Dimitrios silver W 2
US 2008 Athletics MALOKINIS, Ioannis gold M 1
US 2008 Athletics HAJOS, Alfred silver M 2
US 2009 Athletics CHASAPIS, Spiridon gold W 3
France 2010 Athletics CHOROPHAS, Efstathios gold W 3
France 2010 Athletics CHOROPHAS, Efstathios gold M 3
France 2010 golf HAJOS, Alfred Bronze M 1
France 2011 golf ANDREOU, Joannis silver W 2
Spain 2011 golf BURKE, Thomas gold M 3
I am trying to find for how many countries the sum of men scores is equal to the sum of women scores?
I have tried the following:
sum_men = df[ df ['Gender']=='M'].groupby ( 'country' )[Score ].sum()
sum_women = df[ df ['Gender']=='W'].groupby ( 'country' )[Score ].sum()
Now i don't know how to compare this two and filter out no.of countries who have sum of men scores is equal to the sum of women scores.
can anyone please help me in this?
You can do this:
sum_men = df[df['Gender']=='M'].groupby ('Country' )['Score'].sum().reset_index() #watch the reset_index()
sum_women = df[df['Gender']=='W'].groupby ('Country' )['Score'].sum().reset_index()
new_df = sum_men.merge(sum_women, on="Country")
new_df['diff'] = new_df['Score_x'] - new_df['Score_y']
new_df
Country Score_x Score_y diff
0 France 4 5 -1
1 Germany 5 5 0
2 US 3 3 0
print(new_df[new_df['diff']==0])
Country Score_x Score_y diff
1 Germany 5 5 0
2 US 3 3 0
Not sure if you want to leave the ones who are equal or otherwise, but the same logic applies:
group = df.groupby(['country', 'Gender'])['Score'].sum().unstack()
not_equal = group[group.M != group.W]
filtered_df = df[df.country.isin(not_equal.index)]
Output:
country Edition sports Athletes Medal Gender Score score_sum
7 France 2010 Athletics CHOROPHAS, Efstathios gold W 3 5
8 France 2010 Athletics CHOROPHAS, Efstathios gold M 3 4
9 France 2010 golf HAJOS, Alfred Bronze M 1 4
10 France 2011 golf ANDREOU, Joannis silver W 2 5
11 Spain 2011 golf BURKE, Thomas gold M 3 3
Say I have a transposed df like so
id 0 1 2 3
0 1361 Spain Russia South Africa China
1 1741 Portugal Cuba UK Ukraine
2 1783 Germany USA France Egypt
3 1353 Brazil Russia Japan Kenya
4 1458 India Romania Holland Nigeria
How could I get all rows where there is 'er' so it'll return me this
id 0 1 2 3
2 1783 Germany USA France Egypt
4 1458 India Romania Holland Nigeria
because 'er' is contained in Germany and Nigeria.
Thanks!
Using contains
df[df.apply(lambda x :x.str.contains(pat='er')).any(1)]
Out[96]:
id 0 1 2 3
2 1783 Germany USA France Egypt None
4 1458 India Romania Holland Nigeria None
Use apply + str.contains across rows:
df = df[df.apply(lambda x: x.str.contains('er').any(), axis=1)]
print(df)
id 0 1 2 3
2 1783 Germany USA France Egypt
4 1458 India Romania Holland Nigeria
I want to have country codes represented in df dataframe as alpha_3_code, in my field Nationality_Codes of df2 dataframe. For every row in df2 I want to match Reviewer_Nationality with en_short_name in df, and if match, assign country code to Nationality_Codes in df2.
df2.head()
Nationality_Codes Reviewer_Nationality Reviewer_Score
NaN Russia 2.9
NaN United Kingdom 7.5
NaN Australia 7.1
NaN United Kingdom 3.8
NaN Russia 6.7
df.head()
alpha_3_code en_short_name nationality
RUS Russia Russian
ALA Åland Islands Åland Island
ALB Albania Albanian
AUS Australia Australian
UK United Kingdom British, UK
Final Result should be:
df2.head()
Nationality_Codes Reviewer_Nationality Reviewer_Score
RUS Russia 2.9
UK United Kingdom 7.5
AUS Australia 7.1
UK United Kingdom 3.8
RUS Russia 6.7
I tried this code, but didn't worked.
for index, row in df.iterrows():
for index2, row2 in df2.iterrows():
if row2['Reviewer_Nationality']==row['en_short_name']:
df2['Nationality_Codes'][row2]=df['alpha_3_code'][row2]
Can anyone help me?
Many Thanks!
One way would be to create a Series mapping for your english names and codes, and use .map:
#my_map = pd.Series(df.alpha_3_code.values,index=df.en_short_name)
my_map = df.set_index('en_short_name')['alpha_3_code']
df2['Nationality_Codes'] = df2['Reviewer_Nationality'].map(my_map)
Output:
>>> df2
Nationality_Codes Reviewer_Nationality Reviewer_Score
0 RUS Russia 2.9
1 UK United Kingdom 7.5
2 AUS Australia 7.1
3 UK United Kingdom 3.8
4 RUS Russia 6.7
Try this:
merged = df[['alpha_3_code', 'en_short_name']].merge(df2[['Reviewer_Nationality',
'Reviewer_Score']],
left_on='en_short_name', right_on='Reviewer_Nationality', how='left')]
.rename(columns={'alpha_3_code': 'Nationality_Codes'})\
.drop('en_short_name', axis=1)
I'm working on an economics paper and need some help with combining and transforming two datasets.
I have two pandas dataframes, one with a list of countries and their neighbors (borderdf) such as
borderdf
country neighbor
sweden norway
sweden denmark
denmark germany
denmark sweden
and one with data (datadf) for each country and year such as
datadf
country gdp year
sweden 5454 2004
sweden 5676 2005
norway 3433 2004
norway 3433 2005
denmark 2132 2004
denmark 2342 2005
I need to create a column in the datadf for neighbormeangdp that would contain the mean of the gdp of all the neighbors, as given by neighbordf. I would like my result to look like this:
datadf
country year gdp neighborsmeangdp
sweden 2004 5454 5565
sweden 2005 5676 5775
How should I go about doing this?
You can directly merge the two using pandas merge function.
The trick here is that you actually want to merge the country column in your datadf with the neighbor column in your borderdf.
Then use groupby and mean to get the average neighbor gdp.
Finally, merge back with the data to get the country's own GDP.
For example:
import pandas as pd
from StringIO import StringIO
border_csv = '''
country, neighbor
sweden, norway
sweden, denmark
denmark, germany
denmark, sweden
'''
data_csv = '''
country, gdp, year
sweden, 5454, 2004
sweden, 5676, 2005
norway, 3433, 2004
norway, 3433, 2005
denmark, 2132, 2004
denmark, 2342, 2005
'''
borders = pd.read_csv(StringIO(border_csv), sep=',\s*', header=1)
data = pd.read_csv(StringIO(data_csv), sep=',\s*', header=1)
merged = pd.merge(borders,data,left_on='neighbor',right_on='country')
merged = merged.drop('country_y', axis=1)
merged.columns = ['country','neighbor','gdp','year']
grouped = merged.groupby(['country','year'])
neighbor_means = grouped.mean()
neighbor_means.columns = ['neighbor_gdp']
neighbor_means.reset_index(inplace=True)
results_df = pd.merge(neighbor_means,data, on=['country','year'])
I think a direct way is to put the GDP values in the border DataFrame. Then, all what is needed is just to sum the groupby object and then do a merge:
In [178]:
borderdf[2004]=[datadf2.ix[(item, 2004)].values[0] for item in borderdf.neighbor]
borderdf[2005]=[datadf2.ix[(item, 2005)].values[0] for item in borderdf.neighbor]
gpdf=borderdf.groupby(by=['country']).sum()
df=pd.DataFrame(gpdf.unstack(), columns=['neighborsmeangdp'])
df=df.reset_index()
df=df.rename(columns = {'level_0':'year'})
print pd.ordered_merge(datadf, df)
country gdp year neighborsmeangdp
0 denmark 2132 2004 7586
1 germany 2132 2004 NaN
2 norway 3433 2004 NaN
3 sweden 5454 2004 5565
4 denmark 2342 2005 8018
5 germany 2342 2005 NaN
6 norway 3433 2005 NaN
7 sweden 5676 2005 5775
[8 rows x 4 columns]
Sure, I have to make up some data for Germany,
germany 2132 2004
germany 2342 2005
Which I am sure in reality she is doing better.