I have a dataframe like this :
origin
destination
germany
germany
germany
italy
germany
spain
USA
USA
USA
spain
Argentina
Argentina
Argentina
Brazil
and I want to filter the routes that are within the same country, that is, I want to obtain the following dataframe :
origin
destination
germany
italy
germany
spain
USA
spain
Argentina
Brazil
How can i do this with pandas ? I have tried deleting duplicates but it does not give me the results I want
Use a simple filter:
df = df[df['origin'] != df['destination']]
Output:
>>> df
origin destination
1 germany italy
2 germany spain
4 USA spain
6 Argentina Brazil
We could query:
out = df.query('origin!=destination')
Output:
origin destination
1 germany italy
2 germany spain
4 USA spain
6 Argentina Brazil
I have a Table with the following format:
Country
GDP
LifeExp
USA
6.5
75
UK
9.5
78
Italy
5.5
80
I need to change the Table above to the format of the Table below. This is just a small part of the actual table so hard coding is not going to cut it unfortunately.
Country
Indicator name
Value
USA
GDP
6.5
USA
LifeExp
75
UK
GDP
9.5
UK
LifeExp
78
Italy
GDP
5.5
Italy
LifeExp
80
Here is the code to create the first Table:
import pandas as pd
df = pd.DataFrame({'Country':["USA", "UK", "Italy"],
'GDP':[6.5, 9.5, 5.5],
'LifeExp':[75,78,80]})
I've never uploaded something before on stackoverflow so I hope I've provided sufficient info for someone to help me with this problem.
Thanks in advance!
You can use .melt() with .sort_values(), as follows:
(df.melt(id_vars='Country', var_name='Indicator name', value_name='Value')
.sort_values('Country', ascending=False)
).reset_index(drop=True)
# Result
Country Indicator name Value
0 USA GDP 6.5
1 USA LifeExp 75.0
2 UK GDP 9.5
3 UK LifeExp 78.0
4 Italy GDP 5.5
5 Italy LifeExp 80.0
You can choose sorting order of Country column. If you want it in ascending order, you can simply remove the parameter ascending=False in the .sort_values() function.
Use .stack() and .reset_index():
print(
df.set_index("Country")
.stack()
.reset_index()
.rename(columns={"level_1": "Indicator Name", 0: "Value"})
)
Prints:
Country Indicator Name Value
0 USA GDP 6.5
1 USA LifeExp 75.0
2 UK GDP 9.5
3 UK LifeExp 78.0
4 Italy GDP 5.5
5 Italy LifeExp 80.0
here is my problem:
You will find below a Pandas DataFrame, I would like to groupby Date and then filtering within the subgroups, but I have a lot of difficulties in doing it (spent 3 hours on this and I haven't find any solution).
This is what I am looking for :
I first have to group everything by date, then sort each score from the max to the lower (in each subgroup) and then select the two best scores but they have to be from different countries.
(For example, if the two best are from the same country then we select the higher score with a country different from the first).
This is the DataFrame :
Date Name Score Country
2012 Paul 65 France
2012 Silvia 81 Italy
2012 David 80 UK
2012 Alphonse 46 France
2012 Giovanni 82 Italy
2012 Britney 53 UK
2013 Paul 32 France
2013 Silvia 59 Italy
2013 David 92 UK
2013 Alphonse 68 France
2013 Giovanni 23 Italy
2013 Britney 78 UK
2014 Paul 46 France
2014 Silvia 87 Italy
2014 David 89 UK
2014 Alphonse 76 France
2014 Giovanni 53 Italy
2014 Britney 90 UK
The Result I am looking for is something like this :
Date Name Score Country
2012 Giovanni 82 Italy
2012 David 80 UK
2013 David 92 UK
2013 Alphonse 68 France
2014 Britney 90 UK
2014 Silvia 87 Italy
Here is the code that I started :
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2012","2012","2013","2013","2013","2013","2013","2013","2014","2014","2014","2014","2014","2014"],
'Name': ["Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney","Paul", "Silvia","David","Alphone", "Giovanni", "Britney"],
'Score': [65, 81, 80, 46, 82, 53,32,59,92,68,23,78,46,87,89,76,53,90],
"Country":["France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK","France","Italy","UK"]})
df = df.set_index('Name').groupby('Date')["Score","Country"].apply(lambda _df: _df.sort_values(["Score"],ascending=False))
And this is what I have :
But as you can see for example in 2012, the two best scores are from the same country (Italy), so what I still have to do is :
1. Select the max per country for each year
2. Select only two best scores (and the countries have to be different).
I will be really thankful for that because I really don't know how to do it.
If somebody has some ideas on that, please share it :)
PS : please don't hesitate to tell me if it wasn't clear enough
Use DataFrame.sort_values first by 2 columns, then remove duplicates by 2 columns by DataFrame.drop_duplicates and last select top values per groups by GroupBy.head:
df1 = (df.sort_values(['Date','Score'], ascending=[True, False])
.drop_duplicates(['Date','Country'])
.groupby('Date')
.head(2))
print (df1)
Date Name Score Country
4 2012 Giovanni 82 Italy
2 2012 David 80 UK
8 2013 David 92 UK
9 2013 Alphonse 68 France
17 2014 Britney 90 UK
13 2014 Silvia 87 Italy
df1:
Asia 34
America 74
Australia 92
Africa 44
df2 :
Asia 24
Australia 90
Africa 30
I want the output of df1 - df2 to be
Asia 10
America 74
Australia 2
Africa 14
I am getting troubled by this, I am newbie into pandas. Please help out.
Use Series.sub with mapped second Series by Series.map:
df1['B'] = df1['B'].sub(df1['A'].map(df2.set_index('A')['B']), fill_value=0)
print (df1)
A B
0 Asia 10.0
1 America 74.0
2 Australia 2.0
3 Africa 14.0
If possible changed ordering of first column convert both first columns to index by DataFrame.set_index and subtract :
df2 = df1.set_index('A')['B'].sub(df2.set_index('A')['B'], fill_value=0).reset_index()
print (df2)
A B
0 Africa 14.0
1 America 74.0
2 Asia 10.0
3 Australia 2.0
i’ve got a python/pandas groupby that is grouped on name and looks like this:
name gender year city city total
jane female 2011 Detroit 1
2015 Chicago 1
dan male 2009 Lexington 1
bill male 2001 New York 1
2003 Buffalo 1
2000 San Francisco 1
and I want it to look like this:
name gender year1 city1 year2 city2 year3 city3 city total
jane female 2011 Detroit 2015 Chicago 2
dan male 2009 Lexington 1
bill male 2000 Chico 2001 NewYork 2003 Buffalo 3
so i want to keep the grouping by name and then order by year and make each name have only one column. it's a variation on a dummy variables maybe? i'm not even sure how to summarize it.