Pandas group by but keep another column - python

Say that I have a dataframe that looks something like this
date location year
0 1908-09-17 Fort Myer, Virginia 1908
1 1909-09-07 Juvisy-sur-Orge, France 1909
2 1912-07-12 Atlantic City, New Jersey 1912
3 1913-08-06 Victoria, British Columbia, Canada 1912
I want to use pandas groupby function to create an output that shows the total number of incidents by year but also keep the location column that will display one of the locations that year. Any which one works. So it would look something like this:
total location
year
1908 1 Fort Myer, Virginia
1909 1 Juvisy-sur-Orge, France
1912 2 Atlantic City, New Jersey
Can this be done without doing funky joining? The furthest I can get is using the normal groupby
df = df.groupby(['year']).count()
But that only gives me something like this
location
year
1908 1 1
1909 1 1
1912 2 2
How can I display one of the locations in this dataframe?

You can use groupby.agg and use 'first' to extract the first location in each group:
res = df.groupby('year')['location'].agg(['first', 'count'])
print(res)
# first count
# year
# 1908 Fort Myer, Virginia 1
# 1909 Juvisy-sur-Orge, France 1
# 1912 Atlantic City, New Jersey 2

Related

How to find the value belonging to the final row of a certain year in a pandas dataframe?

I have a data frame of the Euro cup results from 1960-2020.
Here is an example of the tournament in 1960 (only 4 games were played):
year
home_team
away_team
home_score
away_score
winner
1960
Czechoslovakia
Russia
0
3
Russia
1960
France
Yugoslavia
4
5
Yugoslavia
1960
France
Czechoslovakia
0
2
Czechoslovakia
1960
Russia
Yugoslavia
2
1
Russia
The winner of the tournament was russia, who won the final game of 1960.
The same will be true for the other years, where the winner always wins the final game of that year.
How can I get this info for the entire dataframe?
Also, when I groupby year, how can I add this info to a new column called "tournament_winner"?
Use
df['tournament_winner'] = df.groupby('year')['winner'].transform('last')

Creating new variable by aggregation in python

I'm pretty new to python and pandas, and know only the basics. Nowadays I'm conducting a research and I need your kind help.
Let’s say I have data on births, containing 2 variables: Date and Country.
Date Country
1.1.20 USA
1.1.20 USA
1.1.20 Italy
1.1.20 England
2.1.20 Italy
2.1.20 Italy
3.1.20 USA
3.1.20 USA
Now I want to create a third variable, let’s call him ‘Births’, which contains the number of births in country at a date. In other words, I want to stick to just one row for each date+country combination by aggregating the number of countries in each date, so I end up with something like this:
Date Country Births
1.1.20 USA 2
1.1.20 Italy 1
1.1.20 England 1
2.1.20 Italy 2
3.1.20 USA 2
I’ve tried many things, but nothing seemed to work. Any help will be much appreciated.
Thanks,
Eran
I guess you can use the groupby method of your DataFrame, then use the size method to count the number of individuals in each group :
df.groupby(by=['Date', 'Country']).size().reset_index(name='Births')
Output:
Date Country Births
0 1.1.20 England 1
1 1.1.20 Italy 1
2 1.1.20 USA 2
3 2.1.20 Italy 2
4 3.1.20 USA 2
Also, the pandas documentation has several examples related to group-by operations : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html.

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns
If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032
where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work
I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

How to output the top 5 of a specific column along with associated columns using python?

I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000

How to add a dictionary as the last element to a list of dictionaries?

I would like to add a dictionary to a list, which contains several other dictionaries.
I have a list of ten top travel cities:
City Country Population Area
0 Buenos Aires Argentina 2891000 4758
1 Toronto Canada 2800000 2731571
2 Pyeongchang South Korea 2581000 3194
3 Marakesh Morocco 928850 200
4 Albuquerque New Mexico 559277 491
5 Los Cabos Mexico 287651 3750
6 Greenville USA 84554 68
7 Archipelago Sea Finland 60000 8300
8 Walla Walla Valley USA 32237 33
9 Salina Island Italy 4000 27
10 Solta Croatia 1700 59
11 Iguazu Falls Argentina 0 672
I imported the excel with pandas:
import pandas as pd
travel_df = pd.read_excel('./cities.xlsx')
print(travel_df)
cities = travel_df.to_dict('records')
print(cities)
variables = list(cities[0].keys())
I would like to add a 12th element to the end of the list but don't know how to do so:
beijing = {"City" : "Beijing", "Country" : "China", "Population" : "24000000", "Ares" : "6490" }
print(beijing)
Try appending the new row to the DataFrame you read.
travel_df.append(beijing, ignore_index=True)

Categories

Resources