I have the following dict which I converted to dataframe
players_info = {'Afghanistan': {'Asghar Stanikzai': 809.0,
'Mohammad Nabi': 851.0,
'Mohammad Shahzad': 1713.0,
'Najibullah Zadran': 643.0,
'Samiullah Shenwari': 774.0},
'Australia': {'AJ Finch': 1082.0,
'CL White': 988.0,
'DA Warner': 1691.0,
'GJ Maxwell': 822.0,
'SR Watson': 1465.0},
'England': {'AD Hales': 1340.0,
'EJG Morgan': 1577.0,
'JC Buttler': 985.0,
'KP Pietersen': 1176.0,
'LJ Wright': 759.0}}
pd.DataFrame(players_info)
The resulting output is
But I want the columns to be mapped with rows like the following
Player Team Score
Mohammad Nabi Afghanistan 851.0
Mohammad Shahzad Afghanistan 1713.0
Najibullah Zadran Afghanistan 643.0
JC Buttler England 985.0
KP Pietersen England 1176.0
LJ Wright England 759.0
I tried reset_index but it is not working as I want. How can I do that ?
You need:
df = df.stack().reset_index()
df.columns=['Player', 'Team', 'Score']
Output of df.head(5):
Player Team Score
0 AD Hales Score 1340.0
1 AJ Finch Team 1082.0
2 Asghar Stanikzai Player 809.0
3 CL White Team 988.0
4 DA Warner Team 1691.0
Let's take a stab at this using melt. Should be pretty fast.
df.rename_axis('Player').reset_index().melt('Player').dropna()
Player variable value
2 Asghar Stanikzai Afghanistan 809.0
10 Mohammad Nabi Afghanistan 851.0
11 Mohammad Shahzad Afghanistan 1713.0
12 Najibullah Zadran Afghanistan 643.0
14 Samiullah Shenwari Afghanistan 774.0
16 AJ Finch Australia 1082.0
18 CL White Australia 988.0
19 DA Warner Australia 1691.0
21 GJ Maxwell Australia 822.0
28 SR Watson Australia 1465.0
30 AD Hales England 1340.0
35 EJG Morgan England 1577.0
37 JC Buttler England 985.0
38 KP Pietersen England 1176.0
39 LJ Wright England 759.0
Related
I am having this kind of code (code sample is recreation of production code) -
import pandas as pd
df_nba = pd.read_csv('https://media.geeksforgeeks.org/wp-content/uploads/nba.csv')
df_nba['custom'] = 'abc'
df_gpby_team_clg = df_nba.groupby(['custom', 'College', 'Team']).agg({'Salary': sum})
print(df_gpby_team_clg)
Output looks something like this -
Now I want to have first N College stats. So if I give n=2 I will have a df with Alabama and Arizona and their respective Team and Salary stats.
You can use .reset_index() to restore the dataframe after groupby() with multi-index row index back to normal range index for easier subsequent operations.
Then extract the first n colleges into a list by calling .unique() on the College column.
Finally, filter the expanded dataframe with .loc by checking for College is in the first n colleges just extracted by using .isin within .loc:
n = 2
df_gpby_team_clg_expand = df_gpby_team_clg.reset_index()
first_N_college = df_gpby_team_clg_expand['College'].unique()[:n]
df_gpby_team_clg_expand.loc[df_gpby_team_clg_expand['College'].isin(first_N_college)]
Result:
custom College Team Salary
0 abc Alabama Cleveland Cavaliers 2100000.0
1 abc Alabama Memphis Grizzlies 845059.0
2 abc Alabama New Orleans Pelicans 1320000.0
3 abc Arizona Brooklyn Nets 1335480.0
4 abc Arizona Cleveland Cavaliers 9140305.0
5 abc Arizona Detroit Pistons 2841960.0
6 abc Arizona Golden State Warriors 11710456.0
7 abc Arizona Houston Rockets 947276.0
8 abc Arizona Indiana Pacers 5358880.0
9 abc Arizona Milwaukee Bucks 3000000.0
10 abc Arizona New York Knicks 4000000.0
11 abc Arizona Orlando Magic 4171680.0
12 abc Arizona Philadelphia 76ers 525093.0
13 abc Arizona Phoenix Suns 206192.0
Use get_level_values() to get the first n colleges:
n = 2
colleges = df_gpby_team_clg.index.get_level_values('College').unique()[:n]
# Index(['Alabama', 'Arizona'], dtype='object', name='College')
Then extract those colleges with IndexSlice:
index = pd.IndexSlice[:, colleges]
df_gpby_team_clg.loc[index, :]
# Salary
# custom College Team
# abc Alabama Cleveland Cavaliers 2100000.0
# Memphis Grizzlies 845059.0
# New Orleans Pelicans 1320000.0
# Arizona Brooklyn Nets 1335480.0
# Cleveland Cavaliers 9140305.0
# Detroit Pistons 2841960.0
# Golden State Warriors 11710456.0
# Houston Rockets 947276.0
# Indiana Pacers 5358880.0
# Milwaukee Bucks 3000000.0
# New York Knicks 4000000.0
# Orlando Magic 4171680.0
# Philadelphia 76ers 525093.0
# Phoenix Suns 206192.0
I have two df's
df1
date League teams
0 201902272215 brazil cup foz do iguacu fcceara ce
1 201902272300 colombia primera a deportes tolimaatletico bucaramanga
2 201902272300 brazil campeonato gaucho 2nd division ypiranga rsuniao frederiquense
3 201902272300 brazil campeonato gaucho 2nd division esportivo rstupi rs
4 201902272300 brazil campeonato gaucho 2nd division sao paulo rsgremio esportivo bage
14 201902280000 four nations women tournament (in usa) usa (w)japan (w)
25 201902280030 bolivia professional football league real potosibolivar
df2
date league teams
0 201902280000 womens international usa womenjapan women
1 201902280000 brazil amazonense sul america ecrio negro am
2 201902280030 bolivia apertura real potosibolivar
3 201902280030 brazil campeonato paulista palmeirasituano
4 201902280030 copa sudamericana racing clubcorinthians
The result I would want is all the rows from df2 that are near matches with df1
date league teams near_match
0 201902280000 womens international usa womenjapan women 1
1 201902280000 brazil amazonense sul america ecrio negro am 0
2 201902280030 bolivia apertura real potosibolivar 1
3 201902280030 brazil campeonato paulista palmeirasituano 0
4 201902280030 copa sudamericana racing clubcorinthians 0
I have tried to use a variation of a for loop using SequenceMatcher and setting the threshold to a match of above 0.8, but haven't had any luck.
df_1['merge_teams'] = df_1['teams'] # we will use these as the merge keys
df_1['merge_date'] = df_1['date']
# df_1['merge_league'] = df_1['league']
for teams_1, date_1, league_1 in df_1[['teams','date']].values:
for ixb, (teams_1, teams_2) in enumerate(df_2[['teams','date']].values):
if difflib.SequenceMatcher(None,teams_1,teams_2).ratio() > .8:
df_2.ix[ixb,'merge_teams'] = teams_1 # creates a merge key in df_2
if difflib.SequenceMatcher(None,date_1, date_2).ratio() > .8:
df_2.ix[ixb,'merge_date'] = date_1 # creates a merge key in df_2
# This should rturn all rows where teams,date and league all match by over 80%
# This is just for teams and date, I want to include league as well
Any advice or guidance would be greatly appreciated.
I would like to add a dictionary to a list, which contains several other dictionaries.
I have a list of ten top travel cities:
City Country Population Area
0 Buenos Aires Argentina 2891000 4758
1 Toronto Canada 2800000 2731571
2 Pyeongchang South Korea 2581000 3194
3 Marakesh Morocco 928850 200
4 Albuquerque New Mexico 559277 491
5 Los Cabos Mexico 287651 3750
6 Greenville USA 84554 68
7 Archipelago Sea Finland 60000 8300
8 Walla Walla Valley USA 32237 33
9 Salina Island Italy 4000 27
10 Solta Croatia 1700 59
11 Iguazu Falls Argentina 0 672
I imported the excel with pandas:
import pandas as pd
travel_df = pd.read_excel('./cities.xlsx')
print(travel_df)
cities = travel_df.to_dict('records')
print(cities)
variables = list(cities[0].keys())
I would like to add a 12th element to the end of the list but don't know how to do so:
beijing = {"City" : "Beijing", "Country" : "China", "Population" : "24000000", "Ares" : "6490" }
print(beijing)
Try appending the new row to the DataFrame you read.
travel_df.append(beijing, ignore_index=True)
Hi I am trying to aggregate some data in a dataframe by using agg but my initial statement mentioned a warning "FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version". I rewrote it based on Pandas documentation but instead of getting the right column label I am getting a function label. example: "". How can I correct the output so that the labels match the deprecated output above with column names std, mean, size, sum?
Deprecated Syntax Command:
Top15.set_index('Continent').groupby(level=0)['Pop Est']
.agg({'size': np.size, 'sum': np.sum, 'mean': np.mean, 'std': np.std})
Deprecated Syntax Output:
std mean size sum
Continent
Asia 6.790979e+08 5.797333e+08 5.0 2.898666e+09
Australia NaN 2.331602e+07 1.0 2.331602e+07
Europe 3.464767e+07 7.632161e+07 6.0 4.579297e+08
North America 1.996696e+08 1.764276e+08 2.0 3.528552e+08
South America NaN 2.059153e+08 1.0 2.059153e+08
New Syntax Command:
Top15.set_index('Continent').groupby(level=0)['Pop Est']\
.agg(['size', 'sum', 'mean', 'std'])\
.rename(columns={'size': np.size, 'sum': np.sum, 'mean': np.mean, 'std': np.std})
New Syntax Output:
<function size at 0x0000000002DE9950> <function sum at 0x0000000002DE90D0> <function mean at 0x0000000002DE9AE8> <function std at 0x0000000002DE9B70>
Continent
Asia 5 2.898666e+09 5.797333e+08 6.790979e+08
Australia 1 2.331602e+07 2.331602e+07 NaN
Europe 6 4.579297e+08 7.632161e+07 3.464767e+07
North America 2 3.528552e+08 1.764276e+08 1.996696e+08
South America 1 2.059153e+08 2.059153e+08 NaN
Dataframe:
Rank Documents Citable documents Citations Self-citations Citations per document H index Energy Supply Energy Supply per Capita % Renewable 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Pop Est Continent
Country
China 1 127050 126767 597237 411683 4.70 138 1.271910e+11 93.0 19.754910 3.992331e+12 4.559041e+12 4.997775e+12 5.459247e+12 6.039659e+12 6.612490e+12 7.124978e+12 7.672448e+12 8.230121e+12 8.797999e+12 1.367645e+09 Asia
United States 2 96661 94747 792274 265436 8.20 230 9.083800e+10 286.0 11.570980 1.479230e+13 1.505540e+13 1.501149e+13 1.459484e+13 1.496437e+13 1.520402e+13 1.554216e+13 1.577367e+13 1.615662e+13 1.654857e+13 3.176154e+08 North America
Japan 3 30504 30287 223024 61554 7.31 134 1.898400e+10 149.0 10.232820 5.496542e+12 5.617036e+12 5.558527e+12 5.251308e+12 5.498718e+12 5.473738e+12 5.569102e+12 5.644659e+12 5.642884e+12 5.669563e+12 1.274094e+08 Asia
United Kingdom 4 20944 20357 206091 37874 9.84 139 7.920000e+09 124.0 10.600470 2.419631e+12 2.482203e+12 2.470614e+12 2.367048e+12 2.403504e+12 2.450911e+12 2.479809e+12 2.533370e+12 2.605643e+12 2.666333e+12 6.387097e+07 Europe
Russian Federation 5 18534 18301 34266 12422 1.85 57 3.070900e+10 214.0 17.288680 1.385793e+12 1.504071e+12 1.583004e+12 1.459199e+12 1.524917e+12 1.589943e+12 1.645876e+12 1.666934e+12 1.678709e+12 1.616149e+12 1.435000e+08 Europe
Canada 6 17899 17620 215003 40930 12.01 149 1.043100e+10 296.0 61.945430 1.564469e+12 1.596740e+12 1.612713e+12 1.565145e+12 1.613406e+12 1.664087e+12 1.693133e+12 1.730688e+12 1.773486e+12 1.792609e+12 3.523986e+07 North America
Germany 7 17027 16831 140566 27426 8.26 126 1.326100e+10 165.0 17.901530 3.332891e+12 3.441561e+12 3.478809e+12 3.283340e+12 3.417298e+12 3.542371e+12 3.556724e+12 3.567317e+12 3.624386e+12 3.685556e+12 8.036970e+07 Europe
India 8 15005 14841 128763 37209 8.58 115 3.319500e+10 26.0 14.969080 1.265894e+12 1.374865e+12 1.428361e+12 1.549483e+12 1.708459e+12 1.821872e+12 1.924235e+12 2.051982e+12 2.200617e+12 2.367206e+12 1.276731e+09 Asia
France 9 13153 12973 130632 28601 9.93 114 1.059700e+10 166.0 17.020280 2.607840e+12 2.669424e+12 2.674637e+12 2.595967e+12 2.646995e+12 2.702032e+12 2.706968e+12 2.722567e+12 2.729632e+12 2.761185e+12 6.383735e+07 Europe
South Korea 10 11983 11923 114675 22595 9.57 104 1.100700e+10 221.0 2.279353 9.410199e+11 9.924316e+11 1.020510e+12 1.027730e+12 1.094499e+12 1.134796e+12 1.160809e+12 1.194429e+12 1.234340e+12 1.266580e+12 4.980543e+07 Asia
Italy 11 10964 10794 111850 26661 10.20 106 6.530000e+09 109.0 33.667230 2.202170e+12 2.234627e+12 2.211154e+12 2.089938e+12 2.125185e+12 2.137439e+12 2.077184e+12 2.040871e+12 2.033868e+12 2.049316e+12 5.990826e+07 Europe
Spain 12 9428 9330 123336 23964 13.08 115 4.923000e+09 106.0 37.968590 1.414823e+12 1.468146e+12 1.484530e+12 1.431475e+12 1.431673e+12 1.417355e+12 1.380216e+12 1.357139e+12 1.375605e+12 1.419821e+12 4.644340e+07 Europe
Iran 13 8896 8819 57470 19125 6.46 72 9.172000e+09 119.0 5.707721 3.895523e+11 4.250646e+11 4.289909e+11 4.389208e+11 4.677902e+11 4.853309e+11 4.532569e+11 4.445926e+11 4.639027e+11 NaN 7.707563e+07 Asia
Australia 14 8831 8725 90765 15606 10.28 107 5.386000e+09 231.0 11.810810 1.021939e+12 1.060340e+12 1.099644e+12 1.119654e+12 1.142251e+12 1.169431e+12 1.211913e+12 1.241484e+12 1.272520e+12 1.301251e+12 2.331602e+07 Australia
Brazil 15 8668 8596 60702 14396 7.00 86 1.214900e+10 59.0 69.648030 1.845080e+12 1.957118e+12 2.056809e+12 2.054215e+12 2.208872e+12 2.295245e+12 2.339209e+12 2.409740e+12 2.412231e+12 2.319423e+12 2.059153e+08 South America
Try using just this:
Top15.set_index('Continent').groupby(level=0)['Pop Est'].agg(['size', 'sum', 'mean', 'std'])
I have two csv files. The first file contains names of all countries with their capital cities,
CSV 1:
Capital Country Country Code
Budapest Hungary HUN
Rome Italy ITA
Dublin Ireland IRL
Paris France FRA
Berlin Germany DEU
.
.
.
CSV 2:
Second CSV file contains trip details of a bus,
Trip City Trip Country No. of pax
Budapest HUN 24
Paris FRA 36
Munich DEU 9
Florence ITA 5
Milan ITA 25
Rome ITA 2
Rome ITA 45
I would like to add a new column df["Touism visit"] with the values of no of pax, if the Trip City (from CSV 2) is a capital of a country (from CSV 1) and if the number of pax is more than 10.
Thank you.
Try this:
df2['tourism'] = 0
df2.loc[df2['Trip City'].isin(df1['Capital']) & (df2['No. of pax'] > 10), 'tourism'] = df2.loc[df2['Trip City'].isin(df1['Capital'])& (df2['No. of pax'] > 10), 'No. of pax']
I get :
Trip_City Trip_Country No._of_pax tourism
0 Budapest HUN 24 24
1 Paris FRA 36 36
2 Munich DEU 9 0
3 Florence ITA 5 0
4 Milan ITA 25 0
5 Rome ITA 2 0
6 Rome ITA 45 45
(I had to add _s to get pd.read_clipboard() to work properly)
this might also help,
import the dfs
df1 = pd.read_csv("CSV1.csv")
df2 = pd.read_csv("CSV2.csv")
make a dictionary out of the pandas Series
my_dict=dict(zip((df1["Country_Code"]),(df1["Capital"])))
define a function that test your conditions (note i used np.logical_and() to combine the conditions. A normal and
def isTourism(country_code,trip_city,No_of_pax):
if np.logical_and((my_dict[country_code]==trip_city),(No_of_pax >= 10)):
return "Yes"
else:
return "No
call function with map
df2["Tourism"] = list(map(isTourism,df2["Trip Country"],df2["Trip City"], df2["No. Of pax"]))
print(df2)
Trip City Trip Country No. Of pax Tourism
0 Budapest HUN 24 Yes
1 Paris FRA 36 Yes
2 Munich DEU 9 No
3 Florence ITA 5 No
4 Milan ITA 25 No
5 Rome ITA 2 No
6 Rome ITA 45 Yes
If you filter your second dataframe to only the values > 10, you could merge and sum as follows:
import pandas as pd
df1 = pd.DataFrame({'Capital': ['Budapest', 'Rome', 'Dublin', 'Paris',
'Berlin'],
'Country': ['Hungary', 'Italy', 'Ireland', 'France',
'Germany'],
'Country Code': ['HUN', 'ITA', 'IRL', 'FRA', 'DEU']
})
df2 = pd.DataFrame({'Trip City': ['Budapest', 'Paris', 'Munich', 'Florence',
'Milan', 'Rome', 'Rome'],
'Trip Country': ['HUN', 'FRA', 'DEU', 'ITA', 'ITA',
'ITA', 'ITA'],
'No. of pax': [24, 36, 9, 5, 25, 2, 45]
})
df2 = df2[df2['No. of pax'] > 10]
combined = df1.merge(df2,
left_on=['Capital', 'Country Code'],
right_on=['Trip City', 'Trip Country'],
how='left').groupby(['Capital', 'Country Code'],
sort=False,
as_index=False)['No. of pax'].sum()
print combined
This prints:
Capital Country Code No. of pax
0 Budapest HUN 24.0
1 Rome ITA 45.0
2 Dublin IRL NaN
3 Paris FRA 36.0
4 Berlin DEU NaN