I am trying to use numeric values as columns on a Pandas pivot_table. The problem is that since each number is mostly unique, the resulting pivot_table isn't very useful as a way to aggregate my data.
Here is what I have so far (fake data example):
import pandas as pd
df = pd.DataFrame({'Country': ['US', 'Brazil', 'France', 'Germany'],
'Continent': ['Americas', 'Americas', 'Europe', 'Europe'],
'Population': [321, 207, 80, 66]})
pd.pivot_table(df, index='Continent', columns='Population', aggfunc='count')
Here is an image of the resulting
.
How could I group my values into ranges based on my columns?
In other words, how can I count all countries with Population... <100, 100-200, >300?
Use pd.cut:
df = df.assign(PopGroup=pd.cut(df.Population,bins=[0,100,200,300,np.inf],labels=['<100','100-200','200-300','>300']))
Output:
Continent Country Population PopGroup
0 Americas US 321 >300
1 Americas Brazil 207 200-300
2 Europe France 80 <100
3 Europe Germany 66 <100
pd.pivot_table(df, index='Continent', columns='PopGroup',values=['Country'], aggfunc='count')
Output:
Country
PopGroup 200-300 <100 >300
Continent
Americas 1.0 NaN 1.0
Europe NaN 2.0 NaN
Related
I have a pandas DataFrame like this:
city country city_population
0 New York USA 8300000
1 London UK 8900000
2 Paris France 2100000
3 Chicago USA 2700000
4 Manchester UK 510000
5 Marseille France 860000
I want to create a new column country_population by calculating a sum of every city for each country. I have tried:
df['Country population'] = df['city_population'].sum().where(df['country'])
But this won't work, could I have some advise on the problem?
Sounds like you're looking for groupby
import pandas as pd
data = {
'city': ['New York', 'London', 'Paris', 'Chicago', 'Manchester', 'Marseille'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'France'],
'city_population': [8_300_000, 8_900_000, 2_100_000, 2_700_000, 510_000, 860_000]
}
df = pd.DataFrame.from_dict(data)
# group by country, access 'city_population' column, sum
pop = df.groupby('country')['city_population'].sum()
print(pop)
output:
country
France 2960000
UK 9410000
USA 11000000
Name: city_population, dtype: int64
Appending this Series to the DataFrame. (Arguably discouraged though, since it stores information redundantly and doesn't really fit the structure of the original DataFrame):
# add to existing df
pop.rename('country_population', inplace=True)
# how='left' to preserve original ordering of df
df = df.merge(pop, how='left', on='country')
print(df)
output:
city country city_population country_population
0 New York USA 8300000 11000000
1 London UK 8900000 9410000
2 Paris France 2100000 2960000
3 Chicago USA 2700000 11000000
4 Manchester UK 510000 9410000
5 Marseille France 860000 2960000
based on #Vaishali's comment, a one-liner
df['Country population'] = df.groupby([ 'country']).transform('sum')['city_population']
I have the following (toy) dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'System_Key':['MER-002', 'MER-003', 'MER-004', 'MER-005', 'BAV-378', 'BAV-379', 'BAV-380', 'BAV-381', 'AUD-220', 'AUD-221', 'AUD-222', 'AUD-223'],
'Manufacturer':['Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'BMW', 'BMW', 'BMW', 'BMW', 'Audi', 'Audi', 'Audi', 'Audi'],
'Region':['Americas', 'Europe', 'Americas', 'Asia', 'Asia', 'Europe', 'Europe', 'Europe', 'Americas', 'Asia', 'Americas', 'Americas'],
'Department':[np.nan, 'Sales', np.nan, 'Operations', np.nan, np.nan, 'Accounting', np.nan, 'Finance', 'Finance', 'Finance', np.nan]
})
System_Key Manufacturer Region Department
0 MER-002 Mercedes Americas NaN
1 MER-003 Mercedes Europe Sales
2 MER-004 Mercedes Americas NaN
3 MER-005 Mercedes Asia Operations
4 BAV-378 BMW Asia NaN
5 BAV-379 BMW Europe NaN
6 BAV-380 BMW Europe Accounting
7 BAV-381 BMW Europe NaN
8 AUD-220 Audi Americas Finance
9 AUD-221 Audi Asia Finance
10 AUD-222 Audi Americas Finance
11 AUD-223 Audi Americas NaN
First, I remove the NaN values in the data frame:
df = df.fillna('')
Then, I pivot the data frame as follows:
pivot = pd.pivot_table(df, index='Manufacturer', columns='Region', values='System_Key', aggfunc='size').applymap(str)
Notice that I'm passing aggfunc='size' for counting.
This results in the following pivot table:
Region Americas Asia Europe
Manufacturer
Audi 3.0 1.0 NaN
BMW NaN 1.0 3.0
Mercedes 2.0 1.0 1.0
How would I convert the float values in this pivot table to integers?
Thanks in advance!
Try fill_value
pivot = pd.pivot_table(df, index='Manufacturer', columns='Region', values='System_Key', aggfunc='size',fill_value=-1).astype(int)
The only reason you get floats when aggregating integers is because some missing size() values are NaN. So use fill_value=0 to impute them to zeros. Avoid getting the NaNs in the first place:
df.pivot_table(index='Manufacturer', columns='Region', values='System_Key', aggfunc='size', fill_value=0)
Region Americas Asia Europe
Manufacturer
Audi 3 1 0
BMW 0 1 3
Mercedes 2 1 1
Notes:
This is much better than kludging the dtype after.
You also don't need the df.fillna(''), and filling NaN with string '' on an integer(/float) column is a bad idea
Note you don't need to do pd.pivot_table(df, ...), just call df.pivot_table(...) directly since it's a method of dataframe.
Since you have NaN in your data, pandas would automatically downcast to float. You can either use Int64 (available from Pandas 0.24+) datatype:
pivot = (pd.pivot_table(df, index='Manufacturer', columns='Region',
values='System_Key', aggfunc='size')
.astype('Int64')
)
Output:
Region Americas Asia Europe
Manufacturer
Audi 3 1 <NA>
BMW <NA> 1 3
Mercedes 2 1 1
or fill NaN with, say, -1 in pivot_table:
pivot = (pd.pivot_table(df, index='Manufacturer', columns='Region',
values='System_Key', aggfunc='size',
fill_value=-1) # <--- here
)
Output:
Region Americas Asia Europe
Manufacturer
Audi 3 1 -1
BMW -1 1 3
Mercedes 2 1 1
Use the Int64 datatype which allows for intgeger NaNs. The convert_dtypes() function would be handy here.
pivot.convert_dtypes()
Americas Asia Europe
Manufacturer
Audi 3 1 <NA>
BMW <NA> 1 3
Mercedes 2 1 1
Also...
I'd probably do df.fillna('', inplace=True) instead of df = df.fillna('') to minimize data copies
I assume you meant to ditch the .applymap(str) bit at the end of your call to pivot_table().
I have concatenated many Pandas series' together to create a dataframe.
datasize = Reducedset['estimate'].groupby(level=0).apply(lambda x:x.count())
datasum = Reducedset['estimate'].groupby(level=0).apply(lambda x:x.sum())
datamean = Reducedset['estimate'].groupby(level=0).apply(lambda x:x.mean())
datastd = Reducedset['estimate'].groupby(level=0).apply(lambda x:x.std())
df = pd.concat([datasize,datasum,datamean,datastd],axis=1)
The output of df is:
df
estimate estimate estimate estimate
Asia 5 2.898666e+09 5.797333e+08 6.790979e+08
Australia 1 2.331602e+07 2.331602e+07 NaN
Europe 6 4.579297e+08 7.632161e+07 3.464767e+07
North America 2 3.528552e+08 1.764276e+08 1.996696e+08
South America 1 2.059153e+08 2.059153e+08 NaN
However, I would like to rename the columns in the following order as: ['size', 'sum', 'mean', 'std']
I would also like to title the index as 'Continent'.
Could anybody give me any advice on how to do this?
Instead your solution use GroupBy.agg and then DataFrame.rename_axis:
So change:
datasize = Reducedset['estimate'].groupby(level=0).apply(lambda x:x.count())
datasum = Reducedset['estimate'].groupby(level=0).apply(lambda x:x.sum())
datamean = Reducedset['estimate'].groupby(level=0).apply(lambda x:x.mean())
datastd = Reducedset['estimate'].groupby(level=0).apply(lambda x:x.std())
df = pd.concat([datasize,datasum,datamean,datastd],axis=1)
df.columns = ['size', 'sum', 'mean', 'std']
to:
Reducedset['estimate'] = pd.to_numeric(Reducedset['estimate'], errors='coerce')
df = (Reducedset.groupby(level=0)['estimate']
.agg(['count','sum','mean','std'])
.rename(columns={'count':'size'})
.rename_axis('Continent'))
Or:
Reducedset['estimate'] = pd.to_numeric(Reducedset['estimate'], errors='coerce')
df = (Reducedset.groupby(level=0).agg(size =('estimate', 'count'),
sum=('estimate', 'sum'),
mean=('estimate', 'mean'),
std =('estimate', 'std'))
.rename_axis('Continent'))
You can try this.
df.columns = ['size', 'sum', 'mean', 'std']
To add name to index using df.rename_axis
df.rename_axis('Continent')
Output:
size sum mean std
Continent
Asia 5 2.898666e+09 5.797333e+08 6.790979e+08
Australia 1 2.331602e+07 2.331602e+07 NaN
Europe 6 4.579297e+08 7.632161e+07 3.464767e+07
North America 2 3.528552e+08 1.764276e+08 1.996696e+08
South America 1 2.059153e+08 2.059153e+08 NaN
Try this:
for index column title:
df.index.name='Continent'
for column names:
df.columns = ['size', 'sum', 'mean', 'std']
Hope it helps..
I have been trying to merge two geopandas dataframes based on a column and am getting some really weird results. To test this point I made two simple dataframes, and merged them:
import pandas as pd
import geopandas as gpd
df = pd.DataFrame(
{'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})
gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
df2 = pd.DataFrame(
{'Capital': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota',
'Caracas'],
'Abbreviation': ['ARG', 'BRA', 'CHI', 'COL', 'VZL']})
combined_df = gdf.merge(df2, left_on='City', right_on='Capital')
print(combined_df)
When I print the results, I get what I expected:
City Country ... Capital Abbreviation
0 Buenos Aires Argentina ... Buenos Aires ARG
1 Brasilia Brazil ... Brasilia BRA
2 Santiago Chile ... Santiago CHI
3 Bogota Colombia ... Bogota COL
4 Caracas Venezuela ... Caracas VZL
The two datasets are merged based on their common column, which is the 'city' column and the 'capital' column.
I have some other data I am working with. Here is a link to it
Both of the files are geopackages I've read in as geodataframes. Dataframe 1 has 16166 rows. Dataframe 2 has 15511 rows. They have a common ID column, 'ALTPARNO' AND 'altparno'. Here is the code I've tried to use to read them in and merge them:
import geopandas as gpd
dataframe1 = gpd.read_file(filepath, layer='allkeepers_2019')
dataframe2 = gpd.read_file(filepath, layer='keepers_2019')
results = dataframe1.merge(dataframe2, left_on='altparno', right_on='ALTPARNO')
When I look at my results, I have a dataframe with over 4 million rows (should be around 15,000).
What is going on?
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two datasets that look like this:
name Longitude Latitude continent
0 Aruba -69.982677 12.520880 North America
1 Afghanistan 66.004734 33.835231 Asia
2 Angola 17.537368 -12.293361 Africa
3 Anguilla -63.064989 18.223959 North America
4 Albania 20.049834 41.142450 Europe
And another dataset looks like this:
COUNTRY GDP (BILLIONS) CODE
0 Afghanistan 21.71 AFG
1 Albania 13.40 ALB
2 Algeria 227.80 DZA
3 American Samoa 0.75 ASM
4 Andorra 4.80 AND
Here, columns name and COUNTRY contains the country names but not in the same order.
How to combine the second dataframe into first one and add the CODE columns to the first dataframe.
Required output:
name Longitude Latitude continent CODE
0 Aruba -69.982677 12.520880 North America NaN
1 Afghanistan 66.004734 33.835231 Asia AFG
2 Angola 17.537368 -12.293361 Africa NaN
3 Anguilla -63.064989 18.223959 North America NaN
4 Albania 20.049834 41.142450 Europe ALB
Attempt:
import numpy as np
import pandas as pd
df = pd.DataFrame({'name' : ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
'Longitude' : [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
'Latitude' : [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
'continent' : ['North America','Asia','Africa','North America','Europe'] })
print(df)
df2 = pd.DataFrame({'COUNTRY' : ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
'GDP (BILLIONS)' : [21.71, 13.40, 227.80, 0.75, 4.80],
'CODE' : ['AFG', 'ALB', 'DZA', 'ASM', 'AND']})
print(df2)
pd.merge(left=df, right=df2,left_on='name',right_on='COUNTRY')
# but this fails
By default, pd.merge uses how='inner', which uses the intersection of keys across your two dataframes. Here, you need how='left' to use keys only from the left dataframe:
res = pd.merge(df, df2, how='left', left_on='name', right_on='COUNTRY')
The merge performs an 'inner' merge or join by default, only keeping records that have a match on both the left and the right. You want an 'outer' join, keeping all records (there is also 'left' or 'right').
Example:
import pandas as pd
df1 = pd.DataFrame({
'name': ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
'Longitude': [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
'Latitude': [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
'continent': ['North America', 'Asia', 'Africa', 'North America', 'Europe']
})
print(df1)
df2 = pd.DataFrame({
'COUNTRY': ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
'GDP (BILLIONS)': [21.71, 13.40, 227.80, 0.75, 4.80],
'CODE': ['AFG', 'ALB', 'DZA', 'ASM', 'AND']
})
print(df2)
# merge, using 'outer' to avoid losing records from either left or right
df3 = pd.merge(left=df1, right=df2, left_on='name', right_on='COUNTRY', how='outer')
# combining the columns used to match
df3['name'] = df3.apply(lambda row: row['name'] if not pd.isnull(row['name']) else row['COUNTRY'], axis=1)
# dropping the now spare column
df3 = df3.drop('COUNTRY', axis=1)
print(df3)
Pandas have pd.merge [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html] function which uses inner join by default. Inner join basically takes only those values that are present in both the keys specified in either on or on left_on and right_on if the keys to merge on in both the dataframes are different.
Since, you require the CODE value to be added, following line of code could be used:
pd.merge(left=df, right=df2[['COUNTRY', 'CODE']], left_on='name', right_on='COUNTRY', how='left')
This gives the following output:
name Longitude Latitude continent COUNTRY CODE
0 Aruba -69.982677 12.520880 North America NaN NaN
1 Afghanistan 66.004734 33.835231 Asia Afghanistan AFG
2 Angola 17.537368 -12.293361 Africa NaN NaN
3 Anguilla -63.064989 18.223959 North America NaN NaN
4 Albania 20.049834 41.142450 Europe Albania ALB
Following also gives the same result:
new_df = pd.merge(left=df1[['COUNTRY', 'CODE']], right=df, left_on='COUNTRY', right_on='name', how='right')
COUNTRY CODE name Longitude Latitude continent
0 Afghanistan AFG Afghanistan 66.004734 33.835231 Asia
1 Albania ALB Albania 20.049834 41.142450 Europe
2 NaN NaN Aruba -69.982677 12.520880 North America
3 NaN NaN Angola 17.537368 -12.293361 Africa
4 NaN NaN Anguilla -63.064989 18.223959 North America