Sum columns by key values in another column - python

I have a pandas DataFrame like this:
city country city_population
0 New York USA 8300000
1 London UK 8900000
2 Paris France 2100000
3 Chicago USA 2700000
4 Manchester UK 510000
5 Marseille France 860000
I want to create a new column country_population by calculating a sum of every city for each country. I have tried:
df['Country population'] = df['city_population'].sum().where(df['country'])
But this won't work, could I have some advise on the problem?

Sounds like you're looking for groupby
import pandas as pd
data = {
'city': ['New York', 'London', 'Paris', 'Chicago', 'Manchester', 'Marseille'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'France'],
'city_population': [8_300_000, 8_900_000, 2_100_000, 2_700_000, 510_000, 860_000]
}
df = pd.DataFrame.from_dict(data)
# group by country, access 'city_population' column, sum
pop = df.groupby('country')['city_population'].sum()
print(pop)
output:
country
France 2960000
UK 9410000
USA 11000000
Name: city_population, dtype: int64
Appending this Series to the DataFrame. (Arguably discouraged though, since it stores information redundantly and doesn't really fit the structure of the original DataFrame):
# add to existing df
pop.rename('country_population', inplace=True)
# how='left' to preserve original ordering of df
df = df.merge(pop, how='left', on='country')
print(df)
output:
city country city_population country_population
0 New York USA 8300000 11000000
1 London UK 8900000 9410000
2 Paris France 2100000 2960000
3 Chicago USA 2700000 11000000
4 Manchester UK 510000 9410000
5 Marseille France 860000 2960000

based on #Vaishali's comment, a one-liner
df['Country population'] = df.groupby([ 'country']).transform('sum')['city_population']

Related

Filter of one dataframe using multiple columns of other dataframe in python

I have one dataframe(df1) which is my raw data from which i want to filter or extract a part of the data. I have another dataframe(df2) which have my filter conditions. The catch here is my filter condition column if blank should skip tht column condition and move to the other column conditions
Example below:
DF1:
City
District
Town
Country
Continent
NY
WASHIN
DC
US
America
CZCH
SEATLLE
DC
CZCH
Europe
NY
MARYLAND
DC
US
S America
NY
WASHIN
NY
US
America
NY
SEAGA
NJ
UK
Europe
DF2:(sample filter condition table - this table can have multiple conditions)
City
District
Town
Country
Continent
NY
DC
NJ
Notice that i have left the district, country and continent column blank. As I may or may not use it later. I cannot delete these columns.
OUTPUT DF: should look like this
City
District
Town
Country
Continent
NY
WASHIN
DC
US
America
NY
MARYLAND
DC
US
S America
NY
SEAGA
NJ
UK
Europe
So basically i need a filter condition table which will extract information from the raw data for fields i input in the filter tables. I cannot change/delete columns in DF2. I can only leave the column blank if i dont require the filter condition.
Thanks in advance,
Nitz
If DF2 has always one row:
df = df1.merge(df2.dropna(axis=1))
print (df)
City District Town Country Continent
0 NY WASHIN DC US America
1 NY NJ DC US S America
If multiple rows with missing values:
Sample data:
nan = np.nan
df1 = pd.DataFrame({'City': ['NY', 'CZCH', 'NY', 'NY', 'NY'], 'District': ['WASHIN', 'SEATLLE', 'MARYLAND', 'WASHIN', 'SEAGA'], 'Town': ['DC', 'DC', 'DC', 'NY', 'NJ'], 'Country': ['US', 'CZCH', 'US', 'US', 'UK'], 'Continent': ['America', 'Europe', 'S America', 'America', 'Europe']})
df2 = pd.DataFrame({'City': ['NY', nan], 'District': [nan, nan], 'Town': ['DC', 'NJ'], 'Country': [nan, nan], 'Continent': [nan, nan]})
First remove missing values with reshape by DataFrame.stack:
print (df2.stack())
0 City NY
Town DC
1 Town NJ
dtype: object
Then for each group compare df1 columns if exist in columns names and value from df2:
m = [df1[list(v.droplevel(0).index)].eq(v.droplevel(0)).all(axis=1)
for k, v in df2.stack().groupby(level=0)]
print (m)
[0 True
1 False
2 True
3 False
4 False
dtype: bool, 0 False
1 False
2 False
3 False
4 True
dtype: bool]
Use logical_or.reduce and filter in boolean indexing:
print (np.logical_or.reduce(m))
[ True False True False True]
df = df1[np.logical_or.reduce(m)]
print (df)
City District Town Country Continent
0 NY WASHIN DC US America
2 NY MARYLAND DC US S America
4 NY SEAGA NJ UK Europe
Another possible solution, using numpy broadcasting (it works even when df2 has more than one row):
df1.loc[np.sum(np.sum(
df1.values == df2.values[:, None], axis=2) ==
np.sum(df2.notna().values, axis=1)[:,None], axis=0) == 1]
Output:
City District Town Country Continent
0 NY WASHIN DC US America
2 NY MARYLAND DC US S America
4 NY SEAGA NJ UK Europe

How to choose random item from a dictionary to df and exclude one item?

I have a dictionary and a dataframe, for example:
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
print(df)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
Output:
Name
0 Tom
1 Joseph
2 Krish
3 John
I've created a column called CITY by using the following:
df["CITY"]=np.random.choice(list(city), len(df))
df
Name CITY
0 Tom New York
1 Joseph LA
2 Krish Miami
3 John New Yor
Now, I would like to generate a new column - CITY2 with a random item from city dictionary, but I would like CITY will be a different item than CITY2, so basically when I'm generating CITY2 I need to exclude CITY item.
It's worth mentioning that my real df is quite large so I need it to be effective as possible.
Thanks in advance.
continue with approach you have used
have used pd.Series() as a convenience to remove value that has already been used
wrapped in apply() to get value of each row
data={'Name': ['Tom', 'Joseph', 'Krish', 'John']}
df=pd.DataFrame(data)
city={"New York": "123",
"LA":"456",
"Miami":"789"}
df["CITY"]=np.random.choice(list(city), len(df))
df["CITY2"] = df["CITY"].apply(lambda x: np.random.choice(pd.Series(city).drop(x).index))
Name
CITY
CITY2
0
Tom
Miami
New York
1
Joseph
LA
Miami
2
Krish
New York
Miami
3
John
New York
LA
You could also first group by "CITY", remove the current city per group from the city dict and then create the new random list of cities.
Maybe this is faster because you don't have to drop one city per row, but per group.
city2 = pd.Series()
for key,group in df.groupby('CITY'):
cities_subset = np.delete(np.array(list(city)),list(city).index(key))
city2 = city2.append(pd.Series(np.random.choice(cities_subset, len(group)),index=group.index))
df["CITY2"] = city2
This gives for example:
Name CITY CITY2
0 Tom New York LA
1 Joseph New York Miami
2 Krish LA New York
3 John New York LA

Trying to use a list to populate a dataframe column

I have a dataframe (df) and I would like to create a new column called country, which is calculated buy looking at the region column and where the region value is present in the EnglandRegions list then the country value is set to England else its the value from the region column.
Please see below for my desired output:
name salary region B1salary country
0 Jason 42000 London 42000 England
1 Molly 52000 South West England
2 Tina 36000 East Midland England
3 Jake 24000 Wales Wales
4 Amy 73000 West Midlands England
You can see that all the values in country are set to England except for the value assigned to Jakes record that is set to Wales (as Wales is not in the EnglandRegions list). The code below produces the following error:
File "C:/Users/stacey/Documents/scripts/stacey.py", line 20
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
^
SyntaxError: invalid syntax
The code is as follows:
import pandas as pd
import numpy as np
EnglandRegions = ["London", "South West", "East Midland", "West Midlands", "East Anglia"]
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'salary': [42000, 52000, 36000, 24000, 73000],
'region': ['London', 'South West', 'East Midland', 'Wales', 'West Midlands']}
df = pd.DataFrame(data, columns = ['name', 'salary', 'region'])
df['B1salary'] = np.where((df['salary']>=40000) & (df['salary']<=50000) , df['salary'], '')
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
print(df)
The specific issue the error is referencing is that you are missing a ] to enclose your .loc. However, fixing this won't work anyways. Try:
df['country'] = np.where(df['region'].isin(EnglandRegions), 'England', df['region'])
This is essentially what you already had in the line above it for B1salary anyways.

How to combine two pandas dataframes on two different columns having elements not in order? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have two datasets that look like this:
name Longitude Latitude continent
0 Aruba -69.982677 12.520880 North America
1 Afghanistan 66.004734 33.835231 Asia
2 Angola 17.537368 -12.293361 Africa
3 Anguilla -63.064989 18.223959 North America
4 Albania 20.049834 41.142450 Europe
And another dataset looks like this:
COUNTRY GDP (BILLIONS) CODE
0 Afghanistan 21.71 AFG
1 Albania 13.40 ALB
2 Algeria 227.80 DZA
3 American Samoa 0.75 ASM
4 Andorra 4.80 AND
Here, columns name and COUNTRY contains the country names but not in the same order.
How to combine the second dataframe into first one and add the CODE columns to the first dataframe.
Required output:
name Longitude Latitude continent CODE
0 Aruba -69.982677 12.520880 North America NaN
1 Afghanistan 66.004734 33.835231 Asia AFG
2 Angola 17.537368 -12.293361 Africa NaN
3 Anguilla -63.064989 18.223959 North America NaN
4 Albania 20.049834 41.142450 Europe ALB
Attempt:
import numpy as np
import pandas as pd
df = pd.DataFrame({'name' : ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
'Longitude' : [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
'Latitude' : [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
'continent' : ['North America','Asia','Africa','North America','Europe'] })
print(df)
df2 = pd.DataFrame({'COUNTRY' : ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
'GDP (BILLIONS)' : [21.71, 13.40, 227.80, 0.75, 4.80],
'CODE' : ['AFG', 'ALB', 'DZA', 'ASM', 'AND']})
print(df2)
pd.merge(left=df, right=df2,left_on='name',right_on='COUNTRY')
# but this fails
By default, pd.merge uses how='inner', which uses the intersection of keys across your two dataframes. Here, you need how='left' to use keys only from the left dataframe:
res = pd.merge(df, df2, how='left', left_on='name', right_on='COUNTRY')
The merge performs an 'inner' merge or join by default, only keeping records that have a match on both the left and the right. You want an 'outer' join, keeping all records (there is also 'left' or 'right').
Example:
import pandas as pd
df1 = pd.DataFrame({
'name': ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
'Longitude': [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
'Latitude': [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
'continent': ['North America', 'Asia', 'Africa', 'North America', 'Europe']
})
print(df1)
df2 = pd.DataFrame({
'COUNTRY': ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
'GDP (BILLIONS)': [21.71, 13.40, 227.80, 0.75, 4.80],
'CODE': ['AFG', 'ALB', 'DZA', 'ASM', 'AND']
})
print(df2)
# merge, using 'outer' to avoid losing records from either left or right
df3 = pd.merge(left=df1, right=df2, left_on='name', right_on='COUNTRY', how='outer')
# combining the columns used to match
df3['name'] = df3.apply(lambda row: row['name'] if not pd.isnull(row['name']) else row['COUNTRY'], axis=1)
# dropping the now spare column
df3 = df3.drop('COUNTRY', axis=1)
print(df3)
Pandas have pd.merge [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html] function which uses inner join by default. Inner join basically takes only those values that are present in both the keys specified in either on or on left_on and right_on if the keys to merge on in both the dataframes are different.
Since, you require the CODE value to be added, following line of code could be used:
pd.merge(left=df, right=df2[['COUNTRY', 'CODE']], left_on='name', right_on='COUNTRY', how='left')
This gives the following output:
name Longitude Latitude continent COUNTRY CODE
0 Aruba -69.982677 12.520880 North America NaN NaN
1 Afghanistan 66.004734 33.835231 Asia Afghanistan AFG
2 Angola 17.537368 -12.293361 Africa NaN NaN
3 Anguilla -63.064989 18.223959 North America NaN NaN
4 Albania 20.049834 41.142450 Europe Albania ALB
Following also gives the same result:
new_df = pd.merge(left=df1[['COUNTRY', 'CODE']], right=df, left_on='COUNTRY', right_on='name', how='right')
COUNTRY CODE name Longitude Latitude continent
0 Afghanistan AFG Afghanistan 66.004734 33.835231 Asia
1 Albania ALB Albania 20.049834 41.142450 Europe
2 NaN NaN Aruba -69.982677 12.520880 North America
3 NaN NaN Angola 17.537368 -12.293361 Africa
4 NaN NaN Anguilla -63.064989 18.223959 North America

Fastest way to get occurrence of each element

I have a large DataFrame looking like this
name Country ...
1 Paul Germany
2 Paul Germany
3 George Italy
3 George Italy
3 George Italy
...
N John USA
I'm looking for the occurence of each element of the name column, such has
name Country Count
1 Paul Germany 2000
2 George Italy 500
...
N John USA 40000
Any idea what is the most optimal way to do it ?
Because this is quite long
df['count'] = df.groupby(['name'])['name'].transform(pd.Series.value_counts)
you can do it like this:
df.groupby(['name', 'Country']).size()
example:
import pandas as pd
df = pd.DataFrame.from_dict({'name' : ['paul', 'paul', 'George', 'George', 'George'],
'Country': ['Germany', 'Italy','Germany','Italy','Italy']})
df
output:
Country name
0 Germany paul
1 Italy paul
2 Germany George
3 Italy George
4 Italy George
Group by and get count:
df.groupby(['name', 'Country']).size()
output:
name Country
George Germany 1
Italy 2
paul Germany 1
Italy 1
If you just want to the counts with respect to the name column, you don't need to use groupby, you can just use select the name column from the DataFrame (which returns a Series object) and call value_counts() on it directly:
df['name'].value_counts()

Categories

Resources