Geopandas .merge() produces strange results - python

I have been trying to merge two geopandas dataframes based on a column and am getting some really weird results. To test this point I made two simple dataframes, and merged them:
import pandas as pd
import geopandas as gpd
df = pd.DataFrame(
{'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})
gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
df2 = pd.DataFrame(
{'Capital': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota',
'Caracas'],
'Abbreviation': ['ARG', 'BRA', 'CHI', 'COL', 'VZL']})
combined_df = gdf.merge(df2, left_on='City', right_on='Capital')
print(combined_df)
When I print the results, I get what I expected:
City Country ... Capital Abbreviation
0 Buenos Aires Argentina ... Buenos Aires ARG
1 Brasilia Brazil ... Brasilia BRA
2 Santiago Chile ... Santiago CHI
3 Bogota Colombia ... Bogota COL
4 Caracas Venezuela ... Caracas VZL
The two datasets are merged based on their common column, which is the 'city' column and the 'capital' column.
I have some other data I am working with. Here is a link to it
Both of the files are geopackages I've read in as geodataframes. Dataframe 1 has 16166 rows. Dataframe 2 has 15511 rows. They have a common ID column, 'ALTPARNO' AND 'altparno'. Here is the code I've tried to use to read them in and merge them:
import geopandas as gpd
dataframe1 = gpd.read_file(filepath, layer='allkeepers_2019')
dataframe2 = gpd.read_file(filepath, layer='keepers_2019')
results = dataframe1.merge(dataframe2, left_on='altparno', right_on='ALTPARNO')
When I look at my results, I have a dataframe with over 4 million rows (should be around 15,000).
What is going on?

Related

Reading nested json to pandas dataframe

I have below URL that has a JSON response. I need to read this json into a pandas dataframe and perform operations on top of it . This is a case of nested JSON which consists of multiple lists and dicts within dicts.
URL: 'http://api.nobelprize.org/v1/laureate.json'
I have tried below code:
import json, pandas as pd,requests
resp=requests.get('http://api.nobelprize.org/v1/laureate.json')
df=pd.json_normalize(json.loads(resp.content),record_path =['laureates'])
print(df.head(5))
Output-
id firstname surname born died \
0 1 Wilhelm Conrad Röntgen 1845-03-27 1923-02-10
1 2 Hendrik A. Lorentz 1853-07-18 1928-02-04
2 3 Pieter Zeeman 1865-05-25 1943-10-09
3 4 Henri Becquerel 1852-12-15 1908-08-25
4 5 Pierre Curie 1859-05-15 1906-04-19
bornCountry bornCountryCode bornCity \
0 Prussia (now Germany) DE Lennep (now Remscheid)
1 the Netherlands NL Arnhem
2 the Netherlands NL Zonnemaire
3 France FR Paris
4 France FR Paris
diedCountry diedCountryCode diedCity gender \
0 Germany DE Munich male
1 the Netherlands NL NaN male
2 the Netherlands NL Amsterdam male
3 France FR NaN male
4 France FR Paris male
prizes
0 [{'year': '1901', 'category': 'physics', 'shar...
1 [{'year': '1902', 'category': 'physics', 'shar...
2 [{'year': '1902', 'category': 'physics', 'shar...
3 [{'year': '1903', 'category': 'physics', 'shar...
4 [{'year': '1903', 'category': 'physics', 'shar...
But in this prizes comes as a list. If I create a separate dataframe for prizes, it has affiliations as list.I want all columns to come as separate columns. Some entires may/may not have prizes. So need to handle that case as well.
I went through this article https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd. Looks like we'll have to use meta and error=ignore here, but not able to fix it. Appreciate your inputs here. Thanks.
You would have to do this in few steps.
The first step would be to extract the first record_path = ['laureates']
The second one would be record_path = ['laureates', 'prizes'] for the nested json records with meta path as the id from the parent record
Combine the two datasets by joining on the id column.
Drop the unnecessary columns and store
import json, pandas as pd, requests
resp = requests.get('http://api.nobelprize.org/v1/laureate.json')
df0 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates'])
df1 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates','prizes'], meta = [['laureates','id']])
output = pd.merge(df0, df1, left_on='id', right_on='laureates.id').drop(['prizes','laureates.id'], axis=1, inplace=False)
print('Shape of data ->',output.shape)
print('Columns ->',output.columns)
Shape of data -> (975, 18)
Columns -> Index(['id', 'firstname', 'surname', 'born', 'died', 'bornCountry',
'bornCountryCode', 'bornCity', 'diedCountry', 'diedCountryCode',
'diedCity', 'gender', 'year', 'category', 'share', 'motivation',
'affiliations', 'overallMotivation'],
dtype='object')
Found an alternate solution as well with lesser code. This works.
from flatten_json import flatten
data = winners['laureates']
dict_flattened = (flatten(record, '.') for record in data)
df = pd.DataFrame(dict_flattened)
print(df.shape)
(968, 43)

Structuring and pivoting corrupted dataframe in pandas

I have a dataframe which I read from an excel file. The thing is first 4 columns and its values look good. But after 5th column data seems kind of corrupted.
That is, the "dateID" values like "2021-09-06" became columns, "sourceOfData" column became ""values".
And it looks like that:
countryName
provinceName
productID
productName
dateID
2021-09-06
2021-09-07
2021-09-08
sourceOfData
productPrice
productPrice
productPrice
United States
New York
35
Sugar
CommissionAgent1
2.6$
5.5$
3.4$
Canada
Ontario
55
Corn
CommissionAgent1
2.6$
5.5$
3.4$
But i want my data to look like that:
countryName
provinceName
productID
productName
sourceOfData
dateID
productPrice
United States
New York
35
Sugar
CommissionAgent1
2021-09-06
2.6$
United States
New York
35
Sugar
CommissionAgent1
2021-09-07
5.5$
United States
New York
35
Sugar
CommissionAgent1
2021-09-08
3.4$
Canada
Ontario
55
Corn
CommissionAgent1
2021-09-06
2.6$
Canada
Ontario
55
Corn
CommissionAgent1
2021-09-07
5.5$
Canada
Ontario
55
Corn
CommissionAgent1
2021-09-08
3.4$
The thing only came to my mind is pivot or melt. I started doing something like this:
df2 = df.melt(var_name='dateID', value_name='productPrice')
df3 = df2.iloc[1:]
in order to organize dates and prices, but I'm stuck.
Hope I explained my needs. Thanks in advance.
For those who want to reproduce my question and obtain dataframes, here is the code that consists of what i have and what i need.
import pandas as pd
whatIHave = {'countryName': ['','United States','Canada'],
'provinceName': ['','New York','Ontario'],
'productID': ['','35','55'],
'productName': ['', 'Sugar', 'Corn'],
'dateID': ['sourceOfData', 'CommissionAgent1', 'CommissionAgent1'],
'2021-09-06': ['productPrice','2.6$','2.6$'],
'2021-09-07': ['productPrice','5.5$','5.5$'],
'2021-09-08': ['productPrice','3.4$','3.4$']
}
df_whatIHave = pd.DataFrame(whatIHave, columns = ['countryName', 'provinceName', 'productID', 'productName', 'dateID', '2021-09-06', '2021-09-07', '2021-09-08'])
print(df_whatIHave)
whatINeed = {'countryName': ['United States','United States','United States', 'Canada', 'Canada', 'Canada'],
'provinceName': ['New York','New York','New York', 'Ontario', 'Ontario', 'Ontario'],
'productID': ['35','35','35', '55', '55', '55'],
'productName': ['Sugar', 'Sugar', 'Sugar', 'Corn', 'Corn', 'Corn'],
'sourceOfData': ['CommissionAgent1', 'CommissionAgent1', 'CommissionAgent1', 'CommissionAgent1', 'CommissionAgent1', 'CommissionAgent1'],
'dateID': ['2021-09-06', '2021-09-07', '2021-09-08', '2021-09-06', '2021-09-07', '2021-09-08'],
'productPrice': ['2.6$','5.5$','3.4$','2.6$','5.5$','3.4$']
}
df_whatINeed = pd.DataFrame(whatINeed, columns = ['countryName', 'provinceName', 'productID', 'productName', 'sourceOfData', 'dateID', 'productPrice'])
print(df_whatINeed)
I managed to solve this problem after hours of searching by dividing the problem into pieces and merging them. If we take into account our dataframe as df_whatIHave:
df2 = df_whatIHave.iloc[1:, 0:6]
df2 = df2.reset_index()
columnSize = df_whatIHave.shape[1]
df3 = df_whatIHave.iloc[:, 6:columnSize]
df4 = df3.iloc[1:]
I divided my dataframe into 2 parts and implemented stack() function which is so crucial for replicating my rows based on date:
df4 = df4.stack()
df4 = df4.to_frame().reset_index()
Then I merged these two dfs like that:
df_merged= pd.merge(df2, df4, on='index', how='inner')

Sum columns by key values in another column

I have a pandas DataFrame like this:
city country city_population
0 New York USA 8300000
1 London UK 8900000
2 Paris France 2100000
3 Chicago USA 2700000
4 Manchester UK 510000
5 Marseille France 860000
I want to create a new column country_population by calculating a sum of every city for each country. I have tried:
df['Country population'] = df['city_population'].sum().where(df['country'])
But this won't work, could I have some advise on the problem?
Sounds like you're looking for groupby
import pandas as pd
data = {
'city': ['New York', 'London', 'Paris', 'Chicago', 'Manchester', 'Marseille'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'France'],
'city_population': [8_300_000, 8_900_000, 2_100_000, 2_700_000, 510_000, 860_000]
}
df = pd.DataFrame.from_dict(data)
# group by country, access 'city_population' column, sum
pop = df.groupby('country')['city_population'].sum()
print(pop)
output:
country
France 2960000
UK 9410000
USA 11000000
Name: city_population, dtype: int64
Appending this Series to the DataFrame. (Arguably discouraged though, since it stores information redundantly and doesn't really fit the structure of the original DataFrame):
# add to existing df
pop.rename('country_population', inplace=True)
# how='left' to preserve original ordering of df
df = df.merge(pop, how='left', on='country')
print(df)
output:
city country city_population country_population
0 New York USA 8300000 11000000
1 London UK 8900000 9410000
2 Paris France 2100000 2960000
3 Chicago USA 2700000 11000000
4 Manchester UK 510000 9410000
5 Marseille France 860000 2960000
based on #Vaishali's comment, a one-liner
df['Country population'] = df.groupby([ 'country']).transform('sum')['city_population']

Trying to use a list to populate a dataframe column

I have a dataframe (df) and I would like to create a new column called country, which is calculated buy looking at the region column and where the region value is present in the EnglandRegions list then the country value is set to England else its the value from the region column.
Please see below for my desired output:
name salary region B1salary country
0 Jason 42000 London 42000 England
1 Molly 52000 South West England
2 Tina 36000 East Midland England
3 Jake 24000 Wales Wales
4 Amy 73000 West Midlands England
You can see that all the values in country are set to England except for the value assigned to Jakes record that is set to Wales (as Wales is not in the EnglandRegions list). The code below produces the following error:
File "C:/Users/stacey/Documents/scripts/stacey.py", line 20
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
^
SyntaxError: invalid syntax
The code is as follows:
import pandas as pd
import numpy as np
EnglandRegions = ["London", "South West", "East Midland", "West Midlands", "East Anglia"]
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'salary': [42000, 52000, 36000, 24000, 73000],
'region': ['London', 'South West', 'East Midland', 'Wales', 'West Midlands']}
df = pd.DataFrame(data, columns = ['name', 'salary', 'region'])
df['B1salary'] = np.where((df['salary']>=40000) & (df['salary']<=50000) , df['salary'], '')
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
print(df)
The specific issue the error is referencing is that you are missing a ] to enclose your .loc. However, fixing this won't work anyways. Try:
df['country'] = np.where(df['region'].isin(EnglandRegions), 'England', df['region'])
This is essentially what you already had in the line above it for B1salary anyways.

Pandas pivot_table group column by values

I am trying to use numeric values as columns on a Pandas pivot_table. The problem is that since each number is mostly unique, the resulting pivot_table isn't very useful as a way to aggregate my data.
Here is what I have so far (fake data example):
import pandas as pd
df = pd.DataFrame({'Country': ['US', 'Brazil', 'France', 'Germany'],
'Continent': ['Americas', 'Americas', 'Europe', 'Europe'],
'Population': [321, 207, 80, 66]})
pd.pivot_table(df, index='Continent', columns='Population', aggfunc='count')
Here is an image of the resulting
.
How could I group my values into ranges based on my columns?
In other words, how can I count all countries with Population... <100, 100-200, >300?
Use pd.cut:
df = df.assign(PopGroup=pd.cut(df.Population,bins=[0,100,200,300,np.inf],labels=['<100','100-200','200-300','>300']))
Output:
Continent Country Population PopGroup
0 Americas US 321 >300
1 Americas Brazil 207 200-300
2 Europe France 80 <100
3 Europe Germany 66 <100
pd.pivot_table(df, index='Continent', columns='PopGroup',values=['Country'], aggfunc='count')
Output:
Country
PopGroup 200-300 <100 >300
Continent
Americas 1.0 NaN 1.0
Europe NaN 2.0 NaN

Categories

Resources