Pairing two Pandas data frames with an ID value

Pairing two Pandas data frames with an ID value - python

I am trying to put together a useable set of data about glaciers. Our original data comes from an ArcGIS dataset, and latitude/longitude values were stored in a separate file, now detached from the CSV with all of our data. I am attempting to merge the latitude/longitude files with our data set. Heres a preview of what the files look like.
This is my main dataset file, glims (columns dropped for clarity)
| ANLYS_ID | GLAC_ID | AREA |
|----------|----------------|-------|
| 101215 | G286929E46788S | 2.401 |
| 101146 | G286929E46788S | 1.318 |
| 101162 | G286929E46788S | 0.061 |
This is the latitude-longitude file, coordinates
| lat | long | glacier_id |
|-------|---------|----------------|
| 1.187 | -70.166 | G001187E70166S |
| 2.050 | -70.629 | G002050E70629S |
| 3.299 | -54.407 | G002939E70509S |
The problem is, the coordinates data frame has one row for each glacier id with latitude longitude, whereas my glims data frame has multiple rows for each glacier id with varying data for each entry.
I need every single entry in my main data file to have a latitude-longitude value added to it, based on the matching glacier_id between the two data frames.
Heres what I've tried so far.
glims = pd.read_csv('glims_clean.csv')
coordinates = pd.read_csv('LatLong_GLIMS.csv')
df['que'] = np.where((coordinates['glacier_id'] ==
glims['GLAC_ID']))
error returns: 'int' object is not subscriptable
and:
glims.merge(coordinates, how='right', on=('glacier_id', 'GLAC_ID'))
error returns: int' object has no attribute 'merge'
I have no idea how to tackle this big of a merge. I am also afraid of making mistakes because it is nearly impossible to catch them, since the data carries no other identifying factors.
Any guidance would be awesome, thank you.

This should work
glims = glims.merge(coordinates, how='left', left_on='GLAC_ID', right_on='glacier_id')

This a classic merging problem. One way to solve is using straight loc and index-matching
glims = glims.set_index('GLAC_ID')
glims.loc[:, 'lat'] = coord.set_index('glacier_id').lat
glims.loc[:, 'long'] = coord.set_index('glacier_id').long
glims = glims.reset_index()
You can also use pd.merge
pd.merge(glims,
coord.rename(columns={'glacier_id': 'GLAC_ID'}),
on='GLAC_ID')

Related

GeoPandas Convert Geometry Column To Geometry Type

I currently have a geopandas dataframe that looks like this
|----|-------|-----|------------------------------------------------|
| id | name | ... | geometry |
|----|-------|-----|------------------------------------------------|
| 1 | poly1 | ... | 0101000020E6100000A6D52A40F1E16690764A7D... |
|----|-------|-----|------------------------------------------------|
| 2 | poly2 | ... | 0101000020E610000065H7D2A459A295J0A67AD2... |
|----|-------|-----|------------------------------------------------|
And when getting ready to write it to postgis, I am getting the following error:
/python3.7/site-packages/geopandas/geodataframe.py:1321: UserWarning: Geometry column does not contain geometry.
warnings.warn("Geometry column does not contain geometry.")
Is there a way to convert this geometry column to a geometry type so that when it is appending to the existing table with geometry type column errors can be avoided. I've tried:
df['geometry'] = gpd.GeoSeries.to_wkt(df['geometry'])
But there are errors parsing the existing geometry column. Is there a correct way I am missing?

The syntax needs to be changed as below
df['geometry'] = df.geometry.apply(lambda x: x.wkt).apply(lambda x: re.sub('"(.*)"', '\\1', x))

How to find the closest matching rows in between two dataframes that has no direct join columns?

For each set of coordinates in a pyspark dataframe, I need to find closest set of coordinates in another dataframe
I have one pyspark dataframe with coordinate data like so (dataframe a):
+------------------+-------------------+
| latitude_deg| longitude_deg|
+------------------+-------------------+
| 40.07080078125| -74.93360137939453|
| 38.704022| -101.473911|
| 59.94919968| -151.695999146|
| 34.86479949951172| -86.77030181884766|
| 35.6087| -91.254898|
| 34.9428028| -97.8180194|
And another like so (dataframe b): (only few rows are shown for understanding)
+-----+------------------+-------------------+
|ident| latitude_deg| longitude_deg|
+-----+------------------+-------------------+
| 00A| 30.07080078125| -24.93360137939453|
| 00AA| 56.704022| -120.473911|
| 00AK| 18.94919968| -109.695999146|
| 00AL| 76.86479949951172| -67.77030181884766|
| 00AR| 10.6087| -87.254898|
| 00AS| 23.9428028| -10.8180194|
Is it possible to somehow merge the dataframes to have a result that a has the closest ident from dataframe b for each row in dataframe a:
+------------------+-------------------+-------------+
| latitude_deg| longitude_deg|closest_ident|
+------------------+-------------------+-------------+
| 40.07080078125| -74.93360137939453| 12A|
| 38.704022| -101.473911| 14BC|
| 59.94919968| -151.695999146| 278A|
| 34.86479949951172| -86.77030181884766| 56GH|
| 35.6087| -91.254898| 09HJ|
| 34.9428028| -97.8180194| 09BV|
What I have tried so far:
I have a pyspark UDF to calculate the haversine distance between 2 pairs of coordinates defined.
udf_get_distance = F.udf(get_distance)
It works like this:
df = (df.withColumn(“ABS_DISTANCE”, udf_get_distance(
df.latitude_deg_a, df.longitude_deg_a,
df.latitude_deg_b, df.longitude_deg_b,)
))
I'd appreciate any kind of help. Thanks so much

You need to do a crossJoin first. something like this
joined_df=source_df1.crossJoin(source_df2)
Then you can call your udf like you have mentioned, generate rownum based on distance and filter out the close one
from pyspark.sql.functions import row_number,Window
rwindow=Window.partitionBy("latitude_deg_a","latitude_deg_b").orderBy("ABS_DISTANCE")
udf_result_df = joined_df.withColumn(“ABS_DISTANCE”, udf_get_distance(
df.latitude_deg_a, df.longitude_deg_a,
df.latitude_deg_b, df.longitude_deg_b).withColumn("rownum",row_number().over(rwindow)).filter("rownum = 1")
Note: add return type to your udf

Loop in Python to find the max value of every destiny

I am fairly new in python. I am working with the nycflights13 df.
My goal is to determine the most common and least common origin for every destiny.
+---------+--------------------+-----------+---------------------+------------+
| destiny | most_common_origin | max_count | least_common_origin | min_count |
+---------+--------------------+-----------+---------------------+------------+
| MIA | LGA | 5781 | EWR | 2633 |
+---------+--------------------+-----------+---------------------+------------+
I managed to do it for one destiny but I´m having trouble in the loop and also the binding the rows because sometimes I get a series.
My working example
from nycflights13 import flights
flights_to_MIA = flights[flights['dest']=='MIA']
flights_to_MIA.groupby(flights_to_MIA['origin']).size().reset_index(name='counts')
Note that I used size and then reset.index to make it a df.

You may check agg + value_counts
s=df.groupby('dest').agg(most_common_origin=('origin',lambda x : x.mode()[0]),
max_count=('origin',lambda x : x.value_counts().sort_values().iloc[-1]),
least_common_origin=('origin',lambda x : x.value_counts().sort_values().iloc[[0]].index),
min_count=('origin',lambda x : x.value_counts().sort_values().iloc[0]))

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!

changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

Eliminate perceived index value from data frame concatenation

I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks

so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pairing two Pandas data frames with an ID value - python

This should work glims = glims.merge(coordinates, how='left', left_on='GLAC_ID', right_on='glacier_id')

Related

GeoPandas Convert Geometry Column To Geometry Type

How to find the closest matching rows in between two dataframes that has no direct join columns?

Loop in Python to find the max value of every destiny

Pandas not displaying all columns when writing to

Eliminate perceived index value from data frame concatenation

Categories

Resources