Speeding up a nested for loop through two Pandas DataFrames

Speeding up a nested for loop through two Pandas DataFrames - python

I have a latitude and longitude stored in a pandas dataframe (df) with filler spots as NaN for stop_id, stoplat, stoplon, and in another dataframe areadf, which contains more lats/lons and an arbitrary id; this is the information that is to be populated into df.
I'm trying to connect the two so that the stop columns in df contain information about the stop closest to that lat/lon point, or leave it as NaN if there is no stop within a radius R of the point.
Right now my code is as follows, but it takes a reaaaaallly long time (>40 minutes for what I'm running at the moment, before changing area to a df and using itertuples; not sure of what magnitude of difference this will make?) as there are thousands of lat/lon points and stops for each set of data, which is a problem because I need to run this on multiple files. I'm looking for suggestions to make it run faster. I've already made some very minor improvements (e.g. moving to a dataframe, using itertuples instead of iterrows, defining lats and lons outside of the loop to avoid having to retrieve it from df on every loop) but I'm out of ideas for speeding it up. getDistance uses the Haversine formula as defined to get the distance between the stop sign and the given lat,lon point.
import pandas as pd
from math import cos, asin, sqrt
R=5
lats = df['lat']
lons = df['lon']
for stop in areadf.itertuples():
for index in df.index:
if getDistance(lats[index],lons[index],
stop[1],stop[2]) < R:
df.at[index,'stop_id'] = stop[0] # id
df.at[index,'stoplat'] = stop[1] # lat
df.at[index,'stoplon'] = stop[2] # lon
def getDistance(lat1,lon1,lat2,lon2):
p = 0.017453292519943295 #Pi/180
a = (0.5 - cos((lat2 - lat1) * p)/2 + cos(lat1 * p) *
cos(lat2 * p) * (1 - cos((lon2 - lon1) * p)) / 2)
return 12742 * asin(sqrt(a)) * 100
Sample data:
df
lat lon stop_id stoplat stoplon
43.657676 -79.380146 NaN NaN NaN
43.694324 -79.334555 NaN NaN NaN
areadf
stop_id stoplat stoplon
0 43.657675 -79.380145
1 45.435143 -90.543253
Desired:
df
lat lon stop_id stoplat stoplon
43.657676 -79.380146 0 43.657675 -79.380145
43.694324 -79.334555 NaN NaN NaN

One way would be to use the numpy haversine function from here, just slightly modified so that you can account for the radius you want.
The just iterate through your df with apply and find the closest value within a given radius
def haversine_np(lon1, lat1, lon2, lat2,R):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
if km.min() <= R:
return km.argmin()
else:
return -1
df['dex'] = df[['lat','lon']].apply(lambda row: haversine_np(row[1],row[0],areadf.stoplon.values,areadf.stoplat.values,1),axis=1)
Then merge the two dataframes.
df.merge(areadf,how='left',left_on='dex',right_index=True).drop('dex',axis=1)
lat lon stop_id stoplat stoplon
0 43.657676 -79.380146 0.0 43.657675 -79.380145
1 43.694324 -79.334555 NaN NaN NaN
NOTE: If you choose to follow this method, you must be sure that both dataframes indexes are reset or that they are sequentially ordered from 0 to total len of df. So be sure to reset the indexes before you run this.
df.reset_index(drop=True,inplace=True)
areadf.reset_index(drop=True,inplace=True)

Related

Efficiently reduce number of points in gps track by distance between points

i have a dataframe containing gps coordinates (Timestamp, Latitude, Longitude) of vehicle tracks. The frequency is between 30 seconds between points and 1 second between points. This depends on some logic in gps receiver including speed but is not very reliable.
These tracks can be very long and contain many thousands of gps points. especially when a vehicle is moving slow or is at rest. The data looks like this:
Timestamp
Latitude
Longitude
0 days 00:00:00
51.1513
9.61053
0 days 00:00:28
51.1513
9.61049
0 days 00:00:29
51.1513
9.61048
0 days 00:00:31
51.1513
9.61048
0 days 00:00:33
51.1513
9.61048
I want to reduce the size of the data frames by only including gps points which are at least 50 meters apart of the gps position before. The distance between two gps positions is calculated using the harvesine formula:
from math import radians, cos, sin, asin, sqrt
def haversine(lat1, lon1, lat2, lon2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees) in meters
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# Radius of earth in kilometers is 6371
m = 6371000* c
return m
Currently i use a very naive approach by looping over the dataframe and creating a mask containing elements at least 50 meters apart. But this is very inefficient and i am looking for an efficient way to calculated this for large data frames.
def reduce_gps(df):
mask = np.full(len(df), False)
cpos = 0
#print(cpos)
lat_col = df_gps.columns.get_loc('Latitude')
lon_col = df_gps.columns.get_loc('Longitude')
for pos in range(len(mask)):
if haversine(df.iloc[cpos, lat_col], df.iloc[cpos, lon_col],
df.iloc[pos, lat_col], df.iloc[pos, lon_col]) > 50 or (pos==len(mask)-1):
#print(pos)
cpos = pos
mask[pos] = True
return df[mask]
The haversine formula can be vectorized if this is helpful:
def haversine_vec(df):
data = np.deg2rad(df[['Latitude', 'Longitude']])
diff = data.shift() - data
d = np.sin(diff['Latitude']/2)**2 + np.cos(data['Latitude'])*np.cos(data['Latitude'].shift()) * np.sin(diff['Longitude']/2)**2
return 2 * 6371000 * np.arcsin(np.sqrt(d))
I uploaded a small set of sample data here:
pd.read_csv('https://pastebin.com/raw/qeUDKr9z')

Try using a list comprehension:
df['distance']=[haversine(df.Latitude[i],df.Longitude[i],df.Latitude[i+1],df.Longitude[i+1]) if i!=len(df)-1 else 0 for i in range(len(df))]
df[df.distance>50]

Find the nearest neighbor of each class on a large dataset

I have a large dataset of around 2 million rows and 4 columns: observation_id, latitude, longitude and class_id, like that:
Observation_id
Latitude
Longitude
Class_id
10131188
45.146973
6.416794
101
10799362
46.783695
-2.072855
700
10392536
48.604866
-2.825003
1456
...
...
...
...
22068176
29.806055
-98.41853
5532
There are 18,000 classes, but some of them are over-represented and some are under-represented. Note that each observation is either in France or in the USA.
I need to find, for each observation, the distance to the closest observation of every class.
For example, for the first observation (which belongs to the class 101 if we look at the table above), I will have a vector of size 18,000. The first value of the vector will represent the distance in km to the closest occurrence of class 1, the second value will represent the distance in km to the closest occurrence of class 2, and so on until the last value which will represent, you guessed it, the distance in km to the closest occurrence of class 18,000.
If the distance is too large (let's say more than 50km), I don't need the exact distance but a fixed value (50 km in this case). So if the closest occurrence from one class to my observation is more than 50km (whether it's 51km or 9,000km), I can fill 50 for the corresponding value of the observation's vector.
But I see two problems here:
My code will take forever to run.
The created file will be huge.
I started to create a small script that calculates the haversine distance, but for one observations it takes around 8 seconds to run, so it would be impossible for 2 million. Here it is anyway:
lat1 = 45.705116 # lat for observation 10561949
lon1 = 1.424622 # lon for observation 10561949
df = df[df.observation_id != 10561949] # removing observation 10561949 from the DataFrame
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for observation_id, lat2, lon2 in zip(df['observation_id'], df['latitude'], df['longitude']):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
dist = 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
if list_obs[observation_id] >= dist:
list_obs[observation_id] = dist
Do you have a idea on how to speed-up the algorithm (the distance doesn't have to be perfectly calculated, I just need to have a global idea on the nearest neighbor of each class for each observation) and to store the gigantic file after that (it will be an array-like of 2,000,000 x 18,000).
The idea after this is to try to feed this to a Neural Network (let's say a MLP), to see the difference with a simple K-Nearest Neighbor.

Since you only care about distances <50km the best saving that I can think of is to pre-calculate (approximately) absolute distances on a grid to exclude the need to compute distances for large values.
Below is my best attempt to solve this, it has a setup complexity of O(len(df)) but a search complexity of just O(9 * avg. bin size) which is significantly less than O(len(df)) from your example.
Note 1) there are large parts of this that can be vectorized to improve performance.
Note 2) there are most certainly better ways to bin distances on a sphere, I am just not that familiar with them, but the idea to first index values such that you can quickly find all data points within distance x is the key.
Note 3) I would be surprised if this code is bug free.
# generate dummy data --------------------------
import pandas as pd
import random
random.seed(10)
rand_float = lambda :(random.random()-.5)*90*2
rand_int = lambda :int(random.random()*18000)
dummy_data = [(rand_float(), rand_float(), rand_int()) for i in range(100_000)]
df = pd.DataFrame(data=dummy_data, columns=('lat', 'lon', 'class'))
# bin data points -------------------------------
from collections import defaultdict
def find_bin(lat, lon, bin_size=60):
"""Approximately bin data points
https://stackoverflow.com/questions/1253499/simple-calculations-for-working-with-lat-lon-and-km-distance
approximate distance conversion to exclude "far away" distances - only needs to be approximate since we add buffer,
however I am sure there are much better methods one could use to do this approximation, I spent 10 mins googling
"""
return int(110*lat//bin_size), int(110*cos(lon)//60)
bins = defaultdict(list)
for i, row in df.iterrows(): # O(len(df))
bins[find_bin(row['lat'], row['lon'])].append(int(i)) # this is slow, it can be vectorized less elegantly but only needs run once
print(f'average bin size {sum(map(len, bins.values()))/len(bins)}')
# find distances to neighbours ------------------
from math import radians, sin, cos, asin, sqrt
def compute_distance(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
return 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
def neighbours(x, y):
yield x-1, y-1
yield x-1, y
yield x-1, y+1
yield x , y-1
yield x , y
yield x , y+1
yield x+1, y-1
yield x+1, y
yield x+1, y+1
obs = 1000 # 10561949
lat1, lon1, _ = df.iloc[obs].values
print(f'finding {lat1}, {lon1}')
b = find_bin(lat1, lon1)
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for adj_bin in neighbours(*b): # O(9)
for i in bins[adj_bin]: # O(avg. bin size)
lat2, lon2, class_ = df.loc[i].values
dist = compute_distance(lat1, lon1, lat2, lon2)
if dist < list_obs[int(class_)]:
print(f'dist {dist} to {i} ({class_})')
list_obs[int(class_)] = dist

Selecting rows in geopandas or pandas based on latitude/longitude and radius

I have a dataframe (pd) where each row contains a bunch of measures, as well as latitude and longitude values. I can convert those into geopandas points if needed.
From this dataframe, I would like to select only rows that fall within a certain (let's say 1km) radius from a new given lat/long.
Is there a wise way to go about this problem?
Here's a data sample from the df:
id . lat . long . polution . label
----------------------------------------
3 . 45.467. -79.51 . 7 . 'nice'
7 . 45.312. -79.56 . 8 . 'mediocre'
a sample lat/long would be lat = 45.4 and long = -79.5.

Here's an example of working code. First make a function to calculate your distance. I implemented a simple distance calculation, but I would recommending which ever you feel most useful. Next you can subset the DataFrame to be within your desired distance.
#Initialize DataFrame
df=pd.DataFrame(columns=['location','lat','lon'])
df['location']=['LA','NY','LV']
df['lat']=[34.05,40.71,36.16]
df['lon']=[-118.24,-74.00,-115.14]
#New point Reno 39.53,-119.81
newlat=39.53
newlon=-119.81
#Import trig stuff from math
from math import sin, cos, sqrt, atan2,radians
#Distance function between two lat/lon
def getDist(lat1,lon1,lat2,lon2):
R = 6373.0
lat1 = radians(lat1)
lon1 = radians(lon1)
lat2 = radians(lat2)
lon2 = radians(lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
return R * c
#Apply distance function to dataframe
df['dist']=list(map(lambda k: getDist(df.loc[k]['lat'],df.loc[k]['lon'],newlat,newlon), df.index))
#This will give all locations within radius of 600 km
df[df['dist']<600]

You can use the following algorithm:
Create a geodataframe (gdfdata) from the input data (pd dataframe)
Create another geodataframe (gdfsel) with the center point for the selection
Create a buffer around the center point (make gdfselbuff from gdfsel) for the selection
Use the within method of geopandas to find the points within. E.g. gdf_within =
gdfdata.loc[gdfdata.geometry.within(gdfselbuff.unary_union)]
For making the buffer, you can use GeoSeries.buffer(distance, resolution)). See these links for reference.
http://geopandas.org/geometric_manipulations.html
https://gis.stackexchange.com/questions/253224/geopandas-buffer-using-geodataframe-while-maintaining-the-dataframe

On top of Sharder's solution, I found convenient to apply a filter function. It also seems to execute faster
def filter(row,lat2,lon2,max):
if getDist(row['lat'],row['lon'],lat2,lon2) < max:
return True
else:
return False
df[df.apply(filter, args = (newlat,newlon,600), axis=1)]

The most efficient way to calculate geospatial distances between many points?

I have two datasets, one describing locations and second having various points:
locations.head()
latitude longitude geobounds_lon1 geobounds_lat1 geobounds_lon2 geobounds_lat2
0 52.5054 13.33320 13.08830 52.6755 13.7611 52.3382
1 54.6192 9.99778 7.86496 55.0581 11.3129 53.3608
2 41.6671 -71.27420 -71.90730 42.0188 -71.0886 41.0958
3 25.9859 -80.12280 -87.81370 30.9964 -78.9917 24.5071
4 43.7004 11.51330 9.63364 44.5102 12.4104 42.1654
points.head()
category lat lon
0 161 47.923132 11.507743
1 161 47.926479 11.531736
2 161 47.943670 11.576099
3 161 57.617577 12.040591
4 23 52.124071 -0.491918
I need to calculate distances from each offer (based on locations.latitude and locations.longitude) to every point of each category (for example, 161). For me, only matters these points that are not so far away from location - I thought that using boundaries of location might be helpful, so I wouldn't need to calculate all distances and then filter them.
The biggest problem for me is how to effectively filter points for every location (based on category and boundaries) and calculate distances to these points from location point as the data counts are quite big (there are almost 9 million rows in locations and more than 10 million rows in points).
For distance calculation I tried BallTree:
RADIANT_TO_KM_CONSTANT = 6367
class BallTreeIndex:
def __init__(self,lat_longs):
self.lat_longs = np.radians(lat_longs)
self.ball_tree_index = BallTree(self.lat_longs, leaf_size=40, metric='haversine')
def query_radius(self,query,radius):
radius_radiant = radius / RADIANT_TO_KM_CONSTANT
query = np.radians(np.array([query]))
result = self.ball_tree_index.query_radius(query, r=radius_radiant,
return_distance=True)
return result[1][0]
And for filtering points:
condition = (points.category == c) & (points.lat > lat2) & (points.lat < lat1) & (points.lon < lon2) & (points.lon > lon1)
tmp = points[condition]
where c is the specific category, lat1, lat2, lon1, lon2 are the location boundaries.
However, this would take a lot of time, so I wonder if there is any way to make it faster.
I would like to have a new column in locations dataframe, for example:
distances_161
0 [distance0_0, distance0_1, ...]
1 [distance1_0, distance1_1, ...]
2 [distance2_1, distance2_2, ...]

I'm not 100% certain that this is what you want, but it seems to make sense to me.
import numpy as np
import pandas
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
df = {'lon1': [40.7454513],
'lat1': [-73.9536799],
'lon2': [40.7060268],
'lat2': [-74.0110188]}
df
df['distance'] = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
Result:
array([6.48545403])
So, Python is saying 6.485 miles and Google says 6.5 miles.

Iterate over Pandas index pairs [0,1],[1,2][2,3]

I have a pandas dataframe of lat/lng points created from a gps device.
My question is how to generate a distance column for the distance between each point in the gps track line.
Some googling has given me the haversine method below which works using single values selected using iloc, but i'm struggling on how to iterate over the dataframe for the method inputs.
I had thought I could run a for loop, with something along the lines of
for i in len(df):
df['dist'] = haversine(df['lng'].iloc[i],df['lat'].iloc[i],df['lng'].iloc[i+1],df['lat'].iloc[i+1]))
but I get the error TypeError: 'int' object is not iterable. I was also thinking about df.apply but I'm not sure how to get the appropriate inputs. Any help or hints. on how to do this would be appreciated.
Sample DF
lat lng
0 -7.11873 113.72512
1 -7.11873 113.72500
2 -7.11870 113.72476
3 -7.11870 113.72457
4 -7.11874 113.72444
Method
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
c = 2 * math.asin(math.sqrt(a))
km = 6367 * c
return km

are you looking for a result like this?
lat lon dist2next
0 -7.11873 113.72512 0.013232
1 -7.11873 113.72500 0.026464
2 -7.11873 113.72476 0.020951
3 -7.11873 113.72457 0.014335
4 -7.11873 113.72444 NaN
There's probably a clever way to use pandas.rolling_apply... but for a quick solution, I'd do something like this.
def haversine(loc1, loc2):
# convert decimal degrees to radians
lon1, lat1 = map(math.radians, loc1)
lon2, lat2 = map(math.radians, loc2)
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
c = 2 * math.asin(math.sqrt(a))
km = 6367 * c
return km
df['dist2next'] = np.nan
for i in df.index[:-1]:
loc1 = df.ix[i, ['lon', 'lat']]
loc2 = df.ix[i+1, ['lon', 'lat']]
df.ix[i, 'dist2next'] = haversine(loc1, loc2)
alternatively, if you don't want to modify your haversine function like that, you can just pick off lats and lons one at a time using df.ix[i, 'lon'], df.ix[i, 'lat'], df.ix[i+1, 'lon], etc.

I would recommande using a quicker variation of looping through a df such has
df_shift = df.shift(1)
df = df.join(df_shift, l_suffix="lag_")
log = []
for rows in df.itertuples():
log.append(haversine(rows.lng ,rows.lat, rows.lag_lng, rows.lag_lat))
pd.DataFrame(log)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up a nested for loop through two Pandas DataFrames - python

Related

Efficiently reduce number of points in gps track by distance between points

Find the nearest neighbor of each class on a large dataset

Selecting rows in geopandas or pandas based on latitude/longitude and radius

The most efficient way to calculate geospatial distances between many points?

Iterate over Pandas index pairs [0,1],[1,2][2,3]

Categories

Resources