I have a two pandas data frames, each containing a column with a tuple of coordinates (one for schools, the other for houses).
I would like to create a new column in the houses dataframe containing the shortest distance from any of the schools. Below is some code I tried but without success
import pandas as pd
import numpy as np
import geopy.distance
schools = pd.DataFrame([(29.775803, -95.56353), (40.060276, -83.004196), (40.70592, -74.010765)])
houses = pd.DataFrame([(41.291989997, -73.087632372), (41.16741635, -73.188437585), (41.038689564, -73.635282641), (40.825542, -96.60775)])
x = 0
minimum_distance = []
for i in t:
for j in private:
if geopy.distance.geodesic(i, j).km > x:
v = geopy.distance.geodesic(i, j).km
minimum_distance.append(v).km
else:
continue
schools['shortest_distance'] = minimum_distance
The house dataframe should look like this after:
0 1 shortest_distance
0 41.291990 -73.087632 101.332983
1 41.167416 -73.188438 86.153595
2 41.038690 -73.635283 48.656830
3 40.825542 -96.607750 1156.075739
Does anyone have any idea how to perform this ? I used a double loop in my code because I don't think there is another way since each element has to be searched but I am also wondering how efficient it would be with 2 dataframe of 20000 rows each.
Thank you in advance for your help !
Louis
Related
So, here for example I have 2 columns as Column1a, Column1b, and another 3 columns as Column2a, Column2b, Column2c. I want to make an output column where there is an array of Column1a to Column2c (if present) as given below.
At least one from column 1 and 1 from column 2 must be present for the output.
Column1a Column1b Column2a Column2b Column2c OUTPUT
123A QWER ERTY 1256Y 234
3456 89AS
WERT 1234 9087
CVBT
OUTPUT should be as follows:
OUTPUT
["123A|ERTY","123A|1256Y","123A|234","QWER|ERTY","QWER|1256Y","QWER|234]
""
["WERT|1234","WERT|9087"]
""
Please help me with using the loop in such cases.Thanks
Here is the answer to your question:
import pandas as pd
import numpy as np
# df=pd.read_excel('demo2.xlsx')
all_columns = list(df) # Creates list of all column headers
df[all_columns] = df[all_columns].astype(str)
from itertools import product
x=pd.DataFrame(list(product([0,1], [2,3,4])), columns=['l1', 'l2'])
for j in range(len(df)):
full=[]
if (((df.iloc[j,0]=="nan") & (df.iloc[j,1]=="nan")) | ((df.iloc[j,2]=="nan") & (df.iloc[j,3]=="nan") &(df.iloc[j,4]=="nan")) ):
full.append("")
else:
l=[]
for k in range(len(x)):
if (df.iloc[j,x.iloc[k,0]]!="nan"):
l1=df.iloc[j,x.iloc[k,0]]
if (df.iloc[j,x.iloc[k,1]]!="nan"):
l2=df.iloc[j,x.iloc[k,1]]
full.append(l1+"|"+l2)
df.loc[j,"OUTPUT"]=full
Output looks like this:
I have a pandas dataframe A of approximately 300000 rows. Each row has a latitude and longitude value.
I also have a second pandas dataframe B of about 10000 rows, which has an ID number, a maximum and minimum latitude, and a maximum and minimum longitude.
For each row in A, I need the ID of the corresponding row in B, such that the latitude and longitude of the row in A is contained within the bounding box represented by the row in B.
So far I have the following:
ID_list = []
for index, row in A.iterrows():
filtered_B = B.apply(lambda x : x['ID'] if row['latitude'] >= x['min_latitude']
and row['latitude'] < x['max_latitude'] \
and row['longitude'] >= x['min_longitude'] \
and row['longitude'] < x['max_longitude'] \
else None, axis = 1)
ID_list.append(B.loc[filtered_B == True]['ID']
The ID_list variable was created with the intention of adding it as an ID column to A. The greater than or equal to and less than conditions are included so that each row in A has only one ID from B.
The above code technically works, but it completes about 1000 rows per minute, which is just not feasible for such a large dataset.
Any tips would be appreciated, thank you.
edit: sample dataframes:
A:
location
latitude
longitude
1
-33.81263
151.23691
2
-33.994823
151.161274
3
-33.320154
151.662009
4
-33.99019
151.1567332
B:
ID
min_latitude
max_latitude
min_longitude
max_longitude
9ae8704
-33.815
-33.810
151.234
151.237
2ju1423
-33.555
-33.543
151.948
151.957
3ef4522
-33.321
-33.320
151.655
151.668
0uh0478
-33.996
-33.990
151.152
151.182
expected output:
ID_list = [9ae8704, 0uh0478, 3ef4522, 0uh0478]
I would use geopandas to do this, which makes use of rtree indexing.
import geopandas as gpd
from shapely.geometry import box
a_gdf = gpd.GeoDataFrame(a[['location']], geometry=gpd.points_from_xy(a.longitude,
a.latitude))
b_gdf = gpd.GeoDataFrame(
b[['ID']],
geometry=[box(*bounds) for _, bounds in b.loc[:, ['min_longitude',
'min_latitude',
'max_longitude',
'max_latitude']].iterrows()])
gpd.sjoin(a_gdf, b_gdf)
Output:
location
geometry
index_right
ID
0
1
POINT (151.23691 -33.81263)
0
9ae8704
1
2
POINT (151.161274 -33.994823)
3
0uh0478
3
4
POINT (151.1567332 -33.99019000000001)
3
0uh0478
2
3
POINT (151.662009 -33.320154)
2
3ef4522
We can create an multi-interval-index on b and then use regular loc index into it with tuples from the rows of a. Interval indexes are useful in situations like this when we have a table of low and high values to bucket a variable into.
from io import StringIO
import pandas as pd
a = pd.read_table(StringIO("""
location latitude longitude
1 -33.81263 151.23691
2 -33.994823 151.161274
3 -33.320154 151.662009
4 -33.99019 151.1567332
"""), sep='\s+')
b = pd.read_table(StringIO("""
ID min_latitude max_latitude min_longitude max_longitude
9ae8704 -33.815 -33.810 151.234 151.237
2ju1423 -33.555 -33.543 151.948 151.957
3ef4522 -33.321 -33.320 151.655 151.668
0uh0478 -33.996 -33.990 151.152 151.182
"""), sep='\s+')
lat_index = pd.IntervalIndex.from_arrays(b['min_latitude'], b['max_latitude'], closed='left')
lon_index = pd.IntervalIndex.from_arrays(b['min_longitude'], b['max_longitude'], closed='left')
index = pd.MultiIndex.from_tuples(list(zip(lat_index, lon_index)), names=['lat_range', 'lon_range'])
b = b.set_index(index)
print(b.loc[list(zip(a.latitude, a.longitude)), 'ID'].tolist())
The above will even handle rows of a that have no corresponding row in b by gracefully filling in those values with nan.
A good option for this might be to perform a cross-product merge and drop the undesirable columns. For example, you might do:
AB_cross = A.merge(
B
how = "cross"
)
Now we have a giant dataframe with all the possible matchings where IDs in B (or might not, we don't know yet) might have boundaries qualifying for the points in A. This is fast but makes a large dataset in memory, since we now have a dataset that is 30000x10000 rows long.
Now, we need to apply our logic by filtering the dataset accordingly. This is a numpy process (as far as I'm aware), so it's vectorized and very fast! I will also say that it might be easier to use between to make your code a bit more semantic.
Note that below I use .between(inclusive = 'left') to represent the fact that you want to look to see if the long/lat is min_long <= long < max_long (the inclusive inequality is on the left side of the equation).
ID_list = AB_cross['ID'].loc[
AB_cross['longitude'].between(AB_cross['min_longitude'], AB_cross['max_longitude'], inclusive = 'left') &
AB_cross['latitude'].between(AB_cross['min_latitude'], AB_cross['max_latitude'], inclusive = 'left')
]
A reasonably fast approach could be to pre-sort points by latitudes and longitudes, then iterate over boxes finding points inside the box by latitude (lat_min < lat < lat_max) and longitude (lon_min < lon < lon_max) separately with np.searchsorted and then intersecting them with np.intersect1d.
For 300K points and 10K non-overlapping boxes in my tests it took less than 10 seconds to run.
Here's an example implementation:
# create `ids` series to be populated with box IDs for each point
ids = pd.Series(np.nan, index=a.index)
# create series with points sorted by lats and lons
lats = a['latitude'].sort_values()
lons = a['longitude'].sort_values()
# iterate over boxes
for bi, r in b.set_index('ID').iterrows():
# find points inside the box by latitude:
i1, i2 = np.searchsorted(lats, [r['min_latitude'], r['max_latitude']])
ix_lat = lats.index[i1:i2]
# find points inside the box by longitude:
j1, j2 = np.searchsorted(lons, [r['min_longitude'], r['max_longitude']])
ix_lon = lons.index[j1:j2]
# find points inside the box as intersection and set values in ids:
ix = np.intersect1d(ix_lat, ix_lon)
ids.loc[ix] = bi
ids.tolist()
Output (on provided sample data):
['9ae8704', '0uh0478', '3ef4522', '0uh0478']
You code consuming
and this code
A = pd.DataFrame.from_dict(
{'location': {0: 1, 1: 2, 2: 3, 3: 4},
'latitude': {0: -33.81263, 1: -33.994823, 2: -33.320154, 3: -33.99019},
'longitude': {0: 151.23691, 1: 151.161274, 2: 151.662009, 3: 151.1567332}}
)
B = pd.DataFrame.from_dict(
{'ID': {0: '9ae8704', 1: '2ju1423', 2: '3ef4522', 3: '0uh0478'},
'min_latitude': {0: -33.815, 1: -33.555, 2: -33.321, 3: -33.996},
'max_latitude': {0: -33.81, 1: -33.543, 2: -33.32, 3: -33.99},
'min_longitude': {0: 151.234, 1: 151.948, 2: 151.655, 3: 151.152},
'max_longitude': {0: 151.237, 1: 151.957, 2: 151.668, 3: 151.182}}
)
def func(latitude, longitude):
for y, x in B.iterrows():
if (latitude >= x.min_latitude and latitude < x.max_latitude and longitude >= x.min_longitude and longitude < x.max_longitude):
return x['ID']
np.vectorize(func)
A.apply(lambda x: func(x.latitude, x.longitude), axis=1).to_list()
consumes
so the new solution is 2.33 times faster
I think Geopandas would be the best solution for making sure all of your edge cases are covered like meridian/equator crossovers and the like - spatial queries are exactly what Geopandas is designed for. It can be a pain to install though.
One naive approach in numpy (assuming that most of the points don't change signs anywhere) would be to calculate each of your clauses as a sign of a difference and then keep only the matches where all the signs match your criteria.
For lots of intensive, repetitive calculations like this, it's usually better to dump it out of pandas into numpy, process and then put it back into pandas.
a_lats = A.latitude.values.reshape(-1,1)
b_min_lats = B.min_latitude.values.reshape(1,-1)
b_max_lats = B.max_latitude.values.reshape(1,-1)
a_lons = A.longitude.values.reshape(-1,1)
b_min_lons =B.min_longitude.values.reshape(1,-1)
b_max_lons = B.max_longitude.values.reshape(1,-1)
north_of_min_lat = np.sign(a_lats - b_min_lats)
south_of_max_lat = np.sign(b_max_lats - a_lats)
west_of_min_lon = np.sign(a_lons - b_min_lons)
east_of_max_lon = np.sign(b_max_lons - a_lons)
margin_matches = (north_of_min_lat + south_of_max_lat + west_of_min_lon + east_of_max_lon)
match_indexes = (margin_matches == 4).nonzero()
matches = [(A.location[i], B.ID[j]) for i, j in zip(match_indexes[0], match_indexes[1])]
print(matches)
PS - You can painlessly run this on a GPU if you use CuPy and replace all references to numpy with cupy.
What's the best way to do an operation on a dataframe that, for every row, I need to do a selection on another dataframe?
For example:
My first dataframe has the similarity between every to pairs of items. For starters, I'll assume every similarity as zero and calculate the correct similarity later.
import pandas as pd
import numpy as np
import scipy as sp
from scipy.spatial import distance
items = [1,2,3,4]
item_item_idx = pd.MultiIndex.from_product([items, items], names = ['from_item', 'to_item'])
item_item_df = pd.DataFrame({'similarity': np.zeros(len(item_item_idx))},
index = item_item_idx
)
My next dataframe has the rating every user gave for every item. For sake of simplification, let's assume every user rated every item and generate random ratings between 1 and 5.
users = [1,2,3,4,5]
ratings_idx = pd.MultiIndex.from_product([items, users], names = ['item', 'user'])
rating_df = pd.DataFrame(
{'rating': np.random.randint(low = 1, high = 6, size = len(users)*len(items))},
columns = ['rating'],
index = ratings_idx
)
Now that I have the ratings, I want to update the cosine similarity between the items. What I need to do is, for every row in item_item_df, select to from rating_df the vector of ratings for each item, and calculate the cosine distance between those two.
I want to know the least dumb way to do this. Here's what I tried so far:
==== FIRST TRY - Iterating over rows
def similarity(ii, iu):
for index, row in ii.iterrows():
v = iu.loc[index[0]]
u = iu.loc[index[1]]
row['similarity'] = distance.cosine(v, u)
return(ii)
import time
start_time = time.time()
item_item_df = similarity(item_item_df, rating_df)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.01002s to run this. In problem with 10k items, I estimate it would take in th ballpark of 20 hours to run. Not good.
The thing is, I'm iterating over rows, my hope is that I can vectorize this to make it faster. I played around with df.apply() and df.map(). This is the best I did so far:
==== SECOND TRY - index.map()
def similarity_map(idx):
v = rating_df.loc[idx[0]]
u = rating_df.loc[idx[1]]
return distance.cosine(v, u)
start_time = time.time()
item_item_df['similarity'] = item_item_df.index.map(similarity_map)
print('Time: {:f}s'.format(time.time() - start_time))
Took me 0.034961s to execute. Slower than just iterating over rows.
So this was a naive attempt to vectorize. Is it even possible to do? What other options I have to improve the runtime?
Thanks for the attention.
For your given example I'd just pivot it into an array and move on with my life.
from sklearn.metrics.pairwise import cosine_similarity
rating_df = rating_df.reset_index().pivot(index='item', columns='user')
cs_df = pd.DataFrame(cosine_similarity(rating_df),
index=rating_df.index, columns=rating_df.index)
>>> cs_df
item 1 2 3 4
item
1 1.000000 0.877346 0.660529 0.837611
2 0.877346 1.000000 0.608781 0.852029
3 0.660529 0.608781 1.000000 0.758098
4 0.837611 0.852029 0.758098 1.000000
This would be more difficult with a giant, highly-sparse array. Sklearn cosine_similarity takes sparse arrays though so as long as your number of items is reasonable (since the output matrix will be dense) this should be solvable.
Same thing but different. Work with numpy arrays. Fine for small arrays but with 10k rows you'll have some large arrays.
import numpy as np
data = rating_df.unstack().values # shape (4,5)
udotv = np.dot(data,data.T) # shape (4,4)
mag_data = np.linalg.norm(data,axis=1)
mag = mag_data * mag_data[:,None]
cos_sim = 1 - (udotv / mag)
df['sim2'] = cos_sim.flatten()
4k users and 14k items pretty much blows up my poor computer. I'm going to have to look how sklearn.metrics.pairwise.cosine_similarity handles that large data.
I have a dataframe with different actors distribution of attention towards different issues. It looks like this:
Social politics & Welfare Technology & IT Business, Finance, & Economy ...
actor_1 0.034483 0.051724 0.017241 ...
actor_2 0.032000 0.016000 0.056000 ...
actor_3 0.012195 0.004065 0.010163 ...
actor_4 0.000000 0.045977 0.022989 ...
actor_5 0.027397 0.006849 0.000000 ...
actor_6 0.128205 0.000000 0.051282 ...
I've created two functions for creating a matrix with the similarity scores between all the different actors.
def dist_sim(array1, array2):
array1 = array1*100
array2 = array2*100
distances = array1-array2
total_distance = 0
for distance in distances:
total_distance += math.sqrt(distance*distance)
return(100-total_distance/2)
def dist_sim_matrix(df):
matrix = []
for index, row in df.iterrows():
party_matrix = []
for index1, row1 in df.iterrows():
party_matrix.append(dist_sim(row, row1))
matrix.append(party_matrix)
return np.array(matrix, int)
They work perfectly fine, however when I apply it to a large dataframe (eg. with 2000 different actors and 25 issues) it takes forever (I'm actually not sure I've got enough RAM for it?).
I'm new in the business of creating my own functions, so any help on optimization would be awesome!
Here what you can do:
import pandas as pd
import numpy as np
# I used a fake dataframe
df = pd.DataFrame(data={'c1': np.random.rand(10),
'c2': np.random.rand(10),
'c3': np.random.rand(10),
'c4': np.random.rand(10)},
index=[f'actor_{i}' for i in range(1,11)])
# Traspose it
df = df.T
# Define the function to compute distance
def dist_sim(array1, array2):
'''
Use vectorization, distributive property and numpy functions
'''
d = np.sqrt((np.square(array1-array2)).sum())*100
return(100-d/2)
# Initialize an empty dataframe
sim_df = pd.DataFrame(columns=list(df), index=list(df))
# cycle over the dataframe actors - exploit symmetry to half iteration number
for i,c1 in enumerate(list(df)):
for c2 in list(df)[i:]:
sim_df.loc[c1, c2]=sim_df.loc[c2, c1]=dist_sim_opt(df[c1], df[c2])
The resulting dataframe is something like
sim_df
actor_1 actor_2 actor_3 ... actor_8 actor_9 actor_10
actor_1 100 67.146 56.3693 ... 74.2303 77.7915 55.0946
actor_2 67.146 100 64.7546 ... 61.9146 72.5428 63.7388
actor_3 56.3693 64.7546 100 ... 57.5318 51.5127 95.3162
actor_4 68.5392 59.2313 75.0851 ... 73.3381 61.7608 74.6694
actor_5 72.671 67.2219 79.2112 ... 64.2796 59.9031 77.3241
actor_6 62.8109 67.1849 87.7293 ... 60.9305 53.3952 83.9605
actor_7 62.0589 63.5562 35.7006 ... 57.5888 61.3989 33.1785
actor_8 74.2303 61.9146 57.5318 ... 100 69.602 55.4216
actor_9 77.7915 72.5428 51.5127 ... 69.602 100 51.4612
actor_10 55.0946 63.7388 95.3162 ... 55.4216 51.4612 100
in this case there is an optimised function in scipy, see the spatial.distance module, specifically the pdist function for computing:
Pairwise distances between observations in n-dimensional space.
in your case you can do:
from scipy.spatial import distance
d = distance.squareform(distance.pdist(df, 'euclidean'))
dd = pd.DataFrame(d, df.index, df.index)
note that these are "distances", so the distance to the same actor is zero. if you really want to have it take a maximal value (as in your calculations) you could do:
d *= -50
d += 100
before turning into a dataframe. note that I'm doing these calculations "inplace" so that additional copies of a potentially enormous matrix aren't created
I posted a question along the same lines yesterday. This is a slightly modified version of it. previous question here.
I have 2 dataframes as follows:
data1 looks like this:
id address
1 11123451
2 78947591
data2 looks like the following:
lowerbound_address upperbound_address place
78392888 89000000 X
10000000 20000000 Y
I want to create another column in data1 called "place" which contains the place the id is from. There will be many ids coming from the same place. And some ids don't have a match.
The addresses here are float values.
What I am actually looking for in Python is an equivalent of this in R. It's easier to code the following in R. But I am unsure of how to code this in Python. Can someone help me with this?
data_place = rep(NA, nrow(data1))
for (i in (1:nrow(data1)){
tmp = as.character(data2[data1$address[i] >= data2$lowerbound_address & data1$address[i] <= data2$upperbound_address, "place"])
if(length(tmp)==1) {data_place[i] = tmp}
}
data$place = data_place
Something like this would work.
import pandas as pd
import numpy as np
# The below section is only used to import data
from io import StringIO
data = """
id address
1 11123451
2 78947591
3 50000000
"""
data2 = """
lowerbound_address upperbound_address place
78392888 89000000 X
10000000 20000000 Y
"""
# The above section is only used to import data
df = pd.read_csv(StringIO(data), delimiter='\s+')
df2 = pd.read_csv(StringIO(data2), delimiter='\s+')
df['new']=np.nan
df['new'][(df['address'] > df2['lowerbound_address'][0]) & (df['address'] < df2['upperbound_address'][0])] = 'X'
df['new'][(df['address'] > df2['lowerbound_address'][1]) & (df['address'] < df2['upperbound_address'][1])] = 'Y'
In addition to pandas, we used numpy for np.nan.
All I have done was create a new column and assign NaN to it. Then created two criteria to assign either X or 'Y' based on the upper and lower boundaries in the second data (last two lines).
Final results:
id address new
0 1 11123451 Y
1 2 78947591 X
2 3 50000000 NaN
Do a merge_asof and then replace all those times that the address is out of bounds with nan.
data1.sort_values('address', inplace = True)
data2.sort_values('lowerbound_address', inplace=True)
data3 = pd.merge_asof(data1, data2, left_on='address', right_on='lowerbound_address')
data3['place'] = data3['place'].where(data3.address <= data3.upperbound_address)
data3.drop(['lowerbound_address', 'upperbound_address'], axis=1)
Output
id address place
0 1 11123451 Y
1 3 50000000 NaN
2 2 78947591 X