I have a dataframe with several columns and rows.
E.g. minimal example to reproduce:
df2 = pd.DataFrame( {'tof': {4875: 553.347572418771, 4876: 554.4911639867169, 4877: 556.2920840265602, 4878: 560.4059228421942, 4879: 560.7254631042018, 4880: 563.528024491697, 4881: 566.9995989172061, 4882: 570.7775523817776, 4883: 572.0685266789887}, 'E': {4875: 21.390025636898983, 4876: 21.301354836365054, 4877: 21.16283071077996, 4878: 20.85142476306085, 4879: 20.82752461774972, 4880: 20.61965532001513, 4881: 20.366451134218167, 4882: 20.096164210524464, 4883: 20.005036194577794}})
tof E
4875 553.347572 21.390026
4876 554.491164 21.301355
4877 556.292084 21.162831
4878 560.405923 20.851425
4879 560.725463 20.827525
4880 563.528024 20.619655
4881 566.999599 20.366451
4882 570.777552 20.096164
4883 572.068527 20.005036
I need to select 2 closest points (rows) using some provided value E_provided.
E.g. E_provided = 20.83
closest point(row) left to some value of E_search = closest_left gives
4879 560.725463 20.827525
closest point(row) right to some value of E_search = closest_right
4878 560.405923 20.851425
I understand that it's simple task, and I need to calculate some difference between df2['E']-E_provided and use additional conditions, but I don't get how to do it :)
I have tried this one:
df2.iloc[(df['E']-E_provided).abs().argsort()[:2]]
But it works incorrect because there are points that are close to the provided value just from one side, e.g. for E_provided == 20.1 it will output 2 nearest points both of them are left to E_provided value:
4882 570.777552 20.096164
4883 572.068527 20.005036
While I need to find:
4881 566.999599 20.366451
4882 570.777552 20.096164
Find values greater than/less than your value:
E_provided = 20.83
mask = df2.E.gt(E_provided)
Use this to get the closet from each direction:
above_id = df2.loc[mask, 'E'].sub(E_provided).abs().idxmin()
below_id = df2.loc[~mask, 'E'].sub(E_provided).abs().idxmin()
Now we can look up those two rows:
out = df2.loc[[above_id, below_id]]
print(out)
Output:
tof E
4878 560.405923 20.851425
4879 560.725463 20.827525
Less elegant, but working solution:
lower_ind = signal[signal['E'] < real_coord]['E'].idxmax()
upper_ind = signal[signal['E'] > real_coord]['E'].idxmin()
indx_list = [lower_ind, upper_ind]
out = df2.loc[indx_list]
print(out)
I have a pandas dataframe A of approximately 300000 rows. Each row has a latitude and longitude value.
I also have a second pandas dataframe B of about 10000 rows, which has an ID number, a maximum and minimum latitude, and a maximum and minimum longitude.
For each row in A, I need the ID of the corresponding row in B, such that the latitude and longitude of the row in A is contained within the bounding box represented by the row in B.
So far I have the following:
ID_list = []
for index, row in A.iterrows():
filtered_B = B.apply(lambda x : x['ID'] if row['latitude'] >= x['min_latitude']
and row['latitude'] < x['max_latitude'] \
and row['longitude'] >= x['min_longitude'] \
and row['longitude'] < x['max_longitude'] \
else None, axis = 1)
ID_list.append(B.loc[filtered_B == True]['ID']
The ID_list variable was created with the intention of adding it as an ID column to A. The greater than or equal to and less than conditions are included so that each row in A has only one ID from B.
The above code technically works, but it completes about 1000 rows per minute, which is just not feasible for such a large dataset.
Any tips would be appreciated, thank you.
edit: sample dataframes:
A:
location
latitude
longitude
1
-33.81263
151.23691
2
-33.994823
151.161274
3
-33.320154
151.662009
4
-33.99019
151.1567332
B:
ID
min_latitude
max_latitude
min_longitude
max_longitude
9ae8704
-33.815
-33.810
151.234
151.237
2ju1423
-33.555
-33.543
151.948
151.957
3ef4522
-33.321
-33.320
151.655
151.668
0uh0478
-33.996
-33.990
151.152
151.182
expected output:
ID_list = [9ae8704, 0uh0478, 3ef4522, 0uh0478]
I would use geopandas to do this, which makes use of rtree indexing.
import geopandas as gpd
from shapely.geometry import box
a_gdf = gpd.GeoDataFrame(a[['location']], geometry=gpd.points_from_xy(a.longitude,
a.latitude))
b_gdf = gpd.GeoDataFrame(
b[['ID']],
geometry=[box(*bounds) for _, bounds in b.loc[:, ['min_longitude',
'min_latitude',
'max_longitude',
'max_latitude']].iterrows()])
gpd.sjoin(a_gdf, b_gdf)
Output:
location
geometry
index_right
ID
0
1
POINT (151.23691 -33.81263)
0
9ae8704
1
2
POINT (151.161274 -33.994823)
3
0uh0478
3
4
POINT (151.1567332 -33.99019000000001)
3
0uh0478
2
3
POINT (151.662009 -33.320154)
2
3ef4522
We can create an multi-interval-index on b and then use regular loc index into it with tuples from the rows of a. Interval indexes are useful in situations like this when we have a table of low and high values to bucket a variable into.
from io import StringIO
import pandas as pd
a = pd.read_table(StringIO("""
location latitude longitude
1 -33.81263 151.23691
2 -33.994823 151.161274
3 -33.320154 151.662009
4 -33.99019 151.1567332
"""), sep='\s+')
b = pd.read_table(StringIO("""
ID min_latitude max_latitude min_longitude max_longitude
9ae8704 -33.815 -33.810 151.234 151.237
2ju1423 -33.555 -33.543 151.948 151.957
3ef4522 -33.321 -33.320 151.655 151.668
0uh0478 -33.996 -33.990 151.152 151.182
"""), sep='\s+')
lat_index = pd.IntervalIndex.from_arrays(b['min_latitude'], b['max_latitude'], closed='left')
lon_index = pd.IntervalIndex.from_arrays(b['min_longitude'], b['max_longitude'], closed='left')
index = pd.MultiIndex.from_tuples(list(zip(lat_index, lon_index)), names=['lat_range', 'lon_range'])
b = b.set_index(index)
print(b.loc[list(zip(a.latitude, a.longitude)), 'ID'].tolist())
The above will even handle rows of a that have no corresponding row in b by gracefully filling in those values with nan.
A good option for this might be to perform a cross-product merge and drop the undesirable columns. For example, you might do:
AB_cross = A.merge(
B
how = "cross"
)
Now we have a giant dataframe with all the possible matchings where IDs in B (or might not, we don't know yet) might have boundaries qualifying for the points in A. This is fast but makes a large dataset in memory, since we now have a dataset that is 30000x10000 rows long.
Now, we need to apply our logic by filtering the dataset accordingly. This is a numpy process (as far as I'm aware), so it's vectorized and very fast! I will also say that it might be easier to use between to make your code a bit more semantic.
Note that below I use .between(inclusive = 'left') to represent the fact that you want to look to see if the long/lat is min_long <= long < max_long (the inclusive inequality is on the left side of the equation).
ID_list = AB_cross['ID'].loc[
AB_cross['longitude'].between(AB_cross['min_longitude'], AB_cross['max_longitude'], inclusive = 'left') &
AB_cross['latitude'].between(AB_cross['min_latitude'], AB_cross['max_latitude'], inclusive = 'left')
]
A reasonably fast approach could be to pre-sort points by latitudes and longitudes, then iterate over boxes finding points inside the box by latitude (lat_min < lat < lat_max) and longitude (lon_min < lon < lon_max) separately with np.searchsorted and then intersecting them with np.intersect1d.
For 300K points and 10K non-overlapping boxes in my tests it took less than 10 seconds to run.
Here's an example implementation:
# create `ids` series to be populated with box IDs for each point
ids = pd.Series(np.nan, index=a.index)
# create series with points sorted by lats and lons
lats = a['latitude'].sort_values()
lons = a['longitude'].sort_values()
# iterate over boxes
for bi, r in b.set_index('ID').iterrows():
# find points inside the box by latitude:
i1, i2 = np.searchsorted(lats, [r['min_latitude'], r['max_latitude']])
ix_lat = lats.index[i1:i2]
# find points inside the box by longitude:
j1, j2 = np.searchsorted(lons, [r['min_longitude'], r['max_longitude']])
ix_lon = lons.index[j1:j2]
# find points inside the box as intersection and set values in ids:
ix = np.intersect1d(ix_lat, ix_lon)
ids.loc[ix] = bi
ids.tolist()
Output (on provided sample data):
['9ae8704', '0uh0478', '3ef4522', '0uh0478']
You code consuming
and this code
A = pd.DataFrame.from_dict(
{'location': {0: 1, 1: 2, 2: 3, 3: 4},
'latitude': {0: -33.81263, 1: -33.994823, 2: -33.320154, 3: -33.99019},
'longitude': {0: 151.23691, 1: 151.161274, 2: 151.662009, 3: 151.1567332}}
)
B = pd.DataFrame.from_dict(
{'ID': {0: '9ae8704', 1: '2ju1423', 2: '3ef4522', 3: '0uh0478'},
'min_latitude': {0: -33.815, 1: -33.555, 2: -33.321, 3: -33.996},
'max_latitude': {0: -33.81, 1: -33.543, 2: -33.32, 3: -33.99},
'min_longitude': {0: 151.234, 1: 151.948, 2: 151.655, 3: 151.152},
'max_longitude': {0: 151.237, 1: 151.957, 2: 151.668, 3: 151.182}}
)
def func(latitude, longitude):
for y, x in B.iterrows():
if (latitude >= x.min_latitude and latitude < x.max_latitude and longitude >= x.min_longitude and longitude < x.max_longitude):
return x['ID']
np.vectorize(func)
A.apply(lambda x: func(x.latitude, x.longitude), axis=1).to_list()
consumes
so the new solution is 2.33 times faster
I think Geopandas would be the best solution for making sure all of your edge cases are covered like meridian/equator crossovers and the like - spatial queries are exactly what Geopandas is designed for. It can be a pain to install though.
One naive approach in numpy (assuming that most of the points don't change signs anywhere) would be to calculate each of your clauses as a sign of a difference and then keep only the matches where all the signs match your criteria.
For lots of intensive, repetitive calculations like this, it's usually better to dump it out of pandas into numpy, process and then put it back into pandas.
a_lats = A.latitude.values.reshape(-1,1)
b_min_lats = B.min_latitude.values.reshape(1,-1)
b_max_lats = B.max_latitude.values.reshape(1,-1)
a_lons = A.longitude.values.reshape(-1,1)
b_min_lons =B.min_longitude.values.reshape(1,-1)
b_max_lons = B.max_longitude.values.reshape(1,-1)
north_of_min_lat = np.sign(a_lats - b_min_lats)
south_of_max_lat = np.sign(b_max_lats - a_lats)
west_of_min_lon = np.sign(a_lons - b_min_lons)
east_of_max_lon = np.sign(b_max_lons - a_lons)
margin_matches = (north_of_min_lat + south_of_max_lat + west_of_min_lon + east_of_max_lon)
match_indexes = (margin_matches == 4).nonzero()
matches = [(A.location[i], B.ID[j]) for i, j in zip(match_indexes[0], match_indexes[1])]
print(matches)
PS - You can painlessly run this on a GPU if you use CuPy and replace all references to numpy with cupy.
Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.
Hi I have a dataset in the following format:
Code for replicating the data:
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
I input the numbers as a string to show blank cells
Where the first three columns denotes date (Year, Month and Day) and the following columns represent individuals (My actual data file consists of about 300 such rows and about 1000 subjects. I presented a subset of the data here).
Where the column value refers to expenditure on FMCG products.
What I would like to do is the following:
Part 1 (Beginning and end points)
a) For each individual locate the first observation and duplicate the value of the first observation for atleast the previous six months. For example: Subject C's 1st observation is on the 10th of August 2008. In that case I would want all the rows from June 10, 2008 to be equal to 65 for Subject C (Roughly 2/12/2008
is the cutoff date. SO we leave the 3rd cell from the top for Subject_C's column blank).
b) Locate last observation and repeat the last observation for the following 3 months. For example for Subject_A, we repeat 35 twice (till 6th November 2008).
Please refer to the following diagram for the highlighted cell with the solutions.
Part II - (Rows in between)
Next I would like to do two things (I would need to do the following three steps separately, not all at one time):
For individuals like Subject_A, locate two observations that come one after the other (30 and 35).
i) Use the average of the two observations. In this case we would have 32.5 in the four rows without caring about time.
for eg:
ii) Find the total time between two observations and take the mean of the time. For the 1st half of the time period assign the first value and for the 2nd half assign the second value. For example - for subject 1, the total days between 01/22/208 and 08/10/2008 is 201 days. For the first 201/2 = 100.5 days assign the value of 30 to Subject_A and for the remaining value assign 35. In this case the columns for Subject_A and Subject_C will look like:
The final dataset will use (a), (b) & (i) or (a), (b) & (ii)
Final data I [using a,b and i]
Final data II [using a,b and ii]
I would appreciate any help with this. Thanks in advance. Please let me know if the steps are unclear.
Follow up question and Issues
Thanks #Juan for the initial answer. Here's my follow up question. Suppose that Subject_A has more than 2 observations (code for the example data below). Would we be able to extend this code to incorporate more than 2 observations?
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
Issues
For the current code, I found an issue for part II (ii). This is the output that I get:
This is actually on the right track. The two cells above 35 does not seem to get updated. Is there something wrong on my end? Also the same question as before, would we be able to extend it to the case of >2 observations?
Here a code solution for subject A. Should work with the other subjects:
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
d1 = pd.DataFrame(d1)
## Create a variable named date
d1['date']= pd.to_datetime(d1['Year']+'/'+d1['Month']+'/'+d1['Day'])
# convert to float, to calculate mean
d1['Subject_A'] = d1['Subject_A'].replace('',np.nan).astype(float)
# index of the not null rows
subja = d1['Subject_A'].notnull()
### max and min index row with notnull value
max_id_subja = d1.loc[subja,'date'].idxmax()
min_id_subja = d1.loc[subja,'date'].idxmin()
### max and min date for Sub A with notnull value
max_date_subja = d1.loc[subja,'date'].max()
min_date_subja = d1.loc[subja,'date'].min()
### value for max and min date
max_val_subja = d1.loc[max_id_subja,'Subject_A']
min_val_subja = d1.loc[min_id_subja,'Subject_A']
#### Cutoffs
min_cutoff = min_date_subja-pd.Timedelta(6, unit='M')
max_cutoff = max_date_subja+pd.Timedelta(3, unit='M')
## PART I.a
d1.loc[(d1['date']<min_date_subja) & (d1['date']>min_cutoff),'Subject_A'] = min_val_subja
## PART I.b
d1.loc[(d1['date']>max_date_subja) & (d1['date']<max_cutoff),'Subject_A'] = max_val_subja
## PART II
d1_2i = d1.copy()
d1_2ii = d1.copy()
lower_date = min_date_subja
lower_val = min_val_subja.copy()
next_dates_index = d1_2i.loc[(d1['date']>min_date_subja) & subja].index
for N in next_dates_index:
next_date = d1_2i.loc[N,'date']
next_val = d1_2i.loc[N,'Subject_A']
#PART II.i
d1_2i.loc[(d1['date']>lower_date) & (d1['date']<next_date),'Subject_A'] = np.mean([lower_val,next_val])
#PART II.ii
mean_time_a = pd.Timedelta((next_date-lower_date).days/2, unit='d')
d1_2ii.loc[(d1['date']>lower_date) & (d1['date']<=lower_date+mean_time_a),'Subject_A'] = lower_val
d1_2ii.loc[(d1['date']>lower_date+mean_time_a) & (d1['date']<=next_date),'Subject_A'] = next_val
lower_date = next_date
lower_val = next_val
print(d1_2i)
print(d1_2ii)