Retaining rows that have percent overlapping ranges in Pandas

Retaining rows that have percent overlapping ranges in Pandas - python

I have a dataframe with the columns:
[id, range_start, range_stop, score]
If two rows have a range overlap by x percentage I retain the row with the higher score. However, I am confused how to pull out rows with no overlap to other ranges. I am using a nested loop and recursion to condense overlapping ranges into a new dataframe. However, this structure causes all rows to be retained when I am looking for the non overlapping rows.
## This is my function to recursively select the highest scoring overlapping regions
def overlap_retention(df_overlap, threshold, df_nonoverlap=None):
if df_nonoverlap != None:
df_nonoverlap = pd.DataFrame()
df_overlap = pd.DataFrame()
for index, row in x.iterrows():
rs = row['range_start']
re = row['range_end']
## Silly nested loop to compare ranges between all rows
for index2, row2 in x.drop(index).iterrows():
rs2 = row2['range_start']
re2 = row2['range_end']
readRegion=[*range(rs,re,1)]
refRegion=[*range(rs2,re2,1)]
regionUnion = set(readRegion).intersection(set(refRegion))
overlap_length = len(regionUnion)
overlap_min = min(rs, rs2)
overlap_max = max(re, re2)
overlap_full_range = overlap_max-overlap_min
overlap_percentage = (overlap_length/overlap_full_range)*100
## Check if they overlap by x_percentage and retain the higher score
if overlap_percentage>x_percentage:
evalue = row['score']
evalue_2 = row2['score']
if evalue_2 > evalue:
df_overlap = df_overlap.append(row2)
else:
df_overlap = df_overlap.append(row)
#----------------------------------------------------------
## How to find non-overlapping rows without pulling everything?
else:
df_nonoverlap = df_nonoverlap.append(row)
# ---------------------------------------------
### Recursion here to condense overlapped list further
if len(df_overlap)>1:
overlap_retention(df_overlap, threshold, df_nonoverlap)
else:
return(df_nonoverlap)
An example input is below:
data = {'id':['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'range_start':[1,12,11,1,20, 10],
'range_end':[4,15,15,6,23,16],
'score':[3,1,8,2,5,1]}
input = pd.DataFrame(data, columns=['id', 'range_start', 'range_end', 'score'])
The desired output can change based on the overlap threshold. In the example above id1 and id4 may both be retained or simply id1 depending on the overlap threshold:
data = {'id':['id1', 'id3', 'id5'],
'range_start':[1,11,20],
'range_end':[4,15,23],
'score':[3,8,5]}
output = pd.DataFrame(data, columns=['id', 'range_start', 'range_end', 'score'])

You can make a cartesian join between all the ranges, then find length and % of the overlap for each pair, and filter it based on the x_overlap threshold.
After that, for each range we can find the overlapping range with the highest score (which could be the range itself, with the overlap of 100%):
# set min overlap parameter
x_overlap = 0.5
# cartesian join all ranges
z = df.assign(k=1).merge(
df.assign(k=1), on='k', suffixes=['_1', '_2'])
# find lengths of overlaps
z['len_overlap'] = (
z[['range_end_1', 'range_end_2']].min(axis=1) -
z[['range_start_1', 'range_start_2']].max(axis=1)).clip(0)
# we're only interested in cases where ranges overlap, so the total
# range is the range between min(start1, start2) and max(end1, end2)
z['len_total'] = (
z[['range_end_1', 'range_end_2']].max(axis=1) -
z[['range_start_1', 'range_start_2']].min(axis=1)).clip(0)
# find % overlap and filter out pairs above threshold
# these include 'pairs' where a range is paired to itself
z['pct_overlap'] = z['len_overlap'] / z['len_total']
z = z[z['pct_overlap'] > x_overlap]
# for each range find an overlapping range with the highest score
# (could be the range itself)
z = z.sort_values('score_2').groupby('id_1')['id_2'].last()
# filter the inputs
df_out = df[df['id'].isin(z)]
df_out
Output:
id range_start range_end score
0 id1 1 4 3
2 id3 11 15 8
4 id5 20 23 5
P.S. Please note that it is not very clear what should happen with id4 in your example. Since you don't have it in your output, I assumed (hopefully correctly) that you're not interested in zero-length ranges in the output
P.P.S. There is a new syntax for cartesian join in pandas 1.2.0+ with how=cross parameter in the merge method. I've used in my answer a version with a dummy variable k=1, which is more verbose, but compatible with older versions

I think you need a very clear definition of overlap. If you have [2;7], [6;10] and [7;8], which one overlaps with which one ?
Avoid using input as a variable name, it shadows the function input() (to get input from the user)
If you want to select clear overlaps (only the start or the end differs), and you only have at most ONE overlap, here you go:
sorted_df = df.sort_values(by=["range_start"])
starts_earlier = sorted_df[sorted_df.range_end.shift(-1) == sorted_df.range_end]
sorted_df = df.sort_values(by=["range_end"])
ends_earlier = sorted_df[sorted_df.range_start.shift(-1) == sorted_df.range_start]
Then you can do a df.drop(starts_earlier.index) and df.drop(ends_earlier.index) to remove the shorter ones/
df.shift() : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
This code won't work for multiple overlapping segments. If you are interested in that, let me know.

Related

How to select "nearest left" and "nearesr right" values from a dataframe using condition?

I have a dataframe with several columns and rows.
E.g. minimal example to reproduce:
df2 = pd.DataFrame( {'tof': {4875: 553.347572418771, 4876: 554.4911639867169, 4877: 556.2920840265602, 4878: 560.4059228421942, 4879: 560.7254631042018, 4880: 563.528024491697, 4881: 566.9995989172061, 4882: 570.7775523817776, 4883: 572.0685266789887}, 'E': {4875: 21.390025636898983, 4876: 21.301354836365054, 4877: 21.16283071077996, 4878: 20.85142476306085, 4879: 20.82752461774972, 4880: 20.61965532001513, 4881: 20.366451134218167, 4882: 20.096164210524464, 4883: 20.005036194577794}})
tof E
4875 553.347572 21.390026
4876 554.491164 21.301355
4877 556.292084 21.162831
4878 560.405923 20.851425
4879 560.725463 20.827525
4880 563.528024 20.619655
4881 566.999599 20.366451
4882 570.777552 20.096164
4883 572.068527 20.005036
I need to select 2 closest points (rows) using some provided value E_provided.
E.g. E_provided = 20.83
closest point(row) left to some value of E_search = closest_left gives
4879 560.725463 20.827525
closest point(row) right to some value of E_search = closest_right
4878 560.405923 20.851425
I understand that it's simple task, and I need to calculate some difference between df2['E']-E_provided and use additional conditions, but I don't get how to do it :)
I have tried this one:
df2.iloc[(df['E']-E_provided).abs().argsort()[:2]]
But it works incorrect because there are points that are close to the provided value just from one side, e.g. for E_provided == 20.1 it will output 2 nearest points both of them are left to E_provided value:
4882 570.777552 20.096164
4883 572.068527 20.005036
While I need to find:
4881 566.999599 20.366451
4882 570.777552 20.096164

Find values greater than/less than your value:
E_provided = 20.83
mask = df2.E.gt(E_provided)
Use this to get the closet from each direction:
above_id = df2.loc[mask, 'E'].sub(E_provided).abs().idxmin()
below_id = df2.loc[~mask, 'E'].sub(E_provided).abs().idxmin()
Now we can look up those two rows:
out = df2.loc[[above_id, below_id]]
print(out)
Output:
tof E
4878 560.405923 20.851425
4879 560.725463 20.827525

Less elegant, but working solution:
lower_ind = signal[signal['E'] < real_coord]['E'].idxmax()
upper_ind = signal[signal['E'] > real_coord]['E'].idxmin()
indx_list = [lower_ind, upper_ind]
out = df2.loc[indx_list]
print(out)

How to speed up pandas dataframe iteration involving 2 different dataframes with a complex condition?

I have a pandas dataframe A of approximately 300000 rows. Each row has a latitude and longitude value.
I also have a second pandas dataframe B of about 10000 rows, which has an ID number, a maximum and minimum latitude, and a maximum and minimum longitude.
For each row in A, I need the ID of the corresponding row in B, such that the latitude and longitude of the row in A is contained within the bounding box represented by the row in B.
So far I have the following:
ID_list = []
for index, row in A.iterrows():
filtered_B = B.apply(lambda x : x['ID'] if row['latitude'] >= x['min_latitude']
and row['latitude'] < x['max_latitude'] \
and row['longitude'] >= x['min_longitude'] \
and row['longitude'] < x['max_longitude'] \
else None, axis = 1)
ID_list.append(B.loc[filtered_B == True]['ID']
The ID_list variable was created with the intention of adding it as an ID column to A. The greater than or equal to and less than conditions are included so that each row in A has only one ID from B.
The above code technically works, but it completes about 1000 rows per minute, which is just not feasible for such a large dataset.
Any tips would be appreciated, thank you.
edit: sample dataframes:
A:
location
latitude
longitude
1
-33.81263
151.23691
2
-33.994823
151.161274
3
-33.320154
151.662009
4
-33.99019
151.1567332
B:
ID
min_latitude
max_latitude
min_longitude
max_longitude
9ae8704
-33.815
-33.810
151.234
151.237
2ju1423
-33.555
-33.543
151.948
151.957
3ef4522
-33.321
-33.320
151.655
151.668
0uh0478
-33.996
-33.990
151.152
151.182
expected output:
ID_list = [9ae8704, 0uh0478, 3ef4522, 0uh0478]

I would use geopandas to do this, which makes use of rtree indexing.
import geopandas as gpd
from shapely.geometry import box
a_gdf = gpd.GeoDataFrame(a[['location']], geometry=gpd.points_from_xy(a.longitude,
a.latitude))
b_gdf = gpd.GeoDataFrame(
b[['ID']],
geometry=[box(*bounds) for _, bounds in b.loc[:, ['min_longitude',
'min_latitude',
'max_longitude',
'max_latitude']].iterrows()])
gpd.sjoin(a_gdf, b_gdf)
Output:
location
geometry
index_right
ID
0
1
POINT (151.23691 -33.81263)
0
9ae8704
1
2
POINT (151.161274 -33.994823)
3
0uh0478
3
4
POINT (151.1567332 -33.99019000000001)
3
0uh0478
2
3
POINT (151.662009 -33.320154)
2
3ef4522

We can create an multi-interval-index on b and then use regular loc index into it with tuples from the rows of a. Interval indexes are useful in situations like this when we have a table of low and high values to bucket a variable into.
from io import StringIO
import pandas as pd
a = pd.read_table(StringIO("""
location latitude longitude
1 -33.81263 151.23691
2 -33.994823 151.161274
3 -33.320154 151.662009
4 -33.99019 151.1567332
"""), sep='\s+')
b = pd.read_table(StringIO("""
ID min_latitude max_latitude min_longitude max_longitude
9ae8704 -33.815 -33.810 151.234 151.237
2ju1423 -33.555 -33.543 151.948 151.957
3ef4522 -33.321 -33.320 151.655 151.668
0uh0478 -33.996 -33.990 151.152 151.182
"""), sep='\s+')
lat_index = pd.IntervalIndex.from_arrays(b['min_latitude'], b['max_latitude'], closed='left')
lon_index = pd.IntervalIndex.from_arrays(b['min_longitude'], b['max_longitude'], closed='left')
index = pd.MultiIndex.from_tuples(list(zip(lat_index, lon_index)), names=['lat_range', 'lon_range'])
b = b.set_index(index)
print(b.loc[list(zip(a.latitude, a.longitude)), 'ID'].tolist())
The above will even handle rows of a that have no corresponding row in b by gracefully filling in those values with nan.

A good option for this might be to perform a cross-product merge and drop the undesirable columns. For example, you might do:
AB_cross = A.merge(
B
how = "cross"
)
Now we have a giant dataframe with all the possible matchings where IDs in B (or might not, we don't know yet) might have boundaries qualifying for the points in A. This is fast but makes a large dataset in memory, since we now have a dataset that is 30000x10000 rows long.
Now, we need to apply our logic by filtering the dataset accordingly. This is a numpy process (as far as I'm aware), so it's vectorized and very fast! I will also say that it might be easier to use between to make your code a bit more semantic.
Note that below I use .between(inclusive = 'left') to represent the fact that you want to look to see if the long/lat is min_long <= long < max_long (the inclusive inequality is on the left side of the equation).
ID_list = AB_cross['ID'].loc[
AB_cross['longitude'].between(AB_cross['min_longitude'], AB_cross['max_longitude'], inclusive = 'left') &
AB_cross['latitude'].between(AB_cross['min_latitude'], AB_cross['max_latitude'], inclusive = 'left')
]

A reasonably fast approach could be to pre-sort points by latitudes and longitudes, then iterate over boxes finding points inside the box by latitude (lat_min < lat < lat_max) and longitude (lon_min < lon < lon_max) separately with np.searchsorted and then intersecting them with np.intersect1d.
For 300K points and 10K non-overlapping boxes in my tests it took less than 10 seconds to run.
Here's an example implementation:
# create `ids` series to be populated with box IDs for each point
ids = pd.Series(np.nan, index=a.index)
# create series with points sorted by lats and lons
lats = a['latitude'].sort_values()
lons = a['longitude'].sort_values()
# iterate over boxes
for bi, r in b.set_index('ID').iterrows():
# find points inside the box by latitude:
i1, i2 = np.searchsorted(lats, [r['min_latitude'], r['max_latitude']])
ix_lat = lats.index[i1:i2]
# find points inside the box by longitude:
j1, j2 = np.searchsorted(lons, [r['min_longitude'], r['max_longitude']])
ix_lon = lons.index[j1:j2]
# find points inside the box as intersection and set values in ids:
ix = np.intersect1d(ix_lat, ix_lon)
ids.loc[ix] = bi
ids.tolist()
Output (on provided sample data):
['9ae8704', '0uh0478', '3ef4522', '0uh0478']

You code consuming
and this code
A = pd.DataFrame.from_dict(
{'location': {0: 1, 1: 2, 2: 3, 3: 4},
'latitude': {0: -33.81263, 1: -33.994823, 2: -33.320154, 3: -33.99019},
'longitude': {0: 151.23691, 1: 151.161274, 2: 151.662009, 3: 151.1567332}}
)
B = pd.DataFrame.from_dict(
{'ID': {0: '9ae8704', 1: '2ju1423', 2: '3ef4522', 3: '0uh0478'},
'min_latitude': {0: -33.815, 1: -33.555, 2: -33.321, 3: -33.996},
'max_latitude': {0: -33.81, 1: -33.543, 2: -33.32, 3: -33.99},
'min_longitude': {0: 151.234, 1: 151.948, 2: 151.655, 3: 151.152},
'max_longitude': {0: 151.237, 1: 151.957, 2: 151.668, 3: 151.182}}
)
def func(latitude, longitude):
for y, x in B.iterrows():
if (latitude >= x.min_latitude and latitude < x.max_latitude and longitude >= x.min_longitude and longitude < x.max_longitude):
return x['ID']
np.vectorize(func)
A.apply(lambda x: func(x.latitude, x.longitude), axis=1).to_list()
consumes
so the new solution is 2.33 times faster

I think Geopandas would be the best solution for making sure all of your edge cases are covered like meridian/equator crossovers and the like - spatial queries are exactly what Geopandas is designed for. It can be a pain to install though.
One naive approach in numpy (assuming that most of the points don't change signs anywhere) would be to calculate each of your clauses as a sign of a difference and then keep only the matches where all the signs match your criteria.
For lots of intensive, repetitive calculations like this, it's usually better to dump it out of pandas into numpy, process and then put it back into pandas.
a_lats = A.latitude.values.reshape(-1,1)
b_min_lats = B.min_latitude.values.reshape(1,-1)
b_max_lats = B.max_latitude.values.reshape(1,-1)
a_lons = A.longitude.values.reshape(-1,1)
b_min_lons =B.min_longitude.values.reshape(1,-1)
b_max_lons = B.max_longitude.values.reshape(1,-1)
north_of_min_lat = np.sign(a_lats - b_min_lats)
south_of_max_lat = np.sign(b_max_lats - a_lats)
west_of_min_lon = np.sign(a_lons - b_min_lons)
east_of_max_lon = np.sign(b_max_lons - a_lons)
margin_matches = (north_of_min_lat + south_of_max_lat + west_of_min_lon + east_of_max_lon)
match_indexes = (margin_matches == 4).nonzero()
matches = [(A.location[i], B.ID[j]) for i, j in zip(match_indexes[0], match_indexes[1])]
print(matches)
PS - You can painlessly run this on a GPU if you use CuPy and replace all references to numpy with cupy.

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.

If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Python Data manipulation: Duplicate and Average row and column values using dates

Hi I have a dataset in the following format:
Code for replicating the data:
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
I input the numbers as a string to show blank cells
Where the first three columns denotes date (Year, Month and Day) and the following columns represent individuals (My actual data file consists of about 300 such rows and about 1000 subjects. I presented a subset of the data here).
Where the column value refers to expenditure on FMCG products.
What I would like to do is the following:
Part 1 (Beginning and end points)
a) For each individual locate the first observation and duplicate the value of the first observation for atleast the previous six months. For example: Subject C's 1st observation is on the 10th of August 2008. In that case I would want all the rows from June 10, 2008 to be equal to 65 for Subject C (Roughly 2/12/2008
is the cutoff date. SO we leave the 3rd cell from the top for Subject_C's column blank).
b) Locate last observation and repeat the last observation for the following 3 months. For example for Subject_A, we repeat 35 twice (till 6th November 2008).
Please refer to the following diagram for the highlighted cell with the solutions.
Part II - (Rows in between)
Next I would like to do two things (I would need to do the following three steps separately, not all at one time):
For individuals like Subject_A, locate two observations that come one after the other (30 and 35).
i) Use the average of the two observations. In this case we would have 32.5 in the four rows without caring about time.
for eg:
ii) Find the total time between two observations and take the mean of the time. For the 1st half of the time period assign the first value and for the 2nd half assign the second value. For example - for subject 1, the total days between 01/22/208 and 08/10/2008 is 201 days. For the first 201/2 = 100.5 days assign the value of 30 to Subject_A and for the remaining value assign 35. In this case the columns for Subject_A and Subject_C will look like:
The final dataset will use (a), (b) & (i) or (a), (b) & (ii)
Final data I [using a,b and i]
Final data II [using a,b and ii]
I would appreciate any help with this. Thanks in advance. Please let me know if the steps are unclear.
Follow up question and Issues
Thanks #Juan for the initial answer. Here's my follow up question. Suppose that Subject_A has more than 2 observations (code for the example data below). Would we be able to extend this code to incorporate more than 2 observations?
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
Issues
For the current code, I found an issue for part II (ii). This is the output that I get:
This is actually on the right track. The two cells above 35 does not seem to get updated. Is there something wrong on my end? Also the same question as before, would we be able to extend it to the case of >2 observations?

Here a code solution for subject A. Should work with the other subjects:
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
d1 = pd.DataFrame(d1)
## Create a variable named date
d1['date']= pd.to_datetime(d1['Year']+'/'+d1['Month']+'/'+d1['Day'])
# convert to float, to calculate mean
d1['Subject_A'] = d1['Subject_A'].replace('',np.nan).astype(float)
# index of the not null rows
subja = d1['Subject_A'].notnull()
### max and min index row with notnull value
max_id_subja = d1.loc[subja,'date'].idxmax()
min_id_subja = d1.loc[subja,'date'].idxmin()
### max and min date for Sub A with notnull value
max_date_subja = d1.loc[subja,'date'].max()
min_date_subja = d1.loc[subja,'date'].min()
### value for max and min date
max_val_subja = d1.loc[max_id_subja,'Subject_A']
min_val_subja = d1.loc[min_id_subja,'Subject_A']
#### Cutoffs
min_cutoff = min_date_subja-pd.Timedelta(6, unit='M')
max_cutoff = max_date_subja+pd.Timedelta(3, unit='M')
## PART I.a
d1.loc[(d1['date']<min_date_subja) & (d1['date']>min_cutoff),'Subject_A'] = min_val_subja
## PART I.b
d1.loc[(d1['date']>max_date_subja) & (d1['date']<max_cutoff),'Subject_A'] = max_val_subja
## PART II
d1_2i = d1.copy()
d1_2ii = d1.copy()
lower_date = min_date_subja
lower_val = min_val_subja.copy()
next_dates_index = d1_2i.loc[(d1['date']>min_date_subja) & subja].index
for N in next_dates_index:
next_date = d1_2i.loc[N,'date']
next_val = d1_2i.loc[N,'Subject_A']
#PART II.i
d1_2i.loc[(d1['date']>lower_date) & (d1['date']<next_date),'Subject_A'] = np.mean([lower_val,next_val])
#PART II.ii
mean_time_a = pd.Timedelta((next_date-lower_date).days/2, unit='d')
d1_2ii.loc[(d1['date']>lower_date) & (d1['date']<=lower_date+mean_time_a),'Subject_A'] = lower_val
d1_2ii.loc[(d1['date']>lower_date+mean_time_a) & (d1['date']<=next_date),'Subject_A'] = next_val
lower_date = next_date
lower_val = next_val
print(d1_2i)
print(d1_2ii)

Performance issues with pandas iterrows

I am having performance issues with iterrows in on my dataframe as I start to scale up my data analysis.
Here is the current loop that I am using.
for ii, i in a.iterrows():
for ij, j in a.iterrows():
if ii != ij:
if i['DOCNO'][-5:] == j['DOCNO'][4:9]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
elif i['DOCNO'][-5:] == j['DOCNO'][-5:]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
c = a.drop(a.index[dl])
The point of the loop is to find 'DOCNO' values that are different in the dataframe but are known to be equivalent denoted by the 5 characters that are equivalent but spaced differently in the string. When found I want to drop the smaller number from the associated 'RSLTN1' column. Additionally, my data set may have multiple entries for a unique 'DOCNO' that I want to drop the lower number 'RSLTN1' result.
I was successful running this will small quantities of data (~1000 rows) but as I scale up 10x I am running into performance issues. Any suggestions?
Sample from dataset
In [107]:a[['DOCNO','RSLTN1']].sample(n=5)
Out[107]:
DOCNO RSLTN1
6815 MP00064958 72386.0
218 MP0059189A 65492.0
8262 MP00066187 96497.0
2999 MP00061663 43677.0
4913 MP00063387 42465.0

How does this fit you needs?
import pandas as pd
s = '''\
DOCNO RSLTN1
MP00059189 72386.0
MP0059189A 65492.0
MP00066187 96497.0
MP00061663 43677.0
MP00063387 42465.0'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Create mask
# We sort to make sure we keep only highest value
# Remove all non-digit according to: https://stackoverflow.com/questions/44117326/
m = (df.sort_values(by='RSLTN1',ascending=False)['DOCNO']
.str.extract('(\d+)', expand=False)
.astype(int).duplicated())
# Apply inverted `~` mask
df = df.loc[~m]
Resulting df:
DOCNO RSLTN1
0 MP00059189 72386.0
2 MP00066187 96497.0
3 MP00061663 43677.0
4 MP00063387 42465.0
In this example the following row was removed:
MP0059189A 65492.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Retaining rows that have percent overlapping ranges in Pandas - python

Related

How to select "nearest left" and "nearesr right" values from a dataframe using condition?

How to speed up pandas dataframe iteration involving 2 different dataframes with a complex condition?

Efficient way to loop through GroupBy DataFrame

Python Data manipulation: Duplicate and Average row and column values using dates

Performance issues with pandas iterrows

Categories

Resources