Find points in cells through pandas dataframes of coordinates - python

I have to find which points are inside a grid of square cells, given the points coordinates and the coordinates of the bounds of the cells, through two pandas dataframes.
I'm calling dfc the dataframe containing the code and the boundary coordinates of the cells (I simplify the problem, in the real analysis I have a big grid with geographical points and tons of points to check):
Code,minx,miny,maxx,maxy
01,0.0,0.0,2.0,2.0
02,2.0,2.0,3.0,3.0
and dfp the dataframe containing an Id and the coordinates of the points:
Id,x,y
0,1.5,1.5
1,1.1,1.1
2,2.2,2.2
3,1.3,1.3
4,3.4,1.4
5,2.0,1.5
Now I would like to perform a search returning in dfc dataframe a new column (called 'GridCode') of the grid in which the point is in. The cells should be perfectly squared, so I would like to perform an analysis through:
a = np.where(
(dfp['x'] > dfc['minx']) &
(dfp['x'] < dfc['maxx']) &
(dfp['y'] > dfc['miny']) &
(dfp['y'] < dfc['maxy']),
r2['Code'],
'na')
avoiding several loops on the dataframes. The lenghts of the dataframes are not the same. The resulting dataframe should be as follows:
Id x y GridCode
0 0 1.5 1.5 01
1 1 1.1 1.1 01
2 2 2.2 2.2 02
3 3 1.3 1.3 01
4 4 3.4 1.4 na
5 5 2.0 1.5 na
Thanks in advance for your help!

Probably a better way, but since this has been sitting out there for awhile..
Using Pandas boolean indexing to filter the dfc data frame instead of np.where()
def findGrid(dfp):
c = dfc[(dfp['x'] > dfc['minx']) &
(dfp['x'] < dfc['maxx']) &
(dfp['y'] > dfc['miny']) &
(dfp['y'] < dfc['maxy'])].Code
if len(c) == 0:
return None
else:
return c.iat[0]
Then use the pandas apply() function
dfp['GridCode'] = dfp.apply(findGrid,axis=1)
Will yield this
Id x y GridCode
0 0 1.5 1.5 1
1 1 1.1 1.1 1
2 2 2.2 2.2 2
3 3 1.3 1.3 1
4 4 3.4 1.4 NaN
5 5 2.0 1.5 NaN

Related

Find index of first bigger row of current value in pandas dataframe

I have big dataset of values as follow:
column "bigger" would be index of the first row with bigger "bsl" than "mb" from current row. I need to do it without loop as I need it to be done in less than a second. by loop it's over a minute.
For example for the first row (with index 74729) the bigger is going to be 74731. I know it can be done by linq in C# but I'm almost new in python.
here is another example:
here is text version:
index bsl mb bigger
74729 47091.89 47160.00 74731.0
74730 47159.00 47201.00 74735.0
74731 47196.50 47201.50 74735.0
74732 47186.50 47198.02 74735.0
74733 47191.50 47191.50 74735.0
74734 47162.50 47254.00 74736.0
74735 47252.50 47411.50 74736.0
74736 47414.50 47421.00 74747.0
74737 47368.50 47403.00 74742.0
74738 47305.00 47310.00 74742.0
74739 47292.00 47320.00 74742.0
74740 47302.00 47374.00 74742.0
74741 47291.47 47442.50 74899.0
74742 47403.50 47416.50 74746.0
74743 47354.34 47362.50 74746.0
I'm not sure how many rows you have, but if the number is reasonable, you can perform a pairwise comparison:
# get data as arrays
a = df['bsl'].to_numpy()
b = df['mb'].to_numpy()
idx = df.index.to_numpy()
# compare values and mask lower triangle
# to ensure comparing only the greater indices
out = np.triu(a>b[:,None]).argmax(1).astype(float)
# reindex to original indices
idx = idx[out]
# mask invalid indices
idx[out<np.arange(len(out))] = np.nan
df['bigger'] = idx
Output:
bsl mb bigger
0 1 2 2.0
1 2 4 6.0
2 3 3 5.0
3 2 1 3.0
4 3 5 NaN
5 4 2 5.0
6 5 1 6.0
7 1 0 7.0

How to create new column in dataframe based on condition from other dataframe?

df1 = pd.DataFrame({"DEPTH":[0.5, 1, 1.5, 2, 2.5],
"POROSITY":[10, 22, 15, 30, 20],
"WELL":"well 1"})
df2 = pd.DataFrame({"Well":"well 1",
"Marker":["Fm 1","Fm 2"],
"Depth":[0.7, 1.7]})
Hello everyone. I have two dataframes and I would like to create a new column on df1, for example: df1["FORMATIONS"], with information from df2["Marker"] values based on depth limits from df2["Depth"] and df1["DEPTH"].
So, for example, if df2["Depth"] = 1.7, then all samples in df1 with df1["DEPTH"] > 1.7 should be labelled as "Fm 2" in this new column df1["FORMATIONS"].
And the final dataframe df1 should look like this:
DEPTH POROSITY WELL FORMATIONS
0.5 10 well 1 nan
1 22 well 1 Fm 1
1.5 15 well 1 Fm 1
2 30 well 1 Fm 2
2.5 20 well 1 Fm 2
Anyone could help me?
What you're doing here is transforming continuous data into categorical data. There are many ways to do this with pandas, but one of the better known ways is using pandas.cut.
When specifying the bins argument, you need to add float(inf) to the end of the list, to represent that the last bin goes to infinity.
df1["FORMATIONS"] = pd.cut(df1.DEPTH, list(df2.Depth) + [float('inf')], labels=df2.Marker)
df1 will now be:
Use pandas.merge_asof:
NB. the columns used for the merge need to be sorted first
pd.merge_asof(df1,
df2[['Marker', 'Depth']].rename(columns={'Marker': 'Formations'}),
left_on='DEPTH', right_on='Depth')
output:
DEPTH POROSITY WELL Formations Depth
0 0.5 10 well 1 NaN NaN
1 1.0 22 well 1 Fm 1 0.7
2 1.5 15 well 1 Fm 1 0.7
3 2.0 30 well 1 Fm 2 1.7
4 2.5 20 well 1 Fm 2 1.7

Pandas reorder rows of dataframe

I stumble upon very peculiar problem in Pandas. I have this dataframe
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,2.0,3
1,1600349033921620000,1,18.5371406,-14.224917,0,-0.0113912,1.443597,20,0.5,0.9,-1,7,2.0,3
2,1600349033921650000,2,19.808648100000006,-6.778450599999998,0,0.037289,-1.0557937,20,0.5,0.9,-1,7,2.0,3
3,1600349033921670000,3,22.1796988,-5.7078115999999985,0,0.2585675,-1.2431861000000002,20,0.5,0.9,-1,7,2.0,3
4,1600349033921670000,4,20.757325,-16.115366,0,-0.2528627,0.7889673,20,0.5,0.9,-1,7,2.0,3
5,1600349033921690000,5,20.9491012,-17.7806833,0,0.5062633,0.9386511,20,0.5,0.9,-1,7,2.0,3
6,1600349033921690000,6,20.6225258,-5.5344404,0,-0.1192678,-0.7889041,20,0.5,0.9,-1,7,2.0,3
7,1600349033921700000,7,21.8077004,-14.736984,0,-0.0295737,1.3084618,20,0.5,0.9,-1,7,2.0,3
8,1600349033954560000,0,23.206789800000006,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,2.0,3
9,1600349033954570000,1,18.555421300000006,-13.7440508,0,0.0548418,1.4426004,20,0.5,0.9,-1,7,2.0,3
10,1600349033954570000,2,19.8409748,-7.126075500000002,0,0.0969802,-1.0428747,20,0.5,0.9,-1,7,2.0,3
11,1600349033954580000,3,22.3263185,-5.9586202,0,0.4398591,-0.752425,20,0.5,0.9,-1,7,2.0,3
12,1600349033954590000,4,20.7154136,-15.842398800000002,0,-0.12573430000000002,0.8189016,20,0.5,0.9,-1,7,2.0,3
13,1600349033954590000,5,21.038901,-17.4111883,0,0.2693992,1.108485,20,0.5,0.9,-1,7,2.0,3
14,1600349033954600000,6,20.612499,-5.810969,0,-0.030080400000000007,-0.8295869,20,0.5,0.9,-1,7,2.0,3
15,1600349033954600000,7,21.7872537,-14.3011986,0,-0.0613401,1.3073578,20,0.5,0.9,-1,7,2.0,3
16,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
17,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
18,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
This is input file
Please note that Id always starts at 0 up to 7 and repeat and time column is in sequential step (which implies that previous row should be smaller or equal to current one).
I would like to reorder rows of the dataframe as it is below.
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.0,2
1,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.0,2
2,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.0,2
3,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,1
4,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,1
5,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,1
6,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
7,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
8,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
9,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,3
10,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,3
11,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,3
This is the desired result
Please note that I need to reorder dataframe rows based on this columns id, time, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP, SIM.
As you see from the desired result we need to reoder dataframe in that way time column from smallest to largest one this holds true for the rest of columns, id, sim, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP.
I tried to sort by several columns without success. Moreover, I tried to use groupby but I failed.
Would you like to help to solve the problem? Any suggestions are welcome.
P.S.
I have paste dataframe so they can be read easily with clipboard function in order to be easily reproducible.
I am attaching pic as well.
What did you try to sort by several columns?
In [10]: df.sort_values(['id', 'time', 'ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP', 'SIM'])
Out[10]:
Unnamed: 0 time id X Y theta Vx Vy ANGLE_FR DANGER_RAD RISK_RAD TTC_DAN_LOW TTC_DAN_UP TTC_STOP SIM
0 0 1600349033921610000 0 23.2644 -7.1409 0 0.0210 -1.1414 20 0.5 0.9 -1 7 2 3
8 8 1600349033954560000 0 23.2068 -7.5171 0 -0.1728 -1.1285 20 0.5 0.9 -1 7 2 3
1 1 1600349033921620000 1 18.5371 -14.2249 0 -0.0114 1.4436 20 0.5 0.9 -1 7 2 3
9 9 1600349033954570000 1 18.5554 -13.7441 0 0.0548 1.4426 20 0.5 0.9 -1 7 2 3
2 2 1600349033921650000 2 19.8086 -6.7785 0 0.0373 -1.0558 20 0.5 0.9 -1 7 2 3
How about this:
groupby_cols = ['ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP, SIM']
df = df.groupby(groupby_cols).reset_index()

How to find the closest element in another column for each element in a column?

The situation is as follows.
I have two pandas dataframes:
df1, which contains a column "p1" with 1895 rows of random numbers ranging from 2.805 to 3.035 (here are the first 20 rows):
p1
0 2.910
1 2.885
2 2.875
3 2.855
4 2.910
5 2.870
6 2.850
7 2.875
8 2.865
9 2.875
10 2.890
11 2.910
12 2.965
13 2.955
14 2.935
15 2.905
16 2.900
17 2.905
18 2.970
19 2.940
df2, which contains two columns, "p2" and "h"
p2 h
0 2.7 256.88
1 2.8 253.52
2 2.9 250.18
3 3.0 246.86
4 3.1 243.55
The aim is to first loop through all rows in df1 and find the closest element in p2 for each row. e.g. for p1[0] = 2.910, the closest element is p2[2] = 2.9.
Then, if these two values are the same, the output for that row is the corresponding value of h
otherwise, the output is the average of the previous and subsequent values of h.
Going back to our example, the output for p1[0] should therefore be (h[1]+h[3])/2
I hope this all makes sense, this is my first question on here :).
Thanks!
This is the usage of merge_asof, notice the allow_exact_matches=True is default as True, for example 2.9 nearest is 2.9 in this case
df1=df1.sort_values('p1')
s1=pd.merge_asof(df1,df2,left_on='p1',right_on='p2',direction='backward')
s2=pd.merge_asof(df1,df2,left_on='p1',right_on='p2',direction='forward')
df1['Value']=(s1.h+s2.h)/2
Another solution with numpy:
import numpy as np
# Generate some test data
x1 = np.random.randint(0,100,10)
x2 = np.vstack([np.random.randint(0,100,10),np.random.normal(0,1,10)]).T
# Repeat the two vectors
X1 = np.tile(x1,(len(x2),1))
X2 = np.tile(x2[:,0],(len(x1),1))
distance = np.abs(X1 - X2.T)
closest_idx = np.argmin(distance,axis=0)
print(x2[closest_idx,1])

Pandas - Outer Join on Column with Repeating Values

This is my first question on Stack Overflow, please let me know how I can help you help me if my question is unclear.
Goal: Use Python and Pandas to Outer join (or merge) Data Sets containing different experimental trials where the "x" axis of each trial is extremely similar but has some deviations. Most importantly, the "x" axis increases, hits a maximum and then decreases, often overlapping with previously existing "x" points.
Problem: When I go to join/merge the datasets on "x", the "x" column is sorted, messing up the order of the collected data and making it impossible to plot it correctly.
Here is a small example of what I am trying to do:
Wouldn't let me add pictures because I am new. Here is the code to generate these example data sets.
Data Sets :
Import:
import numpy as np
import pandas as pd
import random as rand
Code :
T1 = {'x':np.array([1,1.5,2,2.5,3,3.5,4,5,2,1]),'y':np.array([10000,8500,7400,6450,5670,5100,4600,4500,8400,9000]),'z':np.array(rand.sample(range(0,10000),10))}'
T2 = {'x':np.array([1,2,3,4,5,6,7,2,1.5,1]),'y':np.array([10500,7700,5500,4560,4300,3900,3800,5400,8400,8800]),'z':np.array(rand.sample(range(0,10000),10))}
Trial1 = pd.DataFrame(T1)
Trial2 = pd.DataFrame(T2)
Attempt to Merge/Join:
WomboCombo = Trial1.join(Trial2,how='outer',lsuffix=1,rsuffix=2, on='x')
WomboCombo2 = pd.merge(left=Trial1, right= Trial2, how = 'outer', left
Attempt to split into two parts, increasing and decreasing part (manually found row number where data "x" starts decreasing):
Trial1Inc = Trial1[0:8]
Trial2Inc = Trial2[0:7]
Result - Merge works well, join messes with the "x" column, not sure why:
Trial1Inc.merge(Trial2Inc,on='x',how='outer', suffixes=[1,2])
Incrementing section Merge Result
Trial1Inc.join(Trial2Inc,on='x',how='outer', lsuffix=1,rsuffix=2)
Incrementing section Join Result
Hopefully my example is clear, the "x" column in Trial 1 increases until 5, then decreases back towards 0. In Trial 2, I altered the test a bit because I noticed that I needed data at a slightly higher "x" value. Trial 2 Increases until 7 and then quickly decreases back towards 0.
My end goal is to plot the average of all y values (where there is overlap between the trials) against the corresponding x values.
If there is overlap I can add error bars. Pandas is almost perfect for what I am trying to do because an Outer join adds null values where there is no overlap and is capable of horizontally concatenating the two trials when there is overlap.
All thats left now is to figure out how to join on the "x" column but maintain its order of increasing values and then decreasing values. The reason it is important for me to first increase "x" and then decrease it is because when looking at the "y" values, it seems as though the initial "y" value at a given "x" is greater than the "y" value when "x" is decreasing (E.G. in trial 1 when x=1, y=10000, however, later in the trial when we come back to x=1, y=9000, this trend is important. When Pandas sorts the column before merging, instead of there being a clean curve showing a decrease in "y" as "x" increases and then the reverse, there are vertical downward jumps at any point where the data was joined.
I would really appreciate any help with either:
A) a perfect solution that lets me join on "x" when "x" contains duplicates
B) an efficient way to split the data sets into increasing "x" and decreasing "x" so that I can merge the increasing and decreasing sections of each trial separately and then vertically concat them.
Hopefully I did an okay job explaining the problem I would like to solve. Please let me know if I can clarify anything,
Thanks for the help!
I think #xyzjayne idea of splitting the dataframe is a great idea.
Splitting Trial1 and Trial2:
# index of max x value in Trial2
t2_max_index = Trial2.index[Trial2['x'] == Trial2['x'].max()].tolist()
# split Trial2 by max value
trial2_high = Trial2.loc[:t2_max_index[0]].set_index('x')
trial2_low = Trial2.loc[t2_max_index[0]+1:].set_index('x')
# index of max x value in Trial1
t1_max_index = Trial1.index[Trial1['x'] == Trial1['x'].max()].tolist()
# split Trial1 by max vlaue
trial1_high = Trial1.loc[:t1_max_index[0]].set_index('x')
trial1_low = Trial1.loc[t1_max_index[0]+1:].set_index('x')
Once we split the dataframes we join the highers together and the lowers together:
WomboCombo_high = trial1_high.join(trial2_high, how='outer', lsuffix='1', rsuffix='2', on='x').reset_index()
WomboCombo_low = trial1_low.join(trial2_low, how='outer', lsuffix='1', rsuffix='2', on='x').reset_index()
We now combine them toegther to have one dataframe WomboCombo
WomboCombo = WomboCombo_high.append(WomboCombo_low)
OUTPUT:
x y1 z1 y2 z2
0 1.0 10000.0 3425.0 10500.0 3061.0
1 1.5 8500.0 5059.0 NaN NaN
2 2.0 7400.0 2739.0 7700.0 7090.0
3 2.5 6450.0 9912.0 NaN NaN
4 3.0 5670.0 2099.0 5500.0 1140.0
5 3.5 5100.0 9637.0 NaN NaN
6 4.0 4600.0 7581.0 4560.0 9584.0
7 5.0 4500.0 8616.0 4300.0 3940.0
8 6.0 NaN NaN 3900.0 5896.0
9 7.0 NaN NaN 3800.0 6211.0
0 2.0 8400.0 3181.0 5400.0 9529.0
2 1.5 NaN NaN 8400.0 3260.0
1 1.0 9000.0 4280.0 8800.0 8303.0
One possible solution is to give you trial rows specific IDs an then merge on the IDs. Should keep the x values from being sorted.
Here's what I was trying out, but it doesn't address varying numbers of data points. I like gym-hh's answer, though it's not clear to me that you wanted two columns of y,z pairs. So you could combine his ideas and this code to get what you need.
Trial1['index1'] = Trial1.index
Trial2['index1'] = Trial2.index
WomboCombo = Trial1.append(Trial2)
WomboCombo.sort_values(by=['index1'],inplace=True)
WomboCombo
Output:
x y z index1
0 1.0 10000 7148 0
0 1.0 10500 2745 0
1 1.5 8500 248 1
1 2.0 7700 9505 1
2 2.0 7400 6380 2
2 3.0 5500 3401 2
3 2.5 6450 6183 3
3 4.0 4560 5281 3
4 3.0 5670 99 4
4 5.0 4300 8864 4
5 3.5 5100 5132 5
5 6.0 3900 7570 5
6 4.0 4600 9951 6
6 7.0 3800 7447 6
7 2.0 5400 3713 7
7 5.0 4500 3863 7
8 1.5 8400 8776 8
8 2.0 8400 1592 8
9 1.0 9000 2167 9
9 1.0 8800 782 9

Categories

Resources