Comparison between rows based on custom function - python

I have a large size data frame such as below:
Vehicle longitude latitude trip
0 33 155 0
0 34 156 1
1 32 154 2
1 37 154 5
2 25 145 2
. . . .
. . . .
I also defined a custom boolean function to check if coordination is inside a specific area.
def check(main_vehicle_latitude,main_vehicle_longitude,radius,compare_vehicle_latitdude,compare_vehicle_longitude):
if condition:
x=True
return X
Now I want to apply this function to (each row) of my data frame in a way that I compare each vehicle/trip with all other vehicle coordinates and find all the vehicles that have a similar trip location so the final output would be a list for each vehicle that includes all other vehicles that have a similar trip location. For example, the coordinates of vehicle (0) and trip (0) should be compared with all other vehicles to find a list of all vehicles that have similar start coordinates with the first vehicle (trip 0) and continue to check this for all vehicle trips. It seems a bit complicated to explain but hopefully, it was clear enough. I'm looking for a very efficient way since the data frame is large but unfortunately, I'm a beginner so any help with this would be greatly appreciated.

With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"vehicle": [0, 0, 1, 1, 2, 3, 4, 5],
"longitude": [33, 34, 32, 37, 25, 33, 37, 33],
"latitude": [155, 156, 154, 154, 145, 155, 154, 155],
"trip": [0, 1, 2, 5, 2, 1, 0, 6],
}
)
print(df)
# Output
vehicle longitude latitude trip
0 0 33 155 0
1 0 34 156 1
2 1 32 154 2
3 1 37 154 5
4 2 25 145 2
5 3 33 155 1
6 4 37 154 0
7 5 33 155 6
Here is one way to do it with Pandas groupby, explode, and concat:
# Find matches
tmp = df.groupby(["longitude", "latitude"]).agg(list).reset_index(drop=True)
tmp["match"] = tmp.apply(lambda x: 1 if len(x["vehicle"]) > 1 else pd.NA, axis=1)
tmp = tmp.dropna()
# Format results
tmp["match"] = tmp.apply(
lambda x: [[v, t] for v, t in zip(x["vehicle"], x["trip"])], axis=1
)
tmp = tmp.explode("vehicle")
tmp = tmp.explode("trip")
tmp["match"] = tmp.apply(
lambda x: x["match"] if [x["vehicle"], x["trip"]] in x["match"] else pd.NA, axis=1
)
tmp = tmp.dropna()
tmp["match"] = tmp.apply(
lambda x: [p[0] for p in x["match"] if p[0] != x["vehicle"]], axis=1
)
# Add results back to initial dataframe
df = pd.concat(
[df.set_index(["vehicle", "trip"]), tmp.set_index(["vehicle", "trip"])], axis=1
)
Then:
print(df)
# Output
longitude latitude match
vehicle trip
0 0 33 155 [3, 5]
1 34 156 NaN
1 2 32 154 NaN
5 37 154 [4]
2 2 25 145 NaN
3 1 33 155 [0, 5]
4 0 37 154 [1]
5 6 33 155 [0, 3]

Related

split column header number and string in python

Here is my toy example, my question is how to create new column called trial={2,3}, 2 & 3comes from the number part in the columns names 2.0__sum_values, 3.0__sum_values,
my code is:
import pandas as pd
before_spliting = {"ID": [1, 2,3], "2.0__sum_values": [33,28,40],"2.0__mediane": [33,70,20],"2.0__root_mean_square":[33,4,30],"3.0__sum_values": [33,28,40],"3.0__mediane": [33,70,20],"3.0__root_mean_square":[33,4,30]}
before_spliting = pd.DataFrame(before_spliting)
print(before_spliting)
ID 2.0__sum_values 2.0__mediane 2.0__root_mean_square 3.0__sum_values \
0 1 33 33 33 33
1 2 28 70 4 28
2 3 40 20 30 40
3.0__mediane 3.0__root_mean_square
0 33 33
1 70 4
2 20 30
after_spliting = { "ID": [1,1,2, 2,3,3], "trial": [2, 3,2,3,2,3],"sum_values": [33,33,28,28,40,40],"mediane": [33,33,70,70,20,20],"root_mean_square":[33,33,4,4,30,30]}
after_spliting = pd.DataFrame(after_spliting)
print(after_spliting)
ID trial sum_values mediane root_mean_square
0 1 2 33 33 33
1 1 3 33 33 33
2 2 2 28 70 4
3 2 3 28 70 4
4 3 2 40 20 30
5 3 3 40 20 30
You could try:
res = df.melt(id_vars="ID")
res[["trial", "columns"]] = res["variable"].str.split("__", expand=True)
res = (
res
.pivot_table(
index=["ID", "trial"], columns="columns", values="value", aggfunc=list
)
.explode(sorted(set(res["columns"])))
.reset_index()
)
Result for the following input dataframe
data = {
"ID": [1, 2, 3],
"2.0__sum_values": [33, 28, 40], "2.0__mediane": [43, 80, 30], "2.0__root_mean_square":[37, 4, 39],
"3.0__sum_values": [34, 29, 41], "3.0__mediane": [44, 81, 31], "3.0__root_mean_square":[38, 5, 40]
}
df = pd.DataFrame(data)
is
columns ID trial mediane root_mean_square sum_values
0 1 2.0 43 37 33
1 1 3.0 44 38 34
2 2 2.0 80 4 28
3 2 3.0 81 5 29
4 3 2.0 30 39 40
5 3 3.0 31 40 41
Alternative solution with the same output:
res = df.melt(id_vars="ID")
res[["trial", "columns"]] = res["variable"].str.split("__", expand=True)
res = res.set_index(["ID", "trial"]).drop(columns="variable").sort_index()
res = pd.concat(
(group[["value"]].rename(columns={"value": key})
for key, group in res.groupby("columns")),
axis=1
).reset_index()
Because you are using curly brackets {}, it is not possible to have duplicate variables inside the curly brackets {}, so instead you can use brackets [], for creating new column.
trial = []
for i in range(len(d1)):
trial.append([d1['2.0__sum_values'][i], d1['3.0__sum_values'][i]])
d1['trial'] = trial
Best of Luck

Conditionnal Rolling count

Here is my dataframe:
score
1
62
7
15
167
73
25
24
2
76
I want to compare a score with the previous 4 scores and count the number of scores higher than the current one.
This is my expected output:
score count
1
62
7
15
167 0
73 1 (we take 4 previous scores : 167,15,7,62 there is just 167 > 73 so count 1)
25 2
24 3
2 4
76 0
If somebody has an idea on how to do that, you are welcome
Thanks!
I do not think your output is according to your question. However, if you do look only at the previous 4 elements, then you could implement the following:
scores = [1, 62, 7, 15, 167, 73, 25, 24, 2, 76]
highers = []
for index, score in enumerate(scores[4:]):
higher = len([s for s in scores[index:index+4] if score < s])
print(higher)
highers.append(higher)
print(highers)
# [0, 1, 2, 3, 4, 0]
Then, you could just add this highers list as a pandas column:
df['output'] = [0]*4 + highers
Note that I pad the output in such way here that I assign zeros to the first four values.

Select slices/ range of columns for each row in a pandas dataframe

Here is the problem:
import numpy
import pandas
dfl = pandas.DataFrame(numpy.random.randn(30,10))
now, I want the following cells put in a data frame:
For row 1: columns 3 to 6 (length = 4 cells),
For row 2: columns 4 to 7 (length = 4 cells),
For row 3: columns 1 to 4 (length = 4 cells),
ect...
Each of these range is always 4 cells wide, but the start/end are different columns.
The row-wise start point are in a list [3, 4, 1,...] and so are the row-wise end points. The list of rows I'm interested in is also a list [1, 2, 3].
Finally, dfl has an datetime-index which I would like to preserve
(meaning the end result should be a data frame with indexes dfl.index[1, 2, 3].
Edit: range exceeds
Some of the entries of the vector of row-wise start points are too large (say a row-wise start point of 9 in the example matrix above). In those case, I just want all the columns from the row-wise start point and then as many NaN's as necessary to get the right shape (so since 9+4 > 10, then the corresponding row of the result data frame should be [9, 10, NaN, NaN]
Using NumPy broadcasting to create all those column indices and then advanced-indexing into the array data -
def extract_rows(dfl, starts, L, fillval=np.nan):
a = dfl.values
idx = np.asarray(starts)[:,None] + range(L)
valid_mask = idx < dfl.shape[1]
idx[~valid_mask] = 0
val = a[np.arange(len(idx))[:,None],idx]
return pd.DataFrame(np.where(valid_mask, val, fillval))
Sample runs -
In [541]: np.random.seed(0)
In [542]: dfl = pandas.DataFrame(numpy.random.randint(11,99,(3,10)))
In [543]: dfl
Out[543]:
0 1 2 3 4 5 6 7 8 9
0 55 58 75 78 78 20 94 32 47 98
1 81 23 69 76 50 98 57 92 48 36
2 88 83 20 31 91 80 90 58 75 93
In [544]: extract_rows(dfl, starts=[3,4,8], L=4, fillval=np.nan)
Out[544]:
0 1 2 3
0 78.0 78.0 20.0 94.0
1 50.0 98.0 57.0 92.0
2 75.0 93.0 NaN NaN
In [545]: extract_rows(dfl, starts=[3,4,8], L=4, fillval=-1)
Out[545]:
0 1 2 3
0 78 78 20 94
1 50 98 57 92
2 75 93 -1 -1
Or we can using .iloc and enumerate
l=[3, 4, 1]
pd.DataFrame(data=[df.iloc[x:x+1,y:y+4].values[0] for x,y in enumerate(l)])
Out[107]:
0 1 2 3
0 1.224124 -0.938459 -1.114081 -1.128225
1 -0.445288 0.445390 -0.154295 -1.871210
2 0.784677 0.997053 2.144286 -0.179895

merge rows pandas dataframe based on condition

Hi have a dataframe df
containing a set of events (rows).
df = pd.DataFrame(data=[[1, 2, 7, 10],
[10, 22, 1, 30],
[30, 42, 2, 10],
[100,142, 22,1],
[143, 152, 2, 10],
[160, 162, 12, 11]],columns=['Start','End','Value1','Value2'])
df
Out[15]:
Start End Value1 Value2
0 1 2 7 10
1 10 22 1 30
2 30 42 2 10
3 100 142 22 1
4 143 152 2 10
5 160 162 12 11
If 2 (or more) consecutive events are <= 10 far apart I would like to merge the 2 (or more) events (i.e. use the start of the first event, end of the last and sum the values in Value1 and Value2).
In the example above df becomes:
df
Out[15]:
Start End Value1 Value2
0 1 42 10 50
1 100 162 36 22
That's totally possible:
df.groupby(((df.Start - df.End.shift(1)) > 10).cumsum()).agg({'Start':min, 'End':max, 'Value1':sum, 'Value2': sum})
Explanation:
start_end_differences = df.Start - df.End.shift(1) #shift moves the series down
threshold_selector = start_end_differences > 10 # will give you a boolean array where true indicates a point where the difference more than 10.
groups = threshold_selector.cumsum() # sums up the trues (1) and will create an integer series starting from 0
df.groupby(groups).agg({'Start':min}) # the aggregation is self explaining
Here is a generalized solution that remains agnostic of the other columns:
cols = df.columns.difference(['Start', 'End'])
grps = df.Start.sub(df.End.shift()).gt(10).cumsum()
gpby = df.groupby(grps)
gpby.agg(dict(Start='min', End='max')).join(gpby[cols].sum())
Start End Value1 Value2
0 1 42 10 50
1 100 162 36 22

Subtracting many columns in a df by one column in another df

I'm trying to substract a df "stock_returns" (144 rows x 517 col) by a df "p_df" (144 rows x 1 col).
I have tried;
stock_returns - p_df
stock_returns.rsub(p_df,axis=1)
stock_returns.substract(p_df)
But none of them work and all return Nan values.
I'm passing it through this fnc, and using the for loop to get args:
def disp_calc(returns, p, wi): #apply(disp_calc, rows = ...)
wi = wi/np.sum(wi)
rp = (col_len(returns)*(returns-p)**2).sum() #returns - p causing problems
return np.sqrt(rp)
for i in sectors:
stock_returns = returns_rolling[sectordict[i]]#.apply(np.mean,axis=1)
portfolio_return = returns_rolling[i]; p_df = portfolio_return.to_frame()
disp_df[i] = stock_returns.apply(disp_calc,args=(portfolio_return,wi))
My expected output is to subtract all 517 cols in the first df by the 1 col in p_df. so final results would still have 517 cols. Thanks
You're almost there, just need to set axis=0 to subtract along the indexes:
>>> stock_returns = pd.DataFrame([[10,100,200],
[15, 115, 215],
[20,120, 220],
[25,125,225],
[30,130,230]], columns=['A', 'B', 'C'])
>>> stock_returns
A B C
0 10 100 200
1 15 115 215
2 20 120 220
3 25 125 225
4 30 130 230
>>> p_df = pd.DataFrame([1,2,3,4,5], columns=['P'])
>>> p_df
P
0 1
1 2
2 3
3 4
4 5
>>> stock_returns.sub(p_df['P'], axis=0)
A B C
0 9 99 199
1 13 113 213
2 17 117 217
3 21 121 221
4 25 125 225
data['new_col3'] = data['col1'] - data['col2']

Categories

Resources