merge rows pandas dataframe based on condition - python

Hi have a dataframe df
containing a set of events (rows).
df = pd.DataFrame(data=[[1, 2, 7, 10],
[10, 22, 1, 30],
[30, 42, 2, 10],
[100,142, 22,1],
[143, 152, 2, 10],
[160, 162, 12, 11]],columns=['Start','End','Value1','Value2'])
df
Out[15]:
Start End Value1 Value2
0 1 2 7 10
1 10 22 1 30
2 30 42 2 10
3 100 142 22 1
4 143 152 2 10
5 160 162 12 11
If 2 (or more) consecutive events are <= 10 far apart I would like to merge the 2 (or more) events (i.e. use the start of the first event, end of the last and sum the values in Value1 and Value2).
In the example above df becomes:
df
Out[15]:
Start End Value1 Value2
0 1 42 10 50
1 100 162 36 22

That's totally possible:
df.groupby(((df.Start - df.End.shift(1)) > 10).cumsum()).agg({'Start':min, 'End':max, 'Value1':sum, 'Value2': sum})
Explanation:
start_end_differences = df.Start - df.End.shift(1) #shift moves the series down
threshold_selector = start_end_differences > 10 # will give you a boolean array where true indicates a point where the difference more than 10.
groups = threshold_selector.cumsum() # sums up the trues (1) and will create an integer series starting from 0
df.groupby(groups).agg({'Start':min}) # the aggregation is self explaining
Here is a generalized solution that remains agnostic of the other columns:
cols = df.columns.difference(['Start', 'End'])
grps = df.Start.sub(df.End.shift()).gt(10).cumsum()
gpby = df.groupby(grps)
gpby.agg(dict(Start='min', End='max')).join(gpby[cols].sum())
Start End Value1 Value2
0 1 42 10 50
1 100 162 36 22

Related

How to merge 2 lists and sort them based on index?

I have the following dataframe:
idx
val1
val2
0
15
12
1
14
38
2
11
88
3
95
21
4
19
98
5
12
48
6
35
38
7
25
39
8
65
28
I created two lists based on index say.
list1 = [0, 3, 6]
list2 = [5, 8]
I tried to write a code wherein index values from list1 takes val1 data and list2 takes val2 data and same is sorted on index.
My output list should be
output = [15, 95, 48, 35, 28]
The solution is something like:
pd.concat([df1, df2], axis=0).sort_index()
Please provide a minimal reproducible example to have a specific solution for your task
Try:
x = df.loc[df['idx'].isin(list1), 'val1']
y = df.loc[df['idx'].isin(list2), 'val2']
x = pd.concat([x, y]).sort_index().to_list()
print(x)
Prints:
[15, 95, 48, 35, 28]

Comparison between rows based on custom function

I have a large size data frame such as below:
Vehicle longitude latitude trip
0 33 155 0
0 34 156 1
1 32 154 2
1 37 154 5
2 25 145 2
. . . .
. . . .
I also defined a custom boolean function to check if coordination is inside a specific area.
def check(main_vehicle_latitude,main_vehicle_longitude,radius,compare_vehicle_latitdude,compare_vehicle_longitude):
if condition:
x=True
return X
Now I want to apply this function to (each row) of my data frame in a way that I compare each vehicle/trip with all other vehicle coordinates and find all the vehicles that have a similar trip location so the final output would be a list for each vehicle that includes all other vehicles that have a similar trip location. For example, the coordinates of vehicle (0) and trip (0) should be compared with all other vehicles to find a list of all vehicles that have similar start coordinates with the first vehicle (trip 0) and continue to check this for all vehicle trips. It seems a bit complicated to explain but hopefully, it was clear enough. I'm looking for a very efficient way since the data frame is large but unfortunately, I'm a beginner so any help with this would be greatly appreciated.
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"vehicle": [0, 0, 1, 1, 2, 3, 4, 5],
"longitude": [33, 34, 32, 37, 25, 33, 37, 33],
"latitude": [155, 156, 154, 154, 145, 155, 154, 155],
"trip": [0, 1, 2, 5, 2, 1, 0, 6],
}
)
print(df)
# Output
vehicle longitude latitude trip
0 0 33 155 0
1 0 34 156 1
2 1 32 154 2
3 1 37 154 5
4 2 25 145 2
5 3 33 155 1
6 4 37 154 0
7 5 33 155 6
Here is one way to do it with Pandas groupby, explode, and concat:
# Find matches
tmp = df.groupby(["longitude", "latitude"]).agg(list).reset_index(drop=True)
tmp["match"] = tmp.apply(lambda x: 1 if len(x["vehicle"]) > 1 else pd.NA, axis=1)
tmp = tmp.dropna()
# Format results
tmp["match"] = tmp.apply(
lambda x: [[v, t] for v, t in zip(x["vehicle"], x["trip"])], axis=1
)
tmp = tmp.explode("vehicle")
tmp = tmp.explode("trip")
tmp["match"] = tmp.apply(
lambda x: x["match"] if [x["vehicle"], x["trip"]] in x["match"] else pd.NA, axis=1
)
tmp = tmp.dropna()
tmp["match"] = tmp.apply(
lambda x: [p[0] for p in x["match"] if p[0] != x["vehicle"]], axis=1
)
# Add results back to initial dataframe
df = pd.concat(
[df.set_index(["vehicle", "trip"]), tmp.set_index(["vehicle", "trip"])], axis=1
)
Then:
print(df)
# Output
longitude latitude match
vehicle trip
0 0 33 155 [3, 5]
1 34 156 NaN
1 2 32 154 NaN
5 37 154 [4]
2 2 25 145 NaN
3 1 33 155 [0, 5]
4 0 37 154 [1]
5 6 33 155 [0, 3]

Conditionnal Rolling count

Here is my dataframe:
score
1
62
7
15
167
73
25
24
2
76
I want to compare a score with the previous 4 scores and count the number of scores higher than the current one.
This is my expected output:
score count
1
62
7
15
167 0
73 1 (we take 4 previous scores : 167,15,7,62 there is just 167 > 73 so count 1)
25 2
24 3
2 4
76 0
If somebody has an idea on how to do that, you are welcome
Thanks!
I do not think your output is according to your question. However, if you do look only at the previous 4 elements, then you could implement the following:
scores = [1, 62, 7, 15, 167, 73, 25, 24, 2, 76]
highers = []
for index, score in enumerate(scores[4:]):
higher = len([s for s in scores[index:index+4] if score < s])
print(higher)
highers.append(higher)
print(highers)
# [0, 1, 2, 3, 4, 0]
Then, you could just add this highers list as a pandas column:
df['output'] = [0]*4 + highers
Note that I pad the output in such way here that I assign zeros to the first four values.

How to find single largest value from all rows and column array in Python and also show its row and column index

I am new to python. I want to find one largest value from a grid and also show its respective row and column index label in output.The value should be absolute. (Irrespective of + or - sign)
My data structure is like image shown:
My data set
EleNo._ Exat0_ Exat10_ Exat20_ Exat30_ Exat40_ Exat50
1000____ 10____ 20___ -30____ 23_____ 28____ 18
2536____-20___ -36___ -33___ -38_____ 2____ -10
3562_____ 3____ 4______ 8_____ 8_____ 34_____ 4
2561_____ 2____ 4______ 7_____ 6_____ 22____ 20
I tried (df.abs().max()) but it is showing max value for every row and only positive values. I want absoulte max value.
Expected Results:
what i want in output
EleNo.: 2536
Exat30 : -38
Actual Result:
what i am getting in output
Element No. 3562
Exat0: 20
Exat10: 36
Exat20: 33
Exat30: 38
Exat40: 34
Exat50: 20
Use numpy.unravel_index for indices and create DataFrame by constructor with indexing:
df = pd.DataFrame({'Exat0': [10, -20, 3, 2],
'Exat10': [20, -36, 4, 4],
'Exat20': [-30, -33, 8, 7],
'Exat30': [23, -38, 8, 6],
'Exat40': [28, 2, 34, 22],
'Exat50': [18, -10, 4, 20]}, index=[1000, 2536, 3562, 2561])
df.index.name='EleNo.'
print (df)
Exat0 Exat10 Exat20 Exat30 Exat40 Exat50
EleNo.
1000 10 20 -30 23 28 18
2536 -20 -36 -33 -38 2 -10
3562 3 4 8 8 34 4
2561 2 4 7 6 22 20
a = df.abs().values
r,c = np.unravel_index(a.argmax(), a.shape)
print (r, c)
1 3
df1 = pd.DataFrame(df.values[r, c],
columns=[df.columns.values[c]],
index=[df.index.values[r]])
df1.index.name='EleNo.'
print (df1)
Exat30
EleNo.
2536 -38
Another only pandas solution with DataFrame.abs, DataFrame.stack and indices of max value by Series.idxmax:
r1, c1 = df.abs().stack().idxmax()
Last select by DataFrame.loc:
df1 = df.loc[[r1], [c1]]
print (df1)
Exat30
EleNo.
2536 -38
EDIT:
df = pd.DataFrame({'Exat0': [10, -20, 3, 2],
'Exat10': [20, -36, 4, 4],
'Exat20': [-30, -33, 8, 7],
'Exat30': [23, -38, 8, 6],
'Exat40': [28, 2, 34, -38],
'Exat50': [18, -10, 4, 20]}, index=[1000, 2536, 3562, 2561])
df.index.name='EleNo.'
print (df)
Exat0 Exat10 Exat20 Exat30 Exat40 Exat50
EleNo.
1000 10 20 -30 23 28 18
2536 -20 -36 -33 -38 2 -10
3562 3 4 8 8 34 4
2561 2 4 7 6 -38 20
s = df.abs().stack()
mask = s == s.max()
df1 = df.stack()[mask].unstack()
print (df1)
Exat30 Exat40
EleNo.
2536 -38.0 NaN
2561 NaN -38.0
df2 = df.stack()[mask].reset_index()
df2.columns = ['EleNo.','cols','values']
print (df2)
EleNo. cols values
0 2536 Exat30 -38
1 2561 Exat40 -38
Use a combination of max() and dropna()
First create a dataframe:
df = pd.DataFrame(np.random.randn(4,4))
0 1 2 3
0 0.051775 0.352410 -0.451630 -0.452446
1 -1.434128 0.516264 -0.807776 -0.077892
2 1.615521 0.870604 -0.010285 -0.322280
3 -0.027598 1.046129 -0.165166 0.365150
Calculate the max() twice to get the maximum value in the dataframe, and then cut out the rows and columns with nan.
result = df[df == abs(df).max().max()].dropna(axis=0, how="all").dropna(axis=1, how='all')
print(result)
0
2 1.615521
Finally, get the column and row value, plus the max value.
max_value = result.values.item()
max_column = result.columns.values[0]
max_row = result.index.values[0]
print('max_value', max_value, 'max_column', max_column,'max_row', max_row)
max_value 1.615520522284493 max_column 0 max_row 2
Your problem is that you have forgotten to tell pandas that the column EleNo. was the index. After that point, things are simpler: just build a series with the max of the absolute value of each line, take the index of the max of that serie, and use it to find the required line in the original dataframe. Code could be:
s = df.set_index('EleNo.').apply(np.absolute).max(axis=1)
print(df[df['EleNo.'] == s[s == s.max()].index[0]])
Display is as expected:
EleNo. Exat0 Exat10 Exat20 Exat30 Exat40 Exat50
1 2536 -20 -36 -33 -38 2 -10

Large Dataframe Column multiplication

I have a very large dataframe
in>> all_data.shape
out>> (228714, 436)
What I would like to do effciently is multiply many of the columns together. I started with a for loop and list of columns--the most effcient way I have found is
from itertools import combinations
newcolnames=list(all_data.columns.values)
newcolnames=newcolnames[0:87]
#make cross products (the columns I want to operate on are the first 87)
for c1, c2 in combinations(newcolnames, 2):
all_data['{0}*{1}'.format(c1,c2)] = all_data[c1] * all_data[c2]
The problem as one may guess is I have 87 columns which would give on the order of 3800 new columns (yes this is what I intended). Both my jupyter notebook and ipython shell choke on this calculation. I need to figure a better way to undertake this multiplication.
Is there a more efficient way to vectorize and/or process? Perhaps using a numpy array (my dataframe has been processed and now contains only numbers and NANs, it started with categorical variables).
As you have mentioned NumPy in the question, that might be a viable option here, specially because you might want to work in 2D space of NumPy instead of 1D columnar processing with pandas. To start off, you can convert the dataframe to a NumPy array with a call to np.array, like so -
arr = np.array(df) # df is the input dataframe
Now, you can get the pairwise combinations of the column IDs and then index into the columns and perform column-wise multiplications and all of this would be done in a vectorized manner, like so -
idx = np.array(list(combinations(newcolnames, 2)))
out = arr[:,idx[:,0]]*arr[:,idx[:,1]]
Sample run -
In [117]: arr = np.random.randint(0,9,(4,8))
...: newcolnames = [1,4,5,7]
...: for c1, c2 in combinations(newcolnames, 2):
...: print arr[:,c1] * arr[:,c2]
...:
[16 2 4 56]
[64 2 6 16]
[56 3 0 24]
[16 4 24 14]
[14 6 0 21]
[56 6 0 6]
In [118]: idx = np.array(list(combinations(newcolnames, 2)))
...: out = arr[:,idx[:,0]]*arr[:,idx[:,1]]
...:
In [119]: out.T
Out[119]:
array([[16, 2, 4, 56],
[64, 2, 6, 16],
[56, 3, 0, 24],
[16, 4, 24, 14],
[14, 6, 0, 21],
[56, 6, 0, 6]])
Finally, you can create the output dataframe with propers column headers (if needed), like so -
>>> headers = ['{0}*{1}'.format(idx[i,0],idx[i,1]) for i in range(len(idx))]
>>> out_df = pd.DataFrame(out,columns = headers)
>>> df
0 1 2 3 4 5 6 7
0 6 1 1 6 1 5 6 3
1 6 1 2 6 4 3 8 8
2 5 1 4 1 0 6 5 3
3 7 2 0 3 7 0 5 7
>>> out_df
1*4 1*5 1*7 4*5 4*7 5*7
0 1 5 3 5 3 15
1 4 3 8 12 32 24
2 0 6 3 0 0 18
3 14 0 14 0 49 0
you can try the df.eval() method:
for c1, c2 in combinations(newcolnames, 2):
all_data['{0}*{1}'.format(c1,c2)] = all_data.eval('{} * {}'.format(c1, c2))

Categories

Resources