run a function and do sum with next row - python

My dataset df looks like this:
main_id time day lat long
1 2019-05-31 1 53.5501667 9.9716466
1 2019-05-31 1 53.6101545 9.9568781
1 2019-05-30 1 53.5501309 9.9716300
1 2019-05-30 2 53.5501309 9.9716300
1 2019-05-30 2 53.4561309 9.1246300
2 2019-06-31 4 53.5501667 9.9716466
2 2019-06-31 4 53.6101545 9.9568781
I want to find the total distance covered by each main_id item for each day. To calculate the distance between 2 set of coordinates, I can use this function:
def find_kms(coords_1, coords_2):
return geopy.distance.geodesic(coords_1, coords_2).km
but I am not sure how I can sum it by grouping for main_id and day. The end result could be a new df like this:
main_id day total_dist time
1 1 ... 2019-05-31
1 2 .... 2019-05-31
2 4 .... 2019-05-31
Where the derived time is any or the first value from the respective main_id and day time's column.
total_dist calculation:
For example, for the first row, for main_id == 1 and day 1, the total_dist would be calculated like this:
find_kms(( 53.5501667,9.9716466),(53.6101545,9.9568781)) + find_kms((53.6101545, 9.9568781),(53.5501309,9.9716300)

Note that your function is not vectorized hence making the work difficult.
(df.assign(dist = df.join(df.groupby(['main_id', 'day'])[['lat', 'long']].
shift(), rsuffix='1').bfill().
reset_index().groupby('index').
apply(lambda x: find_kms(x[['lat','long']].values, x[['lat1','long1']].values))).
groupby(['main_id', 'day'])['dist'].sum().reset_index())
main_id day dist
0 1 1 13.499279
1 1 2 57.167034
2 2 4 6.747748
Another option will be to use reduce:
from functools import reduce
def total_dist(x):
coords = x[['lat', 'long']].values
lam = lambda x,y: (find_kms(x[1],y) + x[0],y)
dist = reduce(lam, coords, (0,coords[0]))[0]
return pd.Series({'dist':dist})
df.groupby(['main_id', 'day']).apply(total_dist).reset_index()
main_id day dist
0 1 1 13.499351
1 1 2 57.167033
2 2 4 6.747775
EDIT:
If count is needed:
(pd.DataFrame({'count':df.groupby(['main_id', 'day']).main_id.count()}).
join(df.groupby(['main_id', 'day']).apply(total_dist)))
Out[157]:
count dist
main_id day
1 1 3 13.499351
2 2 57.167033
2 4 2 6.747775

Just another approach to convert the lat lon into utm without using find_kms function:
import pandas as pd
import numpy as np
import utm
u= utm.from_latlon(df.lat.values,df.long.values)
df['y'],df['x']=u[0],u[1] # lat, lon to utm meter (y=lat,x=lon)
a=df.groupby(['main_id', 'day'])[['x','y']].apply(lambda x: x.diff().replace(np.nan, 0))
df['dist']=np.sqrt(a.x**2 + a.y**2)/1000 # distance in km
df1=df.groupby(['main_id','day'])[['dist','day']].agg({'day':'count', 'dist': 'sum'}).rename(columns={'day':'day_count'})
df1 output:
day_count dist
main_id day
1 1 3 13.494554
2 2 57.145276
2 4 2 6.745386

Related

Extract a window of rows following a set of one's in a pandas dataframe

I have a pandas dataframe that looks like the following:
Day val
Day1 0
Day2 0
Day3 0
Day4 0
Day5 1
Day6 1
Day7 1
Day8 1
Day9 0
Day10 0
Day11 0
Day12 1
Day13 1
Day14 1
Day15 1
Day16 0
Day17 0
Day18 0
Day19 0
Day20 0
Day21 1
Day22 0
Day23 1
Day24 1
Day25 1
I am looking to extract at-most 2 rows where val = 0 but only those where the proceeding rows are a set of 1's.
For example:
There is a set of ones from Day5 to Day8 (an event). I would need to look into at-most two rows after the end of the event. So here it's Day9 and Day10.
Similarly, Day21 is a single-day event, and I need to look into only Day22 since it is the single zero that follows the event.
For the table data above, the output would be the following:
Day val
day9 0
Day10 0
Day16 0
Day17 0
Day22 0
We can simplify the condition for each row to:
The val value should be 0
The previous day or the day before that should have a val of 1
In code:
cond = (df['val'].shift(1) == 1) | (df['val'].shift(2) == 1)
df.loc[(df['val'] == 0) & cond]
Result:
Day val
8 Day9 0
9 Day10 0
15 Day16 0
16 Day17 0
21 Day22 0
Note: If more than 2 days should be considered this can easily be added to the condition cond. In this case, cond can be constructed with a list comprehension and np.any(), for example:
n = 2
cond = np.any([df['val'].shift(s) == 1 for s in range(1, n+1)], axis=0)
df.loc[(df['val'] == 0) & cond]
You can compute a mask on the rolling max per group where the groups start for each 1->0 transition and combine it with a second mask where the values are 0:
N = 2
o2z = df['val'].diff().eq(-1)
m1 = o2z.groupby(o2z.cumsum()).rolling(N, min_periods=1).max().astype(bool).values
m2 = df['val'].eq(0)
df[m1&m2]
Output:
Day val
8 Day9 0
9 Day10 0
15 Day16 0
16 Day17 0
21 Day22 0

Calculate a np.arange within a Panda dataframe from other columns

I want to create a new column with all the coordinates the car needs to pass to a certain goal. This should be as a list in a panda.
To start with I have this:
import pandas as pd
cars = pd.DataFrame({'x_now': np.repeat(1,5),
'y_now': np.arange(5,0,-1),
'x_1_goal': np.repeat(1,5),
'y_1_goal': np.repeat(10,5)})
output would be:
x_now y_now x_1_goal y_1_goal
0 1 5 1 10
1 1 4 1 10
2 1 3 1 10
3 1 2 1 10
4 1 1 1 10
I have tried to add new columns like this, and it does not work
for xy_index in range(len(cars)):
if cars.at[xy_index, 'x_now'] == cars.at[xy_index,'x_1_goal']:
cars.at[xy_index, 'x_car_move_route'] = np.repeat(cars.at[xy_index, 'x_now'].astype(int),(
abs(cars.at[xy_index, 'y_now'].astype(int)-cars.at[xy_index, 'y_1_goal'].astype(int))))
else:
cars.at[xy_index, 'x_car_move_route'] = \
np.arange(cars.at[xy_index,'x_now'], cars.at[xy_index,'x_1_goal'],
(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now']) / (
abs(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now'])))
at the end I want the columns x_car_move_route and y_car_move_route so I can loop over the coordinates that they need to pass. I will show it with tkinter. I will also add more goals, since this is actually only the first turn that they need to make.
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]
You can apply() something like this route() function along axis=1, which means route() will receive rows from cars. It generates either x or y coordinates depending on what's passed into var (from args).
You can tweak/fix as needed, but it should get you started:
def route(row, var):
var2 = 'y' if var == 'x' else 'x'
now, now2 = row[f'{var}_now'], row[f'{var2}_now']
goal, goal2 = row[f'{var}_1_goal'], row[f'{var2}_1_goal']
diff, diff2 = goal - now, goal2 - now2
if diff == 0:
result = np.array([now] * abs(diff2)).astype(int)
else:
result = 1 + np.arange(now, goal, diff / abs(diff)).astype(int)
return result
cars['x_car_move_route'] = cars.apply(route, args=('x',), axis=1)
cars['y_car_move_route'] = cars.apply(route, args=('y',), axis=1)
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]

Correlation between binary variables in pandas

I am trying to calculate the correlation between binary variables using Cramer's statistics:
def cramers_corrected_stat(confusion_matrix):
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
However I do not know how to apply the code above within my dataset:
CL UP NS P CL_S
480 1 0 1 0 1
1232 1 0 1 0 1
2308 1 1 1 0 1
1590 1 0 1 0 1
497 1 1 0 0 1
... ... ... ... ... ...
1066 1 1 1 0 1
1817 1 0 1 0 1
2411 1 1 1 0 1
2149 1 0 1 0 1
1780 1 0 1 0 1
I would appreciate your help in guiding me
The function you made is not proper for your dataset.
So, use the follow function cramers_V(var1,var2) given as follows.
from scipy.stats import chi2_contingency
def cramers_V(var1,var2):
crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None)) # Cross table building
stat = chi2_contingency(crosstab)[0] # Keeping of the test statistic of the Chi2 test
obs = np.sum(crosstab) # Number of observations
mini = min(crosstab.shape)-1 # Take the minimum value between the columns and the rows of the cross table
return (stat/(obs*mini))
example code using the function is as fllows.
cramers_V(df["CL"], df["NS"])
If you want to calculate all possible pairs of your dataset, use this code.
import itertools
for col1, col2 in itertools.combinations(df.columns, 2):
print(col1, col2, cramers_V(df[col1], df[col2]))

Increment count in Pandas dataframe when values change and when time exceeds condition

Each time the value of Person changes, reset count to 0. Increment count each time the time is more than 2 from the initial increment.
Person Time DesiredResult
B 0 0
B 2 0
B 2 0
B 4 1
A 0 0
S 0 0
S 1 0
S 2 0
S 4 1
S 8 2
code to generate the data frame:
df = pd.DataFrame({'Person': ['Bob','Bob','Bob','Bob','Alvin','Steve','Steve','Steve','Steve','Steve']
,'Time': [0,2,2,4,0,0,1,2,4,8]
,'DesiredResult': [0,0,0,1,0,0,0,0,1,2]})
Image attached at the link below
I assume that you have not DesiredResult column.
To create it, define the following function:
def fn(grp):
t0 = grp.iloc[0].Time
return (grp.Time > t0 + 2).cumsum()
Then run:
df['DesiredResult'] = df.groupby('Person').apply(fn).reset_index(level=0, drop=True)

create a summary of movements between prices by date in a pandas dataframe

I have a dataframe which shows; 1) dates, prices and 3) the difference between two prices by row.
dates | data | result | change
24-09 24 0 none
25-09 26 2 pos
26-09 27 1 pos
27-09 28 1 pos
28-09 26 -2 neg
I want to create a summary of the above data in a new dataframe. The summary would have 4 columns: 1) start date, 2) end date 3) number of days 4) run
For example using the above there was a positive run of +4 from 25-09 and 27-09, so I would want this in a row of a dataframe like so:
In the new dataframe there would be one new row for every change in the value of result from positive to negative. Where run = 0 this indicates no change from the previous days price and would also need its own row in the dataframe.
start date | end date | num days | run
25-09 27-09 3 4
27-09 28-09 1 -2
23-09 24-09 1 0
The first step I think would be to create a new column "change" based on the value of run which then shows either of: "positive","negative" or "no change". Then maybe I could groupby this column.
A couple of useful functions for this style of problem are diff() and cumsum().
I added some extra datapoints to your sample data to flesh out the functionality.
The ability to pick and choose different (and more than one) aggregation functions assigned to different columns is a super feature of pandas.
df = pd.DataFrame({'dates': ['24-09', '25-09', '26-09', '27-09', '28-09', '29-09', '30-09','01-10','02-10','03-10','04-10'],
'data': [24, 26, 27, 28, 26,25,30,30,30,28,25],
'result': [0,2,1,1,-2,0,5,0,0,-2,-3]})
def cat(x):
return 1 if x > 0 else -1 if x < 0 else 0
df['cat'] = df['result'].map(lambda x : cat(x)) # probably there is a better way to do this
df['change'] = df['cat'].diff()
df['change_flag'] = df['change'].map(lambda x: 1 if x != 0 else x)
df['change_cum_sum'] = df['change_flag'].cumsum() # which gives us our groupings
foo = df.groupby(['change_cum_sum']).agg({'result' : np.sum,'dates' : [np.min,np.max,'count'] })
foo.reset_index(inplace=True)
foo.columns = ['id','start date','end date','num days','run' ]
print foo
which yields:
id start date end date num days run
0 1 24-09 24-09 1 0
1 2 25-09 27-09 3 4
2 3 28-09 28-09 1 -2
3 4 29-09 29-09 1 0
4 5 30-09 30-09 1 5
5 6 01-10 02-10 2 0
6 7 03-10 04-10 2 -5

Categories

Resources