Calculate time difference of 2 adjacent datapoints for each user - python

I have the following dataframe:
df = pd.DataFrame(
{'user_id': [53, 53, 53, 53, 53, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54],
'timestamp': [10, 15, 20, 25, 30, 31, 34, 37, 14, 16, 18, 20, 22, 25, 28],
'activity': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
'D', 'D', 'D', 'D', 'D', 'D', 'D']}
)
df
user_id timestamp activity
0 53 10 A
1 53 15 A
2 53 20 A
3 53 25 A
4 53 30 A
5 53 31 A
6 53 34 A
7 53 37 A
8 54 14 D
9 54 16 D
10 54 18 D
11 54 20 D
12 54 22 D
13 54 25 D
14 54 28 D
I want to calculate the time difference between every
2 adjacent datapoints (rows) in each user_id and plot the CDF,
per activity. Assuming each user starts new activity from 0 seconds. timestamp column represents unix timestamp, I give last 2 digits for brevity.
Target df (required result):
user_id timestamp activity timestamp_diff
0 53 10 A 0
1 53 15 A 5
2 53 20 A 5
3 53 25 A 5
4 53 30 A 5
5 53 31 A 1
6 53 34 A 3
7 53 37 A 3
8 54 14 D 0
9 54 16 D 2
10 54 18 D 2
11 54 20 D 2
12 54 22 D 2
13 54 25 D 3
14 54 28 D 3
My attempts (to calculate the time differences):
df['shift1'] = df.groupby('user_id')['timestamp'].shift(1, fill_value=0)
df['shift2'] = df.groupby('user_id')['timestamp'].shift(-1, fill_value=0)
df['diff1'] = df.timestamp - df.shift1
df['diff2'] = df.shift2 - df.timestamp
df['shift3'] = df.groupby('user_id')['timestamp'].shift(-1)
df['shift3'].fillna(method='ffill', inplace=True)
df['diff3'] = df.shift3 - df.timestamp
df
user_id timestamp activity shift1 shift2 diff1 diff2 shift3 diff3
0 53 10 A 0 15 10 5 15.0 5.0
1 53 15 A 10 20 5 5 20.0 5.0
2 53 20 A 15 25 5 5 25.0 5.0
3 53 25 A 20 30 5 5 30.0 5.0
4 53 30 A 25 31 5 1 31.0 1.0
5 53 31 A 30 34 1 3 34.0 3.0
6 53 34 A 31 37 3 3 37.0 3.0
7 53 37 A 34 0 3 -37 37.0 0.0
8 54 14 D 0 16 14 2 16.0 2.0
9 54 16 D 14 18 2 2 18.0 2.0
10 54 18 D 16 20 2 2 20.0 2.0
11 54 20 D 18 22 2 2 22.0 2.0
12 54 22 D 20 25 2 3 25.0 3.0
13 54 25 D 22 28 3 3 28.0 3.0
14 54 28 D 25 0 3 -28 28.0 0.0
I cannot reach to the target, none of diff1, diff2 or diff3 columns match the timestamp_diff.

IIUC you are looking for diff:
df['timestamp_diff'] = df.groupby('user_id')['timestamp'].diff().fillna(0).astype(int)

Related

Sum up previous rows upto 3 and multiply with value from another column using pandas

I have 2 dataframes, i want to get sum value of every row based on groupby of unique id each previous 3rows & each row value should be multiply by other dataframe value
for example : dataframe A dataframe B
unique_id value out_value num_values
1 1 45 0.15
2 1 33 0.30
3 1 18 0.18
#4 1 26 20.7
5 2 66
6 2 44
7 2 22
#8 2 19. 28.3
expected output_value column
4th row = 18 * 0.15 + 33*0.30 + 45*0.18 = 2.7+9.9+8.1 = 20.7
8th row = 22 * 0.15 + 44*0.30 + 66*0.18 = 3.3+ 13.2 + 11.88= 28.3
based on Unique_id each value should calculate based previous 3values.
for every row there will be previous 3 rows available
import pandas as pd
import numpy as np
df_a = pd.DataFrame({
'uni_id':[1, 1, 1, 1, 2, 2, 2, 2, 152, 152, 152, 152, 152],
'value':[45,33,18,26,66,44,22,19,36,27,45,81,90]
}, index=range(1,14))
df_b = pd.DataFrame({
'num_values':[0.15,0.30,0.18]
})
df_a
###
uni_id value
1 1 45
2 1 33
3 1 18
4 1 26
5 2 66
6 2 44
7 2 22
8 2 19
9 152 36
10 152 27
11 152 45
12 152 81
13 152 90
df_b
###
num_values
0 0.15
1 0.30
2 0.18
# main calculation
arr = [df_a['value'].shift(x+1).values[::-1][:3] for x in range(len(df_a['value']))[::-1]]
arr_b = pd.Series(np.inner(arr, df_b['num_values']))
# filter and clean
mask = df_a.groupby('uni_id').cumcount()+1 > 3
output = arr_b * mask
output[output == 0] = np.nan
# concat result to df_a
df_a['out_value'] = output
df_a
###
uni_id value out_value
1 1 45 NaN
2 1 33 NaN
3 1 18 NaN
4 1 26 20.70
5 2 66 NaN
6 2 44 NaN
7 2 22 NaN
8 2 19 28.38
9 152 36 NaN
10 152 27 NaN
11 152 45 NaN
12 152 81 21.33
13 152 90 30.51
If you want to keep non-null values through filtrate:
df_a.query('out_value.notnull()')
###
uni_id value out_value
4 1 26 20.70
8 2 19 28.38
12 152 81 21.33
13 152 90 30.51
Group with metrics uni_id,Year_Month
Data preparation:
# create date range series with 7 days
import pandas as pd
import numpy as np
rng = np.random.default_rng(42)
rng.integers(10,100, 26)
date_range = pd.Series(pd.date_range(start='01.30.2020', periods=27, freq='5D')).dt.to_period('M')
df_a = pd.DataFrame({
'uni_id':[1, 1, 1, 1,1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 152, 152, 152, 152, 152,152, 152, 152, 152, 152],
'Year_Month':date_range,
'value':rng.integers(10,100, 26)
}, index=range(1,27))
df_b = pd.DataFrame({
'num_values':[0.15,0.30,0.18]
})
df_a
###
uni_id Year_Month value
1 1 2020-02 46
2 1 2020-02 84
3 1 2020-02 59
4 1 2020-02 49
5 1 2020-02 50
6 1 2020-02 30
7 1 2020-03 18
8 1 2020-03 59
9 2 2020-03 89
10 2 2020-03 15
11 2 2020-03 87
12 2 2020-03 84
13 2 2020-04 34
14 2 2020-04 66
15 2 2020-04 24
16 2 2020-04 78
17 152 2020-04 73
18 152 2020-04 41
19 152 2020-05 16
20 152 2020-05 97
21 152 2020-05 50
22 152 2020-05 90
23 152 2020-05 71
24 152 2020-05 80
25 152 2020-06 78
26 152 2020-06 27
Processing
arr = [df_a['value'].shift(x+1).values[::-1][:3] for x in range(len(df_a['value']))[::-1]]
arr_b = pd.Series(np.inner(arr, df_b['num_values']))
# filter and clean
mask = df_a.groupby(['uni_id','Year_Month']).cumcount()+1 > 3
output = arr_b * mask
output[output == 0] = np.nan
# concat result to df_a
df_a['out_value'] = output
df_a
###
uni_id Year_Month value out_value
1 1 2020-02 46 NaN
2 1 2020-02 84 NaN
3 1 2020-02 59 NaN
4 1 2020-02 49 40.17
5 1 2020-02 50 32.82
6 1 2020-02 30 28.32
7 1 2020-03 18 NaN
8 1 2020-03 59 NaN
9 2 2020-03 89 NaN
10 2 2020-03 15 NaN
11 2 2020-03 87 NaN
12 2 2020-03 84 41.4
13 2 2020-04 34 NaN
14 2 2020-04 66 NaN
15 2 2020-04 24 NaN
16 2 2020-04 78 30.78
17 152 2020-04 73 NaN
18 152 2020-04 41 NaN
19 152 2020-05 16 NaN
20 152 2020-05 97 NaN
21 152 2020-05 50 NaN
22 152 2020-05 90 45.96
23 152 2020-05 71 46.65
24 152 2020-05 80 49.5
25 152 2020-06 78 NaN
26 152 2020-06 27 NaN

Rearranging Pandas Dataframe

I have a DataFrame as follows:
d = {'name': ['a', 'a','a','b','b','b'],
'var': ['v1', 'v2', 'v3', 'v1', 'v2', 'v3'],
'Yr1': [11, 21, 31, 41, 51, 61],
'Yr2': [12, 22, 32, 42, 52, 62],
'Yr3': [13, 23, 33, 43, 53, 63]}
df = pd.DataFrame(d)
name var Yr1 Yr2 Yr3
a v1 11 12 13
a v2 21 22 23
a v3 31 32 33
b v1 41 42 43
b v2 51 52 53
b v3 61 62 63
and I want to rearrange it to look like this:
name Yr v1 v2 v3
a 1 11 21 31
a 2 12 22 32
a 3 13 23 33
b 1 41 51 61
b 2 42 52 62
b 3 43 53 63
I am new to pandas and tried using other threads I found here but struggled to make it work. Any help would be much appreciated.
Try this
import pandas as pd
d = {'name': ['a', 'a', 'a', 'b', 'b', 'b'],
'var': ['v1', 'v2', 'v3', 'v1', 'v2', 'v3'],
'Yr1': [11, 21, 31, 41, 51, 61],
'Yr2': [12, 22, 32, 42, 52, 62],
'Yr3': [13, 23, 33, 43, 53, 63]}
df = pd.DataFrame(d)
# Solution
df.set_index(['name', 'var'], inplace=True)
df = df.unstack().stack(0)
print(df.reset_index())
output:
var name level_1 v1 v2 v3
0 a Yr1 11 21 31
1 a Yr2 12 22 32
2 a Yr3 13 23 33
3 b Yr1 41 51 61
4 b Yr2 42 52 62
5 b Yr3 43 53 63
Reference: pandas.DataFrame.stack
Try groupby apply:
df.groupby("name").apply(
lambda x: x.set_index("var").T.drop("name")
).reset_index().rename(columns={"level_1": "Yr"}).rename_axis(columns=None)
name Yr v1 v2 v3
0 a Yr1 11 21 31
1 a Yr2 12 22 32
2 a Yr3 13 23 33
3 b Yr1 41 51 61
4 b Yr2 42 52 62
5 b Yr3 43 53 63
Or better:
df.pivot("var", "name", ["Yr1", "Yr2", "Yr3"]).T.sort_index(
level=1
).reset_index().rename({"level_0": "Yr"}, axis=1).rename_axis(columns=None)
Yr name v1 v2 v3
0 Yr1 a 11 21 31
1 Yr2 a 12 22 32
2 Yr3 a 13 23 33
3 Yr1 b 41 51 61
4 Yr2 b 42 52 62
5 Yr3 b 43 53 63
We can use pd.wide_to_long + df.unstack here.
pd.wide_to_long doc:
With stubnames [‘A’, ‘B’], this function expects to find one or more groups of columns with format A-suffix1, A-suffix2,…, B-suffix1, B-suffix2,… You specify what you want to call this suffix in the resulting long format with j (for example j=’year’).
pd.wide_to_long(
df, stubnames="Yr", i=["name", "var"], j="Y"
).squeeze().unstack(level=1).reset_index()
var name Y v1 v2 v3
0 a 1 11 21 31
1 a 2 12 22 32
2 a 3 13 23 33
3 b 1 41 51 61
4 b 2 42 52 62
5 b 3 43 53 63
We can use df.melt + df.pivot here.
out = df.melt(id_vars=['name', 'var'], var_name='Yr')
out['Yr'] = out['Yr'].str.replace('Yr', '')
out.pivot(index=['name', 'Yr'], columns='var', values='value').reset_index()
var name Yr v1 v2 v3
0 a 1 11 21 31
1 a 2 12 22 32
2 a 3 13 23 33
3 b 1 41 51 61
4 b 2 42 52 62
5 b 3 43 53 63

pandas dataframe in def

I tried the below code to pass a df to a def function.
the first line works fine with df.dropna.
however the df.replace has issue as I found that it does not do the replace as I expected.
def Max(df):
df.dropna(subset=df.columns[3:10], inplace=True)
print(df)
df.replace(to_replace=65535, value=-10, inplace=True)
print(df)
return df
anyone know the issue and how to solve it?
Your code works well. Maybe try this version without inplace modifications:
>>> df
A B C D E F G H I J
0 1 2 3 4 5.0 6 7 8 9 10.0
1 11 65535 13 14 15.0 16 17 18 19 20.0
2 21 22 23 24 25.0 26 27 28 29 NaN
3 65535 32 33 34 NaN 36 37 38 39 40.0
4 41 42 65535 44 45.0 46 47 48 49 50.0
5 51 52 53 54 55.0 56 57 58 59 60.0
def Max(df):
return df.dropna(subset=df.columns[3:10]).replace(65535, -10)
>>> Max(df)
A B C D E F G H I J
0 1 2 3 4 5.0 6 7 8 9 10.0
1 11 -10 13 14 15.0 16 17 18 19 20.0
4 41 42 -10 44 45.0 46 47 48 49 50.0
5 51 52 53 54 55.0 56 57 58 59 60.0

Python: Lookup value in header of another data frame and replace/map the corresponding value

I have a data frame with index members which looks like this (A,B,C,... are the company names):
df_members
Date 1 2 3 4
0 2016-01-01 A B C D
1 2016-01-02 B C D E
2 2016-01-03 C D E F
3 2016-01-04 F A B C
4 2016-01-05 B C D E
5 2016-01-06 A B C D
and I have a second table including e.g. prices:
df_prices
Date A B C D E F
0 2015-12-30 1 2 3 4 5 6
1 2015-12-31 7 8 9 10 11 12
2 2016-01-01 13 14 15 16 17 18
3 2016-01-02 20 21 22 23 24 25
4 2016-01-03 27 28 29 30 31 32
5 2016-01-04 34 35 36 37 38 39
6 2016-01-05 41 42 43 44 45 46
7 2016-01-06 48 49 50 51 52 53
The goal is to replace all company names in df1 with the price from df_prices resulting in df_result:
df_result
Date 1 2 3 4
0 2016-01-01 13 14 15 16
1 2016-01-02 21 22 23 24
2 2016-01-03 29 30 31 32
3 2016-01-04 39 34 35 36
4 2016-01-05 42 43 44 45
5 2016-01-06 48 49 50 51
I already have a solution where I iterate through all cells in df_members, look for the values in df_prices and write them in a new data frame df_result. The problem is that my data frames are very large and this process takes around 7 hours.
I already tried to use the merge/join, map or lookup function but it could not solve the problem.
My approach is the following:
# Create new dataframes
df_result = pd.DataFrame(columns=df_members.columns, index=unique_dates_list)
# Load prices
df_prices = prices
# Search ticker & write values in new dataframe
for i in range(0,len(df_members)):
for j in range(0,len(df_members.columns)):
if str(df_members.iloc[i, j]) != 'nan' and df_members.iloc[i, j] in df_prices.columns:
df_result.iloc[i, j] = df_prices.iloc[i, df_prices.columns.get_loc(df_members.iloc[i, j])]
Question: Is there a way to map the values more efficiently?
pandas.lookup() will do what you need:
Code:
df_result = pd.DataFrame(columns=[], index=df_members.index)
for column in df_members.columns:
df_result[column] = df_prices.lookup(
df_members.index, df_members[column])
Test Code:
import pandas as pd
df_members = pd.read_fwf(StringIO(
u"""
Date 1 2 3 4
2016-01-01 A B C D
2016-01-02 B C D E
2016-01-03 C D E F
2016-01-04 F A B C
2016-01-05 B C D E
2016-01-06 A B C D"""
), header=1).set_index('Date')
df_prices = pd.read_fwf(StringIO(
u"""
Date A B C D E F
2015-12-30 1 2 3 4 5 6
2015-12-31 7 8 9 10 11 12
2016-01-01 13 14 15 16 17 18
2016-01-02 20 21 22 23 24 25
2016-01-03 27 28 29 30 31 32
2016-01-04 34 35 36 37 38 39
2016-01-05 41 42 43 44 45 46
2016-01-06 48 49 50 51 52 53"""
), header=1).set_index('Date')
df_result = pd.DataFrame(columns=[], index=df_members.index)
for column in df_members.columns:
df_result[column] = df_prices.lookup(
df_members.index, df_members[column])
print(df_result)
Results:
1 2 3 4
Date
2016-01-01 13 14 15 16
2016-01-02 21 22 23 24
2016-01-03 29 30 31 32
2016-01-04 39 34 35 36
2016-01-05 42 43 44 45
2016-01-06 48 49 50 51

Pandas: compute numerous columns of percentage values

I'm failing to loop through the values of select dataframe columns in order to create new columns representing percentage values. Reproducible example:
data = {'Respondents': [90, 43, 89, '89', '67', '88', '73', '78', '62', '101'],
'answer_1': [51, 15, 15, 61, 16, 14, 15, 1, 0, 16],
'answer_2': [11, 12, 14, 40, 36, 78, 12, 0, 26, 78],
'answer_3': [3, 8, 4, 0, 2, 7, 10, 11, 6, 7]}
df = pd.DataFrame(data)
df
Respondents answer_1 answer_2 answer_3
0 90 51 11 3
1 43 15 12 8
2 89 15 14 4
3 89 61 35 0
4 67 16 36 2
5 88 14 78 7
6 73 15 12 10
7 78 1 0 11
8 62 0 26 6
9 101 16 78 7
The aim is to compute the percentage for each of the answers columns against the total respondents. For example, for the new answer_1 column – let's name it answer_1_perc – the first value would be 46 (because 51 is 46% of 90), the next value would be 35 (15 is 35% of 43). Then there would be answer_2_perc and answer_3_perc columns.
I've written so many iterations of the following code my head's spinning.
for columns in df.iloc[:, 1:4]:
for i in columns:
i_name = 'percentage_' + str(columns)
i_group = ([i] / df['Respondents'] * 100)
df[i_name] = i_group
What is the best way to do this? I need to use an iterative method as my actual data has 25 answer columns rather than the 3 shown in this example.
You almost had it, note that you have string values in respondents col which I've corrected prior to calling the following:
In [172]:
for col in df.columns[1:4]:
i_name = 'percentage_' + col
i_group = (df[col] / df['Respondents']) * 100
df[i_name] = i_group
df
Out[172]:
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693
I recommend using div and concat:
df['Respondents'] = df['Respondents'].astype(float)
df_pct = (df.drop('Respondents', axis=1)
.div(df['Respondents'], axis=0)
.mul(100)
.rename(columns=lambda col: 'percentage_' + col)
)
pd.concat([df, df_pct], axis=1)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90.0 51 11 3 56.666667
1 43.0 15 12 8 34.883721
2 89.0 15 14 4 16.853933
3 89.0 61 40 0 68.539326
4 67.0 16 36 2 23.880597
5 88.0 14 78 7 15.909091
6 73.0 15 12 10 20.547945
7 78.0 1 0 11 1.282051
8 62.0 0 26 6 0.000000
9 101.0 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693
Another solution with div desired columns by column Respondents and then add to new columns names:
print ('percentage_' + df.columns[1:4])
Index(['percentage_answer_1', 'percentage_answer_2', 'percentage_answer_3'], dtype='object')
df['percentage_' + df.columns[1:4]] = df.ix[:,1:4].div(df.Respondents, axis=0) * 100
print (df)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693

Categories

Resources