Populating each calendar month with a specific incrementing pattern - python

Given a pandas DataFrame indexed by a timeseries, e.g.
import pandas as pd
import numpy as np
index = pd.date_range('2023-01-01', '2023-12-31', freq='1D')
pd.DataFrame({'a' : np.random.randint(0, 10, len(index))}, index=index)
a
2023-01-01 3
2023-01-02 2
2023-01-03 1
2023-01-04 3
2023-01-05 8
... ..
2023-12-27 2
2023-12-28 2
2023-12-29 0
2023-12-30 1
2023-12-31 7
How can I add a new column populated with an incrementing pattern within each calendar month? E.g. b: day_of_month / days_in_month,
a b
2023-01-01 0 0.032258
2023-01-02 5 0.064516
2023-01-03 2 0.096774
2023-01-04 7 0.129032
2023-01-05 4 0.161290
... .. ...
2023-12-27 6 0.870968
2023-12-28 5 0.903226
2023-12-29 8 0.935484
2023-12-30 2 0.967742
2023-12-31 9 1.000000
Such that the following pattern is created:

It is a bit convoluted, but this can be accomplished using the following procedure:
Assign the day of month to a new column.
df['b'] = df.index.day
a b
2023-01-01 3 1
2023-01-02 0 2
2023-01-03 0 3
2023-01-04 8 4
2023-01-05 2 5
... .. ..
2023-12-27 2 27
2023-12-28 1 28
2023-12-29 3 29
2023-12-30 1 30
2023-12-31 4 31
Assign the last day of a monthly resample to a third column. This will ensure index unity.
df['c'] = df.resample('1M').b.last().copy()
a b c
2023-01-01 3 1 NaN
2023-01-02 0 2 NaN
2023-01-03 0 3 NaN
2023-01-04 8 4 NaN
2023-01-05 2 5 NaN
... .. .. ...
2023-12-27 2 27 NaN
2023-12-28 1 28 NaN
2023-12-29 3 29 NaN
2023-12-30 1 30 NaN
2023-12-31 4 31 31.0
Back-fill and apply operations (in the asked case, division).
(optional) Dispose of the helper column, c.
a b
2023-01-01 3 0.032258
2023-01-02 0 0.064516
2023-01-03 0 0.096774
2023-01-04 8 0.129032
2023-01-05 2 0.161290
... .. ...
2023-12-27 2 0.870968
2023-12-28 1 0.903226
2023-12-29 3 0.935484
2023-12-30 1 0.967742
2023-12-31 4 1.000000
A complete example, extending the code from the question:
import pandas as pd
import numpy as np
index = pd.date_range('2023-01-01', '2023-12-31', freq='1D')
df = pd.DataFrame({'a' : np.random.randint(0, 10, len(index))}, index=index)
df['b'] = df.index.day
df['c'] = df.resample('1M').b.last().copy()
df.c = df.c.bfill()
df.b /= df.c
df.drop(['c'], inplace=True, axis=1)

Related

(Python) How to calculate the average over a time period?

I have a dataFrame and I am trying to add a new column that calculates the average amount spent with a card over the last 3 days.
I have tried using df[avg_card_7days] = df.groupby('card')['amount'].resample('3D', on = 'date').mean()
The dataFrame currently looks like:
card date amount
1 2/1/10 50
2 2/1/10 40
3 2/1/10 10
1 2/2/10 20
2 2/2/10 30
3 2/2/10 30
1 2/3/10 10
2 2/3/10 30
3 2/3/10 20
...
But I a looking for this result:
card date amount avg_card_3days
1 2/1/10 50 NaN
2 2/1/10 40 NaN
3 2/1/10 10 NaN
1 2/2/10 20 NaN
2 2/2/10 30 NaN
3 2/2/10 30 NaN
1 2/3/10 10 26.26
2 2/3/10 30 33.33
3 2/3/10 20 20.00
...
Any help would be greatly appreciated!
df['date'] = pd.to_datetime(df.date, format='%m/%d/%y')
df = df.set_index('date')
df['avg_card_3days'] = df.groupby('card').expanding(3).amount.agg('mean').droplevel(0).sort_index()
df = df.reset_index()
df
Output
date card amount avg_card_3days
0 2010-02-01 1 50 NaN
1 2010-02-01 2 40 NaN
2 2010-02-01 3 10 NaN
3 2010-02-02 1 20 NaN
4 2010-02-02 2 30 NaN
5 2010-02-02 3 30 NaN
6 2010-02-03 1 10 26.666667
7 2010-02-03 2 30 33.333333
8 2010-02-03 3 20 20.000000
Explanation
Converting date column to datetime type and setting it as index.
Grouping the df by card and finding rolling mean of 3 days and assigning it to new column.
resetting the index to get required output.

How to use the intersection of 2 dataframe as index and then divide one by another

Here I have 2 dataframes. I attempt to calculate the division of intersected rows of these 2 dataframes. That is to firstly find all of codes that belong to both of two dataframes, and then let each of element in df1 be divided by the corresponding element in df2. Please note that codes of df1 and df2 are not necessarily in same length or sequence. (code is index but not one of the columns in the dataframes)
df1:
code 20180101 20180102 ... 20181231
001 3 5 ... 5
002 2 1 ... 10
003 1 1 ... 5
...
1230 1 2 ... 0.5
1231 2 2 ... 5
df2:
code 20180101 20180102 ... 20181231
001 6 10 ... 10
002 4 3 ... 2
004 1 1 ... 5
...
1230 4 3 ... 1
1231 2 2 ... 5
I tried to merge these two dataframes first but I don't know what to do next. Or if there is any more efficient way? My ideal result is:
code 20180101 20180102 ... 20181231
001 0.5 0.5 ... 0.5
002 0.5 0.3333 ... 5
...
1230 0.25 0.6667 ... 2
1231 1 1 ... 1
For df1, df2 have the same index code but not necessarily in same length or sequence, try:
df3 = (df1 / df2).dropna()
Test Run:
print(df1)
Output:
20180101 20180102 20181231
code
1 3 5 5
2 2 1 10
3 1 1 5
4 10 20 30
print(df2)
Output:
20180101 20180102 20181231
code
1 6 10 10
2 4 3 2
3 1 1 5
5 20 30 40
df3 = (df1 / df2).dropna()
print(df3)
Output:
20180101 20180102 20181231
code
1 0.5 0.500000 0.5
2 0.5 0.333333 5.0
3 1.0 1.000000 1.0

Is there anyway to access to the values groupby in python

I am working on a project and I was able to groupby 7D and now I want to access the elements groupedby
Here is the code:
group = df.set_index('date').groupby('user').resample('7D', convention='start', label='left')
group_result = pd.DataFrame({'Weekly_in_averge_amount': group.mean()['value'], 'Weekly_in_max_amount': group.max()['value']'Weekly_in_min_amount': group.min()['value'], 'Weekly_in_totalamount': group.sum()['value'], 'Weekly_in_degree': group.sum()['inputs'], 'monthdays': group.count()['month']})`
groupUser = group_result.groupby('user').first()
I got this output
29 1.512015 ... 1.049153
30 34.896646 ... 26.350528
37 0.055000 ... 0.002245
38 0.835067 ... 0.102253
39 38.044883 ... 9.317114
40 1.476168 ... 0.090378
41 1.000000 ... 0.061224
42 8.976852 ... 0.183201
43 0.012000 ... 0.000490
44 2.377267 ... 0.048516
45 1.365204 ... 284.463992
For example the user 29 has the transaction of one week, Is it possible to display the grouped values in user 29
user date Weekly_in_averge_amount count
29 2011-05-25 1.512015 ... 34
29 2011-06-01 1.123298 ... 23
As we can see, user 29 has grouped all rows by one week. How can I get the rows grouped by one week.
Note that there are 34 rows grouped by the first group
sorry if my explanation is not clear
Thank you for any help
Regards,
Khaled
You can use GroupBy.agg with dictionary of columns names and aggregate functions, then convert MultiIndex to columns, flatten and last rename:
np.random.seed(123)
rng = pd.date_range('2017-04-03', periods=10)
df = pd.DataFrame({'date': rng,
'value': range(10),
'inputs': range(3,13),
'month': np.random.randint(1,7, size=10),
'user':['a'] * 3 + ['b'] *3 + ['c'] *4})
print (df)
date value inputs month user
0 2017-04-03 0 3 6 a
1 2017-04-04 1 4 3 a
2 2017-04-05 2 5 5 a
3 2017-04-06 3 6 3 b
4 2017-04-07 4 7 2 b
5 2017-04-08 5 8 4 b
6 2017-04-09 6 9 3 c
7 2017-04-10 7 10 4 c
8 2017-04-11 8 11 2 c
9 2017-04-12 9 12 2 c
df1 = (df.set_index('date')
.groupby('user')
.resample('7D', convention='start', label='left')
.agg({'value': ['mean','max','min','sum'],
'inputs':'sum',
'month':'count'}))
df1.columns = df1.columns.map('_'.join)
d = {'value_max':'Weekly_in_max_amount',
'value_min':'Weekly_in_min_amount',
'value_sum':'Weekly_in_totalamount',
'inputs_sum':'Weekly_in_degree',
'month_count':'monthdays',
'value_mean':'Weekly_in_averge_amount'}
df1 = df1.rename(columns=d).reset_index()
print (df1)
user date Weekly_in_averge_amount Weekly_in_max_amount \
0 a 2017-04-03 1.0 2
1 b 2017-04-06 4.0 5
2 c 2017-04-09 7.5 9
Weekly_in_min_amount Weekly_in_totalamount Weekly_in_degree monthdays
0 0 3 12 3
1 3 12 21 3
2 6 30 42 4

Calculate average of every 7 instances in a dataframe column

I have this pandas dataframe with daily asset prices:
Picture of head of Dataframe
I would like to create a pandas series (It could also be an additional column in the dataframe or some other datastructure) with the weakly average asset prices. This means I need to calculate the average on every 7 consecutive instances in the column and save it into a series.
Picture of how result should look like
As I am a complete newbie to python (and programming in general, for that matter), I really have no idea how to start.
I am very grateful for every tipp!
I believe need GroupBy.transform by modulo of numpy array create by numpy.arange for general solution also working with all indexes (e.g. with DatetimeIndex):
np.random.seed(2018)
rng = pd.date_range('2018-04-19', periods=20)
df = pd.DataFrame({'Date': rng[::-1],
'ClosingPrice': np.random.randint(4, size=20)})
#print (df)
df['weekly'] = df['ClosingPrice'].groupby(np.arange(len(df)) // 7).transform('mean')
print (df)
ClosingPrice Date weekly
0 2 2018-05-08 1.142857
1 2 2018-05-07 1.142857
2 2 2018-05-06 1.142857
3 1 2018-05-05 1.142857
4 1 2018-05-04 1.142857
5 0 2018-05-03 1.142857
6 0 2018-05-02 1.142857
7 2 2018-05-01 2.285714
8 1 2018-04-30 2.285714
9 1 2018-04-29 2.285714
10 3 2018-04-28 2.285714
11 3 2018-04-27 2.285714
12 3 2018-04-26 2.285714
13 3 2018-04-25 2.285714
14 1 2018-04-24 1.666667
15 0 2018-04-23 1.666667
16 3 2018-04-22 1.666667
17 2 2018-04-21 1.666667
18 2 2018-04-20 1.666667
19 2 2018-04-19 1.666667
Detail:
print (np.arange(len(df)) // 7)
[0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2]

How to take log of only non-zero values in dataframe and replace O's with NA's?

How do i take log of non-zero values in dataframe and replace 0's with NA's.
I have dataframe like below:
time y1 y2
0 2017-08-06 00:52:00 0 10
1 2017-08-06 00:52:10 1 20
2 2017-08-06 00:52:20 2 0
3 2017-08-06 00:52:30 3 0
4 2017-08-06 00:52:40 0 5
5 2017-08-06 00:52:50 4 6
6 2017-08-06 00:53:00 6 11
7 2017-08-06 00:53:10 7 12
8 2017-08-06 00:53:20 8 0
9 2017-08-06 00:53:30 0 13
I want to take log of all columns expect first column time and log should be calculate of only non-zero values and zero's should be replace with NA's? How do i do this?
So, I tried to do something like this:
cols = df.columns.difference(['time'])
# Replacing O's with NA's using below:
df[cols] = df[cols].mask(np.isclose(df[cols].values, 0), np.nan)
df[cols] = np.log(df[cols]) # but this will try take log of NA's also.
Please help.
Output should be dataframe with same time column, and all the zero's replaced with NA's and log equivalent of the remaining values of all columns expect 1st column.
If I understand correctly, you can just replace the zeros with np.nan and then call np.log directly - it ignores NaN values just fine.
np.log(df[['y1', 'y2']].replace(0, np.nan))
Example
>>> df = pd.DataFrame({'time': pd.date_range('20170101', '20170110'),
'y1' : np.random.randint(0, 3, 10),
'y2': np.random.randint(0, 3, 10)})
>>> df
time y1 y2
0 2017-01-01 1 2
1 2017-01-02 0 1
2 2017-01-03 2 0
3 2017-01-04 0 1
4 2017-01-05 1 0
5 2017-01-06 1 1
6 2017-01-07 2 0
7 2017-01-08 1 0
8 2017-01-09 0 1
9 2017-01-10 2 1
>>> df[['log_y1', 'log_y2']] = np.log(df[['y1', 'y2']].replace(0, np.nan))
>>> df
time y1 y2 log_y1 log_y2
0 2017-01-01 1 2 0.000000 0.693147
1 2017-01-02 0 1 NaN 0.000000
2 2017-01-03 2 0 0.693147 NaN
3 2017-01-04 0 1 NaN 0.000000
4 2017-01-05 1 0 0.000000 NaN
5 2017-01-06 1 1 0.000000 0.000000
6 2017-01-07 2 0 0.693147 NaN
7 2017-01-08 1 0 0.000000 NaN
8 2017-01-09 0 1 NaN 0.000000
9 2017-01-10 2 1 0.693147 0.000000

Categories

Resources