How to do a rolling count of values grouping by date python - python

Hi I have a table of data like below and I want to try do a rolling count that takes the date in the group by and the values of dates prior.
Table of data:
Date
ID
1/1/2020
123
2/1/2020
432
2/1/2020
5234
4/1/2020
543
5/1/2020
645
6/1/2020
231
My desired output is something like this:
Date
count
1/1/2020
1
2/1/2020
3
4/1/2020
4
5/1/2020
5
6/1/2020
6
I have tried the following but it doesn't seem to work on how I want it do it.
df[['id','date']].groupby('date').cumcount()

Convert column to datetimes for correct ordering if aggregate GroupBy.size and add cumulative sum by Series.cumsum:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.groupby('Date').size().cumsum().reset_index(name='count')
print (df)
Date count
0 2020-01-01 1
1 2020-01-02 3
2 2020-01-04 4
3 2020-01-05 5
4 2020-01-06 6

Related

I want to select duplicate rows between 2 dataframes

I want to filter rolls (df1) with date column that in datetime64[ns] from df2 (same column name and dtype). I tried searching for a solution but I get the error:
Can only compare identically-labeled Series objects | 'Timestamp' object is not iterable or other.
sample df1
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
4
2018-11-25
120
5
2018-08-25
120
sample df2
date
2018-10-09
2018-10-10
sample result that I want
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
In fact, I want this program to run 1 time in every 7 days, counting back from the day it started. So I want it to remove dates that are not in these past 7 days.
# create new dataframe -> df2
data = {'date':[]}
df2 = pd.DataFrame(data)
#Set the date to the last 7 days.
days_use = 7 # 7 -> 1
for x in range (days_use,0,-1):
days_use = x
use_day = date.today() - timedelta(days=days_use)
df2.loc[x] = use_day
#Change to datetime64[ns]
df2['date'] = pd.to_datetime(df2['date'])
Use isin:
>>> df1[df1["date"].isin(df2["date"])]
id date value
0 1 2018-10-09 120
1 2 2018-10-09 60
2 3 2018-10-10 59
If you want to create df2 with the dates for the past week, you can simply use pd.date_range:
df2 = pd.DataFrame({"date": pd.date_range(pd.Timestamp.today().date()-pd.DateOffset(7),periods=7)})
>>> df2
date
0 2022-05-03
1 2022-05-04
2 2022-05-05
3 2022-05-06
4 2022-05-07
5 2022-05-08
6 2022-05-09

Pandas group by date with subcategories and sums

I have a dataframe such as this one:
Date Category1 Cat2 Cat3 Cat4 Value
0 2021-02-02 4310 0 1 0 1082.00
1 2021-02-03 5121 2 0 0 -210.82
2 2021-02-03 4310 0 0 0 238.41
3 2021-02-12 5121 2 2 0 -1489.11
4 2021-02-25 6412 1 0 0 -30.97
5 2021-03-03 5121 1 1 0 -189.91
6 2021-03-09 6412 0 0 0 238.41
7 2021-03-13 5121 0 0 0 -743.08
Date column has been converted into datetime format, Value is a float, other columns are strings.
I am trying to group the dataframe by month and by each level of category, such as:
Level 1 = filter over category 1 and sum values for each category for each month:
Date Category1 Value
0 2021-02 4310 1320.41
1 2021-02 5121 -1699.93
2 2021-02 6412 -30.97
3 2021-03 5121 -1489.11
4 2021-03 6412 -932.99
Level 2 = filter over category 2 alone (one output dataframe) and over the concatenation of category 1 + 2 (another output dataframe):
Date Cat2 Value
0 2021-02 0 1320.41
1 2021-02 1 -1699.93
2 2021-02 2 -30.97
3 2021-03 0 -504.67
4 2021-03 1 -189.91
Second output :
Date Cat1+2 Value
0 2021-02 43100 1320.41
1 2021-02 51212 -1699.93
2 2021-02 64121 -30.97
3 2021-03 51210 -743.08
4 2021-03 51211 -189.91
5 2021-03 64120 238.41
Level 3 : filter over category 3 alone and over category 1+2+3
etc.
I am able to do one grouping at a time (by date or by category) but I can't combine them.
Grouping by date:
df.groupby(df["Date"].dt.year)
Grouping by category:
df.groupby('Category1')['Value'].sum()
You can try this.
To group by month, you can use this example
df.groupby(df['Date'].dt.strftime('%B'))['Value'].sum()
How can I Group By Month from a Date field using Python/Pandas
For group by multiple columns
df.groupby(['col5', 'col2'])
You could create a Month year column and they group by using the new column.
Pandas DataFrame Groupby two columns and get counts ,
Extracting just Month and Year separately from Pandas Datetime column

int64 to HHMM string

I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.
While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

Pandas, add date column to a series

I have a timeseries dataframe that is data agnostic and uses period vs date.
I would like to at some point add in dates, using the period.
My dataframe looks like
period custid
1 1
2 1
3 1
1 2
2 2
1 3
2 3
3 3
4 3
I would like to be able to pick a random starting date, for example 1/1/2018, and that would be period 1 so you would end up with
period custid date
1 1 1/1/2018
2 1 2/1/2018
3 1 3/1/2018
1 2 1/1/2018
2 2 2/1/2018
1 3 1/1/2018
2 3 2/1/2018
3 3 3/1/2018
4 3 4/1/2018
You could create a column of timedeltas, based on the period column, where each row is a time delta of period dates (-1, so that it starts at 0). then, starting from your start_date, which you can define as a datetime object, add the timedelta to start date:
start_date = pd.to_datetime('1/1/2018')
df['date'] = pd.to_timedelta(df['period'] - 1, unit='D') + start_date
>>> df
period custid date
0 1 1 2018-01-01
1 2 1 2018-01-02
2 3 1 2018-01-03
3 1 2 2018-01-01
4 2 2 2018-01-02
5 1 3 2018-01-01
6 2 3 2018-01-02
7 3 3 2018-01-03
8 4 3 2018-01-04
Edit: In your comment, you said you were trying to add months, not days. For this, you could use your method, or alternatively, the following:
from pandas.tseries.offsets import MonthBegin
df['date'] = start_date + (df['period'] -1) * MonthBegin()

Categories

Resources