How can I fill in missing dates in a dataframe? - python

I have a dataframe df_counts that contains the number of events that happen on a given day.
My goal is to fill in all the missing dates, and assign them a count of 0.
date count
0 2012-03-14 8
1 2012-03-19 1
2 2012-04-07 3
3 2012-04-10 1
4 2012-04-19 5
Desired output:
date count
0 2012-03-14 8
1 2012-03-15 0
2 2012-03-16 0
3 2012-03-17 0
4 2012-03-18 0
5 2012-03-19 0
6 2012-03-20 0
7 2012-03-21 0
8 2012-03-22 0
9 2012-03-23 0
...
Links I've read through:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
https://datatofish.com/pandas-dataframe-to-series/
Add missing dates to pandas dataframe
What I've tried:
idx = pd.date_range('2012-03-06', '2022-12-05')
s = df_counts.sequeeze
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value = 0)
s.head()
Output of what I've tried:
AttributeError: 'DataFrame' object has no attribute 'sequeeze'

You can achieve this using the asfreq() function.
# convert date column to datetime
df.date = pd.to_datetime(df.date)
# set date columns as index, drop the original index
df = df.set_index("date", drop=True)
# daily freq
df = df.asfreq("D", fill_value=0)
Some notes on your code:
squeeze() is a function, not an attribute. Therefore you get an error there.
You can directly use the date column as an index using the set_index() function.

You can use resample:
new_df = df.set_index('date').resample('1D').agg('first').fillna(0)
Output:
'''
date count
2012-03-14 8.0
2012-03-15 0.0
2012-03-16 0.0
2012-03-17 0.0
2012-03-18 0.0
2012-03-19 1.0
2012-03-20 0.0
2012-03-21 0.0
2012-03-22 0.0
2012-03-23 0.0
2012-03-24 0.0
2012-03-25 0.0
2012-03-26 0.0
2012-03-27 0.0
2012-03-28 0.0
2012-03-29 0.0
2012-03-30 0.0
2012-03-31 0.0
2012-04-01 0.0
2012-04-02 0.0
2012-04-03 0.0
2012-04-04 0.0
2012-04-05 0.0
2012-04-06 0.0
2012-04-07 3.0
2012-04-08 0.0
2012-04-09 0.0
2012-04-10 1.0
2012-04-11 0.0
2012-04-12 0.0
2012-04-13 0.0
2012-04-14 0.0
2012-04-15 0.0
2012-04-16 0.0
2012-04-17 0.0
2012-04-18 0.0
2012-04-19 5.0
'''

My favorite is fillna, you can use it for whole df or single column:
df["column"].fillna(0)

Related

Conditionally Set Values Greater Than 0 To 1

I have a dataframe that looks like this, with many more date columns
AUTHOR 2022-07-01 2022-10-14 2022-10-15 .....
0 Kathrine 0.0 7.0 0.0
1 Catherine 0.0 13.0 17.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 3.0 0.0
4 Christine 0.0 0.0 0.0
I would like to set values in each column after the AUTHOR to 1 when the value is greater than 0, so the resulting table would look like this:
AUTHOR 2022-07-01 2022-10-14 2022-10-15 .....
0 Kathrine 0.0 1.0 0.0
1 Catherine 0.0 1.0 1.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 1.0 0.0
4 Christine 0.0 0.0 0.0
I tried the following line of code but got an error, which makes sense. As I need to figure out how to apply this code just to the date columns while also keeping the AUTHOR column in my table.
Counts[Counts != 0] = 1
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
You can select the date column first then mask on these columns
cols = df.drop(columns='AUTHOR').columns
# or
cols = df.filter(regex='\d{4}-\d{2}-\d{2}').columns
# or
cols = df.select_dtypes(include='number').columns
df[cols] = df[cols].mask(df[cols] != 0, 1)
print(df)
AUTHOR 2022-07-01 2022-10-14 2022-10-15
0 Kathrine 0.0 1.0 0.0
1 Catherine 0.0 1.0 1.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 1.0 0.0
4 Christine 0.0 0.0 0.0
Since you would like to only exclude the first column you could first set it as index and then create your booleans. In the end you will reset the index.
df.set_index('AUTHOR').pipe(lambda g: g.mask(g > 0, 1)).reset_index()
df
AUTHOR 2022-10-14 2022-10-15
0 Kathrine 0.0 1.0
1 Cathrine 1.0 1.0

Сonvert the data from summary to daily time series data (pandas)

I have a dataset which is a time series. It has several regions at once, here is a small example:
date confirmed deaths recovered region_code
0 2020-03-27 3.0 0.0 0.0 ARK
1 2020-03-27 4.0 0.0 0.0 BA
2 2020-03-27 1.0 0.0 0.0 BEL
..........................................................
71540 2022-07-19 164194.0 2830.0 160758.0 YAR
71541 2022-07-19 19170.0 555.0 18484.0 YEV
71542 2022-07-19 169603.0 2349.0 167075.0 ZAB
I have three columns for which I want to display information about how many new cases have been added in separate three columns:
date confirmed deaths recovered region_code daily_confirmed daily_deaths daily_recovered
0 2020-03-27 3.0 0.0 0.0 ARK 3.0 0.0 0.0
1 2020-03-27 4.0 0.0 0.0 BA 4.0 0.0 0.0
2 2020-03-27 1.0 0.0 0.0 BEL 1.0 0.0 0.0
..........................................................
71540 2022-07-19 164194.0 2830.0 160758.0 YAR 32.0 16.0 8.0
71541 2022-07-19 19170.0 555.0 18484.0 YEV 6.0 1.0 1.0
71542 2022-07-19 169603.0 2349.0 167075.0 ZAB 1.0 8.0 9.0
That is, for each region, you need to get the difference between the current date and the last day in order to understand how many new cases have occurred.
The problem is that I don't know how to do this process correctly. Since there are no missing dates in the data, you can use something like this: df['daily_cases'] = df['confirmed'] - df['confirmed'].shift(fill_value=0). But there are many different regions here, that is, first you need to filter everything correctly somehow ... Any ideas how to do this?
Use DataFrameGroupBy.diff with replace first missing values by original columns add prefix to columns and cast to inetegers if necessary:
print (df)
date confirmed deaths recovered region_code
0 2020-03-27 3.0 0.0 0.0 ARK
1 2020-03-27 4.0 0.0 0.0 BA
2 2020-03-27 1.0 0.0 0.0 BEL
3 2020-03-28 4.0 0.0 4.0 ARK
4 2020-03-28 6.0 0.0 0.0 BA
5 2020-03-28 1.0 0.0 0.0 BEL
6 2020-03-29 6.0 0.0 10.0 ARK
7 2020-03-29 8.0 0.0 0.0 BA
8 2020-03-29 5.0 0.0 0.0 BEL
cols = ['confirmed','deaths','recovered']
df1 = (df.groupby(['region_code'])[cols]
.diff()
.fillna(df[cols])
.add_prefix('daily_')
.astype(int))
print (df1)
daily_confirmed daily_deaths daily_recovered
0 3 0 0
1 4 0 0
2 1 0 0
3 1 0 4
4 2 0 0
5 0 0 0
6 2 0 6
7 2 0 0
8 4 0 0
Last append to original:
df = df.join(df1)
print (df)

Add missing datetime columns to grouped dataframe

Is it possible to add missing date columns from created date_range to grouped dataframe df without for loop and fill zeros as missing values?
date_range has 7 date elements. df has 4 date columns. So how to add 3 missing columns to df?
import pandas as pd
from datetime import datetime
start = datetime(2018,6,4, )
end = datetime(2018,6,10,)
date_range = pd.date_range(start=start, end=end, freq='D')
DatetimeIndex(['2018-06-04', '2018-06-05', '2018-06-06', '2018-06-07',
'2018-06-08', '2018-06-09', '2018-06-10'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame({
'date':
['2018-06-07', '2018-06-10', '2018-06-09','2018-06-09',
'2018-06-08','2018-06-09','2018-06-08','2018-06-10',
'2018-06-10','2018-06-10',],
'name':
['sogan', 'lyam','alex','alex',
'kovar','kovar','kovar','yamo','yamo','yamo',]
})
df['date'] = pd.to_datetime(df['date'])
df = (df
.groupby(['name', 'date',])['date',]
.count()
.unstack(fill_value=0)
)
df
date date date date
date 2018-06-07 00:00:00 2018-06-08 00:00:00 2018-06-09 00:00:00 2018-06-10 00:00:00
name
alex 0 0 2 0
kovar 0 2 1 0
lyam 0 0 0 1
sogan 1 0 0 0
yamo 0 0 0 3
I would pivot the table for making the date columns as rows then use the .asfreq function of pandas as below:
DataFrame.asfreq(freq, method=None, how=None, normalize=False, fill_value=None)
source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.asfreq.html
Thanks Sina Shabani for clue to making date columns as rows. And in this situation more suitable setting date as index and using .reindex appeared
df = (df.groupby(['date', 'name'])['name']
.size()
.reset_index(name='count')
.pivot(index='date', columns='name', values='count')
.fillna(0))
df
name alex kovar lyam sogan yamo
date
2018-06-07 0.0 0.0 0.0 1.0 0.0
2018-06-08 0.0 2.0 0.0 0.0 0.0
2018-06-09 2.0 1.0 0.0 0.0 0.0
2018-06-10 0.0 0.0 1.0 0.0 3.0
df.index = pd.DatetimeIndex(df.index)
df = (df.reindex(pd.date_range(start, freq='D', periods=7), fill_value=0)
.sort_index())
df
name alex kovar lyam sogan yamo
2018-06-04 0.0 0.0 0.0 0.0 0.0
2018-06-05 0.0 0.0 0.0 0.0 0.0
2018-06-06 0.0 0.0 0.0 0.0 0.0
2018-06-07 0.0 0.0 0.0 1.0 0.0
2018-06-08 0.0 2.0 0.0 0.0 0.0
2018-06-09 2.0 1.0 0.0 0.0 0.0
2018-06-10 0.0 0.0 1.0 0.0 3.0
df.T
date 2018-06-07 00:00:00 2018-06-08 00:00:00 2018-06-09 00:00:00 2018-06-10 00:00:00
name
alex 0.0 0.0 2.0 0.0
kovar 0.0 2.0 1.0 0.0
lyam 0.0 0.0 0.0 1.0
sogan 1.0 0.0 0.0 0.0
yamo 0.0 0.0 0.0 3.0

Pandas resample timeseries in to 24hours

I have the data like this:
OwnerUserId Score
CreationDate
2015-01-01 00:16:46.963 1491895.0 0.0
2015-01-01 00:23:35.983 1491895.0 1.0
2015-01-01 00:30:55.683 1491895.0 1.0
2015-01-01 01:10:43.830 2141635.0 0.0
2015-01-01 01:11:08.927 1491895.0 1.0
2015-01-01 01:12:34.273 3297613.0 1.0
..........
This is a whole year data with different user's score ,I hope to get the data like:
OwnerUserId 1491895.0 1491895.0 1491895.0 2141635.0 1491895.0
00:00 0.0 3.0 0.0 3.0 5.8
00:01 5.0 3.0 0.0 3.0 5.8
00:02 3.0 33.0 20.0 3.0 5.8
......
23:40 12.0 33.0 10.0 3.0 5.8
23:41 32.0 33.0 20.0 3.0 5.8
23:42 12.0 13.0 10.0 3.0 5.8
The element of dataframe is the score(mean or sum).
I have been try like follow:
pd.pivot_table(data_series.reset_index(),index=['CreationDate'],columns=['OwnerUserId'],
fill_value=0).resample('W').sum()['Score']
Get the result like the image.
I think you need:
#remove `[]` and add parameter values for remove MultiIndex in columns
df = pd.pivot_table(data_series.reset_index(),
index='CreationDate',
columns='OwnerUserId',
values='Score',
fill_value=0)
#truncate seconds and convert to timedeltaindex
df.index = pd.to_timedelta(df.index.floor('T').strftime('%H:%M:%S'))
#or round to minutes
#df.index = pd.to_timedelta(df.index.round('T').strftime('%H:%M:%S'))
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:16:00 0 0 0
00:23:00 1 0 0
00:30:00 1 0 0
01:10:00 0 0 0
01:11:00 1 0 0
01:12:00 0 0 1
idx = pd.timedelta_range('00:00:00', '23:59:00', freq='T')
#resample by minutes, aggregate sum, for add missing rows use reindex
df = df.resample('T').sum().fillna(0).reindex(idx, fill_value=0)
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:00:00 0.0 0.0 0.0
00:01:00 0.0 0.0 0.0
00:02:00 0.0 0.0 0.0
00:03:00 0.0 0.0 0.0
00:04:00 0.0 0.0 0.0
00:05:00 0.0 0.0 0.0
00:06:00 0.0 0.0 0.0
...
...

Set value based on day in month in pandas timeseries

I have a timeseries
date
2009-12-23 0.0
2009-12-28 0.0
2009-12-29 0.0
2009-12-30 0.0
2009-12-31 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-11 0.0
2010-01-12 0.0
2010-01-13 0.0
2010-01-14 0.0
2010-01-15 0.0
2010-01-18 0.0
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-25 0.0
2010-01-26 0.0
2010-01-27 0.0
2010-01-28 0.0
2010-01-29 0.0
2010-02-01 0.0
2010-02-02 0.0
I would like to set the value to 1 based on the following rule:
If the constant is set 9 this means the 9th of each month. Due to
that that 2010-01-09 doesn't exist I would like to set the next date
that exists in the series to 1 which is 2010-01-11 above.
I have tried to create two series one (series1) with day < 9 set to 1 and one (series2) with day > 9 to 1 and then series1.shift(1) * series2
It works in the middle of the month but not if day is set to 1 due to that the last date in previous month is set to 0 in series1.
Assume your timeseries is s with a datetimeindex
I want to create a groupby object of all index values whose days are greater than or equal to 9.
g = s.index.to_series().dt.day.ge(9).groupby(pd.TimeGrouper('M'))
Then I'll check that there is at least one day past >= 9 and grab the first among them. With those, I'll assign the value of 1.
s.loc[g.idxmax()[g.any()]] = 1
s
date
2009-12-23 1.0
2009-12-28 0.0
2009-12-29 0.0
2009-12-30 0.0
2009-12-31 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-11 1.0
2010-01-12 0.0
2010-01-13 0.0
2010-01-14 0.0
2010-01-15 0.0
2010-01-18 0.0
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-25 0.0
2010-01-26 0.0
2010-01-27 0.0
2010-01-28 0.0
2010-01-29 0.0
2010-02-01 0.0
2010-02-02 0.0
Name: val, dtype: float64
Note that 2009-12-23 also was assigned a 1 as it satisfies this requirement as well.

Categories

Resources