Fill in missing dates with 0 (zero) in Pandas - python

I have a large csv file with millions of rows. The data looks like this. 2 columns (date, score) and million rows. I need the missing dates (for example 1/1/16, 2/1/16, 4/1/16) to have '0' values in the 'score' column and keep my existing 'date' and 'score' intact, all in the same csv. But,I also have multiple (hundreds probably) scores on many dates. So really having trouble to code it. Looked up quite a few examples on stackoverflow but none of them seemed to work yet.
date score
3/1/16 0.6369
5/1/16 -0.2023
6/1/16 0.25
7/1/16 0.0772
9/1/16 -0.4215
12/1/16 0.296
15/1/16 0.25
15/1/16 0.7684
15/1/16 0.8537
...
...
31/12/18 0.5646
This is what I have done so far. But all I am getting is an index column filled with 3 years of my 'date' and 'score' columns filled with '0'. I will really appreciate your answers and suggestions. Thank you very much.
import csv
import pandas as pd
import datetime as dt
df =pd.read_csv('myfile.csv')
dtr =pd.date_range('01.01.2016', '31.12.2018')
df.index = pd.DatetimeIndex(df.index)
df =df.reindex(dtr,fill_value = 0)
df.to_csv('missingDateCorrected.csv', encoding ='utf-8', index =True)
Note: I know I put index as True that's why the index is appearing but don't know why the 'date' column is not filling. If I put parse_dates =['date'] in my pd.read_csv I get the 'date' column filled with dates from 1970 with the same results as before.

You can do it like this:
(I did it with a smaller timeframe so change the date so that it fits you.)
import pandas as pd
x = {"date":["3/1/16","5/1/16","5/1/16"],
"score":[4,5,6]}
df = pd.DataFrame.from_dict(x)
df["date"] = pd.to_datetime(df["date"], format='%d/%m/%y')
df.set_index("date",inplace=True)
dtr =pd.date_range('01.01.2016', '01.10.2016', freq='D')
s = pd.Series(index=dtr)
df = pd.concat([df,s[~s.index.isin(df.index)]]).sort_index()
df = df.drop([0],axis=1).fillna(0)
print(df)
Output
score
2016-01-01 0.0
2016-01-02 0.0
2016-01-03 4.0
2016-01-04 0.0
2016-01-05 5.0
2016-01-05 6.0
2016-01-06 0.0
2016-01-07 0.0
2016-01-08 0.0
2016-01-09 0.0
2016-01-10 0.0
With file
Because you ask in the comment here an example with file:
df = pd.read_csv('myfile.csv', index_col=0)
df.index = pd.to_datetime(df.index, format='%d/%m/%y')
dtr =pd.date_range('01.01.2016', '01.10.2016', freq='D')
s = pd.Series(index=dtr)
df = pd.concat([df,s[~s.index.isin(df.index)]]).sort_index()
df = df.drop([0],axis=1).fillna(0)
df.to_csv('missingDateCorrected.csv', encoding ='utf-8', index =True)

Just an idea . Try resampling with 1 day and fill zeros .
like : nd = df.resample('D').pad()

Not very efficient but will work.
import pandas as pd
df = pd.read_csv('myfile.csv', index_col=0)
df.index = pd.to_datetime(df.index, format='%d/%m/%y')
dtr = pd.date_range('01.01.2016', '31.12.2018')
# Create an empty DataFrame from selected date range
empty = pd.DataFrame(index=dtr, columns=['score'])
# Append your CSV file
df = pd.concat([df, empty[~empty.index.isin(df.index)]]).sort_index().fillna(0)
df.to_csv('missingDateCorrected.csv', encoding='utf-8', index=True)

Related

Add missing timestamp values in dataframe column, in timerange [duplicate]

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots()
ax.bar(idx.to_pydatetime(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
print(s)
yields
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
...
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
pd.Timestamp('2012-05-04'),
pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)
print(s.asfreq('D'))
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)
fails with:
...
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
df.loc[all_days]
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
df.resample('D').mean()
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
val
date
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
val
date
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.set_index(date_col_name,drop=True,inplace=True)
df.index = pd.DatetimeIndex(df.index)
d = datetime.now().date()
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
'date':pd.to_datetime([
'2022-02-10'
,'2022-02-11'
,'2022-02-14'
,'2022-02-14'
,'2022-02-24'
,'2022-02-16'
])
,'value':[10,20,5,10,15,30]
})
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')
s.asfreq('D').interpolate().asfreq('Q')

Create new Row in Data Frame with ID and date if ID and date do not exist in "x" timeframe [duplicate]

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots()
ax.bar(idx.to_pydatetime(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
print(s)
yields
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
...
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
pd.Timestamp('2012-05-04'),
pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)
print(s.asfreq('D'))
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)
fails with:
...
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
df.loc[all_days]
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
df.resample('D').mean()
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
val
date
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
val
date
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.set_index(date_col_name,drop=True,inplace=True)
df.index = pd.DatetimeIndex(df.index)
d = datetime.now().date()
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
'date':pd.to_datetime([
'2022-02-10'
,'2022-02-11'
,'2022-02-14'
,'2022-02-14'
,'2022-02-24'
,'2022-02-16'
])
,'value':[10,20,5,10,15,30]
})
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')
s.asfreq('D').interpolate().asfreq('Q')

Pandas DateTime index resampling not working

I have a pandas dataframe as shown in the code below. I am trying to "resample"
the data to get daily count of the ticket column. It does not give any error but the
resampling it not wokring. This is a sample of a much larger dataset. I want to be
able to get counts by day, week, month quarter etc. But the .resample option is
not giving me a solution. What am I doing wrong?
import pandas as pd
df = pd.DataFrame([['2019-07-30T00:00:00','22:15:00','car'],
['2013-10-12T00:00:00','0:10:00','bus'],
['2014-03-31T00:00:00','9:06:00','ship'],
['2014-03-31T00:00:00','8:15:00','ship'],
['2014-03-31T00:00:00','12:06:00','ship'],
['2014-03-31T00:00:00','9:24:00','ship'],
['2013-10-12T00:00:00','9:06:00','ship'],
['2018-03-31T00:00:00','9:06:00','ship']],
columns=['date_field','time_field','transportation'])
df['date_field2'] = pd.to_datetime(df['date_field'])
df['time_field2'] = pd.to_datetime(df['time_field'],unit = 'ns').dt.time
df['date_time_field'] = df.apply(lambda df : pd.datetime.combine(df['date_field2'],df['time_field2']),1)
df.set_index(['date_time_field'],inplace=True)
df.drop(columns=['date_field','time_field','date_field2','time_field2'],inplace=True)
df['tickets']=1
df.sort_index(inplace=True)
df.drop(columns=['transportation'],inplace=True)
df.resample('D').sum()
print('\ndaily resampling:')
print(df)
I think you forget assign output to variable like:
df1 = df.resample('D').sum()
print (df1)
Also your code should be simplify:
#join columns together with space and pop for extract column
df['date_field'] = pd.to_datetime(df['date_field']+ ' ' + df.pop('time_field'))
#create and sorting DatetimeIndex, remove column
df = df.set_index(['date_field']).sort_index().drop(columns=['transportation'])
#resample counts
df1 = df.resample('D').size()
print (df1)
date_field
2013-10-12 2
2013-10-13 0
2013-10-14 0
2013-10-15 0
2013-10-16 0
..
2019-07-26 0
2019-07-27 0
2019-07-28 0
2019-07-29 0
2019-07-30 1
Freq: D, Length: 2118, dtype: int64
Also I think inplace is not good practice, check this and this.

Present value minus previous value (python)

I want to subtract the present value by the previous value in each line and whenever there is N/A, it will copy the previous available value and subtract this by the previous available value.
When I run the codes below, I get the following message: 'DataFrame' object has no attribute 'value'. Could anyone please help fix it?
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('ccy_test.xlsx')
X = df.iloc[3:, 1:]
df.fillna(method='pad')
count_row = df.shape[0]
count_col = df.shape[1]
z = df.value[:count_row,1:count_col] - df.value[:count_row,:count_col-1]
dz = pd.DataFrame(z)
Sample File
There are some issues with the code you posted. The example file is a csv file, so you need to refer to "ccy_test.csv". The column Value only contains 0's, so for this example I use the column Open.
Furthermore, I added to your read_csv:
index_col=0 -> makes the first column dates the index
parse_dates=[0] -> parse the dates as dates (instead of strings)
skiprows=3 -> don't read the first rows, as they are not part of the table
header=0 -> read the first row (after the skip) as column names
So:
import pandas as pd
df = pd.read_csv('ccy_test.csv', index_col=0, parse_dates=[0], skiprows=3, header=0)
df = df.fillna(method='pad')
df['Difference'] = df.Open.diff()
print(df)
The output:
Open High Low Value Volume Difference
Dates
2018-03-01 09:30:00 0.83064 0.83121 0.83064 0.0 0.0 NaN
2018-03-01 09:31:00 0.83121 0.83128 0.83114 0.0 0.0 0.00057
2018-03-01 09:32:00 0.83128 0.83161 0.83126 0.0 0.0 0.00007
2018-03-01 09:33:00 0.83161 0.83169 0.83161 0.0 0.0 0.00033
2018-03-01 09:34:00 0.83169 0.83169 0.83145 0.0 0.0 0.00008
df.fillna(method='pad') by default won't change your dataframe, you need to redefine it with df = df.fillna(method='pad').

pandas index filter non datetimes

I'm reading in a csv file with a date time column that has randomly interspersed blocks of non date time text (5 lines in a block at a time and sometimes multiple blocks in a row). See below for an example snipped of the data file:
Date,Time,Count,Fault,Battery
12/22/2015,05:24.0,39615.0,0.0,6.42
12/22/2015,05:25.0,39616.0,0.0,6.42
12/22/2015,05:26.0,39617.0,0.0,6.42
12/22/2015,05:27.0,39618.0,0.0,6.42
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
12/22/2015,19:57.0,39619.0,0.0,6.42
12/22/2015,19:58.0,39620.0,0.0,6.42
12/22/2015,19:59.0,39621.0,0.0,6.42
12/22/2015,20:00.0,39622.0,0.0,6.42
12/22/2015,20:01.0,39623.0,0.0,6.42
12/22/2015,20:02.0,39624.0,0.0,6.42
I can read the data from the clipboard and into a dataframe as follows:
df = pd.read_clipboard(sep=',')
I am looking for a way to clean the 'Date' column of non date formatted strings prior to converting to a datetime index. I have tried converting the column to an index and then to a list and filtering like this:
df.index=df['Date']
df = df[~df.index.get_loc('RMR')]
df = df[~df.index.get_loc('Default Site')]
df = df[~df.index.get_loc('X2CMBasicOpticsBurst')]
df = df[~df.index.get_loc('Sonde STSO3275')]
df = df.dropna()
I can then parse the dates and times together and get a proper datetime index using date parse tools.
However, the contents of the text fields can change and this approach seems very limited and non-pythonic.
Therefore, I'm looking for a better, more flexible and dynamic method to automatically skip these non date fields in the index, hopefully without having to know the details of their contents (e.g. skipping a 4 row block when a blank line is encountered).
Thanks in advance.
Well, you can use to_datetime
df.loc[:, 'Date'] = pd.to_datetime(df.Date, errors='coerce')
element that is not a datetime would be transformed to NaT
then you can drop it.
df = df.dropna()
I think you can use read_csv with dropna and to_datetime:
import pandas as pd
import io
temp=u"""Date,Time,Count,Fault,Battery
12/22/2015,05:24.0,39615.0,0.0,6.42
12/22/2015,05:25.0,39616.0,0.0,6.42
12/22/2015,05:26.0,39617.0,0.0,6.42
12/22/2015,05:27.0,39618.0,0.0,6.42
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
,,,,
Sonde STSO3275,,,,
RMR,,,,
Default Site,,,,
X2CMBasicOpticsBurst,,,,
12/22/2015,19:57.0,39619.0,0.0,6.42
12/22/2015,19:58.0,39620.0,0.0,6.42
12/22/2015,19:59.0,39621.0,0.0,6.42
12/22/2015,20:00.0,39622.0,0.0,6.42
12/22/2015,20:01.0,39623.0,0.0,6.42
12/22/2015,20:02.0,39624.0,0.0,6.42"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['Date','Time']])
df = df.dropna()
df['Date_Time'] = pd.to_datetime(df.Date_Time, format="%m/%d/%Y %H:%M.%S")
print df
Date_Time Count Fault Battery
0 2015-12-22 05:24:00 39615.0 0.0 6.42
1 2015-12-22 05:25:00 39616.0 0.0 6.42
2 2015-12-22 05:26:00 39617.0 0.0 6.42
3 2015-12-22 05:27:00 39618.0 0.0 6.42
14 2015-12-22 19:57:00 39619.0 0.0 6.42
15 2015-12-22 19:58:00 39620.0 0.0 6.42
16 2015-12-22 19:59:00 39621.0 0.0 6.42
17 2015-12-22 20:00:00 39622.0 0.0 6.42
18 2015-12-22 20:01:00 39623.0 0.0 6.42
19 2015-12-22 20:02:00 39624.0 0.0 6.42

Categories

Resources