Filling date gaps in pandas dataframe - python

I have Pandas DataFrame (loaded from .csv) with Date-time as index.. where there is/have-to-be one entry per day.
The problem is that I have gaps i.e. there is days for which I have no data at all.
What is the easiest way to insert rows (days) in the gaps ? Also is there a way to control what is inserted in the columns as data ! Say 0 OR copy the prev day info OR to fill sliding increasing/decreasing values in the range from prev-date toward next-date data-values.
thanks
Here is example 01-03 and 01-04 are missing :
In [60]: df['2015-01-06':'2015-01-01']
Out[60]:
Rate High (est) Low (est)
Date
2015-01-06 1.19643 0.0000 0.0000
2015-01-05 1.20368 1.2186 1.1889
2015-01-02 1.21163 1.2254 1.1980
2015-01-01 1.21469 1.2282 1.2014
Still experimenting but this seems to solve the problem :
df.set_index(pd.DatetimeIndex(df.Date),inplace=True)
and then resample... the reason being that importing the .csv with header-col-name Date, is not actually creating date-time-index, but Frozen-list whatever that means.
resample() is expecting : if isinstance(ax, DatetimeIndex): .....
Here is my final solution :
#make dates the index
self.df.set_index(pd.DatetimeIndex(self.df.Date), inplace=True)
#fill the gaps
self.df = self.df.resample('D',fill_method='pad')
#fix the Date column
self.df.Date = self.df.index.values
I had to fix the Date column, because resample() just allow you to pad-it.
It fixes the index correctly though, so I could use it to fix the Date column.
Here is snipped of the data after correction :
2015-01-29 2015-01-29 1.13262 0.0000 0.0000
2015-01-30 2015-01-30 1.13161 1.1450 1.1184
2015-01-31 2015-01-31 1.13161 1.1450 1.1184
2015-02-01 2015-02-01 1.13161 1.1450 1.1184
01-30, 01-31 are the new generated data.

You'll could resample by day e.g. using mean if there are multiple entries per day:
df.resample('D', how='mean')
You can then ffill to replace NaNs with the previous days result.
See up and down sampling in the docs.

Related

Groupby number of hours in a month in pandas

Could someone please guide how to groupby no. of hours from hourly based index to find how many hours of null values are there in a specific month? Therefore, I am thinking of having a dataframe with monthly based index.
Below given is the dataframe which has timestamp as index and another column with has occassionally null values.
timestamp
rel_humidity
1999-09-27 05:00:00
82.875
1999-09-27 06:00:00
83.5
1999-09-27 07:00:00
83.0
1999-09-27 08:00:00
80.6
1999-09-27 09:00:00
nan
1999-09-27 10:00:00
nan
1999-09-27 11:00:00
nan
1999-09-27 12:00:00
nan
I tried this but the resulting dataframe is not what I expected.
gap_in_month = OG_1998_2022_gaps.groupby(OG_1998_2022_gaps.index.month, OG_1998_2022_gaps.index.year).count()
I always struggle with groupby in function. Therefore, highly appreciate any help. Thanks in advance!
If need 0 if no missing value per month create mask by Series.isna, convert DatetimeIndex to month periods by DatetimeIndex.to_period and aggregate sum - Trues of mask are processing like 1 or alternative with Grouper:
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(OG_1998_2022_gaps.index.to_period('m')).sum())
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(pd.Grouper(freq='m')).sum())
If need only matched rows solution is similar, but first filter by boolean indexing and then aggregate counts by GroupBy.size:
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(OG_1998_2022_gaps.index.to_period('m')).size())
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(pd.Grouper(freq='m')).size())
Alternative to groupby, but (in my opinion) much nicer, is to use pd.Series.resample:
import pandas as pd
# Some sample data with a DatetimeIndex:
series = pd.Series(
np.random.choice([1.0, 2.0, 3.0, np.nan], size=2185),
index=pd.date_range(start="1999-09-26", end="1999-12-26", freq="H")
)
# Solution:
series.isna().resample("M").sum()
# Note that GroupBy.count and Resampler.count count the number of non-null values,
# whereas you seem to be looking for the opposite :)
In your case:
OG_1998_2022_gaps['rel_humidity'].isna().resample("M").sum()

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

How can I filter for rows one hour before and after a set timestamp in Python?

I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5

pandas - efficiently computing minutely returns as columns on intraday data

I have a DataFrame that looks like such:
closingDate Time Last
0 1997-09-09 2018-12-13 00:00:00 1000
1 1997-09-09 2018-12-13 00:01:00 1002
2 1997-09-09 2018-12-13 00:02:00 1001
3 1997-09-09 2018-12-13 00:03:00 1005
I want to create a DataFrame with roughly 1440 columns labled as timestamps, where the respective daily value is the return over the prior minute:
closingDate 00:00:00 00:01:00 00:02:00
0 1997-09-09 2018-12-13 -0.08 0.02 -0.001 ...
1 1997-09-10 2018-12-13 ...
My issue is that this is a very large DataFrame (several GB), and I need to do this operation multiple times. Time and memory efficiency is key, but time being more important. Is there some vectorized, built in method to do this in pandas?
You can do this with some aggregation and shifting your time series that should result in more efficient calculations.
First aggregate your data by closingDate.
g = df.groupby("closingDate")
Next you can shift your data to offset by a day.
shifted = g.shift(periods=1)
This will create a new dataframe where the Last value will be from the previous minute. Now you can join to your original dataframe based on the index.
df = df.merge(shifted, left_index=True, right_index=True)
This adds the shifted columns to the new dataframe that you can use to do your difference calculation.
df["Diff"] = (df["Last_x"] - df["Last_y"]) / df["Last_y"]
You now have all the data you're looking for. If you need each minute to be its own column you can pivot the results. By grouping the closingDate and then applying the shift you avoid shifting dates across days. If you look at the first observation of each day you'll get a NaN since the values won't be shifted across separate days.

Adjusting Monthly Time Series Data in Pandas

I have a pandas DataFrame like this.
As you can see, the data corresponds to end of month data. The problem is that the end of month date is not the same for all the columns. ( The underlying reason is that the last trading day of the month does not always coincide with the end of the month. )
Currently, the end of 2016 January have two rows "2016-01-29" and "2016-01-31." It should be just one row. For example, the end of 2016 January should just be 451.1473 1951.218 1401.093 for Index A, Index B and Index C.
Another point is that even though each row almost always corresponds to the end of monthly data, the data might not be nice enough and can conceivably include the middle of the month data for a random columns. In that case, I don't want to make any adjustment so that any prior data collection error would be caught.
What is the most efficient way to achieve this goal.
EDIT:
Index A Index B Index C
DATE
2015-03-31 2067.89 1535.07 229.1
2015-04-30 2085.51 1543 229.4
2015-05-29 2107.39 NaN NaN
2015-05-31 NaN 1550.39 229.1
2015-06-30 2063.11 1534.96 229
2015-07-31 2103.84 NaN 228.8
2015-08-31 1972.18 1464.32 NaN
2015-09-30 1920.03 1416.84 227.5
2015-10-30 2079.36 NaN NaN
2015-10-31 NaN 1448.39 227.7
2015-11-30 2080.41 1421.6 227.6
2015-12-31 2043.94 1408.33 227.5
2016-01-29 1940.24 NaN NaN
2016-01-31 NaN 1354.66 227.5
2016-02-29 1932.23 1355.42 227.3
So, in this case, I need to combine rows at the end of 2015-05, 2015-10, 2016-01. However, rows at 2015-07 and 2015-08 simply does not have data. So, in this case, I would like to leave 2015-07 and 2015-08 as NaN while I like to merge the end of month rows at 2015-05, 2015-10, 2016-01. Hopefully, this provides more insight to what I am trying to do.
You can use:
df = df.groupby(pd.TimeGrouper('M')).fillna(method='ffill')
df = df.resample(rule='M', how='last')
to create a new DateTimeIndex ending on the last day of the months and sample the last available data point for each months. fillna() ensures that, for columns with of missing data for the last available date, you use the prior available value.

Categories

Resources