Groupby number of hours in a month in pandas

Groupby number of hours in a month in pandas - python

Could someone please guide how to groupby no. of hours from hourly based index to find how many hours of null values are there in a specific month? Therefore, I am thinking of having a dataframe with monthly based index.
Below given is the dataframe which has timestamp as index and another column with has occassionally null values.
timestamp
rel_humidity
1999-09-27 05:00:00
82.875
1999-09-27 06:00:00
83.5
1999-09-27 07:00:00
83.0
1999-09-27 08:00:00
80.6
1999-09-27 09:00:00
nan
1999-09-27 10:00:00
nan
1999-09-27 11:00:00
nan
1999-09-27 12:00:00
nan
I tried this but the resulting dataframe is not what I expected.
gap_in_month = OG_1998_2022_gaps.groupby(OG_1998_2022_gaps.index.month, OG_1998_2022_gaps.index.year).count()
I always struggle with groupby in function. Therefore, highly appreciate any help. Thanks in advance!

If need 0 if no missing value per month create mask by Series.isna, convert DatetimeIndex to month periods by DatetimeIndex.to_period and aggregate sum - Trues of mask are processing like 1 or alternative with Grouper:
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(OG_1998_2022_gaps.index.to_period('m')).sum())
gap_in_month = (OG_1998_2022_gaps['rel_humidity'].isna()
.groupby(pd.Grouper(freq='m')).sum())
If need only matched rows solution is similar, but first filter by boolean indexing and then aggregate counts by GroupBy.size:
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(OG_1998_2022_gaps.index.to_period('m')).size())
gap_in_month = (OG_1998_2022_gaps[OG_1998_2022_gaps['rel_humidity'].isna()]
.groupby(pd.Grouper(freq='m')).size())

Alternative to groupby, but (in my opinion) much nicer, is to use pd.Series.resample:
import pandas as pd
# Some sample data with a DatetimeIndex:
series = pd.Series(
np.random.choice([1.0, 2.0, 3.0, np.nan], size=2185),
index=pd.date_range(start="1999-09-26", end="1999-12-26", freq="H")
)
# Solution:
series.isna().resample("M").sum()
# Note that GroupBy.count and Resampler.count count the number of non-null values,
# whereas you seem to be looking for the opposite :)
In your case:
OG_1998_2022_gaps['rel_humidity'].isna().resample("M").sum()

Related

How do I delete specific dataframe rows based on a columns value?

I have a pandas dataframe with 2 columns ("Date" and "Gross Margin). I want to delete rows based on what the value in the "Date" column is. This is my dataframe:
Date Gross Margin
0 2021-03-31 44.79%
1 2020-12-31 44.53%
2 2020-09-30 44.47%
3 2020-06-30 44.36%
4 2020-03-31 43.69%
.. ... ...
57 2006-12-31 49.65%
58 2006-09-30 52.56%
59 2006-06-30 49.86%
60 2006-03-31 46.20%
61 2005-12-31 40.88%
I want to delete every row where the "Date" value doesn't end with "12-31". I read some similar posts on this and the pandas.drop() function seemed to be the solution, but I haven't figured out how to use it for this specific case.
Please leave any suggestions as to what I should do.

you can try the following code, where you match the day and month.
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df[df['Date'].dt.strftime('%m-%d') == '12-31']

Assuming you have the date formatted as year-month-day
df = df[~df['Date'].str.endswith('12-31')]

If the dates are using a consistent format, you can do it like this:
df = df[df['Date'].str.contains("12-31", regex=False)]

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?

Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0

Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

Split hourly time-series in pandas DataFrame into specific dates and all other dates

I have a time-series in a pandas DataFrame at hourly frequency:
import pandas as pd
import numpy as np
idx = pd.date_range(freq="h", start="2018-01-01", periods=365*24)
df = pd.DataFrame({'value': np.random.rand(365*24)}, index=idx)
I have a list of dates:
dates = ['2018-03-20', '2018-04-08', '2018-07-14']
I want to end up with two DataFrames: one containing just the data for these dates, and one containing all of the data from the original DataFrame excluding all the data for these dates. In this case, I would have a DataFrame containing three days worth of data (for the days listed in dates), and a DataFrame containing 362 days data (all the data excluding those three days).
What is the best way to do this in pandas?
I can take advantage of nice string-based datetime indexing in pandas to extract each date separately, for example:
df[dates[0]]
and I can use this to put together a DataFrame containing just the specified dates:
to_concat = [df[date] for date in dates]
just_dates = pd.concat(to_concat)
This isn't as 'nice' as it could be, but does the job.
However, I can't work out how to remove those dates from the DataFrame to get the other output that I want. Doing:
df[~dates[0]]
gives a TypeError: bad operand type for unary ~: 'str', and I can't seem to get df.drop to work in this context.
What do you suggest as a nice, Pythonic and 'pandas-like' way to go about this?

Create boolean mask by numpy.in1d with converted dates to strings or Index.isin for test membership:
m = np.in1d(df.index.date.astype(str), dates)
m = df.index.to_series().dt.date.astype(str).isin(dates)
Or DatetimeIndex.strftime for strings:
m = df.index.strftime('%Y-%m-%d').isin(dates)
Another idea is remove times by DatetimeIndex.normalize - get DatetimeIndex in output:
m = df.index.normalize().isin(dates)
#alternative
#m = df.index.floor('d').isin(dates)
Last filter by boolean indexing:
df1 = df[m]
And for second DataFrame invert mask by ~:
df2 = df[~m]
print (df1)
value
2018-03-20 00:00:00 0.348010
2018-03-20 01:00:00 0.406394
2018-03-20 02:00:00 0.944569
2018-03-20 03:00:00 0.425583
2018-03-20 04:00:00 0.586190
...
2018-07-14 19:00:00 0.710710
2018-07-14 20:00:00 0.403660
2018-07-14 21:00:00 0.949572
2018-07-14 22:00:00 0.629871
2018-07-14 23:00:00 0.363081
[72 rows x 1 columns]

one way to solve this
df = df.reset_index()
with_date = df[df['index'].dt.date.astype(str).isin(dates)].set_index('index')
##use del with_date.index.name to remove the index name, if required
without_date = df[~df['index'].dt.date.astype(str).isin(dates)].set_index('index')
##with_date
value
index
2018-03-20 00:00:00 0.059623
2018-03-20 01:00:00 0.343513
...
##without_date
value
index
2018-01-01 00:00:00 0.087846
2018-01-01 01:00:00 0.481971
...

Another way to solve this:
Keep your dates in datetime format, for example through a pd.Timestamp:
dates_in_dt_format = [pd.Timestamp(date).date() for date in dates]
Then, keep only the rows where the index's date is not in that group, for example with:
df_without_dates = df.loc[[idx for idx in df.index if idx.date() not in dates_in_dt_format]]
df_with_dates = df.loc[[idx for idx in df.index if idx.date() in dates_in_dt_format]]
or using pandas apply instead of list comprehension:
df_with_dates = df[df.index.to_series().apply(lambda x: pd.Timestamp(x).date()).isin(dates_in_dt_format)]
df_without_dates = df[~df.index.to_series().apply(lambda x: pd.Timestamp(x).date()).isin(dates_in_dt_format)]

Down sampling problem: difference between resample and asfreq

I am playing with a time series dataframe defined as df using pandas.
I've already changed the row index as datetime index using set_index.
I want to downsample a dataframe at 5 second interval using resample or asfreq.
Let say downsample to 1 hour.
df_inst = df.asfreq('1H')
df_inst2 = df.resample('1H')
When I execute above written code, asfreq gave me the right data frame downsampled to 1 h interval, which is exactly I expected to see.
However, resample didn't generate any dataframe variable, moreover, there is no error message.
When inspect it using print, I have the following message.
print(df_inst2)
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
What am I missing?
More specifically, how can I get the results using resample as I used asfreq
Thank you in advance.

DataFrame.resample returns a Resampler object while DataFrame.asfreq returns the data converted.
If you want to use resample correct, use it with a specific method, for instance: df.resample('1H').asfreq().
Example from the docs:
>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>> series = pd.Series(range(9), index=index)
>> series.resample('30S').asfreq().head(5)
2000-01-01 00:00:00 0.0
2000-01-01 00:00:30 NaN
2000-01-01 00:01:00 1.0
2000-01-01 00:01:30 NaN
2000-01-01 00:02:00 2.0
Freq: 30S, dtype: float64

Filling date gaps in pandas dataframe

I have Pandas DataFrame (loaded from .csv) with Date-time as index.. where there is/have-to-be one entry per day.
The problem is that I have gaps i.e. there is days for which I have no data at all.
What is the easiest way to insert rows (days) in the gaps ? Also is there a way to control what is inserted in the columns as data ! Say 0 OR copy the prev day info OR to fill sliding increasing/decreasing values in the range from prev-date toward next-date data-values.
thanks
Here is example 01-03 and 01-04 are missing :
In [60]: df['2015-01-06':'2015-01-01']
Out[60]:
Rate High (est) Low (est)
Date
2015-01-06 1.19643 0.0000 0.0000
2015-01-05 1.20368 1.2186 1.1889
2015-01-02 1.21163 1.2254 1.1980
2015-01-01 1.21469 1.2282 1.2014
Still experimenting but this seems to solve the problem :
df.set_index(pd.DatetimeIndex(df.Date),inplace=True)
and then resample... the reason being that importing the .csv with header-col-name Date, is not actually creating date-time-index, but Frozen-list whatever that means.
resample() is expecting : if isinstance(ax, DatetimeIndex): .....
Here is my final solution :
#make dates the index
self.df.set_index(pd.DatetimeIndex(self.df.Date), inplace=True)
#fill the gaps
self.df = self.df.resample('D',fill_method='pad')
#fix the Date column
self.df.Date = self.df.index.values
I had to fix the Date column, because resample() just allow you to pad-it.
It fixes the index correctly though, so I could use it to fix the Date column.
Here is snipped of the data after correction :
2015-01-29 2015-01-29 1.13262 0.0000 0.0000
2015-01-30 2015-01-30 1.13161 1.1450 1.1184
2015-01-31 2015-01-31 1.13161 1.1450 1.1184
2015-02-01 2015-02-01 1.13161 1.1450 1.1184
01-30, 01-31 are the new generated data.

You'll could resample by day e.g. using mean if there are multiple entries per day:
df.resample('D', how='mean')
You can then ffill to replace NaNs with the previous days result.
See up and down sampling in the docs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Groupby number of hours in a month in pandas - python

Related

How do I delete specific dataframe rows based on a columns value?

How to fill missing dates with corresponding NaN in other columns

Split hourly time-series in pandas DataFrame into specific dates and all other dates

Down sampling problem: difference between resample and asfreq

Filling date gaps in pandas dataframe

Categories

Resources