pandas groupby resample leads to missing data

pandas groupby resample leads to missing data - python

I have some data that is based on every 3 hours and I try to resample it by using
df = df.groupby(df.index.date).resample('1h').pad()
however it stops at the last data at 21:00 everyday and the last three hours are not there. How should I solve this?

You could use DataFrame.asfreq
df.asfreq('H').groupby(df.index.date).resample('H').pad()

Related

How to generate one value each minute out of irregular data?

I have values that are mesured event-related. So there are not the same amount of data every Minute. To be able to better handle this data I aim to only take the first row of values every Minute.
The time of the data I import from a csv looks like this:
time
11.11.2011 11:11
11.11.2011 11:11
11.11.2011 11:11
11.11.2011 11:12
11.11.2011 11:12
11.11.2011 11:13
The other values are Temperatures.
One main problem ist to import the time in the right format.
I tried to solve this with the help of this comunity like this:
with open('my_file.csv','r') as file:
for line in file:
try:
time = line.split(';')[0] #splits the line at the comma and takes the first bit
time = dt.datetime.strptime(time, '%d.%m.%Y %H:%M')
print(time)
except:
pass
then I importet the columns of the temperatures and joind them like this:
df = pd.read_csv("my_file.csv", sep=';', encoding='latin-1')
df=df[["time", "T1", "T2", "DT1", "DT2"]]
when I printed the dtypes of my data the time was datetime64[ns] and the others where objects.
I tried different options of groupby and resample. Like the following:
df=df.groupby([pd.Grouper(key = 'time', freq='1min')])
df.resample('M')
One main problem that was stated in the error messages was that the datatype of the time was not appropriate for grouping,... because it is not an DatetimeIndex.
So I tried to convert the dates to a DatetimeIndex like this:
df.index = pd.to_datetime(daten["time"].index, format='%Y-%m-%d %H:%M:%S')
but then I reseaved a Nummeration of the Index starting with 1970-01-01 so I am not quite shure if this conversion is possible with irregular data.
Without this conversion I also get the message <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026938A74850>
When I then try to call my dataframe the message shows and when saving it to csv like this:
df.to_csv('04_01_DTempminuten.csv', index=False, encoding='utf-8', sep =';', date_format = '%Y-%m-%d %H:%M:%S')
I receive either the same message or only one line with a Dezimalnumber instead of the time.
Does anyone have an idear how to deal with this irregular data to get one line of values each minute?
Thank you for reading my question. I am really thankful for any Idears.

Without sample data I can only show how I do it with irregular time series, which I think is your case. I work with price data which comes at irregular time intervals. So if you need to sample taking the first minute value you can use resample with for a specific interval using ohlc aggregation function, that will give you four columns for each sample interval.
open: first value in the interval
high: highest
low: lowest value
close: last value
In your case the sampling interval would 1 minute ('T')
In the following example I'm using one second ('S') as resampling frequency, to resample ask column (your temperature column):
import pandas as pd
df = pd.read_csv('my_tick_data.csv')
df['date_time'] = pd.to_datetime(df['date_time'])
df.set_index('date_time', inplace=True)
df.head(6)
df['ask'].resample('S').ohlc()
This is not solving your date issue, which is a prerequisite for this part because the data set needs to be indexed by date. If you can provide sample data maybe I can help you with that part either.

Pandas: Resampling Hourly Data for each Group

I have a dataframe that conains gps locations of vehicles recieved at various times in a day. For each vehicle, I want to resample hourly data such that I have the median report (according to the time stamp) for each hour of the day. For hours where there are no corresponding rows, I want a blank row.
I am using the following code:
for i,j in enumerate(list(df.id.unique())):
data=df.loc[df.id==j]
data['hour']=data['timestamp'].hour
data_grouped=data.groupby(['imo','hour']).median().reset_index()
data = data_grouped.set_index('hour').reindex(idx).reset_index() #idx is a list of integers from 0 to 23.
Since my dataframe has millions of id's it takes me a lot of time to iterate though all of them. Is there an efficient way of doing this?
Unlike Pandas reindex dates in Groupby, I have multiple rows for each hour, in addition to some hours having no rows at all.

Tested in last version of pandas, convert hour column to categoricals with all possible categories and then aggregate without loop:
df['hour'] = pd.Categorical(df['timestamp'].dt.hour, categories=range(24))
df1 = df.groupby(['id','imo','hour']).median().reset_index()

Is there a function to get the difference between two values on a pandas dataframe timeseries?

I am messing around in the NYT covid dataset which has total covid cases for each county, per day.
I would like to find out the difference of cases between each day, so theoretically I could get the number of new cases per day instead of total cases. Taking a rolling mean, or resampling every 2 days using a mean/sum/etc all work just fine. It's just subtracting that is giving me such a headache.
Tried methods:
df.resample('2d').diff()
'DatetimeIndexResampler' object has no attribute 'diff'
df.resample('1d').agg(np.subtract)
ufunc() missing 1 of 2required positional argument(s)
df.rolling(2).diff()
'Rolling' object has no attribute 'diff'
df.rolling('2').agg(np.subtract)
ufunc() missing 1 of 2required positional argument(s)
Sample data:
pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
'covid_cases':[1.2,2.0,2.9,3.6,3.9]
})
Desired sample output:
pd.DataFrame(data={'state':['Alabama','Alabama','Alabama','Alabama','Alabama'],
'date':[dt.date(2020,3,13),dt.date(2020,3,14),dt.date(2020,3,15),dt.date(2020,3,16),dt.date(2020,3,17)],
'new_covid_cases':[np.nan,0.8,0.9,0.7,0.3]
})
Recreate sample data from original NYT dataset:
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',parse_dates=['date'])
df.groupby(['state','date'])[['cases']].mean().reset_index()
Any help would be greatly appreciated! Would like to learn how to do this manually/via function rather than finding a "new cases" dataset as I will be working with timeseries a lot in the very near future.

Let's try this bit of complete code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
df['date'] = pd.to_datetime(df['date'])
df_daily_state = df.groupby(['date','state'])['cases'].sum().unstack()
daily_new_cases_AL = df_daily_state.diff()['Alabama']
ax = daily_new_cases_AL.iloc[-30:].plot.bar(title='Last 30 days Alabama New Cases')
Output:
Details:
Download the historical case records from NYTimes github using the
raw URL
Convert the dtype of the 'date' column to datetime dtype
Groupby 'date' and 'state' columns sum 'cases' and unstack the state
level of the index to get dates of rows and states for columns.
Take the difference by columns and select only the Alabama column
Plot the last 30 days

The diff function is correct, but if you look at your error message:
'DatetimeIndexResampler' object has no attribute 'diff'
in your first tried methods, it's because diff is a function available for DataFrames, not for Resamplers, so turn it back into a DataFrame by specifying how you want to resample it.
If you have the total number of COVID cases for each day and want to resample it to 2 days, you probably only want to keep the latest update out of the two days, in which case something like df.resample('2d').last().diff() should work.

Calculating daily averages in pd.Series

I have a Dataframe series with 30s frequency.
df.head()
I want to calculate the daily averages for all signals in that series but it doesnt seem to work. I tried both
df_average = df.to_period('D')
df.resample('D')
And i get:
I want to have only 1 line per day. Why do i get more?
Thank you

If there is DatetimeIndex only add an aggregate function, here mean, to resample:
df1 = df.resample('D').mean()

#jezrael is sure way to go. Could also try;
df.groupby(df.index.date).mean()

python - splitting dataframe to do monthly analyses [duplicate]

I have a long time series, eg.
import pandas as pd
index=pd.date_range(start='2012-11-05', end='2012-11-10', freq='1S').tz_localize('Europe/Berlin')
df=pd.DataFrame(range(len(index)), index=index, columns=['Number'])
Now I want to extract all sub-DataFrames for each day, to get the following output:
df_2012-11-05: data frame with all data referring to day 2012-11-05
df_2012-11-06: etc.
df_2012-11-07
df_2012-11-08
df_2012-11-09
df_2012-11-10
What is the most effective way to do this avoiding to check if the index.date==give_date which is very slow. Also, the user does not know a priory the range of days in the frame.
Any hint do do this with an iterator?
My current solution is this, but it is not so elegant and has two issues defined below:
time_zone='Europe/Berlin'
# find all days
a=np.unique(df.index.date) # this can take a lot of time
a.sort()
results=[]
for i in range(len(a)-1):
day_now=pd.Timestamp(a[i]).tz_localize(time_zone)
day_next=pd.Timestamp(a[i+1]).tz_localize(time_zone)
results.append(df[day_now:day_next]) # how to select if I do not want day_next included?
# last day
results.append(df[day_next:])
This approach has the following problems:
a=np.unique(df.index.date) can take a lot of time
df[day_now:day_next] includes the day_next, but I need to exclude it in the range

If you want to group by date (AKA: year+month+day), then use df.index.date:
result = [group[1] for group in df.groupby(df.index.date)]
As df.index.day will use the day of the month (i.e.: from 1 to 31) for grouping, which could result in undesirable behavior if the input dataframe dates extend to multiple months.

Perhaps groupby?
DFList = []
for group in df.groupby(df.index.day):
DFList.append(group[1])
Should give you a list of data frames where each data frame is one day of data.
Or in one line:
DFList = [group[1] for group in df.groupby(df.index.day)]
Gotta love python!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas groupby resample leads to missing data - python

I have some data that is based on every 3 hours and I try to resample it by using df = df.groupby(df.index.date).resample('1h').pad() however it stops at the last data at 21:00 everyday and the last three hours are not there. How should I solve this?

You could use DataFrame.asfreq df.asfreq('H').groupby(df.index.date).resample('H').pad()

Related

How to generate one value each minute out of irregular data?

Pandas: Resampling Hourly Data for each Group

Is there a function to get the difference between two values on a pandas dataframe timeseries?

Calculating daily averages in pd.Series

python - splitting dataframe to do monthly analyses [duplicate]

Categories

Resources