I'm doing some resampling on data and I was wondering why resampling 1min data to 5min data creates MORE time intervals than my original dataset?
Also, why does t resample until 2018-12-11 (11 days longer!) than the original datset?
1-min data:
result of resampling to 5-min intervalls:
This is how I do the resampling:
df1.loc[:,'qKfz_gesamt'].resample('5min').mean()
I was wondering why resampling 1min data to 5min data creates MORE time intervals than my original dataset?
Problem is if no consecutive values in original pandas create consecutive 5minutes intervals and for not exist values are created NaNs:
df1 = pd.DataFrame({'qKfz_gesamt': range(4)},
index=pd.to_datetime(['2018-11-25 00:00:00','2018-11-25 00:01:00',
'2018-11-25 00:02:00','2018-11-25 00:15:00']))
print (df1)
qKfz_gesamt
2018-11-25 00:00:00 0
2018-11-25 00:01:00 1
2018-11-25 00:02:00 2
2018-11-25 00:15:00 3
print (df1['qKfz_gesamt'].resample('5min').mean())
2018-11-25 00:00:00 1.0
2018-11-25 00:05:00 NaN
2018-11-25 00:10:00 NaN
2018-11-25 00:15:00 3.0
Freq: 5T, Name: qKfz_gesamt, dtype: float64
print (df1['qKfz_gesamt'].resample('5min').mean().dropna())
2018-11-25 00:00:00 1.0
2018-11-25 00:15:00 3.0
Name: qKfz_gesamt, dtype: float64
why does t resample until 2018-12-11 (11 days longer!) than the original datset?
You need filter by maximal value of index:
rng = pd.date_range('2018-11-25', periods=10)
df1 = pd.DataFrame({'a': range(10)}, index=rng)
print (df1)
a
2018-11-25 0
2018-11-26 1
2018-11-27 2
2018-11-28 3
2018-11-29 4
2018-11-30 5
2018-12-01 6
2018-12-02 7
2018-12-03 8
2018-12-04 9
df1 = df1.loc[:'2018-11-30']
print (df1)
a
2018-11-25 0
2018-11-26 1
2018-11-27 2
2018-11-28 3
2018-11-29 4
2018-11-30 5
Or:
df1 = df1.loc[df1.index <= '2018-11-30']
print (df1)
a
2018-11-25 0
2018-11-26 1
2018-11-27 2
2018-11-28 3
2018-11-29 4
2018-11-30 5
Related
For example, I have several columns of dates and I want to get the month from them. Is there a way to loop through columns instead of running pd.DatetimeIndex(df['date']).month
multiple times? The example below is simplified. The real dataset has many more columns.
import pandas as pd
import numpy as np
np.random.seed(0)
rng_start = pd.date_range('2015-07-24', periods=5, freq='M')
rng_mid = pd.date_range('2019-06-24', periods=5, freq='M')
rng_end = pd.date_range('2022-03-24', periods=5, freq='M')
df = pd.DataFrame({ 'start_date': rng_start, 'mid_date': rng_mid, 'end_date': rng_end })
df
start_date mid_date end_date
0 2015-07-31 2019-06-30 2022-03-31
1 2015-08-31 2019-07-31 2022-04-30
2 2015-09-30 2019-08-31 2022-05-31
3 2015-10-31 2019-09-30 2022-06-30
4 2015-11-30 2019-10-31 2022-07-31
The intended output would be
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
You answered your question by saying "loop through columns":
for column in df:
df[column.replace("_date", "_month")] = df[column].dt.month
An alternative solution (a variation of #BENY's):
df[df.columns.str.replace("_date", "_month")] = df.apply(lambda x: x.dt.month, axis=1)
Try apply
df[['start_month', 'mid_month', 'end_month']] = df.apply(lambda x : x.dt.month,axis=1)
df
Out[244]:
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
You can avoid looping using stack:
out = df.join(df.filter(like='_date') # select _date columns
.stack() # convert to Series
.dt.month
.unstack() # back to DataFrame
.rename(columns=lambda x: x.replace('_date', '_month'))
)
Output:
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
quite similar to this solution bub a bit different:
df.join(df.applymap(lambda x: x.month).
set_axis(['start_month', 'mid_month', 'end_month'],axis=1))
I have multiple data frames each having data varying from 1 to 1440 minute (one day).Each dataframes are alike and same columns and same length. The time column values are in hhmm format.
Lets say df_A has the data of 1st day, that is 2021-05-06 It looks like this.
>df_A
timestamp col1 col2..... col80
0
1
2
.
.
.
2359
And the next day's data is in df_B which is also the same. The date is 2021-05-07
>df_B
timestamp col1 col2..... col80
0
1
2
.
.
.
2359
How could I stack these together one under another and create one dataframe while identifying each rows with a column having values in format like YYYYMMDD HH:mm. Which somewhat will look like this:
>df
timestamp col1 col2..... col80
20210506 0000
20210506 0001
.
.
20210506 2359
20210507 0000
.
.
20210507 2359
How could I achieve this while dealing with multiple data frames at ones?
df_A = pd.DataFrame(range(0, 10), columns=['timestamp'])
df_B = pd.DataFrame(range(0, 10), columns=['timestamp'])
df_A['date'] = pd.to_datetime('2021-05-06 ' +
df_A['timestamp'].astype(str).str.zfill(4), format='%Y-%m-%d %H%M')
df_B['date'] = pd.to_datetime('2021-05-07 ' +
df_A['timestamp'].astype(str).str.zfill(4), format='%Y-%m-%d %H%M')
df_final = pd.concat([df_A, df_B])
df_final
timestamp date
0 0 2021-05-06 00:00:00
1 1 2021-05-06 00:01:00
2 2 2021-05-06 00:02:00
3 3 2021-05-06 00:03:00
4 4 2021-05-06 00:04:00
5 5 2021-05-06 00:05:00
6 6 2021-05-06 00:06:00
7 7 2021-05-06 00:07:00
8 8 2021-05-06 00:08:00
9 9 2021-05-06 00:09:00
0 0 2021-05-07 00:00:00
1 1 2021-05-07 00:01:00
2 2 2021-05-07 00:02:00
3 3 2021-05-07 00:03:00
4 4 2021-05-07 00:04:00
5 5 2021-05-07 00:05:00
6 6 2021-05-07 00:06:00
7 7 2021-05-07 00:07:00
8 8 2021-05-07 00:08:00
9 9 2021-05-07 00:09:00
I am dealing with financial data which i need to extrapolate for different months. Here is my dataframe:
invoice_id,date_from,date_to
30492,2019-02-04,2019-09-18
I want to break this up for different months between date_from and date_to. Hence i need to add rows for each month with month starting date to ending date. Final output should look like:
invoice_id,date_from,date_to
30492,2019-02-04,2019-02-28
30492,2019-03-01,2019-03-31
30492,2019-04-01,2019-04-30
30492,2019-05-01,2019-05-31
30492,2019-06-01,2019-06-30
30492,2019-07-01,2019-07-31
30492,2019-08-01,2019-08-30
30492,2019-09-01,2019-09-18
Need to take care of leap year scenario as well. Is there any native method already available in pandas datetime package which i can use to achieve the desired output ?
Use:
print (df)
invoice_id date_from date_to
0 30492 2019-02-04 2019-09-18
1 30493 2019-01-20 2019-03-10
#added months between date_from and date_to
df1 = pd.concat([pd.Series(r.invoice_id,pd.date_range(r.date_from, r.date_to, freq='MS'))
for r in df.itertuples()]).reset_index()
df1.columns = ['date_from','invoice_id']
#added starts of months - sorting for correct positions
df2 = (pd.concat([df[['invoice_id','date_from']], df1], sort=False, ignore_index=True)
.sort_values(['invoice_id','date_from'])
.reset_index(drop=True))
#added MonthEnd and date_to to last rows
mask = df2['invoice_id'].duplicated(keep='last')
s = df2['invoice_id'].map(df.set_index('invoice_id')['date_to'])
df2['date_to'] = np.where(mask, df2['date_from'] + pd.offsets.MonthEnd(), s)
print (df2)
invoice_id date_from date_to
0 30492 2019-02-04 2019-02-28
1 30492 2019-03-01 2019-03-31
2 30492 2019-04-01 2019-04-30
3 30492 2019-05-01 2019-05-31
4 30492 2019-06-01 2019-06-30
5 30492 2019-07-01 2019-07-31
6 30492 2019-08-01 2019-08-31
7 30492 2019-09-01 2019-09-18
8 30493 2019-01-20 2019-01-31
9 30493 2019-02-01 2019-02-28
10 30493 2019-03-01 2019-03-10
You can use pandas.date_range with start and end date, in combination with freq='MS' which is beginning of month and freq='M' which is end of month:
x = pd.date_range(start=df.iloc[0]['date_from'], end=df.iloc[0]['date_to'], freq='MS')
y = pd.date_range(start=df.iloc[0]['date_from'], end=df.iloc[0]['date_to'], freq='M')
df_new = pd.DataFrame({'date_from':x,
'date_to':y})
df_new['invoice_id'] = df.iloc[0]['invoice_id']
print(df_new)
date_from date_to invoice_id
0 2019-03-01 2019-02-28 30492
1 2019-04-01 2019-03-31 30492
2 2019-05-01 2019-04-30 30492
3 2019-06-01 2019-05-31 30492
4 2019-07-01 2019-06-30 30492
5 2019-08-01 2019-07-31 30492
6 2019-09-01 2019-08-31 30492
Another way, using the resample method of a datetime index:
# melt, so we have start and end dates in 1 column
df = pd.melt(df, id_vars='invoice_id')
# now set the date column as index
df.set_index(inplace=True, keys='value')
# resample to daily level
df = df.resample('D').ffill().reset_index()
# get the yr-month value of each daily row
df['yr_month'] = df['value'].dt.strftime("%Y-%m")
# Now group by month and take min/max day values
output = (df.groupby(['invoice_id', 'yr_month'])['value']
.agg({'date_from': 'min', 'date_to': 'max'})
.reset_index()
.drop(labels='yr_month', axis=1))
print(output)
invoice_id date_from date_to
0 30492 2019-02-04 2019-02-28
1 30492 2019-03-01 2019-03-31
2 30492 2019-04-01 2019-04-30
3 30492 2019-05-01 2019-05-31
4 30492 2019-06-01 2019-06-30
5 30492 2019-07-01 2019-07-31
6 30492 2019-08-01 2019-08-31
7 30492 2019-09-01 2019-09-18
How can I convert timestamp column of dataframe to numeric value? The datatype of the below Time column in below dataframe 'df' is 'datetime64'.
Time Count
2018-05-15 00:00:00 4
2018-05-15 00:15:00 1
2018-05-15 00:30:00 5
2018-05-15 00:45:00 6
2018-05-15 01:15:00 3
2018-05-15 01:30:00 4
2018-05-15 02:30:00 5
2018-05-15 02:45:00 3
2018-05-15 03:15:00 2
2018-05-15 03:30:00 5
By using to_numeric
pd.to_numeric(df.Time)
Out[218]:
0 1526342400000000000
1 1526343300000000000
2 1526344200000000000
3 1526345100000000000
4 1526346900000000000
5 1526347800000000000
6 1526351400000000000
7 1526352300000000000
8 1526354100000000000
9 1526355000000000000
Name: Time, dtype: int64
I have a dataframe which I want to split into 5 chunks (more generally n chunks), so that I can apply a groupby on the chunks.
I want the chunks to have equal time intervals but in general each group may contain different numbers of records.
Let's call the data
s = pd.Series(pd.date_range('2012-1-1', periods=100, freq='D'))
and the timeinterval ti = (s.max() - s.min())/n
So the first chunk should include all rows with dates between s.min() and s.min() + ti, the second, all rows with dates between s.min() + ti and s.min() + 2*ti, etc.
Can anyone suggest an easy way to achieve this? If somehow I could convert all my dates into seconds since the epoch, then I could do something like thisgroup = floor(thisdate/ti).
Is there an easy 'pythonic' or 'panda-ista' way to do this?
Thanks very much (and Merry Christmas!),
Robin
You can use numpy.array_split:
>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series(pd.date_range('2012-1-1', periods=10, freq='D'))
>>> np.array_split(s, 5)
[0 2012-01-01 00:00:00
1 2012-01-02 00:00:00
dtype: datetime64[ns], 2 2012-01-03 00:00:00
3 2012-01-04 00:00:00
dtype: datetime64[ns], 4 2012-01-05 00:00:00
5 2012-01-06 00:00:00
dtype: datetime64[ns], 6 2012-01-07 00:00:00
7 2012-01-08 00:00:00
dtype: datetime64[ns], 8 2012-01-09 00:00:00
9 2012-01-10 00:00:00
dtype: datetime64[ns]]
>>> np.array_split(s, 2)
[0 2012-01-01 00:00:00
1 2012-01-02 00:00:00
2 2012-01-03 00:00:00
3 2012-01-04 00:00:00
4 2012-01-05 00:00:00
dtype: datetime64[ns], 5 2012-01-06 00:00:00
6 2012-01-07 00:00:00
7 2012-01-08 00:00:00
8 2012-01-09 00:00:00
9 2012-01-10 00:00:00
dtype: datetime64[ns]]
The answer is as follows:
s = pd.DataFrame(pd.date_range('2012-1-1', periods=20, freq='D'), columns=["date"])
n = 5
s["date"] = np.int64(s) #This step may not be needed in future pandas releases
s["bin"] = np.floor((n-0.001)*(s["date"] - s["date"].min( )) /((s["date"].max( ) - s["date"].min( ))))