resample data each column together in dataframe

resample data each column together in dataframe - python

i have a dataframe named zz
zz columns name ['Ancolmekar','Cidurian','Dayeuhkolot','Hantap','Kertasari','Meteolembang','Sapan']
for col in zz.columns:
df = pd.DataFrame(zz[col],index=pd.date_range('2017-01-01 00:00:00', '2021-12-31 23:50:00', freq='10T'))
df.resample('1M').mean()
error : invalid syntax
i want to know the mean value by month in 10 minutes data interval. when i run this just sapan values appear with NaN. before, i have replace the NaN data 1 else 0.
Sapan
2017-01-31 NaN
2017-02-28 NaN
2017-03-31 NaN
2017-04-30 NaN
2017-05-31 NaN
2017-06-30 NaN
2017-07-31 NaN
2017-08-31 NaN
2017-09-30 NaN
2017-10-31 NaN
2017-11-30 NaN
2017-12-31 NaN
2018-01-31 NaN
2018-02-28 NaN
2018-03-31 NaN
2018-04-30 NaN
2018-05-31 NaN
2018-06-30 NaN
2018-07-31 NaN
2018-08-31 NaN
2018-09-30 NaN
2018-10-31 NaN
2018-11-30 NaN
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN
2019-11-30 NaN
2019-12-31 NaN
2020-01-31 NaN
2020-02-29 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 NaN
2021-01-31 NaN
2021-02-28 NaN
2021-03-31 NaN
2021-04-30 NaN
2021-05-31 NaN
2021-06-30 NaN
2021-07-31 NaN
2021-08-31 NaN
2021-09-30 NaN
2021-10-31 NaN
2021-11-30 NaN
2021-12-31 NaN
what should i do? thanks before

You are re-assigninig variable df to a dataframe with a single column during each pass through the for loop. The last column is sapan. Hence, only this column is shown.
Additionally, you are setting the index on df that probably isn't the index in zz, therefore you get Not A Number NaN for non-existing values.
If the index in zz is corresponding to the one you are setting, this should work:
df = zz.copy()
df['new_column'] = pd.Series(pd.date_range('2017-01-01 00:00:00', '2021-12-31 23:50:00', freq='10T'))
df = df.set_index('new_column')
df.resample('1M').mean()

Related

Reindexing Pandas based on daterange

I am trying to reindex the dates in pandas. This is because there are dates which are missing, such as weekends or national hollidays.
To do this I am using the following code:
import pandas as pd
import yfinance as yf
import datetime
start = datetime.date(2015,1,1)
end = datetime.date.today()
df = yf.download('F', start, end, interval ='1d', progress = False)
df.index = df.index.strftime('%Y-%m-%d')
full_dates = pd.date_range(start, end)
df.reindex(full_dates)
This code is producing this dataframe:
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 NaN NaN NaN NaN NaN NaN
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2023-01-13 NaN NaN NaN NaN NaN NaN
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
Could you please advise why is it not reindexing the data and showing NaN values instead?
===Edit ===
Could it be a python version issue? I ran the same code in python 3.7 and 3.10
In python 3.7
In python 3.10
In python 3.10 - It is datetime as you can see from the image.
Getting datetime after yf.download('F', start, end, interval ='1d', progress = False) without strftime

Remove converting DatetimeIndex to strings by df.index = df.index.strftime('%Y-%m-%d'), so can reindex by datetimes.
df = yf.download('F', start, end, interval ='1d', progress = False)
full_dates = pd.date_range(start, end)
df = df.reindex(full_dates)
print (df)
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 15.59 15.65 15.18 15.36 10.830517 24777900.0
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 15.12 15.13 14.69 14.76 10.407450 44079700.0
... ... ... ... ... ...
2023-01-13 12.63 12.82 12.47 12.72 12.720000 96317800.0
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
[2939 rows x 6 columns]
print (df.index)
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
'2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
'2015-01-09', '2015-01-10',
...
'2023-01-08', '2023-01-09', '2023-01-10', '2023-01-11',
'2023-01-12', '2023-01-13', '2023-01-14', '2023-01-15',
'2023-01-16', '2023-01-17'],
dtype='datetime64[ns]', length=2939, freq='D')
EDIT: There is timezones difference, for remove it use DatetimeIndex.tz_convert:
df = yf.download('F', start, end, interval ='1d', progress = False)
df.index= df.index.tz_convert(None)
full_dates = pd.date_range(start, end)
df = df.reindex(full_dates)
print (df)

You need to use strings in reindex to keep an homogeneous type, else pandas doesn't match the string (e.g., 2015-01-02) with the Timestamp (e.g., pd.Timestamp('2015-01-02')):
df.reindex(full_dates.astype(str))
#or
df.reindex(full_dates.strftime('%Y-%m-%d'))
Output:
Open High Low Close Adj Close Volume
2015-01-01 NaN NaN NaN NaN NaN NaN
2015-01-02 15.59 15.65 15.18 15.36 10.830517 24777900.0
2015-01-03 NaN NaN NaN NaN NaN NaN
2015-01-04 NaN NaN NaN NaN NaN NaN
2015-01-05 15.12 15.13 14.69 14.76 10.407451 44079700.0
... ... ... ... ... ... ...
2023-01-13 12.63 12.82 12.47 12.72 12.720000 96317800.0
2023-01-14 NaN NaN NaN NaN NaN NaN
2023-01-15 NaN NaN NaN NaN NaN NaN
2023-01-16 NaN NaN NaN NaN NaN NaN
2023-01-17 NaN NaN NaN NaN NaN NaN
[2939 rows x 6 columns]

Append empty rows by subtracting 7 days from date

How can I create empty rows from 7 days before 2016-01-01 going to January 2015? I tried reindexing
df
date value
0 2016-01-01 4.0
1 2016-01-08 5.0
2 2016-01-15 1.0
Expected Output
date value
2015-01-02 NaN
....
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0

First create DatetimeIndex:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
And then use DataFrame.reindex with date_range by your minimal value and minimal index value with Index.union for avoid lost original index values:
rng = pd.date_range('2015-01-02', df.index.min(), freq='7d').union(df.index)
df = df.reindex(rng)
print (df)
value
2015-01-02 NaN
2015-01-09 NaN
2015-01-16 NaN
2015-01-23 NaN
2015-01-30 NaN
2015-02-06 NaN
2015-02-13 NaN
2015-02-20 NaN
2015-02-27 NaN
2015-03-06 NaN
2015-03-13 NaN
2015-03-20 NaN
2015-03-27 NaN
2015-04-03 NaN
2015-04-10 NaN
2015-04-17 NaN
2015-04-24 NaN
2015-05-01 NaN
2015-05-08 NaN
2015-05-15 NaN
2015-05-22 NaN
2015-05-29 NaN
2015-06-05 NaN
2015-06-12 NaN
2015-06-19 NaN
2015-06-26 NaN
2015-07-03 NaN
2015-07-10 NaN
2015-07-17 NaN
2015-07-24 NaN
2015-07-31 NaN
2015-08-07 NaN
2015-08-14 NaN
2015-08-21 NaN
2015-08-28 NaN
2015-09-04 NaN
2015-09-11 NaN
2015-09-18 NaN
2015-09-25 NaN
2015-10-02 NaN
2015-10-09 NaN
2015-10-16 NaN
2015-10-23 NaN
2015-10-30 NaN
2015-11-06 NaN
2015-11-13 NaN
2015-11-20 NaN
2015-11-27 NaN
2015-12-04 NaN
2015-12-11 NaN
2015-12-18 NaN
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0

Filling missing dates by imputing on previous dates in Python

I have a time series that I want to lag and predict on for future data one year ahead that looks like:
Date Energy Pred Energy Lag Error
.
2017-09-01 9 8.4
2017-10-01 10 9
2017-11-01 11 10
2017-12-01 12 11.5
2018-01-01 1 1.3
NaT (pred-true)
NaT
NaT
NaT
.
.
All I want to do is impute dates into the NaT entries to continue from 2018-01-01 to 2019-01-01 (just fill them like we're in Excel drag and drop) because there are enough NaT positions to fill up to that point.
I've tried model['Date'].fillna() with various methods and either just repeats the same previous date or drops things I don't want to drop.
Any way to just fill these NaTs with 1 month increments like the previous data?

Make the df and set the index (there are better ways to set the index):
"""
Date,Energy,Pred Energy,Lag Error
2017-09-01,9,8.4
2017-10-01,10,9
2017-11-01,11,10
2017-12-01,12,11.5
2018-01-01,1,1.3
"""
import pandas as pd
df = pd.read_clipboard(sep=",", parse_dates=True)
df.set_index(pd.DatetimeIndex(df['Date']), inplace=True)
df.drop("Date", axis=1, inplace=True)
df
Reindex to a new date_range:
idx = pd.date_range(start='2017-09-01', end='2019-01-01', freq='MS')
df = df.reindex(idx)
Output:
Energy Pred Energy Lag Error
2017-09-01 9.0 8.4 NaN
2017-10-01 10.0 9.0 NaN
2017-11-01 11.0 10.0 NaN
2017-12-01 12.0 11.5 NaN
2018-01-01 1.0 1.3 NaN
2018-02-01 NaN NaN NaN
2018-03-01 NaN NaN NaN
2018-04-01 NaN NaN NaN
2018-05-01 NaN NaN NaN
2018-06-01 NaN NaN NaN
2018-07-01 NaN NaN NaN
2018-08-01 NaN NaN NaN
2018-09-01 NaN NaN NaN
2018-10-01 NaN NaN NaN
2018-11-01 NaN NaN NaN
2018-12-01 NaN NaN NaN
2019-01-01 NaN NaN NaN
Help from:
Pandas Set DatetimeIndex

how to assign values to a new data frame from another data frame in python

I set up a new data frame SimMean:
columns = ['Tenor','5x16', '7x8', '2x16H']
index = range(0,12)
SimMean = pd.DataFrame(index=index, columns=columns)
SimMean
Tenor 5x16 7x8 2x16H
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN NaN
I have another data frame FwdDf:
FwdDf
Tenor 5x16 7x8 2x16H
0 2017-01-01 50.94 34.36 43.64
1 2017-02-01 50.90 32.60 42.68
2 2017-03-01 42.66 26.26 37.26
3 2017-04-01 37.08 22.65 32.46
4 2017-05-01 42.21 20.94 33.28
5 2017-06-01 39.30 22.05 32.29
6 2017-07-01 50.90 21.80 38.51
7 2017-08-01 42.77 23.64 35.07
8 2017-09-01 37.45 19.61 32.68
9 2017-10-01 37.55 21.75 32.10
10 2017-11-01 35.61 22.73 32.90
11 2017-12-01 40.16 29.79 37.49
12 2018-01-01 53.45 36.09 47.61
13 2018-02-01 52.89 35.74 45.00
14 2018-03-01 44.67 27.79 38.62
15 2018-04-01 38.48 24.21 34.43
16 2018-05-01 43.87 22.17 34.69
17 2018-06-01 40.24 22.85 34.31
18 2018-07-01 49.98 23.58 39.96
19 2018-08-01 45.57 24.76 37.23
20 2018-09-01 38.90 21.74 34.22
21 2018-10-01 39.75 23.36 35.20
22 2018-11-01 38.04 24.20 34.62
23 2018-12-01 42.68 31.03 40.00
now I need to assign the 'Tenor' data from row 12 to row 23 in FwdDf to the new data frame SimMean.
I used
SimMean.loc[0:11,'Tenor'] = FwdDf.loc [12:23,'Tenor']
but it didn't work:
SimMean
Tenor 5x16 7x8 2x16H
0 None NaN NaN NaN
1 None NaN NaN NaN
2 None NaN NaN NaN
3 None NaN NaN NaN
4 None NaN NaN NaN
5 None NaN NaN NaN
6 None NaN NaN NaN
7 None NaN NaN NaN
8 None NaN NaN NaN
9 None NaN NaN NaN
10 None NaN NaN NaN
11 None NaN NaN NaN
I'm new to python. I would appreciate your help. Thanks

call .values so there are no index alignment issues:
In [35]:
SimMean.loc[0:11,'Tenor'] = FwdDf.loc[12:23,'Tenor'].values
SimMean
Out[35]:
Tenor 5x16 7x8 2x16H
0 2018-01-01 NaN NaN NaN
1 2018-02-01 NaN NaN NaN
2 2018-03-01 NaN NaN NaN
3 2018-04-01 NaN NaN NaN
4 2018-05-01 NaN NaN NaN
5 2018-06-01 NaN NaN NaN
6 2018-07-01 NaN NaN NaN
7 2018-08-01 NaN NaN NaN
8 2018-09-01 NaN NaN NaN
9 2018-10-01 NaN NaN NaN
10 2018-11-01 NaN NaN NaN
11 2018-12-01 NaN NaN NaN
EDIT
As your column is actually datetime then you need to convert the type again:
In [46]:
SimMean['Tenor'] = pd.to_datetime(SimMean['Tenor'])
SimMean
Out[46]:
Tenor 5x16 7x8 2x16H
0 2018-01-01 NaN NaN NaN
1 2018-02-01 NaN NaN NaN
2 2018-03-01 NaN NaN NaN
3 2018-04-01 NaN NaN NaN
4 2018-05-01 NaN NaN NaN
5 2018-06-01 NaN NaN NaN
6 2018-07-01 NaN NaN NaN
7 2018-08-01 NaN NaN NaN
8 2018-09-01 NaN NaN NaN
9 2018-10-01 NaN NaN NaN
10 2018-11-01 NaN NaN NaN
11 2018-12-01 NaN NaN NaN

"ValueError: cannot reindex from a duplicate axis"

I have the following df:
Timestamp A B C ...
2014-11-09 00:00:00 NaN 1 NaN NaN
2014-11-09 00:00:00 2 NaN NaN NaN
2014-11-09 00:00:00 NaN NaN 3 NaN
2014-11-09 08:24:00 NaN NaN 1 NaN
2014-11-09 08:24:00 105 NaN NaN NaN
2014-11-09 09:19:00 NaN NaN 23 NaN
And I would like to make the following:
Timestamp A B C ...
2014-11-09 00:00:00 2 1 3 NaN
2014-11-09 00:01:00 NaN NaN NaN NaN
2014-11-09 00:02:00 NaN NaN NaN NaN
... NaN NaN NaN NaN
2014-11-09 08:23:00 NaN NaN NaN NaN
2014-11-09 08:24:00 105 NaN 1 NaN
2014-11-09 08:25:00 NaN NaN NaN NaN
2014-11-09 08:26:00 NaN NaN NaN NaN
2014-11-09 08:27:00 NaN NaN NaN NaN
... NaN NaN NaN NaN
2014-11-09 09:18:00 NaN NaN NaN NaN
2014-11-09 09:19:00 NaN NaN 23 NaN
That is: I would like to merge the columns with the same Timestamp (I have 17 columns), resample at 1 min granularity and for those column with no values I would like to have NaN.
I started in the following ways:
df.groupby('Timestamp').sum()
and
df = df.resample('1Min', how='max')
but I obtained the following error:
ValueError: cannot reindex from a duplicate axis
How can I solve this problem? I'm just learning Python so I don't have experience at all.
Thank you!

Assumed that you have your Timestamp as index to begin with, you need to do the resample first, and reset_index before doing a groupby, here's the working sample:
import pandas as pd
df
A B C ...
Timestamp
2014-11-09 00:00:00 NaN 1 NaN NaN
2014-11-09 00:00:00 2 NaN NaN NaN
2014-11-09 00:00:00 NaN NaN 3 NaN
2014-11-09 08:24:00 NaN NaN 1 NaN
2014-11-09 08:24:00 105 NaN NaN NaN
2014-11-09 09:19:00 NaN NaN 23 NaN
df.resample('1Min', how='max').reset_index().groupby('Timestamp').sum()
A B C ...
Timestamp
2014-11-09 00:00:00 2 1 3 NaN
2014-11-09 00:01:00 NaN NaN NaN NaN
2014-11-09 00:02:00 NaN NaN NaN NaN
2014-11-09 00:03:00 NaN NaN NaN NaN
2014-11-09 00:04:00 NaN NaN NaN NaN
...
2014-11-09 09:17:00 NaN NaN NaN NaN
2014-11-09 09:18:00 NaN NaN NaN NaN
2014-11-09 09:19:00 NaN NaN 23 NaN
Hope this helps.
Updated:
As said in comment, your 'Timestamp' isn't datetime and probably as string so you cannot resample by DatetimeIndex, just reset_index and convert it something like this:
df = df.reset_index()
df['ts'] = pd.to_datetime(df['Timestamp'])
# 'ts' is now datetime of 'Timestamp', you just need to set it to index
df = df.set_index('ts')
...
Now just run the previous code again but replace 'Timestamp' with 'ts' and you should be OK.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

resample data each column together in dataframe - python

Related

Reindexing Pandas based on daterange

Append empty rows by subtracting 7 days from date

Filling missing dates by imputing on previous dates in Python

how to assign values to a new data frame from another data frame in python

"ValueError: cannot reindex from a duplicate axis"

Categories

Resources