I have a pandas DataFrame indexed by pandas.core.indexes.datetimes.DatetimeIndex and I'd like to add new values starting from the last date in the series. I need that each new value inserted be in the next date using a daily frequency.
Example:
TotalOrders
Date
2013-12-29 3756
2013-12-30 6222
2013-12-31 4918
I'd like to insert, let's say, 5000 and that it will be automatically assigned to date 2014-01-01 and so on for the following values. What would be the best way to do that?
Example:
TotalOrders
Date
2013-12-29 3756
2013-12-30 6222
2013-12-31 4918
2014-01-01 5000
Use loc with DateOffset:
df.loc[df.index.max()+pd.DateOffset(1)] = 5000
TotalOrders
Date
2013-12-29 3756
2013-12-30 6222
2013-12-31 4918
2014-01-01 5000
Related
I have 3 columns in the dataset to which I wanna add dates
Date
temperature
humidity
2015-01-01 00:00:00
5.9
NA
2015-01-01 01:00:00
5.5
NA
⋮
⋮
⋮
2015-01-01 23:00:00
7
NA
I wanna add 2 months like from 1st may to 31 july to Date column
with hour implementation it will be smth like this
Date
temperature
humidity
⋮
⋮
⋮
2015-01-01 23:00:00
7
NA
2015-05-01 00:00:00
..
NA
2015-05-01 01:00:00
..
NA
⋮
⋮
⋮
until i get to
Date
temperature
humidity
⋮
⋮
⋮
2015-07-31 23:00:00
..
NA
I've tried
date = datetime.datetime(2015,3,31,23,0,0)
for i in range(32):
date += datetime.timedelta(hours=1)
print(date)
is there an easier way to do it?
Well, you have a start with the iteration from datetime.datetime and datetime.timedelta
You need to accumulate those in a list, rather than printing them
listofdates=[]
date = datetime.datetime(2015,3,31,23,0,0)
for i in range(32):
date += datetime.timedelta(hours=1)
listofdates.append(date)
And then, create a new dataframe from the existing one (let's call it df) and this list of dates. Do do so, you can use pd.concat that creates a dataframe from two dataframes.
So, you need a dataframe from your new list of dates. With the same column name
newlines = pd.DataFrame({'Date':listofdates})
Which gives
Date
2015-04-01 00:00:00
⋮
2015-04-02 07:00:00
(Note that it starts at 4/1 00:00, not 3/31 23:00, because you add the timedelta beform appending)
We can concatenate your dataframe with this one (missing columns will be filled with NA) like this
newdf = pd.concat([df, newlines])
Last remarks that I kept for the end to avoid confusion:
I would have stored the timedelta once for all, rather than creating one each time (it is not that expansive, but still)
So, altogether
date=datetime.datetime(2015,3,31,23,0,0)
dt=datetime.timedelta(hours=1)
listofdates=[]
for i in range(32):
date += dt
listofdates.append(date)
newlines = pd.DataFrame({'Date':listofdates})
newdf = pd.concat([df, newlines])
For this kind of usage, you can build the list directly using compound lists
listofdates=[date+k*dt for k in range(1,33)]
Or using numpy
listofdates=date+np.arange(1,33)*dt
Which allows for one-liner
newdf = pd.concat([df, pd.DataFrame({'Date':date+np.arange(1,33)*dt})])
But don't try to understand this one before you understood the longer version previously described
I have two dataframes, one is called Clim and one is called O3_mda8_3135. Clim is a dataframe including monthly average meteorological parameters for one year of data; here is a sample of the dataframe:
Clim.head(12)
Out[7]:
avgT_2551 avgT_5330 ... avgNOx_3135(ppb) avgCO_3135(ppm)
Month ...
1 14.924181 13.545691 ... 48.216128 0.778939
2 16.352172 15.415385 ... 36.110385 0.605629
3 20.530879 19.684720 ... 20.974544 0.460571
4 23.738576 22.919158 ... 14.270995 0.432855
5 26.961927 25.779007 ... 11.087005 0.334505
6 32.208322 31.225072 ... 12.801409 0.384325
7 35.280124 34.265880 ... 10.732970 0.321284
8 35.428857 34.433351 ... 11.916420 0.326389
9 32.008317 30.856782 ... 15.236616 0.343405
10 25.691444 24.139874 ... 24.829518 0.467317
11 19.310550 17.827946 ... 36.339847 0.621938
12 14.186050 12.860077 ... 49.173287 0.720708
[12 rows x 20 columns]
I also have the dataframe O3_mda8_3135, which was created by first calculating the rolling 8 hour average of each component, then finding the maximum daily value of ozone, which is why all of the timestamps and indices are different. There is one value for each meteorological parameter every day of the year. Here's a sample of this dataframe:
O3_mda8_3135
Out[9]:
date Temp_C_2551 ... CO_3135(ppm) O3_mda8_3135
12 2018-01-01 12:00:00 24.1 ... 0.294 10.4000
36 2018-01-02 12:00:00 26.3 ... 0.202 9.4375
60 2018-01-03 12:00:00 22.8 ... 0.184 7.1625
84 2018-01-04 12:00:00 25.6 ... 0.078 8.2500
109 2018-01-05 13:00:00 27.3 ... NaN 9.4500
... ... ... ... ...
8653 2018-12-27 13:00:00 19.6 ... 0.115 35.1125
8676 2018-12-28 12:00:00 14.9 ... 0.097 39.4500
8700 2018-12-29 12:00:00 13.9 ... 0.092 38.1250
8724 2018-12-30 12:00:00 17.4 ... 0.186 35.1375
8753 2018-12-31 17:00:00 8.3 ... 0.110 30.8875
[365 rows x 24 columns]
I am wondering how to subtract the average values in Clim from the corresponding columns and rows in O3_mda8_3135. For example, I would like to subtract the average value for temperature at site 2551 in January (avgT_2551 Month 1 in the Clim dataframe) from every day in January in the other dataframe O3_mda8_3135, column name Temp_C_2551.
avgT_2551 corresponds to Temp_C_2551 in the other dataframe
Is there a simple way to do this? Should I extract the month from the datetime and put it into another column for the O3_mda8_3135 dataframe? I am still a beginner and would appreciate any advice or tips.
I saw this post How to subtract the mean of a month from each day in that month? but there was not enough information given for me to understand what actions were being performed.
I figured it out on my own, thanks to Stack Overflow posts :)
I created new columns in both dataframes corresponding to the month. I had originally set the index in Clim to the Month using Clim = Clim.set_index('Month') so I removed that line. Then, I created a column for Month in the O3_mda8_3135 dataframe. After that, I merged the two dataframes based on the 'Month' column, then used the pd.sub function to subtract the columns I desired.
Here's some example code, sorry the variables are so long but this dataframe is huge.
O3_mda8_3135['Month'] = O3_mda8_3135['date'].dt.month
O3_mda8_3135_anom = pd.merge(O3_mda8_3135, Clim, how='left', on=('Month'))
O3_mda8_3135_anom['O3_mda8_3135_anom'] = O3_mda8_3135_anom['O3_mda8_3135'].sub(O3_mda8_3135_anom['MDA8_3135'])
These posts helped me answer my question:
python pandas extract year from datetime: df['year'] = df['date'].year is not working
How to calculate monthly mean of a time seies data and substract the monthly mean with the values of that month of each year?
Find difference between 2 columns with Nulls using pandas
I have a dataframe with 5 minute time granularity. By now I group the df by cutting it down to the entire day and read the min / max values from two columns:
df.groupby(pd.Grouper(key='Date', freq='1D')).agg({'Low':[np.min],'High':[np.max] })
Now, instead of getting the whole day, I need to boil the dataframe down to a split day, with unequal intervals. Let's say 7:00 to 15:00 and 15:00 to 22:00.
How could I do it? freq='' allows only equal intervals.
I also have a column with value 'A' for the first part of the day, and 'B' for the second part of the day, in case it's easier to group.
Date High Low Session
0 2019-06-20 07:00:00 2927.50 2926.75 A
1 2019-06-20 07:05:00 2927.50 2927.00 A
2 2019-06-20 07:10:00 2927.25 2926.50 A
3 2019-06-20 07:15:00 2926.75 2926.25 A
4 2019-06-20 07:20:00 2926.75 2926.00 A
You can use your Session column
df = df.groupby([df.Date.dt.date, 'Session']).agg({'Low':'min', 'High':'max'})
Or you can make your own with pd.cut
df = (
df.groupby([df.Date.dt.date,
pd.cut(df.Date.dt.hour, bins=[7, 15, 22], labels=['7-15', '15-22'])])
.agg({'Low':'min', 'High':'max'})
)
I have a simple timeseries of daily observations over 2 years. I basically want to plot the daily date for each month of the series (looking daily seasonality that occurs each month). For example:
I'd expect a series on the chart for each month. Is there a way to split the dataframe easily to do this?
I'm trying to avoid doing this for each month/year...
df['JUN-2016'] = data[df['date'].month==12 & df['date'].year==2016]
A sample of the dataframe:
DATE
2015-01-05 2.7483
2015-01-06 2.7400
2015-01-07 2.7250
2015-01-08 2.7350
2015-01-09 2.7350
2015-01-12 2.7350
2015-01-13 2.7450
2015-01-14 2.7450
2015-01-15 2.7350
2015-01-16 2.7183
2015-01-19 2.7300
2015-01-20 2.7150
2015-01-21 2.7150
2015-01-22 2.6550
2015-01-23 2.6500
2015-01-27 2.6450
2015-01-28 2.6350
2015-01-29 2.6100
2015-01-30 2.5600
2015-02-02 2.4783
2015-02-03 2.4700
First you need to convert the column with all dates in your dataframe (let's say it is called df["dates"]) into datetimeformat:
df["date"]=pd.to_datetime(df["date"])
also you need to import datetime library:
from datetime import datetime
Then you can just do:
startDateOfInterval = "2016-05-31"
endDateOfInterval = "2016-07-01"
dfOfDesiredMonth = df[df["date"].apply(lambda x: x > datetime.strptime(startDateOfInterval, "%Y-%m-%d") and x < datetime.strptime(endDateOfInterval, "%Y-%m-%d"))]
The df you will get will then only contain the rows with date within this interval.
I have a time-serie on daily frequency across 1204 days.
I want to resample it on a 365D basis (by summing) but the time-serie runs across 3,29 * 365D, not a multiple of 365D.
By default, resample is returning 4 lines.
Here is the raw data:
DATE
2012-08-12 15350.0
2012-08-19 11204.0
2012-08-26 11795.0
2012-09-02 15160.0
2012-09-09 9991.0
2012-09-16 12337.0
2012-09-23 10721.0
2012-09-30 9952.0
2012-10-07 11903.0
2012-10-14 8537.0
...
2015-09-27 14234.0
2015-10-04 17917.0
2015-10-11 13610.0
2015-10-18 8716.0
2015-10-25 15191.0
2015-11-01 8925.0
2015-11-08 13306.0
2015-11-15 8884.0
2015-11-22 11527.0
2015-11-29 6859.0
df.index.max() - df.index.min()
Timedelta('1204 days 00:00:00')
If I apply:
df.resample('365D').sum()
I got:
DATE
2012-08-12 536310.0
2013-08-12 555016.0
2014-08-12 569548.0
2015-08-12 245942.0
Freq: 365D, dtype: float64
It seems like the last bin is the one covering less than 365 days.
How do I force resample to exclude it from the result?
df.resample('365D') starts sampling at lowest day in index. So last bin will be almost allways not covering all days. Just skip it
df.resample('365D').sum()[:-1]
You can also consider sampling by start/end of the year
df.resample('A').sum()