I have pandas data frame which looks like below with 30 days in each month. Now I would like to convert this data frame into the regular Julian days calendar and put NA in those days with the missing date (for eg 1/31/2001: NA and so on) and interpolate later. Can any one suggest me the option to handle in pandas ?
Date X
1/1/2001 30.56787109
1/2/2001 29.57751465
1/3/2001 30.38424683
1/4/2001 28.64764404
1/5/2001 27.54763794
......
......
1/29/2001 27.44857788
1/30/2001 27.16296387
2/1/2001 28.02816772
2/2/2001 28.28137207
2/3/2001 28.38671875
.......
.......
02/29/2001 32.23730469
02/30/2001 32.56161499
3/1/2001 31.38146973
3/2/2001 30.73623657
3/3/2001 30.81912231
......
3/28/2001 33.7562561
3/29/2001 34.46350098
3/30/2001 33.49130249
4/1/2001 30.91223145
4/2/2001 30.94335938
.....
4/30/2001 30.02526855
......
......
12/29/2001 27.44161987
12/30/2001 28.43597412
So, I'm assuming that your Date column is just a string and is not an index. And I'm also replacing X with an integer value to make it easier to track what's happening to it. So first, convert to datetime, and set as index.
>>> df.Date=pd.to_datetime(df.Date,errors='coerce')
>>> df = df.set_index('Date')
2001-02-27 10
2001-02-28 11
NaT 12
NaT 13
2001-03-01 14
2001-03-02 15
So that uses python/pandas built in time awareness to identify invalid dates (Feb 29 in a non-leap year and Feb 30 in any year).
Then you can just resample to get the index onto a valid calendar. You also have some fill options (besides the default NaN) with resample or you can interpolate later on.
>>> df=df.resample('d')
2001-01-29 3
2001-01-30 4
2001-01-31 NaN
2001-02-01 5
2001-02-02 6
...
2001-02-27 10
2001-02-28 11
2001-03-01 14
2001-03-02 15
First, set the column type as a pandas.Datetimeindex and then use the to_julian_date() function. You can then use the interpolate() method to get the in between dates that are missing.
Source:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.to_julian_date.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html
Related
I have a pandas dataframe column named disbursal_date which is a datetime:
disbursal_date
2009-01-28
2008-01-03
2008-07-15
and so on...
I want to keep the date and month part and replace the years by 2022 for all values.
I tried using df['disbursal_date'].map(lambda x: x.replace(year=2022)) but this didn't work for me.
You need to use apply not map to run a python function on a dataframe columns.
We need to make sure that the dtype is datetime of pandas and not object or string.
Below is the sample code I tried and it works fine, it replaces the year to 2022.
df = pd.DataFrame(['2009-01-28', '2008-01-03', '2008-07-15'],columns=['disbursal_old'])
df['disbursal_old'] = df['disbursal_old'].astype('datetime64[ns]')
df['disbursal_new'] = df['disbursal_old'].apply(lambda x : x.replace(year=2022))
print(df['disbursal_new'])
0 2022-01-28
1 2022-01-03
2 2022-07-15
Name: disbursal_new, dtype: datetime64[ns]
The below code gives the difference between the years.
df['disbursal_diff_year'] = df['disbursal_new'].dt.year - df['disbursal_old'].dt.year
print(df)
disbursal_old disbursal_new disbursal_diff_year
0 2009-01-28 2022-01-28 13
1 2008-01-03 2022-01-03 14
2 2008-07-15 2022-07-15 14
I am trying to filter a DataFrame to only show values 1-hour before and 1-hour after a specified time/date, but am having trouble finding the right function for this. I am working in Python with Pandas.
The posts I see regarding masking by date mostly cover the case of masking rows between a specified start and end date, but I am having trouble finding help on how to mask rows based around a single date.
I have time series data as a DataFrame that spans about a year, so thousands of rows. This data is at 1-minute intervals, and so each row corresponds to a row ID, a timestamp, and a value.
Example of DataFrame:
ID timestamp value
0 2011-01-15 03:25:00 34
1 2011-01-15 03:26:00 36
2 2011-01-15 03:27:00 37
3 2011-01-15 03:28:00 37
4 2011-01-15 03:29:00 39
5 2011-01-15 03:30:00 29
6 2011-01-15 03:31:00 28
...
I am trying to create a function that outputs a DataFrame that is the initial DataFrame, but only rows for 1-hour before and 1-hour after a specified timestamp, and so only rows within this specified 2-hour window.
To be more clear:
I have a DataFrame that has 1-minute interval data throughout a year (as exemplified above).
I now identify a specific timestamp: 2011-07-14 06:15:00
I now want to output a DataFrame that is the initial input DataFrame, but now only contains rows that are within 1-hour before 2011-07-14 06:15:00, and 1-hour after 2011-07-14 06:15:00.
Do you know how I can do this? I understand that I could just create a filter where I get rid of all values before 2011-07-14 05:15:00 and 2011-07-14 07:15:00, but my goal is to have the user simply enter a single date/time (e.g. 2011-07-14 06:15:00) to produce the output DataFrame.
This is what I have tried so far:
hour = pd.DateOffset(hours=1)
date = pd.Timestamp("2011-07-14 06:15:00")
df = df.set_index("timestamp")
df([date - hour: date + hour])
which returns:
File "<ipython-input-49-d42254baba8f>", line 4
df([date - hour: date + hour])
^
SyntaxError: invalid syntax
I am not sure if this is really only a syntax error, or something deeper and more complex. How can I fix this?
Thanks!
You can do with:
import pandas as pd
import datetime as dt
data = {"date": ["2011-01-15 03:10:00","2011-01-15 03:40:00","2011-01-15 04:10:00","2011-01-15 04:40:00","2011-01-15 05:10:00","2011-01-15 07:10:00"],
"value":[1,2,3,4,5,6]}
df=pd.DataFrame(data)
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S', errors='ignore')
date_search= dt.datetime.strptime("2011-01-15 05:20:00",'%Y-%m-%d %H:%M:%S')
mask = (df['date'] > date_search-dt.timedelta(hours = 1)) & (df['date'] <= date_search+dt.timedelta(hours = 1))
print(df.loc[mask])
result:
date value
3 2011-01-15 04:40:00 4
4 2011-01-15 05:10:00 5
I have a large dataset spanning many years and I want to subset this data frame by selecting data based on a specific day of the month using python.
This is simple enough and I have achieved with the following line of code:
df[df.index.day == 12]
This selects data from the 12th of each month for all years in the data set. Great.
The problem I have however is the original data set is based on working day data. Therefore the 12th might actually be a weekend or national holiday and thus doesnt appear in the data set. Nothing is returned for that month as such.
What I would like to happen is to select the 12th where available, else select the next working day in the data set.
All help appreciated!
Here's a solution that looks at three days from every month (12, 13, and 14), and then picks the minimum. If the 12th is a weekend it won't exist in the original dataframe, and you'll get the 13th. The same goes for the 14th.
Here's the code:
# Create dummy data - initial range
df = pd.DataFrame(pd.date_range("2018-01-01", "2020-06-01"), columns = ["date"])
# Create dummy data - Drop weekends
df = df[df.date.dt.weekday.isin(range(5))]
# get only the 12, 13, and 14 of every month
# group by year and month.
# get the minimum
df[df.date.dt.day.isin([12, 13, 14])].groupby(by=[df.date.dt.year, df.date.dt.month], as_index=False).min()
Result:
date
0 2018-01-12
1 2018-02-12
2 2018-03-12
3 2018-04-12
4 2018-05-14
5 2018-06-12
6 2018-07-12
7 2018-08-13
8 2018-09-12
9 2018-10-12
...
Edit
Per a question in the comments about national holidays: the same solution applies. Instead of picking 3 days (12, 13, 14), pick a larger number (e.g. 12-18). Then, get the minimum of these that actually exists in the dataframe - and that's the first working day starting with the 12th.
You can backfill the dataframe first to fill the missing values then select the date you want
df = df.asfreq('d', method='bfill')
Then you can do df[df.index.day == 12]
This is my approach, I will explain each line below the code. Please feel free to add a comment if there's something unclear:
!pip install workalendar #Install the module
import pandas as pd #Import pandas
from workalendar.usa import NewYork #Import the required country and city
df = pd.DataFrame(pd.date_range(start='1/1/2018', end='12/31/2018')).rename(columns={0:'Dates'}) #Create a dataframe with dates for the year 2018
cal = NewYork() #Instance the calendar
df['Is_Working_Day'] = df['Dates'].map(lambda x: cal.is_working_day(x)) #Create an extra column, True for working days, False otherwise
df[(df['Dates'].dt.day >= 12) & (df['Is_Working_Day'] == True)].groupby(df['Dates'].dt.month)['Dates'].first()
Essentially this last line returns all days with values equal or higher than 12 that are actual working days, we then group them by month and return the first day for each where this condition is met (day >= 12 and Working_day = True).
Output:
Dates
1 2018-01-12
2 2018-02-13
3 2018-03-12
4 2018-04-12
5 2018-05-14
6 2018-06-12
7 2018-07-12
8 2018-08-13
9 2018-09-12
10 2018-10-12
11 2018-11-13
12 2018-12-12
I processing time-series data within a pandas dataframe. The datetime index is incomplete (i.e. some dates are missing).
I want to create a new column with a datetime series of 1 year offset, but only containg dates present in the original datetimeindex . The challenge is that the exact 1y match is not present in the index in many cases.
Index (Input) 1 year offset (Output)
1/2/2014 None
1/3/2014 None
1/6/2014 None
1/7/2014 None
1/9/2014 None
1/10/2014 None
1/2/2015 1/2/2014
1/5/2015 1/3/2014
1/6/2015 1/6/2014
1/7/2015 1/7/2014
1/8/2015 1/9/2014
1/9/2015 1/10/2014
The requirements are as follows:
Every date as of 1/2/2015 must have a corresponding offset date (no blanks)
Every date within the "offset date" group must also be present in the Index column (i.e. introduction of new dates, like 1/8/2014, is not desired
All offset dates must be ordered in an ascending way (the sequence of dates must be preserved)
What I have tried so far:
The Dateoffset doesn't help, since it is insensitive to dates not present in the index.
The .shift method data["1 year offset (Output)"] = data.Index.shift(365) doesn't help because the number of dates within the index is different across the years.
What I am trying to do now has several steps:
Apply Dateoffset method at first to create "temp 1 year offset"
Remove single dates from "temp 1 year offset" that are not present in datetimeindex using set(list) method and replace cells by NaN
Select dates in datetimeindex whose "temp 1 year offset" is NaN and substract one year
Map the Dates from (3) to its closest date in the datetimeindex using argmin
The challenge here is that I am getting double entries as well as a descending order of days in some cases. Those mess up with the results in the following way (see the timedeltas between day n and day n+1):
Index (Input) 1 year offset (Output) Timedelta
4/17/2014 4/16/2014 1
4/22/2014 4/17/2014 1
4/23/2014 4/25/2014 8
4/24/2014 None
4/25/2014 4/22/2014 -3
4/28/2014 4/23/2014 1
4/29/2014 4/24/2014 1
4/30/2014 4/25/2014 1
In any case, this last approach seems to be an overkill concerning the simplicity of the underlying goal. Is there a faster and more simple way to do it?
How to group every date in an uneven pandas datetime series with the closest date one year ago in the same series?
This would be a way:
However look at this thread to properly handle 1 year when the year has 366 days:
Add one year in current date PYTHON
This code therefore needs some small modifications.
import pandas as pd
import datetime
df = pd.DataFrame(dict(dates=[
'1/3/2014',
'1/6/2014',
'1/7/2014',
'1/9/2014',
'1/10/2014',
'1/2/2015',
'1/5/2015',
'1/6/2015',
'1/7/2015',
'1/8/2015',
'1/9/2015']))
# Convert column to datetime
df.dates = pd.to_datetime(df.dates)
# Store min(year) as a variable
minyear = min(df.dates).year
# Calculate the day with timedelta -365 days (might fail on 2012?)
df['offset'] = [(i + datetime.timedelta(days=-365)).date()
if i.year != minyear else None for i in df.dates]
df
Returns:
dates offset
0 2014-01-03 None
1 2014-01-06 None
2 2014-01-07 None
3 2014-01-09 None
4 2014-01-10 None
5 2015-01-02 2014-01-02
6 2015-01-05 2014-01-05
7 2015-01-06 2014-01-06
8 2015-01-07 2014-01-07
9 2015-01-08 2014-01-08
10 2015-01-09 2014-01-09
With Pandas I have created a DataFrame from an imported .csv file (this file is generated through simulation). The DataFrame consists of half-hourly energy consumption data for a single year. I have already created a DateTimeindex for the dates.
I would like to be able to reformat this data into average hourly week and weekend profile results. With the week profile excluding holidays.
DataFrame:
Date_Time Equipment:Electricity:LGF Equipment:Electricity:GF
01/01/2000 00:30 0.583979872 0.490327348
01/01/2000 01:00 0.583979872 0.490327348
01/01/2000 01:30 0.583979872 0.490327348
01/01/2000 02:00 0.583979872 0.490327348
I found an example (Getting the average of a certain hour on weekdays over several years in a pandas dataframe) that explains doing this for several years, but not explicitly for a week (without holidays) and weekend.
I realised that there is no resampling techniques in Pandas that do this directly, I used several aliases (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) for creating Monthly and Daily profiles.
I was thinking of using the business day frequency and create a new dateindex with working days and compare that to my DataFrame datetimeindex for every half hour. Then return values for working days and weekend days when true or false respectively to create a new dataset, but am not sure how to do this.
PS; I am just getting into Python and Pandas.
Dummy data (for future reference, more likely to get an answer if you post some in a copy-paste-able form)
df = pd.DataFrame(data={'a':np.random.randn(1000)},
index=pd.date_range(start='2000-01-01', periods=1000, freq='30T'))
Here's an approach. First define a US (or modify as appropriate) business day offset with holidays, and generate and range covering your dates.
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bday_over_df = pd.date_range(start=df.index.min().date(),
end=df.index.max().date(), freq=bday_us)
Then, develop your two grouping columns. An hour column is easy.
df['hour'] = df.index.hour
For weekday/weekend/holiday, define a function to group the data.
def group_day(date):
if date.weekday() in [5,6]:
return 'weekend'
elif date.date() in bday_over_df:
return 'weekday'
else:
return 'holiday'
df['day_group'] = df.index.map(group_day)
Then, just group by the two columns as you wish.
In [140]: df.groupby(['day_group', 'hour']).sum()
Out[140]:
a
day_group hour
holiday 0 1.890621
1 -0.029606
2 0.255001
3 2.837000
4 -1.787479
5 0.644113
6 0.407966
7 -1.798526
8 -0.620614
9 -0.567195
10 -0.822207
11 -2.675911
12 0.940091
13 -1.601885
14 1.575595
15 1.500558
16 -2.512962
17 -1.677603
18 0.072809
19 -1.406939
20 2.474293
21 -1.142061
22 -0.059231
23 -0.040455
weekday 0 9.192131
1 2.759302
2 8.379552
3 -1.189508
4 3.796635
5 3.471802
... ...
18 -5.217554
19 3.294072
20 -7.461023
21 8.793223
22 4.096128
23 -0.198943
weekend 0 -2.774550
1 0.461285
2 1.522363
3 4.312562
4 0.793290
5 2.078327
6 -4.523184
7 -0.051341
8 0.887956
9 2.112092
10 -2.727364
11 2.006966
12 7.401570
13 -1.958666
14 1.139436
15 -1.418326
16 -2.353082
17 -1.381131
18 -0.568536
19 -5.198472
20 -3.405137
21 -0.596813
22 1.747980
23 -6.341053
[72 rows x 1 columns]