Not sure if relevant, but the dates are in DatetimeIndex list(?) in Panda, Python 3.6
I'm trying to get all the date ranges of consecutive days, outputting the minimum and maximum of the said date ranges.
Output preferred to be in list, but it seems like Dataframe is essentially a list where I can use indexing, I think?
I would later output these date ranges to an Excel sheet.
Sample input:
'1990-10-01', '1990-10-02', '1990-10-03', '1990-10-05', '2002-10-05', '2002-10-06'
Expected output:
1990-10-01, 1990-10-03
1990-10-05
2002-10-05, 2002-10-06
I know a naive method would be to do a for loop and check if the next/previous dates is off by one, checking the day, month, and year. But what's a better way to do this?
Thanks
Edited to clarify
Setup:
df = pd.DataFrame()
df['Date'] = pd.to_datetime(['1990-10-01', '1990-10-02', '1990-10-03', '1990-10-05', '2002-10-05', '2002-10-06'])
Solution:
First calculate running diff, create a flag to indicate if the dates should be in the same group, then groupby and get the start and end date for that group. Set is used to remove end date if it's the same as start.
(
df.assign(DateDiff=(df.Date - df.Date.shift(1)).dt.days.fillna(0))
.assign(Flag= lambda x: np.where(x.DateDiff==1, np.nan, range(len(x))))
.assign(Flag=lambda x: x.Flag.ffill())
.groupby(by='Flag').Date
.apply(lambda x: set([x.iloc[0].date(), x.iloc[-1].date()]))
)
Flag
0.0 {1990-10-01, 1990-10-03}
3.0 {1990-10-05}
4.0 {2002-10-05, 2002-10-06}
Name: Date, dtype: object
Lets create the example:
Input:
l = ['1990-10-01', '1990-10-02', '1990-10-03', '1990-10-05', '2002-10-05', '2002-10-06']
idx = pd.DatetimeIndex(l)
DatetimeIndex(['1990-10-01', '1990-10-02', '1990-10-03', '1990-10-05',
'2002-10-05', '2002-10-06'],
dtype='datetime64[ns]', freq=None)
Solution:
Create a helper series which will calculate the difference between the consecutive dates and create groups where difference is not 1 , then loop over the groups and get the first and last item in that group.
g = idx.to_series().diff().fillna(pd.Timedelta(days=1)).dt.days.ne(1).cumsum()
final = [pd.DatetimeIndex(map(grp.index.__getitem__, (0,-1)))
if len(grp.index)>1 else grp.index
for _,grp in g.groupby(g)]
Output:
[DatetimeIndex(['1990-10-01', '1990-10-03'], dtype='datetime64[ns]', freq=None),
DatetimeIndex(['1990-10-05'], dtype='datetime64[ns]', freq=None),
DatetimeIndex(['2002-10-05', '2002-10-06'], dtype='datetime64[ns]', freq=None)]
If you want a dataframe to do df.to_excel(..) , just create a dataframe based on the final list:
df = pd.DataFrame(final,columns = ['start','end'])
print(df)
start end
0 1990-10-01 1990-10-03
1 1990-10-05 NaT
2 2002-10-05 2002-10-06
Related
I have data like this:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
I would like to create a third column which contains a date range created by pd.date_range, using 'date' as the start date and 'n' as the number of periods.
So the first entry should be:
pd.date_range(dt.datetime(2018,8,25), periods=10, freq='d')
(I have a list of "target" dates, and my goal is to check whether the date_range contains any of those target dates).
I tried this:
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'],
x['n'],
freq='d'))
But this gives a KeyError: ('date', 'occurred at index date')
Any idea on how to do this without using a for loop, or is there a better solution altogether?
You can solve your problem without creating date range or day columns. To check if a target date in tgt belongs to a date range specified by rows of df, you can calculate the end of date range, and then check if each date in tgt falls in between the start and end of a time interval. The code below implements this, and produces "target_date" column identical to the one in your own answer:
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
df["daterange_end"] = df.apply(lambda x: x["date"] + pd.Timedelta(days=x["n"]), axis=1)
tgt = [dt.datetime(2018,8,26)]
df['target_date'] = 0
df.loc[(tgt[0] > df.date) &(tgt[0] < df.daterange_end),"target_date"] = 1
print(df)
# date n daterange_end target_date
# 0 2018-08-25 10 2018-09-04 1
# 1 2018-07-21 7 2018-07-28 0
You should add axis=1 in apply
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'], x['n'], freq='d'), axis=1)
I came up with a solution that works (but I'm sure there's a nicer way...)
# define target
tgt = [dt.datetime(2018,8,26)]
# find max n
max_n = max(df['n'])
# create that many columns and increment the day
for i in range(max_n):
df['date_{}'.format(i)] = df['date'] + dt.timedelta(days=i)
new_cols = ['date_{}'.format(n) for n in range(max_n)]
# check each one and replace with a 1 if it matches the "tgt"
df['target_date'] = 0
for col in new_cols:
df['target_date'] = np.where(df[col].isin(tgt),
1,
df['target_date'])
# drop intermediate cols
df = df[[i for i in df.columns if not i in new_cols]]
How can I check what category a date falls into if it is between a the dates in the date field? I cannot use merge_asof as the work; pandas is only v0.18.
d = {'buckets': ['1D', '1W', '1M'], 'dates': ['03-05-2018', '10-05-2018', '03-06-2018']}
date_buckets = pd.DataFrame(data=d)
buckets dates
0 1D 03-05-2018
1 1W 10-05-2018
2 1M 03-06-2018
So, for example, if given the date 07-05-2018, how can I return 1W? I would need to do this for hundreds of rows so would need to be efficient.
thanks,
You can use pandas.cut for binning values:
import pandas as pd
d = {'buckets': ['1D', '1W', '1M'],
'dates': ['03-05-2018', '10-05-2018', '03-06-2018']}
df_bin = pd.DataFrame(data=d)
df_bin['dates'] = pd.to_datetime(df_bin['dates'], dayfirst=True)\
.dt.strftime('%Y%m%d').astype(int)
df = pd.DataFrame({'date': ['07-05-2018']})
df['date'] = pd.to_datetime(df['date'], dayfirst=True)\
.dt.strftime('%Y%m%d').astype(int)
df['Tenor'] = pd.cut(df['date'],
bins=df_bin['dates'],
labels=df_bin['buckets'].iloc[1:])
print(df)
date Tenor
0 20180507 1W
Here's one way that could easily be extended to a larger set of dates to match:
scalar_date = pd.DataFrame(index=[pd.to_datetime("07-05-2018", format="%d-%m-%Y")])
scalar_date.join(date_buckets. \
set_index('dates'). \
reindex(pd.date_range(date_buckets.dates.min(), \
date_buckets.dates.max()), \
method='bfill'))
# buckets
# 2018-05-07 1W
The idea here is to resize your date_buckets dataframe (using .reindex with method='bfill'), so that you can easily join it to a dataframe with your lookup dates.
I have two different time format dataset like that
df1 = pd.DataFrame( {'A': [1499503900, 1512522054, 1412525061, 1502527681, 1512532303]})
df2 = pd.DataFrame( {'B' : ['2017-12-15T11:47:58.119Z', '2017-05-31T08:27:41.943Z', '2017-06-05T14:44:56.425Z', '2017-05-30T16:24:03.175Z' , '2017-07-03T10:20:46.333Z', '2017-06-16T10:13:31.535Z' , '2017-12-15T12:26:01.347Z', '2017-06-15T16:00:41.017Z', '2017-11-28T15:25:39.016Z', '2017-08-10T08:48:01.347Z'] })
I need to find the nearest date for each data in the first dataset. Doesn't matter how far is it. Just needed the nearest time. For example:
1499503900 for '2017-07-03T10:20:46.333Z'
1512522054 for '2017-12-15T12:26:01.347Z'
1412525061 for '2017-05-31T08:27:41.943Z'
1502527681 for '2017-08-10T08:48:01.347Z'
1512532303 for '2017-06-05T14:44:56.425Z'
here is a few help:
This is for converting to long format date :
def time1(date_text):
date = datetime.datetime.strptime(date_text, "%Y-%m-%dT%H:%M:%S.%fZ")
return calendar.timegm(date.utctimetuple())
x = '2017-12-15T12:26:01.347Z'
print(time1(x))
out: 1513340761
And this is for converting to ISO format:
def time_covert(time):
seconds_since_epoch = time
DT.datetime.utcfromtimestamp(seconds_since_epoch)
return DT.datetime.utcfromtimestamp(seconds_since_epoch).isoformat()
y = 1499503900
print(time_covert(y))
out = 2017-07-08T08:51:40
Any idea will be extremely useful.
Thank you all in advance!
Here a quick start:
def time_covert(time):
seconds_since_epoch = time
return datetime.utcfromtimestamp(seconds_since_epoch)
# real time series
df2['B'] = pd.to_datetime(df2['B'])
df2.index = df2['B']
del df2['B']
for a in df1['A']:
print( time_covert(a))
i = np.argmin(np.abs(df2.index.to_pydatetime() - time_covert(a)))
print(df2.iloc[i])
I would like to approach this as an algorithmic question rather than pandas specific. My approach is to sort the "df2" series and for each DateTime in df1, perform a binary search on the sorted df2, to get the indexes of insertion. Then check the indexes just below and above the found index to get the desired output.
Here is the code for above procedure.
Use standard pandas DateTime for easy comparison
df1 = pd.DataFrame( {'A': pd.to_datetime([1499503900, 1512522054, 1412525061, 1502527681, 1512532303], unit='s')})
df2 = pd.DataFrame( {'B' : pd.to_datetime(['2017-12-15T11:47:58.119Z', '2017-05-31T08:27:41.943Z', '2017-06-05T14:44:56.425Z', '2017-05-30T16:24:03.175Z' , '2017-07-03T10:20:46.333Z', '2017-06-16T10:13:31.535Z' , '2017-12-15T12:26:01.347Z', '2017-06-15T16:00:41.017Z', '2017-11-28T15:25:39.016Z', '2017-08-10T08:48:01.347Z']) })
sort df2 according to dates, and get the position of insertion using binary search
df2 = df2.sort_values('B').reset_index(drop=True)
ind = df2['B'].searchsorted(df1['A'])
Now check for the minimum difference between the index just above and just below the position of the insertion
for index, row in df1.iterrows():
i = ind[index]
if i not in df2.index:
print(df2.iloc[i-1]['B'])
elif i-1 not in df2.index:
print(df2.iloc[i]['B'])
else:
if abs(df2.iloc[i]['B'] - row['A']) > abs(df2.iloc[i-1]['B'] - row['A']):
print(df2.iloc[i-1]['B'])
else:
print(df2.iloc[i]['B'])
The test outputs are these, for each value in df1 respectively. (Note: Please recheck your outputs given in the question, they do not correspond to the minimum difference)
2017-07-03 10:20:46.333000
2017-11-28 15:25:39.016000
2017-05-30 16:24:03.175000
2017-08-10 08:48:01.347000
2017-11-28 15:25:39.016000
The above procedure has the time complexity of O(NlogN) for sorting and O(logN) (N = len(df2)) for finding each output. If the size of "df1" is large this will be a fairly fast approach.
I have working code that achieves the desired calculation result, but I am currently using an algorithm that iterates over the pandas array. this is obviously slower than pure pandas DataFrame calculations. Would like some advice on how i can use pandas functions to speed up this calculation
Code to generate dummy data
df = pd.DataFrame(index=pd.date_range(start='2014-01-01', periods=365))
df['Month'] = df.index.month
df['MTD'] = (df.index.day+0.001)/10000
This is basically a pandas DataFrame with MTD figures for some value. This is purely given so that we have some data to play with.
Needed calculation
what I need is a new DataFrame that has starting (investment) dates as columns - populating them with a few beginning of month values. the index is all possible dates and the values should be the YTD figure. I am using this Dataframe as a lookup/cache for investement dates
pseudocode
YTD = (1+last MTD figure) * ((1+last MTD figure)... for all months to the required date
Working function
def calculate_YTD(df): # slow takes 3.5s on my machine!!!!!!
YTD_df = pd.DataFrame(index=df.index)
for investment_date in [datetime.datetime(2014,x+1,1) for x in range(12)]:
YTD_df[investment_date] =1.0 # pre-populate with dummy floats
for date in df.index: # iterate over all dates in period
h = (df[investment_date:date].groupby('Month')['MTD'].max().fillna(0) + 1).product() -1
YTD_df[investment_date][date] = h
return YTD_df
I have hardcoded the investment dates list to simplify the problem statement. On my machines this code takes 2.5 to 3.5 seconds. Any suggestions on how i can speed it up?
Here's an approach that should be reasonably quick. Quite possible there is something faster/cleaner, but this should be an improvement.
#assuming a fixed number of investments dates, build a list
investment_dates = pd.date_range('2014-1-1', periods=12, freq='MS')
#build a table, by month, which contains the cumulative MTD
#return for each invesment date. Still have to loop over the investment dates,
#but don't need to loop over each daily value
running_mtd = []
for date in investment_dates:
curr_mo = (df[df.index >= date].groupby('Month')['MTD'].last() + 1.).cumprod()
curr_mo.name = date
running_mtd.append(curr_mo)
running_mtd_df = pd.concat(running_mtd, axis=1)
running_mtd_df = running_mtd_df.shift(1).fillna(1.)
#merge running mtd returns with base dataframe
df = df.merge(running_mtd_df, left_on='Month', right_index=True)
#calculate ytd return for each column / day, by multipling the running
#monthly return with the current MTD value
for date in investment_dates:
df[date] = np.where(df.index < date, np.nan, df[date] * (1. + df['MTD']) - 1.)
I'm a beginner of Python related environment and I have problem with using time series data.
The below is my OHLC 1 minute data.
2011-11-01,9:00:00,248.50,248.95,248.20,248.70
2011-11-01,9:01:00,248.70,249.00,248.65,248.85
2011-11-01,9:02:00,248.90,249.25,248.70,249.15
...
2011-11-01,15:03:00,250.25,250.30,250.05,250.15
2011-11-01,15:04:00,250.15,250.60,250.10,250.60
2011-11-01,15:15:00,250.55,250.55,250.55,250.55
2011-11-02,9:00:00,245.55,246.25,245.40,245.80
2011-11-02,9:01:00,245.85,246.40,245.75,246.35
2011-11-02,9:02:00,246.30,246.45,245.75,245.80
2011-11-02,9:03:00,245.75,245.85,245.30,245.35
...
I'd like to extract the last "CLOSE" data per each row and convert data format like the following:
2011-11-01, 248.70, 248.85, 249.15, ... 250.15, 250.60, 250.55
2011-11-02, 245.80, 246.35, 245.80, ...
...
I'd like to calculate the highest Close value and it's time(minute) per EACH DAY like the following:
2011-11-01, 10:23:03, 250.55
2011-11-02, 11:02:36, 251.00
....
Any help would be very appreciated.
Thank you in advance,
You can use the pandas library. In the case of your data you can get the max as:
import pandas as pd
# Read in the data and parse the first two columns as a
# date-time and set it as index
df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None)
# get only the fifth column (close)
df = df[[5]]
# Resample to date frequency and get the max value for each day.
df.resample('D', how='max')
If you want to show also the times, keep them in your DataFrame as a column and pass a function that will determine the max close value and return that row:
>>> df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None,
usecols=[0, 1, 5], names=['d', 't', 'close'])
>>> df['time'] = df.index
>>> df.resample('D', how=lambda group: group.iloc[group['close'].argmax()])
close time
d_t
2011-11-01 250.60 2011-11-01 15:04:00
2011-11-02 246.35 2011-11-02 09:01:00
And if you wan't a list of the prices per day then just do a groupby per day and return the list of all the prices from every group using the apply on the grouped object:
>>> df.groupby(lambda dt: dt.date()).apply(lambda group: list(group['close']))
2011-11-01 [248.7, 248.85, 249.15, 250.15, 250.6, 250.55]
2011-11-02 [245.8, 246.35, 245.8, 245.35]
For more information take a look at the docs: Time Series
Update for the concrete data set:
The problem with your data set is that you have some days without any data, so the function passed in as the resampler should handle those cases:
def func(group):
if len(group) == 0:
return None
return group.iloc[group['close'].argmax()]
df.resample('D', how=func).dropna()