I have the a pandas dataframe in this format:
Dates
11-Feb-18
18-Feb-18
03-Mar-18
25-Mar-18
29-Mar-18
04-Apr-18
08-Apr-18
14-Apr-18
17-Apr-18
30-Apr-18
04-May-18
I want to find dates between two consecutive dates. In this example I want to make a new column which will contain dates between two consecutive dates. For example between 11-Feb-18 and 18-Feb-18, I will get all the dates between these two dates.
I tried this code but it's throwing me error:
pd.DataFrame({'dates':pd.date_range(pd.to_datetime(df_new['Time.[Day]'].loc[i].diff(-1)))})
if you want to add a column with the list of dates tat are missing in between, this shoudl work. This could be more efficient and it has to work around the NaT in the last row and becomes a bit longer as intended, but gives you the result.
import pandas as pd
from datetime import timedelta
test_df = pd.DataFrame({
"Dates" :
["11-Feb-18", "18-Feb-18", "03-Mar-18", "25-Mar-18", "29-Mar-18", "04-Apr-18",
"08-Apr-18", "14-Apr-18", "17-Apr-18", "30-Apr-18", "04-May-18"]
})
res = (
test_df
.assign(
# convert to datetime
Dates = lambda x : pd.to_datetime(x.Dates),
# get next rows date
Dates_next = lambda x : x.Dates.shift(-1),
# create the date range
Dates_list = lambda x : x.apply(
lambda x :
pd.date_range(
x.Dates + timedelta(days=1),
x.Dates_next - timedelta(days=1),
freq="D").date.tolist()
if pd.notnull(x.Dates_next)
else None
, axis = 1
))
)
print(res)
results in:
Dates Dates_next Dates_list
0 2018-02-11 2018-02-18 [2018-02-12, 2018-02-13, 2018-02-14, 2018-02-1...
1 2018-02-18 2018-03-03 [2018-02-19, 2018-02-20, 2018-02-21, 2018-02-2...
2 2018-03-03 2018-03-25 [2018-03-04, 2018-03-05, 2018-03-06, 2018-03-0...
3 2018-03-25 2018-03-29 [2018-03-26, 2018-03-27, 2018-03-28]
4 2018-03-29 2018-04-04 [2018-03-30, 2018-03-31, 2018-04-01, 2018-04-0...
5 2018-04-04 2018-04-08 [2018-04-05, 2018-04-06, 2018-04-07]
6 2018-04-08 2018-04-14 [2018-04-09, 2018-04-10, 2018-04-11, 2018-04-1...
7 2018-04-14 2018-04-17 [2018-04-15, 2018-04-16]
8 2018-04-17 2018-04-30 [2018-04-18, 2018-04-19, 2018-04-20, 2018-04-2...
9 2018-04-30 2018-05-04 [2018-05-01, 2018-05-02, 2018-05-03]
10 2018-05-04 NaT None
As a sidenote, if you don't need the last row after the analysis, you could filter out the last row after assigning the next date and eliminate the if statement to make it faster.
This works with dataframes, adding a new column containing the requested list
It iterates over the column 1, preparing a list of lists for column 2.
At the and it creates a new dataframe column and assigns the prepared values to it.
import pandas as pd
from pprint import pp
from datetime import datetime, timedelta
df = pd.read_csv("test.csv")
in_betweens = []
for i in range(len(df["dates"])-1):
d = datetime.strptime(df["dates"][i],"%d-%b-%y")
d2 = datetime.strptime(df["dates"][i+1],"%d-%b-%y")
d = d + timedelta(days=1)
in_between = []
while d < d2:
in_between.append(d.strftime("%d-%b-%y"))
d = d + timedelta(days=1)
in_betweens.append(in_between)
in_betweens.append([])
df["in_betwens"] = in_betweens
df.head()
Related
Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03
I have a time-series in a pandas DataFrame at hourly frequency:
import pandas as pd
import numpy as np
idx = pd.date_range(freq="h", start="2018-01-01", periods=365*24)
df = pd.DataFrame({'value': np.random.rand(365*24)}, index=idx)
I have a list of dates:
dates = ['2018-03-20', '2018-04-08', '2018-07-14']
I want to end up with two DataFrames: one containing just the data for these dates, and one containing all of the data from the original DataFrame excluding all the data for these dates. In this case, I would have a DataFrame containing three days worth of data (for the days listed in dates), and a DataFrame containing 362 days data (all the data excluding those three days).
What is the best way to do this in pandas?
I can take advantage of nice string-based datetime indexing in pandas to extract each date separately, for example:
df[dates[0]]
and I can use this to put together a DataFrame containing just the specified dates:
to_concat = [df[date] for date in dates]
just_dates = pd.concat(to_concat)
This isn't as 'nice' as it could be, but does the job.
However, I can't work out how to remove those dates from the DataFrame to get the other output that I want. Doing:
df[~dates[0]]
gives a TypeError: bad operand type for unary ~: 'str', and I can't seem to get df.drop to work in this context.
What do you suggest as a nice, Pythonic and 'pandas-like' way to go about this?
Create boolean mask by numpy.in1d with converted dates to strings or Index.isin for test membership:
m = np.in1d(df.index.date.astype(str), dates)
m = df.index.to_series().dt.date.astype(str).isin(dates)
Or DatetimeIndex.strftime for strings:
m = df.index.strftime('%Y-%m-%d').isin(dates)
Another idea is remove times by DatetimeIndex.normalize - get DatetimeIndex in output:
m = df.index.normalize().isin(dates)
#alternative
#m = df.index.floor('d').isin(dates)
Last filter by boolean indexing:
df1 = df[m]
And for second DataFrame invert mask by ~:
df2 = df[~m]
print (df1)
value
2018-03-20 00:00:00 0.348010
2018-03-20 01:00:00 0.406394
2018-03-20 02:00:00 0.944569
2018-03-20 03:00:00 0.425583
2018-03-20 04:00:00 0.586190
...
2018-07-14 19:00:00 0.710710
2018-07-14 20:00:00 0.403660
2018-07-14 21:00:00 0.949572
2018-07-14 22:00:00 0.629871
2018-07-14 23:00:00 0.363081
[72 rows x 1 columns]
one way to solve this
df = df.reset_index()
with_date = df[df['index'].dt.date.astype(str).isin(dates)].set_index('index')
##use del with_date.index.name to remove the index name, if required
without_date = df[~df['index'].dt.date.astype(str).isin(dates)].set_index('index')
##with_date
value
index
2018-03-20 00:00:00 0.059623
2018-03-20 01:00:00 0.343513
...
##without_date
value
index
2018-01-01 00:00:00 0.087846
2018-01-01 01:00:00 0.481971
...
Another way to solve this:
Keep your dates in datetime format, for example through a pd.Timestamp:
dates_in_dt_format = [pd.Timestamp(date).date() for date in dates]
Then, keep only the rows where the index's date is not in that group, for example with:
df_without_dates = df.loc[[idx for idx in df.index if idx.date() not in dates_in_dt_format]]
df_with_dates = df.loc[[idx for idx in df.index if idx.date() in dates_in_dt_format]]
or using pandas apply instead of list comprehension:
df_with_dates = df[df.index.to_series().apply(lambda x: pd.Timestamp(x).date()).isin(dates_in_dt_format)]
df_without_dates = df[~df.index.to_series().apply(lambda x: pd.Timestamp(x).date()).isin(dates_in_dt_format)]
I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)
I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.
I have two lists that have the start and end dates of 30 day windows:
start_dts = [2014-01-01,...]
end_dts = [2014-01-30,...]
I have a dataframe with a field called 'transaction_dt'.
What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.
EDIT:
As requested here is some sample data of my format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
IIUC
By suing IntervalIndex
df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values
df
Out[457]:
transaction_dt End Start
0 2017-01-02 2017-01-31 2017-01-01
1 2017-03-02 2017-03-31 2017-03-01
2 2017-04-02 2017-04-30 2017-04-01
3 2017-05-02 2017-05-31 2017-05-01
Data Input :
df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)
df
Returns
customer_id transaction_dt product price units start end
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31
new approach:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
# Get all timestamps that are necessary
# This assumes dates are sorted
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
timestamps.append(timestamps[-1]+datetime.timedelta(days=30))
# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))
# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
for i,(start,end) in enumerate(ranges):
if (value >= start and value <= end):
df.loc[ind, "start"] = start
df.loc[ind, "end"] = end
# When match is found let's also
# remove all ranges that aren't met
# This can be removed if dates are not sorted
# But this should speed things up for large datasets
for _ in range(i):
ranges.pop(0)
Here's the setup:
I have two (integer-indexed) columns, start and month_delta. start has timestamps (its internal type is np.datetime64[ns]) and month_delta is integers.
I want to quickly produce the column that consists of the each datetime in start, offset by the corresponding number of months in month_delta. How do I do this?
Things I've tried that don't work:
apply is too slow.
You can't add a series of DateOffset objects to a series of datetime64[ns] dtype (or a DatetimeIndex).
You can't use a Series of timedelta64 objects either; Pandas silently converts month-based timedeltas to nanosecond-based timedeltas that are ~30 days long. (Yikes! What happened to not failing silently?)
Currently I'm iterating over all different values of month_delta and doing a tshift by that amount on the relevant part of a DatetimeIndex I created, but this is a horrible kludge:
new_dates = pd.Series(pd.Timestamp.now(), index=start.index)
date_index = pd.DatetimeIndex(start)
for i in xrange(month_delta.max()):
mask = (month_delta == i)
cur_dates = pd.Series(index=date_index[mask]).tshift(i, freq='M').index
new_dates[mask] = cur_dates
Yuck! Any suggestions?
Here is a way to do it (by adding NumPy datetime64s with timedelta64s) without calling apply:
import pandas as pd
import numpy as np
np.random.seed(1)
def combine64(years, months=1, days=1, weeks=None, hours=None, minutes=None,
seconds=None, milliseconds=None, microseconds=None, nanoseconds=None):
years = np.asarray(years) - 1970
months = np.asarray(months) - 1
days = np.asarray(days) - 1
types = ('<M8[Y]', '<m8[M]', '<m8[D]', '<m8[W]', '<m8[h]',
'<m8[m]', '<m8[s]', '<m8[ms]', '<m8[us]', '<m8[ns]')
vals = (years, months, days, weeks, hours, minutes, seconds,
milliseconds, microseconds, nanoseconds)
return sum(np.asarray(v, dtype=t) for t, v in zip(types, vals)
if v is not None)
def year(dates):
"Return an array of the years given an array of datetime64s"
return dates.astype('M8[Y]').astype('i8') + 1970
def month(dates):
"Return an array of the months given an array of datetime64s"
return dates.astype('M8[M]').astype('i8') % 12 + 1
def day(dates):
"Return an array of the days of the month given an array of datetime64s"
return (dates - dates.astype('M8[M]')) / np.timedelta64(1, 'D') + 1
N = 10
df = pd.DataFrame({
'start': pd.date_range('2000-1-25', periods=N, freq='D'),
'months': np.random.randint(12, size=N)})
start = df['start'].values
df['new_date'] = combine64(year(start), months=month(start) + df['months'],
days=day(start))
print(df)
yields
months start new_date
0 5 2000-01-25 2000-06-25
1 11 2000-01-26 2000-12-26
2 8 2000-01-27 2000-09-27
3 9 2000-01-28 2000-10-28
4 11 2000-01-29 2000-12-29
5 5 2000-01-30 2000-06-30
6 0 2000-01-31 2000-01-31
7 0 2000-02-01 2000-02-01
8 1 2000-02-02 2000-03-02
9 7 2000-02-03 2000-09-03
i think something like this might work:
df['start'] = pd.to_datetime(df.start)
df.groupby('month_delta').apply(lambda x: x.start + pd.DateOffset(months=x.month_delta.iloc[0]))
there might be a better way to create a series of DateOffset objects and add that some way, but idk...
I was not able to find a way without at least using an apply for setup but assuming that is okay:
df = pandas.DataFrame(
[[datetime.date(2014,10,22), 1], [datetime.date(2014,11,20), 2]],
columns=['date','delta'])
>>> df
date delta
0 2014-10-22 1
1 2014-11-20 2
from dateutil.relativedelta import relativedelta
df['offset'] = df['delta'].apply(lambda x: relativedelta(months=x))
>>> df['date'] + df['offset']
0 2014-11-22
1 2015-01-20
Note that you must use the datetime from the datetime module rather than the numpy one or the pandas one. Since you are only creating the delta with the apply I would hope you experience a speedup.