I am struggling to find a solution for the following problem: I have a dataframe which reports quarterly values. Unfortunately, some of the companies report their quarterly numbers a month after the typical release quarter-dates. For this reason, I would like to select these dates and change them to the typical release date. My dataframe looks like this:
# dataframe 1
rng1 = pd.date_range('2014-12-31', periods=5, freq='3M')
df1 = pd.DataFrame({ 'Date': rng1, 'Company': [1, 1, 1, 1 ,1], 'Val': np.random.randn(len(rng1)) })
# dataframe 2
rng2 = pd.date_range('2015-01-30', periods=5, freq='3M')
df2 = pd.DataFrame({ 'Date': rng2, 'Company': [2, 2, 2, 2 ,2],'Val': np.random.randn(len(rng2)) })
# Target Dataframe
frames = [df1, df2]
df_fin = pd.concat(frames)
Output:
Date Company Val
0 2014-12-31 1 0.374427
1 2015-03-31 1 0.328239
2 2015-06-30 1 -1.226196
3 2015-09-30 1 -0.153937
4 2015-12-31 1 -0.146096
0 2015-01-31 2 0.283528
1 2015-04-30 2 0.426100
2 2015-07-31 2 -0.044960
3 2015-10-31 2 -1.316574
4 2016-01-31 2 0.353073
So what I would like to do is the following: Company 2 reports their numbers a month later. For this reason I would like to change their dates so they allign with company 1. This means I would change dates such as the 2015-01-31 to the 2014-12-31.
Any help is highly appreciated
Thanks in advance
Use, pd.merge_asof with direction nearest to merge the dataframe df_in with the reference quarterly dates qDates:
# Refrence quarterly dates (typical release dates)
qDates = pd.date_range('2014-12-31', periods=5, freq='Q')
df = pd.merge_asof(
df_fin.sort_values(by='Date'), pd.Series(qDates, name='Quarter'),
left_on='Date', right_on='Quarter', direction='nearest')
df = (
df.sort_values(by=['Company', 'Quarter'])
.drop('Date', 1)
.rename(columns={'Quarter': 'Date'})
.reindex(df_fin.columns, axis=1)
.reset_index(drop=True)
)
# print(df)
Date Company Val
0 2014-12-31 1 0.146874
1 2015-03-31 1 0.297248
2 2015-06-30 1 1.444860
3 2015-09-30 1 -0.348871
4 2015-12-31 1 -0.093267
5 2014-12-31 2 -0.238166
6 2015-03-31 2 -1.503571
7 2015-06-30 2 0.791149
8 2015-09-30 2 -0.419414
9 2015-12-31 2 -0.598963
I hope I get what you mean. You can use pd.DateOffset or pd.offsets.MonthOffset here, to add/minus number of month(s) to Date condition by column value Company == 2
For example:
df_fin.loc[df_fin['Company'] == 2,'Date'] = df_fin.loc[df_fin['Company'] == 2,'Date'] - pd.DateOffset(months=1)
df_fin prints:
# df_fin
Date Company Val
0 2014-12-31 1 -0.794092
1 2015-03-31 1 -2.632114
2 2015-06-30 1 -0.176383
3 2015-09-30 1 0.701986
4 2015-12-31 1 -0.447678
0 2014-12-31 2 -0.003322
1 2015-03-30 2 0.475669
2 2015-06-30 2 -1.024190
3 2015-09-30 2 1.241122
4 2015-12-31 2 0.096882
Related
I have a problem. I want to calculate some date questions. But unfortunately I got an error ValueError: cannot reindex from a duplicate axis. I looked at What does `ValueError: cannot reindex from a duplicate axis` mean?. But nothing worked for me. How could I solve the problem?
I tried print(True in df.index.duplicated()) [OUT] False
# Did not work for me
#df[df.index.duplicated()]
#df = df.loc[:,~df.columns.duplicated()]
#df = df.reset_index()
Dataframe
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
8 3 2021-05-17
9 3 2021-07-17
10 3 2021-02-22
11 3 2021-02-22
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
}
df = pd.DataFrame(data=d)
#display(df)
#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
[OUT]
ValueError: cannot reindex from a duplicate axis
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-3f715dc564ee> in <module>()
3 .set_index('fromDate')
4 .groupby('customerId')['lastInteractivity']
----> 5 .apply(lambda x: x.asfreq('d'))
6 .reset_index())
Indeed I arrived at the same conclusion than #ALollz said in his comment, by using the drop_duplicates, you have the expected result :
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.drop_duplicates(['fromDate', 'customerId'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
Output :
customerId fromDate lastInteractivity
0 1 2021-02-10 468 days
1 1 2021-02-11 NaT
2 1 2021-02-12 NaT
3 1 2021-02-13 NaT
4 1 2021-02-14 NaT
...
485 3 2021-07-13 NaT
486 3 2021-07-14 NaT
487 3 2021-07-15 NaT
488 3 2021-07-16 NaT
489 3 2021-07-17 311 days
I have a dataframe:
df1 = pd.DataFrame(
[['2011-01-01','2011-01-03','A'], ['2011-04-01','2011-04-01','A'], ['2012-08-28','2012-08-30','B'], ['2015-04-03','2015-04-05','A'], ['2015-08-21','2015-08-21','B']],
columns=['d0', 'd1', 'event'])
d0 d1 event
0 2011-01-01 2011-01-03 A
1 2011-04-01 2011-04-01 A
2 2012-08-28 2012-08-30 B
3 2015-04-03 2015-04-05 A
4 2015-08-21 2015-08-21 B
It contains some events A and B that occurred in the specified interval from d0 to d1. (There are actually more events, they are mixed, but they have no intersection by dates.) Moreover, this interval can be 1 day (d0 = d1). I need to go from df1 to df2 in which these time intervals are "unrolled" for each event, i.e.:
df2 = pd.DataFrame(
[['2011-01-01','A'], ['2011-01-02','A'], ['2011-01-03','A'], ['2011-04-01','A'], ['2012-08-28','B'], ['2012-08-29','B'], ['2012-08-30','B'], ['2015-04-03','A'], ['2015-04-04','A'], ['2015-04-05','A'], ['2015-08-21','B']],
columns=['Date', 'event'])
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I tried doing this based on resample and comparing areas where ffill = bfill but couldn't come up with anything. How can this be done in the most simple way?
We can set_index to event then create date_range per row, then explode to unwind the ranges and reset_index to create the DataFrame:
df2 = (
df1.set_index('event')
.apply(lambda r: pd.date_range(r['d0'], r['d1']), axis=1)
.explode()
.reset_index(name='Date')[['Date', 'event']]
)
df2:
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
Let us try comprehension to create the pairs of date and event
pd.DataFrame(((d, c) for (*v, c) in df1.to_numpy()
for d in pd.date_range(*v)), columns=['Date', 'Event'])
Date Event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I don't know if this is the "most simple," but it's the most intuitive way I can think to do it. I iterate over the rows and unroll it manually into a new dataframe. This means that I look at each row, iterate over the dates between d0 and d1, and construct a row for each of them and compile them into a dataframe:
from datetime import timedelta
def unroll_events(df):
rows = []
for _, row in df.iterrows():
event = row['event']
start = row['d0']
end = row['d1']
current = start
while current != end:
rows.append(dict(Date=current, event=event))
current += timedelta(days=1)
rows.append(dict(Date=current, event=event)) # make sure last one is included
return pd.DataFrame(rows)
New to Python and coding in general here so this should be pretty basic for most of you.
I basically created this dataframe with a Datetime index.
Here's the dataframe
df = pd.date_range(start='2018-01-01', end='2019-12-31', freq='D')
I would now like to add a new variable to my df called "vacation" with a value of 1 if the date is between 2018-06-24 and 2018-08-24 and value of 0 if it's not between those dates. How can I go about doing this?
I've created a variable with a range of vacation but I'm not sure how to put these two together along with creating a new column for "vacation" in my dataframe.
vacation = pd.date_range(start = '2018-06-24', end='2018-08-24')
Thanks in advance.
First, pd.date_range(start='2018-01-01', end='2019-12-31', freq='D') will not create a DataFrame instead it will create a DatetimeIndex. You can then convert it into a DataFrame by having it as an index or a separate column.
# Having it as an index
datetime_index = pd.date_range(start='2018-01-01', end='2019-12-31', freq='D')
df = pd.DataFrame({}, index=datetime_index)
# Using numpy.where() to create the Vacation column
df['Vacation'] = np.where((df.index >= '2018-06-24') & (df.index <= '2018-08-24'), 1, 0)
Or
# Having it as a column
datetime_index = pd.date_range(start='2018-01-01', end='2019-12-31', freq='D')
df = pd.DataFrame({'Date': datetime_index})
# Using numpy.where() to create the Vacation column
df['Vacation'] = np.where((df['Date'] >= '2018-06-24') & (df['Date'] <= '2018-08-24'), 1, 0)
Note: Displaying only the first five rows of the dataframe df.
Solution for new DataFrame:
i = pd.date_range(start='2018-01-01', end='2018-08-26', freq='D')
m = (i > '2018-06-24') & (i < '2018-08-24')
df = pd.DataFrame({'vacation': m.astype(int)}, index=i)
Or:
df = pd.DataFrame({'vacation':np.where(m, 1, 0)}, index=i)
print (df)
vacation
2018-01-01 0
2018-01-02 0
2018-01-03 0
2018-01-04 0
2018-01-05 0
...
2018-08-22 1
2018-08-23 1
2018-08-24 0
2018-08-25 0
2018-08-26 0
[238 rows x 1 columns]
Solution for add new column to existing DataFrame:
Create mask by compare DatetimeIndex with chaining by & for bitwise AND and convert it to integer (True to 1 and False to 0) or use numpy.where:
i = pd.date_range(start='2018-01-01', end='2018-08-26', freq='D')
df = pd.DataFrame({'a': 1}, index=i)
m = (df.index > '2018-06-24') & (df.index < '2018-08-24')
df['vacation'] = m.astype(int)
#alternative
#df['vacation'] = np.where(m, 1, 0)
print (df)
a vacation
2018-01-01 1 0
2018-01-02 1 0
2018-01-03 1 0
2018-01-04 1 0
2018-01-05 1 0
.. ...
2018-08-22 1 1
2018-08-23 1 1
2018-08-24 1 0
2018-08-25 1 0
2018-08-26 1 0
[238 rows x 2 columns]
Another solution with DatetimeIndex and DataFrame.loc - difference is 1 included 2018-06-24 and 2018-08-24 edge values:
df['vacation'] = 0
df.loc['2018-06-24':'2018-08-24'] = 1
print (df)
a vacation
2018-01-01 1 0
2018-01-02 1 0
2018-01-03 1 0
2018-01-04 1 0
2018-01-05 1 0
.. ...
2018-08-22 1 1
2018-08-23 1 1
2018-08-24 1 1
2018-08-25 1 0
2018-08-26 1 0
[238 rows x 2 columns]
df_new = pd.DataFrame(
{
'person_id': [1, 1, 3, 3, 5, 5],
'obs_date': ['12/31/2007', 'NA-NA-NA NA:NA:NA', 'NA-NA-NA NA:NA:NA', '11/25/2009', '10/15/2019', 'NA-NA-NA NA:NA:NA']
})
It looks like as shown below
What I would like to do is replace/fill NA type rows with actual date values from the same group. For which I tried the below
m1 = df_new['obs_date'].str.contains('^\d')
df_new['obs_date'] = df_new.groupby((m1).cumsum())['obs_date'].transform('first')
But this gives an unexpected output like shown below
Here for the 2nd row it should have been 11/25/2009 from person_id = 3 instead it is from the 1st group of person_id = 1.
How can I get the expected output as shown below
Any elegant and efficient solution is helpful as I am dealing with more than million records
First use to_datetime with errors='coerce' for convert non datetimes to missing values, then GroupBy.first for get first non missing value in GroupBy.transform new column filled by data:
df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('first')
#alternative - minimal value per group
#df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('min')
print (df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15
Another idea is use DataFrame.sort_values with GroupBy.first:
df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = (df_new.sort_values(['person_id','obs_date'])
.groupby('person_id')['obs_date']
.ffill())
print (df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15
You can do a pd.to_datetime(..,errors='coerce') to fill non date values as NaT and ffill and bfill after groupby :
df_new['obs_date']=(df_new.assign(obs_date=pd.to_datetime(df_new['obs_date'],
errors='coerce')).groupby('person_id')['obs_date'].apply(lambda x: x.ffill().bfill()))
print(df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15
df_new= df_new.join(df_new.groupby('person_id')["obs_date"].min(),
on='person_id',
rsuffix="_clean")
Output:
person_id obs_date obs_date_clean
0 1 12/31/2007 12/31/2007
1 1 NA-NA-NA NA:NA:NA 12/31/2007
2 3 NA-NA-NA NA:NA:NA 11/25/2009
3 3 11/25/2009 11/25/2009
4 5 10/15/2019 10/15/2019
5 5 NA-NA-NA NA:NA:NA 10/15/2019
I have data for a number of events with start and end times like this:
df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
Out:
end start value
0 2015-01-07 2015-01-05 3
1 2015-01-15 2015-01-10 4
2 2015-01-13 2015-01-11 5
Now I need to calculate the number of events active at the same time, and eg. the sum of their values. So the result should look something like this:
date count sum
2015-01-05 1 3
2015-01-06 1 3
2015-01-07 1 3
2015-01-08 0 0
2015-01-09 0 0
2015-01-10 1 4
2015-01-11 2 9
2015-01-12 2 9
2015-01-13 2 9
2015-01-14 1 4
2015-01-15 1 4
Any ideas for how to do this? I was thinking about using a custom Grouper for groupby, but as far as I can see a Grouper can only assign a row to a single group so that doesn't look useful.
EDIT: After some testing I found this rather ugly way to get the desired result:
df['count'] = 1
dates = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
start = df[['start', 'value', 'count']].set_index('start').reindex(dates)
end = df[['end', 'value', 'count']].set_index('end').reindex(dates).shift(1)
rstart = pd.rolling_sum(start, len(start), min_periods=1)
rend = pd.rolling_sum(end, len(end), min_periods=1)
rstart.subtract(rend, fill_value=0).fillna(0)
However, this only works with sums, and I can't see an obvious way to make it work with other functions. For example, is there a way to get it to work with median instead of sum?
If I were using SQL, I would do this by joining an all-dates table to the events table, and then grouping by date. Pandas doesn't make this approach especially easy, since there's no way to left-join on a condition, but we can fake it using dummy columns and reindexing:
df = pd.DataFrame({'start': ['2015-01-05', '2015-01-10', '2015-01-11'], 'end': ['2015-01-07', '2015-01-15', '2015-01-13'], 'value': [3, 4, 5]})
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
df['dummy'] = 1
Then:
date_series = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
date_df = pd.DataFrame(dict(date=date_series, dummy=1))
cross_join = date_df.merge(df, on='dummy')
cond_join = cross_join[(cross_join.start <= cross_join.date) & (cross_join.date <= cross_join.end)]
grp_join = cond_join.groupby(['date'])
final = (
pd.DataFrame(dict(
val_count=grp_join.size(),
val_sum=grp_join.value.sum(),
val_median=grp_join.value.median()
), index=date_series)
.fillna(0)
.reset_index()
)
The fillna(0) isn't perfect, since it makes nulls in the val_median column into 0s, when they should really remain nulls.
Alternatively, with pandas-ply we can code that up as:
date_series = pd.date_range('2015-01-05', '2015-01-15', freq='1D')
date_df = pd.DataFrame(dict(date=date_series, dummy=1))
final = (
date_df
.merge(df, on='dummy')
.ply_where(X.start <= X.date, X.date <= X.end)
.groupby('date')
.ply_select(val_count=X.size(), val_sum=X.value.sum(), median=X.value.median())
.reindex(date_series)
.ply_select('*', val_count=X.val_count.fillna(0), val_sum=X.val_sum.fillna(0))
.reset_index()
)
which handles nulls a bit better.
This is what I came up with. Got to think there's a better way
Given your frame
end start value
0 2015-01-07 2015-01-05 3
1 2015-01-15 2015-01-10 4
2 2015-01-13 2015-01-11 5
and then
dList = []
vList = []
d = {}
def buildDict(row):
for x in pd.date_range(row["start"],row["end"]): #build a range for each row
dList.append(x) #date list
vList.append(row["value"]) #value list
df.apply(buildDict,axis=1) #each row in df is passed to buildDict
#this d will be used to create our new frame
d["date"] = dList
d["value"] = vList
#from here you can use whatever agg functions you want
pd.DataFrame(d).groupby("date").agg(["count","sum"])
yields
value
count sum
date
2015-01-05 1 3
2015-01-06 1 3
2015-01-07 1 3
2015-01-10 1 4
2015-01-11 2 9
2015-01-12 2 9
2015-01-13 2 9
2015-01-14 1 4
2015-01-15 1 4
You can avoid the cross join by exploding the dates, imputing the missing rows with complete from pyjanitor, before aggregating the dates:
# pip install pyjanitor
import pandas as pd
import janitor
(df.assign(dates = [pd.date_range(start, end, freq='1D')
for start, end
in zip(df.start, df.end)])
.explode('dates')
.loc[:, ['value', 'dates']]
.complete({'dates': lambda df: pd.date_range(df.min(), df.max(), freq='1D')})
.groupby('dates')
.agg(['size', 'sum'])
.droplevel(level=0, axis='columns')
)
size sum
dates
2015-01-05 1 3.0
2015-01-06 1 3.0
2015-01-07 1 3.0
2015-01-08 1 0.0
2015-01-09 1 0.0
2015-01-10 1 4.0
2015-01-11 2 9.0
2015-01-12 2 9.0
2015-01-13 2 9.0
2015-01-14 1 4.0
2015-01-15 1 4.0