I have a date object and a date column 'date1' in pandas dataframe 'df' as below:
date = '202107'
df
date1
0 2021-07-01
1 2021-08-01
2 2021-09-01
3 2021-10-01
4 2021-11-01
5 2023-02-01
6 2023-03-01
I want to create a column 'months' in df where
months = (date1 + 1month) - date
My output dataframe should look like below:
df
date1 months
0 2021-07-01 1
1 2021-08-01 2
2 2021-09-01 3
3 2021-10-01 4
4 2021-11-01 5
5 2023-02-01 20
6 2023-03-01 21
Here's a way to do using pandas:
date = '202107'
date = pd.to_datetime(date, format='%Y%m')
df['months'] = (df.date + pd.offsets.MonthBegin(1)).dt.month - date.month
print(df)
date months
0 2021-07-01 1
1 2021-08-01 2
2 2021-09-01 3
3 2021-10-01 4
4 2021-11-01 5
Given a date variable as follows
mydate = 202003
and a dataframe [df] containing a datetime variable start_date. You can do:
mydate_to_use= pd.to_datetime(mydate , format = '%Y%m', errors='ignore')
df['months'] = (df['START_DATE'].dt.year - mydate_to_use.year) * 12 + (df['START_DATE'].dt.month - mydate_to_use.month)
IIUC
s=(df.date1-pd.to_datetime(date,format='%Y%m'))//np.timedelta64(1, 'M')+1
Out[118]:
0 1
1 2
2 3
3 4
4 5
Name: date1, dtype: int64
df['months']=s
Update
(df.date1.dt.year*12+df.date1.dt.month)-(pd.to_numeric(date)//100)*12-(pd.to_numeric(date)%100)+1
Out[379]:
0 1
1 2
2 3
3 4
4 5
5 20
6 21
Name: date1, dtype: int64
Related
I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had. I also got an KeyError.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
df_new = df.groupby(['customerId', 'fromDate'], as_index=False)['fromDate'].count()
df_new['count_from_date'] = df_new['fromDate']
df = df.merge(df_new['count_from_date'], how='inner', left_index=True, right_index=True)
(df.set_index(['fromDate']).sort_index().groupby('customerId').apply(lambda s: s['count_from_date'].rolling('180D').sum())- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']
[OUT] KeyError: 'count_from_date'
What I want
customerId fromDate occur_last_6_months
0 1 2022-06-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 1 # 2022-05-20 = 1
2 1 2022-05-25 1 # 2022-05-20 = 1
3 1 2022-05-20 0 # No in the last 6 months
4 1 2021-09-05 0 # No in the last 6 months
5 2 2022-06-02 0 # No in the last 6 months
6 3 2021-03-01 1 # 2021-02-01 = 1
7 3 2021-02-01 0 # No in the last 6 months
If possible sum duplicated values like second and third row count matched values in mask by sum only True values:
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)
def f(x):
d1 = x["fromDate"].to_numpy()
d2 = x["last_month"].to_numpy()
x['occur_last_6_months'] = ((d2[:, None]<= d1) & (d1 <= d1[:, None])).sum(axis=1) - 1
return x
df = df.groupby('customerId').apply(f)
print(df)
customerId fromDate last_month occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 2
2 1 2022-05-25 2021-11-25 2
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
If need subtract by all count per duplciated dates instead subtract 1 use GroupBy.transform with size:
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)
def f(x):
d1 = x["fromDate"].to_numpy()
d2 = x["last_month"].to_numpy()
x['occur_last_6_months'] = ((d2[:, None]<= d1) & (d1 <= d1[:, None])).sum(axis=1)
return x
df = df.groupby('customerId').apply(f)
s = df.groupby(['customerId', 'fromDate'])['customerId'].transform('size')
df['occur_last_6_months'] -= s
print(df)
customerId fromDate last_month occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
I'm using a dataframe and convert the time column to years and months like this:
consumer_confidence = pd.read_csv('consumer_confidence.csv')
business_confidence = pd.read_csv('business_confidence.csv')
consumer_confidence['Year'] = pd.to_datetime(consumer_confidence['TIME']).dt.year
consumer_confidence['Month'] = pd.to_datetime(consumer_confidence['TIME']).dt.month
business_confidence['Year'] = pd.to_datetime(business_confidence['TIME']).dt.year
business_confidence['Month'] = pd.to_datetime(business_confidence['TIME']).dt.month
business_confidence = business_confidence.groupby('Year')['Value'].sum()
consumer_confidence = consumer_confidence.groupby('Year')['Value'].sum()
Attempting the .groupby() statements result in this error:
AttributeError: 'Series' object has no attribute 'Year'
I am unsure how to resolve this as 'Year' should now be a column in the dataframe. Could someone explain my error here?
Your code (with some sample inputs as shown below) works fine for me:
import pandas as pd
'''
consumer_confidence = pd.read_csv('consumer_confidence.csv')
business_confidence = pd.read_csv('business_confidence.csv')
'''
consumer_confidence = pd.DataFrame({'TIME':['2021-01-01', '2021-02-01', '2022-04-11', '2022-04-12'], 'Value':[1,2,3,4]})
business_confidence = pd.DataFrame({'TIME':['2020-01-01', '2021-02-01', '2022-04-11', '2022-04-12'], 'Value':[5,6,7,8]})
print(consumer_confidence)
print(business_confidence)
consumer_confidence['Year'] = pd.to_datetime(consumer_confidence['TIME']).dt.year
consumer_confidence['Month'] = pd.to_datetime(consumer_confidence['TIME']).dt.month
business_confidence['Year'] = pd.to_datetime(business_confidence['TIME']).dt.year
business_confidence['Month'] = pd.to_datetime(business_confidence['TIME']).dt.month
print(consumer_confidence)
print(business_confidence)
business_confidence = business_confidence.groupby('Year')['Value'].sum()
consumer_confidence = consumer_confidence.groupby('Year')['Value'].sum()
print(consumer_confidence)
print(business_confidence)
Output:
TIME Value
0 2021-01-01 1
1 2021-02-01 2
2 2022-04-11 3
3 2022-04-12 4
TIME Value
0 2020-01-01 5
1 2021-02-01 6
2 2022-04-11 7
3 2022-04-12 8
TIME Value Year Month
0 2021-01-01 1 2021 1
1 2021-02-01 2 2021 2
2 2022-04-11 3 2022 4
3 2022-04-12 4 2022 4
TIME Value Year Month
0 2020-01-01 5 2020 1
1 2021-02-01 6 2021 2
2 2022-04-11 7 2022 4
3 2022-04-12 8 2022 4
Year
2021 3
2022 7
Name: Value, dtype: int64
Year
2020 5
2021 6
2022 15
Name: Value, dtype: int64
I have loaded a pandas dataframe from a .csv file that contains a column having datetime values.
df = pd.read_csv('data.csv')
The name of the column having the datetime values is pickup_datetime. Here's what I get if i do df['pickup_datetime'].head():
0 2009-06-15 17:26:00+00:00
1 2010-01-05 16:52:00+00:00
2 2011-08-18 00:35:00+00:00
3 2012-04-21 04:30:00+00:00
4 2010-03-09 07:51:00+00:00
Name: pickup_datetime, dtype: datetime64[ns, UTC]
How do I convert this column into a numpy array having only the day values of the datetime? For example: 15 from 0 2009-06-15 17:26:00+00:00, 05 from 1 2010-01-05 16:52:00+00:00, etc..
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], errors='coerce')
df['pickup_datetime'].dt.day.values
# array([15, 5, 18, 21, 9])
Just adding another Variant, although coldspeed already provide the briefed answer as a x-mas and New year bonus :-) :
>>> df
pickup_datetime
0 2009-06-15 17:26:00+00:00
1 2010-01-05 16:52:00+00:00
2 2011-08-18 00:35:00+00:00
3 2012-04-21 04:30:00+00:00
4 2010-03-09 07:51:00+00:00
Convert the strings to timestamps by inferring their format:
>>> df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
>>> df
pickup_datetime
0 2009-06-15 17:26:00
1 2010-01-05 16:52:00
2 2011-08-18 00:35:00
3 2012-04-21 04:30:00
4 2010-03-09 07:51:00
You can pic the day's only from the pickup_datetime:
>>> df['pickup_datetime'].dt.day
0 15
1 5
2 18
3 21
4 9
Name: pickup_datetime, dtype: int64
You can pic the month's only from the pickup_datetime:
>>> df['pickup_datetime'].dt.month
0 6
1 1
2 8
3 4
4 3
You can pic the Year's only from the pickup_datetime
>>> df['pickup_datetime'].dt.year
0 2009
1 2010
2 2011
3 2012
4 2010
I have a timeseries dataframe that is data agnostic and uses period vs date.
I would like to at some point add in dates, using the period.
My dataframe looks like
period custid
1 1
2 1
3 1
1 2
2 2
1 3
2 3
3 3
4 3
I would like to be able to pick a random starting date, for example 1/1/2018, and that would be period 1 so you would end up with
period custid date
1 1 1/1/2018
2 1 2/1/2018
3 1 3/1/2018
1 2 1/1/2018
2 2 2/1/2018
1 3 1/1/2018
2 3 2/1/2018
3 3 3/1/2018
4 3 4/1/2018
You could create a column of timedeltas, based on the period column, where each row is a time delta of period dates (-1, so that it starts at 0). then, starting from your start_date, which you can define as a datetime object, add the timedelta to start date:
start_date = pd.to_datetime('1/1/2018')
df['date'] = pd.to_timedelta(df['period'] - 1, unit='D') + start_date
>>> df
period custid date
0 1 1 2018-01-01
1 2 1 2018-01-02
2 3 1 2018-01-03
3 1 2 2018-01-01
4 2 2 2018-01-02
5 1 3 2018-01-01
6 2 3 2018-01-02
7 3 3 2018-01-03
8 4 3 2018-01-04
Edit: In your comment, you said you were trying to add months, not days. For this, you could use your method, or alternatively, the following:
from pandas.tseries.offsets import MonthBegin
df['date'] = start_date + (df['period'] -1) * MonthBegin()
I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y
user_id date val
1 1-1-17 1
2 1-1-17 1
3 1-1-17 1
1 1-1-17 1
1 1-2-17 1
2 1-2-17 1
2 1-10-17 1
3 2-1-17 1
The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:
user_id date sum(val)
1 1-2-17 3
2 1-2-17 2
2 1-10-17 1
3 1-1-17 1
3 2-1-17 1
Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..
Thanks!
I'd convert this to a datetime column and then use pd.TimeGrouper:
dates = pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0 2017-01-01
1 2017-01-01
2 2017-01-01
3 2017-01-01
4 2017-01-02
5 2017-01-02
6 2017-01-10
7 2017-02-01
Name: date, dtype: datetime64[ns]
df = (df.assign(date=dates).set_index('date')
.groupby(['user_id', pd.TimeGrouper('3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Similar solution using pd.Grouper:
df = (df.assign(date=dates)
.groupby(['user_id', pd.Grouper(key='date', freq='3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Update: TimeGrouper will be deprecated in future versions of pandas, so Grouper would be preferred in this scenario (thanks for the heads up, Vaishali!).
I come with a very ugly solution but still work...
df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})
Out[586]:
user_id Key val date
0 1 1 3 2017-01-01
1 2 2 2 2017-01-01
2 2 3 1 2017-01-10
3 3 4 1 2017-01-01
4 3 5 1 2017-02-01