Pandas dataframe add integer columns into datetime columns - python

I have a column named transaction_date which stores date, 1970-01-01 for example, and payment_plan_days stores the amount of days, 30, 70, any integer.
How should I add payment_plan_days into transaction_date to create a new column as membership_expire_date ?
I had tried with code below and it doesn't work since they are not the same dtype.
df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'membership_expire_date'] =
df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'transaction_date']
+ df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'payment_plan_days']

I think you need to_timedelta:
df['new'] = df['transaction_date'] + pd.to_timedelta(df['payment_plan_days'], unit='d')
Sample:
dates=pd.to_datetime(['1970-01-01','2005-07-17', '2005-07-17'])
df = pd.DataFrame({'transaction_date':dates, 'payment_plan_days':[30,70,100]})
df['new'] = df['transaction_date'] + pd.to_timedelta(df['payment_plan_days'], unit='d')
print (df)
payment_plan_days transaction_date new
0 30 1970-01-01 1970-01-31
1 70 2005-07-17 2005-09-25
2 100 2005-07-17 2005-10-25

Same idea as jezrael answer only using datetime's timedelta function:
from datetime import timedelta
dates=pd.to_datetime(['1970-01-01','2005-07-17', '2005-07-17'])
df = pd.DataFrame({'transaction_date':dates, 'payment_plan_days':[30,70,100]})
df.loc[:, 'expiration_date'] = list(map(lambda td, ppd: td+timedelta(ppd), df['transaction_date'], df['payment_plan_days']))

Related

Why do i get all date similiar while trying to fill them in dataset?

I have dataset with 800 rows and i want to create new column with date, and in each row in should increase on one day.
import datetime
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(800):
df['Date'] = date + datetime.timedelta(days=x)
In each column date is equal to '2014-01-12', as i inderstand it fills as if x is always equal to 799
Each time through the loop you are updating the ENTIRE Date column. You see the results of the 800th update at the end.
You could use a date range:
dr = pd.date_range('5/11/2011', periods=800, freq='D')
df = pd.DataFrame({'Date': dr})
print(df)
Date
0 2011-05-11
1 2011-05-12
2 2011-05-13
3 2011-05-14
4 2011-05-15
.. ...
795 2013-07-14
796 2013-07-15
797 2013-07-16
798 2013-07-17
799 2013-07-18
Or:
df['Date'] = dr
pandas is nice tool which can repeate some calculations without using for-loop.
When you use df['Date'] = ... then you assign the same value to all cells in column.
You have to use df.loc[x, 'Date'] = ... to assign to single cell.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(10):
df.loc[x,'Date'] = date + datetime.timedelta(days=x)
print(df)
But you could use also pd.date_range() for this.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
df['Date'] = pd.date_range(date, periods=10)
print(df)

How to generate random dates between date range inside pandas column?

I have df that looks like this
df:
id dob
1 7/31/2018
2 6/1992
I want to generate 88799 random dates to go into column dob in the dataframe, between the dates of 1960-01-01 to 1990-12-31 while keeping the format mm/dd/yyyy no time stamp.
How would I do this?
I tried:
date1 = (1960,01,01)
date2 = (1990,12,31)
for i range(date1,date2):
df.dob = i
I would figure out how many days are in your date range, then select 88799 random integers in that range, and finally add that as a timedelta with unit='d' to your minimum date:
min_date = pd.to_datetime('1960-01-01')
max_date = pd.to_datetime('1990-12-31')
d = (max_date - min_date).days + 1
df['dob'] = min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')
>>> df.head()
dob
0 1963-03-05
1 1973-06-07
2 1970-08-24
3 1970-05-03
4 1971-07-03
>>> df.tail()
dob
88794 1965-12-10
88795 1968-08-09
88796 1988-04-29
88797 1971-07-27
88798 1980-08-03
EDIT You can format your dates using .strftime('%m/%d/%Y'), but note that this will slow down the execution significantly:
df['dob'] = (min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')).strftime('%m/%d/%Y')
>>> df.head()
dob
0 02/26/1969
1 04/09/1963
2 08/29/1984
3 02/12/1961
4 08/02/1988
>>> df.tail()
dob
88794 02/13/1968
88795 02/05/1982
88796 07/03/1964
88797 06/11/1976
88798 11/17/1965

Subtracting values across grouped data frames in Pandas

I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)

pandas add a value column to datetime

I have this simple problem but for some reason it's giving a headache. I want to add a existing Date column with another column to get a newDate column.
For example: I have Date and n columns, and I want to add in NewDate column into my existing df.
df:
Date n NewDate (New Calculation here: Date + n)
05/31/2017 3 08/31/2017
01/31/2017 4 05/31/2017
12/31/2016 2 02/28/2017
I tried:
df['NewDate'] = (pd.to_datetime(df['Date']) + MonthEnd(n))
but I get an error saying "cannot convert the series to class 'int'
You're probably looking for an addition with a timedelta object.
v = pd.to_datetime(df.Date) + (pd.to_timedelta(df.n, unit='M'))
v
0 2017-08-30 07:27:18
1 2017-06-01 17:56:24
2 2017-03-01 20:58:12
dtype: datetime64[ns]
At the end, you can convert the result back into the same format as before -
df['NewDate'] = v.dt.strftime('%m/%d/%Y')

Calculate diffrence between two dates

I have a DB with two date column (Y-m-d):
Date_from Date_to
17/01/01 17/01/05
17/02/03 NaN
17/05/01 17/05/05
...
Date_from and Date_to are pandas column.
I built a function that if:
- in Date_to there NaN returns me "corrence";
- in Dta_to there is no Nan makes the difference between the two columns
Both results are saved in a third column. Like this:
Data_from Date_to Difference
17/01/01 17/01/05 4
17/02/03 NaN corrence
17/05/01 17/05/05 4
...
The function is this:
from datetime import datetime
def diff(data,d1, d2):
if pd.isnull(data.iloc[[1],[12]]):
data['difference'] = 366
else:
data[d1] = pd.to_datetime(data[d1])
data[d2] = pd.to_datetime(data[d2])
data['difference'] = data[d2] - data[d1]
return data
d1 = ["Date_from"]
d2 = ["Date_to"]
df = replace_NaN(df,d1,d2)
The error that get out is this:
TypeError: replace_NaN() takes 2 positional arguments but 3 were given
I don't understand why
You don't need a function to do this. Instead,
Convert the columns to datetime using pd.to_datetime
Subtract Date_from from Date_to
Extract the days component of the timedelta columns using dt.days
Call fillna on the result
i = pd.to_datetime(df.Date_to, format='%y/%m/%d', errors='coerce')
j = pd.to_datetime(df.Date_from, format='%y/%m/%d', errors='coerce')
df['Difference'] = i.sub(j).dt.days.fillna('corrence')
df
Date_from Date_to Difference
0 17/01/01 17/01/05 4
1 17/02/03 NaN corrence
2 17/05/01 17/05/05 4

Categories

Resources