Pandas dataframe add integer columns into datetime columns

Pandas dataframe add integer columns into datetime columns - python

I have a column named transaction_date which stores date, 1970-01-01 for example, and payment_plan_days stores the amount of days, 30, 70, any integer.
How should I add payment_plan_days into transaction_date to create a new column as membership_expire_date ?
I had tried with code below and it doesn't work since they are not the same dtype.
df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'membership_expire_date'] =
df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'transaction_date']
+ df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'payment_plan_days']

I think you need to_timedelta:
df['new'] = df['transaction_date'] + pd.to_timedelta(df['payment_plan_days'], unit='d')
Sample:
dates=pd.to_datetime(['1970-01-01','2005-07-17', '2005-07-17'])
df = pd.DataFrame({'transaction_date':dates, 'payment_plan_days':[30,70,100]})
df['new'] = df['transaction_date'] + pd.to_timedelta(df['payment_plan_days'], unit='d')
print (df)
payment_plan_days transaction_date new
0 30 1970-01-01 1970-01-31
1 70 2005-07-17 2005-09-25
2 100 2005-07-17 2005-10-25

Same idea as jezrael answer only using datetime's timedelta function:
from datetime import timedelta
dates=pd.to_datetime(['1970-01-01','2005-07-17', '2005-07-17'])
df = pd.DataFrame({'transaction_date':dates, 'payment_plan_days':[30,70,100]})
df.loc[:, 'expiration_date'] = list(map(lambda td, ppd: td+timedelta(ppd), df['transaction_date'], df['payment_plan_days']))

Related

Why do i get all date similiar while trying to fill them in dataset?

I have dataset with 800 rows and i want to create new column with date, and in each row in should increase on one day.
import datetime
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(800):
df['Date'] = date + datetime.timedelta(days=x)
In each column date is equal to '2014-01-12', as i inderstand it fills as if x is always equal to 799

Each time through the loop you are updating the ENTIRE Date column. You see the results of the 800th update at the end.
You could use a date range:
dr = pd.date_range('5/11/2011', periods=800, freq='D')
df = pd.DataFrame({'Date': dr})
print(df)
Date
0 2011-05-11
1 2011-05-12
2 2011-05-13
3 2011-05-14
4 2011-05-15
.. ...
795 2013-07-14
796 2013-07-15
797 2013-07-16
798 2013-07-17
799 2013-07-18
Or:
df['Date'] = dr

pandas is nice tool which can repeate some calculations without using for-loop.
When you use df['Date'] = ... then you assign the same value to all cells in column.
You have to use df.loc[x, 'Date'] = ... to assign to single cell.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(10):
df.loc[x,'Date'] = date + datetime.timedelta(days=x)
print(df)
But you could use also pd.date_range() for this.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
df['Date'] = pd.date_range(date, periods=10)
print(df)

How to generate random dates between date range inside pandas column?

I have df that looks like this
df:
id dob
1 7/31/2018
2 6/1992
I want to generate 88799 random dates to go into column dob in the dataframe, between the dates of 1960-01-01 to 1990-12-31 while keeping the format mm/dd/yyyy no time stamp.
How would I do this?
I tried:
date1 = (1960,01,01)
date2 = (1990,12,31)
for i range(date1,date2):
df.dob = i

I would figure out how many days are in your date range, then select 88799 random integers in that range, and finally add that as a timedelta with unit='d' to your minimum date:
min_date = pd.to_datetime('1960-01-01')
max_date = pd.to_datetime('1990-12-31')
d = (max_date - min_date).days + 1
df['dob'] = min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')
>>> df.head()
dob
0 1963-03-05
1 1973-06-07
2 1970-08-24
3 1970-05-03
4 1971-07-03
>>> df.tail()
dob
88794 1965-12-10
88795 1968-08-09
88796 1988-04-29
88797 1971-07-27
88798 1980-08-03
EDIT You can format your dates using .strftime('%m/%d/%Y'), but note that this will slow down the execution significantly:
df['dob'] = (min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')).strftime('%m/%d/%Y')
>>> df.head()
dob
0 02/26/1969
1 04/09/1963
2 08/29/1984
3 02/12/1961
4 08/02/1988
>>> df.tail()
dob
88794 02/13/1968
88795 02/05/1982
88796 07/03/1964
88797 06/11/1976
88798 11/17/1965

Subtracting values across grouped data frames in Pandas

I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing

First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64

You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0

Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)

pandas add a value column to datetime

I have this simple problem but for some reason it's giving a headache. I want to add a existing Date column with another column to get a newDate column.
For example: I have Date and n columns, and I want to add in NewDate column into my existing df.
df:
Date n NewDate (New Calculation here: Date + n)
05/31/2017 3 08/31/2017
01/31/2017 4 05/31/2017
12/31/2016 2 02/28/2017
I tried:
df['NewDate'] = (pd.to_datetime(df['Date']) + MonthEnd(n))
but I get an error saying "cannot convert the series to class 'int'

You're probably looking for an addition with a timedelta object.
v = pd.to_datetime(df.Date) + (pd.to_timedelta(df.n, unit='M'))
v
0 2017-08-30 07:27:18
1 2017-06-01 17:56:24
2 2017-03-01 20:58:12
dtype: datetime64[ns]
At the end, you can convert the result back into the same format as before -
df['NewDate'] = v.dt.strftime('%m/%d/%Y')

Calculate diffrence between two dates

I have a DB with two date column (Y-m-d):
Date_from Date_to
17/01/01 17/01/05
17/02/03 NaN
17/05/01 17/05/05
...
Date_from and Date_to are pandas column.
I built a function that if:
- in Date_to there NaN returns me "corrence";
- in Dta_to there is no Nan makes the difference between the two columns
Both results are saved in a third column. Like this:
Data_from Date_to Difference
17/01/01 17/01/05 4
17/02/03 NaN corrence
17/05/01 17/05/05 4
...
The function is this:
from datetime import datetime
def diff(data,d1, d2):
if pd.isnull(data.iloc[[1],[12]]):
data['difference'] = 366
else:
data[d1] = pd.to_datetime(data[d1])
data[d2] = pd.to_datetime(data[d2])
data['difference'] = data[d2] - data[d1]
return data
d1 = ["Date_from"]
d2 = ["Date_to"]
df = replace_NaN(df,d1,d2)
The error that get out is this:
TypeError: replace_NaN() takes 2 positional arguments but 3 were given
I don't understand why

You don't need a function to do this. Instead,
Convert the columns to datetime using pd.to_datetime
Subtract Date_from from Date_to
Extract the days component of the timedelta columns using dt.days
Call fillna on the result
i = pd.to_datetime(df.Date_to, format='%y/%m/%d', errors='coerce')
j = pd.to_datetime(df.Date_from, format='%y/%m/%d', errors='coerce')
df['Difference'] = i.sub(j).dt.days.fillna('corrence')
df
Date_from Date_to Difference
0 17/01/01 17/01/05 4
1 17/02/03 NaN corrence
2 17/05/01 17/05/05 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe add integer columns into datetime columns - python

Related

Why do i get all date similiar while trying to fill them in dataset?

How to generate random dates between date range inside pandas column?

Subtracting values across grouped data frames in Pandas

pandas add a value column to datetime

Calculate diffrence between two dates

Categories

Resources