I have a DB with two date column (Y-m-d):
Date_from Date_to
17/01/01 17/01/05
17/02/03 NaN
17/05/01 17/05/05
...
Date_from and Date_to are pandas column.
I built a function that if:
- in Date_to there NaN returns me "corrence";
- in Dta_to there is no Nan makes the difference between the two columns
Both results are saved in a third column. Like this:
Data_from Date_to Difference
17/01/01 17/01/05 4
17/02/03 NaN corrence
17/05/01 17/05/05 4
...
The function is this:
from datetime import datetime
def diff(data,d1, d2):
if pd.isnull(data.iloc[[1],[12]]):
data['difference'] = 366
else:
data[d1] = pd.to_datetime(data[d1])
data[d2] = pd.to_datetime(data[d2])
data['difference'] = data[d2] - data[d1]
return data
d1 = ["Date_from"]
d2 = ["Date_to"]
df = replace_NaN(df,d1,d2)
The error that get out is this:
TypeError: replace_NaN() takes 2 positional arguments but 3 were given
I don't understand why
You don't need a function to do this. Instead,
Convert the columns to datetime using pd.to_datetime
Subtract Date_from from Date_to
Extract the days component of the timedelta columns using dt.days
Call fillna on the result
i = pd.to_datetime(df.Date_to, format='%y/%m/%d', errors='coerce')
j = pd.to_datetime(df.Date_from, format='%y/%m/%d', errors='coerce')
df['Difference'] = i.sub(j).dt.days.fillna('corrence')
df
Date_from Date_to Difference
0 17/01/01 17/01/05 4
1 17/02/03 NaN corrence
2 17/05/01 17/05/05 4
Related
I have the a pandas dataframe in this format:
Dates
11-Feb-18
18-Feb-18
03-Mar-18
25-Mar-18
29-Mar-18
04-Apr-18
08-Apr-18
14-Apr-18
17-Apr-18
30-Apr-18
04-May-18
I want to find dates between two consecutive dates. In this example I want to make a new column which will contain dates between two consecutive dates. For example between 11-Feb-18 and 18-Feb-18, I will get all the dates between these two dates.
I tried this code but it's throwing me error:
pd.DataFrame({'dates':pd.date_range(pd.to_datetime(df_new['Time.[Day]'].loc[i].diff(-1)))})
if you want to add a column with the list of dates tat are missing in between, this shoudl work. This could be more efficient and it has to work around the NaT in the last row and becomes a bit longer as intended, but gives you the result.
import pandas as pd
from datetime import timedelta
test_df = pd.DataFrame({
"Dates" :
["11-Feb-18", "18-Feb-18", "03-Mar-18", "25-Mar-18", "29-Mar-18", "04-Apr-18",
"08-Apr-18", "14-Apr-18", "17-Apr-18", "30-Apr-18", "04-May-18"]
})
res = (
test_df
.assign(
# convert to datetime
Dates = lambda x : pd.to_datetime(x.Dates),
# get next rows date
Dates_next = lambda x : x.Dates.shift(-1),
# create the date range
Dates_list = lambda x : x.apply(
lambda x :
pd.date_range(
x.Dates + timedelta(days=1),
x.Dates_next - timedelta(days=1),
freq="D").date.tolist()
if pd.notnull(x.Dates_next)
else None
, axis = 1
))
)
print(res)
results in:
Dates Dates_next Dates_list
0 2018-02-11 2018-02-18 [2018-02-12, 2018-02-13, 2018-02-14, 2018-02-1...
1 2018-02-18 2018-03-03 [2018-02-19, 2018-02-20, 2018-02-21, 2018-02-2...
2 2018-03-03 2018-03-25 [2018-03-04, 2018-03-05, 2018-03-06, 2018-03-0...
3 2018-03-25 2018-03-29 [2018-03-26, 2018-03-27, 2018-03-28]
4 2018-03-29 2018-04-04 [2018-03-30, 2018-03-31, 2018-04-01, 2018-04-0...
5 2018-04-04 2018-04-08 [2018-04-05, 2018-04-06, 2018-04-07]
6 2018-04-08 2018-04-14 [2018-04-09, 2018-04-10, 2018-04-11, 2018-04-1...
7 2018-04-14 2018-04-17 [2018-04-15, 2018-04-16]
8 2018-04-17 2018-04-30 [2018-04-18, 2018-04-19, 2018-04-20, 2018-04-2...
9 2018-04-30 2018-05-04 [2018-05-01, 2018-05-02, 2018-05-03]
10 2018-05-04 NaT None
As a sidenote, if you don't need the last row after the analysis, you could filter out the last row after assigning the next date and eliminate the if statement to make it faster.
This works with dataframes, adding a new column containing the requested list
It iterates over the column 1, preparing a list of lists for column 2.
At the and it creates a new dataframe column and assigns the prepared values to it.
import pandas as pd
from pprint import pp
from datetime import datetime, timedelta
df = pd.read_csv("test.csv")
in_betweens = []
for i in range(len(df["dates"])-1):
d = datetime.strptime(df["dates"][i],"%d-%b-%y")
d2 = datetime.strptime(df["dates"][i+1],"%d-%b-%y")
d = d + timedelta(days=1)
in_between = []
while d < d2:
in_between.append(d.strftime("%d-%b-%y"))
d = d + timedelta(days=1)
in_betweens.append(in_between)
in_betweens.append([])
df["in_betwens"] = in_betweens
df.head()
I have df that looks like this
df:
id dob
1 7/31/2018
2 6/1992
I want to generate 88799 random dates to go into column dob in the dataframe, between the dates of 1960-01-01 to 1990-12-31 while keeping the format mm/dd/yyyy no time stamp.
How would I do this?
I tried:
date1 = (1960,01,01)
date2 = (1990,12,31)
for i range(date1,date2):
df.dob = i
I would figure out how many days are in your date range, then select 88799 random integers in that range, and finally add that as a timedelta with unit='d' to your minimum date:
min_date = pd.to_datetime('1960-01-01')
max_date = pd.to_datetime('1990-12-31')
d = (max_date - min_date).days + 1
df['dob'] = min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')
>>> df.head()
dob
0 1963-03-05
1 1973-06-07
2 1970-08-24
3 1970-05-03
4 1971-07-03
>>> df.tail()
dob
88794 1965-12-10
88795 1968-08-09
88796 1988-04-29
88797 1971-07-27
88798 1980-08-03
EDIT You can format your dates using .strftime('%m/%d/%Y'), but note that this will slow down the execution significantly:
df['dob'] = (min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')).strftime('%m/%d/%Y')
>>> df.head()
dob
0 02/26/1969
1 04/09/1963
2 08/29/1984
3 02/12/1961
4 08/02/1988
>>> df.tail()
dob
88794 02/13/1968
88795 02/05/1982
88796 07/03/1964
88797 06/11/1976
88798 11/17/1965
I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)
I have this simple problem but for some reason it's giving a headache. I want to add a existing Date column with another column to get a newDate column.
For example: I have Date and n columns, and I want to add in NewDate column into my existing df.
df:
Date n NewDate (New Calculation here: Date + n)
05/31/2017 3 08/31/2017
01/31/2017 4 05/31/2017
12/31/2016 2 02/28/2017
I tried:
df['NewDate'] = (pd.to_datetime(df['Date']) + MonthEnd(n))
but I get an error saying "cannot convert the series to class 'int'
You're probably looking for an addition with a timedelta object.
v = pd.to_datetime(df.Date) + (pd.to_timedelta(df.n, unit='M'))
v
0 2017-08-30 07:27:18
1 2017-06-01 17:56:24
2 2017-03-01 20:58:12
dtype: datetime64[ns]
At the end, you can convert the result back into the same format as before -
df['NewDate'] = v.dt.strftime('%m/%d/%Y')
I have a column named transaction_date which stores date, 1970-01-01 for example, and payment_plan_days stores the amount of days, 30, 70, any integer.
How should I add payment_plan_days into transaction_date to create a new column as membership_expire_date ?
I had tried with code below and it doesn't work since they are not the same dtype.
df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'membership_expire_date'] =
df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'transaction_date']
+ df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'payment_plan_days']
I think you need to_timedelta:
df['new'] = df['transaction_date'] + pd.to_timedelta(df['payment_plan_days'], unit='d')
Sample:
dates=pd.to_datetime(['1970-01-01','2005-07-17', '2005-07-17'])
df = pd.DataFrame({'transaction_date':dates, 'payment_plan_days':[30,70,100]})
df['new'] = df['transaction_date'] + pd.to_timedelta(df['payment_plan_days'], unit='d')
print (df)
payment_plan_days transaction_date new
0 30 1970-01-01 1970-01-31
1 70 2005-07-17 2005-09-25
2 100 2005-07-17 2005-10-25
Same idea as jezrael answer only using datetime's timedelta function:
from datetime import timedelta
dates=pd.to_datetime(['1970-01-01','2005-07-17', '2005-07-17'])
df = pd.DataFrame({'transaction_date':dates, 'payment_plan_days':[30,70,100]})
df.loc[:, 'expiration_date'] = list(map(lambda td, ppd: td+timedelta(ppd), df['transaction_date'], df['payment_plan_days']))