I have df that looks like this
df:
id dob
1 7/31/2018
2 6/1992
I want to generate 88799 random dates to go into column dob in the dataframe, between the dates of 1960-01-01 to 1990-12-31 while keeping the format mm/dd/yyyy no time stamp.
How would I do this?
I tried:
date1 = (1960,01,01)
date2 = (1990,12,31)
for i range(date1,date2):
df.dob = i
I would figure out how many days are in your date range, then select 88799 random integers in that range, and finally add that as a timedelta with unit='d' to your minimum date:
min_date = pd.to_datetime('1960-01-01')
max_date = pd.to_datetime('1990-12-31')
d = (max_date - min_date).days + 1
df['dob'] = min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')
>>> df.head()
dob
0 1963-03-05
1 1973-06-07
2 1970-08-24
3 1970-05-03
4 1971-07-03
>>> df.tail()
dob
88794 1965-12-10
88795 1968-08-09
88796 1988-04-29
88797 1971-07-27
88798 1980-08-03
EDIT You can format your dates using .strftime('%m/%d/%Y'), but note that this will slow down the execution significantly:
df['dob'] = (min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')).strftime('%m/%d/%Y')
>>> df.head()
dob
0 02/26/1969
1 04/09/1963
2 08/29/1984
3 02/12/1961
4 08/02/1988
>>> df.tail()
dob
88794 02/13/1968
88795 02/05/1982
88796 07/03/1964
88797 06/11/1976
88798 11/17/1965
Related
I have dataset with 800 rows and i want to create new column with date, and in each row in should increase on one day.
import datetime
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(800):
df['Date'] = date + datetime.timedelta(days=x)
In each column date is equal to '2014-01-12', as i inderstand it fills as if x is always equal to 799
Each time through the loop you are updating the ENTIRE Date column. You see the results of the 800th update at the end.
You could use a date range:
dr = pd.date_range('5/11/2011', periods=800, freq='D')
df = pd.DataFrame({'Date': dr})
print(df)
Date
0 2011-05-11
1 2011-05-12
2 2011-05-13
3 2011-05-14
4 2011-05-15
.. ...
795 2013-07-14
796 2013-07-15
797 2013-07-16
798 2013-07-17
799 2013-07-18
Or:
df['Date'] = dr
pandas is nice tool which can repeate some calculations without using for-loop.
When you use df['Date'] = ... then you assign the same value to all cells in column.
You have to use df.loc[x, 'Date'] = ... to assign to single cell.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(10):
df.loc[x,'Date'] = date + datetime.timedelta(days=x)
print(df)
But you could use also pd.date_range() for this.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
df['Date'] = pd.date_range(date, periods=10)
print(df)
I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)
I have dates in Python (pandas) written as "1/31/2010". To apply linear regression I want to have 3 separate variables: number of day, number of month, number of year.
What will be the way to split a column with date in pandas into 3 columns?
Another question is to have the same but group days into 3 groups: 1-10, 11-20, 21-31.
df['date'] = pd.to_datetime(df['date'])
#Create 3 additional columns
df['day'] = df['date'].dt.day
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
Ideally, you can do this without having to create 3 additional columns, you can just pass the Series to your function.
In [2]: pd.to_datetime('01/31/2010').day
Out[2]: 31
In [3]: pd.to_datetime('01/31/2010').month
Out[3]: 1
In [4]: pd.to_datetime('01/31/2010').year
Out[4]: 2010
This answers only your first question
One solution is to extract attributes of pd.Timestamp objects using operator.attrgetter.
The benefit of this method is you can easily expand / change the attributes you require. In addition, the logic is not specific to object type.
from operator import attrgetter
import pandas as pd
df = pd.DataFrame({'date': ['1/21/2010', '5/5/2015', '4/30/2018']})
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')
attr_list = ['day', 'month', 'year']
attrs = attrgetter(*attr_list)
df[attr_list] = df['date'].apply(attrs).apply(pd.Series)
print(df)
date day month year
0 2010-01-21 21 1 2010
1 2015-05-05 5 5 2015
2 2018-04-30 30 4 2018
from datetime import datetime
import pandas as pd
df = pd.DataFrame({'yyyymmdd': ['20150204', '20160305']})
for col, field in [("year", "%Y"), ("month", "%m"), ("day", "%d")]:
df[col] = df["yyyymmdd"].apply(
lambda cell: datetime.strptime(cell, "%Y%m%d").strftime(field))
print(df)
yyyymmdd year month day
0 20150204 2015 02 04
1 20160305 2016 03 05
I have this simple problem but for some reason it's giving a headache. I want to add a existing Date column with another column to get a newDate column.
For example: I have Date and n columns, and I want to add in NewDate column into my existing df.
df:
Date n NewDate (New Calculation here: Date + n)
05/31/2017 3 08/31/2017
01/31/2017 4 05/31/2017
12/31/2016 2 02/28/2017
I tried:
df['NewDate'] = (pd.to_datetime(df['Date']) + MonthEnd(n))
but I get an error saying "cannot convert the series to class 'int'
You're probably looking for an addition with a timedelta object.
v = pd.to_datetime(df.Date) + (pd.to_timedelta(df.n, unit='M'))
v
0 2017-08-30 07:27:18
1 2017-06-01 17:56:24
2 2017-03-01 20:58:12
dtype: datetime64[ns]
At the end, you can convert the result back into the same format as before -
df['NewDate'] = v.dt.strftime('%m/%d/%Y')
I have a column named transaction_date which stores date, 1970-01-01 for example, and payment_plan_days stores the amount of days, 30, 70, any integer.
How should I add payment_plan_days into transaction_date to create a new column as membership_expire_date ?
I had tried with code below and it doesn't work since they are not the same dtype.
df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'membership_expire_date'] =
df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'transaction_date']
+ df_transactions.loc[(df_transactions['membership_expire_date'] == '19700101'), 'payment_plan_days']
I think you need to_timedelta:
df['new'] = df['transaction_date'] + pd.to_timedelta(df['payment_plan_days'], unit='d')
Sample:
dates=pd.to_datetime(['1970-01-01','2005-07-17', '2005-07-17'])
df = pd.DataFrame({'transaction_date':dates, 'payment_plan_days':[30,70,100]})
df['new'] = df['transaction_date'] + pd.to_timedelta(df['payment_plan_days'], unit='d')
print (df)
payment_plan_days transaction_date new
0 30 1970-01-01 1970-01-31
1 70 2005-07-17 2005-09-25
2 100 2005-07-17 2005-10-25
Same idea as jezrael answer only using datetime's timedelta function:
from datetime import timedelta
dates=pd.to_datetime(['1970-01-01','2005-07-17', '2005-07-17'])
df = pd.DataFrame({'transaction_date':dates, 'payment_plan_days':[30,70,100]})
df.loc[:, 'expiration_date'] = list(map(lambda td, ppd: td+timedelta(ppd), df['transaction_date'], df['payment_plan_days']))