I am trying to generate random data(integers) with dates so that I can practice pandas data analytics commands on it and plot time series graphs.
temp depth acceleration
2019-01-1 -0.218062 -1.215978 -1.674843
2019-02-1 -0.465085 -0.188715 0.241956
2019-03-1 -1.464794 -1.354594 0.635196
2019-04-1 0.103813 0.194349 -0.450041
2019-05-1 0.437921 0.073829 1.346550
Is there any random dataframe generator that can generate something like this with each date having a gap of one month?
You can either use pandas.util.testing
import pandas.util.testing as testing
import numpy as np
np.random.seed(1)
testing.N, testing.K = 5, 3 # Setting the rows and columns of the desired data
print testing.makeTimeDataFrame(freq='MS')
>>>
A B C
2000-01-01 -0.488392 0.429949 -0.723245
2000-02-01 1.247192 -0.513568 -0.512677
2000-03-01 0.293828 0.284909 1.190453
2000-04-01 -0.326079 -1.274735 -0.008266
2000-05-01 -0.001980 0.745803 1.519243
Or, if you need more control over the random values being generated, you can use something like
import numpy as np
import pandas as pd
np.random.seed(1)
rows,cols = 5,3
data = np.random.rand(rows,cols) # You can use other random functions to generate values with constraints
tidx = pd.date_range('2019-01-01', periods=rows, freq='MS') # freq='MS'set the frequency of date in months and start from day 1. You can use 'T' for minutes and so on
data_frame = pd.DataFrame(data, columns=['a','b','c'], index=tidx)
print data_frame
>>>
a b c
2019-01-01 0.992856 0.217750 0.538663
2019-02-01 0.189226 0.847022 0.156730
2019-03-01 0.572417 0.722094 0.868219
2019-04-01 0.023791 0.653147 0.857148
2019-05-01 0.729236 0.076817 0.743955
Use numpy.random.rand or numpy.random.randint functions with DataFrame constructor:
np.random.seed(2019)
N = 10
rng = pd.date_range('2019-01-01', freq='MS', periods=N)
df = pd.DataFrame(np.random.rand(N, 3), columns=['temp','depth','acceleration'], index=rng)
print (df)
temp depth acceleration
2019-01-01 0.903482 0.393081 0.623970
2019-02-01 0.637877 0.880499 0.299172
2019-03-01 0.702198 0.903206 0.881382
2019-04-01 0.405750 0.452447 0.267070
2019-05-01 0.162865 0.889215 0.148476
2019-06-01 0.984723 0.032361 0.515351
2019-07-01 0.201129 0.886011 0.513620
2019-08-01 0.578302 0.299283 0.837197
2019-09-01 0.526650 0.104844 0.278129
2019-10-01 0.046595 0.509076 0.472426
If need integers:
np.random.seed(2019)
N = 10
rng = pd.date_range('2019-01-01', freq='MS', periods=N)
df = pd.DataFrame(np.random.randint(20, size=(10, 3)),
columns=['temp','depth','acceleration'],
index=rng)
print (df)
temp depth acceleration
2019-01-01 8 18 5
2019-02-01 15 12 10
2019-03-01 16 16 7
2019-04-01 5 19 12
2019-05-01 16 18 5
2019-06-01 16 15 1
2019-07-01 14 12 10
2019-08-01 0 11 18
2019-09-01 15 19 1
2019-10-01 3 16 18
Related
I have a dataframe in long format with speed data with varying time sampling intervals and frequencies for two observations locations (A and B). If I apply the resample method to get the average daily value, I get the average values of all variables for a given time interval (and not the average value for speed, distance).
Does anyone know how to resample the dataframe and keep the 2 locations but produce daily average speed data?
import pandas as pd
import numpy as np
dti = pd.date_range('2015-01-01', '2015-12-31', freq='15min')
df = pd.DataFrame(index = dti)
# Average speed in miles per hour
df['Location'] = 'A'
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
# Distance in miles (speed * 0.5 hours)
dti2 = pd.date_range('2015-01-01', '2016-06-05', freq='30min')
df2 = pd.DataFrame(index = dti2)
df2['Location'] = 'B'
df2['speed'] = np.random.randint(low=0, high=60, size=len(df2.index))
df = df.append(df2)
df2 = df.resample('d', on='index').mean()
Use groupby and resample:
>>> df.groupby("Location").resample("D").mean().reset_index(0)
Location speed
2015-01-01 A 29.114583
2015-01-02 A 27.083333
2015-01-03 A 31.135417
2015-01-04 A 30.354167
2015-01-05 A 29.427083
... ...
2016-06-01 B 33.770833
2016-06-02 B 28.979167
2016-06-03 B 29.812500
2016-06-04 B 31.270833
2016-06-05 B 42.000000
If you instead want separate columns for A and B, you can use unstack:
>>> df.groupby("Location").resample("D").mean().unstack(0)
speed
Location A B
2015-01-01 29.114583 29.520833
2015-01-02 27.083333 27.291667
2015-01-03 31.135417 30.375000
2015-01-04 30.354167 31.645833
2015-01-05 29.427083 26.645833
... ...
2016-06-01 NaN 33.770833
2016-06-02 NaN 28.979167
2016-06-03 NaN 29.812500
2016-06-04 NaN 31.270833
2016-06-05 NaN 42.000000
I have a dataframe with different timestamp for each user, and I want to calculate the duration.
I used this code to import my CSV files:
import pandas as pd
import glob
path = r'C:\Users\...\Desktop'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0,encoding='ISO-8859-1')
li.append(df)
df = pd.concat(li, axis=0, ignore_index=True)
df.head()
ID timestamp
1828765 31-05-2021 22:27:03
1828765 31-05-2021 22:27:12
1828765 31-05-2021 22:27:13
1828765 31-05-2021 22:27:34
2056557 21-07-2021 10:27:12
2056557 21-07-2021 10:27:20
2056557 21-07-2021 10:27:22
And I want to get something like that
ID timestamp duration(s)
1828765 31-05-2021 22:27:03 NAN
1828765 31-05-2021 22:27:12 9
1828765 31-05-2021 22:27:13 1
1828765 31-05-2021 22:27:34 21
2056557 21-07-2021 10:27:12 NAN
2056557 21-07-2021 10:27:20 8
2056557 21-07-2021 10:27:22 2
I've used this code, but doesn't work for me
import datetime
df['timestamp'] = pd.to_datetime(df['timestamp'], format = "%d-%m-%Y %H:%M:%S")
df['time_diff'] = 0
for i in range(df.shape[0] - 1):
df['time_diff'][i+1] = (datetime.datetime.min + (df['timestamp'][i+1] - df['timestamp'][i])).time()
Operations which occur over groups of values are GroupBy operations in pandas.
pandas supports mathematical operations over timestamps natively. For this reason, subtraction will give the correct duration between any two timestamps.
We've already successfully converted out timestamp column to datetime64[ns]
df['timestamp'] = pd.to_datetime(df['timestamp'], format="%d-%m-%Y %H:%M:%S")
Now we can take the difference between rows within groups with Groupby.diff
df['duration'] = df.groupby('ID')['timestamp'].diff()
df
ID timestamp duration
0 1828765 2021-05-31 22:27:03 NaT
1 1828765 2021-05-31 22:27:12 0 days 00:00:09
2 1828765 2021-05-31 22:27:13 0 days 00:00:01
3 1828765 2021-05-31 22:27:34 0 days 00:00:21
4 2056557 2021-07-21 10:27:12 NaT
5 2056557 2021-07-21 10:27:20 0 days 00:00:08
6 2056557 2021-07-21 10:27:22 0 days 00:00:02
If we want to get the duration in seconds we can extract the total number of seconds using Series.dt.total_seconds:
df['duration (s)'] = df.groupby('ID')['timestamp'].diff().dt.total_seconds()
df:
ID timestamp duration (s)
0 1828765 2021-05-31 22:27:03 NaN
1 1828765 2021-05-31 22:27:12 9.0
2 1828765 2021-05-31 22:27:13 1.0
3 1828765 2021-05-31 22:27:34 21.0
4 2056557 2021-07-21 10:27:12 NaN
5 2056557 2021-07-21 10:27:20 8.0
6 2056557 2021-07-21 10:27:22 2.0
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'ID': [1828765, 1828765, 1828765, 1828765, 2056557, 2056557, 2056557],
'timestamp': ['31-05-2021 22:27:03', '31-05-2021 22:27:12',
'31-05-2021 22:27:13', '31-05-2021 22:27:34',
'21-07-2021 10:27:12', '21-07-2021 10:27:20',
'21-07-2021 10:27:22']
})
df['timestamp'] = pd.to_datetime(df['timestamp'], format="%d-%m-%Y %H:%M:%S")
df['duration (s)'] = df.groupby('ID')['timestamp'].diff().dt.total_seconds()
print(df)
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14
I need to resample some data in Pandas and I am using the code below:
On my data it takes, 5 hours.
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
df = df.set_index('date')
df.groupby('id').resample('D')['value'].agg('sum').loc[lambda x: x>0]
This is prohibitively slow.
How can I speed up the above code, on data like:
id date value
1 16-12-1 9
1 16-12-1 8
1 17-1-1 18
2 17-3-4 19
2 17-3-4 20
1 17-4-3 21
2 17-7-13 12
3 17-8-9 12
2 17-9-12 11
1 17-11-12 19
3 17-11-12 21
giving output:
id date
1 2016-12-04 17
2017-01-01 18
2017-04-09 21
2017-11-12 19
2 2017-03-05 39
2017-07-16 12
2017-09-17 11
3 2017-08-13 12
2017-11-12 21
Name: value, dtype: int64
I set up date as an index but the code is so slow. Any help would be great.
Give this a try.
I am going to use pd.Grouper() and specify the frequency to daily, hoping that it is faster. Also, i am getting rid of the agg and using .sum() straight away.
df['date'] = pd.to_datetime(df['date'], format='%y-%m-%d')
df = df.set_index('date')
df2 = df.groupby(['id',pd.Grouper(freq='D')])['value'].sum()
Results:
id date
1 2016-12-01 17
2017-01-01 18
2017-04-03 21
2017-11-12 19
2 2017-03-04 39
2017-07-13 12
2017-09-12 11
3 2017-08-09 12
2017-11-12 21
Hope this works.
[EDIT]
So I just did a small test between both methods over a randomly generated df with 100000 rows
df = pd.DataFrame(np.random.randint(0, 30,size=100000),
columns=["id"],
index=pd.date_range("19300101", periods=100000))
df['value'] = np.random.randint(0, 10,size=100000)
and tried it on both codes and the results are:
for using resmple:
startTime = time.time()
df2 = df.groupby('id').resample('D')['value'].agg('sum').loc[lambda x: x>0]
print(time.time()-startTime)
1.0451831817626953 seconds
for using pd.Grouper():
startTime = time.time()
df3 = df.groupby(['id',pd.Grouper(freq='D')])['value'].sum()
print(time.time()-startTime)
0.08430838584899902 seconds
so approximately 12 times faster! (if my math is correct)
I'm trying to figure out how to add 3 months to a date in a Pandas dataframe, while keeping it in the date format, so I can use it to lookup a range.
This is what I've tried:
#create dataframe
df = pd.DataFrame([pd.Timestamp('20161011'),
pd.Timestamp('20161101') ], columns=['date'])
#create a future month period
plus_month_period = 3
#calculate date + future period
df['future_date'] = plus_month_period.astype("timedelta64[M]")
However, I get the following error:
AttributeError: 'int' object has no attribute 'astype'
You could use pd.DateOffset
In [1756]: df.date + pd.DateOffset(months=plus_month_period)
Out[1756]:
0 2017-01-11
1 2017-02-01
Name: date, dtype: datetime64[ns]
Details
In [1757]: df
Out[1757]:
date
0 2016-10-11
1 2016-11-01
In [1758]: plus_month_period
Out[1758]: 3
Suppose you have a dataframe of the following format, where you have to add integer months to a date column.
Start_Date
Months_to_add
2014-06-01
23
2014-06-01
4
2000-10-01
10
2016-07-01
3
2017-12-01
90
2019-01-01
2
In such a scenario, using Zero's code or mattblack's code won't be useful. You have to use lambda function over the rows where the function takes 2 arguments -
A date to which months need to be added to
A month value in integer format
You can use the following function:
# Importing required modules
from dateutil.relativedelta import relativedelta
# Defining the function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
After this you can use the following code snippet to add months to the Start_Date column. Use progress_apply functionality of Pandas. Refer to this Stackoverflow answer on progress_apply : Progress indicator during pandas operations.
from tqdm import tqdm
tqdm.pandas()
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
Here's the full code form dataset creation, for your reference:
import pandas as pd
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
# Initilize a new dataframe
df = pd.DataFrame()
# Add Start Date column
df["Start_Date"] = ['2014-06-01T00:00:00.000000000',
'2014-06-01T00:00:00.000000000',
'2000-10-01T00:00:00.000000000',
'2016-07-01T00:00:00.000000000',
'2017-12-01T00:00:00.000000000',
'2019-01-01T00:00:00.000000000']
# To convert the date column to a datetime format
df["Start_Date"] = pd.to_datetime(df["Start_Date"])
# Add months column
df["Months_to_add"] = [23, 4, 10, 3, 90, 2]
# Defining the Add Months function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
You will have the final output dataframe as follows.
Start_Date
Months_to_add
End_Date
2014-06-01
23
2016-05-01
2014-06-01
4
2014-10-01
2000-10-01
10
2001-08-01
2016-07-01
3
2016-10-01
2017-12-01
90
2025-06-01
2019-01-01
2
2019-03-01
Please add to comments if there are any issues with the above code.
All the best!
I believe that the simplest and most efficient (faster) way to solve this is to transform the date to monthly periods with to_period(M), add the result with the values of the Months_to_add column and then retrieve the data as datetime with the .dt.to_timestamp() command.
Using the sample data created by #Aruparna Maity
Start_Date
Months_to_add
2014-06-01
23
2014-06-20
4
2000-10-01
10
2016-07-05
3
2017-12-15
90
2019-01-01
2
df['End_Date'] = ((df['Start_Date'].dt.to_period('M')) + df['Months_to_add']).dt.to_timestamp()
df.head(6)
#output
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-01
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-01
4 2017-12-15 90 2025-06-01
5 2019-01-01 2 2019-03-01
If the exact day is needed, just repeat the process, but changing the periods to days
df['End_Date'] = ((df['End_Date'].dt.to_period('D')) + df['Start_Date'].dt.day -1).dt.to_timestamp()
#output:
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-20
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-05
4 2017-12-15 90 2025-06-15
5 2019-01-01 2 2019-03-01
Another way using numpy timedelta64
df['date'] + np.timedelta64(plus_month_period, 'M')
0 2017-01-10 07:27:18
1 2017-01-31 07:27:18
Name: date, dtype: datetime64[ns]