I have solutions for this question, 2 solutions in fact, but I'm not happy with them. The reason is that the files I'm trying to read have about 12 millions rows, and using these solutions, it takes a huge amount of time to process them. Mainly, the reason is that the solutions are row-by-row operations.
So, I read the file like this:
In [1]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV')
df.head()
Out [1]: TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 103N04152 9252013 211 12 12 NaN
1 103N04152 9262013 0 7 7 NaN
2 103N04152 9032013 177 8 8 NaN
3 103N04152 9042013 176 8 9 7
My problem is with the DATE and EPOCH columns. I want to merge them into a single datetime column.
DATE is in '%m%d%Y' format (with the leading zero missing)
EPOCH is 5 minute epoch of a day:
Time EPOCH
00:00:00 => 0
00:05:00 => 1
...
...
12:00:00 => 144
12:05:00 => 145
...
...
23:50:00 => 286
23:55:00 => 287
What I want is something like this:
In [2]: df.head()
Out [2]: TMC DATE_TIME DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 103N04152 2013-09-25 17:35:00 9252013 211 12 12 NaN
1 103N04152 2013-09-26 00:00:00 9262013 0 7 7 NaN
2 103N04152 2013-09-03 14:45:00 9032013 177 8 8 NaN
3 103N04152 2013-09-04 14:30:00 9042013 176 8 9 7
Now, I can do this row-by-row as I mentioned earlier by doing either of these three things:
In [3]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV',
converters={'DATE': lambda x: datetime.datetime.strptime(x, '%m%d%Y'),
'EPOCH': lambda x: str(datetime.timedelta(minutes = int(x)*5))},
parse_dates = {'date_time': ['DATE', 'EPOCH']},
keep_date_col = True)
df.head()
Out [3]: date_time TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS
0 2013-09-25 17:35:00 103N04152 2013-09-25 17:35:00 12 12 NaN
1 2013-09-26 00:00:00 103N04152 2013-09-26 00:00:00 7 7 NaN
2 2013-09-03 14:45:00 103N04152 2013-09-03 14:45:00 8 8 NaN
3 2013-09-04 14:40:00 103N04152 2013-09-04 14:40:00 8 9 7
4 2013-09-05 09:35:00 103N04152 2013-09-05 09:35:00 10 10 NaN
In this method I lose the original formatting of DATE and EPOCH, but it doesn't really affect further computations on the dataframe. Instead of using converters as an argument, I could have used date_parser. Or, after reading the data, similar to line 1, I could have done something like this:
In [4]: df = pd.read_csv('C:/Projects/NPMRDS/FHWA_TASK2-4_NJ_09_2013_TT.CSV')
df['date_time'] = pd.to_datetime([datetime.datetime.strptime(str(df['DATE'][x]), '%m%d%Y') + datetime.timedelta(minutes = int(df['EPOCH'][x]*5)) for x in range(len(df))])
df.head()
Out [4]: TMC DATE EPOCH Travel_TIME_ALL_VEHICLES Travel_TIME_PASSENGER_VEHICLES Travel_TIME_FREIGHT_TRUCKS DATE_TIME
0 103N04152 9252013 211 12 12 NaN 2013-09-25 17:35:00
1 103N04152 9262013 0 7 7 NaN 2013-09-26 00:00:00
2 103N04152 9032013 177 8 8 NaN 2013-09-03 14:45:00
3 103N04152 9042013 176 8 9 7 2013-09-04 14:40:00
4 103N04152 9052013 115 10 10 NaN 2013-09-05 09:35:00
A more desirable result (don't worry about the column orders), but still row-by-row, and takes a huge amount of time.
Then there are pandas.to_datetime and pandas.to_timedelta, which run much faster than the methods described above. But I cannot merge the results together without resorting to string functions, which are again mainly row-by-row.
Does anyone know a better way to do this?
Edit: Solution!!!
In addition to chrisb's answer, I found a way to do it as well. The trick lies in setting the box parameter to False in pandas.to_datetime(). Like so:
df['DATE_TIME'] = pd.to_datetime(df['DATE'], format='%m%d%Y', box=False) + pd.to_timedelta(df['EPOCH']*5*60, unit='s')
Setting that to False returns a numpy.datetime[64] array, instead of pandas.DatetimeIndex. More information can be found in the pandas.to_datetime() documentation. And, pandas.to_timedelta() does not work with unit='m'.
Try this out - reduced runtime for me to about 1s (compared to 15s) on 4M rows of test data.
df = pd.read_csv('temp.csv')
df['DATE'] = pd.to_datetime(df['DATE'], format='%m%d%Y')
df['EPOCH'] = pd.to_timedelta((df['EPOCH'].astype(int) * 5).astype('timedelta64[m]'))
df['DATE_TIME'] = df['DATE'] + df['EPOCH']
Related
I am currently trying to find a way to merge specific rows of df2 to df1 based on their datetime indices in a way that avoids lookahead bias so that I can add external features (df2) to my main dataset (df1) for ML applications. The lengths of the dataframes are different, and the datetime indices aren't increasing at a constant rate. My current thought process is to do this by using nested loops and if statements, but this method would be too slow as the dataframes I am trying to do this on both have over 30000 rows each. Is there a faster way of doing this?
df1
index a b
2015-06-02 16:00:00 0 5
2015-06-05 16:00:00 1 6
2015-06-06 16:00:00 2 7
2015-06-11 16:00:00 3 8
2015-06-12 16:00:00 4 9
df2
index c d
2015-06-02 9:03:00 10 16
2015-06-02 15:12:00 11 17
2015-06-02 16:07:00 12 18
... ... ...
2015-06-12 15:29:00 13 19
2015-06-12 16:02:00 14 20
2015-06-12 17:33:00 15 21
df_combined
(because you can't see the rows at 06-05, 06-06, 06-11, I just have NaN as the row values to make it easier to interpret)
index a b c d
2015-06-02 16:00:00 0 5 11 17
2015-06-05 16:00:00 1 NaN NaN NaN
2015-06-06 16:00:00 2 NaN NaN NaN
2015-06-11 16:00:00 3 NaN NaN NaN
2015-06-12 16:00:00 4 9 13 19
df_combined.loc[0, ['c', 'd']] and df_combined.loc[4, ['c', 'd']] are 11,17 and 13,19 respectively instead of 12,18 and 14,20 to avoid lookahead bias because in a live scenario, those values haven't been observed yet.
IIUC, you need merge_asof. assuming your index are ordered in time, it is with the direction backward.
print(pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='backward'))
# a b c d
# 2015-06-02 16:00:00 0 5 11 17
# 2015-06-05 16:00:00 1 6 12 18
# 2015-06-06 16:00:00 2 7 12 18
# 2015-06-11 16:00:00 3 8 12 18
# 2015-06-12 16:00:00 4 9 13 19
Note that the dates 06-05, 06-06, 06-11 are not NaN but it is the last values in df2 (for 2015-06-02 16:07:00) being available before these dates in your given data.
Note: if what your dates are actually a column named index and not your index, then do:
print(pd.merge_asof(df1, df2, on='index', direction='backward'))
My Dataframe df3 looks something like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
I wanted to resample using ffill for every second so that it looks like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:06.000 125.00 101
1 2 2018-01-01 00:00:07.000 125.00 101
2 3 2018-01-01 00:00:08.000 125.00 101
3 4 2018-01-02 00:00:09.000 125.00 52
4 5 2018-01-02 00:00:10.000 127.00 52
...
My code:
def resample(df):
indexing = df[['Timestamp','Data']]
indexing['Timestamp']=pd.to_datetime(indexing['Timestamp'])
indexing =indexing.set_index('Timestamp')
indexing1= indexing.resample('1S',fill_method='ffill')
# indexing1 = indexing1.resample('D')
return indexing1
indexing = resample(df3)
but incurred error
ValueError: cannot reindex a non-unique index with a method or limit
I don't quite understand what this error mean. #jezrael from this similar question suggested using drop_duplicates with groupby. I am not sure what this does to the data as it seems there are no duplicates in my data? Can someone explain this please? Thanks.
This error is caused because of the following:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
When you resample both these timestamps to the nearest second they both become
2018-01-01 00:00:06 and pandas doesn't know which value for the data to pick
because it has two to select from. Instead what you can do is use an aggregation function
such as last (though mean, max, min may also be suitable) in order to
select one of the values. Then you can apply the forward fill.
Example:
from io import StringIO
import pandas as pd
df = pd.read_table(StringIO(""" Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50"""), sep='\s\s+')
df['Timestamp'] = pd.to_datetime(df['Timestamp']).dt.round('s')
df.set_index('Timestamp', inplace=True)
df = df.resample('1S').last().ffill()
I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64
I have two columns in a Pandas data frame that are dates.
I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.
A peek at the data:
df_test.head(10)
Out[20]:
First_Date Second Date
0 2016-02-09 2015-11-19
1 2016-01-06 2015-11-30
2 NaT 2015-12-04
3 2016-01-06 2015-12-08
4 NaT 2015-12-09
5 2016-01-07 2015-12-11
6 NaT 2015-12-12
7 NaT 2015-12-14
8 2016-01-06 2015-12-14
9 NaT 2015-12-15
I have created a new column successfully with the difference:
df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0)
df_test.head()
Out[22]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82 days
1 2016-01-06 2015-11-30 37 days
2 NaT 2015-12-04 NaT
3 2016-01-06 2015-12-08 29 days
4 NaT 2015-12-09 NaT
However I am unable to get a numeric version of the result:
df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric)
df_test.head()
Out[25]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 7.084800e+15
1 2016-01-06 2015-11-30 3.196800e+15
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 2.505600e+15
4 NaT 2015-12-09 NaN
How about:
df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days
This will return difference as int if there are no missing values(NaT) and float if there is.
Pandas have a rich documentation on Time series / date functionality and Time deltas
You can divide column of dtype timedelta by np.timedelta64(1, 'D'), but output is not int, but float, because NaN values:
df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D')
print (df_test)
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82.0
1 2016-01-06 2015-11-30 37.0
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 29.0
4 NaT 2015-12-09 NaN
5 2016-01-07 2015-12-11 27.0
6 NaT 2015-12-12 NaN
7 NaT 2015-12-14 NaN
8 2016-01-06 2015-12-14 23.0
9 NaT 2015-12-15 NaN
Frequency conversion.
You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:
import datetime as dt
import numpy as np
import pandas as pd
#Assume we have df_test:
In [222]: df_test
Out[222]:
first_date second_date
0 2016-01-31 2015-11-19
1 2016-02-29 2015-11-20
2 2016-03-31 2015-11-21
3 2016-04-30 2015-11-22
4 2016-05-31 2015-11-23
5 2016-06-30 2015-11-24
6 NaT 2015-11-25
7 NaT 2015-11-26
8 2016-01-31 2015-11-27
9 NaT 2015-11-28
10 NaT 2015-11-29
11 NaT 2015-11-30
12 2016-04-30 2015-12-01
13 NaT 2015-12-02
14 NaT 2015-12-03
15 2016-04-30 2015-12-04
16 NaT 2015-12-05
17 NaT 2015-12-06
In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date']
In [224]: df_test
Out[224]:
first_date second_date Difference
0 2016-01-31 2015-11-19 73 days
1 2016-02-29 2015-11-20 101 days
2 2016-03-31 2015-11-21 131 days
3 2016-04-30 2015-11-22 160 days
4 2016-05-31 2015-11-23 190 days
5 2016-06-30 2015-11-24 219 days
6 NaT 2015-11-25 NaT
7 NaT 2015-11-26 NaT
8 2016-01-31 2015-11-27 65 days
9 NaT 2015-11-28 NaT
10 NaT 2015-11-29 NaT
11 NaT 2015-11-30 NaT
12 2016-04-30 2015-12-01 151 days
13 NaT 2015-12-02 NaT
14 NaT 2015-12-03 NaT
15 2016-04-30 2015-12-04 148 days
16 NaT 2015-12-05 NaT
17 NaT 2015-12-06 NaT
Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.
In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)
In [227]: df_test
Out[227]:
first_date second_date Difference Diffference
0 2016-01-31 2015-11-19 73 days 73
1 2016-02-29 2015-11-20 101 days 101
2 2016-03-31 2015-11-21 131 days 131
3 2016-04-30 2015-11-22 160 days 160
4 2016-05-31 2015-11-23 190 days 190
5 2016-06-30 2015-11-24 219 days 219
6 NaT 2015-11-25 NaT NaN
7 NaT 2015-11-26 NaT NaN
8 2016-01-31 2015-11-27 65 days 65
9 NaT 2015-11-28 NaT NaN
10 NaT 2015-11-29 NaT NaN
11 NaT 2015-11-30 NaT NaN
12 2016-04-30 2015-12-01 151 days 151
13 NaT 2015-12-02 NaT NaN
14 NaT 2015-12-03 NaT NaN
15 2016-04-30 2015-12-04 148 days 148
16 NaT 2015-12-05 NaT NaN
17 NaT 2015-12-06 NaT NaN
Hope that helps.
I feel that the overall answer does not handle if the dates 'wrap' around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).
def get_date_difference(row, x, y):
try:
# Calcuating the smallest date difference between the start and the close date
# There's some tricky logic in here to calculate for determining date difference
# the other way around (Dec -> Jan is 1 month rather than 11)
sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)
close_date = int(row[y].strftime('%j')) # day of year (1-366)
later_date_of_year = max(sub_start_date, close_date)
earlier_date_of_year = min(sub_start_date, close_date)
days_diff = later_date_of_year - earlier_date_of_year
# Calculates the difference going across the next year (December -> Jan)
days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_year
return min(days_diff, days_diff_reversed)
except ValueError:
return None
Then the function could be:
dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Create a vectorized method
def calc_xb_minus_xa(df):
time_dict = {
'<Minute>': 'm',
'<Hour>': 'h',
'<Day>': 'D',
'<Week>': 'W',
'<Month>': 'M',
'<Year>': 'Y'
}
time_delta = df.at[df.index[0], 'end_time'] - df.at[df.index[0], 'open_time']
offset_base_name = str(to_offset(time_delta).base)
time_term = time_dict.get(offset_base_name)
result = (df.end_time - df.open_time) / np.timedelta64(1, time_term)
return result
Then in your df do:
df['x'] = calc_xb_minus_xa(df)
This will work for minutes, hours, days, weeks, month and Year.
open_time and end_time need to change according your df
I have a df pandas with in column just a 'Price' and in index dates. I want to find a new column called 'Aprox' with inside
aprox. = price of today - price of one year ago (or closest date from a year ago) -
price in one year (again take aprox if exact one year price don't exist)
for example
aprox. 2019-04-30 = 8 -4 -10 = -6 = aprox. 2019-04-30
- aprox. 2018-01-31 - aprox.2020-07-30
To be honest I am a bit strugling with that...
ex. [in]: Price
2018-01-31 4
2019-04-30 8
2020-07-30 10
2020-10-31 9
2021-01-31 14
2021-04-30 150
2021-07-30 20
2022-10-31 14
[out]: Price aprox.
2018-01-31 4
2019-04-30 8 -6 ((8-4-10) = -6) since there is no 2018-04-30
2020-07-30 10 -12 (10-14-8)
2020-10-31 9 ...
2021-01-31 14 ...
2021-04-30 150
2021-07-30 20
2022-10-31 14
I am strugling very much with that... even more with the approx.
Thank you very much!!
It's not quite clear to me what you are trying to do, but maybe this is what you want:
import pandas
def last_year(x):
"""
Return date from a year ago.
"""
return x - pandas.DateOffset(years=1)
# Simulate the data you provided in example
dt_str = ['2018-01-31', '2019-04-30', '2020-07-30', '2020-10-31',
'2021-01-31', '2021-04-30', '2021-07-30', '2022-10-31']
dates = [pandas.Timestamp(x) for x in dt_str]
df = pandas.DataFrame([4, 8, 10, 9, 14, 150, 20, 14], columns=['Price'], index=dates)
# This is the code that does the work
for dt, value in df['Price'].iteritems():
df.loc[dt, 'approx'] = value - df['Price'].asof(last_year(dt))
This gave me the following results:
In [147]: df
Out[147]:
Price approx
2018-01-31 4 NaN
2019-04-30 8 4.0
2020-07-30 10 2.0
2020-10-31 9 1.0
2021-01-31 14 6.0
2021-04-30 150 142.0
2021-07-30 20 10.0
2022-10-31 14 -6.0
The bottom line is that for this type of operation you can't just use the apply operation since you need both the index and the value.