Convert Varied Text Fields of Duration to Seconds in Pandas - python

I have a dataframe that contains the duration of a trip as text values as shown below in the column driving_duration_text.
print df
yelp_id driving_duration_text \
0 alexander-rubin-photography-napa 1 hour 43 mins
1 jumas-automotive-napa-2 1 hour 32 mins
2 larson-brothers-painting-napa 1 hour 30 mins
3 preferred-limousine-napa 1 hour 32 mins
4 cardon-y-el-tirano-miami 1 day 16 hours
5 sweet-dogs-miami 1 day 3 hours
As you can see some are written in hours and others in days. How could I convert this format to seconds?

UPDATE:
In [150]: df['seconds'] = (pd.to_timedelta(df['driving_duration_text']
.....: .str.replace(' ', '')
.....: .str.replace('mins', 'min'))
.....: .dt.total_seconds())
In [151]: df
Out[151]:
yelp_id driving_duration_text seconds
0 alexander-rubin-photography-napa 1 hour 43 mins 6180.0
1 jumas-automotive-napa-2 1 hour 32 mins 5520.0
2 larson-brothers-painting-napa 1 hour 30 mins 5400.0
3 preferred-limousine-napa 1 hour 32 mins 5520.0
4 cardon-y-el-tirano-miami 1 day 16 hours 144000.0
5 sweet-dogs-miami 1 day 3 hours 97200.0
OLD answer:
you can do it this way:
from collections import defaultdict
import re
def humantime2seconds(s):
d = {
'w': 7*24*60*60,
'week': 7*24*60*60,
'weeks': 7*24*60*60,
'd': 24*60*60,
'day': 24*60*60,
'days': 24*60*60,
'h': 60*60,
'hr': 60*60,
'hour': 60*60,
'hours': 60*60,
'm': 60,
'min': 60,
'mins': 60,
'minute': 60,
'minutes':60
}
mult_items = defaultdict(lambda: 1).copy()
mult_items.update(d)
parts = re.search(r'^(\d+)([^\d]*)', s.lower().replace(' ', ''))
if parts:
return int(parts.group(1)) * mult_items[parts.group(2)] + humantime2seconds(re.sub(r'^(\d+)([^\d]*)', '', s.lower()))
else:
return 0
df['seconds'] = df.driving_duration_text.map(humantime2seconds)
Output:
In [64]: df
Out[64]:
yelp_id driving_duration_text seconds
0 alexander-rubin-photography-napa 1 hour 43 mins 6180
1 jumas-automotive-napa-2 1 hour 32 mins 5520
2 larson-brothers-painting-napa 1 hour 30 mins 5400
3 preferred-limousine-napa 1 hour 32 mins 5520
4 cardon-y-el-tirano-miami 1 day 16 hours 144000
5 sweet-dogs-miami 1 day 3 hours 97200

Given that the text does seem to follow a standardized format, this is relatively straightforward. We need to break the string apart, compose it into relevant pieces, and then process them.
def parse_duration(duration):
items = duration.split()
words = items[1::2]
counts = items[::2]
seconds = 0
for i, each in enumerate(words):
seconds += get_seconds(each, counts[i])
return seconds
def get_seconds(word, count):
counts = {
'second': 1,
'minute': 60,
'hour': 3600,
'day': 86400
# and so on
}
# Bit complicated here to handle plurals
base = counts.get(word[:-1], counts.get(word, 0))
return base * count

Related

Print a schedule from an array

I have an array ( 10, 200) with 0 & 1. So we have 10 users and 200-time slots.
df = pd.DataFrame({'Startpoint': [ 100 , 50, 40 , 75 , 52 , 43, 90 , 48, 56 ,20 ], 'endpoint': [ 150, 70, 80, 90, 140, 160 ,170 , 120 , 135, 170 ]})
df
rng = np.arange(200)
out = ((df['Startpoint'].to_numpy()[:, None] <= rng) & (rng < df['endpoint'].to_numpy()[:, None])).astype(int)
I would like to print a schedule like the below:( will print it when we have 1 )
Output
User 0 at hour 100
User 0 at hour 101
.
.
User 0 at hour 150
user 1 at hour 50
.
.
I think this should answer your question.
# Enumerate through your output and get the user ID and their schedule
for userID, user in enumerate(out):
for i in range(len(user)): # Enumerate through the length of the schedule by index
if user[i] == 1:
print(f"User {userID} at hour {i}")
This prints
User 0 at hour 100
User 0 at hour 101
User 0 at hour 102
User 0 at hour 103
Also in you out variable you need
(rng <= df['endpoint'].to_numpy()[:, None])).astype(int)
Instead of
(rng < df['endpoint'].to_numpy()[:, None])
So you get the end time as well.

Is there a way to convert/standardize text into Integer in Python?

I have a dataframe with a column showing time(in minutes) spent for organizing each inventory item. The goal is to show minutes spent in either integer or float. However, the value in this column is not clean, see some example below. Is there a way to standardized and convert everything to an integer or float? (For example, 10 hours should be 600 minutes)
import pandas as pd
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
df1=pd.DataFrame(df1)
The desired output is like this
I used regex for this kind of problem.
import regex as re
import numpy as np
import pandas as pd
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
df1=pd.DataFrame(df1)
# Copy Dataframe for iteration
# Created a empty numpy array for parsing by index
arr = np.zeros(df1.shape[0])
df1_copy = df1.copy()
for i,j in df1_copy.iterrows():
if "h" in j["min"]:
j["min"] = re.sub(r"[a-zA-Z()\s]","",j["min"])
j["min"] = float(j["min"])
arr[i] = float(j["min"]*60)
else:
j["min"] = re.sub(r"[a-zA-Z()**\s]","",j["min"])
j["min"] = float(j["min"])
arr[i] = float(j["min"])
df1["min_clean"] = arr
print(df1)
min min_clean
0 420 420.0
1 450 450.0
2 480 480.0
3 512 512.0
4 560 560.0
5 10 hours 600.0
6 10.5 hours 630.0
7 420 (all inventory) 420.0
8 3h 180.0
9 4.1 hours 246.0
10 60** 60.0
11 6h 360.0
12 7hours 420.0
I currently don't know pandas but this solution (using regex) could help
import re
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
def mins(s):
if re.match(r"\d*\.?\d+ *(h|hour)", s):
l = re.sub(r"[^\d.]", "", s).split(".")
m = int(l[0]) * 60
if len(l) != 1:
m += int(l[1]) * 6
return m
return int(re.sub(r"\D", "", s))
min_clear = map(mins, df1["min"])
print(list(min_clear))
# output: [420, 450, 480, 512, 560, 600, 630, 420, 180, 246, 60, 360, 420]
You could later add min_clear to the DataFrame
Btw, I am just a beginner; if any use case fails, tell me and I will try to improve this.
Thanks

Runtime error - Forward Rates Calculation

I am trying to build an forward annual EONIA forward curve with inputs of tenors from 1 week to 50 years.
I have managed to code thus far:
data
maturity spot rate
0 1 -0.529
1 2 -0.529
2 3 -0.529
3 1 -0.504
4 2 -0.505
5 3 -0.506
6 4 -0.508
7 5 -0.509
8 6 -0.510
9 7 -0.512
10 8 -0.514
11 9 -0.515
12 10 -0.517
13 11 -0.518
14 1 -0.520
15 15 -0.524
16 18 -0.526
17 21 -0.527
18 2 -0.528
19 3 -0.519
20 4 -0.501
21 5 -0.476
22 6 -0.441
23 7 -0.402
24 8 -0.358
25 9 -0.313
26 10 -0.265
27 11 -0.219
28 12 -0.174
29 15 -0.062
30 20 0.034
31 25 0.054
32 30 0.039
33 40 -0.001
34 50 -0.037
terms= data["maturity"].tolist()
rates= data['spot rate'].tolist()
calendar = ql.TARGET()
business_convention = ql.ModifiedFollowing
day_count = ql.Actual360()
settlement_days_EONIA = 2
EONIA = ql.OvernightIndex("EONIA", settlement_days_EONIA, ql.EURCurrency(), calendar, day_count)
# Deposit Helper
depo_facility = -0.50
depo_helper = [ql.DepositRateHelper(ql.QuoteHandle(ql.SimpleQuote(depo_facility/100)), ql.Period(1,ql.Days), 1, calendar, ql.Unadjusted, False, day_count)]
# OIS Helper
OIS_helpers = []
for i in range(len(terms)):
if i < 3:
tenor = ql.Period(ql.Weeks)
eon_rate = rates[i]
OIS_helpers.append(ql.OISRateHelper(settlement_days_EONIA, tenor, ql.QuoteHandle(ql.SimpleQuote(eon_rate/100)), EONIA))
elif i < 12:
tenor = ql.Period(ql.Months)
eon_rate = rates[i]
OIS_helpers.append(ql.OISRateHelper(settlement_days_EONIA, tenor, ql.QuoteHandle(ql.SimpleQuote(eon_rate/100)), EONIA))
else:
tenor = ql.Period(ql.Years)
eon_rate = rates[i]
OIS_helpers.append(ql.OISRateHelper(settlement_days_EONIA, tenor, ql.QuoteHandle(ql.SimpleQuote(eon_rate/100)), EONIA))
rate_helpers = depo_helper + OIS_helpers
eonia_curve_c = ql.PiecewiseLogCubicDiscount(0, ql.TARGET(), rate_helpers, day_count)
#This doesn't give me a daily grid of rates, but only the rates at the tenors of my input
eonia_curve_c.enableExtrapolation()
days = ql.MakeSchedule(eonia_curve_c.referenceDate(), eonia_curve_c.maxDate(), ql.Period('1Y'))
rates_fwd = [
eonia_curve_c.forwardRate(d, calendar.advance(d,365,ql.Days), day_count, ql.Simple).rate()*100
for d in days
]
The problem is that when I run the code, I get the following error:
RuntimeError: more than one instrument with pillar June 18th, 2021
There is probably an error somewhere in the code for the OIS helper, where there is an overlap but I am not sure what I have done wrong. Anyone know what the problem is?
First off, apologies for any inelegant Python, as I am coming from C++:
The main issue with the original question was that ql.Period() takes two parameters when used with an integer number of periods: eg ql.Period(3,ql.Years). If instead you construct the input array with string representations of the tenors eg '3y' you can just pass this string to ql.Period(). So ql.Period(3,ql.Years) and ql.Period('3y') give the same result.
import QuantLib as ql
import numpy as np
import pandas as pd
curve = [ ['1w', -0.529],
['2w', -0.529],
['3w', -0.529],
['1m', -0.504],
['2m', -0.505],
['3m', -0.506],
['4m', -0.508],
['5m', -0.509],
['6m', -0.510],
['7m', -0.512],
['8m', -0.514],
['9m', -0.515],
['10m', -0.517],
['11m', -0.518],
['1y', -0.520],
['15m', -0.524],
['18m', -0.526],
['21m', -0.527],
['2y', -0.528],
['3y', -0.519],
['4y', -0.501],
['5y', -0.476],
['6y', -0.441],
['7y', -0.402],
['8y', -0.358],
['9y', -0.313],
['10y', -0.265],
['11y', -0.219],
['12y', -0.174],
['15y', -0.062],
['20y', 0.034],
['25y', 0.054],
['30y', 0.039],
['40y', -0.001],
['50y', -0.037] ]
data = pd.DataFrame(curve, columns = ['maturity','spot rate'])
print('Input curve\n',data)
terms= data["maturity"].tolist()
rates= data['spot rate'].tolist()
calendar = ql.TARGET()
day_count = ql.Actual360()
settlement_days_EONIA = 2
EONIA = ql.OvernightIndex("EONIA", settlement_days_EONIA, ql.EURCurrency(), calendar, day_count)
# Deposit Helper
depo_facility = -0.50
depo_helper = [ql.DepositRateHelper(ql.QuoteHandle(ql.SimpleQuote(depo_facility/100)), ql.Period(1,ql.Days), 1, calendar, ql.Unadjusted, False, day_count)]
# OIS Helper
OIS_helpers = []
for i in range(len(terms)):
tenor = ql.Period(terms[i])
eon_rate = rates[i]
OIS_helpers.append(ql.OISRateHelper(settlement_days_EONIA, tenor, ql.QuoteHandle(ql.SimpleQuote(eon_rate/100)), EONIA))
rate_helpers = depo_helper + OIS_helpers
eonia_curve_c = ql.PiecewiseLogCubicDiscount(0, ql.TARGET(), rate_helpers, day_count)
#This doesn't give me a daily grid of rates, but only the rates at the tenors of my input
eonia_curve_c.enableExtrapolation()
days = ql.MakeSchedule(eonia_curve_c.referenceDate(), eonia_curve_c.maxDate(), ql.Period('1Y'))
rates_fwd = [
eonia_curve_c.forwardRate(d, calendar.advance(d,365,ql.Days), day_count, ql.Simple).rate()*100
for d in days
]
print('Output\n',pd.DataFrame(rates_fwd,columns=['Fwd rate']))

How to set second in pandas dataframe?

I import data from CSV using Python. I want to calculate the mean for every row and column using time-variable only. But the problem is the value is not in seconds.
How can I declare the related variable into time which is second instead of numeric value?
this is my data
--------------------------
|Responses|Time 1 | Time 2 | Time 3|
| abc |20 | 30 | 40 |
| bce |23 | 25 | 30 |
| cde |34 | 40 | 20 |
So, I want to calculate the sum time for each response
df.sum(axis = 1)
abc 90
bce 78
cde 92
df.sum(axis = 0)
Time 1 76
Time 2 95
Time 3 90
But actually I want it in minutes second which is
df.sum(axis = 0)
Time 1 1:16
Time 2 1:35
Time 3 1:30
Or it can be 1 minute 16 seconds or something. Anyone know how to do it?
Your question is not really well defined. You should follow the instructions, as suggested in the comments by jezrael.
As you said "Or it can be 1 minute 16 seconds or something.", I assumed that the output can simply be a string.
If you want the result as:
1:16, use to_minutes_seconds(x)
1 minute 16 seconds, use to_minutes_seconds_text(x)
from datetime import timedelta
def to_minutes_seconds(x):
# x is the current value to process, for example 76
td = timedelta(seconds=x)
# split x into 3 variables: hours, minutes and seconds
hours, remainder = divmod(td.seconds, 3600)
minutes, seconds = divmod(remainder, 60)
# return the required format, minutes:seconds
return "{}:{}".format(minutes, seconds)
def to_minutes_seconds_text(x):
td = timedelta(seconds=x)
hours, remainder = divmod(td.seconds, 3600)
minutes, seconds = divmod(remainder, 60)
if minutes > 1:
m = 'minutes'
else:
m = 'minute'
if seconds > 1:
s = 'seconds'
else:
s = 'second'
return "{} {} {} {}".format(minutes, m, seconds, s)
# Create the input dictionary
df = pd.DataFrame.from_dict({'Responses': [76, 95, 90, 781]})
# Change the total seconds into the required format
df['Time'] = df['Responses'].apply(to_minutes_seconds)
df['Text'] = df['Responses'].apply(to_minutes_seconds_text)
print(df)
Output:
Responses Time Text
0 76 1:16 1 minute 16 seconds
1 95 1:35 1 minute 35 seconds
2 90 1:30 1 minute 30 seconds
3 781 13:1 13 minutes 1 second

Iterate through each dataframe header and update int month to str month if and only if the header string has '20' in it

A few Django formatting issues which require df header changes.
Test data:
Test_Data = [
('Year_Month', ['Done_RFQ','Not_Done_RFQ','Total_RFQ']),
('2018_11', [10, 20, 30]),
('2019_06',[10,20,30]),
('2019_12', [40, 50, 60]),
]
df = pd.DataFrame(dict(Test_Data))
print(df)
Year_Month 2018_11 2019_06 2019_12
0 Done_RFQ 10 10 40
1 Not_Done_RFQ 20 20 50
2 Total_RFQ 30 30 60
Desired output:
Year_Month 2018_Nov 2019_Jun 2019_Dec
0 Done_RFQ 10 10 40
1 Not_Done_RFQ 20 20 50
2 Total_RFQ 30 30 60
My attempt:
df_names = df.columns
for df_name in df_names:
if df_name[:1] == '20':
df.df_name = str(pd.to_datetime(df_name, format='%Y_%m').dt.strftime('%Y_%b'))
Error: AttributeError: 'Timestamp' object has no attribute 'dt'
I was hoping the date object could be used for the formatting. Any sgguestions on how to genearalise this for any string in the headers?
IIUC
s=pd.Series(df.columns)
s2=pd.to_datetime(s,format='%Y_%m',errors ='coerce').dt.strftime('%Y_%b')
df.columns=s2.mask(s2=='NaT').fillna(s)
df
Out[368]:
2018_Nov 2019_Jun 2019_Dec Year_Month
0 10 10 40 Done_RFQ
1 20 20 50 Not_Done_RFQ
2 30 30 60 Total_RFQ
You can drop the .dt since .strftime is a method for Timestamp:
df.df_name = str(pd.to_datetime(df_name, format='%Y_%m').strftime('%Y_%b'))

Categories

Resources