Using Pandas .apply() method with a regex-based function

Using Pandas .apply() method with a regex-based function - python

I am trying to create a new column in a data Frame by applying a function on a column that has numbers as strings.
I have written the function to extract the numbers I want and tested it on a single string input and can confirm that it works.
SEARCH_PATTERN = r'([0-9]{1,2}) ([0-9]{2}):([0-9]{2}):([0-9]{2})'
def get_total_time_minutes(time_col, pattern=SEARCH_PATTERN):
"""Uses regex to parse time_col which is a string in the format 'd hh:mm:ss' to
obtain a total time in minutes
"""
days, hours, minutes, _ = re.match(pattern, time_col).groups()
total_time_minutes = (int(days)*24 + int(hours))*60 + int(minutes)
return total_time_minutes
#test that the function works for a single input
text = "2 23:24:46"
print(get_total_time_minutes(text))
Ouput: 4284
#apply the function to the required columns
df['Minutes Available'] = df['Resource available (d hh:mm:ss)'].apply(get_total_time_minutes)
The picture below is a screenshot of my dataframe columns.
Screenshot of my dataframe
The 'Resources available (d hh:mm:ss)' column of my dataframe is of Pandas type 'o' (string, if my understanding is correct), and has data in the following format: '5 08:00:00'. When I call the apply(get_total_time_minutes) on it though, I get the following error:
TypeError: expected string or bytes-like object
To clarify further, the "Resources Available" column is a string representing the total time in days, hours, minutes and seconds that the resource was available. I want to convert that time string to a total time in minutes, hence the regex and arithmetic within the get_total_time_minutes function. – Sam Ezebunandu just now

This might be a bit hacky, because it uses the datetime library to parse the date and then turn it into a Timedelta by subtracting the default epoch:
>>> pd.to_datetime('2 23:48:30', format='%d %H:%M:%S') - pd.to_datetime('0', format='%S')
Out[47]: Timedelta('1 days 23:48:30')
>>> Out[47] / pd.Timedelta('1 minute')
Out[50]: 2868.5
But it does tell you how many minutes elapsed in those two days and however many hours. It's also vectorised, so you can apply it to the columns and get your minute values a lot faster than using apply.

Related

Faster way to check if a date is within a 5000+ element long list / numpy array?

I have the following function that is called multiple times:
def next_valid_date(self, date_object):
"""Returns next valid date based on valid_dates.
If argument date_object is valid, original date_object will be returned."""
while date_object not in self.valid_dates.tolist():
date_object += datetime.timedelta(days=1)
return date_object
For reference, valid_dates is a numpy array that holds all recorded dates for a given stock pulled from yfinance. In the case of the example I've been working with, NVDA (nvidia stock), the valid_dates array has 5395 elements (dates).
I have another function, and its purpose is to create a series of start dates and end dates. In this example self.interval is a timedelta with a length of 365 days, and self.sub_interval is a timedelta with a length of 1 day:
def get_date_range_series(self):
"""Retrieves a series containing lists of start dates and corresponding end dates over a given interval."""
interval_start = self.valid_dates[0]
interval_end = self.next_valid_date(self.valid_dates[0] + self.interval)
dates = [[interval_start, interval_end]]
while interval_end < datetime.date.today():
interval_start = self.next_valid_date(interval_start + self.sub_interval)
interval_end = self.next_valid_date(interval_start + self.interval)
dates.append([interval_start, interval_end])
return pd.Series(dates)
My main issue is that it takes a lengthy period of time to execute (about 2 minutes), and I'm sure there's a far better way of doing this... Any thoughts?

I just created an alternate next_valid_date() method that calls .loc() on a pandas dataframe (the dataframe's index is a list of the valid dates, which is where the list of valid_dates comes from in the first place):
def next_valid_date_alt(self, date_object):
while True:
try:
self.stock_yf_df.loc[date_object]
break
except KeyError:
date_object += datetime.timedelta(days=1)
return date_object
Checking for the next valid date when 6/28/20 is passed in (which isn't valid, it is a weekend, and the stock market is closed) resulted in the original method taking 0.0099754 seconds to complete and the alternate method taking .0019944 seconds to complete.
What this means is that get_date_range_series() takes just over 1 second to complete when using the next_valid_date_alt() method as opposed to 70 seconds when using the next_valid_date() method. I'll definitely look into the other optimizations mentioned as well. I appreciate everyone else's responses!

How to read minute values greater than 60 and avoid ValueError: time data '60:01' does not match format '%M:%S'

I've a script that reads a corrupt .CUE file that has no INDEX 00 and retreives the minutes and seconds value of each track entry. When the values are found the script subtracts 02 seconds of each track (thus creating a pregap and correcting the .CUE file) and creates a new correct .CUE file. The script worked like a charm till it encountered .CUE files containing minute values greater than 60. The following error occured:
ValueError: time data '60:01' does not match format '%M:%S'
I used datetime because i couldn't just simply subtract the 02 secondes of each track entry as an integer. When an entry has an 'INDEX 01' seconds value of 01 seconds it will also affect the minute value when 02 seconds are subtracted since this means that the minute value will be reduced by 01.
This is part of the code that does the formatting and subtraction. This worked fine till it encountered a minute value than 60:
from datetime import datetime
WrongIndex = '60:01'
NewIndex = '00:02'
format = '%M:%S'
time = datetime.strptime(WrongIndex, format) - datetime.strptime(NewIndex, format)
The expected returned value in this case should be '59:59'.
I'd like to know if there are other ways to use minute values greater than 60 since the max length of these files can go up to 79.8 minutes.

I don't think a datetime object is really an appropriate data structure for your problem. That type expects to be referencing a real clock time, not just an arbitrary number of minutes and seconds. If you were sticking with datetime, a more appropriate type would probably be timedelta, which represents a period of time, unmoored from any specific clock or calendar. But there's no equivalent to strptime for timedeltas.
And without the parsing, you don't get much from datetime at all. So I suggest just doing the parsing yourself. It's not very difficult:
minutes, seconds = map(int, WrongIndex.split(':'))
This just splits your input string (e.g. '60:01') into a list with two values (['60', '01']). It then converts the string values into integers. Then it assigns the two integers to the variables minutes and seconds.
To make doing math easy, you can then combine the two values into a single integer, a count of seconds:
seconds += minutes * 60
Then you can subtract your two-second offset and convert the number of seconds back to a time string:
seconds -= 2 # or parse the offset string if you don't want to hard code two seconds
result = "{:02}:{:02}".format(*divmod(seconds, 60))
In the formatting step, I'm using the divmod function which computes the a floor division, and a modulus in one step (it returns both in a tuple).

You need to do some converting. I would convert your values to integers. If minutes is greater than 59, we will add it to hours. After that we can create a datetime object which we use to subtract. To get minutes, we take our deltas in seconds and divided it with 60
from datetime import datetime
def to_time(value):
"Takes in values as '%M:%S' and return datetime object"
# value casting to integers
minutes, seconds = [int(i) for i in value.split(':')]
# if minutes is greater than 59 pass it to hour
hour = 0
if minutes > 59:
hour = minutes//60
minutes = minutes%60
return datetime.strptime(f'{hour}:{minutes}:{seconds}', '%H:%M:%S')
# now our calculations
wrong_index = '60:01'
new_index = '00:02'
time_ = to_time(wrong_index) - to_time(new_index)
print(time_.seconds/60)

How about this:
import datetime
WrongIndex = '60:01'
NewIndex = '00:02'
wrong_time = WrongIndex.split(':')
new_index = NewIndex.split(':')
old_seconds = int(wrong_time[0])*60 + int(wrong_time[1])
new_seconds = int(new_index[0])*60 + int(new_index[1])
time = datetime.timedelta(seconds=old_seconds-new_seconds)
print(time)

Working with Microsecond Time Stamps in PySpark

I have a pyspark dataframe with the following time format 20190111-08:15:45.275753. I want to convert this to timestamp format keeping the microsecond granularity. However, it appears as though it is difficult to keep the microseconds as all time conversions in pyspark produce seconds?
Do you have a clue on how this can be done? Note that converting it to pandas etc will not work as the dataset is huge so I need an efficient way of doing this. Example of how i am doing this below
time_df = spark.createDataFrame([('20150408-01:12:04.275753',)], ['dt'])
res = time_df.withColumn("time", unix_timestamp(col("dt"), \
format='yyyyMMdd-HH:mm:ss.000').alias("time"))
res.show(5, False)

Normally timestamp granularity is in seconds so I do not think there is a direct method to keep milliseconds granularity.
In pyspark there is the function unix_timestamp that :
unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')
Convert time string with given pattern ('yyyy-MM-dd HH:mm:ss', by default)
to Unix time stamp (in seconds), using the default timezone and the default
locale, return null if fail.
if `timestamp` is None, then it returns current timestamp.
>>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>> time_df = spark.createDataFrame([('2015-04-08',)], ['dt'])
>>> time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()
[Row(unix_time=1428476400)]
>>> spark.conf.unset("spark.sql.session.timeZone")
A usage example:
import pyspark.sql.functions as F
res = df.withColumn(colName, F.unix_timestamp(F.col(colName), \
format='yyyy-MM-dd HH:mm:ss.000').alias(colName) )
What you might do is splitting your date string (str.rsplit('.', 1)) keeping the milliseconds apart (for example by creating another column) in your dataframe.
EDIT
In your example the problem is that the time is of type string. First you need to convert it to a timestamp type: this can be done with:
res = time_df.withColumn("new_col", to_timestamp("dt", "yyyyMMdd-hh:mm:ss"))
Then you can use unix_timestamp
res2 = res.withColumn("time", F.unix_timestamp(F.col("parsed"), format='yyyyMMdd-hh:mm:ss.000').alias("time"))
Finally to create a columns with milliseconds:
res3 = res2.withColumn("ms", F.split(res2['dt'], '[.]').getItem(1))

I've found a work around for this using to_utc_timestamp function in pyspark, however not entirely sure if this is the most efficient though it seems to work fine on about 100 mn rows of data. You can avoid the regex_replace if your timestamp string looked like this -
1997-02-28 10:30:40.897748
from pyspark.sql.functions import regexp_replace, to_utc_timestamp
df = spark.createDataFrame([('19970228-10:30:40.897748',)], ['new_t'])
df = df.withColumn('t', regexp_replace('new_t', '^(.{4})(.{2})(.{2})-', '$1-$2-$3 '))
df = df.withColumn("time", to_utc_timestamp(df.t, "UTC").alias('t'))
df.show(5,False)
print(df.dtypes)

python pandas incorrectly reading excel dates

I have an excel file with dates formatted as such:
22.10.07 16:00
22.10.07 17:00
22.10.07 18:00
22.10.07 19:00
After using the parse method of pandas to read the data, the dates are read almost correctly:
In [55]: nts.data['Tid'][10000:10005]
Out[55]:
10000 2007-10-22 15:59:59.997905
10001 2007-10-22 16:59:59.997904
10002 2007-10-22 17:59:59.997904
10003 2007-10-22 18:59:59.997904
What do I need to do to either a) get it to work correctly, or b) is there a trick to fix this easily? (e.g. some kind of 'round' function for datetime)

I encountered the same issue and got around it by not parsing the dates using Pandas, but rather applying my own function (shown below) to the relevant column(s) of the dataframe:
def ExcelDateToDateTime(xlDate):
epoch = dt.datetime(1899, 12, 30)
delta = dt.timedelta(hours = round(xlDate*24))
return epoch + delta
df = pd.DataFrame.from_csv('path')
df['Date'] = df['Date'].apply(ExcelDateToDateTime)
Note: This will ignore any time granularity below the hour level, but that's all I need, and it looks from your example that this could be the case for you too.

Excel serializes datetimes with a ddddd.tttttt format, where the d part is an integer number representing the offset from a reference day (like Dec 31st, 1899), and the t part is a fraction between 0.0 and 1.0 which stands for the part of the day at the given time (for example at 12:00 it's 0.5, at 18:00 it's 0.75 and so on).
I asked you to upload a file with sample data. .xlsx files are really ZIP archives which contains your XML-serialized worksheets. This are the dates I extracted from the relevant column. Excerpt:
38961.666666666628
38961.708333333292
38961.749999999956
When you try to manually deserialize you get the same datetimes as Panda. Unfortunately, the way Excel stores times makes it impossible to represent some values exactly, so you have to round them for displaying purposes. I'm not sure if rounded data is needed for analysis, though.
This is the script I used to test that deserialized datetimes are really the same ones as Panda:
from datetime import date, datetime, time, timedelta
from urllib2 import urlopen
def deserialize(text):
tokens = text.split(".")
date_tok = tokens[0]
time_tok = tokens[1] if len(tokens) == 2 else "0"
d = date(1899, 12, 31) + timedelta(int(date_tok))
t = time(*helper(float("0." + time_tok), (24, 60, 60, 1000000)))
return datetime.combine(d, t)
def helper(factor, units):
result = list()
for unit in units:
value, factor = divmod(factor * unit, 1)
result.append(int(value))
return result
url = "https://gist.github.com/RaffaeleSgarro/877d7449bd19722b44cb/raw/" \
"45d5f0b339d4abf3359fe673fcd2976374ed61b8/dates.txt"
for line in urlopen(url):
print deserialize(line)

Converting Matlab's datenum format to Python

I just started moving from Matlab to Python 2.7 and I have some trouble reading my .mat-files. Time information is stored in Matlab's datenum format. For those who are not familiar with it:
A serial date number represents a calendar date as the number of days that has passed since a fixed base date. In MATLAB, serial date number 1 is January 1, 0000.
MATLAB also uses serial time to represent fractions of days beginning at midnight; for example, 6 p.m. equals 0.75 serial days. So the string '31-Oct-2003, 6:00 PM' in MATLAB is date number 731885.75.
(taken from the Matlab documentation)
I would like to convert this to Pythons time format and I found this tutorial. In short, the author states that
If you parse this using python's datetime.fromordinal(731965.04835648148) then the result might look reasonable [...]
(before any further conversions), which doesn't work for me, since datetime.fromordinal expects an integer:
>>> datetime.fromordinal(731965.04835648148)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: integer argument expected, got float
While I could just round them down for daily data, I actually need to import minutely time series. Does anyone have a solution for this problem? I would like to avoid reformatting my .mat files since there's a lot of them and my colleagues need to work with them as well.
If it helps, someone else asked for the other way round. Sadly, I'm too new to Python to really understand what is happening there.
/edit (2012-11-01): This has been fixed in the tutorial posted above.

You link to the solution, it has a small issue. It is this:
python_datetime = datetime.fromordinal(int(matlab_datenum)) + timedelta(days=matlab_datenum%1) - timedelta(days = 366)
a longer explanation can be found here

Using pandas, you can convert a whole array of datenum values with fractional parts:
import numpy as np
import pandas as pd
datenums = np.array([737125, 737124.8, 737124.6, 737124.4, 737124.2, 737124])
timestamps = pd.to_datetime(datenums-719529, unit='D')
The value 719529 is the datenum value of the Unix epoch start (1970-01-01), which is the default origin for pd.to_datetime().
I used the following Matlab code to set this up:
datenum('1970-01-01') % gives 719529
datenums = datenum('06-Mar-2018') - linspace(0,1,6) % test data
datestr(datenums) % human readable format

Just in case it's useful to others, here is a full example of loading time series data from a Matlab mat file, converting a vector of Matlab datenums to a list of datetime objects using carlosdc's answer (defined as a function), and then plotting as time series with Pandas:
from scipy.io import loadmat
import pandas as pd
import datetime as dt
import urllib
# In Matlab, I created this sample 20-day time series:
# t = datenum(2013,8,15,17,11,31) + [0:0.1:20];
# x = sin(t)
# y = cos(t)
# plot(t,x)
# datetick
# save sine.mat
urllib.urlretrieve('http://geoport.whoi.edu/data/sine.mat','sine.mat');
# If you don't use squeeze_me = True, then Pandas doesn't like
# the arrays in the dictionary, because they look like an arrays
# of 1-element arrays. squeeze_me=True fixes that.
mat_dict = loadmat('sine.mat',squeeze_me=True)
# make a new dictionary with just dependent variables we want
# (we handle the time variable separately, below)
my_dict = { k: mat_dict[k] for k in ['x','y']}
def matlab2datetime(matlab_datenum):
day = dt.datetime.fromordinal(int(matlab_datenum))
dayfrac = dt.timedelta(days=matlab_datenum%1) - dt.timedelta(days = 366)
return day + dayfrac
# convert Matlab variable "t" into list of python datetime objects
my_dict['date_time'] = [matlab2datetime(tval) for tval in mat_dict['t']]
# print df
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 201 entries, 2013-08-15 17:11:30.999997 to 2013-09-04 17:11:30.999997
Data columns (total 2 columns):
x 201 non-null values
y 201 non-null values
dtypes: float64(2)
# plot with Pandas
df = pd.DataFrame(my_dict)
df = df.set_index('date_time')
df.plot()

Here's a way to convert these using numpy.datetime64, rather than datetime.
origin = np.datetime64('0000-01-01', 'D') - np.timedelta64(1, 'D')
date = serdate * np.timedelta64(1, 'D') + origin
This works for serdate either a single integer or an integer array.

Just building on and adding to previous comments. The key is in the day counting as carried out by the method toordinal and constructor fromordinal in the class datetime and related subclasses. For example, from the Python Library Reference for 2.7, one reads that fromordinal
Return the date corresponding to the proleptic Gregorian ordinal, where January 1 of year 1 has ordinal 1. ValueError is raised unless 1 <= ordinal <= date.max.toordinal().
However, year 0 AD is still one (leap) year to count in, so there are still 366 days that need to be taken into account. (Leap year it was, like 2016 that is exactly 504 four-year cycles ago.)
These are two functions that I have been using for similar purposes:
import datetime
def datetime_pytom(d,t):
'''
Input
d Date as an instance of type datetime.date
t Time as an instance of type datetime.time
Output
The fractional day count since 0-Jan-0000 (proleptic ISO calendar)
This is the 'datenum' datatype in matlab
Notes on day counting
matlab: day one is 1 Jan 0000
python: day one is 1 Jan 0001
hence an increase of 366 days, for year 0 AD was a leap year
'''
dd = d.toordinal() + 366
tt = datetime.timedelta(hours=t.hour,minutes=t.minute,
seconds=t.second)
tt = datetime.timedelta.total_seconds(tt) / 86400
return dd + tt
def datetime_mtopy(datenum):
'''
Input
The fractional day count according to datenum datatype in matlab
Output
The date and time as a instance of type datetime in python
Notes on day counting
matlab: day one is 1 Jan 0000
python: day one is 1 Jan 0001
hence a reduction of 366 days, for year 0 AD was a leap year
'''
ii = datetime.datetime.fromordinal(int(datenum) - 366)
ff = datetime.timedelta(days=datenum%1)
return ii + ff
Hope this helps and happy to be corrected.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.