Working with Microsecond Time Stamps in PySpark - python

I have a pyspark dataframe with the following time format 20190111-08:15:45.275753. I want to convert this to timestamp format keeping the microsecond granularity. However, it appears as though it is difficult to keep the microseconds as all time conversions in pyspark produce seconds?
Do you have a clue on how this can be done? Note that converting it to pandas etc will not work as the dataset is huge so I need an efficient way of doing this. Example of how i am doing this below
time_df = spark.createDataFrame([('20150408-01:12:04.275753',)], ['dt'])
res = time_df.withColumn("time", unix_timestamp(col("dt"), \
format='yyyyMMdd-HH:mm:ss.000').alias("time"))
res.show(5, False)

Normally timestamp granularity is in seconds so I do not think there is a direct method to keep milliseconds granularity.
In pyspark there is the function unix_timestamp that :
unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')
Convert time string with given pattern ('yyyy-MM-dd HH:mm:ss', by default)
to Unix time stamp (in seconds), using the default timezone and the default
locale, return null if fail.
if `timestamp` is None, then it returns current timestamp.
>>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>> time_df = spark.createDataFrame([('2015-04-08',)], ['dt'])
>>> time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()
[Row(unix_time=1428476400)]
>>> spark.conf.unset("spark.sql.session.timeZone")
A usage example:
import pyspark.sql.functions as F
res = df.withColumn(colName, F.unix_timestamp(F.col(colName), \
format='yyyy-MM-dd HH:mm:ss.000').alias(colName) )
What you might do is splitting your date string (str.rsplit('.', 1)) keeping the milliseconds apart (for example by creating another column) in your dataframe.
EDIT
In your example the problem is that the time is of type string. First you need to convert it to a timestamp type: this can be done with:
res = time_df.withColumn("new_col", to_timestamp("dt", "yyyyMMdd-hh:mm:ss"))
Then you can use unix_timestamp
res2 = res.withColumn("time", F.unix_timestamp(F.col("parsed"), format='yyyyMMdd-hh:mm:ss.000').alias("time"))
Finally to create a columns with milliseconds:
res3 = res2.withColumn("ms", F.split(res2['dt'], '[.]').getItem(1))

I've found a work around for this using to_utc_timestamp function in pyspark, however not entirely sure if this is the most efficient though it seems to work fine on about 100 mn rows of data. You can avoid the regex_replace if your timestamp string looked like this -
1997-02-28 10:30:40.897748
from pyspark.sql.functions import regexp_replace, to_utc_timestamp
df = spark.createDataFrame([('19970228-10:30:40.897748',)], ['new_t'])
df = df.withColumn('t', regexp_replace('new_t', '^(.{4})(.{2})(.{2})-', '$1-$2-$3 '))
df = df.withColumn("time", to_utc_timestamp(df.t, "UTC").alias('t'))
df.show(5,False)
print(df.dtypes)

Related

Converting a string to Timestamp with Pyspark

I am currently attempting to convert a column "datetime" which has values that are dates/times in string form, and I want to convert the column such that all of the strings are converted to timestamps.
The date/time strings are of the form "10/11/2015 0:41", and I'd like to convert the string to a timestamp of form YYYY-MM-DD HH:MM:SS. At first I attempted to cast the column to timestamp in the following way:
df=df.withColumn("datetime", df["datetime"].cast("timestamp"))
Though when I did so, I received null for every value, which lead me to believe that the input dates needed to be formatted somehow. I have looked into numerous other possible remedies such as to_timestamp(), though this also gives the same null results for all of the values. How can a string of this format be converted into a timestamp?
Any insights or guidance are greatly appreciated.
Try:
import datetime
def to_timestamp(date_string):
return datetime.datetime.strptime(date_string, "%m/%d/%Y %H:%M")
df = df.withColumn("datetime", to_timestamp(df.datetime))
You can use the to_timestamp function. See Datetime Patterns for valid date and time format patterns.
df = df.withColumn('datetime', F.to_timestamp('datetime', 'M/d/y H:m'))
df.show(truncate=False)
You were doing it in the right way, except you missed to add the format ofstring type which is in this case MM/dd/yyyy HH:mm. Here M is used for months and m is used to detect minutes. Having said that, see the code below for reference -
df = spark.createDataFrame([('10/11/2015 0:41',), ('10/11/2013 10:30',), ('12/01/2016 15:56',)], ("String_Timestamp", ))
from pyspark.sql.functions import *
df.withColumn("Timestamp_Format", to_timestamp(col("String_Timestamp"), "MM/dd/yyyy HH:mm")).show(truncate=False)
+----------------+-------------------+
|String_Timestamp| Timestamp_Format|
+----------------+-------------------+
| 10/11/2015 0:41|2015-10-11 00:41:00|
|10/11/2013 10:30|2013-10-11 10:30:00|
|12/01/2016 15:56|2016-12-01 15:56:00|
+----------------+-------------------+

Converting timestamp to epoch milliseconds in pyspark

I have a dataset like the below:
epoch_seconds
eq_time
1636663343887
2021-11-12 02:12:23
Now, I am trying to convert the eq_time to epoch seconds which should match the value of the first column but am unable to do so. Below is my code:
df = spark.sql("select '1636663343887' as epoch_seconds")
df1 = df.withColumn("eq_time", from_unixtime(col("epoch_seconds") / 1000))
df2 = df1.withColumn("epoch_sec", unix_timestamp(df1.eq_time))
df2.show(truncate=False)
I am getting output like below:
epoch_seconds
eq_time
epoch_sec
1636663343887
2021-11-12 02:12:23
1636663343
I tried this link as well but didn't help. My expected output is that the first and third columns should match each other.
P.S: I am using the Spark 3.1.1 version on local whereas it is Spark 2.4.3 in production, and my end goal would be to run it in production.
Use to_timestamp instead of from_unixtime to preserve the milliseconds part when you convert epoch to spark timestamp type.
Then, to go back to timestamp in milliseconds, you can use unix_timestamp function or by casting to long type, and concatenate the result with the fraction of seconds part of the timestamp that you get with date_format using pattern S:
import pyspark.sql.functions as F
df = spark.sql("select '1636663343887' as epoch_ms")
df2 = df.withColumn(
"eq_time",
F.to_timestamp(F.col("epoch_ms") / 1000)
).withColumn(
"epoch_milli",
F.concat(F.unix_timestamp("eq_time"), F.date_format("eq_time", "S"))
)
df2.show(truncate=False)
#+-------------+-----------------------+-------------+
#|epoch_ms |eq_time |epoch_milli |
#+-------------+-----------------------+-------------+
#|1636663343887|2021-11-11 21:42:23.887|1636663343887|
#+-------------+-----------------------+-------------+
I prefer to do the timestamp conversion with only using cast.
from pyspark.sql.functions import col
df = spark.sql("select '1636663343887' as epoch_seconds")
df = df.withColumn("eq_time", (col("epoch_seconds") / 1000).cast("timestamp"))
df = df.withColumn("epoch_sec", (col("eq_time").cast("double") * 1000).cast("long"))
df.show(truncate=False)
If you do in this way, you need to think in seconds, than it will work perfectly.
To convert between time formats in Python, the datetime.datetime.strptime() and .strftime() are useful.
To read the string from eq_time and process into a Python datetime object:
import datetime
t = datetime.datetime.strptime('2021-11-12 02:12:23', '%Y-%m-%d %H:%M:%S')
To print t in epoch_seconds format:
print(t.strftime('%s')
Pandas has date processing functions which work along similar lines: Applying strptime function to pandas series
You could run this on the eq_time column, immediately after extracting the data, to ensure your DataFrame contains the date in the correct format

Issues with converting date time to proper format- Columns must be same length as key

I'm doing some data analysis on a dataset (https://www.kaggle.com/sudalairajkumar/covid19-in-usa) and Im trying to convert the date and time column (lastModified) to the proper datetime format. When I tried it first it returned an error
ValueError: hour must be in 0..23
so I tried doing this -
data_df[['date','time']] =
data_df['lastModified'].str.split(expand=True)
data_df['lastModified'] = (pd.to_datetime(data_df.pop('date'),
format='%d/%m/%Y') +
pd.to_timedelta(data_df.pop('time') + ':00'))
This gives an error - Columns must be same length as key
I understand this means that both columns I'm splitting arent the same size. How do I resolve this issue? I'm relatively new to python. Please explain in a easy to understand manner. thanks very much
This is my whole code-
import pandas as pd
dataset_url = 'https://www.kaggle.com/sudalairajkumar/covid19-in-
usa'
import opendatasets as od
od.download(dataset_url)
data_dir = './covid19-in-usa'
import os
os.listdir(data_dir)
data_df = pd.read_csv('./covid19-in-usa/us_covid19_daily.csv')
data_df
data_df[['date','time']] =
data_df['lastModified'].str.split(expand=True)
data_df['lastModified'] = (pd.to_datetime(data_df.pop('date'),
format='%d/%m/%Y') +
pd.to_timedelta(data_df.pop('time') + ':00'))
Looks like lastModified is in ISO format. I have used something like below to convert iso date string:
from dateutil import parser
from datetime import datetime
...
timestamp = parser.isoparse(lastModified).timestamp()
dt = datetime.fromtimestamp(timestamp)
...
On this line:
data_df[['date','time']] = data_df['lastModified'].str.split(expand=True)
In order to do this assignment, the number of columns on both sides of the = must be the same. split can output multiple columns, but it will only do this if it finds the character it's looking for to split on. By default, it splits by whitespace. There is no whitespace in the date column, and therefore it will not split. You can read the documentation for this here.
For that reason, this line should be like this, so it splits on the T:
data_df[['date','time']] = data_df['lastModified'].str.split('T', expand=True)
But the solution posted by #southiejoe is likely to be more reliable. These timestamps are in a standard format; parsing them is a previously-solved problem.
You need these libraries
#import
from dateutil import parser
from datetime import datetime
Then try writing something similar for convert the date and time column. This way the columns should be the same length as the key
#convert the time column to the correct datetime format
clock = parser.isoparse(lastModified).timestamp()
#convert the date column to the correct datetime format
data = datetime.fromtimestamp(timestamp)

Filtering pandas dataframe by timeframe in epoch

I have a dataframe which has a timestamp column in seconds since epoch format. It has the dtype float.
It want to filter the dataframe by a specific time window.
Approach:
zombieData[(zombieData['record-ts'] > period_one_start) & (zombieData['record-ts'] < period_one_end)]
This returns an empty dataframe. I can confirm that I have timestamp bigger, smaller and in my timeframe.
I calculate my timestamps with the following method:
period_one_start = datetime.strptime('2020-12-06 03:30:00', '%Y-%m-%d %H:%M:%S').timestamp()
I'm glad for any help. I guess my filtering logic is wrong which confuses me, as one condition filtering (e.g. everything after start time) is working.
Thx for your help!
This looks messy but I highly recommend. Converting to pd.Timestamp before will be most robust for ensuring good comparison and calling to numpy methods for less than and greater than will compute a little bit quicker in a majority of situations (especially for larger dataframes).
zombieData[zombieData['record-ts'].gt(pd.Timestamp('2020-12-06')) &
zombieData['record-ts'].lt(pd.Timestamp('2020-12-09'))]
New Option:
I learned of the between method. I think this is easier to read.
zombieData[zombieData['record-ts'].between(left=pd.Timestamp('2020-12-06'),
right=pd.Timestamp('2020-12-09'),
inclusive="neither")]
import pandas as pd
from datetime import datetime
import numpy as np
date = np.array('2020-12-01', dtype=np.datetime64)
dates = date + np.arange(12)
period_one_start = datetime.strptime('2020-12-06 03:30:00', '%Y-%m-%d %H:%M:%S').timestamp()
period_one_end = datetime.strptime('2020-12-09 03:30:00', '%Y-%m-%d %H:%M:%S').timestamp()
zombieData = pd.DataFrame( data= {"record-ts": dates} )
zombieData[ ((zombieData['record-ts'] > '2020-12-06') & (zombieData['record-ts'] < '2020-12-09')) ]
(if you want to keep you format)

Using Pandas .apply() method with a regex-based function

I am trying to create a new column in a data Frame by applying a function on a column that has numbers as strings.
I have written the function to extract the numbers I want and tested it on a single string input and can confirm that it works.
SEARCH_PATTERN = r'([0-9]{1,2}) ([0-9]{2}):([0-9]{2}):([0-9]{2})'
def get_total_time_minutes(time_col, pattern=SEARCH_PATTERN):
"""Uses regex to parse time_col which is a string in the format 'd hh:mm:ss' to
obtain a total time in minutes
"""
days, hours, minutes, _ = re.match(pattern, time_col).groups()
total_time_minutes = (int(days)*24 + int(hours))*60 + int(minutes)
return total_time_minutes
#test that the function works for a single input
text = "2 23:24:46"
print(get_total_time_minutes(text))
Ouput: 4284
#apply the function to the required columns
df['Minutes Available'] = df['Resource available (d hh:mm:ss)'].apply(get_total_time_minutes)
The picture below is a screenshot of my dataframe columns.
Screenshot of my dataframe
The 'Resources available (d hh:mm:ss)' column of my dataframe is of Pandas type 'o' (string, if my understanding is correct), and has data in the following format: '5 08:00:00'. When I call the apply(get_total_time_minutes) on it though, I get the following error:
TypeError: expected string or bytes-like object
To clarify further, the "Resources Available" column is a string representing the total time in days, hours, minutes and seconds that the resource was available. I want to convert that time string to a total time in minutes, hence the regex and arithmetic within the get_total_time_minutes function. – Sam Ezebunandu just now
This might be a bit hacky, because it uses the datetime library to parse the date and then turn it into a Timedelta by subtracting the default epoch:
>>> pd.to_datetime('2 23:48:30', format='%d %H:%M:%S') - pd.to_datetime('0', format='%S')
Out[47]: Timedelta('1 days 23:48:30')
>>> Out[47] / pd.Timedelta('1 minute')
Out[50]: 2868.5
But it does tell you how many minutes elapsed in those two days and however many hours. It's also vectorised, so you can apply it to the columns and get your minute values a lot faster than using apply.

Categories

Resources