I have a dataset like the below:
epoch_seconds
eq_time
1636663343887
2021-11-12 02:12:23
Now, I am trying to convert the eq_time to epoch seconds which should match the value of the first column but am unable to do so. Below is my code:
df = spark.sql("select '1636663343887' as epoch_seconds")
df1 = df.withColumn("eq_time", from_unixtime(col("epoch_seconds") / 1000))
df2 = df1.withColumn("epoch_sec", unix_timestamp(df1.eq_time))
df2.show(truncate=False)
I am getting output like below:
epoch_seconds
eq_time
epoch_sec
1636663343887
2021-11-12 02:12:23
1636663343
I tried this link as well but didn't help. My expected output is that the first and third columns should match each other.
P.S: I am using the Spark 3.1.1 version on local whereas it is Spark 2.4.3 in production, and my end goal would be to run it in production.
Use to_timestamp instead of from_unixtime to preserve the milliseconds part when you convert epoch to spark timestamp type.
Then, to go back to timestamp in milliseconds, you can use unix_timestamp function or by casting to long type, and concatenate the result with the fraction of seconds part of the timestamp that you get with date_format using pattern S:
import pyspark.sql.functions as F
df = spark.sql("select '1636663343887' as epoch_ms")
df2 = df.withColumn(
"eq_time",
F.to_timestamp(F.col("epoch_ms") / 1000)
).withColumn(
"epoch_milli",
F.concat(F.unix_timestamp("eq_time"), F.date_format("eq_time", "S"))
)
df2.show(truncate=False)
#+-------------+-----------------------+-------------+
#|epoch_ms |eq_time |epoch_milli |
#+-------------+-----------------------+-------------+
#|1636663343887|2021-11-11 21:42:23.887|1636663343887|
#+-------------+-----------------------+-------------+
I prefer to do the timestamp conversion with only using cast.
from pyspark.sql.functions import col
df = spark.sql("select '1636663343887' as epoch_seconds")
df = df.withColumn("eq_time", (col("epoch_seconds") / 1000).cast("timestamp"))
df = df.withColumn("epoch_sec", (col("eq_time").cast("double") * 1000).cast("long"))
df.show(truncate=False)
If you do in this way, you need to think in seconds, than it will work perfectly.
To convert between time formats in Python, the datetime.datetime.strptime() and .strftime() are useful.
To read the string from eq_time and process into a Python datetime object:
import datetime
t = datetime.datetime.strptime('2021-11-12 02:12:23', '%Y-%m-%d %H:%M:%S')
To print t in epoch_seconds format:
print(t.strftime('%s')
Pandas has date processing functions which work along similar lines: Applying strptime function to pandas series
You could run this on the eq_time column, immediately after extracting the data, to ensure your DataFrame contains the date in the correct format
Related
How to convert Long "1206946690" to date format "yyyy-mm-dd" using Pyspark.
There is absolutely no need to use pyspark for this thing whatsoever.
Converting from UNIX timestamp to date is covered in Python's standard library's datetime module, just use it.
Example:
from datetime import datetime
def to_date(n):
return datetime.fromtimestamp(n).strftime('%Y-%m-%d')
>>> to_date(1206946690)
'2008-03-31'
see documentation ; extra thing is cast to date as below example.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.from_unixtime.html
pyspark.sql.functions.from_unixtime(timestamp: ColumnOrName, format: str = 'yyyy-MM-dd HH:mm:ss') → pyspark.sql.column.Column[source]
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
time_df = spark.createDataFrame([(1428476400,)], ['unix_time'])
time_df.select(from_unixtime('unix_time').cast("date"))
you need to cast as date to return required yyyy-mm-dd
from pyspark.sql import functions as F
from pyspark.sql.types import TimestampType
df = spark.createDataFrame([(1206946690,)], ["timestamp"])
df = df.withColumn("date", F.from_unix_time(df["timestamp"], "yyyy-MM-dd").cast(TimestampType()))
df.show()
I am calling some financial data from an API which is storing the time values as (I think) UTC (example below):
enter image description here
I cannot seem to convert the entire column into a useable date, I can do it for a single value using the following code so I know this works, but I have 1000's of rows with this problem and thought pandas would offer an easier way to update all the values.
from datetime import datetime
tx = int('1645804609719')/1000
print(datetime.utcfromtimestamp(tx).strftime('%Y-%m-%d %H:%M:%S'))
Any help would be greatly appreciated.
Simply use pandas.DataFrame.apply:
df['date'] = df.date.apply(lambda x: datetime.utcfromtimestamp(int(x)/1000).strftime('%Y-%m-%d %H:%M:%S'))
Another way to do it is by using pd.to_datetime as recommended by Panagiotos in the comments:
df['date'] = pd.to_datetime(df['date'],unit='ms')
You can use "to_numeric" to convert the column in integers, "div" to divide it by 1000 and finally a loop to iterate the dataframe column with datetime to get the format you want.
import pandas as pd
import datetime
df = pd.DataFrame({'date': ['1584199972000', '1645804609719'], 'values': [30,40]})
df['date'] = pd.to_numeric(df['date']).div(1000)
for i in range(len(df)):
df.iloc[i,0] = datetime.utcfromtimestamp(df.iloc[i,0]).strftime('%Y-%m-%d %H:%M:%S')
print(df)
Output:
date values
0 2020-03-14 15:32:52 30
1 2022-02-25 15:56:49 40
I'm trying to change the time format of my data that's now in form of 15:41:28:4330 or hh:mm:ss:msmsmsms to seconds.
I browsed through some of the pandas documentation but can't seem to find this format anywhere.
Would it be possible to simply calculate the seconds from that time format row by row?
You'll want to obtain a timedelta and take the total_seconds method to get seconds after midnight. So you can parse to datetime first, and subtract the default date (that will be added automatically). Ex:
#1 - via datetime
import pandas as pd
df = pd.DataFrame({'time': ["15:41:28:4330"]})
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S:%f')
df['sec_after_mdnt'] = (df['time']-df['time'].dt.floor('d')).dt.total_seconds()
df
time sec_after_mdnt
0 1900-01-01 15:41:28.433 56488.433
Alternatively, you can clean your time format and parse directly to timedelta:
#2 - str cleaning & to timedelta
df = pd.DataFrame({'time': ["15:41:28:4330"]})
# last separator must be a dot...
df['time'] = df['time'].str[::-1].str.replace(':', '.', n=1, regex=False).str[::-1]
df['sec_after_mdnt'] = pd.to_timedelta(df['time']).dt.total_seconds()
df
time sec_after_mdnt
0 15:41:28.4330 56488.433
I am getting the date from Quickbase in the format "1609372800000". Now I know the code on how to convert this into the correct date format.
Code is
import datetime
date = datetime.datetime.fromtimestamp(1609372800000/1000.0)
date = date.strftime('%Y-%m-%d')
Now I want to apply this calculation to a column of pyspark dataframe.
I tried using this code but is giving me the error that
Expecting integer but received col type
df.withColumn("product_availability_due_date",col("product_availability_due_date").cast('int'))
df.withColumn('product_availability_due_date_1',datetime.datetime.fromtimestamp(col('product_availability_due_date')/1000.0).strftime('%Y-%m-%d'))
product_availability_due_date- this column datatype is string.
You can use from_unixtime to do the conversion:
import pyspark.sql.functions as F
df2 = df.withColumn(
'product_availability_due_date_1',
F.from_unixtime((F.col('product_availability_due_date').cast('long') / 1000))
)
df2.show()
+-----------------------------+-------------------------------+
|product_availability_due_date|product_availability_due_date_1|
+-----------------------------+-------------------------------+
| 1609372800000| 2020-12-31 00:00:00|
+-----------------------------+-------------------------------+
I have a pyspark dataframe with the following time format 20190111-08:15:45.275753. I want to convert this to timestamp format keeping the microsecond granularity. However, it appears as though it is difficult to keep the microseconds as all time conversions in pyspark produce seconds?
Do you have a clue on how this can be done? Note that converting it to pandas etc will not work as the dataset is huge so I need an efficient way of doing this. Example of how i am doing this below
time_df = spark.createDataFrame([('20150408-01:12:04.275753',)], ['dt'])
res = time_df.withColumn("time", unix_timestamp(col("dt"), \
format='yyyyMMdd-HH:mm:ss.000').alias("time"))
res.show(5, False)
Normally timestamp granularity is in seconds so I do not think there is a direct method to keep milliseconds granularity.
In pyspark there is the function unix_timestamp that :
unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')
Convert time string with given pattern ('yyyy-MM-dd HH:mm:ss', by default)
to Unix time stamp (in seconds), using the default timezone and the default
locale, return null if fail.
if `timestamp` is None, then it returns current timestamp.
>>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>> time_df = spark.createDataFrame([('2015-04-08',)], ['dt'])
>>> time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()
[Row(unix_time=1428476400)]
>>> spark.conf.unset("spark.sql.session.timeZone")
A usage example:
import pyspark.sql.functions as F
res = df.withColumn(colName, F.unix_timestamp(F.col(colName), \
format='yyyy-MM-dd HH:mm:ss.000').alias(colName) )
What you might do is splitting your date string (str.rsplit('.', 1)) keeping the milliseconds apart (for example by creating another column) in your dataframe.
EDIT
In your example the problem is that the time is of type string. First you need to convert it to a timestamp type: this can be done with:
res = time_df.withColumn("new_col", to_timestamp("dt", "yyyyMMdd-hh:mm:ss"))
Then you can use unix_timestamp
res2 = res.withColumn("time", F.unix_timestamp(F.col("parsed"), format='yyyyMMdd-hh:mm:ss.000').alias("time"))
Finally to create a columns with milliseconds:
res3 = res2.withColumn("ms", F.split(res2['dt'], '[.]').getItem(1))
I've found a work around for this using to_utc_timestamp function in pyspark, however not entirely sure if this is the most efficient though it seems to work fine on about 100 mn rows of data. You can avoid the regex_replace if your timestamp string looked like this -
1997-02-28 10:30:40.897748
from pyspark.sql.functions import regexp_replace, to_utc_timestamp
df = spark.createDataFrame([('19970228-10:30:40.897748',)], ['new_t'])
df = df.withColumn('t', regexp_replace('new_t', '^(.{4})(.{2})(.{2})-', '$1-$2-$3 '))
df = df.withColumn("time", to_utc_timestamp(df.t, "UTC").alias('t'))
df.show(5,False)
print(df.dtypes)