I have a pandas.DataFrame indexed by time, as seen below. The time is in Epoch time. When I graph the second column these time values display along the x-axis. I want a more readable time in minutes:seconds.
In [13]: print df.head()
Time
1481044277379 0.581858
1481044277384 0.581858
1481044277417 0.581858
1481044277418 0.581858
1481044277467 0.581858
I have tried some pandas functions, and some methods for converting the whole column, I visited: Pandas docs, this question and the cool site.
I am using pandas 0.18.1
If you read your data with read_csv you can use a custom dateparser:
import pandas as pd
#example.csv
'''
Time,Value
1481044277379,0.581858
1481044277384,0.581858
1481044277417,0.581858
1481044277418,0.581858
1481044277467,0.581858
'''
def dateparse(time_in_secs):
time_in_secs = time_in_secs/1000
return datetime.datetime.fromtimestamp(float(time_in_secs))
dtype= {"Time": float, "Value":float}
df = pd.read_csv("example.csv", dtype=dtype, parse_dates=["Time"], date_parser=dateparse)
print df
You can convert an epoch timestamp to HH:MM with:
import datetime as dt
hours_mins = dt.datetime.fromtimestamp(1347517370).strftime('%H:%M')
Adding a column to your pandas.DataFrame can be done as:
df['H_M'] = pd.Series([dt.datetime.fromtimestamp(int(ts)).strftime('%H:%M')
for ts in df['timestamp']]).values
Related
I am calling some financial data from an API which is storing the time values as (I think) UTC (example below):
enter image description here
I cannot seem to convert the entire column into a useable date, I can do it for a single value using the following code so I know this works, but I have 1000's of rows with this problem and thought pandas would offer an easier way to update all the values.
from datetime import datetime
tx = int('1645804609719')/1000
print(datetime.utcfromtimestamp(tx).strftime('%Y-%m-%d %H:%M:%S'))
Any help would be greatly appreciated.
Simply use pandas.DataFrame.apply:
df['date'] = df.date.apply(lambda x: datetime.utcfromtimestamp(int(x)/1000).strftime('%Y-%m-%d %H:%M:%S'))
Another way to do it is by using pd.to_datetime as recommended by Panagiotos in the comments:
df['date'] = pd.to_datetime(df['date'],unit='ms')
You can use "to_numeric" to convert the column in integers, "div" to divide it by 1000 and finally a loop to iterate the dataframe column with datetime to get the format you want.
import pandas as pd
import datetime
df = pd.DataFrame({'date': ['1584199972000', '1645804609719'], 'values': [30,40]})
df['date'] = pd.to_numeric(df['date']).div(1000)
for i in range(len(df)):
df.iloc[i,0] = datetime.utcfromtimestamp(df.iloc[i,0]).strftime('%Y-%m-%d %H:%M:%S')
print(df)
Output:
date values
0 2020-03-14 15:32:52 30
1 2022-02-25 15:56:49 40
I have a dataset like the below:
epoch_seconds
eq_time
1636663343887
2021-11-12 02:12:23
Now, I am trying to convert the eq_time to epoch seconds which should match the value of the first column but am unable to do so. Below is my code:
df = spark.sql("select '1636663343887' as epoch_seconds")
df1 = df.withColumn("eq_time", from_unixtime(col("epoch_seconds") / 1000))
df2 = df1.withColumn("epoch_sec", unix_timestamp(df1.eq_time))
df2.show(truncate=False)
I am getting output like below:
epoch_seconds
eq_time
epoch_sec
1636663343887
2021-11-12 02:12:23
1636663343
I tried this link as well but didn't help. My expected output is that the first and third columns should match each other.
P.S: I am using the Spark 3.1.1 version on local whereas it is Spark 2.4.3 in production, and my end goal would be to run it in production.
Use to_timestamp instead of from_unixtime to preserve the milliseconds part when you convert epoch to spark timestamp type.
Then, to go back to timestamp in milliseconds, you can use unix_timestamp function or by casting to long type, and concatenate the result with the fraction of seconds part of the timestamp that you get with date_format using pattern S:
import pyspark.sql.functions as F
df = spark.sql("select '1636663343887' as epoch_ms")
df2 = df.withColumn(
"eq_time",
F.to_timestamp(F.col("epoch_ms") / 1000)
).withColumn(
"epoch_milli",
F.concat(F.unix_timestamp("eq_time"), F.date_format("eq_time", "S"))
)
df2.show(truncate=False)
#+-------------+-----------------------+-------------+
#|epoch_ms |eq_time |epoch_milli |
#+-------------+-----------------------+-------------+
#|1636663343887|2021-11-11 21:42:23.887|1636663343887|
#+-------------+-----------------------+-------------+
I prefer to do the timestamp conversion with only using cast.
from pyspark.sql.functions import col
df = spark.sql("select '1636663343887' as epoch_seconds")
df = df.withColumn("eq_time", (col("epoch_seconds") / 1000).cast("timestamp"))
df = df.withColumn("epoch_sec", (col("eq_time").cast("double") * 1000).cast("long"))
df.show(truncate=False)
If you do in this way, you need to think in seconds, than it will work perfectly.
To convert between time formats in Python, the datetime.datetime.strptime() and .strftime() are useful.
To read the string from eq_time and process into a Python datetime object:
import datetime
t = datetime.datetime.strptime('2021-11-12 02:12:23', '%Y-%m-%d %H:%M:%S')
To print t in epoch_seconds format:
print(t.strftime('%s')
Pandas has date processing functions which work along similar lines: Applying strptime function to pandas series
You could run this on the eq_time column, immediately after extracting the data, to ensure your DataFrame contains the date in the correct format
I have a csv file with a long timestamp column (years):
1990-05-12 14:01
.
.
1999-01-10 10:00
where the time is in hh:mm format. I'm trying to extract each day worth of data into a new csv file. Here's my code:
import datetime
import pandas as pd
df = pd.read_csv("/home/parallels/Desktop/ewh_log/hpwh_log.csv",parse_dates=True)
#change timestmap column format
def extract_months_data(df):
df = pd.to_datetime(df['timestamp'])
print(df)
def write_o_csv(df):
print('writing ..')
#todo
x1 = pd.to_datetime(df['timestamp'],format='%m-%d %H:%M').notnull().all()
if (x1)==True:
extract_months_data(df)
else:
x2 = pd.to_datetime(df['timestamp'])
x2 = x1.dt.strftime('%m-%d %H:%M')
write_to_csv(df)
The issue is that when I get to the following line
def extract_months_data(df):
df = pd.to_datetime(df['timestamp'])
I get the following error:
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime
Is there alternative solution to do it with pandas without ignoring the rest of the data. I saw posts that suggested using coerce but that replaces the rest of the data with NaT.
Thanks
UPDATE:
This post here here answers half of the question which is how to filter hours (or minutes) out of timestamp column. The second part would be how to extract a full day to another csv file. I'll post updates here once I get to a solution.
You are converting to datetime two times which is not needed
Something like that should work
import pandas as pd
df = pd.read_csv('data.csv')
df['month_data'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M')
df['month_data'] = df['month_data'].dt.strftime('%m-%d %H:%M')
# If you dont want columns with month_data NaN
df = df[df['month_data'].notna()]
print(df)
TimeSeries = pandas.Series(df['time_col'].values.tolist())
pandas.to_timedelta(TimeSeries).mean()
After taking mean() I need to convert it to TimeStamp datatype to add it to DataFrame.
Below lines are not working
pandas.to_timestamp(pandas.to_timedelta(TimeSeries).mean())
pandas.Timestamp(pandas.to_timedelta(TimeSeries).mean())
Thanks in advance,
Ragu
For convert timedelta to timestamp is necessary add some datetime, e.g:
import pandas as pd
out = pd.to_datetime('2000-01-01') + pd.to_timedelta(TimeSeries).mean()
You could use the total_seconds of the mean timedelta and feed it to pd.Timestamp with the correct unit specified:
import pandas as pd
# example Series:
TimeSeriesDelta = pd.Series(pd.to_timedelta(['00:00:02.285932',
'00:00:11.366717',
'00:00:11.367594']))
timestamp = pd.Timestamp(TimeSeriesDelta.mean().total_seconds(), unit='s')
# Timestamp('1970-01-01 00:00:08.340081')
Note that this will add a date, 1970-01-01.
Data:Panda Dataframe, read from excel
Month Sales
01-01-17 1009
01-02-17 1004
..
01-12-19 2244
Code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.tsa.stattools import adfuller
import datetime
CHI = pd.read_excel('D:\DS\TS.xls', index="Month")
CHI['Month'] = pd.to_datetime(CHI['Month']).dt.date
CHI['NetSalesUSD'] = pd.to_numeric(CHI['NetSalesUSD'], errors='coerce')
result = adfuller(CHI)
Error received:
float() argument must be a string or a number, not 'datetime.date'
I tried converting to integer , still not able to get the results, any suggestions?
I think the issue here is excel.
Excel likes to show dates as Month-Day for some reason.
Try changing the date format to short date in excel then save and run your python script again.
It looks like Pandas is not recognizing the date format by default. You can instruct Pandas to use a custom date parser. See the Pandas documentation for more details.
In your case, it would look something like this:
def parse_custom_date(x):
return pd.datetime.strptime(x, '%b-%y')
data_copy = pd.read_excel(
'D:\DS\DATA.xls',
'CHI',
index='Month',
parse_dates=['Month'],
date_parser=parse_custom_date,
)
Note that your date format does not appear to have day of the month, so this would assume the first day of the month.