Storing with Dask date/timestamp columns in Parquet

Storing with Dask date/timestamp columns in Parquet - python

I have a Dask data frame that has two columns, a date and a value.
I store it like so:
ddf.to_parquet('/some/folder', engine='pyarrow', overwrite=True)
I'm expecting Dask to store the date column as date in Parquet, but when I query it with Apache Drill I get 16 digit numbers (I would say timestamps) instead of dates. For example I get:
1546300800000000 instead of 2019-01-01
1548979200000000 instead of 2019-02-01
Is there a way to tell Dask to store columns as dates? How can I run a select with Apache Drill and get the dates? I tried using SELECT CAST in Drill but it doesn't convert the numbers to dates.

Not sure if is relevant for you, but it seems that you are interested only in the date value (ignoring hours, minutes, etc.). If so, you can explicitly convert timestamp information into date string using .dt.date.
import pandas as pd
import dask.dataframe as dd
sample_dates = [
'2019-01-01 00:01:00',
'2019-01-02 05:04:02',
'2019-01-02 15:04:02'
]
df = pd.DataFrame(zip(sample_dates, range(len(sample_dates))), columns=['datestring', 'value'])
ddf = dd.from_pandas(df, npartitions=2)
# convert to timestamp and calculate as unix time (relative to 1970)
ddf['unix_timestamp_seconds'] = (ddf['datestring'].astype('M8[s]') - pd.to_datetime('1970-01-01')).dt.total_seconds()
# convert to timestamp format and extract dates
ddf['datestring'] = ddf['datestring'].astype('M8[s]').dt.date
ddf.to_parquet('test.parquet', engine='pyarrow', write_index=False, coerce_timestamps='ms')
For time conversion, you can use .astype or dd.to_datetime, see answers to this question. There is also a very similar question and answer, which suggests that ensuring that the timestamp is downcasted to ms resolves the issue.
So playing around with the values you provided it's possible to see that the core problem is a mismatch in the scaling of the variable:
# both yield: Timestamp('2019-01-01 00:00:00')
pd.to_datetime(1546300800000000*1000, unit='ns')
pd.to_datetime(1546300800000000/1000000, unit='s')

If memory serves, Drill uses an old non-standard of INT96 time stamps, which was never supported by parquet. A parquet timestamp is essentially a UNIX timestamp, as an int64, and with various precision. Drill must have a function to correctly convert this its internal format.
I am no expert on Drill, but it seems you need to first divide your integer by the appropriate power of 10, (see this answer). This syntac is probably wrong, but might give you the idea:
SELECT TO_TIMESTAMP((mycol as FLOAT) / 1000) FROM ...;

Here's a link to the Drill docs about the TO_TIMESTAMP() function. (https://drill.apache.org/docs/data-type-conversion/#to_timestamp) I think #mdurant is correct in his approach.
I would try either:
SELECT TO_TIMESTAMP(<date_col>) FROM ...
or
SELECT TO_TIMSTAMP((<date_col> / 1000)) FROM ...

Related

Why pyspark converting string date values to null?

Question: Why the myTimeStampCol1 in the following code is returning a null value in the third row, and how can we fix the issue?
from pyspark.sql.functions import *
df=spark.createDataFrame(data = [ ("1","Arpit","2021-07-24 12:01:19.000"),("2","Anand","2019-07-22 13:02:20.000"),("3","Mike","11-16-2021 18:00:08")],
schema=["id","Name","myTimeStampCol"])
df.select(col("myTimeStampCol"),to_timestamp(col("myTimeStampCol"),"yyyy-MM-dd HH:mm:ss.SSSS").alias("myTimeStampCol1")).show()
Output
+--------------------+-------------------+
|myTimeStampCol | myTimeStampCol1|
+--------------------+-------------------+
|2021-07-24 12:01:...|2021-07-24 12:01:19|
|2019-07-22 13:02:...|2019-07-22 13:02:20|
| 11-16-2021 18:00:08| null|
Remarks:
I'm running the code in a python notebook in Azure Databricks (that is almost the same as Databricks)
Above example is just a sample to explain the issue. The real code is importing a data file with millions of records. And the file has a column that has the format MM-dd-yyyy HH:mm:ss (for example 11-16-2021 18:00:08) and all the values in that column have exact same format MM-dd-yyyy HH:mm:ss

The error occurs because of the difference in formats. Since all the records in this column are in the format MM-dd-yyyy HH:mm:ss, You can modify the code as following.
df.select(col("myTimeStampCol"),to_timestamp(col("myTimeStampCol"),'MM-dd-yyyy HH:mm:ss').alias("myTimeStampCol1")).show(truncate=False)
#only if all the records in this column are 'MM-dd-yyyy HH:mm:ss' format
to_timestamp() column expects either 1 or 2 arguments, a column with these timestamp values and the second is the format of these values. Since all these values are the same format MM-dd-yyyy HH:mm:ss, you can specify this as the second argument.
A sample output for this case is given in the below image:

It seem like your timestamp pattern at index #3 is not aligned with others.
Spark uses the default patterns: yyyy-MM-dd for dates and yyyy-MM-dd HH:mm:ss for timestamps.
Changing the format should solve the problem: 2021-11-16 18:00:08
Edit #1:
Alternatively, creating custom transformation function may be a good idea. (Sorry I only found the example with scala).
Spark : Parse a Date / Timestamps with different Formats (MM-dd-yyyy HH:mm, MM/dd/yy H:mm ) in same column of a Dataframe

How to generate one value each minute out of irregular data?

I have values that are mesured event-related. So there are not the same amount of data every Minute. To be able to better handle this data I aim to only take the first row of values every Minute.
The time of the data I import from a csv looks like this:
time
11.11.2011 11:11
11.11.2011 11:11
11.11.2011 11:11
11.11.2011 11:12
11.11.2011 11:12
11.11.2011 11:13
The other values are Temperatures.
One main problem ist to import the time in the right format.
I tried to solve this with the help of this comunity like this:
with open('my_file.csv','r') as file:
for line in file:
try:
time = line.split(';')[0] #splits the line at the comma and takes the first bit
time = dt.datetime.strptime(time, '%d.%m.%Y %H:%M')
print(time)
except:
pass
then I importet the columns of the temperatures and joind them like this:
df = pd.read_csv("my_file.csv", sep=';', encoding='latin-1')
df=df[["time", "T1", "T2", "DT1", "DT2"]]
when I printed the dtypes of my data the time was datetime64[ns] and the others where objects.
I tried different options of groupby and resample. Like the following:
df=df.groupby([pd.Grouper(key = 'time', freq='1min')])
df.resample('M')
One main problem that was stated in the error messages was that the datatype of the time was not appropriate for grouping,... because it is not an DatetimeIndex.
So I tried to convert the dates to a DatetimeIndex like this:
df.index = pd.to_datetime(daten["time"].index, format='%Y-%m-%d %H:%M:%S')
but then I reseaved a Nummeration of the Index starting with 1970-01-01 so I am not quite shure if this conversion is possible with irregular data.
Without this conversion I also get the message <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026938A74850>
When I then try to call my dataframe the message shows and when saving it to csv like this:
df.to_csv('04_01_DTempminuten.csv', index=False, encoding='utf-8', sep =';', date_format = '%Y-%m-%d %H:%M:%S')
I receive either the same message or only one line with a Dezimalnumber instead of the time.
Does anyone have an idear how to deal with this irregular data to get one line of values each minute?
Thank you for reading my question. I am really thankful for any Idears.

Without sample data I can only show how I do it with irregular time series, which I think is your case. I work with price data which comes at irregular time intervals. So if you need to sample taking the first minute value you can use resample with for a specific interval using ohlc aggregation function, that will give you four columns for each sample interval.
open: first value in the interval
high: highest
low: lowest value
close: last value
In your case the sampling interval would 1 minute ('T')
In the following example I'm using one second ('S') as resampling frequency, to resample ask column (your temperature column):
import pandas as pd
df = pd.read_csv('my_tick_data.csv')
df['date_time'] = pd.to_datetime(df['date_time'])
df.set_index('date_time', inplace=True)
df.head(6)
df['ask'].resample('S').ohlc()
This is not solving your date issue, which is a prerequisite for this part because the data set needs to be indexed by date. If you can provide sample data maybe I can help you with that part either.

convert time to UTC in pandas

I have multiple csv files, I've set DateTime as the index.
df6.set_index("gmtime", inplace=True)
#correct the underscores in old datetime format
df6.index = [" ".join( str(val).split("_")) for val in df6.index]
df6.index = pd.to_datetime(df6.index)
The time was put in GMT, but I think it's been saved as BST (British summertime) when I set the clock for raspberry pi.
I want to shift the time one hour backwards. When I use
df6.tz_convert(pytz.timezone('utc'))
it gives me below error as it assumes that the time is correct.
Cannot convert tz-naive timestamps, use tz_localize to localize
How can I shift the time to one hour?

Given a column that contains date/time info as string, you would convert to datetime, localize to a time zone (here: Europe/London), then convert to UTC. You can do that before you set as index.
Ex:
import pandas as pd
dti = pd.to_datetime(["2021-09-01"]).tz_localize("Europe/London").tz_convert("UTC")
print(dti) # notice 1 hour shift:
# DatetimeIndex(['2021-08-31 23:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
Note: setting a time zone means that DST is accounted for, i.e. here, during winter you'd have UTC+0 and during summer UTC+1.

To add to FObersteiner's response (sorry,new user, can't comment on answers yet):
I've noticed that in all the real world situations I've run across it (with full dataframes or pandas series instead of just a single date), .tz_localize() and .tz_convert() need to be called slightly differently.
What's worked for me is
df['column'] = pd.to_datetime(df['column']).dt.tz_localize('Europe/London').dt.tz_convert('UTC')
Without the .dt, I get "index is not a valid DatetimeIndex or PeriodIndex."

Changing Period to Datetime

My goal is to convert period to datetime.
If Life Was Easy:
master_df = master_df['Month'].to_datetime()
Back Story:
I built a new dataFrame that originally summed the monthly totals and made a 'Month' column by converting a timestamp to period. Now I want to convert that time period back to a timestamp so that I can create plots using matplotlib.
I have tried following:
Reading the docs for Period.to_timestamp.
Converting to a string and then back to datetime. Still keeps the period issue and won't convert.
Following a couple similar questions in Stackoverflow but could not seem to get it to work.
A simple goal would be to plot the following:
plot.bar(m_totals['Month'], m_totals['Showroom Visits']);
This is the error I get if I try to use a period dtype in my charts
ValueError: view limit minimum 0.0 is less than 1 and is an invalid Matplotlib date value.
This often happens if you pass a non-datetime value to an axis that has datetime units.
Additional Material:
Code I used to create the Month column (where period issue was created):
master_df['Month'] = master_df['Entry Date'].dt.to_period('M')
Codes I used to group to monthly totals:
m_sums = master_df.groupby(['DealerName','Month']).sum().drop(columns={'Avg. Response Time','Closing Percent'})
m_means = master_df.groupby(['DealerName','Month']).mean()
m_means = m_means[['Avg. Response Time','Closing Percent']]
m_totals = m_sums.join(m_means)
m_totals.reset_index(inplace=True)
m_totals
Resulting DataFrame:

I was able to cast the period type to string then to datetime. Just could not go straight from period to datetime.
m_totals['Month'] = m_totals['Month'].astype(str)
m_totals['Month'] = pd.to_datetime(m_totals['Month'])
m_totals.dtypes
I wish I did not get downvoted for not providing the entire dataFrame.

First change it to str then to date
index=pd.period_range(start='1949-01',periods=144 ,freq='M')
type(index)
#changing period to date
index=index.astype(str)
index=pd.to_datetime(index)
df.set_index(index,inplace=True)
type(df.index)
df.info()

Another potential solution is to use to_timestamp. For example: m_totals['Month'] = m_totals['Month'].dt.to_timestamp()

Use Pandas for interpolation and store time series data with timestamp instead of datetime string?

I have a Pandas dataframe that includes a timestamp column starting from 0.
The first row always starts at time = 0, and the following rows get the relative time from that point. So, e.g. the second row comes 0.25 seconds after the first, obviously it gets the timestamp 0.25.
I want to use the timestamp column mainly for the ability to do resampling and interpolation. So, as far as I know, for that purpose it has to be some time related dtype (pd.Timestamp in my case).
Now, what I additionally want, is to be able to save the dataframe as a CSV file afterwards. Unfortunately, the pd.Timestamp column is stored as a datetime string of the format
1970-01-01 00:00:00.000000000
I'd however like to save it like it comes in: as a float value starting from 0.0.
I'm thinking about storing the timestamp in the dataframe as two separate columns, one in pd.Timestamp format and the other with the same original value as float.
Though additionally, data value floats in the frame are stored in the format %7.3f. However, the float value of the timestamp should be more precise, rather something like %.6f or even more decimal digits. So I'd in addition to all that above need a different float format for a single column.
How can I do all of that together?

Your times sounds more like Timedeltas to me. You can initialise them as 0, add them together, and then represent them as a float with pd.Timedelta.total_seconds().
import pandas as pd
t0 = pd.Timedelta(0)
t1 = t0 + pd.Timedelta('0.25s')
t1_as_float = t1.total_seconds()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Storing with Dask date/timestamp columns in Parquet - python

Here's a link to the Drill docs about the TO_TIMESTAMP() function. (https://drill.apache.org/docs/data-type-conversion/#to_timestamp) I think #mdurant is correct in his approach. I would try either: SELECT TO_TIMESTAMP(<date_col>) FROM ... or SELECT TO_TIMSTAMP((<date_col> / 1000)) FROM ...

Related

Why pyspark converting string date values to null?

How to generate one value each minute out of irregular data?

convert time to UTC in pandas

Changing Period to Datetime

Use Pandas for interpolation and store time series data with timestamp instead of datetime string?

Categories

Resources