Transfer and write Parquet with python and pandas got timestamp error - python

I tried to concat() two parquet file with pandas in python .
It can work , but when I try to write and save the Data frame to a parquet file ,it display the error :
ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data:
I checked the doc. of pandas, it default the timestamp syntax in ms when write the parquet file.
How can I white the parquet file with used schema after concat?
Here is my code:
import pandas as pd
table1 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table2 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table = pd.concat([table1, table2], ignore_index=True)
table.to_parquet('./file.gzip', compression='gzip')

Pandas already forwards unknown kwargs to the underlying parquet-engine since at least v0.22. As such, using table.to_parquet(allow_truncated_timestamps=True) should work - I verified it for pandas v0.25.0 and pyarrow 0.13.0. For more keywords see the pyarrow docs.

Thanks to #axel for the link to Apache Arrow documentation:
allow_truncated_timestamps (bool, default False) – Allow loss of data when coercing timestamps to a particular resolution. E.g. if
microsecond or nanosecond data is lost when coercing to ‘ms’, do not
raise an exception.
It seems like in modern Pandas versions we can pass parameters to ParquetWriter.
The following code worked properly for me (Pandas 1.1.1, PyArrow 1.0.1):
df.to_parquet(filename, use_deprecated_int96_timestamps=True)

I think this is a bug and you should do what Wes says. However, if you need working code now, I have a workaround.
The solution that worked for me was to specify the timestamp columns to be millisecond precision. If you need nanosecond precision, this will ruin your data... but if that's the case, it may be the least of your problems.
import pandas as pd
table1 = pd.read_parquet(path=('path1.parquet'))
table2 = pd.read_parquet(path=('path2.parquet'))
table1["Date"] = table1["Date"].astype("datetime64[ms]")
table2["Date"] = table2["Date"].astype("datetime64[ms]")
table = pd.concat([table1, table2], ignore_index=True)
table.to_parquet('./file.gzip', compression='gzip')

I experienced a similar problem while using pd.to_parquet, my final workaround was to use the argument engine='fastparquet', but I realize this doesn't help if you need to use PyArrow specifically.
Things I tried which did not work:
#DrDeadKnee's workaround of manually casting columns .astype("datetime64[ms]") did not work for me (pandas v. 0.24.2)
Passing coerce_timestamps='ms' as a kwarg to the underlying parquet operation did not change behaviour.

I experienced a related order-of-magnitude problem when writing dask DataFrames with datetime64[ns] columns to AWS S3 and crawling them into Athena tables.
The problem was that subsequent Athena queries showed the datetime fields as year >57000 instead of 2020. I managed to use the following fix:
df.to_parquet(path, times="int96")
Which forwards the kwarg **{"times": "int96"} into fastparquet.writer.write().
I checked the resulting parquet file using package parquet-tools. It indeed shows the datetime columns as INT96 storage format. On Athena (which is based on Presto) the int96 format is well supported and does not have the order of magnitude problem.
Reference: https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py, function write(), kwarg times.
(dask 2.30.0 ; fastparquet 0.4.1 ; pandas 1.1.4)

Related

I cant read parquet file by pandas read_parquet function

when I use pd.read_parquet to read a parquet file this error is displayed
my code:
import pandas as pd
df = pd.read_parquet("fhv_tripdata_2018-05.parquet")
error:
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 32094334800000000
I want to convert this file to csv:
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2018-05.parquet
Please provide a minimal example, i.e., a small parquet file that generates the error.
It seems there are some open issues with this. Conventions are not compatible and apparently there are pitfalls in reading/writing dates in parquet via pandas. Thus, I propose a solution by directly using pyarrow:
import pyarrow.parquet as pq
table = pq.read_table('fhv_tripdata_2018-05.parquet')
table.to_pandas(timestamp_as_object=True)
// to csv
table.to_csv('data.csv')
Note the extra flag timestamp_as_object which prevents the overflow you observed.

pandas raising OutOfBoundsDatetime on csv but not on sql

I have one service running pandas version 0.25.2. This service reads data from a database and stores a snapshot as csv
df = pd.read_sql_query(sql_cmd, oracle)
the query result in a dataframe with some very large datetime values. (e.g. 3000-01-02 00:00:00)
Afterwards I use df.to_csv(index=False) to create a csv snapshot and write it into a file
on a diffrent machine with pandas 0.25.3 installed, i am reading the content of the csv file into a dataframe and try to change the datatype of the date column to datetime. This results in a OutOfBoundsDatetime Exception
df = pd.read_csv("xy.csv")
pd.to_datetime(df['val_until'])
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3000-01-02 00:00:00
I am thinking about using pickle to create the snapshots an load the dataframes directly. However, I am curious why pandas is able to handle a big datetime in the first case and not in the second one.
Also any suggestions how I keep using csv as transfer format are appreciated
I believe I got it.
In the first case, I'm not sure what the actual data type is that is stored in the sql database, but if not otherwise specified, reading it into the df likely results in some generic or string type which has a much higher overflow value.
Eventually though, it ends up in a csv file which is a string type. This can be incredibly (infinitely?) long without any overflow, whereas the data type you are trying to cast into using pandas.to_datetime docs. has a maximum value of _'2262-04-11 23:47:16.854775807' according to the Timestamp.max shown in the first doc link at the bottom.

Pandas dataframe type datetime64[ns] is not working in Hive/Athena

I am working on a python application which just converts csv file to hive/athena compatible parquet format and I am using fastparquet and pandas libraries to perform this. There are timestamp values in csv file like 2018-12-21 23:45:00 which needs to be written as timestamp type in parquet file . Below is my code that am running ,
columnNames = ["contentid","processed_time","access_time"]
dtypes = {'contentid': 'str'}
dateCols = ['access_time', 'processed_time']
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucketname, Key=keyname)
df = pd.read_csv(io.BytesIO(obj['Body'].read()), compression='gzip', header=0, sep=',', quotechar='"', names = columnNames, error_bad_lines=False, dtype=dtypes, parse_dates=dateCols)
s3filesys = s3fs.S3FileSystem()
myopen = s3filesys.open
write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)
the code ran successfully , below is the dataframe created by pandas
contentid object
processed_time datetime64[ns]
access_time datetime64[ns]
And finally , when i queried the parquet file in Hive and athena , the timestamp value is +50942-11-30 14:00:00.000 instead of 2018-12-21 23:45:00
Any help is highly appreciated
I know this question is old but it is still relevant.
As mentioned before Athena only supports int96 as timestamps.
Using fastparquet it is possible to generate a parquet file with the correct format for Athena. The important part is the times='int96' as this tells fastparquet to convert pandas datetime to int96 timestamp.
from fastparquet import write
import pandas as pd
def write_parquet():
df = pd.read_csv('some.csv')
write('/tmp/outfile.parquet', df, compression='GZIP', times='int96')
I solved the problem by this way.
tranforms the df series with to_datetime method
next with a .dt accesor pick the date part of the datetime64[ns]
Example:
df.field = pd.to_datetime(df.field)
df.field = df.field.dt.date
After that, athena will recognize the data
You could try:
dataframe.to_parquet(file_path, compression=None, engine='pyarrow', allow_truncated_timestamps=True, use_deprecated_int96_timestamps=True)
The problem seems to be with Athena, it only seems to support int96 and when you create a timestamp in pandas it is an int64
my dataframe column that contains a string date is "sdate" I first convert to timestamp
# add a new column w/ timestamp
df["ndate"] = pandas.to_datetime["sdate"]
# convert the timestamp to microseconds
df["ndate"] = pandas.to_datetime(["ndate"], unit='us')
# Then I convert my dataframe to pyarrow
table = pyarrow.Table.from_pandas(df, preserve_index=False)
# After that when writing to parquet add the coerce_timestamps and
# use_deprecated_int96_timstamps. (Also writing to S3 directly)
OUTBUCKET="my_s3_bucket"
pyarrow.parquet.write_to_dataset(table, root_path='s3://{0}/logs'.format(OUTBUCKET), partition_cols=['date'], filesystem=s3, coerce_timestamps='us', use_deprecated_int96_timestamps=True)
I also got this problem for multiple times.
My error code is I set the index to datetime format by:
df.set_index(pd.DatetimeIndex(df.index), inplace=True)
When I then read the parquet file by fastparquet it may notice me that
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 219968-03-28 05:07:11
However, it can be easily solved by using pd.read_parquet(path_file) rather than fastparquet.ParquetFile(path_file).to_pandas()
PLEASE USE pd.read_parquet(path_file) TO FIX THIS PROBLEM
That's my solution and it works well, hope it may help you then you don't need to worry about how to write parquet in which way.
I was facing the same problem, after a lot of research, it is solved now.
when you do
write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)
it uses fastparquet behind the scene, which uses a different encoding for DateTime
than what Athena is compatible with.
the solution is: uninstall fastparquet and install pyarrow
pip uninstall fastparquet
pip install pyarrow
run your code again. It should work this time. :)

Proper way of writing and reading Dataframe to file in Python

I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.
You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.
The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.

exporting dataframe into dataframe format to pass as argument into next program

I have certain computations performed on Dataset and I need the result to be stored in external file.
Had it been to CSV, to process it further I'd have to convert again to Dataframe/SFrame, which is again increasing lines of code.
Here's the snippet:
train_data = graphlab.SFrame(ratings_base)
Clearly, it is in SFrame and can be converted to DFrame using
df_train = train_data.to_dataframe()
Now that it is in DFrame, I need it exported to a file without changing it's structure. Since the exported file will be used as Argument to another python code. That code must accept DFrame and not CSV.
I have already check out in place1, place2, place3, place4 and place5
P.S. - I'm still digging for Python serialization, if anyone can simplify
it in the context would be helpful
I'd use HDFS format as it's supported by Pandas and by graphlab.SFrame and beside that HDFS format is very fast.
Alternatively you can export Pandas.DataFrame to Pickle file and read it from another scripts:
sf.to_dataframe().to_pickle(r'/path/to/pd_frame.pickle')
to read it back (from the same or from another script):
df = pd.read_pickle(r'/path/to/pd_frame.pickle')

Categories

Resources