when I use pd.read_parquet to read a parquet file this error is displayed
my code:
import pandas as pd
df = pd.read_parquet("fhv_tripdata_2018-05.parquet")
error:
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 32094334800000000
I want to convert this file to csv:
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2018-05.parquet
Please provide a minimal example, i.e., a small parquet file that generates the error.
It seems there are some open issues with this. Conventions are not compatible and apparently there are pitfalls in reading/writing dates in parquet via pandas. Thus, I propose a solution by directly using pyarrow:
import pyarrow.parquet as pq
table = pq.read_table('fhv_tripdata_2018-05.parquet')
table.to_pandas(timestamp_as_object=True)
// to csv
table.to_csv('data.csv')
Note the extra flag timestamp_as_object which prevents the overflow you observed.
I have one service running pandas version 0.25.2. This service reads data from a database and stores a snapshot as csv
df = pd.read_sql_query(sql_cmd, oracle)
the query result in a dataframe with some very large datetime values. (e.g. 3000-01-02 00:00:00)
Afterwards I use df.to_csv(index=False) to create a csv snapshot and write it into a file
on a diffrent machine with pandas 0.25.3 installed, i am reading the content of the csv file into a dataframe and try to change the datatype of the date column to datetime. This results in a OutOfBoundsDatetime Exception
df = pd.read_csv("xy.csv")
pd.to_datetime(df['val_until'])
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3000-01-02 00:00:00
I am thinking about using pickle to create the snapshots an load the dataframes directly. However, I am curious why pandas is able to handle a big datetime in the first case and not in the second one.
Also any suggestions how I keep using csv as transfer format are appreciated
I believe I got it.
In the first case, I'm not sure what the actual data type is that is stored in the sql database, but if not otherwise specified, reading it into the df likely results in some generic or string type which has a much higher overflow value.
Eventually though, it ends up in a csv file which is a string type. This can be incredibly (infinitely?) long without any overflow, whereas the data type you are trying to cast into using pandas.to_datetime docs. has a maximum value of _'2262-04-11 23:47:16.854775807' according to the Timestamp.max shown in the first doc link at the bottom.
I am working on a python application which just converts csv file to hive/athena compatible parquet format and I am using fastparquet and pandas libraries to perform this. There are timestamp values in csv file like 2018-12-21 23:45:00 which needs to be written as timestamp type in parquet file . Below is my code that am running ,
columnNames = ["contentid","processed_time","access_time"]
dtypes = {'contentid': 'str'}
dateCols = ['access_time', 'processed_time']
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucketname, Key=keyname)
df = pd.read_csv(io.BytesIO(obj['Body'].read()), compression='gzip', header=0, sep=',', quotechar='"', names = columnNames, error_bad_lines=False, dtype=dtypes, parse_dates=dateCols)
s3filesys = s3fs.S3FileSystem()
myopen = s3filesys.open
write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)
the code ran successfully , below is the dataframe created by pandas
contentid object
processed_time datetime64[ns]
access_time datetime64[ns]
And finally , when i queried the parquet file in Hive and athena , the timestamp value is +50942-11-30 14:00:00.000 instead of 2018-12-21 23:45:00
Any help is highly appreciated
I know this question is old but it is still relevant.
As mentioned before Athena only supports int96 as timestamps.
Using fastparquet it is possible to generate a parquet file with the correct format for Athena. The important part is the times='int96' as this tells fastparquet to convert pandas datetime to int96 timestamp.
from fastparquet import write
import pandas as pd
def write_parquet():
df = pd.read_csv('some.csv')
write('/tmp/outfile.parquet', df, compression='GZIP', times='int96')
I solved the problem by this way.
tranforms the df series with to_datetime method
next with a .dt accesor pick the date part of the datetime64[ns]
Example:
df.field = pd.to_datetime(df.field)
df.field = df.field.dt.date
After that, athena will recognize the data
You could try:
dataframe.to_parquet(file_path, compression=None, engine='pyarrow', allow_truncated_timestamps=True, use_deprecated_int96_timestamps=True)
The problem seems to be with Athena, it only seems to support int96 and when you create a timestamp in pandas it is an int64
my dataframe column that contains a string date is "sdate" I first convert to timestamp
# add a new column w/ timestamp
df["ndate"] = pandas.to_datetime["sdate"]
# convert the timestamp to microseconds
df["ndate"] = pandas.to_datetime(["ndate"], unit='us')
# Then I convert my dataframe to pyarrow
table = pyarrow.Table.from_pandas(df, preserve_index=False)
# After that when writing to parquet add the coerce_timestamps and
# use_deprecated_int96_timstamps. (Also writing to S3 directly)
OUTBUCKET="my_s3_bucket"
pyarrow.parquet.write_to_dataset(table, root_path='s3://{0}/logs'.format(OUTBUCKET), partition_cols=['date'], filesystem=s3, coerce_timestamps='us', use_deprecated_int96_timestamps=True)
I also got this problem for multiple times.
My error code is I set the index to datetime format by:
df.set_index(pd.DatetimeIndex(df.index), inplace=True)
When I then read the parquet file by fastparquet it may notice me that
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 219968-03-28 05:07:11
However, it can be easily solved by using pd.read_parquet(path_file) rather than fastparquet.ParquetFile(path_file).to_pandas()
PLEASE USE pd.read_parquet(path_file) TO FIX THIS PROBLEM
That's my solution and it works well, hope it may help you then you don't need to worry about how to write parquet in which way.
I was facing the same problem, after a lot of research, it is solved now.
when you do
write('outfile.snappy.parquet', df, compression='SNAPPY', open_with=myopen,file_scheme='hive',partition_on=PARTITION_KEYS)
it uses fastparquet behind the scene, which uses a different encoding for DateTime
than what Athena is compatible with.
the solution is: uninstall fastparquet and install pyarrow
pip uninstall fastparquet
pip install pyarrow
run your code again. It should work this time. :)
I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.
You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.
The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.
I have certain computations performed on Dataset and I need the result to be stored in external file.
Had it been to CSV, to process it further I'd have to convert again to Dataframe/SFrame, which is again increasing lines of code.
Here's the snippet:
train_data = graphlab.SFrame(ratings_base)
Clearly, it is in SFrame and can be converted to DFrame using
df_train = train_data.to_dataframe()
Now that it is in DFrame, I need it exported to a file without changing it's structure. Since the exported file will be used as Argument to another python code. That code must accept DFrame and not CSV.
I have already check out in place1, place2, place3, place4 and place5
P.S. - I'm still digging for Python serialization, if anyone can simplify
it in the context would be helpful
I'd use HDFS format as it's supported by Pandas and by graphlab.SFrame and beside that HDFS format is very fast.
Alternatively you can export Pandas.DataFrame to Pickle file and read it from another scripts:
sf.to_dataframe().to_pickle(r'/path/to/pd_frame.pickle')
to read it back (from the same or from another script):
df = pd.read_pickle(r'/path/to/pd_frame.pickle')