I cant read parquet file by pandas read_parquet function - python

when I use pd.read_parquet to read a parquet file this error is displayed
my code:
import pandas as pd
df = pd.read_parquet("fhv_tripdata_2018-05.parquet")
error:
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 32094334800000000
I want to convert this file to csv:
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2018-05.parquet

Please provide a minimal example, i.e., a small parquet file that generates the error.
It seems there are some open issues with this. Conventions are not compatible and apparently there are pitfalls in reading/writing dates in parquet via pandas. Thus, I propose a solution by directly using pyarrow:
import pyarrow.parquet as pq
table = pq.read_table('fhv_tripdata_2018-05.parquet')
table.to_pandas(timestamp_as_object=True)
// to csv
table.to_csv('data.csv')
Note the extra flag timestamp_as_object which prevents the overflow you observed.

Related

What happens when pandas read_csv is run on a file that is too large

If a file fed into pandas read_csv is too large, will it raise an exception?
What I'm afraid of is that it will just read what it can, say the first 1,000,000 rows and proceed as if there was no problem.
Does there exist situations in which pandas will fail to read all records in a file but also fail to raise an exception (print errors).
If you have large dataset, and if you want to read it manytimes, I recommend you to use .pkl file
Or you can use try exception method.
However, if you still want to use csv file, you can visit this link and find solution How do I read a large csv file with pandas?
I'd recommend using dask which is a high-level library that supports parallel computing,
You can easily import all your data but it won't be loaded in your memory
import pandas as pd
import dask.dataframe as dd
df = dd.read_csv('data.csv')
df
and from there , you can compute only selected columns/rows you are interested in:
df_selected = df[columns].loc[indices_to_select]
df_selected.compute()

Python: Reading tdms files using Python npTDMS and creating a Pandas dataframe

I'm able to read a labview .tdms file using Python npTDMS package, could read metadata and sample data for groups and channels as well.
However, the file has timestamp values with year '9999'. Hence getting the following error while converting to a pandas dataframe:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp:.
I went through the documentation in:
https://nptdms.readthedocs.io/en/stable/apireference.html#nptdms.TdmsFile.as_dataframe; however, couldn't find an option to deal with this data situation.
Tried passing errors='coerce' while calling as.dataframe() didn't work either. Any pointers or directions to read the .tdms file to a pandas dataframe, with this data situation, would be very helpful.
Changing the data at the source is not an option.
Code snippet to read tdms file:
import numpy as np
import pandas as pd
from nptdms import TdmsFile as td
tdms_file = td.read(<tdms file name>)
tdms_file_df = tdms_file.as_dataframe()
Error while creating a pandas dataframe

Transfer and write Parquet with python and pandas got timestamp error

I tried to concat() two parquet file with pandas in python .
It can work , but when I try to write and save the Data frame to a parquet file ,it display the error :
ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data:
I checked the doc. of pandas, it default the timestamp syntax in ms when write the parquet file.
How can I white the parquet file with used schema after concat?
Here is my code:
import pandas as pd
table1 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table2 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table = pd.concat([table1, table2], ignore_index=True)
table.to_parquet('./file.gzip', compression='gzip')
Pandas already forwards unknown kwargs to the underlying parquet-engine since at least v0.22. As such, using table.to_parquet(allow_truncated_timestamps=True) should work - I verified it for pandas v0.25.0 and pyarrow 0.13.0. For more keywords see the pyarrow docs.
Thanks to #axel for the link to Apache Arrow documentation:
allow_truncated_timestamps (bool, default False) – Allow loss of data when coercing timestamps to a particular resolution. E.g. if
microsecond or nanosecond data is lost when coercing to ‘ms’, do not
raise an exception.
It seems like in modern Pandas versions we can pass parameters to ParquetWriter.
The following code worked properly for me (Pandas 1.1.1, PyArrow 1.0.1):
df.to_parquet(filename, use_deprecated_int96_timestamps=True)
I think this is a bug and you should do what Wes says. However, if you need working code now, I have a workaround.
The solution that worked for me was to specify the timestamp columns to be millisecond precision. If you need nanosecond precision, this will ruin your data... but if that's the case, it may be the least of your problems.
import pandas as pd
table1 = pd.read_parquet(path=('path1.parquet'))
table2 = pd.read_parquet(path=('path2.parquet'))
table1["Date"] = table1["Date"].astype("datetime64[ms]")
table2["Date"] = table2["Date"].astype("datetime64[ms]")
table = pd.concat([table1, table2], ignore_index=True)
table.to_parquet('./file.gzip', compression='gzip')
I experienced a similar problem while using pd.to_parquet, my final workaround was to use the argument engine='fastparquet', but I realize this doesn't help if you need to use PyArrow specifically.
Things I tried which did not work:
#DrDeadKnee's workaround of manually casting columns .astype("datetime64[ms]") did not work for me (pandas v. 0.24.2)
Passing coerce_timestamps='ms' as a kwarg to the underlying parquet operation did not change behaviour.
I experienced a related order-of-magnitude problem when writing dask DataFrames with datetime64[ns] columns to AWS S3 and crawling them into Athena tables.
The problem was that subsequent Athena queries showed the datetime fields as year >57000 instead of 2020. I managed to use the following fix:
df.to_parquet(path, times="int96")
Which forwards the kwarg **{"times": "int96"} into fastparquet.writer.write().
I checked the resulting parquet file using package parquet-tools. It indeed shows the datetime columns as INT96 storage format. On Athena (which is based on Presto) the int96 format is well supported and does not have the order of magnitude problem.
Reference: https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py, function write(), kwarg times.
(dask 2.30.0 ; fastparquet 0.4.1 ; pandas 1.1.4)

CParserError: Error tokenizing data

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:
In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.
Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")
I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.
pandas 0.21 introduces new functions for Parquet:
import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.
Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
The code is simple, just type:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files
Parquet
Step 1: Data to play with
df = pd.DataFrame({
'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
'marks': [20,10,22,21,22],
})
Step 2: Save as Parquet
df.to_parquet('sample.parquet')
Step 3: Read from Parquet
df = pd.read_parquet('sample.parquet')
When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.
df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')
Parquet files are always large. so read it using dask.
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
files = glob.glob('data/*.parquet')
#delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in files])
df.compute()
Considering the .parquet file named data.parquet
parquet_file = '../data.parquet'
open( parquet_file, 'w+' )
Convert to Parquet
Assuming one has a dataframe parquet_df that one wants to save to the parquet file above, one can use pandas.to_parquet (this function requires either the fastparquet or pyarrow library) as follows
parquet_df.to_parquet(parquet_file)
Read from Parquet
In order to read the parquet file into a dataframe new_parquet_df, one can use pandas.read_parquet() as follows
new_parquet_df = pd.read_parquet(parquet_file)

Categories

Resources