I have one service running pandas version 0.25.2. This service reads data from a database and stores a snapshot as csv
df = pd.read_sql_query(sql_cmd, oracle)
the query result in a dataframe with some very large datetime values. (e.g. 3000-01-02 00:00:00)
Afterwards I use df.to_csv(index=False) to create a csv snapshot and write it into a file
on a diffrent machine with pandas 0.25.3 installed, i am reading the content of the csv file into a dataframe and try to change the datatype of the date column to datetime. This results in a OutOfBoundsDatetime Exception
df = pd.read_csv("xy.csv")
pd.to_datetime(df['val_until'])
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3000-01-02 00:00:00
I am thinking about using pickle to create the snapshots an load the dataframes directly. However, I am curious why pandas is able to handle a big datetime in the first case and not in the second one.
Also any suggestions how I keep using csv as transfer format are appreciated
I believe I got it.
In the first case, I'm not sure what the actual data type is that is stored in the sql database, but if not otherwise specified, reading it into the df likely results in some generic or string type which has a much higher overflow value.
Eventually though, it ends up in a csv file which is a string type. This can be incredibly (infinitely?) long without any overflow, whereas the data type you are trying to cast into using pandas.to_datetime docs. has a maximum value of _'2262-04-11 23:47:16.854775807' according to the Timestamp.max shown in the first doc link at the bottom.
Related
I am reading a CSV file using this piece of code:
import pandas as pd
import os
#Open document (.csv format)
path=os.getcwd()
database=pd.read_csv(path+"/mydocument.csv",delimiter=';',header=0)
#Date in the requiered format
size=len(database.Date)
I get the next error: 'DataFrame' object has no attribute 'Date'
As you can see in the image, the first column of mydocument.csv is called Date. This is weird because I used this same procedure to work with this document, and it worked.
Try using delimeter=',' . It must be a comma.
(Can't post comments yet, but think I can help).
The explanation for your problem is, simply, you don't have a column called Date. Has pandas interpretted Date as an index column?
IF your Date is spelled correctly (no trailing whitespace or something else that might confuse things), then try this:
pd.read_csv(path+"/mydocument.csv",delimiter=';',header=0, index_col=False)
or if that fails this:
pd.read_csv(path+"/mydocument.csv",delimiter=';',header=0, index_col=None)
In my experience, I've had pandas unexpectedly infer an index_col.
I have imported a csv file in pandas that contains fields that look like 'datetime' but initially parsed as 'object'. I make the required conversion from 'datetime' to 'object' using 'df.X = pd.to_datetime(df.X)'.
Now, when I try to save these changes by writing this out to a new .csv file and importing that, the format is still 'object'. Is there anyway to fix it's datatype so that on importing it I don't have to perform the conversion everytime? My dataset is quite big and conversion takes some time, which I want to save.
Date parsing can be expensive, so pandas doesn't parse dates by default. You need to specify parse_dates argument when call read_csv
df = pd.read_csv('my_file.csv', parse_dates=['date_column'])
I tried to concat() two parquet file with pandas in python .
It can work , but when I try to write and save the Data frame to a parquet file ,it display the error :
ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data:
I checked the doc. of pandas, it default the timestamp syntax in ms when write the parquet file.
How can I white the parquet file with used schema after concat?
Here is my code:
import pandas as pd
table1 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table2 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table = pd.concat([table1, table2], ignore_index=True)
table.to_parquet('./file.gzip', compression='gzip')
Pandas already forwards unknown kwargs to the underlying parquet-engine since at least v0.22. As such, using table.to_parquet(allow_truncated_timestamps=True) should work - I verified it for pandas v0.25.0 and pyarrow 0.13.0. For more keywords see the pyarrow docs.
Thanks to #axel for the link to Apache Arrow documentation:
allow_truncated_timestamps (bool, default False) – Allow loss of data when coercing timestamps to a particular resolution. E.g. if
microsecond or nanosecond data is lost when coercing to ‘ms’, do not
raise an exception.
It seems like in modern Pandas versions we can pass parameters to ParquetWriter.
The following code worked properly for me (Pandas 1.1.1, PyArrow 1.0.1):
df.to_parquet(filename, use_deprecated_int96_timestamps=True)
I think this is a bug and you should do what Wes says. However, if you need working code now, I have a workaround.
The solution that worked for me was to specify the timestamp columns to be millisecond precision. If you need nanosecond precision, this will ruin your data... but if that's the case, it may be the least of your problems.
import pandas as pd
table1 = pd.read_parquet(path=('path1.parquet'))
table2 = pd.read_parquet(path=('path2.parquet'))
table1["Date"] = table1["Date"].astype("datetime64[ms]")
table2["Date"] = table2["Date"].astype("datetime64[ms]")
table = pd.concat([table1, table2], ignore_index=True)
table.to_parquet('./file.gzip', compression='gzip')
I experienced a similar problem while using pd.to_parquet, my final workaround was to use the argument engine='fastparquet', but I realize this doesn't help if you need to use PyArrow specifically.
Things I tried which did not work:
#DrDeadKnee's workaround of manually casting columns .astype("datetime64[ms]") did not work for me (pandas v. 0.24.2)
Passing coerce_timestamps='ms' as a kwarg to the underlying parquet operation did not change behaviour.
I experienced a related order-of-magnitude problem when writing dask DataFrames with datetime64[ns] columns to AWS S3 and crawling them into Athena tables.
The problem was that subsequent Athena queries showed the datetime fields as year >57000 instead of 2020. I managed to use the following fix:
df.to_parquet(path, times="int96")
Which forwards the kwarg **{"times": "int96"} into fastparquet.writer.write().
I checked the resulting parquet file using package parquet-tools. It indeed shows the datetime columns as INT96 storage format. On Athena (which is based on Presto) the int96 format is well supported and does not have the order of magnitude problem.
Reference: https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py, function write(), kwarg times.
(dask 2.30.0 ; fastparquet 0.4.1 ; pandas 1.1.4)
I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.
You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.
The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.
Python newbie here.
I am trying to save a large data frame into HDF file with lz4 compression using to_hdf.
I use Windows 10, Python 3, Pandas 20.2
I get the error “OverflowError: Python int too large to convert to C long”.
None of the machine resources are close to their limits (RAM, CPU, SWAP usage)
Previous posts discuss the dtype, but the following example shows that there is some other problem, potentially related to the size?
import numpy as np
import pandas as pd
# sample dataframe to be saved, pardon my French
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
'col2':['aaaaaaaaaaaaaaaaa']*n,
'col3':[999999999999999999]*n,
'col4':['aaaaaaaaaaaaaaaaa']*n,
'col5':[999999999999999999]*n,
'col6':['aaaaaaaaaaaaaaaaa']*n})
# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
....
OverflowError: Python int too large to convert to C long
I experienced the same issue and it seems that it is indeed connected to the size of the data frame rather than to dtype (I had all the columns stored as strings and was able to store them to .h5 separately).
The solution that worked for me is to save the data frame in chunks using mode='a'.
As suggested in pandas documentation : mode{‘a’, ‘w’, ‘r+’}, default ‘a’: ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
So the sample code would look something like:
batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')
As #Giovanni Maria Strampelli pointed, the answer of #Artem Snorkovenko only saves the last batch. Pandas documentation states the following:
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
Here is a possible workaround to save all batches (adjusted from the answer of #Artem Snorkovenko):
for i in range(len(df)):
sr = df.loc[i] #pandas series object for the given index
sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')
This code saves each Pandas Series object with a different key. Each key is indexed by i.
To load the existing .h5 file after saving, one can do the following:
i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
#print(i) #this is to see if code is working properly.
try: #check whether current i value exists in the keys of the .h5 file
sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
srl.append(sdfr) #append each series to a list to create the dataframe in the end.
i += 1 #increment i by 1 after loading the series object
except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
dfdone = True #Terminate the while loop.
df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.
I used a while loop, assuming we do not know the exact length of the dataframe in the .h5 file. If the length is known, for loop can also be used.
Note that I am not saving dataframes in chunks here. Thus, the loading procedure is in its current form not suitable for saving in chunks, where the data type would be DataFrame for each chunk. In my implementation, each saved object is Series, and DataFrame is generated from a list of Series. The code I provided can be adjusted to work for saving in chunks and generating a DataFrame from a list of DataFrame objects (a nice starting point can be found in ths Stack Overflow entry.).
In addition to #tetrisforjeff 's answer:
If the df contains object types, the reading could lead to error. I would suggest pd.concat(srl) instead of pd.DataFrame(srl)