Parquet file not accesible to write after first read using PyArrow - python

I am trying to read a parquet file in pandas dataframe, do some manipulation and write it back in the same file, however it seems file is not accessible to write after the first read in same function.
It only works, if I don't perform STEP 1 below.
Is there anyway to unlock the file as such?
#STEP 1: Read entire parquet file
pq_file = pq.ParquetFile('\dev\abc.parquet')
exp_df = pq_file.read(nthreads=1, use_pandas_metadata=True).to_pandas()
#STEP 2:
# Change some data in dataframe
#STEP 3: write merged dataframe
pyarrow_table = pa.Table.from_pandas(exp_df)
pq.write_table(pyarrow_table, '\dev\abc.parquet',compression='none',)
Error:
File "C:\Python36\lib\site-packages\pyarrow\parquet.py", line 943, in
write_table
**kwargs)
File "C:\Python36\lib\site-packages\pyarrow\parquet.py", line 286, in
__init__
**options)
File "_parquet.pyx", line 832, in pyarrow._parquet.ParquetWriter.__cinit__
File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Failed to open local file: \dev\abc.parquet ,
error: Invalid argument

Related

Why am I getting a list out of range error while trying to save an excel workbook in python?

I'm trying to create a workbook,Edit it and save it.
Here is how the code looks :
cur_wb=openpyxl.Workbook()
-- Code modifying the values of cells.
Example : target_cell._style = copy(template_cell._style) --
filename="DB-"+months[inm]+".xlsx"
cur_wb.save(filename="output/"+filename)
File "Python\Python310\lib\site-packages\openpyxl\workbook\workbook.py", line 407, in save
save_workbook(self, filename)
File "Python\Python310\lib\site-packages\openpyxl\writer\excel.py", line 293, in save_workbook
writer.save()
File "Python\Python310\lib\site-packages\openpyxl\writer\excel.py", line 275, in save
self.write_data()
File "Python\Python310\lib\site-packages\openpyxl\writer\excel.py", line 84, in write_data
stylesheet = write_stylesheet(self.workbook)
File "Python\Python310\lib\site-packages\openpyxl\styles\stylesheet.py", line 253, in write_stylesheet
xf.alignment = wb._alignments[style.alignmentId]
IndexError: list index out of range
Why am I getting this error? Is it because I used copy?
So,
I just changed that one copy line of code.
I created a temporary NamedStyle, Copied the styles(Font,Alignment,Fill,Border) of the template cell to the NamedStyle and applied the Namedstyle to the target cell.
This fixed the issue.

Read h5 files through a NAS

I'm reading h5 files from remote, previously it was on a server but more recently I had to store some on a NAS device. When I'm trying to read the ones on the NAS, for some of them I have the following error:
HDF5ExtError: HDF5 error back trace
File "C:\ci\hdf5_1593121603621\work\src\H5Dio.c", line 199, in H5Dread
can't read data
File "C:\ci\hdf5_1593121603621\work\src\H5Dio.c", line 603, in H5D__read
can't read data
File "C:\ci\hdf5_1593121603621\work\src\H5Dcontig.c", line 621, in H5D__contig_read
contiguous read failed
File "C:\ci\hdf5_1593121603621\work\src\H5Dselect.c", line 283, in H5D__select_read
read error
File "C:\ci\hdf5_1593121603621\work\src\H5Dselect.c", line 218, in H5D__select_io
read error
File "C:\ci\hdf5_1593121603621\work\src\H5Dcontig.c", line 956, in H5D__contig_readvv
can't perform vectorized sieve buffer read
File "C:\ci\hdf5_1593121603621\work\src\H5VM.c", line 1500, in H5VM_opvv
can't perform operation
File "C:\ci\hdf5_1593121603621\work\src\H5Dcontig.c", line 753, in H5D__contig_readvv_sieve_cb
block read failed
File "C:\ci\hdf5_1593121603621\work\src\H5Fio.c", line 118, in H5F_block_read
read through page buffer failed
File "C:\ci\hdf5_1593121603621\work\src\H5PB.c", line 732, in H5PB_read
read through metadata accumulator failed
File "C:\ci\hdf5_1593121603621\work\src\H5Faccum.c", line 260, in H5F__accum_read
driver read request failed
File "C:\ci\hdf5_1593121603621\work\src\H5FDint.c", line 205, in H5FD_read
driver read request failed
File "C:\ci\hdf5_1593121603621\work\src\H5FDsec2.c", line 725, in H5FD_sec2_read
file read failed: time = Tue May 10 11:37:06 2022
, filename = 'Y:/myFolder\myFile.h5', file descriptor = 4, errno = 22, error message = 'Invalid argument', buf = 0000020F03F14040, total read size = 16560000, bytes this sub-read = 16560000, bytes actually read = 18446744073709551615, offset = 480252764
End of HDF5 error back trace
Problems reading the array data.
I don't really understand the error, it always happend for the same files, but I can open the file and read the data myself with HDFView. If I put it on the server I can read it without problem with the same lines of code (path is correct for both):
hdf5store = pd.HDFStore(myPath[fich])
datacopy = hdf5store['my_data']
Btw the error occurs at this 2nd line of code. Right now I don't have access to the server and can't copy the file on local because I don't have enough space. If anyone know how to correct this so I could continue to work through the NAS ?

How to write on top of pandas HDF5 'read-only mode' files?

I am storing data using pandas built-in HDF5 methods.
Somehow, these HDF5 files were turned into 'read-only' files, and I am getting a lot of Opening xxx in read-only mode messages when I open those files in write mode and I can't write them, which is something I really need to do.
The thing I really don't understand so far is how come those files turned into read-only, as I am not aware of a piece of code that I wrote that may result in that behavior. (I have tried to check if the data stored in the HDF5 is corrupt, but I am able to read it and manipulate it, so it seems to be working just fine)
I have 2 questions:
How can I append data to those 'read-only mode' HDF5 files? (Can I convert them back to write mode or any other clever solution?)
Is there any pandas method that would change the HDF5 file to a 'read-only mode' by default so I can avoid turning those files into read-only in the first place?
Code:
The piece of code that is raising this issue is, which is the piece I use to save the output I generated:
with pd.HDFStore('data/observer/' + self._currency + '_' + str(ts)) as hdf:
hdf.append(key='observers', value=df, format='table', data_columns=True)
I also use this piece of code to manipulate the outputs that were generated previously:
for the_file in list_dir:
if currency in the_file:
temp_df = pd.read_hdf(folder + the_file)
...
I use some select commands as well to get specific columns from the data files:
with pd.HDFStore('data/observer/' + self.currency + '_' + timestamp) as hdf:
df = hdf.select(key='observers', columns=[x, y])
Error Traceback:
File ".../data_processing/observer_data.py", line 52, in save_obs_to_pandas
hdf.append(key='observers', value=df, format='table', data_columns=True)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 963, in append
**kwargs)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 1341, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 3930, in write
self.set_info()
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 3163, in set_info
self.attrs.info = self.info
File ".../venv/lib/python3.5/site-packages/tables/attributeset.py", line 464, in __setattr__
nodefile._check_writable()
File ".../venv/lib/python3.5/site-packages/tables/file.py", line 2119, in _check_writable
raise FileModeError("the file is not writable")
tables.exceptions.FileModeError: the file is not writable
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".../general_manager.py", line 144, in <module>
gm.run()
File ".../general_manager.py", line 114, in run
list_of_observer_managers = self.load_all_observer_managers()
File ".../general_manager.py", line 64, in load_all_observer_managers
observer = currency_pool.map(self.load_observer_manager, list_of_currencies)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
tables.exceptions.FileModeError: the file is not writable
The issue at hand was that I messed up with OS file permissions. The file I was trying to read belonged to the root (as I had run the code that generated those files with the root) and I was trying to access them with a user account.
I am running debian, and the following command (as root) solved my issues:
chown -R user.user folder
This commands recursively changes permissions of all files inside that folder to user.user.

Python Pandas: Error tokenizing data. C error: EOF inside string starting when reading 1GB CSV file

I'm reading a 1 GB CSV file in chunks of 10,000 rows. The file has 1106012 rows and 171 columns, other smaller sized file does not show any error and finish off successfully but when i read this 1 GB file it shows error every time on exactly line number 1106011 which is a second last line of file, i can manually remove that line but that is not the solution because i have hundreds of other file of that same size and i cannot fix all the lines manually. can anyone help me with that please.
def extract_csv_to_sql(input_file_name, header_row, size_of_chunk, eachRow):
df = pd.read_csv(input_file_name,
header=None,
nrows=size_of_chunk,
skiprows=eachRow,
low_memory=False,
error_bad_lines=False,
sep=',')
# engine='python'
# quoting=csv.QUOTE_NONE
# encoding='utf-8'
df.columns = header_row
df = df.drop_duplicates(keep='first')
df = df.apply(lambda x: x.astype(str).str.lower())
return df
I'm then calling this function within a loop and works just fine.
huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
I read this Pandas ParserError EOF character when reading multiple csv files to HDF5, this read_csv() & EOF character in string cause parsing issue and this https://github.com/pandas-dev/pandas/issues/11654 and many more and tried to include read_csv parameter such as
engine='python'
quoting=csv.QUOTE_NONE // Hangs and even the python shell, don't know why
encoding='utf-8'
but none of it worked, its still throwing the following error
Error:
Traceback (most recent call last):
File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 115, in <module>
huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 24, in extract_csv_to_sql
sep=',')
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 411, in _read
data = parser.read(nrows)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10885)
File "pandas\_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)
File "pandas\_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)
File "pandas\_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 1106011
>>>
If you are under linux, try to remove all non printable caracter.
Try to load your file after this operation.
tr -dc '[:print:]\n' < file > newfile
I inquired many solutions, some of them worked but It affected the calculous used this one and it will skip the line that is causing the error:
pd.read_csv(file,engine='python', error_bad_lines=False)
#engine='python' provides a better output

Create HDF5 file using pytables with table format and data columns

I want to read a h5 file previously created with PyTables.
The file is read using Pandas, and with some conditions, like this:
pd.read_hdf('myH5file.h5', 'anyTable', where='some_conditions')
From another question, I have been told that, in order for a h5 file to be "queryable" with read_hdf's where argument it must be writen in table format and, in addition, some columns must be declared as data columns.
I cannot find anything about it in PyTables documentation.
The documentation on PyTable's create_table method does not indicate anything about it.
So, right now, if I try to use something like that on my h5 file createed with PyTables I get the following:
>>> d = pd.read_hdf('test_file.h5','basic_data', where='operation==1')
C:\Python27\lib\site-packages\pandas\io\pytables.py:3070: IncompatibilityWarning:
where criteria is being ignored as this version [0.0.0] is too old (or
not-defined), read the file in and write it out to a new file to upgrade (with
the copy_to method)
warnings.warn(ws, IncompatibilityWarning)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 323, in read_hdf
return f(store, True)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 305, in <lambda>
key, auto_close=auto_close, **kwargs)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 665, in select
return it.get_result()
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 3968, in read
if not self.read_axes(where=where, **kwargs):
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 3196, in read_axes
values = self.selection.select()
File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 4482, in select
start=self.start, stop=self.stop)
File "C:\Python27\lib\site-packages\tables\table.py", line 1567, in read_where
self._where(condition, condvars, start, stop, step)]
File "C:\Python27\lib\site-packages\tables\table.py", line 1528, in _where
compiled = self._compile_condition(condition, condvars)
File "C:\Python27\lib\site-packages\tables\table.py", line 1366, in _compile_condition
compiled = compile_condition(condition, typemap, indexedcols)
File "C:\Python27\lib\site-packages\tables\conditions.py", line 430, in compile_condition
raise _unsupported_operation_error(nie)
NotImplementedError: unsupported operand types for *eq*: int, bytes
EDIT:
The traceback mentions something about IncompatibilityWarning and version [0.0.0], however if I check my versions of Pandas and Tables I get:
>>> import pandas
>>> pandas.__version__
'0.15.2'
>>> import tables
>>> tables.__version__
'3.1.1'
So, I am totally confused.
I had the same issue, and this is what I have done.
Create a HDF5 file by PyTables;
Read this HDF5 file by pandas.read_hdf and use parameters like "where = where_string, columns = selected_columns"
I got the warning message like below and other error messages:
D:\Program
Files\Anaconda3\lib\site-packages\pandas\io\pytables.py:3065:
IncompatibilityWarning: where criteria is being ignored as this
version [0.0.0] is too old (or not-defined), read the file in and
write it out to a new file to upgrade (with the copy_to method)
warnings.warn(ws, IncompatibilityWarning)
I tried commands like this:
hdf5_store = pd.HDFStore(hdf5_file, mode = 'r')
h5cpt_store_new = hdf5_store.copy(hdf5_new_file, complevel=9, complib='blosc')
h5cpt_store_new.close()
And run the command exactly like step 2, it works.
pandas.version
'0.17.1'
tables.version
'3.2.2'

Categories

Resources