Python error that is workstation specific when running script - python

I am getting an error on one workstation when running a Python script. The script runs fine on VMs and my workstation.
pip list Shows packages are the same
Workstations are all using Python 3.10.4 64bit
This is the only workstation throwing this error.
It might be a memory issue, but the workstation has 2x4Gb RAM. I tried to chunk it out, but that did not work either. The file is barely 1Mb.
As troubleshooting, I cut the file to just 500 rows, and it ran fine. When I tried 1000 rows out of the 2500 rows in the file, it gave the same error. Interestingly the workstation cannot run the script with even just one row now.
Including error_bad_lines=False, iterator=True, chunksize=, low_memory=False have all not worked.
What is causing this error? Why did it run just fine using a few rows, but now not even with one row?
Here is the Traceback:
Traceback (most recent call last):
File "c:\Users\script.py", line 5, in <module>
data = pd.read_csv("C:/Path/file.csv", encoding='latin-1' )
File "C:\Users\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 4
Here is the script:
# Import raw data
data = pd.read_csv("C:/Users/Script.csv", encoding='latin-1' )
# Create array to track failed cases.
data['Test Case Failed']= ''
data = data.replace(np.nan,'')
data.insert(0, 'ID', range(0, len(data)))
# Testcase 1
data_1 = data[(data['FirstName'] == data['SRFirstName'])]
ids = data_1.index.tolist()
for i in ids:
data.at[i,'Test Case Failed']+=', 1'
# There are 15 more test cases that preform similar tasks
# Total cases
failed = data[(data['Test Case Failed'] != '')]
passed = data[(data['Test Case Failed'] == '')]
failed['Test Case Failed'] =failed['Test Case Failed'].str[1:]
failed = failed[(failed['Test Case Failed'] != '')]
# Clean up
del failed["ID"]
del passed["ID"]
# Print results
failed['Test Case Failed'].value_counts()
print("There was a total of",data.shape[0], "rows.", "There was" ,data.shape[0] - failed.shape[0], "rows passed and" ,failed.shape[0], "rows failed at least one test case")
# Drop unwanted columns
redata = passed.drop(columns=['ConsCodeImpID', 'ImportID', 'Suff1', 'SRSuff2', 'Inactive',
'AddrRegion','AddrImpID', 'AddrImpID', 'AddrImpID.2', 'AddrImpID.1', 'PhoneAddrImpID',
'PhoneAddrImpID.1', 'PhoneImpID', 'PhoneAddrImpID', 'PhoneImpID', 'PhoneType.1', 'DateTo',
'SecondID', 'Test Case Failed', 'PhoneImpID.1'])
# Clean address
redata['AddrLines'] = redata['AddrLines'].str.replace('Apartment ','Apt ',regex=True)
redata['AddrLines'] = redata['AddrLines'].str.replace('Apt\\.','Apt ',regex=True)
redata['AddrLines'] = redata['AddrLines'].str.replace('APT','Apt ',regex=True)
redata['AddrLines'] = redata['AddrLines'].str.replace('nApt','Apt ',regex=True)
#There's about 100 more rows of address clean up
# Output edited dropped columns
redata.to_csv("C:/Users/cleandata.csv", index = False)
# Output failed rows
failed.to_csv("C:/Users/Failed.csv", index = False)
# Output passed rows
passed.to_csv("C:/Users/Passed.csv", index = False)

The workstation was corrupting the file, despite never opening it before running the script. I repaired the file and it worked. After reinstalling Excel, I no longer had to repair the file and could run the script as normal.
Click File > Open.
Click the location and folder that contains the corrupted workbook.
In the Open dialog box, select the corrupted workbook.
Click the arrow next to the Open button, and then click Open and
Repair.
Open and repair command
To recover as much of the workbook data as possible, pick Repair.
If Repair isn’t able to recover your data, pick Extract Data to
extract values and formulas from the workbook.

Related

Key Error in Bioinformatics Program Using Pandas

I'll try to keep this as short as possible. I'm trying to create a bioinformatics program for our patient 'reporting' team. To preface this, examples I will be giving are just examples and not actual patient information.
The script I'm writing will take the results of a patients genetic test, take their nucleotide results via specific snps we test for.(organized via rsID from NCBI). This patient information is merged with a reference library I've made and will be compared with it. The goal is to 1.)Merge these files. 2.)Have patient nucleotide results compared to the nucleotides from the reference library. 3.) Create a "Flag" if the patients nucleotide is rare and from a small frequency percentage.
The issue I'm having, is that when running the script, after uploading the patient file and population data, I'm getting a Key Error, as its not able to find the rsID column on the patient .csv.
I'll add 2 photos of what each .csv file looks like
enter image description here population data
enter image description here patient data
Here is a short excerpt of the code
onClick('Upload Patient Files First')
patient_data = pd.read_csv(ask_path(),)
###patient_genotype = patient_data.loc[patient_data['rsID'] == rsID]['NCBI SNP Reference']
##Not using
onClick('Upload Population Frequency Data Next')
pop_ref_data = pd.read_csv(ask_path())
#Creating a dictionary of the population reference data
def pop_dict(pop_ref_data):
pop_ref_dict = {}
for _, row in pop_ref_data.iterrows():
variant_data ={}
rsID = row['rsID']
dominant_nucleotide = row['DomNucl']
recessive_nucleotide = row['RecNucl']
dominant_freq = row['DomAllele']
recessive_freq = row['RecessiveAllele']
variant_data[dominant_nucleotide]= dominant_freq
variant_data[recessive_nucleotide]= recessive_freq
pop_ref_dict[rsID] = variant_data
return pop_ref_dict
The population data is pretty straight forward. I'm getting stuck on the first check though. under the column "rsID" is where i'm getting the Key Error.
The patient data is further down on its respective CSV. I'm trying to get it to find the information under the columns 'NCBI SNP Reference' and 'Call'.
Quick Edit: These are my Traceback calls. Also, to answer another question... Yes, I'm trying to bypass all of the header info on the CSV so that I can just use the bulk information I actually need once the genotyping run is finished.
Traceback (most recent call last):
File "C:\Users\rcthu\PycharmProjects\WorkStuff\venv\lib\site-packages\pandas\core\indexes\base.py", line 3802, in get_loc
return self._engine.get_loc(casted_key)
File "pandas_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
File "pandas_libs\hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas_libs\hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'rsID'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\rcthu\AppData\Roaming\JetBrains\PyCharmCE2022.2\scratches\Flag Process 2.12.py", line 61, in
pop_ref_row = pop_dict(pop_ref_data)
File "C:\Users\rcthu\AppData\Roaming\JetBrains\PyCharmCE2022.2\scratches\Flag Process 2.12.py", line 41, in pop_dict
rsID = row['rsID']
File "C:\Users\rcthu\PycharmProjects\WorkStuff\venv\lib\site-packages\pandas\core\series.py", line 981, in getitem
return self._get_value(key)
File "C:\Users\rcthu\PycharmProjects\WorkStuff\venv\lib\site-packages\pandas\core\series.py", line 1089, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\rcthu\PycharmProjects\WorkStuff\venv\lib\site-packages\pandas\core\indexes\base.py", line 3804, in get_loc
raise KeyError(key) from err
KeyError: 'rsID'
Process finished with exit code 1
The first thing to notice is that 'rsID' is the first key that you are calling. Looking at your data, rsID may not be what you expect since it is over an index.
You should be able to set a breakpoint before the line that breaks and run your code in debug mode. Once you're at the breakpoint you should be able to see what 'row' really is and what keys it has.
You could also just print(row) then return to get the first one.
Hope this helps.

How to write on top of pandas HDF5 'read-only mode' files?

I am storing data using pandas built-in HDF5 methods.
Somehow, these HDF5 files were turned into 'read-only' files, and I am getting a lot of Opening xxx in read-only mode messages when I open those files in write mode and I can't write them, which is something I really need to do.
The thing I really don't understand so far is how come those files turned into read-only, as I am not aware of a piece of code that I wrote that may result in that behavior. (I have tried to check if the data stored in the HDF5 is corrupt, but I am able to read it and manipulate it, so it seems to be working just fine)
I have 2 questions:
How can I append data to those 'read-only mode' HDF5 files? (Can I convert them back to write mode or any other clever solution?)
Is there any pandas method that would change the HDF5 file to a 'read-only mode' by default so I can avoid turning those files into read-only in the first place?
Code:
The piece of code that is raising this issue is, which is the piece I use to save the output I generated:
with pd.HDFStore('data/observer/' + self._currency + '_' + str(ts)) as hdf:
hdf.append(key='observers', value=df, format='table', data_columns=True)
I also use this piece of code to manipulate the outputs that were generated previously:
for the_file in list_dir:
if currency in the_file:
temp_df = pd.read_hdf(folder + the_file)
...
I use some select commands as well to get specific columns from the data files:
with pd.HDFStore('data/observer/' + self.currency + '_' + timestamp) as hdf:
df = hdf.select(key='observers', columns=[x, y])
Error Traceback:
File ".../data_processing/observer_data.py", line 52, in save_obs_to_pandas
hdf.append(key='observers', value=df, format='table', data_columns=True)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 963, in append
**kwargs)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 1341, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 3930, in write
self.set_info()
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 3163, in set_info
self.attrs.info = self.info
File ".../venv/lib/python3.5/site-packages/tables/attributeset.py", line 464, in __setattr__
nodefile._check_writable()
File ".../venv/lib/python3.5/site-packages/tables/file.py", line 2119, in _check_writable
raise FileModeError("the file is not writable")
tables.exceptions.FileModeError: the file is not writable
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".../general_manager.py", line 144, in <module>
gm.run()
File ".../general_manager.py", line 114, in run
list_of_observer_managers = self.load_all_observer_managers()
File ".../general_manager.py", line 64, in load_all_observer_managers
observer = currency_pool.map(self.load_observer_manager, list_of_currencies)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
tables.exceptions.FileModeError: the file is not writable
The issue at hand was that I messed up with OS file permissions. The file I was trying to read belonged to the root (as I had run the code that generated those files with the root) and I was trying to access them with a user account.
I am running debian, and the following command (as root) solved my issues:
chown -R user.user folder
This commands recursively changes permissions of all files inside that folder to user.user.

Python Pandas: Error tokenizing data. C error: EOF inside string starting when reading 1GB CSV file

I'm reading a 1 GB CSV file in chunks of 10,000 rows. The file has 1106012 rows and 171 columns, other smaller sized file does not show any error and finish off successfully but when i read this 1 GB file it shows error every time on exactly line number 1106011 which is a second last line of file, i can manually remove that line but that is not the solution because i have hundreds of other file of that same size and i cannot fix all the lines manually. can anyone help me with that please.
def extract_csv_to_sql(input_file_name, header_row, size_of_chunk, eachRow):
df = pd.read_csv(input_file_name,
header=None,
nrows=size_of_chunk,
skiprows=eachRow,
low_memory=False,
error_bad_lines=False,
sep=',')
# engine='python'
# quoting=csv.QUOTE_NONE
# encoding='utf-8'
df.columns = header_row
df = df.drop_duplicates(keep='first')
df = df.apply(lambda x: x.astype(str).str.lower())
return df
I'm then calling this function within a loop and works just fine.
huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
I read this Pandas ParserError EOF character when reading multiple csv files to HDF5, this read_csv() & EOF character in string cause parsing issue and this https://github.com/pandas-dev/pandas/issues/11654 and many more and tried to include read_csv parameter such as
engine='python'
quoting=csv.QUOTE_NONE // Hangs and even the python shell, don't know why
encoding='utf-8'
but none of it worked, its still throwing the following error
Error:
Traceback (most recent call last):
File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 115, in <module>
huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 24, in extract_csv_to_sql
sep=',')
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 411, in _read
data = parser.read(nrows)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10885)
File "pandas\_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)
File "pandas\_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)
File "pandas\_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 1106011
>>>
If you are under linux, try to remove all non printable caracter.
Try to load your file after this operation.
tr -dc '[:print:]\n' < file > newfile
I inquired many solutions, some of them worked but It affected the calculous used this one and it will skip the line that is causing the error:
pd.read_csv(file,engine='python', error_bad_lines=False)
#engine='python' provides a better output

Pandas open_excel() fails with xlrd.biffh.XLRDError: Can't find workbook in OLE2 compound document

I'm trying to use pandas to parse an .xlsm document. My code worked perfectly with the example file I was given, but once I got the rest of the documents, it failed with the above error. Here's the offending stack trace:
Traceback (most recent call last):
File "########/UnsupervisedCAM.py", line 9, in <module>
info_dict = read_excel_to_dict('files/' + filename)
File "########\readCAM.py", line 7, in read_excel_to_dict
df = pandas.read_excel(filename, parse_cols='E,G,I,K,Q,O')
File "########\Anaconda3\envs\tensorflow\lib\site-packages\pandas\io\excel.py", line 191, in read_excel
io = ExcelFile(io, engine=engine)
File "########\Anaconda3\envs\tensorflow\lib\site-packages\pandas\io\excel.py", line 249, in __init__
self.book = xlrd.open_workbook(io)
File "########\Anaconda3\envs\tensorflow\lib\site-packages\xlrd\__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows,
File "########\Anaconda3\envs\tensorflow\lib\site-packages\xlrd\book.py", line 87, in open_workbook_xls
ragged_rows=ragged_rows,
File "########\Anaconda3\envs\tensorflow\lib\site-packages\xlrd\book.py", line 595, in biff2_8_load
raise XLRDError("Can't find workbook in OLE2 compound document")
xlrd.biffh.XLRDError: Can't find workbook in OLE2 compound document
I'm not even sure where to start... Haven't found anything of use online.
I got the same error message and could solve it by removing the password protection of the xlsx-file.
(not saying that it's the only reason for the error, but worth checking!)
After a lot of searching, the only way I've found to do this is to open and save all the excel documents, which seems to 'strip' them of their OLE2 format. I automated the process with the following vbs script:
Dim objFSO, objFolder, objFile
Dim objExcel, objWB
Set objExcel = CreateObject("Excel.Application")
Set objFSO = CreateObject("scripting.filesystemobject")
MyFolder = "<PATH/TO/FILES"
Set objFolder = objfso.getfolder(myfolder)
For Each objFile In objfolder.Files
If Right(objFile.Name,4) = "<EXTENSION>" Then
Set objWB = objExcel.Workbooks.Open(objFile)
objWB.save
objWB.close
End If
Next
objExcel.Quit
Set objExcel = Nothing
Set objFSO = Nothing
Wscript.Echo "Done"
Make sure to change the path to the folder and extension.
In case you face this issue over Jupyter notebook as I did when searching for the error, you can simply restart the kernel and the issue gets resolved.

Errors when loading .csv file using pandas in python

I have a large sized csv file, approximately 6gb, and it's taking a lot of time to load on to python. I get the following error:
import pandas as pd
df = pd.read_csv('nyc311.csv', low_memory=False)
Python(1284,0x7fffa37773c0) malloc: *** mach_vm_map(size=18446744071562067968) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 401, in _read
data = parser.read()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 851, in pandas.parser.TextReader.read (pandas/parser.c:10438)
File "pandas/parser.pyx", line 939, in pandas.parser.TextReader._read_rows (pandas/parser.c:11607)
File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: out of memory
I don't think I am understanding the error code, the last line seems to suggest that the file is too big to load? I also tried low_memory=FALSE option but this did not work either.
I'm not sure what " can't allocate region" mean, could it be possible that the header includes 'region' and pandas cannot locate the column underneath?
Out of memory issue occur due to RAM.
There's no other explaination for that.
Sum of all data memory-overheads for in-RAM objects !< RAM
malloc: *** mach_vm_map(size=18446744071562067968) failed You can clearly understand from this error statement.
Try using.
df = pd.read_csv('nyc311.csv',chunksize =5000,lineterminator='\r')
Or, if reading this csv is only a part of your program, and if there are any other dataframes created before,try cleaning them if not in use.
import gc
del old_df #clear dataframes not in use
gc.collect() # collect Garbage
del gc.garbage[:] # Clears RAM
`

Categories

Resources