How to continue a loop after catching exception in try ... except - python

I am reading a big file in chunks and I am doing some operations on each of the chunks. While reading one of get them I had the following message error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 26 fields in line 15929977, saw 118
which means that one of my file lines doesn't follow the same format as the others. What I thought I could do was to just omit this chunk but I couldn't get a way to do it. I tried to do a try/except block as follows:
data = pd.read_table('ny_data_file.txt', sep=',',
header=0, encoding = 'latin1', chunksize = 5000)
try:
for chunk in data:
# operations
except pandas.errors.ParseError:
# Here is my problem
What I have written here is my problem is that if the chunk is not well parsed, my code will automatically go to the exception not even entering the for loop, but what I would like is to skip this chunk and move forward to the next one, on which I would like to perform the operations inside the loop.
I have checked on stackoverflow but I couldn't find anything similar where the try was performed on the for loop. Any help would be appreciated.
UPDATE:
I have tried to do as suggested in the comments:
try:
for chunk in data:
#operations
except pandas.errors.ParserError:
# continue/pass/handle error
But still is not cathching the exception because as said the exception is created when getting the chyunk out of my data not when doing operations with it.

The way you use try - except makes it skip the entire for loop if an exception is caught in it. If you want to only skip one iteration you need to write the try-except inside the loop like so:
for chunk in data:
try:
# operations
except pandas.errors.ParseError as e:
# inform the user of the error
print("Error encountered while parsing chunk {}".format(chunk))
print(e)

I understood that, in the operations part you get exception. If it is like that: you should just continue:
for chunk in data:
try:
# operations
except pandas.errors.ParseError:
# continue

I am not sure where the exception is thrown. Maybe adding a full error stack would help. If the error is thrown by the read_table() call maybe you could try this:
try:
data = pd.read_table('ny_data_file.txt', sep=',',
header=0, encoding = 'latin1', chunksize = 5000)
except pandas.errors.ParseError:
pass
for chunk in data:
# operations

As suggested by #JonClements what solved my problem was to use error_bad_lines=False in the pd.read_csv so it just skipped that lines causing trouble and let me execute the rest of the for loop.

Related

Pandas throws a parse error after being caught by try except while code in the except is running

I'm reading in large csv files into pandas (up to ~20 GB), and using the dtype parameter to control what types are loaded to manage memory use. Occasionally one of the files has a bad row, such as one of the floats having two points (eg 3.45.23). Apparently, when using pandas.read_csv like this, it does not care about error_bad_lines=False, and just throws something like :
ValueError: could not convert string to float: '-0.0.10118473'
and TypeError: Cannot cast array from dtype('O') to dtype('float32') according to the rule 'safe'
My solution was to, on the rare cases where this problem happens, catch the exception (with a generic try: except:, no error types declared) and go through the csv and fix any bad lines (by checking against a regex) before re-trying the pandas load.
When I do this, the exception triggers the except clause, I print the error at the beginning of the except, and then it moves on to process the file. But then, from the middle of the except block, a the same pandas error from the try block crashes the execution (no pandas in the except block). Crash point shown with comment below.
How is it running in two places like this? Is there a fix?
read_succesufully = 0
attempted_fix = 0
while(not read_succesufully):
try:
self.in_data = pd.read_csv(filename, usecols=usecols, dtype=dtype,
error_bad_lines=False, warn_bad_lines=True)
read_succesufully = 1
except:
print('error parsing, attempting to fix bad csv ' + filename, sys.exc_info())
pattern = r'(-?\d+\.?\d*(e-?\d+)?,)+-?\d+\.?\d*$'
with open(filename, 'r') as file:
lines = file.readlines()
os.rename(filename, filename + '.bak')
print('read in file %s with %d lines, attempting to fix' % (filename, len(lines)))
#crash occurs here, with pandas ValueError????
wrote_lines = 0
with open(filename, 'w') as file:
file.write(lines[0])
for line in lines[1:]:
if(re.match(pattern, line)):
file.write(line)
wrote_lines += 1
print('wrote %d lines and renamed original file with ".bak". now attempting to re-read')
attempted_fix = 1

Use python struct.error to stop a function?

I'm trying to write a small function for my programm (python) which writes some data into a csv file. The input data is from an other file. Both files are opened!
The code for actual read- and writeprocess is:
while aux != '':
data = f.read(4)
data = unpack('I', data)
data = list(data)
writer.writerow(data)
else:
print('done')
This code works fine so far, but sometimes my input data doesn't have left 4 bytes at the end for the last readprocess, so it gives me the error "struct.error: unpack requires a string argument of length 4".
This is totally fine for me, i don't mind some dataloss at the end but this error stops my whole programm.
Is there a way to stop the function and return to the main programm if this error occours? Or just stop the while loop und go on with the "else:" part?
This should do it:
try:
data = unpack('I', data)
except struct.error as err:
print(err)
This way you'll know when there is a problem, but program execution will continue.
Do read up on Python error handling for the full story.

Why does this code throw an IndexError?

I am trying to read values from a file, data.txt, with this code. However, it throws an IndexError when I run it. Why would this be happening?
def main():
myfile=open('data.txt','r')
line=myfile.readline()
while line!='':
line=line.split()
age=line[1]
line=myfile.readline()
myfile.close()
main()
If line happens to contain exactly one fragment, line.split() returns a list of exactly one element, and accessing its second element (at index 1) leads to an error.
Also, to make your code better, don't ever reassign the variables. It hampers readers, and the code is written mostly to be read, especially by yourself.
I'd use a simpler loop:
for line in myfile: # this iterates over the lines of the file
fragments = line.split()
if len(fragments) >= 2:
age = fragments[1]
...
Also, the idiomatic way to open the file for a particular duration and close it automatically is the use of with:
with open(...) as myfile:
for line in myfile:
...
# At this point, the file will be automatically closed.
Python starts indexing at 0.
in your age=line[1] part, if there is only one word in the line, Python will throw an IndexError to tell you that. Seeing your data would be helpful, but the following is the generally accepted and much easier way of reading a file:
with open('data.txt', 'r') as myfile:
for line in myfile:
# '' is a a false value, so, this is the same as if line != ''
if line:
line = line.split()
# if age is always the first thing on the line:
age = line[0]
# if age can be somewhere else on the line, you need to do something more complicated
Note that, because you used with, you don't need to close the file yourself, the with statement does that
def main():
myfile=open('data.txt','r')
line=myfile.readline()
while line!='':
line=line.split()
try:
age=line[1]
except IndexError:
age = None
line=myfile.readline()
myfile.close()
main()
The try statement works as follows.
First, the try clause (the statement(s) between the try and except keywords) is executed.
If no exception occurs, the except clause is skipped and execution of the try statement is finished.
If an exception occurs during execution of the try clause, the rest of the clause is skipped. Then if its type matches the exception named after the except keyword, the except clause is executed, and then execution continues after the try statement.
If an exception occurs which does not match the exception named in the except clause, it is passed on to outer try statements; if no handler is found, it is an unhandled exception and execution stops with a message.
For more details, see https://docs.python.org/2/tutorial/errors.html#handling-exceptions

How to troubleshoot code in the case of big data

I'm trying to implement this python solution to count the number of lines with identical content in the first few columns of a table. Here is my code:
#count occurrences of reads
import pandas as pd
#pd.options.display.large_repr = 'info'
#pd.set_option('display.max_rows', 100000000)
#pd.set_option('display.width',50000)
import sys
file1 = sys.argv[1]
file2 = file1[:4] + '_multi_nobidir_count.soap'
df = pd.read_csv(file1,sep='\t',header=None)
df.columns = ['v0','v1','v2','v3','v4','v5','v6','v7','v8','v9','v10','v11']
df['v3']=df.groupby(['v0','v1','v2']).transform(sum).v3
df.to_csv(file2,sep='\t',index=False,header=False)
It worked fine with the test data (200 lines) but gives me the following error when I apply it to the real data (20 million lines):
Traceback (most recent call last):
File "count_same_reads.py", line 14, in <module>
df['v3']=df.groupby(['v0','v1','v2']).transform(sum).v3
File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2732, in transform
return self._transform_item_by_item(obj, fast_path)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2799, in _transform_item_by_item
raise TypeError('Transform function invalid for data types')
TypeError: Transform function invalid for data types
How do I go about troubleshooting, to find out why I am getting this error?
[EDIT] Uncommenting the pd.options. and pd.set_option lines did not change the outcome.
[EDIT2] Taking into consideration some of the replies below, I ran the following code on my data to output any lines of data that do not have a number in the 4th column:
#test data type
import sys
file1 = sys.argv[1]
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
with open(file1, 'r') as data:
for row in data:
a = row.strip().split()[3]
if is_number(a) == False:
print row.strip()
This worked on the test data in which I changed one of the rows' fourth column value from 1 to e, it output only the line containing the letter instead of a number. I ran it on the original big data but no lines were returned.
The exception you have received is TypeError which hints at problems with the file. But with large files it is always possible that there are, e.g., memory problems with the code handling the comparisons. So, you have two possibilities:
the file is broken
the code (yours or pandas's) is broken
In order to debug this, you may try to feed your file into your code in pieces. At some point you have isolated the problem. It may be one of the two:
no matter which n lines you take, it throws an exception (but not with n-1 lines); memory management or something else is broken
the problems can be isolated onto a single line or lines of your data file; the data file is broken
I second merlin2011's guess: there is something unexpected in your file. It is unlikely that pandas will choke with only 200 000 000 records.
Open the file /usr/local/lib/python2.7/dist-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/groupby.py, go to line 2799.
Right before the following statement, in the same indent level, add a line to print the value of the offending data.
raise TypeError('Transform function invalid for data types')
Now, right before the TypeError is thrown, you will know what data caused the error.
Given that you are trying to sum, I would speculate that you have a non-numeric value in your column, but I do not have your data, so that is pure speculation.
I have taken a quick look at the code region around where the error occurs, and it appears that you should be this case be inspecting the object obj before the TypeError is raised.
for i, col in enumerate(obj):
try:
output[col] = self[col].transform(wrapper)
inds.append(i)
except Exception:
pass
if len(output) == 0: # pragma: no cover
raise TypeError('Transform function invalid for data types')
Here's how to troublshoot something like this:
Create a function to wrap the operation (this will be a fair bit slower as its not cythonized), but should catch your error.
def f(x):
try:
return x.sum()
except:
import pdb; pdb.set_trace()
df['v3']=df.groupby(['v0','v1','v2']).transform(f).v3

Skipping broken jsons python

I am reading JSON from the database and parsing it using python.
cur1.execute("Select JSON from t1")
dataJSON = cur1.fetchall()
for row in dataJSON:
jsonparse = json.loads(row)
The problem is some JSON's that I'm reading is broken.
I would like my program to skip the json if its not a valid json and if it is then go ahead and parse it. Right now my program crashes once it encounters a broken json.
T1 has several JSON's that I'm reading one by one.
Update
You're getting an expecting string or buffer - you need to be using row[0] as the results will be 1-tuples... and you wish to take the first and only column.
If you did want to check for bad json
You can put a try/except around it:
for row in dataJSON:
try:
jsonparse = json.loads(row)
except Exception as e:
pass
Now - instead of using Exception as above - use the type of exception that's occuring at the moment so that you don't capture non-json loading related errors... (It's probably ValueError)
If you just want to silently ignore errors, you can wrap json.loads in a try..except block:
try: jsonparse = json.loads(row)
except: pass
Try this:
def f(x):
try:
return json.loads(x)
except:
pass
json_df = pd.DataFrame()
json_df = df.join(df["error"].apply(lambda x: f(x)).apply(pd.Series))
After JSON loads, I also wanted to convert each key-value pair from JSON to a new column (all JSON keys), so I used apply(pd.Series) in conjunction. You should try this by removing that if your goal is only to convert each row from a data frame column to JSON.

Categories

Resources