How to troubleshoot code in the case of big data

How to troubleshoot code in the case of big data - python

I'm trying to implement this python solution to count the number of lines with identical content in the first few columns of a table. Here is my code:
#count occurrences of reads
import pandas as pd
#pd.options.display.large_repr = 'info'
#pd.set_option('display.max_rows', 100000000)
#pd.set_option('display.width',50000)
import sys
file1 = sys.argv[1]
file2 = file1[:4] + '_multi_nobidir_count.soap'
df = pd.read_csv(file1,sep='\t',header=None)
df.columns = ['v0','v1','v2','v3','v4','v5','v6','v7','v8','v9','v10','v11']
df['v3']=df.groupby(['v0','v1','v2']).transform(sum).v3
df.to_csv(file2,sep='\t',index=False,header=False)
It worked fine with the test data (200 lines) but gives me the following error when I apply it to the real data (20 million lines):
Traceback (most recent call last):
File "count_same_reads.py", line 14, in <module>
df['v3']=df.groupby(['v0','v1','v2']).transform(sum).v3
File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2732, in transform
return self._transform_item_by_item(obj, fast_path)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2799, in _transform_item_by_item
raise TypeError('Transform function invalid for data types')
TypeError: Transform function invalid for data types
How do I go about troubleshooting, to find out why I am getting this error?
[EDIT] Uncommenting the pd.options. and pd.set_option lines did not change the outcome.
[EDIT2] Taking into consideration some of the replies below, I ran the following code on my data to output any lines of data that do not have a number in the 4th column:
#test data type
import sys
file1 = sys.argv[1]
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
with open(file1, 'r') as data:
for row in data:
a = row.strip().split()[3]
if is_number(a) == False:
print row.strip()
This worked on the test data in which I changed one of the rows' fourth column value from 1 to e, it output only the line containing the letter instead of a number. I ran it on the original big data but no lines were returned.

The exception you have received is TypeError which hints at problems with the file. But with large files it is always possible that there are, e.g., memory problems with the code handling the comparisons. So, you have two possibilities:
the file is broken
the code (yours or pandas's) is broken
In order to debug this, you may try to feed your file into your code in pieces. At some point you have isolated the problem. It may be one of the two:
no matter which n lines you take, it throws an exception (but not with n-1 lines); memory management or something else is broken
the problems can be isolated onto a single line or lines of your data file; the data file is broken
I second merlin2011's guess: there is something unexpected in your file. It is unlikely that pandas will choke with only 200 000 000 records.

Open the file /usr/local/lib/python2.7/dist-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/groupby.py, go to line 2799.
Right before the following statement, in the same indent level, add a line to print the value of the offending data.
raise TypeError('Transform function invalid for data types')
Now, right before the TypeError is thrown, you will know what data caused the error.
Given that you are trying to sum, I would speculate that you have a non-numeric value in your column, but I do not have your data, so that is pure speculation.
I have taken a quick look at the code region around where the error occurs, and it appears that you should be this case be inspecting the object obj before the TypeError is raised.
for i, col in enumerate(obj):
try:
output[col] = self[col].transform(wrapper)
inds.append(i)
except Exception:
pass
if len(output) == 0: # pragma: no cover
raise TypeError('Transform function invalid for data types')

Here's how to troublshoot something like this:
Create a function to wrap the operation (this will be a fair bit slower as its not cythonized), but should catch your error.
def f(x):
try:
return x.sum()
except:
import pdb; pdb.set_trace()
df['v3']=df.groupby(['v0','v1','v2']).transform(f).v3

Related

Pandas throws a parse error after being caught by try except while code in the except is running

I'm reading in large csv files into pandas (up to ~20 GB), and using the dtype parameter to control what types are loaded to manage memory use. Occasionally one of the files has a bad row, such as one of the floats having two points (eg 3.45.23). Apparently, when using pandas.read_csv like this, it does not care about error_bad_lines=False, and just throws something like :
ValueError: could not convert string to float: '-0.0.10118473'
and TypeError: Cannot cast array from dtype('O') to dtype('float32') according to the rule 'safe'
My solution was to, on the rare cases where this problem happens, catch the exception (with a generic try: except:, no error types declared) and go through the csv and fix any bad lines (by checking against a regex) before re-trying the pandas load.
When I do this, the exception triggers the except clause, I print the error at the beginning of the except, and then it moves on to process the file. But then, from the middle of the except block, a the same pandas error from the try block crashes the execution (no pandas in the except block). Crash point shown with comment below.
How is it running in two places like this? Is there a fix?
read_succesufully = 0
attempted_fix = 0
while(not read_succesufully):
try:
self.in_data = pd.read_csv(filename, usecols=usecols, dtype=dtype,
error_bad_lines=False, warn_bad_lines=True)
read_succesufully = 1
except:
print('error parsing, attempting to fix bad csv ' + filename, sys.exc_info())
pattern = r'(-?\d+\.?\d*(e-?\d+)?,)+-?\d+\.?\d*$'
with open(filename, 'r') as file:
lines = file.readlines()
os.rename(filename, filename + '.bak')
print('read in file %s with %d lines, attempting to fix' % (filename, len(lines)))
#crash occurs here, with pandas ValueError????
wrote_lines = 0
with open(filename, 'w') as file:
file.write(lines[0])
for line in lines[1:]:
if(re.match(pattern, line)):
file.write(line)
wrote_lines += 1
print('wrote %d lines and renamed original file with ".bak". now attempting to re-read')
attempted_fix = 1

How to continue a loop after catching exception in try ... except

I am reading a big file in chunks and I am doing some operations on each of the chunks. While reading one of get them I had the following message error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 26 fields in line 15929977, saw 118
which means that one of my file lines doesn't follow the same format as the others. What I thought I could do was to just omit this chunk but I couldn't get a way to do it. I tried to do a try/except block as follows:
data = pd.read_table('ny_data_file.txt', sep=',',
header=0, encoding = 'latin1', chunksize = 5000)
try:
for chunk in data:
# operations
except pandas.errors.ParseError:
# Here is my problem
What I have written here is my problem is that if the chunk is not well parsed, my code will automatically go to the exception not even entering the for loop, but what I would like is to skip this chunk and move forward to the next one, on which I would like to perform the operations inside the loop.
I have checked on stackoverflow but I couldn't find anything similar where the try was performed on the for loop. Any help would be appreciated.
UPDATE:
I have tried to do as suggested in the comments:
try:
for chunk in data:
#operations
except pandas.errors.ParserError:
# continue/pass/handle error
But still is not cathching the exception because as said the exception is created when getting the chyunk out of my data not when doing operations with it.

The way you use try - except makes it skip the entire for loop if an exception is caught in it. If you want to only skip one iteration you need to write the try-except inside the loop like so:
for chunk in data:
try:
# operations
except pandas.errors.ParseError as e:
# inform the user of the error
print("Error encountered while parsing chunk {}".format(chunk))
print(e)

I understood that, in the operations part you get exception. If it is like that: you should just continue:
for chunk in data:
try:
# operations
except pandas.errors.ParseError:
# continue

I am not sure where the exception is thrown. Maybe adding a full error stack would help. If the error is thrown by the read_table() call maybe you could try this:
try:
data = pd.read_table('ny_data_file.txt', sep=',',
header=0, encoding = 'latin1', chunksize = 5000)
except pandas.errors.ParseError:
pass
for chunk in data:
# operations

As suggested by #JonClements what solved my problem was to use error_bad_lines=False in the pd.read_csv so it just skipped that lines causing trouble and let me execute the rest of the for loop.

EOF Error in python Hackerrank

Trying to solve a problem but the compiler of Hackerrank keeps on throwing error EOFError while parsing: dont know where is m i wrong.
#!usr/bin/python
b=[]
b=raw_input().split()
c=[]
d=[]
a=raw_input()
c=a.split()
f=b[1]
l=int(b[1])
if(len(c)==int(b[0])):
for i in range(l,len(c)):
d.append(c[i])
#print c[i]
for i in range(int(f)):
d.append(c[i])
#print c[i]
for j in range(len(d)):
print d[j],
i also tried try catch to solve it but then getting no input.
try:
a=input()
c=a.split()
except(EOFError):
a=""
input format is 2 spaced integers at beginning and then the array
the traceback error is:
Traceback (most recent call last):
File "solution.py", line 4, in <module>
b=raw_input().split()
EOFError: EOF when reading a line

There are several ways to handle the EOF error.
1.throw an exception:
while True:
try:
value = raw_input()
do_stuff(value) # next line was found
except (EOFError):
break #end of file reached
2.check input content:
while True:
value = raw_input()
if (value != ""):
do_stuff(value) # next line was found
else:
break
3. use sys.stdin.readlines() to convert them into a list, and then use a for-each loop. More detailed explanation is Why does standard input() cause an EOF error
import sys
# Read input and assemble Phone Book
n = int(input())
phoneBook = {}
for i in range(n):
contact = input().split(' ')
phoneBook[contact[0]] = contact[1]
# Process Queries
lines = sys.stdin.readlines() # convert lines to list
for i in lines:
name = i.strip()
if name in phoneBook:
print(name + '=' + str( phoneBook[name] ))
else:
print('Not found')

I faced the same issue. This is what I noticed. I haven't seen your "main" function but Hackerrank already reads in all the data for us. We do not have to read in anything. For example this is a function def doSomething(a, b):a and b whether its an array or just integer will be read in for us. We just have to focus on our main code without worrying about reading. Also at the end make sure your function return() something, otherwise you will get another error. Hackerrank takes care of printing the final output too. Their code samples and FAQs are a bit misleading. This was my observation according to my test. Your test could be different.

It's because your function is expecting an Input, but it was not provided. Provide a custom input and try to compile it. It should work.

i dont know but providing a custom input and compiling it and got me in! and passed all cases without even changing anything.

There are some codes hidden below the main visible code in HackerRank.
You need to expand that (observe the line no. where you got the error and check that line by expanding) code and those codes are valid, you need to match the top visible codes with the hidden codes.
For my case there was something like below:
regex_integer_in_range = r"___________" # Do not delete 'r'.
regex_alternating_repetitive_digit_pair = r"__________" # Do not delete 'r'.
I just filled up the above blank as like below and it was working fine with the given hidden codes:
regex_integer_in_range = r"^[0-9][\d]{5}$" # Do not delete 'r'.
regex_alternating_repetitive_digit_pair = r"(\d)(?=\d\1)" # Do not delete 'r'.

Error tokenizing data during Pandas read_csv. How to actually see the bad lines?

I have a large csv that I load as follows
df=pd.read_csv('my_data.tsv',sep='\t',header=0, skiprows=[1,2,3])
I get several errors during the loading process.
First, if I dont specify warn_bad_lines=True,error_bad_lines=False I get:
Error tokenizing data. C error: Expected 22 fields in line 329867, saw
24
Second, if I use the options above, I now get:
CParserError: Error tokenizing data. C error: EOF inside string
starting at line 32357585
Question is: how can I have a look at these bad lines to understand what's going on? Is it possible to have read_csv return these bogus lines?
I tried the following hint (Pandas ParserError EOF character when reading multiple csv files to HDF5):
from pandas import parser
try:
df=pd.read_csv('mydata.tsv',sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
print detail
but still get
Error tokenizing data. C error: Expected 22 fields in line 329867, saw
24

i'll will give my answer in two parts:
part1:
the op asked how to output these bad lines, to answer this we can use python csv module in a simple code like that:
import csv
file = 'your_filename.csv' # use your filename
lines_set = set([100, 200]) # use your bad lines numbers here
with open(file) as f_obj:
for line_number, row in enumerate(csv.reader(f_obj)):
if line_number > max(lines_set):
break
elif line_number in lines_set: # put your bad lines numbers here
print(line_number, row)
also we can put it in more general function like that:
import csv
def read_my_lines(file, lines_list, reader=csv.reader):
lines_set = set(lines_list)
with open(file) as f_obj:
for line_number, row in enumerate(csv.reader(f_obj)):
if line_number > max(lines_set):
break
elif line_number in lines_set:
print(line_number, row)
if __name__ == '__main__':
read_my_lines(file='your_filename.csv', lines_list=[100, 200])
part2: the cause of the error you get:
it's hard to diagnose problem like this without a sample of the file you use.
but you should try this ..
pd.read_csv(filename)
is it parse the file with no error ? if so, i will explain why.
the number of columns is inferred from the first line.
by using skiprows and header=0 you escaped the first 3 rows, i guess that contains the columns names or the header that should contains the correct number of columns.
basically you constraining what the parser is doing.
so parse without skiprows, or header=0 then reindexing to what you need later.
note:
if you unsure about what delimiter used in the file use sep=None, but it would be slower.
from pandas.read_csv docs:
sep : str, default ‘,’ Delimiter to use. If sep is None, the C engine
cannot automatically detect the separator, but the Python parsing
engine can, meaning the latter will be used and automatically detect
the separator by Python’s builtin sniffer tool, csv.Sniffer. In
addition, separators longer than 1 character and different from '\s+'
will be interpreted as regular expressions and will also force the use
of the Python parsing engine. Note that regex delimiters are prone to
ignoring quoted data. Regex example: '\r\t'
link

In my case, adding a separator helped:
data = pd.read_csv('/Users/myfile.csv', encoding='cp1251', sep=';')

We can get line number from error and print line to see what it looks like
Try:
import subprocess
import re
from pandas import parser
try:
filename='mydata.tsv'
df=pd.read_csv(filename,sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
print detail
err=re.findall(r'\b\d+\b', detail) #will give all the numbers ['22', '329867', '24'] line number is at index 1
line=subprocess.check_output("sed -n %s %s" %(str(err[1])+'p',filename),stderr=subprocess.STDOUT,shell=True) # shell command 'sed -n 2p filename' for printing line 2 of filename
print 'Bad line'
print line # to see line

What is this JSON Decoder piece of code doing?

I have been using this piece of code:
def read_text_files(filename):
# Creates JSON Decoder
decoder = json.JSONDecoder()
with open(filename, 'r') as inputfile:
# Returns next item in input file, removes whitespace from it and saves it in line
line = next(inputfile).strip()
while line:
try:
# Returns 2-tuple of Python representation of data and index where data ended
obj, index = decoder.raw_decode(line)
# Remove object
yield obj
# Remove already scanned part of line from rest of file
line = line[index:]
except ValueError:
line += next(inputfile).strip()
if not line:
line += next(inputfile).strip()
global count
count+=1
print str(count)
all_files = glob.glob('Documents/*')
for filename in all_files:
for data in read_text_files(filename):
rawTweet = data['text']
print 'Here'
It reads in a JSON file and decodes it. However, what I realise is that when I place the count and print statements inside the ValueError, I'm losing almost half of the documents being scanned in here - they never make it back to the main method.
Could somebody explain to me exactly what the try statement is doing and why I'm losing documents in the except part. Is it due to bad JSON?
Edit: Including more code
Currently, with the code posted, the machine prints:
"Here"
2
3 etc...
199
Here
200
Here (alternating like this until)...
803
804
805 etc...
1200
Is this happening because some of the JSON is corrupt? Is it because some of the documents are duplicates (and some definitely are)?
Edit 2:
Interesting, deleting:
line=next(inputfile).strip()
while line
and replacing it with:
for line in inputfile:
appears to have fixed the problem. Is there a reason for this?

The try statement is specifying a block of statements for which exceptions are handled through the following except blocks (only one in your case).
My impression is that with your modifications you are making a second exception trigger inside the exception handler itself. This makes control go to a higher-level exception handler, even outside function read_text_files. If no exception occurs in the exception handler, the loop can continue.
Please check that count exists and has been initialized with an integer value (say 0).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to troubleshoot code in the case of big data - python

Here's how to troublshoot something like this: Create a function to wrap the operation (this will be a fair bit slower as its not cythonized), but should catch your error. def f(x): try: return x.sum() except: import pdb; pdb.set_trace() df['v3']=df.groupby(['v0','v1','v2']).transform(f).v3

Related

Pandas throws a parse error after being caught by try except while code in the except is running

How to continue a loop after catching exception in try ... except

EOF Error in python Hackerrank

Error tokenizing data during Pandas read_csv. How to actually see the bad lines?

What is this JSON Decoder piece of code doing?

Categories

Resources