Pandas read_csv produces unexpected behavior, why? - python

I've got a large tab-separated file and am trying to load it using
import pandas as pd
df = pd.read_csv(..., sep="\t")
however, the process crashes with the error being
pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 1743925, saw 12
Nothing apparent wrong with that particular line when I printed it out manually. Feeling confident that there is nothing wrong with my file, I went and tried to calculate the field counts myself...
from collections import Counter
lengths = []
with open(...) as f:
for line in f:
lengths.append(len(line.split('\t')))
c = Counter(lengths)
print(c)
...and got the result Counter({8: 2385674}). So I was wondering what does pandas do differently, but the error is raised inside a .pyx file and hence I cannot plant a breakpoint there. What could be the cause of this? Where is my expectation flawed?

Fixed the issue. Turns out the problem was a different quoting on csv export and read. The issue was solved by matching quoting on read_csv with quoting on the to_csv which created the loaded file. I assume some tabs and newlines were thought to be parts of string literals because of this, hence the assumption of 11 tab characters on one row (they were actually 2+ rows).

Related

Pandas kernel appears to have died. It will restart automatically

I tried to load a CSV file with 43186 rows using this code:
import csv
import pandas as pd
df = pd.read_csv('file.csv', sep=',', engine='python', error_bad_lines=False)
it outputs
Skipping line 2574: field larger than field limit (131072)
Skipping line 892: Expected 13 fields in line 892, saw 15
Skipping line 6376: Expected 13 fields in line 6376, saw 15
Skipping line 35433: Expected 13 fields in line 35433, saw 15
before the kernel eventually dies. I tried with some other larger CSVs, the same exact code works for the others. how can I fix this? I'm ok with skipping the lines. I tried to increase the limit with csv.field_size_limit(sys.maxsize)
but it doesn't work either. I skimmed from line 35433 through the end, no more bad lines there, and if there do exist bad lines, it should be skipped by the error_bad_lines=False, right? Any help would be appreciated!
field larger than field limit (131072)
This probably means that multiple (a lot of) lines are seen as one multiline field by the parser.
So there must be something wrong with the quoting in the csv file.
One solution (apart from fixing the csv file) could be to somehow tell the parser not to use/scan multiline fields - this would then mark that line as invalid.
Or maybe the csv file does not use quoted fields at all (which would allow for quotes inside fields). In this case you should tell the parser the fields are not quoted.

ValueError Reading large data set with pd.read_json

I am working a set of code exercises that use a Yelp reviews dataset. At this point in the exercises I am supposed to read in review.json which has one JSON record per line. I have made a smaller version of the JSON file, with only 100 records, for testing.
I can read the entire test file into a pandas dataframe and examine it.
The complete dataset file, however, has about 6 million lines. The recommendation is to use chunksize and build a json reader. I'm hitting errors, even with my test input.
My code currently looks like this
path = 'file://localhost/Users/.../DSC_Intro/'
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader =
pd.read_json(StringIO(filename), lines=True, chunksize=10)
type(review_reader)
The type call returns
pandas.io.json.json.JsonReader
which looks good.
Then I try
for chunk in review_reader:
print(chunk)
as referenced in pandas user guide
and I get an error:
ValueError: Unexpected character found when decoding 'false'
Update - it has been suggested that the issue is caused by embedded (quoted) "\n" characters in the data file; that pandas is seeing the JSON records as, not one per line, but multiple lines.
The error message is VERY opaque, if that's the case. Also, with 6 million lines, how should I tell pd.read_json to ignore "\n" and only look at actual newlines in the data?
Update
It's been suggested that if I fix my typo (it was a typo in this post, not a typo in my code) and use a Unix file path instead of a URL (JSON doesn't care: see docs).
When I do this but keep StringIO(), I get a different ValueError.
When I do this but remove StringIO(), the code works.
This seems to be very fragile. :-(
Note The tutorial has an answer key. I've tried that code. The answer key uses
review_reader =
pd.read_json(filename, lines=True, chunksize=10)
which throws the TypeError
sequence item 0: expected str instance, bytes found
Adding StringIO() seems to have solved that.
Input Sample JSON record, one per line of the input file.
{"review_id":"Amo5gZBvCuPc_tZNpHwtsA","user_id":"DzZ7piLBF-WsJxqosfJgtA","business_id":"qx6WhZ42eDKmBchZDax4dQ","stars":5.0,"useful":1,"funny":0,"cool":0,"text":"Our family LOVES the food here. Quick, friendly, delicious, and a great restaurant to take kids to. 5 stars!","date":"2017-03-27 01:14:37"}
Firstly, your
path 'file://localhost/Users/.../DSC_Intro/'
is not valid python code. If you try to execute that as such, you will get an invalid syntax error. I assume, then, that this is just showing the value of the path variable. I don't know whether those ellipses are literal or the result of your environment truncating the display of path. I'll assume here that your path is a valid file URL for your system as it doesn't seem germane here to consider an incorrect path.
Either way, yes, read_json can read json from a file URL as you're specifying there (I learned something there) if you read it in one go:
pd.read_json(fname, lines=True)
But if you try to create a reader from this, by specifying
pd.read_json(fname, lines=True, chunksize=...)
then you get
TypeError: sequence item 0: expected str instance, bytes found
Secondly, yes, wrapping your file-like argument with StringIO makes this error go away, but it isn't helping for any reason you might think and its use is based on a misreading of the pandas docs you point to.
I'll quote a couple of bits from the read_json doc here:
Signature: pd.read_json(
path_or_buf=None, ...
path_or_buf : a valid JSON string or file-like, default: None
The string could be a URL. Valid URL schemes include http, ftp, s3,
gcs, and file. For file URLs, a host is expected. For instance, a local
file could be file://localhost/path/to/table.json
So with read_json, you can either give it an actual string that is valid JSON, or you can give it a file-like object that points to a file that contains JSON.
Notice in the pandas docs that you cite:
In [258]: jsonl = '''
.....: {"a": 1, "b": 2}
.....: {"a": 3, "b": 4}
.....: '''
.....:
is JSON, not a path. When their example then does:
df = pd.read_json(jsonl, lines=True)
it is merely parsing the JSON in the string - no files are involved here.
When it then wants to demonstrate reading from a file in chunks, it does
# reader is an iterator that returns `chunksize` lines each iteration
In [262]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)
In other words, they are wrapping a JSON string, not a path, by StringIO(). This is just for the purposes of the documented example, so you can see that if you treated the JSON string as if it were being read from a file you can read it in chunks. That's what StringIO() does. So when you wrap the string that describes your file URL in StringIO(), I expect that read_json is then trying to interpret that string as JSON that's being read from a file and parse it. It understandably falls over because it isn't JSON.
This brings us back to why read_json cannot read your file URL in chunks. I don't have an immediate good answer to that. I suspect it lies in the internals of how read_json opens file URLs, or what function underlies this. If you were intent upon, or forced to, do this chunking from a file URL then I suspect you'd be looking at controlling the mode in which the file is opened, or perhaps somehow providing explicit guidance to read_json how to interpret the bytestream it gets. Libraries such as urllib2 may be useful here, I'm not sure.
But let's cut to the best fix here. Why are we trying to specify the path as a file URL? Simply specify your path as an OS path, e.g.
path = '/path/to/my/data/'
and then
filename = path + 'yelp_dataset/review_100.json'
# create a reader to read in chunks
review_reader = pd.read_json(filename, lines=True, chunksize=10)
And I betcha it works as intended! (It does for me, as it always has).
Caveat: windows doesn't use forward-slash path delimiters, and constructing paths by concatenating strings in the above fashion can be fragile, but usually if you use 'proper' forward-slash delimiters (smile), decent languages internally understand that. It's constructing paths using backslashes that is guaranteed to cause you pain. But just keep an eye on that.

Error tokenizing data during Pandas read_csv. How to actually see the bad lines?

I have a large csv that I load as follows
df=pd.read_csv('my_data.tsv',sep='\t',header=0, skiprows=[1,2,3])
I get several errors during the loading process.
First, if I dont specify warn_bad_lines=True,error_bad_lines=False I get:
Error tokenizing data. C error: Expected 22 fields in line 329867, saw
24
Second, if I use the options above, I now get:
CParserError: Error tokenizing data. C error: EOF inside string
starting at line 32357585
Question is: how can I have a look at these bad lines to understand what's going on? Is it possible to have read_csv return these bogus lines?
I tried the following hint (Pandas ParserError EOF character when reading multiple csv files to HDF5):
from pandas import parser
try:
df=pd.read_csv('mydata.tsv',sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
print detail
but still get
Error tokenizing data. C error: Expected 22 fields in line 329867, saw
24
i'll will give my answer in two parts:
part1:
the op asked how to output these bad lines, to answer this we can use python csv module in a simple code like that:
import csv
file = 'your_filename.csv' # use your filename
lines_set = set([100, 200]) # use your bad lines numbers here
with open(file) as f_obj:
for line_number, row in enumerate(csv.reader(f_obj)):
if line_number > max(lines_set):
break
elif line_number in lines_set: # put your bad lines numbers here
print(line_number, row)
also we can put it in more general function like that:
import csv
def read_my_lines(file, lines_list, reader=csv.reader):
lines_set = set(lines_list)
with open(file) as f_obj:
for line_number, row in enumerate(csv.reader(f_obj)):
if line_number > max(lines_set):
break
elif line_number in lines_set:
print(line_number, row)
if __name__ == '__main__':
read_my_lines(file='your_filename.csv', lines_list=[100, 200])
part2: the cause of the error you get:
it's hard to diagnose problem like this without a sample of the file you use.
but you should try this ..
pd.read_csv(filename)
is it parse the file with no error ? if so, i will explain why.
the number of columns is inferred from the first line.
by using skiprows and header=0 you escaped the first 3 rows, i guess that contains the columns names or the header that should contains the correct number of columns.
basically you constraining what the parser is doing.
so parse without skiprows, or header=0 then reindexing to what you need later.
note:
if you unsure about what delimiter used in the file use sep=None, but it would be slower.
from pandas.read_csv docs:
sep : str, default ‘,’ Delimiter to use. If sep is None, the C engine
cannot automatically detect the separator, but the Python parsing
engine can, meaning the latter will be used and automatically detect
the separator by Python’s builtin sniffer tool, csv.Sniffer. In
addition, separators longer than 1 character and different from '\s+'
will be interpreted as regular expressions and will also force the use
of the Python parsing engine. Note that regex delimiters are prone to
ignoring quoted data. Regex example: '\r\t'
link
In my case, adding a separator helped:
data = pd.read_csv('/Users/myfile.csv', encoding='cp1251', sep=';')
We can get line number from error and print line to see what it looks like
Try:
import subprocess
import re
from pandas import parser
try:
filename='mydata.tsv'
df=pd.read_csv(filename,sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
print detail
err=re.findall(r'\b\d+\b', detail) #will give all the numbers ['22', '329867', '24'] line number is at index 1
line=subprocess.check_output("sed -n %s %s" %(str(err[1])+'p',filename),stderr=subprocess.STDOUT,shell=True) # shell command 'sed -n 2p filename' for printing line 2 of filename
print 'Bad line'
print line # to see line

Pandas ParserError EOF character when reading multiple csv files to HDF5

Using Python3, Pandas 0.12
I'm trying to write multiple csv files (total size is 7.9 GB) to a HDF5 store to process later onwards. The csv files contain around a million of rows each, 15 columns and data types are mostly strings, but some floats. However when I'm trying to read the csv files I get the following error:
Traceback (most recent call last):
File "filter-1.py", line 38, in <module>
to_hdf()
File "filter-1.py", line 31, in to_hdf
for chunk in reader:
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 578, in __iter__
yield self.read(self.chunksize)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7146)
File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandas\parser.c:7568)
File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)
File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991
Closing remaining open files: ta_store.h5... done
Edit:
I managed to find a file that produced this problem. I think it's reading an EOF character. However I have no clue to overcome this problem. Given the large size of the combined files I think it's too cumbersome to check each single character in each string. (Even then I would still not be sure what to do.) As far as I checked, there are no strange characters in the csv files that could raise the error.
I also tried passing error_bad_lines=False to pd.read_csv(), but the error persists.
My code is the following:
# -*- coding: utf-8 -*-
import pandas as pd
import os
from glob import glob
def list_files(path=os.getcwd()):
''' List all files in specified path '''
list_of_files = [f for f in glob('2013-06*.csv')]
return list_of_files
def to_hdf():
""" Function that reads multiple csv files to HDF5 Store """
# Defining path name
path = 'ta_store.h5'
# If path exists delete it such that a new instance can be created
if os.path.exists(path):
os.remove(path)
# Creating HDF5 Store
store = pd.HDFStore(path)
# Reading csv files from list_files function
for f in list_files():
# Creating reader in chunks -- reduces memory load
reader = pd.read_csv(f, chunksize=50000)
# Looping over chunks and storing them in store file, node name 'ta_data'
for chunk in reader:
chunk.to_hdf(store, 'ta_data', mode='w', table=True)
# Return store
return store.select('ta_data')
return 'Finished reading to HDF5 Store, continuing processing data.'
to_hdf()
Edit
If I go into the CSV file that raises the CParserError EOF... and manually delete all rows after the line that is causing the problem, the csv file is read properly. However all I'm deleting are blank rows anyway.
The weird thing is that when I manually correct the erroneous csv files, they are loaded fine into the store individually. But when I again use a list of multiple files the 'false' files still return me errors.
I had a similar problem. The line listed with the 'EOF inside string' had a string that contained within it a single quote mark ('). When I added the option quoting=csv.QUOTE_NONE it fixed my problem.
For example:
import csv
df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
I have the same problem, and after adding these two params to my code, the problem is gone.
read_csv (...quoting=3, error_bad_lines=False)
I realize this is an old question, but I wanted to share some more details on the root cause of this error and why the solution from #Selah works.
From the csv.py docstring:
* quoting - controls when quotes should be generated by the writer.
It can take on any of the following module constants:
csv.QUOTE_MINIMAL means only when required, for example, when a
field contains either the quotechar or the delimiter
csv.QUOTE_ALL means that quotes are always placed around fields.
csv.QUOTE_NONNUMERIC means that quotes are always placed around
fields which do not parse as integers or floating point
numbers.
csv.QUOTE_NONE means that quotes are never placed around fields.
csv.QUOTE_MINIMAL is the default value and " is the default quotechar. If somewhere in your csv file you have a quotechar it will be parsed as a string until another occurrence of the quotechar. If your file has odd number of quotechars the last one will not be closed before reaching the EOF (end of file). Also be aware that anything between the quotechars will be parsed as a single string. Even if there are many line breaks (expected to be parsed as separate rows) it all goes into a single field of the table. So the line number that you get in the error can be misleading. To illustrate with an example consider this:
In[4]: import pandas as pd
...: from io import StringIO
...: test_csv = '''a,b,c
...: "d,e,f
...: g,h,i
...: "m,n,o
...: p,q,r
...: s,t,u
...: '''
...:
In[5]: test = StringIO(test_csv)
In[6]: pd.read_csv(test)
Out[6]:
a b c
0 d,e,f\ng,h,i\nm n o
1 p q r
2 s t u
In[7]: test_csv_2 = '''a,b,c
...: "d,e,f
...: g,h,i
...: "m,n,o
...: "p,q,r
...: s,t,u
...: '''
...: test_2 = StringIO(test_csv_2)
...:
In[8]: pd.read_csv(test_2)
Traceback (most recent call last):
...
...
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2
The first string has 2 (even) quotechars. So each quotechar is closed and the csv is parsed without an error, although probably not what we expected. The other string has 3 (odd) quotechars. The last one is not closed and the EOF is reached hence the error. But line 2 that we get in the error message is misleading. We would expect 4, but since everything between first and second quotechar is parsed as a string our "p,q,r line is actually second.
Make your inner loop like this will allow you to detect the 'bad' file (and further investigate)
from pandas.io import parser
def to_hdf():
.....
# Reading csv files from list_files function
for f in list_files():
# Creating reader in chunks -- reduces memory load
try:
reader = pd.read_csv(f, chunksize=50000)
# Looping over chunks and storing them in store file, node name 'ta_data'
for chunk in reader:
chunk.to_hdf(store, 'ta_data', table=True)
except (parser.CParserError) as detail:
print f, detail
The solution is to use the parameter engine=’python’ in the read_csv function. The Pandas CSV parser can use two different “engines” to parse a CSV file – Python or C (which is also the default).
pandas.read_csv(filepath, sep=',', delimiter=None,
header='infer', names=None,
index_col=None, usecols=None, squeeze=False,
..., engine=None, ...)
The Python engine is described to be “slower, but is more feature complete” in the Pandas documentation.
engine : {‘c’, ‘python’}
My error:
ParserError: Error tokenizing data. C error: EOF inside string
starting at row 4488'
was resolved by adding delimiter="\t" in my code as:
import pandas as pd
df = pd.read_csv("filename.csv", delimiter="\t")
Use
engine="python",
error_bad_lines=False,
on the read_csv .
The full call will be like this:
df = pd.read_csv(csvfile,
delimiter="\t",
engine="python",
error_bad_lines=False,
encoding='utf-8')
For me, the other solutions did not work and caused me quite a headache. error_bad_lines=False still gives the error C error: EOF inside string starting at line. Using a different quoting didn't give the desired results either, since I did not want to have quotes in my text.
I realised that there was a bug in Pandas 0.20. Upgrading to version 0.21 completely solved my issue. More info about this bug, see: https://github.com/pandas-dev/pandas/issues/16559
Note: this may be Windows-related as mentioned in the URL.
After looking up for a solution for hours, I have finally come up with a workaround.
The best way to eliminate this C error: EOF inside string starting at line exception without multiprocessing efficiency reduction is to preprocess the input data (if you have such an opportunity).
Replace all of the '\n' entries in the input file on, for instance ', ', or on any other unique symbols sequence (for example, 'aghr21*&'). Then you will be able to read_csv the data into your dataframe.
After you have read the data, you may want to replace all of your unique symbols sequences ('aghr21*&'), back on '\n'.
Had similar issue while trying to pull data from a Github repository. Simple mistake, was trying to pull data from the git blob (the html rendered part) instead of the raw csv.
If you're pulling data from a git repo, make sure your link doesn't include a \<repo name\>/blob unless you're specifically interested in html code from the repo.

Writing Integers to a File

I'm having a really difficult time writing integers out to a file. Here's my situation. I have a file, let's call it 'idlist.txt'. It has multiple columns and is fairly long (10,000 rows), but I only care about the first column of data.
I'm loading it into python using:
import numpy as np
FH = np.loadtxt('idlist.txt',delimiter=',',comments='#')
# Testing initial data type
print FH[0,0],type(FH[0,0])
>>> 85000370342.0 <type 'numpy.float64'>
# Converting to integers
F = [int(FH[i,0]) for i in range(len(FH))]
print F[0],type(F[0])
>>> 85000370342 <type 'long'>
As you can see, the data must be made into integers. What I now would like to do is to write the entries of this list out as the first column of another file (really the only column in the entire file), we can rename it 'idonly.txt'. Here is how I'm trying to do it:
with open('idonly.txt','a') as f:
for i in range(len(F)):
f.write('%d\n' % (F[i]))
This is clearly not producing the desired output - when I open the file 'idonly.txt', each entry is actually a float (i.e. - 85000370342.0). What exactly is going on here, and why is writing integers to a file such a complicated task? I found the string formatting idea from here: How to write integers to a file , but it didn't fix my issue.
Okay, well it appears that this is completely my fault. When I'm opening the file I'm using the mode 'a', which means append. It turns out that the first time I wrote this out to a file I did it incorrectly, and ever since I've been appending the correct answer onto that and simply not looking down as far as I should since it's a really long file.
For reference here are all of the modes you can use when handling files in python: http://www.tutorialspoint.com/python/python_files_io.htm. Choose carefully.
Try to use:
f.write('{}'.format(F[i]));

Categories

Resources