I've got a large tab-separated file and am trying to load it using
import pandas as pd
df = pd.read_csv(..., sep="\t")
however, the process crashes with the error being
pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 1743925, saw 12
Nothing apparent wrong with that particular line when I printed it out manually. Feeling confident that there is nothing wrong with my file, I went and tried to calculate the field counts myself...
from collections import Counter
lengths = []
with open(...) as f:
for line in f:
lengths.append(len(line.split('\t')))
c = Counter(lengths)
print(c)
...and got the result Counter({8: 2385674}). So I was wondering what does pandas do differently, but the error is raised inside a .pyx file and hence I cannot plant a breakpoint there. What could be the cause of this? Where is my expectation flawed?
Fixed the issue. Turns out the problem was a different quoting on csv export and read. The issue was solved by matching quoting on read_csv with quoting on the to_csv which created the loaded file. I assume some tabs and newlines were thought to be parts of string literals because of this, hence the assumption of 11 tab characters on one row (they were actually 2+ rows).
I have several csv files. Each csv file is present with its description spanning across several rows (15 rows in few files, 100 rows in few other etc.). I want to read the csv files into dataframes. I tried to use pandas.DataFrame('file1.csv') for reading data into dataframe. How ever, I am getting the following error.
Traceback (most recent call last):
File "snowdepthData.py", line 5, in <module>
depthDF = pd.DataFrame('Alaska_SD_Sep2019toOct2020.csv')
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 485, in __init__
raise ValueError("DataFrame constructor not properly called!")
ValueError: DataFrame constructor not properly called!
Is there any way, I can skip reading the description and convert data into dataframe.
Thankyou.
Those lines seem to begin with #, so you can probably use the comment parameter:
comment str, optional
Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether.
This parameter must be a single character. Like empty lines (as long as
skip_blank_lines=True), fully commented lines are ignored by the
parameter header but not by skiprows. For example, if comment='#',
parsing #empty\na,b,c\n1,2,3 with header=0 will result in ‘a,b,c’
being treated as the header.
You can use pandas read_csv() function (see documentation here) to read the csv-file.
In this function you can add an argument called "skiprows" and define the number of rows that should be skipped when reading the file.
I am new to stackoverflow so if my post is not correctly posted or you need more info please let me know. So i have a really weird problem. I have a txt file with a lot of lines separated by ";". Normally there should 42 fields/columns, but for some reason some lines in my txt file when imported and separated by ";" it shows me a large amount of lines that are being skipped because python "expected 42 fields, saw 45". I import the file using pandas as most of my transformation are done with it:
text = pd.read_csv('file.txt',encoding='ISO-8859-1', keep_default_na=False,error_bad_lines=False, sep=';')
What I found out is that for some lines I have 3 extra ";" at the end. Because most of the data is confidential and I cannot share it outside my company I generated a similar 3 line txt file to show you where my issue lies.
;;;5123123;text1;text2;;;;123124;text3;text4;;;;5234234;text5;text6;;;;412321;text7;text8;;;;512312;text9;text10;;;;15123213;text11;text12;;;;123123;text13;text14
;;;4666190;text1;text2;;;;312312;text3;text4;;;;5123123;text5;text6;;;;;;;;;;;;;;;;;;;;;;55123;text7;text8
;;;5123123;text1;text2;;;;1321321;text3;text4;;;;123124;text5;text6;;;;;;;;;;;;;;;;;;;;;;3123123;512312312;text7;;;
So Those are similar three lines from my file, but with substituted names. The first and second line is correct, but the third yields me 45 fields when imported.
So is there a way that I can go through the file before importing it and look for all lines starting with ;;;5123123 and check if there are ";" at the end and if there are remove them, and after that of course import them. The problem is only with some lines starting with ;;;5123123. There are a few hundred lines with this error and the whole data is a little bit more than 50k linees.
I believe pd is pandas, so you can use usecols argument for read_csv method
text = pd.read_csv('file.txt',
encoding='ISO-8859-1',
keep_default_na=False,
error_bad_lines=False,
sep=';',
usecols=list(range(43)),
names=list(range(43)),
headers=None)
Edited
You can also add names and headers argument
Have you tried to split into a list and then removing blank elements??
f = open('file.txt', 'rb')
raw_str = str(f.read())
full_list = raw_str.split(';')
templist = list(filter(None, full_list))
by printing templist it gives a list of all elements. you can perform any action on it like to convert into a string again by using for loop according to your requirement. output is like-
I'm trying to get only the first 100 rows of a csv.gz file that has over 4 million rows in Python. I also want information on the # of columns and the headers of each. How can I do this?
I looked at python: read lines from compressed text files to figure out how to open the file but I'm struggling to figure out how to actually print the first 100 rows and get some metadata on the information in the columns.
I found this Read first N lines of a file in python but not sure how to marry this to opening the csv.gz file and reading it without saving an uncompressed csv file.
I have written this code:
import gzip
import csv
import json
import pandas as pd
df = pd.read_csv('google-us-data.csv.gz', compression='gzip', header=0, sep=' ', quotechar='"', error_bad_lines=False)
for i in range (100):
print df.next()
I'm new to Python and I don't understand the results. I'm sure my code is wrong and I've been trying to debug it but I don't know which documentation to look at.
I get these results (and it keeps going down the console - this is an excerpt):
Skipping line 63: expected 3 fields, saw 7
Skipping line 64: expected 3 fields, saw 7
Skipping line 65: expected 3 fields, saw 7
Skipping line 66: expected 3 fields, saw 7
Skipping line 67: expected 3 fields, saw 7
Skipping line 68: expected 3 fields, saw 7
Skipping line 69: expected 3 fields, saw 7
Skipping line 70: expected 3 fields, saw 7
Skipping line 71: expected 3 fields, saw 7
Skipping line 72: expected 3 fields, saw 7
Pretty much what you've already done, except read_csv also has nrows where you can specify the number of rows you want from the data set.
Additionally, to prevent the errors you were getting, you can set error_bad_lines to False. You'll still get warnings (if that bothers you, set warn_bad_lines to False as well). These are there to indicate inconsistency in how your dataset is filled out.
import pandas as pd
data = pd.read_csv('google-us-data.csv.gz', nrows=100, compression='gzip',
error_bad_lines=False)
print(data)
You can easily do something similar with the csv built-in library, but it'll require a for loop to iterate over the data, has shown in other examples.
I think you could do something like this (from the gzip module examples)
import gzip
with gzip.open('/home/joe/file.txt.gz', 'rb') as f:
header = f.readline()
# Read lines any way you want now.
The first answer you linked suggests using gzip.GzipFile - this gives you a file-like object that decompresses for you on the fly.
Now you just need some way to parse csv data out of a file-like object ... like csv.reader.
The csv.reader object will give you a list of fieldnames, so you know the columns, their names, and how many there are.
Then you need to get the first 100 csv row objects, which will work exactly like in the second question you linked, and each of those 100 objects will be a list of fields.
So far this is all covered in your linked questions, apart from knowing about the existence of the csv module, which is listed in the library index.
Your code is OK;
pandas read_csv
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True,
a warning for each “bad line” will be output. (Only valid with C parser).
Using Python3, Pandas 0.12
I'm trying to write multiple csv files (total size is 7.9 GB) to a HDF5 store to process later onwards. The csv files contain around a million of rows each, 15 columns and data types are mostly strings, but some floats. However when I'm trying to read the csv files I get the following error:
Traceback (most recent call last):
File "filter-1.py", line 38, in <module>
to_hdf()
File "filter-1.py", line 31, in to_hdf
for chunk in reader:
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 578, in __iter__
yield self.read(self.chunksize)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7146)
File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandas\parser.c:7568)
File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)
File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991
Closing remaining open files: ta_store.h5... done
Edit:
I managed to find a file that produced this problem. I think it's reading an EOF character. However I have no clue to overcome this problem. Given the large size of the combined files I think it's too cumbersome to check each single character in each string. (Even then I would still not be sure what to do.) As far as I checked, there are no strange characters in the csv files that could raise the error.
I also tried passing error_bad_lines=False to pd.read_csv(), but the error persists.
My code is the following:
# -*- coding: utf-8 -*-
import pandas as pd
import os
from glob import glob
def list_files(path=os.getcwd()):
''' List all files in specified path '''
list_of_files = [f for f in glob('2013-06*.csv')]
return list_of_files
def to_hdf():
""" Function that reads multiple csv files to HDF5 Store """
# Defining path name
path = 'ta_store.h5'
# If path exists delete it such that a new instance can be created
if os.path.exists(path):
os.remove(path)
# Creating HDF5 Store
store = pd.HDFStore(path)
# Reading csv files from list_files function
for f in list_files():
# Creating reader in chunks -- reduces memory load
reader = pd.read_csv(f, chunksize=50000)
# Looping over chunks and storing them in store file, node name 'ta_data'
for chunk in reader:
chunk.to_hdf(store, 'ta_data', mode='w', table=True)
# Return store
return store.select('ta_data')
return 'Finished reading to HDF5 Store, continuing processing data.'
to_hdf()
Edit
If I go into the CSV file that raises the CParserError EOF... and manually delete all rows after the line that is causing the problem, the csv file is read properly. However all I'm deleting are blank rows anyway.
The weird thing is that when I manually correct the erroneous csv files, they are loaded fine into the store individually. But when I again use a list of multiple files the 'false' files still return me errors.
I had a similar problem. The line listed with the 'EOF inside string' had a string that contained within it a single quote mark ('). When I added the option quoting=csv.QUOTE_NONE it fixed my problem.
For example:
import csv
df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
I have the same problem, and after adding these two params to my code, the problem is gone.
read_csv (...quoting=3, error_bad_lines=False)
I realize this is an old question, but I wanted to share some more details on the root cause of this error and why the solution from #Selah works.
From the csv.py docstring:
* quoting - controls when quotes should be generated by the writer.
It can take on any of the following module constants:
csv.QUOTE_MINIMAL means only when required, for example, when a
field contains either the quotechar or the delimiter
csv.QUOTE_ALL means that quotes are always placed around fields.
csv.QUOTE_NONNUMERIC means that quotes are always placed around
fields which do not parse as integers or floating point
numbers.
csv.QUOTE_NONE means that quotes are never placed around fields.
csv.QUOTE_MINIMAL is the default value and " is the default quotechar. If somewhere in your csv file you have a quotechar it will be parsed as a string until another occurrence of the quotechar. If your file has odd number of quotechars the last one will not be closed before reaching the EOF (end of file). Also be aware that anything between the quotechars will be parsed as a single string. Even if there are many line breaks (expected to be parsed as separate rows) it all goes into a single field of the table. So the line number that you get in the error can be misleading. To illustrate with an example consider this:
In[4]: import pandas as pd
...: from io import StringIO
...: test_csv = '''a,b,c
...: "d,e,f
...: g,h,i
...: "m,n,o
...: p,q,r
...: s,t,u
...: '''
...:
In[5]: test = StringIO(test_csv)
In[6]: pd.read_csv(test)
Out[6]:
a b c
0 d,e,f\ng,h,i\nm n o
1 p q r
2 s t u
In[7]: test_csv_2 = '''a,b,c
...: "d,e,f
...: g,h,i
...: "m,n,o
...: "p,q,r
...: s,t,u
...: '''
...: test_2 = StringIO(test_csv_2)
...:
In[8]: pd.read_csv(test_2)
Traceback (most recent call last):
...
...
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2
The first string has 2 (even) quotechars. So each quotechar is closed and the csv is parsed without an error, although probably not what we expected. The other string has 3 (odd) quotechars. The last one is not closed and the EOF is reached hence the error. But line 2 that we get in the error message is misleading. We would expect 4, but since everything between first and second quotechar is parsed as a string our "p,q,r line is actually second.
Make your inner loop like this will allow you to detect the 'bad' file (and further investigate)
from pandas.io import parser
def to_hdf():
.....
# Reading csv files from list_files function
for f in list_files():
# Creating reader in chunks -- reduces memory load
try:
reader = pd.read_csv(f, chunksize=50000)
# Looping over chunks and storing them in store file, node name 'ta_data'
for chunk in reader:
chunk.to_hdf(store, 'ta_data', table=True)
except (parser.CParserError) as detail:
print f, detail
The solution is to use the parameter engine=’python’ in the read_csv function. The Pandas CSV parser can use two different “engines” to parse a CSV file – Python or C (which is also the default).
pandas.read_csv(filepath, sep=',', delimiter=None,
header='infer', names=None,
index_col=None, usecols=None, squeeze=False,
..., engine=None, ...)
The Python engine is described to be “slower, but is more feature complete” in the Pandas documentation.
engine : {‘c’, ‘python’}
My error:
ParserError: Error tokenizing data. C error: EOF inside string
starting at row 4488'
was resolved by adding delimiter="\t" in my code as:
import pandas as pd
df = pd.read_csv("filename.csv", delimiter="\t")
Use
engine="python",
error_bad_lines=False,
on the read_csv .
The full call will be like this:
df = pd.read_csv(csvfile,
delimiter="\t",
engine="python",
error_bad_lines=False,
encoding='utf-8')
For me, the other solutions did not work and caused me quite a headache. error_bad_lines=False still gives the error C error: EOF inside string starting at line. Using a different quoting didn't give the desired results either, since I did not want to have quotes in my text.
I realised that there was a bug in Pandas 0.20. Upgrading to version 0.21 completely solved my issue. More info about this bug, see: https://github.com/pandas-dev/pandas/issues/16559
Note: this may be Windows-related as mentioned in the URL.
After looking up for a solution for hours, I have finally come up with a workaround.
The best way to eliminate this C error: EOF inside string starting at line exception without multiprocessing efficiency reduction is to preprocess the input data (if you have such an opportunity).
Replace all of the '\n' entries in the input file on, for instance ', ', or on any other unique symbols sequence (for example, 'aghr21*&'). Then you will be able to read_csv the data into your dataframe.
After you have read the data, you may want to replace all of your unique symbols sequences ('aghr21*&'), back on '\n'.
Had similar issue while trying to pull data from a Github repository. Simple mistake, was trying to pull data from the git blob (the html rendered part) instead of the raw csv.
If you're pulling data from a git repo, make sure your link doesn't include a \<repo name\>/blob unless you're specifically interested in html code from the repo.