Python: C engine does not support regex separators - python

Attempting to upload a bunch of csv's to a database. The csvs are not necesarily always separated by a comma so I used a regular expression to ensure the correct delimiters are used. I then added the
error_bad_lines=False
in order to handle CParserError: Error tokenizing data. C error: Expected 3 fields in line 127, saw 4
which resulted in me getting this error
ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators, but this causes 'error_bad_lines' to be ignored as it is not supported by the 'python' engine.
for the following code
Is there a workaround?
import psycopg2
import pandas as pd
import sqlalchemy as sa
csvList = []
tableList = []
filenames = find_csv_filenames(directory)
for name in filenames:
lhs, rhs = str(name).split(".", 1)
print name
dataRaw = pd.read_csv(name,sep=";|,",chunksize=5000000, error_bad_lines=False)
for chunk in dataRaw:
chunk.to_sql(name = str(lhs),if_exists='append',con=con)

As per pandas parameter in this link Pandas-link
if the separator is more than one character you need to add engine parameter as 'python'.
try this,
dataRaw = pd.read_csv(name,sep=";|,",engine ='python',chunksize=5000000,
error_bad_lines=False)

If you could preprocess and change your file try to change ; separator to , to make clean csv file. You could do it with fileinput to change it inplace:
import fileinput
for line in fileinput.FileInput('your_file', inplace=True):
line = line.replace(';', ',')
print(line, end='')
fileinput.close()
Then you could use read_csv with c engine and use parameter error_bad_lines or you could also preprocess them with that loop.
Note: If you want to make backup of your file you could use backup parameter for FileInput

Related

Python - How can I check if a CSV file has a comma or a semicolon as a separator?

I have a bunch of CSV files that I would like to read with Python Pandas. Some of them have a comma as a delimiter, hence I use the following command to read them:
import pandas as pd
df = pd.read_csv('file_with_commas.csv')
However I have others CSVs that have a semicolon as a delimiter. Hence, since the default separator is the comma, I now need to specify it and therefore use the following command:
import pandas as pd
df = pd.read_csv('file_with_semicolons.csv', sep=';')
I would like to write a piece of code that recognises if the CSV file has a comma or a semicolon as a delimiter (before I read it) so that I do not have to change the code every time. How can this be done?
Note: I have checked this similar question on Stack Overflow but it is helpless since it is applicable to R, rather than Python.
Use sep=None
df = pd.read_csv('some_file.csv', sep=None)
From the docs
sep str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can,
meaning the latter will be used and automatically detect the separator
by Python’s builtin sniffer tool, csv.Sniffer. In addition,
separators longer than 1 character and different from '\s+' will be
interpreted as regular expressions and will also force the use of the
Python parsing engine. Note that regex delimiters are prone to
ignoring quoted data. Regex example: '\r\t'
.
Say that you would like to read an arbitrary CSV, named input.csv, and you do not know whether the separator is a comma or a semicolon.
You could open your file using the csv module. The Sniffer class is then used to deduce its format, like in the following code:
import csv
with open(input.csv, newline='') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read())
For this module, the dialect class is a container class whose attributes contain information for how to handle delimiters (among other things like doublequotes, whitespaces, etc). You can check the delimiter attribute by using the following code:
print(dialect.delimiter)
# This will be either a comma or a semicolon, depending on what the input is
Therefore, in order to do a smart CSV reading, you could use something like the following:
if dialect.delimiter == ',':
df = pd.read_csv(input.csv) # Import the csv with a comma as the separator
elif dialect.delimiter == ';':
df = pd.read_csv(input.csv, sep=';') # Import the csv with a semicolon as the separator
More information can be found here.

May I use either tab or comma as delimiter when reading from pandas csv?

I have csv files. Some are comma delimited, and some are tab delimited.
df = pd.read_csv(data_file, sep='\t')
Is there a way to specify either tab or comma as delimiter when using pd.read_csv()? Or, is there a way to automatically detect whether the file is tab or comma delimited? If I know that, I can use different sep='' paramters when reading the file.
Recently I had a similair problem, I ended up using a different method but I explored using the Sniffer Class from the CSV standard library.
I haven't used this in production but only to help find what file types are for testing prototyping, use at your own risk!
from the documentation
"Sniffs" the format of a CSV file (i.e. delimiter, quotechar) Returns
a Dialect object.
you can return the dialect object then pass dialect.delimiter to the sep arg in pd.read_csv
'text_a.csv'
cola|colb|col
A|B|C
E|F|G
A|B|C
E|F|G
'text_b.csv'
cola\tcolb\tcol
A\tB\tC
E\tF\tG
A\tB\tC
E\tF\tG
A\tB\tC
from csv import Sniffer
sniffer = Sniffer()
def detect_delim(file,num_rows,sniffer):
with open(file,'r') as f:
for row in range(num_rows):
line = next(f).strip()
delim = sniffer.sniff(line)
print(delim.delimiter) # ideally you should return the dialect object - just being lazy.
#return delim.dedelimiter
detect_delim(file='text_a.csv',num_rows=5,sniffer=sniffer)
'|'
detect_delim(file='text_b.csv',num_rows=5,sniffer=sniffer)
'\t'
I'd just read the first row and see which gives you more columns:
import pandas as pd
tab = pd.read_csv(data_file, nrows=1, sep='\t').shape[1]
com = pd.read_csv(data_file, nrows=1, sep=',').shape[1]
if tab > com:
df = pd.read_csv(data_file, sep='\t')
else:
df = pd.read_csv(data_file, sep=',')
Is this useful? You can used python regex parser with read_csv and specify different delimiters.
Ask the user to specify how the file is formatted if you don't expect to be able to determine from the file contents itself.
E.g. a flag of some sort as --tab-delimited-file=true and then you flip the separator based on their input.

Auto-detect the delimiter in a CSV file using pd.read_csv

Is there a way for read_csv to auto-detect the delimiter? numpy's genfromtxt does this.
My files have data with single space, double space and a tab as delimiters. genfromtxt() solves it, but is slower than pandas' read_csv.
Any ideas?
Another option is to use the built in CSV Sniffer. I mix it up with only reading a certain number of bytes in case the CSV file is large.
import csv
def get_delimiter(file_path, bytes = 4096):
sniffer = csv.Sniffer()
data = open(file_path, "r").read(bytes)
delimiter = sniffer.sniff(data).delimiter
return delimiter
Option 1
Using delim_whitespace=True
df = pd.read_csv('file.csv', delim_whitespace=True)
Option 2
Pass a regular expression to the sep parameter:
df = pd.read_csv('file.csv', sep='\s+')
This is equivalent to the first option
Documentation for pd.read_csv.
For better control, I use a python module called detect_delimiter from python projects. See https://pypi.org/project/detect-delimiter/ . It has been around for some time. As with all code, you should test with your interpreter prior to deployment. I have tested up to python version 3.8.5.
See code example below where the delimiter is automatically detected, and the var
delimiter is defined from the method's output. The code then reads the CSV file
with sep = delimiter. I have tested with the following delimiters, although others should work: ; , |
It does not work with multi-char delimiters, such as ","
CAUTION! This method will do nothing to detect a malformed CSV file. In the case
where the input file contains both ; and , the method returns , as the detected delimiter.
from detect_delimiter import detect
import pandas as pd
filename = "some_csv.csv"
with open(filename) as f:
firstline = f.readline()
delimiter = detect(firstline)
records = pd.read_csv(filename, sep = delimiter)

Error tokenizing data during Pandas read_csv. How to actually see the bad lines?

I have a large csv that I load as follows
df=pd.read_csv('my_data.tsv',sep='\t',header=0, skiprows=[1,2,3])
I get several errors during the loading process.
First, if I dont specify warn_bad_lines=True,error_bad_lines=False I get:
Error tokenizing data. C error: Expected 22 fields in line 329867, saw
24
Second, if I use the options above, I now get:
CParserError: Error tokenizing data. C error: EOF inside string
starting at line 32357585
Question is: how can I have a look at these bad lines to understand what's going on? Is it possible to have read_csv return these bogus lines?
I tried the following hint (Pandas ParserError EOF character when reading multiple csv files to HDF5):
from pandas import parser
try:
df=pd.read_csv('mydata.tsv',sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
print detail
but still get
Error tokenizing data. C error: Expected 22 fields in line 329867, saw
24
i'll will give my answer in two parts:
part1:
the op asked how to output these bad lines, to answer this we can use python csv module in a simple code like that:
import csv
file = 'your_filename.csv' # use your filename
lines_set = set([100, 200]) # use your bad lines numbers here
with open(file) as f_obj:
for line_number, row in enumerate(csv.reader(f_obj)):
if line_number > max(lines_set):
break
elif line_number in lines_set: # put your bad lines numbers here
print(line_number, row)
also we can put it in more general function like that:
import csv
def read_my_lines(file, lines_list, reader=csv.reader):
lines_set = set(lines_list)
with open(file) as f_obj:
for line_number, row in enumerate(csv.reader(f_obj)):
if line_number > max(lines_set):
break
elif line_number in lines_set:
print(line_number, row)
if __name__ == '__main__':
read_my_lines(file='your_filename.csv', lines_list=[100, 200])
part2: the cause of the error you get:
it's hard to diagnose problem like this without a sample of the file you use.
but you should try this ..
pd.read_csv(filename)
is it parse the file with no error ? if so, i will explain why.
the number of columns is inferred from the first line.
by using skiprows and header=0 you escaped the first 3 rows, i guess that contains the columns names or the header that should contains the correct number of columns.
basically you constraining what the parser is doing.
so parse without skiprows, or header=0 then reindexing to what you need later.
note:
if you unsure about what delimiter used in the file use sep=None, but it would be slower.
from pandas.read_csv docs:
sep : str, default ‘,’ Delimiter to use. If sep is None, the C engine
cannot automatically detect the separator, but the Python parsing
engine can, meaning the latter will be used and automatically detect
the separator by Python’s builtin sniffer tool, csv.Sniffer. In
addition, separators longer than 1 character and different from '\s+'
will be interpreted as regular expressions and will also force the use
of the Python parsing engine. Note that regex delimiters are prone to
ignoring quoted data. Regex example: '\r\t'
link
In my case, adding a separator helped:
data = pd.read_csv('/Users/myfile.csv', encoding='cp1251', sep=';')
We can get line number from error and print line to see what it looks like
Try:
import subprocess
import re
from pandas import parser
try:
filename='mydata.tsv'
df=pd.read_csv(filename,sep='\t',header=0, skiprows=[1,2,3])
except (parser.CParserError) as detail:
print detail
err=re.findall(r'\b\d+\b', detail) #will give all the numbers ['22', '329867', '24'] line number is at index 1
line=subprocess.check_output("sed -n %s %s" %(str(err[1])+'p',filename),stderr=subprocess.STDOUT,shell=True) # shell command 'sed -n 2p filename' for printing line 2 of filename
print 'Bad line'
print line # to see line

Pandas ParserError EOF character when reading multiple csv files to HDF5

Using Python3, Pandas 0.12
I'm trying to write multiple csv files (total size is 7.9 GB) to a HDF5 store to process later onwards. The csv files contain around a million of rows each, 15 columns and data types are mostly strings, but some floats. However when I'm trying to read the csv files I get the following error:
Traceback (most recent call last):
File "filter-1.py", line 38, in <module>
to_hdf()
File "filter-1.py", line 31, in to_hdf
for chunk in reader:
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 578, in __iter__
yield self.read(self.chunksize)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7146)
File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandas\parser.c:7568)
File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)
File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991
Closing remaining open files: ta_store.h5... done
Edit:
I managed to find a file that produced this problem. I think it's reading an EOF character. However I have no clue to overcome this problem. Given the large size of the combined files I think it's too cumbersome to check each single character in each string. (Even then I would still not be sure what to do.) As far as I checked, there are no strange characters in the csv files that could raise the error.
I also tried passing error_bad_lines=False to pd.read_csv(), but the error persists.
My code is the following:
# -*- coding: utf-8 -*-
import pandas as pd
import os
from glob import glob
def list_files(path=os.getcwd()):
''' List all files in specified path '''
list_of_files = [f for f in glob('2013-06*.csv')]
return list_of_files
def to_hdf():
""" Function that reads multiple csv files to HDF5 Store """
# Defining path name
path = 'ta_store.h5'
# If path exists delete it such that a new instance can be created
if os.path.exists(path):
os.remove(path)
# Creating HDF5 Store
store = pd.HDFStore(path)
# Reading csv files from list_files function
for f in list_files():
# Creating reader in chunks -- reduces memory load
reader = pd.read_csv(f, chunksize=50000)
# Looping over chunks and storing them in store file, node name 'ta_data'
for chunk in reader:
chunk.to_hdf(store, 'ta_data', mode='w', table=True)
# Return store
return store.select('ta_data')
return 'Finished reading to HDF5 Store, continuing processing data.'
to_hdf()
Edit
If I go into the CSV file that raises the CParserError EOF... and manually delete all rows after the line that is causing the problem, the csv file is read properly. However all I'm deleting are blank rows anyway.
The weird thing is that when I manually correct the erroneous csv files, they are loaded fine into the store individually. But when I again use a list of multiple files the 'false' files still return me errors.
I had a similar problem. The line listed with the 'EOF inside string' had a string that contained within it a single quote mark ('). When I added the option quoting=csv.QUOTE_NONE it fixed my problem.
For example:
import csv
df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
I have the same problem, and after adding these two params to my code, the problem is gone.
read_csv (...quoting=3, error_bad_lines=False)
I realize this is an old question, but I wanted to share some more details on the root cause of this error and why the solution from #Selah works.
From the csv.py docstring:
* quoting - controls when quotes should be generated by the writer.
It can take on any of the following module constants:
csv.QUOTE_MINIMAL means only when required, for example, when a
field contains either the quotechar or the delimiter
csv.QUOTE_ALL means that quotes are always placed around fields.
csv.QUOTE_NONNUMERIC means that quotes are always placed around
fields which do not parse as integers or floating point
numbers.
csv.QUOTE_NONE means that quotes are never placed around fields.
csv.QUOTE_MINIMAL is the default value and " is the default quotechar. If somewhere in your csv file you have a quotechar it will be parsed as a string until another occurrence of the quotechar. If your file has odd number of quotechars the last one will not be closed before reaching the EOF (end of file). Also be aware that anything between the quotechars will be parsed as a single string. Even if there are many line breaks (expected to be parsed as separate rows) it all goes into a single field of the table. So the line number that you get in the error can be misleading. To illustrate with an example consider this:
In[4]: import pandas as pd
...: from io import StringIO
...: test_csv = '''a,b,c
...: "d,e,f
...: g,h,i
...: "m,n,o
...: p,q,r
...: s,t,u
...: '''
...:
In[5]: test = StringIO(test_csv)
In[6]: pd.read_csv(test)
Out[6]:
a b c
0 d,e,f\ng,h,i\nm n o
1 p q r
2 s t u
In[7]: test_csv_2 = '''a,b,c
...: "d,e,f
...: g,h,i
...: "m,n,o
...: "p,q,r
...: s,t,u
...: '''
...: test_2 = StringIO(test_csv_2)
...:
In[8]: pd.read_csv(test_2)
Traceback (most recent call last):
...
...
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2
The first string has 2 (even) quotechars. So each quotechar is closed and the csv is parsed without an error, although probably not what we expected. The other string has 3 (odd) quotechars. The last one is not closed and the EOF is reached hence the error. But line 2 that we get in the error message is misleading. We would expect 4, but since everything between first and second quotechar is parsed as a string our "p,q,r line is actually second.
Make your inner loop like this will allow you to detect the 'bad' file (and further investigate)
from pandas.io import parser
def to_hdf():
.....
# Reading csv files from list_files function
for f in list_files():
# Creating reader in chunks -- reduces memory load
try:
reader = pd.read_csv(f, chunksize=50000)
# Looping over chunks and storing them in store file, node name 'ta_data'
for chunk in reader:
chunk.to_hdf(store, 'ta_data', table=True)
except (parser.CParserError) as detail:
print f, detail
The solution is to use the parameter engine=’python’ in the read_csv function. The Pandas CSV parser can use two different “engines” to parse a CSV file – Python or C (which is also the default).
pandas.read_csv(filepath, sep=',', delimiter=None,
header='infer', names=None,
index_col=None, usecols=None, squeeze=False,
..., engine=None, ...)
The Python engine is described to be “slower, but is more feature complete” in the Pandas documentation.
engine : {‘c’, ‘python’}
My error:
ParserError: Error tokenizing data. C error: EOF inside string
starting at row 4488'
was resolved by adding delimiter="\t" in my code as:
import pandas as pd
df = pd.read_csv("filename.csv", delimiter="\t")
Use
engine="python",
error_bad_lines=False,
on the read_csv .
The full call will be like this:
df = pd.read_csv(csvfile,
delimiter="\t",
engine="python",
error_bad_lines=False,
encoding='utf-8')
For me, the other solutions did not work and caused me quite a headache. error_bad_lines=False still gives the error C error: EOF inside string starting at line. Using a different quoting didn't give the desired results either, since I did not want to have quotes in my text.
I realised that there was a bug in Pandas 0.20. Upgrading to version 0.21 completely solved my issue. More info about this bug, see: https://github.com/pandas-dev/pandas/issues/16559
Note: this may be Windows-related as mentioned in the URL.
After looking up for a solution for hours, I have finally come up with a workaround.
The best way to eliminate this C error: EOF inside string starting at line exception without multiprocessing efficiency reduction is to preprocess the input data (if you have such an opportunity).
Replace all of the '\n' entries in the input file on, for instance ', ', or on any other unique symbols sequence (for example, 'aghr21*&'). Then you will be able to read_csv the data into your dataframe.
After you have read the data, you may want to replace all of your unique symbols sequences ('aghr21*&'), back on '\n'.
Had similar issue while trying to pull data from a Github repository. Simple mistake, was trying to pull data from the git blob (the html rendered part) instead of the raw csv.
If you're pulling data from a git repo, make sure your link doesn't include a \<repo name\>/blob unless you're specifically interested in html code from the repo.

Categories

Resources