Auto-detect the delimiter in a CSV file using pd.read_csv

Auto-detect the delimiter in a CSV file using pd.read_csv - python

Is there a way for read_csv to auto-detect the delimiter? numpy's genfromtxt does this.
My files have data with single space, double space and a tab as delimiters. genfromtxt() solves it, but is slower than pandas' read_csv.
Any ideas?

Another option is to use the built in CSV Sniffer. I mix it up with only reading a certain number of bytes in case the CSV file is large.
import csv
def get_delimiter(file_path, bytes = 4096):
sniffer = csv.Sniffer()
data = open(file_path, "r").read(bytes)
delimiter = sniffer.sniff(data).delimiter
return delimiter

Option 1
Using delim_whitespace=True
df = pd.read_csv('file.csv', delim_whitespace=True)
Option 2
Pass a regular expression to the sep parameter:
df = pd.read_csv('file.csv', sep='\s+')
This is equivalent to the first option
Documentation for pd.read_csv.

For better control, I use a python module called detect_delimiter from python projects. See https://pypi.org/project/detect-delimiter/ . It has been around for some time. As with all code, you should test with your interpreter prior to deployment. I have tested up to python version 3.8.5.
See code example below where the delimiter is automatically detected, and the var
delimiter is defined from the method's output. The code then reads the CSV file
with sep = delimiter. I have tested with the following delimiters, although others should work: ; , |
It does not work with multi-char delimiters, such as ","
CAUTION! This method will do nothing to detect a malformed CSV file. In the case
where the input file contains both ; and , the method returns , as the detected delimiter.
from detect_delimiter import detect
import pandas as pd
filename = "some_csv.csv"
with open(filename) as f:
firstline = f.readline()
delimiter = detect(firstline)
records = pd.read_csv(filename, sep = delimiter)

Related

Python - How can I check if a CSV file has a comma or a semicolon as a separator?

I have a bunch of CSV files that I would like to read with Python Pandas. Some of them have a comma as a delimiter, hence I use the following command to read them:
import pandas as pd
df = pd.read_csv('file_with_commas.csv')
However I have others CSVs that have a semicolon as a delimiter. Hence, since the default separator is the comma, I now need to specify it and therefore use the following command:
import pandas as pd
df = pd.read_csv('file_with_semicolons.csv', sep=';')
I would like to write a piece of code that recognises if the CSV file has a comma or a semicolon as a delimiter (before I read it) so that I do not have to change the code every time. How can this be done?
Note: I have checked this similar question on Stack Overflow but it is helpless since it is applicable to R, rather than Python.

Use sep=None
df = pd.read_csv('some_file.csv', sep=None)
From the docs
sep str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can,
meaning the latter will be used and automatically detect the separator
by Python’s builtin sniffer tool, csv.Sniffer. In addition,
separators longer than 1 character and different from '\s+' will be
interpreted as regular expressions and will also force the use of the
Python parsing engine. Note that regex delimiters are prone to
ignoring quoted data. Regex example: '\r\t'
.

Say that you would like to read an arbitrary CSV, named input.csv, and you do not know whether the separator is a comma or a semicolon.
You could open your file using the csv module. The Sniffer class is then used to deduce its format, like in the following code:
import csv
with open(input.csv, newline='') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read())
For this module, the dialect class is a container class whose attributes contain information for how to handle delimiters (among other things like doublequotes, whitespaces, etc). You can check the delimiter attribute by using the following code:
print(dialect.delimiter)
# This will be either a comma or a semicolon, depending on what the input is
Therefore, in order to do a smart CSV reading, you could use something like the following:
if dialect.delimiter == ',':
df = pd.read_csv(input.csv) # Import the csv with a comma as the separator
elif dialect.delimiter == ';':
df = pd.read_csv(input.csv, sep=';') # Import the csv with a semicolon as the separator
More information can be found here.

May I use either tab or comma as delimiter when reading from pandas csv?

I have csv files. Some are comma delimited, and some are tab delimited.
df = pd.read_csv(data_file, sep='\t')
Is there a way to specify either tab or comma as delimiter when using pd.read_csv()? Or, is there a way to automatically detect whether the file is tab or comma delimited? If I know that, I can use different sep='' paramters when reading the file.

Recently I had a similair problem, I ended up using a different method but I explored using the Sniffer Class from the CSV standard library.
I haven't used this in production but only to help find what file types are for testing prototyping, use at your own risk!
from the documentation
"Sniffs" the format of a CSV file (i.e. delimiter, quotechar) Returns
a Dialect object.
you can return the dialect object then pass dialect.delimiter to the sep arg in pd.read_csv
'text_a.csv'
cola|colb|col
A|B|C
E|F|G
A|B|C
E|F|G
'text_b.csv'
cola\tcolb\tcol
A\tB\tC
E\tF\tG
A\tB\tC
E\tF\tG
A\tB\tC
from csv import Sniffer
sniffer = Sniffer()
def detect_delim(file,num_rows,sniffer):
with open(file,'r') as f:
for row in range(num_rows):
line = next(f).strip()
delim = sniffer.sniff(line)
print(delim.delimiter) # ideally you should return the dialect object - just being lazy.
#return delim.dedelimiter
detect_delim(file='text_a.csv',num_rows=5,sniffer=sniffer)
'|'
detect_delim(file='text_b.csv',num_rows=5,sniffer=sniffer)
'\t'

I'd just read the first row and see which gives you more columns:
import pandas as pd
tab = pd.read_csv(data_file, nrows=1, sep='\t').shape[1]
com = pd.read_csv(data_file, nrows=1, sep=',').shape[1]
if tab > com:
df = pd.read_csv(data_file, sep='\t')
else:
df = pd.read_csv(data_file, sep=',')

Is this useful? You can used python regex parser with read_csv and specify different delimiters.

Ask the user to specify how the file is formatted if you don't expect to be able to determine from the file contents itself.
E.g. a flag of some sort as --tab-delimited-file=true and then you flip the separator based on their input.

Pandas read_csv (quickly!) with non-regex, multi-char sep

I am often given a file with , (comma-space) separators that I would like to read into a pandas dataframe. The straightforward pd.read_csv(fname, header=0, sep=', ') reads just fine, but is ~8x slower (than an equivalent sep=',') on my files, because the multi-char sep appears like a regex and forces the python engine.
What is good way (main metric: fast) to read files with non-regex, multi-char delimiters without needing to fallback onto the python engine? My current solution (posted below) essentially runs s/, /,/g on the file first, but this requires passing over the file twice. Is there a preferred solution without this drawback?

use df = pd.read_csv(fname, skipinitialspace=True) as :
skipinitialspacebool, default False
Skip spaces after delimiter.
a refer: Dealing with extra white spaces while reading CSV in Pandas

My current solution is to simply replace the delimiter with a single-char before calling read_csv.
from io import StringIO
def read_comma_space(fname):
with open(fname, 'r') as f:
text = f.read().replace(', ', ',')
s = StringIO(text)
return pd.read_csv(s, header=0, sep=',')
This enables use of the C engine, but must make multiple passes over the file. Compared to a baseline read of the file (read_csv(fname, header=0, sep=',')), this solution adds about 50% to total execution time in my tests. This is much better than the ~8x execution time over the baseline of the python engine.

Python: C engine does not support regex separators

Attempting to upload a bunch of csv's to a database. The csvs are not necesarily always separated by a comma so I used a regular expression to ensure the correct delimiters are used. I then added the
error_bad_lines=False
in order to handle CParserError: Error tokenizing data. C error: Expected 3 fields in line 127, saw 4
which resulted in me getting this error
ValueError: Falling back to the 'python' engine because the 'c' engine does not support regex separators, but this causes 'error_bad_lines' to be ignored as it is not supported by the 'python' engine.
for the following code
Is there a workaround?
import psycopg2
import pandas as pd
import sqlalchemy as sa
csvList = []
tableList = []
filenames = find_csv_filenames(directory)
for name in filenames:
lhs, rhs = str(name).split(".", 1)
print name
dataRaw = pd.read_csv(name,sep=";|,",chunksize=5000000, error_bad_lines=False)
for chunk in dataRaw:
chunk.to_sql(name = str(lhs),if_exists='append',con=con)

As per pandas parameter in this link Pandas-link
if the separator is more than one character you need to add engine parameter as 'python'.
try this,
dataRaw = pd.read_csv(name,sep=";|,",engine ='python',chunksize=5000000,
error_bad_lines=False)

If you could preprocess and change your file try to change ; separator to , to make clean csv file. You could do it with fileinput to change it inplace:
import fileinput
for line in fileinput.FileInput('your_file', inplace=True):
line = line.replace(';', ',')
print(line, end='')
fileinput.close()
Then you could use read_csv with c engine and use parameter error_bad_lines or you could also preprocess them with that loop.
Note: If you want to make backup of your file you could use backup parameter for FileInput

Pandas guess delimiter with sep=None

Pandas documentation has this:
With sep=None, read_csv will try to infer the delimiter automatically
in some cases by “sniffing”.
How can I access pandas' guess for the delimiter?
I want to read in 10 lines of my file, have pandas guess the delimiter, and start up my GUI with that delimiter already selected. But I don't know how to access what pandas thinks is the delimiter.
Also, is there a way to pass pandas a list of strings to restrict it's guesses to?

Looking at the source code, I doubt that it's possible to get the delimiter out of read_csv. But pandas internally uses the Sniffer class from the csv module. Here's an example that should get you going:
import csv
s = csv.Sniffer()
print s.sniff("a,b,c").delimiter
print s.sniff("a;b;c").delimiter
print s.sniff("a#b#c").delimiter
Output:
,
;
#
What remains, is reading the first line from a file and feeding it to the Sniffer.sniff() function, but I'll leave that up to you.

The csv.Sniffer is the simplest solution, but it doesn't work if you need to use compressed files.
Here's what's working, although it uses a private member, so beware:
reader = pd.read_csv('path/to/file.tar.gz', sep=None, engine='python', iterator=True)
sep = reader._engine.data.dialect.delimiter
reader.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Auto-detect the delimiter in a CSV file using pd.read_csv - python

Is there a way for read_csv to auto-detect the delimiter? numpy's genfromtxt does this. My files have data with single space, double space and a tab as delimiters. genfromtxt() solves it, but is slower than pandas' read_csv. Any ideas?

Option 1 Using delim_whitespace=True df = pd.read_csv('file.csv', delim_whitespace=True) Option 2 Pass a regular expression to the sep parameter: df = pd.read_csv('file.csv', sep='\s+') This is equivalent to the first option Documentation for pd.read_csv.

Related

Python - How can I check if a CSV file has a comma or a semicolon as a separator?

May I use either tab or comma as delimiter when reading from pandas csv?

Pandas read_csv (quickly!) with non-regex, multi-char sep

Python: C engine does not support regex separators

Pandas guess delimiter with sep=None

Categories

Resources