Pandas guess delimiter with sep=None - python

Pandas documentation has this:
With sep=None, read_csv will try to infer the delimiter automatically
in some cases by “sniffing”.
How can I access pandas' guess for the delimiter?
I want to read in 10 lines of my file, have pandas guess the delimiter, and start up my GUI with that delimiter already selected. But I don't know how to access what pandas thinks is the delimiter.
Also, is there a way to pass pandas a list of strings to restrict it's guesses to?

Looking at the source code, I doubt that it's possible to get the delimiter out of read_csv. But pandas internally uses the Sniffer class from the csv module. Here's an example that should get you going:
import csv
s = csv.Sniffer()
print s.sniff("a,b,c").delimiter
print s.sniff("a;b;c").delimiter
print s.sniff("a#b#c").delimiter
Output:
,
;
#
What remains, is reading the first line from a file and feeding it to the Sniffer.sniff() function, but I'll leave that up to you.

The csv.Sniffer is the simplest solution, but it doesn't work if you need to use compressed files.
Here's what's working, although it uses a private member, so beware:
reader = pd.read_csv('path/to/file.tar.gz', sep=None, engine='python', iterator=True)
sep = reader._engine.data.dialect.delimiter
reader.close()

Related

Python - How can I check if a CSV file has a comma or a semicolon as a separator?

I have a bunch of CSV files that I would like to read with Python Pandas. Some of them have a comma as a delimiter, hence I use the following command to read them:
import pandas as pd
df = pd.read_csv('file_with_commas.csv')
However I have others CSVs that have a semicolon as a delimiter. Hence, since the default separator is the comma, I now need to specify it and therefore use the following command:
import pandas as pd
df = pd.read_csv('file_with_semicolons.csv', sep=';')
I would like to write a piece of code that recognises if the CSV file has a comma or a semicolon as a delimiter (before I read it) so that I do not have to change the code every time. How can this be done?
Note: I have checked this similar question on Stack Overflow but it is helpless since it is applicable to R, rather than Python.
Use sep=None
df = pd.read_csv('some_file.csv', sep=None)
From the docs
sep str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can,
meaning the latter will be used and automatically detect the separator
by Python’s builtin sniffer tool, csv.Sniffer. In addition,
separators longer than 1 character and different from '\s+' will be
interpreted as regular expressions and will also force the use of the
Python parsing engine. Note that regex delimiters are prone to
ignoring quoted data. Regex example: '\r\t'
.
Say that you would like to read an arbitrary CSV, named input.csv, and you do not know whether the separator is a comma or a semicolon.
You could open your file using the csv module. The Sniffer class is then used to deduce its format, like in the following code:
import csv
with open(input.csv, newline='') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read())
For this module, the dialect class is a container class whose attributes contain information for how to handle delimiters (among other things like doublequotes, whitespaces, etc). You can check the delimiter attribute by using the following code:
print(dialect.delimiter)
# This will be either a comma or a semicolon, depending on what the input is
Therefore, in order to do a smart CSV reading, you could use something like the following:
if dialect.delimiter == ',':
df = pd.read_csv(input.csv) # Import the csv with a comma as the separator
elif dialect.delimiter == ';':
df = pd.read_csv(input.csv, sep=';') # Import the csv with a semicolon as the separator
More information can be found here.

May I use either tab or comma as delimiter when reading from pandas csv?

I have csv files. Some are comma delimited, and some are tab delimited.
df = pd.read_csv(data_file, sep='\t')
Is there a way to specify either tab or comma as delimiter when using pd.read_csv()? Or, is there a way to automatically detect whether the file is tab or comma delimited? If I know that, I can use different sep='' paramters when reading the file.
Recently I had a similair problem, I ended up using a different method but I explored using the Sniffer Class from the CSV standard library.
I haven't used this in production but only to help find what file types are for testing prototyping, use at your own risk!
from the documentation
"Sniffs" the format of a CSV file (i.e. delimiter, quotechar) Returns
a Dialect object.
you can return the dialect object then pass dialect.delimiter to the sep arg in pd.read_csv
'text_a.csv'
cola|colb|col
A|B|C
E|F|G
A|B|C
E|F|G
'text_b.csv'
cola\tcolb\tcol
A\tB\tC
E\tF\tG
A\tB\tC
E\tF\tG
A\tB\tC
from csv import Sniffer
sniffer = Sniffer()
def detect_delim(file,num_rows,sniffer):
with open(file,'r') as f:
for row in range(num_rows):
line = next(f).strip()
delim = sniffer.sniff(line)
print(delim.delimiter) # ideally you should return the dialect object - just being lazy.
#return delim.dedelimiter
detect_delim(file='text_a.csv',num_rows=5,sniffer=sniffer)
'|'
detect_delim(file='text_b.csv',num_rows=5,sniffer=sniffer)
'\t'
I'd just read the first row and see which gives you more columns:
import pandas as pd
tab = pd.read_csv(data_file, nrows=1, sep='\t').shape[1]
com = pd.read_csv(data_file, nrows=1, sep=',').shape[1]
if tab > com:
df = pd.read_csv(data_file, sep='\t')
else:
df = pd.read_csv(data_file, sep=',')
Is this useful? You can used python regex parser with read_csv and specify different delimiters.
Ask the user to specify how the file is formatted if you don't expect to be able to determine from the file contents itself.
E.g. a flag of some sort as --tab-delimited-file=true and then you flip the separator based on their input.

Pandas read_csv (quickly!) with non-regex, multi-char sep

I am often given a file with , (comma-space) separators that I would like to read into a pandas dataframe. The straightforward pd.read_csv(fname, header=0, sep=', ') reads just fine, but is ~8x slower (than an equivalent sep=',') on my files, because the multi-char sep appears like a regex and forces the python engine.
What is good way (main metric: fast) to read files with non-regex, multi-char delimiters without needing to fallback onto the python engine? My current solution (posted below) essentially runs s/, /,/g on the file first, but this requires passing over the file twice. Is there a preferred solution without this drawback?
use df = pd.read_csv(fname, skipinitialspace=True) as :
skipinitialspacebool, default False
Skip spaces after delimiter.
a refer: Dealing with extra white spaces while reading CSV in Pandas
My current solution is to simply replace the delimiter with a single-char before calling read_csv.
from io import StringIO
def read_comma_space(fname):
with open(fname, 'r') as f:
text = f.read().replace(', ', ',')
s = StringIO(text)
return pd.read_csv(s, header=0, sep=',')
This enables use of the C engine, but must make multiple passes over the file. Compared to a baseline read of the file (read_csv(fname, header=0, sep=',')), this solution adds about 50% to total execution time in my tests. This is much better than the ~8x execution time over the baseline of the python engine.

Issue with parsing csv from Django web form

I was hoping someone could help me with this. I'm getting a file from a form in Django, this file is a csv and I'm trying to read it with Python's library csv. The problem here is that when I apply the function csv.reader and I turn that result into a list in order to print it, I find out that csv.reader is not splitting correctly my file.
Here are some images to show the problem
This is my csv file:
This my code:
And this is the printed value of the variable file_readed:
As you can see in the picture, it seems to be splitting my file character by character with some exceptions.
I thank you for any help you can provide me.
If you are pulling from a web form, try getting the csv as a string, confirm in a print or debug tool that the result is correct, and then pass it to csv using StringIO.
from io import StringIO
import csv
csv_string = form.files['carga_cie10'].file_read().decode(encoding="ISO-88590-1")
csv_file = StringIO(csv_string)
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
print(row)
Another thing you can try is changing the lineterminator argument to csv.reader(). It can default to \r\n but the web form might use some other value. Inspect the string you get from the web form to confirm.
that CSV does not seem right: you got some lines with more arguments than others.
The acronym of CSV being Comma Separated Values, you need to have the exact same arguments separated by commas for each line, or else it will mess it up.
I see in your lines you're maybe expecting to have 3 columns, instead you got lines with 2, or 4 arguments, and some of them have an opening " in one argument, comma, then closing " in the second argument
check if your script works with other CSVs maybe
Most likely you need to specify delimiter. Since you haven't explicitly told about the delimiter, I guess it's confused.
csv.reader(csvfile, delimiter=',')
However, since there are quotations with comma delimiter, you may need to alter the default delimiter on the CSV file's creation too for tab or something else.
The problem is here:
print(list(file_readed))
'list' is causing printing of every element within the csv as an individual unit.
Try this instead:
with open('carga_cie10') as f:
reader = csv.reader(f)
for row in reader:
print(" ".join(row))
Edit:
import pandas as pd
file_readed = pd.read_csv(file_csv)
print(file_readed)
The output should look clean. Pandas is highly useful in situations where data needs to be read, manipulated, changed, etc.

Auto-detect the delimiter in a CSV file using pd.read_csv

Is there a way for read_csv to auto-detect the delimiter? numpy's genfromtxt does this.
My files have data with single space, double space and a tab as delimiters. genfromtxt() solves it, but is slower than pandas' read_csv.
Any ideas?
Another option is to use the built in CSV Sniffer. I mix it up with only reading a certain number of bytes in case the CSV file is large.
import csv
def get_delimiter(file_path, bytes = 4096):
sniffer = csv.Sniffer()
data = open(file_path, "r").read(bytes)
delimiter = sniffer.sniff(data).delimiter
return delimiter
Option 1
Using delim_whitespace=True
df = pd.read_csv('file.csv', delim_whitespace=True)
Option 2
Pass a regular expression to the sep parameter:
df = pd.read_csv('file.csv', sep='\s+')
This is equivalent to the first option
Documentation for pd.read_csv.
For better control, I use a python module called detect_delimiter from python projects. See https://pypi.org/project/detect-delimiter/ . It has been around for some time. As with all code, you should test with your interpreter prior to deployment. I have tested up to python version 3.8.5.
See code example below where the delimiter is automatically detected, and the var
delimiter is defined from the method's output. The code then reads the CSV file
with sep = delimiter. I have tested with the following delimiters, although others should work: ; , |
It does not work with multi-char delimiters, such as ","
CAUTION! This method will do nothing to detect a malformed CSV file. In the case
where the input file contains both ; and , the method returns , as the detected delimiter.
from detect_delimiter import detect
import pandas as pd
filename = "some_csv.csv"
with open(filename) as f:
firstline = f.readline()
delimiter = detect(firstline)
records = pd.read_csv(filename, sep = delimiter)

Categories

Resources