Pandas: Write CSV file with Windows line ending - python

Example:
import pandas as pd
df = pd.DataFrame({'A':[1,2], 'B':[3,4]})
df.to_csv('test.csv', line_terminator='\r\n')
gives the file
,A,B\r
\r\n
0,1,3\r
\r\n
1,2,4\r
\r\n
however, I would like to have
,A,B\r\n
0,1,3\r\n
1,2,4\r\n
How can I achieve this (i.e., \r\n instead of \r\r\n). My operating system is Windows 10.

The following is from a submission to Kaggle. Hope you'll find it useful. I had to write the output loop myself as pandas insisted on adding both CR and LF and it caused problems for Kaagle.
import io
with io.open('../output/submission_{}.csv'.format('22-Sep-2018-Try-Something_with_re'), 'w', newline='\n') as of:
of.write('fullVisitorId,PredictedLogRevenue\n')
for ind in out_log.index:
of.write('{},{}\n'.format(ind, out_log['PredictedLogRevenue'][ind]))
The main things to notice is the use of io.open, setting newline='\n' in the open call, and adding '\n' at the end of each write call.

Related

Pandas read_csv (quickly!) with non-regex, multi-char sep

I am often given a file with , (comma-space) separators that I would like to read into a pandas dataframe. The straightforward pd.read_csv(fname, header=0, sep=', ') reads just fine, but is ~8x slower (than an equivalent sep=',') on my files, because the multi-char sep appears like a regex and forces the python engine.
What is good way (main metric: fast) to read files with non-regex, multi-char delimiters without needing to fallback onto the python engine? My current solution (posted below) essentially runs s/, /,/g on the file first, but this requires passing over the file twice. Is there a preferred solution without this drawback?
use df = pd.read_csv(fname, skipinitialspace=True) as :
skipinitialspacebool, default False
Skip spaces after delimiter.
a refer: Dealing with extra white spaces while reading CSV in Pandas
My current solution is to simply replace the delimiter with a single-char before calling read_csv.
from io import StringIO
def read_comma_space(fname):
with open(fname, 'r') as f:
text = f.read().replace(', ', ',')
s = StringIO(text)
return pd.read_csv(s, header=0, sep=',')
This enables use of the C engine, but must make multiple passes over the file. Compared to a baseline read of the file (read_csv(fname, header=0, sep=',')), this solution adds about 50% to total execution time in my tests. This is much better than the ~8x execution time over the baseline of the python engine.

Strange character while reading a CSV file

I try to read a CSV file in Python, but the first element in the first row is read like that 0, while the strange character isn't in the file, its just a simple 0. Here is the code I used:
matriceDist=[]
file=csv.reader(open("distanceComm.csv","r"),delimiter=";")
for row in file:
matriceDist.append(row)
print (matriceDist)
I had this same issue. Save your excel file as CSV (MS-DOS) vs. UTF-8 and those odd characters should be gone.
Specifying the byte order mark when opening the file as follows solved my issue:
open('inputfilename.csv', 'r', encoding='utf-8-sig')
Just use pandas together with some encoding (utf-8 for example) is gonna be easier
import pandas as pd
df = pd.read_csv('distanceComm.csv', header=None, encoding = 'utf8', delimiter=';')
print(df)
I don't know what your input file is. But since it has a Byte Order Mark for UTF-8, you can use something like this:
import codecs
matriceDist=[]
file=csv.reader(codecs.open('distanceComm.csv', encoding='utf-8'),delimiter=";")
for row in file:
matriceDist.append(row)
print (matriceDist)

Auto-detect the delimiter in a CSV file using pd.read_csv

Is there a way for read_csv to auto-detect the delimiter? numpy's genfromtxt does this.
My files have data with single space, double space and a tab as delimiters. genfromtxt() solves it, but is slower than pandas' read_csv.
Any ideas?
Another option is to use the built in CSV Sniffer. I mix it up with only reading a certain number of bytes in case the CSV file is large.
import csv
def get_delimiter(file_path, bytes = 4096):
sniffer = csv.Sniffer()
data = open(file_path, "r").read(bytes)
delimiter = sniffer.sniff(data).delimiter
return delimiter
Option 1
Using delim_whitespace=True
df = pd.read_csv('file.csv', delim_whitespace=True)
Option 2
Pass a regular expression to the sep parameter:
df = pd.read_csv('file.csv', sep='\s+')
This is equivalent to the first option
Documentation for pd.read_csv.
For better control, I use a python module called detect_delimiter from python projects. See https://pypi.org/project/detect-delimiter/ . It has been around for some time. As with all code, you should test with your interpreter prior to deployment. I have tested up to python version 3.8.5.
See code example below where the delimiter is automatically detected, and the var
delimiter is defined from the method's output. The code then reads the CSV file
with sep = delimiter. I have tested with the following delimiters, although others should work: ; , |
It does not work with multi-char delimiters, such as ","
CAUTION! This method will do nothing to detect a malformed CSV file. In the case
where the input file contains both ; and , the method returns , as the detected delimiter.
from detect_delimiter import detect
import pandas as pd
filename = "some_csv.csv"
with open(filename) as f:
firstline = f.readline()
delimiter = detect(firstline)
records = pd.read_csv(filename, sep = delimiter)

Reading ASCII with field delimiter as ctrl A and line delimiting as \n into python

I have an ASCII dataset that has ctrl A field delimiting and \n as the line delimiter. I am looking to read this into Python and am wondering how to deal with it. In particular I would like to be able to read this information into a pandas dataframe.
I currently have;
import pandas as pd
input = pd.read_csv('000000_0', sep='^A')
The error I then get is
_main__:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does
not support regex separators; you can avoid this warning by specifying engine='python'.
I then don't know how I am specifying the line delimiter too.
Any ideas?
Thanks in advance!
Instead of mentioning "^A" mention the hex code. Its works like a charm
import pandas as pd
data = pd.read_csv('000000_0', sep='\x01')
Use pd.read_csv with parameter sep=chr(1)
from io import StringIO
import pandas as pd
mycsv = """a{0}b{0}c
d{0}e{0}f""".format(chr(1))
pd.read_csv(StringIO(mycsv), sep=chr(1))
a b c
0 d e f
If by CTRL+A you mean the ASCII-Code for SOH (start of heading), try splitting your data on newline first to get the rows, and split these on "\x01", which is the hex code for SOH. But without any code, data, expected result or error message, this is mostly guessing.
Try this
reader = csv.reader(open("/Users/778123/Documents/Splunk/data/DMS3^idms_core^20200723140421.csv",newline=None), delimiter=',')
print(reader)
writer = csv.writer(open("/Users/778123/Documents/Splunk/data/DMS3^idms_core^test.csv", 'w'), delimiter=chr(1), quoting=csv.QUOTE_NONNUMERIC)
writer.writerows(reader)
Python's csv library is pretty good at reading delimited files ;-)
Taking an example from the docs linked above:
import csv
with open('eggs.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
print ', '.join(row)
This will automatically iterate over the lines in the file (thus handle the newline characters), and you can set the delimiter as shown.

Pandas guess delimiter with sep=None

Pandas documentation has this:
With sep=None, read_csv will try to infer the delimiter automatically
in some cases by “sniffing”.
How can I access pandas' guess for the delimiter?
I want to read in 10 lines of my file, have pandas guess the delimiter, and start up my GUI with that delimiter already selected. But I don't know how to access what pandas thinks is the delimiter.
Also, is there a way to pass pandas a list of strings to restrict it's guesses to?
Looking at the source code, I doubt that it's possible to get the delimiter out of read_csv. But pandas internally uses the Sniffer class from the csv module. Here's an example that should get you going:
import csv
s = csv.Sniffer()
print s.sniff("a,b,c").delimiter
print s.sniff("a;b;c").delimiter
print s.sniff("a#b#c").delimiter
Output:
,
;
#
What remains, is reading the first line from a file and feeding it to the Sniffer.sniff() function, but I'll leave that up to you.
The csv.Sniffer is the simplest solution, but it doesn't work if you need to use compressed files.
Here's what's working, although it uses a private member, so beware:
reader = pd.read_csv('path/to/file.tar.gz', sep=None, engine='python', iterator=True)
sep = reader._engine.data.dialect.delimiter
reader.close()

Categories

Resources