Python Pandas - use Multiple Character Delimiter when writing to_csv - python

It appears that the pandas to_csv function only allows single character delimiters/separators.
Is there some way to allow for a string of characters to be used like, "::" or "%%" instead?
I tried:
df.to_csv(local_file, sep = '::', header=None, index=False)
and getting:
TypeError: "delimiter" must be a 1-character string

Use numpy-savetxt
Ex:
np.savetxt(file.csv, np.char.decode(chunk_data.values.astype(np.bytes_), 'UTF-8'), delimiter='~|', fmt='%s',encoding=None)
np.savetxt(file.dat, chunk_data.values, delimiter='~|', fmt='%s',encoding='utf-8')

Think about what this line a::b::c‘ means to a standard CSV tool: an a, an empty column, a b, an empty column, and a c. Even in a more complicated case with quoting or escaping:"abc::def"::2 means an abc::def, an empty column, and a 2.
So, all you have to do is add an empty column between every column, and then use : as a delimiter, and the output will be almost what you want.
I say “almost” because Pandas is going to quote or escape single colons. Depending on the dialect options you’re using, and the tool you’re trying to interact with, this may or may not be a problem. Unnecessary quoting usually isn’t a problem (unless you ask for QUOTE_ALL, because then your columns will be separated by :"":, so hopefully you don’t need that dialect option), but unnecessary escapes might be (e.g., you might end up with every single : in a string turned into a \: or something). So you have to be careful with the options. But it’ll work for the basic “quote as needed, with mostly standard other options” settings.

Related

How to escape the escapechar in pandas to_csv

I'm trying to write dataframes to CSV. A lot of the incoming data is user-generated and may contain special characters. I can set escapechar='\\' (for example), but then if there is a backslash in the data it gets written as "\" which gets interpreted as an escaped double-quote as opposed to a string containing a backslash. How can I escape the escapechar (ie, how can I have to_csv write \\ by escaping the backslash?)
Example code:
import pandas as pd
import io, csv
data = [[1, "\\", "text"]]
df = pd.DataFrame(data)
sIo = io.StringIO()
df.to_csv(
sIo,
index=False,
sep=',',
header=False,
quoting=csv.QUOTE_MINIMAL,
doublequote=False,
escapechar='\\'
)
sioText = sIo.getvalue()
print(sioText)
Actual output:
1,"\",text
What I need:
1,"\\",text
The engineering use case that creates the constraints is that this will be some core code for moving data from one system to another. I won't know the format of the data in advance and won't have much control over it (any column could contain the escape character), and I can't control the escape character on the other side so the actual output will be read as an error. Hence the original question of "how do you escape the escape character."
For reference this parameter's definition in the pandas docs is:
escapecharstr, default None
String of length 1. Character used to escape sep and quotechar when appropriate.
For anyone coming across this, I solved this by using Pandas' regex replacer:
df = df.replace('\\\\', '\\\\\\\\', regex=True)
We need 4 slashes per final slash because we are doing 2 layers of escaping. One for literal Python strings, and one to escape them in the regular expression. This will find-replace any \s in any column in the data frame, anywhere they appear in the string.
It is mind-boggling to me that this is still the default behavior.
Huh. This seems like an open issue with round-tripping data from pandas to csv. See this issue: https://github.com/pandas-dev/pandas/issues/14122, and especially pandas creator Wes McKinney's post:
This behavior is present in the csv module https://gist.github.com/wesm/7763d396ae25c9fd5b27588da27015e4 . From first principles seems like the offending backslash should be escaped. If I manually edit the file to be
"a"
"Hello! Please \"help\" me. I cannot quote a csv.\\"
then read_csv returns the original input
I fiddled with R and it doesn't seem to do much better
> df <- data.frame(a=c("Hello! Please \"help\" me. I cannot quote a csv.\\"))> write.table(df, sep=',', qmethod='e', row.names=F)
"a"
"Hello! Please \"help\" me. I cannot quote a csv.\"
Another example of CSV not being a high fidelity data interchange tool =|
I'm as baffled as you that this doesn't work, but seems like the official position is... df[col]=df[col].str.replace({"\\": "\\\\"})?

Python: how to build a text file for loading data?

I'm new to Python, and I am following this guide to implement a linear regression
http://nbviewer.jupyter.org/github/jdwittenauer/ipython-notebooks/blob/master/notebooks/ml/ML-Exercise1.ipynb
Basically I am on the step where I need to build a data set to import it into Python
I have created a text file with two columns, each data is separated by a tab
However, this is what I get
I looked around online and it seems that the tab is the delimiter. What am I doing wrong? How can I build this text file?
I would advise using the official documentation instead of "looking around online" - if you check the pandas read_csv() documentation, it lists (at the very top) the default values of each argument. The default value of the sep (separator) argument is ", ". So just change your call to pd.read_csv() to add sep='\t'.
use ',' instead of 'tab' as a delimiter in your textfile ex1data.txt, as pandas default delimiter is ','.
Here is an explanation from pandas official documentation for delimiter :
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine
cannot automatically detect the separator, but the Python parsing
engine can, meaning the latter will be used automatically. In
addition, separators longer than 1 character and different from '\s+'
will be interpreted as regular expressions and will also force the use
of the Python parsing engine. Note that regex delimiters are prone to
ignoring quoted data. Regex example: '\r\t'

LPTHW: double quotes around CSV.writer

While going through LPTHW, I've set on reading the code here:
https://github.com/BrechtDeMan/secretsanta/blob/master/pairing.py
I've been trying to understand why the output CSV has double-quotes. There are several questions here about the problem but I'm not groking.
Where are the quotes getting introduced?
Edit: I wrote the author a couple of weeks back but haven't heard back.
Edit 2: An example of the output...
"Alice,101,alice#mail.org,Wendy,204,wendy#mail.org"
Double quotes are introduced in write_file function.
CSV files look simple on the surface, but sooner or later you will encounter some more complex problems. The first one is: what should happen if character denoting delimiter occurs in field content? Because there is no real standard for CSV format, different people had different ideas of correct answer for this question.
Python csv library tries to abstract this complexity and various approaches and make it easier to read and write CSV files following different rules. This is done by Dialect class objects.
The author of write_file function decided to construct output row manually by joining all fields and delimiter characters together, but then used csv module to actually write data into file:
writer.writerow([givers_list[ind][1] + ',' + givers_list[ind][2]
+ ',' + givers_list[ind][3]
+ ',' + givers_list[rand_vec[ind]][1] + ','
+ givers_list[rand_vec[ind]][2] + ',' + givers_list[rand_vec[ind]][3]])
This inconsistent usage of csv module resulted in entire row of data being treated as single field. Because that field contains characters used as field delimiters, Dialect.quoting decides how it should be handled. Default quoting configuration, csv.QUOTE_MINIMAL says that field should be quoted using Dialect.quotechar - which defaults to double quote character ("). That's why eventually entire field ends up surrounded by double quote characters.
Fast and easy, but not correct, solution would be changing quoting algorithm to csv.QUOTE_NONE. This will tell writer object to never surround fields, but instead to escape special characters by Dialect.escapechar. According to documentation, setting it to None (default) will raise an error. I guess that setting it to empty string could do the job.
The correct solution is feeding writer.writerrow with expected input data - list of fields. This should do (untested):
writer.writerow([givers_list[ind][1], givers_list[ind][2],
givers_list[ind][3],
givers_list[rand_vec[ind]][1],
givers_list[rand_vec[ind]][2], givers_list[rand_vec[ind]][3]])
In general, (double)quotes are needed when there is a seperator-char inside a field - and if there are quotes inside that field, they need to be 'escaped' with another quote.
Do you have an example of the output and the quotes you are talking about?
Edit (after example):
Ok, the whole row is treated as one field here. As Miroslaw Zalewski mentioned, those values should be treated as seperate fields instead of one long string.

csv.reader removing commas within quoted values

I have two programs that I converted to using the csv module in Python 2.7. The first I just used csv.reader:
f=open(sys.argv[1],'r')
line=csv.reader(f)
for values in line:
…do some stuff…
In this case commas within quoted string are removed. Another program was too complex just to read so I copied the file changing the delimiter to semi-colon:
fin=open(sys.argv[1],'rb')
lines=csv.reader(fin)
with open('temp.csv','wb') as fout:
w=csv.writer(fout,delimiter=';')
w.writerows(lines)
In this case the commas inside of quoted string are preserved. I find no posts on bugs.python.org on this, so I conclude it must be me. So, anyone else experience this behavior. In the second case all elements are quoted. In the first case quotes are used only where required.
It's hard to guess without seeing a sample of your input file and the resulting output, but maybe this is related: Read CSV file with comma within fields in Python
Try csv.reader(f, skipinitialspace=True).

Own pretty print option in python script

I'm outputting pretty huge XML structure to file and I want user to be able to enable/disable pretty print.
I'm working with approximately 150MB of data,when I tried xml.etree.ElementTree and build tree structure from it's element objects, it used awfully lot of memory, so I do this manually by storing raw strings and outputing by .write(). My output sequence looks like this:
ofile.write(pretty_print(u'\
\t\t<LexicalEntry id="%s">\n\
\t\t\t<feat att="languageCode" val="cz"/>\n\
\t\t\t<Lemma>\n\
\t\t\t\t<FormRepresentation>\n\
\t\t\t\t\t<feat att="writtenForm" val="%s"/>\n\
\t\t\t\t</FormRepresentation>\n\
\t\t\t</Lemma>\n\
\t\t\t<Sense>%s\n' % (str(lex_id), word['word'], '' if word['pos']=='' else '\n\t\t\t\t<feat att="partOfSpeech" val="%s"/>' % word['pos'])))
inside the .write() I call my function pretty_print which, depending on command line option, SHOULD strip all tab and newline characters
o_parser = OptionParser()
# ....
o_parser.add_option("-p", "--prettyprint", action="store_true", dest="pprint", default=False)
# ....
def pretty_print(string):
if not options.pprint:
return string.strip('\n\t')
return string
I wrote 'should', because it does not, in this particular case it does not strip any of the characters.
BUT in this case, it works fine:
for ss in word['synsets']:
ofile.write(pretty_print(u'\t\t\t\t<Sense synset="%s-synset"/>\n' % ss))
First thing that came on my mind was that there might be some issues with the substitution, but when i print passed string inside the pretty_print function it looks perfectly fine.
Any suggestiones what might cause that .strip() does not work?
Or if there is any better way to do this, I'll accept any advice
Your issue is that str.strip() only removes from the beginning and end of a string.
You either want str.replace() to remove all instances, or to split it into lines and strip each line, if you want to remove them from the beginning and end of lines.
Also note that for your massive string, Python supports multi-line strings with triple quotes that will make it a lot easier to type out, and the old style string formatting with % has been superseded by str.format() - which you probably want to use instead in new code.

Categories

Resources