How to escape the escapechar in pandas to_csv - python

I'm trying to write dataframes to CSV. A lot of the incoming data is user-generated and may contain special characters. I can set escapechar='\\' (for example), but then if there is a backslash in the data it gets written as "\" which gets interpreted as an escaped double-quote as opposed to a string containing a backslash. How can I escape the escapechar (ie, how can I have to_csv write \\ by escaping the backslash?)
Example code:
import pandas as pd
import io, csv
data = [[1, "\\", "text"]]
df = pd.DataFrame(data)
sIo = io.StringIO()
df.to_csv(
sIo,
index=False,
sep=',',
header=False,
quoting=csv.QUOTE_MINIMAL,
doublequote=False,
escapechar='\\'
)
sioText = sIo.getvalue()
print(sioText)
Actual output:
1,"\",text
What I need:
1,"\\",text
The engineering use case that creates the constraints is that this will be some core code for moving data from one system to another. I won't know the format of the data in advance and won't have much control over it (any column could contain the escape character), and I can't control the escape character on the other side so the actual output will be read as an error. Hence the original question of "how do you escape the escape character."
For reference this parameter's definition in the pandas docs is:
escapecharstr, default None
String of length 1. Character used to escape sep and quotechar when appropriate.

For anyone coming across this, I solved this by using Pandas' regex replacer:
df = df.replace('\\\\', '\\\\\\\\', regex=True)
We need 4 slashes per final slash because we are doing 2 layers of escaping. One for literal Python strings, and one to escape them in the regular expression. This will find-replace any \s in any column in the data frame, anywhere they appear in the string.
It is mind-boggling to me that this is still the default behavior.

Huh. This seems like an open issue with round-tripping data from pandas to csv. See this issue: https://github.com/pandas-dev/pandas/issues/14122, and especially pandas creator Wes McKinney's post:
This behavior is present in the csv module https://gist.github.com/wesm/7763d396ae25c9fd5b27588da27015e4 . From first principles seems like the offending backslash should be escaped. If I manually edit the file to be
"a"
"Hello! Please \"help\" me. I cannot quote a csv.\\"
then read_csv returns the original input
I fiddled with R and it doesn't seem to do much better
> df <- data.frame(a=c("Hello! Please \"help\" me. I cannot quote a csv.\\"))> write.table(df, sep=',', qmethod='e', row.names=F)
"a"
"Hello! Please \"help\" me. I cannot quote a csv.\"
Another example of CSV not being a high fidelity data interchange tool =|
I'm as baffled as you that this doesn't work, but seems like the official position is... df[col]=df[col].str.replace({"\\": "\\\\"})?

Related

Python pandas delimiter misprint - double sign

This is my code to open file:
df = pd.read_csv(path_df, delimiter='|')
I get error: Error tokenizing data. C error: Expected 5 fields in line 13571, saw 6
When I check this particular line, I see that there was a misprint and there were 3 signs "|||" instead of one. I would prefer treat double and triple signs as one. Probably, there is other solution.
How can I solve this problem?
Use regex separator [|]+ - one or more |:
import pandas as pd
temp=u"""a|b|c
ss|||s|s
t|g|e"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep="[|]+",engine='python')
print (df)
a b c
0 ss s s
1 t g e
Another way to define a delimiter is using sep while reading a CSV in pandas.
df = pd.read_csv(path_df, sep=r'\|+', engine='python')
Whenever you notice 'C error', it requires the forced use of python engine by specifying engine='python' in the arguments.
my suspicion is that this would be caused by the file being written incorrectly, if the field was supposed to contain the value "|" but somehow was written incorrectly (csv would normally write a line like that as 1|2|3|"|"|5), but if it was mistakenly written without any escaping it would cause this issue.
In that case I don't think you can solve this with pandas, because the issue is badly formed csv.
If it's a one off you can just edit the file first, perhaps to replace all "|||" with "||" - but again that could have unintended consequences. I've had this trouble before and I don't think there's a better way than manually editing the file (at least pandas gives you the line number to look at!)
On the other hand, if it really is just a repeated character misprint, then the other answer will work fine.

Python Pandas - use Multiple Character Delimiter when writing to_csv

It appears that the pandas to_csv function only allows single character delimiters/separators.
Is there some way to allow for a string of characters to be used like, "::" or "%%" instead?
I tried:
df.to_csv(local_file, sep = '::', header=None, index=False)
and getting:
TypeError: "delimiter" must be a 1-character string
Use numpy-savetxt
Ex:
np.savetxt(file.csv, np.char.decode(chunk_data.values.astype(np.bytes_), 'UTF-8'), delimiter='~|', fmt='%s',encoding=None)
np.savetxt(file.dat, chunk_data.values, delimiter='~|', fmt='%s',encoding='utf-8')
Think about what this line a::b::c‘ means to a standard CSV tool: an a, an empty column, a b, an empty column, and a c. Even in a more complicated case with quoting or escaping:"abc::def"::2 means an abc::def, an empty column, and a 2.
So, all you have to do is add an empty column between every column, and then use : as a delimiter, and the output will be almost what you want.
I say “almost” because Pandas is going to quote or escape single colons. Depending on the dialect options you’re using, and the tool you’re trying to interact with, this may or may not be a problem. Unnecessary quoting usually isn’t a problem (unless you ask for QUOTE_ALL, because then your columns will be separated by :"":, so hopefully you don’t need that dialect option), but unnecessary escapes might be (e.g., you might end up with every single : in a string turned into a \: or something). So you have to be careful with the options. But it’ll work for the basic “quote as needed, with mostly standard other options” settings.

Pandas to_csv prefixing 'b' when doing .astype('|S') on column

I'm following advice of this article to reduce Pandas DataFrame memory usage, I'm using .astype('|S') on an object column like so:
data_frame['COLUMN1'] = data_frame['COLUMN1'].astype('|S')
data_frame['COLUMN2'] = data_frame['COLUMN2'].astype('|S')
Performing this on the DataFrame cuts memory usage by 20-40% without negative impacts on processing the columns. However, when outputting the file using .to_csv():
data_frame.to_csv(filename, sep='\t', encoding='utf-8')
The columns with .astype('|S') are outputted with a prefix of b with single quotes:
b'00001234' b'Source'
Removing the .astype('|S') call and outputting to csv gives the expected behavior:
00001234 Source
Some googling on this issue does find GitHub issues, but I don't think they are related (looks like they were fixed as well): to_csv and bytes on Python 3, BUG: Fix default encoding for CSVFormatter.save
I'm on Python 3.6.4 and Pandas 0.22.0. I tested the behavior is consistent on both MacOS and Windows. Any advice on how to output the columns without the b prefix and single quotes?
The 'b' prefix indicates a Python 3 bytes literal that represents an object rather than an unicode string. So if you want to remove the prefix you could decode the bytes object using the string decode method before saving it to a csv file:
data_frame['COLUMN1'] = data_frame['COLUMN1'].apply(lambda s: s.decode('utf-8'))

Python Pandas: correct way to change comma decimal to dot decimal in Pandas Dataframe? [duplicate]

I am importing a CSV file like the one below, using pandas.read_csv:
df = pd.read_csv(Input, delimiter=";")
Example of CSV file:
10;01.02.2015 16:58;01.02.2015 16:58;-0.59;0.1;-4.39;NotApplicable;0.79;0.2
11;01.02.2015 16:58;01.02.2015 16:58;-0.57;0.2;-2.87;NotApplicable;0.79;0.21
The problem is that when I later on in my code try to use these values I get this error: TypeError: can't multiply sequence by non-int of type 'float'
The error is because the number I'm trying to use is not written with a dot (.) as a decimal separator but a comma(,). After manually changing the commas to a dots my program works.
I can't change the format of my input, and thus have to replace the commas in my DataFrame in order for my code to work, and I want python to do this without the need of doing it manually. Do you have any suggestions?
pandas.read_csv has a decimal parameter for this: doc
I.e. try with:
df = pd.read_csv(Input, delimiter=";", decimal=",")
I think the earlier mentioned answer of including decimal="," in pandas read_csv is the preferred option.
However, I found it is incompatible with the Python parsing engine. e.g. when using skiprow=, read_csv will fall back to this engine and thus you can't use skiprow= and decimal= in the same read_csv statement as far as I know. Also, I haven't been able to actually get the decimal= statement to work (probably due to me though)
The long way round I used to achieving the same result is with list comprehensions, .replace and .astype. The major downside to this method is that it needs to be done one column at a time:
df = pd.DataFrame({'a': ['120,00', '42,00', '18,00', '23,00'],
'b': ['51,23', '18,45', '28,90', '133,00']})
df['a'] = [x.replace(',', '.') for x in df['a']]
df['a'] = df['a'].astype(float)
Now, column a will have float type cells. Column b still contains strings.
Note that the .replace used here is not pandas' but rather Python's built-in version. Pandas' version requires the string to be an exact match or a regex.
stallasia's answer looks like the best one.
However, if you want to change the separator when you already have a dataframe, you could do :
df['a'] = df['a'].str.replace(',', '.').astype(float)
Thanks for the great answers. I just want to add that in my case just using decimal=',' did not work because I had numbers like 1.450,00 (with thousands separator), therefore pandas did not recognize it, but passing thousands='.' helped to read the file correctly:
df = pd.read_csv(
Input,
delimiter=";",
decimal=","
thousands="."
)
I answer to the question about how to change the decimal comma to the decimal dot with Python Pandas.
$ cat test.py
import pandas as pd
df = pd.read_csv("test.csv", quotechar='"', decimal=",")
df.to_csv("test2.csv", sep=',', encoding='utf-8', quotechar='"', decimal='.')
where we specify the reading in decimal separator as comma while the output separator is specified as dot. So
$ cat test.csv
header,header2
1,"2,1"
3,"4,0"
$ cat test2.csv
,header,header2
0,1,2.1
1,3,4.0
where you see that the separator has changed to dot.

LPTHW: double quotes around CSV.writer

While going through LPTHW, I've set on reading the code here:
https://github.com/BrechtDeMan/secretsanta/blob/master/pairing.py
I've been trying to understand why the output CSV has double-quotes. There are several questions here about the problem but I'm not groking.
Where are the quotes getting introduced?
Edit: I wrote the author a couple of weeks back but haven't heard back.
Edit 2: An example of the output...
"Alice,101,alice#mail.org,Wendy,204,wendy#mail.org"
Double quotes are introduced in write_file function.
CSV files look simple on the surface, but sooner or later you will encounter some more complex problems. The first one is: what should happen if character denoting delimiter occurs in field content? Because there is no real standard for CSV format, different people had different ideas of correct answer for this question.
Python csv library tries to abstract this complexity and various approaches and make it easier to read and write CSV files following different rules. This is done by Dialect class objects.
The author of write_file function decided to construct output row manually by joining all fields and delimiter characters together, but then used csv module to actually write data into file:
writer.writerow([givers_list[ind][1] + ',' + givers_list[ind][2]
+ ',' + givers_list[ind][3]
+ ',' + givers_list[rand_vec[ind]][1] + ','
+ givers_list[rand_vec[ind]][2] + ',' + givers_list[rand_vec[ind]][3]])
This inconsistent usage of csv module resulted in entire row of data being treated as single field. Because that field contains characters used as field delimiters, Dialect.quoting decides how it should be handled. Default quoting configuration, csv.QUOTE_MINIMAL says that field should be quoted using Dialect.quotechar - which defaults to double quote character ("). That's why eventually entire field ends up surrounded by double quote characters.
Fast and easy, but not correct, solution would be changing quoting algorithm to csv.QUOTE_NONE. This will tell writer object to never surround fields, but instead to escape special characters by Dialect.escapechar. According to documentation, setting it to None (default) will raise an error. I guess that setting it to empty string could do the job.
The correct solution is feeding writer.writerrow with expected input data - list of fields. This should do (untested):
writer.writerow([givers_list[ind][1], givers_list[ind][2],
givers_list[ind][3],
givers_list[rand_vec[ind]][1],
givers_list[rand_vec[ind]][2], givers_list[rand_vec[ind]][3]])
In general, (double)quotes are needed when there is a seperator-char inside a field - and if there are quotes inside that field, they need to be 'escaped' with another quote.
Do you have an example of the output and the quotes you are talking about?
Edit (after example):
Ok, the whole row is treated as one field here. As Miroslaw Zalewski mentioned, those values should be treated as seperate fields instead of one long string.

Categories

Resources