csv.reader removing commas within quoted values - python

I have two programs that I converted to using the csv module in Python 2.7. The first I just used csv.reader:
f=open(sys.argv[1],'r')
line=csv.reader(f)
for values in line:
…do some stuff…
In this case commas within quoted string are removed. Another program was too complex just to read so I copied the file changing the delimiter to semi-colon:
fin=open(sys.argv[1],'rb')
lines=csv.reader(fin)
with open('temp.csv','wb') as fout:
w=csv.writer(fout,delimiter=';')
w.writerows(lines)
In this case the commas inside of quoted string are preserved. I find no posts on bugs.python.org on this, so I conclude it must be me. So, anyone else experience this behavior. In the second case all elements are quoted. In the first case quotes are used only where required.

It's hard to guess without seeing a sample of your input file and the resulting output, but maybe this is related: Read CSV file with comma within fields in Python
Try csv.reader(f, skipinitialspace=True).

Related

How to read a csv that contains commas in multiple columns where delimiter is already a comma in python? [duplicate]

I need some help, I have a CSV file that contains an address field, whoever input the data into the original database used commas to separate different parts of the address - for example:
Flat 5, Park Street
When I try to use the CSV file it treats this one entry as two separate fields when in fact it is a single field. I have used Python to strip commas out where they are between inverted commas as it is easy to distinguish them from a comma that should actually be there, however this problem has me stumped.
Any help would be gratefully received.
Thanks.
You can define the separating and quoting characters with Python's CSV reader. For example:
With this CSV:
1,`Flat 5, Park Street`
And this Python:
import csv
with open('14144315.csv', 'rb') as csvfile:
rowreader = csv.reader(csvfile, delimiter=',', quotechar='`')
for row in rowreader:
print row
You will see this output:
['1', 'Flat 5, Park Street']
This would use commas to separate values but inverted commas for quoted commas
The CSV file was not generated properly. CSV files should have some form of escaping of text, usually using double-quotes:
1,John Doe,"City, State, Country",12345
Some CSV exports do this to all fields (this is an option when exporting from Excel/LibreOffice), but ambiguous fields (such as those including commas) must be escaped.
Either fix this manually or properly regenerate the CSV. Naturally, this cannot be fixed programatically.
Edit: I just noticed something about "inverted commas" being used for escaping - if that is the case see Jason Sperske's answer, which is spot on.

Python Pandas - use Multiple Character Delimiter when writing to_csv

It appears that the pandas to_csv function only allows single character delimiters/separators.
Is there some way to allow for a string of characters to be used like, "::" or "%%" instead?
I tried:
df.to_csv(local_file, sep = '::', header=None, index=False)
and getting:
TypeError: "delimiter" must be a 1-character string
Use numpy-savetxt
Ex:
np.savetxt(file.csv, np.char.decode(chunk_data.values.astype(np.bytes_), 'UTF-8'), delimiter='~|', fmt='%s',encoding=None)
np.savetxt(file.dat, chunk_data.values, delimiter='~|', fmt='%s',encoding='utf-8')
Think about what this line a::b::c‘ means to a standard CSV tool: an a, an empty column, a b, an empty column, and a c. Even in a more complicated case with quoting or escaping:"abc::def"::2 means an abc::def, an empty column, and a 2.
So, all you have to do is add an empty column between every column, and then use : as a delimiter, and the output will be almost what you want.
I say “almost” because Pandas is going to quote or escape single colons. Depending on the dialect options you’re using, and the tool you’re trying to interact with, this may or may not be a problem. Unnecessary quoting usually isn’t a problem (unless you ask for QUOTE_ALL, because then your columns will be separated by :"":, so hopefully you don’t need that dialect option), but unnecessary escapes might be (e.g., you might end up with every single : in a string turned into a \: or something). So you have to be careful with the options. But it’ll work for the basic “quote as needed, with mostly standard other options” settings.

cleaning the format of the printed data in python

I am trying to compare two lists in python and produce two arrays that contain matching rows and non-matching rows, but the program prints the data in an ugly format. How can I clean I go about cleaning it up?
If you want to read the file without the \n character, you might consider doing the following
lines = list1.readlines()
lines2 = list2.readlines()
would read your file without the "\n" characters
Alternatively, for each line, you can do .strip("\n")
The "ugly format" might be because you are using print(match) (which is actually translated by Python to print ( repr(match) ), printing something that is more useful for debugging or as input back to Python - but not 'nice'.
If you want it printed 'nicely', you'd have to decide what format that would be and write the code for it. In the simplest case, you might do:
for i in match:
print(i)
(note your original list contains \n characters, that's what enumerating an open text file does. They will get printed, as well (together with the `\n' added by print() itself). I don't know if you want them removed or not. See the other answer for possible ways of getting rid of them.

Django import-export leading zeros for numerical values in excel

I am faced with the following problem: when I generate .csv files in python using django-import-export even though the field is a string, when I open it in Excel the leading zeros are omitted. E.g. 000123 > 123.
This is a problem, because if I'd like to display a zipcode I need the zeros the way they are. I can cover it in quotes, but that's not desirable since it will grab unnecessary attention and it just looks bad. I'm also aware that you can do it in Excel files manually by changing the data type, but I don't want to explain that to people who are using my software.
Any suggestions?
Thanks in advance.
I've tried this solution. It's the solution suggested by #jquijano but it hasn't worked.
After generating the CSV, I opened it with 'open office' and 'excel' and in both cases I could see the (') character at the beginning of each string. However, if I added a new value to the CSV in the editor, for example '0895, the (') disappeared and the leading 0 wasn't removed.
Luckily, I found a workaround. I just added an empty character at the beginning.
value = chr(24) + unidecode('00123')
An easy fix would be adding an apostrophe (') at the beginning of each number when doing using import-export. This way Excel will recognize those numbers as a text.

LPTHW: double quotes around CSV.writer

While going through LPTHW, I've set on reading the code here:
https://github.com/BrechtDeMan/secretsanta/blob/master/pairing.py
I've been trying to understand why the output CSV has double-quotes. There are several questions here about the problem but I'm not groking.
Where are the quotes getting introduced?
Edit: I wrote the author a couple of weeks back but haven't heard back.
Edit 2: An example of the output...
"Alice,101,alice#mail.org,Wendy,204,wendy#mail.org"
Double quotes are introduced in write_file function.
CSV files look simple on the surface, but sooner or later you will encounter some more complex problems. The first one is: what should happen if character denoting delimiter occurs in field content? Because there is no real standard for CSV format, different people had different ideas of correct answer for this question.
Python csv library tries to abstract this complexity and various approaches and make it easier to read and write CSV files following different rules. This is done by Dialect class objects.
The author of write_file function decided to construct output row manually by joining all fields and delimiter characters together, but then used csv module to actually write data into file:
writer.writerow([givers_list[ind][1] + ',' + givers_list[ind][2]
+ ',' + givers_list[ind][3]
+ ',' + givers_list[rand_vec[ind]][1] + ','
+ givers_list[rand_vec[ind]][2] + ',' + givers_list[rand_vec[ind]][3]])
This inconsistent usage of csv module resulted in entire row of data being treated as single field. Because that field contains characters used as field delimiters, Dialect.quoting decides how it should be handled. Default quoting configuration, csv.QUOTE_MINIMAL says that field should be quoted using Dialect.quotechar - which defaults to double quote character ("). That's why eventually entire field ends up surrounded by double quote characters.
Fast and easy, but not correct, solution would be changing quoting algorithm to csv.QUOTE_NONE. This will tell writer object to never surround fields, but instead to escape special characters by Dialect.escapechar. According to documentation, setting it to None (default) will raise an error. I guess that setting it to empty string could do the job.
The correct solution is feeding writer.writerrow with expected input data - list of fields. This should do (untested):
writer.writerow([givers_list[ind][1], givers_list[ind][2],
givers_list[ind][3],
givers_list[rand_vec[ind]][1],
givers_list[rand_vec[ind]][2], givers_list[rand_vec[ind]][3]])
In general, (double)quotes are needed when there is a seperator-char inside a field - and if there are quotes inside that field, they need to be 'escaped' with another quote.
Do you have an example of the output and the quotes you are talking about?
Edit (after example):
Ok, the whole row is treated as one field here. As Miroslaw Zalewski mentioned, those values should be treated as seperate fields instead of one long string.

Categories

Resources