Use csv file with multiple newline characters in Python 3

Use csv file with multiple newline characters in Python 3 - python

I'm trying to import a csv file which has # as delimiter and \r\n as line break. Inside one column there is data which also has newline in it but \n.
I'm able to read one line after another without problems but using the csv lib (Python 3) I've got stuck.
The below example throws a
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Is it possible to use the csv lib with multiple newline characters?
Thanks!
import csv
with open('../database.csv', newline='\r\n') as csvfile:
file = csv.reader(csvfile, delimiter='#', quotechar='"')
for row in file:
print(row[3])
database.csv:
2202187#"645cc14115dbfcc4defb916280e8b3a1"#"cd2d3e434fb587db2e5c2134740b8192"#"{
Age = 22;
Salary = 242;
}

Please try this code. According to Python 3.5.4 documentation, with newline=None, common line endings like '\r\n' ar replaced by '\n'.
import csv
with open('../database.csv', newline=None) as csvfile:
file = csv.reader(csvfile, delimiter='#', quotechar='"')
for row in file:
print(row[3])
I've replaced newline='\r\n' by newline=None.
You also could use the 'rU' modifier but it is deprecated.
...
with open('../database.csv', 'rU') as csvfile:
...

Related

\ufeff is appearing while reading csv using unicodecsv module

I have following code
import unicodecsv
CSV_PARAMS = dict(delimiter=",", quotechar='"', lineterminator='\n')
unireader = unicodecsv.reader(open('sample.csv', 'rb'), **CSV_PARAMS)
for line in unireader:
print(line)
and it prints
['\ufeff"003', 'word one"']
['003,word two']
['003,word three']
The CSV looks like this
"003,word one"
"003,word two"
"003,word three"
I am unable to figure out why the first row has \ufeff (which is i believe a file marker). Moreover, there is " at the beginning of first row.
The CSV file is comign from client so i can't dictate them how to save a file etc. Looking to fix my code so that it can handle encoding.
Note: I have already tried passing encoding='utf8' to CSV_PARAMS and it didn't solve the problem

encoding='utf-8-sig' will remove the UTF-8-encoded BOM (byte order mark) used as a UTF-8 signature in some files:
import unicodecsv
with open('sample.csv','rb') as f:
r = unicodecsv.reader(f, encoding='utf-8-sig')
for line in r:
print(line)
Output:
['003,word one']
['003,word two']
['003,word three']
But why are you using the third-party unicodecsv with Python 3? The built-in csv module handles Unicode correctly:
import csv
# Note, newline='' is a documented requirement for the csv module
# for reading and writing CSV files.
with open('sample.csv', encoding='utf-8-sig', newline='') as f:
r = csv.reader(f)
for line in r:
print(line)

Open file has data but reports back length 0 in python

I must be missing something very simple here, but I've been hitting my head against the wall for a while and don't understand where the error is. I am trying to open a csv file and read the data. I am detecting the delimiter, then reading in the data with this code:
with open(filepath, 'r') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read())
delimiter = repr(dialect.delimiter)[1:-1]
csvdata = [line.split(delimiter) for line in csvfile.readlines()]
However, my csvfile is being read as having no length. If I run:
print(sum(1 for line in csvfile))
The result is zero. If I run:
print(sum(1 for line in open(filepath, 'r')))
Then I get five lines, as expected. I've checked for name clashes by changing csvfile to other random names, but this does not change the result. Am I missing a step somewhere?

You need to move the file pointer back to the start of the file after sniffing it. You don't need to read the whole file in to do that, just enough to include a few rows:
import csv
with open(filepath, 'r') as f_input:
dialect = csv.Sniffer().sniff(f_input.read(2048))
f_input.seek(0)
csv_input = csv.reader(f_input, dialect)
csv_data = list(csv_input)
Also, the csv.reader() will do the splitting for you.

Python Exception for escaped quotes within exception for quotes

I want adapt a csv from comma-separated to tab-separated. There are also commas between quotes, so I need an exception for that. So, some googling and stackoverflow got me this:
import re
f1 = open('query_result.csv', 'r')
f2 = open('query_result_tab_separated.csv', 'w')
for line in f1:
line = re.sub(',(?=(([^\"]*\"){2})*[^\"]*$)(?![^\[]*\])', '\t', line)
f2.write(line)
f1.close()
However, between the quotes I also find escaped quotes \". An example of a line:
"01-003412467812","Drontmann B.V.",1,6420,"Expert in \"Social, Life and Tech Sciences\""
My current code changes the comma after Social into a tab as well, but I don't want this. How can I make an exception for quotes and within that exception and exception for escaped quotes?

You can't do this with regexp.
Python has a csv module which is intended to do this:
import csv
with open('test.csv', 'rb') as csvfile:
data = csv.reader(csvfile, delimiter=',', quotechar='"', escapechar='\\')
for row in data:
print ' | '.join(row)

The csv module can handle this. You can set the escape character and specify how quotes within a field are escaped using escapechar and doublequote:
import csv
with open('file.csv') as infile, open('file_tabs.csv', 'w') as outfile:
r = csv.reader(infile, doublequote=False, escapechar='\\')
w = csv.writer(outfile, delimiter='\t', doublequote=False, escapechar='\\')
w.writerows(r)
This will create a new tab delimited file that preserves the commas and escaped quotes within a field from the original file. Alternatively, the default settings will use "" (double quote) to escape the quotes:
w = csv.writer(outfile, delimiter='\t')
which would write data like this:
01-003412467812 Drontmann B.V. 1 6420 "Expert in ""Social, Life and Tech Sciences"""

erroneous line added while adding new columns python

I am trying to add extra columns in a csv file after processing an input csv file. But, I am getting extra new line added after each line in the output.
What's missing or wrong in my below code -
import csv
with open('test.csv', 'r') as infile:
with open('test_out.csv', 'w') as outfile:
reader = csv.reader(infile, delimiter=',')
writer = csv.writer(outfile, delimiter=',')
for row in reader:
colad = row[5].rstrip('0123456789./ ')
if colad == row[5]:
col2ad = row[11]
else:
col2ad = row[5].split(' ')[-1]
writer.writerow([row[0],colad,col2ad] +row[1:])
I am processing huge a csv file so would like to get rid of those extra lines.

I had the same problem on Windows (your OS as well, I presume?). CSV and Windows as combination make a \r\r\n at the end of each line (so: double newline).
You need to open the output file in binary mode:
with open('test_out.csv', 'wb') as outfile:
For other answers:
Python's CSV writer produces wrong line terminator
CSV in Python adding an extra carriage return

CSV new-line character seen in unquoted field error

the following code worked until today when I imported from a Windows machine and got this error:
new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
import csv
class CSV:
def __init__(self, file=None):
self.file = file
def read_file(self):
data = []
file_read = csv.reader(self.file)
for row in file_read:
data.append(row)
return data
def get_row_count(self):
return len(self.read_file())
def get_column_count(self):
new_data = self.read_file()
return len(new_data[0])
def get_data(self, rows=1):
data = self.read_file()
return data[:rows]
How can I fix this issue?
def upload_configurator(request, id=None):
"""
A view that allows the user to configurator the uploaded CSV.
"""
upload = Upload.objects.get(id=id)
csvobject = CSV(upload.filepath)
upload.num_records = csvobject.get_row_count()
upload.num_columns = csvobject.get_column_count()
upload.save()
form = ConfiguratorForm()
row_count = csvobject.get_row_count()
colum_count = csvobject.get_column_count()
first_row = csvobject.get_data(rows=1)
first_two_rows = csvobject.get_data(rows=5)

It'll be good to see the csv file itself, but this might work for you, give it a try, replace:
file_read = csv.reader(self.file)
with:
file_read = csv.reader(self.file, dialect=csv.excel_tab)
Or, open a file with universal newline mode and pass it to csv.reader, like:
reader = csv.reader(open(self.file, 'rU'), dialect=csv.excel_tab)
Or, use splitlines(), like this:
def read_file(self):
with open(self.file, 'r') as f:
data = [row for row in csv.reader(f.read().splitlines())]
return data

I realize this is an old post, but I ran into the same problem and don't see the correct answer so I will give it a try
Python Error:
_csv.Error: new-line character seen in unquoted field
Caused by trying to read Macintosh (pre OS X formatted) CSV files. These are text files that use CR for end of line. If using MS Office make sure you select either plain CSV format or CSV (MS-DOS). Do not use CSV (Macintosh) as save-as type.
My preferred EOL version would be LF (Unix/Linux/Apple), but I don't think MS Office provides the option to save in this format.

For Mac OS X, save your CSV file in "Windows Comma Separated (.csv)" format.

If this happens to you on mac (as it did to me):
Save the file as CSV (MS-DOS Comma-Separated)
Run the following script
with open(csv_filename, 'rU') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
print ', '.join(row)

Try to run dos2unix on your windows imported files first

This is an error that I faced. I had saved .csv file in MAC OSX.
While saving, save it as "Windows Comma Separated Values (.csv)" which resolved the issue.

This worked for me on OSX.
# allow variable to opened as files
from io import StringIO
# library to map other strange (accented) characters back into UTF-8
from unidecode import unidecode
# cleanse input file with Windows formating to plain UTF-8 string
with open(filename, 'rb') as fID:
uncleansedBytes = fID.read()
# decode the file using the correct encoding scheme
# (probably this old windows one)
uncleansedText = uncleansedBytes.decode('Windows-1252')
# replace carriage-returns with new-lines
cleansedText = uncleansedText.replace('\r', '\n')
# map any other non UTF-8 characters into UTF-8
asciiText = unidecode(cleansedText)
# read each line of the csv file and store as an array of dicts,
# use first line as field names for each dict.
reader = csv.DictReader(StringIO(cleansedText))
for line_entry in reader:
# do something with your read data

I know this has been answered for quite some time but not solve my problem. I am using DictReader and StringIO for my csv reading due to some other complications. I was able to solve problem more simply by replacing delimiters explicitly:
with urllib.request.urlopen(q) as response:
raw_data = response.read()
encoding = response.info().get_content_charset('utf8')
data = raw_data.decode(encoding)
if '\r\n' not in data:
# proably a windows delimited thing...try to update it
data = data.replace('\r', '\r\n')
Might not be reasonable for enormous CSV files, but worked well for my use case.

Alternative and fast solution : I faced the same error. I reopened the "wierd" csv file in GNUMERIC on my lubuntu machine and exported the file as csv file. This corrected the issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use csv file with multiple newline characters in Python 3 - python

Related

\ufeff is appearing while reading csv using unicodecsv module

Open file has data but reports back length 0 in python

Python Exception for escaped quotes within exception for quotes

erroneous line added while adding new columns python

CSV new-line character seen in unquoted field error

Categories

Resources