python encoding issue - utf8 encoding not working

python encoding issue - utf8 encoding not working - python

I have a csv file like this
Niklas Fagerstr�m http://www.vimeo.com/niklasf 5379549 5379549
Niklas Fagerstr�m http://fagerstrom.eu/en 5379549 5379549
I reading
Niklas Fagerstr�m
Niklas Fagerstr�m
This two fields so all the ? characters should be encoded but my script is not encoding
import csv
import MySQLdb
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
with open('finland_5000_rows.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in spamreader:
#row[0] = row[0].encode('')
one = row[0]
print one
Output:
Niklas Fagerstr�m
Niklas Fagerstr�m
But i want output like this
Niklas Fagerström
Niklas Fagerström
What change should i make in above code to get expected result?

What I do when this happens is, copy the text from the csv into notepad++ and press convert to UTF-8 and save it.

Related

\ufeff is appearing while reading csv using unicodecsv module

I have following code
import unicodecsv
CSV_PARAMS = dict(delimiter=",", quotechar='"', lineterminator='\n')
unireader = unicodecsv.reader(open('sample.csv', 'rb'), **CSV_PARAMS)
for line in unireader:
print(line)
and it prints
['\ufeff"003', 'word one"']
['003,word two']
['003,word three']
The CSV looks like this
"003,word one"
"003,word two"
"003,word three"
I am unable to figure out why the first row has \ufeff (which is i believe a file marker). Moreover, there is " at the beginning of first row.
The CSV file is comign from client so i can't dictate them how to save a file etc. Looking to fix my code so that it can handle encoding.
Note: I have already tried passing encoding='utf8' to CSV_PARAMS and it didn't solve the problem

encoding='utf-8-sig' will remove the UTF-8-encoded BOM (byte order mark) used as a UTF-8 signature in some files:
import unicodecsv
with open('sample.csv','rb') as f:
r = unicodecsv.reader(f, encoding='utf-8-sig')
for line in r:
print(line)
Output:
['003,word one']
['003,word two']
['003,word three']
But why are you using the third-party unicodecsv with Python 3? The built-in csv module handles Unicode correctly:
import csv
# Note, newline='' is a documented requirement for the csv module
# for reading and writing CSV files.
with open('sample.csv', encoding='utf-8-sig', newline='') as f:
r = csv.reader(f)
for line in r:
print(line)

Use csv file with multiple newline characters in Python 3

I'm trying to import a csv file which has # as delimiter and \r\n as line break. Inside one column there is data which also has newline in it but \n.
I'm able to read one line after another without problems but using the csv lib (Python 3) I've got stuck.
The below example throws a
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Is it possible to use the csv lib with multiple newline characters?
Thanks!
import csv
with open('../database.csv', newline='\r\n') as csvfile:
file = csv.reader(csvfile, delimiter='#', quotechar='"')
for row in file:
print(row[3])
database.csv:
2202187#"645cc14115dbfcc4defb916280e8b3a1"#"cd2d3e434fb587db2e5c2134740b8192"#"{
Age = 22;
Salary = 242;
}

Please try this code. According to Python 3.5.4 documentation, with newline=None, common line endings like '\r\n' ar replaced by '\n'.
import csv
with open('../database.csv', newline=None) as csvfile:
file = csv.reader(csvfile, delimiter='#', quotechar='"')
for row in file:
print(row[3])
I've replaced newline='\r\n' by newline=None.
You also could use the 'rU' modifier but it is deprecated.
...
with open('../database.csv', 'rU') as csvfile:
...

Unicode error when writing russian symbols to csv

I want to write cyrillic symbols to csv file but I get unicode encode error. English symbols works perfect. I'm using Python 3.6.2.
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-6: ordinal not in range(128)
import csv
with open("test.csv", 'w') as csvfile:
csvfile = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
hello = 'привет, мир!'
csvfile.writerow([hello])

Declare the encoding of the file when you open it. newline='' is also required per the csv documentation.
import csv
with open('test.csv','w',encoding='utf8',newline='') as csvfile:
csvfile = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
hello = 'привет, мир!'
csvfile.writerow([hello])

You just need to encode the hello string before you write it to a file (csv). Otherwise Python is expecting you to input only ascii characters, in case of non-ascii characters, you may use utf-8 encoding as:
# -*- coding: utf-8 -*-
import csv
with open("test.csv", 'w') as csvfile:
csvfile = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
hello = u'привет, мир!' # Better way of declaring a unicode string literal
csvfile.writerow([hello.encode("utf-8")])

Add this code in your file
# encoding=utf8
--------------------------------------
import sys
reload(sys)
sys.setdefaultencoding('utf8')

For Python 2 guys, use this instead of the normal "open" function:
import codecs
codecs.open(out_path, encoding='utf-8', mode='w')
This is equivalent of the following in Python 3:
open(out_path, 'w', encoding='utf8')

Using regex pattern stocked in file

I'm working on a simple Python script which looks if any lines of an input file match any of the patterns from a CSV file.
The following code doesn't show anything:
# -*- coding: utf8 -*-
import re
import csv
csvfile = open('errors.csv', 'r')
errorsreader = csv.reader(csvfile, delimiter="\t")
log = open('gcc.log', 'r')
for line in log:
for row in errorsreader:
matchObj = re.match(row[0],line)
if matchObj:
print (line)
While the same code, with the following pattern instead of row[0] works:
.* error: expected ‘;’ before ‘}’ token .*
I have been looking for workarounds but none of them seem to work. Any guesses?

The problem is that at first line from log you will read all lines from errorsreader and then you will read nothing. You can change
errorsreader = csv.reader(csvfile, delimiter="\t")
to
errorsreader = list(csv.reader(csvfile, delimiter="\t"))

CSV new-line character seen in unquoted field error

the following code worked until today when I imported from a Windows machine and got this error:
new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
import csv
class CSV:
def __init__(self, file=None):
self.file = file
def read_file(self):
data = []
file_read = csv.reader(self.file)
for row in file_read:
data.append(row)
return data
def get_row_count(self):
return len(self.read_file())
def get_column_count(self):
new_data = self.read_file()
return len(new_data[0])
def get_data(self, rows=1):
data = self.read_file()
return data[:rows]
How can I fix this issue?
def upload_configurator(request, id=None):
"""
A view that allows the user to configurator the uploaded CSV.
"""
upload = Upload.objects.get(id=id)
csvobject = CSV(upload.filepath)
upload.num_records = csvobject.get_row_count()
upload.num_columns = csvobject.get_column_count()
upload.save()
form = ConfiguratorForm()
row_count = csvobject.get_row_count()
colum_count = csvobject.get_column_count()
first_row = csvobject.get_data(rows=1)
first_two_rows = csvobject.get_data(rows=5)

It'll be good to see the csv file itself, but this might work for you, give it a try, replace:
file_read = csv.reader(self.file)
with:
file_read = csv.reader(self.file, dialect=csv.excel_tab)
Or, open a file with universal newline mode and pass it to csv.reader, like:
reader = csv.reader(open(self.file, 'rU'), dialect=csv.excel_tab)
Or, use splitlines(), like this:
def read_file(self):
with open(self.file, 'r') as f:
data = [row for row in csv.reader(f.read().splitlines())]
return data

I realize this is an old post, but I ran into the same problem and don't see the correct answer so I will give it a try
Python Error:
_csv.Error: new-line character seen in unquoted field
Caused by trying to read Macintosh (pre OS X formatted) CSV files. These are text files that use CR for end of line. If using MS Office make sure you select either plain CSV format or CSV (MS-DOS). Do not use CSV (Macintosh) as save-as type.
My preferred EOL version would be LF (Unix/Linux/Apple), but I don't think MS Office provides the option to save in this format.

For Mac OS X, save your CSV file in "Windows Comma Separated (.csv)" format.

If this happens to you on mac (as it did to me):
Save the file as CSV (MS-DOS Comma-Separated)
Run the following script
with open(csv_filename, 'rU') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
print ', '.join(row)

Try to run dos2unix on your windows imported files first

This is an error that I faced. I had saved .csv file in MAC OSX.
While saving, save it as "Windows Comma Separated Values (.csv)" which resolved the issue.

This worked for me on OSX.
# allow variable to opened as files
from io import StringIO
# library to map other strange (accented) characters back into UTF-8
from unidecode import unidecode
# cleanse input file with Windows formating to plain UTF-8 string
with open(filename, 'rb') as fID:
uncleansedBytes = fID.read()
# decode the file using the correct encoding scheme
# (probably this old windows one)
uncleansedText = uncleansedBytes.decode('Windows-1252')
# replace carriage-returns with new-lines
cleansedText = uncleansedText.replace('\r', '\n')
# map any other non UTF-8 characters into UTF-8
asciiText = unidecode(cleansedText)
# read each line of the csv file and store as an array of dicts,
# use first line as field names for each dict.
reader = csv.DictReader(StringIO(cleansedText))
for line_entry in reader:
# do something with your read data

I know this has been answered for quite some time but not solve my problem. I am using DictReader and StringIO for my csv reading due to some other complications. I was able to solve problem more simply by replacing delimiters explicitly:
with urllib.request.urlopen(q) as response:
raw_data = response.read()
encoding = response.info().get_content_charset('utf8')
data = raw_data.decode(encoding)
if '\r\n' not in data:
# proably a windows delimited thing...try to update it
data = data.replace('\r', '\r\n')
Might not be reasonable for enormous CSV files, but worked well for my use case.

Alternative and fast solution : I faced the same error. I reopened the "wierd" csv file in GNUMERIC on my lubuntu machine and exported the file as csv file. This corrected the issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python encoding issue - utf8 encoding not working - python

What I do when this happens is, copy the text from the csv into notepad++ and press convert to UTF-8 and save it.

Related

\ufeff is appearing while reading csv using unicodecsv module

Use csv file with multiple newline characters in Python 3

Unicode error when writing russian symbols to csv

Using regex pattern stocked in file

CSV new-line character seen in unquoted field error

Categories

Resources