\ufeff is appearing while reading csv using unicodecsv module - python

I have following code
import unicodecsv
CSV_PARAMS = dict(delimiter=",", quotechar='"', lineterminator='\n')
unireader = unicodecsv.reader(open('sample.csv', 'rb'), **CSV_PARAMS)
for line in unireader:
print(line)
and it prints
['\ufeff"003', 'word one"']
['003,word two']
['003,word three']
The CSV looks like this
"003,word one"
"003,word two"
"003,word three"
I am unable to figure out why the first row has \ufeff (which is i believe a file marker). Moreover, there is " at the beginning of first row.
The CSV file is comign from client so i can't dictate them how to save a file etc. Looking to fix my code so that it can handle encoding.
Note: I have already tried passing encoding='utf8' to CSV_PARAMS and it didn't solve the problem

encoding='utf-8-sig' will remove the UTF-8-encoded BOM (byte order mark) used as a UTF-8 signature in some files:
import unicodecsv
with open('sample.csv','rb') as f:
r = unicodecsv.reader(f, encoding='utf-8-sig')
for line in r:
print(line)
Output:
['003,word one']
['003,word two']
['003,word three']
But why are you using the third-party unicodecsv with Python 3? The built-in csv module handles Unicode correctly:
import csv
# Note, newline='' is a documented requirement for the csv module
# for reading and writing CSV files.
with open('sample.csv', encoding='utf-8-sig', newline='') as f:
r = csv.reader(f)
for line in r:
print(line)

Related

How to change .csv.gz encoding to utf-8

I want to user either R or Python to convert .csv.gz file to utf-8 encoding. How can I do this directly? I am not able find any comprehensive guide as how to do this.
My best attempt was to read .csv.gz file with csv.reader in python:
csvFile = gzip.open('pracodawcy_20190611_5.csv.gz', 'rt', newline='')
reader = csv.reader(csvFile)
But later how to save it as csv with utf-8?
Very easily, it puts the file in a vector:
import gzip
### assuming the file is separated as you said
with gzip.open('input_file.csv.gz', 'rt', newline='\n') as f:
content = f.readlines()
### to print the vector content
for v in content :
print(v)
### to write to .csv.gz
with gzip.open('output.csv.gz', 'wb') as f:
for v in content :
f.write(v.encode('utf-8'))
you can also lazy-open it line per line if it's too big with read() or for. There are a lot of examples here and in the web.

How do I write Japanese text from a json to a csv using python?

I want to write some Japanese characters to a csv, but I can't get it to display properly. The csv ends up with characters like 学校 and I'm not sure what I'm doing wrong.
import requests
import csv
r = requests.get('http://jisho.org/api/v1/search/words?keyword=%23common')
data = r.json()
with open('common_words.csv', 'a', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=['word','reading'])
for entry in data["data"]:
word = entry["japanese"][0]["word"]
reading = entry["japanese"][0]["reading"]
writer.writerow({'word':word,'reading':reading,})
The code is correct. Reading the csv file with excel was the problem. Opening the file with a text editor displayed the characters properly.

Use csv file with multiple newline characters in Python 3

I'm trying to import a csv file which has # as delimiter and \r\n as line break. Inside one column there is data which also has newline in it but \n.
I'm able to read one line after another without problems but using the csv lib (Python 3) I've got stuck.
The below example throws a
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Is it possible to use the csv lib with multiple newline characters?
Thanks!
import csv
with open('../database.csv', newline='\r\n') as csvfile:
file = csv.reader(csvfile, delimiter='#', quotechar='"')
for row in file:
print(row[3])
database.csv:
2202187#"645cc14115dbfcc4defb916280e8b3a1"#"cd2d3e434fb587db2e5c2134740b8192"#"{
Age = 22;
Salary = 242;
}
Please try this code. According to Python 3.5.4 documentation, with newline=None, common line endings like '\r\n' ar replaced by '\n'.
import csv
with open('../database.csv', newline=None) as csvfile:
file = csv.reader(csvfile, delimiter='#', quotechar='"')
for row in file:
print(row[3])
I've replaced newline='\r\n' by newline=None.
You also could use the 'rU' modifier but it is deprecated.
...
with open('../database.csv', 'rU') as csvfile:
...

Do I have to encode unicode variable before write to file?

I read the "Unicdoe Pain" article days ago. And I keep the "Unicode Sandwich" in mind.
Now I have to handle some Chinese and I've got a list
chinese = [u'中文', u'你好']
Do i need to proceed encoding before writing to file?
add_line_break = [word + u'\n' for word in chinese]
encoded_chinese = [word.encode('utf-8') for word in add_line_break]
with open('filename', 'wb') as f:
f.writelines(encoded_chinese)
Somehow I find out that in python2. I can do this:
chinese = ['中文', '你好']
with open('filename', 'wb') as f:
f.writelines(chinese)
no unicode matter involed. :D
You don't have to do that, you could use io or codecs to open the file with encoding.
import io
with io.open('file.txt', 'w', encoding='utf-8') as f:
f.write(u'你好')
codecs.open has the same syntax.
In python3;
with open('file.txt, 'w', encoding='utf-8') as f:
f.write('你好')
will do just fine.

python encoding issue - utf8 encoding not working

I have a csv file like this
Niklas Fagerstr�m http://www.vimeo.com/niklasf 5379549 5379549
Niklas Fagerstr�m http://fagerstrom.eu/en 5379549 5379549
I reading
Niklas Fagerstr�m
Niklas Fagerstr�m
This two fields so all the ? characters should be encoded but my script is not encoding
import csv
import MySQLdb
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
with open('finland_5000_rows.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
for row in spamreader:
#row[0] = row[0].encode('')
one = row[0]
print one
Output:
Niklas Fagerstr�m
Niklas Fagerstr�m
But i want output like this
Niklas Fagerström
Niklas Fagerström
What change should i make in above code to get expected result?
What I do when this happens is, copy the text from the csv into notepad++ and press convert to UTF-8 and save it.

Categories

Resources