Python error "Ordinal not in range" with accents - python

I'm scraping a table from the Internet and saving as a CSV file. There are characters with French accents in the text, resulting in a unicode error on save:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-6: ordinal not in range(128)
I'd like to find an elegant solution for saving accented characters that I can apply to any situation. I've sometimes used the following:
encode('ascii','ignore')
but it doesn't work this time, for reasons unknown. I'm also trying to replace the <sup> tags in a cell, so I'm converting it using str() first.
Here's the pertinent part of my code:
data = [
str(td[0]).split('<sup')[0].split('>')[1].split('<')[0],
td[1].getText()
]
output.append(data)
csv_file = csv.writer(open('savedFile.csv', 'w'), delimiter=',')
for line in output:
csv_file.writerow(line)

If td[0] is u"a<sup>b</sup>c" :
td[0].split('<sup') is u"a".
td[0].partition('>')[2].split('<')[0] is u"b".
td[0][td[0].rindex('>') + 1:] is u"c".
If this kind of string indexing and matching is too simple you might consider createing a regular expression and matching it against the text in the html tag:
import re
r = re.compile("[^<]*<sup>([^<]*)</sup>")
m = r.match("some<sup>text</sup>")
print(m.groups()[0])

The csv.reader() and csv.writer() require the files opened in binary mode. You should also close the file at the end. Therefore, you should write it like:
f = open('output.csv', 'wb')
writer = csv.writer(f, delimiter=',')
for row in output:
writer.writerow(row)
f.close()
Or you can use the with construct when using newer versions of Python:
with open('output.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',')
for row in output:
writer.writerow(row)
... and the file will be closed automatically.
Anyway, the csv.writer() expects the row composed of the byte sequences (not Unicode strings). If you have Unicode strings, convert them using .encode('utf-8'):
for row in output:
encoded_row = [s.encode('utf-8') for s in row]
writer.writerow(encoded_row)

Related

List to csv without commas in Python

I have a following problem.
I would like to save a list into a csv (in the first column).
See example here:
import csv
mylist = ["Hallo", "der Pixer", "Glas", "Telefon", "Der Kühlschrank brach kaputt."]
def list_na_csv(file, mylist):
with open(file, "w", newline="") as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerows(mylist)
list_na_csv("example.csv", mylist)
My output in excel looks like this:
Desired output is:
You can see that I have two issues: Firstly, each character is followed by comma. Secondly, I don`t know how to use some encoding, for example UTF-8 or cp1250. How can I fix it please?
I tried to search similar question, but nothing worked for me. Thank you.
You have two problems here.
writerows expects a list of rows, said differently a list of iterables. As a string is iterable, you write each word in a different row, one character per field. If you want one row with one word per field, you should use writerow
csv_writer.writerow(mylist)
by default, the csv module uses the comma as the delimiter (this is the most common one). But Excel is a pain in the ass with it: it expects the delimiter to be the one of the locale, which is the semicolon (;) in many West European countries, including Germany. If you want to use easily your file with your Excel you should change the delimiter:
csv_writer = csv.writer(csv_file, delimiter=';')
After your edit, you want all the data in the first column, one element per row. This is kind of a decayed csv file, because it only has one value per record and no separator. If the fields can never contain a semicolon nor a new line, you could just write a plain text file:
...
with open(file, "w", newline="") as csv_file:
for row in mylist:
print(row, file=file)
...
If you want to be safe and prevent future problems if you later want to process more corner cases values, you could still use the csv module and write one element per row by including it in another iterable:
...
with open(file, "w", newline="") as csv_file:
csv_writer = csv.writer(csv_file, delimiter=';')
csv_writer.writerows([elt] for elt in mylist)
...
l = ["Hallo", "der Pixer", "Glas", "Telefon", "Der Kühlschrank brach kaputt."]
with open("file.csv", "w") as msg:
msg.write(",".join(l))
For less trivial examples:
l = ["Hallo", "der, Pixer", "Glas", "Telefon", "Der Kühlschrank, brach kaputt."]
with open("file.csv", "w") as msg:
msg.write(",".join([ '"'+x+'"' for x in l]))
Here you basically set every list element between quotes, to prevent from the intra field comma problem.
Try this it will work 100%
import csv
mylist = ["Hallo", "der Pixer", "Glas", "Telefon", "Der Kühlschrank brach kaputt."]
def list_na_csv(file, mylist):
with open(file, "w") as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow(mylist)
list_na_csv("example.csv", mylist)
If you want to write the entire list of strings to a single row, use csv_writer.writerow(mylist) as mentioned in the comments.
If you want to write each string to a new row, as I believe your reference to writing them in the first column implies, you'll have to format your data as the class expects: "A row must be an iterable of strings or numbers for Writer objects". On this data that would look something like:
csv_writer.writerows((entry,) for entry in mylist)
There, I'm using a generator expression to wrap each word in a tuple, thus making it an iterable of strings. Without something like that, your strings are themselves iterables and lead to it delimiting between each character as you've seen.
Using csv to write a single entry per line is almost pointless, but it does have the advantage that it will escape your delimiter if it appears in the data.
To specify an encoding, the docs say:
Since open() is used to open a CSV file for reading, the file will by
default be decoded into unicode using the system default encoding (see
locale.getpreferredencoding()). To decode a file using a different
encoding, use the encoding argument of open:
import csv with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when
opening the output file.
try split("\n")
example:
counter = 0
amazing list = ["hello","hi"]
for x in titles:
ok = amazinglist[counter].split("\n")
writer.writerow(ok)
counter +=1

python write umlauts into file

i have the following output, which i want to write into a file:
l = ["Bücher", "Hefte, "Mappen"]
i do it like:
f = codecs.open("testfile.txt", "a", stdout_encoding)
f.write(l)
f.close()
in my Textfile i want to see: ["Bücher", "Hefte, "Mappen"] instead of B\xc3\xbccher
Is there any way to do so without looping over the list and decode each item ? Like to give the write() function any parameter?
Many thanks
First, make sure you use unicode strings: add the "u" prefix to strings:
l = [u"Bücher", u"Hefte", u"Mappen"]
Then you can write or append to a file:
I recommend you to use the io module which is Python 2/3 compatible.
with io.open("testfile.txt", mode="a", encoding="UTF8") as fd:
for line in l:
fd.write(line + "\n")
To read your text file in one piece:
with io.open("testfile.txt", mode="r", encoding="UTF8") as fd:
content = fd.read()
The result content is an Unicode string.
If you decode this string using UTF8 encoding, you'll get bytes string like this:
b"B\xc3\xbccher"
Edit using writelines.
The method writelines() writes a sequence of strings to the file. The sequence can be any iterable object producing strings, typically a list of strings. There is no return value.
# add new lines
lines = [line + "\n" for line in l]
with io.open("testfile.txt", mode="a", encoding="UTF8") as fd:
fd.writelines(lines)

Python issues on character encoding

I'm working on a program that need to take two files and merge them and write the union file as a new one. The problem is that the output file contains chars like this \xf0 or if i change some of the encodings the result is something like that \u0028. The input file are codificated in utf8. How can i print on the output file chars like "è" or "ò" and "-"
I have done this code:
import codecs
import pandas as pd
import numpy as np
goldstandard = "..\\files\file1.csv"
tweets = "..\\files\\file2.csv"
with codecs.open(tweets, "r", encoding="utf8") as t:
tFile = pd.read_csv(t, delimiter="\t",
names=['ID', 'Tweet'],
quoting=3)
IDs = tFile['ID']
tweets = tFile['Tweet']
dict = {}
for i in range(len(IDs)):
dict[np.int64(IDs[i])] = [str(tweets[i])]
with codecs.open(goldstandard, "r", encoding="utf8") as gs:
for line in gs:
columns = line.split("\t")
index = np.int64(columns[0])
rowValue = dict[index]
rowValue.append([columns[1], columns[2], columns[3], columns[5]])
dict[index] = rowValue
import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)
f = codecs.open("out.csv", "w", "utf8")
f.write(ndic)
f.close()
and this is example of the outputs
desired: Beyoncè
obtained: Beyonc\xe9
You are producing Python string literals, here:
import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)
Pretty-printing is useful for producing debugging output; objects are passed through repr() to make non-printable and non-ASCII characters easily distinguishable and reproducible:
>>> import pprint
>>> value = u'Beyonc\xe9'
>>> value
u'Beyonc\xe9'
>>> print value
Beyoncé
>>> pprint.pprint(value)
u'Beyonc\xe9'
The é character is in the Latin-1 range, outside of the ASCII range, so it is represented with syntax that produces the same value again when used in Python code.
Don't use pprint if you want to write out actual string values to the output file. You'll have to do your own formatting in that case.
Moreover, the pandas dataframe will hold bytestrings, not unicode objects, so you still have undecoded UTF-8 data at that point.
Personally, I'd not even bother using pandas here; you appear to want to write CSV data, so I've simplified your code to use the csv module instead, and I'm not actually bothering to decode the UTF-8 here (this is safe for this case as both input and output is entirely in UTF-8):
import csv
tweets = {}
with open(tweets, "rb") as t:
reader = csv.reader(t, delimiter='\t')
for id_, tweet in reader:
tweets[id_] = tweet
with open(goldstandard, "rb") as gs, open("out.csv", 'wb') as outf:
reader = csv.reader(gs, delimiter='\t')
writer = csv.reader(outf, delimiter='\t')
for columns in reader:
index = columns[0]
writer.writerow([tweets[index]] + columns[1:4] + [columns[5])
Note that you really want to avoid using dict as a variable name; it masks the built-in type, I used tweets instead.

how to remove non utf 8 code and save as a csv file python

I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte
I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?
thank you!
EDIT1:
Here is the code i convert to text to csv:
import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
"product/productId",
"review/userId",
"review/profileName",
"review/helpfulness",
"review/score",
"review/time",
"review/summary",
"review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")
outfile = open(OUTPUT_FILE_NAME,"w")
outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:
line = line.strip()
#need to reomve the , so that the comment review text won't be in many columns
line = line.replace(',','')
if line == "":
outfile.write(",".join(currentLine))
outfile.write("\n")
currentLine = []
continue
parts = line.split(":",1)
currentLine.append(parts[1])
if currentLine != []:
outfile.write(",".join(currentLine))
f.close()
outfile.close()
EDIT2:
Thanks to all of you trying to helping me out.
So I have solved it by modify the output format in my code:
outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...
You have basically 2 ways to deal with decode errors:
use a charset that will accept any byte such as iso-8859-15 also known as latin9
if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)
For example:
f = open(INPUT_FILE_NAME,encoding="latin9")
or
f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
If you are using python3, it provides inbuilt support for unicode content -
f = open('file.csv', encoding="utf-8")
If you still want to remove all unicode data from it, you can read it as a normal text file and remove the unicode content
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, bytes):
string_data = bytes(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
with open('file.csv', 'r+', encoding="utf-8") as csv_file:
content = remove_unicode(csv_file.read())
csv_file.seek(0)
csv_file.write(content)
Now you can read it without any unicode data issues.

Ascii Code error while converting from xlsx to csv

I have referred some post related to unicode error but didn't get any solution for my problem. I am converting xlsx to csv fom a workbook of 6 sheets.
Use the following code
def csv_from_excel(file_loc):
#file_acess check
print os.access(file_loc, os.R_OK)
wb = xlrd.open_workbook(file_loc)
print wb.nsheets
sheet_names = wb.sheet_names()
print sheet_names
counter = 0
while counter < wb.nsheets:
try:
sh = wb.sheet_by_name(sheet_names[counter])
file_name = str(sheet_names[counter]) + '.csv'
print file_name
fh = open(file_name, 'wb')
wr = csv.writer(fh, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
except Exception as e:
print str(e)
finally:
fh.close()
counter += 1
I get an error in 4th sheet
'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)"
but position 0 is blank and it has converted to csv till 33rd row.
I am unable to figure out. CSV was easy way to read content and put in my data structure .
You'll need to manually encode Unicode values to bytes; for CSV usually UTF-8 is fine:
for rownum in xrange(sh.nrows):
wr.writerow([unicode(c).encode('utf8') for c in sh.row_values(rownum)])
Here I use unicode() for column data that is not text.
The character you encountered is the U+2018 LEFT SINGLE QUOTATION MARK, which is just a fancy form of the ' single quote. Office software (spreadsheets, word processors, etc.) often auto-replace single and double quotes with the 'fancy' versions. You could also just replace those with ASCII equivalents. You can do that with the Unidecode package:
from unidecode import unidecode
for rownum in xrange(sh.nrows):
wr.writerow([unidecode(unicode(c)) for c in sh.row_values(rownum)])
Use this when non-ASCII codepoints are only used for quotes and dashes and other punctuation.

Categories

Resources