It seems that I cannot encode the character '\u015f' (letter s with cedilla). Please could someone help?
from selenium import webdriver
import time
with open('Violators_UNGC1.csv', 'w',encoding='utf-8'.replace(u"\u015f", "ş")) as file:
file.write("Participants; Sectors; Countries; Expelled \n")
driver=webdriver.Chrome(executable_path='C:\webdrivers\chromedriver.exe')
driver.get('https://www.unglobalcompact.org/participation/report/cop/create-and-submit/expelled?page=1&per_page=250')
driver.maximize_window()
time.sleep(2)
for k in range(150):
Participants = driver.find_elements("xpath",'//td[#class="participant"]/a')
Sectors = driver.find_elements("xpath",'//td[#class="sector"]')
Countries = driver.find_elements("xpath",'//td[#class="country"]')
Expelled = driver.find_elements("xpath",'//td[#class="year"]')
time.sleep(1)
with open('Violators_UNGC1.csv', 'a') as file:
for i in range(len(Participants)):
file.write(Participants[i].text + ";" + Sectors[i].text + ";" + Countries[i].text + ";" + Expelled[i].text + "\n")
driver.close()
and I get an error message as per the below:
UnicodeEncodeError
Traceback (most recent call last) Cell In [15], line 28
26 with open('Violators_UNGC1.csv', 'a') as file:
27 for i in range(len(Participants)):
---> 28 file.write(Participants[i].text + ";" + Sectors[i].text + ";" + Countries[i].text + ";" + Expelled[i].text + "\n")
30 driver.close() File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u015f' in position 32: character maps to <undefined>
Thank you all !
As mentioned in comments, the default encoding of open is not fixed and should be declared explicitly. UTF-8 works for all Unicode characters. I also suggest opening the file once instead of re-opening it for each row write, and to use the csv module to write CSV files:
import csv
with open('Violators_UNGC1.csv', 'w', encoding='utf-8') as file:
w = csv.writer(file, delimiter=';')
w.writerow(['Participants','Sectors','Countries','Expelled'])
# Fake data for demonstration
Participants = 'oneş','twoş','threeş'
Sectors = 'sec1','sec2','sec3'
Countries = 'USA','Germany','France'
Expelled = 'A','B','C'
# zip returns all the first items in each group, then the 2nd, etc.
for row in zip(Participants, Sectors, Countries, Expelled):
w.writerow(row)
Output file:
Participants;Sectors;Countries;Expelled
oneş;sec1;USA;A
twoş;sec2;Germany;B
threeş;sec3;France;C
Related
Below is the code that is supposed to convert bz2 to text format. However; I am getting a unicode error.Since I am using utf-8 I wonder what the error could be
from __future__ import print_function
import logging
import os.path
import six
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
inp = "trwiki-latest-pages-articles.xml.bz2"
outp = "wiki_text_dump.txt"
space = " "
i = 0
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
if six.PY3:
output.write(' '.join(text).encode().decode('unicode_escape') + '\n')
# ###another method###
# output.write(
# space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
else:
output.write(space.join(text) + "\n")
#output.write(text)
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
Error:
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-42-9404745af31b> in <module>()
32 for text in wiki.get_texts():
33 if six.PY3:
---> 34 output.write(' '.join(text).encode().decode('unicode_escape') + '\n')
35 # ###another method###
36 # output.write(
c:\users\m\appdata\local\programs\python\python37\lib\encodings\cp1254.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode character '\x9f' in position 47: character maps to <undefined>
I've also replaced "unicode_escape" with "utf-8" then I get this error
UnicodeEncodeError: 'charmap' codec can't encode characters in position 87-92: character maps to <undefined>
As the traceback indicates, the error occurs during the call to .encode, not during the call to .decode. Therefore you cannot fix the problem by changing the .decode codec.
Since the code is running in Python 3.x (six.PY3 is true - but why are you concerned with 2.x compatibility in new code written today?), and since ' '.join(text) worked, we conclude that text is either a string or a list of strings (not a bytes or list of bytes), and ' '.join(text) is a string. Indeed, the documentation tells us that WikiCorpus will already provide strings.
This string contains some character that your codec, cp1254.py (this is a Windows code page specially intended for Turkish text), cannot encode. It is not clear to me what you hope to accomplish by encoding and then decoding again. Just use the string. In fact, text should already be a single string that does not need any .joining (unless you wanted to put a space after each letter, for some reason). You should verify this for yourself by debugging.
When I use the CountVectorizer in sklearn, it needs the file encoding in unicode, but my data file is encoding in ansi.
I tried to change the encoding to unicode using notepad++, then I use readlines, it cannot read all the lines, instead it can only read the last line. After that, I tried to read the line into data file, and write them into the new file by using unicode, but I failed.
def merge_file():
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
resname='resule_final.txt'
if os.path.exists(resname):
os.remove(resname)
result = codecs.open(resname,'w','utf-8')
num = 1
for back_name in os.listdir(r'd:\\workspace\\minibatchk-means\\data\\20_newsgroups'):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num ,":" ,str(filename)
num = num+1
path=current_dir + "\\" +str(filename)
source=open(path,'r')
line = source.readline()
line = line.strip('\n')
line = line.strip('\r')
while line !="":
line = unicode(line,"gbk")
line = line.replace('\n',' ')
line = line.replace('\r',' ')
result.write(line + ' ')
line = source.readline()
else:
print 'End file :'+ str(filename)
result.write('\n')
source.close()
print 'End All.'
result.close()
The error message is :UnicodeDecodeError: 'gbk' codec can't decode bytes in position 0-1: illegal multibyte sequence
Oh,I find the way.
First, use chardet to detect string encoding.
Second,use codecs to input or output to the file in the specific encoding.
Here is the code.
import chardet
import codecs
import os
root_dir="d:\\workspace\\minibatchk-means\\data\\20_newsgroups\\"
num = 1
failed = []
for back_name in os.listdir("d:\\workspace\\minibatchk-means\\data\\20_newsgroups"):
current_dir = root_dir + str(back_name)
for filename in os.listdir(current_dir):
print num,":",str(filename)
num=num+1
path=current_dir+"\\"+str(filename)
content = open(path,'r').read()
source_encoding=chardet.detect(content)['encoding']
if source_encoding == None:
print '??' , filename
failed.append(filename)
elif source_encoding != 'utf-8':
content=content.decode(source_encoding,'ignore')
codecs.open(path,'w',encoding='utf-8').write(content)
print failed
Thanks for all your help.
#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\\Users\\####\\Desktop\\BNC2\\[A00-ZZZ]*.xml")
out_lines = []
for filename in filenames:
with open(filename, 'r', encoding="utf-8") as content:
tree = ET.parse(content)
root = tree.getroot()
for w in root.iter('w'):
lemma = w.get('hw')
pos = w.get('pos')
tag = w.get('c5')
out_lines.append(w.text + "," + lemma + "," + pos + "," + tag)
with open("C:\\Users\\####\\Desktop\\bnc.txt", "w") as out_file:
for line in out_lines:
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
out_file.write("{}\n".format(line))
Gives the error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 0: character maps to undefined
I thought this line would have solved that:
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
You need to specify the encoding when opening the output file, same as you did with the input file:
with open("C:\\Users\\####\\Desktop\\bnc.txt", "w", encoding="utf-8") as out_file:
for line in out_lines:
out_file.write("{}\n".format(line))
If your script have multiple reads and writes and you want to have a particular encoding ( let's say utf-8) for all of them, we can change the default encoding too
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
We should use it only when we have multiple reads/writes though and should be done at the beginning of the script
Changing default encoding of Python?
I am using the code below to parse the XML format wikipedia training data into a pure text file:
from __future__ import print_function
import logging
import os.path
import six
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) != 3:
print("Using: python process_wiki.py enwiki.xxx.xml.bz2 wiki.en.text")
sys.exit(1)
inp, outp = sys.argv[1:3]
space = " "
i = 0
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
if six.PY3:
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
# ###another method###
# output.write(
# space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
else:
output.write(space.join(text) + "\n")
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
when I run this code, it gives me a following error message:
File "wiki_parser.py", line 42, in <module>
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
UnicodeEncodeError: 'cp949' codec can't encode character '\u1f00' in position 1537: illegal multibyte sequence
When I searched this error online, most answers told me to add 'utf-8' as the encoding which is already there. What could be the possible issue with the code?
Minimal example
The problem is that your file is opened with an implicit encoding (inferred from your system). I can recreate your issue as follows:
a = '\u1f00'
with open('f.txt', 'w', encoding='cp949') as f:
f.write(a)
Error message: UnicodeEncodeError: 'cp949' codec can't encode character '\u1f00' in position 0: illegal multibyte sequence
You have two options. Either open the file using an encoding which can encode the character you are using:
with open('f.txt', 'w', encoding='utf-8') as f:
f.write(a)
Or open the file as binary and write encoded bytes:
with open('f.txt', 'wb') as f:
f.write(a.encode('utf-8'))
Applied to your code:
I would replace this part:
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
if six.PY3:
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
# ###another method###
# output.write(
# space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
else:
output.write(space.join(text) + "\n")
with this:
from io import open
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
with open(outp, 'w', encoding='utf=8') as output:
for text in wiki.get_texts():
output.write(u' '.join(text) + u'\n')
which should work in both Python 2 and Python 3.
I need to log Connection errors to log.txt. Windows is Russian.
My code:
# e is a name for "requests.ConnectionError" form Windows if server is not avilable
# I take error and cut from it text I need and convert it to str
e_warning = str(e.args[0].reason)
# I search text I need in string with "re"
e_lst = re.findall('>:\s(.+)', e_warning)
# I create string again from list "re" gives me
e_str = ''.join(e_lst)
# I Convert string to bytes
e_str_unicode = codecs.encode(e_str, 'utf-8')
# It is a message to warning window
e_str_utf = codecs.decode(e_str_unicode, encoding='utf-8')
messagebox.showerror(title='Connection error', message=e_str)
with codecs.open('log.txt', 'a', encoding='utf-8') as log:
log.write(strftime(str("%H:%M:%S %Y-%m-%d") + str(e_str_unicode) + '\n'))
If I use "e_str_utf" in the last line it gives me:
UnicodeEncodeError: 'locale' codec can't encode character '\u041f' in position 72: Illegal byte sequence
Make sense - 72 is first Russian letter.
If I use "e_str_unicode" in the last line it is no error, but in log file I see:
15:25:18 2017-04-28b'Failed to establish a new connection: [WinError 10060] \xd0\x9f\xd0\xbe\xd0\xbf\xd1\x8b\xd1\x82\xd0\xba\xd0\xb0 \xd1\x83\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xb8\xd1\x82\xd1\x8c \xd1\x81\xd0\xbe\xd0\xb5\xd0\xb4\xd0\xb8\xd0\xbd\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb1\xd1\x8b\xd0\xbb\xd0\xb0 \xd0\xb1\xd0\xb5\xd0\xb7\xd1\x83\xd1\x81\xd0\xbf\xd0\xb5\xd1\x88\xd0\xbd\xd0\xbe\xd0\xb9, \xd1\x82.\xd0\xba. \xd0\xbe\xd1\x82 \xd0\xb4\xd1\x80\xd1\x83\xd0\xb3\xd0\xbe\xd0\xb3\xd0\xbe \xd0\xba\xd0\xbe\xd0\xbc\xd0\xbf\xd1\x8c\xd1\x8e\xd1\x82\xd0\xb5\xd1\x80\xd0\xb0 \xd0\xb7\xd0\xb0 \xd1\x82\xd1\x80\xd0\xb5\xd0\xb1\xd1\x83\xd0\xb5\xd0\xbc\xd0\xbe\xd0\xb5 \xd0\xb2\xd1\x80\xd0\xb5\xd0\xbc\xd1\x8f \xd0\xbd\xd0\xb5 \xd0\xbf\xd0\xbe\xd0\xbb\xd1\x83\xd1\x87\xd0\xb5\xd0\xbd \xd0\xbd\xd1\x83\xd0\xb6\xd0\xbd\xd1\x8b\xd0\xb9 \xd0\xbe\xd1\x82\xd0\xba\xd0\xbb\xd0\xb8\xd0\xba, \xd0\xb8\xd0\xbb\xd0\xb8 \xd0\xb1\xd1\x8b\xd0\xbb\xd0\xbe \xd1\x80\xd0\xb0\xd0\xb7\xd0\xbe\xd1\x80\xd0\xb2\xd0\xb0\xd0\xbd\xd0\xbe \xd1\x83\xd0\xb6\xd0\xb5 \xd1\x83\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xbe\xd0\xb5 \xd1\x81\xd0\xbe\xd0\xb5\xd0\xb4\xd0\xb8\xd0\xbd\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xb7-\xd0\xb7\xd0\xb0 \xd0\xbd\xd0\xb5\xd0\xb2\xd0\xb5\xd1\x80\xd0\xbd\xd0\xbe\xd0\xb3\xd0\xbe \xd0\xbe\xd1\x82\xd0\xba\xd0\xbb\xd0\xb8\xd0\xba\xd0\xb0 \xd1\x83\xd0\xb6\xd0\xb5 \xd0\xbf\xd0\xbe\xd0\xb4\xd0\xba\xd0\xbb\xd1\x8e\xd1\x87\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xbe\xd0\xb3\xd0\xbe \xd0\xba\xd0\xbe\xd0\xbc\xd0\xbf\xd1\x8c\xd1\x8e\xd1\x82\xd0\xb5\xd1\x80\xd0\xb0'
As I can understand encoding='utf-8' in
with codecs.open('log.txt', 'a', encoding='utf-8') as log:
should save UNICODE bytes in utf-8 code in my file, but for some reason it is ignores encoding setting... Why?
First: what is codec codecs.open('log.txt', 'a', encoding='utf-8')?
Second: this is not right strftime(str("%H:%M:%S %Y-%m-%d") + str(e_str_unicode) + '\n') it should be strftime("%H:%M:%S %Y-%m-%d") + e_str_unicode + '\n'
This is a short example how to do it:
from time import strftime
text = input()
print(text)
with open('log.text', 'a', encoding='utf-8') as log:
message = strftime("%H:%M:%S %Y-%m-%d") + '=>' + text + '\n'
log.write(message)