dictionary data extraction issue

dictionary data extraction issue - python

top_100 is a mongodb collection:
the following code:
x=[]
thread=[]
for doc in top_100.find():
x.append(doc['_id'])
db = Connection().test
top_100 = db.top_100_thread
thread = [a["thread"] for a in x]
for doc in thread:
print doc
gives this error:
Traceback (most recent call last):
File "C:\Users\chatterjees\workspace\de.vogella.python.first\src\top_100_thread.py", line 21, in <module>
print doc
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u03b9' in position 10: character maps to <undefined>
what's going on?

Its because your document contains some unicode data.
You need to correctly output unicode data
instead of directly printing it.
see:
python 3.0, how to make print() output unicode?

Related

Emoji support reading from file in python? [duplicate]

I need to analyse a textfile in tamil (utf-8 encoded). Im using nltk package of Python on the interface IDLE. when i try to read the text file on the interface, this is the error i get. how do i avoid this?
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>

Since you are using Python 3, just add the encoding parameter to open():
corpus = open(
r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()

Store Unicoded string into file using python

temp.XML
<?xml version="1.0" encoding="utf-8"?>
<PubmedArticleSet>
<LastName>Nalivaĭko</LastName>
<ForeName>Anthony V</ForeName>
</PubmedArticleSet>
MY CODE
import xml.dom.minidom
doc = xml.dom.minidom.parse("temp.xml");
file = open('output1.xml','w')
articles = doc.getElementsByTagName('PubmedArticleSet')
for art in articles:
ln = art.getElementsByTagName("LastName")[0]
data = ln.firstChild.nodeValue
file.write("<LastName>")
file.write(data)
file.write("</LastName>\n")
print("Completed")
file.close()
I need the output as to be the same as the String it is in LastName tag.
Required Output - <LastName>Nalivaĭko</LastName>
I'm getting this error while running my code
Traceback (most recent call last):
File "C:\Users\Yugam\Desktop\python\ParsingUsingDOM.py", line 12, in <module>
file.write(data)
File "C:\Users\Yugam\AppData\Local\Programs\Python\Python37-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u012d' in position 6: character maps to <undefined>

You can open the file for writing with the desired encoding like so:
open('output1.xml','w', encoding='utf-8')
Then you can write out your unicode string as normal.
The output file:
<LastName>Nalivaĭko</LastName>

UnicodeEncodeError: 'charmap' codec can't encode character : character maps to <undefined>

I am doing some csv read/write operations over a huge flat file. According the source, the contents of the file are under UTF-8 encoding but while trying to write to a .csv file I am getting following error:
Traceback (most recent call last):
File " basic.py", line 12, in <module>
F.write(q)
File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x9a' in position 19: character maps to <undefined>
Quite a possibility that the file contains some multicultural symbols, given it contains data that represent some global information.
But as its huge it can’t be fixed manually one by one. So is there a way I can fix these errors, make the code ignore these or ideally can standardize these characters. As after writing these the csv file I will uploading it over a sql server db.

I had the same error message when trying to write text to a .txt file.
I solved the problem byreplacing
my_text = "text containing non-unicode characters"
text_file = open("Output.txt", "w")
text_file.write(my_text)
with
my_text = "text containing non-unicode characters"
text_file = open("Output.txt", "w", encoding='utf-8')
text_file.write(my_text)

Encode UTF-8 for list

I'm using selenium to retrieve a list from a javascript object.
search_reply = driver.find_element_by_class_name("ac_results")
When trying to write to csv, I get this error:
Traceback (most recent call last):
File "insref_lookup15.py", line 54, in <module>
wr_insref.writerow(instrument_name)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 22: ordinal not in range(128)
I have tried placing .encode("utf-8") on both:
search_reply = driver.find_element_by_class_name("ac_results").encode("utf-8")
and
wr_insref.writerow(instrument_name).encode("utf-8")
but I just get the message
AttributeError: 'xxx' object has no attribute 'encode'

You need to encode the elements in the list:
wr_insref.writerow([v.encode('utf8') for v in instrument_name])
The csv module documentation has an Examples section that covers writing Unicode objects in more detail, including utility classes to handle this automatically.

UnicodeDecodeError in Python 2.7

I am trying to read a utf-8 encoded xml file in python and I am doing some processing on the lines read from the file something like below:
next_sent_separator_index = doc_content.find(word_value, int(characterOffsetEnd_value) + 1)
Where doc_content is the line read from the file and word_value is one of the string from the the same line. I am getting encoding related error for above line whenever doc_content or word_value is having some Unicode characters. So, I tried to decode them first with utf-8 decoding (instead of default ascii encoding) as below :
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
But I am still getting UnicodeDecodeError as below :
Traceback (most recent call last):
File "snippetRetriver.py", line 402, in <module>
sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
File "snippetRetriver.py", line 201, in getSentenceList
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)
Can anyone suggest me a suitable approach / way to avoid these kind of encoding errors in python 2.7 ?

codecs.utf_8_decode(input.encode('utf8'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

dictionary data extraction issue - python

Its because your document contains some unicode data. You need to correctly output unicode data instead of directly printing it. see: python 3.0, how to make print() output unicode?

Related

Emoji support reading from file in python? [duplicate]

Store Unicoded string into file using python

UnicodeEncodeError: 'charmap' codec can't encode character : character maps to <undefined>

Encode UTF-8 for list

UnicodeDecodeError in Python 2.7

Categories

Resources