Store Unicoded string into file using python - python

temp.XML
<?xml version="1.0" encoding="utf-8"?>
<PubmedArticleSet>
<LastName>Nalivaĭko</LastName>
<ForeName>Anthony V</ForeName>
</PubmedArticleSet>
MY CODE
import xml.dom.minidom
doc = xml.dom.minidom.parse("temp.xml");
file = open('output1.xml','w')
articles = doc.getElementsByTagName('PubmedArticleSet')
for art in articles:
ln = art.getElementsByTagName("LastName")[0]
data = ln.firstChild.nodeValue
file.write("<LastName>")
file.write(data)
file.write("</LastName>\n")
print("Completed")
file.close()
I need the output as to be the same as the String it is in LastName tag.
Required Output - <LastName>Nalivaĭko</LastName>
I'm getting this error while running my code
Traceback (most recent call last):
File "C:\Users\Yugam\Desktop\python\ParsingUsingDOM.py", line 12, in <module>
file.write(data)
File "C:\Users\Yugam\AppData\Local\Programs\Python\Python37-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u012d' in position 6: character maps to <undefined>

You can open the file for writing with the desired encoding like so:
open('output1.xml','w', encoding='utf-8')
Then you can write out your unicode string as normal.
The output file:
<LastName>Nalivaĭko</LastName>

Related

Emoji support reading from file in python? [duplicate]

I need to analyse a textfile in tamil (utf-8 encoded). Im using nltk package of Python on the interface IDLE. when i try to read the text file on the interface, this is the error i get. how do i avoid this?
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>
Since you are using Python 3, just add the encoding parameter to open():
corpus = open(
r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()

How to extract text from a docx file and store in a text file

I have been trying to read a .docx file and copy its text to a .txt file
I started off by writing this piece of script for achieving the above results.
if extension == 'docx' :
document = Document(filepath)
for para in document.paragraphs:
with open("C:/Users/prasu/Desktop/PySumm-resource/CodeSamples/output.txt","w") as file:
file.writelines(para.text)
The error occurred is as follows :
Traceback (most recent call last):
File "input_script.py", line 27, in <module>
file.writelines(para.text)
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2265' in
position 0: character maps to <undefined>
I tried printing "para.text" with the help of print(), it works.
Now, I want to write "para.text" to a .txt file.
You could try using codecs.
Based on your error message it seems that the following character "≥" is causing issues. Outputting in utf-8 with codecs should hopefully solve your issue.
from docx import Document
import codecs
filepath = r"test.docx"
document = Document(filepath)
for para in document.paragraphs:
with codecs.open('output.txt', 'a', "utf-8-sig") as o_file:
o_file.write(para.text)
o_file.close()

UnicodeEncodeError: 'charmap' codec can't encode character : character maps to <undefined>

I am doing some csv read/write operations over a huge flat file. According the source, the contents of the file are under UTF-8 encoding but while trying to write to a .csv file I am getting following error:
Traceback (most recent call last):
File " basic.py", line 12, in <module>
F.write(q)
File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x9a' in position 19: character maps to <undefined>
Quite a possibility that the file contains some multicultural symbols, given it contains data that represent some global information.
But as its huge it can’t be fixed manually one by one. So is there a way I can fix these errors, make the code ignore these or ideally can standardize these characters. As after writing these the csv file I will uploading it over a sql server db.
I had the same error message when trying to write text to a .txt file.
I solved the problem byreplacing
my_text = "text containing non-unicode characters"
text_file = open("Output.txt", "w")
text_file.write(my_text)
with
my_text = "text containing non-unicode characters"
text_file = open("Output.txt", "w", encoding='utf-8')
text_file.write(my_text)

Error when printing random line from text file

I need to print a random line from the file "Long films".
My code is:
import random
with open('Long films') as f:
lines = f.readlines()
print(random.choice(lines))
But it prints this error:
Traceback (most recent call last):
line 3, in <module>
lines = f.readlines()
line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 36: ordinal not in range(128)
What do I need to do in order to avoid this error?
The problem is not with printing, it is with reading. It seems your file has some special characters. Try opening your file with a different encoding:
with open('Long films', encoding='latin-1') as f:
...
Also, have you made any settings to your locale? Have you set any encoding scheme at the top of your file? Ordinarily, python3 will "helpfully" decode your text to utf-8, so you typically should not be getting this error.

dictionary data extraction issue

top_100 is a mongodb collection:
the following code:
x=[]
thread=[]
for doc in top_100.find():
x.append(doc['_id'])
db = Connection().test
top_100 = db.top_100_thread
thread = [a["thread"] for a in x]
for doc in thread:
print doc
gives this error:
Traceback (most recent call last):
File "C:\Users\chatterjees\workspace\de.vogella.python.first\src\top_100_thread.py", line 21, in <module>
print doc
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u03b9' in position 10: character maps to <undefined>
what's going on?
Its because your document contains some unicode data.
You need to correctly output unicode data
instead of directly printing it.
see:
python 3.0, how to make print() output unicode?

Categories

Resources