UnicodeEncodeError when trying to read a graph with networkx - python

I have a little script that takes hashtags from Twitter with the TwitterSearch API and uses them as nodes in a graph with networkx. TwitterSearch returns the hashtags in unicode format, and I have no problems while saving the graph using the write_pajek function. Instead, when I try to read the graph with read_pajek, it returns me this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in read_pajek
File "C:\Python27\lib\site-packages\networkx\utils\decorators.py", line 263, in _open_file
result = func(*new_args, **kwargs)
File "C:\Python27\lib\site-packages\networkx\readwrite\pajek.py", line 134, in read_pajek
return parse_pajek(lines)
File "C:\Python27\lib\site-packages\networkx\readwrite\pajek.py", line 170, in parse_pajek
splitline=shlex.split(str(next(lines)))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)
I think the problem is due to the fact that it tries to decode with the ascii codec some chinese/japanese characters, but I don't know how solve it. In the second argument of the function, you can declare the encoding of the file, which by default is "UTF-8", so in theory I shouldn't have any problem while reading it.

Related

Python: Can't seem to decode the textfile

I am trying to open the file i.md3 in python. Only the first line is displayed properly.
This file is correct as I can open this in C easily with structures and pointers.
How can I decode this file. I have tried many encoding techniques many of which can show only output of first line.
Without the "encoding=cp850" there is an error:
Traceback (most recent call last):
File "D:\Eclipse Workspace\IGG Project\Main.py", line 40, in <module>
line = fp1.read()
File "C:\Program Files (x86)\Python38-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 35: character maps to <undefined>
This file is correct as I can open this in C easily with structures and pointers.
Code:
fp1 = open("i.md3", encoding="cp850")
while 1:
line = fp1.read()
if not line:
break
print (line)
First few lines of output is in the link below:
https://i.stack.imgur.com/TGZhq.png

About UnicodeDecodeError

I am writing a program to count the words with python(3.6), the code runs smoothly from the terminal. But if I use python IDLE, below error happens:
Traceback (most recent call last):
File "/Users/zhangchaont/python/Course Python Programming/6.7V2.py", line 122, in <module>
main()
File "/Users/zhangchaont/python/Course Python Programming/6.7V2.py", line 21, in main
for line in txtFile:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 33: ordinal not in range(128)
How to solve this?
Since there is not much info about your code. I can only suggest instead of codecs you can also use this package.
https://github.com/iki/unidecode. The method below should solve your problem. Open your file with open method, and pass it the file_handle.read()
unidecode.unidecode_expect_nonascii(string)

Python rdflib Not Parsing Creative Commons License Information Correctly

I was using rdflib version 3.2.3 and everything was working fine. After upgrading to 4.0.1 I started getting the error:
RDFa parsing Error! 'ascii' codec can't decode byte 0xc3 in position 5454: ordinal not in range(128)
I tried various way to make this work but so far have not succeeded. Below are my attempts.
In each case I:
from rdflib import Graph
First attempt:
>>> lg =Graph()
>>> len(lg.parse('http://creativecommons.org/licenses/by/3.0/'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/graph.py", line 1002, in parse
parser.parse(source, self, **args)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/structureddata.py", line 268, in parse
vocab_cache=vocab_cache)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/structureddata.py", line 148, in _process
_check_error(processor_graph)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/structureddata.py", line 57, in _check_error
raise Exception("RDFa parsing Error! %s" % msg)
Exception: RDFa parsing Error! 'ascii' codec can't decode byte 0xc3 in position 4801: ordinal not in range(128)
Second attempt:
>>> lg =Graph()
>>> len(lg.parse('http://creativecommons.org/licenses/by/3.0/rdf'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/graph.py", line 1002, in parse
parser.parse(source, self, **args)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 570, in parse
self._parser.parse(source)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 349, in end_element_ns
self._cont_handler.endElementNS(pair, None)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 160, in endElementNS
self.current.end(name, qname)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 461, in property_element_end
current.data, literalLang, current.datatype)
File "/home/alex/Projects/RDF/rdfEnv/local/lib/python2.7/site-packages/rdflib/term.py", line 541, in __new__
raise Exception("'%s' is not a valid language tag!"%lang)
Exception: 'i18n' is not a valid language tag!
Third attempt: gives no errors but also does not give any results
>>> lg =Graph()
>>> len(lg.parse('http://creativecommons.org/licenses/by/3.0/rdf', format='rdfa'))
0
So someone please tell me what I am dong wrong! :)
As Graham replied on the rdflib mailinglist, there is a html5lib problem - we will pin it correctly for python 2 for the next release, but for now just do:
pip install html5lib==0.95
The second problem is a problem in the data from creative commons, "i18n" really isn't a valid language tag according to rfc5646. I added the check, but in retrospect it seems to strict to raise an exception. I guess I'll change it to a warning.

UnicodeDecodeError for writing file

I know that this is a very common error, but it's the first time I've encountered it when trying to write a file.
I'm using networkx to work with graphs for network analysis, and when I try to write into any format:
nx.write_gml(G, "Graph.gml")
nx.write_pajek(G, "Graph.net")
nx.write_gexf(G, "graph.gexf")
I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in write_pajek
File "/Library/Python/2.7/site-packages/networkx/utils/decorators.py", line 263, in _open_file
result = func(*new_args, **kwargs)
File "/Library/Python/2.7/site-packages/networkx/readwrite/pajek.py", line 100, in write_pajek
path.write(line.encode(encoding))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
I haven't found documentation on this, so quite confused.
Wondering if you can make use of codec module to solve it or not. Just create a file object by codec as following before feeding to networkx.
ex,
import codecs
f = codecs.open("graph.gml", "w", "utf-8")

UnicodeDecodeError in Python 2.7

I am trying to read a utf-8 encoded xml file in python and I am doing some processing on the lines read from the file something like below:
next_sent_separator_index = doc_content.find(word_value, int(characterOffsetEnd_value) + 1)
Where doc_content is the line read from the file and word_value is one of the string from the the same line. I am getting encoding related error for above line whenever doc_content or word_value is having some Unicode characters. So, I tried to decode them first with utf-8 decoding (instead of default ascii encoding) as below :
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
But I am still getting UnicodeDecodeError as below :
Traceback (most recent call last):
File "snippetRetriver.py", line 402, in <module>
sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
File "snippetRetriver.py", line 201, in getSentenceList
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)
Can anyone suggest me a suitable approach / way to avoid these kind of encoding errors in python 2.7 ?
codecs.utf_8_decode(input.encode('utf8'))

Categories

Resources