Trouble using gdata and Unicode Cyrillic in Python

Trouble using gdata and Unicode Cyrillic in Python - python

I have this code
# -*- coding: utf8 -*-
__author__ = 'user'
import gdata.youtube.service
yt_service = gdata.youtube.service.YouTubeService()
query = gdata.youtube.service.YouTubeVideoQuery()
query.vq = u"не"
feed = yt_service.YouTubeQuery(query)
for yt_item in feed.entry:
print yt_item.GetSwfUrl()
And I am getting this error:
Traceback (most recent call last):
File "cyr_search.py", line 7, in
feed = yt_service.YouTubeQuery(query)
File "/Users/user/Documents/GrabaHeroku/graba_h_ve/lib/python2.7/site-packages/gdata/youtube/service.py", line 1346, in YouTubeQuery
result = self.Query(query.ToUri())
File "/Users/user/Documents/GrabaHeroku/graba_h_ve/lib/python2.7/site-packages/gdata/service.py", line 1715, in ToUri
return atom.service.BuildUri(q_feed, self)
File "/Users/user/Documents/GrabaHeroku/graba_h_ve/lib/python2.7/site-packages/atom/service.py", line 584, in BuildUri
parameter_list = DictionaryToParamList(url_params, escape_params)
File "/Users/user/Documents/GrabaHeroku/graba_h_ve/lib/python2.7/site-packages/atom/service.py", line 551, in DictionaryToParamList
for param, value in (url_parameters or {}).items()]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1275, in quote_plus
return quote(s, safe)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1268, in quote
return ''.join(map(quoter, s))
KeyError: u'\u043d'
How do I search for non-ASCII. Do I need to url encode the query? I thought the library will do that on its own.

Change to:
query.vq = u"не".encode('utf8')
The string needs to be encoded before being sent.

Related

ignore encoding error when parsing pdf with pdfminer

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fn='test.pdf'
with open(fn, mode='rb') as fp:
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
item = {}
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
item[name]=value
Hello, I need help with this code as it is giving me Unicode error on some characters
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdftypes.py", line 80, in resolve1
x = x.resolve(default=default)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdftypes.py", line 67, in resolve
return self.doc.getobj(self.objid)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 673, in getobj
stream = stream_value(self.getobj(strmid))
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 676, in getobj
obj = self._getobj_parse(index, objid)
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/pdfdocument.py", line 648, in _getobj_parse
raise PDFSyntaxError('objid mismatch: %r=%r' % (objid1, objid))
File "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 85, in __repr__
return self.name.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
is there anything I can add so it "ingores" the charchters that its not able to decode or at least return the name with the value as blank in name, value = field.get('T'), field.get('V').
any help is appreciated

Here is one way you can fix it
nano "/home/timmy/.local/lib/python3.8/site-packages/pdfminer/psparser.py"
then in line 85
def __repr__(self):
return self.name.decode('ascii', 'ignore') # this fixes it
I don't believe it's recommended to edit source scripts, you should also post an issue on Github

Unable to decode yml file ... utf8' codec can't decode byte #xa0: invalid start byte

I'm trying to read YAML file and convert it into dictionary file. I'm seeing an issue while loading the file into dict variable.
I tried to search for similar issues. One of the replies in stackoverflow was to replace each character '\\xa0' with ' '. I tried do that line = line.replace('\\xa0',' '). This program doesn't work on Python 2.7 version. I tried using Python 3 it works fine.
import yaml
import sys
yaml_dir = "/root/tools/test_case/"
#file_name = "TC_CFD_SR.yml"
file_name = "TC_QB.yml"
tc_file_name = yaml_dir + file_name
def write(file,content):
file = open(file,'a')
file.write(content)
file.close()
def verifyYmlFile(yml_file):
data = {}
with open(yml_file, 'r') as fin:
for line in fin:
line = line.replace('\\xa0',' ')
write('anand-yaml.yml',line)
with open('anand-yaml.yml','r') as fin:
data = yaml.load(fin)
return data
if __name__ == '__main__':
data = {}
print "verifying yaml"
data= verifyYmlFile(tc_file_name)
Error:
[root#anand-harness test_case]# python verify_yaml.py
verifying yaml
Traceback (most recent call last):
File "verify_yaml.py", line 29, in <module>
data= verifyYmlFile(tc_file_name)
File "verify_yaml.py", line 23, in verifyYmlFile
data = yaml.load(fin)
File "/usr/lib64/python2.6/site-packages/yaml/__init__.py", line 71, in load
return loader.get_single_data()
File "/usr/lib64/python2.6/site-packages/yaml/constructor.py", line 37, in get_single_data
node = self.get_single_node()
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 82, in compose_node
node = self.compose_sequence_node(anchor)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 111, in compose_sequence_node
node.value.append(self.compose_node(node, index))
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 133, in compose_mapping_node
item_value = self.compose_node(node, item_key)
File "/usr/lib64/python2.6/site-packages/yaml/composer.py", line 64, in compose_node
if self.check_event(AliasEvent):
File "/usr/lib64/python2.6/site-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/usr/lib64/python2.6/site-packages/yaml/parser.py", line 449, in parse_block_mapping_value
if not self.check_token(KeyToken, ValueToken, BlockEndToken):
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 244, in fetch_more_tokens
return self.fetch_single()
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 653, in fetch_single
self.fetch_flow_scalar(style='\'')
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 667, in fetch_flow_scalar
self.tokens.append(self.scan_flow_scalar(style))
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 1156, in scan_flow_scalar
chunks.extend(self.scan_flow_scalar_non_spaces(double, start_mark))
File "/usr/lib64/python2.6/site-packages/yaml/scanner.py", line 1196, in scan_flow_scalar_non_spaces
while self.peek(length) not in u'\'\"\\\0 \t\r\n\x85\u2028\u2029':
File "/usr/lib64/python2.6/site-packages/yaml/reader.py", line 91, in peek
self.update(index+1)
File "/usr/lib64/python2.6/site-packages/yaml/reader.py", line 165, in update
exc.encoding, exc.reason)
yaml.reader.ReaderError: 'utf8' codec can't decode byte #xa0: invalid start byte
in "anand-yaml.yml", position 3246
What am I missing?

The character sequence "\\xa0" is not the problem that you see in the message, the problem is the sequence "\xa0" (note that the backslash is not escaped).
You replacement line should be:
line = line.replace('\xa0',' ')
to circumvent the problem.
If you know what the format is you can do the correct conversion yourself, but that should not be necessary and that or the above patching is not a structural solution. It would be best if the YAML file was generated in a correct way (they default to UTF-8, so it should contain correct UTF-8). It could UTF-16 without the appropriate BOM (which the yaml library interprets IIRC).
s1 = 'abc\\xa0xyz'
print(repr(s1))
u1 = s1.decode('utf-8') # this works fine
s = 'abc\xa0xyz'
print(repr(s))
u = s.decode('utf-8') # this throws an error

Python HTMLParser Not Reading Whole File

from HTMLParser import HTMLParser
class HTMLParserDos(HTMLParser):
full_text = ""
def handle_data(self, data):
self.full_text += data
return self.full_text
h = HTMLParserDos()
file = open('emails.txt', 'r')
h.feed(file.read())
file.close()
print h.container
This code is getting an error:
Traceback (most recent call last): File "/Users/laurenstrom/Google
Drive/PYTHON/RANDO_CALRISSIAN/html_parse", line 15, in
h.feed(file.read()) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py",
line 108, in feed
self.goahead(0) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py",
line 148, in goahead
k = self.parse_starttag(i) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py",
line 229, in parse_starttag
endpos = self.check_for_whole_start_tag(i) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py",
line 304, in check_for_whole_start_tag
self.error("malformed start tag") File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py",
line 115, in error
raise HTMLParseError(message, self.getpos()) HTMLParseError: malformed start tag, at line 7, column 18
I'm not sure what I'm missing about .feed() but I can't seem to find anything about why it won't just read the whole file.

Your are asking the HTML parser to parse a file most of which isn't HTML. It is tripping over line 7 of your file. Which is :
Return-Path: <Tom#sjnetworkconsulting.com>
I would imagine it is seeing the < and assuming that is HTML which of course it is not.

Python Pandas print error in Eclipse's PyDev: unknown encoding: MS874

I am trying to use Pandas library to read csv files, using Eclipse's PyDev.
foo.csv file:
"head1", "head2",
"A", "123"
test.py:
import pandas as pd
data = pd.read_csv('foo.csv');
print data
I ran this and got an error:
Traceback (most recent call last):
File "C:\Users\qqq\studyspace\macd\test3.py", line 4, in <module>
print data
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 666, in __str__
return self.__bytes__()
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 676, in __bytes__
return self.__unicode__().encode(encoding, 'replace')
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 691, in __unicode__
fits_horizontal = self._repr_fits_horizontal_()
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 651, in _repr_fits_horizontal_
d.to_string(buf=buf)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 1488, in to_string
formatter.to_string()
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 314, in to_string
strcols = self._to_str_columns()
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 258, in _to_str_columns
str_index = self._get_formatted_index()
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 472, in _get_formatted_index
fmt_index = [index.format(name=show_index_names, formatter=fmt)]
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 450, in format
return self._format_with_header(header, **kwargs)
File "C:\Python27\lib\site-packages\pandas\core\index.py", line 472, in _format_with_header
result = _trim_front(format_array(values, None, justify='left'))
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 1321, in format_array
return fmt_obj.get_result()
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 1448, in get_result
return _make_fixed_width(fmt_values, self.justify)
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 1495, in _make_fixed_width
max_len = np.max([_strlen(x) for x in strings])
File "C:\Python27\lib\site-packages\pandas\core\format.py", line 184, in _strlen
return len(x.decode(encoding))
LookupError: unknown encoding: MS874
I have tried to run this in IPython, and it does not give the error, so I think the problem is with my Eclipse setting. I use Eclipse Juno and I installed Pandas via Python(x,y).
I have tried to solve it blindly like this
import pandas as pd
data = pd.read_csv('foo.csv');
b = True;
while(b):
try:
print data
b = False
except:
print 'foooo'
And it just printed 'foooo' forever.

I have found the solution.
Right click on the project => Properties => Resource => Text file encoding. Choose other => UTF-8.

How to fix encoding in Python Mechanize?

here is the sample code:
from mechanize import Browser
br = Browser()
page = br.open('http://hunters.tclans.ru/news.php?readmore=2')
br.form = br.forms().next()
print br.form
The problem is that server return incorrect encoding (windows-cp1251). How can I manually set the encoding of the current page in mechanize?
Error:
Traceback (most recent call last):
File "/tmp/stackoverflow.py", line 5, in <module>
br.form = br.forms().next()
File "/usr/local/lib/python2.6/dist-packages/mechanize/_mechanize.py", line 426, in forms
return self._factory.forms()
File "/usr/local/lib/python2.6/dist-packages/mechanize/_html.py", line 559, in forms
self._forms_factory.forms())
File "/usr/local/lib/python2.6/dist-packages/mechanize/_html.py", line 225, in forms
_urlunparse=_rfc3986.urlunsplit,
File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 967, in ParseResponseEx
_urlunparse=_urlunparse,
File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 1104, in _ParseFileEx
fp.feed(data)
File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 870, in feed
sgmllib.SGMLParser.feed(self, data)
File "/usr/lib/python2.6/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.6/sgmllib.py", line 193, in goahead
self.handle_entityref(name)
File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 751, in handle_entityref
'&%s;' % name, self._entitydefs, self._encoding))
File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 238, in unescape
return re.sub(r"&#?[A-Za-z0-9]+?;", replace_entities, data)
File "/usr/lib/python2.6/re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 230, in replace_entities
repl = repl.encode(encoding)
LookupError: unknown encoding: windows-cp1251

I don't know about Mechanize, but you could hack codecs to accept wrong encoding names that have both ‘windows’ and ‘cp’:
>>> def fixcp(name):
... if name.lower().startswith('windows-cp'):
... try:
... return codecs.lookup(name[:8]+name[10:])
... except LookupError:
... pass
... return None
...
>>> codecs.register(fixcp)
>>> '\xcd\xe0\xef\xee\xec\xe8\xed\xe0\xe5\xec'.decode('windows-cp1251')
u'Напоминаем'

Fixed by setting
br._factory.encoding = enc
br._factory._forms_factory.encoding = enc
br._factory._links_factory._encoding = enc
(note the underscores) after br.open()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble using gdata and Unicode Cyrillic in Python - python

Change to: query.vq = u"не".encode('utf8') The string needs to be encoded before being sent.

Related

ignore encoding error when parsing pdf with pdfminer

Unable to decode yml file ... utf8' codec can't decode byte #xa0: invalid start byte

Python HTMLParser Not Reading Whole File

Python Pandas print error in Eclipse's PyDev: unknown encoding: MS874

How to fix encoding in Python Mechanize?

Categories

Resources