Unicode issue with Python scraper

Unicode issue with Python scraper - python

I've been writing bad perl for a while, but am attempting to learn to write bad python instead. I've read around the problem I've been having for a couple of days now (and know an awful lot more about unicode as a result) but I'm still having troubles with a rogue em-dash in the following code:
import urllib2
def scrape(url):
# simplified
data = urllib2.urlopen(url)
return data.read()
def query_graph_api(url_list):
# query Facebook's Graph API, store data.
for url in url_list:
graph_query = graph_query_root + "%22" + url + "%22"
query_data = scrape(graph_query)
print query_data #debug console
### START HERE ####
graph_query_root = "https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="
url_list = ['http://www.supersavvyme.co.uk', 'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']
query_graph_api(url_list)
(This is a much simplified representation of the scraper, BTW. The original uses a site's sitemap.xml to build a list of URLs, then queries Facebook's Graph API for information on each -- here's the original scraper)
My attempts to debug this have consisted mostly of trying to emulate the infinite monkeys who are rewriting Shakespeare. My usual method (search StackOverflow for the error message, copy-and-paste the solution) has failed.
Question: how do I encode my data so that extended characters like the em-dash in the second URL won't break my code, but will still work in the FQL query?
P.S. I'm even wondering whether I'm asking the right question: might urllib.urlencode help me out here (certainly it would make that graph_query_root easier and prettier to create...
---8<----
The traceback I get from the actual scraper on ScraperWiki is as follows:
http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more
Line 80 - query_graph_api(urls)
Line 53 - query_data = scrape(graph_query) -- query_graph_api((urls=['http://www.supersavvyme.co.uk', 'http://...more
Line 21 - data = urllib2.urlopen(unicode(url)) -- scrape((url=u'https://graph.facebook.com/fql?q=SELECT%20url,...more
/usr/lib/python2.7/urllib2.py:126 -- urlopen((url=u'https://graph.facebook.com/fql?q=SELECT%20url,no...more
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 177: ordinal not in range(128)

If you are using Python 3.x, all you have to do is add one line and change another:
gq = graph_query.encode('utf-8')
query_data = scrape(gq)
If you are using Python 2.x, first put the following line in at the top of the module file:
# -*- coding: utf-8 -*- (read what this is for here)
and then make all your string literals unicode and encode just before passing to urlopen:
def scrape(url):
# simplified
data = urllib2.urlopen(url)
return data.read()
def query_graph_api(url_list):
# query Facebook's Graph API, store data.
for url in url_list:
graph_query = graph_query_root + u"%22" + url + u"%22"
gq = graph_query.encode('utf-8')
query_data = scrape(gq)
print query_data #debug console
### START HERE ####
graph_query_root = u"https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="
url_list = [u'http://www.supersavvyme.co.uk', u'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']
query_graph_api(url_list)
It looks from the code like you are using 3.x, which is really better for dealing with stuff like this. But you still have to encode when necessary. In 2.x, the best advice is to do what 3.x does by default: use unicode throughout your code, and only encode when bytes are called for.

Related

LZMA Returns Input Format not supported

It seems that when I try to decode some bytes that where decoded from base 64 it gives an Input Format not Supported. I cannot isolate the issue, as when I bring the decoded logic alone into a new file, the error will not happen, making me think that this is something to do with the way flask passes arguments to the functions.
Code:
from flask import Flask
import base64
import lzma
from urllib.parse import quote, unquote
app = Flask('app')
#app.route('/')
def hello_world():
return 'Hello, World!<br><button onclick = "var base = \'https://Text-Viewer-from-Bsace-64-URL.inyourface3445.repl.co/encode\';location.href = `${base}/${prompt(\'What do you want to send?\')}`" >Use</button>'
newline = '/n'
#app.route('/view/<path:b64>')
def viewer(b64):
print(type(b64))
s1 = base64.b64decode(b64.encode() + b'==')
s2 = lzma.decompress(s1).decode()
s3 = unquote(s2).replace(newline, '<br>')
return f'<div style="overflow-x: auto;">{s3}</div>'
#app.route('/encode/<path:txt>')
def encode(txt):
quote_text = quote(txt, safe = "")
compressed_text = lzma.compress(quote_text.encode())
base_64_txt = base64.b64encode(compressed_text).decode()
return f'text link '
app.run(host='0.0.0.0', port=8080, debug=True)
Can someone explain what I am doing wrong?

You are passing a base64-encoded string as a part of the URL, and that string may contain characters that gets mangled in the process.
For example, visiting /encode/hello will give the following URL:
https://text-viewer-from-bsace-64-url.inyourface3445.repl.co/view//Td6WFoAAATm1rRGAgAhARYAAAB0L+WjAQAEaGVsbG8AAAAAsTe52+XaHpsAAR0FuC2Arx+2830BAAAAAARZWg==
Several characters could go wrong:
The first character is /, and as a result Flask will redirect from view//TD6... to view/TD6...: in other words the first character gets deleted
Depending on how URL-encoding is performed by the browser and URL-decoding is performed by Flask, the + character may be decoded into a space
To avoid these issues, I would suggest using base64.urlsafe_b64encode / base64.urlsafe_b64decode which are versions of the base64 encoding where the output can be used in URLs without being mangled.
The following changes on your code seems to do the trick:
s1 = base64.urlsafe_b64decode(b64.encode()) in viewer
base_64_txt = base64.urlsafe_b64encode(compressed_text).decode() in encode

Decoding encoded Google News URLs

I saved a search in https://news.google.com/ but google does not use the actual links found on its results page. Rather, you will find links like this:
https://news.google.com/articles/CBMiUGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd3NvcC1tYWluLWV2ZW50LXRpcHMtbmluZS1jaGFtcGlvbnMtMzEyODcuaHRt0gEA?hl=en-US&gl=US&ceid=US%3Aen
I want the 'real link' that this resolves to using python. If you plug the above url into your browser, for a split second you will see
Opening https://www.pokernews.com/strategy/wsop-main-event-tips-nine-champions-31287.htm
I tried a few things using the Requests module but 'no cigar'.
If it can't be done, are these google links permanent - can they always be used to open up the web page?
UPDATE 1:
After posting this question I used a hack to solve the problem. I simply used urllib again to open up the google url and then parsed the source to find the 'real url'.
It was exciting to see TDG's answer as it would help my program to run faster. But google is being cryptic and it did not work for ever link.
For this mornings news feed, it bombed on the 4th news item:
RESTART: C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py
cp1252
cp1252
>>> 1
Tommy Angelo Presents: The Butoff
CBMiTWh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvdG9tbXktYW5nZWxvLXByZXNlbnRzLXRoZS1idXRvZmYtMzE4ODEuaHRt0gEA
b'\x08\x13"Mhttps://www.pokernews.com/strategy/tommy-angelo-presents-the-butoff-31881.htm\xd2\x01\x00'
Flopped Set of Nines: Get All In on Flop or Wait?
CBMiXGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvZmxvcHBlZC1zZXQtb2YtbmluZXMtZ2V0LWFsbC1pbi1vbi1mbG9wLW9yLXdhaXQtMzE4ODAuaHRt0gEA
b'\x08\x13"\\https://www.pokernews.com/strategy/flopped-set-of-nines-get-all-in-on-flop-or-wait-31880.htm\xd2\x01\x00'
What Not to Do Online: Don’t Just Stop Thinking and Shove
CBMiZWh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd2hhdC1ub3QtdG8tZG8tb25saW5lLWRvbi10LWp1c3Qtc3RvcC10aGlua2luZy1hbmQtc2hvdmUtMzE4NzAuaHRt0gEA
b'\x08\x13"ehttps://www.pokernews.com/strategy/what-not-to-do-online-don-t-just-stop-thinking-and-shove-31870.htm\xd2\x01\x00'
Hold’em with Holloway, Vol. 77: Joseph Cheong Gets Crazy with a Pair of Ladies
CBMiV2h0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvaG9sZC1lbS13aXRoLWhvbGxvd2F5LXZvbC03Ny1qb3NlcGgtY2hlb25nLTMxODU4Lmh0bdIBAA
Traceback (most recent call last):
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py", line 68, in <module>
GetGoogleNews("https://news.google.com/search?q=site%3Ahttps%3A%2F%2Fwww.pokernews.com%2Fstrategy&hl=en-US&gl=US&ceid=US%3Aen", 'news')
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py", line 34, in GetGoogleNews
real_URL = base64.b64decode(coded)
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\lib\base64.py", line 87, in b64decode
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
>>>
UPDATE 2:
After reading up on base64 I think the 'Incorrect padding' padding message means that the input string must be divisible by 4. So I added 'aa' to
CBMiV2h0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvaG9sZC1lbS13aXRoLWhvbGxvd2F5LXZvbC03Ny1qb3NlcGgtY2hlb25nLTMxODU4Lmh0bdIBAA
and did not get the error message:
>>> t = s + 'aa'
>>> len(t)/4
32.0
>>> base64.b64decode(t)
b'\x08\x13"Whttps://www.pokernews.com/strategy/hold-em-with-holloway-vol-77-joseph-cheong-31858.htm\xd2\x01\x00\x06\x9a'

Basically it is base64 coded string. If you run the following code snippet:
import base64
coded = 'CBMiUGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd3NvcC1tYWluLWV2ZW50LXRpcHMtbmluZS1jaGFtcGlvbnMtMzEyODcuaHRt0gEA'
url = base64.b64decode(coded)
print(url)
You'll get the following output:
b'\x08\x13"Phttps://www.pokernews.com/strategy/wsop-main-event-tips-nine-champions-31287.htm\xd2\x01\x00'
So it looks like your url with some extras. If all the extras are the same, it will be easy to filter out the url. If not - you'll have to handle every one separately.

I use the following code which you can put in a new module, e.g. gnews.py. This answer is applicable to the RSS feeds provided by Google News, and may otherwise need a slight adjustment. Note that I cache the returned value.
Steps used:
Find the base64 text in the encoded URL, and fix its padding.
Find the first URL in the decoded base64 text.
"""Decode encoded Google News entry URLs."""
import base64
import functools
import re
# Ref: https://stackoverflow.com/a/59023463/
_ENCODED_URL_PREFIX = "https://news.google.com/__i/rss/rd/articles/"
_ENCODED_URL_RE = re.compile(fr"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)")
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')
#functools.lru_cache(2048)
def _decode_google_news_url(url: str) -> str:
match = _ENCODED_URL_RE.match(url)
encoded_text = match.groupdict()["encoded_url"] # type: ignore
encoded_text += "===" # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
decoded_text = base64.urlsafe_b64decode(encoded_text)
match = _DECODED_URL_RE.match(decoded_text)
primary_url = match.groupdict()["primary_url"] # type: ignore
primary_url = primary_url.decode()
return primary_url
def decode_google_news_url(url: str) -> str: # Not cached because not all Google News URLs are encoded.
"""Return Google News entry URLs after decoding their encoding as applicable."""
return _decode_google_news_url(url) if url.startswith(_ENCODED_URL_PREFIX) else url
Usage example:
>>> decode_google_news_url('https://news.google.com/__i/rss/rd/articles/CBMiQmh0dHBzOi8vd3d3LmV1cmVrYWxlcnQub3JnL3B1Yl9yZWxlYXNlcy8yMDE5LTExL2RwcGwtYmJwMTExODE5LnBocNIBAA?oc=5')
'https://www.eurekalert.org/pub_releases/2019-11/dppl-bbp111819.php'

UTF8 missmatch in script

I have issues with a Python script. I just try to translate some sentences with the google translate API. Some sentences have problems with special UTF-8 encoding like ä, ö or ü. Can't imagine why some sentences work, others not.
If I try the API call direct in the browser, it works, but inside my Python script I get a mismatch.
this is a small version of my script which shows directly the error:
# -*- encoding: utf-8' -*-
import requests
import json
satz="Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=de&dt=t&q='+satz
r = requests.get(url);
r.text.encode().decode('utf8','ignore')
n = json.loads(r.text);
i = 0
while i < len(n[0]):
newLine = n[0][i][0]
print(newLine)
i=i+1
this is how my result looks:
Unter dem Mondschein glÃ¤nzt ein winziges Silberfragment, ein Bruchteil einer Li
nie â ? |

Google has served you a Mojibake; the JSON response contains data that was original encoded using UTF-8 but then was decoded with a different codec resulting in incorrect data.
I suspect Google does this as it decodes the URL parameters; in the past URL parameters could be encoded in any number of codecs, that UTF-8 is now the standard is a relatively recent development. This is Google's fault, not yours or that of requests.
I found that setting a User-Agent header makes Google behave better; even an (incomplete) user agent of Mozilla/5.0 is enough here for Google to use UTF-8 when decoding your URL parameters.
You should also make sure your URL string is properly percent encoded, if you pass in parameters in a dictionary to params then requests will take care of adding those to the URL in properly :
satz = "Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&dt=t'
params = {
'q': satz,
'sl': 'en',
'tl': 'de',
}
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get(url, params=params, headers=headers)
results = r.json()[0]
for inputline, outputline, *__ in results:
print(outputline)
Note that I pulled out the source and target language parameters into the params dictionary too, and pulled out the input and output line values from the results lists.

Spynner wrong encoding

I'm trying to download this page - https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8 (looks like this for me in Russia - http://screencloud.net/v/6a7o) via spynner in python - it uses some javascript checking so one does not simply download it without full browser emulation.
My code:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from StringIO import StringIO
import spynner
def log(str, filename_end):
filename = '/tmp/apple_log_%s.html' % filename_end
print 'logged to %s' % filename
f = open(filename, 'w')
f.write(str)
f.close()
debug_stream = StringIO()
browser = spynner.Browser(debug_level=3, debug_stream=debug_stream)
browser.load("https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8")
ret = browser.contents
log(ret, 'noenc')
print 'content length = %s' % len(ret)
browser.close()
del browser
f=open('/tmp/apple_log_debug', 'w')
f.write(debug_stream.getvalue())
f.close()
print 'log stored in /tmp/debug_log'
So, the problem is: either apple, either spynner work wrong with Cyrillic symbols. I see them fine if I try browser.show() after loading, but in the code and logs they are still wrong encoded like <meta content="ÐÐ¾Ð»ÑÑÐ¸ÑÑ Farm Storyâ¢ Ð² App Store. ÐÑÐ¾ÑÐ¼Ð¾ÑÑÐµÑÑ ÑÐºÑÐ¸Ð½ÑÐ¾ÑÑ Ð¸ ÑÐµÐ¹ÑÐ¸Ð½Ð³Ð¸, Ð¿ÑÐ¾ÑÐ¸ÑÐ°ÑÑ Ð¾ÑÐ·ÑÐ²Ñ Ð¿Ð¾ÐºÑÐ¿Ð°ÑÐµÐ»ÐµÐ¹." property="og:description">.
http://2cyr.com/ Says that it is a utf-8 text displayed like iso-8859-1...
As you see - I don't use any headers in my request, but if I take them from chrome's network debug console and pass it to load() method e.g. headers=[('Accept-Encoding', 'utf-8'), ('Accept-Language', 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4')] - I get the same result.
Also, from the same network console you can see that chrome uses gzip,deflate,sdch as Accept-Encoding. I can try that too, but I fail to decode what I get: <html><head></head><body>ï¿½ï¿½}ksÇï¿½g!ï¿½ï¿½ï¿½4ï¿½I/zï¿½Oï¿½ï¿½ï¿½/)ï¿½(ywï¿½ï¿½ï¿½é®iï¿½ï¿½{ï¿½<vï¿½ï¿½ï¿½:ï¿½ï¿½Ù·ï¿½Ø³-?ï¿½bï¿½bï¿½ï¿½ jï¿½... even if I remove the tags from the begin and end of the result.
Any help?

Basically, browser.webframe.toHtml() returns a QTString in which case str() won't help if res actually has unicode non-latin characters.
If you want to get a Python unicode string you need to do:
ret = unicode(browser.webframe.toHtml().toUtf8(), encoding="UTF-8")
#if you want to get rid of non-latin text
ret = ret.encode("ascii", errors="replace") # encodes to bytestring
in case you suspect its in Russian you could decode it to a Russian multibyte oem string (sill a bytestring) by doing
ret = ret.encode("cp1251", errors="replace") # encodes to Win-1251
# or
ret = ret.encode("cp866", errors="replace") # encodes to windows/dos console
Only then you can save it to an ASCII file.

str(browser.webframe.toHtml()) saved me

detect and change website encoding in python

I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?
Thanks

Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode issue with Python scraper - python

Related

LZMA Returns Input Format not supported

Decoding encoded Google News URLs

UTF8 missmatch in script

Spynner wrong encoding

detect and change website encoding in python

Categories

Resources