I saved a search in https://news.google.com/ but google does not use the actual links found on its results page. Rather, you will find links like this:
I want the 'real link' that this resolves to using python. If you plug the above url into your browser, for a split second you will see
Opening https://www.pokernews.com/strategy/wsop-main-event-tips-nine-champions-31287.htm
I tried a few things using the Requests module but 'no cigar'.
If it can't be done, are these google links permanent - can they always be used to open up the web page?
After posting this question I used a hack to solve the problem. I simply used urllib again to open up the google url and then parsed the source to find the 'real url'.
It was exciting to see TDG's answer as it would help my program to run faster. But google is being cryptic and it did not work for ever link.
For this mornings news feed, it bombed on the 4th news item:
RESTART: C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py
>>> 1
Tommy Angelo Presents: The Butoff
Flopped Set of Nines: Get All In on Flop or Wait?
What Not to Do Online: Don’t Just Stop Thinking and Shove
Hold’em with Holloway, Vol. 77: Joseph Cheong Gets Crazy with a Pair of Ladies
Traceback (most recent call last):
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py", line 68, in <module>
GetGoogleNews("https://news.google.com/search?q=site%3Ahttps%3A%2F%2Fwww.pokernews.com%2Fstrategy&hl=en-US&gl=US&ceid=US%3Aen", 'news')
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py", line 34, in GetGoogleNews
real_URL = base64.b64decode(coded)
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\lib\base64.py", line 87, in b64decode
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
After reading up on base64 I think the 'Incorrect padding' padding message means that the input string must be divisible by 4. So I added 'aa' to
and did not get the error message:
>>> t = s + 'aa'
>>> len(t)/4
>>> base64.b64decode(t)
Basically it is base64 coded string. If you run the following code snippet:
import base64
coded = 'CBMiUGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd3NvcC1tYWluLWV2ZW50LXRpcHMtbmluZS1jaGFtcGlvbnMtMzEyODcuaHRt0gEA'
url = base64.b64decode(coded)
You'll get the following output:
So it looks like your url with some extras. If all the extras are the same, it will be easy to filter out the url. If not - you'll have to handle every one separately.
I use the following code which you can put in a new module, e.g. gnews.py. This answer is applicable to the RSS feeds provided by Google News, and may otherwise need a slight adjustment. Note that I cache the returned value.
Steps used:
Find the base64 text in the encoded URL, and fix its padding.
Find the first URL in the decoded base64 text.
"""Decode encoded Google News entry URLs."""
import base64
import functools
import re
# Ref: https://stackoverflow.com/a/59023463/
_ENCODED_URL_PREFIX = "https://news.google.com/__i/rss/rd/articles/"
_ENCODED_URL_RE = re.compile(fr"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)")
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')
def _decode_google_news_url(url: str) -> str:
match = _ENCODED_URL_RE.match(url)
encoded_text = match.groupdict()["encoded_url"] # type: ignore
encoded_text += "===" # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
decoded_text = base64.urlsafe_b64decode(encoded_text)
match = _DECODED_URL_RE.match(decoded_text)
primary_url = match.groupdict()["primary_url"] # type: ignore
primary_url = primary_url.decode()
return primary_url
def decode_google_news_url(url: str) -> str: # Not cached because not all Google News URLs are encoded.
"""Return Google News entry URLs after decoding their encoding as applicable."""
return _decode_google_news_url(url) if url.startswith(_ENCODED_URL_PREFIX) else url
Usage example:
>>> decode_google_news_url('https://news.google.com/__i/rss/rd/articles/CBMiQmh0dHBzOi8vd3d3LmV1cmVrYWxlcnQub3JnL3B1Yl9yZWxlYXNlcy8yMDE5LTExL2RwcGwtYmJwMTExODE5LnBocNIBAA?oc=5')
It seems that when I try to decode some bytes that where decoded from base 64 it gives an Input Format not Supported. I cannot isolate the issue, as when I bring the decoded logic alone into a new file, the error will not happen, making me think that this is something to do with the way flask passes arguments to the functions.
from flask import Flask
import base64
import lzma
from urllib.parse import quote, unquote
app = Flask('app')
def hello_world():
return 'Hello, World!<br><button onclick = "var base = \'https://Text-Viewer-from-Bsace-64-URL.inyourface3445.repl.co/encode\';location.href = `${base}/${prompt(\'What do you want to send?\')}`" >Use</button>'
newline = '/n'
def viewer(b64):
s1 = base64.b64decode(b64.encode() + b'==')
s2 = lzma.decompress(s1).decode()
s3 = unquote(s2).replace(newline, '<br>')
return f'<div style="overflow-x: auto;">{s3}</div>'
def encode(txt):
quote_text = quote(txt, safe = "")
compressed_text = lzma.compress(quote_text.encode())
base_64_txt = base64.b64encode(compressed_text).decode()
return f'text link '
app.run(host='', port=8080, debug=True)
Can someone explain what I am doing wrong?
You are passing a base64-encoded string as a part of the URL, and that string may contain characters that gets mangled in the process.
For example, visiting /encode/hello will give the following URL:
Several characters could go wrong:
The first character is /, and as a result Flask will redirect from view//TD6... to view/TD6...: in other words the first character gets deleted
Depending on how URL-encoding is performed by the browser and URL-decoding is performed by Flask, the + character may be decoded into a space
To avoid these issues, I would suggest using base64.urlsafe_b64encode / base64.urlsafe_b64decode which are versions of the base64 encoding where the output can be used in URLs without being mangled.
The following changes on your code seems to do the trick:
s1 = base64.urlsafe_b64decode(b64.encode()) in viewer
base_64_txt = base64.urlsafe_b64encode(compressed_text).decode() in encode
I've been tinkering with Python using Pythonista on my iPad. I decided to write a simple script that pulls song lyrics in Japanese from one website, and makes post requests to another website that basically annotates the lyrics with extra information.
When I use Python 2 and the module mechanize for the second website, everything works fine, but when I use Python 3 and requests, the resulting text is nonsense.
This is a minimal script that doesn't exhibit the issue:
#!/usr/bin/env python2
from bs4 import BeautifulSoup
import requests
import mechanize
def main():
# Get lyrics from first website (lyrical-nonsense.com)
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib")
raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()
# Use second website to anotate lyrics with fugigana
browser = mechanize.Browser()
browser.form['text'] = raw_lyrics
request = browser.submit()
# My actual script does more stuff at this point, but this snippet doesn't need it
annotated_lyrics = BeautifulSoup(request.read().decode('utf-8'), "html5lib").find("body").get_text()
print annotated_lyrics
if __name__ == '__main__':
The truncated output is:
扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)の夜(よる)昨日(きのう)どうやって帰(かえ)った体(からだ)だけが確(たし)かおはよう これからまた迷子(まいご)の続(つづ)き見慣(みな)れた知(し)らない景色(けしき)の中(なか)でもう駄目(だめ)って思(おも)ってから わりと何(なん)だかやれている死(し)にきらないくらいに丈夫(じょうぶ)何(なに)かちょっと恥(は)ずかしいやるべきことは忘(わす)れていても解(わか)るそうしないと とても苦(くる)しいから顔(かお)を上(あ)げて黒(くろ)い目(め)の人(にん)君(くん)が見(み)たから光(ひかり)は生(う)まれた選(えら)んだ色(しょく)で塗(ぬ)った世界(せかい)に [...]
This is a minimal script that exhibits the issue:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import requests
def main():
# Get lyrics from first website (lyrical-nonsense.com)
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib")
raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()
# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
html_annotated_lyrics = BeautifulSoup(requests.post(url, data=data).text, "html5lib")
annotated_lyrics = html_annotated_lyrics.find("body").get_text()
if __name__ == '__main__':
whose truncated output is:
IQp{_<n(åiFcf0c_S`QLºKJoFSK~_÷PnMc_åjDorn-gFÄîcfcfKhU`KfD{kMjDOD+UKacheZKWDyMSho،fDfã]FWjDhhfæWDKTRfÒDînºL_KIo~_x`rgWc_Lkò~fxyjD·nsoiS`FTê`QLÒüíüLn [...]
It's worth noting that if I just try to get the HTML of the second request, like so:
# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
annotated_lyrics = requests.post(url, data=data).content.decode('utf-8')
A embedded null character error occurs when printing annotated_lyrics. This issue can be circumvented by passing truncated lyrics to the post requests. In the current example, only one character can be passed.
However, with
url = 'https://www.lyrical-nonsense.com/lyrics/aimer/brave-shine/'
I can pass up to 51 characters, like so:
data = {'text': raw_lyrics[0:51], 'state': 'output'}
before triggering the embedded null character error.
I've tried using urllib instead of requests, decoding and encoding to utf-8 the resulting HTML of the post request, or the data passed as an argument to this request. I've also checked that the encoding of the website is utf-8, which matches the encoding of the post requests:
r = requests.post(url, data=data)
prints utf-8.
I think the problem has to do with how Python 3 is more strict in how it treats strings vs bytes, but I've been unable to pinpoint the exact cause.
While I'd appreciate a working code sample in Python 3, I'm more interested in what exactly I'm doing wrong, in what is the code doing that results in failure.
I'm able to get the lyrics properly with this code in python3.x:
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
resp = requests.get(url)
print(BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text())
Printing (truncated)
>>> BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text()
A few things strike me as odd there, notably the \r\n (windows line ending) and \u3000 (IDEOGRAPHIC SPACE) but that's probably not the problem
The one thing I noticed that's odd about the form submission (and why the browser emulator probably succeeds) is the form is using multipart instead of urlencoded form data. (signified by enctype="multipart/form-data")
Sending multipart form data is a little bit strange in requests, I had to poke around a bit and eventually found this which helps show how to format the multipart data in a way that the backing server understands. To do this you have to abuse files but have a "None" filename. "for humans" hah!
url2 = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
resp2 = requests.post(url2, files={'text': (None, raw_lyrics), 'state': (None, 'output')})
And the text is not mangled now!
>>> BeautifulSoup(resp2.text).find('body').get_text()
(Note that this code should work in either python2 or python3)
I've been writing bad perl for a while, but am attempting to learn to write bad python instead. I've read around the problem I've been having for a couple of days now (and know an awful lot more about unicode as a result) but I'm still having troubles with a rogue em-dash in the following code:
import urllib2
def scrape(url):
# simplified
data = urllib2.urlopen(url)
return data.read()
def query_graph_api(url_list):
# query Facebook's Graph API, store data.
for url in url_list:
graph_query = graph_query_root + "%22" + url + "%22"
query_data = scrape(graph_query)
print query_data #debug console
### START HERE ####
graph_query_root = "https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="
url_list = ['http://www.supersavvyme.co.uk', 'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']
(This is a much simplified representation of the scraper, BTW. The original uses a site's sitemap.xml to build a list of URLs, then queries Facebook's Graph API for information on each -- here's the original scraper)
My attempts to debug this have consisted mostly of trying to emulate the infinite monkeys who are rewriting Shakespeare. My usual method (search StackOverflow for the error message, copy-and-paste the solution) has failed.
Question: how do I encode my data so that extended characters like the em-dash in the second URL won't break my code, but will still work in the FQL query?
P.S. I'm even wondering whether I'm asking the right question: might urllib.urlencode help me out here (certainly it would make that graph_query_root easier and prettier to create...
The traceback I get from the actual scraper on ScraperWiki is as follows:
Line 80 - query_graph_api(urls)
Line 53 - query_data = scrape(graph_query) -- query_graph_api((urls=['http://www.supersavvyme.co.uk', 'http://...more
Line 21 - data = urllib2.urlopen(unicode(url)) -- scrape((url=u'https://graph.facebook.com/fql?q=SELECT%20url,...more
/usr/lib/python2.7/urllib2.py:126 -- urlopen((url=u'https://graph.facebook.com/fql?q=SELECT%20url,no...more
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 177: ordinal not in range(128)
If you are using Python 3.x, all you have to do is add one line and change another:
gq = graph_query.encode('utf-8')
query_data = scrape(gq)
If you are using Python 2.x, first put the following line in at the top of the module file:
# -*- coding: utf-8 -*- (read what this is for here)
and then make all your string literals unicode and encode just before passing to urlopen:
def scrape(url):
# simplified
data = urllib2.urlopen(url)
return data.read()
def query_graph_api(url_list):
# query Facebook's Graph API, store data.
for url in url_list:
graph_query = graph_query_root + u"%22" + url + u"%22"
gq = graph_query.encode('utf-8')
query_data = scrape(gq)
print query_data #debug console
### START HERE ####
graph_query_root = u"https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="
url_list = [u'http://www.supersavvyme.co.uk', u'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']
It looks from the code like you are using 3.x, which is really better for dealing with stuff like this. But you still have to encode when necessary. In 2.x, the best advice is to do what 3.x does by default: use unicode throughout your code, and only encode when bytes are called for.
I'm currently going through the python challenge, and i'm up to level 4, see here I have only been learning python for a few months, and i'm trying to learn python 3 over 2.x so far so good, except when i use this bit of code, here's the python 2.x version:
import urllib, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.urlopen(prefix + nothing).read()
print text
match = findnothing(text)
if match:
nothing = match.group(1)
print " going to", nothing
So to convert this to 3, I would change to this:
import urllib.request, urllib.parse, urllib.error, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.request.urlopen(prefix + nothing).read()
match = findnothing(text)
if match:
nothing = match.group(1)
print(" going to", nothing)
So if i run the 2.x version it works fine, goes through the loop, scraping the url and goes to the end, i get the following output:
and the next nothing is 72198
going to 72198
and the next nothing is 80992
going to 80992
and the next nothing is 8880
going to 8880 etc
If i run the 3.x version, i get the following output:
b'and the next nothing is 44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 26, in <module>
match = findnothing(b"text")
TypeError: can't use a string pattern on a bytes-like object
So if i change the r to a b in this line
findnothing = re.compile(b"nothing is (\d+)").search
I get:
b'and the next nothing is 44827'
going to b'44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 24, in <module>
text = urllib.request.urlopen(prefix + nothing).read()
TypeError: Can't convert 'bytes' object to str implicitly
Any ideas?
I'm pretty new to programming, so please don't bite my head off.
You can't mix bytes and str objects implicitly.
The simplest thing would be to decode bytes returned by urlopen().read() and use str objects everywhere:
text = urllib.request.urlopen(prefix + nothing).read().decode() #note: utf-8
The page doesn't specify the preferable character encoding via Content-Type header or <meta> element. I don't know what the default encoding should be for text/html but the rfc 2068 says:
When no explicit charset parameter is provided by the sender, media
subtypes of the "text" type are defined to have a default charset
value of "ISO-8859-1" when received via HTTP.
Regular expressions make sense only on text, not on binary data.
So, keep findnothing = re.compile(r"nothing is (\d+)").search, and convert text to string instead.
Instead of urllib we're using requests and it has two options ( which maybe you can search in urllib for similar options )
Response object
import requests
>>> response = requests.get('https://api.github.com')
Using response.content - has the bytes type
>>> response.content
While using response.text - you have the encoded response
>>> response.text
The default encoding is utf-8, but you can set it right after the request like so
import requests
>>> response = requests.get('https://api.github.com')
>>> response.encoding = 'SOME_ENCODING'
And then response.text will hold the content in the encoding you requested ...
I want to create a script in Python which downloads the current KML files of all the Maps I created on Google Maps.
To do so manually, I can use this:
where USER_ID is a constant number Google uses to identify me, and MAP_ID is the individual map identifier generated by the link icon on top-right corner.
This is not very straightforward, because I have to manually browse "My Places" page on Google Maps, and get the links one by one.
From Google Maps API HTTP Protocol Reference:
The Map Feed is a feed of user-created maps.
This feed's full GET URI is:
This feed returns a list of all maps for the authenticated user.
** The page says this service is no longer available, so I wonder if there is a way to do the same in the present.
So, the question is: Is there a way to get/download the list of MAP_IDs of all my maps, preferrably using Python?
Thanks for reading
The correct answer to this question involves using Google Maps Data API, HTML interface, which by the way is deprecated but still solves my need in a more official way, or at least more convincing than parsing a web page. Here it goes:
# coding: utf-8
import urllib2, urllib, re, getpass
username = 'heltonbiker'
senha = getpass.getpass('Senha do usuário ' + username + ':')
dic = {
'accountType': 'GOOGLE',
'Email': (username + '#gmail.com'),
'Passwd': senha,
'service': 'local',
'source': 'helton-mapper-1'
url = 'https://www.google.com/accounts/ClientLogin?' + urllib.urlencode(dic)
output = urllib2.urlopen(url).read()
authid = output.strip().split('\n')[-1].split('=')[-1]
request = urllib2.Request('http://maps.google.com/maps/feeds/maps/default/full')
request.add_header('Authorization', 'GoogleLogin auth=%s' % authid)
source = urllib2.urlopen(request).read()
for link in re.findall('<link rel=.alternate. type=.text/html. href=((.)[^\1]*?)>', source):
s = link[0]
if 'msa=0' in s:
print s
I arrived with this solution with a bunch of other questions in SO, and a lot of people helped me a lot, so I hope this code might help anyone else trying to do so in the future.
A quick and dirty way I have found, that skips Google Maps API completely and perhaps might brake in the near future, is this:
# coding: utf-8
import urllib, re
from BeautifulSoup import BeautifulSoup as bs
uid = '200931058040775970557'
start = 0
shown = 1
while True:
url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
source = urllib.urlopen(url).read()
soup = bs(source)
maptables = soup.findAll(id=re.compile('^map[0-9]+$'))
for table in maptables:
for line in table.findAll('a', 'maptitle'):
mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-2]
print shown, mapid, mapname
shown += 1
# uncomment if you want to download the KML files:
# urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
'&msa=0&output=kml', mapname + '.kml')
if '<span>Next</span>' in str(source):
start += 5
Of course it is only printing a numbered list, but from there to save a dictionary and/or automate KML download via &output=kml url trick it goes naturally.