I'm currently going through the python challenge, and i'm up to level 4, see here I have only been learning python for a few months, and i'm trying to learn python 3 over 2.x so far so good, except when i use this bit of code, here's the python 2.x version:
import urllib, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.urlopen(prefix + nothing).read()
print text
match = findnothing(text)
if match:
nothing = match.group(1)
print " going to", nothing
else:
break
So to convert this to 3, I would change to this:
import urllib.request, urllib.parse, urllib.error, re
prefix = "http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing="
findnothing = re.compile(r"nothing is (\d+)").search
nothing = '12345'
while True:
text = urllib.request.urlopen(prefix + nothing).read()
print(text)
match = findnothing(text)
if match:
nothing = match.group(1)
print(" going to", nothing)
else:
break
So if i run the 2.x version it works fine, goes through the loop, scraping the url and goes to the end, i get the following output:
and the next nothing is 72198
going to 72198
and the next nothing is 80992
going to 80992
and the next nothing is 8880
going to 8880 etc
If i run the 3.x version, i get the following output:
b'and the next nothing is 44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 26, in <module>
match = findnothing(b"text")
TypeError: can't use a string pattern on a bytes-like object
So if i change the r to a b in this line
findnothing = re.compile(b"nothing is (\d+)").search
I get:
b'and the next nothing is 44827'
going to b'44827'
Traceback (most recent call last):
File "C:\Python32\lvl4.py", line 24, in <module>
text = urllib.request.urlopen(prefix + nothing).read()
TypeError: Can't convert 'bytes' object to str implicitly
Any ideas?
I'm pretty new to programming, so please don't bite my head off.
_bk201
You can't mix bytes and str objects implicitly.
The simplest thing would be to decode bytes returned by urlopen().read() and use str objects everywhere:
text = urllib.request.urlopen(prefix + nothing).read().decode() #note: utf-8
The page doesn't specify the preferable character encoding via Content-Type header or <meta> element. I don't know what the default encoding should be for text/html but the rfc 2068 says:
When no explicit charset parameter is provided by the sender, media
subtypes of the "text" type are defined to have a default charset
value of "ISO-8859-1" when received via HTTP.
Regular expressions make sense only on text, not on binary data.
So, keep findnothing = re.compile(r"nothing is (\d+)").search, and convert text to string instead.
Instead of urllib we're using requests and it has two options ( which maybe you can search in urllib for similar options )
Response object
import requests
>>> response = requests.get('https://api.github.com')
Using response.content - has the bytes type
>>> response.content
b'{"current_user_url":"https://api.github.com/user","current_us...."}'
While using response.text - you have the encoded response
>>> response.text
'{"current_user_url":"https://api.github.com/user","current_us...."}'
The default encoding is utf-8, but you can set it right after the request like so
import requests
>>> response = requests.get('https://api.github.com')
>>> response.encoding = 'SOME_ENCODING'
And then response.text will hold the content in the encoding you requested ...
Related
I saved a search in https://news.google.com/ but google does not use the actual links found on its results page. Rather, you will find links like this:
https://news.google.com/articles/CBMiUGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd3NvcC1tYWluLWV2ZW50LXRpcHMtbmluZS1jaGFtcGlvbnMtMzEyODcuaHRt0gEA?hl=en-US&gl=US&ceid=US%3Aen
I want the 'real link' that this resolves to using python. If you plug the above url into your browser, for a split second you will see
Opening https://www.pokernews.com/strategy/wsop-main-event-tips-nine-champions-31287.htm
I tried a few things using the Requests module but 'no cigar'.
If it can't be done, are these google links permanent - can they always be used to open up the web page?
UPDATE 1:
After posting this question I used a hack to solve the problem. I simply used urllib again to open up the google url and then parsed the source to find the 'real url'.
It was exciting to see TDG's answer as it would help my program to run faster. But google is being cryptic and it did not work for ever link.
For this mornings news feed, it bombed on the 4th news item:
RESTART: C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py
cp1252
cp1252
>>> 1
Tommy Angelo Presents: The Butoff
CBMiTWh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvdG9tbXktYW5nZWxvLXByZXNlbnRzLXRoZS1idXRvZmYtMzE4ODEuaHRt0gEA
b'\x08\x13"Mhttps://www.pokernews.com/strategy/tommy-angelo-presents-the-butoff-31881.htm\xd2\x01\x00'
Flopped Set of Nines: Get All In on Flop or Wait?
CBMiXGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvZmxvcHBlZC1zZXQtb2YtbmluZXMtZ2V0LWFsbC1pbi1vbi1mbG9wLW9yLXdhaXQtMzE4ODAuaHRt0gEA
b'\x08\x13"\\https://www.pokernews.com/strategy/flopped-set-of-nines-get-all-in-on-flop-or-wait-31880.htm\xd2\x01\x00'
What Not to Do Online: Don’t Just Stop Thinking and Shove
CBMiZWh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd2hhdC1ub3QtdG8tZG8tb25saW5lLWRvbi10LWp1c3Qtc3RvcC10aGlua2luZy1hbmQtc2hvdmUtMzE4NzAuaHRt0gEA
b'\x08\x13"ehttps://www.pokernews.com/strategy/what-not-to-do-online-don-t-just-stop-thinking-and-shove-31870.htm\xd2\x01\x00'
Hold’em with Holloway, Vol. 77: Joseph Cheong Gets Crazy with a Pair of Ladies
CBMiV2h0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvaG9sZC1lbS13aXRoLWhvbGxvd2F5LXZvbC03Ny1qb3NlcGgtY2hlb25nLTMxODU4Lmh0bdIBAA
Traceback (most recent call last):
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py", line 68, in <module>
GetGoogleNews("https://news.google.com/search?q=site%3Ahttps%3A%2F%2Fwww.pokernews.com%2Fstrategy&hl=en-US&gl=US&ceid=US%3Aen", 'news')
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py", line 34, in GetGoogleNews
real_URL = base64.b64decode(coded)
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\lib\base64.py", line 87, in b64decode
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
>>>
UPDATE 2:
After reading up on base64 I think the 'Incorrect padding' padding message means that the input string must be divisible by 4. So I added 'aa' to
CBMiV2h0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvaG9sZC1lbS13aXRoLWhvbGxvd2F5LXZvbC03Ny1qb3NlcGgtY2hlb25nLTMxODU4Lmh0bdIBAA
and did not get the error message:
>>> t = s + 'aa'
>>> len(t)/4
32.0
>>> base64.b64decode(t)
b'\x08\x13"Whttps://www.pokernews.com/strategy/hold-em-with-holloway-vol-77-joseph-cheong-31858.htm\xd2\x01\x00\x06\x9a'
Basically it is base64 coded string. If you run the following code snippet:
import base64
coded = 'CBMiUGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd3NvcC1tYWluLWV2ZW50LXRpcHMtbmluZS1jaGFtcGlvbnMtMzEyODcuaHRt0gEA'
url = base64.b64decode(coded)
print(url)
You'll get the following output:
b'\x08\x13"Phttps://www.pokernews.com/strategy/wsop-main-event-tips-nine-champions-31287.htm\xd2\x01\x00'
So it looks like your url with some extras. If all the extras are the same, it will be easy to filter out the url. If not - you'll have to handle every one separately.
I use the following code which you can put in a new module, e.g. gnews.py. This answer is applicable to the RSS feeds provided by Google News, and may otherwise need a slight adjustment. Note that I cache the returned value.
Steps used:
Find the base64 text in the encoded URL, and fix its padding.
Find the first URL in the decoded base64 text.
"""Decode encoded Google News entry URLs."""
import base64
import functools
import re
# Ref: https://stackoverflow.com/a/59023463/
_ENCODED_URL_PREFIX = "https://news.google.com/__i/rss/rd/articles/"
_ENCODED_URL_RE = re.compile(fr"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)")
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')
#functools.lru_cache(2048)
def _decode_google_news_url(url: str) -> str:
match = _ENCODED_URL_RE.match(url)
encoded_text = match.groupdict()["encoded_url"] # type: ignore
encoded_text += "===" # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
decoded_text = base64.urlsafe_b64decode(encoded_text)
match = _DECODED_URL_RE.match(decoded_text)
primary_url = match.groupdict()["primary_url"] # type: ignore
primary_url = primary_url.decode()
return primary_url
def decode_google_news_url(url: str) -> str: # Not cached because not all Google News URLs are encoded.
"""Return Google News entry URLs after decoding their encoding as applicable."""
return _decode_google_news_url(url) if url.startswith(_ENCODED_URL_PREFIX) else url
Usage example:
>>> decode_google_news_url('https://news.google.com/__i/rss/rd/articles/CBMiQmh0dHBzOi8vd3d3LmV1cmVrYWxlcnQub3JnL3B1Yl9yZWxlYXNlcy8yMDE5LTExL2RwcGwtYmJwMTExODE5LnBocNIBAA?oc=5')
'https://www.eurekalert.org/pub_releases/2019-11/dppl-bbp111819.php'
I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage:
import urllib.request
import re
url = "http://www.google.com"
regex = r'<title>(,+?)</title>'
pattern = re.compile(regex)
with urllib.request.urlopen(url) as response:
html = response.read()
title = re.findall(pattern, html)
print(title)
And I get this unexpected error:
Traceback (most recent call last):
File "path\to\file\Crawler.py", line 11, in <module>
title = re.findall(pattern, html)
File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
What am I doing wrong?
You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode('utf-8').
See Convert bytes to a Python String
The problem is that your regex is a string, but html is bytes:
>>> type(html)
<class 'bytes'>
Since python doesn't know how those bytes are encoded, it throws an exception when you try to use a string regex on them.
You can either decode the bytes to a string:
html = html.decode('ISO-8859-1') # encoding may vary!
title = re.findall(pattern, html) # no more error
Or use a bytes regex:
regex = rb'<title>(,+?)</title>'
# ^
In this particular context, you can get the encoding from the response headers:
with urllib.request.urlopen(url) as response:
encoding = response.info().get_param('charset', 'utf8')
html = response.read().decode(encoding)
See the urlopen documentation for more details.
Based upon last one, this was smimple to do when pdf read was done .
text = text.decode('ISO-8859-1')
Thanks #Aran-fey
While porting code from python2 to 3, I get this error when reading from a URL
TypeError: initial_value must be str or None, not bytes.
import urllib
import json
import gzip
from urllib.parse import urlencode
from urllib.request import Request
service_url = 'https://babelfy.io/v1/disambiguate'
text = 'BabelNet is both a multilingual encyclopedic dictionary and a semantic network'
lang = 'EN'
Key = 'KEY'
params = {
'text' : text,
'key' : Key,
'lang' :'EN'
}
url = service_url + '?' + urllib.urlencode(params)
request = Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib.request.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
data = json.loads(f.read())
The exception is thrown at this line
buf = StringIO(response.read())
If I use python2, it works fine.
response.read() returns an instance of bytes while StringIO is an in-memory stream for text only. Use BytesIO instead.
From What's new in Python 3.0 - Text Vs. Data Instead Of Unicode Vs. 8-bit
The StringIO and cStringIO modules are gone. Instead, import the io module and use io.StringIO or io.BytesIO for text and data respectively.
This looks like another python3 bytes vs. str problem. Your response is of type bytes (which is different in python 3 from str). You need to get it into a string first using response.read().decode('utf-8') say and then use StringIO on it. Or you may want to use BytesIO as someone said - but if you expect it to be str, preferred way is to decode into an str first.
Consider using six.StringIO instead of io.StringIO.
And if you are migrating code from python2 to python3 and using suds old version use "suds-py3" for python3
I am using Python 3.x. While using urllib.request to download the webpage, i am getting a lot of \n in between. I am trying to remove it using the methods given in the other threads of the forum, but i am not able to do so. I have used strip() function and the replace() function...but no luck! I am running this code on eclipse. Here is my code:
import urllib.request
#Downloading entire Web Document
def download_page(a):
opener = urllib.request.FancyURLopener({})
try:
open_url = opener.open(a)
page = str(open_url.read())
return page
except:
return""
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)
#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)
I am not able to spot out the reason of getting a lot of \n in the raw_html variable.
Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:
from urllib.request import urlopen
with urlopen("http://www.zseries.in") as response:
html_content = response.read()
At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:
encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)
See A good way to get the charset/encoding of an HTTP response in Python.
if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8"> (you would need an html parser to get it).
If you read the html correctly then you shouldn't see literal characters \n in the page.
If you look at the source you've downloaded, the \n escape sequences you're trying to replace() are actually escaped themselves: \\n. Try this instead:
import urllib.request
def download_page(a):
opener = urllib.request.FancyURLopener({})
open_url = opener.open(a)
page = str(open_url.read()).replace('\\n', '')
return page
I removed the try/except clause because generic except statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.
Seems like they are literal \n characters , so i suggest you to do like this.
raw_html2 = raw_html.replace('\\n', '')
I'm a bit surprised that it's so complicated to get a charset of a webpage with Python. Am I missing a way? The HTTPMessage has loads of functions, but not this.
>>> google = urllib2.urlopen('http://www.google.com/')
>>> google.headers.gettype()
'text/html'
>>> google.headers.getencoding()
'7bit'
>>> google.headers.getcharset()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: HTTPMessage instance has no attribute 'getcharset'
So you have to get the header, and split it. Twice.
>>> google = urllib2.urlopen('http://www.google.com/')
>>> charset = 'ISO-8859-1'
>>> contenttype = google.headers.getheader('Content-Type', '')
>>> if ';' in contenttype:
... charset = contenttype.split(';')[1].split('=')[1]
>>> charset
'ISO-8859-1'
That's a surprising amount of steps for such a basic function. Am I missing something?
Have you checked this?
How to download any(!) webpage with correct charset in python?
I did some research and came up with this solution:
response = urllib.request.urlopen(url)
encoding = response.headers.get_content_charset()
This is how I would do it in Python 3. I haven't tested it in Python 2 but I am guessing that you would have to use urllib2.request instead of urllib.request.
Here is how it works, since the official Python documentation doesn't explain it very well: the result of urlopen is an http.client.HTTPResponse object. The headers property of this object is an http.client.HTTPMessage object, which, according to the documentation, "is implemented using the email.message.Message class", which has a method called get_content_charset, which tries to determine and return the character set of the response.
By default, this method returns None if it is unable to determine the character set, but you can override this behavior instead by passing a failobj parameter:
encoding = response.headers.get_content_charset(failobj="utf-8")
You're not missing anything. It's doing the right thing - encoding of a HTTP response is a subpart of Content-Type.
Note also that some pages might send only Content-Type: text/html and then set the encoding via <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> - that's an ugly hack though (on the part of the page author) and is not too common.
I would go with chardet Universal Encoding Detector.
>>> import urllib
>>> urlread = lambda url: urllib.urlopen(url).read()
>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}
You are doing right but your approach would fail for pages where charset is declared on meta tag or is not declared at all.
If you look closer at Chardet sources, it has a charsetprober/charsetgroupprober modules that deals with this problem nicely.