Python - Page Source when calling a URL

Python - Page Source when calling a URL - python

Im looking for a really simple code to call a url and print the html source code. This is what I am using. Im following an online course which has the code
def get_page(url):
try:
import urllib
return urllib.open(url).read()
except:
return ""
print(get_page('https://www.yahoo.com/'))
Prints nothing but also no errors. Alternatively from browsing these forums I've tried
from urllib.request import urlopen
print (urlopen('https://xkcd.com/353/'))
when I do this I get
<http.client.HTTPResponse object at 0x000001E947559710>

from urllib.request import urlopen
print (urlopen('https://xkcd.com/353/').read().decode())

Assuming UTF-8 encoding was used
from urllib import request
def get_src_code(url):
r = request.urlopen("url")
byte_code = r.read()
src_code = bytecode.decode()
return src_code

It prints the empty string at the except block. Your code is generating error because there is no attribute called open in urllib module. You can't see the error because you are using a try-except block which is returning an empty string on every error. In your code, you can see the error like this:
def get_page(url):
try:
import urllib
return urllib.open(url).read()
except Exception as e:
return e.args[0]
To get your expected output, do it like this:
def get_page(url):
try:
from urllib.request import urlopen
return urlopen(url).read().decode('utf-8')
except Exception as e:
return e.args[0]

Related

Web scraping with urllib.request - data is not refreshing

I am trying to read a table on a website. The first (initial) read is correct, but the subsequent requests in the loop are out of date (the information doesn't change even though the website changes). Any suggestions?
The link shown in the code is not the actual website that I am looking at. Also, I am going through proxy server.
I don't get an error, just out of date information.
Here is my code:
import time
import urllib.request
from pprint import pprint
from html_table_parser.parser import HTMLTableParser
import pandas as pd
def url_get_contents(url):
#making request to the website
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
return f.read()
link='https://www.w3schools.com/html/html_tables.asp'
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
stored_page=p.tables[0]
while True:
try:
xhtml = url_get_contents(link).decode('utf-8')
p = HTMLTableParser()
p.feed(xhtml)
print('now: ',p.tables[0] )
time.sleep(120)
continue
# To handle exceptions
except Exception as e:
print("error")

How to continue other link when Urllib 404 error

I'm trying to download images from a csv with lot of links. Works fine until some link is broken (urllib.error.HTTPError: HTTP Error 404: Not Found) .
import pandas as pd
import urllib.request
import urllib.error
opener = urllib.request.build_opener()
def url_to_jpg (i,url,file_path) :
filename="image-{}".format(i)
full_path = "{}{}".format(file_path, filename)
opener.addheaders = [('User-Agent', 'Chrome/5.0')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(url,full_path)
print ("{} Saved".format(filename))
return None
filename="listado.csv"
file_path="/Users/marcelomorelli/Downloads/tapas/imagenes"
urls=pd.read_csv(filename)
for i, url in enumerate(urls.values):
url_to_jpg (i,url[0],file_path)
Thanks!
Any idea how can I made to Python continue to the other link in the list everytime gets that error?

You can use a try pattern and ignore errors.
Your code would look like this:
for i, url in enumerate(urls.values):
try:
url_to_jpg(i,url[0],file_path)
except Exception as e:
print(f"Failed due to: {e}")
Reference: https://docs.python.org/3/tutorial/errors.html

Handling bad URLs with requests

Sorry in advance for the beginner question. I'm just learning how to access web data in Python, and I'm having trouble understanding exception handling in the requests package.
So far, when accessing web data using the urllib package, I wrap the urlopen call in a try/except structure to catch bad URLs, like this:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text
This is obviously kind of a crude way to do it, as it can mask all kinds of problems other than bad URLs.
From the documentation, I had sort of gathered that you could avoid the try/except structure when using the requests package, like this:
import requests, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
r = requests.get(url)
if r.raise_for_status() is not None:
print 'Failed to open url.'
sys.exit()
text = r.text
print text
However, this clearly doesn't work (throws an error and a traceback). What's the "right" (i.e., simple, elegant, Pythonic) way to do this?

Try to catch connection error:
from requests.exceptions import ConnectionError
try:
requests.get('https://httpbinTYPO.org/')
except ConnectionError:
print 'Failed to open url.'

You can specify a kind of exception after the keyword except. So to catch just errors that come from bad connections, you can do:
import urllib, sys
url = 'https://httpbinTYPO.org/' # Note the typo in my URL
try: uh=urllib.urlopen(url)
except IOError:
print 'Failed to open url.'
sys.exit()
text = uh.read()
print text

Why is it giving me the error, "the JSON object must be str, not 'bytes'", and how do I fix it?

I was following a tutorial about how to use JSON objects (link: https://www.youtube.com/watch?v=Y5dU2aGHTZg). When they ran the code, they got no errors, but I did. Is it something to do with different Python versions or something?
from urllib.request import urlopen
import json
def printResults(data):
theJSON = json.loads(data)
print (theJSON)
def main():
urlData ="http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.geojson"
webUrl = urlopen(urlData)
print(webUrl.getcode())
if (webUrl.getcode()==200):
data = webUrl.read()
printResults(data)
else:
print ("You failed")
main()

The HTTPResponse object returned from urlopen reads bytes data (raw binary data), not str data (textual data), while the json module works with str. You need to know (or inspect the headers to determine) the encoding used for the data received, and decode it appropriately before using json.loads.
Assuming it's UTF-8 (most websites are), you can just change:
data = webUrl.read()
to:
data = webUrl.read().decode('utf-8')
and it should fix your problem.

I think they were using a different version of the urllib
Try with urllib3 and do the import like this:
from urllib import urlopen
Hope this is the fix to your problem

IncompleteRead using httplib

I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious why this happens and whether any higher level interfaces handle this problem properly. This problem isn't really a show stopper, since I don't need to retrieve the feed very often.
I have read a solution that traps the exception and returns the partial content, yet since the incomplete reads differ in the amount of bytes that are actually retrieved, I have no certainty that such solution will actually work.
#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead
url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
content = feedparser.parse(url)
if 'bozo_exception' in content:
print content['bozo_exception']
else:
print "Success!!"
sys.exit(0)
print "If you see this, please tell me what happened."
# try using mechanize
b = Browser()
r = b.open(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using mechanize", e
# try using urllib2
r = urllib2.urlopen(url)
try:
r.read()
except IncompleteRead, e:
print "IncompleteRead using urllib2", e
# try using requests
try:
r = requests.request('GET', url)
except IncompleteRead, e:
print "IncompleteRead using requests", e
# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to
# learn what's happening. Please help me put this function into
# eternal rest.
def get_rss_feed(url):
response = urllib2.urlopen(url)
read_it = True
content = ''
while read_it:
try:
content += response.read(1)
except IncompleteRead:
read_it = False
return content, response.info()
content, info = get_rss_feed(url)
feed = feedparser.parse(content)
As already stated, this isn't a mission critical problem, yet a curiosity, as even though I can expect urllib2 to have this problem, I am surprised that this error is encountered in mechanize and requests as well. The feedparser module doesn't even throw an error, so checking for errors depends on the presence of a 'bozo_exception' key.
Edit: I just wanted to mention that both wget and curl perform the function flawlessly, retrieving the full payload correctly every time. I have yet to find a pure python method to work, excepting my ugly hack, and I am very curious to know what is happening on the backend of httplib. On a lark, I decided to also try this with twill the other day and got the same httplib error.
P.S. There is one thing that also strikes me as very odd. The IncompleteRead happens consistently at one of two breakpoints in the payload. It seems that feedparser and requests fail after reading 926 bytes, yet mechanize and urllib2 fail after reading 1854 bytes. This behavior is consistend, and I am left without explanation or understanding.

At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplib which is where the exception is being thrown.
Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:
>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
'Content-Type: text/xml; charset=utf-8\r\n',
'Server: Microsoft-IIS/7.5\r\n',
'X-AspNet-Version: 4.0.30319\r\n',
'X-Powered-By: ASP.NET\r\n',
'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
'Via: 1.1 BC1-ACLD\r\n',
'Transfer-Encoding: chunked\r\n',
'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)
So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:
>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:
>>> try:
... contents = f.read()
... except httplib.IncompleteRead as e:
... contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
This blog post suggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read() method with the try..except block above to handle things behind the scenes:
import httplib
def patch_http_response_read(func):
def inner(*args):
try:
return func(*args)
except httplib.IncompleteRead, e:
return e.partial
return inner
httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)
I applied the patch and then feedparser worked:
>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
'encoding': 'utf-8',
'entries': ...
'status': 200,
'version': 'rss20'}
This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplib is mis-handling an edge case.

I find out in my case, send a HTTP/1.0 request , fix the problem, just adding this to the code:
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
after I do the request :
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
after I back to http 1.1 with (for connections that support 1.1) :
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'

I have fixed the issue by using HTTPS instead of HTTP and its working fine. No code change was required.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Page Source when calling a URL - python

from urllib.request import urlopen print (urlopen('https://xkcd.com/353/').read().decode())

Assuming UTF-8 encoding was used from urllib import request def get_src_code(url): r = request.urlopen("url") byte_code = r.read() src_code = bytecode.decode() return src_code

Related

Web scraping with urllib.request - data is not refreshing

How to continue other link when Urllib 404 error

Handling bad URLs with requests

Why is it giving me the error, "the JSON object must be str, not 'bytes'", and how do I fix it?

IncompleteRead using httplib

Categories

Resources