I can't get a html page with requests - python

I would like to get an html page and read the content. I use requests (python) and my code is very simple:
import requests
url = "http://www.romatoday.it"
r = requests.get(url)
print r.text
when I try to do this procedure I get ever:
Connection aborted.', error(110, 'Connection timed out')
If I open the url in a browser all work well.
If I use requests with other url all is ok
I think is a "http://www.romatoday.it" particularity but I don't understand what is the problem. Can you help me please?

Maybe the problem is that the comma here
>> url = "http://www.romatoday,it"
should be a dot
>> url = "http://www.romatoday.it"
I tried that and it worked for me

Hmm..Have you tried other packages, not 'requests'?
the code blow is same result as your code.
import urllib
url = "http://www.romatoday.it"
r = urllib.urlopen(url)
print r.read()
a picture that I captured after running your code.

Related

How to get HTML content of 404 error page using python?

I am using python to get HTML data from multiple pages at a URL. I found that urllib throws an exception when a URL does not exist. How do I retrieve the HTML of that custom 404 error page (the page where it says something like "Page is not found.")
Current code:
try:
req = Request(URL, headers={'User-Agent': 'Mozilla/5.0'})
client = urlopen(req)
#downloading html data
page_html = client.read()
#closing connection
client.close()
except:
print("The following URL was not found. Program terminated.\n" + URL)
break
Have you tried the requests library?
Just install the library with pip
pip install requests
And use it like this
import requests
response = requests.get('https://stackoverflow.com/nonexistent_path')
print(response.status_code) # 404
print(response.text) # Prints the raw HTML response
To preserve the comment that also answers the question, and also because it's what I was looking for, a way to do this without going outside urllib:
By t.m.adam at Nov 4, 2018 at 10:07
See HTTPError. It has a .read() method which returns the response content. –

Getting the redirected url in urllib2

I have a url, and as soon as I click on it, it redirects me to another webpage. I want to get that directed URL in my code with urllib2.
Sample code:
link='mywebpage.com'
html = urllib2.urlopen(link).read()
Any help is much appreciated
use requests library, by default Requests will perform location redirection for all verbs except HEAD.
r = requests.get('https://mywebpage.com')
or turn off redirect
r = requests.get('https://mywebpage.com', allow_redirects=False)

Need a solution urllib2

Im working with url lib2 and I need a help.
When I get the information I need from the website, it works fine, but if the info on the website changed, the result still the same, Im thinking that I have to find a way of cleaning up the "cache" or the "lib.close" ... I don't know... Could someone help me out with that please? Thank you
Here is the code:
import urllib2
url = 'http://website.com'
response = urllib2.urlopen(url)
webContent = response.read()
string = webContent.find('***')
alert = webContent[string+11:]
webContent = alert
string = webContent.find('***')
alert = webContent[:string]
alert = alert.replace('</strong>',' ')
print alert
urllib2 does not do caching. Either a HTTP Proxy is involved or the caching happens server-side.
Check the response headers. X-Cache or X-Cache-Lookup would mean that you are connected through a proxy.

Python: urllib2 get nothing which does exist

I'm trying to crawl my college website and I set cookie, add headers then:
homepage=opener.open("website")
content = homepage.read()
print content
I can get the source code sometimes but sometime just nothing.
I can't figure it out what happened.
Is my code wrong?
Or the web matters?
Does one geturl() can use to get double or even more redirect?
redirect = urllib2.urlopen(info_url)
redirect_url = redirect.geturl()
print redirect_url
It can turn out the final url, but sometimes gets me the middle one.
Rather than working around redirects with urlopen, you're probably better off using a more robust requests library: http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
r = requests.get('website', allow_redirects=True)
print r.text

HTTP Error 401: Authorization Required while downloading a file from HTTPS website and saving it

Basically i need a program that given a URL, it downloads a file and saves it. I know this should be easy but there are a couple of drawbacks here...
First, it is part of a tool I'm building at work, I have everything else besides that and the URL is HTTPS, the URL is of those you would paste in your browser and you'd get a pop up saying if you want to open or save the file (.txt).
Second, I'm a beginner at this, so if there's info I'm not providing please ask me. :)
I'm using Python 3.3 by the way.
I tried this:
import urllib.request
response = urllib.request.urlopen('https://websitewithfile.com')
txt = response.read()
print(txt)
And I get:
urllib.error.HTTPError: HTTP Error 401: Authorization Required
Any ideas? Thanks!!
You can do this easily with the requests library.
import requests
response = requests.get('https://websitewithfile.com/text.txt',verify=False, auth=('user', 'pass'))
print(response.text)
to save the file you would type
with open('filename.txt','w') as fout:
fout.write(response.text):
(I would suggest you always set verify=True in the resquests.get() command)
Here is the documentation:
Doesn't the browser also ask you to sign in? Then you need to repeat the request with the added authentication like this:
Python urllib2, basic HTTP authentication, and tr.im
Equally good: Python, HTTPS GET with basic authentication
If you don't have Requests module, then the code below works for python 2.6 or later. Not sure about 3.x
import urllib
testfile = urllib.URLopener()
testfile.retrieve("https://randomsite.com/file.gz", "/local/path/to/download/file")
You can try this solution: https://github.qualcomm.com/graphics-infra/urllib-siteminder
import siteminder
import getpass
url = 'https://XYZ.dns.com'
r = siteminder.urlopen(url, getpass.getuser(), getpass.getpass(), "dns.com")
Password:<Enter Your Password>
data = r.read() / pd.read_html(r.read()) # need to import panda as pd for the second one

Categories

Resources