How can I read the contents of an URL with Python? - python

The following works when I paste it on the browser:
http://www.somesite.com/details.pl?urn=2344
But when I try reading the URL with Python nothing happens:
link = 'http://www.somesite.com/details.pl?urn=2344'
f = urllib.urlopen(link)
myfile = f.readline()
print myfile
Do I need to encode the URL, or is there something I'm not seeing?

To answer your question:
import urllib
link = "http://www.somesite.com/details.pl?urn=2344"
f = urllib.urlopen(link)
myfile = f.read()
print(myfile)
You need to read(), not readline()
EDIT (2018-06-25): Since Python 3, the legacy urllib.urlopen() was replaced by urllib.request.urlopen() (see notes from https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen for details).
If you're using Python 3, see answers by Martin Thoma or i.n.n.m within this question:
https://stackoverflow.com/a/28040508/158111 (Python 2/3 compat)
https://stackoverflow.com/a/45886824/158111 (Python 3)
Or, just get this library here: http://docs.python-requests.org/en/latest/ and seriously use it :)
import requests
link = "http://www.somesite.com/details.pl?urn=2344"
f = requests.get(link)
print(f.text)

For python3 users, to save time, use the following code,
from urllib.request import urlopen
link = "https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html"
f = urlopen(link)
myfile = f.read()
print(myfile)
I know there are different threads for error: Name Error: urlopen is not defined, but thought this might save time.

None of these answers are very good for Python 3 (tested on latest version at the time of this post).
This is how you do it...
import urllib.request
try:
with urllib.request.urlopen('http://www.python.org/') as f:
print(f.read().decode('utf-8'))
except urllib.error.URLError as e:
print(e.reason)
The above is for contents that return 'utf-8'. Remove .decode('utf-8') if you want python to "guess the appropriate encoding."
Documentation:
https://docs.python.org/3/library/urllib.request.html#module-urllib.request

A solution with works with Python 2.X and Python 3.X makes use of the Python 2 and 3 compatibility library six:
from six.moves.urllib.request import urlopen
link = "http://www.somesite.com/details.pl?urn=2344"
response = urlopen(link)
content = response.read()
print(content)

We can read website html content as below :
from urllib.request import urlopen
response = urlopen('http://google.com/')
html = response.read()
print(html)

#!/usr/bin/python
# -*- coding: utf-8 -*-
# Works on python 3 and python 2.
# when server knows where the request is coming from.
import sys
if sys.version_info[0] == 3:
from urllib.request import urlopen
else:
from urllib import urlopen
with urlopen('https://www.facebook.com/') as \
url:
data = url.read()
print data
# When the server does not know where the request is coming from.
# Works on python 3.
import urllib.request
user_agent = \
'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = 'https://www.facebook.com/'
headers = {'User-Agent': user_agent}
request = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(request)
data = response.read()
print data

from urllib.request import urlopen
# if has Chinese, apply decode()
html = urlopen("https://blog.csdn.net/qq_39591494/article/details/83934260").read().decode('utf-8')
print(html)

from urllib.request import urlopen
from bs4 import BeautifulSoup
link = "https://www.timeshighereducation.com/hub/sinorbis"
f = urlopen(link)
soup = BeautifulSoup(f, 'html.parser')
# get the text content of the webpage
text = soup.get_text()
print(text)
using BeautifulSoup's HTML parser we can extract the content of the webpage.

I used the following code:
import urllib
def read_text():
quotes = urllib.urlopen("https://s3.amazonaws.com/udacity-hosted-downloads/ud036/movie_quotes.txt")
contents_file = quotes.read()
print contents_file
read_text()

# retrieving data from url
# only for python 3
import urllib.request
def main():
url = "http://docs.python.org"
# retrieving data from URL
webUrl = urllib.request.urlopen(url)
print("Result code: " + str(webUrl.getcode()))
# print data from URL
print("Returned data: -----------------")
data = webUrl.read().decode("utf-8")
print(data)
if __name__ == "__main__":
main()

The URL should be a string:
import urllib
link = "http://www.somesite.com/details.pl?urn=2344"
f = urllib.urlopen(link)
myfile = f.readline()
print myfile

Related

Printing an HTML Output in Python

I've been creating a program with a variety of uses. I call it the Electronic Database with Direct Yield (EDDY). One thing that I have been having the most trouble with is EDDY's google search capabilities. EDDY will ask the user to give an input. EDDY will then edit the input slightly by replacing any spaces (' ') with plus signs ('+'), then go to the resulting url (without opening a browser). It then copies the html from the webpage and is SUPPOSED to give the results and descriptions of the site, and to specify, without the HTML code.
This is what I have so far.
import urllib
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import requests
def cleanup(url):
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
length = len(soup.prettify()) - 1
print(soup.prettify()[16800:length])
print(soup.title.text)
print(soup.body.text)
def eddysearch():
headers = {'User-Agent': 'Chrome.exe'}
reg_url = "http://www.google.com/search?q="
print("Ready for query")
query = input()
if(query != "quit"):
print("Searching for keyword: " + query)
print("Please wait...")
search = urllib.parse.quote_plus(query)
url = reg_url + search
req = Request(url=url, headers=headers)
html = urlopen(req).read()
cleanup(url)
eddysearch()
eddysearch()
Can anyone help me out? Thanks in advance!
hIf you dont want to use an SSL certificate, you can do .read()
# Python 2.7.x
import urllib
url = "http://stackoverflow.com"
f = urllib.urlopen(url)
print f.read()
#Python 3.x
import urllib.request
url = 'http://www.stackoverflow.com'
f = urllib.request.urlopen(url)
print(f.read())

MissingSchema: Invalid URL "/": No schema supplied

I want to retrieve data from google links so I followed this but the above error is coming.
from urllib.request import urlopen
from urllib.request import urlretrieve
from urllib.parse import quote
qstr = quote("postal code paris")
url_getallfolders = "https://www.google.co.in/?client=safari"+qstr
x = urlopen(url_getallfolders)
data = x.read()
import requests
response=requests.get(x)
response.content
Why you're passing X ? try to get the type of x type(x) which is the HTTP Client response object not an URL. <class 'http.client.HTTPResponse'>
from urllib.request import urlopen
from urllib.request import urlretrieve
from urllib.parse import quote
qstr = quote("postal code paris")
url_getallfolders = "https://www.google.co.in/?client=safari"+qstr
x = urlopen(url_getallfolders)
data = x.read()
import requests
response=requests.get(url_getallfolders) # X is not an URL
response.content
It is saying that you have an invalid URL.
Try to just print the URL to make sure they are getting read correctly before passing them to requests.
I think you are mixing frameworks here.
Requests is one library
and urllib.request
is a different one. If you print data then you will see the correct html document.
these are two alternative approaches not to be mixed.
alternative 1:
req = urllib.request.Request('https://www.google.co.in/search?q=searc',headers={'User-Agent': 'Mozilla/5.0'})
x = urllib.request.urlopen(req)
data = x.read()
print(data)
alternative 2
import requests
response=requests.get(req.full_url)
response.content

Python urllib is not extracting reader comments from a website

I am trying to extract reader comments from the following page with the code shown below. But the output html test.html does not contain any comments from the page. How do I get this information with Python?
http://www.theglobeandmail.com/opinion/it-doesnt-matter-who-won-the-debate-america-has-already-lost/article32314064/comments/
from bs4 import BeautifulSoup
import urllib
import urllib.request
import urllib.parse
req =urllib.request.Request('http://www.theglobeandmail.com/opinion/it-doesnt-matter-who-won-the-debate-america-has-already-lost/article32314064/comments/')
response = urllib.request.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page, 'html.parser')
f = open('test.html', 'w')
f.write(soup.prettify())
f.close()
Thanks!
The comments are retrieved using an ajax requests which you can mimic:
You can see there are numerous parameters but what is below is enough to get a result, I will leave it to you to figure out how you can influence the results:
from json import loads
from urllib.request import urlopen
from urllib.parse import urlencode
data = {"categoryID":"Production",
"streamID":"32314064",
"APIKey":"2_oNjjtSC8Qc250slf83cZSd4sbCzOF4cCiqGIBF8__5dWzOJY_MLAoZvds76cHeQD",
"callback" :"foo",}
r = urlopen("http://comments.us1.gigya.com/comments.getComments", data=urlencode(data).encode("utf-8"))
json_dcts = loads(r.read().decode("utf-8"))["comments"]
print(json_dcts)
That gives you a list of dicts that hold all the comments, upvotes, negvotes etc.. If you wanted to parse the key it is in the url of inside one of th scripts src='https://cdns.gigya.com/js/socialize.js?apiKey=2_oNjjtSC8Qc250slf83cZSd4sbCzOF4cCiqGIBF8__5dWzOJY_MLAoZvds76cHeQD', the streamID is in your original url.

Try to download image from image url, but get html instead

similar to Try to scrape image from image url (using python urllib ) but get html instead , but the solution does not work for me.
from BeautifulSoup import BeautifulSoup
import urllib2
import requests
img_url='http://7-themes.com/data_images/out/79/7041933-beautiful-backgrounds-wallpaper.jpg'
r = requests.get(img_url, allow_redirects=False)
headers = {}
headers['Referer'] = r.headers['location']
r = requests.get(img_url, headers=headers)
with open('7041933-beautiful-backgrounds-wallpaper.jpg', 'wb') as fh:
fh.write(r.content)
the downloaded file is still a html page, not an image.
Your referrer was not being set correctly. I have hard coded the referrer and it works fine
from BeautifulSoup import BeautifulSoup
import urllib2
import requests
img_url='http://7-themes.com/data_images/out/79/7041933-beautiful-backgrounds-wallpaper.jpg'
r = requests.get(img_url, allow_redirects=False)
headers = {}
headers['Referer'] = 'http://7-themes.com/7041933-beautiful-backgrounds-wallpaper.html'
r = requests.get(img_url, headers=headers, allow_redirects=False)
with open('7041933-beautiful-backgrounds-wallpaper.jpg', 'wb') as fh:
fh.write(r.content)
I found a root cause in my code is that refer field in the header is still a html, not image.
So I change the refer field to the img_url, and this works.
from BeautifulSoup import BeautifulSoup
import urllib2
import urllib
import requests
img_url='http://7-themes.com/data_images/out/79/7041933-beautiful-backgrounds-wallpaper.jpg'
headers = {}
headers['Referer'] = img_url
r = requests.get(img_url, headers=headers)
with open('7041933-beautiful-backgrounds-wallpaper.jpg', 'wb') as fh:
fh.write(r.content)

How to read html body from web-site using Python version 3x

I would like to connect and receive http response from a specific web site link.
I have many Python codes :
import urllib.request
import os,sys,re,datetime
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode(encoding=sys.stdout.encoding)
fp.close()
when I pass the response as a parameter to:
BeautifulSoup(str(mystr), 'html.parser')
to get the cleaned html text, I got the following error:
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25bc' in position 1139: character maps to <undefined>.
The question how can I solve this problem?
complete code :
import urllib.request
import os,sys,re,datetime
fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()
mystr = mybytes.decode(encoding=sys.stdout.encoding)
fp.close()
from bs4 import BeautifulSoup
soup = BeautifulSoup(str(mystr), 'html.parser')
mystr = soup;
print(mystr.get_text())
BeautifulSoup is perfectly happy to consume the file-like object returned by urlopen:
from urllib.request import urlopen
from bs4 import BeautifulSoup
with urlopen("...") as website:
soup = BeautifulSoup(website)
print(soup.prettify())
If you use the requests library you can avoid these complications:)
import requests
fp = requests.get("http://www.python.org")
mystr = fp.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(mystr, 'html.parser')
mystr = soup;
print(mystr.get_text())

Categories

Resources