I'm writing a crawler to download the static html pages using urllib.
The get_page function works for 1 cycle but when i try to loop it, it doesn't open the content to the next url i've fed in.
How do i make urllib.urlopen continuously download HTML pages?
If it is not possible, is there any other suggestion to download
webpages within my python code?
my code below only returns the html for the 1st website in the seed list:
import urllib
def get_page(url):
return urllib.urlopen(url).read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
The same crawl "once-only" problem also occurs with urllib2:
import urllib2
def get_page(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
Without the exception, i'm getting an IOError with urllib:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 91, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 4, in get_page
return urllib.urlopen(url).read().decode('utf8')
File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 207, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 462, in open_file
return self.open_local_file(url)
File "/usr/lib/python2.7/urllib.py", line 476, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html'
Without the exception, i'm getting a ValueError with urllib2:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 95, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 7, in get_page
response = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 392, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 254, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http://www.pmo.gov.sg/content/pmosite/aboutpmo.html
ANSWERED:
The IOError and ValueError occurred because there was some sort of Unicode byte order mark (BOM). A non-break space was found in the second URL. Thanks for all your help and suggestion in solving the problem!!
your code is choking on .read().decode('utf8').
but you wouldn't see that since you are just swallowing exceptions. urllib works fine "more than once".
import urllib
def get_page(url):
return urllib.urlopen(url).read()
seeds = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for seed in seeds:
print 'here'
print get_page(seed)
Both of your examples work fine for me. The only explanation I can think of for your exact errors is that the second URL string contains some sort of non-printable character (a Unicode BOM, perhaps) that got filtered out when pasting the code here. Try copying the code back from this site into your file, or retyping the entire second string from scratch.
Related
I am following "The Complete Python Course: Beginner to Advanced!" In SkillShare, and there is a point where my code breaks while the code in the tutorial continues just fine.
The tutorial is about making a webscraper with BeautifulSoup, Pillow, and IO. I'm supposed to be able to do a search for anything in bing, then save the pictures on the images search results to a folder in my computer.
Here's the Code:
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for:")
params = {"q": search}
r = requests.get("http://bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "iusc"})
for item in links:
img_obj = requests.get(item.attrs["href"])
print("getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images" + title, img.format)
Whenever I run it, at the end it gives me a raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL
I tried adding
img_obj = requests.get("https://" + item.attrs["href"])
but it keeps giving me the same error.
I have gone and looked at the bing code, and the only change I have done is change the "thumb" class to "iusc". I tried using the "thumb" class as in the tutorial but then the program just runs without saving anything and eventually just finishes.
Thank you for your help
EDIT: Here is the whole error that is being thrown, as requested by baileythegreen:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 14, in <module>
img_obj = requests.get(item.attrs["href"])
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 515, in request
prep = self.prepare_request(req)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 443, in prepare_request
p.prepare(
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 318, in prepare
self.prepare_url(url, params)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 392, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0': No scheme supplied. Perhaps you meant http:///images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0?
Edit 2: I followed hawschiat instructions, and I am getting a different error this time:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 15, in <module>
print("getting", item.attrs["href"])
KeyError: 'href'
However, if I keep the "src" attribute in the print line, I get
getting http://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 18, in <module>
img.save(r'C://Users/user/PycharmProjects/webscrapery/scraped_images' + title, img.format)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\Image.py", line 2209, in save
fp = builtins.open(filename, "w+b")
OSError: [Errno 22] Invalid argument: 'C://Users/user/PycharmProjects/webscrapery/scraped_imageshttp://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7'
I tried using the 'r' character in front of the C: path, but it keeps giving me the same error. I also tried to change the forward slashes to back slashes, and putting 2 slashes in front of the C. I also made sure I have permission to write on the scrapped_images folder, which I do, as well as webscrapery.
The last line of your stack trace gives you a hint of the cause of the error. The URL scraped from the webpage is not a full URL, but rather the path to the resource.
To make it a full URL, you can simply prepend it with the scheme and authority. In your case, that would be https://bing.com.
That being said, I don't think the URL you obtained is actually the URL to the image. Inspecting Bing Image's webpage using Chrome's developer tool, we can see that the structure of the page looks something like this:
Notice that the anchor (a) element points to the preview page while its child element img contains the actual path to the resource.
With that in mind, we can rewrite your code to something like:
links = soup.findAll("img", {"class": "mimg"})
for item in links:
img_obj = requests.get(item.attrs["src"])
print("getting", item.attrs["src"])
title = item.attrs["src"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images\\" + title, img.format)
And this should achieve what you are trying to do.
I'm trying to scrape the title off of a webpage. Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. So I'm using some code that I found off Google that use the request-html library:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
soup.find_all('h1')
But there's always an error along the line of:
D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
resp.html.render()
File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
return future.result()
File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
content = await page.content()
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
return await frame.content()
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
'''.strip())
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
pageFunction, *args, force_expr=force_expr)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
pageFunction, *args, force_expr=force_expr)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
_rewriteError(e)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.
Process finished with exit code 1
Does anyone know what this means? I'm quite new to this, so I apologize if I'm using any terminology improperly.
As Ivan said, here you have full code: sleep=1, keep_page=True make the trick
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))
Response:
[<title>
Milled wheat and wheat flour produced</title>]
Seems like a bug in underlying library puppeteer, caused by processing some javascript. Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251, maybe it'll help.
resp.html.render(sleep=1, keep_page=True)
You need to load the JS because if you don't load it the HTML code wont load. You can use Selenium
Try Seleneum.
Seleneum is a library that allows programs to interact with web pages by taking control of the browser.
Here is an example in
an answer
to someone else's question.
I am trying to read a list of coordinates from a text file and insert it into a url
Here is my code:
with open("coords.txt", "r") as txtFile:
for line in txtFile:
coords = line
url = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=' + coords + '&radius=1&key=' + key
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
print data['results']
When I run it, I get this error:
Traceback (most recent call last):
File "C:\Users\Vel0city\Desktop\Coding\Python\placeid.py", line 8, in <module>
json_obj = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 429, in open
req = meth(req)
File "C:\Python27\lib\urllib2.py", line 1125, in do_request_
raise URLError('no host given')
URLError: <urlopen error no host given>
I am pretty sure this is due to the fact that python inserts a line break for every line in a text file so when I print out the final url with the coords concatenated to it, i get this:
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.773972,-122.431297
&radius=1&key=
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.773972,-122.431297
&radius=1&key=
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.773972,-122.431297
&radius=1&key=
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.773972,-122.431297&radius=1&key=
how would i remove this linebreak so it doesnt screw up the url?
You can use the strip method
coords = line.strip()
strip does nothing but, removes the the whitespace in your string.
You can use rstrip and lstrip if you would like to strip only one side of the line
EDIT:
As TemporalWolf mentioned in the comments, the strip method can be used to strip other things beside whitespace (which is the default).
For example, line.strip('0') would removes all '0' occurrences.
I am having trouble with the following code:
import praw
import argparse
# argument handling was here
def main():
r = praw.Reddit(user_agent='Python Reddit Image Grabber v0.1')
for i in range(len(args.subreddits)):
try:
r.get_subreddit(args.subreddits[i]) # test to see if the subreddit is valid
except:
print "Invalid subreddit"
else:
submissions = r.get_subreddit(args.subreddits[i]).get_hot(limit=100)
print [str(x) for x in submissions]
if __name__ == '__main__':
main()
subreddit names are taken as arguments to the program.
When an invalid args.subreddits is passed to get_subreddit, it throws an exception which should be caught in the above code.
When a valid args.subreddit name is given as an argument, the program runs fine.
But when an invalid args.subreddit name is given, the exception is not thrown, and instead the following uncaught exception is outputted.
Traceback (most recent call last):
File "./pyrig.py", line 33, in <module>
main()
File "./pyrig.py", line 30, in main
print [str(x) for x in submissions]
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 434, in get_content
page_data = self.request_json(url, params=params)
File "/usr/local/lib/python2.7/dist-packages/praw/decorators.py", line 95, in wrapped
return_value = function(reddit_session, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 469, in request_json
response = self._request(url, params, data)
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 342, in _request
response = handle_redirect()
File "/usr/local/lib/python2.7/dist-packages/praw/__init__.py", line 316, in handle_redirect
url = _raise_redirect_exceptions(response)
File "/usr/local/lib/python2.7/dist-packages/praw/internal.py", line 165, in _raise_redirect_exceptions
.format(subreddit))
praw.errors.InvalidSubreddit: `soccersdsd` is not a valid subreddit
I can't tell what I am doing wrong. I have also tried rewriting the exception code as
except praw.errors.InvalidSubreddit:
which also does not work.
EDIT: exception info for Praw can be found here
File "./pyrig.py", line 30, in main
print [str(x) for x in submissions]
The problem, as your traceback indicates is that the exception doesn't occur when you call get_subreddit In fact, it also doesn't occur when you call get_hot. The first is a lazy invocation that just creates a dummy Subreddit object but doesn't do anything with it. The second, is a generator that doesn't make any requests until you actually try to iterate over it.
Thus you need to move the exception handling code around your print statement (line 30) which is where the request is actually made that results in the exception.
The title pretty much says it all. Here's my code:
from urllib2 import urlopen as getpage
print = getpage("www.radioreference.com/apps/audio/?ctid=5586")
and here's the traceback error I get:
Traceback (most recent call last):
File "C:/Users/**/Dropbox/Dev/ComServ/citetest.py", line 2, in <module>
contents = getpage("www.radioreference.com/apps/audio/?ctid=5586")
File "C:\Python25\lib\urllib2.py", line 121, in urlopen
return _opener.open(url, data)
File "C:\Python25\lib\urllib2.py", line 366, in open
protocol = req.get_type()
File "C:\Python25\lib\urllib2.py", line 241, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: www.radioreference.com/apps/audio/?ctid=5586
My best guess is that urllib can't retrieve data from untidy php URLs. if this is the case, is there a work around? If not, what am I doing wrong?
You should first try to add 'http://' in front of the url. Also, do not store the results in print, as it is binding the reference to another (non callable) object.
So this line should be:
page_contents = getpage("http://www.radioreference.com/apps/audio/?ctid=5586")
This returns a file like object. To read its contents you need to use different file manipulation methods, like this:
for line in page_contents.readlines():
print line
You need to pass a full URL: ie it must begin with http://.
Simply use http://www.radioreference.com/apps/audio/?ctid=5586 and it'll work fine.
In [24]: from urllib2 import urlopen as getpage
In [26]: print getpage("http://www.radioreference.com/apps/audio/?ctid=5586")
<addinfourl at 173987116 whose fp = <socket._fileobject object at 0xa5eb6ac>>