Data scraping from a webpage with javascript using python - python

I'm trying to scrape the title off of a webpage. Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. So I'm using some code that I found off Google that use the request-html library:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
soup.find_all('h1')
But there's always an error along the line of:
D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
resp.html.render()
File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
return future.result()
File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
content = await page.content()
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
return await frame.content()
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
'''.strip())
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
pageFunction, *args, force_expr=force_expr)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
pageFunction, *args, force_expr=force_expr)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
_rewriteError(e)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.
Process finished with exit code 1
Does anyone know what this means? I'm quite new to this, so I apologize if I'm using any terminology improperly.

As Ivan said, here you have full code: sleep=1, keep_page=True make the trick
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))
Response:
[<title>
Milled wheat and wheat flour produced</title>]

Seems like a bug in underlying library puppeteer, caused by processing some javascript. Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251, maybe it'll help.
resp.html.render(sleep=1, keep_page=True)

You need to load the JS because if you don't load it the HTML code wont load. You can use Selenium

Try Seleneum.
Seleneum is a library that allows programs to interact with web pages by taking control of the browser.
Here is an example in
an answer
to someone else's question.

Related

HTMLSession : cssselect.xpath.ExpressionError: Pseudo-elements are not supported

I'm working on a web scraper project with HTMLSession, I plan to scrape Ask search engine results using a set of user-defined keywords. I have already started with writing the code for my scraper, here it is:
from requests_html import HTMLSession
class Scraper():
def scrapedata(self,tag):
url = f'https://www.ask.com/web?q={tag}'
s = HTMLSession()
r = s.get(url)
print(r.status_code)
qlist = []
ask = r.html.find('div.PartialSearchResults-item')
for a in ask:
print(a.find('a.PartialSearchResults-item-title-link.result-link::text', first = True ).text.strip())
ask = Scraper()
ask.scrapedata('ferrari')
However when I run this code, instead of getting the list of all the web page titles related to the keywords searched in my terminal as it should have, I get the following errors:
[Running] python -u "c:\Users\user\Documents\AAprojects\Whelpsgroups1\Beauty\scraper.py"
200
Traceback (most recent call last):
File "c:\Users\user\Documents\AAprojects\Whelpsgroups1\Beauty\scraper.py", line 19, in <module>
ask.scrapedata('ferrari')
File "c:\Users\user\Documents\AAprojects\Whelpsgroups1\Beauty\scraper.py", line 15, in scrapedata
print(a.find('a.PartialSearchResults-item-title-link.result-link::text', first = True ).text.strip())
File "C:\Python310\lib\site-packages\requests_html.py", line 212, in find
for found in self.pq(selector)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 261, in __call__
result = self._copy(*args, parent=self, **kwargs)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 247, in _copy
return self.__class__(*args, **kwargs)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 232, in __init__
xpath = self._css_to_xpath(selector)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 243, in _css_to_xpath
return self._translator.css_to_xpath(selector, prefix)
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 190, in css_to_xpath
return ' | '.join(self.selector_to_xpath(selector, prefix,
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 190, in <genexpr>
return ' | '.join(self.selector_to_xpath(selector, prefix,
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 222, in selector_to_xpath
xpath = self.xpath_pseudo_element(xpath, selector.pseudo_element)
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 232, in xpath_pseudo_element
raise ExpressionError('Pseudo-elements are not supported.')
cssselect.xpath.ExpressionError: Pseudo-elements are not supported.
[Done] exited with code=1 in 17.566 seconds
I don't even know what this means, I searched the Internet but instead came across problems related to IE7 and I don't see what has to do with my problem, especially since I'm using Microsoft Edge as my default web browser. Also, I hope to count on the help of more experienced members of the community to help me solve the problem. Thank you from Cameroon.
Just remove the ::text Like this or print(a.find("a.PartialsearchResults-item-title-link.result-link", first = True).text.strip()) and you will get the titles of your Webpages.

MissingSchema(error) thrown when following tutorial

I am following "The Complete Python Course: Beginner to Advanced!" In SkillShare, and there is a point where my code breaks while the code in the tutorial continues just fine.
The tutorial is about making a webscraper with BeautifulSoup, Pillow, and IO. I'm supposed to be able to do a search for anything in bing, then save the pictures on the images search results to a folder in my computer.
Here's the Code:
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for:")
params = {"q": search}
r = requests.get("http://bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "iusc"})
for item in links:
img_obj = requests.get(item.attrs["href"])
print("getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images" + title, img.format)
Whenever I run it, at the end it gives me a raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL
I tried adding
img_obj = requests.get("https://" + item.attrs["href"])
but it keeps giving me the same error.
I have gone and looked at the bing code, and the only change I have done is change the "thumb" class to "iusc". I tried using the "thumb" class as in the tutorial but then the program just runs without saving anything and eventually just finishes.
Thank you for your help
EDIT: Here is the whole error that is being thrown, as requested by baileythegreen:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 14, in <module>
img_obj = requests.get(item.attrs["href"])
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 515, in request
prep = self.prepare_request(req)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 443, in prepare_request
p.prepare(
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 318, in prepare
self.prepare_url(url, params)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 392, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0': No scheme supplied. Perhaps you meant http:///images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0?
Edit 2: I followed hawschiat instructions, and I am getting a different error this time:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 15, in <module>
print("getting", item.attrs["href"])
KeyError: 'href'
However, if I keep the "src" attribute in the print line, I get
getting http://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 18, in <module>
img.save(r'C://Users/user/PycharmProjects/webscrapery/scraped_images' + title, img.format)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\Image.py", line 2209, in save
fp = builtins.open(filename, "w+b")
OSError: [Errno 22] Invalid argument: 'C://Users/user/PycharmProjects/webscrapery/scraped_imageshttp://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7'
I tried using the 'r' character in front of the C: path, but it keeps giving me the same error. I also tried to change the forward slashes to back slashes, and putting 2 slashes in front of the C. I also made sure I have permission to write on the scrapped_images folder, which I do, as well as webscrapery.
The last line of your stack trace gives you a hint of the cause of the error. The URL scraped from the webpage is not a full URL, but rather the path to the resource.
To make it a full URL, you can simply prepend it with the scheme and authority. In your case, that would be https://bing.com.
That being said, I don't think the URL you obtained is actually the URL to the image. Inspecting Bing Image's webpage using Chrome's developer tool, we can see that the structure of the page looks something like this:
Notice that the anchor (a) element points to the preview page while its child element img contains the actual path to the resource.
With that in mind, we can rewrite your code to something like:
links = soup.findAll("img", {"class": "mimg"})
for item in links:
img_obj = requests.get(item.attrs["src"])
print("getting", item.attrs["src"])
title = item.attrs["src"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images\\" + title, img.format)
And this should achieve what you are trying to do.

BeautifulSoup giving me many error lines when used

I've installed beautifulsoup (file named bs4) into my pythonproject folder which is the same folder as the python file I am running. The .py file contains the following code, and for input I am using this URL to a simple page with 1 link which the code is supposed to retrieve.
URL used as url input: http://data.pr4e.org/page1.htm
.py code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
Though I could be wrong, it appears to me that bs4 imports correctly because my IDE program suggests BeautifulSoup when I begin typing it. After all, it is installed in the same directory as the .py file. however, It spits out the following lines of error when I run it using the previously provided url:
Traceback (most recent call last):
File "C:\Users\Thomas\PycharmProjects\pythonProject\main.py", line 16, in <module>
soup = BeautifulSoup(html, 'html.parser')
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 215, in __init__
self._feed()
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 241, in _feed
self.endData()
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 315, in endData
self.object_was_parsed(o)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 320, in
object_was_parsed
previous_element = most_recent_element or self._most_recent_element
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1001, in __getattr__
return self.find(tag)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1238, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1259, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 516, in _find_all
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1560, in __init__
self.text = self._normalize_search_value(text)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1565, in _
normalize_search_value
if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value,
'match')
AttributeError: module 'collections' has no attribute 'Callable'
Process finished with exit code 1
The lines being referred to in the error messages are from files inside bs4 that were downloaded as part of it. I haven't edited any of the bs4 contained files or even touched them. Can anyone help me figure out why bs4 isn't working?
Are you using python 3.10? Looks like beautifulsoup library is using removed deprecated aliases to Collections Abstract Base Classes. More info here: https://docs.python.org/3/whatsnew/3.10.html#removed
A quick fix is to paste these 2 lines just below your imports:
import collections
collections.Callable = collections.abc.Callable

Why is urllib.urlopen() only working once? - Python

I'm writing a crawler to download the static html pages using urllib.
The get_page function works for 1 cycle but when i try to loop it, it doesn't open the content to the next url i've fed in.
How do i make urllib.urlopen continuously download HTML pages?
If it is not possible, is there any other suggestion to download
webpages within my python code?
my code below only returns the html for the 1st website in the seed list:
import urllib
def get_page(url):
return urllib.urlopen(url).read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
The same crawl "once-only" problem also occurs with urllib2:
import urllib2
def get_page(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
Without the exception, i'm getting an IOError with urllib:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 91, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 4, in get_page
return urllib.urlopen(url).read().decode('utf8')
File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 207, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 462, in open_file
return self.open_local_file(url)
File "/usr/lib/python2.7/urllib.py", line 476, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html'
Without the exception, i'm getting a ValueError with urllib2:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 95, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 7, in get_page
response = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 392, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 254, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http://www.pmo.gov.sg/content/pmosite/aboutpmo.html
ANSWERED:
The IOError and ValueError occurred because there was some sort of Unicode byte order mark (BOM). A non-break space was found in the second URL. Thanks for all your help and suggestion in solving the problem!!
your code is choking on .read().decode('utf8').
but you wouldn't see that since you are just swallowing exceptions. urllib works fine "more than once".
import urllib
def get_page(url):
return urllib.urlopen(url).read()
seeds = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for seed in seeds:
print 'here'
print get_page(seed)
Both of your examples work fine for me. The only explanation I can think of for your exact errors is that the second URL string contains some sort of non-printable character (a Unicode BOM, perhaps) that got filtered out when pasting the code here. Try copying the code back from this site into your file, or retyping the entire second string from scratch.

Why is BeautifulSoup throwing this HTMLParseError?

I thought BeautifulSoup will be able to handle malformed documents, but when I sent it the source of a page, the following traceback got printed:
Traceback (most recent call last):
File "mx.py", line 7, in
s = BeautifulSoup(content)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1499, in __init__
File "build\bdist.win32\egg\BeautifulSoup.py", line 1230, in __init__
File "build\bdist.win32\egg\BeautifulSoup.py", line 1263, in _feed
File "C:\Python26\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python26\lib\HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "C:\Python26\lib\HTMLParser.py", line 314, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "C:\Python26\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u"", at line 258, column 34
Shouldn't it be able to handle this sort of stuff? If it can handle them, how could I do it? If not, is there a module that can handle malformed documents?
EDIT: here's an update. I saved the page locally, using firefox, and I tried to create a soup object from the contents of the file. That's where BeautifulSoup fails. If I try to create a soup object directly from the website, it works.Here's the document that causes trouble for soup.
Worked fine for me using BeautifulSoup version 3.0.7. The latest is 3.1.0, but there's a note on the BeautifulSoup home page to try 3.0.7a if you're having trouble. I think I ran into a similar problem as yours some time ago and reverted, which fixed the problem; I'd try that.
If you want to stick with your current version, I suggest removing the large <script> block at the top, since that is where the error occurs, and since you cannot parse that section with BeautifulSoup anyway.
In my experience BeautifulSoup isn't that fault tolerant. I had to use it once for a small script and ran into these problems. I think using a regular expression to strip out the tags helped a bit, but I eventually just gave up and moved the script over to Ruby and Nokogiri.
The problem appears to be the
contents = contents.replace(/</g, '<');
in line 258 plus the similar
contents = contents.replace(/>/g, '>');
in the next line.
I'd just use re.sub to clobber all occurrences of r"replace(/[<>]/" with something inocuous before feeding it to BeautifulSoup ... moving away from BeautifulSoup would be like throwing out the baby with the bathwater IMHO.

Categories

Resources