I'm working on a web scraper project with HTMLSession, I plan to scrape Ask search engine results using a set of user-defined keywords. I have already started with writing the code for my scraper, here it is:
from requests_html import HTMLSession
class Scraper():
def scrapedata(self,tag):
url = f'https://www.ask.com/web?q={tag}'
s = HTMLSession()
r = s.get(url)
print(r.status_code)
qlist = []
ask = r.html.find('div.PartialSearchResults-item')
for a in ask:
print(a.find('a.PartialSearchResults-item-title-link.result-link::text', first = True ).text.strip())
ask = Scraper()
ask.scrapedata('ferrari')
However when I run this code, instead of getting the list of all the web page titles related to the keywords searched in my terminal as it should have, I get the following errors:
[Running] python -u "c:\Users\user\Documents\AAprojects\Whelpsgroups1\Beauty\scraper.py"
200
Traceback (most recent call last):
File "c:\Users\user\Documents\AAprojects\Whelpsgroups1\Beauty\scraper.py", line 19, in <module>
ask.scrapedata('ferrari')
File "c:\Users\user\Documents\AAprojects\Whelpsgroups1\Beauty\scraper.py", line 15, in scrapedata
print(a.find('a.PartialSearchResults-item-title-link.result-link::text', first = True ).text.strip())
File "C:\Python310\lib\site-packages\requests_html.py", line 212, in find
for found in self.pq(selector)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 261, in __call__
result = self._copy(*args, parent=self, **kwargs)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 247, in _copy
return self.__class__(*args, **kwargs)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 232, in __init__
xpath = self._css_to_xpath(selector)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 243, in _css_to_xpath
return self._translator.css_to_xpath(selector, prefix)
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 190, in css_to_xpath
return ' | '.join(self.selector_to_xpath(selector, prefix,
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 190, in <genexpr>
return ' | '.join(self.selector_to_xpath(selector, prefix,
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 222, in selector_to_xpath
xpath = self.xpath_pseudo_element(xpath, selector.pseudo_element)
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 232, in xpath_pseudo_element
raise ExpressionError('Pseudo-elements are not supported.')
cssselect.xpath.ExpressionError: Pseudo-elements are not supported.
[Done] exited with code=1 in 17.566 seconds
I don't even know what this means, I searched the Internet but instead came across problems related to IE7 and I don't see what has to do with my problem, especially since I'm using Microsoft Edge as my default web browser. Also, I hope to count on the help of more experienced members of the community to help me solve the problem. Thank you from Cameroon.
Just remove the ::text Like this or print(a.find("a.PartialsearchResults-item-title-link.result-link", first = True).text.strip()) and you will get the titles of your Webpages.
Related
I am following "The Complete Python Course: Beginner to Advanced!" In SkillShare, and there is a point where my code breaks while the code in the tutorial continues just fine.
The tutorial is about making a webscraper with BeautifulSoup, Pillow, and IO. I'm supposed to be able to do a search for anything in bing, then save the pictures on the images search results to a folder in my computer.
Here's the Code:
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for:")
params = {"q": search}
r = requests.get("http://bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "iusc"})
for item in links:
img_obj = requests.get(item.attrs["href"])
print("getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images" + title, img.format)
Whenever I run it, at the end it gives me a raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL
I tried adding
img_obj = requests.get("https://" + item.attrs["href"])
but it keeps giving me the same error.
I have gone and looked at the bing code, and the only change I have done is change the "thumb" class to "iusc". I tried using the "thumb" class as in the tutorial but then the program just runs without saving anything and eventually just finishes.
Thank you for your help
EDIT: Here is the whole error that is being thrown, as requested by baileythegreen:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 14, in <module>
img_obj = requests.get(item.attrs["href"])
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 515, in request
prep = self.prepare_request(req)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 443, in prepare_request
p.prepare(
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 318, in prepare
self.prepare_url(url, params)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 392, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0': No scheme supplied. Perhaps you meant http:///images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0?
Edit 2: I followed hawschiat instructions, and I am getting a different error this time:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 15, in <module>
print("getting", item.attrs["href"])
KeyError: 'href'
However, if I keep the "src" attribute in the print line, I get
getting http://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 18, in <module>
img.save(r'C://Users/user/PycharmProjects/webscrapery/scraped_images' + title, img.format)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\Image.py", line 2209, in save
fp = builtins.open(filename, "w+b")
OSError: [Errno 22] Invalid argument: 'C://Users/user/PycharmProjects/webscrapery/scraped_imageshttp://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7'
I tried using the 'r' character in front of the C: path, but it keeps giving me the same error. I also tried to change the forward slashes to back slashes, and putting 2 slashes in front of the C. I also made sure I have permission to write on the scrapped_images folder, which I do, as well as webscrapery.
The last line of your stack trace gives you a hint of the cause of the error. The URL scraped from the webpage is not a full URL, but rather the path to the resource.
To make it a full URL, you can simply prepend it with the scheme and authority. In your case, that would be https://bing.com.
That being said, I don't think the URL you obtained is actually the URL to the image. Inspecting Bing Image's webpage using Chrome's developer tool, we can see that the structure of the page looks something like this:
Notice that the anchor (a) element points to the preview page while its child element img contains the actual path to the resource.
With that in mind, we can rewrite your code to something like:
links = soup.findAll("img", {"class": "mimg"})
for item in links:
img_obj = requests.get(item.attrs["src"])
print("getting", item.attrs["src"])
title = item.attrs["src"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images\\" + title, img.format)
And this should achieve what you are trying to do.
I'm trying to parse some Java code in Python using ANTLRv4. I've tried to follow this post, but I get the following error:
File "/home/xxxxxxx/xxxxxxx/xxxxxxx/antlr/proto_antlr.py", line 14, in <module>
main()
File "/home/xxxxxxx/xxxxxxx/xxxxxxx/antlr/proto_antlr.py", line 9, in main
tree = parser.compilationUnit()
File "/home/xxxxxxx/xxxxxxx/xxxxxxx/antlr/Java8Parser.py", line 4182, in compilationUnit
self.enterRule(localctx, 62, self.RULE_compilationUnit)
File "/home/xxxxxxx/xxxxxxx/xxxxxxx/lib/python3.8/site-packages/antlr4/Parser.py", line 374, in enterRule
self._ctx.start = self._input.LT(1)
File "/home/xxxxxxx/xxxxxxx/xxxxxxx/lib/python3.8/site-packages/antlr4/CommonTokenStream.py", line 62, in LT
self.lazyInit()
File "/home/xxxxxxx/xxxxxxx/xxxxxxx/lib/python3.8/site-packages/antlr4/BufferedTokenStream.py", line 187, in lazyInit
self.setup()
File "/home/xxxxxxx/xxxxxxx/xxxxxxx/lib/python3.8/site-packages/antlr4/BufferedTokenStream.py", line 190, in setup
self.sync(0)
File "/home/xxxxxxx/xxxxxxx/xxxxxxx/lib/python3.8/site-packages/antlr4/BufferedTokenStream.py", line 112, in sync
fetched = self.fetch(n)
File "/home/xxxxxxx/xxxxxxx/xxxxxxx/lib/python3.8/site-packages/antlr4/BufferedTokenStream.py", line 124, in fetch
t = self.tokenSource.nextToken()
File "/home/xxxxxxx/xxxxxxx/xxxxxxx/lib/python3.8/site-packages/antlr4/Lexer.py", line 130, in nextToken
self._tokenStartLine = self._interp.line_number
AttributeError: 'LexerATNSimulator' object has no attribute 'line_number'
I can't figure out what I'm doing wrong. The file I'm trying to parse is proper Java, it's extracted from the docker-maven-plugin package. I've tried with other files, but I get the same error.
Any idea ?
Actually it was just a problem of violent refactorization... I've changed line to line_number in my code, and it actually changed it in librairies too. Changing it back to line cleared the problem.
Thanks to #Thomas Kläger for making me realize it.
I'm trying to scrape the title off of a webpage. Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. So I'm using some code that I found off Google that use the request-html library:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
soup.find_all('h1')
But there's always an error along the line of:
D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
resp.html.render()
File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
return future.result()
File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
content = await page.content()
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
return await frame.content()
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
'''.strip())
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
pageFunction, *args, force_expr=force_expr)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
pageFunction, *args, force_expr=force_expr)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
_rewriteError(e)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.
Process finished with exit code 1
Does anyone know what this means? I'm quite new to this, so I apologize if I'm using any terminology improperly.
As Ivan said, here you have full code: sleep=1, keep_page=True make the trick
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))
Response:
[<title>
Milled wheat and wheat flour produced</title>]
Seems like a bug in underlying library puppeteer, caused by processing some javascript. Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251, maybe it'll help.
resp.html.render(sleep=1, keep_page=True)
You need to load the JS because if you don't load it the HTML code wont load. You can use Selenium
Try Seleneum.
Seleneum is a library that allows programs to interact with web pages by taking control of the browser.
Here is an example in
an answer
to someone else's question.
I used to use the following script for retrieving the different locations ids, to create an VSI order:
https://softlayer.github.io/python/list_packages/
Specifically:
def getAllLocations(self):
mask = "mask[id,locations[id,name]]"
result = self.client['SoftLayer_Location_Group_Pricing'].getAllObjects(mask=mask);
pp(result)
Unfortunately meanwhile it throws the following exception:
Traceback (most recent call last):
File "new.py", line 59, in <module>
main.getAllLocations()
File "new.py", line 52, in getAllLocations
result = self.client['SoftLayer_Location_Group_Pricing'].getAllObjects(mask=mask);
File "/usr/local/lib/python2.7/site-packages/SoftLayer/API.py", line 392, in call_handler
return self(name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/SoftLayer/API.py", line 360, in call
return self.client.call(self.name, name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/SoftLayer/API.py", line 263, in call
return self.transport(request)
File "/usr/local/lib/python2.7/site-packages/SoftLayer/transports.py", line 195, in __call__
raise _ex(ex.faultCode, ex.faultString)
SoftLayer.exceptions.SoftLayerAPIError: SoftLayerAPIError(SOAP-ENV:Server): Internal Error
Is there something that needs to be changed within the function?
Nope your code is fine the problem is that this is an issue with the method which is not working I am gonna report the issue or if you want it you can open an softlayer's ticket and report the issue yourself.
An issue with the getAllObjects method was fixed yesterday for this service. Please try the request again.
This question already has an answer here:
Why do I get a recursion error with BeautifulSoup and IDLE?
(1 answer)
Closed 8 years ago.
This is a really bizarre error that I can't seem to figure out.
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.crummy.com/software/BeautifulSoup/bs4/doc/'
soup = BeautifulSoup(urllib2.urlopen(url))
print soup.title
This returns
<title>Beautiful Soup Documentation — Beautiful Soup 4.0.0 documentation</title>
as should be expected, but if I change it to "print soup.title.string" (which is supposed to return everything above minus the html tag) I get
Traceback (most recent call last):
File "C:\Users\MyName\Desktop\MyProgram\Python\test.py", line 7, in <module>
print soup.title.string
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded
I've looked around and can't find anybody else experiencing this error. Any advice?
Edit: So I've tried the same code on some other pages and it's worked better. google.com works for instance. This implies it's something about the construction of the pages.
Maybe the problem is because it contains non_ASCII characters.
Modify your print statement to this
print soup.title.string.encode('ascii','ignore')