BeautifulSoup giving me many error lines when used - python

I've installed beautifulsoup (file named bs4) into my pythonproject folder which is the same folder as the python file I am running. The .py file contains the following code, and for input I am using this URL to a simple page with 1 link which the code is supposed to retrieve.
URL used as url input: http://data.pr4e.org/page1.htm
.py code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
Though I could be wrong, it appears to me that bs4 imports correctly because my IDE program suggests BeautifulSoup when I begin typing it. After all, it is installed in the same directory as the .py file. however, It spits out the following lines of error when I run it using the previously provided url:
Traceback (most recent call last):
File "C:\Users\Thomas\PycharmProjects\pythonProject\main.py", line 16, in <module>
soup = BeautifulSoup(html, 'html.parser')
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 215, in __init__
self._feed()
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 241, in _feed
self.endData()
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 315, in endData
self.object_was_parsed(o)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 320, in
object_was_parsed
previous_element = most_recent_element or self._most_recent_element
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1001, in __getattr__
return self.find(tag)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1238, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1259, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 516, in _find_all
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1560, in __init__
self.text = self._normalize_search_value(text)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1565, in _
normalize_search_value
if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value,
'match')
AttributeError: module 'collections' has no attribute 'Callable'
Process finished with exit code 1
The lines being referred to in the error messages are from files inside bs4 that were downloaded as part of it. I haven't edited any of the bs4 contained files or even touched them. Can anyone help me figure out why bs4 isn't working?

Are you using python 3.10? Looks like beautifulsoup library is using removed deprecated aliases to Collections Abstract Base Classes. More info here: https://docs.python.org/3/whatsnew/3.10.html#removed
A quick fix is to paste these 2 lines just below your imports:
import collections
collections.Callable = collections.abc.Callable

Related

MissingSchema(error) thrown when following tutorial

I am following "The Complete Python Course: Beginner to Advanced!" In SkillShare, and there is a point where my code breaks while the code in the tutorial continues just fine.
The tutorial is about making a webscraper with BeautifulSoup, Pillow, and IO. I'm supposed to be able to do a search for anything in bing, then save the pictures on the images search results to a folder in my computer.
Here's the Code:
from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO
search = input("Search for:")
params = {"q": search}
r = requests.get("http://bing.com/images/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "iusc"})
for item in links:
img_obj = requests.get(item.attrs["href"])
print("getting", item.attrs["href"])
title = item.attrs["href"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images" + title, img.format)
Whenever I run it, at the end it gives me a raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL
I tried adding
img_obj = requests.get("https://" + item.attrs["href"])
but it keeps giving me the same error.
I have gone and looked at the bing code, and the only change I have done is change the "thumb" class to "iusc". I tried using the "thumb" class as in the tutorial but then the program just runs without saving anything and eventually just finishes.
Thank you for your help
EDIT: Here is the whole error that is being thrown, as requested by baileythegreen:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 14, in <module>
img_obj = requests.get(item.attrs["href"])
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 515, in request
prep = self.prepare_request(req)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 443, in prepare_request
p.prepare(
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 318, in prepare
self.prepare_url(url, params)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 392, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0': No scheme supplied. Perhaps you meant http:///images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0?
Edit 2: I followed hawschiat instructions, and I am getting a different error this time:
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 15, in <module>
print("getting", item.attrs["href"])
KeyError: 'href'
However, if I keep the "src" attribute in the print line, I get
getting http://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 18, in <module>
img.save(r'C://Users/user/PycharmProjects/webscrapery/scraped_images' + title, img.format)
File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\Image.py", line 2209, in save
fp = builtins.open(filename, "w+b")
OSError: [Errno 22] Invalid argument: 'C://Users/user/PycharmProjects/webscrapery/scraped_imageshttp://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7'
I tried using the 'r' character in front of the C: path, but it keeps giving me the same error. I also tried to change the forward slashes to back slashes, and putting 2 slashes in front of the C. I also made sure I have permission to write on the scrapped_images folder, which I do, as well as webscrapery.
The last line of your stack trace gives you a hint of the cause of the error. The URL scraped from the webpage is not a full URL, but rather the path to the resource.
To make it a full URL, you can simply prepend it with the scheme and authority. In your case, that would be https://bing.com.
That being said, I don't think the URL you obtained is actually the URL to the image. Inspecting Bing Image's webpage using Chrome's developer tool, we can see that the structure of the page looks something like this:
Notice that the anchor (a) element points to the preview page while its child element img contains the actual path to the resource.
With that in mind, we can rewrite your code to something like:
links = soup.findAll("img", {"class": "mimg"})
for item in links:
img_obj = requests.get(item.attrs["src"])
print("getting", item.attrs["src"])
title = item.attrs["src"].split("/")[-1]
img = Image.open(BytesIO(img_obj.content))
img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images\\" + title, img.format)
And this should achieve what you are trying to do.

Data scraping from a webpage with javascript using python

I'm trying to scrape the title off of a webpage. Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. So I'm using some code that I found off Google that use the request-html library:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
soup.find_all('h1')
But there's always an error along the line of:
D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
resp.html.render()
File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
return future.result()
File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
content = await page.content()
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
return await frame.content()
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
'''.strip())
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
pageFunction, *args, force_expr=force_expr)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
pageFunction, *args, force_expr=force_expr)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
_rewriteError(e)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.
Process finished with exit code 1
Does anyone know what this means? I'm quite new to this, so I apologize if I'm using any terminology improperly.
As Ivan said, here you have full code: sleep=1, keep_page=True make the trick
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))
Response:
[<title>
Milled wheat and wheat flour produced</title>]
Seems like a bug in underlying library puppeteer, caused by processing some javascript. Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251, maybe it'll help.
resp.html.render(sleep=1, keep_page=True)
You need to load the JS because if you don't load it the HTML code wont load. You can use Selenium
Try Seleneum.
Seleneum is a library that allows programs to interact with web pages by taking control of the browser.
Here is an example in
an answer
to someone else's question.

Python: BeautifulSoup UnboundLocalError

I am trying to remove the HTML tags from some documents in a .txt format. However, there seems to be an error with the bs4 as far as I understand. The error that I am getting is the following:
Traceback (most recent call last):
File "E:/Google Drive1/Thesis stuff/Python/database/get_missing_10ks.py", line 13, in <module>
text = BeautifulSoup(file_read, "html.parser")
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 282, in __init__
self._feed()
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 343, in _feed
self.builder.feed(self.markup)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\builder\_htmlparser.py", line 247, in feed
parser.feed(markup)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\_markupbase.py", line 160, in parse_marked_section
if not match:
UnboundLocalError: local variable 'match' referenced before assignment
And the code that I am using is the following:
import os
from bs4 import BeautifulSoup
path_to_10k = "D:/10ks/list_missing_10k/"
path_to_saved_10k = "D:/10ks/list_missing_10kp/"
list_txt = os.listdir(path_to_10k)
for name in list_txt:
file = open(path_to_10k + name, "r+", encoding="utf-8")
file_read = file.read()
text = BeautifulSoup(file_read, "html.parser")
text = text.get_text("\n")
file2 = open(path_to_saved_10k + name, "w+", encoding="utf-8")
file2.write(str(text))
file2.close()
file.close()
The thing is that I have used this method on 51320 documents and it worked just fine, however, there are a few documents which it cannot do. When I open those HTML documents they seem the same to me.. If anyone could have any indication of what could be the problem and how to fix it it would be great. Thank you!
EXAMPLE OF FILE: https://files.fm/u/2s45uafp
https://github.com/scrapy/w3lib
https://w3lib.readthedocs.io/en/latest/
pip install w3lib
and
from w3lib.html import remove_tags
And then remove_tags(data) return clear data.
Here is a solution which is using regular expression for removing the HTML tags.
import re
TAG_RE = re.compile(r'<[^>]+>')
f = open("C:\Temp\Data.txt", "r")
strHtml=f.read()
def remove_Htmltags(text):
return TAG_RE.sub('', text)
strClearText=remove_Htmltags(strHtml)
print(strClearText)

Why can't I retrieve a URL from an XPath query?

I have the following script, to find an image on a page and download it:
from lxml import html
import urllib
import urllib2
url = 'http://www.example.com/pages/page0987/'
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
tree = html.fromstring(data)
src = tree.xpath('/html/body/div[2]/div[4]/div/div/img/#src')
urllib.urlretrieve(src, "local-filename.jpg")
I get a webpage, access an <img> element on this page (I tr to find it using an XPath query), then I get a src attribute of this element and then try to download the image using this url from the source.
But something is wrong; Python says:
Traceback (most recent call last):
File "C:\Users\Sergey\Desktop\dlImg.py", line 15, in <module>
urllib.urlretrieve(src, "local-filename.jpg")
File "C:\Python27\lib\urllib.py", line 94, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "C:\Python27\lib\urllib.py", line 228, in retrieve
url = unwrap(toBytes(url))
File "C:\Python27\lib\urllib.py", line 1060, in unwrap
url = url.strip()
AttributeError: 'list' object has no attribute 'strip'
Your tree.xpath() query returns a list, not a single match. At the very least index for the first item:
urllib.urlretrieve(src[0], "local-filename.jpg")
or use a loop over the results. Take into account that the list can be empty as well (no matches found).

Why do I get a recursion error with BeautifulSoup and IDLE?

I am following a tutorial to try to learn how to use BeautifulSoup. I am trying to remove names from the urls on a html page I downloaded. I have it working great to this point.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
print link
but when I enter this next part
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink
I get this error
Traceback (most recent call last):
File "C:/Python27/python tutorials/soupexample.py", line 13, in <module>
print names
File "C:\Python27\lib\idlelib\PyShell.py", line 1325, in write
return self.shell.write(s, self.tags)
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded
This is a buggy interaction between IDLE and BeautifulSoup's NavigableString objects (which subclass unicode). See issue 1757057; it's been around for a while.
The work-around is to convert the object to a plain unicode value first:
print unicode(names)

Categories

Resources