I am following a tutorial to try to learn how to use BeautifulSoup. I am trying to remove names from the urls on a html page I downloaded. I have it working great to this point.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
print link
but when I enter this next part
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink
I get this error
Traceback (most recent call last):
File "C:/Python27/python tutorials/soupexample.py", line 13, in <module>
print names
File "C:\Python27\lib\idlelib\PyShell.py", line 1325, in write
return self.shell.write(s, self.tags)
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded
This is a buggy interaction between IDLE and BeautifulSoup's NavigableString objects (which subclass unicode). See issue 1757057; it's been around for a while.
The work-around is to convert the object to a plain unicode value first:
print unicode(names)
Related
I've installed beautifulsoup (file named bs4) into my pythonproject folder which is the same folder as the python file I am running. The .py file contains the following code, and for input I am using this URL to a simple page with 1 link which the code is supposed to retrieve.
URL used as url input: http://data.pr4e.org/page1.htm
.py code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
Though I could be wrong, it appears to me that bs4 imports correctly because my IDE program suggests BeautifulSoup when I begin typing it. After all, it is installed in the same directory as the .py file. however, It spits out the following lines of error when I run it using the previously provided url:
Traceback (most recent call last):
File "C:\Users\Thomas\PycharmProjects\pythonProject\main.py", line 16, in <module>
soup = BeautifulSoup(html, 'html.parser')
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 215, in __init__
self._feed()
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 241, in _feed
self.endData()
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 315, in endData
self.object_was_parsed(o)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 320, in
object_was_parsed
previous_element = most_recent_element or self._most_recent_element
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1001, in __getattr__
return self.find(tag)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1238, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1259, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 516, in _find_all
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1560, in __init__
self.text = self._normalize_search_value(text)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1565, in _
normalize_search_value
if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value,
'match')
AttributeError: module 'collections' has no attribute 'Callable'
Process finished with exit code 1
The lines being referred to in the error messages are from files inside bs4 that were downloaded as part of it. I haven't edited any of the bs4 contained files or even touched them. Can anyone help me figure out why bs4 isn't working?
Are you using python 3.10? Looks like beautifulsoup library is using removed deprecated aliases to Collections Abstract Base Classes. More info here: https://docs.python.org/3/whatsnew/3.10.html#removed
A quick fix is to paste these 2 lines just below your imports:
import collections
collections.Callable = collections.abc.Callable
How can I set the "Unicode strings encoding declaration" in the lxml.html.clean.Cleaner module? I'm looking to read the plaintext of a website, and have used lxml in the past as a way of doing this, scraping out the html and javascript. For some pages, I'm starting to get some weird errors about encoding, but can't make sense of how to set this param correctly in the documentation.
import requests
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
cleaner.html= True
>>> url = 'http://www.princeton.edu'
>>> r = requests.get(url)
>>> lx = r.text.replace('\t',' ').replace('\n',' ').replace('\r',' ')
>>> #lx = r.text
... lxclean = cleaner.clean_html(lx)
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/clean.py", line 501, in clean_html
doc = fromstring(html)
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 672, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 568, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2997, in lxml.etree.fromstring (src/lxml/lxml.etree.c:63276)
File "parser.pxi", line 1607, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:93592)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
However, it works for other urls, like 'http://www.google.com'
This seems suddenly broken. Use beautifulsoup's html parser instead.
This question already has an answer here:
Why do I get a recursion error with BeautifulSoup and IDLE?
(1 answer)
Closed 8 years ago.
This is a really bizarre error that I can't seem to figure out.
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.crummy.com/software/BeautifulSoup/bs4/doc/'
soup = BeautifulSoup(urllib2.urlopen(url))
print soup.title
This returns
<title>Beautiful Soup Documentation — Beautiful Soup 4.0.0 documentation</title>
as should be expected, but if I change it to "print soup.title.string" (which is supposed to return everything above minus the html tag) I get
Traceback (most recent call last):
File "C:\Users\MyName\Desktop\MyProgram\Python\test.py", line 7, in <module>
print soup.title.string
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded
I've looked around and can't find anybody else experiencing this error. Any advice?
Edit: So I've tried the same code on some other pages and it's worked better. google.com works for instance. This implies it's something about the construction of the pages.
Maybe the problem is because it contains non_ASCII characters.
Modify your print statement to this
print soup.title.string.encode('ascii','ignore')
start_url=requests.get('http://www.delicious.com/golisoda')
soup=BeautifulSoup(start_url)
this code is displaying the following error:
Traceback (most recent call last):
File "test2_requests.py", line 10, in <module>
soup=BeautifulSoup(start_url)
File "/usr/local/lib/python2.7/dist-packages/bs4/__init__.py", line 169, in __init__
self.builder.prepare_markup(markup, from_encoding))
File "/usr/local/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 68, in prepare_markup
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
File "/usr/local/lib/python2.7/dist-packages/bs4/dammit.py", line 203, in __init__
self._detectEncoding(markup, is_html)
File "/usr/local/lib/python2.7/dist-packages/bs4/dammit.py", line 373, in _detectEncoding
xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer
Use the .content of the response:
start_url = requests.get('http://www.delicious.com/golisoda')
soup = BeautifulSoup(start_url.content)
Alternatively, you can use the decoded unicode text:
start_url = requests.get('http://www.delicious.com/golisoda')
soup = BeautifulSoup(start_url.text)
See the Response content section of the documentation.
you probebly need to Use
using
soup=BeautifulSoup(start_url.read())
or
soup=BeautifulSoup(start_url.text)
from BeautifulSoup import BeautifulSoup
import urllib2
data=urllib2.urlopen('http://www.delicious.com/golisoda').read()
soup=BeautifulSoup(data)
I'm writing a crawler to download the static html pages using urllib.
The get_page function works for 1 cycle but when i try to loop it, it doesn't open the content to the next url i've fed in.
How do i make urllib.urlopen continuously download HTML pages?
If it is not possible, is there any other suggestion to download
webpages within my python code?
my code below only returns the html for the 1st website in the seed list:
import urllib
def get_page(url):
return urllib.urlopen(url).read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
The same crawl "once-only" problem also occurs with urllib2:
import urllib2
def get_page(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read().decode('utf8')
seed = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for j in seed:
print "here"
print get_page(j)
Without the exception, i'm getting an IOError with urllib:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 91, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 4, in get_page
return urllib.urlopen(url).read().decode('utf8')
File "/usr/lib/python2.7/urllib.py", line 86, in urlopen
return opener.open(url)
File "/usr/lib/python2.7/urllib.py", line 207, in open
return getattr(self, name)(url)
File "/usr/lib/python2.7/urllib.py", line 462, in open_file
return self.open_local_file(url)
File "/usr/lib/python2.7/urllib.py", line 476, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html'
Without the exception, i'm getting a ValueError with urllib2:
Traceback (most recent call last):
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 95, in <module>
print get_page(j)
File "/home/alvas/workspace/SingCorp/sgcrawl.py", line 7, in get_page
response = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 392, in open
protocol = req.get_type()
File "/usr/lib/python2.7/urllib2.py", line 254, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http://www.pmo.gov.sg/content/pmosite/aboutpmo.html
ANSWERED:
The IOError and ValueError occurred because there was some sort of Unicode byte order mark (BOM). A non-break space was found in the second URL. Thanks for all your help and suggestion in solving the problem!!
your code is choking on .read().decode('utf8').
but you wouldn't see that since you are just swallowing exceptions. urllib works fine "more than once".
import urllib
def get_page(url):
return urllib.urlopen(url).read()
seeds = ['http://www.pmo.gov.sg/content/pmosite/home.html',
'http://www.pmo.gov.sg/content/pmosite/aboutpmo.html']
for seed in seeds:
print 'here'
print get_page(seed)
Both of your examples work fine for me. The only explanation I can think of for your exact errors is that the second URL string contains some sort of non-printable character (a Unicode BOM, perhaps) that got filtered out when pasting the code here. Try copying the code back from this site into your file, or retyping the entire second string from scratch.