I'm trying to build a pretty simple scraper to harvest links as part of a crawler project. I've set up the following function to do the scraping:
import requests as rq
from bs4 import BeautifulSoup
def getHomepageLinks(page):
homepageLinks = []
response = rq.get(page)
text = response.text
soup = BeautifulSoup(text)
for a in soup.findAll('a'):
homepageLinks.append(a['href'])
return homepageLinks
I saved this file as "scraper2.py". When I try to run the code, I get the following error:
>>> import scraper2 as sc
>>> sc.getHomepageLinks('http://washingtonpost.com')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "scraper2.py", line 9, in getHomepageLinks
for a in soup.findAll('a'):
TypeError: 'NoneType' object is not callable
Now for the odd part: If I try to debug the code and just print the response, it works fine:
>>> response = rq.get('http://washingtonpost.com')
>>> text = response.text
>>> soup = BeautifulSoup(text)
>>> for a in soup.findAll('a'):
... print(a['href'])
...
https://www.washingtonpost.com
#
#
http://www.washingtonpost.com/politics/
https://www.washingtonpost.com/opinions/
http://www.washingtonpost.com/sports/
http://www.washingtonpost.com/local/
http://www.washingtonpost.com/national/
http://www.washingtonpost.com/world/
...
If I'm reading the error messages correctly, the problem is occurring with soup.findAll, but only when the findAll is part of a function. I'm sure I'm spelling it correctly (not findall or Findall, as many of the errors on here are), and I've tried a fix using lxml suggested on a previous post that didn't fix it. Does anyone have any ideas?
Try to replace your for-loop with the following:
for a in soup.findAll('a'):
url = a.get("href")
if url != None:
homepageLinks.append(url)
Related
New to python here and keep running into an error when trying to set up some code to scrape data off a list of web pages.
The link to one of those pages is - https://rspo.org/members/2.htm
and I am trying to grab the information on there like 'Membership Number', 'Category', 'Sector', 'Country', etc and export it all into a spreadsheet.
Code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
import requests
pages = []
for i in range(1, 10):
url = 'https://rspo.org/members/' + str(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = soup(page.text, 'html.parser')
member = soup.find_all("span", {"class":"current"})
And I get the following error:
Traceback (most recent call last):
File "", line 3, in
soup = soup(page.text, 'html.parser')
TypeError: 'ResultSet' object is not callable
Not sure why I am getting this error. I tried looking at other pages on Stack Overflow but nothing seemed to have a similar error to the one I get above.
The problem is that you have a name conflict because you are using the same name in multiple ways. Thus, your soup sets to a BeautifulSoup soup object, but is then reused as this same object.
Try this instead:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
pages = []
for i in range(1, 10):
url = 'https://rspo.org/members/' + str(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = BeautifulSoup(page.text, 'html.parser')
member = soup.find_all("span", {"class":"current"})
Note that I just removed the alias from BeautifulSoup. The reason why I took this approach is simple. The standard convention in Python is that classes should be proper case. I.e ClassOne and BeautifulSoup. Instances of classes should be lower-case i.e class and soup. This helps avoid name conflicts, but it also makes your code more intuitive. Once you learn this, it becomes much easier to read code, and to write clean code.
I'm doing some basic screenscraping, with BeautifulSoup. I'm fairly new to Python, and completely new to BeautifulSoup. So I might just be missing something, but I can't figure out why I'm encountering this error.
import urllib2
from BeautifulSoup import BeautifulSoup
def get_page(url):
resp = urllib2.urlopen(url)
rval = resp.read()
resp.close()
return rval
def spider_stuff(tree_str):
lable_to_location = dict()
soup = BeautifulSoup(tree_str)
for tag in soup.findAll('a'):
if tag is not None:
print(type(tag))
print(tag.get_text())
print(tag.get('href'))
lable_to_location[tag.get_text()] = tag.get('href')
else:
print('what?')
return lable_to_location
print(spider_stuff(get_page('https://www.example.com/')))
I get this output:
<class 'BeautifulSoup.Tag'>
Traceback (most recent call last):
File "spider.py", line 36, in <module>
print(spider_stuff(get_page('https://www.example.com/')))
File "spider.py", line 17, in spider_stuff
print(tag.get_text())
TypeError: 'NoneType' object is not callable
Why am I getting this error?
The get_text attribute of the tag variable has a value of None, which means you can't try to use it to make a function call.
My environment:
Windows 8
Python 3.6
I have installed beautiful soup 4 using as per documentation:
pip install beautifulsoup4
I see the urllib2 was not working with my Python version. So, I have changed it to from urllib.request import urlopen. Additionally, I have added html.parser parameter in BeautifulSoup.
Finally your code now looks like:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def get_page(url):
resp = urlopen(url)
rval = resp.read()
resp.close()
return rval
def spider_stuff(tree_str):
lable_to_location = dict()
soup = BeautifulSoup(tree_str,"html.parser")
for tag in soup.findAll('a'):
if tag is not None:
print(type(tag))
print(tag.get_text())
print(tag.get('href'))
lable_to_location[tag.get_text()] = tag.get('href')
else:
print('what?')
return lable_to_location
print(spider_stuff(get_page('https://www.example.com/')))
Output:
<class 'bs4.element.Tag'>
More information...
http://www.iana.org/domains/example
{'More information...': 'http://www.iana.org/domains/example'}
I just bought a book to show me how to scrape websites but the first example right off the bat is not working for me - so now I am a little upset that I bought the book in the first place but I would like to try and get it going.
In Python 3.5 my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
BsObj = BeautifulSoup(html.read())
print(bsObj.h1)*
Here is the error I am getting
Traceback (most recent call last):
File
"C:/Users/MyName/AppData/Local/Programs/Python/Python35-32/Lib/site-packages/bs4/test.py",
line 5, in
BsObj = BeautifulSoup(html.read())
File "C:\Users\MyName\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4__init__.py",
line 153, in init
builder = builder_class()
File "C:\Users\MyName\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\builder_htmlparser.py",
line 39, in init
return super(HTMLParserTreeBuilder, self).init(*args, **kwargs)
TypeError: init() got an unexpected keyword argument 'strict'
Any ideas would be super helpful?
Thanks in advance
I guess you transcribed the code from the book. bsObj is not named consistently and there is an unnecessary * after print(). It will work after you change those two things.
Also note that read() is not needed and that it's better to define the parser, otherwise you will get a warning.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bsObj = BeautifulSoup(html, 'html.parser')
print(bsObj.h1)
Hey you just had some typos BsObj not bsObj in print line.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
BsObj = BeautifulSoup(html.read())
print(BsObj.h1)
I'm trying to parse a second set of data. I make a get request to gigya status page, I parse out the part that is important with beautiful soup. Then I take the return string of html trying to parse that with beautiful soup as well but I get a markup error however the returned content string is a string as well so im not sure why..
error
Traceback (most recent call last):
File "C:\Users\Administraor\workspace\ChronoTrack\get_gigiya.py", line 17, in <module>
soup2 = BeautifulSoup(rows)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
code
import requests
import sys
from bs4 import BeautifulSoup
url = ('https://console.gigya.com/site/apiStatus/getTable.ashx')
r = requests.request('GET', url)
content = str(r.content)
soup = BeautifulSoup(content)
table = soup.findAll('table')
rows = soup.findAll('tr')
rows = rows[8]
soup2 = BeautifulSoup(rows) #this is where it fails
items = soup2.findAll('td')
print items
The line soup2 = BeautifulSoup(rows) is unnecessary; rows at that point is already a BeautifulSoup.Tag object. You can simply do:
rows = rows[8]
items = rows.findAll('td')
I wrote the line below:
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
The data is achieved by urllib.urlopen(XXX).read() in python2.7.
It works well when the XXX is a page that consists of total English characters, such as http://python.org. But when it goes for a page there is some Chinese characters, it fails.
There will be a KeyError. And [x for ...] returns an empty list.
What's more, if there is no parseOnlyThese=SoupStrainer('a'), it is OK for both.
Is there some bug of SoupStrainer?
from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib
data = urllib.urlopen('http://tudou.com').read()
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
gives the traceback:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
File "F:\ActivePython27\lib\site-packages\beautifulsoup-3.2.1-py2.7.egg\BeautifulSoup‌​.py", line 613, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'
There are <a> links on that page that do not have a href attribute. Use the following instead:
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a')) if x.has_key('href')]
For example, it is perfectly normal to declare a link target with <a name="something" />; you are selecting those tags too, but they do not have a href attribute and your code fails on that.