Python web crawler not printing the results - python

It is not printing out any results and giving back a strange error as shown in the picture using pycharm.
Code I wrote:
import requests
from bs4 import BeautifulSoup
def webcrawler(max_pages,url):
page = 1
if page <= max_pages:
webpage = (url) + str(page)
source_code = requests.get(url)
code_text = source_code.text
soup_format = BeautifulSoup(code_text)
for link in soup_format.findAll('a', {'class': 's-item__image-wrapper'}):
href = str(url) + link.get('href')
title = link.string
print(href)
print(title)
page += 1
webcrawler(1, 'https://www.ebay.com/b/Cell-Phone-Accessories/9394/bn_320095?_pgn=')

The warning message tells you exactly what to do to stop it from being raised. You just need to pass a parser to the BeautifulSoup that you instantiate on line 10 e.g.
soup_format = BeautifulSoup(code_text, features='html.parser')
However, there are some more issues with your code. Line 11 from the code in your original post:
for link in soup_format.findAll('a', {'class': 's-item__image-wrapper'}):
Will return None as there are no <a> tags with the class s-item__image-wrapper - all tags with that class in the target page are <div>s.
I have a suggestion below that seems to capture what you're looking to scrape. It instead iterates across each <div class="s-item__image"> which is something of a wrapper class around the item data you are looking to print. It then drills down to the first child <a> tag to get the item href and takes the alt attribute of the item img within the wrapper for the item description - have changed the print order of these and added a trailing new line in example below for readability.
import requests
from bs4 import BeautifulSoup
def webcrawler(max_pages,url):
page = 1
if page <= max_pages:
webpage = (url) + str(page)
source_code = requests.get(url)
code_text = source_code.text
soup_format = BeautifulSoup(code_text, features='html.parser')
for wrapper in soup_format.findAll('div', attrs={'class': 's-item__image'}):
href = str(url) + wrapper.find('a').get('href')
title = wrapper.find('img').get('alt')
print(title)
print(href)
print()
page += 1
webcrawler(1, 'https://www.ebay.com/b/Cell-Phone-Accessories/9394/bn_320095?_pgn=')

Related

can anyone tell me why (crux-component-title) is used and where it is taken from

this code is good but i do not understand some things
import requests
from bs4 import BeautifulSoup
def get_products(url):
soup = BeautifulSoup(requests.get(url).content, "html.parser")
out = []
for title in soup.select(".crux-component-title"):
out.append(title.get_text(strip=True))
return out
url = "https://www.consumerreports.org/cro/coffee-makers.htm"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for category_link in soup.select("h3.crux-product-title a"):
u = "https://www.consumerreports.org" + category_link["href"]
print("Getting {}".format(u))
all_data.extend(get_products(u))
for i, title in enumerate(all_data, 1):
print("{:<5} {}".format(i, title))
i did not get that crux-component-title is used and where is it came from
The crux-component-title comes from the page that is obtained in the "loop" and passed in the get_products function.
This is your code:
# Loop the links found in the anchor HTML tag "a"
# that are inside the "h3" tag:
for category_link in soup.select("h3.crux-product-title a"):
# Get the "href" value from the link:
u = "https://www.consumerreports.org" + category_link["href"]
The following line calls the get_products function that makes a request to the page (i.e. the url of the page obtained in the loop ):
all_data.extend(get_products(u))
In the get_products function, the code gets the titles found in the page passed in the u parameter and those titles are contained in an HTML element with the crux-component-title class:
for title in soup.select(".crux-component-title"):
out.append(title.get_text(strip=True))

TypeError: must be str, not NoneType

I'm writing my first "real" project, a web crawler, and I don't know how to fix this error. Here's my code
import requests
from bs4 import BeautifulSoup
def main_spider(max_pages):
page = 1
for page in range(1, max_pages+1):
url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("a"):
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
print(href)
page += 1
main_spider(1)
Here's the error
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
TypeError: must be str, not NoneType
As noted by #Shiping, your code is not indented properly ... I corrected it below.
Also... link.get('href') is not returning a string in one of the cases.
import requests
from bs4 import BeautifulSoup
def main_spider(max_pages):
for page in range(1, max_pages+1):
url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("a"):
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
print(href)
main_spider(1)
For purposes of evaluating what was happening, I added several lines of code...between several of your existing lines AND removed the offending line (for the time being).
soup = BeautifulSoup(plain_text, "html.parser")
print('All anchor tags:', soup.findAll('a')) ### ADDED
for link in soup.findAll("a"):
print(type(link.get("href")), link.get("href")) ### ADDED
The result of my additions was this (truncated for brevity):
NOTE: that the first anchor does NOT have an href attribute and thus link.get('href') can't return a value, so returns None
[<a id="top"></a>, navigation,
search,
<a href="/wiki/Special:SiteMatrix" title="Special:SiteMatrix">sister...
<class 'NoneType'> None
<class 'str'> #mw-head
<class 'str'> #p-search
<class 'str'> /wiki/Special:SiteMatrix
<class 'str'> /wiki/File:Wiktionary-logo-v2.svg
...
To prevent the error, a possible solution would be to add a conditional OR a try/except expression to your code. I'll demo a conditional expression.
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("a"):
if link.get('href') == None:
continue
else:
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
print(href)
The first "a" link on the wikipedia page is
<a id="top"></a>
Therefore, link.get("href") will return None, as there is no href.
To fix this, check for None first:
if link.get('href') is not None:
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
# do stuff here
Not all anchors (<a> elements) need to have a href attribute (see https://www.w3schools.com/tags/tag_a.asp):
In HTML5, the tag is always a hyperlink, but if it has no href attribute, it is only a placeholder for a hyperlink.
Actually you already got the Exception and Python is great at handling exceptions so why not catch the exception? This style is called "Easier to ask for forgiveness than permission." (EAFP) and is actually encouraged:
import requests
from bs4 import BeautifulSoup
def main_spider(max_pages):
for page in range(1, max_pages+1):
url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll("a"):
# The following part is new:
try:
href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
print(href)
except TypeError:
pass
main_spider(1)
Also the page = 1 and page += 1 lines can be omitted. The for page in range(1, max_pages+1): instruction is already sufficient here.
I had the same error from different code. After adding a conditional inside a function, I thought that the return type was not being set properly, but what I realized was that when the condition was False, the return statement was not being called at all -- a change to my indentation fixed the problem.
I had the same error message in a similar situation.
I was concatenating strings too and one variable was supposed to be assigned a return value of a function.
But in one case there was no return value and the variable was "empty". This caused the same error message.
input = get_input() # <-- make sure this always returns a value
print ("input was" + input)

Python web scraping page loop

Appreciate this is been asked many time on here but I cant seem to get it to work for me.
I've written a scraper which successfully scrapes everything I need from the first page of the site. But, I cant figure out how to get it to loop through the various pages.
The url simply increments like this BLAH/3 + 'page=x'
I haven't been learning to code for very long, so any advice would be appreciated!
import requests
from bs4 import BeautifulSoup
url = 'http://www.URL.org/BLAH1/BLAH2/BLAH3'
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
"<a href='>%s'>%s</a>" %(link.get("href"), link.text)
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
name = print(item.contents[0].text)
address = print(item.contents[1].text.replace('.',''))
care_type = print(item.contents[2].text)
Update:
r = requests.get('http://www.URL.org/BLAH1/BLAH2/BLAH3')
for page in range(10):
r = requests.get('http://www.URL.org/BLAH1/BLAH2/BLAH3' + 'page=' + page)
soup = BeautifulSoup(r.content, "html.parser")
#print(soup.prettify())
# String substitution for HTML
for link in soup.find_all("a"):
"<a href='>%s'>%s</a>" %(link.get("href"), link.text)
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
name = print(item.contents[0].text)
address = print(item.contents[1].text.replace('.',''))
care_type = print(item.contents[2].text)
Update 2!:
import requests
from bs4 import BeautifulSoup
url = 'http://www.URL.org/BLAH1/BLAH2/BLAH3&page='
for page in range(10):
r = requests.get(url + str(page))
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace('.',''))
print(item.contents[2].text)
To loop pages with page=x you need for loop like this>
import requests
from bs4 import BeautifulSoup
url = 'http://www.housingcare.org/housing-care/results.aspx?ath=1%2c2%2c3%2c6%2c7&stp=1&sm=3&vm=list&rp=10&page='
for page in range(10):
print('---', page, '---')
r = requests.get(url + str(page))
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace('.',''))
print(item.contents[2].text)
Every page can be different and better solution needs more inforamtion about page. Sometimes you can get link to last page and then you can use this information instead 10 in range(10)
Or you can use while True to loop and break to leave loop if there is no link to next page. But first you have to show this page (url to real page) in question.
EDIT: example how to get link to next page and then you get all pages - not only 10 pages as in previous version.
import requests
from bs4 import BeautifulSoup
# link to first page - without `page=`
url = 'http://www.housingcare.org/housing-care/results.aspx?ath=1%2c2%2c3%2c6%2c7&stp=1&sm=3&vm=list&rp=10'
# only for information, not used in url
page = 0
while True:
print('---', page, '---')
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace('.',''))
print(item.contents[2].text)
# link to next page
next_page = soup.find('a', {'class': 'next'})
if next_page:
url = next_page.get('href')
page += 1
else:
break # exit `while True`

Why does my crawler with BeautifulSoup, not show results?

This is the code that I wrote.
import requests
from bs4 import BeautifulSoup
def code_search(max_pages):
page = 1
while page <= max_pages:
url = 'http://kindai.ndl.go.jp/search/searchResult?searchWord=朝鲜&facetOpenedNodeIds=&featureCode=&viewRestrictedList=&pageNo=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
page += 1
code_search(2)
My pycharm version is pycharm-community-5.0.3 for mac.
It just says:
"Process finished with exit code 0"
But there should be some results if I have wrote the code accordingly...
Please help me out here!
You have no print statements - so the program doesn't output anything.
Add some print statements. For example, if you output the link, do this:
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
print(href)
page += 1
The answer depends on what you want to archieve with the web crawler. The first observation is that nothing is printed.
The following code prints the URL and all links found on the URL.
import requests
from bs4 import BeautifulSoup
def code_search(max_pages):
page = 1
while page <= max_pages:
url = 'http://kindai.ndl.go.jp/search/searchResult?searchWord=朝鲜&facetOpenedNodeIds=&featureCode=&viewRestrictedList=&pageNo=' + str(page)
print("Current URL:", url)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
print("Found URL:", href)
page += 1
code_search(2)
It is also possible to let the method return all found URLs and then print the results:
import requests
from bs4 import BeautifulSoup
def code_search(max_pages):
page = 1
urls = []
while page <= max_pages:
url = 'http://kindai.ndl.go.jp/search/searchResult?searchWord=朝鲜&facetOpenedNodeIds=&featureCode=&viewRestrictedList=&pageNo=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
urls.append(href)
page += 1
return urls
print("Found URLs:", code_search(2))

New to Python, what am I doing wrong and not seeing <A> tag (links) returned with BS4

I'm new to python and learning it. Basically I am trying to pull all the links from my e-commerce store products that is stored in the html below. I'm getting no results returned though and I can't seem to figure out why not.
<h3 class="two-lines-name">
<a title="APPLE IPOD IPOD A1199 2GB" target="_self" href="/Item/Details/APPLE-IPOD-IPOD-A1199-2GB/d1003297dbe7443c8953750f0c96c62a/400">
APPLE IPOD IPOD A1199 2GB
</a>
</h3>
This is my python code
import requests
from bs4 import BeautifulSoup
def my_spider(max_pages):
page = 1
while page <= max_pages:
url = 'www.buya.com/Store/SAM-S-LOCKER/400?page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'h3 class': "two-lines-name"}):
href = link.get('href')
print(href)
page += 1
my_spider(5)
Result with no data
Process finished with exit code 0
You are not actually running the function if you have the function call inside the actual function, after you correct that you are going to get an error as that is not a valid url to pass to requests, last your soup.findAll('a', {'h3 class': "two-lines-name"}) is not going to find anything:
def my_spider(max_pages):
# use range from 1 to max pages
for i in range(1, max_pages+1):
url = 'http://www.buya.com/Store/SAM-S-LOCKER/400?page={}'.format(i) # http:/?...
source_code = requests.get(url)
plain_text = source_code.content
soup = BeautifulSoup(plain_text)
# you want the h3 tags and to extract the href from the a tags
for link in soup.findAll("h3", {'class': "two-lines-name"}):
href = link.a["href"]
print(href)
my_spider(5) # outside the function
Output:
/Item/Details/12-FT-CHAIN-W-HOOK/cbb1eb65b100459283d15102606208c2/400
/Item/Details/12-INCH-FUSION-SUBWOOFER/534c4d677b2547fb814668b7d061df5d/400
/Item/Details/18-Gold-Chain-14K-Yellow-Gold-2-03g/0aaf2e1e5532461884cb44e786329e80/400
/Item/Details/1820-HANDMADE-STRAIGHT-RAZOR/ed0ba44f98224067b595b726bf01f5ab/400
/Item/Details/2-PAIRS-OF-POCKET-PLIERS-LEATHERMAN/410bcb9e4321426487bee7639b3cb96e/400
/Item/Details/20TH-CENTURY-FOX-Motorcycle-Helmet-RACING-HELMET/e12a75dc7e004e5aa43698c1edf87773/400
/Item/Details/30-CLUBS/a65f1cbff00c4d59ac998dee96eed98b/400
/Item/Details/30-STEEL-CHAINSAW-BLADE/daaca24ede1341c58bb0d0cd32051646/400
/Item/Details/5-GALLON-GLASS-JUG-BREWING-JUG-CHANGE-JAR/dde9b1bfea2a4a23ad93da098ffc674d/400
/Item/Details/5150-SNOWBOARDS-Snowboard-5150-155CM/bcaa07c71c8c4b499a70d34459244f75/400
/Item/Details/6-FT-STEEL-CHAIN/7c24fb1a16ac46e7b9e91f99883652f6/400
/Item/Details/6-5-CUSTOM-HUNTING-KNIFE/ffda1685b2324abe96e3fb7cb6f7f265/400
/Item/Details/95150/39cb080edd474eb6b770b26b40e3dc6b/400
/Item/Details/ACER-Monitor-P201W/ff03d9c33ca747e08e4646d2c3d5143e/400
/Item/Details/ACOUSTIC-RESEARCH-Monitor-Speakers-RESEARCH-AW825/856ff1d8beb9480d893f94d9d49a8642/400
/Item/Details/ACTIVISION-Microsoft-XBOX-360-CALL-OF-DUTY-BLACK-OPS-2-XBOX-360/aef62055b4f14e379f2eea154d162551/400
/Item/Details/ACTIVISION-Video-Game-Accessory-DJ-HERO-95837809/41e3c7f0114e497caf23d8a50fe1f547/400
/Item/Details/ACTIVISION-Video-Game-Accessory-WII-FIT/7daee2a759a54dd7a4e2b6acd37b9c3e/400
/Item/Details/AIMTECH-1911-SCOPE-MOUNT/ac69ae1c40fe4d7db8c53a8ebf842d7d/400
/Item/Details/AIRCO-TIG-WELDING-TUNGSTEN-Arc-Welder-ELECTRODE/70b9b35db0c547c29eb90e02ef60d91a/400
/Item/Details/AIWA-Portable-CD-Player-XP-SP911/75761bfff9a44093be51e4d70410bd85/400
/Item/Details/ALESSI-Gent-s-Wristwatch-KARIM-RASHID/251c3f95173f49078722b301e1d920fe/400
/Item/Details/ALL-AMERICAN-RIDER-Motorcycle-Part-SADDLE-BAGS/87634c0c08d2458ba5b84fa39c9bc3fc/400
/Item/Details/ALL-AMERICAN-RIDER-Motorcycle-Part-SADDLE-BAGS/803f6dfdc9f44326a5a52b63681779ad/400
/Item/Details/ALLY-SKATEBAORD-USED/716cec1588d9408e859718f5961e1ec6/400
/Item/Details/ALPINE-ARCHERY-Bow-FRONTIER/e73dda8034cf4cdb8ebeeebc9683b55d/400
/Item/Details/AMAZON-Tablet-KINDLE-D01100/ea9ac5b291ef487ea6f75ca328e05750/400
/Item/Details/AMAZON-Tablet-KINDLE-FIRE-D01400/ebe0e7001ac744ffa030fd153942a548/400
/Item/Details/APPLE-Computer-Accessories-A1023/6a38f60d2e034dc597043cc42282246e/400
/Item/Details/APPLE-Cell-Phone-Smart-Phone-IPHONE-5C-A1532-AT-T/cc65c513e848475c8000b6e10b6855e5/400
/Item/Details/APPLE-IPOD-IPOD-A1199-2GB/d1003297dbe7443c8953750f0c96c62a/400
...................................................
Your tabulation is wrong... you're calling my_spider inside the my_spider function...
Remove the tabulation on the last line and it should work fine.

Categories

Resources