This is the code that I wrote.
import requests
from bs4 import BeautifulSoup
def code_search(max_pages):
page = 1
while page <= max_pages:
url = 'http://kindai.ndl.go.jp/search/searchResult?searchWord=朝鲜&facetOpenedNodeIds=&featureCode=&viewRestrictedList=&pageNo=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
page += 1
code_search(2)
My pycharm version is pycharm-community-5.0.3 for mac.
It just says:
"Process finished with exit code 0"
But there should be some results if I have wrote the code accordingly...
Please help me out here!
You have no print statements - so the program doesn't output anything.
Add some print statements. For example, if you output the link, do this:
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
print(href)
page += 1
The answer depends on what you want to archieve with the web crawler. The first observation is that nothing is printed.
The following code prints the URL and all links found on the URL.
import requests
from bs4 import BeautifulSoup
def code_search(max_pages):
page = 1
while page <= max_pages:
url = 'http://kindai.ndl.go.jp/search/searchResult?searchWord=朝鲜&facetOpenedNodeIds=&featureCode=&viewRestrictedList=&pageNo=' + str(page)
print("Current URL:", url)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
print("Found URL:", href)
page += 1
code_search(2)
It is also possible to let the method return all found URLs and then print the results:
import requests
from bs4 import BeautifulSoup
def code_search(max_pages):
page = 1
urls = []
while page <= max_pages:
url = 'http://kindai.ndl.go.jp/search/searchResult?searchWord=朝鲜&facetOpenedNodeIds=&featureCode=&viewRestrictedList=&pageNo=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
urls.append(href)
page += 1
return urls
print("Found URLs:", code_search(2))
Related
It is not printing out any results and giving back a strange error as shown in the picture using pycharm.
Code I wrote:
import requests
from bs4 import BeautifulSoup
def webcrawler(max_pages,url):
page = 1
if page <= max_pages:
webpage = (url) + str(page)
source_code = requests.get(url)
code_text = source_code.text
soup_format = BeautifulSoup(code_text)
for link in soup_format.findAll('a', {'class': 's-item__image-wrapper'}):
href = str(url) + link.get('href')
title = link.string
print(href)
print(title)
page += 1
webcrawler(1, 'https://www.ebay.com/b/Cell-Phone-Accessories/9394/bn_320095?_pgn=')
The warning message tells you exactly what to do to stop it from being raised. You just need to pass a parser to the BeautifulSoup that you instantiate on line 10 e.g.
soup_format = BeautifulSoup(code_text, features='html.parser')
However, there are some more issues with your code. Line 11 from the code in your original post:
for link in soup_format.findAll('a', {'class': 's-item__image-wrapper'}):
Will return None as there are no <a> tags with the class s-item__image-wrapper - all tags with that class in the target page are <div>s.
I have a suggestion below that seems to capture what you're looking to scrape. It instead iterates across each <div class="s-item__image"> which is something of a wrapper class around the item data you are looking to print. It then drills down to the first child <a> tag to get the item href and takes the alt attribute of the item img within the wrapper for the item description - have changed the print order of these and added a trailing new line in example below for readability.
import requests
from bs4 import BeautifulSoup
def webcrawler(max_pages,url):
page = 1
if page <= max_pages:
webpage = (url) + str(page)
source_code = requests.get(url)
code_text = source_code.text
soup_format = BeautifulSoup(code_text, features='html.parser')
for wrapper in soup_format.findAll('div', attrs={'class': 's-item__image'}):
href = str(url) + wrapper.find('a').get('href')
title = wrapper.find('img').get('alt')
print(title)
print(href)
print()
page += 1
webcrawler(1, 'https://www.ebay.com/b/Cell-Phone-Accessories/9394/bn_320095?_pgn=')
Any help would be appreciated as I am new to python. I have created the below Web Crawler but it doesn't crawl all the pages, just 2 pages. What changes need to be made for it to crawl all the pages?
See def trade_spider(max_pages) loop and at the bottom i have trade_spider(18) which should loop all pages.
Thanks for your help.
import csv
import re
import requests
from bs4 import BeautifulSoup
f = open('dataoutput.csv','w', newline= "")
writer = csv.writer(f)
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'listing-results-price text-price'}):
href = "http://www.zoopla.co.uk" + link.get('href')
title = link.string
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('h2', {'itemprop': 'streetAddress'}):
address = item_name.get_text(strip=True)
writer.writerow([address])
trade_spider(18)
Your code is working fine, it does crawl all the pages (though there are just 14 pages not 18). It seems like your trying to scrape street address , in that case the second function is unnecessary and is only making your crawler slow by calling requests.get() too many times. I've modified the code a little but this one is faster.
import csv
import re
import requests
from bs4 import BeautifulSoup
f = open('dataoutput.csv','w', newline="")
writer = csv.writer(f)
def trade_spider(max_pages):
page = 1
while page <= max_pages:
furl = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
# Changed the class' value
for link in soup.findAll('a', {'class': 'listing-results-address'}):
#href = "http://www.zoopla.co.uk" + link.get('href')
#title = link.string
#get_single_item_data(href)
address = link.get_text()
print (address) # Just to check it is working fine.
writer.writerow([address])
print (page)
page += 1
# Unnecessary code
'''def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('h2', {'itemprop': 'streetAddress'}):
address = item_name.get_text(strip=True)
writer.writerow([address])'''
trade_spider(18)
I am trying to get the output from the python script into excel. The script works fine in Python, but when I try and do the import CSV and writerow rule it doesn't work. It says price not defined in writerow and how would I print multiple items. Any help would be appreciated.
import csv
import requests
from bs4 import BeautifulSoup
f = open('dataoutput.csv','w', newline = "")
writer = csv.writer(f)
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.zoopla.co.uk/for-sale/property/manchester/?identifier=manchester&q=manchester&search_source=home&radius=0&pn=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'listing-results-price text-price'}):
href = "http://www.zoopla.co.uk" + link.get('href')
title = link.string
get_single_item_data(href)
page +=1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('div', {'class': 'listing-details-address'}):
address = item_name.string
print(item_name.get_text(strip=True))
for item_fame in soup.findAll('div', {'class' : 'listing-details-price text-price'}):
price = item_fame.string
print(item_fame.get_text(strip=True))
writer.writerow(price)
trade_spider(1)
The object price is not defined anywhere in your script outside of the function get_single_item_data. Outside of that function your code cannot recognize any object with that name. Also, get_single_item_data does not return anything from the BeautifulSoup object. It only prints it. You should rewrite your function to be something like this:
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
#create list to contain addresses
addresses = []
for item_name in soup.findAll('div', {'class': 'listing-details-address'}):
address = item_name.string
#add each address to the list
addresses.append(address)
print(item_name.get_text(strip=True))
#create list for prices
prices = []
for item_fame in soup.findAll('div', {'class' : 'listing-details-price text-price'}):
price = item_fame.string
#add prices to list
prices.append(price)
print(item_fame.get_text(strip=True))
#alter the code to return the data structure you prefer.
return([addresses,prices])
Appreciate this is been asked many time on here but I cant seem to get it to work for me.
I've written a scraper which successfully scrapes everything I need from the first page of the site. But, I cant figure out how to get it to loop through the various pages.
The url simply increments like this BLAH/3 + 'page=x'
I haven't been learning to code for very long, so any advice would be appreciated!
import requests
from bs4 import BeautifulSoup
url = 'http://www.URL.org/BLAH1/BLAH2/BLAH3'
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
"<a href='>%s'>%s</a>" %(link.get("href"), link.text)
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
name = print(item.contents[0].text)
address = print(item.contents[1].text.replace('.',''))
care_type = print(item.contents[2].text)
Update:
r = requests.get('http://www.URL.org/BLAH1/BLAH2/BLAH3')
for page in range(10):
r = requests.get('http://www.URL.org/BLAH1/BLAH2/BLAH3' + 'page=' + page)
soup = BeautifulSoup(r.content, "html.parser")
#print(soup.prettify())
# String substitution for HTML
for link in soup.find_all("a"):
"<a href='>%s'>%s</a>" %(link.get("href"), link.text)
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
name = print(item.contents[0].text)
address = print(item.contents[1].text.replace('.',''))
care_type = print(item.contents[2].text)
Update 2!:
import requests
from bs4 import BeautifulSoup
url = 'http://www.URL.org/BLAH1/BLAH2/BLAH3&page='
for page in range(10):
r = requests.get(url + str(page))
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace('.',''))
print(item.contents[2].text)
To loop pages with page=x you need for loop like this>
import requests
from bs4 import BeautifulSoup
url = 'http://www.housingcare.org/housing-care/results.aspx?ath=1%2c2%2c3%2c6%2c7&stp=1&sm=3&vm=list&rp=10&page='
for page in range(10):
print('---', page, '---')
r = requests.get(url + str(page))
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace('.',''))
print(item.contents[2].text)
Every page can be different and better solution needs more inforamtion about page. Sometimes you can get link to last page and then you can use this information instead 10 in range(10)
Or you can use while True to loop and break to leave loop if there is no link to next page. But first you have to show this page (url to real page) in question.
EDIT: example how to get link to next page and then you get all pages - not only 10 pages as in previous version.
import requests
from bs4 import BeautifulSoup
# link to first page - without `page=`
url = 'http://www.housingcare.org/housing-care/results.aspx?ath=1%2c2%2c3%2c6%2c7&stp=1&sm=3&vm=list&rp=10'
# only for information, not used in url
page = 0
while True:
print('---', page, '---')
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace('.',''))
print(item.contents[2].text)
# link to next page
next_page = soup.find('a', {'class': 'next'})
if next_page:
url = next_page.get('href')
page += 1
else:
break # exit `while True`
The page I am trying to crawl is http://www.boxofficemojo.com/yearly/chart/?page=1&view=releasedate&view2=domestic&yr=2013&p=.htm. Specifically, I am focusing on this page right now: http://www.boxofficemojo.com/movies/?id=ironman3.htm.
For each of the movies on the first link, I want to get the Genre, Runtime, MPAA Rating, Foreign Gross, and Budget. I am having trouble getting this because there is no identifying tag on the information. What I have so far:
import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.select('td > b > font > a[href^=/movies/?]'):
href = 'http://www.boxofficemojo.com' + link.get('href')
title = link.string
print title, href
get_single_item_data(href)
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
print soup.find_all("Genre: ")
for person in soup.select('td > font > a[href^=/people/]'):
print person.string
trade_spider(1)
So far, this retrieves all the titles of the movies from the original page, their link, and a list of the actors/people/directors etc. for each movie. Right now I am trying to get the Genre of the movie.
I tried to approach this in a similar way as the
"for person in soup.select('td > font > a[href^=/people/]'):
print person.string"
line, but this isn't a link, it is only text, so it is not working.
How can I get this data for each of the movies?
Find the Genre: text and get the next sibling:
soup.find(text="Genre: ").next_sibling.text
Demo:
In [1]: import requests
In [2]: from bs4 import BeautifulSoup
In [3]: response = requests.get("http://www.boxofficemojo.com/movies/?id=ironman3.htm")
In [4]: soup = BeautifulSoup(response.content)
In [5]: soup.find(text="Genre: ").next_sibling.text
Out[5]: u'Action / Adventure'