BeautifulSoup webcrawling: How to get piece of text - python

The page I am trying to crawl is http://www.boxofficemojo.com/yearly/chart/?page=1&view=releasedate&view2=domestic&yr=2013&p=.htm. Specifically, I am focusing on this page right now: http://www.boxofficemojo.com/movies/?id=ironman3.htm.
For each of the movies on the first link, I want to get the Genre, Runtime, MPAA Rating, Foreign Gross, and Budget. I am having trouble getting this because there is no identifying tag on the information. What I have so far:
import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.select('td > b > font > a[href^=/movies/?]'):
href = 'http://www.boxofficemojo.com' + link.get('href')
title = link.string
print title, href
get_single_item_data(href)
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
print soup.find_all("Genre: ")
for person in soup.select('td > font > a[href^=/people/]'):
print person.string
trade_spider(1)
So far, this retrieves all the titles of the movies from the original page, their link, and a list of the actors/people/directors etc. for each movie. Right now I am trying to get the Genre of the movie.
I tried to approach this in a similar way as the
"for person in soup.select('td > font > a[href^=/people/]'):
print person.string"
line, but this isn't a link, it is only text, so it is not working.
How can I get this data for each of the movies?

Find the Genre: text and get the next sibling:
soup.find(text="Genre: ").next_sibling.text
Demo:
In [1]: import requests
In [2]: from bs4 import BeautifulSoup
In [3]: response = requests.get("http://www.boxofficemojo.com/movies/?id=ironman3.htm")
In [4]: soup = BeautifulSoup(response.content)
In [5]: soup.find(text="Genre: ").next_sibling.text
Out[5]: u'Action / Adventure'

Related

Python web crawler not printing the results

It is not printing out any results and giving back a strange error as shown in the picture using pycharm.
Code I wrote:
import requests
from bs4 import BeautifulSoup
def webcrawler(max_pages,url):
page = 1
if page <= max_pages:
webpage = (url) + str(page)
source_code = requests.get(url)
code_text = source_code.text
soup_format = BeautifulSoup(code_text)
for link in soup_format.findAll('a', {'class': 's-item__image-wrapper'}):
href = str(url) + link.get('href')
title = link.string
print(href)
print(title)
page += 1
webcrawler(1, 'https://www.ebay.com/b/Cell-Phone-Accessories/9394/bn_320095?_pgn=')
The warning message tells you exactly what to do to stop it from being raised. You just need to pass a parser to the BeautifulSoup that you instantiate on line 10 e.g.
soup_format = BeautifulSoup(code_text, features='html.parser')
However, there are some more issues with your code. Line 11 from the code in your original post:
for link in soup_format.findAll('a', {'class': 's-item__image-wrapper'}):
Will return None as there are no <a> tags with the class s-item__image-wrapper - all tags with that class in the target page are <div>s.
I have a suggestion below that seems to capture what you're looking to scrape. It instead iterates across each <div class="s-item__image"> which is something of a wrapper class around the item data you are looking to print. It then drills down to the first child <a> tag to get the item href and takes the alt attribute of the item img within the wrapper for the item description - have changed the print order of these and added a trailing new line in example below for readability.
import requests
from bs4 import BeautifulSoup
def webcrawler(max_pages,url):
page = 1
if page <= max_pages:
webpage = (url) + str(page)
source_code = requests.get(url)
code_text = source_code.text
soup_format = BeautifulSoup(code_text, features='html.parser')
for wrapper in soup_format.findAll('div', attrs={'class': 's-item__image'}):
href = str(url) + wrapper.find('a').get('href')
title = wrapper.find('img').get('alt')
print(title)
print(href)
print()
page += 1
webcrawler(1, 'https://www.ebay.com/b/Cell-Phone-Accessories/9394/bn_320095?_pgn=')

My BeautifulSoup spider only crawls 2 pages not all the pages

Any help would be appreciated as I am new to python. I have created the below Web Crawler but it doesn't crawl all the pages, just 2 pages. What changes need to be made for it to crawl all the pages?
See def trade_spider(max_pages) loop and at the bottom i have trade_spider(18) which should loop all pages.
Thanks for your help.
import csv
import re
import requests
from bs4 import BeautifulSoup
f = open('dataoutput.csv','w', newline= "")
writer = csv.writer(f)
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'listing-results-price text-price'}):
href = "http://www.zoopla.co.uk" + link.get('href')
title = link.string
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('h2', {'itemprop': 'streetAddress'}):
address = item_name.get_text(strip=True)
writer.writerow([address])
trade_spider(18)
Your code is working fine, it does crawl all the pages (though there are just 14 pages not 18). It seems like your trying to scrape street address , in that case the second function is unnecessary and is only making your crawler slow by calling requests.get() too many times. I've modified the code a little but this one is faster.
import csv
import re
import requests
from bs4 import BeautifulSoup
f = open('dataoutput.csv','w', newline="")
writer = csv.writer(f)
def trade_spider(max_pages):
page = 1
while page <= max_pages:
furl = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
# Changed the class' value
for link in soup.findAll('a', {'class': 'listing-results-address'}):
#href = "http://www.zoopla.co.uk" + link.get('href')
#title = link.string
#get_single_item_data(href)
address = link.get_text()
print (address) # Just to check it is working fine.
writer.writerow([address])
print (page)
page += 1
# Unnecessary code
'''def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('h2', {'itemprop': 'streetAddress'}):
address = item_name.get_text(strip=True)
writer.writerow([address])'''
trade_spider(18)

Extract links from html page

I am trying to fetch all movie/show netflix links from here http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html and also their country name. e.g from the page source, I want http://www.netflix.com/WiMovie/80048948, USA, etc. I have done the following. But it returns all links instead of the netflix ones I want. I am a little new to regex. How should I go about this?
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen('http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html')
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
##reqlink = re.search('netflix',link.get('href'))
##if reqlink:
print link.get('href')
for link in soup.findAll('img'):
if link.get('alt') == 'UK' or link.get('alt') == 'USA':
print link.get('alt')
If I uncomment the lines above, I get the following error:
TypeError: expected string or buffer
What should I do?
from BeautifulSoup import BeautifulSoup
import urllib2
import re
import requests
url = 'http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url, stream=True)
count = 1
title=[]
country=[]
for line in r.iter_lines():
if count == 746:
urllib2.urlopen('http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html')
soup = BeautifulSoup(line)
for link in soup.findAll('a', href=re.compile('netflix')):
title.append(link.get('href'))
for link in soup.findAll('img'):
print link.get('alt')
country.append(link.get('alt'))
count = count + 1
print len(title), len(country)
The previous error has been worked upon. Now the only thing to look for is films with multiple countries. How to get them together.
e.g. for 10.0 Earthquake, link = http://www.netflix.com/WiMovie/80049286, country = UK, USA.
Your code can be simplified to a couple of selects:
import requests
from bs4 import BeautifulSoup
url = 'http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url)
soup = BeautifulSoup(r.content)
for a in soup.select("a[href*=netflix]"):
print(a["href"])
And for the img:
co = {"UK", "USA"}
for img in soup.select("img[alt]"):
if img["alt"] in co:
print(img)
As for the first question - it failed for links that didn't have an href value. So instead of a string you got None.
The following works:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen('http://netflixukvsusa.netflixable.com/2016/
07/complete-alphabetical-list-k-sat-jul-9.html')
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
link_href = link.get('href')
if link_href:
reqlink = re.search('netflix',link_href)
if reqlink:
print link_href
for link in soup.findAll('img'):
if link.get('alt') == 'UK' or link.get('alt') == 'USA':
print link.get('alt')
As for the second question, I would recommend having a dictionary between the movie to a list of countries that it appears in, then it would be easier to format it in a string the way you want.
I think you'd have an easier iterating through the listing rows and using a generator to assemble the data structure you're looking for (ignore the minor differences in my code, I'm using Python3):
from bs4 import BeautifulSoup
import requests
url = 'http://netflixukvsusa.netflixable.com/2016/07/' \
'complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url)
soup = BeautifulSoup(r.content)
rows = soup.select('span[class="listings"] tr')
def get_movie_info(rows):
netflix_url_prefix = 'http://www.netflix.com/'
for row in rows:
link = row.find('a',
href=lambda href: href and netflix_url_prefix in href)
if link is not None:
link = link['href']
countries = [img['alt'] for img in row('img', class_='flag')]
yield link, countries
print('\n'.join(map(str, get_movie_info(rows))))
Edit: Or if you're looking for a dict instead of a list:
def get_movie_info(rows):
output = {}
netflix_url_prefix = 'http://www.netflix.com/'
for row in rows:
link = row.find('a',
href=lambda href: href and netflix_url_prefix in href)
if link is not None:
name = link.text
link = link['href']
countries = [img['alt'] for img in row('img', class_='flag')]
output[name or 'some_default'] = {'link': link, 'countries': countries}
return output
print('\n'.join(map(str, get_movie_info(rows).items())))
url = 'http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url, stream=True)
count = 1
final=[]
for line in r.iter_lines():
if count == 746:
soup = BeautifulSoup(line)
for row in soup.findAll('tr'):
url = row.find('a', href=re.compile('netflix'))
if url:
t=url.string
u=url.get('href')
one=[]
for country in row.findAll('img'):
one.append(country.get('alt'))
final.append({'Title':t,'Url':u,'Countries':one})
count = count + 1
final is the final list.

How should I scrape these images without errors?

I'm trying to scrape the images (or the images link) of this forum (http://www.xossip.com/showthread.php?t=1384077) . I've tried beautiful soup 4 and here is the code I tried:
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
sourcecode= requests.get(url)
plaintext = sourcecode.text
soup = BeautifulSoup(plaintext)
for link in soup.findAll('a',{'class': 'alt1'}):
src = link.get('src')
print(src)
page += 1
spider(1)
How should I correct it so that I get links of images like pzy.be/example ?
Okay, so I did this by getting all of the #post_message_* divs and then getting the images from each of those.
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page)
sourcecode= requests.get(url)
plaintext = sourcecode.text
soup = BeautifulSoup(plaintext)
divs = soup.findAll('div', id=lambda d: d and d.startswith('post_message_'))
for div in divs:
src = div.find('img')['src']
if src.startswith('http'): # b/c it could be a smilie or something like that
print(src)
page += 1
spider(1)
The simplest way is to just request each page and filter the img tags:
from bs4 import BeautifulSoup
from requests import get
import re
def get_wp():
start_url = "http://www.xossip.com/showthread.php?t=1384077&page={}"
for i in range(73):
r = get(start_url.format(i))
soup = BeautifulSoup(r.content)
for img in (i["src"] for i in soup.find_all("img", src=re.compile("http://pzy.be.*.jpg"))):
yield img

Why does my crawler with BeautifulSoup, not show results?

This is the code that I wrote.
import requests
from bs4 import BeautifulSoup
def code_search(max_pages):
page = 1
while page <= max_pages:
url = 'http://kindai.ndl.go.jp/search/searchResult?searchWord=朝鲜&facetOpenedNodeIds=&featureCode=&viewRestrictedList=&pageNo=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
page += 1
code_search(2)
My pycharm version is pycharm-community-5.0.3 for mac.
It just says:
"Process finished with exit code 0"
But there should be some results if I have wrote the code accordingly...
Please help me out here!
You have no print statements - so the program doesn't output anything.
Add some print statements. For example, if you output the link, do this:
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
print(href)
page += 1
The answer depends on what you want to archieve with the web crawler. The first observation is that nothing is printed.
The following code prints the URL and all links found on the URL.
import requests
from bs4 import BeautifulSoup
def code_search(max_pages):
page = 1
while page <= max_pages:
url = 'http://kindai.ndl.go.jp/search/searchResult?searchWord=朝鲜&facetOpenedNodeIds=&featureCode=&viewRestrictedList=&pageNo=' + str(page)
print("Current URL:", url)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
print("Found URL:", href)
page += 1
code_search(2)
It is also possible to let the method return all found URLs and then print the results:
import requests
from bs4 import BeautifulSoup
def code_search(max_pages):
page = 1
urls = []
while page <= max_pages:
url = 'http://kindai.ndl.go.jp/search/searchResult?searchWord=朝鲜&facetOpenedNodeIds=&featureCode=&viewRestrictedList=&pageNo=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class': 'item-link'}):
href = link.get('href')
urls.append(href)
page += 1
return urls
print("Found URLs:", code_search(2))

Categories

Resources