import urllib2
from bs4 import BeautifulSoup
url = 'http://www.amazon.com/dp/B00IOXUJRY'
page = BeautifulSoup(urllib2.urlopen(url))
print page
title = page.find(id='productTitle') #.text.replace('\t','').strip()
print repr(title)
if I try to get text of this prodcutTitle id, it returns None. although i print the page value and check whether this is a static text or comming from javascript/ajax. I've already spent 1 hour on this but unable to find the reason. May be I'm doing a very small silly mistake I'm not aware of?
PS: I have is one more query.
there is a section "product description" below "Important Information" section. this is javascript generated content(I think so??). So I have to use selenium/phantomJS kind of library. Is there any way to get this content from beautifulsoup or python's inbuilt library (because selenium is too slow)
any other library like mechanize or robobrowser,etc?
You are experiencing the differences between parsers used by BeautifulSoup under the hood.
Since you haven't specified it explicitly, BeautifulSoup chooses one automatically:
The first argument to the BeautifulSoup constructor is a string or an
open filehandle–the markup you want parsed. The second argument is how
you’d like the markup parsed.
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
Here's the demo of what is happening:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.amazon.com/dp/B00IOXUJRY'
>>> page = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
>>> print page.find(id='productTitle')
None
>>> page = BeautifulSoup(urllib2.urlopen(url), 'html5lib')
>>> print page.find(id='productTitle')
<span class="a-size-large" id="productTitle">Keurig, The Original Donut Shop, K-Cup packs (Regular - Medium Roast Extra Bold, 24 Count)</span>
>>> page = BeautifulSoup(urllib2.urlopen(url), 'lxml')
>>> print page.find(id='productTitle')
<span class="a-size-large" id="productTitle">Keurig, The Original Donut Shop, K-Cup packs (Regular - Medium Roast Extra Bold, 24 Count)</span>
In other words, the solution would be to explicitly specify the parser, either html5lib, or lxml - but make sure you have these modules installed.
To get the product description, you don't need to use selenium+PhantomJS approach. You can get it using BeautifulSoup:
print page.find('div', class_='productDescriptionWrapper').text.strip()
Prints:
Coffee People Donut Shop K-Cup Coffee is a medium roast coffee
reminiscent of the cup of joe that you find at classic donut counters
throughout the United States. Sweet and rich with dessert flavors in
every single cup, this classic coffee is approachable even to those
who fear coffee bitters. Sweet savory flavor set Coffee People Donut
Shop coffees apart from your average coffee blends, and now you can
enjoy this unique coffee with the convenience of single serve K-Cup
refills. Includes 24 K-Cups.
Related
so I'm trying to find on a web site, all texts with "strong" tag, but within only specific part of the page as opposed to finding all texts with "strong tag".
This is the code I have so far.
for link in soup.find_all("strong"):
file = open('destination', 'a')
sys.stdout = file
print(link.text)
First find anything that you do not "like" and then Use Going up, Going sideways, Going back and forth etc.
beautifulsoup4==4.7.1 version needed:
import requests
from bs4 import BeautifulSoup
url = 'https://www.cbssports.com/wwe/news/wwe-smackdown-results-recap-grades-kevin-owens-steals-the-show-ahead-of-extreme-rules/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for strong in soup.select('#article-main-body strong:not(#article-main-body h3:last-of-type ~* strong)'):
print(strong.text)
Prints:
Big fan of WWE?
Subscribe to our podcast -- State of Combat with Brian Campbell
-- where we go in depth on everything you need to know in WWE each week.
Roman Reigns def. Dolph Zigger via pinfall:
Grade:
B+
I'm looking to get the price of a specific article online and I cannot seem to get the element under a tag, but I could do it on another (different) site of the website. In this particular site, I only get an empty list. Printing soup.text also works. I don't want to use Selenium if possible, as I'm looking to understand how BS4 works for this kind of cases.
import requests
from bs4 import BeautifulSoup
url = 'https://reverb.com/p/electro-harmonix-oceans-11-reverb-2018'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
cards = soup.select(".product-row-card")
print (cards)
>>>[]
What I would like to get is the name and price of the cards in the website. I also had this problem before, but every solution here only suggests using Selenium (which I could make work) but I don't know why. I find it even less practical.
Also, is there a chance as I read that the website is using javascript to fetch this results. If that was the case, why could I fetch the data in https://reverb.com/price-guide/effects-and-pedals but not here? Would Selenium be the only solution in that case?
You are correct that the site you're targeting relies on javascript to render the data you're trying to obtain. The issue is requests does not evaluate javascript.
You're also correct that Selenium WebDriver is often utilized in these situations, as it drives a real, full-blown browser instance. But it's not the only option, as requests-html has javascript support and is perhaps less cumbersome for simple scraping.
As an example to get you started, the following gets the title and price of the first five items on the site you're accessing:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
r = session.get("https://reverb.com/p/electro-harmonix-oceans-11-reverb-2018")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for item in soup.select(".product-row-card", limit=5):
title = item.select_one(".product-row-card__title__text").text.strip()
price = item.select_one(".product-row-card__price__base").text.strip()
print(f"{title}: {price}")
Result:
Electro-Harmonix EHX Oceans 11 Eleven Reverb Hall Spring Guitar Effects Pedal: $119.98
Electro-Harmonix Oceans 11 Reverb - Used: $119.99
Electro-Harmonix Oceans 11 Multifunction Digital Reverb Effects Pedal: $122
Pre-Owned Electro-Harmonix Oceans 11 Reverb Multi Effects Pedal Used: $142.27
Electro-Harmonix Oceans 11 Reverb Matte Black: $110
I am trying to scrape text from paragraphs with different id names. The text looks as follows:
<p id="comFull1" class="comment" style="display:none"><strong>Comment:
</strong><br>I realized how much Abilify has been helping me when I recently
tried to taper off of it. I am on the bipolar spectrum, with mainly
depression and some OCD symptoms. My obsessive, intrusive thoughts came
racing back when I decreased the medication. I also got much more tired and
had insomnia with the decrease. am not happy with side effects of 15 lb
weight gain, increased cholesterol and a flat effect on my emotions. I am
actually wondering if an increase from the 7 mg would help even more...for
now I'm living with the side effects.<br><a
onclick="toggle('comTrunc1'); toggle('comFull1');return false;"
href="#">Hide Full Comment</a></p>
<p id="comFull2" class="comment" style="display:none"><strong>Comment:
</strong><br>It's worked Very well for me. I'm sleeping I'm
eating I'm going Out in the public. Overall I'm very
satisfied.However I haven't heard anybody mention this but my feet are
very puffy and swollen is this a side effect does anyone know?<br><a
onclick="toggle('comTrunc2'); toggle('comFull2');return false;"
href="#">Hide Full Comment</a></p>
......
I am able to scrap text only from a particular id but not with all id at a time. Can anyone help on this issue to scrap text from all ids. The code looks like this
>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required2 = soup.find("p", {"id": "comFull1"}).text
>>> required2
"Comment:I realized how much Abilify has been helping me when I recently
tried to taper off of it. I am on the bipolar spectrum, with mainly
depression and some OCD symptoms. My obsessive, intrusive thoughts came
racing back when I decreased the medication. I also got much more tired and
had insomnia with the decrease. am not happy with side effects of 15 lb
weight gain, increased cholesterol and a flat effect on my emotions. I am
actually wondering if an increase from the 7 mg would help even more...for
now I'm living with the side effects.Hide Full Comment"
Try this. If all the ID numbers containing paragraphs are suffixed 1,2,3 e.t.c to it, as in comFull1,comFull2,comFull3 then the below selector should handle it.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
for item in soup.select("[id^='comFull']"):
print(item.text)
The issue you are having as understood by me is to scrape the text of all paragraphs in a webpage or <\p> tags.
The function you are looking for is -
soup.find_all('p')
A more comprehensive example is shown in the following docs -
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
If you want to use xpath you can use
response.xpath("//p[contains(#id,'comFull')]/text()").extract()
But since you are using beautiful soup you can pass a function or regular expression to find_all method as mentioned here.
Matching id's in BeautifulSoup
soup.find_all('p', id=re.compile('^comFull-'))
I'm experimenting with using BeautifulSoup and Requests for the first time, and am trying to learn by scraping some information from a news site. The aim of the project is to just be able to read news highlights from terminal, so I need to effectively scrape and parse article titles and article body text.
I am still at the stage of getting the titles, but I simply am not storing any data when I try to use the find_all() function. Below is my code:
from bs4 import BeautifulSoup
from time import strftime
import requests
date = strftime("%Y/%m/%d")
url = "http://www.thedailybeast.com/cheat-sheets/" + date + "/cheat-sheet.html"
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c, "lxml")
titles = soup.find_all('h1 class="title multiline"')
print titles
Any thoughts? If anyone also has any advice / tips to improve what I currently have or the approach I'm taking, I'm always looking to get better so please do tell!
Cheers
You are putting everything here in quotes:
titles = soup.find_all('h1 class="title multiline"')
which makes BeautifulSoup search for h1 class="title multiline" elements.
Instead, use:
titles = soup.find_all("h1", class_="title multiline")
Or, with a CSS selector:
titles = soup.select("h1.title.multiline")
Actually, because of the dynamic nature of the page, to get all of the titles, you have to approach it differently:
import json
results = json.loads(soup.find('div', {'data-pageraillist': True})['data-pageraillist'])
for result in results:
print result["title"]
Prints:
Hillary Email ‘Born Classified’
North Korean Internet Goes Down
Kid-Porn Cops Go to Gene Simmons’s Home
Baylor Player Convicted of Rape After Coverup
U.S. Calls In Aussie Wildfire Experts
Markets’ 2015 Gains Wiped Out
Black Lives Matters Unveils Platform
Sheriff Won’t Push Jenner Crash Charge
Tear Gas Used on Migrants Near Macedonia
Franzen Considered Adopting Iraqi Orphan
You're very close, but find_all only searches the tags, it's not like a generic search function.
Hence if you want to filter by tag and attribute like class, then do this:
soup.find_all('h1', {'class' : 'multiline'})
I want to make a Python list of all of Vincent van Gogh's paintings out of the JSON file from a Wikipedia API call. Here is my URL that I use to make the request:
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=list%20of%20works%20by%20Vincent%20van%20Gogh&Page&prop=revisions&rvprop=content
As you can see if you open the URL in your browser, it's a huge blob of text. How can I begin to extract the titles of paintings from this massive JSON return? I have done a great deal of research before asking this question, and tried numerous methods to solve it. It would be helpful if this JSON file was a useful dictionary to work with, but I can't make sense of it. How would you extract names of paintings from this JSON file?
Instead of directly parsing the results of JSON API calls, use a python wrapper:
import wikipedia
page = wikipedia.page("List_of_works_by_Vincent_van_Gogh")
print page.links
There are also other clients and wrappers.
Alternatively, here's an option using BeautifulSoup HTML parser:
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/List_of_works_by_Vincent_van_Gogh"
>>> soup = BeautifulSoup(urlopen(url))
>>> table = soup.find('table', class_="wikitable")
>>> for row in table.find_all('tr')[1:]:
... print(row.find_all('td')[1].text)
...
Still Life with Cabbage and Clogs
Crouching Boy with Sickle, Black chalk and watercolor
Woman Sewing, Watercolor
Woman with White Shawl
...
Here is a quick way to have your list in a panda dataframe
import pandas as pd
url = 'http://en.wikipedia.org/wiki/List_of_works_by_Vincent_van_Gogh'
df = pd.read_html(url, attrs={"class": "wikitable"})[0] # 0 is for the 1st table in this particular page
df.head()