How to scrape text from paragraphs with different id name? - python

I am trying to scrape text from paragraphs with different id names. The text looks as follows:
<p id="comFull1" class="comment" style="display:none"><strong>Comment:
</strong><br>I realized how much Abilify has been helping me when I recently
tried to taper off of it. I am on the bipolar spectrum, with mainly
depression and some OCD symptoms. My obsessive, intrusive thoughts came
racing back when I decreased the medication. I also got much more tired and
had insomnia with the decrease. am not happy with side effects of 15 lb
weight gain, increased cholesterol and a flat effect on my emotions. I am
actually wondering if an increase from the 7 mg would help even more...for
now I'm living with the side effects.<br><a
onclick="toggle('comTrunc1'); toggle('comFull1');return false;"
href="#">Hide Full Comment</a></p>
<p id="comFull2" class="comment" style="display:none"><strong>Comment:
</strong><br>It's worked Very well for me. I'm sleeping I'm
eating I'm going Out in the public. Overall I'm very
satisfied.However I haven't heard anybody mention this but my feet are
very puffy and swollen is this a side effect does anyone know?<br><a
onclick="toggle('comTrunc2'); toggle('comFull2');return false;"
href="#">Hide Full Comment</a></p>
......
I am able to scrap text only from a particular id but not with all id at a time. Can anyone help on this issue to scrap text from all ids. The code looks like this
>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required2 = soup.find("p", {"id": "comFull1"}).text
>>> required2
"Comment:I realized how much Abilify has been helping me when I recently
tried to taper off of it. I am on the bipolar spectrum, with mainly
depression and some OCD symptoms. My obsessive, intrusive thoughts came
racing back when I decreased the medication. I also got much more tired and
had insomnia with the decrease. am not happy with side effects of 15 lb
weight gain, increased cholesterol and a flat effect on my emotions. I am
actually wondering if an increase from the 7 mg would help even more...for
now I'm living with the side effects.Hide Full Comment"

Try this. If all the ID numbers containing paragraphs are suffixed 1,2,3 e.t.c to it, as in comFull1,comFull2,comFull3 then the below selector should handle it.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
for item in soup.select("[id^='comFull']"):
print(item.text)

The issue you are having as understood by me is to scrape the text of all paragraphs in a webpage or <\p> tags.
The function you are looking for is -
soup.find_all('p')
A more comprehensive example is shown in the following docs -
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

If you want to use xpath you can use
response.xpath("//p[contains(#id,'comFull')]/text()").extract()
But since you are using beautiful soup you can pass a function or regular expression to find_all method as mentioned here.
Matching id's in BeautifulSoup
soup.find_all('p', id=re.compile('^comFull-'))

Related

Webscraping - looking to find 'hidden stuff'

edit *
problem has been solved. It was my stupid mistake, not looking far enough. And been focussing on my wrong way of thinking
I'm trying to scrape the prices from following website: Online webshop
I can scrape everything, except for the prices. When I do an inspect of the page and look for the prices, the only thing I find is: class="hit-area__link medium--is-hidden"
Which is true :-)
How can I get the price?
btw, I'm using Beautifulsoup (in Python)
Many thanks for helping me out!
Kind regards,
Peter
Looking a the page, I saw there is a span tag with a class of "promo-price" for each product. Using the following code:
soup = BeautifulSoup(r.text)
product_prices = soup.find_all("span", {"class":"promo-price"})
for price in product_prices:
print(price) # <span class="promo-price" data-test="price">19 <sup class="promo-price__fraction" data-test="price-fraction">58</sup>
print(str(price.text).replace(' ', '.').replace('\n', ''))
You can obtain the product prices which are put into classes of "price" and "price-fraction" and then strip the new line and replace it the white space with a period
Next time can you copy your code into the question so we know what you have tried :)

Why does BeautifulSoup return empty list on search results websites?

I'm looking to get the price of a specific article online and I cannot seem to get the element under a tag, but I could do it on another (different) site of the website. In this particular site, I only get an empty list. Printing soup.text also works. I don't want to use Selenium if possible, as I'm looking to understand how BS4 works for this kind of cases.
import requests
from bs4 import BeautifulSoup
url = 'https://reverb.com/p/electro-harmonix-oceans-11-reverb-2018'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
cards = soup.select(".product-row-card")
print (cards)
>>>[]
What I would like to get is the name and price of the cards in the website. I also had this problem before, but every solution here only suggests using Selenium (which I could make work) but I don't know why. I find it even less practical.
Also, is there a chance as I read that the website is using javascript to fetch this results. If that was the case, why could I fetch the data in https://reverb.com/price-guide/effects-and-pedals but not here? Would Selenium be the only solution in that case?
You are correct that the site you're targeting relies on javascript to render the data you're trying to obtain. The issue is requests does not evaluate javascript.
You're also correct that Selenium WebDriver is often utilized in these situations, as it drives a real, full-blown browser instance. But it's not the only option, as requests-html has javascript support and is perhaps less cumbersome for simple scraping.
As an example to get you started, the following gets the title and price of the first five items on the site you're accessing:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
r = session.get("https://reverb.com/p/electro-harmonix-oceans-11-reverb-2018")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")
for item in soup.select(".product-row-card", limit=5):
title = item.select_one(".product-row-card__title__text").text.strip()
price = item.select_one(".product-row-card__price__base").text.strip()
print(f"{title}: {price}")
Result:
Electro-Harmonix EHX Oceans 11 Eleven Reverb Hall Spring Guitar Effects Pedal: $119.98
Electro-Harmonix Oceans 11 Reverb - Used: $119.99
Electro-Harmonix Oceans 11 Multifunction Digital Reverb Effects Pedal: $122
Pre-Owned Electro-Harmonix Oceans 11 Reverb Multi Effects Pedal Used: $142.27
Electro-Harmonix Oceans 11 Reverb Matte Black: $110

Cannot get Beautifulsoup to recognize tag

This simple scanner depicted below is designed to find the tag which displays a stock's percent change for the day on yahoo finance. When I examine the source code of the webpage I can easily identify that there is only one span tag which has a class equal to what I have written below. The tags class either reads $dataGreen if the price has gone up, or $dataRed if it has gone down.
I am using iterators in many other places on this webpage, all are formatted exactly the same way, and are functional. But for some reason, no amount of tweaking here will give me a result. It is as though this tag cannot be detected.
I haven't a clue why this tag can be found by ctrl+f but not .find_all()
Any guidance you can give me would be most appreciated. Here's my code.
import bs4 as bs
from urllib.request import urlopen
import urllib.request, urllib.error
url = str('https://finance.yahoo.com/quote/ABEO?p=ABEO')
source = urllib.request.urlopen(url, timeout=30).read()
soup = bs.BeautifulSoup(source,'lxml')
for row in soup.find('span',{"class":"Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($dataGreen)"}):
print (1)
for row in soup.find('span',{"class":"Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($dataRed)"}):
print (2)
Edit:
I've saved the source to a .txt and poured through it for the tag, though I couldnt detect it with ctrl+feither. When I compare what I found in the .txt to what I had pulled from the webpage, it differs. My problem seems to be solved, but I would love for someone to explain why that worked.
Trsdu(0.3s) Fw(500) Fz(14px) C($dataRed)

find_all() not finding any results when using Beautiful Soup + Requests

I'm experimenting with using BeautifulSoup and Requests for the first time, and am trying to learn by scraping some information from a news site. The aim of the project is to just be able to read news highlights from terminal, so I need to effectively scrape and parse article titles and article body text.
I am still at the stage of getting the titles, but I simply am not storing any data when I try to use the find_all() function. Below is my code:
from bs4 import BeautifulSoup
from time import strftime
import requests
date = strftime("%Y/%m/%d")
url = "http://www.thedailybeast.com/cheat-sheets/" + date + "/cheat-sheet.html"
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c, "lxml")
titles = soup.find_all('h1 class="title multiline"')
print titles
Any thoughts? If anyone also has any advice / tips to improve what I currently have or the approach I'm taking, I'm always looking to get better so please do tell!
Cheers
You are putting everything here in quotes:
titles = soup.find_all('h1 class="title multiline"')
which makes BeautifulSoup search for h1 class="title multiline" elements.
Instead, use:
titles = soup.find_all("h1", class_="title multiline")
Or, with a CSS selector:
titles = soup.select("h1.title.multiline")
Actually, because of the dynamic nature of the page, to get all of the titles, you have to approach it differently:
import json
results = json.loads(soup.find('div', {'data-pageraillist': True})['data-pageraillist'])
for result in results:
print result["title"]
Prints:
Hillary Email ‘Born Classified’
North Korean Internet Goes Down
Kid-Porn Cops Go to Gene Simmons’s Home
Baylor Player Convicted of Rape After Coverup
U.S. Calls In Aussie Wildfire Experts
Markets’ 2015 Gains Wiped Out
Black Lives Matters Unveils Platform
Sheriff Won’t Push Jenner Crash Charge
Tear Gas Used on Migrants Near Macedonia
Franzen Considered Adopting Iraqi Orphan
You're very close, but find_all only searches the tags, it's not like a generic search function.
Hence if you want to filter by tag and attribute like class, then do this:
soup.find_all('h1', {'class' : 'multiline'})

BeautifulSoup find function unusual behaviour

import urllib2
from bs4 import BeautifulSoup
url = 'http://www.amazon.com/dp/B00IOXUJRY'
page = BeautifulSoup(urllib2.urlopen(url))
print page
title = page.find(id='productTitle') #.text.replace('\t','').strip()
print repr(title)
if I try to get text of this prodcutTitle id, it returns None. although i print the page value and check whether this is a static text or comming from javascript/ajax. I've already spent 1 hour on this but unable to find the reason. May be I'm doing a very small silly mistake I'm not aware of?
PS: I have is one more query.
there is a section "product description" below "Important Information" section. this is javascript generated content(I think so??). So I have to use selenium/phantomJS kind of library. Is there any way to get this content from beautifulsoup or python's inbuilt library (because selenium is too slow)
any other library like mechanize or robobrowser,etc?
You are experiencing the differences between parsers used by BeautifulSoup under the hood.
Since you haven't specified it explicitly, BeautifulSoup chooses one automatically:
The first argument to the BeautifulSoup constructor is a string or an
open filehandle–the markup you want parsed. The second argument is how
you’d like the markup parsed.
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
Here's the demo of what is happening:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.amazon.com/dp/B00IOXUJRY'
>>> page = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
>>> print page.find(id='productTitle')
None
>>> page = BeautifulSoup(urllib2.urlopen(url), 'html5lib')
>>> print page.find(id='productTitle')
<span class="a-size-large" id="productTitle">Keurig, The Original Donut Shop, K-Cup packs (Regular - Medium Roast Extra Bold, 24 Count)</span>
>>> page = BeautifulSoup(urllib2.urlopen(url), 'lxml')
>>> print page.find(id='productTitle')
<span class="a-size-large" id="productTitle">Keurig, The Original Donut Shop, K-Cup packs (Regular - Medium Roast Extra Bold, 24 Count)</span>
In other words, the solution would be to explicitly specify the parser, either html5lib, or lxml - but make sure you have these modules installed.
To get the product description, you don't need to use selenium+PhantomJS approach. You can get it using BeautifulSoup:
print page.find('div', class_='productDescriptionWrapper').text.strip()
Prints:
Coffee People Donut Shop K-Cup Coffee is a medium roast coffee
reminiscent of the cup of joe that you find at classic donut counters
throughout the United States. Sweet and rich with dessert flavors in
every single cup, this classic coffee is approachable even to those
who fear coffee bitters. Sweet savory flavor set Coffee People Donut
Shop coffees apart from your average coffee blends, and now you can
enjoy this unique coffee with the convenience of single serve K-Cup
refills. Includes 24 K-Cups.

Categories

Resources