Webscraping - looking to find 'hidden stuff'

Webscraping - looking to find 'hidden stuff' - python

edit *
problem has been solved. It was my stupid mistake, not looking far enough. And been focussing on my wrong way of thinking
I'm trying to scrape the prices from following website: Online webshop
I can scrape everything, except for the prices. When I do an inspect of the page and look for the prices, the only thing I find is: class="hit-area__link medium--is-hidden"
Which is true :-)
How can I get the price?
btw, I'm using Beautifulsoup (in Python)
Many thanks for helping me out!
Kind regards,
Peter

Looking a the page, I saw there is a span tag with a class of "promo-price" for each product. Using the following code:
soup = BeautifulSoup(r.text)
product_prices = soup.find_all("span", {"class":"promo-price"})
for price in product_prices:
print(price) # <span class="promo-price" data-test="price">19 <sup class="promo-price__fraction" data-test="price-fraction">58</sup>
print(str(price.text).replace(' ', '.').replace('\n', ''))
You can obtain the product prices which are put into classes of "price" and "price-fraction" and then strip the new line and replace it the white space with a period
Next time can you copy your code into the question so we know what you have tried :)

Related

Not able to scrape data in "div" class on WSJ pages

I am trying to scrape text content from articles on the WSJ site. For e.g. consider the following html source:
<div class="article-content ">
<p>BEIRUT—
Carlos Ghosn,
who is seeking to clear his name in Lebanon, would face a very different path to vindication here, where endemic corruption and the former auto executive’s widespread popularity could influence the outcome of a potential trial. </p> <p>Mr. Ghosn, the former chief of auto makers
I am using the following code:
res = requests.get(url)
html = BeautifulSoup(res.text, "lxml")
classid = "article-content "
item = html.find_all("div", {"class":classid})
This returns a null item. I saw a few other posts where people have suggested adding delays and others but these are not working in my case. Plan on using the scraped text for some ML projects.
I have a subscription to WSJ and am logged in when running the above script.
Any help with this will be much appreciated! Thanks

Your code worked fine for me. Just make sure that you are searching for the correct 'classid'. I don't think this will make a difference, but you can try using this as an alternative:
item = html.find_all("div", class_ = classid)

One thing that can be done is to confirm the presence of the element by checking with javascript on the console. Many a times there are background requests being made to serve the page. So, you might see the element in the page..but it is the result of a request to different URL or inside of a script.

Try using select and set the parser as 'lxml'
content = [p.text for p in soup.select('.article-content p')]

Decompose Specific Links When Scraping Data (Python)

Below is a section of HTML code I am currently scraping.
<div class="RadAjaxPanel" id="LiveBoard1_LiveBoard1_litGamesPanel">
<a href="leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=8&season=2016&month=0&season1=2016&ind=0&team=0&rost=0&age=0&filter=&players=p2018-04-20">
Today's Probable Starters and Lineups Leaderboard
</a>
</div>
Throughout the code, I need to figure out a way to scrape all the links in this div class with the exception of the one posted above. Does anyone know how to decompose one specific link within a specific div class but still scrape the remaining links? With regards to this specific link, the beginning ("leaders.aspx") of the link is different than the links I am currently targeting. Below is a sample of my current code.
import requests
import csv
from bs4 import BeautifulSoup
page=requests.get('https://www.fangraphs.com/livescoreboard.aspx?date=2018-04-18')
soup=BeautifulSoup(page.text, 'html.parser')
#Remove Unwanted Links
[link.decompose() for link in soup.find_all(class_='lineup')]
[yesterday.decompose() for yesterday in soup.find_all('td',attrs=
{'colspan':'2'})]
team_name_list=soup.find(class_='RadAjaxPanel')
team_name_list_items=team_name_list.find_all('a')
for team_name in team_name_list_items:
teams=team_name.contents[0]
print(teams)
winprob_list=soup.find(class_='RadAjaxPanel')
winprob_list_items=winprob_list.find_all('td',attrs={'style':'border:1px
solid black;'})
for winprob in winprob_list_items:
winprobperc=winprob.contents[0]
print(winprobperc)
To summarize, I just need to remove the "Today's Probable Starters and Lineups Leaderboard" link that was posted in the first code block. Thanks in advance!

Just use CSS selectors with .select_one():
soup.select_one('.RadAjaxPanel > center > a').decompose()

How to scrape text from paragraphs with different id name?

I am trying to scrape text from paragraphs with different id names. The text looks as follows:
<p id="comFull1" class="comment" style="display:none"><strong>Comment:
</strong><br>I realized how much Abilify has been helping me when I recently
tried to taper off of it. I am on the bipolar spectrum, with mainly
depression and some OCD symptoms. My obsessive, intrusive thoughts came
racing back when I decreased the medication. I also got much more tired and
had insomnia with the decrease. am not happy with side effects of 15 lb
weight gain, increased cholesterol and a flat effect on my emotions. I am
actually wondering if an increase from the 7 mg would help even more...for
now I'm living with the side effects.<br><a
onclick="toggle('comTrunc1'); toggle('comFull1');return false;"
href="#">Hide Full Comment</a></p>
<p id="comFull2" class="comment" style="display:none"><strong>Comment:
</strong><br>It's worked Very well for me. I'm sleeping I'm
eating I'm going Out in the public. Overall I'm very
satisfied.However I haven't heard anybody mention this but my feet are
very puffy and swollen is this a side effect does anyone know?<br><a
onclick="toggle('comTrunc2'); toggle('comFull2');return false;"
href="#">Hide Full Comment</a></p>
......
I am able to scrap text only from a particular id but not with all id at a time. Can anyone help on this issue to scrap text from all ids. The code looks like this
>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required2 = soup.find("p", {"id": "comFull1"}).text
>>> required2
"Comment:I realized how much Abilify has been helping me when I recently
tried to taper off of it. I am on the bipolar spectrum, with mainly
depression and some OCD symptoms. My obsessive, intrusive thoughts came
racing back when I decreased the medication. I also got much more tired and
had insomnia with the decrease. am not happy with side effects of 15 lb
weight gain, increased cholesterol and a flat effect on my emotions. I am
actually wondering if an increase from the 7 mg would help even more...for
now I'm living with the side effects.Hide Full Comment"

Try this. If all the ID numbers containing paragraphs are suffixed 1,2,3 e.t.c to it, as in comFull1,comFull2,comFull3 then the below selector should handle it.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
for item in soup.select("[id^='comFull']"):
print(item.text)

The issue you are having as understood by me is to scrape the text of all paragraphs in a webpage or <\p> tags.
The function you are looking for is -
soup.find_all('p')
A more comprehensive example is shown in the following docs -
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

If you want to use xpath you can use
response.xpath("//p[contains(#id,'comFull')]/text()").extract()
But since you are using beautiful soup you can pass a function or regular expression to find_all method as mentioned here.
Matching id's in BeautifulSoup
soup.find_all('p', id=re.compile('^comFull-'))

How do I make the code return the text using xpath?

from lxml import html
import requests
import time
#Gets prices
page = requests.get('https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=hi')
tree = html.fromstring(page.content)
price = tree.xpath('//h2[#data-attribute="Hi Guess the Food - What’s the Food Brand in the Picture"]/text()')
print(price)
This only returns []
When looking into page.content, it shows the amazon anti bot stuff. How can I bypass this?

One general advice when you're trying to scrap something from some website. Take a look first to the returned content, in this case page.content before trying anything. You're assuming wrongly amazon is allowing you nicely to fetch their data, when they don't.

I think urllib2 is better, and xpath could be:
price = c.xpath('//div[#class="s-item-container"]//h2')[0]
print price.text
After all, long string may contains strange characters.

find_all() not finding any results when using Beautiful Soup + Requests

I'm experimenting with using BeautifulSoup and Requests for the first time, and am trying to learn by scraping some information from a news site. The aim of the project is to just be able to read news highlights from terminal, so I need to effectively scrape and parse article titles and article body text.
I am still at the stage of getting the titles, but I simply am not storing any data when I try to use the find_all() function. Below is my code:
from bs4 import BeautifulSoup
from time import strftime
import requests
date = strftime("%Y/%m/%d")
url = "http://www.thedailybeast.com/cheat-sheets/" + date + "/cheat-sheet.html"
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c, "lxml")
titles = soup.find_all('h1 class="title multiline"')
print titles
Any thoughts? If anyone also has any advice / tips to improve what I currently have or the approach I'm taking, I'm always looking to get better so please do tell!
Cheers

You are putting everything here in quotes:
titles = soup.find_all('h1 class="title multiline"')
which makes BeautifulSoup search for h1 class="title multiline" elements.
Instead, use:
titles = soup.find_all("h1", class_="title multiline")
Or, with a CSS selector:
titles = soup.select("h1.title.multiline")
Actually, because of the dynamic nature of the page, to get all of the titles, you have to approach it differently:
import json
results = json.loads(soup.find('div', {'data-pageraillist': True})['data-pageraillist'])
for result in results:
print result["title"]
Prints:
Hillary Email ‘Born Classified’
North Korean Internet Goes Down
Kid-Porn Cops Go to Gene Simmons’s Home
Baylor Player Convicted of Rape After Coverup
U.S. Calls In Aussie Wildfire Experts
Markets’ 2015 Gains Wiped Out
Black Lives Matters Unveils Platform
Sheriff Won’t Push Jenner Crash Charge
Tear Gas Used on Migrants Near Macedonia
Franzen Considered Adopting Iraqi Orphan

You're very close, but find_all only searches the tags, it's not like a generic search function.
Hence if you want to filter by tag and attribute like class, then do this:
soup.find_all('h1', {'class' : 'multiline'})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping - looking to find 'hidden stuff' - python

Related

Not able to scrape data in "div" class on WSJ pages

Decompose Specific Links When Scraping Data (Python)

How to scrape text from paragraphs with different id name?

How do I make the code return the text using xpath?

find_all() not finding any results when using Beautiful Soup + Requests

Categories

Resources