Decompose Specific Links When Scraping Data (Python) - python

Below is a section of HTML code I am currently scraping.
<div class="RadAjaxPanel" id="LiveBoard1_LiveBoard1_litGamesPanel">
<a href="leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=8&season=2016&month=0&season1=2016&ind=0&team=0&rost=0&age=0&filter=&players=p2018-04-20">
Today's Probable Starters and Lineups Leaderboard
</a>
</div>
Throughout the code, I need to figure out a way to scrape all the links in this div class with the exception of the one posted above. Does anyone know how to decompose one specific link within a specific div class but still scrape the remaining links? With regards to this specific link, the beginning ("leaders.aspx") of the link is different than the links I am currently targeting. Below is a sample of my current code.
import requests
import csv
from bs4 import BeautifulSoup
page=requests.get('https://www.fangraphs.com/livescoreboard.aspx?date=2018-04-18')
soup=BeautifulSoup(page.text, 'html.parser')
#Remove Unwanted Links
[link.decompose() for link in soup.find_all(class_='lineup')]
[yesterday.decompose() for yesterday in soup.find_all('td',attrs=
{'colspan':'2'})]
team_name_list=soup.find(class_='RadAjaxPanel')
team_name_list_items=team_name_list.find_all('a')
for team_name in team_name_list_items:
teams=team_name.contents[0]
print(teams)
winprob_list=soup.find(class_='RadAjaxPanel')
winprob_list_items=winprob_list.find_all('td',attrs={'style':'border:1px
solid black;'})
for winprob in winprob_list_items:
winprobperc=winprob.contents[0]
print(winprobperc)
To summarize, I just need to remove the "Today's Probable Starters and Lineups Leaderboard" link that was posted in the first code block. Thanks in advance!

Just use CSS selectors with .select_one():
soup.select_one('.RadAjaxPanel > center > a').decompose()

Related

How to extract only a specific kind of link from a webpage with beautifulsoup4

I'm trying to extract specific links on a page full of links. The links I need contain the word "apartment" in them.
But whatever I try, I get way more data extracted than only the links I need.
<a href="https://www.website.com/en/ad/apartment/abcd123" title target="IWEB_MAIN">
If anyone could help me out on this, it'd be much appreciated!
Also, if you have a good source that could inform me better about this, it would be double appreciated!
Yon can use regular expression re.
import re
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.find_all("a",attrs={"href" : re.compile("apartment")})
for item in alltags:
print(item['href']) #grab href value
Or You can use css selector
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.select("a[href*='apartment']")
for item in alltags:
print(item['href'])
You find the details in official documents Beautifulsoup
Edited:
You need to consider parent div first then find the anchor tag.
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select("div[data-type='resultgallery-resultitem'] >a[href*='apartment']"):
print(item['href'])

Scrapy Nested Div Selection

I am trying to scrapy Articles heading from
https://time.com/
I want to select only those articles Link which are under "The Brief" Heading
I have tried to select nested div using this code
for url in response.xpath('//div[#class="column text-align-left visible-desktop visible-mobile last-column"]/div[#class="column-tout"]/a/#href').extract():
but it did not work
Can someone please help to extract those specific articles
You can find this div by content and next get all following-sibling:
for url in response.xpath('//div[.="The Brief"]/following-sibling::div//a/#href').extract():

Cannot get Beautifulsoup to recognize tag

This simple scanner depicted below is designed to find the tag which displays a stock's percent change for the day on yahoo finance. When I examine the source code of the webpage I can easily identify that there is only one span tag which has a class equal to what I have written below. The tags class either reads $dataGreen if the price has gone up, or $dataRed if it has gone down.
I am using iterators in many other places on this webpage, all are formatted exactly the same way, and are functional. But for some reason, no amount of tweaking here will give me a result. It is as though this tag cannot be detected.
I haven't a clue why this tag can be found by ctrl+f but not .find_all()
Any guidance you can give me would be most appreciated. Here's my code.
import bs4 as bs
from urllib.request import urlopen
import urllib.request, urllib.error
url = str('https://finance.yahoo.com/quote/ABEO?p=ABEO')
source = urllib.request.urlopen(url, timeout=30).read()
soup = bs.BeautifulSoup(source,'lxml')
for row in soup.find('span',{"class":"Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($dataGreen)"}):
print (1)
for row in soup.find('span',{"class":"Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($dataRed)"}):
print (2)
Edit:
I've saved the source to a .txt and poured through it for the tag, though I couldnt detect it with ctrl+feither. When I compare what I found in the .txt to what I had pulled from the webpage, it differs. My problem seems to be solved, but I would love for someone to explain why that worked.
Trsdu(0.3s) Fw(500) Fz(14px) C($dataRed)

Excluding unwanted results of findAll using BeautifulSoup

Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:
<p class="review_comment">
So, using the simple code as follows,
content = page.read()
soup = BeautifulSoup(content)
results = soup.find_all("p", "review_comment")
I am happily parsing the text that is living here:
<p class="review_comment">
This place is terrible!</p>
The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:
<p class="review_comment">
It's 1999, and I will always love this place…
Read more »</p>
In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.
I've been trying to alter the arguments in my soup.find_all() call
to specifically exclude any text that comes before the <a href="#"
class="show-archived">Read more »</a>
I've drowned in Regular Expressions-type matching limbo with no success.
I can't seem to take advantage of the class="show-archived" attribute.
Any ideas would be gratefully appreciated. Thanks in advance.
Is this what you are seeking?
for p in soup.find_all("p", "review_comment"):
if p.find(class_='show-archived'):
continue
# p is now a wanted p

Pull Tag Value using BeautifulSoup

Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:
<span title="Funstuff" class="thisClass">Fun Text</span>
How would I just pull "Funstuff" busing BeautifulSoup/Python?
Edit: I am using version 3.2.1
You need to have something to identify the element you're looking for, and it's hard to tell what it is in this question.
For example, both of these will print out 'Funstuff' in BeautifulSoup 3. One looks for a span element and gets the title, another looks for spans with the given class. Many other valid ways to get to this point are possible.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup('<html><body><span title="Funstuff" class="thisClass">Fun Text</span></body></html>')
print soup.html.body.span['title']
print soup.find('span', {"class": "thisClass"})['title']
A tags children are available via .contents
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
In your case you can find the tag be using its CSS class to extract the contents
from bs4 import BeautifulSoup
soup=BeautifulSoup('<span title="Funstuff" class="thisClass">Fun Text</span>')
soup.select('.thisClass')[0].contents[0]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors has all the details nevessary

Categories

Resources