How to fetch text from BeautifulSoup, getting error - python

I am trying to fetch text from a webpage - https://www.symantec.com/security_response/definitions.jsp?pid=sep14
Exactly where is says -
File-Based Protection (Traditional Antivirus)
Extended Version: 4/18/2019 rev. 2
But I am still facing errors, can I get the part where it says - 4/18/2019 rev. 2
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find_all('div', class_='unit size1of2 feedBody')
print(extended)

You can actually use CSS selectors to do this. This is done with Beautiful Soup 4.7+. Here we target the same div and classes that you did above, but we also look for the descendant li and it's direct child > strong. We then use the custom pseudo-class :contains() to ensure that the strong element contains the text Extended Version:. We use select_one API call as it will return the first element that matches, select would return all elements that match in a list, but we only need one.
Once we have the strong element, we know the next sibling text node has the information we want, so we can just use next_sibling to grab that text:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.select_one('div.unit.size1of2.feedBody li:contains("Extended Version:") > strong')
print(extended.next_sibling)
Output
4/18/2019 rev. 7
EDIT: As #QHarr mentions in the comments, you can most likely get away with a more simplified strong:contains("Extended Version:"). It is important to remember that :contains() searches all child text nodes of the given element, even sub text nodes of child elements, so being specific is important. I wouldn't use :contains("Extended Version:") as it would find the div, the list elements, etc., so by specify (at the very minimum) strong should narrow the selection enough to give you exactly what you need.

i changed your code like below, now it's showing that you want
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find('div', class_='unit size1of2 feedBody').find_all('li')
print(extended[2])

Try this maybe?
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find('div', class_='unit size1of2 feedBody').findAll('li')
print(extended[2].text.strip())

Related

How to select a tags and scrape href value?

I am having trouble getting hyperlinks for tennis matches listed on a webpage, how do I go about fixing the code below so that it can obtain it through print?
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.betexplorer.com/results/tennis/?year=2022&month=11&day=02")
webpage = response.content
soup = BeautifulSoup(webpage, "html.parser")
print(soup.findAll('a href'))
In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
Select your elements more specific and use set comprehension to avoid duplicates:
set('https://www.betexplorer.com'+a.get('href') for a in soup.select('a[href^="/tennis"]:has(strong)'))
Example
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.betexplorer.com/results/tennis/?year=2022&month=11&day=02')
soup = BeautifulSoup(r.text)
set('https://www.betexplorer.com'+a.get('href') for a in soup.select('a[href^="/tennis"]:has(strong)'))
Output
{'https://www.betexplorer.com/tennis/itf-men-singles/m15-new-delhi-2/sinha-nitin-kumar-vardhan-vishnu/tOasQaJm/',
'https://www.betexplorer.com/tennis/itf-women-doubles/w25-jerusalem/mushika-mao-mushika-mio-cohen-sapir-nagornaia-sofiia/xbNOHTEH/',
'https://www.betexplorer.com/tennis/itf-men-singles/m25-jakarta-2/barki-nathan-anthony-sun-fajing/zy2r8bp0/',
'https://www.betexplorer.com/tennis/itf-women-singles/w15-solarino/margherita-marcon-abbagnato-anastasia/lpq2YX4d/',
'https://www.betexplorer.com/tennis/itf-women-singles/w60-sydney/lee-ya-hsuan-namigata-junri/CEQrNPIG/',
'https://www.betexplorer.com/tennis/itf-men-doubles/m15-sharm-elsheikh-16/echeverria-john-marrero-curbelo-ivan-ianin-nikita-jasper-lai/nsGbyqiT/',...}
Change the last line to
print([a['href'] for a in soup.findAll('a')])
See a full tutorial here: https://pythonprogramminglanguage.com/get-links-from-webpage/

Matching a specific piece of text in a title using Beuatiful Soup

Basically, I want to find all links that contain certain key terms. In my case, the titles of these links that I want come in this form: abc... (common text), dce... (common text), ... I want to take all of the links containing "(common text)" and put them in the list. I got the code working and I understand how to find all links. However, I converted the links to strings to find the "(common text)". I know that this isn't good practice and I am not sure how to use Beautiful Soup to find this common element without converting to a string. The issue here is that the titles I am searching for are not all the same. Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import webbrowser
url = 'website.com'
http = requests.get(url)
soup = BeautifulSoup(http.content, "lxml")
links = soup.find_all('a', limit=4000)
links_length = len(links)
string_links = []
targetlist = []
for a in range(links_length):
string_links.append(str(links[a]))
if '(common text)' in string_links[a]:
targetlist.append(string_links[a])
NOTE: I am looking for the simplest method using Beautiful Soup to accomplish this. Any help will be appreciated.
Without the actual website and actual output you want, it's very difficult to say what you want but this is a "cleaner" solution using list comprehension.
from bs4 import BeautifulSoup
import requests
import webbrowser
url = 'website.com'
http = requests.get(url)
soup = BeautifulSoup(http.content, "lxml")
links = soup.find_all('a', limit=4000)
targetlist = [str(link) for link in links if "(common text)" in str(link)]

Unable to find the class for price - web scraping

I want to extract the price off the website
However, I'm having trouble locating the class type.
on this website
we see that the price for this course is $5141. When I check the source code the class for the price should be "field-items".
from bs4 import BeautifulSoup
import pandas as pd
import requests
url =
"https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-
advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.find(class_='field-items')
print(price)
However when I ran the code I got a description of the course instead of the price..not sure what I did wrong. Any help appreciated, thanks!
There are actually several "field-item even" classes on your webpage so you have to pick the one inside the good class. Here's the code :
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
section = soup.find(class_='field field-name-field-price field-type-number-decimal field-label-inline clearfix view-mode-full')
price = section.find(class_="field-item even").text
print(price)
And the result :
5141.00
With bs4 4.7.1 + you can use :contains to isolate the appropriate preceeding tag then use adjacent sibling and descendant combinators to get to the target
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education')
soup = bs(r.content, 'lxml')
print(soup.select_one('.field-label:contains("Price:") + div .field-item').text)
This
.field-label:contains("Price:")
looks for an element with class field-label, the . is a css class selector, which contains the text Price:. Then the + is an adjacent sibling combinator specifying to get the adjacent div. The .field-item (space dot field-item) is a descendant combinator (the space) and class selector for a child of the adjacent div having class field-item. select_one returns the first match in the DOM for the css selector combination.
Reading:
css selectors
To get the price you can try using .select() which is precise and less error prone.
import requests
from bs4 import BeautifulSoup
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.select_one("[class*='field-price'] .even").text
print(price)
Output:
5141.00
Actually the class I see, using Firefox inspector is : field-item even, it's where the text is:
<div class="field-items"><div class="field-item even">5141.00</div></div>
But you need to change a little bit your code:
price = soup.find_all("div",{"class":'field-item even'})[2]
There are more than one "field-item even" labeled class, price is not the first one.

beautiful soup findall returning different results

I am trying to parse through a div class from an html table on Amazon, and when I run the code, find_all() sometimes returns the right div classes that I am looking for, and other times it will return an empty list. Any ideas on why the results vary?
I am pulling from this url: https://www.amazon.com/dp/B0767653BK
My code:
req = requests.get('https://www.amazon.com/dp/B0767653BK')
page = req.text
BSoup = BeautifulSoup(page, 'html.parser')
divClass = Bsoup.find_all('div', class_='a-section a-spacing-none a-padding-none overflow_ellipsis')
It is better to use a beautifulsoup selector when trying to find all elements with a combination of CSS classes:
from bs4 import BeautifulSoup
import requests
req = requests.get('https://www.amazon.com/dp/B0767653BK')
soup = BeautifulSoup(req.text, 'html.parser')
for div_class in soup.select('div.a-section.a-spacing-none.a-padding-none.overflow_ellipsis'):
print div_class.get_text(strip=True)
This is preferable as it allows the four class elements to be present in any order. So if the page decides to change the ordering of the classes, it will still find them.
Take a look at Searching by CSS class in the documenation.

Python Beautiful Soup - find value based on text in HTML

I am having a problem finding a value in a soup based on text. Here is the code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text)
findit=soup.find("td", text=re.compile('Market Cap'))
This returns [], yet there absolutely is text in a 'td' tag with 'Market Cap'.
When I use
soup.find_all("td")
I get a result set which includes:
<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>
Explanation:
The problem is that this particular tag has other child elements and the .string value, which is checked when you apply the text argument, is None (bs4 has it documented here).
Solutions/Workarounds:
Don't specify the tag name here at all, find the text node and go up to the parent:
soup.find(text=re.compile('Market Cap')).parent.get_text()
Or, you can use find_parent() if td is not the direct parent of the text node:
soup.find(text=re.compile('Market Cap')).find_parent("td").get_text()
You can also use a "search function" to search for the td tags and see if the direct text child nodes has the Market Cap text:
soup.find(lambda tag: tag and
tag.name == "td" and
tag.find(text=re.compile('Market Cap'), recursive=False))
Or, if you are looking to find the following number 5:
soup.find(text=re.compile('Market Cap')).next_sibling.get_text()
You can't use regex with tag. It just won't work. Don't know if it's a bug of specification. I just search after all, and then get the parent back in a list comprehension cause "td" "regex" would give you the td tag.
Code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text, "lxml")
findit=soup.find_all(text=re.compile('Market Cap'))
findit=[x.parent for x in findit if x.parent.name == "td"]
print(findit)
Output
[<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>]
Regex is just a terrible thing to integrate into parsing code and in my humble opinion should be avoided whenever possible.
Personally, I don't like BeautifulSoup due to its lack of XPath support. What you're trying to do is the sort of thing that XPath is ideally suited for. If I were doing what you're doing, I would use lxml for parsing rather than BeautifulSoup's built in parsing and/or regex. It's really quite elegant and extremely fast:
from lxml import etree
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = etree.HTML(source)
tds_w_market_cap = parsed.xpath('//td[contains(., "Market Cap")]')
FYI the above returns an lxml object rather than the text of the page source. In lxml you don't really work with the source directly, per se. If you need to return a list of the actual source for some reason, you would add something like:
print [etree.tostring(i) for i in tds_w_market_cap]
If you absolutely have to use BeautifulSoup for this task, then I'd use a list comprehension:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = bs(source, 'lxml')
tds_w_market_cap = [i for i in parsed.find_all('td') if 'Market Cap' in i.get_text()]

Categories

Resources