beautiful soup findall returning different results

beautiful soup findall returning different results - python

I am trying to parse through a div class from an html table on Amazon, and when I run the code, find_all() sometimes returns the right div classes that I am looking for, and other times it will return an empty list. Any ideas on why the results vary?
I am pulling from this url: https://www.amazon.com/dp/B0767653BK
My code:
req = requests.get('https://www.amazon.com/dp/B0767653BK')
page = req.text
BSoup = BeautifulSoup(page, 'html.parser')
divClass = Bsoup.find_all('div', class_='a-section a-spacing-none a-padding-none overflow_ellipsis')

It is better to use a beautifulsoup selector when trying to find all elements with a combination of CSS classes:
from bs4 import BeautifulSoup
import requests
req = requests.get('https://www.amazon.com/dp/B0767653BK')
soup = BeautifulSoup(req.text, 'html.parser')
for div_class in soup.select('div.a-section.a-spacing-none.a-padding-none.overflow_ellipsis'):
print div_class.get_text(strip=True)
This is preferable as it allows the four class elements to be present in any order. So if the page decides to change the ordering of the classes, it will still find them.
Take a look at Searching by CSS class in the documenation.

Related

Can't able to scrape the airtel website plans with python BeautifulSoup

from bs4 import BeautifulSoup
import requests
req = requests.get("https://www.airtel.in/myplan-infinity/")
soup = BeautifulSoup(req.content, 'html.parser')
#finding the div with the id
div_bs4 = soup.find('div')
print(div_bs4)
What should I do to scrape the recharge plans of the page?

you should use content when you need to get media data (like pictures, videos, etc)
if you need to get non-media data you've to use text instead of content
make the soup like this: `soup = BeautifulSoup(req.text, 'lxml')
make sure that this like div_bs4 = soup.find('div') finds the exact div you need (cuz it will just get the first div in html)
this line print(div_bs4) won't give you needed data
you'd better use this print(div_bs4.text)

Web Scraping through links with Beautiful Soup

I'm trying to Scrape a blog "https://blog.feedspot.com/ai_rss_feeds/" and crawl through all the links in it to look for Artificial Intelligence related information in each of the crawled links.
The blog follows a pattern - It has multiple RSS Feeds and each Feed has an attribute called "Site" in the UI. I need to get all the links in the "Site" attribute. Example : aitrends.com, sciecedaily.com/... etc. In the code, the main div has a class called "rss-block", which has another nested class called "data" and each data has several tags and the tags have in them. The value in href gives the links to be crawled upon. We need to look for AI related articles in each of those links found by scraping the above-mentioned structure.
I've tried various variations of the following code but nothing seemed to help much.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = BeautifulSoup(page.text, 'html.parser')
class_name='data'
dataSoup = soup.find(class_=class_name)
print(dataSoup)
artist_name_list_items = dataSoup.find('a', href=True)
print(artist_name_list_items)
I'm struggling to even get the links in that page, let alone craling through each of these links to scrape articles related to AI in them.
If you could help me finish both the parts of the problem, that'd be a great learning for me. Please refer to the source of https://blog.feedspot.com/ai_rss_feeds/ for the HTML Structure. Thanks in advance!

The first twenty results are stored in the html as you see on page. The others are pulled from a script tag and you can regex them out to create the full list of 67. Then loop that list and issue requests to those for further info. I offer a choice of two different selectors for the initial list population (the second - commented out - uses :contains - available with bs4 4.7.1+)
from bs4 import BeautifulSoup as bs
import requests, re
p = re.compile(r'feed_domain":"(.*?)",')
with requests.Session() as s:
r = s.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = bs(r.content, 'lxml')
results = [i['href'] for i in soup.select('.data [rel="noopener nofollow"]:last-child')]
## or use with bs4 4.7.1 +
#results = [i['href'] for i in soup.select('strong:contains(Site) + a')]
results+= [re.sub(r'\n\s+','',i.replace('\\','')) for i in p.findall(r.text)]
for link in results:
#do something e.g.
r = s.get(link)
soup = bs(r.content, 'lxml')
# extract info from indiv page

To get all the sublinks for each block, you can use soup.find_all:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://blog.feedspot.com/ai_rss_feeds/').text, 'html.parser')
results = [[i['href'] for i in c.find('div', {'class':'data'}).find_all('a')] for c in d.find_all('div', {'class':'rss-block'})]
Output:
[['http://aitrends.com/feed', 'https://www.feedspot.com/?followfeedid=4611684', 'http://aitrends.com/'], ['https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml', 'https://www.feedspot.com/?followfeedid=4611682', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/'], ['http://machinelearningmastery.com/blog/feed', 'https://www.feedspot.com/?followfeedid=4575009', 'http://machinelearningmastery.com/blog/'], ['http://news.mit.edu/rss/topic/artificial-intelligence2', 'https://www.feedspot.com/?followfeedid=4611685', 'http://news.mit.edu/topic/artificial-intelligence2'], ['https://www.reddit.com/r/artificial/.rss', 'https://www.feedspot.com/?followfeedid=4434110', 'https://www.reddit.com/r/artificial/'], ['https://chatbotsmagazine.com/feed', 'https://www.feedspot.com/?followfeedid=4470814', 'https://chatbotsmagazine.com/'], ['https://chatbotslife.com/feed', 'https://www.feedspot.com/?followfeedid=4504512', 'https://chatbotslife.com/'], ['https://aws.amazon.com/blogs/ai/feed', 'https://www.feedspot.com/?followfeedid=4611538', 'https://aws.amazon.com/blogs/ai/'], ['https://developer.ibm.com/patterns/category/artificial-intelligence/feed', 'https://www.feedspot.com/?followfeedid=4954414', 'https://developer.ibm.com/patterns/category/artificial-intelligence/'], ['https://lexfridman.com/category/ai/feed', 'https://www.feedspot.com/?followfeedid=4968322', 'https://lexfridman.com/ai/'], ['https://medium.com/feed/#Francesco_AI', 'https://www.feedspot.com/?followfeedid=4756982', 'https://medium.com/#Francesco_AI'], ['https://blog.netcoresmartech.com/rss.xml', 'https://www.feedspot.com/?followfeedid=4998378', 'https://blog.netcoresmartech.com/'], ['https://www.aitimejournal.com/feed', 'https://www.feedspot.com/?followfeedid=4979214', 'https://www.aitimejournal.com/'], ['https://blogs.nvidia.com/feed', 'https://www.feedspot.com/?followfeedid=4611576', 'https://blogs.nvidia.com/'], ['http://feeds.feedburner.com/AIInTheNews', 'https://www.feedspot.com/?followfeedid=623918', 'http://aitopics.org/whats-new'], ['https://blogs.technet.microsoft.com/machinelearning/feed', 'https://www.feedspot.com/?followfeedid=4431827', 'https://blogs.technet.microsoft.com/machinelearning/'], ['https://machinelearnings.co/feed', 'https://www.feedspot.com/?followfeedid=4611235', 'https://machinelearnings.co/'], ['https://www.artificial-intelligence.blog/news?format=RSS', 'https://www.feedspot.com/?followfeedid=4611100', 'https://www.artificial-intelligence.blog/news/'], ['https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=artificial+intelligence&output=rss', 'https://www.feedspot.com/?followfeedid=4611157', 'https://news.google.com/news/section?q=artificial%20intelligence&tbm=nws&*'], ['https://www.youtube.com/feeds/videos.xml?channel_id=UCEqgmyWChwvt6MFGGlmUQCQ', 'https://www.feedspot.com/?followfeedid=4611505', 'https://www.youtube.com/channel/UCEqgmyWChwvt6MFGGlmUQCQ/videos']]

Unable to find the class for price - web scraping

I want to extract the price off the website
However, I'm having trouble locating the class type.
on this website
we see that the price for this course is $5141. When I check the source code the class for the price should be "field-items".
from bs4 import BeautifulSoup
import pandas as pd
import requests
url =
"https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-
advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.find(class_='field-items')
print(price)
However when I ran the code I got a description of the course instead of the price..not sure what I did wrong. Any help appreciated, thanks!

There are actually several "field-item even" classes on your webpage so you have to pick the one inside the good class. Here's the code :
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
section = soup.find(class_='field field-name-field-price field-type-number-decimal field-label-inline clearfix view-mode-full')
price = section.find(class_="field-item even").text
print(price)
And the result :
5141.00

With bs4 4.7.1 + you can use :contains to isolate the appropriate preceeding tag then use adjacent sibling and descendant combinators to get to the target
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education')
soup = bs(r.content, 'lxml')
print(soup.select_one('.field-label:contains("Price:") + div .field-item').text)
This
.field-label:contains("Price:")
looks for an element with class field-label, the . is a css class selector, which contains the text Price:. Then the + is an adjacent sibling combinator specifying to get the adjacent div. The .field-item (space dot field-item) is a descendant combinator (the space) and class selector for a child of the adjacent div having class field-item. select_one returns the first match in the DOM for the css selector combination.
Reading:
css selectors

To get the price you can try using .select() which is precise and less error prone.
import requests
from bs4 import BeautifulSoup
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.select_one("[class*='field-price'] .even").text
print(price)
Output:
5141.00

Actually the class I see, using Firefox inspector is : field-item even, it's where the text is:
<div class="field-items"><div class="field-item even">5141.00</div></div>
But you need to change a little bit your code:
price = soup.find_all("div",{"class":'field-item even'})[2]
There are more than one "field-item even" labeled class, price is not the first one.

How to fetch text from BeautifulSoup, getting error

I am trying to fetch text from a webpage - https://www.symantec.com/security_response/definitions.jsp?pid=sep14
Exactly where is says -
File-Based Protection (Traditional Antivirus)
Extended Version: 4/18/2019 rev. 2
But I am still facing errors, can I get the part where it says - 4/18/2019 rev. 2
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find_all('div', class_='unit size1of2 feedBody')
print(extended)

You can actually use CSS selectors to do this. This is done with Beautiful Soup 4.7+. Here we target the same div and classes that you did above, but we also look for the descendant li and it's direct child > strong. We then use the custom pseudo-class :contains() to ensure that the strong element contains the text Extended Version:. We use select_one API call as it will return the first element that matches, select would return all elements that match in a list, but we only need one.
Once we have the strong element, we know the next sibling text node has the information we want, so we can just use next_sibling to grab that text:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.select_one('div.unit.size1of2.feedBody li:contains("Extended Version:") > strong')
print(extended.next_sibling)
Output
4/18/2019 rev. 7
EDIT: As #QHarr mentions in the comments, you can most likely get away with a more simplified strong:contains("Extended Version:"). It is important to remember that :contains() searches all child text nodes of the given element, even sub text nodes of child elements, so being specific is important. I wouldn't use :contains("Extended Version:") as it would find the div, the list elements, etc., so by specify (at the very minimum) strong should narrow the selection enough to give you exactly what you need.

i changed your code like below, now it's showing that you want
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find('div', class_='unit size1of2 feedBody').find_all('li')
print(extended[2])

Try this maybe?
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find('div', class_='unit size1of2 feedBody').findAll('li')
print(extended[2].text.strip())

Scraping site returns different href for a link

In python, I'm using the requests module and BS4 to search the web with duckduckgo.com. I went to http://duckduckgo.com/html/?q='hello' manually and got the first results title as <a class="result__a" href="http://example.com"> using the Developer Tools. Now I used the following code to get the href with Python:
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup4(html, 'html.parser')
result = soup.find('a', class_='result__a')['href']
However, the href looks like gibberish and is completely different from the one i saw manually. ny idea why this is happening?

There are multiple DOM elements with the classname 'result__a'. So, don't expect the first link you see be the first you get.
The 'gibberish' you mentioned is an encoded URL. You'll need to decode and parse it to get the parameters(params) of the URL.
For example:
"/l/?kh=-1&uddg=https%3A%2F%2Fwww.example.com"
The above href contains two params, namely kh and uddg.
uddg is the actual link you need I suppose.
Below code will get all the URL of that particular class, unquoted.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs, unquote
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup(html, 'html.parser')
for anchor in soup.find_all('a', attrs={'class':'result__a'}):
link = anchor.get('href')
url_obj = urlparse(link)
parsed_url = parse_qs(url_obj.query).get('uddg', '')
if parsed_url:
print(unquote(parsed_url[0]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

beautiful soup findall returning different results - python

Related

Can't able to scrape the airtel website plans with python BeautifulSoup

Web Scraping through links with Beautiful Soup

Unable to find the class for price - web scraping

How to fetch text from BeautifulSoup, getting error

Scraping site returns different href for a link

Categories

Resources