Cannot scrape this specific div class - python

I am curious why I cannot get the div-elements with this class as follows (which worked before but on different sites). Maybe it is an issue with this site?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url = "https://www.docmorris.de/produkte/abnehmen"
page=requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, features="lxml")
divs=soup.find_all("div",attrs={"class": "l-product mod-standard product
list-item ff-slider"})
print(divs)
prints an empty array. I want all div elements with class 'l-product mod-standard product list-item ff-slider'

You only need a single class of the multi-value and that will be less fragile. Also, remove headers.
from bs4 import BeautifulSoup
import requests
url = "https://www.docmorris.de/produkte/abnehmen"
page = requests.get(url)
soup = BeautifulSoup(page.content, features="lxml")
divs = soup.select('.l-product')
print(divs)
The multi-valued (more fragile) would be:
divs = soup.select('.l-product.mod-standard.product.list-item.ff-slider')
Or (as in comments - ensure on one line) :
divs = soup.find_all("div",attrs={"class": "l-product mod-standard product list-item ff-slider"})

Related

Facebook Scraper

I'm trying to scrape Post and images from this facebook profile; https://www.facebook.com/carlostablanteoficial and getting nothing when trying to reach the actual post text with this code:
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
html = urlopen("https://www.facebook.com/carlostablanteoficial")
res = BeautifulSoup(html.read(),"html5lib");
resdiv = res.div
post = resdiv.findAll('div', class_='text_exposed_root')
print(post)
This will return many results:
import requests
from bs4 import BeautifulSoup
data = requests.get("https://www.facebook.com/carlostablanteoficial")
soup = BeautifulSoup(data.text, 'html.parser')
for div in soup.find_all('div'):
print(div)
to search for a specific class, change the loop to:
for div in soup.find_all('div', {'class', 'text_exposed_root'}):
print(div)
but when I tried it returned nothing, meaning there is no div with that class on the page

Trying to find the number within class

So the class i am looking through has no attribute name. I have tried both of these. When I search =containers = soup.find_all('div', class_='today_nowcard-temp') it returns the line that has the value i want. But i have no way of extracting the number. Because the line i want has no class attribute name i cant get the number directly from the find function. if i do containers = soup.find('span', class_=None it doesn't return the vale i want. This is the whole thing of what i have so far:
import requests
from bs4 import BeautifulSoup
url = 'https://weather.com/en-CA/weather/today/l/8663b88e4a1c7d6068b7f33e360396ac1c89f3dde9533082cd342aef06ad1e87'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
containers = soup.find_all('div', class_='today_nowcard-temp')
print(containers)
Use find() instead of find_all() and then do find_next('span')
import requests
from bs4 import BeautifulSoup
url = 'https://weather.com/en-CA/weather/today/l/8663b88e4a1c7d6068b7f33e360396ac1c89f3dde9533082cd342aef06ad1e87'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
containers = soup.find('div', class_='today_nowcard-temp')
print(containers.find_next('span').text)
If you want to use find_all() then iterate it like below.
containers = soup.find_all('div', class_='today_nowcard-temp')
for item in containers:
print(item.find_next('span').text)
Or you can use css selector.
print(soup.select_one("div.today_nowcard-temp > span").text)

Unable to find the class for price - web scraping

I want to extract the price off the website
However, I'm having trouble locating the class type.
on this website
we see that the price for this course is $5141. When I check the source code the class for the price should be "field-items".
from bs4 import BeautifulSoup
import pandas as pd
import requests
url =
"https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-
advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.find(class_='field-items')
print(price)
However when I ran the code I got a description of the course instead of the price..not sure what I did wrong. Any help appreciated, thanks!
There are actually several "field-item even" classes on your webpage so you have to pick the one inside the good class. Here's the code :
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
section = soup.find(class_='field field-name-field-price field-type-number-decimal field-label-inline clearfix view-mode-full')
price = section.find(class_="field-item even").text
print(price)
And the result :
5141.00
With bs4 4.7.1 + you can use :contains to isolate the appropriate preceeding tag then use adjacent sibling and descendant combinators to get to the target
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education')
soup = bs(r.content, 'lxml')
print(soup.select_one('.field-label:contains("Price:") + div .field-item').text)
This
.field-label:contains("Price:")
looks for an element with class field-label, the . is a css class selector, which contains the text Price:. Then the + is an adjacent sibling combinator specifying to get the adjacent div. The .field-item (space dot field-item) is a descendant combinator (the space) and class selector for a child of the adjacent div having class field-item. select_one returns the first match in the DOM for the css selector combination.
Reading:
css selectors
To get the price you can try using .select() which is precise and less error prone.
import requests
from bs4 import BeautifulSoup
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.select_one("[class*='field-price'] .even").text
print(price)
Output:
5141.00
Actually the class I see, using Firefox inspector is : field-item even, it's where the text is:
<div class="field-items"><div class="field-item even">5141.00</div></div>
But you need to change a little bit your code:
price = soup.find_all("div",{"class":'field-item even'})[2]
There are more than one "field-item even" labeled class, price is not the first one.

'NoneType' object is not callable in Beautiful Soup 4

I'm new-ish to python and started experimenting with Beautiful Soup 4. I tried writing code that would get all the links on one page then with those links repeat the prosses until I have an entire website parsed.
import bs4 as bs
import urllib.request as url
links_unclean = []
links_clean = []
soup = bs.BeautifulSoup(url.urlopen('https://pythonprogramming.net/parsememcparseface/').read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
print(links_clean)
while True:
for link in links_clean:
soup = bs.BeautifulSoup(url.urlopen(link).read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
links_clean = list(dict.fromkeys(links_clean))
input()
But I'm now getting this error:
'NoneType' object is not callable
line 20, in
soup = bs.BeautifulSoup(url.urlopen(link).read(),
'html.parser')
Can you pls help.
Be careful when importing modules as something. In this case, url on line 2 gets overridden in your for loop when you iterate.
Here is a shorter solution that will also give back only URLs containing https as part of the href attribute:
from bs4 import BeautifulSoup
from urllib.request import urlopen
content = urlopen('https://pythonprogramming.net/parsememcparseface/')
soup = BeautifulSoup(content, "html.parser")
base = soup.find('body')
for link in BeautifulSoup(str(base), "html.parser").findAll("a"):
if 'href' in link.attrs:
if 'https' in link['href']:
print(link['href'])
However, this paints an incomplete picture as not all links are captured because of errors on the page with HTML tags. May I recommend also the following alternative, which is very simple and works flawlessly in your scenario (note: you will need the package Requests-HTML):
from requests_html import HTML, HTMLSession
session = HTMLSession()
r = session.get('https://pythonprogramming.net/parsememcparseface/')
for link in r.html.absolute_links:
print(link)
This will output all URLs, including both those that reference other URLs on the same domain and those that are external websites.
I would consider using an attribute = value css selector and using the ^ operator to specify that the href attributes begin with https. You will then only have valid protocols. Also, use set comprehensions to ensure no duplicates and Session to re-use connection.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
final = []
with requests.Session() as s:
r = s.get('https://pythonprogramming.net/parsememcparseface/')
soup = bs(r.content, 'lxml')
httpsLinks = {item['href'] for item in soup.select('[href^=https]')}
for link in httpsLinks:
r = s.get(link)
soup = bs(r.content, 'lxml')
newHttpsLinks = [item['href'] for item in soup.select('[href^=https]')]
final.append(newHttpsLinks)
tidyList = list({item for sublist in final for item in sublist})
df = pd.DataFrame(tidyList)
print(df)

beautiful soup findall returning different results

I am trying to parse through a div class from an html table on Amazon, and when I run the code, find_all() sometimes returns the right div classes that I am looking for, and other times it will return an empty list. Any ideas on why the results vary?
I am pulling from this url: https://www.amazon.com/dp/B0767653BK
My code:
req = requests.get('https://www.amazon.com/dp/B0767653BK')
page = req.text
BSoup = BeautifulSoup(page, 'html.parser')
divClass = Bsoup.find_all('div', class_='a-section a-spacing-none a-padding-none overflow_ellipsis')
It is better to use a beautifulsoup selector when trying to find all elements with a combination of CSS classes:
from bs4 import BeautifulSoup
import requests
req = requests.get('https://www.amazon.com/dp/B0767653BK')
soup = BeautifulSoup(req.text, 'html.parser')
for div_class in soup.select('div.a-section.a-spacing-none.a-padding-none.overflow_ellipsis'):
print div_class.get_text(strip=True)
This is preferable as it allows the four class elements to be present in any order. So if the page decides to change the ordering of the classes, it will still find them.
Take a look at Searching by CSS class in the documenation.

Categories

Resources