Cannot extract span element with BeautifulSoup - python

See below. I am using BeautifulSoup to try and extract this value. What I've tried:
pg = requests.get(websitelink)
soup = BeautifulSoup(pg.content, 'html.parser'
value = soup.find('span',{'class':'wall-header__item_count'}).text
I've tried find, and find all and it returns a Nonetype. For whatever reason the wall-header item count is unable to be found with these methods, even though it appears in the HTML. How can I get this value? Thanks!

I'm assuming you want to get the number of total items. The number is stored within the HTML page inside the <script>. beautifulsoup doesn't see it, but you can use re/json modules to extract it:
import re
import json
import requests
url = "https://www.nike.com/w"
html_doc = requests.get(url).text
data = re.search(r"window\.INITIAL_REDUX_STATE=(\{.*\})", html_doc).group(1)
data = json.loads(data)
# uncomment this to print all data;
# print(json.dumps(data, indent=4))
print("Total items:", data["Wall"]["pageData"]["totalResources"])
Prints (in case in my country):
Total items: 5600

you are forgetting to close the braces
soup = BeautifulSoup(pg.content, 'html.parser'
should be:
soup = BeautifulSoup(pg.content, 'html.parser')

Related

Can't scrape <h3> tag from page

Seems like i can scrape any tag and class, except h3 on this page. It keeps returning None or an empty list. I'm trying to get this h3 tag:
...on the following webpage:
https://www.empireonline.com/movies/features/best-movies-2/
And this is the code I use:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")
movies_text=[]
for item in movies:
result = item.getText()
movies_text.append(result)
print(movies_text)
Can you please help with the solution for this problem?
As other people mentioned this is dynamic content, which needs to be generated first when opening/running the webpage. Therefore you can't find the class "jsx-4245974604" with BS4.
If you print out your "soup" variable you actually can see that you won't find it. But if simply you want to get the names of the movies you can just use another part of the html in this case.
The movie name is in the alt tag of the picture (and actually also in many other parts of the html).
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll("img", class_="jsx-952983560")
movies_text=[]
for item in movies:
result = item.get('alt')
movies_text.append(result)
print(movies_text)
If you run into this issue in the future, remember to just print out the initial html you can get with soup and just check by eye if the information you need can be found.

Scrape <div<span from HTML-page

I am trying to create a simple weather forecast with Python in Eclipse. So far I have written this:
from bs4 import BeautifulSoup
import requests
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
print(r.content) # Outputs HTML code for the page
soup = BeautifulSoup(r.content, 'html5lib') # Parse the data with BeautifulSoup(HTML-string, html-parser)
min_max = soup.select('min-max.temperature') # Select all spans with a "min-max-temperature" attribute
print(min_max.prettify())
table = soup.find('div', attrs={'daily-weather-list-item__temperature'})
print(table.prettify())
From a html-page with elements that looks like this:
I have found the path to the first temperature in the HTML-page's elements, but when I try and execute my code, and print to see if I have done it correctly, nothing is printed. My goal is to print a table with dates and corresponding temperatures, which seems like an easy task, but I do not know how to properly name the attribute or how to scrape them all from the HTML-page in one iteration.
The <span has two temperatures stored, one min and one max, here it just happens that they're the same.
I want to go into each <div class="daily-weather-list-item__temperature", collect the two temperatures and add them to a dictionary, how do I do this?
I have looked at this question on stackoverflow but I couldn't figure it out:
Python BeautifulSoup - Scraping Div Spans and p tags - also how to get exact match on div name
You could use a dictionary comprehension. Loop over all the forecasts which have class daily-weather-list-item, then extract date from the datetime attribute of the time tags, and use those as keys; associate the keys with the maxmin info.
import requests
from bs4 import BeautifulSoup
def weather_forecast():
url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen'
r = requests.get(url) # Get request for contents of the page
soup = BeautifulSoup(r.content, 'html5lib')
temps = {i.select_one('time')['datetime']:i.select_one('.min-max-temperature').get_text(strip= True)
for i in soup.select('.daily-weather-list-item')}
return temps
weather_forecast()

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

How do I parse the values like this with BeutifulSoup?

I am trying to parse some information from awebsite but ran in to a little problem, the information I need wont print out and just shows [] when I need the values (3 for example from the source code provided. I would need some help to get it working. Hope someone here can help me out and assist in the matter.
Best of regards.
import re
import requests
from bs4 import BeautifulSoup
url_to_parse = "https://www.webpage.com"
response = requests.get(url_to_parse)
response_text = response.text
soup = BeautifulSoup(response_text, 'lxml')
#print(soup.prettify())
ragex = re.compile('c76a6')
content_lis = soup.find_all('button', attrs={'class': ragex})
print(content_lis)
source: <button class="c76a6" type="button" data-test-name="valueButton"><span class="_5a5c0" data-test-name="value">3</span></button>
because the find_all returns in array to get item require to index on it or loop through the matches and that takes time if you know that the target is unique so you have to use find get the first match so in that case you should add attribute called text to get the value only
import re
import requests
from bs4 import BeautifulSoup
url_to_parse = "https://www.webpage.com"
response = requests.get(url_to_parse)
response_content = response.content
soup = BeautifulSoup(response_content, 'lxml')
# print(soup.prettify())
regex = re.compile('c76a6')
content_list = soup.find('button',{'class': regex})
print(content_list.text)

Extracting from script - beautiful soup

How would the value for the "tier1Category" be extracted from the source of this page?
https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product
soup.find('script')
returns only a subset of the source, and the following returns another source within that code.
json.loads(soup.find("script", type="application/ld+json").text)
Bitto and I have similar approaches to this, however I prefer to not rely on knowing which script contains the matching pattern, nor the structure of the script.
import requests
from collections import abc
from bs4 import BeautifulSoup as bs
def nested_dict_iter(nested):
for key, value in nested.items():
if isinstance(value, abc.Mapping):
yield from nested_dict_iter(value)
else:
yield key, value
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
for script in soup.find_all('script'):
if 'tier1Category' in script.text:
j = json.loads(script.text[str(script.text).index('{'):str(script.text).rindex(';')])
for k,v in list(nested_dict_iter(j)):
if k == 'tier1Category':
print(v)
Here are the steps I used to get the output
use find_all and get the 10th script tag. This script tag contains the tier1Category value.
Get the script text from the first occurrence of { and till last occurrence of ; . This will give us a proper json text.
Load the text using json.loads
Understand the structure of the json to find how to get the tier1Category value.
Code:
import json
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = BeautifulSoup(r.text, 'html.parser')
script_text=soup.find_all('script')[9].text
start=str(script_text).index('{')
end=str(script_text).rindex(';')
proper_json_text=script_text[start:end]
our_json=json.loads(proper_json_text)
print(our_json['product']['results']['productInfo']['tier1Category'])
Output:
Medicines & Treatments
I think you can use an id. I assume tier 1 is after shop in the navigation tree. Otherwise, I don't see that value in that script tag. I see it in an ordinary script (without the script[type="application/ld+json"] ) tag but there are a lot of regex matches for tier 1
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
data = soup.select_one("#bdCrumbDesktopUrls_0").text
print(data)

Categories

Resources