Extracting from script - beautiful soup - python

How would the value for the "tier1Category" be extracted from the source of this page?
https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product
soup.find('script')
returns only a subset of the source, and the following returns another source within that code.
json.loads(soup.find("script", type="application/ld+json").text)

Bitto and I have similar approaches to this, however I prefer to not rely on knowing which script contains the matching pattern, nor the structure of the script.
import requests
from collections import abc
from bs4 import BeautifulSoup as bs
def nested_dict_iter(nested):
for key, value in nested.items():
if isinstance(value, abc.Mapping):
yield from nested_dict_iter(value)
else:
yield key, value
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
for script in soup.find_all('script'):
if 'tier1Category' in script.text:
j = json.loads(script.text[str(script.text).index('{'):str(script.text).rindex(';')])
for k,v in list(nested_dict_iter(j)):
if k == 'tier1Category':
print(v)

Here are the steps I used to get the output
use find_all and get the 10th script tag. This script tag contains the tier1Category value.
Get the script text from the first occurrence of { and till last occurrence of ; . This will give us a proper json text.
Load the text using json.loads
Understand the structure of the json to find how to get the tier1Category value.
Code:
import json
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = BeautifulSoup(r.text, 'html.parser')
script_text=soup.find_all('script')[9].text
start=str(script_text).index('{')
end=str(script_text).rindex(';')
proper_json_text=script_text[start:end]
our_json=json.loads(proper_json_text)
print(our_json['product']['results']['productInfo']['tier1Category'])
Output:
Medicines & Treatments

I think you can use an id. I assume tier 1 is after shop in the navigation tree. Otherwise, I don't see that value in that script tag. I see it in an ordinary script (without the script[type="application/ld+json"] ) tag but there are a lot of regex matches for tier 1
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
data = soup.select_one("#bdCrumbDesktopUrls_0").text
print(data)

Related

Using BeautifulSoup4, find every time text starts with a certain symbol in a website

I am trying to scrape price for an item from a website using python.
import requests
from bs4 import BeautifulSoup
URL = "https://..."
result = requests.get(URL)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(???)
print(prices)
In question marks I know I can write the full string which to look for, but I want so that it finds every time there is a text that starts with "$".
Is it possible, if so, how?
Use regular expression to catch the tags that starts with certain character as below:
import re
from bs4 import BeautifulSoup
html = """
<p>$Show me</p>
<p>I am invisible</p>
<p>me too</p>
<p>$Show me too</p>
"""
soup = BeautifulSoup(html, 'html.parser')
result = soup.find_all("p", text=re.compile("^\$"))
# -> [<p>$Show me</p>, <p>$Show me too</p>]
Note that I used \ operated before $ since dollar sign itself is a special character. See regular expression syntax for more information.

Is there a way I can extract a list from a javascript document?

There is a website where I need to obtain the owners of this item from an online-game item and from research, I need to do some 'web scraping' to get this data. But, the information is in a Javascript document/code, not an easily parseable HTML document like bs4 shows I can easily extract information from. So, I need to get a variable in this javascript document (contains a list of owners of the item I'm looking at) and make it into a usable list/json/string I can implement in my program. Is there a way I can do this? if so, how can I?
I've attached an image of the variable I need when viewing the page source of the site I'm on.
My current code:
from bs4 import BeautifulSoup
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
datas = soup.find_all("script")
print(data) #prints the sections of the website content that have ja
IMAGE LINK
To scrape javascript variable, can't use only BeautifulSoup. Regular expression (re) is required.
Use ast.literal_eval to convert string representation of dict to a dict.
from bs4 import BeautifulSoup
import requests
import re
import ast
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
ownership_data = re.search(r'ownership_data\s+=\s+.*;', soup.text).group(0)
ownership_data_dict = ast.literal_eval(ownership_data.split('=')[1].strip().replace(';', ''))
print(ownership_data_dict)
Output:
> {'id': 1029025, 'num_points': 1616, 'timestamps': [1491004800,
> 1491091200, 1491177600, 1491264000, 1491350400, 1491436800,
> 1491523200, 1491609600, 1491696000, 1491782400, 1491868800,
> 1491955200, 1492041600, 1492128000, 1492214400, 1492300800,
> 1492387200, 1492473600, 1492560000, 1492646400, 1492732800,
> 1492819200, ...}
import requests
import json
import re
r = requests.get('...')
m = re.search(r'var history_data\s+=\s+(.*)', r.text)
print(json.loads(m.group(1)))

Cannot extract span element with BeautifulSoup

See below. I am using BeautifulSoup to try and extract this value. What I've tried:
pg = requests.get(websitelink)
soup = BeautifulSoup(pg.content, 'html.parser'
value = soup.find('span',{'class':'wall-header__item_count'}).text
I've tried find, and find all and it returns a Nonetype. For whatever reason the wall-header item count is unable to be found with these methods, even though it appears in the HTML. How can I get this value? Thanks!
I'm assuming you want to get the number of total items. The number is stored within the HTML page inside the <script>. beautifulsoup doesn't see it, but you can use re/json modules to extract it:
import re
import json
import requests
url = "https://www.nike.com/w"
html_doc = requests.get(url).text
data = re.search(r"window\.INITIAL_REDUX_STATE=(\{.*\})", html_doc).group(1)
data = json.loads(data)
# uncomment this to print all data;
# print(json.dumps(data, indent=4))
print("Total items:", data["Wall"]["pageData"]["totalResources"])
Prints (in case in my country):
Total items: 5600
you are forgetting to close the braces
soup = BeautifulSoup(pg.content, 'html.parser'
should be:
soup = BeautifulSoup(pg.content, 'html.parser')

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

find a particular tag using beautifulsoup

I'm trying to get the town and state for a given zip code using the following site:
http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go
Using the following code I get all the tr tags:
import sys
import os
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go")
data = r.text
soup = BeautifulSoup(data)
print soup.find_all('tr')
how do I find a particular tr tag? in exmaples like this: How to find tag with particular text with Beautiful Soup? you already know the text you are looking for. what do I use if I don't know the text ahead of time?
EDIT
i've now added the following and get nowhere:
for tag in soup.find_all(re.compile("^td align=")):
print (tag.name)
After I took a look at the HTML code of the website you provided, I will say the best way to locate will be "text based location" instead of class, id based ..etc.
First you can easily identify the header row based on the text using the key word "Mail", and then you can easily get the row that contains the content you want.
Here is my code:
import urllib2, re, bs4
soup = bs4.BeautifulSoup(urllib2.urlopen("http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go"))
# find the header, then find the next tr, which contains your data
tr = soup.find(text=re.compile("Mailing")).find_next("tr")
name, code, zip = [ td.text.strip() for td in tr.find_all("td")]
print name
print code
print zip
After you print them out they look like this:
New York
NY
10023
I would navigate until that point in the html source with a mix of find() and find_all() calls, because I cannot diferenciate from other <td> elements based in poistion, attributes or something else:
import sys
import os
from bs4 import BeautifulSoup
import requests
l = list()
r = requests.get("http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go")
data = r.text
soup = BeautifulSoup(data)
for table in soup.find('table'):
center = table.find_all('center')[3]
for tr in center.find_all('tr')[-1]:
l.append(tr.string)
print(l[0:-1])
Run it like:
python script.py
That yields:
[u'New York', u'NY']

Categories

Resources