Soup does not find specific class from div - python

I have tried everything. The response is perfect and I do get what I am supposed to be getting, I just don't understand why I receive an empty array when I'm searching for a div with a specific class (that definitely exists) on the web page. I have tried looking everywhere, but nothing seems to work.
Here's my code:
import requests
import lxml
from bs4 import BeautifulSoup
baseurl = 'https://www.atea.dk/eshop/products/?filters=S_Apple%20MacBook%20Pro%2016%20GB'
response = requests.get(baseurl)
soup = BeautifulSoup(response.content, 'lxml')
productlist = soup.find_all("div", class_="nsv-product ns_b_a")
print(productlist)
I am essentially trying to build a script that emails me when the items from an e-shop are marked as available (pa lager) instead of unavailable (ikke pa lager).

You might need to use Selenium on this one.
The div is, AFAIK, rendered by JS.
BeautifulSoup does not capture JS-rendered content.
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.common.keys import Keys
options = webdriver.FirefoxOptions()
options.headless = True
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(),options=options)
driver.get('https://www.atea.dk/eshop/products/?filters=S_Apple%20MacBook%20Pro%2016%20GB')
k = driver.find_elements_by_xpath("//div[#class='nsv-product ns_b_a']")
Your code below that snippet should contains everything you need, e.g. processing, saving into your database, etc..
Note: That snippet is a bit flawed, e.g. you want to use Chrome, but it only provides an example, so tweak it to your own needs.

You need to inspect the page source code (for windows : Ctrl + U) and search for this section window.netset.model.productListDataModel
it's enclosed in <script> tag.
What you need to do is to parse that enclosed json string
<script>window.netset.model.productListDataModel = {....}
which will provide your desired product listing (8 per page).
Here is the code
import re, json
import requests
import lxml
from bs4 import BeautifulSoup
baseurl = 'https://www.atea.dk/eshop/products/?filters=S_Apple%20MacBook%20Pro%2016%20GB'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
response = requests.get(baseurl, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
# print response
print(response)
# regex to find the script enclosed json string
product_list_raw_str = re.findall(r'productListDataModel\s+=\s+(\{.*?\});\n', response.text)[0].strip()
# parse json string
products_json = json.loads(product_list_raw_str)
# find the product list
product_list = products_json['response']['productlistrows']['productrows']
# check product list count, 8 per page
len(product_list)
# iterate the product list
for product in product_list:
print(product)
It will output -
<Response [200]>
{'buyable': True, 'ispackage': False, 'artcolumndata': 'MK1E3DK/A', 'quantityDisabled': False, 'rowid': '5418526',.........
..... 'showAddToMyList': False, 'showAddToCompareList': True, 'showbuybutton': True}

Related

I can't access the text in the span using BeautifulSoup

Hi Everyone receive error msg when executing this code :
from bs4 import BeautifulSoup
import requests
import html.parser
from requests_html import HTMLSession
session = HTMLSession()
response = session.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht")
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all("tr")
for table in tables:
movie_name = table.find("span", class_ = "secondaryInfo")
print(movie_name)
output:
movie_name = table.find("span", class_ = "secondaryInfo").text
AttributeError: 'NoneType' object has no attribute 'text'
You selected for the first row which is the header and doesn't have that class as it doesn't list the prices. An alternative way is to simply exclude the header with a css selector of nth-child(n+2). You also only need requests.
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht")
soup = BeautifulSoup(response.content, 'html.parser')
for row in soup.select('tr:nth-child(n+2)'):
movie_name = row.find("span", class_ = "secondaryInfo")
print(movie_name.text)
Just use the SelectorGadget Chrome extension to grab CSS selector by clicking on the desired element in your browser without inventing anything superfluous. However, it's not working perfectly if the HTML structure is terrible.
You're looking for this:
for result in soup.select(".titleColumn a"):
movie_name = result.text
Also, there's no need in using HTMLSession IF you don't want to persist certain parameters across requests to the same host (website).
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
# user-agent is used to act as a real user visit
# this could reduce the chance (a little bit) of being blocked by a website
# and prevent from IP limit block or permanent ban
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht", headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for result in soup.select(".titleColumn a"):
movie_name = result.text
print(movie_name)
# output
'''
Eternals
Dune: Part One
No Time to Die
Venom: Let There Be Carnage
Ron's Gone Wrong
The French Dispatch
Halloween Kills
Spencer
Antlers
Last Night in Soho
'''
P.S. There's a dedicated web scraping blog of mine. If you need to parse search engines, have a try using SerpApi.
Disclaimer, I work for SerpApi.

Trying to access hidden <div> tags when web scraping in python

so I'm trying to extract some data from a website by webscraping using python but some of the div tags are not expanding to show the data that I want.
This is my code.
import requests
from bs4 import BeautifulSoup as soup
uq_url = "https://my.uq.edu.au/programs-courses/requirements/program/2451/2021"
headers = {
'User-Agent': "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
web_r = requests.get(uq_url, headers=headers)
web_soup = soup(web_r.text, 'html.parser')
print(web_soup.prettify())
This is what the code will scrape but it won't extract any of the data in the div with id="app". It's supposed to have a lot of data in there like the second picture. Any help would be appreciated.
All the that content is present within a script tag, as shown in your image. You can regex out the appropriate javascript object then handle the unquoted keys with json, in order to convert to hjson. Then extract whatever you want:
import requests, re, hjson
from bs4 import BeautifulSoup as bs #there is some data as embedded html you may wish to parse later from json
r = requests.get('https://my.uq.edu.au/programs-courses/requirements/program/2451/2021', headers = {'User-Agent':'Mozilla/5.0'})
data = hjson.loads(re.search(r'window\.AppData = ([\s\S]+?);\n' , r.text).group(1))
# hjson.dumpsJSON(data['programRequirements'])
core_courses = data['programRequirements']['payload']['components'][1]['payload']['body'][0]['body']
for course in core_courses:
if 'curriculumReference' in course:
print(course['curriculumReference'])

Why is the data retrieved showing as blank instead of outputting the correct numbers?

I can't seem to see what is missing. Why is the response not printing the ASINs?
import requests
import re
urls = [
'https://www.amazon.com/s?k=xbox+game&ref=nb_sb_noss_2',
'https://www.amazon.com/s?k=ps4+game&ref=nb_sb_noss_2'
]
for url in urls:
content = requests.get(url).content
decoded_content = content.decode()
asins = set(re.findall(r'/[^/]+/dp/([^"]+)', decoded_content))
print(asins)
traceback
set()
set()
[Finished in 0.735s]
Regular expressions should not be used to parse HTML. Every StackOverflow answer to questions like this do not recommend regex for HTML. It is difficult to write a regular expression complex enough to get the data-asin value from each <div>. The BeautifulSoup library will make this task easier. But if you must use regex, this code will return everything inside of the body tags:
re.findall(r'<body.*?>(.+?)</body>', decoded_content, flags=re.DOTALL)
Also, print decoded_content and read the HTML. You might not be receiving the same website that you see in the web browser. Using your code I just get an error message from Amazon or a small test to see if I am a robot. If you do not have real headers attached to your request, big websites like Amazon will not return the page you want. They try to prevent people from scraping their site.
Here is some code that works using the BeautifulSoup library. You need to install the library first pip3 install bs4.
from bs4 import BeautifulSoup
import requests
def getAsins(url):
headers = requests.utils.default_headers()
headers.update({'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})
decoded_content = requests.get(url, headers=headers).content.decode()
soup = BeautifulSoup(decoded_content, 'html.parser')
asins = {}
for asin in soup.find_all('div'):
if asin.get('data-asin'):
asins[asin.get('data-uuid')] = asin.get('data-asin')
return asins
'''
result = getAsins('https://www.amazon.com/s?k=xbox+game&ref=nb_sb_noss_2')
print(result)
{None: 'B07RBN5C9C', '8652921a-81ee-4e15-b12d-5129c3d35195': 'B07P15JL3T', 'cb25b4bf-efc3-4bc6-ae7f-84f69dcf131b': 'B0886YWLC9', 'bc730e28-2818-472d-bc03-6e9fb97dcaad': 'B089F8R7SQ', '339c4ca0-1d24-4920-be60-54ef6890d542': 'B08GQW447N', '4532f725-f416-4372-8aa0-8751b2b090cc': 'B08DD5559K', 'a0e17b74-7457-4df7-85c9-5eefbfe4025b': 'B08BXHCQKR', '52ef86ef-58ac-492d-ad25-46e7bed0b8b9': 'B087XR383W', '3e79c338-525c-42a4-80da-4f2014ed6cf7': 'B07H5VVV1H', '45007b26-6d8c-4120-9ecc-0116bb5f703f': 'B07DJW4WZC', 'dc061247-2f4c-4f6b-a499-9e2c2e50324b': 'B07YLGXLYQ', '18ff6ba3-37b9-44f8-8f87-23445252ccbd': 'B01FST8A90', '6d9f29a1-9264-40b6-b34e-d4bfa9cb9b37': 'B088MZ4R82', '74569fd0-7938-4375-aade-5191cb84cd47': 'B07SXMV28K', 'd35cb3a0-daea-4c37-89c5-db53837365d4': 'B07DFJJ3FN', 'fc0b73cc-83dd-44d9-b920-d08f07be76eb': 'B07KYC1VL7', 'eaeb69d1-a2f9-4ea4-ac97-1d9a955d706b': 'B076PRWVFG', '0aafbb75-1bac-492c-848e-a046b2de9978': 'B07Q47W1B4', '9e373245-9e8b-4564-a32f-42baa7b51d64': 'B07C4SGGZ2', '4af7587a-98bf-41e0-bde6-2a2fad512d95': 'B07SJ2T3CW', '8635a92e-22a7-4474-a27d-3db75c75e500': 'B08D44W56B', '49d752ce-5d68-4323-be9b-3cbb34c8b562': 'B086JQGB7W', '6398531f-6864-4c7b-9879-84ee9de57d80': 'B07XD3TK36'}
'''
If you are reading html from a file then:
from bs4 import BeautifulSoup
import requests
def getAsins(location_to_file):
file = open(location_to_file)
soup = BeautifulSoup(file, 'html.parser')
asins = {}
for asin in soup.find_all('div'):
if asin.get('data-asin'):
asins[asin.get('data-uuid')] = asin.get('data-asin')
return asins

Crawling google search url list with python

I'd like to scrape google search result url with python.
Here's my code
import requests
from bs4 import BeautifulSoup
def search(keyword):
html = requests.get('https://www.google.co.kr/search?q={}&num=100&sourceid=chrome&ie=UTF-8'.format(keyword)).text
soup = BeautifulSoup(html, 'html.parser')
result = []
for i in soup.find_all('h3', {'class':'r'}):
result.append(i.find('a', href = True) ['href'][7:])
return result
search('computer')
Then I can get result. First url of the list is wikipedia.com which is,
'https://en.wikipedia.org/wiki/Computer&sa=U&ved=0ahUKEwixyfu7q5HdAhWR3lQKHUfoDcsQFggTMAA&usg=AOvVaw2nvT-2sO4iJenW_fkyCS3i',
'?q=computer&num=100&ie=UTF-8&prmd=ivnsbp&tbm=isch&tbo=u&source=univ&sa=X&ved=0ahUKEwixyfu7q5HdAhWR3lQKHUfoDcsQsAQIHg'
I want to get clean url, which is 'https://en.wikipedia.org/wiki/Computer' including all the other search result in this case.
How can I modify my codes?
Edited: As you see the image below, I want to get the real url (marked yellow), not the messy and long url above.
How about appending
.split('&')[0]
to your code, in a way such that it becomes:
import requests
from bs4 import BeautifulSoup
def search(keyword):
html = requests.get('https://www.google.co.kr/search?q={}&num=100&sourceid=chrome&ie=UTF-8'.format(keyword)).text
soup = BeautifulSoup(html, 'html.parser')
result = []
for i in soup.find_all('h3', {'class':'r'}):
result.append(i.find('a', href = True) ['href'][7:].split('&')[0])
return result
search('computer')
[EDIT]
Taking https://en.wikipedia.org/wiki/Computer as an example:
Through chrome developer tools the url looks clean.
Since it belongs to <h3 class="r">, your code should work fine and return the clean url.
Instead, if you replace
result.append(i.find('a', href = True) ['href'][7:])
with
print i
then in my terminal it returns the following for the above link:
/url?q=https://en.wikipedia.org/wiki/Computer&sa=U&ved=0ahUKEwinqcqdypHdAhVhKH0KHVWIBEUQFggfMAU&usg=AOvVaw1pduIWw_TSCJUxtP9W_kHJ
you can see that /url?q= has been prepended, and &sa=U&ved=0ahUKEwinqcqdypHdAhVhKH0KHVWIBEUQFggfMAU&usg=AOvVaw1pduIWw_TSCJUxtP9W_kHJ
has been appended.
By looking at other links as well, I observed that the prepended part always looks like /url?q=, and the appended part always begins with a &.
Therefore it's my belief that my original answer should work:
result.append(i.find('a', href = True) ['href'][7:].split('&')[0])
[7:] removes the prepended string, and split('&')[0] the appended string.
I found solution.
This modification in the search function works.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
html = requests.get('https://www.google.co.kr/search?q={}&num=100&sourceid=chrome&ie=UTF-8'.format(keyword), headers = headers).text

Crawling the pair of tags from html

I am using python 3.6 and Pycharm 2016.2 as editor
I would like to crawl the pairs of contents inside of "th" : "td" tags if "td" tag has a child tag which is input tag with "checked = 'chedcked'". I tried regEx, find_all from BeautifulSoup and others, but still have error messages.
Please help.
This is web site address: http://www.bobaedream.co.kr/mycar/popup/mycarChart_4.php?zone=C&cno=652691&tbl=cyber
Below is my code:
from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urlparse
import re
popup_inspection = "http://www.bobaedream.co.kr/mycar/popup/mycarChart_4.php?zone=C&cno=652691&tbl=cyber"
res = urllib.request.urlopen(popup_inspection)
html = res.read()
soup_inspection = BeautifulSoup(html, 'html.parser')
insp_trs = soup_inspection.find_all('tr')
for insp_tr in insp_trs:
# print(insp_td.text)
th = insp_tr.find('th')
td = insp_tr.find('td')
if td.find('input', checked=''):
print(th, ":", td)
else: pass
The idea is to use a searching function to locate the th elements followed by a td sibling. Then, we can locate the input element with type="radio" and present checked attribute. If there is one, we can locate the label element coming right after the radio input.
Sample implementation:
import requests
from bs4 import BeautifulSoup
url = "http://www.bobaedream.co.kr/mycar/popup/mycarChart_4.php?zone=C&cno=652691&tbl=cyber"
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'}
page = session.get(url)
soup = BeautifulSoup(page.content, "html.parser")
for label in soup.find_all(lambda tag: tag.name == "th" and tag.find_next_sibling('td')):
value_cell = label.find_next_sibling('td')
# if combobox cell
selected_value = value_cell.find("input", type="radio", checked=True)
if selected_value:
value = selected_value.find_next("label").get_text()
print(label.get_text(), value)
Currently prints:
10. 보증유형 자가보증
13. 사고/침수유무(단순수리제외) 무
12. 불법구조변경 없음
This, of course, can and should be further improved but I hope the techniques used in the snippet would help you to get to the final solution.

Categories

Resources