How to parse a dynamically loading wildberries page?

How to parse a dynamically loading wildberries page? - python

I need to extract data from a comment block that is loaded dynamically. I have tried many methods from the Internet, all of them return an empty array.
The program cannot access the comments because they are loaded only when the user scrolls through the page to them. How can I get all the content of the page?
Page - https://www.wildberries.ru/catalog/22063490/detail.aspx?targetUrl=XS
Here is the code that I have now.
import requests as req
from bs4 import BeautifulSoup as bs
import pandas as pd
from sqlalchemy import create_engine
import psycopg2
from selenium import webdriver
import time
import lxml
driver = webdriver.Chrome(executable_path=r"D:\Downloads\chromedriver.exe")
safe_delay = 15
def read_comments(url):
response = req.get(url)
response.encoding = 'utf-8'
driver.get(url)
time.sleep(safe_delay)
html = driver.page_source
soup = bs(html, "html.parser")
#soup = bs(response.text, 'lxml')
coms = soup.find_all('div', class_='comment j-b-comment')
return(coms)
print(read_comments('https://www.wildberries.ru/catalog/22063490/detail.aspx?targetUrl=XS'))

You will need to scroll down the page for the comments to load. You can do this by sending the space key repeatedly, with a small sleep value to give the page time to load.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from bs4 import BeautifulSoup as bs
import time
driver = webdriver.Chrome()
url = 'https://www.wildberries.ru/catalog/22063490/detail.aspx?targetUrl=XS'
def read_comments(url):
driver.get(url)
for x in range(10):
actions = ActionChains(driver)
actions.send_keys(Keys.SPACE)
actions.perform()
time.sleep(.5)
html = driver.page_source
soup = bs(html, "html.parser")
coms = soup.find_all('div', class_='comment j-b-comment')
return coms

Related

python bs4, how to scrape this text in html?

the site url:https://n.news.naver.com/mnews/article/421/0006111920
I want to scrape "5" on the below html.
I used this code: soup.select_one('span.u_likeit_text._count').get_text()
the result is '추천'
html code
<span class="u_likeit_text _count num">5</span>

Main issue here that the count is dynamically generated by JavaScript and not present in response and so your soup.
You could use selenium to render the page like a browser will do and convert the driver.page_source to your BeautifulSoup object:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://n.news.naver.com/mnews/article/421/0006111920")
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
soup.select_one('span.u_likeit_text._count').get_text()
Output:
8

You have to separate the classes using space, instead of connecting over dot.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<span class='u_likeit_text _count num'>5</span>", 'html.parser')
print(soup)
seven_day = soup.find_all("span" , class_="u_likeit_text _count num")
print(seven_day[0].text)

Scraping using beautiful soup not working fully?

I was trying to scrape some data using BeautifulSoup on python from the site 'https://www.geappliances.com/ge-appliances/kitchen/ranges/' which has some products.
import unittest, time, random
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup
import pandas as pd
links = []
browser = webdriver.Firefox(executable_path="C:\\Users\\drivers\\geckodriver\\win64\\v0.29.1\\geckodriver.exe")
browser.get("https://www.geappliances.com/ge-appliances/kitchen/ranges/")
content = browser.page_source
soup = BeautifulSoup(content, "html.parser")
for main in soup.findAll('li', attrs = {'class' : 'product'}):
name=main.find('a', href=True)
if (name != ''):
links.append((name.get('href')).strip())
print("Got links : ", len(links))
exit()
Here in output I get:-
Got links: 0
I printed the soup and saw that this part was not there in the soup. I have been trying to get around this problem to no avail.
Am I doing something wrong? Any suggestion are appreciated. Thanks.

Study the source of the webpage.Check your findAll function.

You should wait till the page loads. Use time.sleep() to pause execution for a while.
You can try like this.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
url = 'https://www.geappliances.com/ge-appliances/kitchen/ranges/'
driver = webdriver.Chrome("chromedriver.exe")
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
u = soup.find('ul', class_='productGrid')
for item in u.find_all('li', class_='product'):
print(item.find('a')['href'])
driver.close()
/appliance/GE-Profile-30-Smart-Slide-In-Front-Control-Induction-Fingerprint-Resistant-Range-with-In-Oven-Camera-PHS93XYPFS
/appliance/GE-Profile-30-Electric-Pizza-Oven-PS96PZRSS
/appliance/GE-Profile-30-Smart-Slide-In-Front-Control-Gas-Double-Oven-Convection-Fingerprint-Resistant-Range-PGS960YPFS
/appliance/GE-Profile-30-Smart-Dual-Fuel-Slide-In-Front-Control-Fingerprint-Resistant-Range-P2S930YPFS
/appliance/GE-Profile-30-Smart-Slide-In-Electric-Double-Oven-Convection-Fingerprint-Resistant-Range-PS960YPFS
/appliance/GE-Profile-30-Smart-Slide-In-Front-Control-Induction-and-Convection-Range-with-No-Preheat-Air-Fry-PHS930BPTS
/appliance/GE-Profile-30-Smart-Slide-In-Fingerprint-Resistant-Front-Control-Induction-and-Convection-Range-with-No-Preheat-Air-Fry-PHS930YPFS
/appliance/GE-Profile-30-Smart-Slide-In-Front-Control-Gas-Range-with-No-Preheat-Air-Fry-PGS930BPTS
/appliance/GE-Profile-30-Smart-Slide-In-Front-Control-Gas-Fingerprint-Resistant-Range-with-No-Preheat-Air-Fry-PGS930YPFS
/appliance/GE-Profile-30-Smart-Slide-In-Electric-Convection-Range-with-No-Preheat-Air-Fry-PSS93BPTS
/appliance/GE-Profile-30-Smart-Slide-In-Electric-Convection-Fingerprint-Resistant-Range-with-No-Preheat-Air-Fry-PSS93YPFS
/appliance/GE-30-Slide-In-Front-Control-Gas-Double-Oven-Range-JGSS86SPSS

fixed url scraping(Selenium)

hellow! I have a question about
I wanna scrape company names and ticker names in "https://www.nasdaq.com/market-activity/stocks/screener"
So I think Selenium can help my problem. but my code only works on the first page.
I'm sorry for my poor English.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
import time
nasduq_all=[] #ticker+company
nasduq_ticker=[] #ticker lists(홀수)
nasduq_company=[] #company lists(짝수)
dict_nasduq={} #ticker+company
page_url = 'https://www.nasdaq.com/market-activity/stocks/screener'
driver = webdriver.Chrome('/Users/kim/Desktop/dgaja/chromedriver')
driver.implicitly_wait(2)
driver.get(page_url)
time.sleep(1)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
#First of all, I'm only trying to go to page 2. :[
driver.find_element_by_xpath("/html/body/div[2]/div/main/div[2]/article/div[3]/div[1]/div/div/div[3]/div[5]/button[2]").send_keys(Keys.ENTER)
time.sleep(10)
ticker = soup.find("tbody", {"nasdaq-screener__table-body"}).find_all('a')
for i in ticker: #text
name=i.text
nasduq_all.append(name)
print(nasduq_all)

Attempting to generate links from all products on website using Selenium

The main goal of the script is to generate links for all the products available on the website, the products are segregated based on categories.
The issue I am having is that I can only generate links for one category (infusion), specifically the URL I have saved. The second category or URL, I would like to include is here: https://www.vatainc.com/wound-care.html
Is there a way I can loop through multiple category URLs, that have the same effect of the script I already have?
Here is my code:
import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup
all_product = []
url = "https://www.vatainc.com/infusion.html?limit=all"
service = service.Service('/Users/Jon/Downloads/chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(#class, 'product-name')]/a")]
for link in links:
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
products = soup.findAll("div", {"class": "product-view"})
print(links)
Here is some of the output, there are approximately 52 links from this one URL.
['https://www.vatainc.com/infusion/0705-vascular-access-ultrasound-phantom-1616.html', 'https://www.vatainc.com/infusion/0751-simulated-ultrasound-blood.html', 'https://www.vatainc.com/infusion/body-skin-shell-0242.html', 'https://www.vatainc.com/infusion/2366-advanced-four-vein-venipuncture-training-aidtm-dermalike-iitm-latex-free-1533.html',

You could just loop through the 2 urls. But if you were looking for a way to pull those first, and then loop through, this works:
import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup
import pandas as pd
root_url = 'https://www.vatainc.com/'
service = service.Service('C:\chromedriver_win32\chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(root_url)
time.sleep(2)
# Grab the urls, but only keep the ones of interest
urls = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//ol[contains(#class, 'nav-primary')]/li/a")]
urls = [ x for x in urls if 'html' in x ]
# It produces duplicates, so drop those and include ?limit=all to query all products
urls_list = pd.Series(urls).drop_duplicates().tolist()
urls_list = [ x +'?limit=all' for x in urls_list]
driver.close()
all_product = []
# loop through those urls and the links to generate a final product list
for url in urls_list:
print ('Url: '+url)
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(#class, 'product-name')]/a")]
for link in links:
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
products = soup.findAll("div", {"class": "product-view"})
all_product.append(link)
print(link)
driver.close()
produces list of 303 links

Just using a simple for loop to enumerate through the two URLS:
import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup
all_product = []
urls = ["website", "website2"]
service = service.Service('/Users/Jonathan/Downloads/chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
for index, url in enumerate(urls):
driver.get(url)
time.sleep(2)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(#class, 'product-name')]/a")]
for link in links:
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
products = soup.findAll("div", {"class": "product-view"})
print(links)

Python BeautifulSoup soup.find

I want to scrape some specific data from a website using urllib and BeautifulSoup.
Im trying to fetch the text "190.0 kg". I have tried as you can see in my code to use attrs={'class': 'col-md-7'}
but this returns the wrong result. Is there any way to specify that I want it to return the text between <h3>?
from urllib.request import urlopen
from bs4 import BeautifulSoup
# specify the url
quote_page = 'https://styrkeloft.no/live.styrkeloft.no/v2/?test-stevne'
# query the website and return the html to the variable 'page'
page = urlopen(quote_page)
# parse the html using beautiful soup
soup = BeautifulSoup(page, 'html.parser')
# take out the <div> of name and get its value
Weight_box = soup.find('div', attrs={'class': 'col-md-7'})
name = name_box.text.strip()
print (name)

Since this content is dynamically generated there is no way to access that data using the requests module.
You can use selenium webdriver to accomplish this:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_driver = "path_to_chromedriver"
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path=chrome_driver)
driver.get('https://styrkeloft.no/live.styrkeloft.no/v2/?test-stevne')
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
current_lifter = soup.find("div", {"id":"current_lifter"})
value = current_lifter.find_all("div", {'class':'row'})[2].find_all("h3")[0].text
driver.quit()
print(value)
Just be sure to have the chromedriver executable in your machine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse a dynamically loading wildberries page? - python

Related

python bs4, how to scrape this text in html?

Scraping using beautiful soup not working fully?

fixed url scraping(Selenium)

Attempting to generate links from all products on website using Selenium

Python BeautifulSoup soup.find

Categories

Resources