I am trying to scrape with soup and am obtaining an empty set when I call findAll
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url='https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=70KutR16JmLgr7Ka%2F385RFXrzDpOkSqx%2FRC3DnlU09%2BYcw0pR5cfIfC0kOlQywiD%2BTEe7ppq8ENXglbpqA8sDUtif1h3ZjrEoQkV29%2B90iqljHi2gm2T%2BDZHH2%2FCNeKB%2BkVglbz%2BNx1bKsSfE5L6SVtckHxg%2FM%2F%2FVieWp8vgaJTan0k1WrPjCrVuDs5WnbRN#langId=44&storeId=10151&catalogId=10123&categoryId=&parent_category_rn=&top_category=&pageSize=60&orderBy=RELEVANCE&searchTerm=milk&beginIndex=0&hideFilters=true&categoryFacetId1='
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,'html.parser')
containers = page_soup.findAll("div",{"class":"product"})
containers
I also got empty datasets from these articles:
findAll returning empty for html
and BeautifulSoup find_all() returns no data
Can anyone offer any help?
The page content is loaded with javascript, so you can't just use BeautifulSoup to parse it. You have to use another module like selenium to simulate javacript execution.
Here is an exemple:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
url='https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=70KutR16JmLgr7Ka%2F385RFXrzDpOkSqx%2FRC3DnlU09%2BYcw0pR5cfIfC0kOlQywiD%2BTEe7ppq8ENXglbpqA8sDUtif1h3ZjrEoQkV29%2B90iqljHi2gm2T%2BDZHH2%2FCNeKB%2BkVglbz%2BNx1bKsSfE5L6SVtckHxg%2FM%2F%2FVieWp8vgaJTan0k1WrPjCrVuDs5WnbRN#langId=44&storeId=10151&catalogId=10123&categoryId=&parent_category_rn=&top_category=&pageSize=60&orderBy=RELEVANCE&searchTerm=milk&beginIndex=0&hideFilters=true&categoryFacetId1='
driver = webdriver.Firefox()
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
containers = page_soup.findAll("div",{"class":"product"})
print(containers)
print(len(containers))
OUTPUT:
[
<div class="product "> ...
...,
<div class="product hl-product hookLogic highlighted straplineRow" ...
]
64
Related
I am trying to scrape some data from a website.
But when I want to print it I just get the tags back, but with out the information in it.
This is the code:
#Imports
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#URL
my_url = 'https://website.com'
#Opening connection grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
#closing the page
uClient.close()
#Parse html
page_soup = soup(page_html,"html.parser")
price = page_soup.findAll("div",{"id":"lastTrade"})
print(price)
This ist what I get back:
[<div id="lastTrade"> </div>]
So does anyone can tell me what i have to change or add so I receive the actual infortmation from inside this tag?
Maybe loop through your list like this :
for res in price:
print(res.text)
the image shows the area that i want to access
containers = pagebs('div',{'class':"search-content"})
when i print containers it just displays
[<div class="search-content">
</div>]
nothing inside it. I tried searching for tags inside in it that didn't work
is there a workaround or i just can't access it not matter what i do
this is what i've written so far
from bs4 import BeautifulSoup as BS
from urllib.request import urlopen as uReq
url = 'https://bahrain.sharafdg.com/?q=asus%20laptops&post_type=product'
uclient = uReq(url)
pagehtml = uclient.read()
uclient.close()
pagebs = BS(pagehtml , 'html.parser')
containers = pagebs('div',{'class':"search-content"})
I'm trying to scrape data from Robintrack but, I cannot get the data from the increase/decrease section. I can only scrape the home page data. Here is my Soup
import bs4
import requests
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
robin_track_url= 'https://robintrack.net/popularity_changes?changeType=increases'
#r = requests.get('https://robintrack.net/popularity_changes?changeType=increases')
#soup = bs4.BeautifulSoup(r.text, "xml")
#Grabs and downloads html page
uClient = uReq(robin_track_url)
page_html = uClient.read()
uClient.close()
#Does HTML parsing
page_soup = soup(page_html, "html.parser")
print("On Page")
print(page_soup.body)
stocks = page_soup.body.findAll("div", {"class":"ReactVirtualized__Table__row"})
print(len(stocks))
print(stocks)
Am I doing something wrong?
Your version not will be work because the data that you want to load is loaded via JS.
requests load the only static page.
if you want to get data that you want to do next:
requests.get('https://robintrack.net/api/largest_popularity_increases?hours_ago=24&limit=50&percentage=false&min_popularity=50&start_index=0').json()
I want to get tables from this website with this code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.flashscore.pl/pilka-nozna/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.find_all('table', {'class': 'soccer'})
print(len(containers))
But when I try to check how much tables I get by print(len(containers)), I get 0.
Any solutions?
edit:
it's possible the page is dynamic. you can use requests-html which allows you to let the page render before pulling the html, or you can use Selenium, as I did here.
This resulted in 42 elements of table class="soccer"
import bs4
from selenium import webdriver
url = 'https://www.flashscore.pl/pilka-nozna/'
browser = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
browser.get(url)
html = browser.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
containers = soup.find_all('table', {'class': 'soccer'})
browser.close()
In [11]: print(len(containers))
42
FindAll doesn't find the class I need. However I was able to find the class above that one, but the data structure is not that well organized.
Do you know what can we do to get the data or organize the output from the class above which has all the data together ?
Please see the HTML below and the images.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.vivino.com/explore?e=eJzLLbI11jNVy83MszU0UMtNrLA1MVBLrrQtLVYrsDVUK7ZNTlQrS7YtKSpNVSsviY4FioEpIwhlDKFMIJQ5VM4EAJCfGxQ='
#Opening a connection
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parse
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("div", {"class":"wine-explorer__results__item"})
len(container)
Thanks everyone, as you all suggested a module to read Javascript was needed to select that class. I've used selenium in this case, however PyQt5 might be a better option.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
my_url = 'https://www.vivino.com/explore?e=eJzLLbI11jNVy83MszU0UMtNrLA1MVBLrrQtLVYrsDVUK7ZNTlQrS7YtKSpNVSsviY4FioEpIwhlDKFMIJQ5VM4EAJCfGxQ='
#Opening a connection
#html parse
web_r = uReq(my_url)
driver=webdriver.Firefox()
driver.get(my_url)
page_soup = soup(web_r, "html.parser")
html = driver.execute_script("return document.documentElement.outerHTML")
#print(html)
html_page_soup = soup(html, "html.parser")
container = html_page_soup.findAll("div", {"class": "wine-explorer__results__item"})
len(container)
You can use Dryscrape module with bs4 because wine-explorer selector is created by javascript. Dryscrape module helps you for javascript support.
Try using the following instead:
container = page_soup.findAll("div", {"class": "wine-explorer__results"})