Scraping hidden jquery values using python

Scraping hidden jquery values using python - python

HERE is the site I believe its hosted on GitHub.
I am having trouble scraping the values in the input fields. Specifically private key and Public. I tried using Selenium and BeautifulSoup but it would give empty values, rather None (the HTML doesn't contain the keys).
I checked the page source and it seems that the input value is empty (not contained within the HTML) but when you load the page it is visible and existent in the input box.
Here is my code:
def openit(browser):
browser.get('file:///Users/Aha/Desktop/Code/english/index.html')
time.sleep(5)
nav = browser.find_element_by_id("addr")
print(nav.text)
return browser.page_source
soupdata = openit(browser)
soup = BeautifulSoup(soupdata, 'html.parser')
val = soup.find('input', {'id': 'addr'}).get('value')
print (val)

You can retrieve that value via execute_script method of the selenium webdriver
print(browser.execute_script("return $('#addr').val();"))
Output:
14ropRunS5iY9sx9d9mpCRNEsXj7RtTtuS

Related

FInd_all in bs4 returns one elment when there is more in the web page in

I am doing web scrapping to a new egg page and i want to scrape the rating of the product by the consumers and i am using this code
page = requests.get('https://www.newegg.com/msi-geforce-rtx-3060-rtx-3060-ventus-2x-12g-oc/p/N82E16814137632?Description=gpu&cm_re=gpu-_-14-137-632-_-Product').text
soup = bs(page , 'lxml')
the_rating = soup.find_all( class_ = 'rating rating-4')
print(the_rating)
And it returns only this one element even though I am using the find all element
[<i class="rating rating-4"></i>]

I get [] with your code; judging by the text content, or when I break it print the response status and url
r = requests.get('https://www.newegg.com/msi-geforce-rtx-3060-rtx-3060-ventus-2x-> 12g-oc/p/N82E16814137632?Description=gpu&cm_re=gpu-_-14-137-632-_-Product')
print(f'<{r.status_code} {r.reason}> from {r.url}')
# soup = bs(r.content , 'lxml')
output:
<200 OK> from https://www.newegg.com/areyouahuman?referer=/areyouahuman?referer=https%3A%2F%2Fwww.newegg.com%2Fmsi-geforce-rtx-3060-rtx-3060-ventus-2x-12g-oc%2Fp%2FN82E16814137632%3FDescription%3Dgpu%26cm_re%3Dgpu-_-14-137-632-_-Product&why=8&cm_re=gpu-_-14-137-632-_-Product&Description=gpu
It's been redirected to a CAPTCHA...
Anyway, even if you get past that (I couldn't so I just pasted and parsed the response from my browser's network logs to test), all you can get from page is the source HTML, which does not contain any elements with class="rating rating-4"; using selenium and waiting for the page to finish loading yielded a bit more, but even then there weren't any exact matches.
[There were some matches when I inspected in browser, but only if I wasn't in incognito mode, which is likely why selenium didn't find them either.]
So, the site probably adds or removes some classes based on the source of the request. If you just want to get all elements with both the rating and rating-4 classes (that will include the elements with class="rating is-large rating-4"), you can use .find... with lambda (or define a separate function) or use .select with CSS selectors like
the_rating = soup.select('.rating.rating-4') # shorter than
# .find_all(lambda t: {'rating', 'rating-4'}.issubset(set(t.get('class', []))))
[Just make sure you have the full/correct HTML.]

How do you use beautifulsoup and selenium to scrape html inside shadow dom?

I'm trying to make an automation program to scrape part of a website. But this website is made out of javascript, and the part of the website I want to scrape is in a shadow dom.
So I figured out that I should use selenium to go to that website and use this code to access elements in shadow dom
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
and use
driver.page_source
to get the HTML of that website. But this code doesn't show me elements that are inside the shadow dom.
I've tried combining those two and tried
root1 = driver.find_element(By. CSS_SELECTOR, "path1")
shadow_root = expand_shadow_element(root1)
html = shadow_root.page_source
but I got
AttributeError: 'ShadowRoot' object has no attribute 'page_source'
for a response. So I think that I need to use BeautifulSoup to scrape data from that page, but I can't figure out how to combine BeautifulSoup and Selenium to scrape data from a shadow dom.
P.S. If the part I want to scrape is
<h3>apple</h3>
<p>1$</p>
<p>red</p>
I want to scrape that code exactly, not
apple
1$
red

You would use BeautifulSoup here as follows:
soup = BeautifulSoup(driver.page_source, 'lxml')
my_parts = soup.select('h3') # for example

Most likely you need to wait for an element to show in the code so you need to set Implicit Wait or Explicit Wait, then once an element is loaded you can soup that page for HTML result.
driver.implicitly_wait(15) #in secounds

text = shadow_root.find_element(By. CSS_SELECTOR, "path2").get_attribute('innerHTML')
None of the answers solved my problem, so I tinkered with the code and this worked! The answer was get_attribute!

How to scrape a page that is dynamicaly locaded?

So here's my problem. I wrote a program that is perfectly able to get all of the information I want on the first page that I load. But when I click on the nextPage button it runs a script that loads the next bunch of products without actually moving to another page.
So when I run the next loop all that happens is that I get the same content of the first one, even when the ones on the browser I'm emulating itself is different.
This is the code I run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
driver.get("https://www.my-website.com/search/results-34y1i")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
# /////////// code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')
while currentPage != totalPages:
# ///////// code to find the products
currentPage += 1
button_NextPage = driver.find_element(By.ID, 'nextButton')
button_NextPage.click()
time.sleep(5)
Is there any way for me to scrape exactly what's loaded on my browser?

The issue it seems to be because you're just fetching the page 1 as shown in the next line:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page=1&view=grid")
But as you can see there's a query parameter called page in the url that determines which html's page you are fetching. So what you'll have to do is every time you're looping to a new page you'll have to fetch the new html content with the driver by changing the page query parameter. For example in your loop it will be something like this:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page={page}&view=grid".format(page = currentPage))
And after you fetch the new html structure you'll be able to access to the new elements that are present in the differente pages as you require.

Python beautiful soup web scraper doesn't return tag contents

I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else

it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.

You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')

Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")

Scrape 'li' tags from a data table that changes based on drop-down menu

I am trying to scrape data from a data table on this website: [http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793]
The site has multiple tabs, which changes the html (I am working in the 'matchup' tab). Within that matchup tab, there is a drop-down menu that changes the data table that I am trying to access. The items in the table that I am trying to access are 'li' tags within an unordered list. I just want to scrape the data from the "Overall" category of the drop-down menu.
I have been unable to access the data that I want. The item that I'm trying to access is coming back as a 'noneType'. Is there a way to do this?
url = "http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-
744793"
html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')
dataList = []
for ultag in soup.find_all('ul', {'class': 'base-list team-stats'}):
print(ultag)
for iltag in ultag.find_all('li'):
dataList.append(iltag.get_text())

So the problem is that the content of the tab you are trying to pull data from is dynamically loaded using React JS. So you have to use the selenium module in Python to open a browser to click the list element "Matchup" programmatically then get the source after clicking it.
On my mac I installed selenium and the chromewebdriver using these instructions:
https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f
Then signed the python file, so that the OS X firewall doesn't complain to us when trying run it, using these instructions:
Add Python to OS X Firewall Options?
Then ran the following python3 code:
import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
# Navigate in Chrome to specified page.
driver.get("http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793")
# Find the matchup list element using a css selector and click it.
link = driver.find_element_by_css_selector("li[id='react-tabs-0'").click()
# Wait for content to load
time.sleep(1)
# Get the current page source.
source = driver.page_source
# Parse into soup() the source of the page after the link is clicked and use "html.parser" as the Parser.
soupify = soup(source, 'html.parser')
dataList = []
for ultag in soupify.find_all('ul', {'class': 'base-list team-stats'}):
print(ultag)
for iltag in ultag.find_all('li'):
dataList.append(iltag.get_text())
# We are done with the driver so quit.
driver.quit()
Hope this helps as I noticed this was a similar problem to the one I just solved here - Beautifulsoup doesn't reach a child element

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping hidden jquery values using python - python

You can retrieve that value via execute_script method of the selenium webdriver print(browser.execute_script("return $('#addr').val();")) Output: 14ropRunS5iY9sx9d9mpCRNEsXj7RtTtuS

Related

FInd_all in bs4 returns one elment when there is more in the web page in

How do you use beautifulsoup and selenium to scrape html inside shadow dom?

How to scrape a page that is dynamicaly locaded?

Python beautiful soup web scraper doesn't return tag contents

Scrape 'li' tags from a data table that changes based on drop-down menu

Categories

Resources