I want to use selenium to render the html of a request without having to save it on a file. I'm open to other ideas to solve this problem.
I need to render the html to save it with generated tags like <tbody> for example.
I tried to inject the html like data to the driver, but for long html code it doesn't work.
from selenium import webdriver
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
html_test = '''<body>
<div>
<table>
<td><tr>
Hellow work
</tr></td>
</table>
</div>
</body>'''
driver.get(f'data:text/html;charset=utf-8,{html_test}')
rendered_html = driver.page_source
driver.close()
This code works with small samples of html, but not with a full page code.
Related
When I want to show view source it looks like this:
<li class="results__list-container-item"></li>
But when I click Inspect Element in Firefox I see something like this:
<li class="results__list-container-item"><div class="offer offer--normal"><a class="offer__click-area" href="/praca/data-engineer-for-bixby-voice-assistant-krakow,oferta,7201566"></a><div class="offer__info"><div class="offer-details"><div class="offer-logo"><img src="https://i.gpcdn.pl/oferty-loga-firm/wyniki-wyszukiwania/14032.png" alt="logo" class="offer-logo__image"></div><div class="offer-details__text"><h3 class="offer-details__title"><a class="offer-details__title-link" href="/praca/data-engineer-for-bixby-voice-assistant-krakow,oferta,7201566">Data Engineer for Bixby Voice Assistant</a></h3><p class="offer-company"><span class="offer-company__link-wrapper"></li>
And it's possible to extract hidden content by web scraper(BeautifulSoup4)?
Hidden content is usually generated via JS. If you make a request to the webpage, it will not contain hidden HTML because the page has to be loaded in a browser for the hidden content to be loaded. We can get around this by using selenium web browser to actually open the page and then get the HTML from the rendered page.
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('example-url.com')
html = browser.page_source
soup = BeautifulSoup(html,features='html.parser')
hidden_divs = soup.find_all('div', {'class':'offer offer--normal'})
Of course, we would need the URL you are looking at to actually test this but this is how it generally works.
I am trying to scrape AFL odds from betfair (https://www.betfair.com.au/exchange/plus/australian-rules).
I am fairly new to webscraping however have managed to scrape odds from other bookies but i am having trouble with Betfair. The data I need is within a "ui-view" tag which doesn't seem to open when I use Beautiful soup to get the HTML.
I've tried unsuccessfully to use selenium when loading the page to get the odds.
from selenium import webdriver
from bs4 import BeautifulSoup
import pprint as pp
BETFAIR_URL = "https://www.betfair.com.au/exchange/plus/australian-rules"
#functions
def parse(url):
# open url
driver = webdriver.Chrome(
'C:/Users/Maroz/Downloads/chromedriver_win32 (1)/chromedriver.exe')
# opens page
driver.get(url)
# parses as html
soup = BeautifulSoup(driver.page_source, 'html.parser')
# closes same
driver.quit()
return soup
betfair_soup = parse(BETFAIR_URL)
pp.pprint(betfair_soup)
#edit to show that it finds nothing in the span i need which is within the ui-#tags
price = betfair_soup.find_all("span", {"class": "bet-button-price"})
pp.pprint(price)
#output is []
I expected the betfair_soup to contain the infomation within this tag ui-view
however it remains closed when printed to the terminal.
Won't let me post an image because this is my first post but you might be able to see a screenshot of the tags I am trying to access here. https://imgur.com/gallery/jI3MQYY
As requested here is the html I get in terminal:
<!--[if IE]>
<script type="text/javascript">window['isIE'] = true;</script>
<![endif]-->
<!-- Set ie10 class: http://www.impressivewebs.com/ie10-css-hacks/ -->
<!--[if !IE]><!-->
<script>
(function () {
var isIE10 = Function('/*#cc_on return document.documentMode===10#*/')();
if (isIE10) {
document.documentElement.className += ' ie10';
}
})();
</script>
<!--<![endif]-->
<bf-meta-tags></bf-meta-tags>
<bf-tooltip-guide><div class="tooltip-guide-container" ng-controller="TooltipGuideController as controller"><!-- --></div></bf-tooltip-guide>
<!-- --><ui-view></ui-view> #INFO IS IN HERE
<script src="//ie2eds.cdnppb.net/resources/eds/bundle/vendor-assets-min_4146.js"></script>
<script src="//ie2eds.cdnppb.net/resources/eds/bundle/bf-eds-static-client.min_4146_.js"></script>
<script type="text/javascript">
I put a comment where the odds are located, when i view the page source the tags are also closed, so there isn't any way of me showing you what i see when i inspect element on the odds box other than the photo link i posted above^
edit: After trying suggestion to wait for ui-view to load this is the entire response, I still couldn't access the information in the span tags though.
https://pastebin.com/v6JzYa6V
FINAL EDIT: Problem solved! Thank you to everyone for your suggestions and special thanks to S Ahmed for his persistence in solving this for me!
Looks like it takes time to load the content of the <ui-view> tag and it is loaded by javascript. Try to wait for an internal element to be present before getting the source of the page.
Try this:
def parse(url):
driver.get(url)
try:
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID,"main-wrapper")))
except:
pp.pprint("Exception")
finally:
soup = BeautifulSoup(driver.page_source, 'html.parser')
return soup
driver.quit()
You have to import the following libraries
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Edit:
Try waiting for the span.bet-button-price to be present instead of the #main-wrap
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR,"span.bet-button-price")))
I see that my selenium cannot execute codes except to launch Chrome.
I don't know why my selenium is not working. It just open the browser (Chrome) with the URL and then doing nothing even to maximize the window, not even inserting the form.
Is there anything wrong with my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re, time, csv
driver = webdriver.Chrome("C:\\Users\\Ashraf%20Misran\\Installer\\chromedriver.exe")
driver.get("file:///C:/Users/Ashraf%20Misran/Devs/project-html/learning-html/selenium sandbox.html")
driver.maximize_window()
username = driver.find_element_by_xpath(".//input")
username.click()
username.send_keys("000200020002")
The page I opened is coded as below:
<!DOCTYPE html>
<html>
<head>
<title>Sandbox</title>
</head>
<body>
<form>
<input type="text" name="username">
</form>
</body>
</html>
I think the problem is with web-page, you are trying to open. Would suggest to try first with simple test, like Open Google page, enter something in search field. With this you will be able to verify, if you correctly implemented driver initialization.
Update: try to use this css selector: input[name='username'], if page is loaded correctly, then you have a problem with your web element selector.
I think, there is a problem with using relative xpath locator. Please try that one:
username = driver.findElement(By.xpath("//input"))
I am writing a program using Python and selenium to automate logging into a website. The website asks a security question for additional verification. Clearly the answer I would send using "send_keys" would depend on the question asked so I need to figure out what is being asked based on the text. BeautifulSoup can be used to parse through the HTML but in all the examples I have seen you have to give a URL to then read the page content. How do I read the content of a page that's already open? The code I am using is:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
chromedriver = 'C:\\Program Files\\Google\\chromedriver.exe'
browser = webdriver.Chrome(chromedriver)
browser.get('http://www.aaaa.com')
loginElem = browser.find_element_by_id('bbbb')
loginElem.send_keys('cccc')
passwordElem = browser.find_element_by_id('dddd')
passwordElem.send_keys('eeee')
passwordElem.send_keys(Keys.RETURN)
The page with the security questions loads after this and that's the page I want the URL of.
I also tried finding by element but for some reason it wasnt working which is why I am trying a workaround. Below is the HTML for the entire div class where the question is. Alternatively maybe you can help me search for the right one.
<div class="answer-section">
<p> Please answer your challenge question so we can help
verify your identity.
</p> <label for="tlpvt-challenge-answer"> What is the name of your dog?
</label>
<input type="text" id="tlpvt-challenge-answer" class="tl-private gis- mask"
name="challengeQuestionAnswer" value=""/>
</div>
well if you want to use BeautifulSoup you can retrieve the source code from the webdriver and then parse it:
chromedriver = 'C:\\Program Files\\Google\\chromedriver.exe'
browser = webdriver.Chrome(chromedriver)
browser.get('http://www.aaaa.com')
# call page_source attr from a webdriver instance to
# retrieve HTML source code
html = browser.page_source
# parse it with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
label = soup.find('label', {'for': 'tlpvt-challenge-answer'})
print label.get_text()
output:
$ What is the name of your dog?
I am trying to run Selenium on a local HTML string but can't seem to find any documentation on how to do so. I retrieve HTML source from an e-mail API, so Selenium won't be able to parse it directly. Is there anyway to alter the following so that it would read the HTML string below:
Python Code for remote access:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_class_name("q")
Local HTML Code:
s = "<body>
<p>This is a test</p>
<p class="q">This is a second test</p>
</body>"
If you don't want to create a file or load a URL before being able to replace the content of the page, you can always leverage the Data URLs feature, which supports HTML, CSS and JavaScript:
from selenium import webdriver
driver = webdriver.Chrome()
html_content = """
<html>
<head></head>
<body>
<div>
Hello World =)
</div>
</body>
</html>
"""
driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))
If I understand the question correctly, I can imagine 2 ways to do this:
Save HTML code as file, and load it as url file:///file/location. The problem with that is that location of file and how file is loaded by a browser may differ for various OSs / browsers. But implementation is very simple on the other hand.
Another option is to inject your code onto some page, and then work with it as a regular dynamic HTML. I think this is more reliable, but also more work. This question has a good example.
Here was my solution for doing basic generated tests without having to make lots of temporary local files.
import json
from selenium import webdriver
driver = webdriver.PhantomJS() # or your browser of choice
html = '''<div>Some HTML</div>'''
driver.execute_script("document.write('{}')".format(json.dumps(html)))
# your tests
If I am reading correctly you are simply trying to get text from an element. If that is the case then the following bit should fit your needs:
elem = driver.find_element_by_class_name("q").text
print elem
Assuming "q" is the element you need.