Download html of a webpage thats already loaded - python

I am writing a program using Python and selenium to automate logging into a website. The website asks a security question for additional verification. Clearly the answer I would send using "send_keys" would depend on the question asked so I need to figure out what is being asked based on the text. BeautifulSoup can be used to parse through the HTML but in all the examples I have seen you have to give a URL to then read the page content. How do I read the content of a page that's already open? The code I am using is:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
chromedriver = 'C:\\Program Files\\Google\\chromedriver.exe'
browser = webdriver.Chrome(chromedriver)
browser.get('http://www.aaaa.com')
loginElem = browser.find_element_by_id('bbbb')
loginElem.send_keys('cccc')
passwordElem = browser.find_element_by_id('dddd')
passwordElem.send_keys('eeee')
passwordElem.send_keys(Keys.RETURN)
The page with the security questions loads after this and that's the page I want the URL of.
I also tried finding by element but for some reason it wasnt working which is why I am trying a workaround. Below is the HTML for the entire div class where the question is. Alternatively maybe you can help me search for the right one.
<div class="answer-section">
<p> Please answer your challenge question so we can help
verify your identity.
</p> <label for="tlpvt-challenge-answer"> What is the name of your dog?
</label>
<input type="text" id="tlpvt-challenge-answer" class="tl-private gis- mask"
name="challengeQuestionAnswer" value=""/>
</div>

well if you want to use BeautifulSoup you can retrieve the source code from the webdriver and then parse it:
chromedriver = 'C:\\Program Files\\Google\\chromedriver.exe'
browser = webdriver.Chrome(chromedriver)
browser.get('http://www.aaaa.com')
# call page_source attr from a webdriver instance to
# retrieve HTML source code
html = browser.page_source
# parse it with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
label = soup.find('label', {'for': 'tlpvt-challenge-answer'})
print label.get_text()
output:
$ What is the name of your dog?

Related

How do I extract text from a button using Beautiful Soup?

I am trying to scrape GoFundMe information but can't seem to extract the number of donors.
This is the html I am trying to navigate. I am attempting to retrieve 11.1K,
<ul class="list-unstyled m-meta-list m-meta-list--default">
<li class="m-meta-list-item">
<button class="text-stat disp-inline text-left a-button a-button--inline" data-element-
id="btn_donors" type="button" data-analytic-event-listener="true">
<span class="text-stat-value text-underline">11.1K</span>
<span class="m-social-stat-item-title text-stat-title">donors</span>
I've tried using
donors = soup.find_all('li', class_ = 'm-meta-list-item')
for donor in donors:
print(donor.text)
The class/button seems to be hidden inside another class? How can I extract it?
I'm new to beautifulsoup but have used selenium quite a bit.
Thanks in advance.
These fundraiser pages all have similar html and that value is dynamically retrieved. I would suggest using selenium and a css class selector
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.gofundme.com/f/treatmentforsiyona?qid=7375740208a5ee878a70349c8b74c5a6')
num = d.find_element_by_css_selector('.text-stat-value').text
print(num)
d.quit()
Learn more about selenium:
https://sqa.stackexchange.com/a/27856
get the id gofundme.com/f/{THEID} and call the API
/web-gateway/v1/feed/THEID/donations?sort=recent&limit=20&offset=20
process the Data
for people in apiResponse['references']['donations']
print(people['name'])
use browser console to find host API.

How to see hidden content in html selectors?

When I want to show view source it looks like this:
<li class="results__list-container-item"></li>
But when I click Inspect Element in Firefox I see something like this:
<li class="results__list-container-item"><div class="offer offer--normal"><a class="offer__click-area" href="/praca/data-engineer-for-bixby-voice-assistant-krakow,oferta,7201566"></a><div class="offer__info"><div class="offer-details"><div class="offer-logo"><img src="https://i.gpcdn.pl/oferty-loga-firm/wyniki-wyszukiwania/14032.png" alt="logo" class="offer-logo__image"></div><div class="offer-details__text"><h3 class="offer-details__title"><a class="offer-details__title-link" href="/praca/data-engineer-for-bixby-voice-assistant-krakow,oferta,7201566">Data Engineer for Bixby Voice Assistant</a></h3><p class="offer-company"><span class="offer-company__link-wrapper"></li>
And it's possible to extract hidden content by web scraper(BeautifulSoup4)?
Hidden content is usually generated via JS. If you make a request to the webpage, it will not contain hidden HTML because the page has to be loaded in a browser for the hidden content to be loaded. We can get around this by using selenium web browser to actually open the page and then get the HTML from the rendered page.
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('example-url.com')
html = browser.page_source
soup = BeautifulSoup(html,features='html.parser')
hidden_divs = soup.find_all('div', {'class':'offer offer--normal'})
Of course, we would need the URL you are looking at to actually test this but this is how it generally works.

Unable to fetch information as 3rd party browser plugin is blocking JS to work

I wanted to extract data from https://www.similarweb.com/ but when I run my code it shows (converted the output of HTML into text):
Pardon Our Interruption http://cdn.distilnetworks.com/css/distil.css" media="all" /> http://cdn.distilnetworks.com/images/anomaly-detected.png" alt="0" />
Pardon Our Interruption...
As you were browsing www.similarweb.com something about your browser made us think you were a bot. There are a few reasons this might happen:
You're a power user moving through this website with super-human speed.
You've disabled JavaScript in your web browser.
A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article .
After completing the CAPTCHA below, you will immediately regain access to www.similarweb.com.
if (!RecaptchaOptions){ var RecaptchaOptions = { theme : 'blackglass' }; }
You reached this page when attempting to access https://www.similarweb.com/ from 14.139.82.6 on 2017-05-22 12:02:37 UTC.
Trace: 9d8ae335-8bf6-4218-968d-eadddd0276d6 via 536302e7-b583-4c1f-b4f6-9d7c4c20aed2
I have written the following piece of code:
import urllib
from BeautifulSoup import *
url = "https://www.similarweb.com/"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print (soup.prettify())
# tags = soup('a')
# for tag in tags:
# print 'TAG:',tag
# print tag.get('href', None)
# print 'Contents:',tag.contents[0]
# print 'Attrs:',tag.attrs
Can anyone help me as to how I can extract the information?
I tried with requests; it failed. selenium seems to work.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('https://www.similarweb.com/')

Python Selenium On Local HTML String

I am trying to run Selenium on a local HTML string but can't seem to find any documentation on how to do so. I retrieve HTML source from an e-mail API, so Selenium won't be able to parse it directly. Is there anyway to alter the following so that it would read the HTML string below:
Python Code for remote access:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_class_name("q")
Local HTML Code:
s = "<body>
<p>This is a test</p>
<p class="q">This is a second test</p>
</body>"
If you don't want to create a file or load a URL before being able to replace the content of the page, you can always leverage the Data URLs feature, which supports HTML, CSS and JavaScript:
from selenium import webdriver
driver = webdriver.Chrome()
html_content = """
<html>
<head></head>
<body>
<div>
Hello World =)
</div>
</body>
</html>
"""
driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))
If I understand the question correctly, I can imagine 2 ways to do this:
Save HTML code as file, and load it as url file:///file/location. The problem with that is that location of file and how file is loaded by a browser may differ for various OSs / browsers. But implementation is very simple on the other hand.
Another option is to inject your code onto some page, and then work with it as a regular dynamic HTML. I think this is more reliable, but also more work. This question has a good example.
Here was my solution for doing basic generated tests without having to make lots of temporary local files.
import json
from selenium import webdriver
driver = webdriver.PhantomJS() # or your browser of choice
html = '''<div>Some HTML</div>'''
driver.execute_script("document.write('{}')".format(json.dumps(html)))
# your tests
If I am reading correctly you are simply trying to get text from an element. If that is the case then the following bit should fit your needs:
elem = driver.find_element_by_class_name("q").text
print elem
Assuming "q" is the element you need.

Fetching name and email from a web page [duplicate]

This question already has an answer here:
How to get data off from a web page in selenium webdriver [closed]
(1 answer)
Closed 7 years ago.
I'm trying to fetch data off from a Link. I want to fetch name/email/location/etc content from the web page and paste it into the webpage. I have written the code for it always when i run this code it just stores a blank list.
Please help me to copy these data from the web page.
I want to fetch company name, email, phone number from this Link and put these contents in an excel file. I want to do the same for the all pages of the website. I have got the logic to fetch the the links in the browser and switch in between them. I'm unable to fetch the data from the website. Can anybody provide me an enhancement to the code i have written.
Below is the code i have written:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
from lxml import html
import requests
import xlwt
browser = webdriver.Firefox() # Get local session of firefox
# 0 wait until the pages are loaded
browser.implicitly_wait(3) # 3 secs should be enough. if not, increase it
browser.get("http://ae.bizdirlib.com/taxonomy/term/1493") # Load page
links = browser.find_elements_by_css_selector("h2 > a")
#print link
for link in links:
link.send_keys(Keys.CONTROL + Keys.RETURN)
link.send_keys(Keys.CONTROL + Keys.PAGE_UP)
#tree = html.fromstring(link.text)
time.sleep(5)
companyNameElement = browser.find_elements_by_css_selector(".content.clearfix>div>fieldset>div>ul>li").text
companyName = companyNameElement
print companyNameElement
The Html code is given below
<div class="content">
<div id="node-946273" class="node node-country node-promoted node-full clearfix">
<div class="content clearfix">
<div itemtype="http://schema.org/Corporation" itemscope="">
<fieldset>
<legend>Company Information</legend>
<div style="width:100%;">
<div style="float:right; width:340px; vertical-align:top;">
<br/>
<ul>
<li>
<strong>Company Name</strong>
:
<span itemprop="name">Sabbro - F.Z.C</span>
</li>
</ul>
when i use it it gives me a error that list' object has no attribute 'text'. Can somebody help me to enhance the code and make it work. I'm kind of like stuck forever on this issue.
companyNameElement = browser.find_elements_by_css_selector(".content.clearfix>div>fieldset>div>ul>li").text
companyName = companyNameElement
print companyNameElement
find_elements_by... return a list, you can either access first element of that list or use equivalent find_element_by... method that would get just the first element.

Categories

Resources