Scraping dynamic data with Python

Scraping dynamic data with Python - python

I have this simple page on html:
<html>
<body>
<p>Javascript (dynamic data) test:</p>
<p class='jstest' id='yesnojs'>Hello</p>
<button onclick="myFunction()">Try it</button>
<script>
function myFunction() {
document.getElementById('yesnojs').innerHTML = 'GoodBye';
}
</script>
</body>
</html>
I would like now scrap this page using Python to get when the id "yesnojs" is "GoodBye", I mean, when the user has clicked the button. I have been trying some tutorials but I always get "Hello", it doesn´t care if I have click and I am viewing on the page "GoodBye".
I hope your help, thank you.
PD:
this is my code on Python for try scrape the page:
from selenium import webdriver
chrome_path=
"C:\\Users\\Antonio\\Downloads\\chromedriver_win32\\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://localhost/templates/scraping.html")
review = driver.find_elements_by_class_name("jstest")
for post in review:
print(post.text)

Selenium does not attach to your existing open web pages. It opens a new web page. You would have to simulate clicking with Selenium if you're designing a unit test.
Alternatively, are you looking at making a browser extension that does the scraping when this event happens, Selenium is not the tool for this.

Related

Selenium not working except launching browser and open the page

I see that my selenium cannot execute codes except to launch Chrome.
I don't know why my selenium is not working. It just open the browser (Chrome) with the URL and then doing nothing even to maximize the window, not even inserting the form.
Is there anything wrong with my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re, time, csv
driver = webdriver.Chrome("C:\\Users\\Ashraf%20Misran\\Installer\\chromedriver.exe")
driver.get("file:///C:/Users/Ashraf%20Misran/Devs/project-html/learning-html/selenium sandbox.html")
driver.maximize_window()
username = driver.find_element_by_xpath(".//input")
username.click()
username.send_keys("000200020002")
The page I opened is coded as below:
<!DOCTYPE html>
<html>
<head>
<title>Sandbox</title>
</head>
<body>
<form>
<input type="text" name="username">
</form>
</body>
</html>

I think the problem is with web-page, you are trying to open. Would suggest to try first with simple test, like Open Google page, enter something in search field. With this you will be able to verify, if you correctly implemented driver initialization.
Update: try to use this css selector: input[name='username'], if page is loaded correctly, then you have a problem with your web element selector.

I think, there is a problem with using relative xpath locator. Please try that one:
username = driver.findElement(By.xpath("//input"))

Unable to fetch information as 3rd party browser plugin is blocking JS to work

I wanted to extract data from https://www.similarweb.com/ but when I run my code it shows (converted the output of HTML into text):
Pardon Our Interruption http://cdn.distilnetworks.com/css/distil.css" media="all" /> http://cdn.distilnetworks.com/images/anomaly-detected.png" alt="0" />
Pardon Our Interruption...
As you were browsing www.similarweb.com something about your browser made us think you were a bot. There are a few reasons this might happen:
You're a power user moving through this website with super-human speed.
You've disabled JavaScript in your web browser.
A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article .
After completing the CAPTCHA below, you will immediately regain access to www.similarweb.com.
if (!RecaptchaOptions){ var RecaptchaOptions = { theme : 'blackglass' }; }
You reached this page when attempting to access https://www.similarweb.com/ from 14.139.82.6 on 2017-05-22 12:02:37 UTC.
Trace: 9d8ae335-8bf6-4218-968d-eadddd0276d6 via 536302e7-b583-4c1f-b4f6-9d7c4c20aed2
I have written the following piece of code:
import urllib
from BeautifulSoup import *
url = "https://www.similarweb.com/"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print (soup.prettify())
# tags = soup('a')
# for tag in tags:
# print 'TAG:',tag
# print tag.get('href', None)
# print 'Contents:',tag.contents[0]
# print 'Attrs:',tag.attrs
Can anyone help me as to how I can extract the information?

I tried with requests; it failed. selenium seems to work.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('https://www.similarweb.com/')

Python Selenium ChromeDriver: chromedriver dummy frame

I had this simple login script to facebook that used to work perfectly until about a month ago. But yesterday when I tried running it again I got this dummy page:
<html xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body><pre style="word-wrap: break-word; white-space: pre-wrap;">
</pre>
<iframe name="chromedriver dummy frame" src="about:blank"></iframe>
</body>
</html>
I guess they've added some new detections. Is there a way to avoid those?
This is my simplified code:
browser = webdriver.Chrome(executable_path=path, service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])
browser.get("https://www.facebook.com/")
for line in browser.page_source.split('\n'):
print line

I have a similar problem which is not Facebook but our developing pages.
I might be ssl problem. (which might be solved --ignore-ssl-... option.)
Mostly, This is waiting problem.
The Selenium bot captures whole HTML PAGE before the server print out their contexts.
Thus, it might be solved, using same wait options (See this)
If there is some unique ID html elements, please insert following codes:
wait = WebDriverWait(driver, 5)
element = wait.until(EC.visibility_of_element_located((By.ID, 'unique')))

Python Selenium On Local HTML String

I am trying to run Selenium on a local HTML string but can't seem to find any documentation on how to do so. I retrieve HTML source from an e-mail API, so Selenium won't be able to parse it directly. Is there anyway to alter the following so that it would read the HTML string below:
Python Code for remote access:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_class_name("q")
Local HTML Code:
s = "<body>
<p>This is a test</p>
<p class="q">This is a second test</p>
</body>"

If you don't want to create a file or load a URL before being able to replace the content of the page, you can always leverage the Data URLs feature, which supports HTML, CSS and JavaScript:
from selenium import webdriver
driver = webdriver.Chrome()
html_content = """
<html>
<head></head>
<body>
<div>
Hello World =)
</div>
</body>
</html>
"""
driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))

If I understand the question correctly, I can imagine 2 ways to do this:
Save HTML code as file, and load it as url file:///file/location. The problem with that is that location of file and how file is loaded by a browser may differ for various OSs / browsers. But implementation is very simple on the other hand.
Another option is to inject your code onto some page, and then work with it as a regular dynamic HTML. I think this is more reliable, but also more work. This question has a good example.

Here was my solution for doing basic generated tests without having to make lots of temporary local files.
import json
from selenium import webdriver
driver = webdriver.PhantomJS() # or your browser of choice
html = '''<div>Some HTML</div>'''
driver.execute_script("document.write('{}')".format(json.dumps(html)))
# your tests

If I am reading correctly you are simply trying to get text from an element. If that is the case then the following bit should fit your needs:
elem = driver.find_element_by_class_name("q").text
print elem
Assuming "q" is the element you need.

Clicking on a Javascript Link on Firefox with Selenium

I am trying to get some comments off the car blog, Jalopnik. It doesn't come with the web page initially, instead the comments get retrieved with some Javascript. You only get the featured comments. I need all the comments so I would click "All" (between "Featured" and "Start a New Discussion") and get them.
To automate this, I tried learning Selenium. I modified their script from Pypi, guessing the code for clicking a link was link.click() and link = broswer.find_element_byxpath(...). It doesn't look liek the "All" button (displaying all comments) was pressed.
Ultimately I'd like to download the HTML of that version to parse.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://jalopnik.com/5912009/prius-driver-beat-up-after-taking-out-two-bikers/") # Load page
time.sleep(0.2)
link = browser.find_element_by_xpath("//a[#class='tc cn_showall']")
link.click()
browser.save_screenshot('screenie.png')
browser.close()

Using Firefox with the Firebug plugin, I browsed to http://jalopnik.com/5912009/prius-driver-beat-up-after-taking-out-two-bikers.
I then opened the Firebug console and clicked on ALL; it obligingly showed a single AJAX call to http://jalopnik.com/index.php?op=threadlist&post_id=5912009&mode=all&page=0&repliesmode=hide&nouser=true&selected_thread=null
Opening that url in a new window gets me the comment feed you are seeking.
More generally, if you substitute the appropriate article-ID into that url, you should be able to automate the process without Selenium.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping dynamic data with Python - python

Related

Selenium not working except launching browser and open the page

Unable to fetch information as 3rd party browser plugin is blocking JS to work

Python Selenium ChromeDriver: chromedriver dummy frame

Python Selenium On Local HTML String

Clicking on a Javascript Link on Firefox with Selenium

Categories

Resources