Scraping a dynamic website via Python Scripts: how to get the values?

Scraping a dynamic website via Python Scripts: how to get the values? - python

I am trying to scrape information from a website. So far, I've been able to access the webpage, log in with a username and password, and then print that landing page's page source into a separate .html/.txt file as needed.
Here's where the problems arise: on that "landing page," there's a table that I want to scrape the data from. If I were to manually right-click on any integer on that table, and select "inspect," I'd find the integer with no problem. However, when looking at the page source as a whole, I don't see the integers- just variable/parameter names. This leads me to believe it is a dynamic website.
How can I scrape the data?
I've been to hell and back trying to scrape this website, and so far, here's how the available technology has worked for me:
Firefox, IE, and Opera do not render the table. My guess is that this is a problem on the website's end. Only Chrome seems to work if I log in manually.
Selenium's Chromium package has been failing on me repeatedly (on my Windows 7 laptop) and I have even posted a question about the matter here. For now I'll assume it's just a lost cause, but I'm willing to graciously accept anyone's benevolent help.
Spynner's description looked promising, but that setup has frustrated me for quite some time- and the lack of a clear introduction only compounds its cumbersome nature to a novice like myself.
I prefer to code in Python, as it is the language I am most comfortable with. I have a pending company request to have the company install Visual Studio on my computer (to try doing this in C#), but I'm not holding my breath...
If my code can be of any use, so far, here's how I'm using mechanize:
# Headless Browsing Using PhantomJS and Selenium
#
# PhantomJS is installed in current directory
#
from selenium import webdriver
import time
browser = webdriver.PhantomJS()
browser.set_window_size(1120, 550) # need a fake browser size to fetch elements
def login_entry(username, password):
login_email = browser.find_element_by_id('UserName')
login_email.send_keys(username)
login_password = browser.find_element_by_id('Password')
login_password.send_keys(password)
submit_elem = browser.find_element_by_xpath("//button[contains(text(), 'Log in')]")
submit_elem.click()
browser.get("https://www.example.com")
login_entry('usr_name', 'pwd')
time.sleep(10)
test_output = open('phantomjs_test_source_output.html', 'w')
test_output.write(repr(browser.page_source))
test_output.close()
browser.quit()
p.s.- if anyone thinks I should be tagging javascript to this question, let me know. I personally don't know javascript but I'm sensing that it might be part of the problem/solution.

Try something like this. Sometimes with dynamic pages you need to wait for the data to load.
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(my_driver, my_time).until(EC.presence_of_all_elements_located(my_expected_element))
http://selenium-python.readthedocs.io/waits.html
https://seleniumhq.github.io/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html

Related

Automate Login with Selenium and Python

I am trying to use Selenium to log into a printer and conduct some tests. I really do not have much experience with this process and have found it somewhat confusing.
First, to get values that I may have wanted I opened the printers web page in Chrome and did a right click "view page source" - This turned out to be not helpful. From here I can only see a bunch of <script> tags that call some .js scripts. I am assuming this is a big part of my problem.
Next I selected the "inspect" option after right clicking. From here I can see the actual HTML that is loaded. I logged into the site and recorded the process in Chrome. With this I was able to identify the variables which contain the Username and Password. I went to this part of the HTML did a right click and copied the Xpath. I then tried to use the Selenium find_element_by_xpath but still no luck. I have tried all the other methods to (find by ID, and name) however it returns an error that the element is not found.
I feel like there is something fundamental here that I am not understanding. Does anyone have any experience with this???
Note: I am using Python 3.7 and Selenium, however I am not opposed to trying something other than Selenium if there is a more graceful way to accomplish this.
My code looks something like this:
EDIT
Here is my updated code - I can confirm this is not just a time/wait issue. I have managed to successfully grab the first two outer elements but the second I go deeper it errors out.
def sel_test():
chromeOptions = Options()
chromeOptions.add_experimental_option("useAutomationExtension", False)
browser = webdriver.Chrome(chrome_options=chromeOptions)
url = 'http://<ip address>/'
browser.get(url)
try:
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="ccrx-root"]')))
finally: browser.quit()
The element that I want is buried in this tag - Maybe this has something to do with it? Maybe related to this post
<frame name="wlmframe" src="../startwlm/Start_Wlm.htm?arg11=">

As mentioned in this post you can only work with the current frame which is seen. You need to tell selenium to switch frames in order to access child frames.
For example:
browser.switch_to.frame('wlmframe')
This will then load the nested content so you can access the children

Your issue is most likely do to either the element not loading on the page until after your bot searches for it, or a pop-up changing the xpath of the element.
Try this:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
delay = 3 # seconds
try:
elementUsername = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.xpath, 'element-xpath')))
element.send_keys('your username')
except TimeoutException:
print("Loading took too much time!")
you can find out more about this here

Using AutoIT with Selenium

Thank you for answering my previous question but as one is solved another is found apparently.
Interacting with the flash game itself is now the problem. I have tried researching how to do it in Selenium but it can't be done. I've seen FlashSelenium, Sikuli, and AutoIT.
I can only find documentation for FlashSelenium in Java, It's easier for me to use AutoIT rather than Sikuli as I'd have to learn to use Jpython to create the kind of script I want to, which I am not straying away from learning just trying to finish this as fast as possible. As for AutoIT, the only problem with it is that I don't undertsand how to use it with seleium
from selenium import webdriver
import autoit
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://na58.evony.com/s.html?loginid=747970653D74727947616D65&adv=index")
driver.maximize_window()
assert "Evony - Free forever " in driver.title
So far I have this and It's doing what is suppose to do which is create a new account using that "driver.get" but when I reach to the page, it is all flash and I can not interact with anything in the webpage so I have to use AutoIT but I don't know how to get it to "pick-up" from where selenium left off. I want it to interact with a button on the webpage and from viewing a previous post on stackoverflow I can use a (x,y) to specify the location but unfortunately that post didn't explain beyond that. Any and all information would be great, thanks.

Yes, you can use any number of scraping libraries (scrapy and beautiful soup are both easy to use and very powerful). Personally though, I like Selenium and its python bindings because they're the most flexible. Your final script would look something like this:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://xx.yy.zz")
# Click the "New comer, click here to play!" button
elem = driver. find_element_by_partial_link_text("Click here to play")
elem.send_keys(Keys.RETURN)
Can you post what the source of the page looks like (maybe using a Pastebin)?
Edit: updated to show you how to click the "Click here to play" button.

automatically edit firefox web address upon pageload, and then reload

My coding experience is in Python. Is there a simple way to execute a python code in firefox that would detect a particular address, say nytimes.com, load the page, then delete the end of the address following html (this allows bypassing the 20 pageviews/month limit) and reload?

Your best bet is to use selenium as proposed before. Here's a small example how you could do it. Basically the code checks if the limit has been reached and if it has it deletes cookies and refreshes the page letting you to continue reading. Deleting cookies lets you read another 10 articles without continuously editing the address. Thats the technical part, you have to consider the legal implications yourself.
from selenium import webdriver
browser=webdriver.Firefox()
browser.get('http://www.nytimes.com')
if browser.find_element_by_xpath('.//*[contains(.,"You’ve reached the limit of 10 free articles a month.")]'):
browser.delete_all_cookies()
browser.refresh()

you can use selenium it lets you easily fully control firefox and other web browsers with python. it would only be a few lines of code to acheive this. this answer How to integrate Selenium and Python has a working example

How to scrape a web-site filling out forms and 'clicking' on links with R?

I would like to web-scrape the html source code of java-script pages that I can´t access without selecting one option in a drop-down list and, after, 'clicking' on links. Spite of not been in java, a simple example can be this:
Web-scrape the main wikipedia pages in all languages available in the drop-down list in the bottom of this url: http://www.wikipedia.org/
To do so, I need to select one language, English for example, and then 'click' in the 'Main Page' link in the left of the new url (http://en.wikipedia.org/wiki/Special:Search?search=&go=Go).
After this step, I would scrape the html source code of the wikipedia main page in English.
Is there any way to do this using R? I have already tried RCurl and XML packages, but it does not work well with the javascript page.
If it is not possible with R, could anyone tell me how to do this with python?

It's possible to do this using python with the selenium package. There are some useful examples here. I found it helpful to install Firebug so that I could identify elements on the page. There is also a Selenium Firefox plugin with an interactive window that can help too.
import sys
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://website.aspx")
elem = driver.find_element_by_id("ctl00_ctl00")
elem.send_keys( '15' )
elem.send_keys( Keys.RETURN )

Take a look at the RCurl and XML packages for posting form information to the website and then processing the data afterwards. RCurl is pretty cool, but you might have an issue with the HTML parsing because if it isn't standards compliant, the XML package may not want to play nice.
If you are interested in learning Python however, Celenius' example above coupled with beautifulSoup would be what you need.

How to control Webpage dialog with python

When I try to automatically download a file from some webpage using Python,
I get Webpage Dialog window (I use IE). The window has two buttons, such as 'Continue' and 'Cancel'. I cannot figure out how to click on the Continue Button. The problem is
that I don't know how to control Webpage Dialog with Python. I tried to use
winGuiAuto to find the controls of the window, but it fails to recognize any Button type
controls... An ideas?
Sasha
A clarification of my question:
My purpose is to download stock data from a certain web site. I need to perform it for many stocks so I need python to do it for me in a repetitive way. This specific site exports the data by letting me download it in Excel file by clicking a link. However after clicking the link I get a Web Page dialog box asking me if I am sure that I want to download this file. This Web page dialog is my problem - it is not an html page and it is not a regular windows dialog box. It is something else and I cannot configure how to control it with python. It has two buttons and I need to click on one of them (i.e. Continue). It seems like it is a special kind of window implemented in IE. It is distinguished by its title which looks like this: Webpage Dialog -- Download blalblabla. If I click Continue mannually it opens a regular windows dialog box (open,save,cancel) which i know how to handle with winGuiAuto library. Tried to use this library for the Webpage Dialog window with no luck. Tried to recognize the buttons with Autoit Info tool -no luck either. In fact, maybe these are not buttons, but actually links, however I cannot see the links and there is no source code visible... What I need is someone to tell me what this Web page Dialog box is and how to control it with Python. That was my question.

You can't, and you don't want to. When you ask a question, try explaining what you are trying to achieve, and not just the task immediately before you. You are likely barking down the wrong path. There is some other way of doing what you are trying to do.

The title 'Webpage Dialog' suggests that is a Javascript-generated input box, hence why you can't access it via winGuiAuto. What you're asking directly is unlikely to be possible.
However, making the assumption that what you want to do is just download this data from the site, why are you using the GUI at all? Python provides everything you need to download files from the internet without controlling IE. The process you will want to follow is:
Download the host page
Find the url for your download in the page (if it changes)
Download the file from that url to a local file
In Python this would look something like this:
import urllib,re
f = urllib.urlopen('http://yoursitehere') # Original page where the download button is
html = f.read()
f.close()
m = re.search('/[\'"](.*\.xls)["\']/', html, re.S) # Find file ending .xls in page
if m:
urllib.urlretrieve(m.group(1), 'local_filename.xls') # Retrieve the Excel file

It is better to use selenium Python bindings:
from selenium import webdriver
from selenium.webdriver.common import alert
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
class AlertsManager:
def alertsManager(self,url):
self.url_to_visit=url
self.driver=webdriver.Ie()
self.driver.get(self.url_to_visit)
try:
while WebDriverWait(self.driver,1).until(EC.alert_is_present()):
self.alert=self.driver.switch_to_alert()
self.driver.switch_to_alert().accept()
except TimeoutException:
pass
if __name__=='__main__':
AM=AlertsManager()
url="http://htmlite.com/JS006.php" # This website has 2 popups
AM.alertsManager(url)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.