Python mechanize wait and click - python

Using mechanize, how can I wait for some time after page load (some websites have a timer before links appear, like in download pages), and after the links have been loaded, click on a specific link?
Since it's an anchor tag and not a submit button, will browser.submit() work(I got errors while doing that)?

Mechanize does not offer javascript functionality, so you will not see dynamic content (like a timer that turns into a link).
As far as clicking on a link, you have to find the element, and then you can call click_link on it. See the Finding Links section of this site.
If you are looking for something to handle such sites, a good option is PhantomJS. It uses nodejs, but runs on the webkit engine, allowing you parse dynamic content. If you have your heart set on python, using Selenium to programatically drive a real browser may be your best bet.

If it's an anchor tag, then just GET/POST whatever it is.
The timer between links appearing is generally done in javascript - some sites you are attempting to scrape may not be usable without javascript, or requires a token generated in javascript with clientside math.
Depending on the site, you can either extract the wait time in msec/sec and time.sleep() for that long, or you'll have to use something that can execute javascript

Related

Navigate webpage as if from my browser (Python, selenium)

I need to parse a page, keeping HTML and JS the same as in my own browser. Site must think, that I am logged using the same browser, I need to "press" some buttons using JS and find some elements.
When using requests library or selenium.webdriver.Firefox(), site think I am from a new browser. But I think selenium must help.
Requests cannot process JavaScript, nor can it parse HTML and CSS to create a DOM. Requests is just a very nice abstraction around making HTTP requests to any server, but websites/browsers aren't the only things that use HTTP.
What you're looking for is a JavaScript engine along with an HTML and CSS parser so that it can create an actual DOM for the site and allow you to interact with it. Without these things, there'd be no way to tell what the DOM of the page would be, and so you wouldn't be able to click buttons on it and have the resulting JavaScript do what it should.
So what you're looking for is a web browser. There's just no way around it. Anything that does those things, is, by definition, a web browser.
To clarify from one of your comments, just because something has a GUI, that doesn't mean it isn't automatic. In fact, that's exactly what Selenium is for (i.e. automating the interactions with the GUI that is the web page). It's not meant to emulate user behavior exactly 1:1, and it's actually an abstraction around the WebDriver protocol, which is meant for writing automated tests. However, it does allow you to interact with the webpage in a way that approximates how a user would interact with it.
You may not want to see the GUI of the browser, but luckily, Chrome and Firefox have "headless" modes, and Selenium can control headless instances of those browsers. This would have the browser GUI be hidden while Selenium controls it, which sounds like what you're looking for.

miss part of html by getting html by requests in python [duplicate]

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?
In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.
You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.
Since there is no comprehensive answer here, I'll go ahead and write one.
To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)
Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.
So here's what you do:
Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk
driver.quit()
I have had to do this before (in .NET) and you are basically going to have to host a browser, get it to click the button, and then interrogate the DOM (document object model) of the browser to get at the generated HTML.
This is definitely one of the downsides to web apps moving towards an Ajax/Javascript approach to generating HTML client-side.
I use webkit, which is the browser renderer behind Chrome and Safari. There are Python bindings to webkit through Qt. And here is a full example to execute JavaScript and extract the final HTML.
For Scrapy (great python scraping framework) there is scrapyjs: an additional downloader handler / middleware handler able to scraping javascript generated content.
It's based on webkit engine by pygtk, python-webkit, and python-jswebkit and it's quite simple.

Downloading dynamic web pages in python

I use the python requests (http://docs.python-requests.org/en/latest/) library in order to download and parse particular web pages. This works fine as long as the page is not dynamic. Things look different if the page under consideration uses javascript.
In particular, I am talking about a web page that automatically loads more content once you scrolled to the bottom of the page so that you can continue scrolling. This new content is not included in the page's source text, thus, I can't download it.
I thought about simulating a browser in python (selenium) - is this the right way to go?

How to select "Load more results" button when scraping using Python & lxml

I am scraping a webpage. The webpage consists of 50 entries. After 50 entries it gives a
Load more reults button. I need to automatically select it. How can I do it. For scraping I am using Python, Lxml.
Even JavaScript is using http requests to get the data, so one method would be to investigate, what requests are providing the data when user asks to "Load more results" and emulate these requests.
This is not traditional scraping, which is based on plain or rendered html content and detecting further links, but can be working solution.
Next actions:
visit the page in Google Chrome or Firefox
press F12 to start up Developer tools or Firebug
switch to "Network" tab
click "Load more results"
check, what http requests have served data for loading more results and what data they return.
try to emulate these requests from Python
Note, that the data do not necessarily come in HTML or XML form, but could be in JSON. But Python provide enough tools to process this format too.
You can't do that. The functionality is provided by javascript, which lxml will not execute.

Advanced screen-scraping using curl

I need to create a script that will log into an authenticated page and download a pdf.
However, the pdf that I need to download is not at a URL, but it is generated upon clicking on a specific input button on the page. When I check the HTML source, it only gives me the url of the button graphic and some obscure name of the button input, and action=".".
In addition, both the url where the button is and the form name is obscured, for example:
url = /WebObjects/MyStore.woa/wo/5.2.0.5.7.3
input name = 0.0.5.7.1.1.11.19.1.13.13.1.1
How would I log into the page, 'click' that button, and download the pdf file within a script?
Maybe Mechanize module can help.
I think that url on clicking the button maybe generated using javascript.So, to run javascript code from python script take a look at Spidermonkey.
Try mechanize or twill. HttpFox or firebug can help you to build your queries. Remember you can also pickle cookies from browser and use it later with py libs. If the code is generated by javascript it could be possible to 'reverse engineer' it. If nof you can run some javascript interpret or use selenium or windmill to script a real browser.
You could observe what requests are made when you click the button (using Firebug in Firefox or Developer Tools in Chrome). You may then be able to request the PDF directly.
It's difficult to help without seeing the page in question.
As Acorn said, you should try monitoring the actual requests and see if you can spot a pattern.
If not, then your best bet is actually to automate a fully-featured browser, that will be able to run Javascript, so you'll exactly mimic what a regular user would do. Have a look at this page on the Python Wiki for ideas, check the section Python Wrappers around Web "Libraries" and Browser Technology.

Categories

Resources