Follow a link with Ghost.py - python

I'm trying to use Ghost.py to do some web scraping. I'm trying to follow a link but the Ghost doesn't seem to actually evaluate the javascript and follow the link. My problem is that i'm in an HTTPS session and cannot use redirection. I've also looked at other options (like selenium) but I cannot install a browser on the machine that will run the script. I also have some javascript evaluation further so I cannot use mechanize.
Here's what I do...
## Open the website
page,resources = ghost.open('https://my.url.com/')
## Fill textboxes of the form (the form didn't have a name)
result, resources = ghost.set_field_value("input[name=UserName]", "myUser")
result, resources = ghost.set_field_value("input[name=Password]", "myPass")
## Submitting the form
result, resources = ghost.evaluate( "document.getElementsByClassName('loginform')[0].submit();", expect_loading=True)
## Print the link to make sure that's the one I want to follow
#result, resources = ghost.evaluate( "document.links[4].href")
## Click the link
result, resources = ghost.evaluate( "document.links[4].click()")
#print ghost.content
When I look at ghost.content, I'm still on the same page and result is empty. I noticed that when I add expect_loading=True when trying to evaluate the click, I get a timeout error.
When I try the to run the javascript in a Chrome Developper Tools console, I get
event.returnValue is deprecated. Please use the standard
event.preventDefault() instead.
but the page does load up the linked url correctly.
Any ideas are welcome.
Charles

I think you are using the wrong methods for that.
If you want to submit the form there's a special method for that:
page, resources = ghost.fire_on("loginform", "submit", expect_loading=True)
Also there's a special ghost.py method for performing a click:
ghost.click('#some-selector')
Another possibilty, if you just want to open that link could be:
link_url = ghost.evaluate("document.links[4]")[0]
ghost.open(link_url)
You only have to find the right selectors for that.
I don't know on which page you want to perform the task, thus I can't fix your code. But I hope this will help you.

Related

Url request does not parse every information in HTML using Python

I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.

How i post data on search-Bar website using python script?

As all we know in web application we have get method and post data method.
Here my problem appear with post data.
For example i want to make my python code that access for search bar of website by insert same values and submit (the website button), then check for the page.
How the code gonna be then if there any documentation about this python concepts!
I am totally confused
Note : i am just beginner in python.
If the website relies on javascript, you're going to need to use something like Selenium which will emulate a typical browser and allow you to insert information onto a page and execute javascript commands.
If, however, the search bar simply posts data to a URL. You can determine that URL and then use requests to post the data and retrieve the result.
resp = requests.post('http://website/search', data = {'term':'value'})

How to crawl a web site where page navigation involves dynamic loading

I want to crawl a website having multiple pages and when a page number is clicked it is dynamically loaded.How to screen scrape it?
i.e as the url is not present as href or a how to crawl to other pages?
Would be greatful if someone helped me on this.
PS:URL remains the same when different page is clicked.
You should consider also Ghost.py, since it allows you tu run arbitrary javascript commands, fill forms and take snapshoot very quickly.
if you are using google chrome, you can check the url which is dynamically being called in
network->headers of the developer tools
so based on that you can identify whether it is a GET or POST request.
If it is a GET request you can find the parameters straight away from the url.
If it is a POST request you can find the parameters from form data in network->headers
of the developer tools.
You could look for the data you want in the javascript code instead of the HTML. This is usually a pain but you can do fun things with regular expressions.
Alternatively, some of the browser testing libraries like splinter work by loading the page up in an actual browser like firefox or chrome before scraping. One of those would work if you are running this on a machine with a browser installed.
Since this post has been tagged with python and web-crawler, Beautiful Soup has to be mentioned: http://www.crummy.com/software/BeautifulSoup/
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs3/download/2.x/documentation.html
You can not do that easily since it is an ajax pagination (even with mechanize). Instead, open the source file of the page and try to know what is the url request used for the ajax pagination. Then, you can create a fake request back and process the returned data by your own way
If you don't mind using gevent.GRobot is another good choose.

Modify wiki page (Confluence) programmatically

I'd like to modify a wiki page (Confluence by Atlassian - JIRA editors) programmatically (in python). What I tried so far is to simulate user behaviour:
click on Edit button
change content of a textarea input
submit changes with Save button
Part 1 is ok since I have the URL corresponding to an edit of the page, part 2 (retrieval of the page and modification) is ok too, but I don't know how to achieve step 3... I'm using urllib2.
Thanks for your help !!
EDIT: XML-RPC is indeed the solution, this example does exactly what I want !
# write to a confluence page
import xmlrpclib
CONFLUENCE_URL = "https://intranet.example.com/confluence/rpc/xmlrpc"
CONFLUENCE_LOGIN = "a confluence username here"
CONFLUENCE_PASSWORD = "confluence pwd for username"
# get this from the page url while editing
# e.g. ../editpage.action?pageId=132350005 <-- here
PAGE_ID = "132350005"
client = xmlrpclib.Server(CONFLUENCE_URL, verbose = 0)
auth_token = client.confluence2.login(CONFLUENCE_LOGIN, CONFLUENCE_PASSWORD)
page = client.confluence2.getPage(auth_token, PAGE_ID)
# and write the new contents
page['content'] = "!!!your content here!!!"
result = client.confluence2.storePage(auth_token, page)
client.confluence2.logout(auth_token)
Note that confluence modifies your html code when you do this. It strips out scripts, styles and sometimes title attributes on elements for example. In order to get that stuff back in you then need to use their macro code.
Easiest way to do this is to edit the page in confluence and make it look like you want and then grab the page and do a print page['content'] to see what magical new stuff the atlassian people have decided to do to standard html.
This seems like the absolutely wrong way to go about it.
First off, Confluence has a plugin architecture which should allow you to manage content programmatically from the application itself without any kind of HTTP requests. Secondly, even if you don't want to, or can't, use the plugin API for some reason, the next obvious option is to use the SOAP/XML-RPC API.
There is no reason to actually mess with buttons and textareas unless you're trying to do some kind of end-to-end test that includes testing GUI (e.g. automated cross-browser testing).

Click a Button in Scrapy

I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking).
I found out that Scrapy can handle forms (like logins) as shown here. But the problem is that there is no form to fill out, so it's not exactly what I need.
How can I simply click a button, which then shows the information I need?
Do I have to use an external library like mechanize or lxml?
Scrapy cannot interpret javascript.
If you absolutely must interact with the javascript on the page, you want to be using Selenium.
If using Scrapy, the solution to the problem depends on what the button is doing.
If it's just showing content that was previously hidden, you can scrape the data without a problem, it doesn't matter that it wouldn't appear in the browser, the HTML is still there.
If it's fetching the content dynamically via AJAX when the button is pressed, the best thing to do is to view the HTTP request that goes out when you press the button using a tool like Firebug. You can then just request the data directly from that URL.
Do I have to use an external library like mechanize or lxml?
If you want to interpret javascript, yes you need to use a different library, although neither of those two fit the bill. Neither of them know anything about javascript. Selenium is the way to go.
If you can give the URL of the page you're working on scraping I can take a look.
Selenium browser provide very nice solution. Here is an example (pip install -U selenium):
from selenium import webdriver
class northshoreSpider(Spider):
name = 'xxx'
allowed_domains = ['www.example.org']
start_urls = ['https://www.example.org']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self,response):
self.driver.get('https://www.example.org/abc')
while True:
try:
next = self.driver.find_element_by_xpath('//*[#id="BTN_NEXT"]')
url = 'http://www.example.org/abcd'
yield Request(url,callback=self.parse2)
next.click()
except:
break
self.driver.close()
def parse2(self,response):
print 'you are here!'
To properly and fully use JavaScript you need a full browser engine and this is possible only with Watir/WatiN/Selenium etc.
Although it's an old thread I've found quite useful to use Helium (built on top of Selenium) for this purpose and far more easier/simpler than using Selenium. It will be something like the following:
from helium import *
start_firefox('your_url')
s = S('path_to_your_button')
click(s)
...

Categories

Resources