I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking).
I found out that Scrapy can handle forms (like logins) as shown here. But the problem is that there is no form to fill out, so it's not exactly what I need.
How can I simply click a button, which then shows the information I need?
Do I have to use an external library like mechanize or lxml?
Scrapy cannot interpret javascript.
If you absolutely must interact with the javascript on the page, you want to be using Selenium.
If using Scrapy, the solution to the problem depends on what the button is doing.
If it's just showing content that was previously hidden, you can scrape the data without a problem, it doesn't matter that it wouldn't appear in the browser, the HTML is still there.
If it's fetching the content dynamically via AJAX when the button is pressed, the best thing to do is to view the HTTP request that goes out when you press the button using a tool like Firebug. You can then just request the data directly from that URL.
Do I have to use an external library like mechanize or lxml?
If you want to interpret javascript, yes you need to use a different library, although neither of those two fit the bill. Neither of them know anything about javascript. Selenium is the way to go.
If you can give the URL of the page you're working on scraping I can take a look.
Selenium browser provide very nice solution. Here is an example (pip install -U selenium):
from selenium import webdriver
class northshoreSpider(Spider):
name = 'xxx'
allowed_domains = ['www.example.org']
start_urls = ['https://www.example.org']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self,response):
self.driver.get('https://www.example.org/abc')
while True:
try:
next = self.driver.find_element_by_xpath('//*[#id="BTN_NEXT"]')
url = 'http://www.example.org/abcd'
yield Request(url,callback=self.parse2)
next.click()
except:
break
self.driver.close()
def parse2(self,response):
print 'you are here!'
To properly and fully use JavaScript you need a full browser engine and this is possible only with Watir/WatiN/Selenium etc.
Although it's an old thread I've found quite useful to use Helium (built on top of Selenium) for this purpose and far more easier/simpler than using Selenium. It will be something like the following:
from helium import *
start_firefox('your_url')
s = S('path_to_your_button')
click(s)
...
Related
I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!
There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.
In python you can open a web browser like this...
import webbrowser
webbrowser.open("stackoverflow.com")
This method opens a new tab EVERY time the page is called. I want to create a web page with text boxes, graphic (SVG) devices, etc... then pass variables to it. Basically... use the browser as a display screen.
The HTML page would reside in the same folder with the python code... so this works just fine...
import webbrowser
webbrowser.open("sample.html")
The issue is... if I place this in a timer that updates every second... I get tab after tab... but what I want is for it to open the page ONCE, then just pass data to it as if I had used a SUBMIT button...
My code would generate the appropriate text... URL plus data... then pass it as a long URL.
webbrowser.open("sample.html?alpha=50&beta=100")
The page would pull the variables "alpha" and "beta", then shove the data into some graphic device using javascript. I have had great success manipulating SVG this way... http://askjerry.info/SVG for example.
(Feel free to grab my graphics if you like.)
Is it possible to keep updating a SINGLE page/window instead of a new tab every time??
Thanks,
Jerry
Use the selenium module. The .get() method actually just opens the given url in the same tab and leaves the old url. In fact, I think there's even a .refresh().
From this question: Refresh a local web page using Python
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get('URL')
while True:
time.sleep(20)
driver.refresh()
driver.quit()
Where you can replace your url and parameters with 'URL' - but if you want to pass data from python to html/javascript you will be better off learning flask or something similar. Then you can update your page using ajax which will make your graphics look nicer and will be tractable if you need to pass more data than just alpha and beta.
I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)
I'm trying to use Ghost.py to do some web scraping. I'm trying to follow a link but the Ghost doesn't seem to actually evaluate the javascript and follow the link. My problem is that i'm in an HTTPS session and cannot use redirection. I've also looked at other options (like selenium) but I cannot install a browser on the machine that will run the script. I also have some javascript evaluation further so I cannot use mechanize.
Here's what I do...
## Open the website
page,resources = ghost.open('https://my.url.com/')
## Fill textboxes of the form (the form didn't have a name)
result, resources = ghost.set_field_value("input[name=UserName]", "myUser")
result, resources = ghost.set_field_value("input[name=Password]", "myPass")
## Submitting the form
result, resources = ghost.evaluate( "document.getElementsByClassName('loginform')[0].submit();", expect_loading=True)
## Print the link to make sure that's the one I want to follow
#result, resources = ghost.evaluate( "document.links[4].href")
## Click the link
result, resources = ghost.evaluate( "document.links[4].click()")
#print ghost.content
When I look at ghost.content, I'm still on the same page and result is empty. I noticed that when I add expect_loading=True when trying to evaluate the click, I get a timeout error.
When I try the to run the javascript in a Chrome Developper Tools console, I get
event.returnValue is deprecated. Please use the standard
event.preventDefault() instead.
but the page does load up the linked url correctly.
Any ideas are welcome.
Charles
I think you are using the wrong methods for that.
If you want to submit the form there's a special method for that:
page, resources = ghost.fire_on("loginform", "submit", expect_loading=True)
Also there's a special ghost.py method for performing a click:
ghost.click('#some-selector')
Another possibilty, if you just want to open that link could be:
link_url = ghost.evaluate("document.links[4]")[0]
ghost.open(link_url)
You only have to find the right selectors for that.
I don't know on which page you want to perform the task, thus I can't fix your code. But I hope this will help you.
Quite often i have to download the pdfs from websites but sometimes they are not on one page.
They have divided the links in pagination and I have to click on every page of get the links.
I am learning python and i want to code some script where i can put the weburl and it extracts the pdf links from that webiste.
I am new to python so can anyone please give me the directions how can i do it
Pretty simple with urllib2, urlparse and lxml. I've commented things more verbosely since you're new to Python:
# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse
# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'
# fetch the page
res = urllib2.urlopen(base_url)
# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())
# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}
# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(#href, "\.pdf$", "i")]', namespaces=ns):
# print the href, joining it to the base_url
print urlparse.urljoin(base_url, node.attrib['href'])
Result:
http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...
If there is a lot of pages with links you can try excellent framework -- Scrapy(http://scrapy.org/).
It is pretty easy to understand how to use it and can download pdf files you need.
By phone, maybe it is not very readable
If you is going to gran things from website which are all static pages or other things. You can easily grab html by requests
import requests
page_content=requests.get(url)
But if you grab things like some communication website. There will be some anti-grabing ways.(how to break these noisy things will be the problem)
Frist way:make your requests more like a browser(human).
add the headers(you can use the dev tools by Chrome or Fiddle to copy the headers)
make the right post form.This one should copy the ways you post the form by browser.
get the cookies, and add it to requests
Second way. use selenium and browser driver. Selenium will use true browser driver(like me, i use chromedriver)
remeber to add chromedriver to the path
Or use code to load the driver.exe
driver=WebDriver.Chrome(path)
not sure is this set up code
driver.get(url)
It is trully surf the url by browser, so it will decrease the difficulty of grabing things
get the web page
page=driver.page_soruces
some of the website will jump several page. This will cause some error. Make your website wait for some certain element showing.
try:
certain_element=ExpectedConditions.presenceOfElementLocated(By.id,'youKnowThereIsAElement'sID)
WebDriverWait(certain_element)
or use implict wait: wait the time you like
driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS)
And you can controll the website by WebDriver. Here is not going to describe. You can search the module.