I'm trying to use PhantomJS to write a scraper but even the example in the documentation of morph.io is not working. I guess the problem is "https", I tested it with http and it is working. Can you please give me a solution?
I tested it using firefox and it works.
from splinter import Browser
with Browser("phantomjs") as browser:
# Optional, but make sure large enough that responsive pages don't
# hide elements on you...
browser.driver.set_window_size(1280, 1024)
# Open the page you want...
browser.visit("https://morph.io")
# submit the search form...
browser.fill("q", "parliament")
button = browser.find_by_css("button[type='submit']")
button.click()
# Scrape the data you like...
links = browser.find_by_css(".search-results .list-group-item")
for link in links:
print link['href']
PhantomJS is not working on https urls?
Splinter uses the Selenium WebDriver bindings (example) for Python under the hood, so you can simply pass the necessary options like this:
with Browser("phantomjs", service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any']) as browser:
...
See PhantomJS failing to open HTTPS site for why those options might be necessary. Take a look at the PhantomJS commandline interface for more options.
Related
So I am trying to login programatically (python) to https://www.datacamp.com/users/sign_in using my email & password.
I have tried 2 methods of login. One using requests library & another using selenium (code below). Both time facing [403] issue.
Could someone please help me login programatically to it ?
Thank you !
Using Requests library.
import requests; r = requests.get("https://www.datacamp.com/users/sign_in"); r (which gives <response [403]>)
Using Selenium webdriver.
driver = webdriver.Chrome(executable_path=driver_path, options=option)
driver.get("https://www.datacamp.com/users/sign_in")
driver.find_element_by_id("user_email") # there is supposed to be form element with id=user_email for inputting email
Implicit wait at least should have worked, like this:
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.implicitly_wait(10)
url = "https://www.datacamp.com/users/sign_in"
driver.get(url)
driver.find_element_by_id("user_email").send_keys("test#dsfdfs.com")
driver.find_element_by_css_selector("#new_user>button[type=button]").click()
BUT
The real issue is the the site uses anti-scraping software.
If you open Console and go to request itself you'll see:
It means that the site blocks your connection even before you try to login.
Here is similar question with different solutions: Can a website detect when you are using Selenium with chromedriver?
Not all answers will work for you, try different approaches suggested.
With Firefox you'll have the same issue (I've already checked).
You have to add a wait after driver.get("https://www.datacamp.com/users/sign_in") before driver.find_element_by_id("user_email") to let the page loaded.
Try something like WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'user_email')))
Was hoping someone could help me understand what's going on:
I'm using Selenium with Firefox browser to download a pdf (need Selenium to login to the corresponding website):
le = browser.find_elements_by_xpath('//*[#title="Download PDF"]')
time.sleep(5)
if le:
pdf_link = le[0].get_attribute("href")
browser.get(pdf_link)
The code does download the pdf, but after that just stays idle.
This seems to be related to the following browser settings:
fp.set_preference("pdfjs.disabled", True)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
If I disable the first, it doesn't hang, but opens pdf instead of downloading it. If I disable the second, a "Save As" pop-up window shows up. Could someone explain how to handle this?
For me, the best way to solve this was to let Firefox render the PDF in the browser via pdf.js and then send a subsequent fetch via the Python requests library with the selenium cookies attached. More explanation below:
There are several ways to render a PDF via Firefox + Selenium. If you're using the most recent version of Firefox, it'll most likely render the PDF via pdf.js so you can view it inline. This isn't ideal because now we can't download the file.
You can disable pdf.js via Selenium options but this will likely lead to the issue in this question where the browser gets stuck. This might be because of an unknown MIME-Type but I'm not totally sure. (There's another StackOverflow answer that says this is also due to Firefox versions.)
However, we can bypass this by passing Selenium's cookie session to requests.session().
Here's a toy example:
import requests
from selenium import webdriver
pdf_url = "/url/to/some/file.pdf"
# setup driver with options
driver = webdriver.Firefox(..options)
# do whatever you need to do to auth/login/click/etc.
# navigate to the PDF URL in case the PDF link issues a
# redirect because requests.session() does not persist cookies
driver.get(pdf_url)
# get the URL from Selenium
current_pdf_url = driver.current_url
# create a requests session
session = requests.session()
# add Selenium's cookies to requests
selenium_cookies = driver.get_cookies()
for cookie in selenium_cookies:
session.cookies.set(cookie["name"], cookie["value"])
# Note: If headers are also important, you'll need to use
# something like seleniumwire to get the headers from Selenium
# Finally, re-send the request with requests.session
pdf_response = session.get(current_pdf_url)
# access the bytes response from the session
pdf_bytes = pdf_response.content
I highly recommend using seleniumwire over regular selenium because it extends Python Selenium to let you return headers, wait for requests to finish, use proxies, and much more.
Here is my problem: I'm trying to use selenium to access a webpage and the special about this page is it is an auto redirecting page (you open that page and after few seconds, it automatically redirect to another page). When i use driver = webdriver.Firefox(), my IDM catched that link just perfectly after few seconds.
And because i don't want the browser to come up so i use Phantomjs instead, ut it not working. My application just can get the loading page url (bitdl-1336...) but not the redirected link. Please help!
This is my code:
link = 'http://torrent.ajee.sh/hash.php?hash=' + self.global_hash_code
driver = webdriver.PhantomJS('phantomjs.exe')
driver.get(str(link))
element = driver.find_element_by_link_text('Download Zip')
element.click()
time.sleep(10)
msg = QMessageBox.information(self, QString('Thành công'),QString(driver.current_url))
And this is the result:
Please help!
Sorry about my english
Not exactly an answer to your PhantomJS-specific question, but a workaround to the problem.
And because i don't want the browser to come up so i use Phantomjs instead
You can continue using Firefox, but start it in a Virtual Display, see more information at:
How do I run Selenium in Xvfb?
You may also need to let the browser automatically save the archive in a specified directory, see:
How do I automatically download files from a pop up dialog using selenium-python
Access to file download dialog in Firefox
I wrote a program to take a screenshoot of a chosen webpage. User types an url and then my application takes a screenshoot of typed page. I wonder is it possible (and how) to hide a browser window? I mean, no to open it but take a screenshoot? thanks in advance :)
I use python 2.7 and splinter for this. Code below:
from splinter import Browser
import socket
url = raw_input('> ')
browser = None
try:
browser = Browser('firefox')
try:
browser.visit(url)
if browser.status_code.is_success():
browser.driver.save_screenshot('picture.png')
except socket.gaierror, e:
print "URL not found: %s" % url
finally:
if browser is not None:
browser.quit()
For Ubuntu, I found this: Selenium-Python Client Library - Automating in Background but how about Windows?
You have a few options:
Use a "dumb" headless browser, like mechanize. This is fast, and perfect for a quick visit and a screenshot. However, it doesn't understand javascript.
Use the zope.testbrowser browser in your splinter tests. It's a headless browser, so it won't appear on screen. It understands javascript, but will take more investment to get up and going.
Just use urllib2 with some special headers.
I am trying to get some comments off the car blog, Jalopnik. It doesn't come with the web page initially, instead the comments get retrieved with some Javascript. You only get the featured comments. I need all the comments so I would click "All" (between "Featured" and "Start a New Discussion") and get them.
To automate this, I tried learning Selenium. I modified their script from Pypi, guessing the code for clicking a link was link.click() and link = broswer.find_element_byxpath(...). It doesn't look liek the "All" button (displaying all comments) was pressed.
Ultimately I'd like to download the HTML of that version to parse.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://jalopnik.com/5912009/prius-driver-beat-up-after-taking-out-two-bikers/") # Load page
time.sleep(0.2)
link = browser.find_element_by_xpath("//a[#class='tc cn_showall']")
link.click()
browser.save_screenshot('screenie.png')
browser.close()
Using Firefox with the Firebug plugin, I browsed to http://jalopnik.com/5912009/prius-driver-beat-up-after-taking-out-two-bikers.
I then opened the Firebug console and clicked on ALL; it obligingly showed a single AJAX call to http://jalopnik.com/index.php?op=threadlist&post_id=5912009&mode=all&page=0&repliesmode=hide&nouser=true&selected_thread=null
Opening that url in a new window gets me the comment feed you are seeking.
More generally, if you substitute the appropriate article-ID into that url, you should be able to automate the process without Selenium.