I’m working to make web crawler with python by using selenium
Here, I successfully got contents by using chromedriver, but problem occurred when I tried to make
Headless access crawling through PhantomJS. find_element_by_id, or find_element_by_name did not work
Is there any difference between these? Actually I am trying to make this as headless because I want to run this
Code in ubuntu server as a batch job without GUI support.
My script is like as below.
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
#driver = webdriver.PhantomJS('/Users/user/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs')
#driver = webdriver.Chrome('/Users/user/Downloads/chromedriver')
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get(url)
driver.implicitly_wait(3)
#here I tried two different find_tag things but both didn’t work
user = driver.find_element(by=By.NAME,value="user:email")
password = driver.find_element_by_id('user_password')
Related
I am trying to get in this website: "https://core.cro.ie/".
I can get in using normal web search, but I can't get in using selenium.
My code looks like this:
site= "https://core.cro.ie/"
driver = webdriver.Edge(service=Service(EdgeChromiumDriverManager().install()))
driver.get(site)
driver.maximize_window()
Any ideas? Thank you very much
This code works fine for navigation ( I dont have Edge browser):
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
site= "https://core.cro.ie/"
driver = webdriver.Firefox()
driver.get(site)
driver.maximize_window()
I have installed selenium prior running the test. It seems like the website has some sort of bot prevention mechanism but navigation works fine:
pip install selenium
I am trying to set up python selenium to work on my WSL 2 (Kali).
I have followed along with this article https://www.gregbrisebois.com/posts/chromedriver-in-wsl2/
Running "google-chrome" in terminal returns a working browser.
Trying to run this test script results in a browser window, but nothing loads into the window and no code after driver = webdriver.Chrome() runs.
from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import time
print('starting')
driver = webdriver.Chrome()
driver.get('https://www.google.com')
print('Worked')
driver.close()
This is the browser the script returns but never stops loading
Currently using python and trying to have selenium click the "About" on google without using id. When I try to use .click() it does not execute, what is wrong with my code? I have looked at many videos and tutorials and it looks correct.
from selenium import webdriver
from time import sleep
browser = webdriver.Safari()
browser.get('http://google.com')
browser.maximize_window()
elm = browser.find_element_by_link_text('About')
browser.implicitly_wait(5)
elm.click()
I think, you can try using find_element_by_xpath.
First you will copy xpath of about link then you can try like below:
from selenium import webdriver
from time import sleep
browser = webdriver.Safari()
browser.get('http://google.com')
browser.maximize_window()
elm = browser.find_element_by_xpath('//*[#id="fsl"]/a[3]')
browser.implicitly_wait(5)
elm.click()
So the issue ended up being safari. For some reason the web driver safari was not allowing me to use .click. I switched to chrome and it worked.
I am using selenium to crawl a javascript website, the issue is that, a Firefox browser opens up, but the call for the URL is not done. however, when I close the browser, it is then that call for URL is done and of course I get the missing driver exception. what do you think the issue comes from.
knowing that:
all programs are up-to-date
my solution works fine, in local, but when I try to deploy it on the server, I start having issues
Example: at my local machine, I run this script and everything goes smooth, however when I run it a server (Linux), only the browser opens up and no get URL is called
from selenium import webdriver
import time
geckodriver_path = r'.../geckodriver'
driver = webdriver.Firefox(executable_path= geckodriver_path)
time.sleep(3)
driver.get("http://www.stackoverflow.com")
I end up finding the solution :
from selenium import webdriver
import time
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
geckodriver_path = r'/path_to/geckodriver'
binary = FirefoxBinary(r'/usr/bin/firefox')
capabilities = webdriver.DesiredCapabilities().FIREFOX
capabilities["marionette"] = False
driver = webdriver.Firefox(firefox_binary=binary,
executable_path= geckodriver_path,
capabilities=capabilities)
time.sleep(3)
driver.get("https://stackoverflow.com/")
time.sleep(6)
driver.close()
# solution from:
# https://github.com/SeleniumHQ/selenium/issues/3884
# https://stackoverflow.com/questions/25713824/setting-path-to-firefox-binary-on-windows-with-selenium-webdriver
I have written many scrapers but I am not really sure how to handle infinite scrollers. These days most website etc, Facebook, Pinterest has infinite scrollers.
You can use selenium to scrap the infinite scrolling website like twitter or facebook.
Step 1 : Install Selenium using pip
pip install selenium
Step 2 : use the code below to automate infinite scroll and extract the source code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys
import unittest, time, re
class Sel(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(30)
self.base_url = "https://twitter.com"
self.verificationErrors = []
self.accept_next_alert = True
def test_sel(self):
driver = self.driver
delay = 3
driver.get(self.base_url + "/search?q=stckoverflow&src=typd")
driver.find_element_by_link_text("All").click()
for i in range(1,100):
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(4)
html_source = driver.page_source
data = html_source.encode('utf-8')
if __name__ == "__main__":
unittest.main()
Step 3 : Print the data if required.
Most sites that have infinite scrolling do (as Lattyware notes) have a proper API as well, and you will likely be better served by using this rather than scraping.
But if you must scrape...
Such sites are using JavaScript to request additional content from the site when you reach the bottom of the page. All you need to do is figure out the URL of that additional content and you can retrieve it. Figuring out the required URL can be done by inspecting the script, by using the Firefox Web console, or by using a debug proxy.
For example, open the Firefox Web Console, turn off all the filter buttons except Net, and load the site you wish to scrape. You'll see all the files as they are loaded. Scroll the page while watching the Web Console and you'll see the URLs being used for the additional requests. Then you can request that URL yourself and see what format the data is in (probably JSON) and get it into your Python script.
Finding the url of the ajax source will be the best option but it can be cumbersome for certain sites. Alternatively you could use a headless browser like QWebKit from PyQt and send keyboard events while reading the data from the DOM tree. QWebKit has a nice and simple api.