Without using selenium headless, the below code works fine. But with headless mode, why the for loop won't execute??
Here is my code:-
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
options.add_argument("--disable-notifications")
options.add_argument('headless')
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=options)
url = "https://www.justdial.com/Delhi/S-K-Premium-Par-Hari-Nagar/011PXX11-XX11-131128122154-B8G6_BZDET"
driver.get(url)
try:
pop_up = WebDriverWait(driver, 30).until(
EC.element_to_be_clickable((By.XPATH, '//*[#id="best_deal_detail_div"]/section/span')))
pop_up.click() # For disable pop-up
except TimeoutException:
pass
while True:
try:
element = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH, "//span[text()='Load More Reviews..']")))
element.click()
except TimeoutException:
break
except:
pass
soup = BeautifulSoup(driver.page_source, 'lxml')
services = soup.find_all('span', {'class': "rName lng_commn"})
for i in services:
print(i.text)
I want to run this code with selenium headless. Please help.
Some websites behave different when they see your "headless" user-agent.
Try changing your user-agent to Chrome and see if it works.
options.add_argument("""user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)""")
Related
I am trying to scrape data from the following site. I was able to click on load more yet the code doesn't catch most of the elements and I do not really know what to do.
url = 'https://www.carrefouregypt.com/mafegy/en/c/FEGY1701230'
products = []
options = Options()
driver = webdriver.Chrome(options = options)
driver.get(url)
time.sleep(8)
#click on load more
while True:
try:
btn_class = 'css-1n3fqy0'
btn = driver.find_element(By.CLASS_NAME , btn_class)
btn.click()
driver.implicitly_wait(10)
except NoSuchElementException:
break
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(8)
The following code will click that button until it cannot locate it, and exit gracefully:
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options as Firefox_Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC
import time as t
firefox_options = Firefox_Options()
firefox_options.add_argument("--width=1280")
firefox_options.add_argument("--height=720")
# firefox_options.headless = True
firefox_options.set_preference("general.useragent.override", "Mozilla/5.0 (Linux; Android 7.0; SM-A310F Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.91 Mobile Safari/537.36 OPR/42.7.2246.114996")
driverService = Service('chromedriver/geckodriver')
browser = webdriver.Firefox(service=driverService, options=firefox_options)
url = 'https://www.carrefouregypt.com/mafegy/en/c/FEGY1701230'
browser.get(url)
t.sleep(5)
while True:
try:
load_more_button = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH,'//button[text()="Load More"]')))
browser.execute_script('window.scrollBy(0, 100);')
load_more_button.click()
print('clicked')
t.sleep(3)
except TimeoutException:
print('all elements loaded in page')
break
It's using Firefox, on a linux setup (for some reasons Chrome was temperamental on this one). You just have to observe the imports, and the code after defining the browser/driver. Selenium documentation: https://www.selenium.dev/documentation/
I want to extract the CSV download URL from website - https://www.nseindia.com/option-chain
enter image description here
Code I used till now
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
driver.get("https://www.nseindia.com/option-chain")
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.ID,
"equity_underlyingVal")))
nifty = (driver.find_element(By.XPATH, '//*
[#id="equity_underlyingVal"]').text).replace('NIFTY ',
'').replace(',','')
time_stamp = driver.find_element(By.XPATH, '//*
[#id="equity_timeStamp"]').text
I need the csv link to be load in pandas df. I dont want to use selenium or if using selenium, I need it as headless. Let me know if anyone has a better idea about extracting data directly into pandas datafream..
You can extract the downloading link contained in that element with Selenium as following:
link = driver.find_element(By.CSS_SELECTOR, '#downloadOCTable').get_attribute("href")
As the download link is not present in the href attribute, the best approach is to download the csv file.
Interacting in headless mode can cause problems if the window-size argument is not specified, and a workaround to download files in headless mode is to specify the download path using the driver.command_executor method.
Code snippet to download csv in headless mode-
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import os
options = Options()
#add necessary arguments
options.add_argument("user-agent= Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36")
options.add_argument("--window-size=1920,1080")
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
#set download path (set to current working directory in this example)
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow','downloadPath':os.getcwd()}}
command_result = driver.execute("send_command", params)
driver.get("https://www.nseindia.com/option-chain")
#wait for table details to appear
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="equity_optionChainTable"]')))
#find and click on download csv button
download_button=driver.find_element_by_xpath('//*[#id="downloadOCTable"]')
download_button.click()
I'm testing with this site here any https://www.nike.com.br/cosmic-unity-153-169-211-324680
And I'm trying after a few seconds that the page loads you must select the size and I can't select the size automatically with Selenium. Can someone help me?
Look, when it appears for you to select the size of the sneaker, I'm in Brazil and I select the size 40 of the sneaker, only if you inspect the "40" you will see that it is a label, and this label has no id, this label is the following html code snippet:
<label for="tamanho__id40">40</label>
How could I click on this label in Selenium?
I currently have this code:
import datetime
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities
import DesiredCapabilities from selenium.webdriver.support.ui
import WebDriverWait from selenium.webdriver.common.by
import By from selenium.webdriver.support
import expected_conditions as EC
import time
option = Options()
prefs = {'profile.default_content_setting_values': {'images': 2}}
option.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(options = option)
# Navigate to url
driver.get"https://www.nike.com.br/cosmic-unity-153-169-211-324680")
What would I have to add to be able to click on this label that has no id?
1 You need to accept cookies
2 Use Selenium's explicit waits. To use them you will need to import:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
3 Use reliable locators. I propose using this xpath locator for 40 shoe size: //label[#for="tamanho__id40"]
4 I added some chrome_options for dealing with this site.
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-blink-features")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=chrome_options)
driver.get("https://www.nike.com.br/cosmic-unity-153-169-211-324680")
wait = WebDriverWait(driver, 15)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.cc-allow'))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, '//label[#for="tamanho__id40"]'))).click()
I've been trying to scrape this website for 2 days now. I'm completely stuck. The problem is that it detects me as a bot.
I have a list of urls that I need to crawl. and in the results folder, every file says that Access to this page has been denied... To continue, please prove you are not a robot... etc.
Below is my current code
import time
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
CHROMEDRIVER_PATH = './chromedriver'
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("disable-infobars")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
ua = UserAgent()
userAgent = ua.random
chrome_options.add_argument('user-agent={userAgent}')
LOGIN_PAGE = "https://www.seekingalpha.com/login"
ACCOUNT = "Account"
PASSWORD = "Password"
driver = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, chrome_options=chrome_options)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
wait = WebDriverWait(driver, 30)
driver.get("https://www.seekingalpha.com/login")
time.sleep(1)
wait.until(EC.element_to_be_clickable((By.NAME, "email"))).send_keys(ACCOUNT)
wait.until(EC.element_to_be_clickable((By.ID, "signInPasswordField"))).send_keys(PASSWORD)
wait.until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Sign in']"))).click()
time.sleep(1)
with open("links.txt", "r") as inArticle:
articles = inArticle.read().splitlines()
for article in articles:
outName = article.split("/")[-1]
outName = outName.split("-")[0]
driver.get(article)
time.sleep(1)
html_source = driver.page_source
out_text = str(html_source).encode("utf8")
with open("./results/"+outName, "w") as outFile:
outFile.write(out_text)
driver.quit()
Is there a better way to do this? and is there a way to pass this bot check?
I am trying to register a new account for this site. However, I cannot register there because of an error or block for ChromeDriver (selenium Python).
I am using this code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import random
import string
from time import sleep
from selenium.webdriver.common.keys import Keys
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'
options = webdriver.ChromeOptions()
options.add_argument(f'user-agent={user_agent}')
options.add_argument('disable-infobars')
options.add_argument('--profile-directory=Default')
options.add_argument("--incognito")
options.add_argument("--disable-plugins-discovery")
options.add_experimental_option("excludeSwitches", ["ignore-certificate-errors", "safebrowsing-disable-download-protection", "safebrowsing-disable-auto-update", "disable-client-side-phishing-detection"])
options.add_argument('--disable-extensions')
options.add_argument("start-maximized")
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'chromedriver1.exe')
driver.get('https://www.nordstrom.com/signin?cm_sp=SI_SP_A-_-SI_SP_B-_-SI_SP_C&origin=tab&ReturnURL=https%3A%2F%2Fwww.nordstrom.com%2F')
def email(stringLength=8):
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(stringLength))
email = email(6) + "#gmail.com"
sleep(5)
# email
driver.find_element_by_name("email").send_keys(email)
sleep(5)
# next
driver.find_element_by_id('account-check-next-button').send_keys(Keys.ENTER)
I think the website is blocking WebDriver. When I use Chrome in my computer, I don't encounter any problems, but using ChromeDriver, this is the issue I receive.
Open form use:
driver.get('https://www.nordstrom.com/signin')
Not
driver.get('https://www.nordstrom.com/signin?cm_sp=SI_SP_A-_-SI_SP_B-_-SI_SP_C&origin=tab&ReturnURL=https%3A%2F%2Fwww.nordstrom.com%2F')
Try to use explicit wait before click:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
wait = WebDriverWait(driver, 30)
wait.until(EC.element_to_be_clickable(
(By.CSS_SELECTOR, 'button[alt="next button"]')))
btn = driver.find_element_by_css_selector('button[alt="next button"]')
btn.click()
It must work because I reproduced your error. The problem is that:
You should use click() for clicking, not send_keys(Keys.ENTER) In your case you click Enter before email is completely input, just before # symbol.
Your time.sleep(5) is not enough. Use explicit wait. Or, in the works case increase your sleep (if you don't case about the speed)
Read here how to use Selenium's wait instead on time.sleep()
Update:
Clear all cookies, cache, application data for this site and try again. First manually. Looks like it block users after some unsuccessful attempts. Even valid emails.
UPDATE:
Unfortunately, Nordstrom is blocking automated requests...
Nordstrom is tracking all customers actions, so it's unlikely to be used for testing. I would suggest to try other sites to save your time.