bypass cookiewall selenium

bypass cookiewall selenium - python

I would like to scrape job listings from a Dutch job listings website. However, when I try to open the page with selenium I run into a cookiewall (new GDPR rules). How do I bypass the cookiewall?
import selenium
#launch url
url = "https://www.nationalevacaturebank.nl/vacature/zoeken?query=&location=&distance=city&limit=100&sort=relevance&filters%5BcareerLevel%5D%5B%5D=Starter&filters%5BeducationLevel%5D%5B%5D=MBO"
# create a new Firefox session
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)
Edit I tried something
import selenium
import pickle
url = "https://www.nationalevacaturebank.nl/vacature/zoeken?query=&location=&distance=city&limit=100&sort=relevance&filters%5BcareerLevel%5D%5B%5D=Starter&filters%5BeducationLevel%5D%5B%5D=MBO"
driver = webdriver.Firefox()
driver.set_page_load_timeout(20)
driver.get(start_url)
pickle.dump(driver.get_cookies() , open("NVBCookies.pkl","wb"))
after that loading the cookies did not work
for cookie in pickle.load(open("NVBCookies.pkl", "rb")):
driver.add_cookie(cookie)
InvalidCookieDomainException: Message: Cookies may only be set for the current domain (cookiewall.vnumediaonline.nl)
It looks like I don't get the cookies from the cookiewall, correct?

Instead of bypassing why don't you write code to check if it's present then accept it otherwise continue with next operation. Please find below code for more details
import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
class PythonOrgSearch(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Chrome(executable_path="C:\\Users\\USER\\Downloads\\New folder (2)\\chromedriver_win32\\chromedriver.exe")
def test_search_in_python_org(self):
driver = self.driver
driver.get("https://www.nationalevacaturebank.nl/vacature/zoeken?query=&location=&distance=city&limit=100&sort=relevance&filters%5BcareerLevel%5D%5B%5D=Starter&filters%5BeducationLevel%5D%5B%5D=MBO")
elem = driver.find_element_by_xpath("//div[#class='article__button']//button[#id='form_save']")
elem.click()
def tearDown(self):
self.driver.close()
if __name__ == "__main__":
unittest.main()

driver.find_element_by_xpath('//*[#id="form_save"]').click()
ok I made selenium click the accept button. Also fine by me. Not sure if I'll run into cookiewalls later

Related

GoogleCaptcha roadblock in website scraper

I am currently working on a scraper for aniworld.to.
My goal is it to enter the anime name and get all of the Episodes downloaded.
I have everything working except one thing...
The websites has a Watch button. That Button redirects you to https://aniworld.to/redirect/SOMETHING and that Site has a captcha which means the link is not in the html...
Is there a way to bypass this/get the link in python? Or a way to display the captcha so I can solve it?
Because the captcha only appears every lightyear.
The only thing I need from that page is the redirect link. It looks like this:
https://vidoza.net/embed-something.html
My very very wip code is here if it helps: https://github.com/wolfswolke/aniworld_scraper

Mitchdu showed me how to do it.
If anyone else needs help here is my code: https://github.com/wolfswolke/aniworld_scraper/blob/main/src/logic/captcha.py
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from threading import Thread
import os
def open_captcha_window(full_url):
working_dir = os.getcwd()
path_to_ublock = r'{}\extensions\ublock'.format(working_dir)
options = webdriver.ChromeOptions()
options.add_argument("app=" + full_url)
options.add_argument("window-size=423,705")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
if os.path.exists(path_to_ublock):
options.add_argument('load-extension=' + path_to_ublock)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get(full_url)
wait = WebDriverWait(driver, 100, 0.3)
wait.until(lambda redirect: redirect.current_url != full_url)
new_page = driver.current_url
Thread(target=threaded_driver_close, args=(driver,)).start()
return new_page
def threaded_driver_close(driver):
driver.close()

Why by scraping LinkedIn it cannot load the requested url? Python

I am trying to scrape LinkedIn, the script was working for 3 months but yesterday it crashed.
I use selenium webdriver, Firefox with fake useragent.
The URL is https://www.linkedin.com/company/my_company/
def init_driver():
"""Initiates selenium webdriver.
:return: Firefox browser instance
"""
try:
# use random UserAgent to avoid captcha
fp = webdriver.FirefoxProfile()
fp.set_preference("general.useragent.override", UserAgent().random)
fp.update_preferences()
# initiate driver
options = FirefoxOptions()
#options.add_argument("--headless")
return webdriver.Firefox(firefox_options=options)
except Exception as e:
logging.error('Exception occurred initiating webdriver', exc_info=True)
And then just open a page driver.get(url)
at this moment it opens it but cannot load
the same situation happens without fake agent and by using chrome.
Has anyone encountered something like this? When I open the link myself everything os ok.
https://www.linkedin.com/authwall?trk=gf&trkInfo=AQFvPeNP8NQIxwAAAXLqc-uI5rnQe1ZIysPcZOgjZCzbrBHZj7q6gd68fPG9NzbX00Rlre_yC0tITChjMDEXSNnD8tZRaMXqcRG-z_3QUMlCvQPR4uVGBQYoSOl3ycoO2E6Jl9w=&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2my_company%2F
Other URLs are opened without problems by the function

This is how you should modify your code.
I modified your code and your code was executed correctly.
from selenium import webdriver
from fake_useragent import UserAgent
import logging
def init_driver():
"""Initiates selenium webdriver.
:return: Firefox browser instance
"""
path = r"your firefox driver path"
try:
# use random UserAgent to avoid captcha
fp = webdriver.FirefoxProfile()
fp.set_preference("general.useragent.override", UserAgent().random)
fp.update_preferences()
# initiate driver
options = webdriver.FirefoxOptions()
# options.add_argument("--headless")
return webdriver.Firefox(firefox_options=options, executable_path=path)
except Exception:
logging.error('Exception occurred initiating webdriver', exc_info=True)
url = "your url"
driver = init_driver()
driver.get(url)

Empty cookies using Python and Selenium whit get_cookies() function

I'm doing a "stupid" bot with Python and Selenium to automate some actions on web.telegram.org.
I want stay logged in after the first log on, but when I try to save cookie with function driver.get_cookies() they are empty (I tryed to print them and the output was "[]"). When I did the same thing with other website like youtube.com for example, it works! I tryed also using different webDrivers but I got the same results.
The code is:
from selenium import webdriver
import time
import pickle
driver = webdriver.Firefox()
driver.get('https://web.telegram.org')
time.sleep(4)
phone_number = driver.find_element_by_name("phone_number")
phone_number.send_keys("3478995060")
login_button = driver.find_element_by_class_name("login_head_submit_btn")
login_button.click()
time.sleep(2)
ok_button = driver.find_element_by_xpath("//span[#my-i18n='modal_ok']")
ok_button.click()
time.sleep(30)
all_cookies = driver.get_cookies()
print(all_cookies)
driver.quit()

Trying to open a tab in my opened browser with selenium

I have written a small python script with selenium to search Google and open the first link but whenever I run this script, it opens a console and open a new Chrome window and run this script in that Chrome window.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pyautogui
def main():
setup()
# open Chrome and open Google
def setup():
driver = webdriver.Chrome(r'C:\\python_programs'+
'(Starting_out_python)'+
'\\chromedriver.exe')
driver.get('https://www.google.com')
assert 'Google' in driver.title
mySearch(driver)
#Search keyword
def mySearch(driver):
search = driver.find_element_by_id("lst-ib")
search.clear()
search.send_keys("Beautiful Islam")
search.send_keys(Keys.RETURN)
first_link(driver)
#click first link
def first_link(driver):
link = driver.find_elements_by_class_name("r")
link1 = link[0]
link1.click()
main()
How can I open this in the same browser I am using?

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
def main():
setup()
# open Chrome and open Google
def setup():
driver = webdriver.Chrome()
driver.get('https://www.google.com')
assert 'Google' in driver.title
mySearch(driver)
#Search keyword
def mySearch(driver):
search = driver.find_element_by_id("lst-ib")
search.clear()
search.send_keys("test")
search.send_keys(Keys.RETURN)
first_link(driver)
#click first link
def first_link(driver):
link = driver.find_elements_by_xpath("//a[#href]")
# uncomment to see each href of the found links
# for i in link:
# print(i.get_attribute("href"))
first_link = link[0]
url = first_link.get_attribute("href")
driver.execute_script("window.open('about:blank', 'tab2');")
driver.switch_to.window("tab2")
driver.get(url)
# Do something else with this new tab now
main()
A few observation: the first link you get might not be the first link you want. In my case, the first link is the login to Google account. So you might want to do some more validation on it until you open it, like check it's href property, check it's text to see if it matches something etc.
Another observation is that there are easier ways of crawling google search results and using googles API directly or a thirdparty implementation like this: https://pypi.python.org/pypi/google or https://pypi.python.org/pypi/google-search

To my knowledge, there's no way to attach Selenium to an already-running browser.
More to the point, why do you want to do that? The only thing I can think of is if you're trying to set up something with the browser manually, and then having Selenium do things to it from that manually-set-up state. If you want your tests to run as consistently as possible, you shouldn't be relying on a human setting up the browser in a particular way; the script should do this itself.

When writing tests with selenium how can you view the resulting HTML?

Take for example the following selenium test in python:
import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
class PythonOrgSearch(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
def test_search_in_python_org(self):
driver = self.driver
driver.get("http://www.python.org")
self.assertIn("Python", driver.title)
elem = driver.find_element_by_name("q")
elem.send_keys("selenium")
elem.send_keys(Keys.RETURN)
self.assertIn("Google", driver.title)
def tearDown(self):
self.driver.close()
if __name__ == "__main__":
unittest.main()
Taken from: http://selenium-python.readthedocs.org/en/latest/getting-started.html#id2
The resulting output is something like:
----------------------------------------------------------------------
Ran 1 test in 15.566s
OK
Is there any way to get selenium to output the html after it has executed its browser actions?
Basically, I am using Selenium IDE for Firefox to record actions on the browser. I want to play them back on python, get the resulting html, and based on that html take further action (e.g. the 1st selenium test might be to log on to a website and navigate somewhere. Based on what is there I want to conduct a second test (w. the user still logged on)). Is this possible using selenium?
Thanks in Advance!

It sounds as though your tests might end up being dependant on each other, which is a very very bad idea.
Nonetheless, the page_source function will return the full HTML of the current page the driver is looking at:
https://code.google.com/p/selenium/source/browse/py/selenium/webdriver/remote/webdriver.py#429

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

bypass cookiewall selenium - python

driver.find_element_by_xpath('//*[#id="form_save"]').click() ok I made selenium click the accept button. Also fine by me. Not sure if I'll run into cookiewalls later

Related

GoogleCaptcha roadblock in website scraper

Why by scraping LinkedIn it cannot load the requested url? Python

Empty cookies using Python and Selenium whit get_cookies() function

Trying to open a tab in my opened browser with selenium

When writing tests with selenium how can you view the resulting HTML?

Categories

Resources