Why by scraping LinkedIn it cannot load the requested url? Python - python

I am trying to scrape LinkedIn, the script was working for 3 months but yesterday it crashed.
I use selenium webdriver, Firefox with fake useragent.
The URL is https://www.linkedin.com/company/my_company/
def init_driver():
"""Initiates selenium webdriver.
:return: Firefox browser instance
"""
try:
# use random UserAgent to avoid captcha
fp = webdriver.FirefoxProfile()
fp.set_preference("general.useragent.override", UserAgent().random)
fp.update_preferences()
# initiate driver
options = FirefoxOptions()
#options.add_argument("--headless")
return webdriver.Firefox(firefox_options=options)
except Exception as e:
logging.error('Exception occurred initiating webdriver', exc_info=True)
And then just open a page driver.get(url)
at this moment it opens it but cannot load
the same situation happens without fake agent and by using chrome.
Has anyone encountered something like this? When I open the link myself everything os ok.
https://www.linkedin.com/authwall?trk=gf&trkInfo=AQFvPeNP8NQIxwAAAXLqc-uI5rnQe1ZIysPcZOgjZCzbrBHZj7q6gd68fPG9NzbX00Rlre_yC0tITChjMDEXSNnD8tZRaMXqcRG-z_3QUMlCvQPR4uVGBQYoSOl3ycoO2E6Jl9w=&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2my_company%2F
Other URLs are opened without problems by the function

This is how you should modify your code.
I modified your code and your code was executed correctly.
from selenium import webdriver
from fake_useragent import UserAgent
import logging
def init_driver():
"""Initiates selenium webdriver.
:return: Firefox browser instance
"""
path = r"your firefox driver path"
try:
# use random UserAgent to avoid captcha
fp = webdriver.FirefoxProfile()
fp.set_preference("general.useragent.override", UserAgent().random)
fp.update_preferences()
# initiate driver
options = webdriver.FirefoxOptions()
# options.add_argument("--headless")
return webdriver.Firefox(firefox_options=options, executable_path=path)
except Exception:
logging.error('Exception occurred initiating webdriver', exc_info=True)
url = "your url"
driver = init_driver()
driver.get(url)

Related

Interacting with browser that has been garbage collected in try block

I have a selenium browser where I've added options to use my google
chrome profile when the browser is opened.
I know there will be an error when trying to create my selenium
browser if chrome is opened elsewhere with the same profile.
But despite there being an error the browser still opens
What I want to do is to still be able to interact with this browser, since it still opens with the profile I wanted it to, (and for various reasons I don't want to close my other chrome instances)
I had to through in try and except so the program doesn't stop, but I think the browser gets garbage collected in the try block.
So is there a way to stop this getting garbage collected or can I find all browsers opened by webdriver ?and then set one of them to a new browser
Here's the code:
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
try:
chrome_options.add_argument("user-data-dir=C:\\Users\\coderoftheday\\AppData\\Local\\Google\\Chrome\\User Data\\")
browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)
except:
pass
browser.get('https://www.google.co.uk/')
Error:
NameError: name 'browser' is not defined
Isn't this just a python variable scope issue?
See https://stackoverflow.com/a/25666911/1387701
Simplest solution:
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
browser = None
try:
chrome_options.add_argument("user-data-dir=C:\\Users\\coderoftheday\\AppData\\Local\\Google\\Chrome\\User Data\\")
browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)
except:
#code here to kill existing chrome instance
#retry opening chrome:
browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)
browser.get('https://www.google.co.uk/')

Selenium is not loading TikTok pages

I'm implementing a TikTok crawler using selenium and scrapy
start_urls = ['https://www.tiktok.com/trending']
....
def parse(self, response):
options = webdriver.ChromeOptions()
from fake_useragent import UserAgent
ua = UserAgent()
user_agent = ua.random
options.add_argument(f'user-agent={user_agent}')
options.add_argument('window-size=800x841')
driver = webdriver.Chrome(chrome_options=options)
driver.get(response.url)
The crawler open Chrome but it does not load videos.
Image loading
The same problem happens also using Firefox
No loading page using Firefox
The same problem using a simple script using Selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("https://www.tiktok.com/trending")
time.sleep(10)
driver.close()
driver = webdriver.Chrome()
driver.get("https://www.tiktok.com/trending")
time.sleep(10)
driver.close()
Did u try to navigate further within the selenium browser window? If an error 404 appears on following sites, I have a solution that worked for me:
I simply changed my User-Agent to "Naverbot" which is "allowed" by the robots.txt file from Tik Tok
(Robots.txt)
After changing that all sites and videos loaded properly.
Other user-agents that are listed under the "allow" segment should work too, if you want to add a rotation.
You can use Windows IE. Instead of chrome or firefox
Videos will load in IE but IE's Layout of showing feed is somehow different from chrome and firefox.
Reasons, why your page, is not loading.
Few advance web apps check your browser history, profile data and cached to check the authentication of the user.
One other thing you can do is run your default profile within your selenium It would be helpfull.

bypass cookiewall selenium

I would like to scrape job listings from a Dutch job listings website. However, when I try to open the page with selenium I run into a cookiewall (new GDPR rules). How do I bypass the cookiewall?
import selenium
#launch url
url = "https://www.nationalevacaturebank.nl/vacature/zoeken?query=&location=&distance=city&limit=100&sort=relevance&filters%5BcareerLevel%5D%5B%5D=Starter&filters%5BeducationLevel%5D%5B%5D=MBO"
# create a new Firefox session
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)
Edit I tried something
import selenium
import pickle
url = "https://www.nationalevacaturebank.nl/vacature/zoeken?query=&location=&distance=city&limit=100&sort=relevance&filters%5BcareerLevel%5D%5B%5D=Starter&filters%5BeducationLevel%5D%5B%5D=MBO"
driver = webdriver.Firefox()
driver.set_page_load_timeout(20)
driver.get(start_url)
pickle.dump(driver.get_cookies() , open("NVBCookies.pkl","wb"))
after that loading the cookies did not work
for cookie in pickle.load(open("NVBCookies.pkl", "rb")):
driver.add_cookie(cookie)
InvalidCookieDomainException: Message: Cookies may only be set for the current domain (cookiewall.vnumediaonline.nl)
It looks like I don't get the cookies from the cookiewall, correct?
Instead of bypassing why don't you write code to check if it's present then accept it otherwise continue with next operation. Please find below code for more details
import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
class PythonOrgSearch(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Chrome(executable_path="C:\\Users\\USER\\Downloads\\New folder (2)\\chromedriver_win32\\chromedriver.exe")
def test_search_in_python_org(self):
driver = self.driver
driver.get("https://www.nationalevacaturebank.nl/vacature/zoeken?query=&location=&distance=city&limit=100&sort=relevance&filters%5BcareerLevel%5D%5B%5D=Starter&filters%5BeducationLevel%5D%5B%5D=MBO")
elem = driver.find_element_by_xpath("//div[#class='article__button']//button[#id='form_save']")
elem.click()
def tearDown(self):
self.driver.close()
if __name__ == "__main__":
unittest.main()
driver.find_element_by_xpath('//*[#id="form_save"]').click()
ok I made selenium click the accept button. Also fine by me. Not sure if I'll run into cookiewalls later

How to open a fully functioning chrome browser using selenium WebDriver with python?

When I'm trying to open a web page, its opening in a new chrome window stripped of all the extensions and modules. I'm not able to emulate the certain behavior of the website using selenium chrome browser window but I'm able to do the same thing in a normal chrome window without any issues.
from selenium import webdriver
driver = webdriver.Chrome(r'C:\chromedriver.exe')
driver.get("remote_worksplace_link")
id_box = driver.find_element_by_id('Enter user name')
id_box.send_keys('123456')
pass_box = driver.find_element_by_id('passwd')
pass_box.send_keys('123abc')
login_button = driver.find_element_by_id('Log_On')
login_button.click()
driver.implicitly_wait(2)
launch_button = driver.find_element_by_class_name('storeapp-icon ui-sortable-handle')
launch_button.click()
driver.implicitly_wait(5)
driver.close()
all extentions has its .crx file just you need to add those path
chrome_options = Options()
chrome_options.add_extension('path_to_extension')
driver = webdriver.Chrome(executable_path=executable_path, chrome_options=chrome_options)
driver.get("url")
driver.quit()

Can not load Chrome default profile

I have the following Python script:
from selenium import webdriver
import time
def main():
chrome_options = webdriver.ChromeOptions();
chrome_options.add_argument("user-data-dir=/Users/octavian/Library/Application Support/Google/Chrome/Default");
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://www.facebook.com/")
time.sleep(1000)
main()
I am trying to load my Facebook profile. However, the page that is opened doesn't have my profile(I'm not logged in), which means the browser state is not loaded.
However, my Chrome profile is stored in this file:
/Users/octavian/Library/Application Support/Google/Chrome/Default
Why is the profile not seen by Selenium?

Categories

Resources