Deploy Scrapper facebook post on Heroku using Selenium Python - python

Here is my config to using chrome driver on Heroku
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument('--disable-gpu')
chrome =webdriver.Chrome(executable_path=os.environ.get("CHROMEDRIVER_PATH"),chrome_options=chrome_options)
chrome.get("https://facebook.com/groups/744721719556973/")
My duty is crawling post from a public group.So I selenium and beautifulsoup
It worked very well on Local ,I cralwed data successfully.But when I deploy it to heroku it return an empty array
My config in local
options = Options()
options.add_argument("--disable-notifications")
options.add_argument("--headless") \# options.add_argument("--start-maximized")
options.add_argument("--disable-dev-shm-usage") \# options.add_argument("--start-maximized")
options.add_argument("--no-sandbox")
# chrome = webdriver.Chrome('./chromedriver', chrome_options=options)
# chrome.get("https://facebook.com/groups/744721719556973/")`
//scroll to crawl post
chrome.execute_script("window.scrollTo(0,document.body.scrollHeight)")
tree = html.fromstring(chrome.page_source)
soup = BeautifulSoup(chrome.page_source, 'html.parser')
Here is my way
//this still work on local but on heroku it doenst find any div with the these classes
match = soup.find_all('div', class\_='du4w35lb k4urcfbm l9j0dhe7 sjgh65i0')
options.add_argument("--disable-notifications")
options.add_argument("--headless")
options.add_argument("--start-maximized")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--start-maximized")
options.add_argument("--no-sandbox")
I tried many way but it doesnt work.If you overcame this problem,suggest me pls.Tks u you so much

Here is an in-depth article from Medium, that may help you overcome your issues. https://medium.com/#mikelcbrowne/running-chromedriver-with-python-selenium-on-heroku-acc1566d161c

Related

python my selenium "webdriver.Remote" not work? .It's a real headache

Why does my "webdriver.Remote" not work?
from selenium import webdriver
options = webdriver.ChromeOptions()
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
options=options
)
driver.get("http://www.google.com")
driver.quit()
enter image description here
I tried running "webdriver.Chrome" locally directly and it was successful
options = webdriver.ChromeOptions()
# options.add_argument("--headless")
# options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=options)
driver.get("http://www.google.com")
I found that he kept Starting "Starting ChromeDriver 100.0.4896.60" while running, so I found another "ChromeDriver "in the selenium-server.jar sibling directory.How stupid of me.

How to avoid bot detection in Selenium?

I am trying to scrape this website with the following code:
from selenium import webdriver
options = webdriver.ChromeOptions()
driver_path = '/Users/francopiccolo/Utils/chromedriver97'
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)
url = 'https://www.zonaprop.com.ar/inmuebles-venta-rosario.html'
driver.get(url)
The problem is it somehow detects a bot and throws an error.
Ideas?
options = Options()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
Try these to remove the detection of the bot.

Accesssing Corsair.com with Python Selenium?

Is it possible to access https://www.corsair.com/ with Selenium in Python without getting blocked by Corsair?
When I try to load the page in Selenium, it keeps giving me this error message:
What I tried to bypass it, is changing the user-agent to a random one, which didn't fix the issue.
This is my code:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from fake_useragent import UserAgent
options = webdriver.ChromeOptions()
options.add_argument("window-size=1400,600")
ua = UserAgent()
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=options)
print('Loading Corsair Website ...')
driver.get("https://www.corsair.com/")
There are multiple ways to evade detection of Selenium automation and one of them is to use the following argument:
--disable-blink-features=AutomationControlled.
Code Block:
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
# options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://www.corsair.com/")
driver.save_screenshot("image.png")
Screenshot:

Print dom after Selenium driver.get

I'm used to requests where I can just print the response after I do make a GET request. I find myself unsure if parts of the page are in the resonse or not, particularly when the website uses React or jQuery.
Is there a way I can do the same with Selemium?
Like this?
DRIVER_PATH = '/usr/bin/chromedriver'
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)
driver.get('example.com')
# Print the DOM
driver.quit()
You are looking for driver.page_source.
from selenium import webdriver
DRIVER_PATH = '/usr/bin/chromedriver'
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)
driver.get('https://google.com')
# Print the DOM
print(driver.page_source)
driver.quit()

How to initiate Chrome Canary in headless mode through Selenium and Python

from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = 'C:\Users\mpmccurdy\Desktop\Google Chrome Canary.lnk'
options.add_argument('headless')
options.add_argument('window-size=1200x600')
driver = webdriver.Chrome(chrome_options=options)
driver.get("https://www.python.org")
If you are using Chrome Canary as a basic Requirement the server still expects you to have Chrome installed in the default location as per the underlying OS architecture as follows:
You can also override the default Chrome Binary Location following the documentation Using a Chrome executable in a non-standard location as follows:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = r'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
options.add_argument('--headless')
options.add_argument('window-size=1200x600')
driver = webdriver.Chrome(executable_path=r'C:\path\to\chromedriver.exe', chrome_options=options)
driver.get("https://www.python.org")
chrome_options = Options()
chrome_options.add_argument("--headless")
path = os.getcwd() +'\\chromedriver.exe' #needs to be in your current working directory
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=path)

Categories

Resources