Deploy Scrapper facebook post on Heroku using Selenium Python

Deploy Scrapper facebook post on Heroku using Selenium Python - python

Here is my config to using chrome driver on Heroku
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument('--disable-gpu')
chrome =webdriver.Chrome(executable_path=os.environ.get("CHROMEDRIVER_PATH"),chrome_options=chrome_options)
chrome.get("https://facebook.com/groups/744721719556973/")
My duty is crawling post from a public group.So I selenium and beautifulsoup
It worked very well on Local ,I cralwed data successfully.But when I deploy it to heroku it return an empty array
My config in local
options = Options()
options.add_argument("--disable-notifications")
options.add_argument("--headless") \# options.add_argument("--start-maximized")
options.add_argument("--disable-dev-shm-usage") \# options.add_argument("--start-maximized")
options.add_argument("--no-sandbox")
# chrome = webdriver.Chrome('./chromedriver', chrome_options=options)
# chrome.get("https://facebook.com/groups/744721719556973/")`
//scroll to crawl post
chrome.execute_script("window.scrollTo(0,document.body.scrollHeight)")
tree = html.fromstring(chrome.page_source)
soup = BeautifulSoup(chrome.page_source, 'html.parser')
Here is my way
//this still work on local but on heroku it doenst find any div with the these classes
match = soup.find_all('div', class\_='du4w35lb k4urcfbm l9j0dhe7 sjgh65i0')
options.add_argument("--disable-notifications")
options.add_argument("--headless")
options.add_argument("--start-maximized")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--start-maximized")
options.add_argument("--no-sandbox")
I tried many way but it doesnt work.If you overcame this problem,suggest me pls.Tks u you so much

Here is an in-depth article from Medium, that may help you overcome your issues. https://medium.com/#mikelcbrowne/running-chromedriver-with-python-selenium-on-heroku-acc1566d161c

Related

python my selenium "webdriver.Remote" not work? .It's a real headache

Why does my "webdriver.Remote" not work?
from selenium import webdriver
options = webdriver.ChromeOptions()
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
options=options
)
driver.get("http://www.google.com")
driver.quit()
enter image description here
I tried running "webdriver.Chrome" locally directly and it was successful
options = webdriver.ChromeOptions()
# options.add_argument("--headless")
# options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=options)
driver.get("http://www.google.com")

I found that he kept Starting "Starting ChromeDriver 100.0.4896.60" while running, so I found another "ChromeDriver "in the selenium-server.jar sibling directory.How stupid of me.

How to avoid bot detection in Selenium?

I am trying to scrape this website with the following code:
from selenium import webdriver
options = webdriver.ChromeOptions()
driver_path = '/Users/francopiccolo/Utils/chromedriver97'
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)
url = 'https://www.zonaprop.com.ar/inmuebles-venta-rosario.html'
driver.get(url)
The problem is it somehow detects a bot and throws an error.
Ideas?

options = Options()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
Try these to remove the detection of the bot.

Accesssing Corsair.com with Python Selenium?

Is it possible to access https://www.corsair.com/ with Selenium in Python without getting blocked by Corsair?
When I try to load the page in Selenium, it keeps giving me this error message:
What I tried to bypass it, is changing the user-agent to a random one, which didn't fix the issue.
This is my code:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from fake_useragent import UserAgent
options = webdriver.ChromeOptions()
options.add_argument("window-size=1400,600")
ua = UserAgent()
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=options)
print('Loading Corsair Website ...')
driver.get("https://www.corsair.com/")

There are multiple ways to evade detection of Selenium automation and one of them is to use the following argument:
--disable-blink-features=AutomationControlled.
Code Block:
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
# options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://www.corsair.com/")
driver.save_screenshot("image.png")
Screenshot:

Print dom after Selenium driver.get

I'm used to requests where I can just print the response after I do make a GET request. I find myself unsure if parts of the page are in the resonse or not, particularly when the website uses React or jQuery.
Is there a way I can do the same with Selemium?
Like this?
DRIVER_PATH = '/usr/bin/chromedriver'
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)
driver.get('example.com')
# Print the DOM
driver.quit()

You are looking for driver.page_source.
from selenium import webdriver
DRIVER_PATH = '/usr/bin/chromedriver'
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)
driver.get('https://google.com')
# Print the DOM
print(driver.page_source)
driver.quit()

How to initiate Chrome Canary in headless mode through Selenium and Python

from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = 'C:\Users\mpmccurdy\Desktop\Google Chrome Canary.lnk'
options.add_argument('headless')
options.add_argument('window-size=1200x600')
driver = webdriver.Chrome(chrome_options=options)
driver.get("https://www.python.org")

If you are using Chrome Canary as a basic Requirement the server still expects you to have Chrome installed in the default location as per the underlying OS architecture as follows:
You can also override the default Chrome Binary Location following the documentation Using a Chrome executable in a non-standard location as follows:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = r'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
options.add_argument('--headless')
options.add_argument('window-size=1200x600')
driver = webdriver.Chrome(executable_path=r'C:\path\to\chromedriver.exe', chrome_options=options)
driver.get("https://www.python.org")

chrome_options = Options()
chrome_options.add_argument("--headless")
path = os.getcwd() +'\\chromedriver.exe' #needs to be in your current working directory
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=path)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deploy Scrapper facebook post on Heroku using Selenium Python - python

Here is an in-depth article from Medium, that may help you overcome your issues. https://medium.com/#mikelcbrowne/running-chromedriver-with-python-selenium-on-heroku-acc1566d161c

Related

python my selenium "webdriver.Remote" not work? .It's a real headache

How to avoid bot detection in Selenium?

Accesssing Corsair.com with Python Selenium?

Print dom after Selenium driver.get

How to initiate Chrome Canary in headless mode through Selenium and Python

Categories

Resources