I am scraping an e-commerce website with selenium, because the pages are loaded by Javascipt.
Here's the workflow -:
1. Instantiate a web diver driver in virtual display mode, while sending a random user agent. Using a random user-agent decreases you chances of detection just a little bit. This will not reduce the chances of blocking by IP.
2. For each query term, say "pajamas" - create the search url for that website - and open the url.
3. Get the corresponding text elements from Xpath, say top 10 prod ids, their prices, title of product etc.
4. Store them in a file - that I will further process
I have upwards of 38000 such urls that I need to fetch for the elements on page load.
I did multiprocessing, and I realized quickly that the process was failing since after a while, the website was blocked, so the page load did not happen.
How can I IP spoof in Python and will it work with selenium driving the web for you, not urllib/urlopen ?
Aside of setting the actual fetch via the xpaths, here's the basic code - more specifically, see init_driver
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import argparse
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import codecs, urllib, os
import multiprocessing as mp
from my_custom_path import scraping_conf_updated as sf
from fake_useragent import UserAgent
def set_cookies(COOKIES, exp, driver):
for key, val in COOKIES[exp].items():
driver.add_cookie({'name': key, 'value': val, 'path': '/', 'secure': False, 'expiry': None})
return driver
def check_cookies(driver, exp):
print "printing cookie name & value"
for cookie in driver.get_cookies():
if cookie['name'] in COOKIES[exp].keys():
print cookie['name'], "-->", cookie['value']
def wait_for(driver):
if conf_key['WAIT_FOR_ID'] != '':
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, conf_key['WAIT_FOR_ID'])))
elif conf_key['WAIT_FOR_CLASS'] != '':
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS, conf_key['WAIT_FOR_CLASS'])))
return driver
def init_driver(base_url, url, exp):
display = Display(visible=0, size=(1024, 768))
display.start()
profile = webdriver.FirefoxProfile()
ua = UserAgent(cache=False)
profile.set_preference("general.useragent.override",ua.random)
driver=webdriver.Firefox(profile)
if len(conf_key['COOKIES'][exp]) != 0:
driver.get(base_url)
driver.delete_all_cookies()
driver = set_cookies(COOKIES, exp, driver)
check_cookies(driver, exp)
driver.get(url)
driver.set_page_load_timeout(300)
if len(conf_key['POP_UP']['XPATH']) > 0:
driver = identify_and_close_popup(driver)
driver = wait_for(driver)
return driver
use a vpn provider or an http or socks proxy to change your apparent originating ip address from your target website
Related
I am currently working on a scraper for aniworld.to.
My goal is it to enter the anime name and get all of the Episodes downloaded.
I have everything working except one thing...
The websites has a Watch button. That Button redirects you to https://aniworld.to/redirect/SOMETHING and that Site has a captcha which means the link is not in the html...
Is there a way to bypass this/get the link in python? Or a way to display the captcha so I can solve it?
Because the captcha only appears every lightyear.
The only thing I need from that page is the redirect link. It looks like this:
https://vidoza.net/embed-something.html
My very very wip code is here if it helps: https://github.com/wolfswolke/aniworld_scraper
Mitchdu showed me how to do it.
If anyone else needs help here is my code: https://github.com/wolfswolke/aniworld_scraper/blob/main/src/logic/captcha.py
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from threading import Thread
import os
def open_captcha_window(full_url):
working_dir = os.getcwd()
path_to_ublock = r'{}\extensions\ublock'.format(working_dir)
options = webdriver.ChromeOptions()
options.add_argument("app=" + full_url)
options.add_argument("window-size=423,705")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
if os.path.exists(path_to_ublock):
options.add_argument('load-extension=' + path_to_ublock)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get(full_url)
wait = WebDriverWait(driver, 100, 0.3)
wait.until(lambda redirect: redirect.current_url != full_url)
new_page = driver.current_url
Thread(target=threaded_driver_close, args=(driver,)).start()
return new_page
def threaded_driver_close(driver):
driver.close()
I've created a script using python in combination with selenium implementing proxies within it to log in to facebook and scrape the name of the user whose post is on top of my feed. I would like the script to do this every five minutes for an unlimited time.
As this continuous login may lead my account to ban, I thought to implement proxies within the script to do the whole stuff anonymously.
I've written so far:
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def get_first_user(random_proxy):
options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
options.add_experimental_option("prefs",prefs)
options.add_argument(f'--proxy-server={random_proxy}')
with webdriver.Chrome(options=options) as driver:
wait = WebDriverWait(driver,10)
driver.get("https://www.facebook.com/")
driver.find_element_by_id("email").send_keys("username")
driver.find_element_by_id("pass").send_keys("password",Keys.RETURN)
user = wait.until(EC.presence_of_element_located((By.XPATH,"//h4[#id][#class][./span[./a]]/span/a"))).text
return user
if __name__ == '__main__':
proxies = [`list of proxies`]
while True:
random_proxy = proxies.pop(random.randrange(len(proxies)))
print(get_first_user(random_proxy))
time.sleep(60000*5)
How to stay undetected while scraping data continuously from a site that requires authentication?
I'm not sure why you would want to continuously login into your Facebook account every 5 minutes to scrape content. And using a random proxy address for each login would likely raise a red flag with Facebook's security rules .
Instead of logging into Facebook every 5 minutes, I would recommend staying logged-in. Selenium has a feature that refreshes a webpage being controlled by automation. By using this method you could refresh your
Facebook feed at a predefined interval, such as 5 minutes.
The code below uses this refresh method to reload the page. The code also checks from the user post at the top of your feed.
In testing I noted that Facebook uses some randomized tagging, which is likely used to mitigate scraping. I also noted that Facebook changed the username formats on posts linked to groups, so more testing is required if you want the usernames linked to these posts. I highly recommended conducting more tests to determine what user elements aren't being scraped correctly.
from time import sleep
from random import randint
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
chrome_options = Options()
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")
# disable the banner "Chrome is being controlled by automated test software"
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_experimental_option("excludeSwitches", ['enable-automation'])
# global driver
driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)
driver.get('https://www.facebook.com')
driver.implicitly_wait(20)
driver.find_element_by_id("email").send_keys("your_username")
driver.find_element_by_id("pass").send_keys("your_password")
driver.implicitly_wait(10)
driver.find_element_by_xpath(("//button[text()='Log In']")).click()
# this function checks for a standard username tag
def user_element_exist():
try:
if driver.find_element_by_xpath("//h4[#id][#class][./span[./a]]/span/a"):
return True
except NoSuchElementException:
return False
# this function looks for username linked to Facebook Groups at the top of your feed
def group_element():
try:
if driver.find_element_by_xpath("//*[starts-with(#id, 'jsc_c_')]/span[1]/span/span/a/b"):
poster_name = driver.find_element_by_xpath("//*[starts-with(#id, 'jsc_c_')]/span[1]/span/span/a/b").text
return poster_name
if driver.find_element_by_xpath("//*[starts-with(#id, 'jsc_c_')]/strong[1]/span/a/span/span"):
poster_name = driver.find_element_by_xpath("//*[starts-with(#id, 'jsc_c_')]/strong["
"1]/span/a/span/span").text
return poster_name
except NoSuchElementException:
return "No user information found"
while True:
element_exists = user_element_exist()
if not element_exists:
user_name = group_element()
print(user_name)
driver.refresh()
elif element_exists:
user_name = driver.find_element_by_xpath("//h4[#id][#class][./span[./a]]/span/a").text
print(user_name)
driver.refresh()
# set the sleep timer to fit your needs
sleep(300) # This sleeps for 300 seconds, which is 5 minutes.
# I would likely use a random sleep function
# sleep(randint(180, 360))
Instead of log in every 5 minutes try to move that part away from loop to login only once
if __name__ == '__main__':
with webdriver.Chrome(options=options) as driver:
wait = WebDriverWait(driver,10)
driver.get("https://www.facebook.com/")
driver.find_element_by_id("email").send_keys("username")
driver.find_element_by_id("pass").send_keys("password",Keys.RETURN)
while True:
user = wait.until(EC.presence_of_element_located((By.XPATH,"//h4[#id][#class][./span[./a]]/span/a"))).text
print(user)
time.sleep(60000*5)
Also consider to use random interval instead of hardcoded sleep:
import random
time.sleep(random.randint(240, 360)) # wait for 4~6 minutes
I'm trying to control Spotify's browser player. All of the controls are put inside iframe sections.
Problem: The iframes are EMPTY in the Selenium WebDriver object. Yet, the iframes are filled with the correct content in the ACTUAL browser.
Code sample:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import time
username = 'myusername'
base_url = 'https://play.spotify.com/user/spotifydiscover/playlist/76HML2OQXigkyKopqYkeng'
browser = None
def login():
global browser
password = input("Password: ")
browser = webdriver.Firefox()
browser.get(base_url)
browser.find_element_by_id('has-account').click()
browser.find_element_by_id('login-usr').clear()
browser.find_element_by_id('login-usr').send_keys(username)
browser.find_element_by_id('login-pass').clear()
browser.find_element_by_id('login-pass').send_keys(password)
browser.find_element_by_id('login-pass').submit()
def next_track():
global browser
wrapper = browser.find_element_by_id("section-collection")
print(wrapper.get_attribute('innerHTML').encode('utf-8'))
iframe = wrapper.find_element_by_tag_name('iframe')
print(iframe.get_attribute('innerHTML').encode('utf-8'))
sub = browser.switch_to_frame(iframe)
sub.find_element_by_id('next').click()
def test():
login()
time.sleep(14) # Adjusted until page is 100% fully loaded
next_track()
time.sleep(40)
The problem is here:
iframe = wrapper.find_element_by_tag_name('iframe')
There are multiple iframes on the page and you are interested in the one with app-player id:
browser.switch_to_frame("app-player")
# now use browser.find_element_* to locate elements inside iframe
Also note that using time.sleep makes your automation code seriously fragile and usually slower than needed - instead use Explicit Waits to wait for the specific conditions to be met on a page.
Trying to screen scrape a web site without having to launch an actual browser instance in a python script (using Selenium). I can do this with Chrome or Firefox - I've tried it and it works - but I want to use PhantomJS so it's headless.
The code looks like this:
import sys
import traceback
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87"
)
try:
# Choose our browser
browser = webdriver.PhantomJS(desired_capabilities=dcap)
#browser = webdriver.PhantomJS()
#browser = webdriver.Firefox()
#browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
# Go to the login page
browser.get("https://www.whatever.com")
# For debug, see what we got back
html_source = browser.page_source
with open('out.html', 'w') as f:
f.write(html_source)
# PROCESS THE PAGE (code removed)
except Exception, e:
browser.save_screenshot('screenshot.png')
traceback.print_exc(file=sys.stdout)
finally:
browser.close()
The output is merely:
<html><head></head><body></body></html>
But when I use the Chrome or Firefox options, it works fine. I thought maybe the web site was returning junk based on the user agent, so I tried faking that out. No difference.
What am I missing?
UPDATED: I will try to keep the below snippet updated with until it works. What's below is what I'm currently trying.
import sys
import traceback
import time
import re
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87")
try:
# Set up our browser
browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
#browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")
# Go to the login page
print "getting web page..."
browser.get("https://www.website.com")
# Need to wait for the page to load
timeout = 10
print "waiting %s seconds..." % timeout
wait = WebDriverWait(browser, timeout)
element = wait.until(EC.element_to_be_clickable((By.ID,'the_id')))
print "done waiting. Response:"
# Rest of code snipped. Fails as "wait" above.
I was facing the same problem and no amount of code to make the driver wait was helping.
The problem is the SSL encryption on the https websites, ignoring them will do the trick.
Call the PhantomJS driver as:
driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])
This solved the problem for me.
You need to wait for the page to load. Usually, it is done by using an Explicit Wait to wait for a key element to be present or visible on a page. For instance:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# ...
browser.get("https://www.whatever.com")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.content")))
html_source = browser.page_source
# ...
Here, we'll wait up to 10 seconds for a div element with class="content" to become visible before getting the page source.
Additionally, you may need to ignore SSL errors:
browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
Though, I'm pretty sure this is related to the redirecting issues in PhantomJS. There is an open ticket in phantomjs bugtracker:
PhantomJS does not follow some redirects
driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])
This worked for me
I have written many scrapers but I am not really sure how to handle infinite scrollers. These days most website etc, Facebook, Pinterest has infinite scrollers.
You can use selenium to scrap the infinite scrolling website like twitter or facebook.
Step 1 : Install Selenium using pip
pip install selenium
Step 2 : use the code below to automate infinite scroll and extract the source code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys
import unittest, time, re
class Sel(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(30)
self.base_url = "https://twitter.com"
self.verificationErrors = []
self.accept_next_alert = True
def test_sel(self):
driver = self.driver
delay = 3
driver.get(self.base_url + "/search?q=stckoverflow&src=typd")
driver.find_element_by_link_text("All").click()
for i in range(1,100):
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(4)
html_source = driver.page_source
data = html_source.encode('utf-8')
if __name__ == "__main__":
unittest.main()
Step 3 : Print the data if required.
Most sites that have infinite scrolling do (as Lattyware notes) have a proper API as well, and you will likely be better served by using this rather than scraping.
But if you must scrape...
Such sites are using JavaScript to request additional content from the site when you reach the bottom of the page. All you need to do is figure out the URL of that additional content and you can retrieve it. Figuring out the required URL can be done by inspecting the script, by using the Firefox Web console, or by using a debug proxy.
For example, open the Firefox Web Console, turn off all the filter buttons except Net, and load the site you wish to scrape. You'll see all the files as they are loaded. Scroll the page while watching the Web Console and you'll see the URLs being used for the additional requests. Then you can request that URL yourself and see what format the data is in (probably JSON) and get it into your Python script.
Finding the url of the ajax source will be the best option but it can be cumbersome for certain sites. Alternatively you could use a headless browser like QWebKit from PyQt and send keyboard events while reading the data from the DOM tree. QWebKit has a nice and simple api.