Selenium Webdriver for webscraping on multiple websites concurrently?

Selenium Webdriver for webscraping on multiple websites concurrently? - python

I am working on webscraping flight prices using Selenium Webdriver. I want my code to be able to search flight prices for multiple trips. As of now my code works for 1 destination only.
Most answers that I find online involves using a for loop for the specific URLs of multiple destinations, which is not applicable to my case as the URLs depend on the different destinations that I choose.
Anyone knows how I can search for these prices concurrently without waiting for individual searches to be completed? Or perhaps an even faster way to do this?
Thanks!

I believe you could use the MultiPoolProcess to fetch the flights concurrently. Here is an example that I have worked with selenium:
Script to excute your selenium function:
# MultiProcess
from subprocess import Popen
from concurrent.futures import ProcessPoolExecutor, wait, FIRST_EXCEPTION, as_completed
urls = [url1, url2, url3]
N = 4 # Number of processors that you want to use
# Execute each bot
with ProcessPoolExecutor(N) as executor:
for url in urls:
command = ["python", "mySeleniumScript.py", url]
future = executor.submit(Popen,command)
self.futures.append(future)
In this case your python script containing the selenium scraper should parse the url from the input. Like this:
mySeleniumScript.py
from selenium import webdriver
import sys
url = sys.argv[1]
driver = webdriver.Firefox()
driver.get(url)
*** Your scraper logic here ***
Hopefully this point you in the right direction, let me know how it went!

Related

How can i webscrap aviator game results?

I want to get the latest result from the aviator game each time it crashes, i'm trying to do it with python and selenium but i can't get it to work, the website takes some time to load which complicates the process since the classes are not loaded from the beginning
this is the website i'm trying to scrape: https://estrelabet.com/ptb/bet/main
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = 'https://estrelabet.com/ptb/bet/main'
options = Options()
options.headless = True
navegador = webdriver.Chrome(options=options)
navegador.get('https://estrelabet.com/ptb/bet/main')
navegador.find_element()
navegador.quit()
this is what i've done so far
i want to get all the elements in the results block
payout block
and get these results individually
result

I tried to extract the data using selenium but it was impossible since the iDs and elements were dynamic, I was able to extract data using an OCR library called Tesseract, I share the code I used for this purpose, I hope it helps you
AviatorScraping github

Why does python and my web browser show different codes for the same link?

Let's use the url https://www.google.cl/#q=stackoverflow as an example. Using Chrome Developer Tools on the first link given by the search we see this html code:
Now, if I run this code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = urlopen("https://www.google.cl/#q=stackoverflow")
soup = BeautifulSoup(url)
print(soup.prettify())
I wont find the same elements. In fact, I wont find any link from the results given by the google search. Same goes if I use the requests module. Why does this happen? Can I do something to get the same results as if I was requesting from a web browser?

Since the html is generated dynamically, likely from a modern single page javascript framework like Angular or React (or even just plain JavaScript), you will need to actually drive a browser to the site using selenium or phantomjs before parsing the dom.
Here is some skeleton code.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("http://google.com")
html = driver.execute_script("return document.documentElement.innerHTML")
soup = BeautifulSoup(html)
Here is the selenium documentation for more info on running selenium, configurations, etc.:
http://selenium-python.readthedocs.io/
edit:
you will likely need to add a wait before grabbing the html, since it may take a second or so to load certain elements of the page. See below for reference to the explicity wait documentation of python selenium:
http://selenium-python.readthedocs.io/waits.html
Another source of complication is that certain parts of the page might be hidden until AFTER user interaction. In this case you will need to code your selenium script to interact with the page in certain ways before grabbing the html.

Cannot find element from a jump out window. How can I switch to a new jump out window?

I'm trying to automate our system with Python2.7, Selenium-webdriver, and Sikuli. I have a problem on login. Every time I open our system, the first page is an empty page, and it will jump to another page automatically; the new page is the main login page, so Python is always trying to find the element from the first page. The first page sometimes shows:
your session has timeout
I set a really large number for session timeout, but it doesn't work.
Here is my code:
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
driver.get('http://172.16.1.186:8080/C******E/servlet/LOGON')
# time.sleep(15)
bankid = driver.find_element_by_id("idBANK")
bankid.send_keys(01)
empid = driver.find_element_by_id("idEMPLOYEE")
empid.send_keys(200010)
pwdid = driver.fin`enter code here`d_element_by_id("idPASSWORD")
pwdid.send_keys("C******e1")
elem = driver.find_element_by_id("maint")
elem.send_keys(Keys.RETURN)

First of all, I can't see any Sikuli usage in your example. If you were using Sikuli, it wouldn't matter how the other page was launched as you'd be interacting with whatever is visible on your screen at that time.
In Selenium, if you have multiple windows you have to switch your driver to the correct one. A quick way to get a list of the available windows is something like this:
for handle in driver.window_handles:
driver.switch_to_window(handle);
print "Switched to handle:", handle
element = browser.find_element_by_tag_name("title")
print element.get_attribute("value")

Python Selenium: Looking for ways to print page load time

I am pretty new with using python with selenium web testing.
I am creating a handful of test cases for my website and I would like to see how long it takes for specific pages to load. I was wondering if there is a way to print the page load time after or during the test.
Here is a basic example of what one of my test cases looks like:
import time
import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("some URL")
driver.implicitly_wait(10)
element = driver.find_element_by_name("username")
element.send_keys("User")
element = driver.find_element_by_name("password")
element.send_keys("Pass")
element.submit()
time.sleep(2)
driver.close()
In this example I would like to see how long it took for the page to load after submitting my log in information.

I have found a way around this by running my tests as python unit tests. I now record my steps using the selenium IDE and export them into a python file. I then modify the file as needed. After the test runs it shows the time by default.

Faster way to parse JS generated site

I am creating a script that will have to parse from a JS generated site ( using python ) atleast a 1000 times a day. Parsing it the usual way (using a browser to open it and then get the code) takes me about 30 seconds. Which is not satisfactory. I thought about how to make the process faster and I have an idea - is it possible to create a browser that doesn`t create a window ( ignores the visual part ) and is simply a process in other words an "invisible" browser. What I want to know is if it will be efficient and if there are other ways to make it run faster. Any help is appreciated.
EDIT : Here is the code for my parser
from selenium import webdriver
import re
browser = webdriver.Firefox()
browser.get('http://www.spokeo.com/search?q=Joe+Henderson,+Phoenix,+AZ&sao7=t104#:18643819031')
content = browser.page_source
browser.quit()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium Webdriver for webscraping on multiple websites concurrently? - python

Related

How can i webscrap aviator game results?

Why does python and my web browser show different codes for the same link?

Cannot find element from a jump out window. How can I switch to a new jump out window?

Python Selenium: Looking for ways to print page load time

Faster way to parse JS generated site

Categories

Resources