How to use pool-specific functions while multiprocessing with python?

How to use pool-specific functions while multiprocessing with python? - python

I'm using selenium and multiprocessing to spawn four different websites, and I want to run functions specific to the website the driver generated.
This is similar to my current code:
from multiprocessing import Pool
from selenium import webdriver
def gh(hosts):
driver = webdriver.Chrome(executable_path='./chromedriver')
driver.get(hosts)
html_source = driver.page_source
if 'ryan' in html_source:
print 'ryan'
doSomethingForRyan()
elif 'austin' in html_source:
print 'austin'
doSomethingForAustin()
elif 'travis' in html_source:
print 'travis'
doSomethingForTravis()
elif 'levi' in html_source:
print 'levi'
doSomethingForLevi()
else:
print '--NONE--'
if __name__ == '__main__':
p = Pool(4)
hosts = ["http://ryan.com", "https://www.austin.com", "http://levi.com", "http://travis.com"]
p.map(gh, hosts)
The result I'm getting is something like:
austin
austin
ryan
austin

EDIT - SOLVED
Instead of reading from driver.page_source, reading from driver.current_url makes sure that I can run website-specific functions.
if 'ryan' in driver.current_url:
print 'ryan'
doStuff()

Related

Web scraping Google Maps with Selenium uses too much data

I am scraping travel times from Google Maps. The below code scrapes travel times between 1 million random points in Tehran, which works perfectly fine. I also use multiprocessing to get travel times simultaneously. The results are fully replicable, feel free to run the code in a terminal (but not in an interactive session like Spyder as the multiprocessing won't work). This is how what I am scraping looks like on google maps (in this case 22 min is the travel time):
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from multiprocessing import Process, Pipe, Pool, Value
import time
from multiprocessing.pool import ThreadPool
import threading
import gc
threadLocal = threading.local()
class Driver:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
self.driver = webdriver.Chrome(options=options)
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
print('The driver has been "quitted".')
#classmethod
def create_driver(cls):
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
print('Creating new driver.')
the_driver = cls()
threadLocal.the_driver = the_driver
driver = the_driver.driver
the_driver = None
return driver
success = Value('i', 0)
error = Value('i', 0)
def f(x):
global success
global error
with success.get_lock():
success.value += 1
print("Number of errors", success.value)
with error.get_lock():
error.value += 1
print("counter.value:", error.value)
def scraper(url):
"""
This now scrapes a single URL.
"""
global success
global error
try:
driver = Driver.create_driver()
driver.get(url)
time.sleep(1)
trip_times = driver.find_element(By.XPATH, "//div[contains(#aria-labelledby,'section-directions-trip-title')]//span[#jstcache='198']")
print("got data from: ", url)
print(trip_times.text)
with success.get_lock():
success.value += 1
print("Number of sucessful scrapes: ", success.value)
except Exception as e:
# print(f"Error: {e}")
with error.get_lock():
error.value += 1
print("Number of errors", error.value)
import random
min_x = 35.617487
max_x = 35.783375
min_y = 51.132557
max_y = 51.492329
urls = []
for i in range(1000000):
x = random.uniform(min_x, max_x)
y = random.uniform(min_y, max_y)
url = f'https://www.google.com/maps/dir/{x},{y}/35.8069533,51.4261312/#35.700769,51.5571612,21z'
urls.append(url)
number_of_processes = min(2, len(urls))
start_time = time.time()
with ThreadPool(processes=number_of_processes) as pool:
# result_array = pool.map(scraper, urls)
result_array = pool.map(scraper, urls)
# Must ensure drivers are quitted before threads are destroyed:
del threadLocal
# This should ensure that the __del__ method is run on class Driver:
gc.collect()
pool.close()
pool.join()
print(result_array)
print( "total time: ", round((time.time()-start_time)/60, 1), "number of urls: ", len(URLs))
But after having it run for only 24 hours, it has already used around 80 GB of data! Is there a way to make this more efficient in terms of data usage?
I suspect this excessive data usage is because Selenium has to load each URL completely every time before it can access the HTML and get the target node. Can I change anything in my code to prevent that and still get the travel time?
*Please note that using the Google Maps API is not an option. Because the limit is too small for my application and the service is not provided in my country.

You can use Page Load Strategy.
A Selenium WebDriver has 3 Page Load Strategy:
normal - Waits for all resources to download.
eager - DOM access is ready, but other resources like images may still be loading.
none - Does not block WebDriver at all.
options.page_load_strategy = "none" # ["normal", "eager", "none"]
It might help you (obviously it doesn't perform mirakle, but better than nothing).

JustEat delivery checker script

I am starting to learn Python and just wondering if someone could help me with my first script.
As you can see from the code below, the script will open a firefox service and grab details from a Just-Eat page and if they are delivering then return up or down.
If they are down then it should open a web WhatsApp page and send a message to a set group.
What I am trying to do now and this is where I'm getting stuck, if the site reports down I want to run the check again and if it is down saying another 4 times then return that the site is down. This could stop me from getting false returns at times.
Also, I know this code could be made faster and stronger but it is my first time coding in Python :-) positive and constructive comments are welcomed.
Thanks guys <3
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import time
import pywhatkit
from datetime import datetime
import keyboard
print("Delivery checker v1.0")
def executeSomething():
ser = Service(r"C:\xampp\htdocs\drivers\geckodriver.exe")
my_opt = Options()
my_opt.headless = True
driver = webdriver.Firefox(options=my_opt, service=ser)
driver.get("https://www.just-eat.co.uk/restaurants-mcdonaldsstevenstonhawkhillretailpark-stevenston/menu")
driver.find_element("xpath", "/html/body/div[2]/div[6]/div/div/div/div/div[2]/button[1]").click()
status = driver.find_element("xpath", "/html/body/div[2]/div[2]/div[3]/div/main/header/section/div[1]/p/span")
now = datetime.now()
dt_string = now.strftime("%d/%m/%Y %H:%M:%S")
if status.text == "Delivering now":
print("[" + dt_string + "] - up")
else:
print("[" + dt_string + "] - down [" + str(counter) + "]")
pywhatkit.sendwhatmsg_to_group_instantly("HpYQbjTU5fz728BGjiS45K", "MCDONALDS ISNT SHOWING UP ON JUST EAT!!!!")
keyboard.press_and_release('ctrl+w')
driver.close()
time.sleep(1)
while True:
executeSomething()

I would do something like this
down_count = 0
down_max = 4
while True:
if is_web_down():
down_count += 1
else:
down_count = 0
if down_count >= down_max:
send_message()
time.sleep(1)
You just need to refactor executeSomething() to 2 functions: is_web_down() and send_message()

Can't Stop ThreadPoolExecutor

I'm scraping hundreds of urls, each with a leaderboard of data I want, and the only difference between each url string is a 'platform','region', and lastly, the page number. There are only a few platforms and regions, but the page numbers change each day and I don't know how many there are. So that's the first function, I'm just creating lists of urls to be requested in parallel.
If I use page=1, then the result will contain 'table_rows > 0' in the last function. But around page=500, the requested url still pings back but very slowly and then it will show an error message, no leaderboard found, the last function will show 'table_rows == 0', etc. The problem is I need to get through the very last page and I want to do this quickly, hence the threadpoolexecutor - but I can't cancel all the threads or processes or whatever once PAGE_LIMIT is tripped. I threw the executor.shutdown(cancel_futures=True) just to kind of show what I'm looking for. If nobody can help me I'll miserably remove the parallelization and I'll scrape slowly, sadly, one url at a time...
Thanks
from concurrent.futures import ThreadPoolExecutor
from bs4 import BeautifulSoup
import pandas
import requests
PLATFORM = ['xbl', 'psn', 'atvi', 'battlenet']
REGION = ['us', 'ca']
PAGE_LIMIT = True
def leaderboardLister():
global REGION
global PLATFORM
list_url = []
for region in REGION:
for platform in PLATFORM:
for i in range(1,750):
list_url.append('https://cod.tracker.gg/warzone/leaderboards/battle-royale/' + platform + '/KdRatio?country=' + region + '&page=' + str(i))
leaderboardExecutor(list_url,30)
def leaderboardExecutor(urls,threads):
global PAGE_LIMIT
global INTERNET
if len(urls) > 0:
with ThreadPoolExecutor(max_workers=threads) as executor:
while True:
if PAGE_LIMIT == False:
executor.shutdown(cancel_futures=True)
while INTERNET == False:
try:
print('bad internet')
requests.get("http://google.com")
INTERNET = True
except:
time.sleep(3)
print('waited')
executor.map(scrapeLeaderboardPage, urls)
def scrapeLeaderboardPage(url):
global PAGE_LIMIT
checkInternet()
try:
page = requests.get(url)
soup = BeautifulSoup(page.content,features = 'lxml')
table_rows = soup.find_all('tr')
if len(table_rows) == 0:
PAGE_LIMIT = False
print(url)
else:
pass
print('success')
except:
INTERNET = False
leaderboardLister()

How to create HTML reporting in Python unit test?

I'm automating mobile app with Appium and Python. I need to get a HTML report for the test results.
What I need is to save test results in Html files, for human readable presentation of results. I have tried out a few ways, but nothing worked for me.
Anyone knows how to do it? Thanks in advance.
import os, sys
import glob
import unittest
from appium import webdriver
from time import sleep
PLATFORM_VERSION = '5.1.1'
class EntranceTests(unittest.TestCase):
def setUp(self):
print 'commandline args',sys.argv[1]
desired_caps = {}
desired_caps['platformName'] = 'Android'
desired_caps['platformVersion'] = '5.1.1'
desired_caps['deviceName'] = 'CooTel S32'
desired_caps['udid'] = sys.argv[1]
desired_caps['appPackage'] = 'com.android.systemui'
desired_caps['appActivity'] = ' '
url = "http://localhost:{}/wd/hub".format(sys.argv[2])
self.driver = webdriver.Remote(url, desired_caps)
def data_connection(self):
self.driver.orientation = "PORTRAIT"
self.driver.swipe(340, 1, 340, 800, 2000)
notification = self.driver.find_element_by_id('com.android.systemui:id/header')
notification.click()
try:
wifi = self.driver.find_element_by_xpath('//*[contains(#class,"android.widget.TextView") and contains(#text, "WLAN")]')
wifi.is_displayed()
print 'Wifi is switched off'
mobiledata = self.driver.find_element_by_xpath('//android.widget.TextView[contains(#text, "Mobile data")]')
mobiledata.click()
print 'SUCCESS! Switch on Mobile data'
sleep(5)
except:
print 'Wifi is switched on'
wifi_off = self.driver.find_element_by_xpath('//*[contains(#class,"android.widget.ImageView") and contains(#index, "0")]')
wifi_off.click()
print 'SUCCESS! Switch off Wifi'
mobiledata = self.driver.find_element_by_xpath('//android.widget.TextView[contains(#text, "Mobile data")]')
mobiledata.click()
print 'SUCCESS! Switch on Mobile data'
sleep(5)
def testcase_dataAndWifi(self):
self.data_connection()
def tearDown(self):
self.driver.quit()
if __name__ == '__main__':
suite = unittest.TestLoader().loadTestsFromTestCase(EntranceTests)
result = str(unittest.TextTestRunner(verbosity=2).run(suite))

You can use nose-html-reporting module(pip install nose-html-reporting) for generating HTML reports while using python unittest framework.
Please refer below link for more information:
https://pypi.org/project/nose-html-reporting/

Scraping with Beautifulsoup-Python

I want to scrape the name of the hotel in the tripadvisor in each review page of the hotel.
I wrote a code in python which is very simple and I think that it isn't false.
But every time it stops at a different point(page for example the first time stopped in page 150 second time in the page 330).
I am 100% that my code are correct. Is there any possibility that tripadvisor block me every time?
I update the code and i use selenium too but the problem is still remain
The updated code is the following:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import os
import urllib.request
import time
import re
file2 = open(os.path.expanduser(r"~/Desktop/TripAdviser Reviews2.csv"), "wb")
file2.write(b"hotel,Address,HelpCount,HotelCount,Reviewer" + b"\n")
Checker ="REVIEWS"
# example option: add 'incognito' command line arg to options
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
# create new instance of chrome in incognito mode
browser = webdriver.Chrome(executable_path='/Users/thimios/AppData/Local/Google/chromedriver.exe', chrome_options=option)
#print(browser)
# go to website of interest
for i in range(10,50,10):
Websites=["https://www.tripadvisor.ca/Hotel_Review-g190479-d3587956-Reviews-or"+str(i)+"-The_Thief-Oslo_Eastern_Norway.html#REVIEWS"]
print(Websites)
for theurl in Websites:
thepage=browser.get(theurl)
thepage1 = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage1, "html.parser")
# wait up to 10 seconds for page to load
timeout = 5
try:
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="HEADING"]')))
#print(WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="HEADING"]'))))
except TimeoutException:
print("Timed out waiting for page to load")
browser.quit()
# Extract the helpful votes, hotel reviews
helpcountarray = ""
hotelreviewsarray = ""
for profile in soup.findAll(attrs={"class": "memberBadging g10n"}):
image = profile.text.replace("\n", "|||||").strip()
#print(image)
if image.find("helpful vote") > 0:
counter = re.findall('\d+', image.split("helpful vote", 1)[0].strip()[-4:])
if len(helpcountarray) == 0:
helpcountarray = [counter]
else:
helpcountarray.append(counter)
elif image.find("helpful vote") < 0:
if len(helpcountarray) == 0:
helpcountarray = ["0"]
else:
helpcountarray.append("0")
print(helpcountarray)
#print(len(helpcountarray))
if image.find("hotel reviews") > 0:
counter = re.findall('\d+', image.split("hotel reviews", 1)[0].strip()[-4:])
if len(hotelreviewsarray) == 0:
hotelreviewsarray = counter
else:
hotelreviewsarray.append(counter)
elif image.find("hotel reviews") < 0:
if len(hotelreviewsarray) == 0:
hotelreviewsarray = ['0']
else:
hotelreviewsarray.append("0")
print(hotelreviewsarray)
#print(len(hotelreviewsarray))
hotel_element = browser.find_elements_by_xpath('//*[#id="HEADING"]')
Address_element = browser.find_elements_by_xpath('//*[#id="HEADING_GROUP"]/div/div[3]/address/div/div[1]')
for i in range(0,10):
print(i)
for x in hotel_element:
hotel = x.text
print(hotel)
#print(type(hotel))
for y in Address_element:
Address = y.text.replace(',', '').replace('\n', '').strip()
print(Address)
#print(type(Address))
HelpCount = helpcountarray[i]
HelpCount = " ".join(str(w) for w in HelpCount)
print(HelpCount)
#print(type(HelpCount))
HotelCount = hotelreviewsarray[i]
HotelCount = " ".join(str(w) for w in HotelCount)
print(HotelCount)
#print(type(HotelCount))
Reviewer = soup.findAll(attrs={"class": "username mo"})[i].text.replace(',', ' ').replace('”', '').replace('“', '').replace('"', '').strip()
print(Reviewer)
Record2 = hotel + "," + Address +"," + HelpCount +"," + HotelCount+"," +Reviewer
if Checker == "REVIEWS":
file2.write(bytes(Record2, encoding="ascii", errors='ignore') + b"\n")
file2.close()
I read somewhere that I should add a header. Something like
headers={'user-agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
in order for the web site to allow me to scrape it. Is that true?
Thanks for your help

Yes. there is such a possibility.
Websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages.
The default User-Agent typically refers to automated processes implemented using a python software, so you will want to change it to browser like User-Agent.
Even though, I do not believe you were blocked by TripAdvisor.

Try to slow down the downloading by
import time
...
time.sleep(1)

No, try REAL life slowing it down, using Backoff so the target website doesn't think you're a bot...
import time
for term in ["web scraping", "web crawling", "scrape this site"]:
t0 = time.time()
r = requests.get("http://example.com/search", params=dict(
query=term
))
response_delay = time.time() - t0
time.sleep(10*response_delay) # wait 10x longer than it took them to respond
source:
https://blog.hartleybrody.com/web-scraping-cheat-sheet/#delays-and-backing-off

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use pool-specific functions while multiprocessing with python? - python

EDIT - SOLVED Instead of reading from driver.page_source, reading from driver.current_url makes sure that I can run website-specific functions. if 'ryan' in driver.current_url: print 'ryan' doStuff()

Related

Web scraping Google Maps with Selenium uses too much data

JustEat delivery checker script

Can't Stop ThreadPoolExecutor

How to create HTML reporting in Python unit test?

Scraping with Beautifulsoup-Python

Categories

Resources