I am starting to learn Python and just wondering if someone could help me with my first script.
As you can see from the code below, the script will open a firefox service and grab details from a Just-Eat page and if they are delivering then return up or down.
If they are down then it should open a web WhatsApp page and send a message to a set group.
What I am trying to do now and this is where I'm getting stuck, if the site reports down I want to run the check again and if it is down saying another 4 times then return that the site is down. This could stop me from getting false returns at times.
Also, I know this code could be made faster and stronger but it is my first time coding in Python :-) positive and constructive comments are welcomed.
Thanks guys <3
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import time
import pywhatkit
from datetime import datetime
import keyboard
print("Delivery checker v1.0")
def executeSomething():
ser = Service(r"C:\xampp\htdocs\drivers\geckodriver.exe")
my_opt = Options()
my_opt.headless = True
driver = webdriver.Firefox(options=my_opt, service=ser)
driver.get("https://www.just-eat.co.uk/restaurants-mcdonaldsstevenstonhawkhillretailpark-stevenston/menu")
driver.find_element("xpath", "/html/body/div[2]/div[6]/div/div/div/div/div[2]/button[1]").click()
status = driver.find_element("xpath", "/html/body/div[2]/div[2]/div[3]/div/main/header/section/div[1]/p/span")
now = datetime.now()
dt_string = now.strftime("%d/%m/%Y %H:%M:%S")
if status.text == "Delivering now":
print("[" + dt_string + "] - up")
else:
print("[" + dt_string + "] - down [" + str(counter) + "]")
pywhatkit.sendwhatmsg_to_group_instantly("HpYQbjTU5fz728BGjiS45K", "MCDONALDS ISNT SHOWING UP ON JUST EAT!!!!")
keyboard.press_and_release('ctrl+w')
driver.close()
time.sleep(1)
while True:
executeSomething()
I would do something like this
down_count = 0
down_max = 4
while True:
if is_web_down():
down_count += 1
else:
down_count = 0
if down_count >= down_max:
send_message()
time.sleep(1)
You just need to refactor executeSomething() to 2 functions: is_web_down() and send_message()
Related
I have the following code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options=Options()
driver=webdriver.Chrome(options=options)
driver.get("https://www.theguardian.com/uk")
time.sleep(2)
driver.refresh()
I'd like to be able to do the following with the above code:
1.Go to the url
2.Wait for page to load
3.Refresh page
4.Repeat steps 2 & 3 for 'n' number of times (let us say n=100)
You can simply introduce a while loop have put counter i, and set a condition that if i == 100 (let's say), you want to come out of the loop.
Code:
i = 0
while True:
time.sleep(2)
driver.refresh()
if i == 100:
break
else:
continue
i = i + 1
I have been working on this script that automatically joins google meets. It logs in to gmail and then goes to the meeting automatically if it is the time for meeting. But, now I am having problems with leaving the meeting after certain time. I want to just close a browser tab, thus the meeting. Then continue checking for the next meeting. I think the last while loop that is intended to close the chome tab after the meeting is done does not run at all. I have tried replacing it with print statements to see if it is executed, but it does not. I do not know why not tho.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import datetime
import time
import signal
now = datetime.datetime.now()
current_time = now.strftime("%H:%M / %A")
justtime = now.strftime("%H:%M")
print (current_time)
def Glogin(mail_address, password):
#os.system("obs --startvirtualcam &")
# Login Page
driver.get(
'https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/&ec=GAZAAQ')
# input Gmail
driver.find_element_by_id("identifierId").send_keys(mail_address)
driver.find_element_by_id("identifierNext").click()
driver.implicitly_wait(10)
# input Password
driver.find_element_by_xpath(
'//*[#id="password"]/div[1]/div/div[1]/input').send_keys(password)
driver.implicitly_wait(10)
driver.find_element_by_id("passwordNext").click()
driver.implicitly_wait(10)
# go to google home page
driver.get('https://google.com/')
driver.implicitly_wait(100)
driver.get(sub)
# turn off Microphone
time.sleep(1)
#driver.find_elements_by_class_name("JRY2Pb")[0].click()
driver.find_elements_by_class_name("JRY2Pb")[0].click()
# switch camera
time.sleep(2)
for x in driver.find_elements_by_class_name("JRtysb"):
x.click()
time.sleep(2)
for a in driver.find_elements_by_class_name("FwR7Pc"):
a.click()
time.sleep(2)
for b in driver.find_elements_by_class_name("XhPA0b"):
b.click()
time.sleep(2)
driver.find_element_by_tag_name('body').send_keys(Keys.TAB + Keys.TAB + Keys.ARROW_DOWN + Keys.ENTER)
time.sleep(1)
webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()
time.sleep(2)
# Join meet
time.sleep(1)
driver.implicitly_wait(2000)
driver.find_element_by_css_selector(
'div.uArJ5e.UQuaGc.Y5sE8d.uyXBBb.xKiqt').click()
# assign email id and password
mail_address = 'email'
password = 'password'
# create chrome instamce
opt = Options()
opt.add_argument('--disable-blink-features=AutomationControlled')
opt.add_argument('--start-maximized')
opt.add_experimental_option("prefs", {
"profile.default_content_setting_values.media_stream_mic": 1,
"profile.default_content_setting_values.media_stream_camera": 1,
"profile.default_content_setting_values.geolocation": 0,
"profile.default_content_setting_values.notifications": 1
})
while True:
if current_time == "05:00 / Wednesday":
sub = "link"
driver = webdriver.Chrome(options=opt, executable_path=r'/usr/bin/chromedriver')
Glogin(mail_address, password)
break
while True:
if current_time == "05:01 / Wednesday":
driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 'w')
break
If the last while loop isn't running, it's because the previous while True loop never broke.
I suspect it has something to do with your condition current_time == "05:00 / Wednesday", which means current_time is never being set equal to "05:00 / Wednesday".
Based on the limited context, I can only suggest two things.
First off, don't use while True loops with only one if statement inside, use the if-condition as your while loop condition.
Secondly, you may want to reset current_time in your loop. Modify your loop to look something like this:
run_loop = True # will be false when we want our loop to quit
while run_loop:
if current_time == "05:00 / Wednesday":
sub = "link"
driver = webdriver.Chrome(options=opt, executable_path=r'/usr/bin/chromedriver')
Glogin(mail_address, password)
run_loop = False #break us out of the loop
else: #we keep trying to get the time to see if it's 05:00 yet
now = datetime.datetime.now()
current_time = now.strftime("%H:%M / %A")
The above code will continuously check if the time meets your conditions, and then exit appropriately.
I am trying to create a python-selenium project which checks whether the people in my whatsapp chat list are online or offline. Basically it bruteforces one by one to check whether someone is online or not and then it saves the data in a excel file. Also it gives a green background to the people who are online..
here is my code:
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from openpyxl import Workbook
from openpyxl.styles import PatternFill
import time
workbook = Workbook()
sheet = workbook.active
browser = webdriver.Chrome(executable_path=r"F:\software\chromedriver_win32\chromedriver.exe")
browser.get('https://web.whatsapp.com/')
print("Loading..\n")
for x in range(5,0,-1):
print(x)
time.sleep(1)
#the below function checks whether the 'online' element exists or not
#I got the class name by inspecting the WhatsappWeb page
def check_exists_by_xpath():
try:
browser.find_element_by_xpath('//span[#class="O90ur _3FXB1"]')
except NoSuchElementException:
return False
return True
count = 1
#the xpath gets the name of the persons on my chatlist
for iterator in browser.find_elements_by_xpath('//div[#class="_2wP_Y"]'):
iterator.click()
cellA = "A" + str(count)
cellB = "B" + str(count)
time.sleep(2)
name = browser.find_element_by_xpath('//div[#class="_3XrHh"]/span').text
if check_exists_by_xpath() == True:
sheet[cellA] = name
sheet[cellB] = " isOnline\n"
sheet[cellA].fill = PatternFill(start_color="a4d968", end_color="a4d968", fill_type = "solid")
sheet[cellB].fill = PatternFill(start_color="a4d968", end_color="a4d968", fill_type = "solid")
if check_exists_by_xpath() == False:
sheet[cellA] = name
sheet[cellB] = " isOffline\n"
count = count + 1
workbook.save(filename="WhatsApp_Data.xlsx")
print("Complete..!")
browser.close()
But I can't understand, why the code stops after collecting data of 18 people? Also can anyone find a better technique to achieve this, other than bruteforcing..
Actually the code just clicks on the names of the people in WhatsappWeb list and if the element which display the online message (beneath the name) - exists then returns true or else false..
This question already has answers here:
Website blocking Selenium : is there a way to bypass?
(2 answers)
Closed 3 years ago.
I read a lot of posts on the topic, and also tried some of this article's advice, but I am still blocked.
https://www.scraperapi.com/blog/5-tips-for-web-scraping
IP Rotation: done I'm using a VPN and often changing IP (but not DURING the script, obviously)
Set a Real User-Agent: implemented fake-useragent with no luck
Set other request headers: tried with SeleniumWire but how to use it at the same time than 2.?
Set random intervals in between your requests: done but anyway at the present time I even cannot access the starting home page !!!
Set a referer: same as 3.
Use a headless browser: no clue
Avoid honeypot traps: same as 4.
10: irrelevant
The website I want to scrape: https://www.winamax.fr/paris-sportifs/
Without Selenium: it goes smoothly to a page with some games and their odds, and I can navigate from here
With Selenium: the page shows a "Winamax est actuellement en maintenance" message and no games and no odds
Try to execute this piece of code and you might get blocked quite quickly :
from selenium import webdriver
import time
from time import sleep
import json
driver = webdriver.Chrome(executable_path="chromedriver")
driver.get("https://www.winamax.fr/paris-sportifs/") #I'm even blocked here now !!!
toto = driver.page_source.splitlines()
titi = {}
matchez = []
matchez_detail = []
resultat_1 = {}
resultat_2 = {}
taratata = 1
comptine = 1
for tut in toto:
if tut[0:53] == "<script type=\"text/javascript\">var PRELOADED_STATE = ": titi = json.loads(tut[53:tut.find(";var BETTING_CONFIGURATION = ")])
for p_id in titi.items():
if p_id[0] == "sports":
for fufu in p_id:
if isinstance(fufu, dict):
for tyty in fufu.items():
resultat_1[tyty[0]] = tyty[1]["categories"]
for p_id in titi.items():
if p_id[0] == "categories":
for fufu in p_id:
if isinstance(fufu, dict):
for tyty in fufu.items():
resultat_2[tyty[0]] = tyty[1]["tournaments"]
for p_id in resultat_1.items():
for tgtg in p_id[1]:
for p_id2 in resultat_2.items():
if str(tgtg) == p_id2[0]:
for p_id3 in p_id2[1]:
matchez.append("https://www.winamax.fr/paris-sportifs/sports/"+str(p_id[0])+"/"+str(tgtg)+"/"+str(p_id3))
for alisson in matchez:
print("compet " + str(taratata) + "/" + str(len(matchez)) + " : " + alisson)
taratata = taratata + 1
driver.get(alisson)
sleep(1)
elements = driver.find_elements_by_xpath("//*[#id='app-inner']/div/div[1]/span/div/div[2]/div/section/div/div/div[1]/div/div/div/div/a")
for elm in elements:
matchez_detail.append(elm.get_attribute("href"))
for mat in matchez_detail:
print("match " + str(comptine) + "/" + str(len(matchez_detail)) + " : " + mat)
comptine = comptine + 1
driver.get(mat)
sleep(1)
elements = driver.find_elements_by_xpath("//*[#id='app-inner']//button/div/span")
for elm in elements:
elm.click()
sleep(1) # and after my specific code to scrape what I want
I recommend using requests , I don’t see a reason to use selenium since you said requests works, and requests can work with pretty much any site as long as you are using appropriate headers, you can see the headers needed by looking at the developer console in chrome or Firefox and looking at the request headers.
I want to scrape the name of the hotel in the tripadvisor in each review page of the hotel.
I wrote a code in python which is very simple and I think that it isn't false.
But every time it stops at a different point(page for example the first time stopped in page 150 second time in the page 330).
I am 100% that my code are correct. Is there any possibility that tripadvisor block me every time?
I update the code and i use selenium too but the problem is still remain
The updated code is the following:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import os
import urllib.request
import time
import re
file2 = open(os.path.expanduser(r"~/Desktop/TripAdviser Reviews2.csv"), "wb")
file2.write(b"hotel,Address,HelpCount,HotelCount,Reviewer" + b"\n")
Checker ="REVIEWS"
# example option: add 'incognito' command line arg to options
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
# create new instance of chrome in incognito mode
browser = webdriver.Chrome(executable_path='/Users/thimios/AppData/Local/Google/chromedriver.exe', chrome_options=option)
#print(browser)
# go to website of interest
for i in range(10,50,10):
Websites=["https://www.tripadvisor.ca/Hotel_Review-g190479-d3587956-Reviews-or"+str(i)+"-The_Thief-Oslo_Eastern_Norway.html#REVIEWS"]
print(Websites)
for theurl in Websites:
thepage=browser.get(theurl)
thepage1 = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage1, "html.parser")
# wait up to 10 seconds for page to load
timeout = 5
try:
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="HEADING"]')))
#print(WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="HEADING"]'))))
except TimeoutException:
print("Timed out waiting for page to load")
browser.quit()
# Extract the helpful votes, hotel reviews
helpcountarray = ""
hotelreviewsarray = ""
for profile in soup.findAll(attrs={"class": "memberBadging g10n"}):
image = profile.text.replace("\n", "|||||").strip()
#print(image)
if image.find("helpful vote") > 0:
counter = re.findall('\d+', image.split("helpful vote", 1)[0].strip()[-4:])
if len(helpcountarray) == 0:
helpcountarray = [counter]
else:
helpcountarray.append(counter)
elif image.find("helpful vote") < 0:
if len(helpcountarray) == 0:
helpcountarray = ["0"]
else:
helpcountarray.append("0")
print(helpcountarray)
#print(len(helpcountarray))
if image.find("hotel reviews") > 0:
counter = re.findall('\d+', image.split("hotel reviews", 1)[0].strip()[-4:])
if len(hotelreviewsarray) == 0:
hotelreviewsarray = counter
else:
hotelreviewsarray.append(counter)
elif image.find("hotel reviews") < 0:
if len(hotelreviewsarray) == 0:
hotelreviewsarray = ['0']
else:
hotelreviewsarray.append("0")
print(hotelreviewsarray)
#print(len(hotelreviewsarray))
hotel_element = browser.find_elements_by_xpath('//*[#id="HEADING"]')
Address_element = browser.find_elements_by_xpath('//*[#id="HEADING_GROUP"]/div/div[3]/address/div/div[1]')
for i in range(0,10):
print(i)
for x in hotel_element:
hotel = x.text
print(hotel)
#print(type(hotel))
for y in Address_element:
Address = y.text.replace(',', '').replace('\n', '').strip()
print(Address)
#print(type(Address))
HelpCount = helpcountarray[i]
HelpCount = " ".join(str(w) for w in HelpCount)
print(HelpCount)
#print(type(HelpCount))
HotelCount = hotelreviewsarray[i]
HotelCount = " ".join(str(w) for w in HotelCount)
print(HotelCount)
#print(type(HotelCount))
Reviewer = soup.findAll(attrs={"class": "username mo"})[i].text.replace(',', ' ').replace('”', '').replace('“', '').replace('"', '').strip()
print(Reviewer)
Record2 = hotel + "," + Address +"," + HelpCount +"," + HotelCount+"," +Reviewer
if Checker == "REVIEWS":
file2.write(bytes(Record2, encoding="ascii", errors='ignore') + b"\n")
file2.close()
I read somewhere that I should add a header. Something like
headers={'user-agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
in order for the web site to allow me to scrape it. Is that true?
Thanks for your help
Yes. there is such a possibility.
Websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages.
The default User-Agent typically refers to automated processes implemented using a python software, so you will want to change it to browser like User-Agent.
Even though, I do not believe you were blocked by TripAdvisor.
Try to slow down the downloading by
import time
...
time.sleep(1)
No, try REAL life slowing it down, using Backoff so the target website doesn't think you're a bot...
import time
for term in ["web scraping", "web crawling", "scrape this site"]:
t0 = time.time()
r = requests.get("http://example.com/search", params=dict(
query=term
))
response_delay = time.time() - t0
time.sleep(10*response_delay) # wait 10x longer than it took them to respond
source:
https://blog.hartleybrody.com/web-scraping-cheat-sheet/#delays-and-backing-off