Python Selenium Webpage fill: To download data from links

Python Selenium Webpage fill: To download data from links - python

I've compiled this code to perform an iteration of downloads from a webpage which has multiple download links. Once the download link is clicked, the webpage produces a webform which has to be filled and submitted for the download to start. I've tried running the code and face issue in 'try'& 'except' block code (Error: Too broad exception clause) and towards the end there is an error associated with the 'submit' (Error: method submit maybe static) both of these subsequently result in 'SyntaxError: invalid syntax '. Any suggestions / help will be much appreciated. Thank you.
import os
from selenium import webdriver
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", os.getcwd())
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-msdos-program")
driver = webdriver.Firefox(firefox_profile=fp)
driver.get('http://def.com/catalog/attribute')
#This is to find the download links in the webpage one by one
i=0
while i<1:
try:
driver.find_element_by_xpath('//*[#title="xml (Open in a new window)"]').click()
except:
i=1
#Once the download link is clicked this has to fill the form for submission which fill download the file
class FormPage(object):
def fill_form(self, data):
driver.find_element_by_xpath('//input[#type = "radio" and #value = "Non-commercial"]').click()
driver.find_element_by_xpath('//input[#type = "checkbox" and #value = "R&D"]').click()
driver.find_element_by_xpath('//input[#name = "name_d"]').send_keys(data['name_d'])
driver.find_element_by_xpath('//input[#name = "mail_d"]').send_keys(data['mail_d'])
return self
def submit(self):
driver.find_element_by_xpath('//input[#value = "Submit"]').click()
data = {
'name_d': 'abc',
'mail_d': 'xyz#gmail.com',
}
FormPage().fill_form(data).submit()
driver.quit()

Actually you have two warnings and a error:
1 - "Too broad exception" this is a warning telling you that you should except espefic errors, not all of them. In your "except" line should be something like except [TheExceptionYouAreTreating]: an example would be except ValueError:. However this should not stop your code from running
2 - "Error: method submit maybe static" this is warning telling you that the method submit is a static method (bassically is a method that don't use the self attribute) to supress this warning you can use the decorator #staticmethod like this
#staticmethod
def submit():
...
3 - "SyntaxError: invalid syntax" this is what is stopping your code from running. This is a error telling you that something is written wrong in your code. I think that may be the indentation on your class. Try this:
i=0
while i<1:
try:
driver.find_element_by_xpath('//*[#title="xml (Open in a new window)"]').click()
except:
i=1
#Once the download link is clicked this has to fill the form for submission which fill download the file
class FormPage(object):
def fill_form(self, data):
driver.find_element_by_xpath('//input[#type = "radio" and #value = "Non-commercial"]').click()
driver.find_element_by_xpath('//input[#type = "checkbox" and #value = "R&D"]').click()
driver.find_element_by_xpath('//input[#name = "name_d"]').send_keys(data['name_d'])
driver.find_element_by_xpath('//input[#name = "mail_d"]').send_keys(data['mail_d'])
return self
def submit(self):
driver.find_element_by_xpath('//input[#value = "Submit"]').click()
data = {
'name_d': 'abc',
'mail_d': 'xyz#gmail.com',
}
FormPage().fill_form(data).submit()
driver.quit()
One more thing. Those are really simple errors and warnings, you should be able to fix them by yourself by carefully reading what the error has to say. I also reccomend you reading about Exceptions

Related

How can i use my python script as proxy for urls

i have a script that check the input link, if it's equivalent to one i specified in the code, then it will use my code, else it open the link in chrome.
i want to make that script kind of as a default browser, as to gain speed compared to opening the browser, getting the link with an help of an extension and then send it to my script using POST.
i used procmon to check where the process in question query the registry key and it seem like it tried to check HKCU\Software\Classes\ChromeHTML\shell\open\command so i added a some key there and in command, i edited the content of the key with my script path and arguments (-- %1)(-- only here for testing purposes)
unfortunately, once the program query this to send a link, windows prompt to choose a browser instead of my script, which isn't what i want.
Any idea?

in HKEY_CURRENT_USER\Software\Classes\ChromeHTML\Shell\open\command Replace the value in default with "C:\Users\samdra.r\AppData\Local\Programs\Python\Python39\pythonw.exe" "[Script_path_here]" %1
when launching a link, you'll be asked to set a default browser only once (it ask for a default browser for each change you make to the key):
i select chrome in my case
as for the python script, here it is:
import sys
import browser_cookie3
import requests
from bs4 import BeautifulSoup as BS
import re
import os
import asyncio
import shutil
def Prep_download(args):
settings = os.path.abspath(__file__.split("NewAltDownload.py")[0]+'/settings.txt')
if args[1] == "-d" or args[1] == "-disable":
with open(settings, 'r+') as f:
f.write(f.read()+"\n"+"False")
print("Background program disabled, exiting...")
exit()
if args[1] == "-e" or args[1] == "-enable":
with open(settings, 'r+') as f:
f.write(f.read()+"\n"+"True")
link = args[-1]
with open(settings, 'r+') as f:
try:
data = f.read()
osupath = data.split("\n")[0]
state = data.split("\n")[1]
except:
f.write(f.read()+"\n"+"True")
print("Possible first run, wrote True, exiting...")
exit()
if state == "True":
asyncio.run(Download_map(osupath, link))
async def Download_map(osupath, link):
if link.split("/")[2] == "osu.ppy.sh" and link.split("/")[3] == "b" or link.split("/")[3] == "beatmapsets":
with requests.get(link) as r:
link = r.url.split("#")[0]
BMID = []
id = re.sub("[^0-9]", "", link)
for ids in os.listdir(os.path.abspath(osupath+("/Songs/"))):
if re.match(r"(^\d*)",ids).group(0).isdigit():
BMID.append(re.match(r"(^\d*)",ids).group(0))
if id in BMID:
print(link+": Map already exist")
os.system('"'+os.path.abspath("C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")+'" '+link)
return
if not id.isdigit():
print("Invalid id")
return
cj = browser_cookie3.load()
print("Downloading", link, "in", os.path.abspath(osupath+"/Songs/"))
headers = {"referer": link}
with requests.get(link) as r:
t = BS(r.text, 'html.parser').title.text.split("·")[0]
with requests.get(link+"/download", stream=True, cookies=cj, headers=headers) as r:
if r.status_code == 200:
try:
id = re.sub("[^0-9]", "", link)
with open(os.path.abspath(__file__.split("NewAltDownload.pyw")[0]+id+" "+t+".osz"), "wb") as otp:
otp.write(r.content)
shutil.copy(os.path.abspath(__file__.split("NewAltDownload.pyw")[0]+id+" "+t+".osz"),os.path.abspath(osupath+"/Songs/"+id+" "+t+".osz"))
except:
print("You either aren't connected on osu!'s website or you're limited by the API, in which case you now have to wait 1h and then try again.")
else:
os.system('"'+os.path.abspath("C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")+'" '+link)
args = sys.argv
if len(args) == 1:
print("No arguments provided, exiting...")
exit()
Prep_download(args)
you obtain the argument %1 (the link) with sys.argv()[-1] (since sys.argv is a list) and from there, you just check if the link is similar to the link you're looking for (in my case it need to look like https://osu.ppy.sh/b/ or https://osu.ppy.sh/beatmapsets/)
if that's the case, do some code, else, just launch chrome with chrome executable and the link as argument. and if the id of the beatmap is found in the Songs folder, then i also open the link in chrome.
to make it work in the background i had to fight with subprocesses and even more tricks, and at the end, it started working suddenly with pythonw and .pyw extension.

whatsApp-web driver with python time out

I want to create a program, that can read all the messages from my whatsApp and print them to the screen using python.
In order to do that I tried using the whatsapp-web library https://pypi.org/project/whatsapp-web/.
But when i tried to run their code example I got a timeout error
this is the code
import time
from selenium import webdriver
from simon.accounts.pages import LoginPage
from simon.header.pages import HeaderPage
from simon.pages import BasePage
# Creating the driver (browser)
driver = webdriver.Firefox()
driver.maximize_window()
login_page = LoginPage(driver)
login_page.load()
login_page.remember_me = False
time.sleep(7)
base_page = BasePage(driver)
base_page.is_welcome_page_available()
base_page.is_nav_bar_page_available()
base_page.is_search_page_available()
base_page.is_pane_page_available()
base_page.is_chat_page_available()
# 3. Logout
header_page = HeaderPage(driver)
header_page.logout()
# Close the browser
driver.quit()
and this is the error
base_page.is_welcome_page_available()
File "D:\zoom\venv\lib\site-packages\simon\pages.py", line 18, in wrapper
return func(*args, **kwargs)
File "D:\zoom\venv\lib\site-packages\simon\pages.py", line 51, in is_welcome_page_available
if self._find_element(WelcomeLocators.WELCOME):
File "D:\zoom\venv\lib\site-packages\simon\pages.py", line 77, in _find_element
lambda driver: self.driver.find_element(*locator))
File "D:\zoom\venv\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

Your code is not completed to find error or correct.
Because the error is coming from the imported page.
Try by increasing the time limit to load the page,
use time.sleep(15)
You can try to automate the WhatsApp web by yourself without using pip of WhatsApp-web. because I guess that is not updated.
The code is giving error because of WhatsApp web has changed its elements classes names. Because of that the code section is not able to find welcome page in given time limit. So the program execution breaks
I have done the same without using WhatsApp-web pip.
I hope it will work for you.
A complete code reference Example :
from selenium import webdriver
import time
# You can use any web-browser which supported by selenium and which can run WhatsApp web.
# For using GoogleChrome
web_driver = webdriver.Chrome("Chrome_Driver_Path/chromedriver.exe")
web_driver.get("https://web.whatsapp.com/")
# For using Firefox
# web_driver = webdriver.Firefox(executable_path=r"C:/Users/Pascal/Desktop/geckodriver.exe")
# web_driver.get("https://web.whatsapp.com/")
time.sleep(25) # For scan the qr code
# Plese make sure that you have done the qr code scan successful.
confirm = int(input("Press 1 to proceed if sucessfully login or press 0 for retry : "))
if confirm == 1:
print("Continuing...")
elif confirm == 0:
web_driver.close()
exit()
else:
print("Sorry Please Try again")
web_driver.close()
exit()
while True:
unread_chats = web_driver.find_elements_by_xpath("// span[#class='_38M1B']")
# In the above line Change the xpath's class name from the current time class name by inspecting span element
# which containing the number of unread message showing the contact card inside a green circle before opening the chat room.
# Open each chat using loop and read message.
for chat in unread_chats:
chat.click()
time.sleep(2)
# For getting message to perform action
message = web_driver.find_elements_by_xpath("//span[#class='_3-8er selectable-text copyable-text']")
# In the above line Change the xpath's class name from the current time class name by inspecting span element
# which containing received text message of any chat room.
for i in message:
try:
print("Message received : " + str(i.text))
# Here you can use you code to perform action according to your need
except:
pass
Please make sure that the indentation is equal in code blocks if you are copying it.
Can following link for more info about WhatsApp web using python.
https://stackoverflow.com/a/68288416/15284163
I am developing WhatsApp bot using python.
For contribution you can contact at : anurag.cse016#gmail.com
Please give a star on my https://github.com/4NUR46 If this Answer helps you.

How to download books automatically from Gutenberg

I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?

I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).

start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system

As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue

How to split selenium python code into multiply functions

I'm writing a python program which will test some functions on website. It will log in to this site, check it version and do some tests on it regarding the site version. I want to write few tests for this site but few things will repeat, for example login to the site.
I try to split my code into functions, like hue_login() and use it on every test I need to login to the site. To login to site I use selenium webdriver. So If I split the code into small functions and try to use it in other function where I also use selenium webdriver I end up with two browser windows. One from my hue_login() function where function log me in. And second browser window where it try to put url where I want to go after I log in to the site interface. Of course, because I am not login into the second browser window, site wont show and other tests will fail (tests from this second function).
Example:
def hue_version():
url = global_var.domain + global_var.about
response = urllib.request.urlopen(url)
htmlparser = etree.HTMLParser()
xpath = etree.parse(response, htmlparser).xpath('/html/body/div[4]/div/div/h2/text()')
string = "".join(xpath)
pattern = re.compile(r'(\d{1,2}).(\d{1,2}).(\d{1,2})')
return pattern.search(string).group()
hue_ver = hue_version()
print(hue_ver)
if hue_ver == '3.9.0':
do something
elif hue_version == '3.7.0':
do something else
else:
print("Hue version not recognized!")
def hue_login():
driver = webdriver.Chrome(global_var.chromeDriverPath)
driver.get(global_var.domain + global_var.loginPath)
input_username = driver.find_element_by_name('username')
input_password = driver.find_element_by_name('password')
input_username.send_keys(username)
input_password.send_keys(password)
input_password.submit()
sleep(1)
driver.find_element_by_id('jHueTourModalClose').click()
def file_browser():
hue_login()
click_file_browser_link = global_var.domain + global_var.fileBrowserLink
driver = webdriver.Chrome(global_var.chromeDriverPath)
driver.get(click_file_browser_link)
How can I call hue_login() from file_browser() function that rest of the code from file_browser() will be executed in the same window opened by hue_login()?

Here you go:
driver = webdriver.Chrome(global_var.chromeDriverPath)
def hue_login():
driver.get(global_var.domain + global_var.loginPath)
input_username = driver.find_element_by_name('username')
input_password = driver.find_element_by_name('password')
input_username.send_keys(username)
input_password.send_keys(password)
input_password.submit()
sleep(1)
driver.find_element_by_id('jHueTourModalClose').click()
def file_browser():
hue_login()
click_file_browser_link = global_var.domain + global_var.fileBrowserLink
driver.get(click_file_browser_link)

Python Selenium: Unable to Find Element After First Refresh

I've seen a few instances of this question, but I was not sure how to apply the changes to my particular situation. I have code that monitors a webpage for changes and refreshes every 30 seconds, as follows:
import sys
import ctypes
from time import sleep
from Checker import Checker
USERNAME = sys.argv[1]
PASSWORD = sys.argv[2]
def main():
crawler = Checker()
crawler.login(USERNAME, PASSWORD)
crawler.click_data()
crawler.view_page()
while crawler.check_page():
crawler.wait_for_table()
crawler.refresh()
ctypes.windll.user32.MessageBoxW(0, "A change has been made!", "Attention", 1)
if __name__ == "__main__":
main()
The problem is that Selenium will always show an error stating it is unable to locate the element after the first refresh has been made. The element in question, I suspect, is a table from which I retrieve data using the following function:
def get_data_cells(self):
contents = []
table_id = "table.datadisplaytable:nth-child(4)"
table = self.driver.find_element(By.CSS_SELECTOR, table_id)
cells = table.find_elements_by_tag_name('td')
for cell in cells:
contents.append(cell.text)
return contents
I can't tell if the issue is in the above function or in the main(). What's an easy way to get Selenium to refresh the page without returning such an error?
Update:
I've added a wait function and adjusted the main() function accordinly:
def wait_for_table(self):
table_selector = "table.datadisplaytable:nth-child(4)"
delay = 60
try:
wait = ui.WebDriverWait(self.driver, delay)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, table_selector)))
except TimeoutError:
print("Operation timeout! The requested element never loaded.")
Since the same error is still occurring, either my timing function is not working properly or it is not a timing issue.

I've run into the same issue while doing web scraping before and found that re-sending the GET request (instead of refreshing) seemed to eliminate it.
It's not very elegant, but it worked for me.

I appear to have fixed my own problem.
My refresh() function was written as follows:
def refresh():
self.driver.refresh()
All I did was switch frames right after the refresh() call. That is:
def refresh():
self.driver.refresh()
self.driver.switch_to.frame("content")
This took care of it. I can see that the page is now refreshing without issues.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Selenium Webpage fill: To download data from links - python

Related

How can i use my python script as proxy for urls

whatsApp-web driver with python time out

How to download books automatically from Gutenberg

How to split selenium python code into multiply functions

Python Selenium: Unable to Find Element After First Refresh

Categories

Resources