Easy way to work around slow Selenium Python startup? - python

So I have a program I want to run using selenium specifically that takes a series of actions on a password-protected website. Basically, I need to be able to input a unique link and password when I get it, which will take me to the main website which I have automated. The issue here is that Selenium takes very long to get to load a webpage when you start it up and time is very important in this application. Inputting the link and launching the browser to that link directly takes a long time. What I have tried doing is preloading the browser to a different website (ie, https://google.com) beforehand, and then waiting on user input for the link to the actual page. This process works a lot quicker, but I'm having trouble getting it to work inside a function and with multiprocessing. I am using multiprocessing to execute this on a wide scale with lots of instances. I am trying to start all of my functions the second a link is defined by me. I am on Windows 10, using Python 3.8.3, and using Chrome for my Selenium browser.
from selenium import webdriver
global link
link = input('Paste Link Here: ')
def instance_1():
browser1 = webdriver.Chrome(*my webdriver file path*)
browser1.get('https://google.com')
#need something that waits here until the link variable is defined by me
browser1.get(link)
#the rest of the automation works fine from here
Ideally, the solution would be able to work with multiprocessing. The ideal flow would be something like this:
1. All selenium instances" (written as their own functions) start-up and preload to a website (this part works fine)
2. They wait until the link to go to is specified (this is where the issue is)
3. They then go to the link and execute the automation (this part works fine)
Tldr; basically anything that would allow me to let the program continue while waiting on the input would be nice.

Related

Extracting info from webpage via python

I'd like to ask somebody with experience with headless browsers and python if it's possible to extract box info with distance from closest strike on webpage below. Till now I was using python bs4 but since everything is driven by jQuery here simple download of webpage doesn't work. I found PhantomJS but I wasn't able extract it too so I am not sure if it's possible. Thanks for hints.
https://lxapp.weatherbug.net/v2/lxapp_impl.html?lat=49.13688&lon=16.56522&v=1.2.0
This isn't really a Linux question, it's a StackOverflow question, so I won't go into too much detail.
The thing you want to do can be easily done with Selenium. Selenium has both a headless mode, and a heady mode (where you can watch it open your browser and click on things). The DOM query API is a bit less extensive than bs4, but it does have nice visual query (location on screen) functions. So you would write a Python script that initializes Selenium, goes to your website and interacts with it. You may need to do some image recognition on screenshots at some point. It may be as simple as finding for a certain query image on the screen, or something much more complicated.
You'd have to go through the Selenium tutorials first to see how it works, which would take you 1-2 days. Then figure out what Selenium stuff you can use to do what you want, that depends on luck and whether what you want happens to be easy or hard for that particular website.
Instead of using Selenium, though, I recommend trying to reverse engineer the API. For example, the page you linked to hits https://cmn-lx.pulse.weatherbug.net/data/lightning/v1/spark with parameters like:
_
callback
isGpsLocation
location
locationtype
safetyMessage
shortMessage
units
verbose
authid
timestamp
hash
You can figure out by trial and error which ones you need and what to put in them. You can capture requests from your browser and then read them yourself. Then construct appropriate requests from a Python program and hit their API. It would save you from having to deal with a Web UI designed for humans.

Controlling the mouse and browser with pyautogui for process automation

I'm new at Python and I need expert guidance for the project I'm trying to finish at work, as none of my coworkers are programmers.
I'm making a script that logs into a website and pulls a CSV dataset. Here are the steps that I'd like to automate:
Open chrome, go to a website
Login with username/password
Navigate to another internal site via menu dropdown
Input text into a search tag box or delete search tags, e.g. "Hours", press "Enter" or "Tab" to select (repeat this for 3-4 search tags)
Click "Run data"
Wait until data loads, then click "Download" to get a CSV file with 40-50k rows of data
Repeat this process 3-4 times for different data pulls, different links and different search tags
This process usually takes 30-40 minutes for a total of 4 or 5 data pulls each week so it's like watching paint dry.
I've tried to automate this using the pyautogui module, but it isn't working out for me. It works too fast, or doesn't work at all. I think I'm using it wrong.
This is my code:
import webbrowser
import pyautogui
#pyautogui.position()
#print(pyautogui.position())
#1-2
pyautogui.FAILSAFE = True
chrome_path = 'open -a /Applications/Google\ Chrome.app %s'
#2-12
url = 'http://Google.com/'
webbrowser.get(chrome_path).open(url)
pyautogui.moveTo(185, 87, duration=0.25)
pyautogui.click()
pyautogui.typewrite('www.linkedin.com')
pyautogui.press('enter')
#loginhere? Research
In case pyautogui is not suited for this task, can you recommend an alternative way?
The way you are going about grabbing your data is very error prone and not how people generally go about grabbing data from websites. What you want is a web scraper, which allows you to grab information from websites or some companies provide API's that allow you easier access to the data.
To grab information from LinkedIn it has a built in API. You did mention that you were navigating to another site though in which case I would see if that site has an API or look into using Scrapy, a web scraper that should allow you to pull the information you need.
Sidenote: You can also look into synchronous and asynchronous programming with python to make multiple requests faster/easier

Run selenium in parallel/scrape multiple sites in parallel or in order?

I'm new to Django, I'm currently creating a web app that has a form that sends user input from the form to some other online tools, submits that data and then scrapes the results and shows the output in the form of a table on my web app like so: https://i.imgur.com/AVj3cJJ.png
I've ended up using selenium with PhantomJS as the tools have javascript objects, I was using mechanicalsoup before which was working well but got stuck when I got to the javascript elements hence switching to selenium.
I've ran into a problem, currently I run the scraper via a function I've created and call it in my view like so:
if form.is_valid():
crisporDF = crispor(form.cleaned_data['dnaSeq'], form.cleaned_data['species'])
This function takes the user data and feeds it into the scraper and then produces the table as can be seen in the screenshot above, all works fine, however I have 5 tools I need to scrape and produce 5 tables, so I've created another similar function for another tool and this is the result I get: https://i.imgur.com/qDXVR7s.png
As you can see it kinda works, the results in the second table are fine but using the same input data from the user the first table isn't producing the correct results.
I think? It's because I'm creating a new webdriver in each function, and I know Django is a synchronous framework, but I also know there are ways to run functions asynchronously. I don't know why its not producing the correct results for the first site anymore. But if my logic is correct it must be because one instance of a webdriver can be called at once? I've tried creating one webdriver but it doesn't seem to work.
Hence I am asking,
how do I run selenium in parallel?
Or how do I get selenium to go to the first site, run my function and get the data, then go to the next url and run my next function
and get the data? So in order/chronological? I understand that this
will increase page load time but that is not something I am worrying
about as with scientific data processing it can take seconds to even
hours so that is not a problem.
I understand there are tools like celery, but since loading times aren't a problem I don't mind running the processes in the
foreground, would Multiprocess in Python be an option? e.g. create an
instance for each tool so e.g. 5 instances of the webdriver for each
tool, get the data then pool it together to print as tables on the
site? Is that possible?
Thanks for all your help! If you need any more info please ask!

Browser rendering time

I am trying to capture the rendering time for pages, in an automated way.I tried the same using two approaches
a)Create a selenium script, attach httpwatch to the browse window, collect metrics[The free version of httpwatch doesn't give me all the metrics I need ]
b)Use selenium to launch a chrome window, and collect performance logs using the chrome, performance capabilty], and then try to read the content of the log file by loading it back to chrome or some other tool which would suit the purpose.
the code I used in selenium is
driver = webdriver.Chrome(executable_path="C:\\IEDriverServer\\chromedriver.exe",desired_capabilities={'loggingPrefs': {'performance': 'ALL'}})
The problem is chrome is not able to give me any data when I save the output to a file and load it.
What is it that I am doing wrong, or please let me know if there is a better way to do this. The metrics I need are basically Rendering time on the browse, Rendering start- Onload event start.
I am using python for the selenium scripts

automatically edit firefox web address upon pageload, and then reload

My coding experience is in Python. Is there a simple way to execute a python code in firefox that would detect a particular address, say nytimes.com, load the page, then delete the end of the address following html (this allows bypassing the 20 pageviews/month limit) and reload?
Your best bet is to use selenium as proposed before. Here's a small example how you could do it. Basically the code checks if the limit has been reached and if it has it deletes cookies and refreshes the page letting you to continue reading. Deleting cookies lets you read another 10 articles without continuously editing the address. Thats the technical part, you have to consider the legal implications yourself.
from selenium import webdriver
browser=webdriver.Firefox()
browser.get('http://www.nytimes.com')
if browser.find_element_by_xpath('.//*[contains(.,"You’ve reached the limit of 10 free articles a month.")]'):
browser.delete_all_cookies()
browser.refresh()
you can use selenium it lets you easily fully control firefox and other web browsers with python. it would only be a few lines of code to acheive this. this answer How to integrate Selenium and Python has a working example

Categories

Resources