Distributing a python workload across multiple processes - python

Let us suppose that I want to do a google search for the word 'hello'. I then want to go to every single link on the first 100 pages of Google and download the HTML of that linked page. Since there are 10 results per page, this would mean I'd have to click about 1,000 links.
This is how I would do it with a single process:
from selenium import webdriver
driver=webdriver.Firefox()
driver.get('http://google.com')
# do the search
search = driver.find_element_by_name('q')
search.send_keys('hello')
search.submit()
# click all the items
links_on_page = driver.find_elements_by_xpath('//li/div/h3/a')
for item in links_on_page:
item.click()
# do something on the page
driver.back()
# go to the next page
driver.find_element_by_xpath('//*[#id="pnnext"]')
This would obviously take a very long time to do this on 100 pages. How would I distribute the load, such that I could have (for example) three drivers open, and each would 'check out' a page. For example:
Driver #1 checks out page 1. Starts page 1.
Driver #2 sees that page 1 is checked out and goes to page #2. Starts page 2.
Driver #3 sees that page 1 is checked out and goes to page #2. Same with page 2. Starts page 3.
Driver #1 finishes work on page 1...starts page 4.
I understand the principle of how this would work, but what would be the actual code to get a basic implementation of this working?

You probably want to use a multiprocessing Pool. To do so, write a method that is parameterised by page number:
def get_page_data(page_number):
# Fetch page data
...
# Parse page data
...
for linked_page in parsed_links:
# Fetch page source and save to file
...
Then just use a Pool of however many processes you think is appropriate (determining this number will probably require some experimentation):
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool(processes=4)
pool.map(get_page_data, range(1,101))
This will now set 4 processes going, each fetching a page from Google and then fetching each of the pages it links to.

Not answering your question directly, but proposing an avenue that might make your code usable in a single process, thereby avoiding you synchronisation issues between different threads/processes...
You would probably better be with a framework such as Twisted which enables asynchronous network operations, in order to keep all operations within the same process. In your code, the parsing of the HTML code is likely going to take far less time than the complete network operations required to fetch the pages. Therefore, using asynchronous IO, you can launch a couple of requests at the same time and parse the result only as responses arrive. In effect, each time a page is returned, your process is likely to be "idling" in the runloop.

Related

In Python Splinter/Selenium, how to load all contents in a lazy-load web page

What I want to do - Now I want to crawl the contents (similar to stock prices of companies) in a website. The value of each element (i.e. stock price) is updated every 1s. However, this web is a lazy-loaded page, so only 5 elements are visible at a time, meanwhile, I need to collect all data from ~200 elements.
What I tried - I use Python Splinter to get the data in the div.class of the elements, however, only 5-10 elements surrounding the current view appear in the HTML code. I tried scrolling down the browser, then I can get the next elements (stock prices of next companies), but the information of the prior elements is no longer available. This process (scrolling down and get new data) is too slow and when I can finish getting all 200 elements, the first element's value was changed several times.
So, can you suggest some approaches to handle this issue? Is there any way to force the browser to load all contents instead of lazy-loading?
there is not the one right way. It depends on how is the website working in background. Normaly there are two options if its a lazy loaded page.
Selenium. It executes all js scripts and "merges" all requests from the background to a complete page like a normal webbrowser.
Access the API. In this case you dont have to care for the ui and dynamicly hidden elements. The API gives you access to all data on the webpage, often more than displayed.
In your case, if there is an update every second it sounds like a
stream connection (maybe webstream). So try to figure out how the
website gets its data and then try to scrape the api endpoint directly.
What page is it?

selenium - webdriver won't go back to previous page

I am trying to click on a link, scrape data from that webpage, go back again, click on the next link and so on. But, I am not able to go back to the previous page for some reason. I observed that I can execute the code to go back if I am outside the loop, and I can't figure out what is wrong with the loop. I tried to use driver.back() too and yet it won't work. Any help is appreciated!! TYI
x = 0 #counter
contents=[]
for link in soup_level1.find_all('a', href=re.compile(r"^/new-homes/arizona/phoenix/"), tabindex=-1):
python_button =driver.find_element_by_xpath("//div[#class='clearfix len-results-items len-view-list']//a[contains(#href,'/new-homes/arizona/phoenix/')]")
driver.execute_script("arguments[0].click();",python_button)
driver.implicitly_wait(50)
soup_level2=BeautifulSoup(driver.page_source, 'lxml')
a=soup_level2.find('ul', class_ ='plan-info-lst')
for names in a.find('li'):
contents.append(names.span.next_sibling.strip())
driver.execute_script("window.history.go(-1)")
driver.implicitly_wait(50)
x += 1
Some more information about your usecase interms of:
Selenium client version
WebDriver variant and version
Browser type and version
would have helped us to debug the issue in a better way.
However to go back to the previous page you can use either of the following solutions:
Using back(): Goes one step backward in the browser history.
Usage:
driver.back()
Using execute_script(): Synchronously Executes JavaScript in the current window/frame.
Usage:
driver.execute_script("window.history.go(-1)")
Usecase Internet Explorer
As per #james.h.evans.jr's comment in the discussion driver.navigate().back() blocks when back button triggers a javascript alert on the page if you are using internet-explorer at times back() may not work and is pretty much expected as ie navigates back in the history by using the COM GoBack() method of the IWebBrowser interface. Given that, if there are any modal dialogs that appear during the execution of the method, the method will block.
You may even face similar issues while invoking forward() in the history, and submitting forms. The GoBack method can be executed on a separate thread which would involve calling a few not-very-intuitive COM object marshaling functions e.g. CoGetInterfaceAndReleaseStream() and CoMarshalInterThreadInterfaceInStream() but there seems we can't do much about that.
Instead of using
driver.execute_script("window.history.go(-1)")
You can try using
driver.back() see here
Please be aware that this functionality depends entirely on the underlying driver. It’s just possible that something unexpected may happen when you call these methods if you’re used to the behavior of one browser over another.

Controlling the mouse and browser with pyautogui for process automation

I'm new at Python and I need expert guidance for the project I'm trying to finish at work, as none of my coworkers are programmers.
I'm making a script that logs into a website and pulls a CSV dataset. Here are the steps that I'd like to automate:
Open chrome, go to a website
Login with username/password
Navigate to another internal site via menu dropdown
Input text into a search tag box or delete search tags, e.g. "Hours", press "Enter" or "Tab" to select (repeat this for 3-4 search tags)
Click "Run data"
Wait until data loads, then click "Download" to get a CSV file with 40-50k rows of data
Repeat this process 3-4 times for different data pulls, different links and different search tags
This process usually takes 30-40 minutes for a total of 4 or 5 data pulls each week so it's like watching paint dry.
I've tried to automate this using the pyautogui module, but it isn't working out for me. It works too fast, or doesn't work at all. I think I'm using it wrong.
This is my code:
import webbrowser
import pyautogui
#pyautogui.position()
#print(pyautogui.position())
#1-2
pyautogui.FAILSAFE = True
chrome_path = 'open -a /Applications/Google\ Chrome.app %s'
#2-12
url = 'http://Google.com/'
webbrowser.get(chrome_path).open(url)
pyautogui.moveTo(185, 87, duration=0.25)
pyautogui.click()
pyautogui.typewrite('www.linkedin.com')
pyautogui.press('enter')
#loginhere? Research
In case pyautogui is not suited for this task, can you recommend an alternative way?
The way you are going about grabbing your data is very error prone and not how people generally go about grabbing data from websites. What you want is a web scraper, which allows you to grab information from websites or some companies provide API's that allow you easier access to the data.
To grab information from LinkedIn it has a built in API. You did mention that you were navigating to another site though in which case I would see if that site has an API or look into using Scrapy, a web scraper that should allow you to pull the information you need.
Sidenote: You can also look into synchronous and asynchronous programming with python to make multiple requests faster/easier

Speed up the number of page I can scrape via threading

I'm currently using beautifulsoup to scrape sourceforge.net for various project information. I'm using the solution in this thread. It works well, but I wish to do it yet faster. Right now I'm creating a list of 15 URLs, and feed them into the
run_parallel_in_threads. All the URLs are sourceforge.net links. I'm currently getting about 2.5 pages per second. And it seems that increasing or decreasing the number of URLs in my list doesn't have much effect on the speed. Are there any strategy to increase the number of page I can scrape? Any other solutions that are more suitable for this kind of project?
You could have your threads which run in parallel simply retrieve the web content. Once the html page is retrieved, pass the page into a queue which have multiple workers each parsing a single html page. Now you've essentially pipelined your workflow. Instead of having each thread do multiple steps (retrieve page, scrape, store). Each of your threads in parallel simple retrieve the page and then have it pass the task into a queue which processes these tasks in a round robin approach.
Please let me know if you have any questions!

Python 3.X Extract Source Code ONLY when page is done loading

I submit a query on a web page. The query takes several seconds before it is done. Only when it is done does it display an HTML table that I would like to get the information from. Let's say this query takes a maximum of 4 seconds to load. While I would prefer to get the data as soon as it is loaded, it would be acceptable to wait 4 seconds then get the data from the table.
The issue I have is when I make my urlread request, the page hasn't finished loading yet. I tried loading the page, then issuing a sleep command, then loading it again, but that does not work either.
My code is
import urllib.request
import time
uf = urllib.request.urlopen(urlname)
time.sleep(3)
uf.decode('UTF-8')
text = uf.read()
print (text)
The webpage I am looking at is http://bookscouter.com/prices.php?isbn=9781111835811 (feel free to ignore the interesting textbook haha)
And I am using Python 3.X on a Raspberry Pi
The prices you want are not in the page you're retrieving, so no amount of waiting will make them appear. Instead, the prices are retrieved by a JavaScript in that page after it has loaded. The urllib module is not a browser, so it won't run that script for you. You'll want to figure out what the URL is for the AJAX request (a quick look at the source code gives a pretty big hint) and retrieve that instead. It's probably going to be in JSON format so you can just use Python's json module to parse it.

Categories

Resources