Web Scraping with Multiprocessing in Python Stalls Randomly - python

I am attempting to scrape a list of urls using a function called data_finder, where a url is the only argument. The list of urls is called urls.
To speed up the process, I am using the multiprocessing package in Python 3 on Windows 10. The code I am using is below:
if __name__ == '__main__':
multiprocessing.freeze_support()
p = multiprocessing.Pool(10)
records = p.map(data_finder, urls)
p.close()
p.join()
print('Successfully exported.')
with open('test.json', 'w') as outfile:
json.dump(records, outfile)
The problem I am having is that sometimes the code freezes and is unable to continue, but other times it does work as expected. Whenever it does freeze though, it is usually in the last 10 urls. Is this a common occurrence or is there a solution to this?

Have you tried timing the request call to check if that is what is stalling? From my understanding of 'sometimes' it looks to me like the network is causing the delay.

Related

How to timeout child processes using Process(target=), start(), join(), alive() and terminate()

This Question is covering many topics and scenarios of Multiprocessing.
Stackoverflow
I've searched StackOverflow and although I've found many questions on this, I haven't found an answer that fits for my situation/not a strong python programmer to adapt their answer to fit my need. I am working on multiprocessing more than 2 weeks.
I have looked here youtube, stackoverflow, google, github to no avail:
https://www.youtube.com/watch?v=fKl2JW_qrso
https://www.youtube.com/watch?v=35yYObtZ95o
https://www.youtube.com/watch?v=IT8RYokUvvQ
kill a function after a certain time in windows
Creating a timeout function in Python with multiprocessing
Handle multiprocessing.TimeoutError in multiprocessing pool.map_async()
many others.
Target Point
I am generating sitemap of different websites, websites urls are present in excel sheet and 'data' (present in below code) is the column name of websites. Some websites are taking so much time to crawl, I want if a website is taking more than 3 minutes to crawl, it stop the process, store the data that crawl in 3 minutes and start the new process.
I target the (row),but don' know how to access row in if __name__ == '__main__':
Program
from multiprocessing import Process
import time
from pysitemap import crawler
import pandas
#describe function with parameter crawler
def do_actions(crawler):
#pandas to read excel file in excel file urls of different website, absolute path is path of excel sheet
df = pandas.read_excel(r'Absolute path')
for index, row in df.iterrows():
#data is name of column of excel sheet(urls)
Url=row['data']
try:
#crawler used to crawl urls that are in excel sheet
crawler(Url, out_file=f'{index}sitemap.xml')
except Exception as e:
print (e)
pass
if __name__ == '__main__':
#create a Process
action_process = Process(target=row)
# start a process
action_process.start()
#timeout 180 seconds
action_process.join(timeout=180)
#process alive after 180s terminate
if action_process is alive():
action_process.terminate()
You want to pass a callable as the target function, and not the return value of the function itself. Also, the specific attribute you want to check for is process.is_alive(). Lastly, while it doesn't affect your program, you should instantiate action_process under the if __name__ ... block. Think of it this way, do your child processes require access to this variable? If not, throw it under that block.
if __name__ == '__main__':
action_process = Process(target=do_actions, args=(crawler, ))
# start a process
action_process.start()
#timeout 180 seconds
action_process.join(timeout=180)
#process alive after 180s terminate
if action_process.is_alive():
action_process.terminate()

Python Multithreading Rest API

I download Data over a restAPI and wrote a module. The download takes lets say 10sec. During this time, the rest of the script in 'main' and in the module is not running until the download is finished.
How can I change it, e.g. by processing it in another core?
I tried this code but it does not do the trick (same lag). Then I tried to implement this approach and it just gives me errors, as I suspect it 'map' does not work with 'wget.download'?
My code from the module:
from multiprocessing.dummy import Pool as ThreadPool
import urllib.parse
#define the needed data
function='TIME_SERIES_INTRADAY_EXTENDED'
symbol='IBM'
interval='1min'
slice='year1month1'
adjusted='true'
apikey= key[0].rstrip()
#create URL
SCHEME = os.environ.get("API_SCHEME", "https")
NETLOC = os.environ.get("API_NETLOC", "www.alphavantage.co") #query?
PATH = os.environ.get("API_PATH","query")
query = urllib.parse.urlencode(dict(function=function, symbol=symbol, interval=interval, slice=slice, adjusted=adjusted, apikey=apikey))
url = urllib.parse.urlunsplit((SCHEME, NETLOC,PATH, query, ''))
#this is my original code to download the data (working but slow and stopping the rest of the script)
wget.download(url, 'C:\\Users\\x\\Desktop\\Tool\\RAWdata\\test.csv')
#this is my attempt to speed things up via multithreading from code
pool = ThreadPool(4)
if __name__ == '__main__':
futures = []
for x in range(1):
futures.append(pool.apply_async(wget.download, url,'C:\\Users\\x\\Desktop\\Tool\\RAWdata\\test.csv']))
# futures is now a list of 10 futures.
for future in futures:
print(future.get())
any suggestions or do you see the error i make?
ok, i figured it out, so i will leave it here in case someone else needs it.
I made a module called APIcall which has a function APIcall() which uses wget.download() to download my data.
in main, i create a function (called threaded_APIfunc) which calls the APIcall() function in my modul APIcall
import threading
import APIcall
def threaded_APIfunc():
APIcall.APIcall(function, symbol, interval, slice, adjusted, apikey)
print ("Data Download complete for ${}".format(symbol))
and then i run the threaded_APIfunc within a thread like so
threading.Thread(target=threaded_APIfunc).start()
print ('Start Downloading Data for ${}'.format(symbol))
what happends is, that the .csv file gets downloaded in the background, while the main loop doesent wait till the download ir completed, it does the code what comes after the threading right away

How to make for loop that will create processes in Python?

I'm only beginner, so maybe the answer is obvious for you. I'm working on simple program and I would like to use multiprocessing there. I have function (Let's call it f) which is making requests on some URL. I would like to give user opportunity to choose number of 'threads' (number of functions running simultaneously ?).
I' have tried making for loop with _ variable in range of variable that user can adjust with input(), but I' don't know how to create processes (I know how to do it 'manually' but not 'automatically'). Could you please help me?
if __name__ == '__main__':
p1 = Process(target=f1)
You can use Pool to decide how many processes run at the same time. And you don't have to build loop for this.
from multiprocessing import Pool
import requests
def f(url):
print('url:', url)
data = requests.get(url).json()
result = data['args']
return result
if __name__ == '__main__':
urls = [
'https://httpbin.org/get?x=1',
'https://httpbin.org/get?x=2',
'https://httpbin.org/get?x=3',
'https://httpbin.org/get?x=4',
'https://httpbin.org/get?x=5',
]
numer_of_processes = 2
with Pool(numer_of_processes) as p:
results = p.map(f, urls)
print(results)
It will start 2 processes with two first urls. When one of them ends its job then it starts process again with next url.
You can see similar example with Pool in documentation: multiprocessing

Python multiprocessing, using pool multiple times in a loop gets stuck after first iteration

I have the following situation, where I create a pool in a for loop as follows (I know it's not very elegant, but I have to do this for pickling reasons). Assume that the pathos.multiprocessing is equivalent to python's multiprocessing library (as it is up to some details, that are not relevant for this problem).
I have the following code I want to execute:
self.pool = pathos.multiprocessing.ProcessingPool(number_processes)
for i in range(5):
all_responses = self.pool.map(wrapper_singlerun, range(self.no_of_restarts))
pool._clear()
Now my problem: The loop successfully runs the first iteration. However, at the second iteration, the algorithm suddenly stops (Does not finish the pool.map operation. I suspected that zombie processes are generated, or that the process was somehow switched. Below you will find everything I have tried so far.
for i in range(5):
pool = pathos.multiprocessing.ProcessingPool(number_processes)
all_responses = self.pool.map(wrapper_singlerun, range(self.no_of_restarts))
pool._clear()
gc.collect()
for p in multiprocessing.active_children():
p.terminate()
gc.collect()
print("We have so many active children: ", multiprocessing.active_children()) # Returns []
The above code works perfectly well on my mac. However, when I upload it on the cluster that has the following specs, I get the error that it gets stuck after the first iteration:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04 LTS"
This is the link to the pathos' multiprocessing library file is
I am assuming that you are trying to call this via some function which is not the correct way to use this.
You need to wrap it around with :
if __name__ == '__main__':
for i in range(5):
pool = pathos.multiprocessing.Pool(number_processes)
all_responses = pool.map(wrapper_singlerun,
range(self.no_of_restarts))
If you don't do it will keep on creating a copy of itself and will start putting it into stack which will ultimately fill the stack and block everything. The reason it works on mac is that it has fork while windows does not have it.

Selenium Multiprocessing Help in Python 3.4

I am in over my head trying to use Selenium to get the number of results for specific searches on a website. Basically, I'd like to make the process run faster. I have code that works by iterating over search terms and then by newspapers and outputs the collected data into a CSV. Currently, this runs to produce 3 search terms x 3 newspapers over 3 years giving me 9 CSVs in about 10 minutes per CSV.
I would like to use multiprocessing to run each search and newspaper combination simultaneously or at least faster. I've tried to follow other examples on here, but have not been able to successfully implement them. Below is my code so far:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import pandas as pd
from multiprocessing import Pool
def websitesearch(search):
try:
start = list_of_inputs[0]
end = list_of_inputs[1]
newsabbv=list_of_inputs[2]
directory=list_of_inputs[3]
os.chdir(directory)
if search == broad:
specification = "broad"
relPapers = newsabbv
elif search == narrow:
specification = "narrow"
relPapers = newsabbv
elif search == general:
specification = "allarticles"
relPapers = newsabbv
else:
for newspapers in relPapers:
...rest of code here that gets the data and puts it in a list named all_Data...
browser.close()
df = pd.DataFrame(all_Data)
df.to_csv(filename, index=False)
except:
print('error with item')
if __name__ == '__main__':
...Initializing values and things like that go here. This helps with the setup for search...
#These are things that go into the function
start = ["January",2015]
end = ["August",2017]
directory = "STUFF GOES HERE"
newsabbv = all_news_abbv
search_list = [narrow, broad, general]
list_of_inputs = [start,end,newsabbv,directory]
pool = Pool(processes=4)
for search in search_list:
pool.map(websitesearch, search_list)
print(list_of_inputs)
If I add in a print statement in the main() function, it will print, but nothing really ends up happening. I'd appreciate any and all help. I left out the code that gets the values and puts it into a list since its convoluted but I know it works.
Thanks in advance for any and all help! Let me know if there is more information I can provide.
Isaac
EDIT: I have looked into more help online and realize that I misunderstood the purpose of mapping a list to the function using pool.map(fn, list). I have updated my code to reflect my current approach that is still not working. I also moved the initializing values into the main function.
i don't think it can be multiprocessing with your way. Because it's still have queue process there (not queue module) caused by selenium.
The reason is...selenium only can handle one window, cannot handle several window or tab browser at the same time (limitation of the window_handle features). that's means....your multi process only processing data process in memory that send to selenium or crawled by selenium. by try process the crawl of selenium in one script file, will make the selenium as the bottle neck process's source.
the best way to make real multiprocess is:
make a script that use selenium to handle that url to crawl by selenium and save it as a file. e.g crawler.py and make sure the script have print command to print the result
e.g:
import -> all modules that you need to run selenium
import sys
url = sys.argv[1] #you will catch the url
driver = ......#open browser
driver.get(url)
#just continue the script base on your method
print(--the result that you want--)
sys.exit(0)
i can give more explanation because this is the main core of the process, and what you want to do on that web, only you that understood.
make another script file that:
a. devide the url, multi process means make some process and run it together with all cpu cores, the best way to make it... it's start by devide the input process, in your case maybe the url target (you don't give us, the website target that you want to crawl). but every pages of the website have the different url. just collect all url and devide it to several groups (best practice: your cpu cores - 1)
e.g:
import multiprocessing as mp
cpucore=int(mp.cpu_count())-1.
b. send the url to processing with the crawl.py that already you made before (by subprocess, or other module e,g: os.system). make sure you run the crawl.py max == the cpucore.
e.g:
crawler = r'YOUR FILE DIRECTORY\crawler.py'
def devideurl():
global url1, url2, url3, url4
make script that result:
urls1 = groups or list of url
urls2 = groups or list of url
urls3 = groups or list of url
urls4 = groups or list of url
def target1():
for url in url1:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
#do you see the combination between python crawler and url?
#the cmd command will be: python crawler.py "value", the "value" is captured by sys.argv[1] command in crawler.py
def target2():
for url in url2:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
def target3():
for url in url1:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
def target4():
for url in url2:
t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
#continue the script, base on your need...
cpucore = int(mp.cpu_count())-1
pool = Pool(processes="max is the value of cpucore")
for search in search_list:
pool.map(target1, devideurl)
pool.map(target2, devideurl)
pool.map(target3, devideurl)
pool.map(target4, devideurl)
#you can make it, more, depend on your cpu core
c. get the printed result to the memory of main script
d. continous your script process to process the data that you already got.
and the last, make the multiprocess script for the whole process in the main script.
with this method:
you can open many browser windows and handle it with the same time, and because of the data processing that crawling from website is slower than data processing in memory, this method at least reducing the bottle neck on data flow. means it's more faster than your method before.
hopelly helpfull...cheers

Categories

Resources