About Threads in Python [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have a small python program made by myself which scrape a website for some prices. I am using beautifulsoup 4 and python threading module.
The problem is I dont know how to "control" the threads. As you can see from the code, I made subclass of threading class (something like consumer, producer). In one class I am taking links from the pages, and in the other I am looking for some classes at the html with BS4 and writing to the major file.
When I start the script, I am normally starting with Thread 1. I am scraping for every link at the website, taking name and article price. For every link, I am making thread. As the website has many links (around 3000), after some time, I have that much threads which are killing my computer. Python.exe is around 2 GB and I have to kill the program.
This is my fourth day trying to find a solution...... Please.... :)
If I get it right: setDaemon(true) - the program kills them after execution, .join() is waiting to complete the thread.
I am totally beginner in the programming and also aware that the code is little messy. Any suggestions are welcome.
Dont worry about last few try blocks. Its just for the fun.
Thank you!
import threading
import csv
import urllib2
import time
from bs4 import BeautifulSoup
import re
import Queue
httpLink = "WWW.SOMEWEBSITE.COM"
fn = 'J:\\PRICES\\'
queue = Queue.Queue()
soup_queue = Queue.Queue()
brava = threading.Lock()
links = []
brokenLinks = []
pageLinks = []
fileName = time.strftime("%d_%m_%Y-%H_%M")
class TakeURL(threading.Thread):
def __init__(self, queue, soup_queue):
threading.Thread.__init__(self)
self.queue = queue
self.soup_queue = soup_queue
def run(self):
while True:
host = self.queue.get()
try:
url = urllib2.urlopen(host)
chunk = url.read()
except:
print ("Broken link " + host)
writeCSV("BrokenLinks.csv", "ab", host)
brokenLinks.append(host)
time.sleep(30)
writeCSV('Links.csv','ab',host)
if ("class=\"price\"" in chunk):
self.soup_queue.put(chunk)
else:
writeCSV("LinksWithoutPrice.csv", "ab", host)
try:
findLinks(chunk, "ul", "mainmenu")
except:
print ("Broken Link" + host)
writeCSV("BrokenLinks.csv", "ab", host)
brokenLinks.append(host)
time.sleep(30)
self.queue.task_done()
class GetDataURL(threading.Thread):
getDataUrlLock = threading.Lock()
def __init__ (self, soup_queue):
threading.Thread.__init__(self)
self.soup_queue = soup_queue
def run(self):
while True:
chunk = self.soup_queue.get()
soup = BeautifulSoup(chunk)
dataArticle = soup.findAll("tr",{"class":""})
pagination = soup.findAll("a",{"class":"page"})
self.getDataUrlLock.acquire()
f = open(fn + fileName + ".csv", "ab")
filePrice = csv.writer(f)
for groupData in dataArticle:
for articleName in groupData.findAll("a",{"class":"noFloat"}):
fullName = articleName.string.encode('utf-8')
print (fullName)
for articlePrice in groupData.findAll("div", {"class":"price"}):
if (len(articlePrice) > 1):
fullPrice = articlePrice.contents[2].strip()
else:
fullPrice = articlePrice.get_text().strip()
print (fullPrice[:-12])
print ('-')*80
filePrice.writerow([fullName, fullPrice[:-12]])
f.close()
for page in pagination:
pageLink = page.get('href')
pageLinks.append('http://www.' + pageLink[1:])
self.getDataUrlLock.release()
self.soup_queue.task_done()
def writeCSV(fileName, writeMode, link):
try:
brava.acquire()
f = csv.writer(open(fn + fileName,writeMode))
f.writerow([link])
except IOError as e:
print (e.message)
finally:
brava.release()
def findLinks(chunk, tagName, className):
soup = BeautifulSoup(chunk)
mainmenu = soup.findAll(tagName,{"class":className})
for mm in mainmenu:
for link in mm.findAll('a'):
href = link.get('href')
links.insert(0,href)
print (href)
print ('-')*80
def startMain(links):
while (links):
#time.sleep(10)
threadLinks = links[-10:]
print ("Alive Threads: " + str(threading.activeCount()))
#time.sleep(1)
for item in range(len(threadLinks)):
links.pop()
for i in range(len(threadLinks)):
tu = TakeURL(queue, soup_queue)
tu.setDaemon(True)
tu.start()
for host in threadLinks:
queue.put(host)
for i in range(len(threadLinks)):
gdu = GetDataURL(soup_queue)
gdu.setDaemon(True)
gdu.start()
queue.join()
soup_queue.join()
if __name__ == "__main__":
start = time.time()
httpWeb = urllib2.urlopen(httpLink)
chunk = httpWeb.read()
findLinks(chunk, 'li','tab')
startMain(links)
pageLinks = list(set(pageLinks))
startMain(pageLinks)
startMain(brokenLinks)
print ('-') * 80
print ("Seconds: ") % (time.time() - start)
print ('-') * 80

Your thread never returns anything, so it never stops; just continually runs the while loop. And since you're starting a new thread for each link, you eventually just keep adding on more and more threads while previous threads may not be doing anything. You essentially wouldn't need a queue with the way you have it. This approach can cause problems with a large number of jobs, as you're noticing.
worker = GetDataURL()
worker.start()
really points to GetDataURL.run()...which is an infinite while loop.
Same is true for TakeURL.start().
You could go a couple routes
1) Just take the while out of your thread, do away with the queues and return the result at the end of the run definition. This way each thread has 1 task, returns the results, then stops. Not the most efficient but would require the least amount of code modification.
2) In your startMain, outside of the while loop, start a group of say 10 threads (i.e. a thread pool). These 10 threads will always run, and instead of starting a new thread for each link, just put the link in the queue. When a thread is available, it will run the next item in queue. But you still need to manage the cleanup of these threads.
3) You could rework your code a bit more and make use of built in functions like Thread Pools and Process Pools. I've posted on Process Pools before: SO MultiProcessing
With this method, you can forget all the mess associated with locks too. After each pool.map (or whatever you use) you can right that chunk of information to the file in your startMain code. Cleans things up a lot.
Hopefully that makes some sense. I chose not to modify your code cause I think it's worth you experimenting with the options and choosing a direction.

Related

Run different tasks using multiprocessing Pool

I'm writing a code in python that suppose to get all links that was founded in one url save them as key:value pairs (url:[links]) & then go over those links and do the same thing over & over until I got enough keys.
I've already done this with a list of threads and removing them from the list when they were done running but i want to use thread pool for easy maintenance
To do this i make 2 functions:
get the content from the url and return it
extract the links from the content and return them
Now I want to mange those tasks with thread pool but i don't know how to do it properly because i don't know how to control the flow.
I can extract the links only after the get request returned the html page.
these are the functions I will use:
def extract_links(response):
arr_starts = [m.start() for m in re.finditer('href="https://', response.content)]
arr_ends = []
links = []
for start in arr_starts:
end_index = response.content.find('"', start + 6)
arr_ends.append(end_index)
for i in range(len(arr_starts)):
link = response.content[arr_starts[i] + 6:arr_ends[i]]
links.append(link)
def get_page(url):
return requests.get(url)
and this is the code i did it the first time:
first_url = r'https://blablabla'
hash_links = {}
thread_list = []
web_crawl(first_url, hash_links)
while len(hash_links.keys()) < 30:
if len(thread_list) < MAX_THREAD_COUNT:
for urls in hash_links.values():
for url in urls:
if url not in hash_links:
new_tread = threading.Thread(target=web_crawl, args=(url, hash_links))
thread_list.append(new_tread)
new_tread.start()
new_tread.join()
else:
for t in thread_list:
if not t.isAlive():
t.handled = True
thread_list = [t for t in thread_list if not t.handled]
for key in hash_links.keys():
print key + ':'
for link in hash_links[key]:
print '----' + link
``
Your problem seems to be that of producing content from a URL and then processing the links found in that URL as keys, also scheduling those links for processing, but do that in parallel using the thread pool and semaphore objects.
If that is the case, I would point you to this article for semaphore objects and the thread pool.
Also, your problem sounds to me a lot like something that would benefit from a producer-consumer architecture, so I would also recommend this article.

Splitting a task for multiple threads

I am working around with my personal project.I actually making a brute forcing program in python.I already made it, but the problem is now i want to make it faster by adding some thread to it.The problem is the program has a for loop which repeats for every user,password.So at this point if I make some threads and join the main process to the threads it will do nothing but just repeating the same user,password for every thread.But I don't want this, I want every thread will have a different user,password to bruteforce.Is there any way to tell the threads grab this user,password and now that one because that one is using by another thread.
Thanks.
Here is the code:
import requests as r
user_list = ['a','b','c','d']
pass_list = ['e','f','g','h']
def main_part():
for user,pwd in zip(user_list,pass_list):
action_url = 'https:example.com'
payload = {'user_email':user,'password':pwd}
req = r.post(action_url,data=payload)
print(req.content)
You can use multiprocessing to do what you want. you just need to define a function which handles a single user:
def brute_force_user(user, pwd):
action_url = 'https:example.com'
payload = {'user_email':user,'password':pwd}
req = r.post(action_url,data=payload)
print(req.content)
then run it like that:
import multiprocessing
import os
from itertools import repeat
pool = multiprocessing.Pool(os.cpu_count() - 1)
results = pool.starmap(brute_force_user, user_list, pass_list)

Asynchronously get and store images in python

The following code is a sample of non-asynchronous code, is there any way to get the images asynchronously?
import urllib
for x in range(0,10):
urllib.urlretrieve("http://test.com/file %s.png" % (x), "temp/file %s.png" % (x))
I have also seen the Grequests library but I couldn't figure much if that is possible or how to do it from the documentation.
You don't need any third party library. Just create a thread for every request, start the threads, and then wait for all of them to finish in the background, or continue your application while the images are being downloaded.
import threading
results = []
def getter(url, dest):
results.append(urllib.urlretreave(url, dest))
threads = []
for x in range(0,10):
t = threading.Thread(target=getter, args=('http://test.com/file %s.png' % x,
'temp/file %s.png' % x))
t.start()
threads.append(t)
# wait for all threads to finish
# You can continue doing whatever you want and
# join the threads when you finally need the results.
# They will fatch your urls in the background without
# blocking your main application.
map(lambda t: t.join(), threads)
Optionally you can create a thread pool that will get urls and dests from a queue.
If you're using Python 3 it's already implemented for you in the futures module.
Something like this should help you
import grequests
urls = ['url1', 'url2', ....] # this should be the list of urls
requests = (grequests.get(u) for u in urls)
responses = grequests.map(requests)
for response in responses:
if 199 < response.status_code < 400:
name = generate_file_name() # generate some name for your image file with extension like example.jpg
with open(name, 'wb') as f: # or save to S3 or something like that
f.write(response.content)
Here only the downloading of images would be parallel but writing each image content to a file would be sequential so you can create a thread or do something else to make it parallel or asynchronous

Processing Result outside For Loop in Python

I have this simple code which fetches page via urllib:
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
result= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
Now I can read the result via result.read() but I was wondering if all this functionality can be done outside the for loop. Because other URLs to be fetched will wait until all the result has been processed.
I want to process result outside the for loop. Can this be done?
One of the ways to do this maybe to have result as a dictionary. What you can do is:
result = {}
for eachBrowser in browser_list:
result[eachBrowser]= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
and use result[BrowserName] outside the loop.
Hope this helps.
If you simply wants to access all results outside the loop just append all results to a array or dictionary as above answer.
Or if you trying to speed up your task try multithreading.
import threading
class myThread (threading.Thread):
def __init__(self, result):
threading.Thread.__init__(self)
self.result=result
def run(self):
// process your result(as self.result) here
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
result= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
myThread(result).start() // it will start processing result on another thread and continue loop without any waiting
Its a simple way of multithrading. It may break depending on your result processing. Consider reading the documentation and some examples before you try.
You can use threads for this:
import threading
import urllib2
from urlparse import urljoin
def worker(url):
res = urllib2.urlopen(url)
data = res.read()
res.close()
browser_list = ['Chrome', 'Mozilla', 'Safari', 'Internet Explorer', 'Opera']
user_string_url='http://www.useragentstring.com/'
for browser in browser_list:
url = urljoin(user_string_url, browser)
threading.Thread(target=worker,args=[url]).start()
# wait for everyone to complete
for thread in threading.enumerate():
if thread == threading.current_thread(): continue
thread.join()
Are you using python3?, if so, you can use futures for this task:
from urllib.request import urlopen
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor
browser_list = ['Chrome','Mozilla','Safari','Internet+Explorer','Opera']
user_string_url = "http://www.useragentstring.com/pages/"
def process_request(url, future):
print("Processing:", url)
print("Reading data")
print(future.result().read())
with ThreadPoolExecutor(max_workers=10) as executor:
submit = executor.submit
for browser in browser_list:
url = urljoin(user_string_url, browser) + '/'
submit(process_request, url, submit(urlopen, url))
You could also do this with yield:
def collect_browsers():
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
yield eachBrowser, urllib2.urlopen(urljoin(user_string_url,eachBrowser))
def process_browsers():
for browser, result in collect_browsers():
do_something (result)
This is still a synchronous call (browser 2 will not fire until browser 1 is processed) but you can keep the logic for dealing with the results separate from the logic managing the connections. You could of course also use threads to handle the processing asynchronously with or without yield
Edit
Just re-read OP and should repeat that yield doesn't provide multi-threaded, asynchronous execution in case that was not clear in my first answer!

Fire off function without waiting for answer (Python)

I have a stream of links coming in, and I want to check them for rss every now and then. But when I fire off my get_rss() function, it blocks and the stream halts. This is unnecessary, and I'd like to just fire-and-forget about the get_rss() function (it stores its results elsewhere.)
My code is like thus:
self.ff.get_rss(url) # not async
print 'im back!'
(...)
def get_rss(url):
page = urllib2.urlopen(url) # not async
soup = BeautifulSoup(page)
I'm thinking that if I can fire-and-forget the first call, then I can even use urllib2 wihtout worrying about it not being async. Any help is much appreciated!
Edit:
Trying out gevent, but like this nothing happens:
print 'go'
g = Greenlet.spawn(self.ff.do_url, url)
print g
print 'back'
# output:
go
<Greenlet at 0x7f760c0750f0: <bound method FeedFinder.do_url of <rss.FeedFinder object at 0x2415450>>(u'http://nyti.ms/SuVBCl')>
back
The Greenlet seem to be registered, but the function self.ff.do_url(url) doesn't seem to be run at all. What am I doing wrong?
Fire and forget using the multiprocessing module:
def fire_and_forget(arg_one):
# do stuff
...
def main_function():
p = Process(target=fire_and_forget, args=(arg_one,))
# you have to set daemon true to not have to wait for the process to join
p.daemon = True
p.start()
return "doing stuff in the background"
here is sample code for thread based method invocation additionally desired threading.stack_size can be added to boost the performance.
import threading
import requests
#The stack size set by threading.stack_size is the amount of memory to allocate for the call stack in threads.
threading.stack_size(524288)
def alpha_gun(url, json, headers):
#r=requests.post(url, data=json, headers=headers)
r=requests.get(url)
print(r.text)
def trigger(url, json, headers):
threading.Thread(target=alpha_gun, args=(url, json, headers)).start()
url = "https://raw.githubusercontent.com/jyotiprakash-work/Live_Video_steaming/master/README.md"
payload="{}"
headers = {
'Content-Type': 'application/json'
}
for _ in range(10):
print(i)
#for condition
if i==5:
trigger(url=url, json =payload, headers=headers)
print('invoked')
You want to use the threading module or the multiprocessing module and save the result either in database, a file or a queue.
You also can use gevent.

Categories

Resources