Read file dynamically with Python and threads - python

I want a Python script that reads in an infinite loop a file (till stopped from keyboard or process killed).
That input file is appended dynamically from top and bottom.
The Python script should have 5 threads doing the reading of the file and removing the lines read.
I am having problems doing that: either a line from the input file is read more than once (which should not happen) or the threads don't change the file properly.
#!/usr/bin/env python
from multiprocessing.pool import ThreadPool
from time import time as timer
from urllib2 import urlopen
import requests
import os
session = requests.Session()
rawBody = "\r\n"
i=0
k=0
lines = [line.rstrip() for line in open('input.txt')]
urls = lines
def fetch_url(url):
global k
try:
print url
return url, None, None
except Exception as e:
return url, None, e
start = timer()
results = ThreadPool(5).imap_unordered(fetch_url, urls)
for url, html, error in results:
if error is None:
#print ""
i=i+1
#print("%r fetched in %ss" % (url, timer() - start))
else:
i=i+1
print error
#print("error fetching %r: %s" % (url, error))
#print("Elapsed Time: %s" % (timer() - start,))
Here is a real example of how I wanna use the script: the input file contains a large number of urls of image files. I want to check every url for gps data. The python script needs to run continuously, even when the input file is empty and wait for updates of the urls. The threads I need to do more urls at once so it's not time consuming one by one.
And I need a nice quality script, not a "quick and dirty" sample.

Related

how can I download multiple files with the web address found in a local .txt file

import wget
with open('downloadhlt.txt') as file:
urls = file.read()
for line in urls.split('\n'):
wget.download(line, 'localfolder')
for some reason the post wouldn't work so I put the code above
What I'm trying to do is from a text file that has ~2 million of lines like these.
http://halitereplaybucket.s3.amazonaws.com/1475594084-2235734685.hlt
http://halitereplaybucket.s3.amazonaws.com/1475594100-2251426701.hlt
http://halitereplaybucket.s3.amazonaws.com/1475594119-2270812773.hlt
I want to grab each line and request it so it downloads as a group greater than 10. Currently, what I have and it downloads one item at a time, it is very time-consuming.
I tried looking at Ways to read/edit multiple lines in python but the iteration seems to be for editing while mine is for multiple executions of wget.
I have not tried other methods simply because this is the first time I have ever been in the need to make over 2 million download calls.
This should work fine, I'm a total newbie so I can't really
advice you on the number of thread to start lol.
These are my 2 cents anyway, hope it somehow helps.
I tried timing yours and mine over 27 downloads:
(base) MBPdiFrancesco:stack francesco$ python3 old.py
Elapsed Time: 14.542160034179688
(base) MBPdiFrancesco:stack francesco$ python3 new.py
Elapsed Time: 1.9618661403656006
And here is the code, you have to create a "downloads" folder
import wget
from multiprocessing.pool import ThreadPool
from time import time as timer
s = timer()
thread_num = 8
def download(url):
try:
wget.download(url, 'downloads/')
except Exception as e:
print(e)
if __name__ == "__main__":
with open('downloadhlt.txt') as file:
urls = file.read().split("\n")
results = ThreadPool(8).imap_unordered(download, urls)
c = 0
for i in results:
c += 1
print("Downloaded {} file{} so far".format(c, "" if c == 1 else "s"))
print("Elapsed Time: {} seconds\nDownloaded {} files".format(timer() - s, c))

request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance
Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue
It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?

Lambda Python Pool.map and urllib2.urlopen : Retry only failing processes, log only errors

I have an AWS Lambda function which calls a set of URLs using pool.map. The problem is that if one of the URLs returns anything other than a 200 the Lambda function fails and immediately retries. The problem is it immediately retries the ENTIRE lambda function. I'd like it to retry only the failed URLs, and if (after a second try) it still fails them, call a fixed URL to log an error.
This is the code as it currently sits (with some details removed), working only when all URLs are:
from __future__ import print_function
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
import hashlib
import datetime
import json
print('Loading function')
def lambda_handler(event, context):
f = urllib2.urlopen("https://example.com/geturls/?action=something");
data = json.loads(f.read());
urls = [];
for d in data:
urls.append("https://"+d+".example.com/path/to/action");
# Make the Pool of workers
pool = ThreadPool(4);
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls);
#close the pool and wait for the work to finish
pool.close();
return pool.join();
I tried reading the official documentation but it seems to be lacking a bit in explaining the map function, specifically explaining return values.
Using the urlopen documentation I've tried modifying my code to the following:
from __future__ import print_function
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
import hashlib
import datetime
import json
print('Loading function')
def lambda_handler(event, context):
f = urllib2.urlopen("https://example.com/geturls/?action=something");
data = json.loads(f.read());
urls = [];
for d in data:
urls.append("https://"+d+".example.com/path/to/action");
# Make the Pool of workers
pool = ThreadPool(4);
# Open the urls in their own threads
# and return the results
try:
results = pool.map(urllib2.urlopen, urls);
except URLError:
try: # try once more before logging error
urllib2.urlopen(URLError.url); # TODO: figure out which URL errored
except URLError: # log error
urllib2.urlopen("https://example.com/error/?url="+URLError.url);
#close the pool and wait for the work to finish
pool.close();
return true; # always return true so we never duplicate successful calls
I'm not sure if I'm correct to be doing exceptions that way, or if I'm even making python exception notation correctly. Again, my goal is I'd like it to retry only the failed URLs, and if (after a second try) it still fails them, call a fixed URL to log an error.
I figured out the answer thanks to a "lower-level" look at this question I posted here.
The answer was to create my own custom wrapper to the urllib2.urlopen function, since each thread itself needed to be try{}catch'd instead of the whole thing. That function looked like so:
def my_urlopen(url):
try:
return urllib2.urlopen(url)
except URLError:
urllib2.urlopen("https://example.com/log_error/?url="+url)
return None
I put that above the def lambda_handler function declaration, then I can replace the whole try/catch within it from this:
try:
results = pool.map(urllib2.urlopen, urls);
except URLError:
try: # try once more before logging error
urllib2.urlopen(URLError.url);
except URLError: # log error
urllib2.urlopen("https://example.com/error/?url="+URLError.url);
To this:
results = pool.map(my_urlopen, urls);
Q.E.D.

Python Multithreading Not Functioning

Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()
There are multiple issues in your code.
The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()
Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.

How can I fix this multithreaded Python script?

I'm writing a python script to read through a list of domains, find out what rating Mcafee's Siteadvisor service gives, then output the domain and result to a CSV.
I've based my script off this previous answer. It uses the urllib to scrape Siteadvisor's page for the domain in question (not the best method, I know, but Siteadvisor provides no alternative). Unfortunately, it fails to produce anything - I consistently get this error:
Traceback (most recent call last):
File "multi.py", line 55, in <module>
main()
File "multi.py", line 44, in main
resolver_thread.start()
File "/usr/lib/python2.6/threading.py", line 474, in start
_start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread
Here is my script:
import threading
import urllib
class Resolver(threading.Thread):
def __init__(self, address, result_dict):
threading.Thread.__init__(self)
self.address = address
self.result_dict = result_dict
def run(self):
try:
content = urllib.urlopen("http://www.siteadvisor.com/sites/" + self.address).read(12000)
search1 = content.find("didn't find any significant problems.")
search2 = content.find('yellow')
search3 = content.find('web reputation analysis found potential security')
search4 = content.find("don't have the results yet.")
if search1 != -1:
result = "safe"
elif search2 != -1:
result = "caution"
elif search3 != -1:
result = "warning"
elif search4 != -1:
result = "unknown"
else:
result = ""
self.result_dict[self.address] = result
except:
pass
def main():
infile = open("domainslist", "r")
intext = infile.readlines()
threads = []
results = {}
for address in [address.strip() for address in intext if address.strip()]:
resolver_thread = Resolver(address, results)
threads.append(resolver_thread)
resolver_thread.start()
for thread in threads:
thread.join()
outfile = open('final.csv', 'w')
outfile.write("\n".join("%s,%s" % (address, ip) for address, ip in results.iteritems()))
outfile.close()
if __name__ == '__main__':
main()
Any help would be greatly appreciated.
It looks like you are trying to start too many threads.
You can check how many items are in [address.strip() for address in intext if address.strip()] list. I quess this is a problem here. Basically there is a limit of available resources that allows to start new threads.
The solution for this is to chunk your list to pieces of let's say 20 elements, do the stuff (in 20 threads), wait for threads to finish their jobs, and then pick up next chunk. Do this until all elements from your list are processed.
You can also use some thread pool for better threads management. (I recently used this implementation).
There's probably an upper limit to the number of threads you can create, and you're probably exceeding it.
Suggestion: Create a small, fixed number of Resolvers - under 10 will probably get you 90% of the possible parallelism benefit possible - and a (threadsafe) Queue from python's queue lib. Have the main thread dump all the domains into the queue, and have each Resolver take one domain at a time from the queue and work on it.

Categories

Resources