I use python 2.7, and I have a simple multitheaded md5 dict brute:
# -*- coding: utf-8 -*-
import md5
import Queue
import threading
import traceback
md5_queue = Queue.Queue()
def Worker(queue):
while True:
try:
item = md5_queue.get_nowait()
except Queue.Empty:
break
try:
work(item)
except Exception:
traceback.print_exc()
queue.task_done()
def work(param):
with open('pwds', 'r') as f:
pwds = [x.strip() for x in f.readlines()]
for pwd in pwds:
if md5.new(pwd).hexdigest() == param:
print '%s:%s' % (pwd, md5.new(pwd).hexdigest())
def main():
global md5_queue
md5_lst = []
threads = 5
with open('md5', "r") as f:
md5_lst = [x.strip() for x in f.readlines()]
for m in md5_lst:
md5_queue.put(m) # add md5 hash to queue
for i in xrange(threads):
t = threading.Thread(target=Worker, args=(md5_queue,))
t.start()
md5_queue.join()
if __name__ == '__main__':
main()
Work in 5 threads. Each thread reads one hash from queue and checks it with list of passwords. Pretty simple: 1 thread 1 check in 'for' loop.
I want to have a little bit more: 1 thread and few threads to check passwords. So work() should read hash from queue and start a new number of threads to check passwords (1 thread hash, 10 thread there check for passwords). For example: 20 threads with hash and 20 threads to brute the hash in each thread. Something like that.
How can I do this?
P.S. Sorry for my explanation, ask if you did not understood what I want.
P.P.S. It's not about bruting md5, it's about multi-threading.
Thanks.
The default implementation of Python (called CPython) uses a Global Interpreter Lock (GIL) that effectively only allows one thread to be running at once. For I/O bound multithreaded applications, this is not usually a problem, but for CPU-bound applications like yours, it means you're not going to see much of a multicore speedup.
I'd suggest using a different Python implementation that doesn't have a GIL such as Jython, or rewriting your code to use a different language that doesn't have a GIL. Writing it in natively compiled code is a good idea, but most scripting languages that have an MD5 function usually implement that in native code anyways, so I honestly wouldn't expect much of a speedup between a natively compiled language and a scripting language.
I believe that the following code would be a considerably more efficient program than your example code:
from __future__ import with_statement
try:
import md5
digest = lambda text: md5.new(text).hexdigest()
except ImportError:
import hashlib
digest = lambda text: hashlib.md5(text.encode()).hexdigest()
def main():
passwords = load_passwords('pwds')
check_hashes('md5', passwords)
def load_passwords(filename):
passwords = {}
with open(filename) as file:
for word in (line.strip() for line in file):
passwords.setdefault(digest(word), []).append(word)
return passwords
def check_hashes(filename, passwords):
with open(filename) as file:
for code in (line.strip() for line in file):
for word in passwords.get(code, ()):
print (word + ':' + code)
if __name__ == '__main__':
main()
It has been written with both Python 2.x and 3.x and should be able to run on either of those languages.
Related
I have a question I'm new to the python async world and I write some code to test the power of asyncio, I create 10 files with random content, named file1.txt, file2.txt, ..., file10.txt
here is my code:
import asyncio
import aiofiles
import time
async def reader(pack, address):
async with aiofiles.open(address) as file:
pack.append(await file.read())
async def main():
content = []
await asyncio.gather(*(reader(content, f'./file{_+1}.txt') for _ in range(10)))
return content
def core():
content = []
for number in range(10):
with open(f'./file{number+1}.txt') as file:
content.append(file.read())
return content
if __name__ == '__main__':
# Asynchronous
s = time.perf_counter()
content = asyncio.run(main())
e = time.perf_counter()
print(f'Take {e - s: .3f}')
# Synchronous
s = time.perf_counter()
content = core()
e = time.perf_counter()
print(f'Take {e - s: .3f}')
and got this result:
Asynchronous: Take 0.011
Synchronous: Take 0.001
why Asynchronous code takes longer than Synchronous code ?
where I do it wrong ?
I post an issue #110 on aiofiles's GitHub and the author of aiofiles answer that:
You're not doing anything wrong. What aiofiles does is delegate the file reading operations to a thread pool. This approach is going to be slower than just reading the file directly. The benefit is that while the file is being read in a different thread, your application can do something else in the main thread.A true, cross-platform way of reading files asynchronously is not available yet, I'm afraid :)
I hope it be helpful to anybody that has the same problem
I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance
Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue
It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?
I am looking fo a way to set the language on the fly when requesting a translation for a string in gettext. I'll explain why :
I have a multithreaded bot that respond to users by text on multiple servers, thus needing to reply in different languages.
The documentation of gettext states that, to change locale while running, you should do the following :
import gettext # first, import gettext
lang1 = gettext.translation('myapplication', languages=['en']) # Load every translations
lang2 = gettext.translation('myapplication', languages=['fr'])
lang3 = gettext.translation('myapplication', languages=['de'])
# start by using language1
lang1.install()
# ... time goes by, user selects language 2
lang2.install()
# ... more time goes by, user selects language 3
lang3.install()
But, this does not apply in my case, as the bot is multithreaded :
Imagine the 2 following snippets are running at the same time :
import time
import gettext
lang1 = gettext.translation('myapplication', languages=['fr'])
lang1.install()
message(_("Loading a dummy task")) # This should be in french, and it will
time.sleep(10)
message(_("Finished loading")) # This should be in french too, but it wont :'(
and
import time
import gettext
lang = gettext.translation('myapplication', languages=['en'])
time.sleep(3) # Not requested on the same time
lang.install()
message(_("Loading a dummy task")) # This should be in english, and it will
time.sleep(10)
message(_("Finished loading")) # This should be in english too, and it will
You can see that messages sometimes are translated in the wrong locale.
But, if I could do something like _("string", lang="FR"), the problem would disappear !
Have I missed something, or I'm using the wrong module to do the task...
I'm using python3
While the above solutions seem to work, they don’t play well with the conventional _() function that aliases gettext(). But I wanted to keep that function, because it’s used to extract translation strings from the source (see docs or e.g. this blog).
Because my module runs in a multi-process and multi-threaded environment, using the application’s built-in namespace or a module’s global namespace wouldn’t work because _() would be a shared resource and subject to race conditions if multiple threads install different translations.
So, first I wrote a short helper function that returns a translation closure:
import gettext
def get_translator(lang: str = "en"):
trans = gettext.translation("foo", localedir="/path/to/locale", languages=(lang,))
return trans.gettext
And then, in functions that use translated strings I assigned that translation closure to the _, thus making it the desired function _() in the local scope of my function without polluting a global shared namespace:
def some_function(...):
_ = get_translator() # Pass whatever language is needed.
log.info(_("A translated log message!"))
(Extra brownie points for wrapping the get_translator() function into a memoizing cache to avoid creating the same closures too many times.)
You can just create translation objects for each language directly from .mo files:
from babel.support import Translations
def gettext(msg, lang):
return get_translator(lang).gettext(msg)
def get_translator(lang):
with open(f"path_to_{lang}_mo_file", "rb") as fp:
return Translations(fp=fp, domain="name_of_your_domain")
And a dict cache for them can be easily thrown in there too.
I took a moment to whip up a script that uses all the locales available on the system, and tries to print a well-known message in them. Note that "all locales" includes mere encoding changes, which are negated by Python anyway, and plenty of translations are incomplete so do use the fallback.
Obviously, you will also have to make appropriate changes to your use of xgettext (or equivalent) for you real code to identify the translating function.
#!/usr/bin/env python3
import gettext
import os
def all_languages():
rv = []
for lang in os.listdir(gettext._default_localedir):
base = lang.split('_')[0].split('.')[0].split('#')[0]
if 2 <= len(base) <= 3 and all(c.islower() for c in base):
if base != 'all':
rv.append(lang)
rv.sort()
rv.append('C.UTF-8')
rv.append('C')
return rv
class Domain:
def __init__(self, domain):
self._domain = domain
self._translations = {}
def _get_translation(self, lang):
try:
return self._translations[lang]
except KeyError:
# The fact that `fallback=True` is not the default is a serious design flaw.
rv = self._translations[lang] = gettext.translation(self._domain, languages=[lang], fallback=True)
return rv
def get(self, lang, msg):
return self._get_translation(lang).gettext(msg)
def print_messages(domain, msg):
domain = Domain(domain)
for lang in all_languages():
print(lang, ':', domain.get(lang, msg))
def main():
print_messages('libc', 'No such file or directory')
if __name__ == '__main__':
main()
The following example uses the translation directly, as shown in o11c's answer to allow the use of threads:
import gettext
import threading
import time
def translation_function(quit_flag, language):
lang = gettext.translation('simple', localedir='locale', languages=[language])
while not quit_flag.is_set():
print(lang.gettext("Running translator"), ": %s" % language)
time.sleep(1.0)
if __name__ == '__main__':
thread_list = list()
quit_flag = threading.Event()
try:
for lang in ['en', 'fr', 'de']:
t = threading.Thread(target=translation_function, args=(quit_flag, lang,))
t.daemon = True
t.start()
thread_list.append(t)
while True:
time.sleep(1.0)
except KeyboardInterrupt:
quit_flag.set()
for t in thread_list:
t.join()
Output:
Running translator : en
Traducteur en cours d’exécution : fr
Laufenden Übersetzer : de
Running translator : en
Traducteur en cours d’exécution : fr
Laufenden Übersetzer : de
I would have posted this answer if I had known more about gettext. I am leaving my previous answer for folks who really want to continue using _().
The following simple example shows how to use a separate process for each translator:
import gettext
import multiprocessing
import time
def translation_function(language):
try:
lang = gettext.translation('simple', localedir='locale', languages=[language])
lang.install()
while True:
print(_("Running translator"), ": %s" % language)
time.sleep(1.0)
except KeyboardInterrupt:
pass
if __name__ == '__main__':
thread_list = list()
try:
for lang in ['en', 'fr', 'de']:
t = multiprocessing.Process(target=translation_function, args=(lang,))
t.daemon = True
t.start()
thread_list.append(t)
while True:
time.sleep(1.0)
except KeyboardInterrupt:
for t in thread_list:
t.join()
The output looks like this:
Running translator : en
Traducteur en cours d’exécution : fr
Laufenden Übersetzer : de
Running translator : en
Traducteur en cours d’exécution : fr
Laufenden Übersetzer : de
When I tried this using threads, I only got an English translation. You could create individual threads in each process to handle connections. You probably do not want to create a new process for each connection.
I'm trying to download a lot of data using multiple threads from Yahoo Finance. I'm using concurrent.futures.ThreadPoolExecutor to speed things up. Everything goes well until I consume all the available file descriptors (1024 by default).
When urllib.request.urlopen() raises an exception the file descriptor is left open (no matter what timeout for socket I use). Normally this file descriptor is reused if I run stuff only from a single (main) thread so this problem doesn't occur. But when these exceptional urlopen() calls are made from ThreadPoolExecutor threads these file descriptors are left open. The only solution I have come up with so far is to use either processes (ProcessPoolExecutor) which is very cumbersome and inefficient or increase the number of allowed file descriptors to something really big (not all the potential users of my library are going to do this anyway). There must be a smarter way to deal with this problem.
And also I wonder whether this is a bug in Python libraries or am I just doing something wrong...
I'm running Python 3.4.1 on Debian (testing, kernel 3.10-3-amd64).
This is an example code that demonstrates this behaviour:
import concurrent
import concurrent.futures
import urllib.request
import os
import psutil
from time import sleep
def fetchfun(url):
urllib.request.urlopen(url)
def main():
print(os.getpid())
p = psutil.Process(os.getpid())
print(p.get_num_fds())
# this url doesn't exist
test_url = 'http://ichart.finance.yahoo.com/table.csv?s=YHOOxyz' + \
'&a=00&b=01&c=1900&d=11&e=31&f=2019&g=d'
with concurrent.futures.ThreadPoolExecutor(1) as executor:
futures = []
for i in range(100):
futures.append(executor.submit(fetchfun, test_url))
count = 0
for future in concurrent.futures.as_completed(futures):
count += 1
print("{}: {} (ex: {})".format(count, p.get_num_fds(), future.exception()))
print(os.getpid())
sleep(60)
if __name__ == "__main__":
main()
When the HTTPError is raised, it saves a reference to the HTTPResponse object for the request as the fp attribute of the HTTPError. That reference gets saved in your futures list, which isn't destroyed until your program ends. That means there's a reference to the HTTPResponse being kept alive for your entire program. As long as that reference exists, the socket used in the HTTPResponse stays open. One way you can work around this is by explicitly closing the HTTPResponse when you handle the exception:
with concurrent.futures.ThreadPoolExecutor(1) as executor:
futures = []
for i in range(100):
futures.append(executor.submit(fetchfun, test_url))
count = 0
for future in concurrent.futures.as_completed(futures):
count += 1
exc = future.exception()
print("{}: {} (ex: {})".format(count, p.get_num_fds(), exc))
exc.fp.close() # Close the HTTPResponse
Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()
There are multiple issues in your code.
The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()
Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.