I am following one of the examples in a book I am reading ("Violent Python"). It is to create a zip file password cracker from a dictionary. I have two questions about it. First it says to thread it as I have written in the code to increase performance but when I timed it (I know time.time() is not great for timing) there was about a twelve second difference in favor of not threading. Is this because it is taking longer to start the threads? Second if I do it without the threads I can break as soon as the correct value is found by printing the result and the entering the statement exit(0). Is there a way to get the same result using threading so that if I find the result I am looking for I can end all other threads simultaneously?
import zipfile
from threading import Thread
import time
def extractFile(z, password, starttime):
try:
z.extractall(pwd=password)
except:
pass
else:
z.close()
print('PWD IS ' + password)
print(str(time.time()-starttime))
def main():
start = time.time()
z = zipfile.ZipFile('test.zip')
pwdfile = open('words.txt')
pwds = pwdfile.read()
pwdfile.close()
for pwd in pwds.splitlines():
t = Thread(target=extractFile, args=(z, pwd, start))
t.start()
#extractFile(z, pwd, start)
print(str(time.time()-start))
if __name__ == '__main__':
main()
In CPython, the Global Interpreter Lock ("GIL") enforces the restriction that only one thread at a time can execute Python bytecode.
So in this application, it is probably better to use the map method of a multiprocessing.Pool, since every try is independant of the others;
import multiprocessing
import zipfile
def tryfile(password):
rv = passwd
with zipfile.ZipFile('test.zip') as z:
try:
z.extractall(pwd=password)
except:
rv = None
return rv
with open('words.txt') as pwdfile:
data = pwdfile.read()
pwds = data.split()
p = multiprocessing.Pool()
results = p.map(tryfile, pwds)
results = [r for r in results if r is not None]
This will start (by default) as many processes as your computer has cores. If will keep running tryfile() with a different passwords in these processes until the list pwds is exhausted, gather the results and return them. The last list comprehension is to discard the None results.
Note that this code could be improved to stop shut down the map once the password is found. You'd probably have to use map_async and a shared variable in that case. It would also be nice to load the zipfile only once and share that.
This code is slow because python has a Global Interpreter Lock, which means only one thread can execute at a time. This causes multithreaded code to run slower than serial code in Python. If you want to create a truly multithreaded application, you'd have to use the Multiprocessing Module.
To break out of the threads and get the return value, you can use os._exit(1) First, import the os module at the top of your file:
import os
Then, change your extractFile function to use os._exit(1):
def extractFile(z, password, starttime):
try:
z.extractall(pwd=password)
except:
pass
else:
z.close()
print('PWD IS ' + password)
print(str(time.time()-starttime))
os._exit(1)
Related
I am using multiprocessing to perform jobs parallel, my Goal is to use multi cpu core and hence i choosen multiprocessing module instead of threading module
Now i have method, which uses subprocess module to execute linux shell command, i need to filter it and update the results to DB.
For every thread, subprocess execution time may differ some threads input execution time may be 10 seconds for other it may be 15 seconds.
My concern is will always get same thread execution result or different thread execution result,
or i have to go for locking mechanism, if yes can you provide me example that suitable for my requirement
Below is the example code:
#!/usr/bin/env python
import json
from subprocess import check_output
import multiprocessing
class Test:
# Convert bytes to UTF-8 string
#staticmethod
def bytes_to_string(string_convert):
if not isinstance(string_convert, bytes) and isinstance(string_convert, str):
return string_convert, True
elif isinstance(string_convert, bytes):
string_convert = string_convert.decode("utf-8")
else:
print("Passed in non-byte type to convert to string: {0}".format(string_convert))
return "", False
return string_convert, True
# Execute commands in Linux shell
#staticmethod
def command_output(command):
try:
output = check_output(command)
except Exception as e:
return e, False
output, state = Test.bytes_to_string(output)
return output, True
#staticmethod
def run_multi(num):
test_result, success = Test.command_output(["curl", "-sb", "-H", "Accept: application/json", "http://127.0.0.1:5500/stores"])
out = json.loads(test_result)
#Update Database is safer here or i need to use any locks
if __name__ == '__main__':
test = Test()
input_list = list(range(0, 1000))
numberOfThreads = 100
p = multiprocessing.Pool(numberOfThreads)
p.map(test.run_multi, input_list)
p.close()
p.join()
Depends on what sort of updates you're doing in the database...
If it's a full database, it'll have its own locking mechanisms; you'll need to work with them, but other than that it's already designed to handle concurrent access.
For example, if the update involves inserting a row, you can just do that; the database will end up with all the rows, each exactly once.
I want to realize a function of finding the password of excel. So i use both threading and Process to run the codes, but it raise a question the process time is really the same whatever i use 4 threads or 8 threads even i just use the single function to find the password.
Here's my step:
the first step to use itertools to create password dictionary
import itertools,os,time
from win32com.client import Dispatch
from multiprocessing import Process
import threading
import pywintypes
import pythoncom
time_begin = time.time()
# setting the password words
path='C:/Users/M_Seas/Desktop/file.xlsx'
words='wulong'
passw=[]
pw=itertools.product(words,repeat=6)
# write all combination into dic and list
dic=open("password.txt","w")
for word in pw:
dic.write(''.join(word)+'\n')
dic.close()
with open("password.txt","r") as f:
for word in f.readlines():
passw.append(word.strip('\n'))
The next setp is create function mainstep() to use combination to open the excel
def mainstep(file,passwords):
pythoncom.CoInitialize()
excel = Dispatch("Excel.Application")
excel.Visible = False
for password in passwords:
try:
excel.Workbooks.Open(file, False, True, None, Password=password)
except pywintypes.com_error:
excel.Application.Quit()
pass
else:
print('Success '+password)
excel.Application.Quit()
time_end=time.time()
print('time:{} seconds '.format(time_end-time_begin))
break
Finally create the multi-process or multi-threading to speed up
if __name__ == '__main__':
NUMBER_OF_PROCESSES = 10
groups = [passw[i::NUMBER_OF_PROCESSES] for i in range(NUMBER_OF_PROCESSES)]
'''test in two ways'''
# for group in groups:
# t = threading.Thread(target=mainstep, args=(path,group))
# t.start()
for group in groups:
p = Process(target=mainstep, args=(path,group))
p.start()
I try to set NUMBER_OF_PROCESSES = 4/8/10, the process time are all the same whatever i use threading or threading
I also test just use the mainstep() to see the original process time, the time is the same as i use multi-process and multi-threading.
Is there any way can real speed up the process time???
I wanted to make a python module with a convenience function for running commands in parallel using Python 3.7 on Windows. (for az cli commands)
I wanted a to make a function that:
Was easy to use: Just pass a list of commands as strings, and have them execute in parallel.
Let me see the output generated by the commands.
Used build in python libraries
Worked equally well on Windows and Linux (Python Multiprocessing uses fork(), and Windows doesn't have fork(), so sometimes Multiprocessing code will work on Linux but not Windows.)
Could be made into an importable module for greater convenience.
This was surprisingly difficult, I think maybe it used to not be possible in older versions of python? (I saw several 2-8 year old Q&As that said you had to use if __name__==__main__: to pull off parallel processing, but I discovered that didn't work in a consistently predictable way when it came to making a importable module.
def removeExtraLinesFromString(inputstring):
stringtoreturn = ""
for line in inputstring.split("\n"):
if len(line.strip()) > 0: #Only add non empty lines to the stringtoreturn
stringtoreturn = stringtoreturn + line
return stringtoreturn
def runCmd(cmd): #string of a command passed in here
from subprocess import run, PIPE
stringtoreturn = str( run(cmd, shell=True, stdout=PIPE).stdout.decode('utf-8') )
stringtoreturn = removeExtraLinesFromString(stringtoreturn)
return stringtoreturn
def exampleOfParrallelCommands():
if __name__ == '__main__': #I don't like this method, because it doesn't work when imported, refractoring attempts lead to infinite loops and unexpected behavior.
from multiprocessing import Pool
cmd = "python -c \"import time;time.sleep(5);print('5 seconds have passed')\""
cmds = []
for i in range(12): #If this were running in series it'd take at least a minute to sleep 5 seconds 12 times
cmds.append(cmd)
with Pool(processes=len(cmds)) as pool:
results = pool.map(runCmd, cmds) #results is a list of cmd output
print(results[0])
print(results[1])
return results
When I tried importing this as a module it didn't work (makes since because of the if statement), so I tried rewriting the code to move the if statement around, I think I removed it once which caused my computer to go into a loop until I shut the program. Another time I was able to import the module into another python program, but to make that work I had to add __name__ == '__main__' and that's very intuitive.
I almost gave up, but after 2 days of searching though tons of python websites and SO posts I finally figured out how to do it after seeing user jfs's code in this Q&A (Python: execute cat subprocess in parallel) I modified his code so it'd better fit into an answer to my question.
toolbox.py
def removeExtraLinesFromString(inputstring):
stringtoreturn = ""
for line in inputstring.split("\n"):
if len(line.strip()) > 0: #Only add non empty lines to the stringtoreturn
stringtoreturn = stringtoreturn + line
return stringtoreturn
def runCmd(cmd): #string of a command passed in here
from subprocess import run, PIPE
stringtoreturn = str( run(cmd, shell=True, stdout=PIPE).stdout.decode('utf-8') )
stringtoreturn = removeExtraLinesFromString(stringtoreturn)
return stringtoreturn
def runParallelCmds(listofcommands):
from multiprocessing.dummy import Pool #thread pool
from subprocess import Popen, PIPE, STDOUT
listofprocesses = [Popen(listofcommands[i], shell=True,stdin=PIPE, stdout=PIPE, stderr=STDOUT, close_fds=True) for i in range(len(listofcommands))]
#Python calls this list comprehension, it's a way of making a list
def get_outputs(process): #MultiProcess Thread Pooling require you to map to a function, thus defining a function.
return process.communicate()[0] #process is object of type subprocess.Popen
outputs = Pool(len(listofcommands)).map(get_outputs, listofprocesses) #outputs is a list of bytes (which is a type of string)
listofoutputstrings = []
for i in range( len(listofcommands) ):
outputasstring = removeExtraLinesFromString( outputs[i].decode('utf-8') ) #.decode('utf-8') converts bytes to string
listofoutputstrings.append( outputasstring )
return listofoutputstrings
main.py
from toolbox import runCmd #(cmd)
from toolbox import runParallelCmds #(listofcommands)
listofcommands = []
cmd = "ping -n 2 localhost"
listofcommands.append(cmd)
cmd = "python -c \"import time;time.sleep(5);print('5 seconds have passed')\""
for i in range(12):
listofcommands.append(cmd) # If 12 processes each sleep 5 seconds, this taking less than 1 minute proves parrallel processing
outputs = runParallelCmds(listofcommands)
print(outputs[0])
print(outputs[1])
output:
Pinging neokylesPC [::1] with 32 bytes of data:
Reply from ::1: time<1ms Reply from ::1: time<1ms Ping statistics
for ::1:
Packets: Sent = 2, Received = 2, Lost = 0 (0% loss), Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum = 0ms, Average = 0ms
5 seconds have passed
I'd like to read from two (or more) serial ports (/dev/ttyUSB0 etc) at the same time in python on Linux. I want to read complete lines from each port (whichever has data) and process the results in the order received (without race conditions). As a simple example could just write the lines to a single merged file.
I assume the way to do this is based on pyserial, but I can't quite figure out how to do it. Pyserial has non-blocking reads using asyncio and using threads. Asyncio is marked as experimental. I assume there wouldn't be any race conditions if the processing is done in asyncio.Protocol.data_received(). In the case of threads, the processing would probably have to be protected by a mutex.
Perhaps this can also be done not in pyserial. The two serial ports can be opened as files and then read from when data is available using select().
Consider using aioserial.
Here's an example:
import asyncio
import concurrent.futures
import queue
import aioserial
async def readline_and_put_to_queue(
aioserial_instance: aioserial.AioSerial,
q: queue.Queue):
while True:
q.put(await aioserial_instance.readline_async())
async def process_queue(q: queue.Queue):
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
while True:
line: bytes = await asyncio.get_running_loop().run_in_executor(
executor, q.get)
print(line.decode(errors='ignore'), end='', flush=True)
q.task_done()
q: queue.Queue = queue.Queue()
aioserial_ttyUSB0: aioserial.AioSerial = \
aioserial.AioSerial(port='/dev/ttyUSB0')
aioserial_ttyUSB1: aioserial.AioSerial = \
aioserial.AioSerial(port='/dev/ttyUSB1', baudrate=115200)
asyncio.run(asyncio.wait([
readline_and_put_to_queue(aioserial_ttyUSB0, q),
readline_and_put_to_queue(aioserial_ttyUSB1, q),
process_queue(q),
]))
As suggested by #AlexHall in a comment, here is a solution that uses one thread for each serial port and a queue to synchronize access:
import serial
import Queue
import threading
queue = Queue.Queue(1000)
def serial_read(s):
while True:
line = s.readline()
queue.put(line)
serial0 = serial.Serial('/dev/ttyUSB0')
serial1 = serial.Serial('/dev/ttyUSB1')
thread1 = threading.Thread(target=serial_read, args=(serial0,),).start()
thread2 = threading.Thread(target=serial_read, args=(serial1,),).start()
while True:
line = queue.get(True, 1)
print line
It may be possible to write this more elegantly, but it works.
You could try to take the values in order and memorise it in variables:
a = data1.read ()
b = data2.read ()
And after process it in order:
If len (a) != 0 or len (b ) != 0:
Process a
Process b
Using this method if one or both of the value has data, process it
Suppose you're running Django on Linux, and you've got a view, and you want that view to return the data from a subprocess called cmd that operates on a file that the view creates, for example likeso:
def call_subprocess(request):
response = HttpResponse()
with tempfile.NamedTemporaryFile("W") as f:
f.write(request.GET['data']) # i.e. some data
# cmd operates on fname and returns output
p = subprocess.Popen(["cmd", f.name],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out, err = p.communicate()
response.write(p.out) # would be text/plain...
return response
Now, suppose cmd has a very slow start-up time, but a very fast operating time, and it does not natively have a daemon mode. I would like to improve the response-time of this view.
I would like to make the whole system would run much faster by starting up a number of instances of cmd in a worker-pool, have them wait for input, and having call_process ask one of those worker pool processes handle the data.
This is really 2 parts:
Part 1. A function that calls cmd and cmd waits for input. This could be done with pipes, i.e.
def _run_subcmd():
p = subprocess.Popen(["cmd", fname],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
# write 'out' to a tmp file
o = open("out.txt", "W")
o.write(out)
o.close()
p.close()
exit()
def _run_cmd(data):
f = tempfile.NamedTemporaryFile("W")
pipe = os.mkfifo(f.name)
if os.fork() == 0:
_run_subcmd(fname)
else:
f.write(data)
r = open("out.txt", "r")
out = r.read()
# read 'out' from a tmp file
return out
def call_process(request):
response = HttpResponse()
out = _run_cmd(request.GET['data'])
response.write(out) # would be text/plain...
return response
Part 2. A set of workers running in the background that are waiting on the data. i.e. We want to extend the above so that the subprocess is already running, e.g. when the Django instance initializes, or this call_process is first called, a set of these workers is created
WORKER_COUNT = 6
WORKERS = []
class Worker(object):
def __init__(index):
self.tmp_file = tempfile.NamedTemporaryFile("W") # get a tmp file name
os.mkfifo(self.tmp_file.name)
self.p = subprocess.Popen(["cmd", self.tmp_file],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
self.index = index
def run(out_filename, data):
WORKERS[self.index] = Null # qua-mutex??
self.tmp_file.write(data)
if (os.fork() == 0): # does the child have access to self.p??
out, err = self.p.communicate()
o = open(out_filename, "w")
o.write(out)
exit()
self.p.close()
self.o.close()
self.tmp_file.close()
WORKERS[self.index] = Worker(index) # replace this one
return out_file
#classmethod
def get_worker() # get the next worker
# ... static, incrementing index
There should be some initialization of workers somewhere, like this:
def init_workers(): # create WORKERS_COUNT workers
for i in xrange(0, WORKERS_COUNT):
tmp_file = tempfile.NamedTemporaryFile()
WORKERS.push(Worker(i))
Now, what I have above becomes something likeso:
def _run_cmd(data):
Worker.get_worker() # this needs to be atomic & lock worker at Worker.index
fifo = open(tempfile.NamedTemporaryFile("r")) # this stores output of cmd
Worker.run(fifo.name, data)
# please ignore the fact that everything will be
# appended to out.txt ... these will be tmp files, too, but named elsewhere.
out = fifo.read()
# read 'out' from a tmp file
return out
def call_process(request):
response = HttpResponse()
out = _run_cmd(request.GET['data'])
response.write(out) # would be text/plain...
return response
Now, the questions:
Will this work? (I've just typed this off the top of my head into StackOverflow, so I'm sure there are problems, but conceptually, will it work)
What are the problems to look for?
Are there better alternatives to this? e.g. Could threads work just as well (it's Debian Lenny Linux)? Are there any libraries that handle parallel process worker-pools like this?
Are there interactions with Django that I ought to be conscious of?
Thanks for reading! I hope you find this as interesting a problem as I do.
Brian
It may seem like i am punting this product as this is the second time i have responded with a recommendation of this.
But it seems like you need a Message Queing service, in particular a distributed message queue.
ere is how it will work:
Your Django App requests CMD
CMD gets added to a queue
CMD gets pushed to several works
It is executed and results returned upstream
Most of this code exists, and you dont have to go about building your own system.
Have a look at Celery which was initially built with Django.
http://www.celeryq.org/
http://robertpogorzelski.com/blog/2009/09/10/rabbitmq-celery-and-django/
Issy already mentioned Celery, but since comments doesn't work well
with code samples, I'll reply as an answer instead.
You should try to use Celery synchronously with the AMQP result store.
You could distribute the actual execution to another process or even another machine. Executing synchronously in celery is easy, e.g.:
>>> from celery.task import Task
>>> from celery.registry import tasks
>>> class MyTask(Task):
...
... def run(self, x, y):
... return x * y
>>> tasks.register(MyTask)
>>> async_result = MyTask.delay(2, 2)
>>> retval = async_result.get() # Now synchronous
>>> retval 4
The AMQP result store makes sending back the result very fast,
but it's only available in the current development version (in code-freeze to become
0.8.0)
How about "daemonizing" the subprocess call using python-daemon or its successor, grizzled.