Multiprocessing to speed up for loop - python

Just trying to learn and I"m wondering if multiprocessing would speed
up this for loop ,.. trying to compare
alexa_white_list(1,000,000 lines) and
dnsMISP (can get up to 160,000 lines)
Code checks each line in dnsMISP and looks for it in alexa_white_list.
if it doesn't see it, it adds it to blacklist.
Without mp_handler function the code works fine but it takes
around 40-45 minutes. For brevity, I've omitted all the other imports and
the function that pulls down and unzips the alexa white list.
The below gives me the following error -
File "./vetdns.py", line 128, in mp_handler
p.map(dns_check,dnsMISP,alexa_white_list)
NameError: global name 'dnsMISP' is not defined
from multiprocessing import Pool
def dns_check():
awl = []
blacklist = []
ctr = 0
dnsMISP = open(INPUT_FILE,"r")
dns_misp_lines = dnsMISP.readlines()
dnsMISP.close()
alexa_white_list = open(outname, 'r')
alexa_white_list_lines = alexa_white_list.readlines()
alexa_white_list.close()
print "converting awl to proper format"
for line in alexa_white_list_lines:
awl.append(".".join(line.split(".")[-2:]).strip())
print "done"
for host in dns_misp_lines:
host = host.strip()
host = ".".join(host.split(".")[-2:])
if not host in awl:
blacklist.append(host)
file_out = open(FULL_FILENAME,"w")
file_out.write("\n".join(blacklist))
file_out.close()
def mp_handler():
p = Pool(2)
p.map(dns_check,dnsMISP,alexa_white_list)
if __name__ =='__main__':
mp_handler()
If I label it as global etc I still get the error. I'd appreciate any
suggestions!!

There's no need for multiprocessing here. In fact this code can be greatly simplified:
def get_host_form_line(line):
return line.strip().split(".", 1)[-1]
def dns_check():
with open('alexa.txt') as alexa:
awl = {get_host_from_line(line) for line in alexa}
blacklist = []
with open(INPUT_FILE, "r") as dns_misp_lines:
for line in dns_misp_lines:
host = get_host_from_line(line)
if host not in awl:
blacklist.append(host)
with open(FULL_FILENAME,"w") as file_out:
file_out.write("\n".join(blacklist))
Using a set comprehension to create your Alexa collection has the advantage of being O(1) lookup time. Sets are similar to dictionaries. They are pretty much dictionaries that only have keys with no values. There is some additional overhead in memory and the initial creation time will likely be slower since the values you put in to a set need to be hashed and hash collisions dealt with but the increase in performance you gain from the faster look ups should make up for it.
You can also clean up your line parsing. split() takes an additional parameter that will limit the number of times the input is split. I'm assuming your lines look something like this:
http://www.something.com and you want something.com (if this isn't the case let me know)
It's important to remember that the in operator isn't magic. When you use it to check membership (is an element in the list) what it's essentially doing under the hood is this:
for element in list:
if element == input:
return True
return False
So every time in your code you did if element in list your program had to iterate across each element until it either found what you were looking for or got to the end. This was probably the biggest bottleneck of your code.

You tried to read a variable named dnsMISP to pass as an argument to Pool.map. It doesn't exist in local or global scope (where do you think it's coming from?), so you got a NameError. This has nothing to do with multiprocessing; you could just type a line with nothing but:
dnsMISP
and have the same error.

Related

Execute python code and evaluate/test results

Admittedly I am not sure how to ask this, as I know how to handle this in R (code execution in a new environment), but equivalent searches for the python solution are not yielding what I was hoping.
In short, I will receive a spreadsheet (or csv) where the contents of the column will contain, hopefully, valid python code. This could be the equivalent of a script, but just contained in the csv/workbook. For a use case, think teaching programming and the output is an LMS.
What I am hoping to do is loop over the file, and for each cell, run the code, and with the results in memory, test to see if certain things exist.
For example: https://docs.google.com/spreadsheets/d/1D-zC10rUTuozfTR5yHfauIGbSNe-PmfrZCkC7UTPH1c/edit?usp=sharing
When evaluating the first response in the spreadsheet above, I would want to test that x, y, and z are all properly defined and have the expected values.
Because there would be multiple rows in the file, one per student, how can I run each row separately, evaluate the results, and ensure that I isolate the results to only that cell. Simply, when moving on, I do not retain any of the past evaluations.
(I am unaware of tools to do code checking, so I am dealing with it in a very manual way.)
It is possible to use Python's exec() function to execute strings such as the content in the cells.
Ex:
variables = {}
exec("""import os
# a comment
x = 2
y = 6
z = x * y""", variables)
assert variables["z"] == 12
Dealing with the csv file:
import csv
csv_file = open("path_to_csv_file", "rt")
csv_reader = csv.reader(csv_file)
iterator = iter(csv_reader)
next(iterator) # To skip the titles of the columns
for row in iterator:
user = row[0]
answer = row[1]
### Any other code involving the csv file must be put here to work properly,
### that is, before closing csv_file.
csv_file.close() # Remember to close the file.
It won't be able to detect whether some module was imported (Because when importing from an exec() function, the module will remain in cache for the next exec's). One way to test this would be to 'unimport' the module and test the exec for Exceptions.
Ex:
# This piece of code would be before closing the file,
# INSIDE THE FOR LOOP AND WITH IT IDENTED (Because you want
# it to run for each student.).
try:
del os # 'unimporting' os (This doesn't 'unimport' as much as deletes a
# reference to the module, what could be problematic if a 'from
# module import object' statement was used.)
except NameError: # So that trying to delete a module that wasn't imported
# does not lead to Exceptions being raised.
pass
namespace = dict()
try:
exec(answer, namespace)
except:
# Answer code could not be run without raising exceptions, i.e., the code
# is poorly written.
# Code you want to run when the answer is wrong.
else:
# The code hasn't raised Exceptions, time to test the variables.
x, y, z = namespace['x'], namespace['y'], namespace['z']
if (x == 2) and (y == 6) and (z == x * y):
# Code you want to run when the answer is right.
else:
# Code you want to run when the answer is wrong.
I sense that this is not the best way to do this, but it is certainly an attempt.
I hope this helped.
EDIT: Removed some bad code and added part of Tadhg McDonald-Jensen's comment.

Python: Why my function returns None and then executes

So, I have a function which basically does this:
import os
import json
import requests
from openpyxl import load_workbook
def function(data):
statuslist = []
for i in range(len(data[0])):
result = performOperation(data[0][i])
if result in satisfying_results:
print("its okay")
statuslist.append("Pass")
else:
print("absolutely not okay")
statuslist.append("Fail" + result)
return statuslist
Then, I invoke the function like this (I've added error handling to check what will happen after stumbling upon the reason for me asking this question), and was actually amazed by the results, as the function returns None, and then executes:
statuslist = function(data)
print(statuslist)
try:
for i in range(len(statuslist)):
anotherFunction(i)
print("Confirmation that it is working")
except TypeError:
print("This is utterly nonsense I think")
The output of the program is then as follows:
None
This is utterly nonsense I think
its okay
its okay
its okay
absolutely not okay
its okay
There is only single return statement at the end of the function, the function is not recursive, its pretty straightforward and top-down(but parses a lot of data in the meantime).
From the output log, it appears that the function first returns None, and then is properly executed. I am puzzled, and I were unable to find any similar problems over the internet (maybe I phrase the question incorrectly).
Even if there were some inconsistency in the code, I'd still expect it to return [] instead.
After changing the initial list to statuslist = ["WTF"], the return is [].
To rule out the fact that I have modified the list in some other functions performed in the function(data), I have changed the name of the initial list several times - the results are consistently beyond my comprehension
I will be very grateful on tips in debugging the issue. Why does the function return the value first, and is executed after?
While being unable to write the code which would at the same time present what happened in my code in full spectrum, be readable, and wouldn't interfere with no security policies of the company, I have re-wrote it in a simpler form (the original code has been written while I had 3 months of programming experience), and the issue does not reproduce anymore. I guess there had be some level of nesting of functions that I have misinterpreted, and this re-written code, doing pretty much the same, correctly returns me the expected list.
Thank you everyone for your time and suggestions.
So, the answer appears to be: You do not understand your own code, make it simpler.

How to get results out of a Python exec()/eval() call?

I want to write a tool in Python to prepare a simulation study by creating for each simulation run a folder and a configuration file with some run-specific parameters.
study/
study.conf
run1
run.conf
run2
run.conf
The tool should read the overall study configuration from a file including (1) static parameters (key-value pairs), (2) lists for iteration parameters, and (3) some small code snippets to calculate further parameters from the previous ones. The latter are run specific depending on the permutation of the iteration parameters used.
Before writing the run.conf files from a template, I need to run some code like this to determine the specific key-value pairs from the code snippets for that run
code = compile(code_str, 'foo.py', 'exec')
rv=eval(code, context, { })
However, as this is confirmed by the Python documentation, this just leads to a None as return value.
The code string and context dictionary in the example are filled elsewhere. For this discussion, this snippet should do it:
code_str="""import math
math.sqrt(width**2 + height**2)
"""
context = {
'width' : 30,
'height' : 10
}
I have done this before in Perl and Java+JavaScript. There, you just give the code snippet to some evaluation function or script engine and get in return a value (object) from the last executed statement -- not a big issue.
Now, in Python I struggle with the fact that eval() is too narrow just allowing one statement and exec() doesn't return values in general. I need to import modules and sometimes do some slightly more complex calculations, e.g., 5 lines of code.
Isn't there a better solution that I don't see at the moment?
During my research, I found some very good discussions about Pyhton eval() and exec() and also some tricky solutions to circumvent the issue by going via the stdout and parsing the return value from there. The latter would do it, but is not very nice and already 5 years old.
The exec function will modify the global parameter (dict) passed to it. So you can use the code below
code_str="""import math
Result1 = math.sqrt(width**2 + height**2)
"""
context = {
'width' : 30,
'height' : 10
}
exec(code_str, context)
print (context['Result1']) # 31.6
Every variable code_str created will end up with a key:value pair in the context dictionary. So the dict is the "object" like you mentioned in JavaScript.
Edit1:
If you only need the result of the last line in code_str and try to prevent something like Result1=..., try the below code
code_str="""import math
math.sqrt(width**2 + height**2)
"""
context = { 'width' : 30, 'height' : 10 }
lines = [l for l in code_str.split('\n') if l.strip()]
lines[-1] = '__myresult__='+lines[-1]
exec('\n'.join(lines), context)
print (context['__myresult__'])
This approach is not as robust as the former one, but should work for most case. If you need to manipulate the code in a sophisticated way, please take a look at the Abstract Syntax Trees
Since this whole exec() / eval() thing in Python is a bit weird ... I have chose to re-design the whole thing based on a proposal in the comments to my question (thanks #jonrsharpe).
Now, the whole study specification is a .py module that the user can edit. From there, the configuration setup is directly written to a central object of the whole package. On tool runs, the configuration module is imported using the code below
import imp
# import the configuration as a module
(path, name) = os.path.split(filename)
(name, _) = os.path.splitext(name)
(file, filename, data) = imp.find_module(name, [path])
try:
module = imp.load_module(name, file, filename, data)
except ImportError as e:
print(e)
sys.exit(1)
finally:
file.close()
I came across similar needs, and finally figured out a approach by playing with ast:
import ast
code = """
def tf(n):
return n*n
r=tf(3)
{"vvv": tf(5)}
"""
ast_ = ast.parse(code, '<code>', 'exec')
final_expr = None
for field_ in ast.iter_fields(ast_):
if 'body' != field_[0]: continue
if len(field_[1]) > 0 and isinstance(field_[1][-1], ast.Expr):
final_expr = ast.Expression()
final_expr.body = field_[1].pop().value
ld = {}
rv = None
exec(compile(ast_, '<code>', 'exec'), None, ld)
if final_expr:
rv = eval(compile(final_expr, '<code>', 'eval'), None, ld)
print('got locals: {}'.format(ld))
print('got return: {}'.format(rv))
It'll eval instead of exec the last clause if it's an expression, or have all execed and return None.
Output:
got locals: {'tf': <function tf at 0x10103a268>, 'r': 9}
got return: {'vvv': 25}

how to get intermidiate results in ipython parallel processing?

my work is to deal with lots of xmls; to get faster results i want to use ipython's parallel processing; below is my sample code. in that i am just finding the number of elements of xml/xsd with celementTree module.
>>> from IPython.parallel import Client
>>> import os
>>> c = Client()
>>> c.ids
>>> lview = c.load_balanced_view()
>>> lview.block =True
>>> def return_len(xml_filepath):
import xml.etree.cElementTree as cElementTree
tree = cElementTree.parse(xml_filepath)
my_count=0
file_result=[]
cdict={}
for elem in tree.getiterator():
cdict[my_count]={}
if elem.tag:
cdict[my_count]['tag']=elem.tag
if elem.text:
cdict[my_count]['text']=(elem.text).strip()
if elem.attrib.items():
cdict[my_count]['xmlattb']={}
for key, value in elem.attrib.items():
cdict[my_count]['xmlattb'][key]=value
if list(elem):
cdict[my_count]['xmlinfo']=len(list(elem))
if elem.tail:
cdict[my_count]['tail']=elem.tail.strip()
my_count+=1
output=xml_filepath.split('\\')[-1],len(cdict)
return output
## return cdict
>>> def get_dir_list(target_dir, *extensions):
"""
This function will filter out the files from given dir based on their extensions
"""
my_paths=[]
for top, dirs, files in os.walk(target_dir):
for nm in files:
fileStats = os.stat(os.path.join(top, nm))
if nm.split('.')[-1] in extensions:
my_paths.append(top+'\\'+nm)
return my_paths
>>> r=lview.map_async(return_len,get_dir_list('C:\\test_folder','xsd','xml'))
to get the final result i have to do
>>> r.get() by this i will get result when process will complete my question is am i able to get the intermediate results while they are getting finished; for example if i have applied my work to a folder which contains 1000 xmls/xsds files then can i get results immediately when that particular files has been processed. like 1st file is done--> show its result... 2nd file is done---> show its result........ 1000th file is done--> show its result not like current work as above; wait till final file get finished then it will show complete result of all those 1000 files.
also to deal with import/namespace error i have defined import inside of return_len function; is there any better way to deal with that?
Sure. AsyncMapResult (the type returned by map_async) are iterable immediately,
and the items yielded by the iteration are the same as the list ultimately produced by r.get(). So after you do:
amr = lview.map_async(return_len, get_dir_list('C:\\test_folder','xsd','xml'))
You can do:
for r in amr:
print r
or keep the index with enumerate
for i,r in enumerate(amr):
print i, r
or perform reductions with the reduce builtin:
summary_result = reduce(myfunc, amr)
All of these will iterate through your results as they arrive. If you don't care about the ordering and the time for each task is significantly varied, you can pass map_async(...,ordered=False). If you do this, when you iterate through the AMR, you will get individual results on a first-come-first-serve basis, rather than preserving the submission order.
There's a bit more info in the ipython docs.
also to deal with import/namespace error i have defined import inside of return_len function; is there any better way to deal with that?
Yes and no. There are a few ways to set up the engine namespace, such as using modules, the #parallel.require("module") decorator, or simply performing the import explicitly with %px import xml.etree.cElementTree as cElementTree, each of which has benefits in certain scenarios. But I often find putting imports in the function to be the easiest way, and with the least surprises.

pykka -- Actors are slow?

I am currently experimenting with Actor-concurreny (on Python), because I want to learn more about this. Therefore I choosed pykka, but when I test it, it's more than half as slow as an normal function.
The Code is only to look if it works; it's not meant to be elegant. :)
Maybe I made something wrong?
from pykka.actor import ThreadingActor
import numpy as np
class Adder(ThreadingActor):
def add_one(self, i):
l = []
for j in i:
l.append(j+1)
return l
if __name__ == '__main__':
data = np.random.random(1000000)
adder = Adder.start().proxy()
adder.add_one(data)
adder.stop()
This runs not so fast:
time python actor.py
real 0m8.319s
user 0m8.185s
sys 0m0.140s
And now the dummy 'normal' function:
def foo(i):
l = []
for j in i:
l.append(j+1)
return l
if __name__ == '__main__':
data = np.random.random(1000000)
foo(data)
Gives this result:
real 0m3.665s
user 0m3.348s
sys 0m0.308s
So what is happening here is that your functional version is creating two very large lists which is the bulk of the time. When you introduce actors, mutable data like lists must be copied before being sent to the actor to maintain proper concurrency. Also the list created inside the actor must be copied as well when sent back to the sender. This means that instead of two very large lists being created we have four very large lists created instead.
Consider designing things so that data is constructed and maintained by the actor and then queried by calls to the actor minimizing the size of messages getting passed back and forth. Try to apply the principal of minimal data movement. Passing the List in the functional case is only efficient because the data is not actually moving do to leveraging a shared memory space. If the actor was on a different machine we would not have the benefit of a shared memory space even if the message data was immutable and didn't need to be copied.

Categories

Resources