my work is to deal with lots of xmls; to get faster results i want to use ipython's parallel processing; below is my sample code. in that i am just finding the number of elements of xml/xsd with celementTree module.
>>> from IPython.parallel import Client
>>> import os
>>> c = Client()
>>> c.ids
>>> lview = c.load_balanced_view()
>>> lview.block =True
>>> def return_len(xml_filepath):
import xml.etree.cElementTree as cElementTree
tree = cElementTree.parse(xml_filepath)
my_count=0
file_result=[]
cdict={}
for elem in tree.getiterator():
cdict[my_count]={}
if elem.tag:
cdict[my_count]['tag']=elem.tag
if elem.text:
cdict[my_count]['text']=(elem.text).strip()
if elem.attrib.items():
cdict[my_count]['xmlattb']={}
for key, value in elem.attrib.items():
cdict[my_count]['xmlattb'][key]=value
if list(elem):
cdict[my_count]['xmlinfo']=len(list(elem))
if elem.tail:
cdict[my_count]['tail']=elem.tail.strip()
my_count+=1
output=xml_filepath.split('\\')[-1],len(cdict)
return output
## return cdict
>>> def get_dir_list(target_dir, *extensions):
"""
This function will filter out the files from given dir based on their extensions
"""
my_paths=[]
for top, dirs, files in os.walk(target_dir):
for nm in files:
fileStats = os.stat(os.path.join(top, nm))
if nm.split('.')[-1] in extensions:
my_paths.append(top+'\\'+nm)
return my_paths
>>> r=lview.map_async(return_len,get_dir_list('C:\\test_folder','xsd','xml'))
to get the final result i have to do
>>> r.get() by this i will get result when process will complete my question is am i able to get the intermediate results while they are getting finished; for example if i have applied my work to a folder which contains 1000 xmls/xsds files then can i get results immediately when that particular files has been processed. like 1st file is done--> show its result... 2nd file is done---> show its result........ 1000th file is done--> show its result not like current work as above; wait till final file get finished then it will show complete result of all those 1000 files.
also to deal with import/namespace error i have defined import inside of return_len function; is there any better way to deal with that?
Sure. AsyncMapResult (the type returned by map_async) are iterable immediately,
and the items yielded by the iteration are the same as the list ultimately produced by r.get(). So after you do:
amr = lview.map_async(return_len, get_dir_list('C:\\test_folder','xsd','xml'))
You can do:
for r in amr:
print r
or keep the index with enumerate
for i,r in enumerate(amr):
print i, r
or perform reductions with the reduce builtin:
summary_result = reduce(myfunc, amr)
All of these will iterate through your results as they arrive. If you don't care about the ordering and the time for each task is significantly varied, you can pass map_async(...,ordered=False). If you do this, when you iterate through the AMR, you will get individual results on a first-come-first-serve basis, rather than preserving the submission order.
There's a bit more info in the ipython docs.
also to deal with import/namespace error i have defined import inside of return_len function; is there any better way to deal with that?
Yes and no. There are a few ways to set up the engine namespace, such as using modules, the #parallel.require("module") decorator, or simply performing the import explicitly with %px import xml.etree.cElementTree as cElementTree, each of which has benefits in certain scenarios. But I often find putting imports in the function to be the easiest way, and with the least surprises.
Related
I have a large list of various types of objects I would like Z3 to synthesize in my Python project. Since constraints associated with each object to be synthesized are independent, this process can be completely parallelized. That is, instead of synthesizing one value at a time, if I have a machine with 4 cores, I can synthesize 4 values at the same time. To do this, we must use Python's multiprocessing package instead of threading (due to GIL and the fact that the workload should be CPU-bound).
For simplicity, say I have a simple str synthesizer that synthesizes a new str that is lexicographically less than a given input value, something like this:
def lt_constraint(value):
solver = Solver()
# do a number of processing on 'value', which is an input string
# ... define char and _chars in code here
template = Concat(Re(StringVal(value[:offset])), char, Star(_chars))
solver.add(InRe(String("var"), template))
if solver.check() == sat:
value = solver.model()[self.var]
return convert_to_str(value)
Now if I have a number of values, I want to run the function above in parallel:
from pathos.multiprocessing import ProcessingPool as Pool
with Pool(processes=4) as pool:
value_list = ['This', 'is', 'an', 'example']
synthesized_strs = pool.map(lt_constraint, value_list)
I use pathos hoping that it will handle pickling issue, but I still received this error:
TypeError: cannot pickle 're.Match' object
which I believe is because Z3 uses methods in re and they need to be pickled when pickling lt_constraint(), but dill cannot pickle those.
Is there any other way to parallelize Z3 for my case (other than implementing pickling myself for re or what not)?
Thanks!
Stack-overflow works the best when you include your whole code, so people can experiment with it. Having said that, I had good luck with the following:
from z3 import *
import concurrent.futures
def getVal(value):
solver = Solver()
var = Int('var')
solver.add(var > value)
if solver.check() == sat:
return solver.model()[var].as_long()
else:
return 'CANT SOLVE'
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(getVal, i) for i in [1, 2, 3]]
results = [f.result() for f in futures]
print(results)
This prints:
$ python3.9 a.py
[2, 3, 4]
Without actually having the details of how you constructed your lt_constraint, it's hard to tell whether this'll work for your case. But it seems using the concurrent.futures library works well with z3; so far as simple constraints are used. Give this a try and see if it handles your case as well. If not; please post the full-code as an minimal-reproducible example. See https://stackoverflow.com/help/minimal-reproducible-example
I'm trying to sum up the sizes of all files in a directory including recursive subdirectories. The relevant function (self._count) works totaly fine if I just call it once. But for large amounts of files I want to use multiprocessing to make the program faster. Here are the relevant parts of the code.
self._sum_dict sums the values of the same keys of the given dicts up.
self._get_file_type returns the category (key for stats) the file shall be placed.
self._categories holds a list of all possible categorys.
number_of_threats specifies the number of workers thal shall be used.
path holds the path to the directory meantioned in the first sentence.
import os
from multiprocessing import Pool
def _count(self, path):
stats = dict.fromkeys(self._categories, 0)
try:
dir_list = os.listdir(path)
except:
# I do some warning here, but removed it for SSCCE
return stats
for element in dir_list:
new_path = os.path.join(path, element)
if os.path.isdir(new_path):
add_stats = self._count(new_path)
stats = self._sum_dicts([stats, add_stats])
else:
file_type = self._get_file_type(element)
try:
size = os.path.getsize(new_path)
except Exception as e:
# I do some warning here, but removed it for SSCCE
continue
stats[file_type] += size
return stats
files = []
dirs = []
for e in dir_list:
new_name = os.path.join(path, e)
if os.path.isdir(new_name):
dirs.append(new_name)
else:
files.append(new_name)
with Pool(processes=number_of_threats) as pool:
res = pool.map(self._count, dirs)
self._stats = self._sum_dicts(res)
I know, that this code won't consider files in path, but that is something that I can add easily add. When execuding the code I get the following exception.
Exception has occurred: TypeError
cannot serialize '_io.TextIOWrapper' object
...
line ... in ...
res = pool.map(self._count, dirs)
I found out, that this exception can occure when sharing resources betwenen processes, which - as far as I can see - I only do with stats = dict.fromkeys(self._categories, 0). But replacing this line with hardcoded values won't fix the problem. Even placing a breakpoint at this line won't help me, because it isn't reached.
Does anybody have an idea what the reason for this problem is and how I can fix this?
The problem is you try to transmit "self". If self has a Stream object it can't be serialized.
Try and move the multiprocessed code outside the class.
Python multiprocessing launches a new interpreter and if you try to access shared code that can't be pickled (or serialized) it fails. Usually it doesn't crash where you think it crashed... but when trying to recieve the object.
I converted a code using threads to multiprocessing and i had a lot of wierd errors even i didn't sent or used those objects, but i used their parent ( self )
I am working on a Python project in PyCharm.
Every time I create a new file, I have to import all of the basic libraries again. I keep getting kicked out of my workflow for a few seconds every time I notice that I forgot to import a library.
As far as I can tell, there is no reason not to just import every library I use anywhere in every single file, or is there? I would much rather import a library I don't need than risk losing my concentration because I forgot to import something and have to waste a few seconds on it. Since the auto-sorting of libraries doesn't seem to work properly, I even have to add the import statement manually.
Is there a way to give PyCharm a large list of libraries and just import those in absolutely every file by default?
For comparison, this is possible to do in Jupyter Lab: You can have one notebook that contains all import statements, and just call "%run import_everything.ipynb" in all the other notebooks. This has saved me a lot of time, and is also much more readable than having different imports in every notebook.
(I care specifically about how to do this in PyCharm, but if there is a more generic way to do it, that information would also be appreciated)
I wrote a short script to go over all my files and combine all basic import statements. It also wraps it in comments that say "BASIC IMPORTS", so in the future I can just use PyCharms "replace string in whole project" feature to mass-replace all imports by looking for these comments with a regex.
from pathlib import Path
import re
p = Path(__file__).absolute().parent.parent
excluded_files = ['__init__.py', 'setup.py']
affected_files = [a for a in p.glob('**/*.py') if a.name not in excluded_files]
acc = {}
for a in affected_files:
b = a.read_text()
b = b[:b.index("\n\n")]
finds = re.findall("(import |from )(.*)", b)
for c in finds:
acc[c] = True
assert "intnet" not in c[1], a
lst = list(acc.keys())
lst.sort(key=lambda a: a[1])
res = '\n'.join(f"{a[0]}{a[1]}" for a in lst)
res = "# BASIC IMPORTS\n" + res + "\n# / BASIC IMPORTS"
for a in affected_files:
b = a.read_text()
b = res + b[b.index("\n\n"):]
a.write_text(b)
Just trying to learn and I"m wondering if multiprocessing would speed
up this for loop ,.. trying to compare
alexa_white_list(1,000,000 lines) and
dnsMISP (can get up to 160,000 lines)
Code checks each line in dnsMISP and looks for it in alexa_white_list.
if it doesn't see it, it adds it to blacklist.
Without mp_handler function the code works fine but it takes
around 40-45 minutes. For brevity, I've omitted all the other imports and
the function that pulls down and unzips the alexa white list.
The below gives me the following error -
File "./vetdns.py", line 128, in mp_handler
p.map(dns_check,dnsMISP,alexa_white_list)
NameError: global name 'dnsMISP' is not defined
from multiprocessing import Pool
def dns_check():
awl = []
blacklist = []
ctr = 0
dnsMISP = open(INPUT_FILE,"r")
dns_misp_lines = dnsMISP.readlines()
dnsMISP.close()
alexa_white_list = open(outname, 'r')
alexa_white_list_lines = alexa_white_list.readlines()
alexa_white_list.close()
print "converting awl to proper format"
for line in alexa_white_list_lines:
awl.append(".".join(line.split(".")[-2:]).strip())
print "done"
for host in dns_misp_lines:
host = host.strip()
host = ".".join(host.split(".")[-2:])
if not host in awl:
blacklist.append(host)
file_out = open(FULL_FILENAME,"w")
file_out.write("\n".join(blacklist))
file_out.close()
def mp_handler():
p = Pool(2)
p.map(dns_check,dnsMISP,alexa_white_list)
if __name__ =='__main__':
mp_handler()
If I label it as global etc I still get the error. I'd appreciate any
suggestions!!
There's no need for multiprocessing here. In fact this code can be greatly simplified:
def get_host_form_line(line):
return line.strip().split(".", 1)[-1]
def dns_check():
with open('alexa.txt') as alexa:
awl = {get_host_from_line(line) for line in alexa}
blacklist = []
with open(INPUT_FILE, "r") as dns_misp_lines:
for line in dns_misp_lines:
host = get_host_from_line(line)
if host not in awl:
blacklist.append(host)
with open(FULL_FILENAME,"w") as file_out:
file_out.write("\n".join(blacklist))
Using a set comprehension to create your Alexa collection has the advantage of being O(1) lookup time. Sets are similar to dictionaries. They are pretty much dictionaries that only have keys with no values. There is some additional overhead in memory and the initial creation time will likely be slower since the values you put in to a set need to be hashed and hash collisions dealt with but the increase in performance you gain from the faster look ups should make up for it.
You can also clean up your line parsing. split() takes an additional parameter that will limit the number of times the input is split. I'm assuming your lines look something like this:
http://www.something.com and you want something.com (if this isn't the case let me know)
It's important to remember that the in operator isn't magic. When you use it to check membership (is an element in the list) what it's essentially doing under the hood is this:
for element in list:
if element == input:
return True
return False
So every time in your code you did if element in list your program had to iterate across each element until it either found what you were looking for or got to the end. This was probably the biggest bottleneck of your code.
You tried to read a variable named dnsMISP to pass as an argument to Pool.map. It doesn't exist in local or global scope (where do you think it's coming from?), so you got a NameError. This has nothing to do with multiprocessing; you could just type a line with nothing but:
dnsMISP
and have the same error.
I am currently experimenting with Actor-concurreny (on Python), because I want to learn more about this. Therefore I choosed pykka, but when I test it, it's more than half as slow as an normal function.
The Code is only to look if it works; it's not meant to be elegant. :)
Maybe I made something wrong?
from pykka.actor import ThreadingActor
import numpy as np
class Adder(ThreadingActor):
def add_one(self, i):
l = []
for j in i:
l.append(j+1)
return l
if __name__ == '__main__':
data = np.random.random(1000000)
adder = Adder.start().proxy()
adder.add_one(data)
adder.stop()
This runs not so fast:
time python actor.py
real 0m8.319s
user 0m8.185s
sys 0m0.140s
And now the dummy 'normal' function:
def foo(i):
l = []
for j in i:
l.append(j+1)
return l
if __name__ == '__main__':
data = np.random.random(1000000)
foo(data)
Gives this result:
real 0m3.665s
user 0m3.348s
sys 0m0.308s
So what is happening here is that your functional version is creating two very large lists which is the bulk of the time. When you introduce actors, mutable data like lists must be copied before being sent to the actor to maintain proper concurrency. Also the list created inside the actor must be copied as well when sent back to the sender. This means that instead of two very large lists being created we have four very large lists created instead.
Consider designing things so that data is constructed and maintained by the actor and then queried by calls to the actor minimizing the size of messages getting passed back and forth. Try to apply the principal of minimal data movement. Passing the List in the functional case is only efficient because the data is not actually moving do to leveraging a shared memory space. If the actor was on a different machine we would not have the benefit of a shared memory space even if the message data was immutable and didn't need to be copied.