Python Multithreading search script using ThreadWithReturnValue

Python Multithreading search script using ThreadWithReturnValue - python

I need some help with the code below.
GOAL:
I have a huge ".txt" file and need a script to search a string given as input by the user.
In this specific code I'm trying to speed things up by creating multiple threads (8, to be exact) and assigning slices of the ".txt" file for each thread, so each thread will search for the results in only a portion of the file.
Since I need the return value for the search function I created and that is passed to the threads, I used the "ThreadWithReturnValue" Class shown in this question here:
How to get the return value from a thread in python?
PROBLEM:
The problem is that I only get the result if it located in the FIRST thread. The results from the other threads are always empty lists.
If I create loggings inside the search function, the loggings appears in the terminal for the first thread only, while the loggings for the other threads do not show up. This makes me believe that the other threads are not even running, even though the thread list shows up the correct number of created threads.
HELP: I'm new to Python and have very little experience with threads. I'm very probably missing some basic thing here. If you can help me I would be immensely grateful.
NOTE: I would like to make the current code work; I know there are other options to accomplish the same task, but I would like to make this one work in the first place.
#! python3
#Search string in huge ".txt" file
import pprint
from threading import Thread
#Search function that takes arguments for searched value, file object, start line and end line of search
def searchFile(value, fileObj, linestart, lineend):
results = []
textToSearch = fileObj.readlines()[linestart:lineend]
for line in textToSearch:
if value in line:
results.append(line[0:-1])
return results
#Thread class that actually return the function's return value
class ThreadWithReturnValue(Thread):
def __init__(self, group=None, target=None, name=None,
args=(), kwargs={}, Verbose=None):
Thread.__init__(self, group, target, name, args, kwargs)
self._return = None
def run(self):
print(type(self._target))
if self._target is not None:
self._return = self._target(*self._args,
**self._kwargs)
def join(self, *args):
Thread.join(self, *args)
return self._return
#File variables
fileToSearch = 'somefile.txt'
totalLines = 1000000 #Let's suppose the file has 1 million lines
sizeThread = totalLines // 8 #Each thread will cover 1/8 of the file's lines
#Main program loop
while True:
#Ask user for input
userinput = input('Enter here your input: /n')
#Open file
SearchFileObj = open(fileToSearch)
#Create empty search results list
searchResults = []
#Create empty Threads list
searchThreads = []
#Create 8 Threads
for i in range(0, totalLines, sizeThread):
ThreadlLineStart = i
ThreadlLineEnd = i + sizeThread
#Create Thread Objects
threadObj = ThreadWithReturnValue(target=searchFile, args=[userinput, SearchFileObj, ThreadlLineStart, ThreadlLineEnd])
searchThreads.append(threadObj)
threadObj.start()
#Wait Threads to finish
for thread in searchThreads:
threadResult = thread.join()
#Verify thread result and append result (if any) to list
if threadResult == []:
continue
else:
searchResults.append(threadResult)
#Close file
fileToSearch.close()
#Give output to user
if searchResults == []:
print('No results found.')
else:
print('The results are:')
pprint.pprint(searchResults)
break

Related

Why my multiprocess queue might be "losing" items?

I've got some code where I want to share objects between processes using queues. I've got a parent:
processing_manager = mp.Manager()
to_cacher = processing_manager.Queue()
fetchers = get_fetchers()
fetcher_process = mp.Process(target=fetch_news, args=(to_cacher, fetchers))
fetcher_process.start()
while 1:
print(to_cacher.get())
And a child:
def fetch_news(pass_to: Queue, fetchers: List[Fetcher]):
def put_news_to_query(pass_to: Queue, fetchers: List[Fetcher]):
for fet in fetchers:
for news in fet.latest_news():
print(news)
pass_to.put(news)
print("----------------")
put_news_to_query(pass_to, fetchers)
I'm expecting to see N objects printed in put_news_to_query, then a line, and then the same objects printed in while loop in a parent. Problem is, objects appear to miss: if I get, say, 8 objects printed in put_news_to_query I get only 2-3 objects printed in while loop. What am I doing wrong here?

This is not the answer, unless the answer is that the code is already working. I've just modified the code to make it a running example of the same technique. The data gets from child to parent without data loss.
import multiprocessing as mp
import time
import random
def worker(pass_to):
for i in range(10):
time.sleep(random.randint(0,10)/1000)
print('child', i)
pass_to.put(i)
print("---------------------")
pass_to.put(None)
def main():
manager = mp.Manager()
to_cacher = manager.Queue()
fetcher = mp.Process(target=worker, args=(to_cacher,))
fetcher.start()
while 1:
msg = to_cacher.get()
if msg is None:
break
print(msg)
if __name__ == "__main__":
main()

So, apparently, it was something related to in which order put and get statements were executed. Basically, some of the objects from parent's print were printed before the line. If you struggle with something like this, I'd recommend adding something to distinguish prints, like this:
print(f"Worker: {news}")
print(f"Main: {to_cacher.get()}")

multiprocessing a function with parameters that are iterated through

I'm trying to improve the speed of my program and I decided to use multiprocessing!
the problem is I can't seem to find any way to use the pool function (i think this is what i need) to use my function
here is the code that i am dealing with:
def dataLoading(output):
name = ""
link = ""
upCheck = ""
isSuccess = ""
for i in os.listdir():
with open(i) as currentFile:
data = json.loads(currentFile.read())
try:
name = data["name"]
link = data["link"]
upCheck = data["upCheck"]
isSuccess = data["isSuccess"]
except:
print("error in loading data from config: improper naming or formating used")
output[name] = [link, upCheck, isSuccess]
#working
def userCheck(link, user, isSuccess):
link = link.replace("<USERNAME>", user)
isSuccess = isSuccess.replace("<USERNAME>", user)
html = requests.get(link, headers=headers)
page_source = html.text
count = page_source.count(isSuccess)
if count > 0:
return True
else:
return False
I have a parent function to run these two together but I don't think i need to show the whole thing, just the part that gets the data iteratively:
for i in configData:
data = configData[i]
link = data[0]
print(link)
upCheck = data[1] #just for future use
isSuccess = data[2]
if userCheck(link, username, isSuccess) == True:
good.append(i)
you can see how I enter all of the data in there, how would I be able to use multiprocessing to do this when I am iterating through the dictionary to collect multiple parameters?

I like to use mp.Pool().map. I think it is easiest and most straight forward and handles most multiprocessing cases. So how does map work? For starts, we have to keep in mind that mp creates workers, each worker receives a copy of the namespace (ya the whole thing), then each worker works on what they are assigned and returns. Hence, doing something like "updating a global variable" while they work, doesn't work; since they are each going to receive a copy of the global variable and none of the workers are going to be communicating. (If you want communicating workers you need to use mp.Queue's and such, it gets complicated). Anyway, here is using map:
from multiprocessing import Pool
t = 'abcd'
def func(s):
return t[int(s)]
results = Pool().map(func,range(4))
Each worker received a copy of t, func, and the portion of range(4) they were assigned. They are then automatically tracked and everything is cleaned up in the end by Pool.
Something like your dataLoading won't work very well, we need to modify it. I also cleaned the code a little.
def loadfromfile(file):
data = json.loads(open(file).read())
items = [data.get(k,"") for k in ['name','link','upCheck','isSuccess']]
return items[0],items[1:]
output = dict(Pool().map(loadfromfile,os.listdir()))

Is it possible to write arbitrary depth delegated generators in Python?

I would like to write a class with the following interface.
class Automaton:
""" A simple automaton class """
def iterate(self, something):
""" yield something and expects some result in return """
print("Yielding", something)
result = yield something
print("Got \"" + result + "\" in return")
return result
def start(self, somefunction):
""" start the iteration process """
yield from somefunction(self.iterate)
raise StopIteration("I'm done!")
def first(iterate):
while iterate("what do I do?") != "over":
continue
def second(iterate):
value = yield from iterate("what do I do?")
while value != "over":
value = yield from iterate("what do I do?")
# A simple driving process
automaton = Automaton()
#generator = automaton.start(first) # This one hangs
generator = automaton.start(second) # This one runs smoothly
next_yield = generator.__next__()
for step in range(4):
next_yield = generator.send("Continue...({})".format(step))
try:
end = generator.send("over")
except StopIteration as excp:
print(excp)
The idea is that Automaton will regularly yield values to the caller which will in turn send results/commands back to the Automaton.
The catch is that the decision process "somefunction" will be some user defined function I have no control over. Which means that I can't really expect it to call the iterate method will a yield from in front. Worst, it could be that the user wants to plug some third-party function he has no control over inside this Automaton class. Meaning that the user might not be able to rewrite his somefunction for it to include yield from in front of iterate calls.
To be clear: I completely understand why using the first function hangs the automaton. I am just wondering if there is a way to alter the definition of iterate or start that would make the first function work.

Using pexpect to detect and handle notifications from Bluetooth LE

I've been writing a python code to read values on Raspberry Pi 3 Model B received by Bluetooth LE.
I can read the correct values with:
child.sendline("char-read-hnd handle")
child.expect("Characteristic value/descripto: ",timeout=5)
What i am trying to do now is to check for notifications at any time, so i have a Thread that searches for the expected pattern "Notification handle=" like this:
def run():
patterns = ['Notification handle=','Indication handle=']
while True:
try:
matched_pattern_index = child.expect(patterns,timeout=1)
if matched_pattern_index in {0,1}:
print("Received Notification")
handleNotification()
except pexpect.TIMEOUT:
pass
Now during my main code, i am always doing some child.sendline to check for new values as well as child.expect . The problem is that my Thread call to pexpect.expect above blocks my others pexpect.expect in my code.
I already tried to do a second child similar to the first one to work inside the Thread but the result is the same.
So anyone has any idea how i can achieve this?
Any help would be appreciated.
Thanks in advance

I was thinking of subclassing pexpect.spawn and providing an expect_before(p,c) method that would save a list of patterns p and callback functions c, then override expect(p2) to prefix the list p onto the list p2 of that call before calling the real spawn.expect function.
If the real function then returns a match index i that is within the size of the list p, we can call function c[i] and loop again. When the index is beyond that list, we adjust it to be an index in list p2 and return from the call. Here's an approximation:
#!/usr/bin/python
# https://stackoverflow.com/q/51622025/5008284
from __future__ import print_function
import sys, pexpect
class Myspawn(pexpect.spawn):
def __init__(self, *args, **kwargs):
pexpect.spawn.__init__(self, *args, **kwargs)
def expect_before(self, patterns, callbacks):
self.eb_pat = patterns
self.eb_cb = callbacks
def expect(self, patlist, *args, **kwargs):
numbefore = len(self.eb_pat)
while True:
rc = pexpect.spawn.expect(self, self.eb_pat+patlist, *args, **kwargs)
if rc>=numbefore:
break
self.eb_cb[rc]() # call callback
return rc-numbefore
def test():
child = Myspawn("sudo bluetoothctl", logfile=sys.stdout)
patterns = ['Device FC:A8:9A:..:..:..', 'Device C8:FD:41:..:..:..']
callbacks = [lambda :handler1(), lambda :handler2()]
child.expect_before(patterns, callbacks)
child.sendline('scan on')
while True:
child.expect(['[bluetooth].*?#'], timeout=5)

Pyqt and general python, can this be considered a correct approach for coding?

I have a dialog window containing check-boxes, when each of them is checked a particular class needs to be instantiated and a run a a task on a separated thread (one for each check box). I have 14 check-boxes to check the .isChecked() property and is comprehensible checking the returned Boolean for each of them is not efficient and requires a lot more coding.
Hence I decided to get all the children items corresponding to check-box element, get just those that are checked, appending their names to list and loop through them matching their name to d dictionary which key is the name of the check box and the value is the corresponding class to instantiate.
EXAMPLE:
# class dictionary
self.summary_runnables = {'dupStreetCheckBox': [DupStreetDesc(),0],
'notStreetEsuCheckBox': [StreetsNoEsuDesc(),1],
'notType3CheckBox': [Type3Desc(False),2],
'incFootPathCheckBox': [Type3Desc(True),2],
'dupEsuRefCheckBox': [DupEsuRef(True),3],
'notEsuStreetCheckBox': [NoLinkEsuStreets(),4],
'invCrossRefCheckBox': [InvalidCrossReferences()],
'startEndCheckBox': [CheckStartEnd(tol=10),8],
'tinyEsuCheckBox': [CheckTinyEsus("esu",1)],
'notMaintReinsCheckBox': [CheckMaintReins()],
'asdStartEndCheckBox': [CheckAsdCoords()],
'notMaintPolysCheckBox': [MaintNoPoly(),16],
'notPolysMaintCheckBox': [PolyNoMaint()],
'tinyPolysCheckBox': [CheckTinyEsus("rd_poly",1)]}
# looping through list
self.long_task = QThreadPool(None).globalInstance()
self.long_task.setMaxThreadCount(1)
start_report = StartReport(val_file_path)
end_report = EndReport()
# start_report.setAutoDelete(False)
# end_report.setAutoDelete(False)
end_report.signals.result.connect(self.log_progress)
end_report.signals.finished.connect(self.show_finished)
# end_report.setAutoDelete(False)
start_report.signals.result.connect(self.log_progress)
self.long_task.start(start_report)
# print str(self.check_boxes_names)
for check_box_name in self.check_boxes_names:
run_class = self.summary_runnables[check_box_name]
if run_class[0].__class__.__name__ is 'CheckStartEnd':
run_class[0].tolerance = tolerance
runnable = run_class[0]()
runnable.signals.result.connect(self.log_progress)
self.long_task.start(runnable)
self.long_task.start(end_report)
example of a runnable (even if some of them use different global functions)
I can't post the global functions that write content to file as they are too many and not all 14 tasks execute the same type function. arguments of these functions are int keys to other dictionaries that contain the report static content and the SQL queries to return report main dynamic contents.
class StartReport(QRunnable):
def __init__(self, file_path):
super(StartReport,self).__init__()
# open the db connection in thread
db.open()
self.signals = GeneralSignals()
# self.simple_signal = SimpleSignal()
# print self.signals.result
self.file_path = file_path
self.task = "Starting Report"
self.progress = 1
self.org_name = org_name
self.user = user
self.report_title = "Validation Report"
print "instantiation of start report "
def run(self):
self.signals.result.emit(self.task, self.progress)
if self.file_path is None:
print "I started and found file none "
return
else:
global report_file
# create the file and prints the header
report_file = open(self.file_path, 'wb')
report_file.write(str(self.report_title) + ' for {0} \n'.format(self.org_name))
report_file.write('Created on : {0} at {1} By : {2} \n'.format(datetime.today().strftime("%d/%m/%Y"),
datetime.now().strftime("%H:%M"),
str(self.user)))
report_file.write(
"------------------------------------------------------------------------------------------ \n \n \n \n")
report_file.flush()
os.fsync(report_file.fileno())
class EndReport(QRunnable):
def __init__(self):
super(EndReport,self).__init__()
self.signals = GeneralSignals()
self.task = "Finishing report"
self.progress = 100
def run(self):
self.signals.result.emit(self.task, self.progress)
if report_file is not None:
# write footer and close file
report_file.write("\n \n \n")
report_file.write("---------- End of Report -----------")
report_file.flush()
os.fsync(report_file.fileno())
report_file.close()
self.signals.finished.emit()
# TODO: checking whether opening a db connection in thread might affect the db on the GUI
# if db.isOpen():
# db.close()
else:
return
class DupStreetDesc(QRunnable):
"""
duplicate street description report section creation
:return: void if the report is to text
list[string] if the report is to screen
"""
def __init__(self):
super(DupStreetDesc,self).__init__()
self.signals = GeneralSignals()
self.task = "Checking duplicate street descriptions..."
self.progress = 16.6
def run(self):
self.signals.result.emit(self.task,self.progress)
if report_file is None:
print "report file is none "
# items_list = write_content(0, 0, 0, 0)
# for item in items_list:
# self.signals.list.emit(item)
else:
write_content(0, 0, 0, 0)
Now, I used this approach before and it has always worked fine without using multiprocessing. In this case it works good to some extent, I can run the tasks the first time but if I try to run for the second time I get the following Python Error :
self.long_task.start(run_class[0])
RuntimeError: wrapped C/C++ object of type DupStreetDesc has been deleted
I tried to use run_class[0].setAutoDelete(False) before running them in the loop but pyQt crashes with a minidump error (I am running the code in QGIS) and I the programs exists with few chances to understand what has happened.
On the other hand, if I run my classes separately, checking with an if else statement each check-box, then it works fine, I can run the tasks again and the C++ classes are not deleted, but it isn't a nice coding approach, at least from my very little experience.
Is there anyone else out there who can advise a different approach in order to make this run smoothly without using too many lines of code? Or knows whether there is a more efficient pattern to handle this problem, which I think must be quite common?

It seems that you should create a new instance of each runnable, and allow Qt to automatically delete it. So your dictionary entries could look like this:
'dupStreetCheckBox': [lambda: DupStreetDesc(), 0],
and then you can do:
for check_box_name in self.check_boxes_names:
run_class = self.summary_runnables[check_box_name]
runnable = run_class[0]()
runnable.signals.result.connect(self.log_progress)
self.long_task.start(runnable)
I don't know why setAutoDelete does not work (assuming you are calling it before starting the threadpool). I suppose there might be a bug, but it's impossible to be sure without having a fully-working example to test.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Multithreading search script using ThreadWithReturnValue - python

Related

Why my multiprocess queue might be "losing" items?

multiprocessing a function with parameters that are iterated through

Is it possible to write arbitrary depth delegated generators in Python?

Using pexpect to detect and handle notifications from Bluetooth LE

Pyqt and general python, can this be considered a correct approach for coding?

Categories

Resources