I'm trying to build a list of parent/comment pairs from the publicly available Reddit data set.
I have a CSV file which I load into a Pandas dataframe which contains rows of the comments with the parent and child id, as well as the child comment. The data is loaded using the following block of code:
import os
import multiprocessing as mp
import numpy as np
import pandas as pd
sourcePATH = r'C:\'
workingFILE = r'\output-pt1.csv'
# filepaths
input_file = sourcePATH + workingFILE
data_df = pd.read_csv(input_file,header=None,names=['PostIDX','ParentIDX','Comment','Score','Controversiality'])
The aim is to scan through each row in the dataframe and using the parent id to search through the rest of the dataframe to see if their is a parent comment present. If it is I then store the child and parent comments in a tuple with some other information. This will then be added to a list which will then be written out to a csv file at the end. To do this I use the following code:
def checkChildParent(ParentIDX_curr, ChildIDX_curr,ChildComment_curr,ChildScore_curr,ChildCont_curr):
idx = data_df.loc[data_df['PostIDX'] == ParentIDX_curr]
if idx.empty is False:
ParentComment = idx.iloc[0,2]
ParentScore = idx.iloc[0,3]
ParentCont = idx.iloc[0,4]
outPut.put([ParentIDX_curr[0], ParentComment,ParentScore,ParentCont,ChildIDX_curr[0], ChildComment_curr[0],ChildScore_curr[0],ChildCont_curr[0]])
if __name__ == '__main__':
print('Process started')
t_start_init = time.time()
t_start = time.time()
noCores = 1
#pool = mp.Pool(processes=noCores)
update_freq = 100
n = 1000
#n = round(len(data_df)/8)
flag_create = 0
flag_run = 0
i = 0
outPut = mp.Queue()
#parent_child_df = pd.DataFrame()
#parent_child_df.coumns = ['PostIDX','ParentIDX']
while i < n:
#print(i)
procs = []
ParentIDX = []
ParentComment = []
ParentScore = []
ParentCont = []
ChildIDX = []
ChildComment = []
ChildScore = []
ChildCont = []
for worker in range(0,noCores):
ParentIDX.append(data_df.iloc[i,1])
ChildIDX.append(data_df.iloc[i,0])
ChildComment.append(data_df.iloc[i,2])
ChildScore.append(data_df.iloc[i,3])
ChildCont.append(data_df.iloc[i,4])
i = i + 1
#when I call the function this way it returns the expected matches
#checkChildParent(ParentIDX,ChildIDX,ChildComment,
# ChildScore,ChildCont)
#when I call the function with Process function nothing appears to be happening
for proc in range(0,noCores):
p = mp.Process(target = checkChildParent, args=(ParentIDX[proc],ChildIDX[proc],ChildComment[proc],ChildScore[proc],ChildCont[proc]))
procs.append(p)
p.start()
#for p in procs:
# p.join()
if outPut.empty() is False:
print(outPut.get())
At the top of the file is a function which scans the dataframe for a given row and returns the tuple of the matched parent and child comment if it was found. If I call this function normally then it works fine, however when I call the function using the Process function it doesn't match anything!. I'm guessing its the form the arguments that are being passed to the function is being passed to the function that is causing the issue, but I have been trying to debug this all afternoon and have failed so far. If anyone has any suggestions then please let me know!
Thanks!
Related
I have multidimensional array which needs to be calculated with an imported function. (I am using jupyter notebook, so I exported the function to ipynb and imported it again) The function takes argument of 1 dimensional array.
#Function
def calculatespi(datagrid,q):
date_time = datagrid['time'][:]
gridvalue = datagrid.values
if np.isnan(np.sum(gridvalue)) != True:
df_precip = pd.DataFrame({"Date": date_time,"precip":gridvalue})
spi_prc = spi.SPI()
spi3_grid = spi_prc.calculate(df_precip, 'Date', 'precip', freq = 'M', scale = 3, fit_type ="lmom", dist_type="gam")
spi3 = spi3_grid['precip_scale_3_calculated_index'].values
else:
spi3 = np.empty((489))
spi3[:] = np.nan
q.put(spi3)
#Main Notebook
if name == "main":
spipi = []
processes = []
for x in range (3):
for y in range(3):
q = multiprocessing.Queue()
p = multiprocessing.Process(target=calculatespi, args= (prcoba[:,x,y],q))
p.start()
processes.append(p)
spipi.append(q.get())
for process in processes:
process.join()
After hundreds of attempt, finally I can retrieve the results from my problem but it took times longer than running it without using multiprocessing. What should I do?
Using concurrent.futures.ProcessPoolExecutor makes things much easier.
First, replace in calculatespi the q.put(spi3) by return spi3 and remove the q parameter. Then the "main" code can be written as
#Main Notebook
if name == "main":
from concurrent.futures import ProcessPoolExecutor
args = []
for x in range (3):
for y in range(3):
args.append(prcoba[:,x,y])
with ProcessPoolExecutor() as executor:
spipi = list(executor.map(calculatespi, args))
The executor takes care about everything else.
I'm a hobby coder started with AHK, then some java and now I try to learn Python. I have searched and found some tips but I have yet not been able to implement it into my own code.
Hopefully someone here can help me, it's a very short program.
I'm using .txt csv database with ";" as a separator.
DATABASE EXAMPLE:
Which color is normally a cat?;Black
How tall was the longest man on earth?;272 cm
Is the earth round?;Yes
The database now consists of 20.000 lines which makes the program "to slow", only using 25% CPU (1 core).
If I can make it use all 4 cores (100%) I guess it would perform the task alot faster. The task is basically to compare the CLIPBOARD with the database and if there is a match, it should give me an answer as a return. Perhaps also I can separate the database into 4 pieces?
The code right now looks like this! Not more then 65 lines and its doing its job (but to slow). Advice on how I can make this process into multi core needed.
import time
import pyperclip as pp
import pandas as pd
import pymsgbox as pmb
from fuzzywuzzy import fuzz
import numpy
ratio_threshold = 90
fall_back_time = 1
db_file_path = 'database.txt'
db_separator = ';'
db_encoding = 'latin-1'
def load_db():
while True:
try:
# Read and create database
db = pd.read_csv(db_file_path, sep=db_separator, encoding=db_encoding)
db = db.drop_duplicates()
return db
except:
print("Error in load_db(). Will sleep for %i seconds..." % fall_back_time)
time.sleep(fall_back_time)
def top_answers(db, question):
db['ratio'] = db['question'].apply(lambda q: fuzz.ratio(q, question))
db_sorted = db.sort_values(by='ratio', ascending=False)
db_sorted = db_sorted[db_sorted['ratio'] >= ratio_threshold]
return db_sorted
def write_txt(top):
result = top.apply(lambda row: "%s" % (row['answer']), axis=1).tolist()
result = '\n'.join(result)
fileHandle = open("svar.txt", "w")
fileHandle.write(result)
fileHandle.close()
pp.copy("")
def main():
try:
db = load_db()
last_db_reload = time.time()
while True:
# Get contents of clipboard
question = pp.paste()
# Rank answer
top = top_answers(db, question)
# If answer was found, show results
if len(top) > 0:
write_txt(top)
time.sleep(fall_back_time)
except:
print("Error in main(). Will sleep for %i seconds..." % fall_back_time)
time.sleep(fall_back_time)
if name == 'main':
main()'
If you could divide the db into four equally large you could process them in parallel like this:
import time
import pyperclip as pp
import pandas as pd
import pymsgbox as pmb
from fuzzywuzzy import fuzz
import numpy
import threading
ratio_threshold = 90
fall_back_time = 1
db_file_path = 'database.txt'
db_separator = ';'
db_encoding = 'latin-1'
def worker(thread_id, question):
thread_id = str(thread_id)
db = pd.read_csv(db_file_path + thread_id, sep=db_separator, encoding=db_encoding)
db = db.drop_duplicates()
db['ratio'] = db['question'].apply(lambda q: fuzz.ratio(q, question))
db_sorted = db.sort_values(by='ratio', ascending=False)
db_sorted = db_sorted[db_sorted['ratio'] >= ratio_threshold]
top = db_sorted
result = top.apply(lambda row: "%s" % (row['answer']), axis=1).tolist()
result = '\n'.join(result)
fileHandle = open("svar" + thread_id + ".txt", "w")
fileHandle.write(result)
fileHandle.close()
pp.copy("")
return
def main():
question = pp.paste()
for i in range(1, 4):
t = threading.Thread(target=worker, args=(i, question))
t.start()
t.join()
if name == 'main':
main()
The solution with multiprocessing:
import time
import pyperclip as pp
import pandas as pd
#import pymsgbox as pmb
from fuzzywuzzy import fuzz
import numpy as np
# pathos uses better pickle to tranfer more complicated objects
from pathos.multiprocessing import Pool
from functools import reduce
import sys
import os
from contextlib import closing
ratio_threshold = 70
fall_back_time = 1
db_file_path = 'database.txt'
db_separator = ';'
db_encoding = 'latin-1'
chunked_db = []
NUM_PROCESSES = os.cpu_count()
def load_db():
while True:
try:
# Read and create database
db = pd.read_csv(db_file_path, sep=db_separator, encoding=db_encoding)
db.columns = ['question', 'answer']
#db = db.drop_duplicates() # i drop it for experiment
break
except:
print("Error in load_db(). Will sleep for %i seconds..." % fall_back_time)
time.sleep(fall_back_time)
# split database into equal chunks:
# (if you have a lot of RAM, otherwise you
# need to compute ranges in db, something like
# chunk_size = len(db)//NUM_PROCESSES
# ranges[i] = (i*chunk_size, (i+1)*cjunk_size)
# and pass ranges in original db to processes
chunked_db = np.split(db, [NUM_PROCESSES], axis=0)
return chunked_db
def top_answers_multiprocessed(question, chunked_db):
# on unix, python uses 'fork' mode by default
# so the process has 'copy-on-change' access to all global variables
# i.e. if process will change something in db, it will be copied to it
# with a lot of overhead
# Unfortunately, I'fe heard that on Windows only 'spawn' mode with full
# copy of everything is used
# Process pipeline uses pickle, it's quite slow.
# so on small database you may not have benefit from multiprocessing
# If you are going to transfer big objects in or out, look
# in the direction of multiprocessing.Array
# this solution is not fully efficient,
# as pool is recreated each time
# You can create daemon processes which will monitor
# Queue for incoming questions, but it's harder to implement
def top_answers(idx):
# question is in the scope of parent function,
chunked_db[idx]['ratio'] = chunked_db[idx]['question'].apply(lambda q: fuzz.ratio(q, question))
db_sorted = chunked_db[idx].sort_values(by='ratio', ascending=False)
db_sorted = db_sorted[db_sorted['ratio'] >= ratio_threshold]
return db_sorted
with closing(Pool(processes=NUM_PROCESSES)) as pool:
# chunked_db is a list of databases
# they are in global scope, we send only index beacause
# all the data set is pickled
num_chunks = len(chunked_db)
# apply function top_answers across generator range(num_chunks)
res = pool.imap_unordered(top_answers, range(num_chunks))
res = list(res)
# now res is list of dataframes, let's join it
res_final = reduce(lambda left,right: pd.merge(left,right,on='ratio'), res)
return res_final
def write_txt(top):
result = top.apply(lambda row: "%s" % (row['answer']), axis=1).tolist()
result = '\n'.join(result)
fileHandle = open("svar.txt", "w")
fileHandle.write(result)
fileHandle.close()
pp.copy("")
def mainfunc():
global chunked_db
chunked_db = load_db()
last_db_reload = time.time()
print('db loaded')
last_clip = ""
while True:
# Get contents of clipboard
try:
new_clip = pp.paste()
except:
continue
if (new_clip != last_clip) and (len(new_clip)> 0):
print(new_clip)
last_clip = new_clip
question = new_clip.strip()
else:
continue
# Rank answer
top = top_answers_multiprocessed(question, chunked_db)
# If answer was found, show results
if len(top) > 0:
#write_txt(top)
print(top)
if __name__ == '__main__':
mainfunc()
I have a code in which I need to read an excel file and store the information into dictionaries.
I have to use multiprocessing.Manager() to create the dictionaries in order to be able to retrieve calculation output from a function that I run using multiprocess.Process.
The problem is that, when multiprocessing.Manager() and manager.dict() is used to create a dictionary it takes ~400 times longer than using only dict() (and dict() is not a shared memory structure).
Here is a sample code to verify the diference:
import xlrd
import multiprocessing
import time
def DictManager(inp1, inp2):
manager = multiprocessing.Manager()
Dict = manager.dict()
Dict['input1'] = inp1
Dict['input2'] = inp2
Dict['Output1'] = None
Dict['Output2'] = None
return Dict
def DictNoManager(inp1, inp2):
Dict = dict()
Dict['input1'] = inp1
Dict['input2'] = inp2
Dict['Output1'] = None
Dict['Output2'] = None
return Dict
def ReadFileManager(excelfile):
DictList = []
book = xlrd.open_workbook(excelfile)
sheet = book.sheet_by_index(0)
line = 2
for line in range(2,sheet.nrows):
inp1 = sheet.cell(line,2).value
inp2 = sheet.cell(line,3).value
dictionary = DictManager(inp1, inp2)
DictList.append(dictionary)
print 'Done!'
def ReadFileNoManager(excelfile):
DictList = []
book = xlrd.open_workbook(excelfile)
sheet = book.sheet_by_index(0)
line = 2
for line in range(2,sheet.nrows):
inp1 = sheet.cell(line,2).value
inp2 = sheet.cell(line,3).value
dictionary = DictNoManager(inp1, inp2)
DictList.append(dictionary)
print 'Done!'
if __name__ == '__main__':
excelfile = 'MyFile.xlsx'
start = time.time()
ReadFileNoManager(excelfile)
end = time.time()
print 'Run time NoManager:', end - start, 's'
start = time.time()
ReadFileManager(excelfile)
end = time.time()
print 'Run time Manager:', end - start, 's'
Is there a way to improve the performance of multiprocessing.Manager()?
If the answer is No, is there any other shared memory structure that I can use to replace what I am doing and improve performance?
I would appreciate your help!
EDIT:
My main function uses the following code:
def MyFunction(Dictionary, otherdata):
#Perform calculation and save results in the dictionary
Dict['Output1'] = Value1
Dict['Output2'] = Value2
ListOfProcesses = []
for Dict in DictList:
p = multiprocessing.Process(target=MyFunction, args=(Dict, otherdata)
p.start()
ListOfProcesses.append(p)
for p in ListOfProcesses:
p.join()
If I do not use the manager, I will not be able to retrieve the Outputs.
As I mentioned in the comments, I recommend using the main process to read in the excel file. Then using multiprocessing for the function calls. Just add your function to apply_function and make sure it returns whatever you want. results will contain a list of your results.
Update: I changed map to starmap to include your extra argument
def ReadFileNoManager(excelfile):
DictList = []
book = xlrd.open_workbook(excelfile)
sheet = book.sheet_by_index(0)
line = 2
for line in range(2,sheet.nrows):
inp1 = sheet.cell(line,2).value
inp2 = sheet.cell(line,3).value
dictionary = DictNoManager(inp1, inp2)
DictList.append(dictionary)
print 'Done!'
return DictList
def apply_function(your_dict, otherdata):
pass
if __name__ == '__main__':
excelfile = 'MyFile.xlsx'
dict_list = ReadFileNoManager(excelfile)
pool = multiprocessing.Pool(multiprocessing.cpu_count())
results = pool.starmap(apply_function, zip(dict_list, repeat(otherdata)))
for testing reasons I start only 1 process. One given argument is an array that shall be changed from that process.
class Engine():
Ready = Value('i', False)
def movelisttoctypemovelist(self, movelist):
ctML = []
for zug in movelist:
ctZug = ctypeZug()
ctZug.VonReihe = zug.VonReihe
ctZug.VonLinie = zug.VonLinie
ctZug.NachReihe = zug.NachReihe
ctZug.NachLinie = zug.NachLinie
ctZug.Bewertung = zug.Bewertung
ctML.append(ctZug)
return ctML
def findbestmove(self, board, settings, enginesettings):
print ("Computer using", multiprocessing.cpu_count(),"Cores.")
movelist = Array(ctypeZug, [], lock = True)
movelist = self.movelisttoctypemovelist(board.movelist)
bd = board.boardtodictionary()
process = []
for i in range(1):
p = Process(target=self.calculatenullmoves, args=(bd, movelist, i, self.Ready))
process.append(p)
p.start()
for p in process:
p.join()
self.printctypemovelist(movelist, settings)
print ("Ready:", self.Ready.value)
def calculatenullmoves(self, boarddictionary, ml, processindex, ready):
currenttime = time()
print ("Process", processindex, "begins to work...")
board = Board()
board.dictionarytoboard(boarddictionary)
...
ml[processindex].Bewertung = 2.4
ready.value = True
print ("Process", processindex, "finished work in", time()-currenttime, "sec")
def printctypemovelist(self, ml):
for zug in ml:
print (zug.VonReihe, zug.VonLinie, zug.NachReihe, zug.NachLinie, zug.Bewertung)
I try to write 2.4 directly in the list, but no changing is shown when calling "printctypemovelist".
I set "Ready" to True and it works.
I used information from http://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.sharedctypes
I hope someone can find my mistake, if it is too difficult to read, please let me know.
The problem is that you're trying to share a plain Python list:
ctML = []
Use a proxy object instead:
from multiprocessing import Manager
ctML = Manager().list()
See Python doc on Sharing state between processes for more detail.
I am filtering huge text files using multiprocessing.py. The code basically opens the text files, works on it, then closes it.
Thing is, I'd like to be able to launch it successively on multiple text files. Hence, I tried to add a loop, but for some reason it doesn't work (while the code works on each file). I believe this is an issue with:
if __name__ == '__main__':
However, I am looking for something else. I tried to create a Launcher and a LauncherCount files like this:
LauncherCount.py:
def setLauncherCount(n):
global LauncherCount
LauncherCount = n
and,
Launcher.py:
import os
import LauncherCount
LauncherCount.setLauncherCount(0)
os.system("OrientedFilterNoLoop.py")
LauncherCount.setLauncherCount(1)
os.system("OrientedFilterNoLoop.py")
...
I import LauncherCount.py, and use LauncherCount.LauncherCount as my loop index.
Of course, this doesn't work too as it edits the variable LauncherCount.LauncherCount locally, so it won't be edited in the imported version of LauncherCount.
Is there any way to edit globally a variable in an imported file? Or, is there any way to do this in any other way? What I need is running a code multiple times, in changing one value, and without using any loop apparently.
Thanks!
Edit: Here is my main code if necessary. Sorry for the bad style ...
import multiprocessing
import config
import time
import LauncherCount
class Filter:
""" Filtering methods """
def __init__(self):
print("launching methods")
# Return the list: [Latitude,Longitude] (elements are floating point numbers)
def LatLong(self,line):
comaCount = []
comaCount.append(line.find(','))
comaCount.append(line.find(',',comaCount[0] + 1))
comaCount.append(line.find(',',comaCount[1] + 1))
Lat = line[comaCount[0] + 1 : comaCount[1]]
Long = line[comaCount[1] + 1 : comaCount[2]]
try:
return [float(Lat) , float(Long)]
except ValueError:
return [0,0]
# Return a boolean:
# - True if the Lat/Long is within the Lat/Long rectangle defined by:
# tupleFilter = (minLat,maxLat,minLong,maxLong)
# - False if not
def LatLongFilter(self,LatLongList , tupleFilter) :
if tupleFilter[0] <= LatLongList[0] <= tupleFilter[1] and
tupleFilter[2] <= LatLongList[1] <= tupleFilter[3]:
return True
else:
return False
def writeLine(self,key,line):
filterDico[key][1].write(line)
def filteringProcess(dico):
myFilter = Filter()
while True:
try:
currentLine = readFile.readline()
except ValueError:
break
if len(currentLine) ==0: # Breaks at the end of the file
break
if len(currentLine) < 35: # Deletes wrong lines (too short)
continue
LatLongList = myFilter.LatLong(currentLine)
for key in dico:
if myFilter.LatLongFilter(LatLongList,dico[key][0]):
myFilter.writeLine(key,currentLine)
###########################################################################
# Main
###########################################################################
# Open read files:
readFile = open(config.readFileList[LauncherCount.LauncherCount][1], 'r')
# Generate writing files:
pathDico = {}
filterDico = config.filterDico
# Create outputs
for key in filterDico:
output_Name = config.readFileList[LauncherCount.LauncherCount][0][:-4]
+ '_' + key +'.log'
pathDico[output_Name] = config.writingFolder + output_Name
filterDico[key] = [filterDico[key],open(pathDico[output_Name],'w')]
p = []
CPUCount = multiprocessing.cpu_count()
CPURange = range(CPUCount)
startingTime = time.localtime()
if __name__ == '__main__':
### Create and start processes:
for i in CPURange:
p.append(multiprocessing.Process(target = filteringProcess ,
args = (filterDico,)))
p[i].start()
### Kill processes:
while True:
if [p[i].is_alive() for i in CPURange] == [False for i in CPURange]:
readFile.close()
for key in config.filterDico:
config.filterDico[key][1].close()
print(key,"is Done!")
endTime = time.localtime()
break
print("Process started at:",startingTime)
print("And ended at:",endTime)
To process groups of files in sequence while working on files within a group in parallel:
#!/usr/bin/env python
from multiprocessing import Pool
def work_on(args):
"""Process a single file."""
i, filename = args
print("working on %s" % (filename,))
return i
def files():
"""Generate input filenames to work on."""
#NOTE: you could read the file list from a file, get it using glob.glob, etc
yield "inputfile1"
yield "inputfile2"
def process_files(pool, filenames):
"""Process filenames using pool of processes.
Wait for results.
"""
for result in pool.imap_unordered(work_on, enumerate(filenames)):
#NOTE: in general the files won't be processed in the original order
print(result)
def main():
p = Pool()
# to do "successive" multiprocessing
for filenames in [files(), ['other', 'bunch', 'of', 'files']]:
process_files(p, filenames)
if __name__=="__main__":
main()
Each process_file() is called in sequence after the previous one has been complete i.e., the files from different calls to process_files() are not processed in parallel.