I am trying to create a pipeline that takes in 3 files, takes n amount of rows from each file (represented by obs_num) compares each of the values in the files to a random float between 0 and 1 and either returns the obs_num if it is greater than the random number or false if not. I then append these values to a list (list 1)
I then look at the next second file, check the position of the obs_num, if the num in the same position returned false in the previous file return false, or else check if the num is greater than the random float again, I then do the same for the third file. I also append these values to lists (list 2 and 3)
I then convert these 3 lists to a dataframe with each list representing a column.
The problem however is that when I run my pipeline the output is a file with one blank column as opposed to a csv with rows equivalent to obs_num.
Here is the code I am using for my wrapper:
import pandas as pd
import luigi
import state_to_state_machine as ssm
class wrapper(luigi.WrapperTask):
def requires(self):
file_tag = ['Sessiontolead','leadtoopportunity','opportunitytocomplete']
size = 10
for j in range(1,int(size)):
return[ssm.state_machine(file_tag=i,size=size,obs_nums=j)for i in file_tag]
def run(self):
print('The wrapper is complete')
pd.DataFrame().to_csv('/Users/emm/Documents/AttributionData/Data/datawranglerwrapper3.csv') #never returns anything
def output(self):
return luigi.LocalTarget('/Users/emm/Documents/AttributionData/Data/datawranglerwrapper3.csv')
if __name__ == '__main__':
luigi.build([wrapper()],workers=8,local_scheduler=True)
state machine:
import pandas as pd
import get_samples as gs
import luigi
import random
class state_machine(luigi.Task):
file_tag = luigi.Parameter()
obs_nums = luigi.Parameter() #directly get element - don't write to file
size = luigi.Parameter()
def run(self):
path = '/Users/emm/Documents/AttributionData/Data/Probabilities/'
file = path+self.file_tag+'sampleprobs.csv'
def generic_state_machine(tag,file=file,obs_nums=self.obs_nums):
if file.split('/')[7][:4] == tag:
state_machine = pd.read_csv(file)
return state_machine.ix[:,1][obs_nums] if s.ix[:,1][obs_nums] > random.uniform(0,1) else False
session_to_leads = []
lead_to_opps = []
opps_to_comp = []
session_to_leads.append(generic_state_machine(tag='Sessiontoload+sampleprobabs',file=file,obs_nums=self.obs_nums))
lead_to_opps.append(generic_state_machine(tag='leadtoopportunity+sampleprobabs',file=file,obs_nums=self.obs_nums)) if session_to_leads[self.obs_nums-1] != False else lead_to_opps.append(False)
opps_to_comp.append(generic_state_machine(tag='opportunitytocomplete+sampleprobabs',file=file,obs_nums=self.obs_nums)) if lead_to_opps[self.obs_nums-1] != False else opps_to_comps.append(False)
df = pd.DataFrame(zip(session_to_leads,lead_to_opps,opps_to_comp),columns=['session_to_leads','lead_to_opps','oops_to_comp'])
with self.output().open('w') as out_csv:
out_csv.write(df.to_csv('/Users/emmanuels/Documents/AttributionData/Data/Probabilities/'+str(self.file_tag)+str(self.size)+'statemachine.csv'))
def output(self):
return luigi.LocalTarget('/Users/emmanuels/Documents/AttributionData/Data/Probabilities/'+str(self.file_tag)+str(self.size)+'statemachine.csv')
I have asked similar versions of this question, but then has changed each time, I have managed to resolve most of the initial issues- so this is not a repetition on previous questions
So from my understanding in this instance this should produce 3 state machine files each with 10 rows for each observation and the comparison made.
The three files are literally files with 2 columns, the first being the index and the second being probabilities between 0 and 1
I'm not sure if this a problem with the logic of my code or how I am using Luigi
Check your formatting. In your state machine file, your with statement is at the class level for some reason and the output method is at the namespace level.
Related
I have two Python files (using PyCharm). In Python file#2, I want to call a function in Python file#1.
from main import load_data_from_file
delay, wavelength, measured_trace = load_data_from_file("Sweep_0.txt")
print(delay.shape)
which main is the name of python file#1. However, when I run python file#2 (the code posted at the top), I can see that whole python file#1 is also running.
Any suggestion on how I can just run print(delay. shape) without running the entire python file#1??
Here are my codes:
class import_trace:
def __init__(self,delays,wavelengths,spectra):
self.delays = delays
self.wavelengths = wavelengths
self.spectra = spectra
def load_data_from_file(self, filename):
# logging.info("Entered load_data_from_file")
with open(filename, 'r') as data_file:
wavelengths = []
spectra = []
delays = []
for num, line in enumerate(data_file):
if num == 0:
# Get the 1st line, drop the 1st element, and convert it
# to a float array.
delays = np.array([float(stri) for stri in line.split()[1:]])
else:
data = [float(stri) for stri in line.split()]
# The first column contains wavelengths.
wavelengths.append(data[0])
# All other columns contain intensity at that wavelength
# vs time.
spectra.append(np.array(data[1:]))
logging.info("Data loaded from file has sizes (%dx%d)" %
(delays.size, len(wavelengths)))
return delays, np.array(wavelengths), np.vstack(spectra)
and below I use this to get the values, however it does not work:
frog = import_trace(delays,wavelengths,spectra)
delay, wavelength, measured_trace =
frog.load_data_from_file("Sweep_0.txt")
I got this error:
frog = import_trace(delays,wavelengths,spectra)
NameError: name 'delays' is not defined
You can wrap up the functions of file 1 in a class and in file 2 you can create object and can call the specific function. However if you can share the file 1's code then it would be clear. For your reference find below...
File-1
class A:
def func1():
...
File-2
import file1 as f
# function call
f.A.func1()
I have 3 types of files each of the same size ( around 500 files of each type). I have to give these files to a function. How can I use multiprocessing for the same?
The files are rgb_image: 15.png,16.png,17.png .... depth_img: 15.png, 16.png, 17.png and mat :15.mat, 16.mat, 17.mat ... I have to use 3 files 15.png, 15.png and 15.mat as argument to the function. Starting names of files can vary but it is of this format.
The code is as follows:
def depth_rgb_registration(rgb, depth, mat):
required operation is performed here and
gait_list ( a list is the output of this function)
def display_fun(mat, selected_depth, selected_color, excel):
for idx, color_img in enumerate(color_lists):
for i in range(len(depth_lists)):
if color_img.split('.')[0] == depth_lists[i].split('.')[0]:
rgb = os.path.join(selected_color, color_img)
depth = os.path.join(selected_depth, sorted(depth_lists)[i])
m = sorted(mat_lists)[idx]
mat2 = os.path.join(mat, m)
abc = color_img.split('.')[0]
gait_list1 = []
fnum = int("".join([str(i) for i in re.findall("(\d+)", abc)]))
gait_list1.append(fnum)
depth_rgb_registration(rgb, depth,mat2)
gait_list2.append(gait_list1) #Output gait_list1 from above function
data1 = pd.DataFrame(gait_list2)
data1.to_excel(writer, index=False)
wb.save(excel)
In the above code, we have display_fun which is the main function, which is called from the other code.
In this function, we have color_img, depth_imp, and mat which are three different types of files from the folders. These three files are given as arguments to depth_rgb_registration function. In this function, some required values are stored in gait_list1 which is then stored in an excel file for every set of files.
This loop above is working but it takes around 20-30 minutes to run depending on the number of files.
So I wanted to use Multiprocessing and reduce the overall time.
I tried multiprocessing by seeing some example but I am not able to understand how can I give these 3 files as an argument. I know using a dictionary here is not correct which I have used below, but what can be an alternative?
Even if it is asynchronous multiprocessing, it is fine. I even thought of using GPU to run the function, but as I read, extra time will go in the loading of the data to GPU. Any suggestions?
def display_fun2(mat, selected_depth, selected_color, results, excel):
path3 = selected_depth
path4 = selected_color
path5 = mat
rgb_depth_pairs = defaultdict(list)
for rgb in path4.iterdir():
rgb_depth_pairs[rgb.stem].append(rgb)
included_extensions = ['png']
images = [fn for ext in included_extensions for fn in path3.glob(f'*.{ext}')]
for image in images:
rgb_depth_pairs[image.stem].append(image)
for mat in path5.iterdir():
rgb_depth_pairs[mat.stem].append(mat)
rgb_depth_pairs = [item for item in rgb_depth_pairs.items() if len(item) == 3]
with Pool() as p:
p.starmap_async(process_pairs, rgb_depth_pairs)
gait_list2.append(gait_list1)
data1 = pd.DataFrame(gait_list2)
data1.to_excel(writer, index=False)
wb.save(excel)
def depth_rgb_registration(rgb, depth, mat):
required operation for one set of files
I did not look at the code in detail (it was too long), but provided that the combinations of arguments that will be sent to your function with 3 arguments can be evaluated independently (outside of the function itself), you can simply use Pool.starmap:
For example:
from multiprocessing import Pool
def myfunc(a, b, c):
return 100*a + 10*b + c
myargs = [(2,3,1), (1,2,4), (5,3,2), (4,6,1), (1,3,8), (3,4,1)]
p = Pool(2)
print(p.starmap(myfunc, myargs))
returns:
[231, 124, 532, 461, 138, 341]
Alternatively, if your function can be recast as a function which accepts a single argument (the tuple) and expands from this into the separate variables that it needs, then you can use Pool.map:
def myfunc(t):
a, b, c = t # unpack the tuple and carry on
return 100*a + 10*b + c
...
print(p.map(myfunc, myargs))
I have extremely large files. Each file is almost 2GB. Therefore, I would like to run multiple files in parallel. And I can do that because all of the files have similar format therefore, file reading can be done in parallel. I know I should use multiprocessing library but I am really confused how to use it with my code.
My code for file reading is:
def file_reading(file,num_of_sample,segsites,positions,snp_matrix):
with open(file,buffering=2000009999) as f:
###I read file here. I am not putting that code here.
try:
assert len(snp_matrix) == len(positions)
return positions,snp_matrix ## return statement
except:
print('length of snp matrix and length of position vector not the same.')
sys.exit(1)
My main function is:
if __name__ == "__main__":
segsites = []
positions = []
snp_matrix = []
path_to_directory = '/dataset/example/'
extension = '*.msOut'
num_of_samples = 162
filename = glob.glob(path_to_directory+extension)
###How can I use multiprocessing with function file_reading
number_of_workers = 10
x,y,z = [],[],[]
array_of_number_tuple = [(filename[file], segsites,positions,snp_matrix) for file in range(len(filename))]
with multiprocessing.Pool(number_of_workers) as p:
pos,snp = p.map(file_reading,array_of_number_tuple)
x.extend(pos)
y.extend(snp)
So my input to the function is as follows:
file - list containing filenames
num_of_samples - int value
segsites - initially an empty list to which I want to append as I am reading the file.
positions - initially an empty list to which I want to append as I am reading the file.
snp_matrix - initially an empty list to which I want to append as I am reading the file.
The function returns positions list and snp_matrix list at the end. How can I use multiprocessing for this where my arguments are lists and integer? The way I've used multiprocessing gives me following error:
TypeError: file_reading() missing 3 required positional arguments: 'segsites', 'positions', and 'snp_matrix'
The elements in the list that is being passed to the Pool.map are not automatically unpacked. You can generally only have one argument in your 'file_reading' function.
Of course, this argument can be a tuple, so it is no problem to unpack it yourself:
def file_reading(args):
file, num_of_sample, segsites, positions, snp_matrix = args
with open(file,buffering=2000009999) as f:
###I read file here. I am not putting that code here.
try:
assert len(snp_matrix) == len(positions)
return positions,snp_matrix ## return statement
except:
print('length of snp matrix and length of position vector not the same.')
sys.exit(1)
if __name__ == "__main__":
segsites = []
positions = []
snp_matrix = []
path_to_directory = '/dataset/example/'
extension = '*.msOut'
num_of_samples = 162
filename = glob.glob(path_to_directory+extension)
number_of_workers = 10
x,y,z = [],[],[]
array_of_number_tuple = [(filename[file], num_of_samples, segsites,positions,snp_matrix) for file in range(len(filename))]
with multiprocessing.Pool(number_of_workers) as p:
pos,snp = p.map(file_reading,array_of_number_tuple)
x.extend(pos)
y.extend(snp)
Introduction
Firstly, thank you so much for taking the time to look at my question and code. I know my coding needs improving, but as with all new things it times time to perfect.
Background
I am making use of different functions to do the following:
Import multiple files (3 for now). Each files contains time, pressure and void columns.
Store the lists in dictionaries. Eg. All the pressures in one pressure dictionary.
Filter the pressure data and ensure that I still have the corresponding number of time data points (for each file).
I call these functions in the main function.
Problem
Everything runs perfectly until I run the time loop in the DataFilter function for a second time. However, I did check that I can access all three different lists for pressure and time in the dictionaries. Why do I get 'index out of range' at this point:t=timeFROMtimes [i] when the function runs a second time? This is the output I get for the code
Code
#IMPORT LIBRARIES
import glob
import pandas as pd
import matplotlib.pyplot as plt, numpy as np
from scipy.interpolate import spline
#INPUT PARAMETERS
max_iter = 3 #maximum number of iterations
i = 0 #starting number of iteration
tcounter = 0
number_of_files = 3 #number of files in directory
sourcePath = 'C:\\Users\\*' #not displaying actual directory
list_of_source_files = glob.glob(sourcePath + '/*.TXT')
pressures = {} #initialize a dictionary
times ={} #initialize a dictionary
#GET SPECIFIC FILE & PRESSURE, TIME VALUES
def Get_source_files(list_of_source_files,i):
#print('Get_source_files')
with open(list_of_source_files[i]) as source_file:
print("file entered:",i+1)
lst = []
for line in source_file:
lst.append([ float(x) for x in line.split()])
time = np.array([ x[0] for x in lst]) #first row in file and make array
void = np.array([ x[1] for x in lst]) #second row in file and make array
pressure =(np.array([ x[2] for x in lst]))*-1 #etc. & change the sign of the imported pressure data
return pressure,time
#SAVE THE TIME AND PRESSURE IN DICTIONARIES
def SaveInDictionary (Pressure,time,i):
#print('SaveInDictionary')
pressures[i]=[Pressure]
times[i]=[time]
return pressures,times
#FILTER PRESSURE DATA AND ADJUST TIME
def FilterData (pressureFROMpressures,timeFROMtimes,i):
print('FilterData')
t=timeFROMtimes [i]
p=pressureFROMpressures[i]
data_points=int((t[0]-t[-1])/-0.02) #determine nr of data points per column of the imported file
filtered_pressure = [] #create an empty list for the filtered pressure
pcounter = 0 #initiate a counter
tcounter = 0
time_new =[0,0.02] #these initial values are needed for the for-loop
for j in range(data_points):
if p[j]<8: #filter out all garbage pressure data points above 8 bar
if p[j]>0:
filtered_pressure.append(p[j]) #append the empty list ofsave all the new pressure values in new_pressure
pcounter += 1
for i in range(pcounter-1):
time_new[0]=0
time_new[i+1]=time_new[i]+0.02 #add 0.02 to the previous value
time_new.append(time) #append the time list
tcounter += 1 #increment the counter
del time_new[-1] #somehow a list of the entire time is formed at the end that should be removed
return filtered_pressure, time_new
#MAIN!!
P = list()
while (i <= number_of_files and i!=max_iter):
pressure,time = Get_source_files(list_of_source_files,i) #return pressure and time from specific file
pressures, times = SaveInDictionary (pressure,time,i)#save pressure and time in the dictionaries
i+=1 #increment i to loop
print('P1=',pressures[0])
print('P2=',pressures[1])
print('P3=',pressures[2])
print('t1=',times[0])
print('t2=',times[1])
print('t3=',times[2])
#I took this out of the main function to check my error:
for i in range(2):
filtered_pressure,changed_time = FilterData(pressures[i],times[i],i)
finalP, finalT = SaveInDictionary (filtered_pressure,changed_time,i)#save pressure and time in the dictionaries
In the loop you took out of the main function you already index into times, but in the
#I took this out of the main function to check my error:
for i in range(2):
filtered_pressure,changed_time = FilterData(pressures[i],times[i],i)
# Here you call FilterData with times[i]
but in the FilterData function, where you call the passed variable timeFROMtimes you index it again with i:
def FilterData (pressureFROMpressures,timeFROMtimes,i):
print('FilterData')
t=timeFROMtimes [i] # <-- right here
This seems odd. Try removing one of the index operators ([i]) and also do so for pressures.
Edit
As #strubbly pointed out, in your SaveInDictionary function you assign the value [Pressure]. The square brackets denote a list with one element, the numpy array Pressure. This can be seen in your error message, by the opening bracket before the arrays when you print P1-3 and t1-3.
To directly save your numpy array, loose the brackets:
def save_in_dictionary(pressure_array, time, i):
pressures[i] = pressure_array
times[i] = time
return pressures, times
I'm trying to cross compare two outputs labeled "S" in compareDNA (calculating Hamming distance). Though, I cannot figure out how to call an integer from one def to another. I've tried returning the variable, but, I am unable to call it (in a different def) after returning it.
I'm attempting to see which output of "compareDNA(Udnalin, Mdnalin) and compareDNA(Udnalin, Hdnalin)" is higher, to determine which has a greater hamming distance.
How does one call an integer from one def to another?
import sys
def main():
var()
def var():
Mdna = open("mouseDNA.txt", "r")
Mdnalin = Mdna.readline()
print(Mdnalin)
Mdna.close
Hdna = open("humanDNA.txt", "r")
Hdnalin = Hdna.readline()
print(Hdnalin)
Hdna.close
Udna = open("unknownDNA.txt", "r")
Udnalin = Udna.readline()
print(Udnalin)
Udna.close
S = 0
S1 = 0
S2 = 0
print("Udnalin + Mdnalin")
compareDNA(Udnalin, Mdnalin)
S1 = S
print("Udnalin + Hdnalin")
compareDNA(Udnalin, Hdnalin)
def compareDNA(i, j):
diffs = 0
length = len(i)
for x in range(length):
if i[x] != j[x]:
diffs += 1
S = length - diffs / length
S = round(S, 2)
return S
# print("Mouse")
# print("Human")
# print("RATMA- *cough* undetermined")
main()
You probably want to assign the value returned by each call to compareDNA to a separate variable in your var function. Then you can do whatever you want with those values (what exactly you want to do is not clear from your question). Try something like this:
S1 = compareDNA(Udnalin, Mdnalin) # bind the return value from this call to S1
S2 = compareDNA(Udnalin, Hdnalin) # and this one to S2
# do something with S1 and S2 here!
If what you want to do is especially simple (e.g. comparing them to see which is larger), you could even use the return values directly in an expression, such as the condition in a if statement:
if compareDNA(Udnalin, Mdnalin) > S2 = compareDNA(Udnalin, Hdnalin):
print("Unknown DNA is closer to a Mouse")
else:
print("Unknown DNA is closer to a Human")
There's one further thing I'd like to point out, which is unrelated to the core of your question: You should use with statements to handle closing your files, rather than manually trying to close them. Your current code doesn't actually close the files correctly (you're missing the parentheses after .close in each case which are needed to make it a function call).
If you use a with statement instead, the files will be closed automatically at the end of the block (even if there is an exception):
with open("mouseDNA.txt", "r") as Mdna:
Mdnalin = Mdna.readline()
print(Mdnalin)
with open("humanDNA.txt", "r") as Hdna:
Hdnalin = Hdna.readline()
print(Hdnalin)
with open("unknownDNA.txt", "r") as Udna:
Udnalin = Udna.readline()
print(Udnalin)