Elegant way of integrating for loop with a flag in Python - python

I am using click to create a command line tool that performs some data preprocessing. Until now, I have basically survived using click.option() as flag with some if statements in my code so that I can choose the options I want. However, I am struggling to find an elegant way to solve my next issue. Since I believe the general code structure does not depend on my purposes I will try to be as general as possible without getting into details of what goes into the main code.
I have a list of elements my_list that I want to loop over and apply some very long code after each iteration. However, I want this to be dependent on a flag (via click, as I said). The general structure would be like this:
#click.option('--myflag', is_flag=True)
#just used click.option to point out that I use click but it is just a boolean
if myflag:
for i in my_list:
print('Function that depends on i')
print('Here goes the rest of the code')
else:
print('Function independent on i')
print('Here goes the rest of the code')
My issue is that I wouldn't like to copy paste twice the rest of the code in the above structure (it is a long code and hard to integrate into a function). Is there a way to do that? That is, is there a way to tell python: "If myflag==True, run the full code while looping into mylist. Otherwise, just go to the full code. All of that without having to duplicate the code.
EDIT: I believe it might actually be useful to go a bit more specific.
What I have is :
mylist=['washington','la','houston']
if myflag:
for i in my_list:
train,test = full_data[full_data.city!=i],full_data[full_data.city==i]
print('CODE:Clean,tokenize,train,performance')
else:
def train_test_split2(df, frac=0.2):
# get random sample
test = df.sample(frac=frac, axis=0,random_state=123)
# get everything but the test sample
train = df.drop(index=test.index)
return train, test
train, test = train_test_split2(full_data[['tidy_tweet', 'classify']])
print('CODE: Clean,tokenize,train,performance')
full_datais a pandas data frame that contains text and classification. Whenever I set my_flag=True, I pretend the code to train some models an test performance when leaving some city as the test data. Hence the loop gives me an overview on how does my model perform on different cities (some sort of GroupKfold loop).
Under the second option, my_flag=False, there is a random test-train split and the training is only performed once.
It is the CODE part that I don't want to duplicate.
I hope this helps the previous intuition.

What do you mean by "hard to integrate into a function"? If a function is an option, just use the following code.
def rest_of_the_code(i):
...
if myflag:
for i in my_list:
print('Function that depends on i')
rest_of_the_code(i)
else:
print('Function independent on i')
rest_of_the_code(0)
Otherwise you could do something like this:
if not myflag:
my_list = [0] # if you even need to initialize it, not sure where my_list comes from
for i in my_list:
print('Function that may depend on i')
print('Here goes the rest of the code')
EDIT to answer regarding your clarification: You could use a list which is iterated.
mylist=['washington','la','houston']
list_of_dataframes = []
if myflag:
for i in my_list:
train,test = full_data[full_data.city!=i],full_data[full_data.city==i]
list_of_dataframes.append( (train, test) )
else:
train, test = train_test_split2(full_data[['tidy_tweet', 'classify']])
list_of_dataframes.append( (train, test) )
for train, test in list_of_dataframes:
print('CODE: Clean,tokenize,train,performance')

Related

Python Pandas how to speed up the __init__ class function correctly with numba?

I have a class that does different mathematical calculations on a cycle
I want to speed up its processing with Numba
And now I'm trying to apply Namba to init functions
The class itself and its init function looks like this:
class Generic(object):
##numba.vectorize
def __init__(self, N, N1, S1, S2):
A = [['Time'],['Time','Price'], ["Time", 'Qty'], ['Time', 'Open_interest'], ['Time','Operation','Quantity']]
self.table = pd.DataFrame(pd.read_csv('Datasets\\RobotMath\\table_OI.csv'))
self.Reader()
for a in A:
if 'Time' in a:
self.df = pd.DataFrame(pd.read_csv(Ex2_Csv, usecols=a, parse_dates=[0]))
self.df['Time'] = self.df['Time'].dt.floor("S", 0)
self.df['Time'] = pd.to_datetime(self.df['Time']).dt.time
if a == ['Time']:
self.Tik()
elif a == ['Time','Price']:
self.Poc()
self.Pmm()
self.SredPrice()
self.Delta_Ema(N, N1)
self.Comulative(S1, S2)
self.M()
elif a == ["Time", 'Qty']:
self.Volume()
elif a == ['Time', 'Open_interest']:
self.Open_intrest()
elif a == ['Time','Operation','Quantity']:
self.D()
#self.Dataset()
else:
print('Something went wrong', f"Set Error: {0} ".format(a))
All functions of the class are ordinary column calculations using Pandas.
Here are two of them for example:
def Tik(self):
df2 = self.df.groupby('Time').value_counts(ascending=False)
df2.to_csv('Datasets\\RobotMath\\Tik.csv', )
def Poc(self):
g = self.df.groupby('Time', sort=False)
out = (g.last()-g.first()).reset_index()
out.to_csv('Datasets\\RobotMath\\Poc.csv', index=False)
I tried to use Numba in different ways, but I got an error everywhere.
Is it possible to speed up exactly the init function? Or do I need to look for another way and I can 't do without rewriting the class ?
So there's nothing special or magical about the init function, at run time, it's just another function like any other.
In terms of performance, your class is doing quite a lot here - it might be worth breaking down the timings of each component first to establish where your performance hang-ups lie.
For example, just reading in the files might be responsible for a fair amount of that time, and for Ex2_Csv you repeatedly do that within a loop, which is likely to be sub-optimal depending on the volume of data you're dealing with, but before targeting a resolution (like Numba for example) it'd be prudent to identify which aspect of the code is performing within expectations.
You can gather that information in a number of ways, but the simplest might be to add in some tagged print statements that emit the elapsed time since the last print statement.
e.g.
start = datetime.datetime.now()
print("starting", start)
## some block of code that does XXX
XXX_finish = datetime.datetime.now()
print("XXX_finished at", XXX_finish, "taking", XXX_finish-start)
## and repeat to generate a runtime report showing timings for each block of code
Then you can break the runtime of your program down into feature-aligned chunks, then when tweaking you'll be able to directly see the effect. Profiling the runtime of your code like this can really help when making performance tweaks, and it helps sharpen focus on what tweaks are benefiting/harming specific areas of your code.
For example, in the section that's performing the group-bys, with some timestamp outputs before and after you can compare running it with the numba turned on, and then again with it turned off.
From there, with a little careful debugging, sensible logging of times and a bit of old-fashioned sleuthing (I tend to jot these kinds of things down with paper and pencil) you (and if you share your findings, we) ought to be in a better position to answer your question more precisely.

Design pattern for a data parsing&feature engineering pipeline

I know this question has been asked a few times on these boards in the past, but I have a more generic version of it, which might be applicable to someone else's projects in the future as well.
In short - I am building an ML system (with Python, but language choice in this case is not very critical), which has its ML model at the end of a pipeline of actions happening:
Data upload
Data parsing
Feature engineering
Feature engineering (in a different logical bracket from prev. step)
Feature engineering (in a different logical bracket from prev. steps)
... (more steps like the last 3)
Data passed to ML model
Each of the above steps, has its own series of actions it must take, in order to build a proper output, which is then used as an input in the next one etc. These sub-steps in turn, can either be completely decoupled from one another, or some of them might need some steps inside of that big step, to be completed first, to produce data these following steps use.
The thing right now is, that I need to build a custom pipeline, which will make it super easy to add new steps into the mix (both big and small), without upsetting the existing ones.
So far, I have this concept idea of how this might look like from an architecture perspective, as shown below:
While looking at this architecture, I am immediately thinking about a Chain of Responsibility Design Pattern, which manages BIG STEPS (1, 2, ..., n), and each of these BIG STEPS having their own small version of Chain of Responsibility happening inside of their guts, which happen independently for NO_REQ steps, and then for REQ steps (with REQ steps looping-over until they are all done). With a shared interface for running logic inside of big and small steps, it would probably run rather neatly.
Yet, I am wondering, if there is any better way of doing it? Moreover, what I do not like about a Chain of Responsibility, is that it would require a person adding new BIG/SMALL step, to always edit the "guts" of the logic setting up step bags, to manually include the newly added step. I would love to build something, which instead would just scan a folder specific to steps under each BIG STEP, and build a list of NO_REQ and REQ steps on its own (to uphold the Open/Closed SOLID principle).
I would be grateful for any ideas.
I did something similar recently. Challenge is to allow "steps" to be plugged in easily without much changes. So, what I did was, something like below. Core idea is to define interface 'process' and fix input and output format.
Code:
class Factory:
def process(self, input):
raise NotImplementedError
class Extract(Factory):
def process(self, input):
print("Extracting...")
output = {}
return output
class Parse(Factory):
def process(self, input):
print("Parsing...")
output = {}
return output
class Load(Factory):
def process(self, input):
print("Loading...")
output = {}
return output
pipeline = {
"Extract" : Extract(),
"Parse" : Parse(),
"Load" : Load(),
}
input_data = {} #vanilla input
for process_name, process_instance in pipeline.items():
output = process_instance.process(input_data)
input_data = output
Output:
Extracting...
Parsing...
Loading...
So, in case you need to add a 'step', 'append_headers' after parse, all you need to do is,
#Defining a new step.
class AppendHeaders(Factory):
def process(self, input):
print("adding headers...")
output = {}
return output
pipeline = {
"Extract" : Extract(),
"Append headers": AppendHeaders(), #adding a new step
"Parse" : Parse(),
"Load" : Load(),
}
New output:
Extracting...
adding headers...
Parsing...
Loading...
Additional requirement in your case where, you might want to scan specific folder for REQ/NOT_REQ, can be added as field in json and load it into pipeline, meaning, create those "steps" objects only if flag is set to REQ.
Not sure how much this idea can help. Thought, I will convey my thoughts.

Theano's function() reports that my `givens` value is not needed for the graph

Sorry for not posting entire snippets -- the code is very big and spread out, so hopefully this can illustrate my issue. I have these:
train = theano.function([X], output, updates=update_G,
givens={train_mode=:np.cast['int32'](1)})
and
test = theano.function([X], output, updates=update_G,
givens={train_mode=:np.cast['int32'](0)})
to my understanding, givens would input the value of train_mode (i.e. 1/0) wherever it's needed to compute the output.
The output is computed in the lines of this:
...
network2 = Net2()
# This is sort of a dummy variable so I don't get a NameError when this
# is called before `theano.function()` is called. Not sure if this is the
# right way to do this.
train_mode = T.iscalar('train_mode')
output = loss(network1.get_outputs(network2.get_outputs(X, train_mode=train_mode)),something).mean()
....
class Net2():
def get_outputs(self, x, train_mode):
from theano.ifelse import ifelse
import theano.tensor as T
my_flag = ifelse(T.eq(train_mode, 1), 1, 0)
return something if my_flag else something_else
So train_mode is used as an argument in one of the nested functions, and I use it to tell between train and test as I'd like to handle them slightly differently.
However, when I try to run this, I get this error:
theano.compile.function_module.UnusedInputError: theano.function was
asked to create a function computing outputs given certain inputs, but
the provided input variable at index 1 is not part of the computational
graph needed to compute the outputs: <TensorType(int32, scalar)>.To make
this error into a warning, you can pass the parameter
on_unused_input='warn' to theano.function. To disable it completely, use
on_unused_input='ignore'.
If I delete the givens parameter, the error disappears, so to my understanding Theano believes that my train_mode is not necessary for compute the function(). I can use on_unusued_input='ignore' as per their suggestion, but that would just ignore my train_mode if they think it's unused. Am I going around this the wrong way? I basically just want to train a neural network with dropout, but not use dropout when evaluating.
why you use "=" sign? I think, it made train_mode not readable, my code works well by writing:
givens = {train_mode:1}

Python How to avoid many if statements

I'll try to simplify my problem. I'm writing a test program using py.test and appium. Now:
In the application I have 4 media formats: Video, Audio, Image and Document.
I have a control interface with previous, next, play , stop buttons.
Each media formats has a unique ID like
video_playbutton, audio_playbutton, document_playbutton, image_playbutton, video_stopbutton audio_stopbutton ...etc etc.
But the operation I have to do is the same for all of them e.g press on playbutton.
I can address playbutton of each when i give them explicitly like this
find_element_by_id("video_playbutton")
And when i want to press on other playbuttons I've to repeat above line each time. Like this:
find_element_by_id("video_playbutton")
find_element_by_id("audio_playbutton")
find_element_by_id("image_playbutton")
find_element_by_id("document_playbutton")
And because I'm calling this function from another script I would have to distinguish first what string I got e.g:
def play(mediatype):
if mediatype == "video"
el = find_element_by_id("video_playbutton")
el.click()
if mediatype == "audio"
el = find_element_by_id("audio_playbutton")
el.click()
if .....
What is the best way to solve this situation? I want to avoid hundreds of if-statements because there is also stop, next , previous etc buttons.
I'm rather searching for something like this
def play(mediatype)
find_element_by_id(mediatype.playbutton)
You can separate out the selectors and operations in two dictionaries which scales better. Otherwise the mapping eventually gets huge. Here is the example.
dictMedia = {'video':['video_playbutton', 'video_stopbutton','video_nextbutton'], 'audio':['audio_playbutton', 'audio_stopbutton', 'audio_nextbutton']}
dictOperations = {'play':0, 'stop':1, 'next':2}
def get_selector(mediatype, operation):
return dictMedia[mediatype][dictOperations[operation]]
print get_selector('video', 'play')
PS: The above operation doesn't check for key not found errors.
However, I still feel, if the media specific operations grow, then a page object model would be better.

pykka -- Actors are slow?

I am currently experimenting with Actor-concurreny (on Python), because I want to learn more about this. Therefore I choosed pykka, but when I test it, it's more than half as slow as an normal function.
The Code is only to look if it works; it's not meant to be elegant. :)
Maybe I made something wrong?
from pykka.actor import ThreadingActor
import numpy as np
class Adder(ThreadingActor):
def add_one(self, i):
l = []
for j in i:
l.append(j+1)
return l
if __name__ == '__main__':
data = np.random.random(1000000)
adder = Adder.start().proxy()
adder.add_one(data)
adder.stop()
This runs not so fast:
time python actor.py
real 0m8.319s
user 0m8.185s
sys 0m0.140s
And now the dummy 'normal' function:
def foo(i):
l = []
for j in i:
l.append(j+1)
return l
if __name__ == '__main__':
data = np.random.random(1000000)
foo(data)
Gives this result:
real 0m3.665s
user 0m3.348s
sys 0m0.308s
So what is happening here is that your functional version is creating two very large lists which is the bulk of the time. When you introduce actors, mutable data like lists must be copied before being sent to the actor to maintain proper concurrency. Also the list created inside the actor must be copied as well when sent back to the sender. This means that instead of two very large lists being created we have four very large lists created instead.
Consider designing things so that data is constructed and maintained by the actor and then queried by calls to the actor minimizing the size of messages getting passed back and forth. Try to apply the principal of minimal data movement. Passing the List in the functional case is only efficient because the data is not actually moving do to leveraging a shared memory space. If the actor was on a different machine we would not have the benefit of a shared memory space even if the message data was immutable and didn't need to be copied.

Categories

Resources