Design pattern for a data parsing&feature engineering pipeline

Design pattern for a data parsing&feature engineering pipeline - python

I know this question has been asked a few times on these boards in the past, but I have a more generic version of it, which might be applicable to someone else's projects in the future as well.
In short - I am building an ML system (with Python, but language choice in this case is not very critical), which has its ML model at the end of a pipeline of actions happening:
Data upload
Data parsing
Feature engineering
Feature engineering (in a different logical bracket from prev. step)
Feature engineering (in a different logical bracket from prev. steps)
... (more steps like the last 3)
Data passed to ML model
Each of the above steps, has its own series of actions it must take, in order to build a proper output, which is then used as an input in the next one etc. These sub-steps in turn, can either be completely decoupled from one another, or some of them might need some steps inside of that big step, to be completed first, to produce data these following steps use.
The thing right now is, that I need to build a custom pipeline, which will make it super easy to add new steps into the mix (both big and small), without upsetting the existing ones.
So far, I have this concept idea of how this might look like from an architecture perspective, as shown below:
While looking at this architecture, I am immediately thinking about a Chain of Responsibility Design Pattern, which manages BIG STEPS (1, 2, ..., n), and each of these BIG STEPS having their own small version of Chain of Responsibility happening inside of their guts, which happen independently for NO_REQ steps, and then for REQ steps (with REQ steps looping-over until they are all done). With a shared interface for running logic inside of big and small steps, it would probably run rather neatly.
Yet, I am wondering, if there is any better way of doing it? Moreover, what I do not like about a Chain of Responsibility, is that it would require a person adding new BIG/SMALL step, to always edit the "guts" of the logic setting up step bags, to manually include the newly added step. I would love to build something, which instead would just scan a folder specific to steps under each BIG STEP, and build a list of NO_REQ and REQ steps on its own (to uphold the Open/Closed SOLID principle).
I would be grateful for any ideas.

I did something similar recently. Challenge is to allow "steps" to be plugged in easily without much changes. So, what I did was, something like below. Core idea is to define interface 'process' and fix input and output format.
Code:
class Factory:
def process(self, input):
raise NotImplementedError
class Extract(Factory):
def process(self, input):
print("Extracting...")
output = {}
return output
class Parse(Factory):
def process(self, input):
print("Parsing...")
output = {}
return output
class Load(Factory):
def process(self, input):
print("Loading...")
output = {}
return output
pipeline = {
"Extract" : Extract(),
"Parse" : Parse(),
"Load" : Load(),
}
input_data = {} #vanilla input
for process_name, process_instance in pipeline.items():
output = process_instance.process(input_data)
input_data = output
Output:
Extracting...
Parsing...
Loading...
So, in case you need to add a 'step', 'append_headers' after parse, all you need to do is,
#Defining a new step.
class AppendHeaders(Factory):
def process(self, input):
print("adding headers...")
output = {}
return output
pipeline = {
"Extract" : Extract(),
"Append headers": AppendHeaders(), #adding a new step
"Parse" : Parse(),
"Load" : Load(),
}
New output:
Extracting...
adding headers...
Parsing...
Loading...
Additional requirement in your case where, you might want to scan specific folder for REQ/NOT_REQ, can be added as field in json and load it into pipeline, meaning, create those "steps" objects only if flag is set to REQ.
Not sure how much this idea can help. Thought, I will convey my thoughts.

Related

Struggling with how to iterate data

I am learning Python3 and I have a fairly simple task to complete but I am struggling how to glue it all together. I need to query an API and return the full list of applications which I can do and I store this and need to use it again to gather more data for each application from a different API call.
applistfull = requests.get(url,authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
else:
print(applistfull.status_code)
I next have I think 'summaryguid' and I need to again query a different API and return a value that could exist many times for each application; in this case the compiler used to build the code.
I can statically call a GUID in the URL and return the correct information but I haven't yet figured out how to get it to do the below for all of the above and build a master list:
summary = requests.get(f"url{summaryguid}moreurl",authmethod)
if summary.ok:
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(appsummary["compiler"])
I would prefer to not yet have someone just type out the right answer but just drop a few hints and let me continue to work through it logically so I learn how to deal with what I assume is a common issue in the future. My thought right now is I need to move my second if up as part of my initial block and continue the logic in that space but I am stuck with that.

You are on the right track! Here is the hint: the second API request can be nested inside the loop that iterates through the list of applications in the first API call. By doing so, you can get the information you require by making the second API call for each application.

import requests
applistfull = requests.get("url", authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
summary = requests.get(f"url/{summaryguid}/moreurl", authmethod)
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(app["profile"]["name"],appsummary["compiler"])
else:
print(applistfull.status_code)

Elegant way of integrating for loop with a flag in Python

I am using click to create a command line tool that performs some data preprocessing. Until now, I have basically survived using click.option() as flag with some if statements in my code so that I can choose the options I want. However, I am struggling to find an elegant way to solve my next issue. Since I believe the general code structure does not depend on my purposes I will try to be as general as possible without getting into details of what goes into the main code.
I have a list of elements my_list that I want to loop over and apply some very long code after each iteration. However, I want this to be dependent on a flag (via click, as I said). The general structure would be like this:
#click.option('--myflag', is_flag=True)
#just used click.option to point out that I use click but it is just a boolean
if myflag:
for i in my_list:
print('Function that depends on i')
print('Here goes the rest of the code')
else:
print('Function independent on i')
print('Here goes the rest of the code')
My issue is that I wouldn't like to copy paste twice the rest of the code in the above structure (it is a long code and hard to integrate into a function). Is there a way to do that? That is, is there a way to tell python: "If myflag==True, run the full code while looping into mylist. Otherwise, just go to the full code. All of that without having to duplicate the code.
EDIT: I believe it might actually be useful to go a bit more specific.
What I have is :
mylist=['washington','la','houston']
if myflag:
for i in my_list:
train,test = full_data[full_data.city!=i],full_data[full_data.city==i]
print('CODE:Clean,tokenize,train,performance')
else:
def train_test_split2(df, frac=0.2):
# get random sample
test = df.sample(frac=frac, axis=0,random_state=123)
# get everything but the test sample
train = df.drop(index=test.index)
return train, test
train, test = train_test_split2(full_data[['tidy_tweet', 'classify']])
print('CODE: Clean,tokenize,train,performance')
full_datais a pandas data frame that contains text and classification. Whenever I set my_flag=True, I pretend the code to train some models an test performance when leaving some city as the test data. Hence the loop gives me an overview on how does my model perform on different cities (some sort of GroupKfold loop).
Under the second option, my_flag=False, there is a random test-train split and the training is only performed once.
It is the CODE part that I don't want to duplicate.
I hope this helps the previous intuition.

What do you mean by "hard to integrate into a function"? If a function is an option, just use the following code.
def rest_of_the_code(i):
...
if myflag:
for i in my_list:
print('Function that depends on i')
rest_of_the_code(i)
else:
print('Function independent on i')
rest_of_the_code(0)
Otherwise you could do something like this:
if not myflag:
my_list = [0] # if you even need to initialize it, not sure where my_list comes from
for i in my_list:
print('Function that may depend on i')
print('Here goes the rest of the code')
EDIT to answer regarding your clarification: You could use a list which is iterated.
mylist=['washington','la','houston']
list_of_dataframes = []
if myflag:
for i in my_list:
train,test = full_data[full_data.city!=i],full_data[full_data.city==i]
list_of_dataframes.append( (train, test) )
else:
train, test = train_test_split2(full_data[['tidy_tweet', 'classify']])
list_of_dataframes.append( (train, test) )
for train, test in list_of_dataframes:
print('CODE: Clean,tokenize,train,performance')

Finding a specific "route" through a network simulator

I'm writing something for a game that involves networks. In this game, a network is a class and the "connections" to each node are formatted like:
network.nodes = [router, computer1, computer2]
network.connections = [ [1, 2], [0], [0] ]
Each iteration in "network.nodes" works in parallel with each iteration in "network.connections", meaning "network.connections[0]" represents all the nodes "network.nodes[0]" is connected to. I'm trying to write a simple function in the network class that finds a route starting from the router - "network.connections[0]" - and then to a specific "node". The more thought I put into this, the more complicated the answer seems to be.
In this, rather simple case it should return something like
[router, computer1]
That's what I'd like to see if I was trying to find a route to "computer1", but I need something that will work with more complicated network simulations.
It's basically a simulator for a computer network. But in this game, I need to be able to know exactly which nodes something might travel though to reach a specific target.
Any help would be greatly appreciated. Thanks.

How about dropping the .nodes and .connections and just keeping them in one data structure like a dictionary.
network.nodes = {"router": [computer1, computer2],
"computer1": [router],
"computer2": [router]
}
You could even drop the strings as keys and use the objects themselves:
network.nodes = {router: [computer1, computer2],
computer1: [router],
computer2: [router]
}
That way if you need to access the the connections for the router you would do:
>>>network.nodes[router]
[computer1, computer2]
Because I don't have a full overview of your project, I can't just give you a function to do that, but I can try and point you in the right direction.
If you build the network 'map' up as a dictionary, and network.nodes[router] returns [computer1, computer2], the next thing you would need to do is network.nodes[computer1] and network.nodes[computer2].
In your firewall example from the comments, you would rebuild the network map to include the firewall. So the dictionary would look like this:
network.nodes = {router: [firewall, computer2],
firewall: [computer1]
computer1: [firewall],
computer2: [router]
}

Python How to avoid many if statements

I'll try to simplify my problem. I'm writing a test program using py.test and appium. Now:
In the application I have 4 media formats: Video, Audio, Image and Document.
I have a control interface with previous, next, play , stop buttons.
Each media formats has a unique ID like
video_playbutton, audio_playbutton, document_playbutton, image_playbutton, video_stopbutton audio_stopbutton ...etc etc.
But the operation I have to do is the same for all of them e.g press on playbutton.
I can address playbutton of each when i give them explicitly like this
find_element_by_id("video_playbutton")
And when i want to press on other playbuttons I've to repeat above line each time. Like this:
find_element_by_id("video_playbutton")
find_element_by_id("audio_playbutton")
find_element_by_id("image_playbutton")
find_element_by_id("document_playbutton")
And because I'm calling this function from another script I would have to distinguish first what string I got e.g:
def play(mediatype):
if mediatype == "video"
el = find_element_by_id("video_playbutton")
el.click()
if mediatype == "audio"
el = find_element_by_id("audio_playbutton")
el.click()
if .....
What is the best way to solve this situation? I want to avoid hundreds of if-statements because there is also stop, next , previous etc buttons.
I'm rather searching for something like this
def play(mediatype)
find_element_by_id(mediatype.playbutton)

You can separate out the selectors and operations in two dictionaries which scales better. Otherwise the mapping eventually gets huge. Here is the example.
dictMedia = {'video':['video_playbutton', 'video_stopbutton','video_nextbutton'], 'audio':['audio_playbutton', 'audio_stopbutton', 'audio_nextbutton']}
dictOperations = {'play':0, 'stop':1, 'next':2}
def get_selector(mediatype, operation):
return dictMedia[mediatype][dictOperations[operation]]
print get_selector('video', 'play')
PS: The above operation doesn't check for key not found errors.
However, I still feel, if the media specific operations grow, then a page object model would be better.

How to get around using eval in python

I have a game I've been working on for awhile. The core is C++, but I'm using Python for scripting and for Attacks/StatusEffects/Items etc.
I've kind of coded myself into a corner where I'm having to use eval to get the behaviout I want. Here's how it's arising:
I have an xml document that I use to spec attacks, like so:
<Attack name="Megiddo Flare" mp="144" accuracy="1.5" targetting="EnemyParty">
<Flags>
<IgnoreElements/>
<Unreflectable/>
<ConstantDamage/>
<LockedTargetting/>
</Flags>
<Components>
<ElementalWeightComponent>
<Element name="Fire" weight="0.5"/>
</ElementalWeightComponent>
<ConstantDamageCalculatorComponent damage="9995" index="DamageCalculatorComponent"/>
</Components>
</Attack>
I parse this file in python, and build my Attacks. Each Attack consist of any number of Components to implement different behaviour. In this Attack's case, I implement a DamageCalculatorComponent, telling python to use the ConstantDamage variant. I implement all these components in my script files. This is all well and good for component types I'm going to use often. There are some attacks where that attack will be the only attack to use that particular Component Variant. Rather then adding the component to my script files, I wanted to be able to specify the Component class in the xml file.
For instance, If I were to implement the classic White Wind attack from Final Fantasy (restores team HP by the amount of HP of the attacker)
<Attack name="White Wind" mp="41" targetting="AnyParty">
<Flags>
<LockedTargetting/>
</Flags>
<Components>
<CustomComponent index="DamageCalculatorComponent">
<![CDATA[
class WhiteWindDamageComponent(DamageCalculatorComponent):
def __init__(self, Owner):
DamageCalculatorComponent.__init__(self, Owner)
def CalculateDamage(self, Action, Mechanics):
Dmg = 0
character = Action.GetUsers().GetFirst()
SM = character.GetComponent("StatManagerComponent")
if (SM != None):
Dmg = -SM.GetCurrentHP()
return Dmg
return WhiteWindDamageComponent(Owner)
]]>
</CustomComponent>
</Components>
</Attack>
I was wondering if there might be a better way to do this? The only other way I can see is too put every possible Component variant definition into my python files, and expand my Component creators to check for the additional variant. Seems abit wasteful for a single use Component. Is there a better/safer alternative to generating types dynamically, or perhaps another solution I'm not seeing?
Thanks in advance

Inline Python is bad because
Your source code files cannot be understood by Python source code editors and you will miss syntax highlighting
All other tools, like pylint, which can be used to lint and validate source code will fail also
Alternative
In element <CustomComponent index="DamageCalculatorComponent">
... add parameter script:
<CustomComponent index="DamageCalculatorComponent" script="damager.py">
Then add file damager.py somewhere along the file system.
Load it as described here: What is an alternative to execfile in Python 3?
In your main game engine code construct the class out of loaded system module like:
damager_class = sys.modules["mymodulename"].my_factory_function()
All Python modules must share some kind of agreed class names / entry point functions.
If you want to have really pluggable architecture, use Python eggs and setup.py entry points
http://wiki.pylonshq.com/display/pylonscookbook/Using+Entry+Points+to+Write+Plugins
Example
https://github.com/miohtama/vvv/blob/master/setup.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.