Understanding flow of execution of Python code - python

I'm trying to do home assignment connected with python from Data Manipulation at Scale: Systems and Algorithms at Curesra. Generally I have problems with understanding base code which was presented as an example of MapReduce alogorythm. I would be grateful for helping me understand it in 2 places, details below.
I tired to go step by step through code flow of below two files after running command:
python wordcount.py 'data/books.json'
File wordcount.py is opened
mr = MapReduce.MapReduce() - me object is created
def __init__(self): part from MapReduce.py is
executed
We come back to wordcount.py
Functions def mapper(record): and def reducer(key,list_of_values): are created but for the time being without execution
Python go to if __name__ == '__main__':
` inputdata = open(sys.argv[1]) - json file is assigned to a
variable
mr.execute(inputdata, mapper, reducer) - A call to the function from MapReduce.py.
And here is my first question we haven't deffined mapper or reducer variable/object so far. Is it just null/no value passed to this function or we somehow defined this variable before but I missed this?
Later me move to def execute(self, data, mapper, reducer): in
MapReduce.py
And there we have mapper(record).
So this is reference to a function in wordcount.py, am I right? But if we have reference to a function in different file shouldn't we use import at the beginning of the file and define from which file this function came?
(...) further code execution
wordcount.py file:
import MapReduce
import sys
"""
Word Count Example in the Simple Python MapReduce Framework
"""
mr = MapReduce.MapReduce()
# =============================
# Do not modify above this line
def mapper(record):
# key: document identifier
# value: document contents
key = record[0]
value = record[1]
words = value.split()
for w in words:
mr.emit_intermediate(w, 1)
def reducer(key, list_of_values):
# key: word
# value: list of occurrence counts
total = 0
for v in list_of_values:
total += v
mr.emit((key, total))
# Do not modify below this line
# =============================
if __name__ == '__main__':
inputdata = open(sys.argv[1])
mr.execute(inputdata, mapper, reducer)
MapReduce.py file:
import json
class MapReduce:
def __init__(self):
self.intermediate = {}
self.result = []
def emit_intermediate(self, key, value):
self.intermediate.setdefault(key, [])
self.intermediate[key].append(value)
def emit(self, value):
self.result.append(value)
def execute(self, data, mapper, reducer):
for line in data:
record = json.loads(line)
mapper(record)
for key in self.intermediate:
reducer(key, self.intermediate[key])
#jenc = json.JSONEncoder(encoding='latin-1')
jenc = json.JSONEncoder()
for item in self.result:
print jenc.encode(item)
Thank you in advance for help with that.

In python everything is a object, that include functions, so you can pass a functionA as argument to another functionB (or class or whenever), and if functionB expect that you to do it, it will assume that you give it a functions with the right firm and a proceed as normal.
In yours case
mr.execute(inputdata, mapper, reducer)
here mapper, reducer are the functions previously defined that are passed as argument to the method execute of the instance mr of the class MapReduce and as you can see, said method use it as the functions that it expect.
Thank to this you can, as the that code show, make generic code that do some calculus that can be used in similar way by many applications by given the user the options of supplies his/her own functions.
A much more generic example of this is the function map, this function receive a function that do something, map don't care what it does or where it comefrom, only that receive as many argument as map himself receive (others that say functions) and return a value to build a new list with the results.

Related

PyYAML: Custom/default constructor for nodes without tags

I have a YAML file for storing constants. Some of the entries have custom tags such as !HANDLER or !EXPR and these are easily handled by adding constructors to the YAML loader.
However, I also want to have a custom constructor for non-tagged nodes. Reason being, I want to add these non-tagged values to a dictionary for use elsewhere. These values need to be available before parsing finishes hence I can't just let parsing finish and then update the dictionary.
So with a YAML file like
sample_rate: 16000
input_file: !HANDLER
handler_fn: file_loader
handler_input: path/to/file
mode: w
I have a handler constructor
def file_handler_loader(loader, node):
params = loader.construct_mapping(node)
module = __import__('handlers.file_handlers', fromlist=[params.pop('handler_fn')])
func = getattr(module, params.pop('handler_fn'))
handler_input = params.pop('handler_input')
return func(handler_input, **params)
And a function initialize_constants
def _get_loader():
loader = FullLoader
loader.add_constructor('!HANDLER', file_handler_loader)
loader.add_constructor('!EXPR', expression_loader)
return loader
def initialize_constants(path_to_yaml: str) -> None:
try:
with open(path_to_yaml, 'r') as yaml_file:
constants = yaml.load(yaml_file, Loader=_get_loader())
except FileNotFoundError as ex:
LOGGER.error(ex)
exit(-1)
The goal is then to have a constructor for non-tagged entries in the YAML. I haven't been able to figure out though how to add a constructor for non-tagged entries. Ideally, the code would look like below
def default_constructor(loader, node):
param = loader.construct_scalar(node)
constants[node_name] = param
I've also attempted to add an resolver to solve the problem. The code below was tested but didn't work as expected.
loader.add_constructor('!DEF', default_constructor)
loader.add_implicit_resolver('!DEF', re.compile('.*'), first=None)
def default_constructor(loader, node):
# do stuff
In this case what happened was the node contained the value sample_rate and not the 16000 as expected.
Thanks in advance :)

Python unit testing with PyTest, stuck on more advanced testing

I have been using PyTest for some time now to write some simple tests (like the ones you find in tutorials and youtube video's) and I thought now it was time to start writing actual test for our python scripts. The scripts are way more advanced than any shown in tutorials so I am getting a bit stuck. I do not want the entire correct answer, but rather a nudge in the right direction if possible. Here is my issue:
We have a script that reads a .md text file and converts it to a pdf file based on an external template. Part of the script is here below (I removed most of it because I first just want to have 1 running test)
class DocumentationEngine:
def __init__(self, title, subtitle, series, style='TIIStyle_Digital_Aug_2020', templateFile='template.docet', tableOfContents=True, listOfFigures=False, listOfTables=False):
self.title = title
self.subtitle = subtitle
self.series = series
self.style = style
self.template = {}
self.hasTOC = tableOfContents
self.hasLOF = listOfFigures
self.hasLOT = listOfTables
self.loadTemplate(templateFile)
def loadTemplate(self, file='template.docet'):
with open(file, "r") as templatefile:
lines = templatefile.readlines()
key = "dummy"
value = ""
for line in lines:
line = line.strip()
if line.startswith('[') and line.endswith(']'):
self.template[key] = value
key = line[1:-1]
value = ""
else:
value += line + '\n'
def build(self, versions=[], content='', filename='Documenter\\_Autogenerated'):
document = self.template["doc"]
document = document.replace("%%style%%", self.style)
document = document.replace("%%body%%",
self.buildFirstPage() +
self.buildTableOfContents() +
self.buildListOfFigures() +
self.buildListOfTables() +
self.buildVersionTable(versions, filename) +
self.buildContentPages(content=content) +
self.buildLastPage()
)
return document
def buildLastPage(self):
return self.template["last_page"]
I am trying to write a simple unit test for the buildLastPage method and have been stuck for several days now.
I am not sure whether or not I need to mock the template file, use a fixture and/or if I can actually test only that method with all dependencies.
I started with the following:
from doceng import DocumentationEngine
import pytest
class Test:
def test_buildLastPage(self):
build_last_page = DocumentationEngine()
assert build_last_page.template(1) == 1
which gives me an error regarding 3 required arguments. When adding the arguments like this:
from doceng import DocumentationEngine
import pytest
class Test:
def test_buildLastPage(self, title, subtitle, series):
build_last_page = DocumentationEngine()
assert build_last_page.template(1) == 1
which gives me an error that the fixture is not found.
I added a fixture in conftest.py file like this:
import pytest
from doceng import DocumentationEngine
#pytest.fixture
def title(title):
return title("test")
which will get me another error, recursive dependency involving fixture 'title' detected
I'm quite stuck so any nudge in the right direction for a newbie would be highly appreciated
The error of the fixtures is regarding your test function test_buildLastPage. The way you are using it, it only needs the self argument.
A test function in pytest without any decorators always expects to find fixtures, that have the same name as the arguments. You did not define any fixtures and also do not use the arguments in your function. Therefore, you can remove them safely.
The actual error points DocumentationEngine(). The class expect 3 arguments when initializing the object. You set no arguments. Check your __init__ function again to find the proper arguments.

Unsure how asyncio works and how I return information from a continuously running async folder watchdog function [duplicate]

This question already has answers here:
Yielding asyncio generator data back from event loop possible?
(3 answers)
Closed 2 years ago.
I am working on a tool that watches up to 3 folders for changes.
If a change occurs its passes the path into a function.
This function is supposed to gather information and return them in a way that I can for example use it to be displayed on a GUI.
The code is as follows.
# Contains code for different checks on the output folder of the AOIs
from export_read import aoi_output_read
import asyncio
from watchgod import awatch
import config_ini
async def __run_export_read(path):
# On every trigger of the function read a .ini with config settings
# this .ini gets changed from the outside (by hand or with a tool)
# This way we can modify the actions that happen on every read
# The config file contains the name of the functions in export_read. If the value of the key is 1 it uses getattr to run that function and return that value
# For example it runs "list_of_errors" if that key is set to 1 in the ini file and returns the result of that
config_cls = config_ini.export_read_config()
active_tools = config_cls.read_ini()
aoi_output_read_cls = aoi_output_read(path)
if not aoi_output_read_cls.is_valid(): # checks if file is valid. Returns if not
return
for tool in active_tools:
try:
method = getattr(aoi_output_read_cls, tool) # This gets the method name out of the loaded config file string
except AttributeError:
print("Class `{}` does not implement `{}`".format(aoi_output_read_cls.__class__.__name__, tool))
yield method() # "method" runs the function that got read from the ini file
# todo: I need to figure out how to keep returning data from this function as long as the async loop from the watchdog is running
async def __watchdog(path):
async for changes in awatch(path):
for items in changes:
if items[0] == 1 or items[0] == 2: # when File is created or changed
async for item in __run_export_read(items[1]):
yield item # This doesnt work as run_until_complete(__watchdog(path)) needs a different output. Which I do not understand
The yield inside __watchdog() does not work because I cant pass a list to run_until_complete. But this is as far as my knowledge goes.
def folder_watchdog(path):
loop = asyncio.get_event_loop()
loop.run_until_complete(__watchdog(path))
My main file.
import analysis
def main():
path = r"I:\Data Export\5"
print(analysis.folder_watchdog(path))
main()
Currently I just want to be able to use the print statement in main to know that that data gets passed through all of this code.
What can I do to get the data I need without stopping the execution.
Is there a completley different way of doing it?
Solved this with the following(Solution from user4815162342)
analysis.py:
def iter_over_async(ait, loop):
ait = ait.__aiter__()
async def get_next():
try:
obj = await ait.__anext__()
return False, obj
except StopAsyncIteration:
return True, None
while True:
done, obj = loop.run_until_complete(get_next())
if done:
break
yield obj
def folder_watchdog(path):
while True:
loop = asyncio.get_event_loop()
for items in iter_over_async(__watchdog(path), loop):
yield items
Now in my main file print works like this:
def main():
path = r"I:\AOI Data Export\L5"
for stuff in analysis.folder_watchdog(path):
print(stuff)
main()

BDD behave Python need to create a World map to hold values

I'm not too familiar with Python but I have setup a BDD framework using Python behave, I now want to create a World map class that holds data and is retrievable throughout all scenarios.
For instance I will have a world class where I can use:
World w
w.key.add('key', api.response)
In one scenario and in another I can then use:
World w
key = w.key.get('key').
Edit:
Or if there is a built in way of using context or similar in behave where the attributes are saved and retrievable throughout all scenarios that would be good.
Like lettuce where you can use world http://lettuce.it/tutorial/simple.html
I've tried this between scenarios but it doesn't seem to be picking it up
class World(dict):
def __setitem__(self, key, item):
self.__dict__[key] = item
print(item)
def __getitem__(self, key):
return self.__dict__[key]
Setting the item in one step in scenario A: w.setitem('key', response)
Getting the item in another step in scenario B: w.getitem('key',)
This shows me an error though:
Traceback (most recent call last):
File "C:\Program Files (x86)\Python\lib\site-packages\behave\model.py", line 1456, in run
match.run(runner.context)
File "C:\Program Files (x86)\Python\lib\site-packages\behave\model.py", line 1903, in run
self.func(context, *args, **kwargs)
File "steps\get_account.py", line 14, in step_impl
print(w.__getitem__('appToken'))
File "C:Project\steps\world.py", line 8, in __getitem__
return self.__dict__[key]
KeyError: 'key'
It appears that the World does not hold values here between steps that are run.
Edit:
I'm unsure how to use environment.py but can see it has a way of running code before the steps. How can I allow my call to a soap client within environment.py to be called and then pass this to a particular step?
Edit:
I have made the request in environment.py and hardcoded the values, how can I pass variables to environment.py and back?
It's called "context" in the python-behave jargon. The first argument of your step definition function is an instance of the behave.runner.Context class, in which you can store your world instance. Please see the appropriate part of the tutorial.
Have you tried the
simple approach, using global var, for instance:
def before_all(context):
global response
response = api.response
def before_scenario(context, scenario):
global response
w.key.add('key', response)
Guess feature can be accessed from context, for instance:
def before_feature(context, feature):
feature.response = api.response
def before_scenario(context, scenario):
w.key.add('key', context.feature.response)
You are looking for:
Class variable: A variable that is shared by all instances of a class.
Your code in Q uses Class Instance variable.
Read about: python_classes_objects
For instance:
class World(dict):
__class_var = {}
def __setitem__(self, key, item):
World.__class_var[key] = item
def __getitem__(self, key):
return World.__class_var[key]
# Scenario A
A = World()
A['key'] = 'test'
print('A[\'key\']=%s' % A['key'] )
del A
# Scenario B
B = World()
print('B[\'key\']=%s' % B['key'] )
Output:
A['key']=test
B['key']=test
Tested with Python:3.4.2
Come back and Flag your Question as answered if this is working for you or comment why not.
Defining global var in before_all hook did not work for me.
As mentioned by #stovfl
But defining global var within one of my steps worked out.
Instead, as Szabo Peter mentioned use the context.
context.your_variable_name = api.response
and just use
context.your_variable_name anywhere the value is to be used.
For this I actually used a config file [config.py] I then added the variables in there and retrieved them using getattr. See below:
WSDL_URL = 'wsdl'
USERNAME = 'name'
PASSWORD = 'PWD'
Then retrieved them like:
import config
getattr(config, 'USERNAME ', 'username not found')

Using python class with spark DataFrame to parse URL's

I'm trying to process URL's in a pyspark dataframe using a class that I've written and a udf. I'm aware of urllib and other url parsing libraries but for this case I need to use my own code.
In order to get the tld of a url I cross check it against the iana public suffix list.
Here's a simplification of my code
class Parser:
# list of available public suffixes for extracting top level domains
file = open("public_suffix_list.txt", 'r')
data = []
for line in file:
if line.startswith("//") or line == '\n':
pass
else:
data.append(line.strip('\n'))
def __init__(self, url):
self.url = url
#the code here extracts port,protocol,query etc.
#I think this bit below is causing the error
matches = [r for r in self.data if r in self.hostname]
#extra functionality in my actual class
i = matches.index(self.string)
try:
self.tld = matches[i]
# logic to find tld if no match
The class works in pure python so for example I can run
import Parser
x = Parser("www.google.com")
x.tld #returns ".com"
However when I try to do
import Parser
from pyspark.sql.functions import udf
parse = udf(lambda x: Parser(x).url)
df = sqlContext.table("tablename").select(parse("column"))
When I call an action I get
File "<stdin>", line 3, in <lambda>
File "<stdin>", line 27, in __init__
TypeError: 'in <string>' requires string as left operand
So my guess is that it's failing to interpret the data as a list of strings?
I've also tried to use
file = sc.textFile("my_file.txt")\
.filter(lambda x: not x.startswith("//") or != "")\
.collect()
data = sc.broadcast(file)
to open my file instead, but that causes
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Any ideas?
Thanks in advance
EDIT: Apologies, I didn't have my code to hand so my test code didn't explain very well the problems I was having. The error I initially reported was a result of the test data I was using.
I've updated my question to be more reflective of the challenge I'm facing.
Why do you need a class in this case (the code for defining your class is incorrect, you never declared self.data before using it in the init method) the only relevant line that affects the output you want is self.string=string, so you are basically passing the identity function as udf.
The UnicodeDecodeError is due to an encoding issue in your file, it has nothing to do with your definition of the class.
The second error is in the line sc.broadcast(file) , details of which can be found here : Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
EDIT 1
I would redefine your class structure as follows. You basically need to create the instance self.data by calling self.data = data before you can use it. Also anything that you write before the init method is executed irrespective of whether you call that class or not. So moving out the file parsing part will not have any effect.
# list of available public suffixes for extracting top level domains
file = open("public_suffix_list.txt", 'r')
data = []
for line in file:
if line.startswith("//") or line == '\n':
pass
else:
data.append(line.strip('\n'))
class Parser:
def __init__(self, url):
self.url = url
self.data = data
#the code here extracts port,protocol,query etc.
#I think this bit below is causing the error
matches = [r for r in self.data if r in self.hostname]
#extra functionality in my actual class
i = matches.index(self.string)
try:
self.tld = matches[i]
# logic to find tld if no match

Categories

Resources