Using python class with spark DataFrame to parse URL's - python

I'm trying to process URL's in a pyspark dataframe using a class that I've written and a udf. I'm aware of urllib and other url parsing libraries but for this case I need to use my own code.
In order to get the tld of a url I cross check it against the iana public suffix list.
Here's a simplification of my code
class Parser:
# list of available public suffixes for extracting top level domains
file = open("public_suffix_list.txt", 'r')
data = []
for line in file:
if line.startswith("//") or line == '\n':
pass
else:
data.append(line.strip('\n'))
def __init__(self, url):
self.url = url
#the code here extracts port,protocol,query etc.
#I think this bit below is causing the error
matches = [r for r in self.data if r in self.hostname]
#extra functionality in my actual class
i = matches.index(self.string)
try:
self.tld = matches[i]
# logic to find tld if no match
The class works in pure python so for example I can run
import Parser
x = Parser("www.google.com")
x.tld #returns ".com"
However when I try to do
import Parser
from pyspark.sql.functions import udf
parse = udf(lambda x: Parser(x).url)
df = sqlContext.table("tablename").select(parse("column"))
When I call an action I get
File "<stdin>", line 3, in <lambda>
File "<stdin>", line 27, in __init__
TypeError: 'in <string>' requires string as left operand
So my guess is that it's failing to interpret the data as a list of strings?
I've also tried to use
file = sc.textFile("my_file.txt")\
.filter(lambda x: not x.startswith("//") or != "")\
.collect()
data = sc.broadcast(file)
to open my file instead, but that causes
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Any ideas?
Thanks in advance
EDIT: Apologies, I didn't have my code to hand so my test code didn't explain very well the problems I was having. The error I initially reported was a result of the test data I was using.
I've updated my question to be more reflective of the challenge I'm facing.

Why do you need a class in this case (the code for defining your class is incorrect, you never declared self.data before using it in the init method) the only relevant line that affects the output you want is self.string=string, so you are basically passing the identity function as udf.
The UnicodeDecodeError is due to an encoding issue in your file, it has nothing to do with your definition of the class.
The second error is in the line sc.broadcast(file) , details of which can be found here : Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
EDIT 1
I would redefine your class structure as follows. You basically need to create the instance self.data by calling self.data = data before you can use it. Also anything that you write before the init method is executed irrespective of whether you call that class or not. So moving out the file parsing part will not have any effect.
# list of available public suffixes for extracting top level domains
file = open("public_suffix_list.txt", 'r')
data = []
for line in file:
if line.startswith("//") or line == '\n':
pass
else:
data.append(line.strip('\n'))
class Parser:
def __init__(self, url):
self.url = url
self.data = data
#the code here extracts port,protocol,query etc.
#I think this bit below is causing the error
matches = [r for r in self.data if r in self.hostname]
#extra functionality in my actual class
i = matches.index(self.string)
try:
self.tld = matches[i]
# logic to find tld if no match

Related

Python unit testing with PyTest, stuck on more advanced testing

I have been using PyTest for some time now to write some simple tests (like the ones you find in tutorials and youtube video's) and I thought now it was time to start writing actual test for our python scripts. The scripts are way more advanced than any shown in tutorials so I am getting a bit stuck. I do not want the entire correct answer, but rather a nudge in the right direction if possible. Here is my issue:
We have a script that reads a .md text file and converts it to a pdf file based on an external template. Part of the script is here below (I removed most of it because I first just want to have 1 running test)
class DocumentationEngine:
def __init__(self, title, subtitle, series, style='TIIStyle_Digital_Aug_2020', templateFile='template.docet', tableOfContents=True, listOfFigures=False, listOfTables=False):
self.title = title
self.subtitle = subtitle
self.series = series
self.style = style
self.template = {}
self.hasTOC = tableOfContents
self.hasLOF = listOfFigures
self.hasLOT = listOfTables
self.loadTemplate(templateFile)
def loadTemplate(self, file='template.docet'):
with open(file, "r") as templatefile:
lines = templatefile.readlines()
key = "dummy"
value = ""
for line in lines:
line = line.strip()
if line.startswith('[') and line.endswith(']'):
self.template[key] = value
key = line[1:-1]
value = ""
else:
value += line + '\n'
def build(self, versions=[], content='', filename='Documenter\\_Autogenerated'):
document = self.template["doc"]
document = document.replace("%%style%%", self.style)
document = document.replace("%%body%%",
self.buildFirstPage() +
self.buildTableOfContents() +
self.buildListOfFigures() +
self.buildListOfTables() +
self.buildVersionTable(versions, filename) +
self.buildContentPages(content=content) +
self.buildLastPage()
)
return document
def buildLastPage(self):
return self.template["last_page"]
I am trying to write a simple unit test for the buildLastPage method and have been stuck for several days now.
I am not sure whether or not I need to mock the template file, use a fixture and/or if I can actually test only that method with all dependencies.
I started with the following:
from doceng import DocumentationEngine
import pytest
class Test:
def test_buildLastPage(self):
build_last_page = DocumentationEngine()
assert build_last_page.template(1) == 1
which gives me an error regarding 3 required arguments. When adding the arguments like this:
from doceng import DocumentationEngine
import pytest
class Test:
def test_buildLastPage(self, title, subtitle, series):
build_last_page = DocumentationEngine()
assert build_last_page.template(1) == 1
which gives me an error that the fixture is not found.
I added a fixture in conftest.py file like this:
import pytest
from doceng import DocumentationEngine
#pytest.fixture
def title(title):
return title("test")
which will get me another error, recursive dependency involving fixture 'title' detected
I'm quite stuck so any nudge in the right direction for a newbie would be highly appreciated
The error of the fixtures is regarding your test function test_buildLastPage. The way you are using it, it only needs the self argument.
A test function in pytest without any decorators always expects to find fixtures, that have the same name as the arguments. You did not define any fixtures and also do not use the arguments in your function. Therefore, you can remove them safely.
The actual error points DocumentationEngine(). The class expect 3 arguments when initializing the object. You set no arguments. Check your __init__ function again to find the proper arguments.

Adding subtitles to video with moviepy requiring encoding

I have a srt file named subtitles.srt that is non-English, and I followed the instructions of the documentation and source code of the moviepy package (https://moviepy.readthedocs.io/en/latest/_modules/moviepy/video/tools/subtitles.html):
from moviepy.video.tools.subtitles import SubtitlesClip
from moviepy.video.io.VideoFileClip import VideoFileClip
generator = lambda txt: TextClip(txt, font='Georgia-Regular', fontsize=24, color='white')
sub = SubtitlesClip("subtitles.srt", generator, encoding='utf-8')
And this gives the error TypeError: __init__() got an unexpected keyword argument 'encoding'.
In the source code, the class SubtitlesClip does have a keyword argument encoding. Does that mean the version of the source code is outdated or something? And what can I do about this? I even attempted to copy the source code for moviepy.video.tools.subtitles with the encoding keyword argument directly to my code, yet it led to more errors like at the line:
from moviepy.decorators import convert_path_to_string
it failed to import the decorator convert_path_to_string.
The source code does not seem to agree with what I have installed. Anyway to fix it? If not, are there any good alternatives of Python libraries for inserting subtitles or video editing in general?
Edit: My current solution is to create a child class of SubtitlesClip and override the constructor of the parent class:
from moviepy.video.tools.subtitles import SubtitlesClip
from moviepy.video.VideoClip import TextClip, VideoClip
from moviepy.video.compositing.CompositeVideoClip import CompositeVideoClip
import re
from moviepy.tools import cvsecs
def file_to_subtitles_with_encoding(filename):
""" Converts a srt file into subtitles.
The returned list is of the form ``[((ta,tb),'some text'),...]``
and can be fed to SubtitlesClip.
Only works for '.srt' format for the moment.
"""
times_texts = []
current_times = None
current_text = ""
with open(filename,'r',encoding='utf-8') as f:
for line in f:
times = re.findall("([0-9]*:[0-9]*:[0-9]*,[0-9]*)", line)
if times:
current_times = [cvsecs(t) for t in times]
elif line.strip() == '':
times_texts.append((current_times, current_text.strip('\n')))
current_times, current_text = None, ""
elif current_times:
current_text += line
return times_texts
class SubtitlesClipUTF8(SubtitlesClip):
def __init__(self, subtitles, make_textclip=None):
VideoClip.__init__(self, has_constant_size=False)
if isinstance(subtitles, str):
subtitles = file_to_subtitles_with_encoding(subtitles)
#subtitles = [(map(cvsecs, tt),txt) for tt, txt in subtitles]
self.subtitles = subtitles
self.textclips = dict()
if make_textclip is None:
make_textclip = lambda txt: TextClip(txt, font='Georgia-Bold',
fontsize=24, color='white',
stroke_color='black', stroke_width=0.5)
self.make_textclip = make_textclip
self.start=0
self.duration = max([tb for ((ta,tb), txt) in self.subtitles])
self.end=self.duration
def add_textclip_if_none(t):
""" Will generate a textclip if it hasn't been generated asked
to generate it yet. If there is no subtitle to show at t, return
false. """
sub =[((ta,tb),txt) for ((ta,tb),txt) in self.textclips.keys()
if (ta<=t<tb)]
if not sub:
sub = [((ta,tb),txt) for ((ta,tb),txt) in self.subtitles if
(ta<=t<tb)]
if not sub:
return False
sub = sub[0]
if sub not in self.textclips.keys():
self.textclips[sub] = self.make_textclip(sub[1])
return sub
def make_frame(t):
sub = add_textclip_if_none(t)
return (self.textclips[sub].get_frame(t) if sub
else np.array([[[0,0,0]]]))
def make_mask_frame(t):
sub = add_textclip_if_none(t)
return (self.textclips[sub].mask.get_frame(t) if sub
else np.array([[0]]))
self.make_frame = make_frame
hasmask = bool(self.make_textclip('T').mask)
self.mask = VideoClip(make_mask_frame, ismask=True) if hasmask else None
I actually only changed two lines, but I have to create a new class and redefine the whole thing, so I doubt whether it's really necessary. Any better solution than this?
The latest version in the documentation (the one you are looking at) corresponds to a dev version 2.x, which is not released to PyPI yet. The version you have installed through pip is most likely 1.0.3, which is the latest on PyPI, and it doesn't allow an encoding parameter.
From the PR where the feature was introduced, you can see that it's only been tagged for release in 2.x versions.
Copying only that file to your source code will most likely not work, because it will depend on changes that happened in between the two versions. However, if you feel adventurous, you can install the dev version of the package, by following the Method by hand section in moviepy's docs.

BDD behave Python need to create a World map to hold values

I'm not too familiar with Python but I have setup a BDD framework using Python behave, I now want to create a World map class that holds data and is retrievable throughout all scenarios.
For instance I will have a world class where I can use:
World w
w.key.add('key', api.response)
In one scenario and in another I can then use:
World w
key = w.key.get('key').
Edit:
Or if there is a built in way of using context or similar in behave where the attributes are saved and retrievable throughout all scenarios that would be good.
Like lettuce where you can use world http://lettuce.it/tutorial/simple.html
I've tried this between scenarios but it doesn't seem to be picking it up
class World(dict):
def __setitem__(self, key, item):
self.__dict__[key] = item
print(item)
def __getitem__(self, key):
return self.__dict__[key]
Setting the item in one step in scenario A: w.setitem('key', response)
Getting the item in another step in scenario B: w.getitem('key',)
This shows me an error though:
Traceback (most recent call last):
File "C:\Program Files (x86)\Python\lib\site-packages\behave\model.py", line 1456, in run
match.run(runner.context)
File "C:\Program Files (x86)\Python\lib\site-packages\behave\model.py", line 1903, in run
self.func(context, *args, **kwargs)
File "steps\get_account.py", line 14, in step_impl
print(w.__getitem__('appToken'))
File "C:Project\steps\world.py", line 8, in __getitem__
return self.__dict__[key]
KeyError: 'key'
It appears that the World does not hold values here between steps that are run.
Edit:
I'm unsure how to use environment.py but can see it has a way of running code before the steps. How can I allow my call to a soap client within environment.py to be called and then pass this to a particular step?
Edit:
I have made the request in environment.py and hardcoded the values, how can I pass variables to environment.py and back?
It's called "context" in the python-behave jargon. The first argument of your step definition function is an instance of the behave.runner.Context class, in which you can store your world instance. Please see the appropriate part of the tutorial.
Have you tried the
simple approach, using global var, for instance:
def before_all(context):
global response
response = api.response
def before_scenario(context, scenario):
global response
w.key.add('key', response)
Guess feature can be accessed from context, for instance:
def before_feature(context, feature):
feature.response = api.response
def before_scenario(context, scenario):
w.key.add('key', context.feature.response)
You are looking for:
Class variable: A variable that is shared by all instances of a class.
Your code in Q uses Class Instance variable.
Read about: python_classes_objects
For instance:
class World(dict):
__class_var = {}
def __setitem__(self, key, item):
World.__class_var[key] = item
def __getitem__(self, key):
return World.__class_var[key]
# Scenario A
A = World()
A['key'] = 'test'
print('A[\'key\']=%s' % A['key'] )
del A
# Scenario B
B = World()
print('B[\'key\']=%s' % B['key'] )
Output:
A['key']=test
B['key']=test
Tested with Python:3.4.2
Come back and Flag your Question as answered if this is working for you or comment why not.
Defining global var in before_all hook did not work for me.
As mentioned by #stovfl
But defining global var within one of my steps worked out.
Instead, as Szabo Peter mentioned use the context.
context.your_variable_name = api.response
and just use
context.your_variable_name anywhere the value is to be used.
For this I actually used a config file [config.py] I then added the variables in there and retrieved them using getattr. See below:
WSDL_URL = 'wsdl'
USERNAME = 'name'
PASSWORD = 'PWD'
Then retrieved them like:
import config
getattr(config, 'USERNAME ', 'username not found')

Understanding flow of execution of Python code

I'm trying to do home assignment connected with python from Data Manipulation at Scale: Systems and Algorithms at Curesra. Generally I have problems with understanding base code which was presented as an example of MapReduce alogorythm. I would be grateful for helping me understand it in 2 places, details below.
I tired to go step by step through code flow of below two files after running command:
python wordcount.py 'data/books.json'
File wordcount.py is opened
mr = MapReduce.MapReduce() - me object is created
def __init__(self): part from MapReduce.py is
executed
We come back to wordcount.py
Functions def mapper(record): and def reducer(key,list_of_values): are created but for the time being without execution
Python go to if __name__ == '__main__':
` inputdata = open(sys.argv[1]) - json file is assigned to a
variable
mr.execute(inputdata, mapper, reducer) - A call to the function from MapReduce.py.
And here is my first question we haven't deffined mapper or reducer variable/object so far. Is it just null/no value passed to this function or we somehow defined this variable before but I missed this?
Later me move to def execute(self, data, mapper, reducer): in
MapReduce.py
And there we have mapper(record).
So this is reference to a function in wordcount.py, am I right? But if we have reference to a function in different file shouldn't we use import at the beginning of the file and define from which file this function came?
(...) further code execution
wordcount.py file:
import MapReduce
import sys
"""
Word Count Example in the Simple Python MapReduce Framework
"""
mr = MapReduce.MapReduce()
# =============================
# Do not modify above this line
def mapper(record):
# key: document identifier
# value: document contents
key = record[0]
value = record[1]
words = value.split()
for w in words:
mr.emit_intermediate(w, 1)
def reducer(key, list_of_values):
# key: word
# value: list of occurrence counts
total = 0
for v in list_of_values:
total += v
mr.emit((key, total))
# Do not modify below this line
# =============================
if __name__ == '__main__':
inputdata = open(sys.argv[1])
mr.execute(inputdata, mapper, reducer)
MapReduce.py file:
import json
class MapReduce:
def __init__(self):
self.intermediate = {}
self.result = []
def emit_intermediate(self, key, value):
self.intermediate.setdefault(key, [])
self.intermediate[key].append(value)
def emit(self, value):
self.result.append(value)
def execute(self, data, mapper, reducer):
for line in data:
record = json.loads(line)
mapper(record)
for key in self.intermediate:
reducer(key, self.intermediate[key])
#jenc = json.JSONEncoder(encoding='latin-1')
jenc = json.JSONEncoder()
for item in self.result:
print jenc.encode(item)
Thank you in advance for help with that.
In python everything is a object, that include functions, so you can pass a functionA as argument to another functionB (or class or whenever), and if functionB expect that you to do it, it will assume that you give it a functions with the right firm and a proceed as normal.
In yours case
mr.execute(inputdata, mapper, reducer)
here mapper, reducer are the functions previously defined that are passed as argument to the method execute of the instance mr of the class MapReduce and as you can see, said method use it as the functions that it expect.
Thank to this you can, as the that code show, make generic code that do some calculus that can be used in similar way by many applications by given the user the options of supplies his/her own functions.
A much more generic example of this is the function map, this function receive a function that do something, map don't care what it does or where it comefrom, only that receive as many argument as map himself receive (others that say functions) and return a value to build a new list with the results.

Variable Route Not Working in For Loop

I tried to create multiple routes in one go by using the variables from the database and a for loop.
I tried this
temp = "example"
#app.route("/speaker/<temp>")
def getSpeakerAtr(temp):
return '''%s''' % temp
It works very well. BUT:
for x in models.Speaker.objects:
temp = str(x.name)
#app.route("/speaker/<temp>")
def getSpeakerAtr(temp):
return '''%s''' % temp
Doesn't work. The error message:
File "/Users/yang/Documents/CCPC-Website/venv/lib/python2.7/site-packages/flask/app.py", line 1013, in decorator
02:03:04 web.1 | self.add_url_rule(rule, endpoint, f, **options)
**The reason I want to use multiple routes is that I need to get the full data of an object by querying from the route. For example:
if we type this url:
//.../speaker/sam
we can get the object who has the 'name' value as 'sam'. Then I can use all of the values in this object like bio or something.**
You don't need multiple routes. Just one route that validates its value, eg:
#app.route('/speaker/<temp>')
def getSpeakerAtr(temp):
if not any(temp == str(x.name) for x in models.Speaker.objects):
# do something appropriate (404 or something?)
# carry on doing something else
Or as to your real intent:
#app.route('/speaker/<name>')
def getSpeakerAtr(name):
speaker = # do something with models.Speaker.objects to lookup `name`
if not speaker: # or whatever check is suitable to determine name didn't exist
# raise a 404, or whatever's suitable
# we have a speaker object, so use as appropriate

Categories

Resources