Not able to update variable in Pyspark

Not able to update variable in Pyspark - python

I am trying to update a variable in pyspark and want to use the same in another method. I am using #property in class, when i tested it in python it is working as expected but when i am trying to implement it in pyspark it is not updating the variable. Please help me find out what i am doing wrong.
Code:
class Hrk(object):
def __init__(self, hrkval):
self.hrkval = hrkval
#property
def hrkval(self):
return self._hrkval
#hrkval.setter
def hrkval(self, value):
self._hrkval = value
#hrkval.deleter
def hrkval(self):
del self._hrkval
filenme = sc.wholeTextFiles("/user/root/CCDs")
hrk = Hrk("No Value")
def add_demo(filename):
pfname[]
plname[]
PDOB[]
gender[]
.......i have not mentioned my logic, i skipped that part......
hrk.hrkval = pfname[0]+"##"+plname[0]+PDOB[0]+gender[0]
return (str(hrk.hrkval))
def add_med(filename):
return (str(hrk.hrkval))
filenme.map(getname).map(add_demo).saveAsTextFile("/user/cloudera/Demo/")
filenme.map(getname).map(add_med).saveAsTextFile("/user/cloudera/Med/")
In my first method call (add_demo) i am getting the proper value but when i want to use the same variable in the second method i am getting No Value . I don't know why it is not updating the variable. Where as similar logic working fine in python.

You are trying to mutate the state of a global variable using the map API. This is not a recommended pattern for Spark. You try to use pure functions as much as possible, and use operations like .reduce or .reduceByKey or .foldLeft. The reason the following simplified example does not work is because when .map is called, spark creates a closure for the function f1, creates a copy of hrk object for each "partition" and applies this to the rows within each partition.
import pyspark
import pyspark.sql
number_cores = 2
memory_gb = 1
conf = (
pyspark.SparkConf()
.setMaster('local[{}]'.format(number_cores))
.set('spark.driver.memory', '{}g'.format(memory_gb))
)
c = pyspark.SparkContext(conf=conf)
spark = pyspark.sql.SQLContext(sc)
class Hrk(object):
def __init__(self, hrkval):
self.hrkval = hrkval
#property
def hrkval(self):
return self._hrkval
#hrkval.setter
def hrkval(self, value):
self._hrkval = value
#hrkval.deleter
def hrkval(self):
del self._hrkval
hrk = Hrk("No Value")
print(hrk.hrkval)
# No Value
def f1(x):
hrk.hrkval = str(x)
return "str:"+str(hrk.hrkval)
data = sc.parallelize([1,2,3])
data.map(f1).collect()
# ['str:1', 'str:2', 'str:3']
print(hrk.hrkval)
# No Value
You can read more about closures in Understanding Closures section in the rdd programming guide on official spark docs, here are some important snippets:
One of the harder things about Spark is understanding the scope and life cycle of variables and methods when executing code across a cluster. RDD operations that modify variables outside of their scope can be a frequent source of confusion. In the example below we’ll look at code that uses foreach() to increment a counter, but similar issues can occur for other operations as well.
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode. Use an Accumulator instead if some global aggregation is needed.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures-

Related

Use parallelize function over python objects

Is it possible in pyspark to use the parallelize function over python objects? I want to run on parallel on a list of objects, modified them using a function, and then print these objects.
def init_spark(appname):
spark = SparkSession.builder.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark(object_list):
spark,sc = init_spark(appname="analysis")
p_configs_RDD = sc.parallelize(object_list)
p_configs_RDD=p_configs_RDD.map(func)
p_configs_RDD.foreach(print)
def func(object):
return do-somthing(object)
When I run the above code, I encounter an error of "AttributeError: Can't get attribute 'Object' on <module 'pyspark.daemon' from...> ". How can I solve it?
I did the following workaround. But I don't think it is a good solution in general, and it assumes I can change the constructor of the object.
I have converted the object into a dictionary, and construed the object from the directory.
def init_spark(appname):
spark = SparkSession.builder.appName(appname).getOrCreate()
sc = spark.sparkContext
return spark,sc
def run_on_configs_spark(object_list):
spark,sc = init_spark(appname="analysis")
p_configs_RDD = sc.parallelize([x.__dict__() for x in object_list])
p_configs_RDD=p_configs_RDD.map(func)
p_configs_RDD.foreach(print)
def func(dict):
object=CreateObject(create_from_dict=True,dictionary=dict)
return do-something(object)
In the constructor of the Object:
class Object:
def __init__(create_from_dict=False,dictionary=None, other_params...):
if(create_from_dict):
self.__dict__.update(dictionary)
return
Are there any better solutions?

Well for better answer I suggest you post a sample of the object_list and your desired output so we can test with real code.
According to pyspark docs (as above) parallelize function should accept any collection, so I think the problem might be the object_list. I see the workaround can work since the input type is a list of dictionary (or other mapping object)
As for a modular way to run over general created objects, it's depend on how you want the RDD to be, but the general way should be converting the whole object you want into a collection type object. One solution without modifying constructor / structure can be
sc.parallelize([object_list])
The key point is to ensure that the input is in collection type.

In order to test a function with unit testing, does it need to have a return value?

I am new to python and am trying to learn about unit testing. I am using pytest to run some unit tests on my code. Something very similar to this example:
def test_calc_total():
total = mathlib.calc_total(4,5)
assert total == 9
I may be incorrect here, but I am learning that in order for your function be testable, it will have to return a value. Is this true for all unit tests?
I am finding that the majority of my functions are not able to be tested, since they don't have a return value. I am creating a ticket tracking system and storing the ticket information in a dataframe. Here is a portion of TicketDF class:
class TicketDF():
cwd = os.getcwd()
df = pd.DataFrame()
def __init__(self, configParser):
self.populateFilePaths(configParser)
self.populateAttributes(configParser)
self.populateTable()
def populateTable(self):
if self.doesFileExistAndNotEmpty(self.ticketCSVFilePath):
self.df = pd.read_csv(self.ticketCSVFilePath, converters={'Comments':literal_eval}) #, 'Attachments':literal_eval})
self.df.set_index('Ticket ID', inplace=True)
def populateAttributes(self, configParser):
self.ticketAttributes = configParser.getTicketAttributes()
def populateFilePaths(self, configParser):
self.ticketCSVFilePath = configParser.getTicketsCSVPath()
def doesFileExistAndNotEmpty(self, filepath):
if os.path.exists(filepath) and os.path.getsize(filepath) > 0:
return True
else:
return False
def getTicketDF(self):
return self.df
def addTicketToDF(self, data):
df = pd.DataFrame (data, columns = self.ticketAttributes)
df.set_index('Ticket ID', inplace=True)
self.df = self.df.append(df)
self.saveDFToCSV()
As you can see, the majority of the functions, even the ones not shown, do not have return values. Is it best practice to write code so that all functions have a return value? Or is there another way to test functions without a return value?

The short answer is no.
Testing a function depends on the ability to
observe changes caused by the function to the state of the program or
access the return value of the function.
So, if a function's behavior changes the state of a program and we need to test this behavior, then we need to be able to access the part of the state that is changed to test the corresponding behavior. Such tests may not always rely on the return value of the function. Hence, functions need not return value to be testable but they do need to expose information pertaining to their behaviors that need to be tested.
For more on this aspect of testing, read about test doubles. "Using test doubles" chapter about indirect inputs and outputs in "xUnit Test Patterns" book is a good reference.

No, it can have a side-effect and you test that side effect.
For example, consider a function which writes to a file. Errors are reported via exceptions. It has nothing to return. The unit test would invoke the function and check that the relevant file has been written to, as well as to induce various exceptions.
Another example is a method call which changes the state of an object. You call the method and check that the object's state has changed.

Using constructor parameter variable names during object instantiation in Python?

When declaring a new instance of an object in python, why would someone use the names of the variables from the parameters at instatntiation time? Say you have the following object:
class Thing:
def __init__(self,var1=None,var2=None):
self.var1=var1
self.var2=var2
The programmer from here decides to create an instance of this object at some point later and enters it in the following way:
NewObj = Thing(var1=newVar,var2=otherVar)
Is there a reason why someone would enter it that way vs. just entering the newVar/otherVar variables into the constructor parameters without using "var1=" and "var2="? Like below:
NewObj = Thing(newVar,otherVar)
I'm fairly novice at using python, and I couldn't find anything about this specific sort of syntax even if it seems like a fairly simple/straightforward question

The reason is clarity, not for the computer, but for yourself and other humans.
class Calculation:
def __init__(self, low=None, high=None, mean=None):
self.low=low
self.high=high
self.mean=mean
...
# with naming (notice how ordering is not important)
calc = Calculation(mean=0.5, low=0, high=1)
# without naming (now order is important and it is less clear what the numbers are used for)
calc = Calculation(0, 1, 0.5)
Note that the same can be done for any function, not only when initializing an object.

Global variables not recognized in lambda functions in Pyspark

I am working in Pyspark with a lambda function like the following:
udf_func = UserDefinedFunction(lambda value: method1(value, dict_global), IntegerType())
result_col = udf_func(df[atr1])
The implementation of the method1 is the next one:
def method1(value, dict_global):
result = len(dict_global)
if (value in dict_global):
result = dict_global[value]
return result
'dict_global' is a global dictionary that contains some values.
The problem is that when I execute the lambda function the result is always None. For any reason the 'method1' function doesn't interpret the variable 'dict_global' as an external variable. Why? What could I do?

Finally I found a solution. I write it below:
Lambda functions (as well as map and reduce functions) executed in SPARK schedule the executions among the different executors, and it works in different execution threads. So the problem in my code could be global variables sometimes are not caught by the functions executed in parallel in different threads, so I looked for a solution to try solve it.
Fortunately, in SPARK there is an element called "Broadcast" which allows to pass variables to the execution of a function organized among the executors to work with them without problems. There are 2 type of sharable variables: Broadcast (inmutable variables, only for read) and accumulators (mutable variables, but numeric values only accepted).
I rewrite my code to show you how did I fix the problem:
broadcastVar = sc.broadcast(dict_global)
udf_func = UserDefinedFunction(lambda value: method1(value, boradcastVar), IntegerType())
result_col = udf_func(df[atr1])
Hope it helps!

Running multiple functions in Python

I had a program that read in a text file and took out the necessary variables for serialization into turtle format and storing in an RDF graph. The code I had was crude and I was advised to separate it into functions. As I am new to Python, I had no idea how to do this. Below is some of the functions of the program.
I am getting confused as to when parameters should be passed into the functions and when they should be initialized with self. Here are some of my functions. If I could get an explanation as to what I am doing wrong that would be great.
#!/usr/bin/env python
from rdflib import URIRef, Graph
from StringIO import StringIO
import subprocess as sub
class Wordnet():
def __init__(self, graph):
self.graph = Graph()
def process_file(self, file):
file = open("new_2.txt", "r")
return file
def line_for_loop(self, file):
for line in file:
self.split_pointer_part()
self.split_word_part()
self.split_gloss_part()
self.process_lex_filenum()
self.process_synset_offset()
+more functions............
self.print_graph()
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
def get_num_words(self, word_part, num_words):
""" 1 as default, may want 0 as an invalid case """
""" do if else statements on l3 variable """
if word_part[3] == '0a':
num_words = 10
else:
num_words = int(word_part[3])
return num_words
def get_pointers_list(self, pointers, after_at, num_pointers, pointerList):
pointers = after_at.split()[0:0 +4 * num_pointers:4]
pointerList = iter(pointers)
return pointerList
............code to create triples for graph...............
def print_graph(self):
print graph.serialize(format='nt')
def main():
wordnet = Wordnet()
my_file = wordnet.process_file()
wordnet.line_for_loop(my_file)
if __name__ == "__main__":
main()

You question is mainly a question about what object oriented programming is. I will try to explain quickly, but I recommend reading a proper tutorial on it like
http://www.voidspace.org.uk/python/articles/OOP.shtml
http://net.tutsplus.com/tutorials/python-tutorials/python-from-scratch-object-oriented-programming/
and/or http://www.tutorialspoint.com/python/python_classes_objects.htm
When you create a class and instantiate it (with mywordnet=WordNet(somegraph)), you can resue the mywordnet instance many times. Each variable you set on self. in WordNet, is stored in that instance. So for instance self.graph is always available if you call any method of mywordnet. If you wouldn't store it in self.graph, you would need to specify it as a parameter in each method (function) that requires it. Which would be tedious if all of these method calls require the same graph anyway.
So to look at it another way: everything you set with self. can be seen as a sort of configuration for that specific instance of Wordnet. It influences the Wordnet behaviour. You could for instance have two Wordnet instances, each instantiated with a different graph, but all other functionality the same. That way you can choose which graph to print to, depending on which Wordnet instance you use, but everything else stays the same.
I hope this helps you out a little.

First, I suggest you figure out the basic functional decomposition on its own - don't worry about writing a class at all.
For example,
def split_pointer_part(self, before_at, after_at, line):
before_at, after_at = line.split('#', 1)
return before_at, after_at
doesn't touch any instance variables (it never refers to self), so it can just be a standalone function.
It also exhibits a peculiarity I see in your other code: you pass two arguments (before_at, after_at) but never use their values. If the caller doesn't already know what they are, why pass them in?
So, a free function should probably look like:
def split_pointer_part(line):
"""get tuple (before #, after #)"""
return line.split('#', 1)
If you want to put this function in your class scope (so it doesn't pollute the top-level namespace, or just because it's a logical grouping), you still don't need to pass self if it isn't used. You can make it a static method:
#staticmethod
def split_pointer_part(line):
"""get tuple (before #, after #)"""
return line.split('#', 1)

One thing that would be very helpful for you is a good visual debugger. There's a nice free one for Python called Winpdb. There are also excellent debuggers in the commercial products IntelliJ IDEA/PyCharm, Komodo IDE, WingIDE, and Visual Studio (with the Python Tools add-in). Probably a few others too.
I highly recommend setting up one of these debuggers and running your code under it. It will let you step through your code line by line and see what happens with all your variables and objects.
You may find people who tell you that real programmers don't need or shouldn't use debuggers. Don't listen to them: a good debugger is one of the very best tools to help you learn a new language or to get familiar with a piece of code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not able to update variable in Pyspark - python

Related

Use parallelize function over python objects

In order to test a function with unit testing, does it need to have a return value?

Using constructor parameter variable names during object instantiation in Python?

Global variables not recognized in lambda functions in Pyspark

Running multiple functions in Python

Categories

Resources