Python : Extracting variables from a join - python

I have joined two datasets in Spark (pySpark)and the output looks likes this
(u'SomeThing', (u'ABC', u'500'))
I would like to do the following: Define a function that extracts and returns only ABC, 500. I wrote a function like this
def extract_lasttwo_cols(three_cols):
a,b,c = three_cols.split(',')
return b,c
But this function results in an error "tuple object has no attribute split()"
Is it possible to extract the variables without saving the results as text files and then processing them?

Tuples are immutable. split() is for str types.
This will return the b and c separately:
def extract_lasttwo_cols(three_cols):
b, c = three_cols[1][0], three_cols[1][1]
return b, c

Your value is a tuple with two elements, whereby the second element is a tuple by itself
def extract_lasttwo_cols(three_cols):
return three_cols[1]

Related

Python concurrent.futures

I have a multiprocessing code, and each process have to analyse same data differently.
I have implemented:
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p, [global_DataFrame], [global_String])
for f in concurrent.futures.as_completed(res):
fp = res
and function:
def goal_fcn(x, DataFrame, String):
return heavy_calculation(x, DataFrame, String)
the problem is goal_fcn is called only once, while should be multiple time
In debugger, I checked now the variable p is looking, and it has multiple columns and rows. Inside goal_fcn, variable x have only first row - looks good.
But the function is called only once. There is no error, the code just execute next steps.
Even if I modify variable p = [1,3,4,5], and of course code. goal_fcn is executed only once
I have to use map() because keeping the order between input and output is required
map works like zip. It terminates once at least one input sequence is at its end. Your [global_DataFrame] and [global_String] lists have one element each, so that is where map ends.
There are two ways around this:
Use itertools.product. This is the equivalent of running "for all data frames, for all strings, for all p". Something like this:
def goal_fcn(x_DataFrame_String):
x, DataFrame, String = x_DataFrame_String
...
executor.map(goal_fcn, itertools.product(p, [global_DataFrame], [global_String]))
Bind the fixed arguments instead of abusing the sequence arguments.
def goal_fcn(x, DataFrame, String):
pass
bound = functools.partial(goal_fcn, DataFrame=global_DataFrame, String=global_String)
executor.map(bound, p)

Dynamically adding functions to array columns

I'm trying to dynamically add function calls to fill in array columns. I will be accessing the array millions of times so it needs to be quick.
I'm thinking to add the call of a function into a dictionary by using a string variable
numpy_array[row,column] = dict[key[index containing function call]]
The full scope of the code I'm working with is too large to post here is an equivalent simplistic example I've tried.
def hello(input):
return input
dict1 = {}
#another function returns the name and ID values
name = 'hello'
ID = 0
dict1["hi"] = globals()[name](ID)
print (dict1)
but it literally activates the function when using
globals()[name](ID)
instead of copy pasting hello(0) as a variable into the dictionary.
I'm a bit out of my depth here.
What is the proper way to implement this?
Is there a more efficient way to do this than reading into a dictionary on every call of
numpy_array[row,column] = dict[key[index containing function call]]
as I will be accessing and updating it millions of times.
I don't know if the dictionary is called every time the array is written to or if the location of the column is already saved into cache.
Would appreciate the help.
edit
Ultimately what I'm trying to do is initialize some arrays, dictionaries, and values with a function
def initialize(*args):
create arrays and dictionaries
assign values to global and local variables, arrays, dictionaries
Each time the initialize() function is used it creates a new set of variables (names, values, ect) that direct to a different function with a different set of variables.
I have an numpy array which I want to store information from the function and associated values created from the initialize() function.
So in other words, in the above example hello(0), the name of the function, it's value, and some other things as set up within initialize()
What I'm trying to do is add the function with these settings to the numpy array as a new column before I run the main program.
So as another example. If I was setting up hello() (and hello() was a complex function) and when I used initialize() it might give me a value of 1 for hello(1).
Then if I use initialize again it might give me a value of 2 for hello(2).
If I used it one more time it might give the value 0 for the function goodbye(0).
So in this scenaro let's say I have an array
array[row,0] = stuff()
array[row,1] = things()
array[row,2] = more_stuff()
array[row,3] = more_things()
Now I want it to look like
array[row,0] = stuff()
array[row,1] = things()
array[row,2] = more_stuff()
array[row,3] = more_things()
array[row,4] = hello(1)
array[row,5] = hello(2)
array[row,6] = goodbye(0)
As a third, example.
def function1():
do something
def function2():
do something
def function3():
do something
numpy_array(size)
initialize():
do some stuff
then add function1(23) to the next column in numpy_array
initialize():
do some stuff
then add function2(5) to the next column in numpy_array
initialize():
do some stuff
then add function3(50) to the next column in numpy_array
So as you can see. I need to permanently append new columns to the array and feed the new columns with the function/value as directed by the initialize() function without manual intervention.
So fundamentally I need to figure out how to assign syntax to an array column based upon a string value without activating the syntax on assignment.
edit #2
I guess my explanations weren't clear enough.
Here is another way to look at it.
I'm trying to dynamically assign functions to an additional column in a numpy array based upon the output of a function.
The functions added to the array column will be used to fill the array millions of times with data.
The functions added to the array can be various different function with various different input values and the amount of functions added can vary.
I've tried assigning the functions to a dictionary using exec(), eval(), and globals() but when using these during assignment it just instantly activates the functions instead of assigning them.
numpy_array = np.array((1,5))
def some_function():
do some stuff
return ('other_function(15)')
#somehow add 'other_function(15)' to the array column.
numpy_array([1,6] = other_function(15)
The functions returned by some_function() may or may not exist each time the program is run so the functions added to the array are also dynamic.
I'm not sure this is what the OP is after, but here is a way to make an indirection of functions by name:
def make_fun_dict():
magic = 17
def foo(x):
return x + magic
def bar(x):
return 2 * x + 1
def hello(x):
return x**2
return {k: f for k, f in locals().items() if hasattr(f, '__name__')}
mydict = make_fun_dict()
>>> mydict
{'foo': <function __main__.make_fun_dict.<locals>.foo(x)>,
'bar': <function __main__.make_fun_dict.<locals>.bar(x)>,
'hello': <function __main__.make_fun_dict.<locals>.hello(x)>}
>>> mydict['foo'](0)
17
Example usage:
x = np.arange(5, dtype=int)
names = ['foo', 'bar', 'hello', 'foo', 'hello']
>>> np.array([mydict[name](v) for name, v in zip(names, x)])
array([17, 3, 4, 20, 16])

Declaring a pandas series with a user-defined function

I am trying to simplify some code with a function. The intent is to use the function to declare blank series to populate later.
The code currently declares each series on a separate line like this:
series1=pd.Series()
series2=pd.Series()
This approach works well but makes the code lengthy with many series.
I would like to do the following:
Create a list of blank objects to use in the function with the names series1, series2, etc. or with a more descriptive name for each
series_list=[series1,series2]
Declare function
def series(name):
name=pd.Series()
return name
Call function with input
for i in series_list:
series(i)
However, when I try to declare the series_list, it returns the NameError: [variable] is not defined. Is there a way to populate the series_list with empty objects(i.e. no data but with the names series1, series2, ... series1000)?
Here's how you instantiate the Series objects iteratively, then use the generated list to assign to known variables
def assign_series(n):
series_list = []
#series_dict = {}
num_of_series = n
for i in range(num_of_series):
series_list.append(pd.Series())
#or if you want to call them by name
#series_dict['series'+str(i)] = pd.Series()
return series_list
corporate_securities, agency_securities, unrealized_gainloss = assign_series(3)
corporate_securities
Series([], dtype: float64)

“Too many values to unpack”

I have a function to preprocess images in batches to forward to caffe as input, it is something like below and returns two variables.
def processImageCrop(im_info, transformer, flowtransformer):
.....
return processed_image, processed_flowimage
class ImageProcessorCrop(object):
def __init__(self, transformer, flowtransformer):
self.transformer = transformer
self.flowtransformer = flowtransformer
#self.flow = flow
def __call__(self, im_info):
return processImageCrop(im_info, self.transformer, self.flowtransformer) #, self.flow)
I call this function with pool.map sending im_info parameters, and want to assign the two variables returned as below, but I get the exception Too many values to unpack. Both variables should have length 192. How can I assign the returned values? Thx. I don't want to iterate over each element, but return the two values and assign them to two variables.
result['data'] , result['flowdata'] = pool.map(image_processor, im_info)
Your pool.map call is going to return a list with the results of calling your callable class once per value in im_info. If im_info has more than two value, your assignment that unpacks the list into two variables will not work.
If you actually want to be unpacking the two-tuples within the list, you probably want to use zip to transpose the data:
result['data'], result['flowdata'] = zip(*pool.map(image_processor, im_info))

How to return multiple strings from a script to the rule sequence in booggie 2?

This is an issue specific to the use of python scripts in booggie 2.
I want to return multiple strings to the sequence and store them there in variables.
The script should look like this:
def getConfiguration(config_id):
""" Signature: getConfiguration(int): string, string"""
return "string_1", "string_2"
In the sequence I wanna have this:
(param_1, param_2) = getConfiguration(1)
Please note: The booggie-project does not exist anymore but led to the development of Soley Studio which covers the same functionality.
Scripts in booggie 2 are restricted to a single return value.
But you can return an array which then contains your strings.
Sadly Python arrays are different from GrGen arrays so we need to convert them first.
So your example would look like this:
def getConfiguration(config_id):
""" Signature: getConfiguration(int): array<string>"""
#TypeHelper in booggie 2 contains conversion methods from Python to GrGen types
return TypeHelper.ToSeqArray(["string_1", "string_2"])
return a tuple
return ("string_1", "string_2")
See this example
In [124]: def f():
.....: return (1,2)
.....:
In [125]: a, b = f()
In [126]: a
Out[126]: 1
In [127]: b
Out[127]: 2
Still, it's not possible to return multiple values but a python list is now converted into a C#-array that works in the sequence.
The python script itself should look like this
def getConfiguration(config_id):
""" Signature: getConfiguration(int): array<string>"""
return ["feature_1", "feature_2"]
In the sequence, you can then use this list as if it was an array:
config_list:array<string> # initialize array of string
(config_list) = getConfigurationList(1) # assign script output to that array
{first_item = config_list[0]} # get the first string("feature_1")
{second_item = config_list[1]} # get the second string("feature_2")
For the example above I recommend using the following code to access the entries in the array (in the sequence):
config_list:array<string> # initialize array of string
(config_list) = getConfigurationList(1) # assign script output to that array
{first_item = config_list[0]} # get the first string("feature_1")
{second_item = config_list[1]} # get the second string("feature_2")

Categories

Resources