How to convert a function to a Pandas UDF in PySpark?

How to convert a function to a Pandas UDF in PySpark? - python

I have a function in Python I would like to adapt to PySpark. I am pretty new to PySpark so finding a way to implement this - whether with a UDF or actually in PySpark is posing a challenge.
Essentially, it performs a series of numpy calculations on a grouped by dataframe. I am not entirely sure the best way to do this in PySpark
Python code:
data = [
[1, "a", 10, 23, 33],
[1, "b", 11, 25, 34],
[1, "c", 12, 35, 35],
[1, "d", 13, 40, 36],
[2, "e", 14, 56, 38],
[2, "g", 14, 56, 39],
[2, "g", 16, 40, 38],
[2, "g", 19, 87, 90],
[3, "a", 20, 12, 90],
[3, "a", 21, 45, 80],
[3, "b", 21, 45, 38],
[3, "c", 12, 45, 67],
[3, "d", 18, 45, 78],
[3, "d", 12, 78, 90],
[3, "d", 8, 85, 87],
[3, "d", 19, 87, 89],
]
df = pd.DataFrame(data, columns=["id", "sub_id", "sub_sub_id", "value_1", "value_2"])
df
grouped_df = df.groupby(["id", "sub_id", "sub_sub_id"])
aggregated_df = grouped_df.agg(
{
"value_1": ["mean", "std"],
"value_2": ["mean", "std"],
}
).reset_index()
for value in ["value_1", "value_2"]:
aggregated_df[f"{value}_calc"] = np.maximum(
aggregated_df[value]["mean"]
- grouped_df[value].min().values,
grouped_df[value].max().values
- aggregated_df[value]["mean"],
)
I was trying to perform a Window function with the already grouped and aggregated Spark Dataframe, but I am pretty sure this is not the best way to do this.
test = aggregated_sdf.withColumn(
"new_calculated_value",
spark_fns.max(
spark_fns.expr(
"ave_value_1" - spark_fns.min(spark_fns.collect_list("ave_value_1"))
),
(
spark_fns.expr(
spark_fns.max(spark_fns.collect_list("ave_value_1")) - "ave_value_1"
)
),
).over(Window.partitionBy("id", "sub_id", "sub_sub_id"))

You can try doing the calculations during the aggregation, similar to what you did in the pandas code. The equivalent of np.maximum should be F.greatest. F.max is an aggregate function which gets the maximum in a column, while F.greatest is not an aggregate function, and gets the maximum of several columns along a single row.
import pyspark.sql.functions as F
df2 = df.groupby("id", "sub_id", "sub_sub_id").agg(
F.mean('value_1').alias('ave_value_1'),
F.mean('value_2').alias('ave_value_2'),
F.greatest(
F.mean('value_1') - F.min('value_1'),
F.max('value_1') - F.mean('value_1')
).alias('value_1_calc'),
F.greatest(
F.mean('value_2') - F.min('value_2'),
F.max('value_2') - F.mean('value_2')
).alias('value_2_calc')
)

Related

How to change or create new ndarray from list

I have value X of type ndarray with shape: (40000, 2)
The second column of X contains list of 50 numbers
Example:
[17, [1, 2, 3, ...]],
[39, [44, 45, 45, ...]], ...
I want to convert it to ndarray of shape (40000, 51):
the first column will be the same
the every element of the list will be in it's own column.
for my example:
[17, 1, 2, 3, ....],
[39, 44, 45, 45, ...]
How can I do it ?

np.hstack((arr[:,0].reshape(-1,1), np.array(arr[:,1].tolist())))
Example:
>>> arr
array([[75, list([90, 39, 63])],
[20, list([82, 92, 22])],
[80, list([12, 6, 89])],
[79, list([11, 96, 74])],
[96, list([26, 37, 65])]], dtype=object)
>>> np.hstack((arr[:,0].reshape(-1,1),np.array(arr[:,1].tolist()))).astype(int)
array([[75, 90, 39, 63],
[20, 82, 92, 22],
[80, 12, 6, 89],
[79, 11, 96, 74],
[96, 26, 37, 65]])

You can do this for each line of your ndarray , here is an example :
# X = [39, [44, 45, 45, ...]]
newX = numpy.ndarray(shape=(1,51))
new[0] = X[0] # adding the first element
# now the rest of elements
i = 0
for e in X[1] :
newX[i] = e
i = i + 1
You can make this process as a function and apply it in this way :
newArray = numpy.ndarray(shape=(40000,51))
i = 0
for x in oldArray :
Process(newArray[i],x)
i=i+1

I defined the source array (with shorter lists in column 1) as:
X = np.array([[17, [1, 2, 3, 4]], [39, [44, 45, 45, 46]]])
To do your task, define the following function:
def myExplode(row):
tbl = [row[0]]
tbl.extend(row[1])
return tbl
Then apply it to each row:
np.apply_along_axis(myExplode, axis=1, arr=X)
The result is:
array([[17, 1, 2, 3, 4],
[39, 44, 45, 45, 46]])

Python KeyError: 0 troubleshooting

I'm new to Gurobi and Python in general, and keep getting the error code 'KeyError: 0' on line 27 (which is the final line) whenever I run my code (which obviously isn't complete but my professor encouraged us to run our code as we write it because it's easier to troubleshoot that way).
I've read on multiple forums what that means (that it tried to access key '0' value from the dictionary where the key isn't present in that dictionary (or) which isn't initialized), but still don't really understand it?
from gurobipy import *
# Sets
SetA = ["a", "b", "c", "d", "e"]
SetB = ["f", "g", "h", "i", "j",
"k", "l", "m", "n", "o"]
A = range(len(SetA))
B = range(len(SetB))
# Data
PC = 100
X = [1, 2, 3, 4, 5]
D = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Y = [
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
[41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
]
m = Model("Problem 2")
# Variables
Z = {(a,b): m.addVar() for a in A for b in B}
# Objective
m.setObjective(quicksum((PC+X[a]+Y[a][b])*Z[a][b] for a in A for b in B), GRB.MINIMIZE)

Solution:
Change final line to:
m.setObjective(quicksum((PC+X[a]+Y[a][b])*Z[a,b] for a in A for b in B), GRB.MINIMIZE)

You are getting keyerror 0 because at the beginning of your list comprehension, where
for a in A
a is equal to 0, so at this line
m.setObjective(quicksum((PC+X[a]+Y[a][b])*Z[a][b] for a in A for b in B), GRB.MINIMIZE)
where you typed Z[a][b], you are trying to access the value of key 0 of dictionary Z (and then again key 0 of dictionary Z[a], which does not even exists), but in dictionary Z there is no key 0, since all your keys are tuples.
So as you correctly derived yourself, you don't want to access a value stored in key b of dictionary Z[a], instead you want to access the value stored in key (a, b) of dictionary Z, so
m.setObjective(quicksum((PC+X[a]+Y[a][b])*Z[a,b] for a in A for b in B), GRB.MINIMIZE)

Python: How to bin a set of data by a repeated value in one column

Say, I have a numpy array like this:
import numpy as np
x= np.array(
[[100, 14, 12, 15],
[100, 21, 16, 11],
[100, 19, 10, 13],
[160, 24, 15, 12],
[160, 43, 12, 65],
[160, 17, 53, 23],
[300, 15, 17, 11],
[300, 66, 23, 12],
[300, 44, 70, 19]])
The original array is much bigger, so my question is if there's a way to bin or group rows based on the value on column 1?
for example:
{'100': [[14, 12, 15], [21, 16, 11], [19, 10, 13]],
,'160': [[24, 15, 12], [43, 12, 65], [17, 53, 23]],
,'300': [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

Since you tagged this as pandas, you might want to do it using DataFrame's groupby() functionality. You'd create a DataFrame from your original array
import pandas as pd
df = pd.DataFrame(x)
and group by the first column; then you can iterate over the resulting GroupBy object to get the groups of the frame which have all the same result in the first column.
{key: group for key, group in df.groupby(0)}
Of course, in this snippet group includes the first column. You can strip it out using indexing:
{key: group.iloc[:,1:] for key, group in df.groupby(0)}
and if you would like to convert the sub-frames back into Numpy arrays, use group.iloc[:,1:].values instead. (If you want them as lists of lists, as indicated in your question, it shouldn't be hard to write a function to make that conversion, but it'll probably be more efficient to keep it in Pandas or at least Numpy if you can.)
An alternative is to use the OG groupby() from itertools which lets you avoid Pandas (if you have some reason for doing so) and use a plain old iterative approach.
import itertools as it
{key: list(map(list, group))
for key, group in it.groupby(x, lambda row: row[0])}
This, again, includes the key in the resulting rows, but you can trim it out using indexing
{key: list(map(lambda a: list(a)[1:], group))
for key, group in it.groupby(x, lambda row: row[0])}
You can make the code a tad cleaner by using the groupby_transform() function from the more-itertools module (which is not included in the standard Python library):
import more_itertools as mt
{key: list(group) for key, group in mt.groupby_transform(
x, lambda row: row[0], lambda row: list(row[1:])
)}
Disclosure: I contributed the groupby_transform() function to more-itertools

We are talking about large dataset, so we might need the performance, as also we have the input data as a NumPy array. Listed in this post are two NumPy approaches.
Approach #1
Here's one approach using np.unique to get the row indices separating groups and then using a loop comprehension to get the dictionary output -
unq, idx = np.unique(x[:,0], return_index=1)
idx1 = np.r_[idx,x.shape[0]]
dict_out = {unq[i]:x[idx1[i]:idx1[i+1],1:] for i in range(len(unq))}
This assumes the first column to be sorted as stated in the question title - ...repeated value in one column. If that's not the case, we need to use x[:,0].argsort() to sort the rows of x and then proceed.
Sample input, output -
In [41]: x
Out[41]:
array([[100, 14, 12, 15],
[100, 21, 16, 11],
[100, 19, 10, 13],
[160, 24, 15, 12],
[160, 43, 12, 65],
[160, 17, 53, 23],
[300, 15, 17, 11],
[300, 66, 23, 12],
[300, 44, 70, 19]])
In [42]: dict_out
Out[42]:
{100: array([[14, 12, 15],
[21, 16, 11],
[19, 10, 13]]), 160: array([[24, 15, 12],
[43, 12, 65],
[17, 53, 23]]), 300: array([[15, 17, 11],
[66, 23, 12],
[44, 70, 19]])}
Approach #2
Here's another getting rid of np.unique for further performance boost -
idx1 = np.concatenate(([0],np.flatnonzero(x[1:,0] != x[:-1,0])+1, [x.shape[0]]))
dict_out = {x[i,0]:x[i:j,1:] for i,j in zip(idx1[:-1], idx1[1:])}
Runtime test
Approaches -
# #COLDSPEED's soln
from collections import defaultdict
def defaultdict_app(x):
data = defaultdict(list)
for l in x:
data[l[0]].append(l[1:])
# #David Z's soln-1
import pandas as pd
def pandas_groupby_app(x):
df = pd.DataFrame(x)
return {key: group.iloc[:,1:] for key, group in df.groupby(0)}
# #David Z's soln-2
import itertools as it
def groupby_app(x):
return {key: list(map(list, group)) for key, group in \
it.groupby(x, lambda row: row[0])}
# Proposed in this post
def numpy_app1(x):
unq, idx = np.unique(x[:,0], return_index=1)
idx1 = np.r_[idx,x.shape[0]]
return {unq[i]:x[idx1[i]:idx1[i+1],1:] for i in range(len(unq))}
# Proposed in this post
def numpy_app2(x):
idx1 = np.concatenate(([0],np.flatnonzero(x[1:,0] != x[:-1,0])+1, [x.shape[0]]))
return {x[i,0]:x[i:j,1:] for i,j in zip(idx1[:-1], idx1[1:])}
Timings -
In [84]: x = np.random.randint(0,100,(10000,4))
In [85]: x[:,0].sort()
In [86]: %timeit defaultdict_app(x)
...: %timeit pandas_groupby_app(x)
...: %timeit groupby_app(x)
...: %timeit numpy_app1(x)
...: %timeit numpy_app2(x)
...:
100 loops, best of 3: 4.43 ms per loop
100 loops, best of 3: 15 ms per loop
100 loops, best of 3: 12.1 ms per loop
1000 loops, best of 3: 310 µs per loop
10000 loops, best of 3: 75.6 µs per loop

You can group your data with the collections.defaultdict and a loop.
from collections import defaultdict
data = defaultdict(list)
for l in x:
data[l[0]].append(l[1:])
print(dict(data))
Output:
{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]],
160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]],
300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

I think you want like this
After Edit
ls_dict={}
for ls in x:
key=ls[0]
value=[ls[1:]]
if key in ls_dict:
value = ls[1:]
ls_dict[key].append(value)
else:
value = [ls[1:]]
ls_dict[key]=value
print(ls_dict)
output
{100: [[14, 12, 15], [21, 16, 11], [19, 10, 13]], 160: [[24, 15, 12], [43, 12, 65], [17, 53, 23]], 300: [[15, 17, 11], [66, 23, 12], [44, 70, 19]]}

How to reindex columns MultiIndex of a Pandas Dataframe?

I have a Pandas DataFrame with MultiIndex on columns (lets say 3 levels):
MultiIndex(levels=[['BA-10.0', 'BA-2.5', ..., 'p'], ['41B004', '41B005', ..., 'T1M003', 'T1M011'], [25, 26, ..., 276, 277]],
labels=[[0, 0, 0, ..., 18, 19, 19], [4, 5, 6,..., 14, 12, 13], [24, 33, 47, ..., 114, 107, 113]],
names=['measurandkey', 'sitekey', 'channelid'])
When I iter through the first level and yield subset of DataFrame:
def cluster(df):
for key in df.columns.levels[0]:
yield df[key]
for subdf in cluster(df):
print(subdf.columns)
Columns index does have lost its first level, but the MultiIndex still contains reference to all other keys in sub-levels even if they are missing in the subset.
MultiIndex(levels=[['41B004', '41B005', '41B006', '41B008', '41B011', '41MEU1', '41N043', '41R001', '41R002', '41R012', '41WOL1', '41WOL2', 'T1M001', 'T1M003', 'T1M011'], [25, 26, 27, 28, 30, 31, 32, 3, ....
labels=[[4, 5, 6, 7, 9, 10], [24, 33, 47, 61, 83, 98]],
names=['sitekey', 'channelid'])
How can I force subdf to have its columns MultiIndex updated with only keys that are present?

def cluster(df):
for key in df.columns.levels[0]:
d = df[key]
d.columns = pd.MultiIndex.from_tuples(d.columns.to_series())
yield d

Python: Renumerate elements in multiple lists

Suppose I have a dictionary with lists as follows:
{0: [31, 32, 58, 59], 1: [31, 32, 12, 13, 37, 38], 2: [12, 13]}
I am trying to obtain the following one from it:
{0: [1, 2, 3, 4], 1: [1, 2, 5, 6, 7, 8], 2: [5, 6]}
So I renumerate all the entries in order of occurence but skipping those that were already renumerated.
What I have now is a bunch of for loops going back and forth which works, but doesn't look good at all, could anyone please tell me the way it should be done in Python 2.7?
Thank you

import operator
data = {0: [31, 32, 58, 59], 1: [31, 32, 12, 13, 37, 38], 2: [12, 13]}
# the accumulator is the new dict with renumbered values combined with a list of renumbered numbers so far
# item is a (key, value) element out of the original dict
def reductor(acc, item):
(out, renumbered) = acc
(key, values) = item
def remapper(v):
try:
x = renumbered.index(v)
except ValueError:
x = len(renumbered)
renumbered.append(v)
return x
# transform current values to renumbered values
out[key] = map(remapper, values)
# return output and updated list of renumbered values
return (out, renumbered)
# now reduce the original data
print reduce(reductor, sorted(data.iteritems(), key=operator.itemgetter(0)), ({}, []))

If you're not worried about memory or speed you can use an intermediate dictionary to map the new values:
a = {0: [31, 32, 58, 59], 1: [31, 32, 12, 13, 37, 38], 2: [12, 13]}
b = {}
c = {}
for key in sorted(a.keys()):
c[key] = [b.setdefault(val, len(b)+1) for val in a[key]]

Just use a function like this:
def renumerate(data):
ids = {}
def getid(val):
if val not in ids:
ids[val] = len(ids) + 1
return ids[val]
return {k : map(getid, data[k]) for k in sorted(data.keys())}
Example
>>> data = {0: [31, 32, 58, 59], 1: [31, 32, 12, 13, 37, 38], 2: [12, 13]}
>>> print renumerate(data)
{0: [1, 2, 3, 4], 1: [1, 2, 5, 6, 7, 8], 2: [5, 6]}

data = {0: [31, 32, 58, 59], 1: [31, 32, 12, 13, 37, 38], 2: [12, 13]}
from collections import defaultdict
numbered = defaultdict(lambda: len(numbered)+1)
result = {key: [numbered[v] for v in val] for key, val in sorted(data.iteritems(), key=lambda item: item[0])}
print result

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert a function to a Pandas UDF in PySpark? - python

Related

How to change or create new ndarray from list

Python KeyError: 0 troubleshooting

Python: How to bin a set of data by a repeated value in one column

How to reindex columns MultiIndex of a Pandas Dataframe?

Python: Renumerate elements in multiple lists

Categories

Resources