PySpark: Convert RDD to column in dataframe

PySpark: Convert RDD to column in dataframe - python

I have a spark dataframe using which I am calculating the Euclidean distance between a row and a given set of corrdinates. I am recreating a structurally similar dataframe 'df_vector' here to explain better.
from pyspark.ml.feature import VectorAssembler
arr = [[1,2,3], [4,5,6]]
df_example = spark.createDataFrame(arr, ['A','B','C'])
assembler = VectorAssembler(inputCols=[x for x in df_example.columns],outputCol='features')
df_vector = assembler.transform(df_example).select('features')
>>> df_vector.show()
+-------------+
| features|
+-------------+
|[1.0,2.0,3.0]|
|[4.0,5.0,6.0]|
+-------------+
>>> df_vector.dtypes
[('features', 'vector')]
As you can see the features column is a vector. In practice, I get this vector column as the output of a StandardScaler. Anyway, since I need to calculate Euclidean distance, I do the following
rdd = df_vector.select('features').rdd.map(lambda r: np.linalg.norm(r-b))
where
b = np.asarray([0.5,1.0,1.5])
I have all the calculations I need but I need this rdd as a column in df_vector. How do I go about it?

Instead of creating a new rdd, you could use an UDF:
norm_udf = udf(lambda r: np.linalg.norm(r - b).tolist(), FloatType())
df_vector.withColumn("norm", norm_udf(df.features))
Make sure numpy is defined on the worker nodes.

One way to tackle performance issues might be to use mapPartitions. The idea would be, at a partition level, to convert features to an array and then calculate the norm on the whole array (thus implicitly using numpy vectorisation). Then do some housekeeping to get the form you want. For large datasets this might improve performance:
Here is the function which calculates the norm at partition level:
from pyspark.sql import Row
def getnorm(vectors):
# convert vectors into numpy array
vec_array=np.vstack([v['features'] for v in vectors])
# calculate the norm
norm=np.linalg.norm(vec_array-b, axis=1)
# tidy up to get norm as a column
output=[Row(features=x, norm=y) for x,y in zip(vec_array.tolist(), norm.tolist())]
return(output)
Applying this using mapPartitions gives an RDD of Rows which can then be converted to a DataFrame:
df_vector.rdd.mapPartitions(getnorm).toDF()

Related

Python Dataframe Filter data using linear relation

I have a data frame with input and output columns. They have a linear relation. So, I want to remove data that does not fit this relation. My actual df is big and has many samples. Here, I am giving an example.
My code:
xdf = pd.DataFrame({'ip':[10,20,30,40],'op':[105,195,500,410]})
I am not getting any idea on how to proceed.

You can do a linear fit first then filter out the data that is outside of a certain threshold.
Sample code below:
import numpy as np
df = pd.DataFrame({'ip':[10,20,30,40],'op':[105,195,500,410]})
# do a linear fit on ip and op
f = np.polyfit(df.ip,df.op,1)
fl = np.poly1d(f)
# you will have to determine this threshold in some way
threshold = 100
output = df[(df.op - fl(df.ip)).abs()<threshold]

Another way:
You can create a boolean mask to check the ratio of op/dp is less then their mean value:
m=xdf.eval("op/ip").lt(xdf.eval("op/ip").mean())
Finally:
out=xdf[m]
plt.scatter(x=out['ip'],y=out['op'])

Creating correlation matrix for multiple combinations of variables

I have a csv file with 10 columns. I can use pandas to import the dataframe and use the corr() function to output a matrix heatmap. What I want to achieve next is for the code to loop through the dataframe and find high or low correlations between combinations of columns
For example, the simple correlation matrix looks at:
A:A, A:B, A:C, A:D etc
But I want the code to combine columns, in every conceivable way, such as:
AB:A, AB:B, AB:C, AB: D etc
ABC:A, ABC:B, ABC:D etc
And if there are any noticeable correlations between certain combinations, to highlight those.
Is this possible at all? Or are there proprietary applications that can do this?
Thanks

I assume with "combination" you mean linear combination. You can loop over the columns (not the most elegant way) and use sklearn linear_model
import pandas as pd
import numpy as np
from sklearn import linear_model
df = pd.DataFrame(np.random.random([10,10]),columns=['A','B','C','D','E','F','G','H','I','J'])
for i,col1 in enumerate(df):
if i > 0:
X = df.iloc[:,0:i]
for j,col2 in enumerate(df):
if j >= i:
y = df[[col2]]
regr = linear_model.LinearRegression()
regr.fit(X, y)
score = regr.score(X,y)
print(f'X: {X.columns} y: {y.columns} score:{score}')

Value Error and problem with shape during creation of Data Frame in Python?

I would like to combine coefficient from Liear Regression model with values from test dataset, nevertheless I have error like below, my code is below, do you know where is the problem and what can I do ?
I need something like below, where indexes are from X.columns and numbers are from LR.coef_.

In the following example, values is a dataframe which has the same shape of your LR.coef_. To use its first row as column values in another dataframe, you can create a dict and pass that dict to pandas.DataFrame().
import pandas as pd
import numpy as np
values = pd.DataFrame(np.zeros((1, 689)))
X = pd.DataFrame(np.zeros((2096, 689)))
frame = { 'coefficient': values.iloc[0] }
coefficient = pd.DataFrame(frame, index=X.columns)

Parallel programming approach to solve pandas problems

I have a dataframe of the following format.
df
A B Target
5 4 3
1 3 4
I am finding the correlation of each column (except Target) with the Target column using pd.DataFrame(df.corr().iloc[:-1,-1]).
But the issue is - size of my actual dataframe is (216, 72391) which atleast takes 30 minutes to process on my system. Is there any way of parallerize it using a gpu ? I need to find the values of similar kind multiple times so can't wait for the normal processing time of 30 minutes each time.

Here, I have tried to implement your operation using numba
import numpy as np
import pandas as pd
from numba import jit, int64, float64
#
#------------You can ignore the code starting from here---------
#
# Create a random DF with cols_size = 72391 and row_size =300
df_dict = {}
for i in range(0, 72391):
df_dict[i] = np.random.randint(100, size=300)
target_array = np.random.randint(100, size=300)
df = pd.DataFrame(df_dict)
# ----------Ignore code till here. This is just to generate dummy data-------
# Assume df is your original DataFrame
target_array = df['target'].values
# You can choose to restore this column later
# But for now we will remove it, since we will
# call the df.values and find correlation of each
# column with target
df.drop(['target'], inplace=True, axis=1)
# This function takes in a numpy 2D array and a target array as input
# The numpy 2D array has the data of all the columns
# We find correlation of each column with target array
# numba's Jit required that both should have same columns
# Hence the first 2d array is transposed, i.e. it's shape is (72391,300)
# while target array's shape is (300,)
def do_stuff(df_values, target_arr):
# Just create a random array to store result
# df_values.shape[0] = 72391, equal to no. of columns in df
result = np.random.random(df_values.shape[0])
# Iterator over each column
for i in range(0, df_values.shape[0]):
# Find correlation of a column with target column
# In order to find correlation we must transpose array to make them compatible
result[i] = np.corrcoef(np.transpose(df_values[i]), target_arr.reshape(300,))[0][1]
return result
# Decorate the function do_stuff
do_stuff_numba = jit(nopython=True, parallel=True)(do_stuff)
# This contains all the correlation
result_array = do_stuff_numba(np.transpose(df.T.values), target_array)
Link to colab notebook.

You should take a look at dask. It should be able to do what you want and a lot more.
It parallelizes most of the DataFrame functions.

How to Find Indices where multiple vectors all are zeo

Beginner pySpark question here:
How do I find the indices where all vectors are zero?
After a series of transformations, I have a spark df with ~2.5M rows and a tfidf Sparse Vector of length ~262K. I would like to perform PCA dimensionality reduction to make this data more manageable for multi-layer perceptron model fitting, but pyspark's PCA is limited to a max of 65,535 columns.
+--------------------+
| tfidf_features| df.count() >>> 2.5M
+--------------------+ Example Vector:
|(262144,[1,37,75,...| SparseVector(262144, {7858: 1.7047, 12326: 1.2993, 15207: 0.0953,
|(262144,[0],[0.12...| 24112: 0.452, 40184: 1.7047,...255115: 1.2993, 255507: 1.2993})
|(262144,[0],[0.12...|
|(262144,[0],[0.12...|
|(262144,[0,6,22,3...|
+--------------------+
Therefore, I would like to delete the indicies or columns of the sparse tfidf vector that are zero for all ~2.5M documents (rows). This will hopefully get me under the 65,535 maximum for PCA.
My plan is to to create a udf that (1) converts the Sparse Vectors to Dense Vectors (or np arrays) (2) searches all Vectors to find indices where all Vectors are zero (3) delete the index. However, I am struggling with the second part (finding the indices where all vectors equal zero). Here's where I am so far, but I think my plan of attack is way too time consuming and not very pythonic (especially for such a big dataset):
import numpy as np
row_count = df.count()
def find_zero_indicies(df):
vectors = df.select('tfidf_features').take(row_count)[0]
zero_indices = []
to_delete = []
for vec in vectors:
vec = vec.toArray()
for value in vec:
if value.nonzero():
zero_indices.append(vec.index(value))
for value in zero_indices:
if zero_inices.count(value) == row_count:
to_delete.append(value)
return to_delete
Any advice or help appreciated!

If anything, it makes more sense to find indices which should be preserved:
from pyspark.ml.linalg import DenseVector, SparseVector
from pyspark.sql.functions import explode, udf
from operator import itemgetter
#udf("array<integer>")
def indices(v):
if isinstance(v, DenseVector):
return [i for i in range(len(v))]
if isinstance(v, SparseVector):
return v.indices.tolist()
return []
indices_list = (df
.select(explode(indices("tfidf_features")))
.distinct()
.rdd.map(itemgetter(0))
.collect())
and use VectorSlicer:
from pyspark.ml.feature import VectorSlicer
slicer = VectorSlicer(
inputCol="tfidf_features",
outputCol="tfidf_features_subset", indices=indices_list)
slicer.transform(df)
However in practice I would recommend using fixed size vector, either with HashingTF:
HashingTF(inputCol="words", outputCol="tfidf_features", numFeatures=65535)
or CountVectorizer:
CountVectorizer(inputCol="words", outputCol="vectorizer_features",
vocabSize=65535)
In both cases you can combine it with StopWordsRemover.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark: Convert RDD to column in dataframe - python

Instead of creating a new rdd, you could use an UDF: norm_udf = udf(lambda r: np.linalg.norm(r - b).tolist(), FloatType()) df_vector.withColumn("norm", norm_udf(df.features)) Make sure numpy is defined on the worker nodes.

Related

Python Dataframe Filter data using linear relation

Creating correlation matrix for multiple combinations of variables

Value Error and problem with shape during creation of Data Frame in Python?

Parallel programming approach to solve pandas problems

How to Find Indices where multiple vectors all are zeo

Categories

Resources