Firstly I tried everything in the link below to fix my error but none of them worked.
How to convert RDD of dense vector into DataFrame in pyspark?
I am trying to convert a dense vector into a dataframe (Spark preferably) along with column names and running into issues.
My column in spark dataframe is a vector that was created using Vector Assembler and I now want to convert it back to a dataframe as I would like to create plots on some of the variables in the vector.
Approach 1:
from pyspark.ml.linalg import SparseVector, DenseVector
from pyspark.ml.linalg import Vectors
temp=output.select("all_features")
temp.rdd.map(
lambda row: (DenseVector(row[0].toArray()))
).toDF()
Below is the Error
TypeError: not supported type: <type 'numpy.ndarray'>
Approach 2:
from pyspark.ml.linalg import VectorUDT
from pyspark.sql.functions import udf
from pyspark.ml.linalg import *
as_ml = udf(lambda v: v.asML() if v is not None else None, VectorUDT())
result = output.withColumn("all_features", as_ml("all_features"))
result.head(5)
Error:
AttributeError: 'numpy.ndarray' object has no attribute 'asML'
I also tried to convert the dataframe into a Pandas dataframe and after that I am not able to split the values into separate columns
Approach 3:
pandas_df=temp.toPandas()
pandas_df1=pd.DataFrame(pandas_df.all_features.values.tolist())
Above code runs fine but I still have only one column in my dataframe with all the values separated by commas as a list.
Any help is greatly appreciated!
EDIT:
Here is how my temp dataframe looks like. It just has one column all_features. I am trying to create a dataframe that splits all of these values into separate columns (all_features is a vector that was created using 200 columns)
+--------------------+
| all_features|
+--------------------+
|[0.01193689934723...|
|[0.04774759738895...|
|[0.0,0.0,0.194417...|
|[0.02387379869447...|
|[1.89796699621085...|
+--------------------+
only showing top 5 rows
Expected output is a dataframe with all 200 columns separated out in a dataframe
+----------------------------+
| col1| col2| col3|...
+----------------------------+
|0.01193689934723|0.0|0.5049431301173817...
|0.04774759738895|0.0|0.1657316216149636...
|0.0|0.0|7.213126372469...
|0.02387379869447|0.0|0.1866693496827619|...
|1.89796699621085|0.0|0.3192169213385746|...
+----------------------------+
only showing top 5 rows
Here is how my Pandas DF output looks like
0
0 [0.011936899347238104, 0.0, 0.5049431301173817...
1 [0.047747597388952415, 0.0, 0.1657316216149636...
2 [0.0, 0.0, 0.19441761495525278, 7.213126372469...
3 [0.023873798694476207, 0.0, 0.1866693496827619...
4 [1.8979669962108585, 0.0, 0.3192169213385746, ...
Since you want all the features in separate columns (as I got from your EDIT), the link to the answer you provided is not your solution.
Try this,
#column_names
temp = temp.rdd.map(lambda x:[float(y) for y in x['all_features']]).toDF(column_names)
EDIT:
Since your temp is originally a dataframe, you can also use this method without converting it to rdd,
import pyspark.sql.functions as F
from pyspark.sql.types import *
splits = [F.udf(lambda val: float(val[i].item()),FloatType()) for i in range(200)]
temp = temp.select(*[s(F.col('all_features')).alias(c) for c,s in zip(column_names,splits)])
temp.show()
Related
I am new in PySpark and trying to append a dataframe with a numpy array.
I have a numpy array as:
print(category_dimension_vectors)
[[ 5.19333403e-01 -3.36615935e-01 -6.93262848e-02 2.37293671e-01]
[ 4.45220874e-01 1.30108798e-01 1.12913839e-01 1.87161517e-01]]
I would like to append this to a pyspark dataframe as a new column where each row in the array stored in its correspondent row in the dataframe.
Number of rows in the array, and the number of rows in the dataframe are equal.
This is what I have tried first:
arr_rows = udf(lambda row: category_dimension_vectors[row,:], ArrayType(DecimalType()))
df = df.withColumn("category_dimensions_reduced", arr_rows(df))
Getting the error:
TypeError: Invalid argument, not a string or column
Then I have tried
df = df.withColumn("category_dimensions_reduced", lit(category_dimension_vectors))
But got the error:
org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE]
What I try to achieve is:
+----+----+-----------------------------------------------------------------+
| a| b|category_dimension_vectors |
+----+----+-----------------------------------------------------------------+
|foo | 1|[5.19333403e-01,-3.36615935e-01,-6.93262848e-02,2.37293671e-01] |
|bar | 2|[4.45220874e-01,1.30108798e-01,1.12913839e-01,1.87161517e-01] |
+----+----+-----------------------------------------------------------------+
How should I approach to this problem?
I have a the following dataframe:
I would like to concatenate the lat and lon into a list. Where mmsi is similar to an ID (This is unique)
+---------+--------------------+--------------------+
| mmsi| lat| lon|
+---------+--------------------+--------------------+
|255801480|[47.1018366666666...|[-5.3017783333333...|
|304182000|[44.6343033333333...|[-63.564803333333...|
|304682000|[41.1936, 41.1715...|[-8.7716, -8.7514...|
|305930000|[49.5221333333333...|[-3.6310166666666...|
|306216000|[42.8185133333333...|[-29.853155, -29....|
|477514400|[47.17205, 47.165...|[-58.6317, -58.60...|
Therefore, I would like to concatenate the lat and lon array but on axis = 1, that is, I would like to have at the end a list of lists, in a separate column, like:
[[47.1018366666666, -5.3017783333333], ... ]
How is that could be possible in pyspark dataframe? I have tried concat, but that will return:
[47.1018366666666, 44.6343033333333, ..., -5.3017783333333, -63.564803333333, ...]
Any help is much appreciated!
Starting Spark version 2.4, you can use the inbuilt function arrays_zip.
from pyspark.sql.functions import arrays_zip
df.withColumn('zipped_lat_lon',arrays_zip(df.lat,df.lon)).show()
Anyone can give me some guidance on the pivot table, using spark dataframe in python language
I am getting the following error :Column is not iterable
enter image description here
anyone has idea ?
Pivots function Pivots a column of the current DataFrame and performs the specified aggregation operation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not.
With specifying column values - df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings")
without specifying column values (more concise but less efficient) - df.groupBy("year").pivot("course").sum("earnings")
You are proceeding in the right direction. Sample working code, python 2
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.functions import col
>>> spark = SparkSession.builder.master('local').appName('app').getOrCreate()
>>> df = spark.read.option('header', 'true').csv('pivot.csv')
>>> df = df.withColumn('value1', col('value1').cast("int"))
>>> pdf = df.groupBy('thisyear').pivot('month', ['JAN','FEB']).sum('value1')
>>> pdf.show(10)
+--------+---+---+
|thisyear|JAN|FEB|
+--------+---+---+
| 2019| 3| 2|
+--------+---+---+
//pivot.csv
thisyear,month,value1
2019,JAN,1
2019,JAN,1
2019,FEB,1
2019,JAN,1
2019,FEB,1
I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below:
df['column_1']: 'abc, def, ghi'
df['column_2']: '1.0, 2.0, 3.0'
I wanted to join these two columns in a third column like below for each row of my dataframe.
df['column_3']: [abc_1.0, def_2.0, ghi_3.0]
I have successfully done so in python using the code below but the dataframe is quite large and it takes a very long time to run it for the whole dataframe. I want to do the same thing in PySpark for efficiency. I have read the data in spark dataframe successfully but I'm having a hard time determining how to replicate Pandas functions with PySpark equivalent functions. How can I get my desired result in PySpark?
df['column_3'] = df['column_2']
for index, row in df.iterrows():
while index < 3:
if isinstance(row['column_1'], str):
row['column_1'] = list(row['column_1'].split(','))
row['column_2'] = list(row['column_2'].split(','))
row['column_3'] = ['_'.join(map(str, i)) for i in zip(list(row['column_1']), list(row['column_2']))]
I have converted the two columns to arrays in PySpark by using the below code
from pyspark.sql.types import ArrayType, IntegerType, StringType
from pyspark.sql.functions import col, split
crash.withColumn("column_1",
split(col("column_1"), ",\s*").cast(ArrayType(StringType())).alias("column_1")
)
crash.withColumn("column_2",
split(col("column_2"), ",\s*").cast(ArrayType(StringType())).alias("column_2")
)
Now all I need is to zip each element of the arrays in the two columns using '_'. How can I use zip with this? Any help is appreciated.
A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:
pyspark.sql.functions.arrays_zip(*cols)
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
So if you already have two arrays:
from pyspark.sql.functions import split
df = (spark
.createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
.toDF("column_1", "column_2")
.withColumn("column_1", split("column_1", "\s*,\s*"))
.withColumn("column_2", split("column_2", "\s*,\s*")))
You can just apply it on the result
from pyspark.sql.functions import arrays_zip
df_zipped = df.withColumn(
"zipped", arrays_zip("column_1", "column_2")
)
df_zipped.select("zipped").show(truncate=False)
+------------------------------------+
|zipped |
+------------------------------------+
|[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
+------------------------------------+
Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):
df_zipped_concat = df_zipped.withColumn(
"zipped_concat",
expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
)
df_zipped_concat.select("zipped_concat").show(truncate=False)
+---------------------------+
|zipped_concat |
+---------------------------+
|[abc_1.0, def_2.0, ghi_3.0]|
+---------------------------+
Note:
Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.
For Spark 2.4+, this can be done using only zip_with function to zip a concatenate on the same time:
df.withColumn("column_3", expr("zip_with(column_1, column_2, (x, y) -> concat(x, '_', y))"))
The higher-order function takes 2 arrays to merge, element-wise, using a lambda function (x, y) -> concat(x, '_', y).
You can also UDF to zip the split array columns,
df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2'])
+-----------+-----------+
|col1 |col2 |
+-----------+-----------+
|abc,def,ghi|1.0,2.0,3.0|
+-----------+-----------+ ## Hope this is how your dataframe is
from pyspark.sql import functions as F
from pyspark.sql.types import *
def concat_udf(*args):
return ['_'.join(x) for x in zip(*args)]
udf1 = F.udf(concat_udf,ArrayType(StringType()))
df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
df.show(1,False)
+-----------+-----------+---------------------------+
|col1 |col2 |col3 |
+-----------+-----------+---------------------------+
|abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
+-----------+-----------+---------------------------+
For Spark 3.1+, they now provide pyspark.sql.functions.zip_with() with Python lambda function, therefore it can be done like this:
import pyspark.sql.functions as F
df = df.withColumn("column_3", F.zip_with("column_1", "column_2", lambda x,y: F.concat_ws("_", x, y)))
I have a Spark dataframe that looks as follows:
+-----------+-------------------+
| ID | features |
+-----------+-------------------+
| 18156431|(5,[0,1,4],[1,1,1])|
| 20260831|(5,[0,4,5],[2,1,1])|
| 91859831|(5,[0,1],[1,3]) |
| 206186631|(5,[3,4,5],[1,5]) |
| 223134831|(5,[2,3,5],[1,1,1])|
+-----------+-------------------+
In this dataframe the features column is a sparse vector. In my scripts I have to save this DF as file on disk. When doing this, the features column is saved as as text column: example "(5,[0,1,4],[1,1,1])".
When importing again in Spark the column stays string, as you could expect. How can I convert the column back to (sparse) vector format?
Not particularly efficient (it would be a good idea to use a format that preserves types) due to UDF overhead but you can do something like this:
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
df = sc.parallelize([
(18156431, "(5,[0,1,4],[1,1,1])")
]).toDF(["id", "features"])
parse = udf(lambda s: Vectors.parse(s), VectorUDT())
df.select(parse("features"))
Please note this doesn't port directly to 2.0.0+ and ML Vector. Since ML vectors don't provide parse method you'd have to parse to MLLib and use asML:
parse = udf(lambda s: Vectors.parse(s).asML(), VectorUDT())