How to Append Pyspark Dataframe with Numpy Array? - python

I am new in PySpark and trying to append a dataframe with a numpy array.
I have a numpy array as:
print(category_dimension_vectors)
[[ 5.19333403e-01 -3.36615935e-01 -6.93262848e-02 2.37293671e-01]
[ 4.45220874e-01 1.30108798e-01 1.12913839e-01 1.87161517e-01]]
I would like to append this to a pyspark dataframe as a new column where each row in the array stored in its correspondent row in the dataframe.
Number of rows in the array, and the number of rows in the dataframe are equal.
This is what I have tried first:
arr_rows = udf(lambda row: category_dimension_vectors[row,:], ArrayType(DecimalType()))
df = df.withColumn("category_dimensions_reduced", arr_rows(df))
Getting the error:
TypeError: Invalid argument, not a string or column
Then I have tried
df = df.withColumn("category_dimensions_reduced", lit(category_dimension_vectors))
But got the error:
org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE]
What I try to achieve is:
+----+----+-----------------------------------------------------------------+
| a| b|category_dimension_vectors |
+----+----+-----------------------------------------------------------------+
|foo | 1|[5.19333403e-01,-3.36615935e-01,-6.93262848e-02,2.37293671e-01] |
|bar | 2|[4.45220874e-01,1.30108798e-01,1.12913839e-01,1.87161517e-01] |
+----+----+-----------------------------------------------------------------+
How should I approach to this problem?

Related

How can I convert a row from a dataframe in pyspark to a column but keep the column names? - pyspark or python

I have an array where it is made up of several arrays.
Zip the list and then call the dataframe constructor:
df = spark.createDataFrame(zip(*all_data), cols)
df.show(truncate=False)
+-----------------------------+-----------+
|name |chromossome|
+-----------------------------+-----------+
|NM_019112.4(ABCA7):c.161-2A>T|19p13.3 |
|CCL2, 767C-G |17q11.2-q12|
+-----------------------------+-----------+
Or with zip_longest:
from itertools import zip_longest
df = spark.createDataFrame(zip_longest(*all_data,fillvalue=''),cols)
df.show()

Merging two rows into one based on common field

I have dataframe with the following data:
+----------+------------+-------------+---------------+----------+
|id |name |predicted |actual |yyyy_mm_dd|
+----------+------------+-------------+---------------+----------+
| 215| NirPost| null|100.10023 |2020-01-10|
| null| NirPost| 57145|null |2020-01-10|
+----------+------------+-------------+---------------+----------+
I want to merge these two rows into one, based on the name. This df is the result of a query which I've restricted to one company and single day. In the real dataset, there is 70~ companies with daily data. I want to rewrite this data into a new table as single rows.
This is the output I'd like:
+----------+------------+-------------+---------------+----------+
|id |name |predicted | actual |yyyy_mm_dd|
+----------+------------+-------------+---------------+----------+
| 215| NirPost| 57145 |100.10023 |2020-01-10|
+----------+------------+-------------+---------------+----------+
I've tried this:
df.replace('null','').groupby('name',as_index=False).agg(''.join)
However, this outputs my original df but with NaN instead of null.
`df.dtypes`:
id float64
name object
predicted float64
actual float64
yyyy_mm_dd object
dtype: object
How about you explicitly pass all the columns in the groupby with max so that it eliminates the null values?
import pandas as pd
import numpy as np
data = {'id':[215,np.nan],'name':['nirpost','nirpost'],'predicted':[np.nan,57145],'actual':[100.12,np.nan],'yyyy_mm_dd':['2020-01-10','2020-01-10']}
df = pd.DataFrame(data)
df = df.groupby('name').agg({'id':'max','predicted':'max','actual':'max','yyyy_mm_dd':'max'}).reset_index()
print(df)
Returns:
name id predicted actual yyyy_mm_dd
0 nirpost 215.0 57145.0 100.12 2020-01-10
Of course since you have more data you should probably consider adding something else in your groupby so as to not delete too many rows, but for the example data you provide, I believe this is a way to solve the issue.
EDIT:
If all columns are being named as max_original_column_name then you can simply use this:
df.columns = [x[:-4] for x in list(df)]
With the list comprehension you are creating a list that strips the last 4 characters (that is _max from each value in list(df) which is the list of the name of the columns. Last, you are assigning it with df.columns =

How to Concat 2 column of ArrayType on axis = 1 in Pyspark dataframe?

I have a the following dataframe:
I would like to concatenate the lat and lon into a list. Where mmsi is similar to an ID (This is unique)
+---------+--------------------+--------------------+
| mmsi| lat| lon|
+---------+--------------------+--------------------+
|255801480|[47.1018366666666...|[-5.3017783333333...|
|304182000|[44.6343033333333...|[-63.564803333333...|
|304682000|[41.1936, 41.1715...|[-8.7716, -8.7514...|
|305930000|[49.5221333333333...|[-3.6310166666666...|
|306216000|[42.8185133333333...|[-29.853155, -29....|
|477514400|[47.17205, 47.165...|[-58.6317, -58.60...|
Therefore, I would like to concatenate the lat and lon array but on axis = 1, that is, I would like to have at the end a list of lists, in a separate column, like:
[[47.1018366666666, -5.3017783333333], ... ]
How is that could be possible in pyspark dataframe? I have tried concat, but that will return:
[47.1018366666666, 44.6343033333333, ..., -5.3017783333333, -63.564803333333, ...]
Any help is much appreciated!
Starting Spark version 2.4, you can use the inbuilt function arrays_zip.
from pyspark.sql.functions import arrays_zip
df.withColumn('zipped_lat_lon',arrays_zip(df.lat,df.lon)).show()

Convert a Dense Vector to a Dataframe using Pyspark

Firstly I tried everything in the link below to fix my error but none of them worked.
How to convert RDD of dense vector into DataFrame in pyspark?
I am trying to convert a dense vector into a dataframe (Spark preferably) along with column names and running into issues.
My column in spark dataframe is a vector that was created using Vector Assembler and I now want to convert it back to a dataframe as I would like to create plots on some of the variables in the vector.
Approach 1:
from pyspark.ml.linalg import SparseVector, DenseVector
from pyspark.ml.linalg import Vectors
temp=output.select("all_features")
temp.rdd.map(
lambda row: (DenseVector(row[0].toArray()))
).toDF()
Below is the Error
TypeError: not supported type: <type 'numpy.ndarray'>
Approach 2:
from pyspark.ml.linalg import VectorUDT
from pyspark.sql.functions import udf
from pyspark.ml.linalg import *
as_ml = udf(lambda v: v.asML() if v is not None else None, VectorUDT())
result = output.withColumn("all_features", as_ml("all_features"))
result.head(5)
Error:
AttributeError: 'numpy.ndarray' object has no attribute 'asML'
I also tried to convert the dataframe into a Pandas dataframe and after that I am not able to split the values into separate columns
Approach 3:
pandas_df=temp.toPandas()
pandas_df1=pd.DataFrame(pandas_df.all_features.values.tolist())
Above code runs fine but I still have only one column in my dataframe with all the values separated by commas as a list.
Any help is greatly appreciated!
EDIT:
Here is how my temp dataframe looks like. It just has one column all_features. I am trying to create a dataframe that splits all of these values into separate columns (all_features is a vector that was created using 200 columns)
+--------------------+
| all_features|
+--------------------+
|[0.01193689934723...|
|[0.04774759738895...|
|[0.0,0.0,0.194417...|
|[0.02387379869447...|
|[1.89796699621085...|
+--------------------+
only showing top 5 rows
Expected output is a dataframe with all 200 columns separated out in a dataframe
+----------------------------+
| col1| col2| col3|...
+----------------------------+
|0.01193689934723|0.0|0.5049431301173817...
|0.04774759738895|0.0|0.1657316216149636...
|0.0|0.0|7.213126372469...
|0.02387379869447|0.0|0.1866693496827619|...
|1.89796699621085|0.0|0.3192169213385746|...
+----------------------------+
only showing top 5 rows
Here is how my Pandas DF output looks like
0
0 [0.011936899347238104, 0.0, 0.5049431301173817...
1 [0.047747597388952415, 0.0, 0.1657316216149636...
2 [0.0, 0.0, 0.19441761495525278, 7.213126372469...
3 [0.023873798694476207, 0.0, 0.1866693496827619...
4 [1.8979669962108585, 0.0, 0.3192169213385746, ...
Since you want all the features in separate columns (as I got from your EDIT), the link to the answer you provided is not your solution.
Try this,
#column_names
temp = temp.rdd.map(lambda x:[float(y) for y in x['all_features']]).toDF(column_names)
EDIT:
Since your temp is originally a dataframe, you can also use this method without converting it to rdd,
import pyspark.sql.functions as F
from pyspark.sql.types import *
splits = [F.udf(lambda val: float(val[i].item()),FloatType()) for i in range(200)]
temp = temp.select(*[s(F.col('all_features')).alias(c) for c,s in zip(column_names,splits)])
temp.show()

Transform string column to vector column Spark DataFrames

I have a Spark dataframe that looks as follows:
+-----------+-------------------+
| ID | features |
+-----------+-------------------+
| 18156431|(5,[0,1,4],[1,1,1])|
| 20260831|(5,[0,4,5],[2,1,1])|
| 91859831|(5,[0,1],[1,3]) |
| 206186631|(5,[3,4,5],[1,5]) |
| 223134831|(5,[2,3,5],[1,1,1])|
+-----------+-------------------+
In this dataframe the features column is a sparse vector. In my scripts I have to save this DF as file on disk. When doing this, the features column is saved as as text column: example "(5,[0,1,4],[1,1,1])".
When importing again in Spark the column stays string, as you could expect. How can I convert the column back to (sparse) vector format?
Not particularly efficient (it would be a good idea to use a format that preserves types) due to UDF overhead but you can do something like this:
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
df = sc.parallelize([
(18156431, "(5,[0,1,4],[1,1,1])")
]).toDF(["id", "features"])
parse = udf(lambda s: Vectors.parse(s), VectorUDT())
df.select(parse("features"))
Please note this doesn't port directly to 2.0.0+ and ML Vector. Since ML vectors don't provide parse method you'd have to parse to MLLib and use asML:
parse = udf(lambda s: Vectors.parse(s).asML(), VectorUDT())

Categories

Resources