I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below:
df['column_1']: 'abc, def, ghi'
df['column_2']: '1.0, 2.0, 3.0'
I wanted to join these two columns in a third column like below for each row of my dataframe.
df['column_3']: [abc_1.0, def_2.0, ghi_3.0]
I have successfully done so in python using the code below but the dataframe is quite large and it takes a very long time to run it for the whole dataframe. I want to do the same thing in PySpark for efficiency. I have read the data in spark dataframe successfully but I'm having a hard time determining how to replicate Pandas functions with PySpark equivalent functions. How can I get my desired result in PySpark?
df['column_3'] = df['column_2']
for index, row in df.iterrows():
while index < 3:
if isinstance(row['column_1'], str):
row['column_1'] = list(row['column_1'].split(','))
row['column_2'] = list(row['column_2'].split(','))
row['column_3'] = ['_'.join(map(str, i)) for i in zip(list(row['column_1']), list(row['column_2']))]
I have converted the two columns to arrays in PySpark by using the below code
from pyspark.sql.types import ArrayType, IntegerType, StringType
from pyspark.sql.functions import col, split
crash.withColumn("column_1",
split(col("column_1"), ",\s*").cast(ArrayType(StringType())).alias("column_1")
)
crash.withColumn("column_2",
split(col("column_2"), ",\s*").cast(ArrayType(StringType())).alias("column_2")
)
Now all I need is to zip each element of the arrays in the two columns using '_'. How can I use zip with this? Any help is appreciated.
A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:
pyspark.sql.functions.arrays_zip(*cols)
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
So if you already have two arrays:
from pyspark.sql.functions import split
df = (spark
.createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
.toDF("column_1", "column_2")
.withColumn("column_1", split("column_1", "\s*,\s*"))
.withColumn("column_2", split("column_2", "\s*,\s*")))
You can just apply it on the result
from pyspark.sql.functions import arrays_zip
df_zipped = df.withColumn(
"zipped", arrays_zip("column_1", "column_2")
)
df_zipped.select("zipped").show(truncate=False)
+------------------------------------+
|zipped |
+------------------------------------+
|[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
+------------------------------------+
Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):
df_zipped_concat = df_zipped.withColumn(
"zipped_concat",
expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
)
df_zipped_concat.select("zipped_concat").show(truncate=False)
+---------------------------+
|zipped_concat |
+---------------------------+
|[abc_1.0, def_2.0, ghi_3.0]|
+---------------------------+
Note:
Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.
For Spark 2.4+, this can be done using only zip_with function to zip a concatenate on the same time:
df.withColumn("column_3", expr("zip_with(column_1, column_2, (x, y) -> concat(x, '_', y))"))
The higher-order function takes 2 arrays to merge, element-wise, using a lambda function (x, y) -> concat(x, '_', y).
You can also UDF to zip the split array columns,
df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2'])
+-----------+-----------+
|col1 |col2 |
+-----------+-----------+
|abc,def,ghi|1.0,2.0,3.0|
+-----------+-----------+ ## Hope this is how your dataframe is
from pyspark.sql import functions as F
from pyspark.sql.types import *
def concat_udf(*args):
return ['_'.join(x) for x in zip(*args)]
udf1 = F.udf(concat_udf,ArrayType(StringType()))
df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
df.show(1,False)
+-----------+-----------+---------------------------+
|col1 |col2 |col3 |
+-----------+-----------+---------------------------+
|abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
+-----------+-----------+---------------------------+
For Spark 3.1+, they now provide pyspark.sql.functions.zip_with() with Python lambda function, therefore it can be done like this:
import pyspark.sql.functions as F
df = df.withColumn("column_3", F.zip_with("column_1", "column_2", lambda x,y: F.concat_ws("_", x, y)))
Related
Anyone can give me some guidance on the pivot table, using spark dataframe in python language
I am getting the following error :Column is not iterable
enter image description here
anyone has idea ?
Pivots function Pivots a column of the current DataFrame and performs the specified aggregation operation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not.
With specifying column values - df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings")
without specifying column values (more concise but less efficient) - df.groupBy("year").pivot("course").sum("earnings")
You are proceeding in the right direction. Sample working code, python 2
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.functions import col
>>> spark = SparkSession.builder.master('local').appName('app').getOrCreate()
>>> df = spark.read.option('header', 'true').csv('pivot.csv')
>>> df = df.withColumn('value1', col('value1').cast("int"))
>>> pdf = df.groupBy('thisyear').pivot('month', ['JAN','FEB']).sum('value1')
>>> pdf.show(10)
+--------+---+---+
|thisyear|JAN|FEB|
+--------+---+---+
| 2019| 3| 2|
+--------+---+---+
//pivot.csv
thisyear,month,value1
2019,JAN,1
2019,JAN,1
2019,FEB,1
2019,JAN,1
2019,FEB,1
Firstly I tried everything in the link below to fix my error but none of them worked.
How to convert RDD of dense vector into DataFrame in pyspark?
I am trying to convert a dense vector into a dataframe (Spark preferably) along with column names and running into issues.
My column in spark dataframe is a vector that was created using Vector Assembler and I now want to convert it back to a dataframe as I would like to create plots on some of the variables in the vector.
Approach 1:
from pyspark.ml.linalg import SparseVector, DenseVector
from pyspark.ml.linalg import Vectors
temp=output.select("all_features")
temp.rdd.map(
lambda row: (DenseVector(row[0].toArray()))
).toDF()
Below is the Error
TypeError: not supported type: <type 'numpy.ndarray'>
Approach 2:
from pyspark.ml.linalg import VectorUDT
from pyspark.sql.functions import udf
from pyspark.ml.linalg import *
as_ml = udf(lambda v: v.asML() if v is not None else None, VectorUDT())
result = output.withColumn("all_features", as_ml("all_features"))
result.head(5)
Error:
AttributeError: 'numpy.ndarray' object has no attribute 'asML'
I also tried to convert the dataframe into a Pandas dataframe and after that I am not able to split the values into separate columns
Approach 3:
pandas_df=temp.toPandas()
pandas_df1=pd.DataFrame(pandas_df.all_features.values.tolist())
Above code runs fine but I still have only one column in my dataframe with all the values separated by commas as a list.
Any help is greatly appreciated!
EDIT:
Here is how my temp dataframe looks like. It just has one column all_features. I am trying to create a dataframe that splits all of these values into separate columns (all_features is a vector that was created using 200 columns)
+--------------------+
| all_features|
+--------------------+
|[0.01193689934723...|
|[0.04774759738895...|
|[0.0,0.0,0.194417...|
|[0.02387379869447...|
|[1.89796699621085...|
+--------------------+
only showing top 5 rows
Expected output is a dataframe with all 200 columns separated out in a dataframe
+----------------------------+
| col1| col2| col3|...
+----------------------------+
|0.01193689934723|0.0|0.5049431301173817...
|0.04774759738895|0.0|0.1657316216149636...
|0.0|0.0|7.213126372469...
|0.02387379869447|0.0|0.1866693496827619|...
|1.89796699621085|0.0|0.3192169213385746|...
+----------------------------+
only showing top 5 rows
Here is how my Pandas DF output looks like
0
0 [0.011936899347238104, 0.0, 0.5049431301173817...
1 [0.047747597388952415, 0.0, 0.1657316216149636...
2 [0.0, 0.0, 0.19441761495525278, 7.213126372469...
3 [0.023873798694476207, 0.0, 0.1866693496827619...
4 [1.8979669962108585, 0.0, 0.3192169213385746, ...
Since you want all the features in separate columns (as I got from your EDIT), the link to the answer you provided is not your solution.
Try this,
#column_names
temp = temp.rdd.map(lambda x:[float(y) for y in x['all_features']]).toDF(column_names)
EDIT:
Since your temp is originally a dataframe, you can also use this method without converting it to rdd,
import pyspark.sql.functions as F
from pyspark.sql.types import *
splits = [F.udf(lambda val: float(val[i].item()),FloatType()) for i in range(200)]
temp = temp.select(*[s(F.col('all_features')).alias(c) for c,s in zip(column_names,splits)])
temp.show()
I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.
df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])
I do the following to sum the column.
df.groupBy().sum()
But I get a dataframe back.
+-----------+
|sum(Number)|
+-----------+
| 130|
+-----------+
I would 130 returned as an int stored in a variable to be used else where in the program.
result = 130
I think the simplest way:
df.groupBy().sum().collect()
will return a list.
In your example:
In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130
If you want a specific column :
import pyspark.sql.functions as F
df.agg(F.sum("my_column")).collect()[0][0]
The simplest way really :
df.groupBy().sum().collect()
But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:
df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
I tried on a bigger dataset and i measured the processing time:
RDD and ReduceByKey : 2.23 s
GroupByKey: 30.5 s
This is another way you can do this. using agg and collect:
sum_number = df.agg({"Number":"sum"}).collect()[0]
result = sum_number["sum(Number)"]
Similar to other answers, but without the use of a groupby or agg. I just select the column in question, sum it, collect it, and then grab the first two indices to return an int. The only reason I chose this over the accepted answer is I am new to pyspark and was confused that the 'Number' column was not explicitly summed in the accepted answer. If I had to come back after sometime and try to understand what was happening, syntax such as below would be easier for me to follow.
import pyspark.sql.functions as f
df.select(f.sum('Number')).collect()[0][0]
You can also try using first() function. It returns the first row from the dataframe, and you can access values of respective columns using indices.
df.groupBy().sum().first()[0]
In your case, the result is a dataframe with single row and column, so above snippet works.
Select column as RDD, abuse keys() to get value in Row (or use .map(lambda x: x[0])), then use RDD sum:
df.select("Number").rdd.keys().sum()
SQL sum using selectExpr:
df.selectExpr("sum(Number)").first()[0]
The following should work:
df.groupBy().sum().rdd.map(lambda x: x[0]).collect()
sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()
import pyspark.sql.functions as F
df.groupBy().agg(F.sum('Number')).show()
Here is the code to create a pyspark.sql DataFrame
import numpy as np
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
df = pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]]), columns=['a','b','c'])
sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)
So that sparkdf looks like
a b c
1 2 3
4 5 6
7 8 9
10 11 12
Now I would like to add as a new column a numpy array (or even a list)
new_col = np.array([20,20,20,20])
But the standard way
sparkdf = sparkdf.withColumn('newcol', new_col)
fails.
Probably an udf is the way to go, but I don't know how to create an udf that assigns one different value per DataFrame row, i.e. that iterates through new_col.
I have looked at other pyspark and pyspark.sql but couldn't find a solution.
Also I need to stay within pyspark.sql so not a scala solution. Thanks!
Assuming that data frame is sorted to match order of values in an array you can zip RDDs and rebuild data frame as follows:
n = sparkdf.rdd.getNumPartitions()
# Parallelize and cast to plain integer (np.int64 won't work)
new_col = sc.parallelize(np.array([20,20,20,20]), n).map(int)
def process(pair):
return dict(pair[0].asDict().items() + [("new_col", pair[1])])
rdd = (sparkdf
.rdd # Extract RDD
.zip(new_col) # Zip with new col
.map(process)) # Add new column
sqlContext.createDataFrame(rdd) # Rebuild data frame
You can also use joins:
new_col = sqlContext.createDataFrame(
zip(range(1, 5), [20] * 4),
("rn", "new_col"))
sparkdf.registerTempTable("df")
sparkdf_indexed = sqlContext.sql(
# Make sure we have specific order and add row number
"SELECT row_number() OVER (ORDER BY a, b, c) AS rn, * FROM df")
(sparkdf_indexed
.join(new_col, new_col.rn == sparkdf_indexed.rn)
.drop(new_col.rn))
but window function component is not scalable and should be avoided with larger datasets.
Of course if all you need is a column of a single value you can simply use lit
import pyspark.sql.functions as f
sparkdf.withColumn("new_col", f.lit(20))
but I assume it is not the case.
I have a table stored as an RDD of lists, on which I want to perform something akin to a groupby in SQL or pandas, taking the sum or average for every variable.
The way I currently do it is this (untested code):
l=[(3, "add"),(4, "add")]
dict={}
i=0
for aggregation in l:
RDD= RDD.map(lambda x: (x[6], float(x[aggregation[0]])))
agg=RDD.reduceByKey(aggregation[1])
dict[i]=agg
i+=1
Then I'll need to join all the RDDs in dict.
This isn't very efficient though. Is there a better way?
If you are using >= Spark 1.3, you could look at the DataFrame API.
In the pyspark shell:
import numpy as np
# create a DataFrame (this can also be from an RDD)
df = sqlCtx.createDataFrame(map(lambda x:map(float, x), np.random.rand(50, 3)))
df.agg({col: "mean" for col in df.columns}).collect()
This outputs:
[Row(AVG(_3#1456)=0.5547187588389414, AVG(_1#1454)=0.5149476209374797, AVG(_2#1455)=0.5022967093047612)]
The available aggregate methods are "avg"/"mean", "max", "min", "sum", "count".
To get several aggregations for the same column, you can call agg with a list of explicitly constructed aggregations rather than with a dictionary:
from pyspark.sql import functions as F
df.agg(*[F.min(col) for col in df.columns] + [F.avg(col) for col in df.columns]).collect()
Or for your case:
df.agg(F.count(df.var3), F.max(df.var3), ) # etc...