I have a dataframe with single row and multiple columns. I would like it to convert it into multiple rows.
I had found a similar question here on the stackoverflow.
The question answers how it can be done in scala but I wanted to do this in pyspark. I tried to replicate the code in pyspark but I wasn't able to do that.
I am not able to convert the below code in scala to python:
import org.apache.spark.sql.Column
var ColumnsAndValues: Array[Column] = df.columns.flatMap { c => {Array(lit(c), col(c))}}
val df2 = df1.withColumn("myMap", map(ColumnsAndValues: _*))
In Pyspark you can use create_map function to create map column. And a list comprehension with itertools.chain to get the equivalent of scala flatMap :
import itertools
from pyspark.sql import functions as F
columns_and_values = itertools.chain(*[(F.lit(c), F.col(c)) for c in df1.columns])
df2 = df1.withColumn("myMap", F.create_map(*columns_and_values))
Related
newlist = []
for column in new_columns:
count12 = new_df.loc[new_df[col].diff() == 1]
new_df2=new_df2.groupby(['my_id','friend_id','family_id','colleage_id']).apply(len)
There is no option is available in pyspark for getting all length of column
How can we achieve this code into pyspark.
Thanks in advance..
Literally, apply(len) is just an aggregation function that would count grouped elements from groupby. You can do the very same thing in basic PySpark syntax
import pyspark.sql.functions as F
(df
.groupBy('my_id','friend_id','family_id','colleage_id')
.agg(F.count('*'))
.show()
)
Anyone can give me some guidance on the pivot table, using spark dataframe in python language
I am getting the following error :Column is not iterable
enter image description here
anyone has idea ?
Pivots function Pivots a column of the current DataFrame and performs the specified aggregation operation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not.
With specifying column values - df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings")
without specifying column values (more concise but less efficient) - df.groupBy("year").pivot("course").sum("earnings")
You are proceeding in the right direction. Sample working code, python 2
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.functions import col
>>> spark = SparkSession.builder.master('local').appName('app').getOrCreate()
>>> df = spark.read.option('header', 'true').csv('pivot.csv')
>>> df = df.withColumn('value1', col('value1').cast("int"))
>>> pdf = df.groupBy('thisyear').pivot('month', ['JAN','FEB']).sum('value1')
>>> pdf.show(10)
+--------+---+---+
|thisyear|JAN|FEB|
+--------+---+---+
| 2019| 3| 2|
+--------+---+---+
//pivot.csv
thisyear,month,value1
2019,JAN,1
2019,JAN,1
2019,FEB,1
2019,JAN,1
2019,FEB,1
I have some 5 columns to be added to the dataframe. (A - E) The values for these columns are stored in (a - e) variables.
Instead of using
df.withColumn("A", a).withColumn("B", b).withColumn..... etc
Can we do this with a udf?
Currently I have named function :
def add_col(df_name,newCol,value):
df = df_name
df = df.withColumn(newCol, value)
return df
But I am not able to understand how to convert it to UDF and use it. Please help.
If you want to add multiple columns you can use select with *:
df.select("*", some_column, another_column, ...)
You should not use UDF, they can't create multiple results.
However you can write select statement similar to this in other answer:
df.select(col("*"), lit(a).as("a"), lit(b).as("b"), ...)
You can also automate this adding:
val fieldsMap = Map("a" -> a, "b" -> b)
df.select(Array(col("*")) ++ fieldsMap.map(e => lit(e._2).as(e._1)) : _*)
Here is the code to create a pyspark.sql DataFrame
import numpy as np
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
df = pd.DataFrame(np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]]), columns=['a','b','c'])
sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)
So that sparkdf looks like
a b c
1 2 3
4 5 6
7 8 9
10 11 12
Now I would like to add as a new column a numpy array (or even a list)
new_col = np.array([20,20,20,20])
But the standard way
sparkdf = sparkdf.withColumn('newcol', new_col)
fails.
Probably an udf is the way to go, but I don't know how to create an udf that assigns one different value per DataFrame row, i.e. that iterates through new_col.
I have looked at other pyspark and pyspark.sql but couldn't find a solution.
Also I need to stay within pyspark.sql so not a scala solution. Thanks!
Assuming that data frame is sorted to match order of values in an array you can zip RDDs and rebuild data frame as follows:
n = sparkdf.rdd.getNumPartitions()
# Parallelize and cast to plain integer (np.int64 won't work)
new_col = sc.parallelize(np.array([20,20,20,20]), n).map(int)
def process(pair):
return dict(pair[0].asDict().items() + [("new_col", pair[1])])
rdd = (sparkdf
.rdd # Extract RDD
.zip(new_col) # Zip with new col
.map(process)) # Add new column
sqlContext.createDataFrame(rdd) # Rebuild data frame
You can also use joins:
new_col = sqlContext.createDataFrame(
zip(range(1, 5), [20] * 4),
("rn", "new_col"))
sparkdf.registerTempTable("df")
sparkdf_indexed = sqlContext.sql(
# Make sure we have specific order and add row number
"SELECT row_number() OVER (ORDER BY a, b, c) AS rn, * FROM df")
(sparkdf_indexed
.join(new_col, new_col.rn == sparkdf_indexed.rn)
.drop(new_col.rn))
but window function component is not scalable and should be avoided with larger datasets.
Of course if all you need is a column of a single value you can simply use lit
import pyspark.sql.functions as f
sparkdf.withColumn("new_col", f.lit(20))
but I assume it is not the case.
I have a table stored as an RDD of lists, on which I want to perform something akin to a groupby in SQL or pandas, taking the sum or average for every variable.
The way I currently do it is this (untested code):
l=[(3, "add"),(4, "add")]
dict={}
i=0
for aggregation in l:
RDD= RDD.map(lambda x: (x[6], float(x[aggregation[0]])))
agg=RDD.reduceByKey(aggregation[1])
dict[i]=agg
i+=1
Then I'll need to join all the RDDs in dict.
This isn't very efficient though. Is there a better way?
If you are using >= Spark 1.3, you could look at the DataFrame API.
In the pyspark shell:
import numpy as np
# create a DataFrame (this can also be from an RDD)
df = sqlCtx.createDataFrame(map(lambda x:map(float, x), np.random.rand(50, 3)))
df.agg({col: "mean" for col in df.columns}).collect()
This outputs:
[Row(AVG(_3#1456)=0.5547187588389414, AVG(_1#1454)=0.5149476209374797, AVG(_2#1455)=0.5022967093047612)]
The available aggregate methods are "avg"/"mean", "max", "min", "sum", "count".
To get several aggregations for the same column, you can call agg with a list of explicitly constructed aggregations rather than with a dictionary:
from pyspark.sql import functions as F
df.agg(*[F.min(col) for col in df.columns] + [F.avg(col) for col in df.columns]).collect()
Or for your case:
df.agg(F.count(df.var3), F.max(df.var3), ) # etc...