Computing one value from multiple values in row - python

I have a PySpark Dataframe, and I'd like to add a column computed from multiple values from the other columns.
For instance let's say I have a simple dataframe with ages and names of people, and I want to compute some value, like age*2 + len(name). Can I do this with a udf or a .withColumn?
from pyspark.sql import Row
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = sqlContext.createDataFrame(people)
display(schemaPeople)

Use withColumn:
from pyspark.sql import functions as F
schemaPeople.withColumn(
"my_column",
F.col("age")*2 + F.length(F.col("name"))
).show()

I found an way to do this with #udf:
#udf
def complex_op(age, name):
return age*2 + len(name)
schemaPeople.withColumn(
"my_column",
lit(complex_op(schemaPeople["age"], schemaPeople["name"]))
)

Related

filtering out spark dataframe using udf

I have a pyspark dataframe with two columns, name and source. All the values in the name column are distinct. Source has multiple strings separated with a comma (,).
I want to filter out all those rows where any of the strings in the source column contains any value from the whole name column.
I am using the following UDF:
def checkDependentKPI(df, name_list):
for row in df.collect():
for src in row["source"].split(","):
for name in name_list:
if name in src:
return row['name']
return row['name']
My end goal is to put all such rows at the end of the dataframe. How can I do it?
Sample dataframe:
+--------------------+--------------------+
| name| source|
+--------------------+--------------------+
|dev.................|prod, sum, diff.....|
|prod................|dev, diff, avg......|
|stage...............|mean, mode..........|
|balance.............|median, mean........|
|target..............|avg, diff, sum......|
+--------------------+--------------------+
You can use a like() to leverage the SQL like expression without any heavy collect() action and loop checking. Suppose you already have a list of name:
from functools import reduce
df.filter(
reduce(lambda x, y: x|y, [func.col('source').like(f"%{pattern}%") for pattern in name])
).show(20, False)
Maybe this?
from pyspark.sql import functions as psf
test_data = [('dev','prod,sum,diff')
, ('prod','dev,diff,avg')
, ('stage','mean,mode')
, ('balance','median,mean')
, ('target','avg,diff,sum')]
df = spark.createDataFrame(test_data, ['kpi_name','kpi_source_table'])
df = df.withColumn('kpi_source_table', psf.split('kpi_source_table', ','))
df_flat = df.agg(psf.collect_list('kpi_name').alias('flat_kpi'))
df = df.join(df_flat, how='cross')
df = df.withColumn('match', psf.array_intersect('kpi_source_table', 'flat_kpi'))
display(df.orderBy('match'))

Pyspark: Count how many rows have the same value in two columns and drop the duplicates [duplicate]

I have three Arrays of string type containing following information:
groupBy array: containing names of the columns I want to group my data by.
aggregate array: containing names of columns I want to aggregate.
operations array: containing the aggregate operations I want to perform
I am trying to use spark data frames to achieve this. Spark data frames provide an agg() where you can pass a Map [String,String] (of column name and respective aggregate operation ) as input, however I want to perform different aggregation operations on the same column of the data. Any suggestions on how to achieve this?
Scala:
You can for example map over a list of functions with a defined mapping from name to function:
import org.apache.spark.sql.functions.{col, min, max, mean}
import org.apache.spark.sql.Column
val df = Seq((1L, 3.0), (1L, 3.0), (2L, -5.0)).toDF("k", "v")
val mapping: Map[String, Column => Column] = Map(
"min" -> min, "max" -> max, "mean" -> avg)
val groupBy = Seq("k")
val aggregate = Seq("v")
val operations = Seq("min", "max", "mean")
val exprs = aggregate.flatMap(c => operations .map(f => mapping(f)(col(c))))
df.groupBy(groupBy.map(col): _*).agg(exprs.head, exprs.tail: _*).show
// +---+------+------+------+
// | k|min(v)|max(v)|avg(v)|
// +---+------+------+------+
// | 1| 3.0| 3.0| 3.0|
// | 2| -5.0| -5.0| -5.0|
// +---+------+------+------+
or
df.groupBy(groupBy.head, groupBy.tail: _*).agg(exprs.head, exprs.tail: _*).show
Unfortunately parser which is used internally SQLContext is not exposed publicly but you can always try to build plain SQL queries:
df.registerTempTable("df")
val groupExprs = groupBy.mkString(",")
val aggExprs = aggregate.flatMap(c => operations.map(
f => s"$f($c) AS ${c}_${f}")
).mkString(",")
sqlContext.sql(s"SELECT $groupExprs, $aggExprs FROM df GROUP BY $groupExprs")
Python:
from pyspark.sql.functions import mean, sum, max, col
df = sc.parallelize([(1, 3.0), (1, 3.0), (2, -5.0)]).toDF(["k", "v"])
groupBy = ["k"]
aggregate = ["v"]
funs = [mean, sum, max]
exprs = [f(col(c)) for f in funs for c in aggregate]
# or equivalent df.groupby(groupBy).agg(*exprs)
df.groupby(*groupBy).agg(*exprs)
See also:
Spark SQL: apply aggregate functions to a list of column
For those that wonder, how #zero323 answer can be written without a list comprehension in python:
from pyspark.sql.functions import min, max, col
# init your spark dataframe
expr = [min(col("valueName")),max(col("valueName"))]
df.groupBy("keyName").agg(*expr)
Do something like
from pyspark.sql import functions as F
df.groupBy('groupByColName') \
.agg(F.sum('col1').alias('col1_sum'),
F.max('col2').alias('col2_max'),
F.avg('col2').alias('col2_avg')) \
.show()
Here is another straight forward way to apply different aggregate functions on the same column while using Scala (this has been tested in Azure Databricks).
val groupByColName = "Store"
val colName = "Weekly_Sales"
df.groupBy(groupByColName)
.agg(min(colName),
max(colName),
round(avg(colName), 2))
.show()
for example if you want to count percentage of zeroes in each column in pyspark dataframe for which we can use expression to be executed on each column of the dataframe
from pyspark.sql.functions import count,col
def count_zero_percentage(c):
pred = col(c)==0
return sum(pred.cast("integer")).alias(c)
df.agg(*[count_zero_percentage(c)/count('*').alias(c) for c in df.columns]).show()
case class soExample(firstName: String, lastName: String, Amount: Int)
val df = Seq(soExample("me", "zack", 100)).toDF
import org.apache.spark.sql.functions._
val groupped = df.groupBy("firstName", "lastName").agg(
sum("Amount"),
mean("Amount"),
stddev("Amount"),
count(lit(1)).alias("numOfRecords")
).toDF()
display(groupped)
// Courtesy Zach ..
Zach simplified answer for a post Marked Duplicate
Spark Scala Data Frame to have multiple aggregation of single Group By

Another way of passing orderby list in pyspark windows method

I just had a below concern in performing window operation on pyspark dataframe. I want to get the latest records from the input table with the below condition, but I want to exclude the for loop:
groupby_col = ["col('customer_id')"]
orderby_col = ["col('process_date').desc()", "col('load_date').desc()"]
window_spec = Window.partitionBy(*groupby_col).orderBy([eval(x) for x in orderby_col])
df = df.withColumn("rank", rank().over(window_spec))
df = df.filter(col('rank') == '1')
My concern, is I'm using the orderby_col and evaluating to covert in columner way using eval() and for loop to check all the orderby columns in the list.
Could you please let me know how we can pass multiple columns in order by without having a for loop to do the descending order??
import pyspark.sql.functions as f
groupby_col = ["col('customer_id')"]
orderby_col = ["col('process_date')", "col('load_date')"]
window_spec = Window.partitionBy(*groupby_col).orderBy(f.desc(*orderby_col))
df = df.withColumn("rank", f.rank().over(window_spec))
df = df.filter(col('rank') == '1')

How to modify/transform the column of a dataframe?

I have an instance of pyspark.sql.dataframe.DataFrame created using
dataframe = sqlContext.sql("select * from table").
One column is 'arrival_date' and contains a string.
How do I modify this column so as to only take the first 4 characters from it and throw away the rest?
How would I convert the type of this column from string to date?
In graphlab.SFrame, this would be:
dataframe['column_name'] = dataframe['column_name'].apply(lambda x: x[:4] )
and
dataframe['column_name'] = dataframe['column_name'].str_to_datetime()
As stated by Orions, you can't modify a column, but you can override it. Also, you shouldn't need to create an user defined function, as there is a built-in function for extracting substrings:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", df['arrival_date'].substr(0, 4))
To convert it to date, you can use to_date, as Orions said:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", to_date(df['arrival_date'].substr(0, 4)))
However, if you need to specify the format, you should use unix_timestamp:
from pyspark.sql.functions import *
format = 'yyMM'
col = unix_timestamp(df['arrival_date'].substr(0, 4), format).cast('timestamp')
df = df.withColumn("arrival_date", col)
All this can be found in the pyspark documentation.
To extract first 4 characters from the arrival_date (StringType) column, create a new_df by using UserDefinedFunction (as you cannot modify the columns: they are immutable):
from pyspark.sql.functions import UserDefinedFunction, to_date
old_df = spark.sql("SELECT * FROM table")
udf = UserDefinedFunction(lambda x: str(x)[:4], StringType())
new_df = old_df.select(*[udf(column).alias('arrival_date') if column == 'arrival_date' else column for column in old_df.columns])
And to covert the arrival_date (StringType) column into DateType column, use the to_date function as show below:
new_df = old_df.select(old_df.other_cols_if_any, to_date(old_df.arrival_date).alias('arrival_date'))
Sources:
https://stackoverflow.com/a/29257220/2873538
https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html

Filtering a Pyspark DataFrame with SQL-like IN clause

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a is the tuple (1, 2, 3). I am getting this error:
java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a.
The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?
String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.
reiterating what #zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below
from pyspark.sql.functions import col
df.where(col("v").isin(["foo", "bar"])).count()
Just a little addition/update:
choice_list = ["foo", "bar", "jack", "joan"]
If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then
from pyspark.sql.functions import col
df_filtered = df.where( ( col("v").isin (choice_list) ) )
You can also do this for integer columns:
df_filtered = df.filter("field1 in (1,2,3)")
or this for string columns:
df_filtered = df.filter("field1 in ('a','b','c')")
A slightly different approach that worked for me is to filter with a custom filter function.
def filter_func(a):
"""wrapper function to pass a in udf"""
def filter_func_(col):
"""filtering function"""
if col in a.value:
return True
return False
return udf(filter_func_, BooleanType())
# Broadcasting allows to pass large variables efficiently
a = sc.broadcast((1, 2, 3))
df = my_df.filter(filter_func(a)(col('field1'))) \
from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark=spark.read.csv('datasets/myData.csv',header=True,inferSchema=True)
df_spark.createOrReplaceTempView("df") # we need to create a Temp table first
spark.sql("SELECT * FROM df where Departments in ('IOT','Big Data') order by Departments").show()

Categories

Resources