filtering out spark dataframe using udf - python

I have a pyspark dataframe with two columns, name and source. All the values in the name column are distinct. Source has multiple strings separated with a comma (,).
I want to filter out all those rows where any of the strings in the source column contains any value from the whole name column.
I am using the following UDF:
def checkDependentKPI(df, name_list):
for row in df.collect():
for src in row["source"].split(","):
for name in name_list:
if name in src:
return row['name']
return row['name']
My end goal is to put all such rows at the end of the dataframe. How can I do it?
Sample dataframe:
+--------------------+--------------------+
| name| source|
+--------------------+--------------------+
|dev.................|prod, sum, diff.....|
|prod................|dev, diff, avg......|
|stage...............|mean, mode..........|
|balance.............|median, mean........|
|target..............|avg, diff, sum......|
+--------------------+--------------------+

You can use a like() to leverage the SQL like expression without any heavy collect() action and loop checking. Suppose you already have a list of name:
from functools import reduce
df.filter(
reduce(lambda x, y: x|y, [func.col('source').like(f"%{pattern}%") for pattern in name])
).show(20, False)

Maybe this?
from pyspark.sql import functions as psf
test_data = [('dev','prod,sum,diff')
, ('prod','dev,diff,avg')
, ('stage','mean,mode')
, ('balance','median,mean')
, ('target','avg,diff,sum')]
df = spark.createDataFrame(test_data, ['kpi_name','kpi_source_table'])
df = df.withColumn('kpi_source_table', psf.split('kpi_source_table', ','))
df_flat = df.agg(psf.collect_list('kpi_name').alias('flat_kpi'))
df = df.join(df_flat, how='cross')
df = df.withColumn('match', psf.array_intersect('kpi_source_table', 'flat_kpi'))
display(df.orderBy('match'))

Related

How to use wide_to_long (Pandas)

I have this code which I thought would reformat the dataframe so that the columns with the same column name would be replaced by their duplicates.
# Function that splits dataframe into two separate dataframes, one with all unique
# columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
columns = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = columns[columns == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from
dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate
columns:\n\n{}".format(unq_df.columns.tolist(), dup_df.columns.tolist()))
Output:
Unique columns:
['total_tracks', 'popularity']
Duplicate columns:
['t_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 't_energy1', 't_energy2',
't_key0', 't_key1', 't_key2', 't_speech0', 't_speech1', 't_speech2', 't_acous0', 't_acous1', 't_acous2',
't_ins0', 't_ins1', 't_ins2', 't_live0', 't_live1', 't_live2', 't_val0', 't_val1', 't_val2', 't_tempo0',
't_tempo1', 't_tempo2']
Then I tried to use wide_to_long to combine columns with the same name:
cols = unq_df.columns.tolist()
temp = pd.wide_to_long(dataset.reset_index(), stubnames=['t_dur','t_dance', 't_energy', 't_key', 't_mode',
't_speech', 't_acous', 't_ins', 't_live', 't_val',
't_tempo'], i=['index'] + cols, j='temp', sep='t_')
.reset_index().groupby(cols, as_index=False).mean()
temp
Which gave me this output:
I tried to look at this question, but the dataframe that's returned has "Nothing to show". What am I doing wrong here? How do I fix this?
EDIT
Here is an example of how I've done it "by-hand", but I am trying to do it more efficiently using the already defined built-in functions.
The desired output is the dataframe that is shown last.

Pyspark: Count how many rows have the same value in two columns and drop the duplicates [duplicate]

I have three Arrays of string type containing following information:
groupBy array: containing names of the columns I want to group my data by.
aggregate array: containing names of columns I want to aggregate.
operations array: containing the aggregate operations I want to perform
I am trying to use spark data frames to achieve this. Spark data frames provide an agg() where you can pass a Map [String,String] (of column name and respective aggregate operation ) as input, however I want to perform different aggregation operations on the same column of the data. Any suggestions on how to achieve this?
Scala:
You can for example map over a list of functions with a defined mapping from name to function:
import org.apache.spark.sql.functions.{col, min, max, mean}
import org.apache.spark.sql.Column
val df = Seq((1L, 3.0), (1L, 3.0), (2L, -5.0)).toDF("k", "v")
val mapping: Map[String, Column => Column] = Map(
"min" -> min, "max" -> max, "mean" -> avg)
val groupBy = Seq("k")
val aggregate = Seq("v")
val operations = Seq("min", "max", "mean")
val exprs = aggregate.flatMap(c => operations .map(f => mapping(f)(col(c))))
df.groupBy(groupBy.map(col): _*).agg(exprs.head, exprs.tail: _*).show
// +---+------+------+------+
// | k|min(v)|max(v)|avg(v)|
// +---+------+------+------+
// | 1| 3.0| 3.0| 3.0|
// | 2| -5.0| -5.0| -5.0|
// +---+------+------+------+
or
df.groupBy(groupBy.head, groupBy.tail: _*).agg(exprs.head, exprs.tail: _*).show
Unfortunately parser which is used internally SQLContext is not exposed publicly but you can always try to build plain SQL queries:
df.registerTempTable("df")
val groupExprs = groupBy.mkString(",")
val aggExprs = aggregate.flatMap(c => operations.map(
f => s"$f($c) AS ${c}_${f}")
).mkString(",")
sqlContext.sql(s"SELECT $groupExprs, $aggExprs FROM df GROUP BY $groupExprs")
Python:
from pyspark.sql.functions import mean, sum, max, col
df = sc.parallelize([(1, 3.0), (1, 3.0), (2, -5.0)]).toDF(["k", "v"])
groupBy = ["k"]
aggregate = ["v"]
funs = [mean, sum, max]
exprs = [f(col(c)) for f in funs for c in aggregate]
# or equivalent df.groupby(groupBy).agg(*exprs)
df.groupby(*groupBy).agg(*exprs)
See also:
Spark SQL: apply aggregate functions to a list of column
For those that wonder, how #zero323 answer can be written without a list comprehension in python:
from pyspark.sql.functions import min, max, col
# init your spark dataframe
expr = [min(col("valueName")),max(col("valueName"))]
df.groupBy("keyName").agg(*expr)
Do something like
from pyspark.sql import functions as F
df.groupBy('groupByColName') \
.agg(F.sum('col1').alias('col1_sum'),
F.max('col2').alias('col2_max'),
F.avg('col2').alias('col2_avg')) \
.show()
Here is another straight forward way to apply different aggregate functions on the same column while using Scala (this has been tested in Azure Databricks).
val groupByColName = "Store"
val colName = "Weekly_Sales"
df.groupBy(groupByColName)
.agg(min(colName),
max(colName),
round(avg(colName), 2))
.show()
for example if you want to count percentage of zeroes in each column in pyspark dataframe for which we can use expression to be executed on each column of the dataframe
from pyspark.sql.functions import count,col
def count_zero_percentage(c):
pred = col(c)==0
return sum(pred.cast("integer")).alias(c)
df.agg(*[count_zero_percentage(c)/count('*').alias(c) for c in df.columns]).show()
case class soExample(firstName: String, lastName: String, Amount: Int)
val df = Seq(soExample("me", "zack", 100)).toDF
import org.apache.spark.sql.functions._
val groupped = df.groupBy("firstName", "lastName").agg(
sum("Amount"),
mean("Amount"),
stddev("Amount"),
count(lit(1)).alias("numOfRecords")
).toDF()
display(groupped)
// Courtesy Zach ..
Zach simplified answer for a post Marked Duplicate
Spark Scala Data Frame to have multiple aggregation of single Group By

Append pandas dataframe to existing table in databricks

I want to append a pandas dataframe (8 columns) to an existing table in databricks (12 columns), and fill the other 4 columns that can't be matched with None values. Here is I've tried:
spark_df = spark.createDataFrame(df)
spark_df.write.mode("append").insertInto("my_table")
It thrown the error:
ParseException: "\nmismatched input ':' expecting (line 1, pos 4)\n\n== SQL ==\n my_table
Looks like spark can't handle this operation with unmatched columns, is there any way to achieve what I want?
I think that the most natural course of action would be a select() transformation to add the missing columns to the 8-column dataframe, followed by a unionAll() transformation to merge the two.
from pyspark.sql import Row
from pyspark.sql.functions import lit
bigrow = Row(a='foo', b='bar')
bigdf = spark.createDataFrame([bigrow])
smallrow = Row(a='foobar')
smalldf = spark.createDataFrame([smallrow])
fitdf = smalldf.select(smalldf.a, lit(None).alias('b'))
uniondf = bigdf.unionAll(fitdf)
Can you try this
df = spark.createDataFrame(pandas_df)
df_table_struct = sqlContext.sql('select * from my_table limit 0')
for col in set(df_table_struct.columns) - set(df.columns):
df = df.withColumn(col, F.lit(None))
df_table_struct = df_table_struct.unionByName(df)
df_table_struct.write.saveAsTable('my_table', mode='append')

Computing one value from multiple values in row

I have a PySpark Dataframe, and I'd like to add a column computed from multiple values from the other columns.
For instance let's say I have a simple dataframe with ages and names of people, and I want to compute some value, like age*2 + len(name). Can I do this with a udf or a .withColumn?
from pyspark.sql import Row
l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = sqlContext.createDataFrame(people)
display(schemaPeople)
Use withColumn:
from pyspark.sql import functions as F
schemaPeople.withColumn(
"my_column",
F.col("age")*2 + F.length(F.col("name"))
).show()
I found an way to do this with #udf:
#udf
def complex_op(age, name):
return age*2 + len(name)
schemaPeople.withColumn(
"my_column",
lit(complex_op(schemaPeople["age"], schemaPeople["name"]))
)

How to modify/transform the column of a dataframe?

I have an instance of pyspark.sql.dataframe.DataFrame created using
dataframe = sqlContext.sql("select * from table").
One column is 'arrival_date' and contains a string.
How do I modify this column so as to only take the first 4 characters from it and throw away the rest?
How would I convert the type of this column from string to date?
In graphlab.SFrame, this would be:
dataframe['column_name'] = dataframe['column_name'].apply(lambda x: x[:4] )
and
dataframe['column_name'] = dataframe['column_name'].str_to_datetime()
As stated by Orions, you can't modify a column, but you can override it. Also, you shouldn't need to create an user defined function, as there is a built-in function for extracting substrings:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", df['arrival_date'].substr(0, 4))
To convert it to date, you can use to_date, as Orions said:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", to_date(df['arrival_date'].substr(0, 4)))
However, if you need to specify the format, you should use unix_timestamp:
from pyspark.sql.functions import *
format = 'yyMM'
col = unix_timestamp(df['arrival_date'].substr(0, 4), format).cast('timestamp')
df = df.withColumn("arrival_date", col)
All this can be found in the pyspark documentation.
To extract first 4 characters from the arrival_date (StringType) column, create a new_df by using UserDefinedFunction (as you cannot modify the columns: they are immutable):
from pyspark.sql.functions import UserDefinedFunction, to_date
old_df = spark.sql("SELECT * FROM table")
udf = UserDefinedFunction(lambda x: str(x)[:4], StringType())
new_df = old_df.select(*[udf(column).alias('arrival_date') if column == 'arrival_date' else column for column in old_df.columns])
And to covert the arrival_date (StringType) column into DateType column, use the to_date function as show below:
new_df = old_df.select(old_df.other_cols_if_any, to_date(old_df.arrival_date).alias('arrival_date'))
Sources:
https://stackoverflow.com/a/29257220/2873538
https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html

Categories

Resources