Pyspark: compare one column with other columns and flag if similar - python

I'm looking to create a column which flags if my column1 value is found in either column2, column3 or column4.
I can do it this way:
import pyspark.sql.functions as f
df.withColumn("FLAG", f.when((f.col("column1") == f.col("column2")) |
(f.col("column1") == f.col("column3")) |
(f.col("column1") == f.col("column4")), 'Y')\
.otherwise('N'))
This takes quite some time and I find it to be inefficient. I was wondering if there was a better way to write this using UDFs? And I'm trying to figure out how to reference which column there is a match from.
Any help helps! Thanks

You can use Spark built-in function isin
from pyspark.sql import functions as F
df = (spark
.sparkContext
.parallelize([
('A', 'A', 'B', 'C'),
('A', 'B', 'C', 'A'),
('A', 'C', 'A', 'B'),
('A', 'X', 'Y', 'Z'),
])
.toDF(['ca', 'cb', 'cc', 'cd'])
)
(df
.withColumn('flag', F.col('ca').isin(
F.col('cb'),
F.col('cc'),
F.col('cd'),
))
.show()
)
# +---+---+---+---+-----+
# | ca| cb| cc| cd| flag|
# +---+---+---+---+-----+
# | A| A| B| C| true|
# | A| B| C| A| true|
# | A| C| A| B| true|
# | A| X| Y| Z|false|
# +---+---+---+---+-----+

Related

display the difference of two values of two columns from two different dataframes without losing the other columns

I've got two dataframes with different values of "d" but have the same values of "a" and "b"
this is df1
df1 = spark.createDataFrame([
('c', 'd', 8),
('e', 'f', 8),
('c', 'j', 9),
], ['a', 'b', 'd'])
​
df1.show()
+---+---+---+
| a| b| d|
+---+---+---+
| c| d| 8|
| e| f| 8|
| c| j| 9|
+---+---+---+
and this is df 2
df2 = spark.createDataFrame([
('c', 'd', 7),
('e', 'f', 3),
('c', 'j', 8),
], ['a', 'b', 'd'])
df2.show()
+---+---+---+
| a| b| d|
+---+---+---+
| c| d| 7|
| e| f| 3|
| c| j| 8|
+---+---+---+
and i want to obtain the difference between the values of column "d" but also i want to keep the columns "a" and "b"
df3
+---+---+---+
| a| b| d|
+---+---+---+
| c| d| 1|
| e| f| 5|
| c| j| 1|
+---+---+---+
i tried doing a subtract between the two dataframes but it didn't work
df1.subtract(df2).show()
+---+---+---+
| a| b| d|
+---+---+---+
| c| d| 8|
| e| f| 8|
| c| j| 9|
+---+---+---+
Here is how you can do it:
df3 = df1.join(df2, on = ['b', 'a'], how = 'outer').select('a', 'b', (df1.d - df2.d).alias('diff'))
df3.show()

PySpark replace value in several column at once

I want to replace a value in a dataframe column with another value and I've to do it for many column (lets say 30/100 columns)
I've gone through this and this already.
from pyspark.sql.functions import when, lit, col
df = sc.parallelize([(1, "foo", "val"), (2, "bar", "baz"), (3, "baz", "buz")]).toDF(["x", "y", "z"])
df.show()
# I can replace "baz" with Null separaely in column y and z
def replace(column, value):
return when(column != value, column).otherwise(lit(None))
df = df.withColumn("y", replace(col("y"), "baz"))\
.withColumn("z", replace(col("z"), "baz"))
df.show()
I can replace "baz" with Null separaely in column y and z. But I want to do it for all columns - something like list comprehension way like below
[replace(df[col], "baz") for col in df.columns]
Since there are to the tune of 30/100 columns, so let's add a few more columns to the DataFrame to generalize it well.
# Loading the requisite packages
from pyspark.sql.functions import col, when
df = sc.parallelize([(1,"foo","val","baz","gun","can","baz","buz","oof"),
(2,"bar","baz","baz","baz","got","pet","stu","got"),
(3,"baz","buz","pun","iam","you","omg","sic","baz")]).toDF(["x","y","z","a","b","c","d","e","f"])
df.show()
+---+---+---+---+---+---+---+---+---+
| x| y| z| a| b| c| d| e| f|
+---+---+---+---+---+---+---+---+---+
| 1|foo|val|baz|gun|can|baz|buz|oof|
| 2|bar|baz|baz|baz|got|pet|stu|got|
| 3|baz|buz|pun|iam|you|omg|sic|baz|
+---+---+---+---+---+---+---+---+---+
Let's say we want to replace baz with Null in all the columns except in column x and a. Use list comprehensions to choose those columns where replacement has to be done.
# This contains the list of columns where we apply replace() function
all_column_names = df.columns
print(all_column_names)
['x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f']
columns_to_remove = ['x','a']
columns_for_replacement = [i for i in all_column_names if i not in columns_to_remove]
print(columns_for_replacement)
['y', 'z', 'b', 'c', 'd', 'e', 'f']
Finally, doing the replacement using when(), which actually is a pseudonym for if clause.
# Doing the replacement on all the requisite columns
for i in columns_for_replacement:
df = df.withColumn(i,when((col(i)=='baz'),None).otherwise(col(i)))
df.show()
+---+----+----+---+----+---+----+---+----+
| x| y| z| a| b| c| d| e| f|
+---+----+----+---+----+---+----+---+----+
| 1| foo| val|baz| gun|can|null|buz| oof|
| 2| bar|null|baz|null|got| pet|stu| got|
| 3|null| buz|pun| iam|you| omg|sic|null|
+---+----+----+---+----+---+----+---+----+
There is no need to create a UDF and define a function to do the replacement if it can be done with normal if-else clause. UDFs are in general a costly operation and should be avoided when ever possible.
use a reduce() function:
from functools import reduce
reduce(lambda d, c: d.withColumn(c, replace(col(c), "baz")), [df, 'y', 'z']).show()
#+---+----+----+
#| x| y| z|
#+---+----+----+
#| 1| foo| val|
#| 2| bar|null|
#| 3|null| buz|
#+---+----+----+
You can use select and a list comprehension:
df = df.select([replace(f.col(column), 'baz').alias(column) if column!='x' else f.col(column)
for column in df.columns])
df.show()

Dataframe Join Null-Safe Condition Use

I have two dataframes with null values that I'm trying to join using PySpark 2.3.0:
dfA:
# +----+----+
# |col1|col2|
# +----+----+
# | a|null|
# | b| 0|
# | c| 0|
# +----+----+
dfB:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null| x|
# | b| 0| x|
# +----+----+----+
The dataframes are creatable with this script:
dfA = spark.createDataFrame(
[
('a', None),
('b', '0'),
('c', '0')
],
('col1', 'col2')
)
dfB = spark.createDataFrame(
[
('a', None, 'x'),
('b', '0', 'x')
],
('col1', 'col2', 'col3')
)
Join call:
dfA.join(dfB, dfB.columns[:2], how='left').orderBy('col1').show()
Result:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null|null| <- col3 should be x
# | b| 0| x|
# | c| 0|null|
# +----+----+----+
Expected result:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null| x| <-
# | b| 0| x|
# | c| 0|null|
# +----+----+----+
It works if I set the first row, col2 to anything other than null, but I need to support null values.
I tried using a condition to compare using null-safe equals as outlined in this post like so:
cond = (dfA.col1.eqNullSafe(dfB.col1) & dfA.col2.eqNullSafe(dfB.col2))
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()
Result of null-safe join:
# +----+----+----+----+----+
# |col1|col2|col1|col2|col3|
# +----+----+----+----+----+
# | a|null| a|null| x|
# | b| 0| b| 0| x|
# | c| 0|null|null|null|
# +----+----+----+----+----+
This retains duplicate columns though, I'm still looking for a way to achieve the expected result at the end of a join.
A simple solution would be to select the columns that you want to keep. This will let you specify which source dataframe they should come from as well as avoid the duplicate column issue.
dfA.join(dfB, cond, how='left').select(dfA.col1, dfA.col2, dfB.col3).orderBy('col1').show()
This fails, because col1 in orderBy is ambiguous. You should reference specific source, for example dfA:
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()
If you have to join with null values with null values in pyspark you should use eqnullsafe in joining conditions, then will match null to null values , spark 2.5 version after is better to use eqnullsafe while joining if need more with examples https://knowledges.co.in/how-to-use-eqnullsafe-in-pyspark-for-null-values/

python to pyspark, converting the pivot in pyspark

I have below DataFrame's and achieved the desired output in python. But I wanted to convert the same into pyspark.
d = {'user': ['A', 'A', 'B','B','C', 'D', 'C', 'E', 'D', 'E', 'F', 'F'], 'songs' : [11,22,99,11,11,44,66,66,33,55,11,77]}
data = pd.DataFrame(data = d)
e = {'user': ['A', 'B','C', 'D', 'E', 'F','A'], 'cluster': [1,2,3,1,2,3,2]}
clus = pd.DataFrame(data= e)
Desired output: I wanted to achieve all the songs which were not listened by the user of a particular cluster. A belongs to cluster 1, and cluster 1 has songs [11,22,33,44] so A hasnt listened to [33,44] so I achieved that using the below python code.
user
A [33, 44]
B [55, 66]
C [77]
D [11, 22]
E [11, 99]
F [66]
PYTHON CODE:
df = pd.merge(data, clus, on='user', how='left').drop_duplicates(['user','movie'])
df1 = (df.groupby(['cluster']).apply(lambda x: x.pivot('user','movie','cluster').isnull())
.fillna(False)
.reset_index(level=0, drop=True)
.sort_index())
s = np.where(df1, ['{}'.format(x) for x in df1.columns], '')
#remove empty values
s1 = pd.Series([''.join(x).strip(', ') for x in s], index=df1.index)
print (s1)
Hot to achieve the same in pyspark distributed coding ?
There could be a better solution than this, but it works.
Assuming that each user belongs to only one cluster,
import pyspark.sql.functions as F
from pyspark.sql.types import *
d = zip(['A', 'A', 'B','B','C', 'D', 'C', 'E', 'D', 'E', 'F', 'F'],[11,22,99,11,11,44,66,66,33,55,11,77])
data = sql.createDataFrame(d).toDF('user','songs')
This gives,
+----+-----+
|user|songs|
+----+-----+
| A| 11|
| A| 22|
| B| 99|
| B| 11|
| C| 11|
| D| 44|
| C| 66|
| E| 66|
| D| 33|
| E| 55|
| F| 11|
| F| 77|
+----+-----+
Creating clusters assuming each user belongs only to one cluster,
c = zip(['A', 'B','C', 'D', 'E', 'F'],[1,2,3,1,2,3])
clus = sql.createDataFrame(c).toDF('user','cluster')
clus.show()
+----+-------+
|user|cluster|
+----+-------+
| A| 1|
| B| 2|
| C| 3|
| D| 1|
| E| 2|
| F| 3|
+----+-------+
Now, we get all songs heard by a user along with their cluster,
all_combine = data.groupBy('user').agg(F.collect_list('songs').alias('songs'))\
.join(clus, data.user==clus.user).select(data['user'],'songs','cluster')
all_combine.show()
+----+--------+-------+
|user| songs|cluster|
+----+--------+-------+
| F|[11, 77]| 3|
| E|[66, 55]| 2|
| B|[99, 11]| 2|
| D|[44, 33]| 1|
| C|[11, 66]| 3|
| A|[11, 22]| 1|
+----+--------+-------+
Finally, calculating all songs heard in a cluster and subsequently all songs not heard by a user in that cluster,
not_listened = F.udf(lambda song,all_: list(set(all_) - set(song)) , ArrayType(IntegerType()))
grouped_clusters = data.join(clus, data.user==clus.user).select(data['user'],'songs','cluster')\
.groupby('cluster').agg(F.collect_list('songs').alias('all_songs'))\
.join(all_combine, ['cluster']).select('user', all_combine['cluster'], 'songs', 'all_songs')\
.select('user', not_listened(F.col('songs'), F.col('all_songs')).alias('not_listened'))
grouped_clusters.show()
We get output as,
+----+------------+
|user|not_listened|
+----+------------+
| D| [11, 22]|
| A| [33, 44]|
| F| [66]|
| C| [77]|
| E| [99, 11]|
| B| [66, 55]|
+----+------------+

How can I apply groupBy in a dataframe without removing other columns of the not-grouped instances in Pyspark?

I am trying to generate an operation with groupBy() in Pyspark, but I get the next problem:
I have a dataframe (df1) which has 3 attributes: attrA, attrB and attrC. I want to apply a groupBy operation over that dataframe only taking in account the attributes attrA and attrB. Of course, when groupBy(attr1, attr2) is applied over df1 it generates groups of those instances that are equal to each other.
What I want to get is:
If I apply groupBy() operation and some instances are equal I want to generate an independent dataframe with those groups, and if there are instances that are not equal no any other one, I want to conserve those in another dataframe with the 3 attributes: attr1, attr2 and attr3(not used to group by).
Is it possible?
from pyspark.sql import functions as f
from pyspark.sql import *
spark = SparkSession.builder.appName('MyApp').getOrCreate()
df = spark.createDataFrame([('a', 'a', 3), ('a', 'c', 5), ('b', 'a', 4), ('c', 'a', 2), ('a', 'a', 9), ('b', 'a', 9)],
('attr1', "attr2", "attr3"))
df = df.withColumn('count', f.count('attr3').over(Window().partitionBy('attr1', 'attr2'))).cache()
output:
+-----+-----+-----+-----+
|attr1|attr2|attr3|count|
+-----+-----+-----+-----+
| b| a| 4| 2|
| b| a| 9| 2|
| a| c| 5| 1|
| c| a| 2| 1|
| a| a| 3| 2|
| a| a| 9| 2|
+-----+-----+-----+-----+
and
an_independent_dataframe = df.filter(df['count'] > 1).groupBy('attr1', 'attr2').sum('attr3')
+-----+-----+----------+
|attr1|attr2|sum(attr3)|
+-----+-----+----------+
| b| a| 13|
| a| a| 12|
+-----+-----+----------+
another_dataframe = df.filter(df['count'] == 1).select('attr1', "attr2", "attr3")
+-----+-----+-----+
|attr1|attr2|attr3|
+-----+-----+-----+
| a| c| 5|
| c| a| 2|
+-----+-----+-----+

Categories

Resources