PySpark Dataframe: Column based on existence and Value of another column - python

What I am trying to do is set a Value "EXIST" based on a .isnotnull in potentially nonexisting column.
What I mean is:
I have a dataframe A like
A B C
-------------------
1 a "Test"
2 b null
3 c "Test2"
Where C isnt necessarily defined. I want to define another Dataframe B
B:
D E F
---------------
1 a 'J'
2 b 'N'
3 c 'J'
Where is the Column B.F is either 'N' everywhere in case that A.C is not defined, or 'N' if A.Cs value is null and 'J' if the value is not null.
How would you proceed at this point?
I thought of using when statement
DF.withColumn('F'. when(A.C.isNotNull(), 'J').otherwise('N'))
but how would you check for the existence of the Column in the same statement?

First you check if the column exists. If not, you create it.
from pyspark.sql import functions as F
if "c" not in df.columns:
df = df.withColumn("c", F.lit(None))
then you create the column F :
df.withColumn('F'. F.when(F.col("C").isNotNull(), 'J').otherwise('N'))

You can check the column's present using 'c' in data_sdf.columns. Here's an example using it.
Let's say the input dataframe has 3 columns - ['a', 'b', 'c']
data_sdf. \
withColumn('d',
func.when(func.col('c').isNull() if 'c' in data_sdf.columns else func.lit(True), func.lit('N')).
otherwise(func.lit('J'))
). \
show()
# +---+---+----+---+
# | a| b| c| d|
# +---+---+----+---+
# | 1| 2| 3| J|
# | 1| 2|null| N|
# +---+---+----+---+
Now, let's say there are only 2 columns - ['a', 'b']
# +---+---+---+
# | a| b| d|
# +---+---+---+
# | 1| 2| N|
# | 1| 2| N|
# +---+---+---+

Related

Edit the value of all the rows of column with the same id based on the value of another column in pyspark

id
val1
val2
1
Y
Flagged
1
N
Flagged
2
N
Flagged
2
Y
Flagged
2
N
Flagged
I have the above table. I want to check the rows in val1 with the same id, if there's at least one Y and one N then all the rows having 1 as id will be marked flagged in val2. In addition, for a more efficient code, I want the code to jump to the next id once it finds a Y.
Assuming the val1 columns contains only Y and N as unique values, you can group the dataframe by id and aggregate val1 using countDistinct to count the unique values, then create a new column flagged corresponding the condition where distinct count > 1, finally join this new column with original dataframe to get the result
from pyspark.sql import functions as F
counts = df.groupBy('id').agg(F.countDistinct('val1').alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
If column val1 may contains other values along with Y, N, then first mask the values which are not in Y and N:
vals = F.when(F.col('val1').isin(['Y', 'N']), F.col('val1'))
counts = df.groupBy('id').agg(F.countDistinct(vals).alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
>>> df.show()
| id|val1|flagged|
+---+----+-------+
| 1| Y| true|
| 1| N| true|
| 2| N| true|
| 2| Y| true|
| 2| N| true|
+---+----+-------+
PS: I have also modified your output slightly as having a column named flagged with boolean values makes more sense
You can also use a window to collect the set of values and compare to the array of Y and N values:
from pyspark.sql import functions as F, Window as W
a = F.array([F.lit('N'),F.lit('Y')])
out = (df.withColumn("Flagged",F.array_intersect(a,
F.collect_set("val1").over(W.partitionBy("id")))==a))
out.show()
+---+----+-------+-------+
| id|val1| val2|Flagged|
+---+----+-------+-------+
| 1| Y|Flagged| true|
| 1| N|Flagged| true|
| 2| N|Flagged| true|
| 2| Y|Flagged| true|
| 2| N|Flagged| true|
+---+----+-------+-------+

Pyspark add row based on a condition

I have a below dataframe structure
A
B
C
1
open
01.01.22 10:05:04
1
In-process
01.01.22 10:07:02
I need to insert a row before the open value row.So,I need to check the status whether its open and then add a new row before it with other columns being the same values except the C column to get 1hour subtracted. How this can be acheived using Pyspark?
Instead of "insert a row" – which is a non-trivial issue to solve –, think about it as "union dataset"
Assuming this is your dataset
df = spark.createDataFrame([
(1, 'open', '01.01.22 10:05:04'),
(1, 'In process', '01.01.22 10:07:02'),
], ['a', 'b', 'c'])
+---+----------+-----------------+
| a| b| c|
+---+----------+-----------------+
| 1| open|01.01.22 10:05:04|
| 1|In process|01.01.22 10:07:02|
+---+----------+-----------------+
Based on your rule, we can construct another dataset like this
from pyspark.sql import functions as F
df_new = (df
.where(F.col('b') == 'open')
.withColumn('b', F.lit('Before open'))
.withColumn('c', F.to_timestamp('c', 'dd.MM.yy HH:mm:ss')) # convert text to date with custom date format
.withColumn('c', F.col('c') - F.expr('interval 1 hour')) # subtract 1 hour
.withColumn('c', F.from_unixtime(F.unix_timestamp('c'), 'dd.MM.yy HH:mm:ss')) # revert to custom date format
)
+---+-----------+-----------------+
| a| b| c|
+---+-----------+-----------------+
| 1|Before open|01.01.22 09:05:04|
+---+-----------+-----------------+
Now you just need to union them together, and sort if you want to "see" it
(df
.union(df_new)
.orderBy('a', 'c')
.show()
)
+---+-----------+-----------------+
| a| b| c|
+---+-----------+-----------------+
| 1|Before open|01.01.22 09:05:04|
| 1| open|01.01.22 10:05:04|
| 1| In process|01.01.22 10:07:02|
+---+-----------+-----------------+

Pyspark - Select the distinct values from each column

I am trying to find all of the distinct values in each column in a dataframe and show in one table.
Example data:
|-----------|-----------|-----------|
| COL_1 | COL_2 | COL_3 |
|-----------|-----------|-----------|
| A | C | D |
| A | C | D |
| A | C | E |
| B | C | E |
| B | C | F |
| B | C | F |
|-----------|-----------|-----------|
Example output:
|-----------|-----------|-----------|
| COL_1 | COL_2 | COL_3 |
|-----------|-----------|-----------|
| A | C | D |
| B | | E |
| | | F |
|-----------|-----------|-----------|
Is this even possible? I have been able to do it in separate tables, but it would be much better all in one table.
Any ideas?
The simplest thing here would be to use pyspark.sql.functions.collect_set on all of the columns:
import pyspark.sql.functions as f
df.select(*[f.collect_set(c).alias(c) for c in df.columns]).show()
#+------+-----+---------+
#| COL_1|COL_2| COL_3|
#+------+-----+---------+
#|[B, A]| [C]|[F, E, D]|
#+------+-----+---------+
Obviously, this returns the data as one row.
If instead you want the output as you wrote in your question (one row per unique value for each column), it's doable but requires quite a bit of pyspark gymnastics (and any solution likely will be much less efficient).
Nevertheless, I present you some options:
Option 1: Explode and Join
You can use pyspark.sql.functions.posexplode to explode the elements in the set of values for each column along with the index in the array. Do this for each column separately and then outer join the resulting list of DataFrames together using functools.reduce:
from functools import reduce
unique_row = df.select(*[f.collect_set(c).alias(c) for c in df.columns])
final_df = reduce(
lambda a, b: a.join(b, how="outer", on="pos"),
(unique_row.select(f.posexplode(c).alias("pos", c)) for c in unique_row.columns)
).drop("pos")
final_df.show()
#+-----+-----+-----+
#|COL_1|COL_2|COL_3|
#+-----+-----+-----+
#| A| null| E|
#| null| null| D|
#| B| C| F|
#+-----+-----+-----+
Option 2: Select by position
First compute the size of the maximum array and store this in a new column max_length. Then select elements from each array if a value exists at that index.
Once again we use pyspark.sql.functions.posexplode but this time it's just to create a column to represent the index in each array to extract.
Finally we use this trick that allows you to use a column value as a parameter.
final_df= df.select(*[f.collect_set(c).alias(c) for c in df.columns])\
.withColumn("max_length", f.greatest(*[f.size(c) for c in df.columns]))\
.select("*", f.expr("posexplode(split(repeat(',', max_length-1), ','))"))\
.select(
*[
f.expr(
"case when size({c}) > pos then {c}[pos] else null end AS {c}".format(c=c))
for c in df.columns
]
)
final_df.show()
#+-----+-----+-----+
#|COL_1|COL_2|COL_3|
#+-----+-----+-----+
#| B| C| F|
#| A| null| E|
#| null| null| D|
#+-----+-----+-----+

Filter A Dataframe Column based on the the values of another column [duplicate]

This question already has an answer here:
Pyspark filter dataframe by columns of another dataframe
(1 answer)
Closed 4 years ago.
Say I have two dataframes,
**A** **B**
| a | b | c | |a|
| 1 | 2 | 3 | |1|
I want to filter the contents of dataframe A based on the values in column a from Dataset B. The equivalent where clause in SQL is like this
WHERE NOT (A.a in (select a from B)
How can I achieve this?
To keep all the rows in the left table where there is a match in the right, you can use the leftsemi join. In this case, you only want to keep values if there is not a match in the right table, in that case you can use a leftanti join:
df = spark.createDataFrame([(1,2,3),(2,3,4)], ["a","b","c"])
df2 = spark.createDataFrame([(1,2)], ["a","b"])
df.join(df2,'a','leftanti').show()
df
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
df2
+---+---+
| a| b|
+---+---+
| 1| 2|
+---+---+
result
+---+---+---+
| a| b| c|
+---+---+---+
| 2| 3| 4|
+---+---+---+
Hope this helps!

How to drop all columns with null values in a PySpark DataFrame?

I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that?
The following only drops a single column or rows containing null.
df.where(col("dt_mvmt").isNull()) #doesnt work because I do not have all the columns names or for 1000's of columns
df.filter(df.dt_mvmt.isNotNull()) #same reason as above
df.na.drop() #drops rows that contain null, instead of columns that contain null
For example
a | b | c
1 | | 0
2 | 2 | 3
In the above case it will drop the whole column B because one of its values is empty.
Here is one possible approach for dropping all columns that have NULL values: See here for the source on the code of counting NULL values per column.
import pyspark.sql.functions as F
# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
'x2': ['b', None, '2'],
'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()
def drop_null_columns(df):
"""
This function drops all columns which contain null values.
:param df: A PySpark DataFrame
"""
null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
to_drop = [k for k, v in null_counts.items() if v > 0]
df = df.drop(*to_drop)
return df
# Drops column b2, because it contains null values
drop_null_columns(df).show()
Before:
+---+----+---+
| x1| x2| x3|
+---+----+---+
| a| b| c|
| 1|null| 0|
| 2| 2| 3|
+---+----+---+
After:
+---+---+
| x1| x3|
+---+---+
| a| c|
| 1| 0|
| 2| 3|
+---+---+
Hope this helps!
If we need to keep only the rows having at least one inspected column not null then use this. Execution time is very less.
from operator import or_
from functools import reduce
inspected = df.columns
df = df.where(reduce(or_, (F.col(c).isNotNull() for c in inspected ), F.lit(False)))```

Categories

Resources