PySpark replace value in several column at once - python

I want to replace a value in a dataframe column with another value and I've to do it for many column (lets say 30/100 columns)
I've gone through this and this already.
from pyspark.sql.functions import when, lit, col
df = sc.parallelize([(1, "foo", "val"), (2, "bar", "baz"), (3, "baz", "buz")]).toDF(["x", "y", "z"])
df.show()
# I can replace "baz" with Null separaely in column y and z
def replace(column, value):
return when(column != value, column).otherwise(lit(None))
df = df.withColumn("y", replace(col("y"), "baz"))\
.withColumn("z", replace(col("z"), "baz"))
df.show()
I can replace "baz" with Null separaely in column y and z. But I want to do it for all columns - something like list comprehension way like below
[replace(df[col], "baz") for col in df.columns]

Since there are to the tune of 30/100 columns, so let's add a few more columns to the DataFrame to generalize it well.
# Loading the requisite packages
from pyspark.sql.functions import col, when
df = sc.parallelize([(1,"foo","val","baz","gun","can","baz","buz","oof"),
(2,"bar","baz","baz","baz","got","pet","stu","got"),
(3,"baz","buz","pun","iam","you","omg","sic","baz")]).toDF(["x","y","z","a","b","c","d","e","f"])
df.show()
+---+---+---+---+---+---+---+---+---+
| x| y| z| a| b| c| d| e| f|
+---+---+---+---+---+---+---+---+---+
| 1|foo|val|baz|gun|can|baz|buz|oof|
| 2|bar|baz|baz|baz|got|pet|stu|got|
| 3|baz|buz|pun|iam|you|omg|sic|baz|
+---+---+---+---+---+---+---+---+---+
Let's say we want to replace baz with Null in all the columns except in column x and a. Use list comprehensions to choose those columns where replacement has to be done.
# This contains the list of columns where we apply replace() function
all_column_names = df.columns
print(all_column_names)
['x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f']
columns_to_remove = ['x','a']
columns_for_replacement = [i for i in all_column_names if i not in columns_to_remove]
print(columns_for_replacement)
['y', 'z', 'b', 'c', 'd', 'e', 'f']
Finally, doing the replacement using when(), which actually is a pseudonym for if clause.
# Doing the replacement on all the requisite columns
for i in columns_for_replacement:
df = df.withColumn(i,when((col(i)=='baz'),None).otherwise(col(i)))
df.show()
+---+----+----+---+----+---+----+---+----+
| x| y| z| a| b| c| d| e| f|
+---+----+----+---+----+---+----+---+----+
| 1| foo| val|baz| gun|can|null|buz| oof|
| 2| bar|null|baz|null|got| pet|stu| got|
| 3|null| buz|pun| iam|you| omg|sic|null|
+---+----+----+---+----+---+----+---+----+
There is no need to create a UDF and define a function to do the replacement if it can be done with normal if-else clause. UDFs are in general a costly operation and should be avoided when ever possible.

use a reduce() function:
from functools import reduce
reduce(lambda d, c: d.withColumn(c, replace(col(c), "baz")), [df, 'y', 'z']).show()
#+---+----+----+
#| x| y| z|
#+---+----+----+
#| 1| foo| val|
#| 2| bar|null|
#| 3|null| buz|
#+---+----+----+

You can use select and a list comprehension:
df = df.select([replace(f.col(column), 'baz').alias(column) if column!='x' else f.col(column)
for column in df.columns])
df.show()

Related

PySpark Dataframe: Column based on existence and Value of another column

What I am trying to do is set a Value "EXIST" based on a .isnotnull in potentially nonexisting column.
What I mean is:
I have a dataframe A like
A B C
-------------------
1 a "Test"
2 b null
3 c "Test2"
Where C isnt necessarily defined. I want to define another Dataframe B
B:
D E F
---------------
1 a 'J'
2 b 'N'
3 c 'J'
Where is the Column B.F is either 'N' everywhere in case that A.C is not defined, or 'N' if A.Cs value is null and 'J' if the value is not null.
How would you proceed at this point?
I thought of using when statement
DF.withColumn('F'. when(A.C.isNotNull(), 'J').otherwise('N'))
but how would you check for the existence of the Column in the same statement?
First you check if the column exists. If not, you create it.
from pyspark.sql import functions as F
if "c" not in df.columns:
df = df.withColumn("c", F.lit(None))
then you create the column F :
df.withColumn('F'. F.when(F.col("C").isNotNull(), 'J').otherwise('N'))
You can check the column's present using 'c' in data_sdf.columns. Here's an example using it.
Let's say the input dataframe has 3 columns - ['a', 'b', 'c']
data_sdf. \
withColumn('d',
func.when(func.col('c').isNull() if 'c' in data_sdf.columns else func.lit(True), func.lit('N')).
otherwise(func.lit('J'))
). \
show()
# +---+---+----+---+
# | a| b| c| d|
# +---+---+----+---+
# | 1| 2| 3| J|
# | 1| 2|null| N|
# +---+---+----+---+
Now, let's say there are only 2 columns - ['a', 'b']
# +---+---+---+
# | a| b| d|
# +---+---+---+
# | 1| 2| N|
# | 1| 2| N|
# +---+---+---+

Pyspark: compare one column with other columns and flag if similar

I'm looking to create a column which flags if my column1 value is found in either column2, column3 or column4.
I can do it this way:
import pyspark.sql.functions as f
df.withColumn("FLAG", f.when((f.col("column1") == f.col("column2")) |
(f.col("column1") == f.col("column3")) |
(f.col("column1") == f.col("column4")), 'Y')\
.otherwise('N'))
This takes quite some time and I find it to be inefficient. I was wondering if there was a better way to write this using UDFs? And I'm trying to figure out how to reference which column there is a match from.
Any help helps! Thanks
You can use Spark built-in function isin
from pyspark.sql import functions as F
df = (spark
.sparkContext
.parallelize([
('A', 'A', 'B', 'C'),
('A', 'B', 'C', 'A'),
('A', 'C', 'A', 'B'),
('A', 'X', 'Y', 'Z'),
])
.toDF(['ca', 'cb', 'cc', 'cd'])
)
(df
.withColumn('flag', F.col('ca').isin(
F.col('cb'),
F.col('cc'),
F.col('cd'),
))
.show()
)
# +---+---+---+---+-----+
# | ca| cb| cc| cd| flag|
# +---+---+---+---+-----+
# | A| A| B| C| true|
# | A| B| C| A| true|
# | A| C| A| B| true|
# | A| X| Y| Z|false|
# +---+---+---+---+-----+

Check if value greater than zero exists in all columns of dataframe using pyspark

data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).show()
This is the code I was trying to get the count of the nan values. I want to write an if-else condition where if a specific column contains nan values, I want to print the name of the column and count of nan values.
If I understand you correctly, you want to perform a column filtering first before passing it to the list comprehension.
For example, you have a df that looks as follows, where column c is nan free,
from pyspark.sql.functions import isnan, count, when
import numpy as np
df = spark.createDataFrame([(1.0, np.nan, 0.0), (np.nan, 2.0, 9.0),\
(np.nan, 3.0, 8.0), (np.nan, 4.0, 7.0)], ('a', 'b', 'c'))
df.show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# |1.0|NaN|0.0|
# |NaN|2.0|9.0|
# |NaN|3.0|8.0|
# |NaN|4.0|7.0|
# +---+---+---+
You were given the solutions and materials to produce
df.select([count(when((isnan(c)),c)).alias(c) for c in df.columns]).show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# | 3| 1| 0|
# +---+---+---+
but you want
# +---+---+
# | a| b|
# +---+---+
# | 3| 1|
# +---+---+
In order to have that output, you can try this
rows = df.collect()
#column filtering based on your nan condition
nan_columns = [''.join(key) for _ in rows for (key,val) in _.asDict().items() if np.isnan(val)]
nan_columns = list(set(nan_columns)) #may sort if order is important
#nan_columns
#['a', 'b']
df.select([count(when((isnan(c)),c)).alias(c) for c in nan_columns]).show()
# +---+---+
# | a| b|
# +---+---+
# | 3| 1|
# +---+---+
You can just convert the same comprehension to:
df.select([count(when(c > 0, c)).alias(c) for c in data.columns]).show()
but this will cause problems when you have other dtypes.
So let's go with:
from pyspark.sql.functions import col
# You can do the following two lines of code in one line, but want to make it more readable
schema = {col: col_type for col, col_type in df.dtypes}
numeric_columns = [
col for col, col_type in schema.items()
if col_type in "int double bitint".split()
]
df.select([count(when(col(c) > 0, c)).alias(c) for c in numeric_columns]).show()

How to drop all columns with null values in a PySpark DataFrame?

I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that?
The following only drops a single column or rows containing null.
df.where(col("dt_mvmt").isNull()) #doesnt work because I do not have all the columns names or for 1000's of columns
df.filter(df.dt_mvmt.isNotNull()) #same reason as above
df.na.drop() #drops rows that contain null, instead of columns that contain null
For example
a | b | c
1 | | 0
2 | 2 | 3
In the above case it will drop the whole column B because one of its values is empty.
Here is one possible approach for dropping all columns that have NULL values: See here for the source on the code of counting NULL values per column.
import pyspark.sql.functions as F
# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
'x2': ['b', None, '2'],
'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()
def drop_null_columns(df):
"""
This function drops all columns which contain null values.
:param df: A PySpark DataFrame
"""
null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
to_drop = [k for k, v in null_counts.items() if v > 0]
df = df.drop(*to_drop)
return df
# Drops column b2, because it contains null values
drop_null_columns(df).show()
Before:
+---+----+---+
| x1| x2| x3|
+---+----+---+
| a| b| c|
| 1|null| 0|
| 2| 2| 3|
+---+----+---+
After:
+---+---+
| x1| x3|
+---+---+
| a| c|
| 1| 0|
| 2| 3|
+---+---+
Hope this helps!
If we need to keep only the rows having at least one inspected column not null then use this. Execution time is very less.
from operator import or_
from functools import reduce
inspected = df.columns
df = df.where(reduce(or_, (F.col(c).isNotNull() for c in inspected ), F.lit(False)))```

Pyspark - casting multiple columns from Str to Int

I'm attempting to cast multiple String columns to integers in a dataframe using PySpark 2.1.0. The data set is a rdd to begin, when created as a dataframe it generates the following error:
TypeError: StructType can not accept object 3 in type <class 'int'>
A sample of what I'm trying to do:
import pyspark.sql.types as typ
from pyspark.sql.functions import *
labels = [
('A', typ.StringType()),
('B', typ.IntegerType()),
('C', typ.IntegerType()),
('D', typ.IntegerType()),
('E', typ.StringType()),
('F', typ.IntegerType())
]
rdd = sc.parallelize(["1", 2, 3, 4, "5", 6])
schema = typ.StructType([typ.StructField(e[0], e[1], False) for e in labels])
df = spark.createDataFrame(rdd, schema)
df.show()
cols_to_cast = [dt[0] for dt in df.dtypes if dt[1]=='string']
#df2 = df.select(*(c.cast("integer").alias(c) for c in cols_to_cast))
df2 = df.select(*( df[dt[0]].cast("integer").alias(dt[0])
for dt in df.dtypes if dt[1]=='string'))
df2.show()
The problem to begin with is the dataframe is not being created based on the RDD.
Thereafter, I have tried two ways to cast (df2), the first is commented out.
Any suggestions?
Alternatively is there anyway I could use the .withColumn functions for casting all columns in 1 go, instead of specifying each column?
The actual dataset, although not large, has many columns.
Problem isnt your code, its your data. You are passing single list which will be treated as single column instead of six that you want.
Try rdd line as below and it should work fine.( Notice extra brackets around list )-
rdd = sc.parallelize([["1", 2, 3, 4, "5", 6]])
you code with above corrected line shows me following output :
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6|
+---+---+---+---+---+---+
+---+---+
| A| E|
+---+---+
| 1| 5|
+---+---+

Categories

Resources