How to drop all columns with null values in a PySpark DataFrame? - python

I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that?
The following only drops a single column or rows containing null.
df.where(col("dt_mvmt").isNull()) #doesnt work because I do not have all the columns names or for 1000's of columns
df.filter(df.dt_mvmt.isNotNull()) #same reason as above
df.na.drop() #drops rows that contain null, instead of columns that contain null
For example
a | b | c
1 | | 0
2 | 2 | 3
In the above case it will drop the whole column B because one of its values is empty.

Here is one possible approach for dropping all columns that have NULL values: See here for the source on the code of counting NULL values per column.
import pyspark.sql.functions as F
# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
'x2': ['b', None, '2'],
'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()
def drop_null_columns(df):
"""
This function drops all columns which contain null values.
:param df: A PySpark DataFrame
"""
null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
to_drop = [k for k, v in null_counts.items() if v > 0]
df = df.drop(*to_drop)
return df
# Drops column b2, because it contains null values
drop_null_columns(df).show()
Before:
+---+----+---+
| x1| x2| x3|
+---+----+---+
| a| b| c|
| 1|null| 0|
| 2| 2| 3|
+---+----+---+
After:
+---+---+
| x1| x3|
+---+---+
| a| c|
| 1| 0|
| 2| 3|
+---+---+
Hope this helps!

If we need to keep only the rows having at least one inspected column not null then use this. Execution time is very less.
from operator import or_
from functools import reduce
inspected = df.columns
df = df.where(reduce(or_, (F.col(c).isNotNull() for c in inspected ), F.lit(False)))```

Related

Edit the value of all the rows of column with the same id based on the value of another column in pyspark

id
val1
val2
1
Y
Flagged
1
N
Flagged
2
N
Flagged
2
Y
Flagged
2
N
Flagged
I have the above table. I want to check the rows in val1 with the same id, if there's at least one Y and one N then all the rows having 1 as id will be marked flagged in val2. In addition, for a more efficient code, I want the code to jump to the next id once it finds a Y.
Assuming the val1 columns contains only Y and N as unique values, you can group the dataframe by id and aggregate val1 using countDistinct to count the unique values, then create a new column flagged corresponding the condition where distinct count > 1, finally join this new column with original dataframe to get the result
from pyspark.sql import functions as F
counts = df.groupBy('id').agg(F.countDistinct('val1').alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
If column val1 may contains other values along with Y, N, then first mask the values which are not in Y and N:
vals = F.when(F.col('val1').isin(['Y', 'N']), F.col('val1'))
counts = df.groupBy('id').agg(F.countDistinct(vals).alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
>>> df.show()
| id|val1|flagged|
+---+----+-------+
| 1| Y| true|
| 1| N| true|
| 2| N| true|
| 2| Y| true|
| 2| N| true|
+---+----+-------+
PS: I have also modified your output slightly as having a column named flagged with boolean values makes more sense
You can also use a window to collect the set of values and compare to the array of Y and N values:
from pyspark.sql import functions as F, Window as W
a = F.array([F.lit('N'),F.lit('Y')])
out = (df.withColumn("Flagged",F.array_intersect(a,
F.collect_set("val1").over(W.partitionBy("id")))==a))
out.show()
+---+----+-------+-------+
| id|val1| val2|Flagged|
+---+----+-------+-------+
| 1| Y|Flagged| true|
| 1| N|Flagged| true|
| 2| N|Flagged| true|
| 2| Y|Flagged| true|
| 2| N|Flagged| true|
+---+----+-------+-------+

Pyspark add row based on a condition

I have a below dataframe structure
A
B
C
1
open
01.01.22 10:05:04
1
In-process
01.01.22 10:07:02
I need to insert a row before the open value row.So,I need to check the status whether its open and then add a new row before it with other columns being the same values except the C column to get 1hour subtracted. How this can be acheived using Pyspark?
Instead of "insert a row" – which is a non-trivial issue to solve –, think about it as "union dataset"
Assuming this is your dataset
df = spark.createDataFrame([
(1, 'open', '01.01.22 10:05:04'),
(1, 'In process', '01.01.22 10:07:02'),
], ['a', 'b', 'c'])
+---+----------+-----------------+
| a| b| c|
+---+----------+-----------------+
| 1| open|01.01.22 10:05:04|
| 1|In process|01.01.22 10:07:02|
+---+----------+-----------------+
Based on your rule, we can construct another dataset like this
from pyspark.sql import functions as F
df_new = (df
.where(F.col('b') == 'open')
.withColumn('b', F.lit('Before open'))
.withColumn('c', F.to_timestamp('c', 'dd.MM.yy HH:mm:ss')) # convert text to date with custom date format
.withColumn('c', F.col('c') - F.expr('interval 1 hour')) # subtract 1 hour
.withColumn('c', F.from_unixtime(F.unix_timestamp('c'), 'dd.MM.yy HH:mm:ss')) # revert to custom date format
)
+---+-----------+-----------------+
| a| b| c|
+---+-----------+-----------------+
| 1|Before open|01.01.22 09:05:04|
+---+-----------+-----------------+
Now you just need to union them together, and sort if you want to "see" it
(df
.union(df_new)
.orderBy('a', 'c')
.show()
)
+---+-----------+-----------------+
| a| b| c|
+---+-----------+-----------------+
| 1|Before open|01.01.22 09:05:04|
| 1| open|01.01.22 10:05:04|
| 1| In process|01.01.22 10:07:02|
+---+-----------+-----------------+

Pyspark - Calculate number of null values in each dataframe column

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.
Example:
+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
| Column_1 | 15 |
| Column_2 | 56 |
| Column_3 | 18 |
| ... | ... |
+-------------+-------------+
I have managed to get the number of null values for ONE column like so:
df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))
where c is a column in the dataframe. However, it does not show the name of the column. The output is:
+------------+
| NULL_Count |
+------------+
| 15 |
+------------+
Any ideas?
You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
However, this will return the results in one row as shown below:
df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:
from functools import reduce
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count"))
for c in df_agg.columns
)
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#| Column_1| 15|
#| Column_2| 56|
#| Column_3| 18|
#+-----------+----------+
Or you can skip the intermediate step of creating df_agg and do:
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df.agg(
F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
).select(F.lit(c).alias("Column_Name"), "NULL_Count")
for c in df.columns
)
)
Scala alternative could be
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
df1.show()
+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
| 1| 100| 23| Male|
| 2| null| 25| null|
| 3| null| 33|Female|
+---+------+---+------+
val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))
val df2 = df1.agg(s.head, s.tail:_*)
val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))
val df_agg_col = t.reduce((df1, df2) => df1.union(df2))
df_agg_col.show()

How to update a pyspark dataframe with new values from another dataframe?

I have two spark dataframes:
Dataframe A:
|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |
and dataframe B:
|col_1 | col_2 | ... | col_m |
|val_1 | val_2 | ... | val_m |
Dataframe B can contain duplicate, updated and new rows from dataframe A. I want to write an operation in spark where I can create a new dataframe containing the rows from dataframe A and the updated and new rows from dataframe B.
I started by creating a hash column containing only the columns that are not updatable. This is the unique id. So let's say col1 and col2 can change value (can be updated), but col3,..,coln are unique. I have created a hash function as hash(col3,..,coln):
A=A.withColumn("hash", hash(*[col(colname) for colname in unique_cols_A]))
B=B.withColumn("hash", hash(*[col(colname) for colname in unique_cols_B]))
Now I want to write some spark code that basically selects the rows from B that have the hash not in A (so new rows and updated rows) and join them into a new dataframe together with the rows from A. How can I achieve this in pyspark?
Edit:
Dataframe B can have extra columns from dataframe A, so a union is not possible.
Sample example
Dataframe A:
+-----+-----+
|col_1|col_2|
+-----+-----+
| a| www|
| b| eee|
| c| rrr|
+-----+-----+
Dataframe B:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| d| yyy| 2|
| c| rer| 3|
+-----+-----+-----+
Result:
Dataframe C:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| b| eee| null|
| c| rer| 3|
| d| yyy| 2|
+-----+-----+-----+
This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.
For example:
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
'col_1',
f.when(
~f.isnull(f.col('b.col_2')),
f.col('b.col_2')
).otherwise(f.col('a.col_2')).alias('col_2'),
'b.col_3'
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| a| wew| 1|
#| b| eee| null|
#| c| rer| 3|
#| d| yyy| 2|
#+-----+-----+-----+
Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:
cols_to_update = ['col_2']
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
*[
['col_1'] +
[
f.when(
~f.isnull(f.col('b.{}'.format(c))),
f.col('b.{}'.format(c))
).otherwise(f.col('a.{}'.format(c))).alias(c)
for c in cols_to_update
] +
['b.col_3']
]
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.
replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
resultDf = dfA.subtract(replaceDf).union(dfB).show()
Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would
prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.
If you want to keep only unique values, and require strictly correct results, then union followed by dropDupilcates should do the trick:
columns_which_dont_change = [...]
old_df.union(new_df).dropDuplicates(subset=columns_which_dont_change)

Removing duplicate columns after a DF join in Spark

When you join two DFs with similar column names:
df = df1.join(df2, df1['id'] == df2['id'])
Join works fine but you can't call the id column because it is ambiguous and you would get the following exception:
pyspark.sql.utils.AnalysisException: "Reference 'id' is ambiguous,
could be: id#5691, id#5918.;"
This makes id not usable anymore...
The following function solves the problem:
def join(df1, df2, cond, how='left'):
df = df1.join(df2, cond, how=how)
repeated_columns = [c for c in df1.columns if c in df2.columns]
for col in repeated_columns:
df = df.drop(df2[col])
return df
What I don't like about it is that I have to iterate over the column names and delete them why by one. This looks really clunky...
Do you know of any other solution that will either join and remove duplicates more elegantly or delete multiple columns without iterating over each of them?
If the join columns at both data frames have the same names and you only need equi join, you can specify the join columns as a list, in which case the result will only keep one of the join columns:
df1.show()
+---+----+
| id|val1|
+---+----+
| 1| 2|
| 2| 3|
| 4| 4|
| 5| 5|
+---+----+
df2.show()
+---+----+
| id|val2|
+---+----+
| 1| 2|
| 1| 3|
| 2| 4|
| 3| 5|
+---+----+
df1.join(df2, ['id']).show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+----+----+
Otherwise you need to give the join data frames alias and refer to the duplicated columns by the alias later:
df1.alias("a").join(
df2.alias("b"), df1['id'] == df2['id']
).select("a.id", "a.val1", "b.val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+----+----+
df.join(other, on, how) when on is a column name string, or a list of column names strings, the returned dataframe will prevent duplicate columns.
when on is a join expression, it will result in duplicate columns. We can use .drop(df.a) to drop duplicate columns. Example:
cond = [df.a == other.a, df.b == other.bb, df.c == other.ccc]
# result will have duplicate column a
result = df.join(other, cond, 'inner').drop(df.a)
Assuming 'a' is a dataframe with column 'id' and 'b' is another dataframe with column 'id'
I use the following two methods to remove duplicates:
Method 1: Using String Join Expression as opposed to boolean expression. This automatically remove a duplicate column for you
a.join(b, 'id')
Method 2: Renaming the column before the join and dropping it after
b.withColumnRenamed('id', 'b_id')
joinexpr = a['id'] == b['b_id']
a.join(b, joinexpr).drop('b_id)
The code below works with Spark 1.6.0 and above.
salespeople_df.show()
+---+------+-----+
|Num| Name|Store|
+---+------+-----+
| 1| Henry| 100|
| 2| Karen| 100|
| 3| Paul| 101|
| 4| Jimmy| 102|
| 5|Janice| 103|
+---+------+-----+
storeaddress_df.show()
+-----+--------------------+
|Store| Address|
+-----+--------------------+
| 100| 64 E Illinos Ave|
| 101| 74 Grand Pl|
| 102| 2298 Hwy 7|
| 103|No address available|
+-----+--------------------+
Assuming -in this example- that the name of the shared column is the same:
joined=salespeople_df.join(storeaddress_df, ['Store'])
joined.orderBy('Num', ascending=True).show()
+-----+---+------+--------------------+
|Store|Num| Name| Address|
+-----+---+------+--------------------+
| 100| 1| Henry| 64 E Illinos Ave|
| 100| 2| Karen| 64 E Illinos Ave|
| 101| 3| Paul| 74 Grand Pl|
| 102| 4| Jimmy| 2298 Hwy 7|
| 103| 5|Janice|No address available|
+-----+---+------+--------------------+
.join will prevent the duplication of the shared column.
Let's assume that you want to remove the column Num in this example, you can just use .drop('colname')
joined=joined.drop('Num')
joined.show()
+-----+------+--------------------+
|Store| Name| Address|
+-----+------+--------------------+
| 103|Janice|No address available|
| 100| Henry| 64 E Illinos Ave|
| 100| Karen| 64 E Illinos Ave|
| 101| Paul| 74 Grand Pl|
| 102| Jimmy| 2298 Hwy 7|
+-----+------+--------------------+
After I've joined multiple tables together, I run them through a simple function to drop columns in the DF if it encounters duplicates while walking from left to right. Alternatively, you could rename these columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = dropDupeDfCols(NamesAndDates)
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where dropDupeDfCols is defined as:
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Date'].
In my case I had a dataframe with multiple duplicate columns after joins and I was trying to same that dataframe in csv format, but due to duplicate column I was getting error. I followed below steps to drop duplicate columns. Code is in scala
1) Rename all the duplicate columns and make new dataframe
2) make separate list for all the renamed columns
3) Make new dataframe with all columns (including renamed - step 1)
4) drop all the renamed column
private def removeDuplicateColumns(dataFrame:DataFrame): DataFrame = {
var allColumns: mutable.MutableList[String] = mutable.MutableList()
val dup_Columns: mutable.MutableList[String] = mutable.MutableList()
dataFrame.columns.foreach((i: String) =>{
if(allColumns.contains(i))
if(allColumns.contains(i))
{allColumns += "dup_" + i
dup_Columns += "dup_" +i
}else{
allColumns += i
}println(i)
})
val columnSeq = allColumns.toSeq
val df = dataFrame.toDF(columnSeq:_*)
val unDF = df.drop(dup_Columns:_*)
unDF
}
to call the above function use below code and pass your dataframe which contains duplicate columns
val uniColDF = removeDuplicateColumns(df)
Here is simple solution for remove duplicate column
final_result=df1.join(df2,(df1['subjectid']==df2['subjectid']),"left").drop(df1['subjectid'])
If you join on a list or string, dup cols are automatically]1 removed
This is a scala solution, you could translate the same idea into any language
// get a list of duplicate columns or use a list/seq
// of columns you would like to join on (note that this list
// should include columns for which you do not want duplicates)
val duplicateCols = df1.columns.intersect(df2.columns)
// no duplicate columns in resulting DF
df1.join(df2, duplicateCols.distinct.toSet)
Spark SQL version of this answer:
df1.createOrReplaceTempView("t1")
df2.createOrReplaceTempView("t2")
spark.sql("select * from t1 inner join t2 using (id)").show()
# +---+----+----+
# | id|val1|val2|
# +---+----+----+
# | 1| 2| 2|
# | 1| 2| 3|
# | 2| 3| 4|
# +---+----+----+
This works for me when multiple columns used to join and need to drop more than one column which are not string type.
final_data = mdf1.alias("a").join(df3.alias("b")
(mdf1.unique_product_id==df3.unique_product_id) &
(mdf1.year_week==df3.year_week) ,"left" ).select("a.*","b.promotion_id")
Give a.* to select all columns from one table and from the other table choose specific columns.

Categories

Resources