split delimited column into new columns in pyspark dataframe

split delimited column into new columns in pyspark dataframe - python

need to split the delimited(~) column values into new columns dynamically. Thie input s a dataframe and column name list. We are trying to solve using spark datfarame functions. Please help.
Input:
|Raw_column_name|
|1~Ram~1000~US|
|2~john~2000~UK|
|3~Marry~7000~IND|
col_names=[id,names,sal,country]
output:
id | names | sal | country
1 | Ram | 1000 | US
2 | joh n| 2000 | UK
3 | Marry | 7000 | IND

We can use split() and then use the resulting array's elements to create columns.
data_sdf. \
withColumn('raw_col_split_arr', func.split('raw_column_name', '~')). \
select(func.col('raw_col_split_arr').getItem(0).alias('id'),
func.col('raw_col_split_arr').getItem(1).alias('name'),
func.col('raw_col_split_arr').getItem(2).alias('sal'),
func.col('raw_col_split_arr').getItem(3).alias('country')
). \
show()
# +---+-----+----+-------+
# | id| name| sal|country|
# +---+-----+----+-------+
# | 1| Ram|1000| US|
# | 2| john|2000| UK|
# | 3|Marry|7000| IND|
# +---+-----+----+-------+
In case the use case is extended to be a dynamic list of columns.
col_names = ['id', 'names', 'sal', 'country']
data_sdf. \
withColumn('raw_col_split_arr', func.split('raw_column_name', '~')). \
select(*[func.col('raw_col_split_arr').getItem(i).alias(k) for i, k in enumerate(col_names)]). \
show()
# +---+-----+----+-------+
# | id|names| sal|country|
# +---+-----+----+-------+
# | 1| Ram|1000| US|
# | 2| john|2000| UK|
# | 3|Marry|7000| IND|
# +---+-----+----+-------+

Another option is from_csv() function. The only thing that needs to be defined is schema which has the added advantage that data can be parsed to correct type automatically:
df = spark.createDataFrame([('1~Ram~1000~US',), ('2~john~2000~UK',), ('3~Marry~7000~IND',)], ["Raw_column_name"])
df.show()
col_names = ['id', 'names', 'sal', 'country']
schema = ','.join([f'{name} string' for name in col_names])
# if custom type conversion is needed
# schema = "id int, names string, sal string, country string"
options = {'sep': '~'}
df2 = (df
.select(from_csv(col('Raw_column_name'), schema, options).alias('cols'))
.select(col('cols.*'))
)
df2.printSchema()
df2.show()

Related

How do I filter the column in pyspark?

I am new to pyspark. I want to compare two tables. If the the value in one of the column does not match, I want to print out that column name in a new column. Using, Compare two dataframes Pyspark link, I am able to get that result. Now, I want to filter the new table based on the newly created column.
df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])
df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
from pyspark.sql.functions import *
#from pyspark.sql.functions import col, array, when, array_remove
# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']
select_expr =[
col("id"),
*[df2[c] for c in df2.columns if c != 'id'],
array_remove(array(*conditions_), "").alias("column_names")
]
df3 = df1.join(df2, "id").select(*select_expr)
df3.show()
DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| ABC | 5000 | US | [] |
| 2| DEF | 4000 | CAN | [address] |
| 3| GHI | 3500 | JPN | [sal] |
| 4| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+
This is the step where I am getting an error message.
df3.filter(df3.column_names!="")
Error: cannot resolve '(column_names = '')' due to data type mismatch: differing types in '(column_names = '')' (array<string> and string).
I want the following result
DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| DEF | 4000 | CAN | [address] |
| 2| GHI | 3500 | JPN | [sal] |
| 3| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+

you are getting error because you are comparing array type to string, you should first convert column_names array type to string then it will work
df3 = df3.withColumn('column_names',concat_ws(";",col("column_names")))

You can create a udf to filter and pass the relevant column name to it, I hope below code will help.
from pyspark.sql import functions
simple filter function
#udf(returnType=BooleanType())
def my_filter(col1):
return True if len(col1) > 0 else False
df3.filter(my_filter(col('column_names'))).show()

Another way
#Do an outer join
new = df1.join(df2.alias('df2'), how='outer', on=['id','name','sal','Address'])
#Count disntict values in in each column per id
new1 =new.groupBy('id').agg(*[countDistinct(x).alias(f'{x}') for x in new.drop('id').columns])
#Using case when, where there is more than one distinct value, append column to new column
new2 = new1.select('id',array_except(array((*[when(col(c) != 1, lit(c)) for c in new1.drop('id').columns])),array(lit(None).cast('string'))).alias('column_names'))
#Join back to df2
df2.join(new2,how='right', on='id').show()
+---+-----+----+-------+------------+
| id| name| sal|Address|column_names|
+---+-----+----+-------+------------+
| 1| ABC|5000| US| []|
| 2| DEF|4000| CAN| [Address]|
| 3| GHI|3500| JPN| [sal]|
| 4|JKL_M|4800| CHN| [name, sal]|
+---+-----+----+-------+------------+

Use filter('array_column != array()'). See below example that filters out the empty arrays.
spark.sparkContext.parallelize([([],), (['blah', 'bleh'],)]).toDF(['arrcol']). \
show()
# +------------+
# | arrcol|
# +------------+
# | []|
# |[blah, bleh]|
# +------------+
spark.sparkContext.parallelize([([],), (['blah', 'bleh'],)]).toDF(['arrcol']). \
filter('arrcol != array()'). \
show()
# +------------+
# | arrcol|
# +------------+
# |[blah, bleh]|
# +------------+

How can I verify the schema (number and name of the columns) of a Dataframe in pyspark?

I have to read a csv file and I have to verify the name and the number of columns of the dataframe. The minimum number of columns is 3 and they have to be: 'id', 'name' and 'phone'. There is no problem of having more columns than that. But it always needs to have at least those 3 columns with the exact name. Otherwise, program should fail.
For example:
Correct:
+-----+-----+-----+ +-----+-----+-----+-----+
| id| name|phone| | id| name|phone|unit |
+-----+-----+-----+ +-----+-----+-----+-----+
|3940A|jhon |1345 | |3940A|jhon |1345 | 222 |
|2BB56|mike | 492 | |2BB56|mike | 492 | 333 |
|3(401|jose |2938 | |3(401|jose |2938 | 444 |
+-----+-----+-----+ +-----+-----+-----+-----+
Incorrect:
+-----+-----+-----+ +-----+-----+
| sku| nomb|phone| | sku| name|
+-----+-----+-----+ +-----+-----+
|3940A|jhon |1345 | |3940A|jhon |
|2BB56|mike | 492 | |2BB56|mike |
|3(401|jose |2938 | |3(401|jose |
+-----+-----+-----+ +-----+-----+

Using simple python if-else statement should do the job:
mandatory_cols = ["id", "name", "phone"]
if all(c in df.columns for c in mandatory_cols):
# your logic
else:
raise ValueError("missing columns!")

Here's an example on how to check if columns exist in your dataframe:
from pyspark.sql import Row
def check_columns_exits(cols):
if 'id' in cols and 'name' in cols and 'phone' in cols:
print("All required columns are present")
else:
print("Does not have all the required columns")
data = [Row(id="3940A", name="john", phone="1345", unit=222),
Row(id="2BB56", name="mike", phone="492", unit=333)]
df = spark.createDataFrame(data)
check_columns_exits(df.columns)
data1 = [Row(id="3940A", name="john", unit=222),
Row(id="2BB56", name="mike", unit=333)]
df1 = spark.createDataFrame(data1)
check_columns_exits(df1.columns)
RESULT:
All required columns are present
Does not have all the required columns

is there a way to get date difference over a certain column

i want to calculate the time difference/date difference for each unique name it took for the status to get from order to arrived.
Input dataframe is like this
+------------------------------+
| Date | id | name |staus
+------------------------------+
| 1986/10/15| A |john |order
| 1986/10/16| A |john |dispatched
| 1986/10/18| A |john |arrived
| 1986/10/15| B |peter|order
| 1986/10/16| B |peter|dispatched
| 1986/10/17| B |peter|arrived
| 1986/10/16| C |raul |order
| 1986/10/17| C |raul |dispatched
| 1986/10/18| C |raul |arrived
+-----------------------------+
the expected output dataset should look similar to this
+---------------------------------------------------+
| id | name |time_difference_from_order_to_delivered|
+---------------------------------------------------+
A | john | 3days
B |peter | 2days
C | Raul | 2days
+---------------------------------------------------+
I am stuck on what logic to implement

You can group by and calculate the date diff using a conditional aggregation:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'name').agg(
F.datediff(
F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd'),
F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
).alias('time_diff')
)
df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
| A| john| 3|
| C| raul| 2|
| B|peter| 2|
+---+-----+---------+
You can also directly subtract the dates, which will return an interval type column:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'name').agg(
(
F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd') -
F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
).alias('time_diff')
)
df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
| A| john| 3 days|
| C| raul| 2 days|
| B|peter| 2 days|
+---+-----+---------+

Assuming ordered is the earliest date and delivered is the last, just use aggregation and datediff():
select id, name, datediff(max(date), min(date)) as num_days
from t
group by id, name;
For more precision, you can use conditional aggregation:
select id, name,
datediff(max(case when status = 'arrived' then date end)
min(case when status = 'order' then date end)
) as num_days
from t
group by id, name;

Compare two dataframes Pyspark

I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames
df1 = spark.read.csv("/path/to/data1.csv")
df2 = spark.read.csv("/path/to/data2.csv")
Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1
df2.withColumn("column_names",udf())
DF1
+------+---------+--------+------+
| id | |name | sal | Address |
+------+---------+--------+------+
| 1| ABC | 5000 | US |
| 2| DEF | 4000 | UK |
| 3| GHI | 3000 | JPN |
| 4| JKL | 4500 | CHN |
+------+---------+--------+------+
DF2:
+------+---------+--------+------+
| id | |name | sal | Address |
+------+---------+--------+------+
| 1| ABC | 5000 | US |
| 2| DEF | 4000 | CAN |
| 3| GHI | 3500 | JPN |
| 4| JKL_M | 4800 | CHN |
+------+---------+--------+------+
Now I want DF3
DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| ABC | 5000 | US | [] |
| 2| DEF | 4000 | CAN | [address] |
| 3| GHI | 3500 | JPN | [sal] |
| 4| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+
I saw this SO question, How to compare two dataframe and print columns that are different in scala. Tried that, however the result is different.
I'm thinking of going with a UDF function by passing row from each dataframe to udf and compare column by column and return column list. However for that both the data frames should be in sorted order so that same id rows will be sent to udf. Sorting is costly operation here. Any solution?

Assuming that we can use id to join these two datasets I don't think that there is a need for UDF. This could be solved just by using inner join, array and array_remove functions among others.
First let's create the two datasets:
df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])
df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
First we do an inner join between the two datasets then we generate the condition df1[col] != df2[col] for each column except id. When the columns aren't equal we return the column name otherwise an empty string. The list of conditions will consist the items of an array from which finally we remove the empty items:
from pyspark.sql.functions import col, array, when, array_remove
# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']
select_expr =[
col("id"),
*[df2[c] for c in df2.columns if c != 'id'],
array_remove(array(*conditions_), "").alias("column_names")
]
df1.join(df2, "id").select(*select_expr).show()
# +---+-----+----+-------+------------+
# | id| name| sal|Address|column_names|
# +---+-----+----+-------+------------+
# | 1| ABC|5000| US| []|
# | 3| GHI|3500| JPN| [sal]|
# | 2| DEF|4000| CAN| [Address]|
# | 4|JKL_M|4800| CHN| [name, sal]|
# +---+-----+----+-------+------------+

Here is your solution with UDF, I have changed first dataframe name dynamically so that it will be not ambiguous during check. Go through below code and let me know in case any concerns.
>>> from pyspark.sql.functions import *
>>> df.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| UK|
| 3| GHI|3000| JPN|
| 4| JKL|4500| CHN|
+---+----+----+-------+
>>> df1.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| CAN|
| 3| GHI|3500| JPN|
| 4|JKLM|4800| CHN|
+---+----+----+-------+
>>> df2 = df.select([col(c).alias("x_"+c) for c in df.columns])
>>> df3 = df1.join(df2, col("id") == col("x_id"), "left")
//udf declaration
>>> def CheckMatch(Column,r):
... check=''
... ColList=Column.split(",")
... for cc in ColList:
... if(r[cc] != r["x_" + cc]):
... check=check + "," + cc
... return check.replace(',','',1).split(",")
>>> CheckMatchUDF = udf(CheckMatch)
//final column that required to select
>>> finalCol = df1.columns
>>> finalCol.insert(len(finalCol), "column_names")
>>> df3.withColumn("column_names", CheckMatchUDF(lit(','.join(df1.columns)),struct([df3[x] for x in df3.columns])))
.select(finalCol)
.show()
+---+----+----+-------+------------+
| id|name| sal|Address|column_names|
+---+----+----+-------+------------+
| 1| ABC|5000| US| []|
| 2| DEF|4000| CAN| [Address]|
| 3| GHI|3500| JPN| [sal]|
| 4|JKLM|4800| CHN| [name, sal]|
+---+----+----+-------+------------+

Python: PySpark version of my previous scala code.
import pyspark.sql.functions as f
df1 = spark.read.option("header", "true").csv("test1.csv")
df2 = spark.read.option("header", "true").csv("test2.csv")
columns = df1.columns
df3 = df1.alias("d1").join(df2.alias("d2"), f.col("d1.id") == f.col("d2.id"), "left")
for name in columns:
df3 = df3.withColumn(name + "_temp", f.when(f.col("d1." + name) != f.col("d2." + name), f.lit(name)))
df3.withColumn("column_names", f.concat_ws(",", *map(lambda name: f.col(name + "_temp"), columns))).select("d1.*", "column_names").show()
Scala: Here is my best approach for your problem.
val df1 = spark.read.option("header", "true").csv("test1.csv")
val df2 = spark.read.option("header", "true").csv("test2.csv")
val columns = df1.columns
val df3 = df1.alias("d1").join(df2.alias("d2"), col("d1.id") === col("d2.id"), "left")
columns.foldLeft(df3) {(df, name) => df.withColumn(name + "_temp", when(col("d1." + name) =!= col("d2." + name), lit(name)))}
.withColumn("column_names", concat_ws(",", columns.map(name => col(name + "_temp")): _*))
.show(false)
First, I join two dataframe into df3 and used the columns from df1. By folding left to the df3 with temp columns that have the value for column name when df1 and df2 has the same id and other column values.
After that, concat_ws for those column names and the null's are gone away and only the column names are left.
+---+----+----+-------+------------+
|id |name|sal |Address|column_names|
+---+----+----+-------+------------+
|1 |ABC |5000|US | |
|2 |DEF |4000|UK |Address |
|3 |GHI |3000|JPN |sal |
|4 |JKL |4500|CHN |name,sal |
+---+----+----+-------+------------+
The only thing different from your expected result is that the output is not a list but string.
p.s. I forgot to use PySpark but this is the normal spark, sorry.

You can get that query build for you in PySpark and Scala by the spark-extension package.
It provides the diff transformation that does exactly that.
from gresearch.spark.diff import *
options = DiffOptions().with_change_column('changes')
df1.diff_with_options(df2, options, 'id').show()
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
|diff| changes| id|left_name|right_name|left_sal|right_sal|left_Address|right_Address|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
| N| []| 1| ABC| ABC| 5000| 5000| US| US|
| C| [Address]| 2| DEF| DEF| 4000| 4000| UK| CAN|
| C| [sal]| 3| GHI| GHI| 3000| 3500| JPN| JPN|
| C|[name, sal]| 4| JKL| JKL_M| 4500| 4800| CHN| CHN|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
While this is a simple example, diffing DataFrames can become complicated when wide schemas, insertions, deletions and null values are involved. That package is well-tested, so you don't have to worry about getting that query right yourself.

There is a wonderful package for pyspark that compares two dataframes. The name of the package is datacompy
https://capitalone.github.io/datacompy/
example code:
import datacompy as dc
comparison = dc.SparkCompare(spark, base_df=df1, compare_df=df2, join_columns=common_keys, match_rates=True)
comparison.report()
The above code will generate a summary report, and the one below it will give you the mismatches.
comparison.rows_both_mismatch.display()
There are also more fearures that you can explore.

Pyspark - Calculate number of null values in each dataframe column

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.
Example:
+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
| Column_1 | 15 |
| Column_2 | 56 |
| Column_3 | 18 |
| ... | ... |
+-------------+-------------+
I have managed to get the number of null values for ONE column like so:
df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))
where c is a column in the dataframe. However, it does not show the name of the column. The output is:
+------------+
| NULL_Count |
+------------+
| 15 |
+------------+
Any ideas?

You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
However, this will return the results in one row as shown below:
df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:
from functools import reduce
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count"))
for c in df_agg.columns
)
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#| Column_1| 15|
#| Column_2| 56|
#| Column_3| 18|
#+-----------+----------+
Or you can skip the intermediate step of creating df_agg and do:
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df.agg(
F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
).select(F.lit(c).alias("Column_Name"), "NULL_Count")
for c in df.columns
)
)

Scala alternative could be
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
df1.show()
+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
| 1| 100| 23| Male|
| 2| null| 25| null|
| 3| null| 33|Female|
+---+------+---+------+
val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))
val df2 = df1.agg(s.head, s.tail:_*)
val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))
val df_agg_col = t.reduce((df1, df2) => df1.union(df2))
df_agg_col.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

split delimited column into new columns in pyspark dataframe - python

Related

How do I filter the column in pyspark?

How can I verify the schema (number and name of the columns) of a Dataframe in pyspark?

is there a way to get date difference over a certain column

Compare two dataframes Pyspark

Pyspark - Calculate number of null values in each dataframe column

Categories

Resources