Fill null values with next incrementing number | PySpark

Fill null values with next incrementing number | PySpark | Python - python

I am having below dataframe with values. I want to add next concecative id in the column id which must be unique as well incrementing in nature.
+----------------+----+--------------------+
|local_student_id| id| last_updated|
+----------------+----+--------------------+
| 610931|null| null|
| 599768| 3|2020-02-26 15:47:...|
| 633719|null| null|
| 612949| 2|2020-02-26 15:47:...|
| 591819| 1|2020-02-26 15:47:...|
| 595539| 4|2020-02-26 15:47:...|
| 423287|null| null|
| 641322| 5|2020-02-26 15:47:...|
+----------------+----+--------------------+
I want below expected output. can anybody hemp me? I am new to Pyspark. and also want to add current timestamp in last_updated column.
+----------------+----+--------------------+
|local_student_id| id| last_updated|
+----------------+----+--------------------+
| 610931| 6|2020-02-26 16:00:...|
| 599768| 3|2020-02-26 15:47:...|
| 633719| 7|2020-02-26 16:00:...|
| 612949| 2|2020-02-26 15:47:...|
| 591819| 1|2020-02-26 15:47:...|
| 595539| 4|2020-02-26 15:47:...|
| 423287| 8|2020-02-26 16:00:...|
| 641322| 5|2020-02-26 15:47:...|
+----------------+----+--------------------+
actually i tried
final_data = final_data.withColumn(
'id', when(col('id').isNull(), row_number() + max(col('id'))).otherwise(col('id')))
but it gives the below Error:-
: org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`local_student_id`' is not an aggregate function. Wrap '(CASE WHEN (`id` IS NULL) THEN (CAST(row_number() AS BIGINT) + max(`id`)) ELSE `id` END AS `id`)' in windowing function(s) or wrap '`local_student_id`' in first() (or first_value) if you don't care which value you get.;;

here is the code you need :
from pyspark.sql import functions as F, Window
max_id = final_data.groupBy().max("id").collect()[0][0]
final_data.withColumn(
"id",
F.coalesce(
F.col("id"),
F.row_number().over(Window.orderBy("id")) + F.lit(max_id)
)
).withColumn(
"last_updated",
F.coalesce(
F.col("last_updated"),
F.current_timestamp()
)
)

Related

Pyspark use partition or groupby with agg and datediff

I'm new to Pyspark.
I would like to find the products not seen after 10 days from the first day they entered the store. And create a column in dataframe and set it to 1 for these products and 0 for the rest.
First I need to group the data based on product_id, then find the maximum of the seen_date. And finally calculate the difference between import_date and max(seen_date) in the groups. And finally create a new column based on the value of date_diff in each group.
Following is the code I used to first get the difference between the import_date and seen_date, but it gives error:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = (Window()
.partitionBy(df.product_id)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df.withColumn("date_diff", F.datediff(F.max(F.from_unixtime(F.col("import_date")).over(w)), F.from_unixtime(F.col("seen_date"))))
Error:
AnalysisException: It is not allowed to use a window function inside an aggregate function. Please use the inner window function in a sub-query.
This is the rest of my code to define a new column based on the date_diff:
not_seen = udf(lambda x: 0 if x >10 else 1, IntegerType())
df = df.withColumn('not_seen', not_seen("date_diff"))
Q: Can someone provide a fix for this code or a better approach to solve this problem?
sample data generation:
columns = ["product_id","import_date", "seen_date"]
data = [("123", "2014-05-06", "2014-05-07"),
("123", "2014-05-06", "2014-06-11"),
("125", "2015-01-02", "2015-01-03"),
("125", "2015-01-02", "2015-01-04"),
("128", "2015-08-06", "2015-08-25")]
dfFromData2 = spark.createDataFrame(data).toDF(*columns)
dfFromData2 = dfFromData2.withColumn("import_date",F.unix_timestamp(F.col("import_date"),'yyyy-MM-dd'))
dfFromData2 = dfFromData2.withColumn("seen_date",F.unix_timestamp(F.col("seen_date"),'yyyy-MM-dd'))
+----------+-----------+----------+
|product_id|import_date| seen_date|
+----------+-----------+----------+
| 123| 1399334400|1399420800|
| 123| 1399334400|1402444800|
| 125| 1420156800|1420243200|
| 125| 1420156800|1420329600|
| 128| 1438819200|1440460800|
+----------+-----------+----------+

columns = ["product_id","import_date", "seen_date"]
data = [("123", "2014-05-06", "2014-05-07"),
("123", "2014-05-06", "2014-06-11"),
("125", "2015-01-02", "2015-01-03"),
("125", "2015-01-02", "2015-01-04"),
("128", "2015-08-06", "2015-08-25")]
df = spark.createDataFrame(data).toDF(*columns)
df = df.withColumn("import_date",F.to_date(F.col("import_date"),'yyyy-MM-dd'))
df = df.withColumn("seen_date",F.to_date(F.col("seen_date"),'yyyy-MM-dd'))
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = (Window()
.partitionBy(df.product_id)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df\
.withColumn('max_import_date', F.max(F.col("import_date")).over(w))\
.withColumn("date_diff", F.datediff(F.col('seen_date'), F.col('max_import_date')))\
.withColumn('not_seen', F.when(F.col('date_diff') > 10, 0).otherwise(1))\
.show()
+----------+-----------+----------+---------------+---------+--------+
|product_id|import_date| seen_date|max_import_date|date_diff|not_seen|
+----------+-----------+----------+---------------+---------+--------+
| 123| 2014-05-06|2014-05-07| 2014-05-06| 1| 1|
| 123| 2014-05-06|2014-06-11| 2014-05-06| 36| 0|
| 125| 2015-01-02|2015-01-03| 2015-01-02| 1| 1|
| 125| 2015-01-02|2015-01-04| 2015-01-02| 2| 1|
| 128| 2015-08-06|2015-08-25| 2015-08-06| 19| 0|
+----------+-----------+----------+---------------+---------+--------+

You can use the max windowing function to extract the max date.
dfFromData2 = dfFromData2.withColumn(
'not_seen',
F.expr('if(datediff(max(from_unixtime(seen_date)) over (partition by product_id), from_unixtime(import_date)) > 10, 1, 0)')
)
dfFromData2.show(truncate=False)
# +----------+-----------+----------+--------+
# |product_id|import_date|seen_date |not_seen|
# +----------+-----------+----------+--------+
# |125 |1420128000 |1420214400|0 |
# |125 |1420128000 |1420300800|0 |
# |123 |1399305600 |1399392000|1 |
# |123 |1399305600 |1402416000|1 |
# |128 |1438790400 |1440432000|1 |
# +----------+-----------+----------+--------+

PySpark Self Join without alias

I have a DF, I want to left_outer join with itself but I would liek to do it with pyspark api rather than alias.
So it is something like:
df = ...
df2 = df
df.join(df2, [df['SomeCol'] == df2['SomeOtherCol']], how='left_outer')
Interestingly this is incorrect. When I run it I get this error:
WARN Column: Constructing trivially true equals predicate, 'CAMPAIGN_ID#62L = CAMPAIGN_ID#62L'. Perhaps you need to use aliases.
Is there a way to do this without using alias? Or a clean way with alias? Alias really makes the code a lot dirtier rather than using the pyspark api directly.

The most clean way of using aliases is as follows.
Given the following Dataframe.
df.show()
+---+----+---+
| ID|NAME|AGE|
+---+----+---+
| 1|John| 50|
| 2|Anna| 32|
| 3|Josh| 41|
| 4|Paul| 98|
+---+----+---+
In the following example, I am simply adding "2" to each of the column names so that each column has is unique name after the join.
df2 = df.select([functions.col(c).alias(c + "2") for c in df.columns])
df = df.join(df2, on = df['NAME'] == df2['NAME2'], how='left_outer')
df.show()
+---+----+---+---+-----+----+
| ID|NAME|AGE|ID2|NAME2|AGE2|
+---+----+---+---+-----+----+
| 1|John| 50| 1| John| 50|
| 2|Anna| 32| 2| Anna| 32|
| 3|Josh| 41| 3| Josh| 41|
| 4|Paul| 98| 4| Paul| 98|
+---+----+---+---+-----+----+
If I just simply did a df.join(df).select("NAME"), pyspark does not know which column I want to select as they both have the exact same name. This will lead to errors like the following.
AnalysisException: Reference 'NAME' is ambiguous, could be: NAME, NAME.

Pyspark mapping regex

I have a pyspark dataframe, with text column.
I wanted to map the values which with a regex expression.
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-RH', 'RH'))
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-FI, 'FI'))
Plus I wanted to map specifics values according to a dictionnary, I did the following (mapper is from create_map()):
df = df.withColumn("mapped_col",mapper.getItem(F.col("action")))
Finaly the values which has not been mapped by the dictionnary or the regex expression, will be set null. I do not know how to do this part in accordance to the two others.
Is it possible to have like a dictionnary of regex expression so I can regroup the two 'functions'?
{".*-RH": "RH", ".*FI" : "FI"}
Original Output Example
+-----------------------------+
|message |
+-----------------------------+
|GDF2009 |
|GDF2014 |
|ADS-set |
|ADS-set |
|XSQXQXQSDZADAA5454546a45a4-FI|
|dadaccpjpifjpsjfefspolamml-FI|
|dqdazdaapijiejoajojp565656-RH|
|kijipiadoa
+-----------------------------+
Expected Output Example
+-----------------------------+-----------------------------+
|message |status|
+-----------------------------+-----------------------------+
|GDF2009 | GDF
|GDF2014 | GDF
|ADS/set | ADS
|ADS-set | ADS
|XSQXQXQSDZADAA5454546a45a4-FI| FI
|dadaccpjpifjpsjfefspolamml-FI| FI
|dqdazdaapijiejoajojp565656-RH| RH
|kijipiadoa | null or ??
So first 4th line are mapped with a dict, and the other are mapped using regex. Unmapped are null or ??
Thank you,

You can achieve it using contains function:
from pyspark.sql.types import StringType
df = spark.createDataFrame(
["GDF2009", "GDF2014", "ADS-set", "ADS-set", "XSQXQXQSDZADAA5454546a45a4-FI", "dadaccpjpifjpsjfefspolamml-FI",
"dqdazdaapijiejoajojp565656-RH", "kijipiadoa"], StringType()).toDF("message")
df.show()
names = ("GDF", "ADS", "FI", "RH")
def c(col, names):
return [f.when(f.col(col).contains(i), i).otherwise("") for i in names]
df.select("message", f.concat_ws("", f.array_remove(f.array(*c("message", names)), "")).alias("status")).show()
output:
+--------------------+
| message|
+--------------------+
| GDF2009|
| GDF2014|
| ADS-set|
| ADS-set|
|XSQXQXQSDZADAA545...|
|dadaccpjpifjpsjfe...|
|dqdazdaapijiejoaj...|
| kijipiadoa|
+--------------------+
+--------------------+------+
| message|status|
+--------------------+------+
| GDF2009| GDF|
| GDF2014| GDF|
| ADS-set| ADS|
| ADS-set| ADS|
|XSQXQXQSDZADAA545...| FI|
|dadaccpjpifjpsjfe...| FI|
|dqdazdaapijiejoaj...| RH|
| kijipiadoa| |
+--------------------+------+

Use spark function result as input of another function

In my Spark application I have a dataframe with informations like
+------------------+---------------+
| labels | labels_values |
+------------------+---------------+
| ['l1','l2','l3'] | 000 |
| ['l3','l4','l5'] | 100 |
+------------------+---------------+
What I am trying to achieve is to create, given a label name as input a single_label_value column that takes the value for that label from the labels_values column.
For example, for label='l3' I would like to retrieve this output:
+------------------+---------------+--------------------+
| labels | labels_values | single_label_value |
+------------------+---------------+--------------------+
| ['l1','l2','l3'] | 000 | 0 |
| ['l3','l4','l5'] | 100 | 1 |
+------------------+---------------+--------------------+
Here's what I am attempting to use:
selected_label='l3'
label_position = F.array_position(my_df.labels, selected_label)
my_df= my_df.withColumn(
"single_label_value",
F.substring(my_df.labels_values, label_position, 1)
)
But I am getting an error because the substring function does not like the label_position argument.
Is there any way to combine these function outputs without writing an udf?

Hope, this will work for you.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.getOrCreate()
mydata=[[['l1','l2','l3'],'000'], [['l3','l4','l5'],'100']]
df = spark.createDataFrame(mydata,schema=["lebels","lebel_values"])
selected_label='l3'
df2=df.select(
"*",
(array_position(df.lebels,selected_label)-1).alias("pos_val"))
df2.createOrReplaceTempView("temp_table")
df3=spark.sql("select *,substring(lebel_values,pos_val,1) as val_pos from temp_table")
df3.show()
+------------+------------+-------+-------+
| lebels|lebel_values|pos_val|val_pos|
+------------+------------+-------+-------+
|[l1, l2, l3]| 000| 2| 0|
|[l3, l4, l5]| 100| 0| 1|
+------------+------------+-------+-------+
This is giving location of the value. If you want exact index then you can use -1 from this value.
--Edited anser -> Worked with temp view. Still looking for solution using withColumn option. I hope, it will help you for now.
Edit2 -> Answer using dataframe.
df2=df.select(
"*",
(array_position(df.lebels,selected_label)-1).astype("int").alias("pos_val")
)
df3=df2.withColumn("asked_col",expr("substring(lebel_values,pos_val,1)"))
df3.show()

Try maybe:
import pyspark.sql.functions as f
from pyspark.sql.functions import *
selected_label='l3'
df=df.withColumn('single_label_value', f.substring(f.col('labels_values'), array_position(f.col('labels'), lit(selected_label))-1, 1))
df.show()
(for spark version >=2.4)
I think lit() was the function you were missing - you can use it to pass constant values to spark dataframes.

pyspark sql query equivalent functions

I'm just starting to dive into Pyspark.
There's this dataset which contains some values I'll demonstrate below to ask the query I'm not able to create.
This is a sample of the actual dataset which contains roughly 20K rows. I'm reading this CSV file in pyspark shell as data frame. Trying to convert some basic SQL queries on this data to get hands on. Below are one such query I'm not able to:
1. Which country has the least number of Government Type (4th Column).
There are other queries I've manually created myself that I can do in SQL but I'm just stuck in understanding the one. If I get an idea for this, it'll be fairly relatable to address other ones.
This is the only line I can create after much bugging:
df.filter(df.Government=='Democratic').select('Country').show()
I'm not sure how to approach this problem statement. Any ideas?

Here is how you can do it
Demography = Row("City", "Country", "Population", "Government")
demo1 = Demography("a","AD",1.2,"Democratic")
demo2 = Demography("b","AD",1.2,"Democratic")
demo3 = Demography("c","AD",1.2,"Democratic")
demo4 = Demography("m","XX",1.2,"Democratic")
demo5 = Demography("n","XX",1.2,"Democratic")
demo6 = Demography("o","XX",1.2,"Democratic")
demo7 = Demography("q","XX",1.2,"Democratic")
demographic_data = [demo1,demo2,demo3,demo4,demo5,demo6,demo7]
demographic_data_df = spark.createDataFrame(demographic_data)
demographic_data_df.show(10)
+----+-------+----------+----------+
|City|Country|Population|Government|
+----+-------+----------+----------+
| a| AD| 1.2|Democratic|
| b| AD| 1.2|Democratic|
| c| AD| 1.2|Democratic|
| m| XX| 1.2|Democratic|
| n| XX| 1.2|Democratic|
| o| XX| 1.2|Democratic|
| q| XX| 1.2|Democratic|
+----+-------+----------+----------+
new = demographic_data_df.groupBy('Country').count().select('Country', f.col('count').alias('n'))
max = new.agg(f.max('n').alias('n'))
new.join(max , on = "n",
how = "inner").show()
+---+-------+
| n|Country|
+---+-------+
| 4| XX|
+---+-------+
The other option is to register the dataframe as a temporary table and run normal SQL queries. For registering it as temporary table you can do the following
demographic_data_df.registerTempTable("demographic_data_table")
Hope that helps

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fill null values with next incrementing number | PySpark | Python - python

Related

Pyspark use partition or groupby with agg and datediff

PySpark Self Join without alias

Pyspark mapping regex

Use spark function result as input of another function

pyspark sql query equivalent functions

Categories

Resources