This question already has answers here:
How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?
(2 answers)
Closed 4 years ago.
I have 2 dataframes named - brand_name and poi_name.
Dataframe 1(brand_name):-
+-------------+
|brand_stop[0]|
+-------------+
|TOASTMASTERS |
|USBORNE |
|ARBONNE |
|USBORNE |
|ARBONNE |
|ACADEMY |
|ARBONNE |
|USBORNE |
|USBORNE |
|PILLAR |
+-------------+
Dataframe 2:-(poi_name)
+---------------------------------------+
|Name |
+---------------------------------------+
|TOASTMASTERS DISTRICT 48 |
|USBORNE BOOKS AND MORE |
|ARBONNE |
|USBORNE BOOKS AT HOME |
|ARBONNE |
|ACADEMY, LTD. |
|ARBONNE |
|USBORNE BOOKS AT HOME |
|USBORNE BOOKS & MORE |
|PILLAR TO POST HOME INSPECTION SERVICES|
+---------------------------------------+
I want to check whether the strings in brand_stop column of dataframe 1 are present in Name column of dataframe 2. The matching should be done row wise and then if there is a successful match, that particular record should be stored in a new column.
I have tried filtering the dataframe using Join:-
from pyspark.sql.functions import udf, col
from pyspark.sql.types import BooleanType
contains = udf(lambda s, q: q in s, BooleanType())
like_with_python_udf = (poi_names.join(brand_names1)
.where(contains(col("Name"), col("brand_stop[0]")))
.select(col("Name")))
like_with_python_udf.show()
But this shows an error
"AnalysisException: u'Detected cartesian product for INNER join between logical plans"
I am new to PySpark. Please help me with this.
Thank you
The scala code will be like this:
val d1 = Array(("TOASTMASTERS"),("USBORNE"),("ARBONNE"),("USBORNE"),("ARBONNE"),("ACADEMY"),("ARBONNE"),("USBORNE"),("USBORNE"),("PILLAR"))
val rdd1 = sc.parallelize(d1)
val df1 = rdd1.toDF("brand_stop")
val d2 = Array(("TOASTMASTERS DISTRICT 48"),("USBORNE BOOKS AND MORE"),("ARBONNE"),("USBORNE BOOKS AT HOME"),("ARBONNE"),("ACADEMY, LTD."),("ARBONNE"),("USBORNE BOOKS AT HOME"),("USBORNE BOOKS & MORE"),("PILLAR TO POST HOME INSPECTION SERVICES"))
val rdd2 =sc.parallelize(d2)
val df2 = rdd2.toDF("names")
def matchFunc(s1:String,s2:String) : Boolean ={
if(s2.contains(s1)) true
else false
}
val contains = udf(matchFunc _)
val like_with_python_udf = (df1.join(df2).where(contains(col("brand_stop"), col("names"))).select(col("brand_stop"), col("names")))
like_with_python_udf.show()
The Python code:
from pyspark.sql import Row
from pyspark.sql.functions import udf, col
from pyspark.sql.types import BooleanType
schema1 = Row("brand_stop")
schema2 = Row("names")
df1 = sc.parallelize([
schema1("TOASTMASTERS"),
schema1("USBORNE"),
schema1("ARBONNE")
]).toDF()
df2 = sc.parallelize([
schema2("TOASTMASTERS DISTRICT 48"),
schema2("USBORNE BOOKS AND MORE"),
schema2("ARBONNE"),
schema2("ACADEMY, LTD."),
schema2("PILLAR TO POST HOME INSPECTION SERVICES")
]).toDF()
contains = udf(lambda s, q: q in s, BooleanType())
like_with_python_udf = (df1.join(df2)
.where(contains(col("brand_stop"), col("names")))
.select(col("brand_stop"), col("names")))
like_with_python_udf.show()
I am getting ouput:
+------------+
| brand_stop|
+------------+
|TOASTMASTERS|
| USBORNE|
| ARBONNE|
+------------+
The matching should be done row wise
In that case you have to add some form of indices and join
from pyspark.sql.types import *
def index(df):
schema = StructType(df.schema.fields + [(StructField("_idx", LongType()))])
rdd = df.rdd.zipWithIndex().map(lambda x: x[0] +(x[1], ))
return rdd.toDF(schema)
brand_name = spark.createDataFrame(["TOASTMASTERS", "USBORNE"], "string").toDF("brand_stop")
poi_name = spark.createDataFrame(["TOASTMASTERS DISTRICT 48", "USBORNE BOOKS AND MORE"], "string").toDF("poi_name")
index(brand_name).join(index(poi_name), ["_idx"]).selectExpr("*", "poi_name rlike brand_stop").show()
# +----+------------+--------------------+-------------------------+
# |_idx| brand_stop| poi_name|poi_name RLIKE brand_stop|
# +----+------------+--------------------+-------------------------+
# | 0|TOASTMASTERS|TOASTMASTERS DIST...| true|
# | 1| USBORNE|USBORNE BOOKS AND...| true|
# +----+------------+--------------------+-------------------------+
Related
I'm new to Pyspark.
I would like to find the products not seen after 10 days from the first day they entered the store. And create a column in dataframe and set it to 1 for these products and 0 for the rest.
First I need to group the data based on product_id, then find the maximum of the seen_date. And finally calculate the difference between import_date and max(seen_date) in the groups. And finally create a new column based on the value of date_diff in each group.
Following is the code I used to first get the difference between the import_date and seen_date, but it gives error:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = (Window()
.partitionBy(df.product_id)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df.withColumn("date_diff", F.datediff(F.max(F.from_unixtime(F.col("import_date")).over(w)), F.from_unixtime(F.col("seen_date"))))
Error:
AnalysisException: It is not allowed to use a window function inside an aggregate function. Please use the inner window function in a sub-query.
This is the rest of my code to define a new column based on the date_diff:
not_seen = udf(lambda x: 0 if x >10 else 1, IntegerType())
df = df.withColumn('not_seen', not_seen("date_diff"))
Q: Can someone provide a fix for this code or a better approach to solve this problem?
sample data generation:
columns = ["product_id","import_date", "seen_date"]
data = [("123", "2014-05-06", "2014-05-07"),
("123", "2014-05-06", "2014-06-11"),
("125", "2015-01-02", "2015-01-03"),
("125", "2015-01-02", "2015-01-04"),
("128", "2015-08-06", "2015-08-25")]
dfFromData2 = spark.createDataFrame(data).toDF(*columns)
dfFromData2 = dfFromData2.withColumn("import_date",F.unix_timestamp(F.col("import_date"),'yyyy-MM-dd'))
dfFromData2 = dfFromData2.withColumn("seen_date",F.unix_timestamp(F.col("seen_date"),'yyyy-MM-dd'))
+----------+-----------+----------+
|product_id|import_date| seen_date|
+----------+-----------+----------+
| 123| 1399334400|1399420800|
| 123| 1399334400|1402444800|
| 125| 1420156800|1420243200|
| 125| 1420156800|1420329600|
| 128| 1438819200|1440460800|
+----------+-----------+----------+
columns = ["product_id","import_date", "seen_date"]
data = [("123", "2014-05-06", "2014-05-07"),
("123", "2014-05-06", "2014-06-11"),
("125", "2015-01-02", "2015-01-03"),
("125", "2015-01-02", "2015-01-04"),
("128", "2015-08-06", "2015-08-25")]
df = spark.createDataFrame(data).toDF(*columns)
df = df.withColumn("import_date",F.to_date(F.col("import_date"),'yyyy-MM-dd'))
df = df.withColumn("seen_date",F.to_date(F.col("seen_date"),'yyyy-MM-dd'))
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = (Window()
.partitionBy(df.product_id)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df\
.withColumn('max_import_date', F.max(F.col("import_date")).over(w))\
.withColumn("date_diff", F.datediff(F.col('seen_date'), F.col('max_import_date')))\
.withColumn('not_seen', F.when(F.col('date_diff') > 10, 0).otherwise(1))\
.show()
+----------+-----------+----------+---------------+---------+--------+
|product_id|import_date| seen_date|max_import_date|date_diff|not_seen|
+----------+-----------+----------+---------------+---------+--------+
| 123| 2014-05-06|2014-05-07| 2014-05-06| 1| 1|
| 123| 2014-05-06|2014-06-11| 2014-05-06| 36| 0|
| 125| 2015-01-02|2015-01-03| 2015-01-02| 1| 1|
| 125| 2015-01-02|2015-01-04| 2015-01-02| 2| 1|
| 128| 2015-08-06|2015-08-25| 2015-08-06| 19| 0|
+----------+-----------+----------+---------------+---------+--------+
You can use the max windowing function to extract the max date.
dfFromData2 = dfFromData2.withColumn(
'not_seen',
F.expr('if(datediff(max(from_unixtime(seen_date)) over (partition by product_id), from_unixtime(import_date)) > 10, 1, 0)')
)
dfFromData2.show(truncate=False)
# +----------+-----------+----------+--------+
# |product_id|import_date|seen_date |not_seen|
# +----------+-----------+----------+--------+
# |125 |1420128000 |1420214400|0 |
# |125 |1420128000 |1420300800|0 |
# |123 |1399305600 |1399392000|1 |
# |123 |1399305600 |1402416000|1 |
# |128 |1438790400 |1440432000|1 |
# +----------+-----------+----------+--------+
Consider the simple DataFrame:
from pyspark import SparkContext
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf, PandasUDFType
spark = SparkSession.builder.appName('Trial').getOrCreate()
simpleData = (("2000-04-17", "144", 1), \
("2000-07-06", "015", 1), \
("2001-01-23", "015", -1), \
("2001-01-18", "144", -1), \
("2001-04-17", "198", 1), \
("2001-04-18", "036", -1), \
("2001-04-19", "012", -1), \
("2001-04-19", "188", 1), \
("2001-04-25", "188", 1),\
("2001-04-27", "015", 1) \
)
columns= ["dates", "id", "eps"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
Out:
root
|-- dates: string (nullable = true)
|-- id: string (nullable = true)
|-- eps: long (nullable = true)
+----------+---+---+
|dates |id |eps|
+----------+---+---+
|2000-04-17|144|1 |
|2000-07-06|015|1 |
|2001-01-23|015|-1 |
|2001-01-18|144|-1 |
|2001-04-17|198|1 |
|2001-04-18|036|-1 |
|2001-04-19|012|-1 |
|2001-04-19|188|1 |
|2001-04-25|188|1 |
|2001-04-27|015|1 |
+----------+---+---+
I would like to sum the values in the eps column over a rolling window keeping only the last value for any given ID in the id column. For example, defining a window of 5 rows and assuming we are on 2001-04-17, I want to sum only the last eps value for each given unique ID. In the 5 rows we have only 3 different ID, so the sum must be of 3 elements: -1 for the ID 144 (forth row), -1 for the ID 015 (third row) and 1 for the ID 198 (fifth row) for a total of -1.
In my mind, within the rolling window I should do something like F.sum(groupBy('id').agg(F.last('eps'))) that of course is not possible to achieve in a rolling window.
I obtained the desired result using a UDF.
#pandas_udf(IntegerType(), PandasUDFType.GROUPEDAGG)
def fun_sum(id, eps):
df = pd.DataFrame()
df['id'] = id
df['eps'] = eps
value = df.groupby('id').last().sum()
return value
And then:
w = Window.orderBy('dates').rowsBetween(-5,0)
df = df.withColumn('sum', fun_sum(F.col('id'), F.col('eps')).over(w))
The problem is that my dataset contains more than 8 milion rows and performing this task with this UDF takes about 2 hours.
I was wandering whether there is a way to achieve the same result with built-in PySpark functions avoiding using a UDF or at least whether there is a way to improve the performance of my UDF.
For completeness, the desired output should be:
+----------+---+---+----+
|dates |id |eps|sum |
+----------+---+---+----+
|2000-04-17|144|1 |1 |
|2000-07-06|015|1 |2 |
|2001-01-23|015|-1 |0 |
|2001-01-18|144|-1 |-2 |
|2001-04-17|198|1 |-1 |
|2001-04-18|036|-1 |-2 |
|2001-04-19|012|-1 |-3 |
|2001-04-19|188|1 |-1 |
|2001-04-25|188|1 |0 |
|2001-04-27|015|1 |0 |
+----------+---+---+----+
EDIT: the rseult must also be achievable using a .rangeBetween() window.
In case you haven't figured it out yet, here's one way of achieving it.
Assuming that df is defined and initialised the way you defined and initialised it in your question.
Import the required functions and classes:
from pyspark.sql.functions import row_number, col
from pyspark.sql.window import Window
Create the necessary WindowSpec:
window_spec = (
Window
# Partition by 'id'.
.partitionBy(df.id)
# Order by 'dates', latest dates first.
.orderBy(df.dates.desc())
)
Create a DataFrame with partitioned data:
partitioned_df = (
df
# Use the window function 'row_number()' to populate a new column
# containing a sequential number starting at 1 within a window partition.
.withColumn('row', row_number().over(window_spec))
# Only select the first entry in each partition (i.e. the latest date).
.where(col('row') == 1)
)
Just in case you want to double-check the data:
partitioned_df.show()
# +----------+---+---+---+
# | dates| id|eps|row|
# +----------+---+---+---+
# |2001-04-19|012| -1| 1|
# |2001-04-25|188| 1| 1|
# |2001-04-27|015| 1| 1|
# |2001-04-17|198| 1| 1|
# |2001-01-18|144| -1| 1|
# |2001-04-18|036| -1| 1|
# +----------+---+---+---+
Group and aggregate the data:
sum_rows = (
partitioned_df
# Aggragate data.
.groupBy()
# Sum all rows in 'eps' column.
.sum('eps')
# Get all records as a list of Rows.
.collect()
)
Get the result:
print(f"sum eps: {sum_rows[0][0]})
# sum eps: 0
I have some data in two tables, one table is a list of dates (with other fields), running from 1st Jan 2014 until yesterday. The other table contains a year's worth of numeric data (coefficients / metric data) in 2020.
A left join between the two datasets on the date table results in all the dates being brought back with only the year of data being populated for 2020, the rest null as expected.
What I want to do is to populate the history to 2014 (and future) with the data in 2020 on a -364 day mapping.
For example
#+----------+-----------+
#|date |metric |
#+----------+-----------+
#|03/02/2018|null |
#|04/02/2018|null |
#|05/02/2018|null |
#|06/02/2018|null |
#|07/02/2018|null |
#|08/02/2018|null |
#|09/02/2018|null |
#|10/02/2018|null |
#|.... | |
#|02/02/2019|null |
#|03/02/2019|null |
#|04/02/2019|null |
#|05/02/2019|null |
#|06/02/2019|null |
#|07/02/2019|null |
#|08/02/2019|null |
#|09/02/2019|null |
#|... |... |
#|01/02/2020|0.071957531|
#|02/02/2020|0.086542975|
#|03/02/2020|0.023767137|
#|04/02/2020|0.109725808|
#|05/02/2020|0.005774458|
#|06/02/2020|0.056242301|
#|07/02/2020|0.086208715|
#|08/02/2020|0.010676928|
This is what I am trying to achieve:
#+----------+-----------+
#|date |metric |
#+----------+-----------+
#|03/02/2018|0.071957531|
#|04/02/2018|0.086542975|
#|05/02/2018|0.023767137|
#|06/02/2018|0.109725808|
#|07/02/2018|0.005774458|
#|08/02/2018|0.056242301|
#|09/02/2018|0.086208715|
#|10/02/2018|0.010676928|
#|.... | |
#|02/02/2019|0.071957531|
#|03/02/2019|0.086542975|
#|04/02/2019|0.023767137|
#|05/02/2019|0.109725808|
#|06/02/2019|0.005774458|
#|07/02/2019|0.056242301|
#|08/02/2019|0.086208715|
#|09/02/2019|0.010676928|
#|... |... |
#|01/02/2020|0.071957531|
#|02/02/2020|0.086542975|
#|03/02/2020|0.023767137|
#|04/02/2020|0.109725808|
#|05/02/2020|0.005774458|
#|06/02/2020|0.056242301|
#|07/02/2020|0.086208715|
#|08/02/2020|0.010676928|
Worth noting I may eventually have to go back more than 2014 so any dynamism on the population would help!
I'm doing this in databricks so I can use various languages but wanted to focus on Python / Pyspark / SQL for solutions.
Any help would be appreciated.
Thanks.
CT
First create new columns month and year:
df_with_month = df.withColumn("month", f.month(f.to_timestamp("date", "dd/MM/yyyy")))
.withColumn("year", f.month(f.to_timestamp("date", "dd/MM/yyyy")))
with import pyspark.sql.functions as f
Create a new DataFrame with 2020's data:
df_2020 = df_with_month.filter(col("year") == 2020)
.withColumnRenamed("metric", "new_metric")
Join the results on the month:
df_with_metrics = df_with_month.join(df_2020, df_with_month.month == df_2020.month, "left")
.drop("metric")
.withColumnRenamed("new_metric", "metric")
You can do a self join using the condition that the date difference is a multiple of 364 days:
import pyspark.sql.functions as F
df2 = df.join(
df.toDF('date2', 'metric2'),
F.expr("""
datediff(to_date(date, 'dd/MM/yyyy'), to_date(date2, 'dd/MM/yyyy')) % 364 = 0
and
to_date(date, 'dd/MM/yyyy') <= to_date(date2, 'dd/MM/yyyy')
""")
).select(
'date',
F.coalesce('metric', 'metric2').alias('metric')
).filter('metric is not null')
df2.show(999)
+----------+-----------+
| date| metric|
+----------+-----------+
|03/02/2018|0.071957531|
|04/02/2018|0.086542975|
|05/02/2018|0.023767137|
|06/02/2018|0.109725808|
|07/02/2018|0.005774458|
|08/02/2018|0.056242301|
|09/02/2018|0.086208715|
|10/02/2018|0.010676928|
|02/02/2019|0.071957531|
|03/02/2019|0.086542975|
|04/02/2019|0.023767137|
|05/02/2019|0.109725808|
|06/02/2019|0.005774458|
|07/02/2019|0.056242301|
|08/02/2019|0.086208715|
|09/02/2019|0.010676928|
|01/02/2020|0.071957531|
|02/02/2020|0.086542975|
|03/02/2020|0.023767137|
|04/02/2020|0.109725808|
|05/02/2020|0.005774458|
|06/02/2020|0.056242301|
|07/02/2020|0.086208715|
|08/02/2020|0.010676928|
+----------+-----------+
First you can add the timestamp column:
df = df.select(F.to_timestamp("date", "dd/MM/yyyy").alias('ts'), '*')
Then you can join on equal month and day:
cond = [F.dayofmonth(F.col('left.ts')) == F.dayofmonth(F.col('right.ts')),
F.month(F.col('left.ts')) == F.month(F.col('right.ts'))]
df.select('ts', 'date').alias('left').\
join(df.filter(F.year('ts')==2020).select('ts', 'metric').alias('right'), cond)\
.orderBy(F.col('left.ts')).drop('ts').show()
I have a pyspark dataframe, with text column.
I wanted to map the values which with a regex expression.
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-RH', 'RH'))
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-FI, 'FI'))
Plus I wanted to map specifics values according to a dictionnary, I did the following (mapper is from create_map()):
df = df.withColumn("mapped_col",mapper.getItem(F.col("action")))
Finaly the values which has not been mapped by the dictionnary or the regex expression, will be set null. I do not know how to do this part in accordance to the two others.
Is it possible to have like a dictionnary of regex expression so I can regroup the two 'functions'?
{".*-RH": "RH", ".*FI" : "FI"}
Original Output Example
+-----------------------------+
|message |
+-----------------------------+
|GDF2009 |
|GDF2014 |
|ADS-set |
|ADS-set |
|XSQXQXQSDZADAA5454546a45a4-FI|
|dadaccpjpifjpsjfefspolamml-FI|
|dqdazdaapijiejoajojp565656-RH|
|kijipiadoa
+-----------------------------+
Expected Output Example
+-----------------------------+-----------------------------+
|message |status|
+-----------------------------+-----------------------------+
|GDF2009 | GDF
|GDF2014 | GDF
|ADS/set | ADS
|ADS-set | ADS
|XSQXQXQSDZADAA5454546a45a4-FI| FI
|dadaccpjpifjpsjfefspolamml-FI| FI
|dqdazdaapijiejoajojp565656-RH| RH
|kijipiadoa | null or ??
So first 4th line are mapped with a dict, and the other are mapped using regex. Unmapped are null or ??
Thank you,
You can achieve it using contains function:
from pyspark.sql.types import StringType
df = spark.createDataFrame(
["GDF2009", "GDF2014", "ADS-set", "ADS-set", "XSQXQXQSDZADAA5454546a45a4-FI", "dadaccpjpifjpsjfefspolamml-FI",
"dqdazdaapijiejoajojp565656-RH", "kijipiadoa"], StringType()).toDF("message")
df.show()
names = ("GDF", "ADS", "FI", "RH")
def c(col, names):
return [f.when(f.col(col).contains(i), i).otherwise("") for i in names]
df.select("message", f.concat_ws("", f.array_remove(f.array(*c("message", names)), "")).alias("status")).show()
output:
+--------------------+
| message|
+--------------------+
| GDF2009|
| GDF2014|
| ADS-set|
| ADS-set|
|XSQXQXQSDZADAA545...|
|dadaccpjpifjpsjfe...|
|dqdazdaapijiejoaj...|
| kijipiadoa|
+--------------------+
+--------------------+------+
| message|status|
+--------------------+------+
| GDF2009| GDF|
| GDF2014| GDF|
| ADS-set| ADS|
| ADS-set| ADS|
|XSQXQXQSDZADAA545...| FI|
|dadaccpjpifjpsjfe...| FI|
|dqdazdaapijiejoaj...| RH|
| kijipiadoa| |
+--------------------+------+
How do I expand a dataframe based on column values? I intend to go from this dataframe:
+---------+----------+----------+
|DEVICE_ID| MIN_DATE| MAX_DATE|
+---------+----------+----------+
| 1|2019-08-29|2019-08-31|
| 2|2019-08-27|2019-09-02|
+---------+----------+----------+
To one that looks like this:
+---------+----------+
|DEVICE_ID| DATE|
+---------+----------+
| 1|2019-08-29|
| 1|2019-08-30|
| 1|2019-08-31|
| 2|2019-08-27|
| 2|2019-08-28|
| 2|2019-08-29|
| 2|2019-08-30|
| 2|2019-08-31|
| 2|2019-09-01|
| 2|2019-09-02|
+---------+----------+
Any help would be much appreciated.
from datetime import timedelta, date
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType
# Create a sample data row.
df = sqlContext.sql("""
select 'dev1' as device_id,
to_date('2020-01-06') as start,
to_date('2020-01-09') as end""")
# Define a UDf to return a list of dates
#udf
def datelist(start, end):
return ",".join([str(start + datetime.timedelta(days=x)) for x in range(0, 1+(end-start).days)])
# explode the list of dates into rows
df.select("device_id",
F.explode(
F.split(datelist(df["start"], df["end"]), ","))
.alias("date")).show(10, False)