Merge 2 spark dataframes with non overlapping columns - python

I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks

Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")

You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf

you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)

Related

Pyspark create combinations from list

Say, I have Dataframe:
df = spark.createDataFrame([['some_string', 'A'],['another_string', 'B']],['a','b'])
a | b
---------------------------+------------
some_string | A
another_string | B
And i have list of ints like [1,2,3]
What i want - is to add list column to my dataframe.
a | b | c
---------------------------+-----------+------------
some_string | A | 1
some_string | A | 2
some_string | A | 3
another_string | B | 1
another_string | B | 2
another_string | B | 3
Is there any way to do it without udf?
Use crossJoin. Please check below code.
>>> dfa.show()
+--------------+---+
| a| b|
+--------------+---+
| some_string| A|
|another_string| B|
+--------------+---+
>>> dfb.show()
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
>>> dfa.crossJoin(dfb).show()
+--------------+---+---+
| a| b| id|
+--------------+---+---+
| some_string| A| 1|
| some_string| A| 2|
| some_string| A| 3|
|another_string| B| 1|
|another_string| B| 2|
|another_string| B| 3|
+--------------+---+---+
You could also just use explode, and avoid unnecessary shuffle caused by joins.
ints=[1,2,3]
from pyspark.sql import functions as F
df.withColumn("c", F.explode(F.array(*[F.lit(x) for x in ints]))).show()
#+--------------+---+---+
#| a| b| c|
#+--------------+---+---+
#| some_string| A| 1|
#| some_string| A| 2|
#| some_string| A| 3|
#|another_string| B| 1|
#|another_string| B| 2|
#|another_string| B| 3|
#+--------------+---+---+

Reshape pyspark dataframe to show moving window of item interactions

I have a large pyspark dataframe of subject interactions in long format--each row describes a subject interacting with some item of interest, along with a timestamp and a rank-order for that subject's interaction (i.e., first interaction is 1, second is 2, etc.). Here's a few rows:
+----------+---------+----------------------+--------------------+
| date|itemId |interaction_date_order| userId|
+----------+---------+----------------------+--------------------+
|2019-07-23| 10005880| 1|37 |
|2019-07-23| 10005903| 2|37 |
|2019-07-23| 10005903| 3|37 |
|2019-07-23| 12458442| 4|37 |
|2019-07-26| 10005903| 5|37 |
|2019-07-26| 12632813| 6|37 |
|2019-07-26| 12632813| 7|37 |
|2019-07-26| 12634497| 8|37 |
|2018-11-24| 12245677| 1|5 |
|2018-11-24| 12245677| 1|5 |
|2019-07-29| 12541871| 2|5 |
|2019-07-29| 12541871| 3|5 |
|2019-07-30| 12626854| 4|5 |
|2019-08-31| 12776880| 5|5 |
|2019-08-31| 12776880| 6|5 |
+----------+---------+----------------------+--------------------+
I need to reshape these data such that, for each subject, a row has a length-5 moving window of interactions. So then, something like this:
+------+--------+--------+--------+--------+--------+
|userId| i-2 | i-1 | i | i+1 | i+2|
+------+--------+--------+--------+--------+--------+
|37 |10005880|10005903|10005903|12458442|10005903|
|37 |10005903|10005903|12458442|10005903|12632813|
Does anyone have suggestions for how I might do this?
Import spark and everything
from pyspark.sql import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
Create your dataframe
columns = '| date|itemId |interaction_date_order| userId|'.split('|')
lines = '''2019-07-23| 10005880| 1|37 |
2019-07-23| 10005903| 2|37 |
2019-07-23| 10005903| 3|37 |
2019-07-23| 12458442| 4|37 |
2019-07-26| 10005903| 5|37 |
2019-07-26| 12632813| 6|37 |
2019-07-26| 12632813| 7|37 |
2019-07-26| 12634497| 8|37 |
2018-11-24| 12245677| 1|5 |
2018-11-24| 12245677| 2|5 |
2019-07-29| 12541871| 3|5 |
2019-07-29| 12541871| 4|5 |
2019-07-30| 12626854| 5|5 |
2019-08-31| 12776880| 6|5 |
2019-08-31| 12776880| 7|5 |'''
Interaction = Row("date", "itemId", "interaction_date_order", "userId")
interactions = []
for line in lines.split('\n'):
column_values = line.split('|')
interaction = Interaction(column_values[0], int(column_values[1]), int(column_values[2]), int(column_values[3]))
interactions.append(interaction)
df = spark.createDataFrame(interactions)
now we have
df.show()
+----------+--------+----------------------+------+
| date| itemId|interaction_date_order|userId|
+----------+--------+----------------------+------+
|2019-07-23|10005880| 1| 37|
|2019-07-23|10005903| 2| 37|
|2019-07-23|10005903| 3| 37|
|2019-07-23|12458442| 4| 37|
|2019-07-26|10005903| 5| 37|
|2019-07-26|12632813| 6| 37|
|2019-07-26|12632813| 7| 37|
|2019-07-26|12634497| 8| 37|
|2018-11-24|12245677| 1| 5|
|2018-11-24|12245677| 2| 5|
|2019-07-29|12541871| 3| 5|
|2019-07-29|12541871| 4| 5|
|2019-07-30|12626854| 5| 5|
|2019-08-31|12776880| 6| 5|
|2019-08-31|12776880| 7| 5|
+----------+--------+----------------------+------+
Create a window and collect itemId with count
from pyspark.sql.window import Window
import pyspark.sql.functions as F
window = Window() \
.partitionBy('userId') \
.orderBy('interaction_date_order') \
.rowsBetween(Window.currentRow, Window.currentRow+4)
df2 = df.withColumn("itemId_list", F.collect_list('itemId').over(window))
df2 = df2.withColumn("itemId_count", F.count('itemId').over(window))
df_final = df2.where(df2['itemId_count'] == 5)
now we have
df_final.show()
+----------+--------+----------------------+------+--------------------+------------+
| date| itemId|interaction_date_order|userId| itemId_list|itemId_count|
+----------+--------+----------------------+------+--------------------+------------+
|2018-11-24|12245677| 1| 5|[12245677, 122456...| 5|
|2018-11-24|12245677| 2| 5|[12245677, 125418...| 5|
|2019-07-29|12541871| 3| 5|[12541871, 125418...| 5|
|2019-07-23|10005880| 1| 37|[10005880, 100059...| 5|
|2019-07-23|10005903| 2| 37|[10005903, 100059...| 5|
|2019-07-23|10005903| 3| 37|[10005903, 124584...| 5|
|2019-07-23|12458442| 4| 37|[12458442, 100059...| 5|
+----------+--------+----------------------+------+--------------------+------------+
Final touch
df_final2 = (df_final
.withColumn('i-2', df_final['itemId_list'][0])
.withColumn('i-1', df_final['itemId_list'][1])
.withColumn('i', df_final['itemId_list'][2])
.withColumn('i+1', df_final['itemId_list'][3])
.withColumn('i+2', df_final['itemId_list'][4])
.select('userId', 'i-2', 'i-1', 'i', 'i+1', 'i+2')
)
df_final2.show()
+------+--------+--------+--------+--------+--------+
|userId| i-2| i-1| i| i+1| i+2|
+------+--------+--------+--------+--------+--------+
| 5|12245677|12245677|12541871|12541871|12626854|
| 5|12245677|12541871|12541871|12626854|12776880|
| 5|12541871|12541871|12626854|12776880|12776880|
| 37|10005880|10005903|10005903|12458442|10005903|
| 37|10005903|10005903|12458442|10005903|12632813|
| 37|10005903|12458442|10005903|12632813|12632813|
| 37|12458442|10005903|12632813|12632813|12634497|
+------+--------+--------+--------+--------+--------+

Combine two rows in Pyspark if a condition is met

I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1
IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.
You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()
The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.

Partition pyspark dataframe based on the change in column value

I have a dataframe in pyspark.
Say the has some columns a,b,c...
I want to group the data into groups as the value of column changes. Say
A B
1 x
1 y
0 x
0 y
0 x
1 y
1 x
1 y
There will be 3 groups as (1x,1y),(0x,0y,0x),(1y,1x,1y)
And corresponding row data
If I understand correctly you want to create a distinct group every time column A changes values.
First we'll create a monotonically increasing id to keep the row order as it is:
import pyspark.sql.functions as psf
df = sc.parallelize([[1,'x'],[1,'y'],[0,'x'],[0,'y'],[0,'x'],[1,'y'],[1,'x'],[1,'y']])\
.toDF(['A', 'B'])\
.withColumn("rn", psf.monotonically_increasing_id())
df.show()
+---+---+----------+
| A| B| rn|
+---+---+----------+
| 1| x| 0|
| 1| y| 1|
| 0| x| 2|
| 0| y| 3|
| 0| x|8589934592|
| 1| y|8589934593|
| 1| x|8589934594|
| 1| y|8589934595|
+---+---+----------+
Now we'll use a window function to create a column that contains 1 every time column A changes:
from pyspark.sql import Window
w = Window.orderBy('rn')
df = df.withColumn("changed", (df.A != psf.lag('A', 1, 0).over(w)).cast('int'))
+---+---+----------+-------+
| A| B| rn|changed|
+---+---+----------+-------+
| 1| x| 0| 1|
| 1| y| 1| 0|
| 0| x| 2| 1|
| 0| y| 3| 0|
| 0| x|8589934592| 0|
| 1| y|8589934593| 1|
| 1| x|8589934594| 0|
| 1| y|8589934595| 0|
+---+---+----------+-------+
Finally we'll use another window function to allocate different numbers to each group:
df = df.withColumn("group_id", psf.sum("changed").over(w)).drop("rn").drop("changed")
+---+---+--------+
| A| B|group_id|
+---+---+--------+
| 1| x| 1|
| 1| y| 1|
| 0| x| 2|
| 0| y| 2|
| 0| x| 2|
| 1| y| 3|
| 1| x| 3|
| 1| y| 3|
+---+---+--------+
Now you can build you groups

Fill Pyspark dataframe column null values with average value from same column

With a dataframe like this,
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])
df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|null| 201602|
| 1| 20|3003| 201601|
| 1| 20|null| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|null| 201601|
+---+----+----+-------+
I need to fill the null values with the average of the existing values, with the expected result being
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
where 1128 is the average of the existing values. I need to do that for several columns.
My current approach is to use na.fill:
fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
But this is very cumbersome. Any ideas?
Well, one way or another you have to:
compute statistics
fill the blanks
It pretty much limits what you can really improve here, still:
replace flatMap(list).collect()[0] with first()[0] or structure unpacking
compute all stats with a single action
use built-in Row methods to extract dictionary
The final result could like this:
def fill_with_mean(df, exclude=set()):
stats = df.agg(*(
avg(c).alias(c) for c in df.columns if c not in exclude
))
return df.na.fill(stats.first().asDict())
fill_with_mean(df_data, ["id", "date"])
In Spark 2.2 or later you can also use Imputer. See Replace missing values with mean - Spark Dataframe.

Categories

Resources