Joining two RDD's in PySpark? - python

I asked a previous question regarding two RDD's I was trying to join together, here were the two RDD's:
+------+---+
| _1| _2|
+------+---+
|Python| 36|
| C| 6|
| C#| 8|
+------+---+
+------+---+
| _1| _2|
+------+---+
|Python| 10|
| C| 1|
| C#| 1|
+------+---+
After executing the following line of code on both RDD's, this was the result:
joined_rdd = rdd1.join(rdd2).map(lambda x: (x[0], *x[1]))
+------+---+---+
| _1| _2| _3|
+------+---+---+
|Python| 36| 10|
| C| 6| 1|
| C#| 8| 1|
+------+---+---+
This was exactly what I wanted, BUT, if I would like to join another RDD to this 3 column joined_rdd, how might I do that? The code I used originally does not work and I've tried every variation and cannot seem to obtain the result I want, here is what I want it to look like:
rdd3
+------+---+
| _1| _2|
+------+---+
|Python| 8|
| C| 15|
| C#|100|
+------+---+
After joining with joined_rdd:
final_joined_rdd
+------+---+---+---+
| _1| _2| _3| _3|
+------+---+---+---+
|Python| 36| 10| 8|
| C| 6| 1| 15|
| C#| 8| 1|100|
+------+---+---+---+
Any help to achieve this result would be appreciated, thanks!
Note: I cannot convert these RDD's to data frame and then join because the RDD's are really just DStream objects.

Related

Merge 2 spark dataframes with non overlapping columns

I have two data frames, df1:
+---+---------+
| id| col_name|
+---+---------+
| 0| a |
| 1| b |
| 2| null|
| 3| null|
| 4| e |
| 5| f |
| 6| g |
| 7| h |
| 8| null|
| 9| j |
+---+---------+
and df2:
+---+---------+
| id| col_name|
+---+---------+
| 0| null|
| 1| null|
| 2| c|
| 3| d|
| 4| null|
| 5| null|
| 6| null|
| 7| null|
| 8| i|
| 9| null|
+---+---------+
and I want to merge them so I get
+---+---------+
| id| col_name|
+---+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
| 6| g|
| 7| h|
| 8| i|
| 9| j|
+---+---------+
I know for sure that they aren't overlapping (ie when df2 entry is null df1 entry isn't and vise versa)
I know that if I use join I won't get them on the same column and will instead get 2 "col_name". I just want it on the one column. How do I do this? Thanks
Try this-
df1.alias("a").join(df2.alias("b"), "id").selectExpr("id", "coalesce(a.col_name, b.col_name) as col_name")
You could do this:
mydf = df1.copy() #make copy of first array
idx = np.where(df1['col_name'].values == 'null')[0] #get indices of null
val = df2['col_name'].values[idx] #get values from df2 where df1 is null
mydf['col_name'][idx] = val #assign those values in mydf
mydf #print mydf
you should be able to utilize the coalesce function to achieve this.
df1 = df1.withColumnRenamed("col_name", "col_name_a")
df2 = df2.withColumnRenamed("col_name", "col_name_b")
joinedDF = renamedDF1.join(renamedDF2, "id")
joinedDF = joinedDF.withColumn(
"col_name",
coalesce(joinedDF.col("col_name_a"), joinedDF.col("col_name_b"))
)

pyspark dataframe withColumn command not working

I have a input dataframe: df_input (updated df_input)
|comment|inp_col|inp_val|
|11 |a |a1 |
|12 |a |a2 |
|15 |b |b3 |
|16 |b |b4 |
|17 |c |&b |
|17 |c |c5 |
|17 |d |&c |
|17 |d |d6 |
|17 |e |&d |
|17 |e |e7 |
I want to replace the variable in inp_val column to its value. I have tried with the below code to create a new column.
Taken the list of values which starts with '&'
df_new = df_inp.select(inp_val).where(df.inp_val.substr(0, 1) == '&')
Now I'm iterating over the list to replace the '&' column value data to it original list.
for a in [row[inp_val] for row in df_new.collect()]
df_inp = df_inp.withColumn
(
'new_col',
when(df.inp_val.substr(0, 1) == '&',
[row[inp_val] for row in df.select(df.inp_val).where(df.inp_col == a[1:]).collect()])
.otherwise(df.inp_val)
)
But, I'm getting error as below:
Java.lang.RuntimeException: Unsupported literal tpe class java.util.ArrayList [[5], [6]]
Basically I want the output as below. Please check and let me know where is the error???.
I was thinking that two type of datatype values I'm trying to insert as per the above code??
Updated lines of code:
tst_1 = tst.withColumn("col3_extract", when(tst.col3.substr(0, 1) == '&', regexp_replace(tst.col3, "&", "")).otherwise(""))
# Select which values need to be replaced; withColumnRenamed will also solve spark self join issues
# The substring search can also be done using regex function
tst_filter=tst.where(~F.col('col3').contains('&')).withColumnRenamed('col2','col2_collect')
# For the selected data, perform a collect list
tst_clct = tst_filter.groupby('col2_collect').agg(F.collect_list('col3').alias('col3_collect'))
#%% Join the main table with the collected list
tst_join = tst_1.join(tst_clct,on=tst_1.col3_extract==tst_clct.col2_collect,how='left').drop('col2_collect')
#%% In the column3 replace the values such as a, b
tst_result = tst_join.withColumn("result",F.when(~F.col('col3').contains('&'),F.array(F.col('col3'))).otherwise(F.col('col3_collect')))
But, the above code doesn't work on the multiple iterations
Updated Expected Output:
|comment|inp_col|inp_val|new_col |
|11 |a |a1 |['a1'] |
|12 |a |a2 |['a2'] |
|15 |b |b3 |['b3'] |
|16 |b |b4 |['b4'] |
|17 |c |&b |['b3', 'b4'] |
|18 |c |c5 |['c5'] |
|19 |d |&c |['b3', 'b4', 'c5'] |
|20 |d |d6 |['d6'] |
|21 |e |&d |['b3', 'b4', 'c5', 'd6'] |
|22 |e |e7 |['e7'] |
Try this, self-join with collected list on rlike join condition is the way to go.
df.show() #sampledataframe
#+-------+---------+---------+
#|comment|input_col|input_val|
#+-------+---------+---------+
#| 11| a| 1|
#| 12| a| 2|
#| 15| b| 5|
#| 16| b| 6|
#| 17| c| &b|
#| 17| c| 7|
#+-------+---------+---------+
df.join(df.groupBy("input_col").agg(F.collect_list("input_val").alias("y1"))\
.withColumnRenamed("input_col","x1"),F.expr("""input_val rlike x1"""),'left')\
.withColumn("new_col", F.when(F.col("input_val").cast("int").isNotNull(), F.array("input_val"))\
.otherwise(F.col("y1"))).drop("x1","y1").show()
#+-------+---------+---------+-------+
#|comment|input_col|input_val|new_col|
#+-------+---------+---------+-------+
#| 11| a| 1| [1]|
#| 12| a| 2| [2]|
#| 15| b| 5| [5]|
#| 16| b| 6| [6]|
#| 17| c| &b| [5, 6]|
#| 17| c| 7| [7]|
#+-------+---------+---------+-------+
You can simply use regex_replace like this:
df.withColumn("new_col", regex_replace(col("inp_val"), "&", ""))
Can you tryout this solution. Your approach may run into whole lot of problems.
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window
#Test data
tst = sqlContext.createDataFrame([(1,'a','3'),(1,'a','4'),(1,'b','5'),(1,'b','7'),(2,'c','&b'),(2,'c','&a'),(2,'d','&b')],schema=['col1','col2','col3'])
# extract the special character out
tst_1 = tst.withColumn("col3_extract",F.substring(F.col('col3'),2,1))
# Selecct which values need to be replaced; withColumnRenamed will also solve spark self join issues
# The substring search can also be done using regex function
tst_filter=tst.where(~F.col('col3').contains('&')).withColumnRenamed('col2','col2_collect')
# For the selected data, perform a collect list
tst_clct = tst_filter.groupby('col2_collect').agg(F.collect_list('col3').alias('col3_collect'))
#%% Join the main table with the collected list
tst_join = tst_1.join(tst_clct,on=tst_1.col3_extract==tst_clct.col2_collect,how='left').drop('col2_collect')
#%% In the column3 replace the values such as a, b
tst_result = tst_join.withColumn("result",F.when(~F.col('col3').contains('&'),F.array(F.col('col3'))).otherwise(F.col('col3_collect')))
Results :
+----+----+----+------------+------------+------+
|col1|col2|col3|col3_extract|col3_collect|result|
+----+----+----+------------+------------+------+
| 2| c| &a| a| [3, 4]|[3, 4]|
| 2| c| &b| b| [7, 5]|[7, 5]|
| 2| d| &b| b| [7, 5]|[7, 5]|
| 1| a| 3| | null| [3]|
| 1| a| 4| | null| [4]|
| 1| b| 5| | null| [5]|
| 1| b| 7| | null| [7]|
+----+----+----+------------+------------+------+

Randomly Sample Pyspark dataframe with column conditions

I'm trying to randomly sample a Pyspark dataframe where a column value meets a certain condition. I would like to use the sample method to randomly select rows based on a column value. Let's say I have the following data frame:
+---+----+------+-------------+------+
| id|code| amt|flag_outliers|result|
+---+----+------+-------------+------+
| 1| a| 10.9| 0| 0.0|
| 2| b| 20.7| 0| 0.0|
| 3| c| 30.4| 0| 1.0|
| 4| d| 40.98| 0| 1.0|
| 5| e| 50.21| 0| 2.0|
| 6| f| 60.7| 0| 2.0|
| 7| g| 70.8| 0| 2.0|
| 8| h| 80.43| 0| 3.0|
| 9| i| 90.12| 0| 3.0|
| 10| j|100.65| 0| 3.0|
+---+----+------+-------------+------+
I would like to sample only 1(or any certain amount) of each of the 0, 1, 2, 3 based on the result column so I'd end up with this:
+---+----+------+-------------+------+
| id|code| amt|flag_outliers|result|
+---+----+------+-------------+------+
| 1| a| 10.9| 0| 0.0|
| 3| c| 30.4| 0| 1.0|
| 5| e| 50.21| 0| 2.0|
| 8| h| 80.43| 0| 3.0|
+---+----+------+-------------+------+
Is there a good programmatic way to achieve this, i.e take the same number of rows for each of the values given in a certain column? Any help is really appreciated!
You can use sampleBy() which returns a stratified sample without replacement based on the fraction given on each stratum.
>>> from pyspark.sql.functions import col
>>> dataset = sqlContext.range(0, 100).select((col("id") % 3).alias("result"))
>>> sampled = dataset.sampleBy("result", fractions={0: 0.1, 1: 0.2}, seed=0)
>>> sampled.groupBy("result").count().orderBy("key").show()
+------+-----+
|result|count|
+------+-----+
| 0| 5|
| 1| 9|
+------+-----+

Sort or orderBy in pyspark showing strange output

I am trying to sort value in my pyspark dataframe, but its showing me strange output. Instead of sorting by entire number, it is sorting by first digit of entire number
I have tried sort and orderBy method, both are giving same result
sdf=spark.read.csv("dummy.txt", header=True)
sdf.sort('1',ascending=False).show()
I am getting following output
+---+
| 98|
| 9|
| 8|
| 76|
| 7|
| 68|
| 6|
| 54|
| 5|
| 43|
| 4|
| 35|
| 34|
| 34|
| 3|
| 2|
| 2|
| 2|
| 10|
+---+
Can any one explain me this thing
As your column contains data of String type, the String is being converted into a Sequence of chars and these chars are sorted.It works like a map function.
So, you could do a type cast, and then apply the orderBy function to achieve your required result.
>>> df
DataFrame[Numb: string]
>>> df.show()
+----+
|Numb|
+----+
| 20|
| 19|
| 1|
| 200|
| 60|
+----+
>>> df.orderBy(df.Numb.cast('int'),ascending=False).show()
+----+
|Numb|
+----+
| 200|
| 60|
| 20|
| 19|
| 1|
+----+

Groupby and divide count of grouped elements in pyspark data frame

I have a data frame in pyspark like below. I want to do groupby and count of category column in data frame
df.show()
+--------+----+
|category| val|
+--------+----+
| cat1| 13|
| cat2| 12|
| cat2| 14|
| cat3| 23|
| cat1| 20|
| cat1| 10|
| cat2| 30|
| cat3| 11|
| cat1| 7|
| cat1| 8|
+--------+----+
res = df.groupBy('category').count()
res.show()
+--------+-----+
|category|count|
+--------+-----+
| cat2| 3|
| cat3| 2|
| cat1| 5|
+--------+-----+
I am getting my desired result. Now I want to calculate the average of category. data frame has records for 3 days. I want to calculate average of count for these 3 days.
The result I want is below. I basically want to do count/no.of.days
+--------+-----+
|category|count|
+--------+-----+
| cat2| 1|
| cat3| 1|
| cat1| 2|
+--------+-----+
How can I do that?
I believe what you want is
from pyspark.sql import functions as F
df.groupby('category').agg((F.count('val') / 3).alias('average'))

Categories

Resources