python to pyspark, converting the pivot in pyspark - python

I have below DataFrame's and achieved the desired output in python. But I wanted to convert the same into pyspark.
d = {'user': ['A', 'A', 'B','B','C', 'D', 'C', 'E', 'D', 'E', 'F', 'F'], 'songs' : [11,22,99,11,11,44,66,66,33,55,11,77]}
data = pd.DataFrame(data = d)
e = {'user': ['A', 'B','C', 'D', 'E', 'F','A'], 'cluster': [1,2,3,1,2,3,2]}
clus = pd.DataFrame(data= e)
Desired output: I wanted to achieve all the songs which were not listened by the user of a particular cluster. A belongs to cluster 1, and cluster 1 has songs [11,22,33,44] so A hasnt listened to [33,44] so I achieved that using the below python code.
user
A [33, 44]
B [55, 66]
C [77]
D [11, 22]
E [11, 99]
F [66]
PYTHON CODE:
df = pd.merge(data, clus, on='user', how='left').drop_duplicates(['user','movie'])
df1 = (df.groupby(['cluster']).apply(lambda x: x.pivot('user','movie','cluster').isnull())
.fillna(False)
.reset_index(level=0, drop=True)
.sort_index())
s = np.where(df1, ['{}'.format(x) for x in df1.columns], '')
#remove empty values
s1 = pd.Series([''.join(x).strip(', ') for x in s], index=df1.index)
print (s1)
Hot to achieve the same in pyspark distributed coding ?

There could be a better solution than this, but it works.
Assuming that each user belongs to only one cluster,
import pyspark.sql.functions as F
from pyspark.sql.types import *
d = zip(['A', 'A', 'B','B','C', 'D', 'C', 'E', 'D', 'E', 'F', 'F'],[11,22,99,11,11,44,66,66,33,55,11,77])
data = sql.createDataFrame(d).toDF('user','songs')
This gives,
+----+-----+
|user|songs|
+----+-----+
| A| 11|
| A| 22|
| B| 99|
| B| 11|
| C| 11|
| D| 44|
| C| 66|
| E| 66|
| D| 33|
| E| 55|
| F| 11|
| F| 77|
+----+-----+
Creating clusters assuming each user belongs only to one cluster,
c = zip(['A', 'B','C', 'D', 'E', 'F'],[1,2,3,1,2,3])
clus = sql.createDataFrame(c).toDF('user','cluster')
clus.show()
+----+-------+
|user|cluster|
+----+-------+
| A| 1|
| B| 2|
| C| 3|
| D| 1|
| E| 2|
| F| 3|
+----+-------+
Now, we get all songs heard by a user along with their cluster,
all_combine = data.groupBy('user').agg(F.collect_list('songs').alias('songs'))\
.join(clus, data.user==clus.user).select(data['user'],'songs','cluster')
all_combine.show()
+----+--------+-------+
|user| songs|cluster|
+----+--------+-------+
| F|[11, 77]| 3|
| E|[66, 55]| 2|
| B|[99, 11]| 2|
| D|[44, 33]| 1|
| C|[11, 66]| 3|
| A|[11, 22]| 1|
+----+--------+-------+
Finally, calculating all songs heard in a cluster and subsequently all songs not heard by a user in that cluster,
not_listened = F.udf(lambda song,all_: list(set(all_) - set(song)) , ArrayType(IntegerType()))
grouped_clusters = data.join(clus, data.user==clus.user).select(data['user'],'songs','cluster')\
.groupby('cluster').agg(F.collect_list('songs').alias('all_songs'))\
.join(all_combine, ['cluster']).select('user', all_combine['cluster'], 'songs', 'all_songs')\
.select('user', not_listened(F.col('songs'), F.col('all_songs')).alias('not_listened'))
grouped_clusters.show()
We get output as,
+----+------------+
|user|not_listened|
+----+------------+
| D| [11, 22]|
| A| [33, 44]|
| F| [66]|
| C| [77]|
| E| [99, 11]|
| B| [66, 55]|
+----+------------+

Related

display the difference of two values of two columns from two different dataframes without losing the other columns

I've got two dataframes with different values of "d" but have the same values of "a" and "b"
this is df1
df1 = spark.createDataFrame([
('c', 'd', 8),
('e', 'f', 8),
('c', 'j', 9),
], ['a', 'b', 'd'])
​
df1.show()
+---+---+---+
| a| b| d|
+---+---+---+
| c| d| 8|
| e| f| 8|
| c| j| 9|
+---+---+---+
and this is df 2
df2 = spark.createDataFrame([
('c', 'd', 7),
('e', 'f', 3),
('c', 'j', 8),
], ['a', 'b', 'd'])
df2.show()
+---+---+---+
| a| b| d|
+---+---+---+
| c| d| 7|
| e| f| 3|
| c| j| 8|
+---+---+---+
and i want to obtain the difference between the values of column "d" but also i want to keep the columns "a" and "b"
df3
+---+---+---+
| a| b| d|
+---+---+---+
| c| d| 1|
| e| f| 5|
| c| j| 1|
+---+---+---+
i tried doing a subtract between the two dataframes but it didn't work
df1.subtract(df2).show()
+---+---+---+
| a| b| d|
+---+---+---+
| c| d| 8|
| e| f| 8|
| c| j| 9|
+---+---+---+
Here is how you can do it:
df3 = df1.join(df2, on = ['b', 'a'], how = 'outer').select('a', 'b', (df1.d - df2.d).alias('diff'))
df3.show()

Need to aggregate and put into list by group in Pyspark dataframe

I have a pyspark dataframe, where I want to group by some index, and combine all the values in each column into one list per column.
Example input:
id_1| id_2| id_3|timestamp|thing1|thing2|thing3
A | b | c |time_0 |1.2 |1.3 |2.5
A | b | c |time_1 |1.1 |1.5 |3.4
A | b | c |time_2 |2.2 |2.6 |2.9
A | b | d |time_0 |5.1 |5.5 |5.7
A | b | d |time_1 |6.1 |6.2 |6.3
A | b | e |time_0 |0.1 |0.5 |0.9
A | b | e |time_1 |0.2 |0.3 |0.6
Example output:
id_1|id_2|id_3| timestamp |thing1 |thing2 |thing3
A |b | c |[time_0,time_1,time_2]|[1.2,1.1,2.2]|[1.3,1.5,2.6|[2.5,3.4,2.9]
A |b | d |[time_0,time_1] |[5.1,6.1] |[5.5,6.2] |[5.7,6.3]
A |b | e |[time_0,time_1] |[0.1,0.2] |[0.5,0.3] |[0.9,0.6]
How can I do this efficiently?
Use collect_list() as people have suggested above as well.
# Creating the DataFrame
df =sqlContext.createDataFrame([('A','b','c','time_0',1.2,1.3,2.5),('A','b','c','time_1',1.1,1.5,3.4),
('A','b','c','time_2',2.2,2.6,2.9),('A','b','d','time_0',5.1,5.5,5.7),
('A','b', 'd','time_1',6.1,6.2,6.3),('A','b','e','time_0',0.1,0.5,0.9),
('A','b', 'e','time_1',0.2,0.3,0.6)],
['id_1','id_2','id_3','timestamp','thing1','thing2','thing3'])
df.show()
+----+----+----+---------+------+------+------+
|id_1|id_2|id_3|timestamp|thing1|thing2|thing3|
+----+----+----+---------+------+------+------+
| A| b| c| time_0| 1.2| 1.3| 2.5|
| A| b| c| time_1| 1.1| 1.5| 3.4|
| A| b| c| time_2| 2.2| 2.6| 2.9|
| A| b| d| time_0| 5.1| 5.5| 5.7|
| A| b| d| time_1| 6.1| 6.2| 6.3|
| A| b| e| time_0| 0.1| 0.5| 0.9|
| A| b| e| time_1| 0.2| 0.3| 0.6|
+----+----+----+---------+------+------+------+
In addition to using agg(), you can write familiar SQL syntax to operate on it, but first we have to register our DataFrame as temporary SQL view -
df.createOrReplaceTempView("df_view")
df = spark.sql("""select id_1, id_2, id_3,
collect_list(timestamp) as timestamp,
collect_list(thing1) as thing1,
collect_list(thing2) as thing2,
collect_list(thing3) as thing3
from df_view
group by id_1, id_2, id_3""")
df.show(truncate=False)
+----+----+----+------------------------+---------------+---------------+---------------+
|id_1|id_2|id_3|timestamp |thing1 |thing2 |thing3 |
+----+----+----+------------------------+---------------+---------------+---------------+
|A |b |d |[time_0, time_1] |[5.1, 6.1] |[5.5, 6.2] |[5.7, 6.3] |
|A |b |e |[time_0, time_1] |[0.1, 0.2] |[0.5, 0.3] |[0.9, 0.6] |
|A |b |c |[time_0, time_1, time_2]|[1.2, 1.1, 2.2]|[1.3, 1.5, 2.6]|[2.5, 3.4, 2.9]|
+----+----+----+------------------------+---------------+---------------+---------------+
Note: The """ has been used to have multiline statements for the sake of visibility and neatness. With simple 'select id_1 ....' that wouldn't work if you try to spread your statement over multiple lines. Needless to say, the final result will be the same.
Here is the example github TestExample1
exampleDf = self.spark.createDataFrame(
[('A', 'b', 'c', 'time_0', 1.2, 1.3, 2.5),
('A', 'b', 'c', 'time_1', 1.1, 1.5, 3.4),
],
("id_1", "id_2", "id_3", "timestamp", "thing1", "thing2", "thing3"))
exampleDf.show()
ans = exampleDf.groupBy(col("id_1"), col("id_2"), col("id_3")) \
.agg(collect_list(col("timestamp")),
collect_list(col("thing1")),
collect_list(col("thing2")))
ans.show()
+----+----+----+---------+------+------+------+
|id_1|id_2|id_3|timestamp|thing1|thing2|thing3|
+----+----+----+---------+------+------+------+
| A| b| c| time_0| 1.2| 1.3| 2.5|
| A| b| c| time_1| 1.1| 1.5| 3.4|
+----+----+----+---------+------+------+------+
+----+----+----+-----------------------+--------------------+--------------------+
|id_1|id_2|id_3|collect_list(timestamp)|collect_list(thing1)|collect_list(thing2)|
+----+----+----+-----------------------+--------------------+--------------------+
| A| b| c| [time_0, time_1]| [1.2, 1.1]| [1.3, 1.5]|
+----+----+----+-----------------------+--------------------+--------------------+

How to filter data from a dataframe using pyspark

I have a table named mytable as dataframe available, and below is the table
[+---+----+----+----+
| x| y| z| w|
+---+----+----+----+
| 1| a|null|null|
| 1|null| b|null|
| 1|null|null| c|
| 2| d|null|null|
| 2|null| e|null|
| 2|null|null| f|
+---+----+----+----+]
I want result where we group by col x and concatenate result of col y,z,w. The result looks as below.
[+---+----+----+-
| x| result|
+---+----+----+
| 1| a b c |
| 2| d e f |
+---+----+---+|
from pyspark.sql.functions import concat_ws, collect_list, concat, coalesce, lit
#sample data
df = sc.parallelize([
[1, 'a', None, None],
[1, None, 'b', None],
[1, None, None, 'c'],
[2, 'd', None, None],
[2, None, 'e', None],
[2, None, None, 'f']]).\
toDF(('x', 'y', 'z', 'w'))
df.show()
result_df = df.groupby("x").\
agg(concat_ws(' ', collect_list(concat(*[coalesce(c, lit("")) for c in df.columns[1:]]))).
alias('result'))
result_df.show()
Output is:
+---+------+
| x|result|
+---+------+
| 1| a b c|
| 2| d e f|
+---+------+
Sample input:
+---+----+----+----+
| x| y| z| w|
+---+----+----+----+
| 1| a|null|null|
| 1|null| b|null|
| 1|null|null| c|
| 2| d|null|null|
| 2|null| e|null|
| 2|null|null| f|
+---+----+----+----+

How can I convert a list of lists in a Dataframe in Pyspark, being each list the values of each attribute?

I have a list of list of type:
[[1, 2, 3], ['A', 'B', 'C'], ['aa', 'bb', 'cc']]
Each list contains the values of the attributes 'A1', 'A2', and 'A3'.
I want to get the next dataframe:
+----------+----------+----------+
| A1 | A2 | A3 |
+----------+----------+----------+
| 1 | A | aa |
+----------+----------+----------+
| 2 | B | bb |
+----------+----------+----------+
| 3 | C | cc |
+----------+----------+----------+
How can I do it?
You can create a Row Class with the header as fields, and use zip to loop through the list row wise and construct a row object for each row:
lst = [[1, 2, 3], ['A', 'B', 'C'], ['aa', 'bb', 'cc']]
from pyspark.sql import Row
R = Row("A1", "A2", "A3")
sc.parallelize([R(*r) for r in zip(*lst)]).toDF().show()
+---+---+---+
| A1| A2| A3|
+---+---+---+
| 1| A| aa|
| 2| B| bb|
| 3| C| cc|
+---+---+---+
Or if you have pandas installed, create a pandas data frame first; You can create a spark data frame from pandas data frame directly by using spark.createDataFrame:
import pandas as pd
headers = ['A1', 'A2', 'A3']
pdf = pd.DataFrame.from_dict(dict(zip(headers, lst)))
spark.createDataFrame(pdf).show()
+---+---+---+
| A1| A2| A3|
+---+---+---+
| 1| A| aa|
| 2| B| bb|
| 3| C| cc|
+---+---+---+
from pyspark.sql import Row
names=['A1', 'A2', 'A3']
data=sc.parallelize(zip(*[[1, 2, 3], ['A', 'B', 'C'], ['aa', 'bb', 'cc']])).\
map(lambda x: Row(**{names[i]: elt for i, elt in enumerate(x)})).toDF()

How can I apply groupBy in a dataframe without removing other columns of the not-grouped instances in Pyspark?

I am trying to generate an operation with groupBy() in Pyspark, but I get the next problem:
I have a dataframe (df1) which has 3 attributes: attrA, attrB and attrC. I want to apply a groupBy operation over that dataframe only taking in account the attributes attrA and attrB. Of course, when groupBy(attr1, attr2) is applied over df1 it generates groups of those instances that are equal to each other.
What I want to get is:
If I apply groupBy() operation and some instances are equal I want to generate an independent dataframe with those groups, and if there are instances that are not equal no any other one, I want to conserve those in another dataframe with the 3 attributes: attr1, attr2 and attr3(not used to group by).
Is it possible?
from pyspark.sql import functions as f
from pyspark.sql import *
spark = SparkSession.builder.appName('MyApp').getOrCreate()
df = spark.createDataFrame([('a', 'a', 3), ('a', 'c', 5), ('b', 'a', 4), ('c', 'a', 2), ('a', 'a', 9), ('b', 'a', 9)],
('attr1', "attr2", "attr3"))
df = df.withColumn('count', f.count('attr3').over(Window().partitionBy('attr1', 'attr2'))).cache()
output:
+-----+-----+-----+-----+
|attr1|attr2|attr3|count|
+-----+-----+-----+-----+
| b| a| 4| 2|
| b| a| 9| 2|
| a| c| 5| 1|
| c| a| 2| 1|
| a| a| 3| 2|
| a| a| 9| 2|
+-----+-----+-----+-----+
and
an_independent_dataframe = df.filter(df['count'] > 1).groupBy('attr1', 'attr2').sum('attr3')
+-----+-----+----------+
|attr1|attr2|sum(attr3)|
+-----+-----+----------+
| b| a| 13|
| a| a| 12|
+-----+-----+----------+
another_dataframe = df.filter(df['count'] == 1).select('attr1', "attr2", "attr3")
+-----+-----+-----+
|attr1|attr2|attr3|
+-----+-----+-----+
| a| c| 5|
| c| a| 2|
+-----+-----+-----+

Categories

Resources