How to create multiple count columns in Pyspark? - python

I have a dataframe of title and bin:
+---------------------+-------------+
| Title| bin|
+---------------------+-------------+
| Forrest Gump (1994)| 3|
| Pulp Fiction (1994)| 2|
| Matrix, The (1999)| 3|
| Toy Story (1995)| 1|
| Fight Club (1999)| 3|
+---------------------+-------------+
How do I count the bin into each individual column of a new dataframe using Pyspark? For instance:
+------------+------------+------------+
| count(bin1)| count(bin2)| count(bin3)|
+------------+------------+------------+
| 1| 1 | 3|
+------------+------------+------------+
Is this possible? Would someone please help me with this if you know how?

Group by bin and count then pivot the column bin and rename the columns of resulting dataframe if you want:
import pyspark.sql.functions as F
df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))
df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])
df1.show()
#+----------+----------+----------+
#|count_bin1|count_bin2|count_bin3|
#+----------+----------+----------+
#| 1| 1| 3|
#+----------+----------+----------+

Related

Conditional calculation with two datasets - PySpark

Imagine you have two datasets df and df2 like the following:
df:
ID Size Condition
1 2 1
2 3 0
3 5 0
4 7 1
df2:
aux_ID Scalar
1 2
3 2
I want to get an output where if the condition of df is 1, we multiply the size times the scalar and then return df with the changed values.
I would want to do this as efficient as possible, perhaps avoiding the join if that's possible.
output_df:
ID Size Condition
1 4 1
2 3 0
3 5 0
4 7 1
Not sure why would you want to avoid Joins in the first place. They can be efficient in there own regards.
With this said , this can be easily done with Merging the 2 datasets and building a case-when statement against the condition
Data Preparation
df1 = pd.read_csv(StringIO("""ID,Size,Condition
1,2,1
2,3,0
3,5,0
4,7,1
""")
,delimiter=','
)
df2 = pd.read_csv(StringIO("""aux_ID,Scalar
1,2
3,2
""")
,delimiter=','
)
sparkDF1 = sql.createDataFrame(df1)
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show()
+---+----+---------+
| ID|Size|Condition|
+---+----+---------+
| 1| 2| 1|
| 2| 3| 0|
| 3| 5| 0|
| 4| 7| 1|
+---+----+---------+
sparkDF2.show()
+------+------+
|aux_ID|Scalar|
+------+------+
| 1| 2|
| 3| 2|
+------+------+
Case When
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['ID'] == sparkDF2['aux_ID']
,'left'
).select(sparkDF1['*']
,sparkDF2['Scalar']
,sparkDF2['aux_ID']
).withColumn('Size_Changed',F.when( ( F.col('Condition') == 1 )
& ( F.col('aux_ID').isNotNull())
,F.col('Size') * F.col('Scalar')
).otherwise(F.col('Size')
)
)
finalDF.show()
+---+----+---------+------+------+------------+
| ID|Size|Condition|Scalar|aux_ID|Size_Changed|
+---+----+---------+------+------+------------+
| 1| 2| 1| 2| 1| 4|
| 3| 5| 0| 2| 3| 5|
| 2| 3| 0| null| null| 3|
| 4| 7| 1| null| null| 7|
+---+----+---------+------+------+------------+
You can drop the unnecessary columns , I kept them for your illustration

Labelling duplicates in PySpark

I am trying to label the duplicates in my PySpark DataFrame based on their group, while having the full length data frame. Below is an example code.
data= [
("A", "2018-01-03"),
("A", "2018-01-03"),
("A", "2018-01-03"),
("B", "2019-01-03"),
("B", "2019-01-03"),
("B", "2019-01-03"),
("C", "2020-01-03"),
("C", "2020-01-03"),
("C", "2020-01-03"),
]
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark= SparkSession.builder.getOrCreate()
df= spark.createDataFrame(data=data, schema=["Group", "Date"])
df= df.withColumn("Date", F.to_date("Date", "yyyy-MM-dd"))
from pyspark.sql import Window
windowSpec= Window.partitionBy("Group").orderBy(F.asc("Date"))
df.withColumn("group_number", F.dense_rank().over(windowSpec)).orderBy("Date").show()
This is my current output and although it is correct since the code ranks "Date" based on its group but that was not my desired outcome.
+-----+----------+------------+
|Group| Date|group_number|
+-----+----------+------------+
| A|2018-01-03| 1|
| A|2018-01-03| 1|
| A|2018-01-03| 1|
| B|2019-01-03| 1|
| B|2019-01-03| 1|
| B|2019-01-03| 1|
| C|2020-01-03| 1|
| C|2020-01-03| 1|
| C|2020-01-03| 1|
+-----+----------+------------+
I was hoping my output to look like this
+-----+----------+------------+
|Group| Date|group_number|
+-----+----------+------------+
| A|2018-01-03| 1|
| A|2018-01-03| 1|
| A|2018-01-03| 1|
| B|2019-01-03| 2|
| B|2019-01-03| 2|
| B|2019-01-03| 2|
| C|2020-01-03| 3|
| C|2020-01-03| 3|
| C|2020-01-03| 3|
+-----+----------+------------+
Any suggestions? I have found this post but this is just a binary solution! I have more than 2 groups in my dataset.
You don't need to use the partitionBy function when you declare your windowSpec. By specifying the column "Group" in partionBy, you're telling the program to do a dense_rank() for each partition based on the "Date". So the output is correct. If we look at group A, they have the same dates, thus they all have a group_rank of 1. Moving on to group B, they all have the same dates, thus they have a group rank of 1.
So a quick fix for your problem is to remove the partionBy in your windowSpec.
EDIT: If you were to group by the Group column, the following is another solution: you can use a user defined function (UDF) as the second argument parameter in the df.withColumn(). In this UDF, you would specify your input/output like a normal function. Something like this:
import pyspark.sql.functions import udf
def new_column(group):
return ord(group) - 64 # Unicode integer equivalent as A is 65
funct = udf(new_column, IntegerType())
df.withColumn("group_number", funct(df["Group"])).orderBy("Date").show()
If you were to use UDF for for the Date, you would need some way to keep track of Dates. An example:
import datetime
date_dict = {}
def new_column(date_obj):
if len(date_dict) > 0 and date_dict[date_obj.strftime("%Y-%m-%d")]:
return date_dict[date_obj.strftime("%Y-%m-%d")]
date_dict[date_obj.strftime("%Y-%m-%d")] = len(date_obj.strftime("%Y-%m-%d")) + 1
return date_dict[date_obj.strftime("%Y-%m-%d")]
What you want is to rank over all the groups not in each group so you don't need to partition by the Window, just order by Group and Date will give you the desired output:
windowSpec = Window.orderBy(F.asc("Group"), F.asc("Date"))
df.withColumn("group_number", F.dense_rank().over(windowSpec)).orderBy("Date").show()
#+-----+----------+------------+
#|Group| Date|group_number|
#+-----+----------+------------+
#| A|2018-01-03| 1|
#| A|2018-01-03| 1|
#| A|2018-01-03| 1|
#| B|2019-01-03| 2|
#| B|2019-01-03| 2|
#| B|2019-01-03| 2|
#| C|2020-01-03| 3|
#| C|2020-01-03| 3|
#| C|2020-01-03| 3|
#+-----+----------+------------+
And you surely don't need any UDF as the other answer suggests.

Flatten pyspark Dataframe to get timestamp for each particular value and field

I have tried to find the change in value for each column attribute in following manner :
windowSpec = Window.partitionBy("attribute").orderBy(df_series['time'].asc())
final_df_series = df_series.withColumn('lagdate',f.lag(df_series['time'],-1).over(windowSpec))\
.withColumn("value_lagvalue$df",(f.lag(df_series["value"],-1).over(windowSpec)))\
.withColumn("value_grp$df",(f.col("value") - f.col("value_lagvalue$df")).cast("int"))\
.filter(F.col("value_grp$df") != 0).drop(F.col("value_grp$df"))\
.select("attribute","lagdate","value_lagvalue$df").persist()
The output of dataframe from above code is :
+---------+-------------------+-----------------+
|attribute| lagdate|value_lagvalue$df|
+---------+-------------------+-----------------+
| column93|2020-09-07 10:29:24| 3|
| column93|2020-09-07 10:29:38| 1|
| column93|2020-09-07 10:31:08| 0|
| column94|2020-09-07 10:29:26| 3|
| column94|2020-09-07 10:29:40| 1|
| column94|2020-09-07 10:31:18| 0|
|column281|2020-09-07 10:29:34| 3|
|column281|2020-09-07 10:29:54| 0|
|column281|2020-09-07 10:31:08| 3|
|column281|2020-09-07 10:31:13| 0|
|column281|2020-09-07 10:35:24| 3|
|column281|2020-09-07 10:36:08| 0|
|column282|2020-09-07 10:41:13| 3|
|column282|2020-09-07 10:49:24| 1|
|column284|2020-09-07 10:51:08| 1|
|column284|2020-09-07 11:01:13| 0|
|column285|2020-09-07 11:21:13| 1|
+---------+-------------------+-----------------+
I want to transform it into following structure
attribute,timestamp_3,timestamp_1,timestamp_0
column93,2020-09-07 10:29:24,2020-09-07 10:29:38,2020-09-07 10:31:08
column94,2020-09-07 10:29:26,2020-09-07 10:29:40,2020-09-07 10:31:18
column281,2020-09-07 10:29:34,null,2020-09-07 10:29:54
column281,2020-09-07 10:31:08,null,2020-09-07 10:31:13
column281,2020-09-07 10:35:24,null,2020-09-07 10:36:08
column282,2020-09-07 10:41:13,2020-09-07 10:49:24,null
column284,null,2020-09-07 10:51:08,2020-09-07 11:01:13
column285,null,2020-09-07 11:21:13,null
Any help appreciated.(Solution in pyspark is preferable as it is optimized in nature for large dataframe of such kind but in pandas is also very helpful).
Update:
This article seem to achieve nearly the same thing. Hope from community to help in achieving desired goal
PySpark explode list into multiple columns based on name

Filter A Dataframe Column based on the the values of another column [duplicate]

This question already has an answer here:
Pyspark filter dataframe by columns of another dataframe
(1 answer)
Closed 4 years ago.
Say I have two dataframes,
**A** **B**
| a | b | c | |a|
| 1 | 2 | 3 | |1|
I want to filter the contents of dataframe A based on the values in column a from Dataset B. The equivalent where clause in SQL is like this
WHERE NOT (A.a in (select a from B)
How can I achieve this?
To keep all the rows in the left table where there is a match in the right, you can use the leftsemi join. In this case, you only want to keep values if there is not a match in the right table, in that case you can use a leftanti join:
df = spark.createDataFrame([(1,2,3),(2,3,4)], ["a","b","c"])
df2 = spark.createDataFrame([(1,2)], ["a","b"])
df.join(df2,'a','leftanti').show()
df
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
df2
+---+---+
| a| b|
+---+---+
| 1| 2|
+---+---+
result
+---+---+---+
| a| b| c|
+---+---+---+
| 2| 3| 4|
+---+---+---+
Hope this helps!

count values in multiple columns that contain a substring based on strings of lists pyspark

I have a data frame in Pyspark like below. I want to count values in two columns based on some lists and populate new columns for each list
df.show()
+---+-------------+-------------_+
| id| device| device_model|
+---+-------------+--------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 2| iphone| apple iphone|
| 3| spy camera| |
| 3| cctv| cctv|
+---+-------------+--------------+
lists are below:
phone_list = ['iphone', 'android', 'nokia']
pc_list = ['windows', 'mac']
security_list = ['camera', 'cctv']
I want to count the device and device_model for each id and pivot the values in a new data frame.
I want to count the values in the both the device_model and device columns for each id that match the strings in the list.
For example: in phone_list I have a iphone string this should count values for both values iphone and iphone5
The result I want
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 4| 2| 2|
| 2| 2|null| 1|
| 3| null| 2| 3|
+---+------+----+--------+
I have done like below
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
Using the above I can only do for device column and only if the string matches exactly. But unable to figure out how to do for both the columns and when value contains the string.
How can I achieve the result I want?
Here is the working solution . I have used udf function for checking the strings and calculating sum. You can use inbuilt functions if possible. (comments are provided as a means for explanation)
#creating dictionary for the lists with names for columns
columnLists = {'phone':phone_list, 'pc':pc_list, 'security':security_list}
#udf function for checking the strings and summing them
from pyspark.sql import functions as F
from pyspark.sql import types as t
def checkDevices(device, deviceModel, name):
sum = 0
for x in columnLists[name]:
if x in device:
sum += 1
if x in deviceModel:
sum += 1
return sum
checkDevicesAndSum = F.udf(checkDevices, t.IntegerType())
#populating the sum returned from udf function to respective columns
for x in columnLists:
df = df.withColumn(x, checkDevicesAndSum(F.col('device'), F.col('device_model'), F.lit(x)))
#finally grouping and sum
df.groupBy('id').agg(F.sum('phone').alias('phone'), F.sum('pc').alias('pc'), F.sum('security').alias('security')).show()
which should give you
+---+-----+---+--------+
| id|phone| pc|security|
+---+-----+---+--------+
| 3| 0| 2| 3|
| 1| 4| 2| 2|
| 2| 2| 0| 1|
+---+-----+---+--------+
Aggrgation part can be generalized as the rest of the parts. Improvements and modification is all in your hand. :)

Categories

Resources