how to use if condition for the following case in pyspark? - python

I am working with a pyspark dataframe as shown below:
df1:
+-----------+-------+------------+----------+
|parsed_date| id| count| date|
+-----------+-------+------------+----------+
| 2018-01-16|1520036| 1277|2018-01-17|
| 2018-01-14|1516457| 767|2018-01-17|
| 2018-01-15|1518451| 1074|2018-01-17|
| 2018-01-24|1536787| 1306|2018-01-27|
| 2018-01-25|1537211| 1105|2018-01-27|
| 2018-01-26|1539203| 1100|2018-01-27|
| 2019-01-03|2325105| 1298|2019-01-16|
+-----------+-------+------------+----------+
I am want to sum all the count for same date :
df2:
+----------+----------+
| date| sum |
+----------+----------+
|2018-01-17| 3118|
|2018-01-27| 3511|
|2019-01-16| 1298|
+----------+----------+
So far I could do the following inside a for loop for different date:
df1_list = []
for d in date_list:
df1= my_func(df, d)
df1 = df1.withColumn("sum", F.sum("count").over(Window.partitionBy("date")))
df1_list.append(df1)
full_df1 = reduce(DataFrame.unionAll, df1_list)
But now there can be a case when there is a date with no records in df1 (or let's say some date is not there in df1) so I want to add sum as zero as shown below:
expected output:
example -> date_list: 2018-01-17, 2018-01-27, 2019-01-16, 2019-01-18
+----------+----------+
| date| sum |
+----------+----------+
|2018-01-17| 3118|
|2018-01-27| 3511|
|2019-01-16| 1298|
|2019-01-18| 0|
+----------+----------+
How can I use if condition (or any other logic) while making new column sum to get this done?

You can create a dataframe from date_list and do a left join to df, before a group by and sum:
import pyspark.sql.functions as F
date_list = ['2018-01-17', '2018-01-27', '2019-01-16', '2019-01-18']
date_df = spark.createDataFrame([[d] for d in date_list], 'date string')
result = (date_df.join(df, 'date', 'left')
.fillna(0, 'count')
.groupBy('date')
.agg(F.sum('count').alias('sum'))
)
result.show()
+----------+----+
| date| sum|
+----------+----+
|2018-01-17|3118|
|2019-01-16|1298|
|2018-01-27|3511|
|2019-01-18| 0|
+----------+----+

Related

is there a way to get date difference over a certain column

i want to calculate the time difference/date difference for each unique name it took for the status to get from order to arrived.
Input dataframe is like this
+------------------------------+
| Date | id | name |staus
+------------------------------+
| 1986/10/15| A |john |order
| 1986/10/16| A |john |dispatched
| 1986/10/18| A |john |arrived
| 1986/10/15| B |peter|order
| 1986/10/16| B |peter|dispatched
| 1986/10/17| B |peter|arrived
| 1986/10/16| C |raul |order
| 1986/10/17| C |raul |dispatched
| 1986/10/18| C |raul |arrived
+-----------------------------+
the expected output dataset should look similar to this
+---------------------------------------------------+
| id | name |time_difference_from_order_to_delivered|
+---------------------------------------------------+
A | john | 3days
B |peter | 2days
C | Raul | 2days
+---------------------------------------------------+
I am stuck on what logic to implement
You can group by and calculate the date diff using a conditional aggregation:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'name').agg(
F.datediff(
F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd'),
F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
).alias('time_diff')
)
df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
| A| john| 3|
| C| raul| 2|
| B|peter| 2|
+---+-----+---------+
You can also directly subtract the dates, which will return an interval type column:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'name').agg(
(
F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd') -
F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
).alias('time_diff')
)
df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
| A| john| 3 days|
| C| raul| 2 days|
| B|peter| 2 days|
+---+-----+---------+
Assuming ordered is the earliest date and delivered is the last, just use aggregation and datediff():
select id, name, datediff(max(date), min(date)) as num_days
from t
group by id, name;
For more precision, you can use conditional aggregation:
select id, name,
datediff(max(case when status = 'arrived' then date end)
min(case when status = 'order' then date end)
) as num_days
from t
group by id, name;

Pyspark dataframe apply function to a row and add row to bottom of dataframe

I have a df that only has one row.
id |id2 |score|score2|
----------------------
0 |1 |4 |2 |
and i want to add a row of the percent of these to the bottom, i.e. every number divided by 7
0/7 |1/7 |4/7 |2/7 |
but the solution I came up with is incredibly slow
temp = [i/7 for i in df.collect()[0]]
row = sc.parallelize(Row(temp)).toDF()
df.union(row)
This took 21 seconds to run, almost all of which is the last two lines of code. Is there a better way to do this? My other thought was to transpose the table then this can easily be done with df.withColumn(). Ideally, I also want to filter out the column with 0, but I haven't really looked into that yet
check this out and let me know if it helps
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()
sc= spark.sparkContext
df = sc.parallelize([
(0,1,4,2)]).toDF(["id", "id2","score","score2"])
df2 = df.select(*[(F.col(column)/7).alias(column) for column in df.columns])
df3 = df.union(df2)
df3.show()
+---+-------------------+------------------+------------------+
| id| id2| score| score2|
+---+-------------------+------------------+------------------+
|0.0| 1.0| 4.0| 2.0|
|0.0|0.14285714285714285|0.5714285714285714|0.2857142857142857|
+---+-------------------+------------------+------------------+
If you want to. filter out the column having 0 you can use below code
non_zero_cols = [c for c in df.columns if df[[c]].first()[c] > 0]
df1 = df.select(*non_zero_cols)
df2 = df1.select(*[(F.col(column)/7).alias(column) for column in
df1.columns])
df3 = df1.union(df2)
df3.show()
+-------------------+------------------+------------------+
| id2| score| score2|
+-------------------+------------------+------------------+
| 1.0| 4.0| 2.0|
|0.14285714285714285|0.5714285714285714|0.2857142857142857|
+-------------------+------------------+------------------+
Please check the below code for df having type column
non_zero_cols = [c for c in df.columns if df[[c]].first()[c] > 0]
df1 = df.select(*non_zero_cols, F.lit('count').alias('type') )
df2 = df1.select(*[(F.col(column)/7).alias(column) for column in
df1.columns if not column=='type'], F.lit('percent').alias('type'))
df3 = df1.union(df2)
df3.show()
+-------------------+------------------+------------------+-------+
| id2| score| score2| type|
+-------------------+------------------+------------------+-------+
| 1.0| 4.0| 2.0| count|
|0.14285714285714285|0.5714285714285714|0.2857142857142857|percent|
+-------------------+------------------+------------------+-------+

Pyspark - Calculate number of null values in each dataframe column

I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.
Example:
+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
| Column_1 | 15 |
| Column_2 | 56 |
| Column_3 | 18 |
| ... | ... |
+-------------+-------------+
I have managed to get the number of null values for ONE column like so:
df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))
where c is a column in the dataframe. However, it does not show the name of the column. The output is:
+------------+
| NULL_Count |
+------------+
| 15 |
+------------+
Any ideas?
You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
However, this will return the results in one row as shown below:
df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:
from functools import reduce
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count"))
for c in df_agg.columns
)
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#| Column_1| 15|
#| Column_2| 56|
#| Column_3| 18|
#+-----------+----------+
Or you can skip the intermediate step of creating df_agg and do:
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df.agg(
F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
).select(F.lit(c).alias("Column_Name"), "NULL_Count")
for c in df.columns
)
)
Scala alternative could be
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
df1.show()
+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
| 1| 100| 23| Male|
| 2| null| 25| null|
| 3| null| 33|Female|
+---+------+---+------+
val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))
val df2 = df1.agg(s.head, s.tail:_*)
val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))
val df_agg_col = t.reduce((df1, df2) => df1.union(df2))
df_agg_col.show()

Merging two PySpark DataFrame's gives unexpected results

I have two PySpark DataFrames (NOT pandas):
df1 =
+----------+--------------+-----------+---------+
|pk |num_id |num_pk |qty_users|
+----------+--------------+-----------+---------+
| 63479840| 12556940| 298620| 13|
| 63480030| 12557110| 298620| 9|
| 63835520| 12627890| 299750| 8|
df2 =
+----------+--------------+-----------+----------+
|pk2 |num_id2 |num_pk2 |qty_users2|
+----------+--------------+-----------+----------+
| 63479800| 11156940| 298620| 10 |
| 63480030| 12557110| 298620| 1 |
| 63835520| 12627890| 299750| 2 |
I want to join both DataFrames in order to get one DataFrame df:
+----------+--------------+-----------+---------+
|pk |num_id |num_pk |total |
+----------+--------------+-----------+---------+
| 63479840| 12556940| 298620| 13|
| 63479800| 11156940| 298620| 10|
| 63480030| 12557110| 298620| 10|
| 63835520| 12627890| 299750| 10|
The only condition for merging is that I want to sum up the values of qty_users for those rows that have the same values of < pk, num_id, num_pk > in df1 and df2. Just as I showed in the above example.
How can I do it?
UPDATE:
This is what I did:
newdf = df1.join(df2,(df1.pk==df2.pk2) & (df1.num_pk==df2.num_pk2) & (df1.num_id==df2.num_id2),'outer')
newdf = newdf.withColumn('total', sum(newdf[col] for col in ["qty_users","qty_users2"]))
But it gives me 9 columns instead of 4 columns. How to solve this issue?
The outer join will return all columns from both tables.Also,we got to fill null values in qty_users as sum will also return null.
Finally, we can select using coalsece function,
from pyspark.sql import functions as F
newdf = df1.join(df2,(df1.pk==df2.pk2) & (df1.num_pk==df2.num_pk2) & (df1.num_id==df2.num_id2),'outer').fillna(0,subset=["qty_users","qty_users2"])
newdf = newdf.withColumn('total', sum(newdf[col] for col in ["qty_users","qty_users2"]))
newdf.select(*[F.coalesce(c1,c2).alias(c1) for c1,c2 in zip(df1.columns,df2.columns)][:-1]+['total']).show()
+--------+--------+------+-----+
| pk| num_id|num_pk|total|
+--------+--------+------+-----+
|63479840|12556940|298620| 13|
|63480030|12557110|298620| 10|
|63835520|12627890|299750| 10|
|63479800|11156940|298620| 10|
+--------+--------+------+-----+
Hope this helps.!
Does this output what you want?
df3 = pd.concat([df1, df2], as_index=False).groupby(['pk','num_id','num_pk'])['qty_users'].sum()
The merging of your 2 dataframes is achieved via pd.concat([df1, df2], as_index=False)
Finding the sum of the qty_users columns when all other columns are the same first requires grouping by those columns
groupby(['pk','num_id','num_pk'])
and then finding the grouped sum of qty_users
['qty_users'].sum()

In Pyspark, how do I populate a new column in one dataframe using calculations from another dataframe?

In df1 I have
+--------+----------+----------+
| ID1|start_date| stop_date|
+--------+----------+----------+
|50194077|2012-05-22|2012-05-25|
|50194077|2012-05-19|2012-05-22|
|50194077|2012-06-15|2012-06-18|
|50127135|2016-05-12|2016-05-15|
...
+--------+----------+----------+
In df2 I have
+----------+-------------------+------------------+
| ID2| date| X|
+----------+-------------------+------------------+
| 50127135|2016-06-10 00:00:00| 24.14699999999999|
| 50127135|2015-08-01 00:00:00|17.864999999999995|
| 50127135|2015-05-10 00:00:00|1.6829999999999998|
| 50127135|2014-07-02 00:00:00| 5.301000000000002|
...
+----------+-------------------+------------------+
I would like to add a column to df1 called X_sum which contains the sum of the X values that meet the conditions ID2 == ID1 and date is between start_date and stop_date.
I tried
def f(start_date, stop_date, ID, df2):
sub_df2 = df2[df2['date'].between(start_date, stop_date) & df2.ID2 == ID]
return sub_df2.select(F.sum(sub_df2['X'])).collect()[0][0]
udf_f = udf(cumulative_func, DoubleType())
df1 = df1.withColumn('X_sum',
udf_f(df1.start_date, df1.stop_date, df1.ID1, F.lit('X'), df2))
(and a few other variants), but I don't think pyspark likes how I'm trying to include df2.
I'm using python 2.7 and Spark 1.6.

Categories

Resources