How to save pyspark 'for' loop output as a single dataframe? - python

I have a basic 'for' loop that shows the number of active customers each year. I can print the output, but I want the output to be a single table/dataframe (with 2 columns: year and # customers, each iteration of the loop creates 1 row in the table)
for yr in range(2018, 2023):
print (yr, df.filter(year(col('first_sale')) <= yr).count())

was able to solve by creating a blank dataframe with desired schema outside the loop and using union, but still curious if there's a shorter solution?
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([StructField("year", IntegerType(), True), StructField("customer_count", IntegerType(), True)])
df2 = spark.createDataFrame([],schema=schema)
for yr in range(2018, 2023):
c1 = yr
c2 = df.filter((year(col('first_sale')) <= yr)).count()
newRow= spark.createDataFrame([(c1,c2)], schema)
df2 = df2.union(newRow)
display(df2)

I don't have your data, so I can't test if this works, but how about something like this:
year_col = year(col('first_sale')).alias('year')
grp = df.groupby(year_col).count().toPandas().sort_values('year').reset_index(drop=True)
grp['cumsum'] = grp['count'].cumsum()
The view grp[['year', 'cumsum']] should be the same as your for-loop.

Related

how can I assign a row with Pyspark Dataframe?

Can you please convert this expression below from Pandas to Pyspark Dataframe, I try to see the equivalent of the loc in Pyspark?
import pandas as pd
df3 = pd.DataFrame(columns=["Devices","months"])
new_entry = {'Devices': 'device1', 'months': 'month1'}
df3.loc[len(df3)] = new_entry
In pyspark you need to union to add a new row to an existing data frame. But Spark data frame are unordered and there no index as in pandas so there no such equivalent. For the example you gave, you write it like this in Pyspark :
from pyspark.sql.types import *
schema = StructType([
StructField('Devices', StringType(), True),
StructField('months', TimestampType(), True)
])
df = spark.createDataFrame(sc.emptyRDD(), schema)
new_row_df = spark.createDataFrame([("device1", "month1")], ["Devices", "months"])
# or using spark.sql()
# new_row_df = spark.sql("select 'device1' as Devices, 'month1' as months")
df = df.union(new_row_df)
df.show()
#+-------+------+
#|Devices|months|
#+-------+------+
#|device1|month1|
#+-------+------+
If you want to add a row at "specific position", you can create a column index using for example row_number function by defining an ordering, then filter the row number you want to assign the new row into before doing union :
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn("rn", F.row_number().over(Window.orderBy("Devices")))
# df.loc[1] = ...
df = df.filter("rn <> 1").drop("rn").union(new_row_df)

Pandas UDF in pyspark

I am trying to fill a series of observation on a spark dataframe. Basically I have a list of days and I should create the missing one for each group.
In pandas there is the reindex function, which is not available in pyspark.
I tried to implement a pandas UDF:
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def reindex_by_date(df):
df = df.set_index('dates')
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates, fill_value=0).ffill()
This looks like should do what I need, however it fails with this message
AttributeError: Can only use .dt accessor with datetimelike values
. What am I doing wrong here?
Here the full code:
data = spark.createDataFrame(
[(1, "2020-01-01", 0),
(1, "2020-01-03", 42),
(2, "2020-01-01", -1),
(2, "2020-01-03", -2)],
('id', 'dates', 'value'))
data = data.withColumn('dates', col('dates').cast("date"))
schema = StructType([
StructField('id', IntegerType()),
StructField('dates', DateType()),
StructField('value', DoubleType())])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def reindex_by_date(df):
df = df.set_index('dates')
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates, fill_value=0).ffill()
data = data.groupby('id').apply(reindex_by_date)
Ideally I would like something like this:
+---+----------+-----+
| id| dates|value|
+---+----------+-----+
| 1|2020-01-01| 0|
| 1|2020-01-02| 0|
| 1|2020-01-03| 42|
| 2|2020-01-01| -1|
| 2|2020-01-02| 0|
| 2|2020-01-03| -2|
+---+----------+-----+
Case 1: Each ID has an individual date range.
I would try to reduce the content of the udf as much as possible. In this case I would only calculate the date range per ID in the udf. For the other parts I would use Spark native functions.
from pyspark.sql import types as T
from pyspark.sql import functions as F
# Get min and max date per ID
date_ranges = data.groupby('id').agg(F.min('dates').alias('date_min'), F.max('dates').alias('date_max'))
# Calculate the date range for each ID
#F.udf(returnType=T.ArrayType(T.DateType()))
def get_date_range(date_min, date_max):
return [t.date() for t in list(pd.date_range(date_min, date_max))]
# To get one row per potential date, we need to explode the UDF output
date_ranges = date_ranges.withColumn(
'dates',
F.explode(get_date_range(F.col('date_min'), F.col('date_max')))
)
date_ranges = date_ranges.drop('date_min', 'date_max')
# Add the value for existing entries and add 0 for others
result = date_ranges.join(
data,
['id', 'dates'],
'left'
)
result = result.fillna({'value': 0})
Case 2: All ids have the same date range
I think there is no need to use a UDF here. What you want to can be archived in a different way: First, you get all possible IDs and all necessary dates. Second, you crossJoin them, which will provide you with all possible combinations. Third, left join the original data onto the combinations. Fourth, replace the occurred null values with 0.
# Get all unique ids
ids_df = data.select('id').distinct()
# Get the date series
date_min, date_max = data.agg(F.min('dates'), F.max('dates')).collect()[0]
dates = [[t.date()] for t in list(pd.date_range(date_min, date_max))]
dates_df = spark.createDataFrame(data=dates, schema="dates:date")
# Calculate all combinations
all_comdinations = ids_df.crossJoin(dates_df)
# Add the value column
result = all_comdinations.join(
data,
['id', 'dates'],
'left'
)
# Replace all null values with 0
result = result.fillna({'value': 0})
Please be aware of the following limitiations with this solution:
crossJoins can be quite costly. One potential solution to cope with the issue can be found in this related question.
The collect statement and use of Pandas results in a not perfectly parallelised Spark transformation.
[EDIT] Split into two cases as I first thought all IDs have the same date range.

Fill Null values with mean of previous rows

Here is my sample data:
date,number
2018-06-24,13
2018-06-25,4
2018-06-26,5
2018-06-27,1
2017-06-24,3
2017-06-25,5
2017-06-26,2
2017-06-27,null
2016-06-24,3
2016-06-25,5
2016-06-26,2
2016-06-27,7
2015-06-24,8
2015-06-25,9
2015-06-26,12
2015-06-27,13
I need to fill null values with mean of previous year data.
That is if '2017-06-27' is null value, I need to fill it with mean of "2016-06-27" and '2015-06-27' data.
output
date,number
2018-06-24,13
2018-06-25,4
2018-06-26,5
2018-06-27,1
2017-06-24,3
2017-06-25,5
2017-06-26,2
2017-06-27,10
2016-06-24,3
2016-06-25,5
2016-06-26,2
2016-06-27,7
2015-06-24,8
2015-06-25,95
2015-06-26,12
2015-06-27,13
I used below code but it gives me mean of everything in a perticular partition.
Extracted date and month columns
wingrp = Window.partitionBy('datee','month')
df = df.withColumn("TCount",avg(df["Count"]).over(wingrp))
Your solution is a step in the right direction (even though you’re not showing the columns you’ve added). You’ll want to partition by the month and the day of the month in your window, sort the resulting windows by the date column (so basically by year) and then limit the window to all preceding rows. Like this:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
schema = StructType([
StructField("date", DateType(), True),
StructField("number", IntegerType(), True)
])
df = spark.read.csv("your_data.csv",
header=True,
schema=schema)
wind = (Window
.partitionBy(month(df.date), dayofmonth(df.date))
.orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
result = (df
.withColumn("result",
coalesce(df.number, avg(df.number).over(wind)))
)

How to transform a data frame in spark structured streaming using python?

I am testing structured streaming using localhost from which it reads a stream of data. Input streaming data from localhost:
ID Subject Marks
--------------------
1 Maths 85
1 Physics 80
2 Maths 70
2 Physics 80
I would like to get the average marks for each unique ID's.
I tried this but not able to transform the DF which is a single value.
Below is my code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName("SrteamingAge").getOrCreate()
schema = StructType([StructField("ID", IntegerType(), \
True),StructField("Subject", StringType(), True),StructField("Marks", \
IntegerType(), True)])
marks = spark.readStream.format("socket").option("host",
"localhost").option("port", 9999).schema(schema).load()
marks.printSchema()
result = marks.groupBy("ID").agg(avg("Marks").alias("Average Marks"))
But I am getting the below error:
root
|-- value: string (nullable = true)
Pyspark.sql.utils.Analysisexception: "u can not resolve 'ID' given input columns: [value];"
I am creating a schema for the same but no luck. Any help would be appreciated.
My expected output is just 2 columns (ID and Average Marks)
ID Average Marks
1 82.5
2 75
Your dataframe has no column named ID, but you are trying to group on it. You need to split the column named "value" like so:
df = marks\
.withColumn("value", split(col("value"),"\\,")) \
.select(
col("value").getItem(0).cast("int").alias("ID"),
col("value").getItem(1).alias("Subject"),
col("value").getItem(2).cast("int").alias("Marks")) \
.drop("value")
Then group on df:
result = df.groupBy("ID").agg(avg("Marks").as("Average Marks"))
Assumption: Input is of the form 1,Maths,85 and so on

pyspark dataframe "condition should be string or Column"

i am unable to use a filter on a data frame. i keep getting error "TypeError("condition should be string or Column")"
I have tried changing the filter to use col object. Still, it does not work.
path = 'dbfs:/FileStore/tables/TravelData.txt'
data = spark.read.text(path)
from pyspark.sql.types import StructType, StructField, IntegerType , StringType, DoubleType
schema = StructType([
StructField("fromLocation", StringType(), True),
StructField("toLocation", StringType(), True),
StructField("productType", IntegerType(), True)
])
df = spark.read.option("delimiter", "\t").csv(path, header=False, schema=schema)
from pyspark.sql.functions import col
answerthree = df.select("toLocation").groupBy("toLocation").count().sort("count", ascending=False).take(10) # works fine
display(answerthree)
I add a filter to variable "answerthree" as follows:
answerthree = df.select("toLocation").groupBy("toLocation").count().filter(col("productType")==1).sort("count", ascending=False).take(10)
It is throwing error as follows:
""cannot resolve 'productType' given input columns""condition should be string or Column"
In jist, i am trying to solve problem 3 given in below link using pyspark instead of scal. Dataset is also provided in below url.
https://acadgild.com/blog/spark-use-case-travel-data-analysis?fbclid=IwAR0fgLr-8aHVBsSO_yWNzeyh7CoiGraFEGddahDmDixic6wmumFwUlLgQ2c
I should be able to get the desired result only for productType value 1
As you don't have a variable referencing the data frame, the easiest is to use a string condition:
answerthree = df.select("toLocation").groupBy("toLocation").count()\
.filter("productType = 1")\
.sort(...
Alternatively, you can use a data frame variable and use a column-based filter:
count_df = df.select("toLocation").groupBy("toLocation").count()
answerthree = count_df.filter(count_df['productType'] == 1)\
.sort("count", ascending=False).take(10)

Categories

Resources