Fill Null values with mean of previous rows - python

Here is my sample data:
date,number
2018-06-24,13
2018-06-25,4
2018-06-26,5
2018-06-27,1
2017-06-24,3
2017-06-25,5
2017-06-26,2
2017-06-27,null
2016-06-24,3
2016-06-25,5
2016-06-26,2
2016-06-27,7
2015-06-24,8
2015-06-25,9
2015-06-26,12
2015-06-27,13
I need to fill null values with mean of previous year data.
That is if '2017-06-27' is null value, I need to fill it with mean of "2016-06-27" and '2015-06-27' data.
output
date,number
2018-06-24,13
2018-06-25,4
2018-06-26,5
2018-06-27,1
2017-06-24,3
2017-06-25,5
2017-06-26,2
2017-06-27,10
2016-06-24,3
2016-06-25,5
2016-06-26,2
2016-06-27,7
2015-06-24,8
2015-06-25,95
2015-06-26,12
2015-06-27,13
I used below code but it gives me mean of everything in a perticular partition.
Extracted date and month columns
wingrp = Window.partitionBy('datee','month')
df = df.withColumn("TCount",avg(df["Count"]).over(wingrp))

Your solution is a step in the right direction (even though you’re not showing the columns you’ve added). You’ll want to partition by the month and the day of the month in your window, sort the resulting windows by the date column (so basically by year) and then limit the window to all preceding rows. Like this:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
schema = StructType([
StructField("date", DateType(), True),
StructField("number", IntegerType(), True)
])
df = spark.read.csv("your_data.csv",
header=True,
schema=schema)
wind = (Window
.partitionBy(month(df.date), dayofmonth(df.date))
.orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
result = (df
.withColumn("result",
coalesce(df.number, avg(df.number).over(wind)))
)

Related

How to save pyspark 'for' loop output as a single dataframe?

I have a basic 'for' loop that shows the number of active customers each year. I can print the output, but I want the output to be a single table/dataframe (with 2 columns: year and # customers, each iteration of the loop creates 1 row in the table)
for yr in range(2018, 2023):
print (yr, df.filter(year(col('first_sale')) <= yr).count())
was able to solve by creating a blank dataframe with desired schema outside the loop and using union, but still curious if there's a shorter solution?
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([StructField("year", IntegerType(), True), StructField("customer_count", IntegerType(), True)])
df2 = spark.createDataFrame([],schema=schema)
for yr in range(2018, 2023):
c1 = yr
c2 = df.filter((year(col('first_sale')) <= yr)).count()
newRow= spark.createDataFrame([(c1,c2)], schema)
df2 = df2.union(newRow)
display(df2)
I don't have your data, so I can't test if this works, but how about something like this:
year_col = year(col('first_sale')).alias('year')
grp = df.groupby(year_col).count().toPandas().sort_values('year').reset_index(drop=True)
grp['cumsum'] = grp['count'].cumsum()
The view grp[['year', 'cumsum']] should be the same as your for-loop.

how can I assign a row with Pyspark Dataframe?

Can you please convert this expression below from Pandas to Pyspark Dataframe, I try to see the equivalent of the loc in Pyspark?
import pandas as pd
df3 = pd.DataFrame(columns=["Devices","months"])
new_entry = {'Devices': 'device1', 'months': 'month1'}
df3.loc[len(df3)] = new_entry
In pyspark you need to union to add a new row to an existing data frame. But Spark data frame are unordered and there no index as in pandas so there no such equivalent. For the example you gave, you write it like this in Pyspark :
from pyspark.sql.types import *
schema = StructType([
StructField('Devices', StringType(), True),
StructField('months', TimestampType(), True)
])
df = spark.createDataFrame(sc.emptyRDD(), schema)
new_row_df = spark.createDataFrame([("device1", "month1")], ["Devices", "months"])
# or using spark.sql()
# new_row_df = spark.sql("select 'device1' as Devices, 'month1' as months")
df = df.union(new_row_df)
df.show()
#+-------+------+
#|Devices|months|
#+-------+------+
#|device1|month1|
#+-------+------+
If you want to add a row at "specific position", you can create a column index using for example row_number function by defining an ordering, then filter the row number you want to assign the new row into before doing union :
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn("rn", F.row_number().over(Window.orderBy("Devices")))
# df.loc[1] = ...
df = df.filter("rn <> 1").drop("rn").union(new_row_df)

Pandas UDF in pyspark

I am trying to fill a series of observation on a spark dataframe. Basically I have a list of days and I should create the missing one for each group.
In pandas there is the reindex function, which is not available in pyspark.
I tried to implement a pandas UDF:
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def reindex_by_date(df):
df = df.set_index('dates')
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates, fill_value=0).ffill()
This looks like should do what I need, however it fails with this message
AttributeError: Can only use .dt accessor with datetimelike values
. What am I doing wrong here?
Here the full code:
data = spark.createDataFrame(
[(1, "2020-01-01", 0),
(1, "2020-01-03", 42),
(2, "2020-01-01", -1),
(2, "2020-01-03", -2)],
('id', 'dates', 'value'))
data = data.withColumn('dates', col('dates').cast("date"))
schema = StructType([
StructField('id', IntegerType()),
StructField('dates', DateType()),
StructField('value', DoubleType())])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def reindex_by_date(df):
df = df.set_index('dates')
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates, fill_value=0).ffill()
data = data.groupby('id').apply(reindex_by_date)
Ideally I would like something like this:
+---+----------+-----+
| id| dates|value|
+---+----------+-----+
| 1|2020-01-01| 0|
| 1|2020-01-02| 0|
| 1|2020-01-03| 42|
| 2|2020-01-01| -1|
| 2|2020-01-02| 0|
| 2|2020-01-03| -2|
+---+----------+-----+
Case 1: Each ID has an individual date range.
I would try to reduce the content of the udf as much as possible. In this case I would only calculate the date range per ID in the udf. For the other parts I would use Spark native functions.
from pyspark.sql import types as T
from pyspark.sql import functions as F
# Get min and max date per ID
date_ranges = data.groupby('id').agg(F.min('dates').alias('date_min'), F.max('dates').alias('date_max'))
# Calculate the date range for each ID
#F.udf(returnType=T.ArrayType(T.DateType()))
def get_date_range(date_min, date_max):
return [t.date() for t in list(pd.date_range(date_min, date_max))]
# To get one row per potential date, we need to explode the UDF output
date_ranges = date_ranges.withColumn(
'dates',
F.explode(get_date_range(F.col('date_min'), F.col('date_max')))
)
date_ranges = date_ranges.drop('date_min', 'date_max')
# Add the value for existing entries and add 0 for others
result = date_ranges.join(
data,
['id', 'dates'],
'left'
)
result = result.fillna({'value': 0})
Case 2: All ids have the same date range
I think there is no need to use a UDF here. What you want to can be archived in a different way: First, you get all possible IDs and all necessary dates. Second, you crossJoin them, which will provide you with all possible combinations. Third, left join the original data onto the combinations. Fourth, replace the occurred null values with 0.
# Get all unique ids
ids_df = data.select('id').distinct()
# Get the date series
date_min, date_max = data.agg(F.min('dates'), F.max('dates')).collect()[0]
dates = [[t.date()] for t in list(pd.date_range(date_min, date_max))]
dates_df = spark.createDataFrame(data=dates, schema="dates:date")
# Calculate all combinations
all_comdinations = ids_df.crossJoin(dates_df)
# Add the value column
result = all_comdinations.join(
data,
['id', 'dates'],
'left'
)
# Replace all null values with 0
result = result.fillna({'value': 0})
Please be aware of the following limitiations with this solution:
crossJoins can be quite costly. One potential solution to cope with the issue can be found in this related question.
The collect statement and use of Pandas results in a not perfectly parallelised Spark transformation.
[EDIT] Split into two cases as I first thought all IDs have the same date range.

How to add new column with min and max function in Pyspark and group by the data?

PySpark Dataframe: adobeDF
Adding new columns to the dataframe:
from pyspark.sql.window import Window
from pyspark.sql import functions as f
adobeDF_new = adobeDF.withColumn('start_date', f.col('Date')).withColumn('end_date', f.col('Date'))
Result:
I am trying to figure out the code on how to save min(Date) values in start_date and max(Date) values in end_Date and grouping the final dataframe by post_evar10 and Type.
What I have tried:The below code works but want to see if there is a better way to do it and limit the data to 60 days from start_date
from pyspark.sql.window import Window
from pyspark.sql import functions as f
adobe_window = Window.partitionBy('post_evar10','Type').orderBy('Date')
adobeDF_new = adobeDF.withColumn('start_date', min(f.col('Date')).over(adobe_window)).withColumn('end_date', max(f.col('Date')).over(adobe_window))
How about the following?
adobeDF.groupBy("post_evar10").agg(
f.min("start_date").alias("min_start"),
f.max("end_date").alias("max_end")
)

How to modify/transform the column of a dataframe?

I have an instance of pyspark.sql.dataframe.DataFrame created using
dataframe = sqlContext.sql("select * from table").
One column is 'arrival_date' and contains a string.
How do I modify this column so as to only take the first 4 characters from it and throw away the rest?
How would I convert the type of this column from string to date?
In graphlab.SFrame, this would be:
dataframe['column_name'] = dataframe['column_name'].apply(lambda x: x[:4] )
and
dataframe['column_name'] = dataframe['column_name'].str_to_datetime()
As stated by Orions, you can't modify a column, but you can override it. Also, you shouldn't need to create an user defined function, as there is a built-in function for extracting substrings:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", df['arrival_date'].substr(0, 4))
To convert it to date, you can use to_date, as Orions said:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", to_date(df['arrival_date'].substr(0, 4)))
However, if you need to specify the format, you should use unix_timestamp:
from pyspark.sql.functions import *
format = 'yyMM'
col = unix_timestamp(df['arrival_date'].substr(0, 4), format).cast('timestamp')
df = df.withColumn("arrival_date", col)
All this can be found in the pyspark documentation.
To extract first 4 characters from the arrival_date (StringType) column, create a new_df by using UserDefinedFunction (as you cannot modify the columns: they are immutable):
from pyspark.sql.functions import UserDefinedFunction, to_date
old_df = spark.sql("SELECT * FROM table")
udf = UserDefinedFunction(lambda x: str(x)[:4], StringType())
new_df = old_df.select(*[udf(column).alias('arrival_date') if column == 'arrival_date' else column for column in old_df.columns])
And to covert the arrival_date (StringType) column into DateType column, use the to_date function as show below:
new_df = old_df.select(old_df.other_cols_if_any, to_date(old_df.arrival_date).alias('arrival_date'))
Sources:
https://stackoverflow.com/a/29257220/2873538
https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html

Categories

Resources