Pyspark cumulative product using numpy - python

I want to perform a cumulative product, previous successful answers use logarithmic sums to the the deed. However, is there a way to use Numpy cumsum. I have tried with no clear result, here is my code:
import numpy as np
def cumulative_product (x):
"""Calculation of cumulative product using numpy function cumprod.
"""
return np.cumprod(float(x)).tolist()
spark_cumulative_product = udf(cumulative_product, ArrayType(DoubleType()))
# the dataset in question:
param.show()
Which gives me for example:
+--------------+-----+
|financial_year| wpi|
+--------------+-----+
| 2014|1.026|
| 2015|1.024|
| 2016|1.021|
| 2017|1.019|
| 2018|1.021|
+--------------+-----+
When applying
param = param.withColumn('cum_wpi', spark_cumulative_product(param_treasury['wpi']))
param.show()
I have that there are no changes i.e.
+--------------+-----+-------+
|financial_year| wpi|cum_wpi|
+--------------+-----+-------+
| 2014|1.026|[1.026]|
| 2015|1.024|[1.024]|
| 2016|1.021|[1.021]|
| 2017|1.019|[1.019]|
| 2018|1.021|[1.021]|
+--------------+-----+-------+
Can anyone help on what is going wrong or if there is a better way to do cumprod without using exp-sum-log
-Update:
The desired output is:
+--------------+-----+-------+
|financial_year| wpi|cum_wpi|
+--------------+-----+-------+
| 2014|1.026| 1.026 |
| 2015|1.024| 1.051 |
| 2016|1.021| 1.073 |
| 2017|1.019| 1.093 |
| 2018|1.021| 1.116 |
+--------------+-----+-------+

One way you can achieve this using cum_prod() pandas series function, using a pandas grouped map UDF.
Sample DataFrame:
#+--------------+-----+
#|financial_year| wpi|
#+--------------+-----+
#| 2014|1.026|
#| 2015|1.024|
#| 2016|1.021|
#| 2017|1.019|
#| 2018|1.021|
#+--------------+-----+
I will first create a dummy column, which will be similar to our cum_wpi. I will overwrite this dummy column in the pandas udf. The use of orderBy right before the groupby and apply is there to ensure that the dataframe is sorted on financial_year.
import pandas as pd
import numpy as np
from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
df1=df.withColumn("cum_wpi", F.lit(1.2456))
#pandas_udf(df1.schema, PandasUDFType.GROUPED_MAP)
def grouped_map(df1):
df1['cum_wpi']=df1['wpi'].cumprod().round(decimals=3)
return df1
df.orderBy(F.col("financial_year").asc())\
.groupby().apply(grouped_map).show()
#+--------------+-----+-------+
#|financial_year| wpi|cum_wpi|
#+--------------+-----+-------+
#| 2014|1.026| 1.026|
#| 2015|1.024| 1.051|
#| 2016|1.021| 1.073|
#| 2017|1.019| 1.093|
#| 2018|1.021| 1.116|
#+--------------+-----+-------+
UPDATE:
You can use aggregate as mentioned earlier by #pault, as long as we cast acc(accumulator) to double we can handle your values.
df.withColumn("cum_wpi", F.expr("""format_number(aggregate(collect_list(wpi)\
over (order by financial_year)\
,cast(1 as double),(acc,x)-> acc*x),3)"""))\
.show(truncate=False)
#+--------------+-----+-------+
#|financial_year|wpi |cum_wpi|
#+--------------+-----+-------+
#|2014 |1.026|1.026 |
#|2015 |1.024|1.051 |
#|2016 |1.021|1.073 |
#|2017 |1.019|1.093 |
#|2018 |1.021|1.116 |
#+--------------+-----+-------+

Related

Pyspark use partition or groupby with agg and datediff

I'm new to Pyspark.
I would like to find the products not seen after 10 days from the first day they entered the store. And create a column in dataframe and set it to 1 for these products and 0 for the rest.
First I need to group the data based on product_id, then find the maximum of the seen_date. And finally calculate the difference between import_date and max(seen_date) in the groups. And finally create a new column based on the value of date_diff in each group.
Following is the code I used to first get the difference between the import_date and seen_date, but it gives error:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = (Window()
.partitionBy(df.product_id)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df.withColumn("date_diff", F.datediff(F.max(F.from_unixtime(F.col("import_date")).over(w)), F.from_unixtime(F.col("seen_date"))))
Error:
AnalysisException: It is not allowed to use a window function inside an aggregate function. Please use the inner window function in a sub-query.
This is the rest of my code to define a new column based on the date_diff:
not_seen = udf(lambda x: 0 if x >10 else 1, IntegerType())
df = df.withColumn('not_seen', not_seen("date_diff"))
Q: Can someone provide a fix for this code or a better approach to solve this problem?
sample data generation:
columns = ["product_id","import_date", "seen_date"]
data = [("123", "2014-05-06", "2014-05-07"),
("123", "2014-05-06", "2014-06-11"),
("125", "2015-01-02", "2015-01-03"),
("125", "2015-01-02", "2015-01-04"),
("128", "2015-08-06", "2015-08-25")]
dfFromData2 = spark.createDataFrame(data).toDF(*columns)
dfFromData2 = dfFromData2.withColumn("import_date",F.unix_timestamp(F.col("import_date"),'yyyy-MM-dd'))
dfFromData2 = dfFromData2.withColumn("seen_date",F.unix_timestamp(F.col("seen_date"),'yyyy-MM-dd'))
+----------+-----------+----------+
|product_id|import_date| seen_date|
+----------+-----------+----------+
| 123| 1399334400|1399420800|
| 123| 1399334400|1402444800|
| 125| 1420156800|1420243200|
| 125| 1420156800|1420329600|
| 128| 1438819200|1440460800|
+----------+-----------+----------+
columns = ["product_id","import_date", "seen_date"]
data = [("123", "2014-05-06", "2014-05-07"),
("123", "2014-05-06", "2014-06-11"),
("125", "2015-01-02", "2015-01-03"),
("125", "2015-01-02", "2015-01-04"),
("128", "2015-08-06", "2015-08-25")]
df = spark.createDataFrame(data).toDF(*columns)
df = df.withColumn("import_date",F.to_date(F.col("import_date"),'yyyy-MM-dd'))
df = df.withColumn("seen_date",F.to_date(F.col("seen_date"),'yyyy-MM-dd'))
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = (Window()
.partitionBy(df.product_id)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df\
.withColumn('max_import_date', F.max(F.col("import_date")).over(w))\
.withColumn("date_diff", F.datediff(F.col('seen_date'), F.col('max_import_date')))\
.withColumn('not_seen', F.when(F.col('date_diff') > 10, 0).otherwise(1))\
.show()
+----------+-----------+----------+---------------+---------+--------+
|product_id|import_date| seen_date|max_import_date|date_diff|not_seen|
+----------+-----------+----------+---------------+---------+--------+
| 123| 2014-05-06|2014-05-07| 2014-05-06| 1| 1|
| 123| 2014-05-06|2014-06-11| 2014-05-06| 36| 0|
| 125| 2015-01-02|2015-01-03| 2015-01-02| 1| 1|
| 125| 2015-01-02|2015-01-04| 2015-01-02| 2| 1|
| 128| 2015-08-06|2015-08-25| 2015-08-06| 19| 0|
+----------+-----------+----------+---------------+---------+--------+
You can use the max windowing function to extract the max date.
dfFromData2 = dfFromData2.withColumn(
'not_seen',
F.expr('if(datediff(max(from_unixtime(seen_date)) over (partition by product_id), from_unixtime(import_date)) > 10, 1, 0)')
)
dfFromData2.show(truncate=False)
# +----------+-----------+----------+--------+
# |product_id|import_date|seen_date |not_seen|
# +----------+-----------+----------+--------+
# |125 |1420128000 |1420214400|0 |
# |125 |1420128000 |1420300800|0 |
# |123 |1399305600 |1399392000|1 |
# |123 |1399305600 |1402416000|1 |
# |128 |1438790400 |1440432000|1 |
# +----------+-----------+----------+--------+

Replacing dots with commas on a pyspark dataframe

I'm using the code bellow to collect some info:
df = (
df
.select(
date_format(date_trunc('month', col("reference_date")), 'yyyy-MM-dd').alias("month"),
col("id"),
col("name"),
col("item_type"),
col("sub_group"),
col("latitude"),
col("longitude")
)
My latitude and longitude are values with dots, like this: -30.130307 -51.2060018 but I must replace the dot for a comma. I've tried both .replace() and .regexp_replace() but none of them are working. Could you guys help me please?
With the following dataframe as an example.
df.show()
+-------------------+-------------------+
| latitude| longitude|
+-------------------+-------------------+
| 85.70708380916193| -68.05674981929877|
| 57.074495803252404|-42.648691976080215|
| 2.944303748172473| -62.66186439333423|
| 119.76923402031701|-114.41179457810185|
|-138.52573939229234| 54.38429596238362|
+-------------------+-------------------+
You should be able to use spark.sql functions like the following
from pyspark.sql import functions
df = df.withColumn("longitude", functions.regexp_replace('longitude',r'[.]',","))
df = df.withColumn("latitude", functions.regexp_replace('latitude',r'[.]',","))
df.show()
+-------------------+-------------------+
| latitude| longitude|
+-------------------+-------------------+
| 85,70708380916193| -68,05674981929877|
| 57,074495803252404|-42,648691976080215|
| 2,944303748172473| -62,66186439333423|
| 119,76923402031701|-114,41179457810185|
|-138,52573939229234| 54,38429596238362|
+-------------------+-------------------+

Use spark function result as input of another function

In my Spark application I have a dataframe with informations like
+------------------+---------------+
| labels | labels_values |
+------------------+---------------+
| ['l1','l2','l3'] | 000 |
| ['l3','l4','l5'] | 100 |
+------------------+---------------+
What I am trying to achieve is to create, given a label name as input a single_label_value column that takes the value for that label from the labels_values column.
For example, for label='l3' I would like to retrieve this output:
+------------------+---------------+--------------------+
| labels | labels_values | single_label_value |
+------------------+---------------+--------------------+
| ['l1','l2','l3'] | 000 | 0 |
| ['l3','l4','l5'] | 100 | 1 |
+------------------+---------------+--------------------+
Here's what I am attempting to use:
selected_label='l3'
label_position = F.array_position(my_df.labels, selected_label)
my_df= my_df.withColumn(
"single_label_value",
F.substring(my_df.labels_values, label_position, 1)
)
But I am getting an error because the substring function does not like the label_position argument.
Is there any way to combine these function outputs without writing an udf?
Hope, this will work for you.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark=SparkSession.builder.getOrCreate()
mydata=[[['l1','l2','l3'],'000'], [['l3','l4','l5'],'100']]
df = spark.createDataFrame(mydata,schema=["lebels","lebel_values"])
selected_label='l3'
df2=df.select(
"*",
(array_position(df.lebels,selected_label)-1).alias("pos_val"))
df2.createOrReplaceTempView("temp_table")
df3=spark.sql("select *,substring(lebel_values,pos_val,1) as val_pos from temp_table")
df3.show()
+------------+------------+-------+-------+
| lebels|lebel_values|pos_val|val_pos|
+------------+------------+-------+-------+
|[l1, l2, l3]| 000| 2| 0|
|[l3, l4, l5]| 100| 0| 1|
+------------+------------+-------+-------+
This is giving location of the value. If you want exact index then you can use -1 from this value.
--Edited anser -> Worked with temp view. Still looking for solution using withColumn option. I hope, it will help you for now.
Edit2 -> Answer using dataframe.
df2=df.select(
"*",
(array_position(df.lebels,selected_label)-1).astype("int").alias("pos_val")
)
df3=df2.withColumn("asked_col",expr("substring(lebel_values,pos_val,1)"))
df3.show()
Try maybe:
import pyspark.sql.functions as f
from pyspark.sql.functions import *
selected_label='l3'
df=df.withColumn('single_label_value', f.substring(f.col('labels_values'), array_position(f.col('labels'), lit(selected_label))-1, 1))
df.show()
(for spark version >=2.4)
I think lit() was the function you were missing - you can use it to pass constant values to spark dataframes.

Getting a value from DataFrame based on other column value (PySpark)

I have a Spark dataframe which I want to get the statistics
stats_df = df.describe(['mycol'])
stats_df.show()
+-------+------------------+
|summary| mycol|
+-------+------------------+
| count| 300|
| mean| 2243|
| stddev| 319.419860456123|
| min| 1400|
| max| 3100|
+-------+------------------+
How do I extract the values of min and max in mycol using the summary min max column values? How do I do it by number index?
You could easily assign a variable from a select on that dataframe.
x = stats_df.select('mycol').where('summary' == 'min')
Ok let's consider the following example :
from pyspark.sql.functions import rand, randn
df = sqlContext.range(1, 1000).toDF('mycol')
df.describe().show()
# +-------+-----------------+
# |summary| mycol|
# +-------+-----------------+
# | count| 999|
# | mean| 500.0|
# | stddev|288.5307609250702|
# | min| 1|
# | max| 999|
# +-------+-----------------+
If you want to access the row concerning stddev, per example, you'll just need to convert it into an RDD, collect it and convert it into a dictionary as following :
stats = dict(df.describe().map(lambda r : (r.summary,r.mycol)).collect())
print(stats['stddev'])
# 288.5307609250702

Choosing random items from a Spark GroupedData Object

I'm new to using Spark in Python and have been unable to solve this problem: After running groupBy on a pyspark.sql.dataframe.DataFrame
df = sqlsc.read.json("data.json")
df.groupBy('teamId')
how can you choose N random samples from each resulting group (grouped by teamId) without replacement?
I'm basically trying to choose N random users from each team, maybe using groupBy is wrong to start with?
Well, it is kind of wrong. GroupedData is not really designed for a data access. It just describes grouping criteria and provides aggregation methods. See my answer to Using groupBy in Spark and getting back to a DataFrame for more details.
Another problem with this idea is selecting N random samples. It is a task which is really hard to achieve in parallel without psychical grouping of data and it is not something that happens when you call groupBy on a DataFrame:
There are at least two ways to handle this:
convert to RDD, groupBy and perform local sampling
import random
n = 3
def sample(iter, n):
rs = random.Random() # We should probably use os.urandom as a seed
return rs.sample(list(iter), n)
df = sqlContext.createDataFrame(
[(x, y, random.random()) for x in (1, 2, 3) for y in "abcdefghi"],
("teamId", "x1", "x2"))
grouped = df.rdd.map(lambda row: (row.teamId, row)).groupByKey()
sampled = sqlContext.createDataFrame(
grouped.flatMap(lambda kv: sample(kv[1], n)))
sampled.show()
## +------+---+-------------------+
## |teamId| x1| x2|
## +------+---+-------------------+
## | 1| g| 0.81921738561455|
## | 1| f| 0.8563875814036598|
## | 1| a| 0.9010425238735935|
## | 2| c| 0.3864428179837973|
## | 2| g|0.06233470405822805|
## | 2| d|0.37620872770129155|
## | 3| f| 0.7518901502732027|
## | 3| e| 0.5142305439671874|
## | 3| d| 0.6250620479303716|
## +------+---+-------------------+
use window functions
from pyspark.sql import Window
from pyspark.sql.functions import col, rand, rowNumber
w = Window.partitionBy(col("teamId")).orderBy(col("rnd_"))
sampled = (df
.withColumn("rnd_", rand()) # Add random numbers column
.withColumn("rn_", rowNumber().over(w)) # Add rowNumber over windw
.where(col("rn_") <= n) # Take n observations
.drop("rn_") # drop helper columns
.drop("rnd_"))
sampled.show()
## +------+---+--------------------+
## |teamId| x1| x2|
## +------+---+--------------------+
## | 1| f| 0.8563875814036598|
## | 1| g| 0.81921738561455|
## | 1| i| 0.8173912535268248|
## | 2| h| 0.10862995810038856|
## | 2| c| 0.3864428179837973|
## | 2| a| 0.6695356657072442|
## | 3| b|0.012329360826023095|
## | 3| a| 0.6450777858109182|
## | 3| e| 0.5142305439671874|
## +------+---+--------------------+
but I am afraid both will be rather expensive. If size of the individual groups is balanced and relatively large I would simply use DataFrame.randomSplit.
If number of groups is relatively small it is possible to try something else:
from pyspark.sql.functions import count, udf
from pyspark.sql.types import BooleanType
from operator import truediv
counts = (df
.groupBy(col("teamId"))
.agg(count("*").alias("n"))
.rdd.map(lambda r: (r.teamId, r.n))
.collectAsMap())
# This defines fraction of observations from a group which should
# be taken to get n values
counts_bd = sc.broadcast({k: truediv(n, v) for (k, v) in counts.items()})
to_take = udf(lambda k, rnd: rnd <= counts_bd.value.get(k), BooleanType())
sampled = (df
.withColumn("rnd_", rand())
.where(to_take(col("teamId"), col("rnd_")))
.drop("rnd_"))
sampled.show()
## +------+---+--------------------+
## |teamId| x1| x2|
## +------+---+--------------------+
## | 1| d| 0.14815204548854788|
## | 1| f| 0.8563875814036598|
## | 1| g| 0.81921738561455|
## | 2| a| 0.6695356657072442|
## | 2| d| 0.37620872770129155|
## | 2| g| 0.06233470405822805|
## | 3| b|0.012329360826023095|
## | 3| h| 0.9022527556458557|
## +------+---+--------------------+
In Spark 1.5+ you can replace udf with a call to sampleBy method:
df.sampleBy("teamId", counts_bd.value)
It won't give you exact number of observations but should be good enough most of the time as long as a number of observations per group is large enough to get proper samples. You can also use sampleByKey on a RDD in a similar way.
I found this one more dataframey, rather than going into rdd way.
You can use window function to create ranking within a group, where ranking can be random to suit your case. Then, you can filter based on the number of samples (N) you want for each group
window_1 = Window.partitionBy(data['teamId']).orderBy(F.rand())
data_1 = data.select('*', F.rank().over(window_1).alias('rank')).filter(F.col('rank') <= N).drop('rank')
Here's an alternative using Pandas DataFrame.Sample method. This uses the spark applyInPandas method to distribute the groups, available from Spark 3.0.0. This allows you to select an exact number of rows per group.
I've added args and kwargs to the function so you can access the other arguments of DataFrame.Sample.
def sample_n_per_group(n, *args, **kwargs):
def sample_per_group(pdf):
return pdf.sample(n, *args, **kwargs)
return sample_per_group
df = spark.createDataFrame(
[
(1, 1.0),
(1, 2.0),
(2, 3.0),
(2, 5.0),
(2, 10.0)
],
("id", "v")
)
(df.groupBy("id")
.applyInPandas(
sample_n_per_group(2, random_state=2),
schema=df.schema
)
)
To be aware of the limitations for very large groups, from the documentation:
This function requires a full shuffle. All the data of a group will be
loaded into memory, so the user should be aware of the potential OOM
risk if data is skewed and certain groups are too large to fit in
memory.
See also here:
How take a random row from a PySpark DataFrame?

Categories

Resources