Spark DataFrame operators (nunique, multiplication) - python

I'm using jupyter notebook with pandas, but when i use Spark, i want to use Spark DataFrame to convert or computation instead of Pandas. Please help me convert some computation to Spark DataFrame or RDD.
DataFrame:
df =
+--------+-------+---------+--------+
| userId | item | price | value |
+--------+-------+---------+--------+
| 169 | I0111 | 5300 | 1 |
| 169 | I0973 | 70 | 1 |
| 336 | C0174 | 455 | 1 |
| 336 | I0025 | 126 | 1 |
| 336 | I0973 | 4 | 1 |
| 770963 | B0166 | 2 | 1 |
| 1294537| I0110 | 90 | 1 |
+--------+-------+---------+--------+
1. Using Pandas computing:
(1) userItem = df.groupby(['userId'])['item'].nunique()
and result is a Series object:
+--------+------+
| userId | |
+--------+------+
| 169 | 2 |
| 336 | 3 |
| 770963 | 1 |
| 1294537| 1 |
+--------+------+
2. Using multiplication
data_sum = df.groupby(['userId', 'item'])['value'].sum() --> result is Series object
average_played = np.mean(userItem) --> result is number
(2) weighted_games_played = data_sum * (average_played / userItem)
Please help me using Spark DataFrame and Opertors on Spark to do this (1) and (2)

You can achieve (1) using something like the following:
import pyspark.sql.functions as f
userItem=df.groupby('userId').agg(f.expr('count(distinct item)').alias('n_item'))
and for (2):
data_sum=df.groupby(['userId','item']).agg(f.sum('value').alias('sum_value'))
average_played=userItem.agg(f.mean('n_item').alias('avg_played'))
data_sum=data_sum.join(userItem, on='userId').crossJoin(average_played)
data_sum=data_sum.withColumn("weighted_games_played", f.expr("sum_value*avg_played/n_item"))

You can define method like below:
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
import org.apache.spark.mllib.linalg.{Vectors,Matrices,DenseVector}
import org.apache.spark.sql.types.{StructType,StructField,DoubleType}
import org.apache.spark.{SparkConf,SparkContext}
object retain {
implicit class DataFrameTransforms(left: DataFrame) {
val dftordd = left.rdd.map{case row =>
Vectors.dense(row.toSeq.toArray.map{x=>x.asInstanceOf[Double]})}
val leftRM = new RowMatrix(dftordd)
def multiply(right:DataFrame):DataFrame = {
val matrixC = right.columns.map(col(_))
val arr = right.select(array(matrixC:_*).as("arr")).as[Array[Double]].collect.flatten
val rows = right.count().toInt
val cols = matrixC.length
val rightRM = Matrices.dense(cols,rows,arr).transpose
val product = leftRM.multiply(rightRM).rows
val x = product.map(r=>r.toArray).collect.map(p=>Row(p: _*))
var schema = new StructType()
var i = 0
val t = cols
while (i < t) {
schema = schema.add(StructField(s"component${i}", DoubleType, true))
i = i + 1
}
val err = spark.createDataFrame(sc.parallelize(x),schema)
err
}
}
}
and before using just
import retain._
say you have two dataframes called df1(m×n) and df2(n×m)
df1.multiply(df2)

Related

PySpark - How to group by rows and then map them using custom function

Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)

Create a dataframe by iterating over column of list in another dataframe

In pyspark, I have a DataFrame with a column that contains a list of ordered nodes to go through:
osmDF.schema
Out[1]:
StructType(List(StructField(id,LongType,true),
StructField(nodes,ArrayType(LongType,true),true),
StructField(tags,MapType(StringType,StringType,true),true)))
osmDF.head(3)
Out[2]:
| id | nodes | tags |
|-----------|-----------------------------------------------------|---------------------|
| 62960871 | [783186590,783198852] | "{""foo"":""bar""}" |
| 211528816 | [2215187080,2215187140,2215187205,2215187256] | "{""foo"":""boo""}" |
| 62960872 | [783198772,783183397,783167527,783169067,783198772] | "{""foo"":""buh""}" |
I need to create a dataframe with a row for each consecutive combination of 2 nodes the list of nodes, then save it as parquet.
The expected result will have a length of n-1, with n len(nodes) for each rows. It would look like this (with other columns that I'll add):
| id | from | to | tags |
|-----------------------|------------|------------|---------------------|
| 783186590_783198852 | 783186590 | 783198852 | "{""foo"":""bar""}" |
| 2215187080_2215187140 | 2215187080 | 2215187140 | "{""foo"":""boo""}" |
| 2215187140_2215187205 | 2215187140 | 2215187205 | "{""foo"":""boo""}" |
| 2215187205_2215187256 | 2215187205 | 2215187256 | "{""foo"":""boo""}" |
| 783198772_783183397 | 783198772 | 783183397 | "{""foo"":""buh""}" |
| 783183397_783167527 | 783183397 | 783167527 | "{""foo"":""buh""}" |
| 783167527_783169067 | 783167527 | 783169067 | "{""foo"":""buh""}" |
| 783169067_783198772 | 783169067 | 783198772 | "{""foo"":""buh""}" |
I tried to initiate with the following
from pyspark.sql.functions import udf
def split_ways_into_arcs(row):
arcs = []
for node in range(len(row['nodes']) - 1):
arc = dict()
arc['id'] = str(row['nodes'][node]) + "_" + str(row['nodes'][node + 1])
arc['from'] = row['nodes'][node]
arc['to'] = row['nodes'][node + 1]
arc['tags'] = row['tags']
arcs.append(arc)
return arcs
# Declare function as udf
split = udf(lambda row: split_ways_into_arcs(row.asDict()))
The issue I'm having is I don't know how many nodes there are in each row of the original DataFrame.
I know how to apply a udf to add a column to an existing DataFrame, but not to create a new one from lists of dicts.
Iterate over the nodes array using transform and explode the array afterwards:
from pyspark.sql import functions as F
df = ...
df.withColumn("nodes", F.expr("transform(nodes, (n,i) -> named_struct('from', nodes[i], 'to', nodes[i+1]))")) \
.withColumn("nodes", F.explode("nodes")) \
.filter("not nodes.to is null") \
.selectExpr("concat_ws('_', nodes.to, nodes.from) as id", "nodes.*", "tags") \
.show(truncate=False)
Output:
+---------------------+----------+----------+-----------------+
|id |from |to |tags |
+---------------------+----------+----------+-----------------+
|783198852_783186590 |783186590 |783198852 |{""foo"":""bar""}|
|2215187140_2215187080|2215187080|2215187140|{""foo"":""boo""}|
|2215187205_2215187140|2215187140|2215187205|{""foo"":""boo""}|
|2215187256_2215187205|2215187205|2215187256|{""foo"":""boo""}|
|783183397_783198772 |783198772 |783183397 |{""foo"":""buh""}|
|783167527_783183397 |783183397 |783167527 |{""foo"":""buh""}|
|783169067_783167527 |783167527 |783169067 |{""foo"":""buh""}|
|783198772_783169067 |783169067 |783198772 |{""foo"":""buh""}|
+---------------------+----------+----------+-----------------+

Splitting a csv into multiple csv's depending on what is in column 1 using python

so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.

Iterate pyspark dataframe rows and apply UDF

I have a dataframe that looks like this:
partitionCol orderCol valueCol
+--------------+----------+----------+
| partitionCol | orderCol | valueCol |
+--------------+----------+----------+
| A | 1 | 201 |
| A | 2 | 645 |
| A | 3 | 302 |
| B | 1 | 335 |
| B | 2 | 834 |
+--------------+----------+----------+
I want to group by the partitionCol, then within each partition to iterate over the rows, ordered by orderCol and apply some function to calculate a new column based on the valueCol and a cached value.
e.g.
def foo(col_value, cached_value):
tmp = <some value based on a condition between col_value and cached_value>
<update the cached_value using some logic>
return tmp
I understand I need to groupby the partitionCol and apply a UDF that will operate on each chink separately, but struggling to find a good way to iterate the rows and applying the logic I described, to get a desired output of:
+--------------+----------+----------+---------------+
| partitionCol | orderCol | valueCol | calculatedCol -
+--------------+----------+----------+---------------+
| A | 1 | 201 | C1 |
| A | 2 | 645 | C1 |
| A | 3 | 302 | C2 |
| B | 1 | 335 | C1 |
| B | 2 | 834 | C2 |
+--------------+----------+----------+---------------+
I think the best way for you to do that is to apply an UDF on the whole set of data :
# first, you create a struct with the order col and the valu col
df = df.withColumn("my_data", F.struct(F.col('orderCol'), F.col('valueCol'))
# then you create an array of that new column
df = df.groupBy("partitionCol").agg(F.collect_list('my_data').alias("my_data")
# finaly, you apply your function on that array
df = df.withColumn("calculatedCol", my_udf(F.col("my_data"))
But without knowing exactly what you want to do, that is all I can offer.

How to add conditional 'if' using 'map(+lambda)' function under Python

I have an example csv file with name 'r2.csv':
Factory | Product_Number | Date | Avg_Noshow | Walk_Cost | Room_Rev
-------------------------------------------------------------------------
A | 1 | 01MAY2017 | 5.6 | 125 | 275
-------------------------------------------------------------------------
A | 1 | 02MAY2017 | 0 | 200 | 300
-------------------------------------------------------------------------
A | 1 | 03MAY2017 | 6.6 | 150 | 250
-------------------------------------------------------------------------
A | 1 | 04MAY2017 | 7.5 | 175 | 325
-------------------------------------------------------------------------
And I would like to read the file and calculate as output using the following code:
I have the following python code to read a csv file and transfer the columns to arrays:
# Read csv file
import numpy as np
import scipy.stats as stats
from scipy.stats import poisson, norm
import csv
with open('r2.csv', 'r') as infile:
reader = csv.DictReader(infile)
data = {}
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
# Transfer the column from list to arrays for later calculation.
mu = data['Avg_Noshow']
cs = data['Walk_Cost']
co = data['Room_Rev']
mu = map(float,mu)
cs = map(float,cs)
co = map(float,co)
The prior part works fine and it reads data. Following is the function for calculation.
# The following 'map()' function calculates Overbooking number
Overbooking_Number = map(lambda mu_,cs_,co_:np.ceil(poisson.ppf(co_/(cs_+co_),mu_)),mu,cs,co)
data['Overbooking_Number'] = Overbooking_Number
header = 'LOC_ID', 'Prod_Class_ID', 'Date', 'Avg_Noshow', 'Walk_Cost', 'Room_Rev', 'Overbooking_Number'
# Write to an output file
with open("output.csv",'wb') as resultFile:
wr = csv.writer(resultFile,quoting=csv.QUOTE_ALL)
wr.writerow(header)
z = zip(data['LOC_ID'],data['Prod_Class_ID'],data['Date'],data['Avg_Noshow'],data['Walk_Cost'],data['Room_Rev'],data['Overbooking_Number'])
for i in z:
wr.writerow(i)
It works fine as well.
However, if I would like to calculate and output 'Overbooking_Number' using the above function only if 'Avg_Noshow > 0' and output 'Overbooking_Number = 0' if 'Avg_Noshow = 0'?
For example, the output table may look like below:
Factory | Product_Number | Date | Avg_Noshow | Walk_Cost | Room_Rev | Overbooking_Number
----------------------------------------------------------------------------------------------
A | 1 | 01MAY2017 | 5.6 | 125 | 275 | ...
----------------------------------------------------------------------------------------------
A | 1 | 02MAY2017 | 0 | 200 | 300 | 0
----------------------------------------------------------------------------------------------
A | 1 | 03MAY2017 | 6.6 | 150 | 250 | ...
----------------------------------------------------------------------------------------------
A | 1 | 04MAY2017 | 7.5 | 175 | 325 | ...
----------------------------------------------------------------------------------------------
What shall I add a conditional if into my map(+lambda) function?
Thank you!
If I understand correctly, the condition is that mu should be higher than zero. In this case, I think you can simply use Python's "ternary operator" like this:
Overbooking_Number = map(lambda mu_, cs_, co_:
(np.ceil(poisson.ppf(co_ / (cs_ + co_), mu_)
if mu_ > 0 else 0),
mu, cs, co)

Categories

Resources