I'm facing a heavy data transformation. In a nutshell, I have columns of data, each containing strings which correspond to some ordinals. For example, HIGH, MID and LOW. My objective is to map these strings to integers which will preserve the order. In this case, LOW -> 0, MID -> 1 and HIGH -> 2.
Here is a simple function generating such data:
def fresh_df(N=100000, seed=None):
np.random.seed(seed)
feat1 = np.random.choice(["HI", "LO", "MID"], size=N)
feat2 = np.random.choice(["SMALL", "MEDIUM", "LARGE"], size=N)
pdf = pd.DataFrame({
"feat1": feat1,
"feat2": feat2
})
return spark.createDataFrame(pdf)
My first approach was:
feat1_dict = {"HI": 1, "MID": 2, "LO": 3}
feat2_dict = {"SMALL": 0, "MEDIUM": 1, "LARGE": 2}
mappings = {
"feat1": F.create_map([F.lit(x) for x in chain(*feat1_dict.items())]),
"feat2": F.create_map([F.lit(x) for x in chain(*feat2_dict.items())])
}
for col in df.columns:
col_map = mappings[col]
df = df.withColumn(col+"_mapped", col_map[df[col]])
This works as expected but in reality it turns to be slow and I wanted to optimize the process. I read about pandas_udf and it gave me hope. Here is the modified code:
feats_dict = {
"feat1": feat1_dict,
"feat2": feat2_dict
}
for col_name in df.columns:
#F.pandas_udf('integer', F.PandasUDFType.SCALAR)
def map_map(col):
return col.map(feats_dict[col_name])
df = df.withColumn(col_name + "_mapped", map_map(df[col_name]))
Alas! When comparing these two versions there was no improvement in terms of execution time. I compared the two on a local instance of Spark (using docker) and on a 5 nodes EMR cluster (with the default configurations).
I created a notebook where you can see all the code. In general, I used the following imports:
import numpy as np
import pandas as pd
from itertools import chain
from pyspark.sql import functions as F
What am I missing? Why is this process so slow and why there's no improvement when using pandas_udf?
Why so slow? Because the Spark runs in JVM and pyspark doesn't (because its a python process) and to make it the process possible needs to move all data serializing and deserializing to JVM.
You can map the values with when and otherwise function and avoid the serialize and deserialize process, increasing the performance.
import numpy as np
import pandas as pd
import pyspark.sql.functions as f
from pyspark.shell import spark
def fresh_df(n=100000, seed=None):
np.random.seed(seed)
feat1 = np.random.choice(["HI", "LO", "MID"], size=n)
feat2 = np.random.choice(["SMALL", "MEDIUM", "LARGE"], size=n)
pdf = pd.DataFrame({
"feat1": feat1,
"feat2": feat2
})
return spark.createDataFrame(pdf)
df = fresh_df()
df = df.withColumn('feat1_mapped', f
.when(df.feat1 == f.lit('HI'), 1)
.otherwise(f.when(df.feat1 == f.lit('MID'), 2).otherwise(3)))
df = df.withColumn('feat2_mapped', f
.when(df.feat2 == f.lit('SMALL'), 0)
.otherwise(f.when(df.feat2 == f.lit('MEDIUM'), 1).otherwise(2)))
df.show(n=20)
Output
+-----+------+------------+------------+
|feat1| feat2|feat1_mapped|feat2_mapped|
+-----+------+------------+------------+
| LO| SMALL| 3| 0|
| LO|MEDIUM| 3| 1|
| MID|MEDIUM| 2| 1|
| MID| SMALL| 2| 0|
| MID| LARGE| 2| 2|
| MID| SMALL| 2| 0|
| LO| SMALL| 3| 0|
| MID| LARGE| 2| 2|
| MID| LARGE| 2| 2|
| MID| SMALL| 2| 0|
| MID|MEDIUM| 2| 1|
| LO| LARGE| 3| 2|
| HI|MEDIUM| 1| 1|
| LO| SMALL| 3| 0|
| HI|MEDIUM| 1| 1|
| MID| SMALL| 2| 0|
| MID|MEDIUM| 2| 1|
| HI| SMALL| 1| 0|
| HI| LARGE| 1| 2|
| MID| LARGE| 2| 2|
+-----+------+------------+------------+
Related
I have a dataframe like this
+---+---------------------+
| id| csv|
+---+---------------------+
| 1|a,b,c\n1,2,3\n2,3,4\n|
| 2|a,b,c\n3,4,5\n4,5,6\n|
| 3|a,b,c\n5,6,7\n6,7,8\n|
+---+---------------------+
and I want to explode the string type csv column, in fact I'm only interested in this column. So I'm looking for a method to obtain the following dataframe from the above.
+--+--+--+
| a| b| c|
+--+--+--+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 4| 5| 6|
| 5| 6| 7|
| 6| 7| 8|
+--+--+--+
Looking at the from_csv documentation it seems that the insput csv string can contain only one row of data, which I found stated more clearly here. So that's not an option.
I guess I could loop over the individual rows of the input dataframe, extract and parse the csv string from each row and then stitch everything together:
rows = df.collect()
for (i, row) in enumerate(rows):
data = row['csv']
data = data.split('\\n')
rdd = spark.sparkContext.parallelize(data)
df_row = (spark.read
.option('header', 'true')
.schema('a int, b int, c int')
.csv(rdd))
if i == 0:
df_new = df_row
else:
df_new = df_new.union(df_row)
df_new.show()
But that seems awfully inefficient. Is there a better way to achieve the desired result?
Using split + from_csv functions along with transform you can do something like:
from pyspark.sql import functions as F
df = spark.createDataFrame([
(1, r"a,b,c\n1,2,3\n2,3,4\n"), (2, r"a,b,c\n3,4,5\n4,5,6\n"),
(3, r"a,b,c\n5,6,7\n6,7,8\n")], ["id", "csv"]
)
df1 = df.withColumn(
"csv",
F.transform(
F.split(F.regexp_replace("csv", r"^a,b,c\\n|\\n$", ""), r"\\n"),
lambda x: F.from_csv(x, "a int, b int, c int")
)
).selectExpr("inline(csv)")
df1.show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# | 1| 2| 3|
# | 2| 3| 4|
# | 3| 4| 5|
# | 4| 5| 6|
# | 5| 6| 7|
# | 6| 7| 8|
# +---+---+---+
I want to calculate the number of lines that satisfy a condition on a very large dataframe which can be achieved by
df.filter(col("value") >= thresh).count()
I want to know the result for each threshold in range [1, 10]. Enumerate each threshold then do this action will scan the dataframe for 10 times. It's slow.
If I can achieve it by scanning the df only once?
Create an indicator column for each threshold, then sum:
import random
import pyspark.sql.functions as F
from pyspark.sql import Row
df = spark.createDataFrame([Row(value=random.randint(0,10)) for _ in range(1_000_000)])
df.select([
(F.col("value") >= thresh)
.cast("int")
.alias(f"ind_{thresh}")
for thresh in range(1,11)
]).groupBy().sum().show()
# +----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
# |sum(ind_1)|sum(ind_2)|sum(ind_3)|sum(ind_4)|sum(ind_5)|sum(ind_6)|sum(ind_7)|sum(ind_8)|sum(ind_9)|sum(ind_10)|
# +----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
# | 908971| 818171| 727240| 636334| 545463| 454279| 363143| 272460| 181729| 90965|
# +----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+
Using conditional aggregation with when expressions should do the job.
Here's an example:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1,), (2,), (3,), (4,), (4,), (6,), (7,)], ["value"])
count_expr = [
F.count(F.when(F.col("value") >= th, 1)).alias(f"gte_{th}")
for th in range(1, 11)
]
df.select(*count_expr).show()
#+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
#|gte_1|gte_2|gte_3|gte_4|gte_5|gte_6|gte_7|gte_8|gte_9|gte_10|
#+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
#| 7| 6| 5| 4| 2| 2| 1| 0| 0| 0|
#+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
Using a user-defined function udf from pyspark.sql.functions:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100, size=(20)), columns=['val'])
thres = [90, 80, 30] # these are the thresholds
thres.sort(reverse=True) # list needs to be sorted
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
spark = SparkSession.builder \
.master("local[2]") \
.appName("myApp") \
.getOrCreate()
sparkDF = spark.createDataFrame(df)
myUdf = udf(lambda x: 0 if x>thres[0] else 1 if x>thres[1] else 2 if x>thres[2] else 3)
sparkDF = sparkDF.withColumn("rank", myUdf(sparkDF.val))
sparkDF.show()
# +---+----+
# |val|rank|
# +---+----+
# | 28| 3|
# | 54| 2|
# | 19| 3|
# | 4| 3|
# | 74| 2|
# | 62| 2|
# | 95| 0|
# | 19| 3|
# | 55| 2|
# | 62| 2|
# | 33| 2|
# | 93| 0|
# | 81| 1|
# | 41| 2|
# | 80| 2|
# | 53| 2|
# | 14| 3|
# | 16| 3|
# | 30| 3|
# | 77| 2|
# +---+----+
sparkDF.groupby(['rank']).count().show()
# Out:
# +----+-----+
# |rank|count|
# +----+-----+
# | 3| 7|
# | 0| 2|
# | 1| 1|
# | 2| 10|
# +----+-----+
A value gets rank i if it's strictly greater than thres[i] but smaller or equal thres[i-1]. This should minimize the number of comparisons.
For thres = [90, 80, 30] we have the ranks 0-> [max, 90[, 1-> [90, 80[, 2->[80, 30[, 3->[30, min]
I wanted to apply a function to some columns of a Spark DataFrame with different methods: fn and a fn1. Here is how I did that:
def fn(column):
return(x*2)
udf_1 = udf(fn, DecimalType())
def fn1(column):
return(x*3)
udf_2 = udf(fn1, DecimalType())
def process_df1(df, col_name):
df1 = df.withColumn(col_name, udf_1(col_name))
return df1
def process_df2(df, col_name):
df2 = df.withColumn(col_name, udf_2(col_name))
return df2
For a single column it works fine. But now I get a list of dicts containing information on various columns:
cols_info = [{'col_name': 'metric_1', 'process': 'True', 'method':'simple'}, {'col_name': 'metric_2', 'process': 'False', 'method':'hash'}]
How should I parse the cols_info list and apply the above logic only to the columns that have process:True and use a required method?
The first thing that comes to mind is to filter out columns with process:False
list(filter(lambda col_info: col_info['process'] == 'True', cols_info))
But I'm still missing a more generic approach here.
selectExpr function will be useful here
import pyspark.sql.functions as F
from pyspark.sql.window import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(1,3,4,1),(1,4,5,5),(1,6,7,8),(2,1,9,2),(2,2,9,9)],schema=['col1','col2','col3','col4'])
def fn(x):
return(x*2)
def fn1(x):
return(x*3)
sqlContext.udf.register("fn1", fn)
sqlContext.udf.register("fn2", fn1)
cols_info =[{'col_name':'col1','encrypt':False,},{'col_name':'col2','encrypt':True,'method':'fn1'},{'col_name':'col3','encrypt':True,'method':'fn2'}]
# determine which columns have any of the encryption
modified_columns = [x['col_name'] for x in cols_info if x['encrypt']]
# select which colulmns have to be retained
columns_retain = list(set(tst.columns)-set(modified_columns))
#%
expr =columns_retain+[((x['method'])+'('+(x['col_name'])+') as '+ x['col_name']) for x in cols_info if x['encrypt']]
#%
tst_res = tst.selectExpr(*expr)
The results will be :
+----+----+----+----+
|col4|col1|col2|col3|
+----+----+----+----+
| 4| 1| 4| 9|
| 1| 1| 6| 12|
| 5| 1| 8| 15|
| 8| 1| 12| 21|
| 2| 2| 2| 27|
| 9| 2| 4| 27|
+----+----+----+----+
I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1
IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.
You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()
The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.
I'm working on a process in pyspark which I have a dataframe and I'm trying to add one more column (using withColumn method).
The problem is that the formula is:
STATUS1 = If 'PETP-today' > 0 then 'Status1 last day' + 'PETP-today' else 0
Each result for Status1 involves status1 from the last day result.
One solution I found was to create a pandas dataframe and run the records one by one till I can calculate each, using variables. However I'll have performance issues. Can you help?
Consider the dataframe columns: Date (daily) / PETP (Float)/ STATUS1? (Float)
I really appreciate any help!
I think the key to your solution is the lag function. Try this (for simplicity, I am assuming integer data for all columns):
First, shift the column by one day up
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
columns = ['date', 'petp', 'status']
data = [(0, 0, 0), (1, 1, 1), (2, 2, 2), (3,3,3), (4,4,4), (5,5,5)]
pd_data = pd.DataFrame.from_records(data=data, columns=columns)
spark_data = spark.createDataFrame(pd_data)
spark_data_with_lag = spark_data.withColumn("status_last_day", F.lag("status", 1, 0).over(Window.orderBy("date")))
spark_data_with_lag.show()
+----+----+------+---------------+
|date|petp|status|status_last_day|
+----+----+------+---------------+
| 1| 1| 1| 0|
| 2| 2| 2| 1|
| 3| 3| 3| 2|
| 4| 4| 4| 3|
| 5| 5| 5| 4|
+----+----+------+---------------+
Then use that data in your conditional
status2 = spark_data_with_lag.withColumn("status2", F.when(F.col("date") > 0, F.col("petp") + F.col("status_last_day")).otherwise(0))
status2.show()
+----+----+------+---------------+-------+
|date|petp|status|status_last_day|status2|
+----+----+------+---------------+-------+
| 1| 1| 1| 0| 1|
| 2| 2| 2| 1| 3|
| 3| 3| 3| 2| 5|
| 4| 4| 4| 3| 7|
| 5| 5| 5| 4| 9|
+----+----+------+---------------+-------+
I hope that is what you were looking for.