I have a PySpark dataframe df and want to add an "iteration suffix". For every iteration, counter should be raised by 1 and added as suffix to the dataframe name. For test purposes, my code looks like this:
counter = 1
def loop:
counter = counter + 1
df_%s = df.select('A','B') % counter
2 problems here: I don't know how to set up the counter variable as this version runs into an error ('local variable 'counter' referenced before assignment') and I don't know how to correctly pass the current counter value to the dataframe name. Thanks for your help!
Given the following dataframe
+---+------+-----+
| A| B| C|
+---+------+-----+
| 1| Red| 5.52|
| 2| Blue| 1.99|
| 3| Green| 3.71|
| 4|Purple|12.09|
+---+------+-----+
You can get the result with the following
for i in range(0, 9):
globals()['df_{}'.format(i)] = df.select("A","B")
Now you have 10 different dataframes to operate on
from pyspark.sql import functions
df_1 = df_1.withColumn("test", functions.lit(1))
df_1.show()
+---+------+----+
| A| B|Test|
+---+------+----+
| 1| Red| 1|
| 2| Blue| 1|
| 3| Green| 1|
| 4|Purple| 1|
+---+------+----+
df_2.show()
+---+------+
| A| B|
+---+------+
| 1| Red|
| 2| Blue|
| 3| Green|
| 4|Purple|
+---+------+
#and so on..
Related
spark = 2.x
New to pyspark.
While encoding date related columns for training DNN keep on facing error mentioned in the title.
from df
day month ...
1 1
2 3
3 1 ...
I am trying to get cos, sine value for each column in order to capture their cyclic nature.
When applying function to column in pyspark udf worked fine until now. But below code doesn't work
def to_cos(x, _max):
return np.sin(2*np.pi*x / _max)
to_cos_udf = udf(to_cos, DecimalType())
df = df.withColumn("month", to_cos_udf("month", 12))
I've tried it with IntegerType and tried it with only one variable def to_cos(x) however none of them seem to work and outputs:
Py4JJavaError: An error occurred while calling 0.24702.showString.
Since you havent shared the entire Stacktrack from the error , not sure what is the actual error which is causing the failure
However by the code snippets you have shared , Firstly you need to update your UDF definition as below -
Will passing arguments to a UDF function using it with lambda is probably the best approach towards it , apart from that you can use partial
Data Preparation
df = pd.DataFrame({
'month':[i for i in range(0,12)],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+-----+
|month|
+-----+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
+-----+
Custom UDF
def to_cos(x,_max):
try:
res = np.sin(2*np.pi*x / _max)
except Exception as e:
res = 0.0
return float(res)
max_cos = 12
to_cos_udf = F.udf(lambda x: to_cos(x,max_cos),FloatType())
sparkDF = sparkDF.withColumn('month_cos',to_cos_udf('month'))
sparkDF.show()
+-----+-------------+
|month| month_cos|
+-----+-------------+
| 0| 0.0|
| 1| 0.5|
| 2| 0.8660254|
| 3| 1.0|
| 4| 0.8660254|
| 5| 0.5|
| 6|1.2246469E-16|
| 7| -0.5|
| 8| -0.8660254|
| 9| -1.0|
| 10| -0.8660254|
| 11| -0.5|
+-----+-------------+
Custom UDF - Partial
from functools import partial
partial_func = partial(to_cos,_max=max_cos)
to_cos_partial_udf = F.udf(partial_func)
sparkDF = sparkDF.withColumn('month_cos',to_cos_partial_udf('month'))
sparkDF.show()
+-----+--------------------+
|month| month_cos|
+-----+--------------------+
| 0| 0.0|
| 1| 0.49999999999999994|
| 2| 0.8660254037844386|
| 3| 1.0|
| 4| 0.8660254037844388|
| 5| 0.49999999999999994|
| 6|1.224646799147353...|
| 7| -0.4999999999999998|
| 8| -0.8660254037844384|
| 9| -1.0|
| 10| -0.8660254037844386|
| 11| -0.5000000000000004|
+-----+--------------------+
I have a dataframe that looks like this
+-----------+-----------+-----------+
|salesperson| device|amount_sold|
+-----------+-----------+-----------+
| john| notebook| 2|
| gary| notebook| 3|
| john|small_phone| 2|
| mary|small_phone| 3|
| john|large_phone| 3|
| john| camera| 3|
+-----------+-----------+-----------+
and I have transformed it using pivot function to this with a Total column
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
+-----------+------+-----------+--------+-----------+-----+
but I would like a dataframe with a row (Total) that would also contain a total for every column like below:
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
| Total| 3| 3| 5| 5| 16|
+-----------+------+-----------+--------+-----------+-----+
Is it possible to do this is Spark using Scala/Python? (Preferably Scala and using Spark) and not using Union if possible
TIA
You can do something like below:
val columns = df.columns.dropWhile(_ == "salesperson").map(col)
//Use function `sum` on each column and union the result with original DataFrame.
val withTotalAsRow = df.union(df.select(lit("Total").as("salesperson") +: columns.map(sum):_*))
//I think this column already exists in DataFrame
//Append another column by adding value from each column
val withTotalAsColumn = withTotalAsRow.withColumn("Total", columns.reduce(_ plus _))
With spark Scala, you can achieve this using following snippet of code.
// Assuming spark session available as variable named 'spark'
import spark.implicits._
val resultDF = df.withColumn("Total", sum($"camera", $"large_phone", $"notebook", $"small_phone"))
I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1
IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.
You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()
The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.
hi i have already a dataframe:
df_init with all column:
A|B|C|D
i receive a json like:
json=[{"A":"1","B":"2","C":"3"},
{"A":"1","B":"2","C":"3","D":"4"},
{"A":"1","B":"2"}]
i want to have df_final like:
A|B| C |D
1|2| 3 |None
1|2| 3 |4
1|2|None|None
if i do:
msgJSON=self.spark.sparkContext.parallelize([json_string],1)
df = self.sqlContext.read.option("multiLine", "true").options(samplingRatio=1.0).json(msgJSON)
but i have some problems with error.
thanks
json = [{"A":"1","B":"2","C":"3"},
{"A":"1","B":"2","C":"3","D":"4"},
{"A":"1","B":"2"}]
msgJSON = spark.sparkContext.parallelize([json],1)
df_final = sqlContext.read.option("multiLine","true").options(samplingRatio=1.0).json(msgJSON)
df_final.show()
+---+---+----+----+
| A| B| C| D|
+---+---+----+----+
| 1| 2| 3|null|
| 1| 2| 3| 4|
| 1| 2|null|null|
+---+---+----+----+
I replicated your code without keyword self. You cannot use self, "everywhere".
For more information, refer: The self variable in python explained
With a dataframe like this,
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])
df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|null| 201602|
| 1| 20|3003| 201601|
| 1| 20|null| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|null| 201601|
+---+----+----+-------+
I need to fill the null values with the average of the existing values, with the expected result being
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
where 1128 is the average of the existing values. I need to do that for several columns.
My current approach is to use na.fill:
fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
But this is very cumbersome. Any ideas?
Well, one way or another you have to:
compute statistics
fill the blanks
It pretty much limits what you can really improve here, still:
replace flatMap(list).collect()[0] with first()[0] or structure unpacking
compute all stats with a single action
use built-in Row methods to extract dictionary
The final result could like this:
def fill_with_mean(df, exclude=set()):
stats = df.agg(*(
avg(c).alias(c) for c in df.columns if c not in exclude
))
return df.na.fill(stats.first().asDict())
fill_with_mean(df_data, ["id", "date"])
In Spark 2.2 or later you can also use Imputer. See Replace missing values with mean - Spark Dataframe.