I have a pyspark DataFrame and I want to get a specific column and iterate over its values. For example:
userId itemId
1 2
2 2
3 7
4 10
I get the userId column by df.userId and for each userId in this column I want to apply a method. How can I achieve this?
Your question is not very specific about the type of function you want to apply, so I have created an example that adds an item description based on the value of itemId.
First let's import the relevant libraries and create the data:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = spark.createDataFrame([(1,2),(2,2),(3,7),(4,10)], ['userId', 'itemId'])
Secondly, create the function and convert it into an UDF function that can be used by PySpark:
def item_description(itemId):
items = {2 : "iPhone 8",
7 : "Apple iMac",
10 : "iPad"}
return items[itemId]
item_description_udf = udf(item_description,StringType())
Finally, add a new column for ItemDescription and populate it with the value returned by the item_description_udf function:
df = df.withColumn("ItemDescription",item_description_udf(df.itemId))
df.show()
This gives the following output:
+------+------+---------------+
|userId|itemId|ItemDescription|
+------+------+---------------+
| 1| 2| iPhone 8|
| 2| 2| iPhone 8|
| 3| 7| Apple iMac|
| 4| 10| iPad|
+------+------+---------------+
Related
id
val1
val2
1
Y
Flagged
1
N
Flagged
2
N
Flagged
2
Y
Flagged
2
N
Flagged
I have the above table. I want to check the rows in val1 with the same id, if there's at least one Y and one N then all the rows having 1 as id will be marked flagged in val2. In addition, for a more efficient code, I want the code to jump to the next id once it finds a Y.
Assuming the val1 columns contains only Y and N as unique values, you can group the dataframe by id and aggregate val1 using countDistinct to count the unique values, then create a new column flagged corresponding the condition where distinct count > 1, finally join this new column with original dataframe to get the result
from pyspark.sql import functions as F
counts = df.groupBy('id').agg(F.countDistinct('val1').alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
If column val1 may contains other values along with Y, N, then first mask the values which are not in Y and N:
vals = F.when(F.col('val1').isin(['Y', 'N']), F.col('val1'))
counts = df.groupBy('id').agg(F.countDistinct(vals).alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
>>> df.show()
| id|val1|flagged|
+---+----+-------+
| 1| Y| true|
| 1| N| true|
| 2| N| true|
| 2| Y| true|
| 2| N| true|
+---+----+-------+
PS: I have also modified your output slightly as having a column named flagged with boolean values makes more sense
You can also use a window to collect the set of values and compare to the array of Y and N values:
from pyspark.sql import functions as F, Window as W
a = F.array([F.lit('N'),F.lit('Y')])
out = (df.withColumn("Flagged",F.array_intersect(a,
F.collect_set("val1").over(W.partitionBy("id")))==a))
out.show()
+---+----+-------+-------+
| id|val1| val2|Flagged|
+---+----+-------+-------+
| 1| Y|Flagged| true|
| 1| N|Flagged| true|
| 2| N|Flagged| true|
| 2| Y|Flagged| true|
| 2| N|Flagged| true|
+---+----+-------+-------+
I want to add a new column new_col, if the value of column a is in yes_list, then the value is 1 in new_col else 0
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
rdd = sc.parallelize([{"a":'y'}, {"a":'y', "b":2}, {"a":'n', "c":3}])
rdd_df = sqlContext.read.json(rdd)
yes_list = ['y']
Something like this:
rdd_df.withColumn("new_col", [1 if val in yes_list else 0 for val in rdd_df["a"]])
But the above is not correct, and raise errors.
TypeError: Column is not iterable
How to achieve it?
You can use the when and isin functions for the sparkSQL API. It would go as follows:
from pyspark.sql import functions
rdd_df.withColumn("new_col", functions.when(rdd_df['a'].isin(yes_list), 1).otherwise(0)).show()
+---+----+----+-------+
| a| b| c|new_col|
+---+----+----+-------+
| y|null|null| 1|
| y| 2|null| 1|
| n|null| 3| 0|
+---+----+----+-------+
I am new to PySpark and struggling with a simple dataframe manipulation. I have a dataframe similar to:
product period rating product_Desc1 product_Desc2 ..... more columns
a 1 60 foo xx
a 2 70 foo xx
a 3 59 foo xx
b 1 50 bar yy
b 2 55 bar yy
c 1 90 foo bar xy
c 2 100 foo bar xy
I would like to groupBy product, add columns to calculate arithmetic, geometric and harmonic means of ratings while also maintaining the rest of the columns in the dataframe, which are all consistent across each product.
I have tried to do so with a combination of built in functions and UDF. For example:
a_means = df.groupBy("product").agg(mean("rating").alias("a_mean")
g_means = df.groupBy("product").agg(udf_gmean("rating").alias("g_mean")
where:
def g_mean(x):
gm = reduce(mul,x)**(1/len(x))
return gm
udf_gmean = udf(g_mean, FloatType())
I would then join the a_means and g_means output with the original dataframe on product and drop duplicates. However, this method returns an error, for g_means, stating that "rating" is not involved in the groupBy nor is it a user defined aggregation function....
I have also tried using SciPy's gmean module but the error message I get states that the ufunc 'log' is not suitable for the input types, despite all of the rating column being integer type as far as I can see.
There are similar questions on the site but nothing that I can find that seems to fix this issue I have. I would really appreciate the help as it's driving me mad!
Thanks in advance and I should be able to provide any further info quickly today if I haven't provided enough.
It's worth noting that, for efficiency, I am unable to simply convert to Pandas and transform as I would with a Pandas dataframe...and I am using Spark 2.2 and unable to update!
How about something like this
from pyspark.sql.functions import avg
df1 = df.select("product","rating").rdd.map(lambda x: (x[0],(1.0,x[1]*1.0))).reduceByKey(lambda x,y: (x[0]+y[0], x[1]*y[1])).toDF(['product', 'g_mean'])
gdf = df1.select(df1['product'],pow(df1['g_mean._2'],1.0/df1['g_mean._1']).alias("rating_g_mean"))
display(gdf)
+-------+-----------------+
|product| rating_g_mean|
+-------+-----------------+
| a|62.81071936240795|
| b|52.44044240850758|
| c|94.86832980505137|
+-------+-----------------+
df1 = df.withColumn("h_mean", 1.0/df["rating"])
hdf = df1.groupBy("product").agg(avg(df1["rating"]).alias("rating_mean"), (1.0/avg(df1["h_mean"])).alias("rating_h_mean"))
sdf = hdf.join(gdf, ['product'])
display(sdf)
+-------+-----------+-----------------+-----------------+
|product|rating_mean| rating_h_mean| rating_g_mean|
+-------+-----------+-----------------+-----------------+
| a| 63.0|62.62847514743051|62.81071936240795|
| b| 52.5|52.38095238095239|52.44044240850758|
| c| 95.0|94.73684210526315|94.86832980505137|
+-------+-----------+-----------------+-----------------+
fdf = df.join(sdf, ['product'])
display(fdf.sort("product"))
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
|product|period|rating|product_Desc1|product_Desc2|rating_mean| rating_h_mean| rating_g_mean|
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
| a| 3| 59| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| a| 2| 70| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| a| 1| 60| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| b| 2| 55| bar| yy| 52.5|52.38095238095239|52.44044240850758|
| b| 1| 50| bar| yy| 52.5|52.38095238095239|52.44044240850758|
| c| 2| 100| foo bar| xy| 95.0|94.73684210526315|94.86832980505137|
| c| 1| 90| foo bar| xy| 95.0|94.73684210526315|94.86832980505137|
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
A slightly easier way than above using gapply:
from spark_sklearn.group_apply import gapply
from scipy.stats.mstats import gmean
import pandas as pd
def g_mean(_, vals):
gm = gmean(vals["rating"])
return pd.DataFrame(data=[gm])
geoSchema = StructType().add("geo_mean", FloatType())
gMeans = gapply(df.groupby("product"), g_mean, geoSchema)
This returns a dataframe which can then be sorted and joined onto the original using:
df_withGeo = df.join(gMeans, ["product"])
And repeat the process for any aggregation type function columns to be added to the original DataFrame...
I have two spark dataframes:
Dataframe A:
|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |
and dataframe B:
|col_1 | col_2 | ... | col_m |
|val_1 | val_2 | ... | val_m |
Dataframe B can contain duplicate, updated and new rows from dataframe A. I want to write an operation in spark where I can create a new dataframe containing the rows from dataframe A and the updated and new rows from dataframe B.
I started by creating a hash column containing only the columns that are not updatable. This is the unique id. So let's say col1 and col2 can change value (can be updated), but col3,..,coln are unique. I have created a hash function as hash(col3,..,coln):
A=A.withColumn("hash", hash(*[col(colname) for colname in unique_cols_A]))
B=B.withColumn("hash", hash(*[col(colname) for colname in unique_cols_B]))
Now I want to write some spark code that basically selects the rows from B that have the hash not in A (so new rows and updated rows) and join them into a new dataframe together with the rows from A. How can I achieve this in pyspark?
Edit:
Dataframe B can have extra columns from dataframe A, so a union is not possible.
Sample example
Dataframe A:
+-----+-----+
|col_1|col_2|
+-----+-----+
| a| www|
| b| eee|
| c| rrr|
+-----+-----+
Dataframe B:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| d| yyy| 2|
| c| rer| 3|
+-----+-----+-----+
Result:
Dataframe C:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| b| eee| null|
| c| rer| 3|
| d| yyy| 2|
+-----+-----+-----+
This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.
For example:
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
'col_1',
f.when(
~f.isnull(f.col('b.col_2')),
f.col('b.col_2')
).otherwise(f.col('a.col_2')).alias('col_2'),
'b.col_3'
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| a| wew| 1|
#| b| eee| null|
#| c| rer| 3|
#| d| yyy| 2|
#+-----+-----+-----+
Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:
cols_to_update = ['col_2']
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
*[
['col_1'] +
[
f.when(
~f.isnull(f.col('b.{}'.format(c))),
f.col('b.{}'.format(c))
).otherwise(f.col('a.{}'.format(c))).alias(c)
for c in cols_to_update
] +
['b.col_3']
]
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.
replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
resultDf = dfA.subtract(replaceDf).union(dfB).show()
Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would
prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.
If you want to keep only unique values, and require strictly correct results, then union followed by dropDupilcates should do the trick:
columns_which_dont_change = [...]
old_df.union(new_df).dropDuplicates(subset=columns_which_dont_change)
Suppose I've got a data frame df (created from a hard-coded array for tests)
+----+----+---+
|name| c1|qty|
+----+----+---+
| a|abc1| 1|
| a|abc2| 0|
| b|abc3| 3|
| b|abc4| 2|
+----+----+---+
I am grouping and aggregating it to get df1
import pyspark.sql.functions as sf
df1 = df.groupBy('name').agg(sf.min('qty'))
df1.show()
+----+--------+
|name|min(qty)|
+----+--------+
| b| 2|
| a| 0|
+----+--------+
What is the expected order of the rows in df1 ?
Suppose now that I am writing a unit test. I need to compare df1 with the expected data frame. Should I compare them ignoring the order of rows. What is the best way to do it ?
The ordering of the rows in the dataframe is not fixed. There is an easy way to use the expected Dataframe in test cases
Do a dataframe diff . For scala:
assert(df1.except(expectedDf).count == 0)
And
assert(expectedDf.except(df1).count == 0)
For python you need to replace except by subtract
From documentation:
subtract(other)
Return a new DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.