Order of rows in DataFrame after aggregation - python

Suppose I've got a data frame df (created from a hard-coded array for tests)
+----+----+---+
|name| c1|qty|
+----+----+---+
| a|abc1| 1|
| a|abc2| 0|
| b|abc3| 3|
| b|abc4| 2|
+----+----+---+
I am grouping and aggregating it to get df1
import pyspark.sql.functions as sf
df1 = df.groupBy('name').agg(sf.min('qty'))
df1.show()
+----+--------+
|name|min(qty)|
+----+--------+
| b| 2|
| a| 0|
+----+--------+
What is the expected order of the rows in df1 ?
Suppose now that I am writing a unit test. I need to compare df1 with the expected data frame. Should I compare them ignoring the order of rows. What is the best way to do it ?

The ordering of the rows in the dataframe is not fixed. There is an easy way to use the expected Dataframe in test cases
Do a dataframe diff . For scala:
assert(df1.except(expectedDf).count == 0)
And
assert(expectedDf.except(df1).count == 0)
For python you need to replace except by subtract
From documentation:
subtract(other)
Return a new DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.

Related

Calculating percentage of multiple column values of a Spark DataFrame in PySpark

I have multiple binary columns (0 and 1) in my Spark DataFrame. I want to calculate the percentage of 1 in each column and project the result in another DataFrame.
The input DataFrame dF looks like:
+------------+-----------+
| a| b|
+------------+-----------+
| 0| 1|
| 1| 1|
| 0| 0|
| 1| 1|
| 0| 1|
+------------+-----------+
Expected output would look like:
+------------+-----------+
| a| b|
+------------+-----------+
| 40| 80|
+------------+-----------+
40 (2/5) and 80 (4/5) is the percentage of 1 in columns a and b respectively.
What I tried so far is creating a custom aggregation function, passing over the two columns a and b to it, doing a group by to get the count of 0s and 1s, calculating percentages of 0s and 1s, and finally filtering the DataFrame to only keep the 1.
selection = ['a', 'b']
#F.udf
def cal_perc(c, dF):
grouped = dF.groupBy(c).count()
grouped = grouped.withColumn('perc_' + str(c), ((grouped['count']/5) * 100))
return grouped[grouped[c] == 1].select(['perc_' + str(c)])
dF.select(*(dF[c].alias(c) for c in selection)).agg(*(cal_perc(c, dF).alias(c) for c in selection)).show()
This does not seem to be working. I'm not able to figure out where I'm going wrong. Any help appreciated. Thanks.
If your columns in fact always are 0/1 and no other digits a mean should be equivalent.
It is implemented natively in spark.

Groupby and UDF/UDAF in PySpark while maintaining DataFrame structure

I am new to PySpark and struggling with a simple dataframe manipulation. I have a dataframe similar to:
product period rating product_Desc1 product_Desc2 ..... more columns
a 1 60 foo xx
a 2 70 foo xx
a 3 59 foo xx
b 1 50 bar yy
b 2 55 bar yy
c 1 90 foo bar xy
c 2 100 foo bar xy
I would like to groupBy product, add columns to calculate arithmetic, geometric and harmonic means of ratings while also maintaining the rest of the columns in the dataframe, which are all consistent across each product.
I have tried to do so with a combination of built in functions and UDF. For example:
a_means = df.groupBy("product").agg(mean("rating").alias("a_mean")
g_means = df.groupBy("product").agg(udf_gmean("rating").alias("g_mean")
where:
def g_mean(x):
gm = reduce(mul,x)**(1/len(x))
return gm
udf_gmean = udf(g_mean, FloatType())
I would then join the a_means and g_means output with the original dataframe on product and drop duplicates. However, this method returns an error, for g_means, stating that "rating" is not involved in the groupBy nor is it a user defined aggregation function....
I have also tried using SciPy's gmean module but the error message I get states that the ufunc 'log' is not suitable for the input types, despite all of the rating column being integer type as far as I can see.
There are similar questions on the site but nothing that I can find that seems to fix this issue I have. I would really appreciate the help as it's driving me mad!
Thanks in advance and I should be able to provide any further info quickly today if I haven't provided enough.
It's worth noting that, for efficiency, I am unable to simply convert to Pandas and transform as I would with a Pandas dataframe...and I am using Spark 2.2 and unable to update!
How about something like this
from pyspark.sql.functions import avg
df1 = df.select("product","rating").rdd.map(lambda x: (x[0],(1.0,x[1]*1.0))).reduceByKey(lambda x,y: (x[0]+y[0], x[1]*y[1])).toDF(['product', 'g_mean'])
gdf = df1.select(df1['product'],pow(df1['g_mean._2'],1.0/df1['g_mean._1']).alias("rating_g_mean"))
display(gdf)
+-------+-----------------+
|product| rating_g_mean|
+-------+-----------------+
| a|62.81071936240795|
| b|52.44044240850758|
| c|94.86832980505137|
+-------+-----------------+
df1 = df.withColumn("h_mean", 1.0/df["rating"])
hdf = df1.groupBy("product").agg(avg(df1["rating"]).alias("rating_mean"), (1.0/avg(df1["h_mean"])).alias("rating_h_mean"))
sdf = hdf.join(gdf, ['product'])
display(sdf)
+-------+-----------+-----------------+-----------------+
|product|rating_mean| rating_h_mean| rating_g_mean|
+-------+-----------+-----------------+-----------------+
| a| 63.0|62.62847514743051|62.81071936240795|
| b| 52.5|52.38095238095239|52.44044240850758|
| c| 95.0|94.73684210526315|94.86832980505137|
+-------+-----------+-----------------+-----------------+
fdf = df.join(sdf, ['product'])
display(fdf.sort("product"))
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
|product|period|rating|product_Desc1|product_Desc2|rating_mean| rating_h_mean| rating_g_mean|
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
| a| 3| 59| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| a| 2| 70| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| a| 1| 60| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| b| 2| 55| bar| yy| 52.5|52.38095238095239|52.44044240850758|
| b| 1| 50| bar| yy| 52.5|52.38095238095239|52.44044240850758|
| c| 2| 100| foo bar| xy| 95.0|94.73684210526315|94.86832980505137|
| c| 1| 90| foo bar| xy| 95.0|94.73684210526315|94.86832980505137|
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
A slightly easier way than above using gapply:
from spark_sklearn.group_apply import gapply
from scipy.stats.mstats import gmean
import pandas as pd
def g_mean(_, vals):
gm = gmean(vals["rating"])
return pd.DataFrame(data=[gm])
geoSchema = StructType().add("geo_mean", FloatType())
gMeans = gapply(df.groupby("product"), g_mean, geoSchema)
This returns a dataframe which can then be sorted and joined onto the original using:
df_withGeo = df.join(gMeans, ["product"])
And repeat the process for any aggregation type function columns to be added to the original DataFrame...

How to update a pyspark dataframe with new values from another dataframe?

I have two spark dataframes:
Dataframe A:
|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |
and dataframe B:
|col_1 | col_2 | ... | col_m |
|val_1 | val_2 | ... | val_m |
Dataframe B can contain duplicate, updated and new rows from dataframe A. I want to write an operation in spark where I can create a new dataframe containing the rows from dataframe A and the updated and new rows from dataframe B.
I started by creating a hash column containing only the columns that are not updatable. This is the unique id. So let's say col1 and col2 can change value (can be updated), but col3,..,coln are unique. I have created a hash function as hash(col3,..,coln):
A=A.withColumn("hash", hash(*[col(colname) for colname in unique_cols_A]))
B=B.withColumn("hash", hash(*[col(colname) for colname in unique_cols_B]))
Now I want to write some spark code that basically selects the rows from B that have the hash not in A (so new rows and updated rows) and join them into a new dataframe together with the rows from A. How can I achieve this in pyspark?
Edit:
Dataframe B can have extra columns from dataframe A, so a union is not possible.
Sample example
Dataframe A:
+-----+-----+
|col_1|col_2|
+-----+-----+
| a| www|
| b| eee|
| c| rrr|
+-----+-----+
Dataframe B:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| d| yyy| 2|
| c| rer| 3|
+-----+-----+-----+
Result:
Dataframe C:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| b| eee| null|
| c| rer| 3|
| d| yyy| 2|
+-----+-----+-----+
This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.
For example:
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
'col_1',
f.when(
~f.isnull(f.col('b.col_2')),
f.col('b.col_2')
).otherwise(f.col('a.col_2')).alias('col_2'),
'b.col_3'
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| a| wew| 1|
#| b| eee| null|
#| c| rer| 3|
#| d| yyy| 2|
#+-----+-----+-----+
Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:
cols_to_update = ['col_2']
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
*[
['col_1'] +
[
f.when(
~f.isnull(f.col('b.{}'.format(c))),
f.col('b.{}'.format(c))
).otherwise(f.col('a.{}'.format(c))).alias(c)
for c in cols_to_update
] +
['b.col_3']
]
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.
replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
resultDf = dfA.subtract(replaceDf).union(dfB).show()
Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would
prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.
If you want to keep only unique values, and require strictly correct results, then union followed by dropDupilcates should do the trick:
columns_which_dont_change = [...]
old_df.union(new_df).dropDuplicates(subset=columns_which_dont_change)

Removing duplicate columns after a DF join in Spark

When you join two DFs with similar column names:
df = df1.join(df2, df1['id'] == df2['id'])
Join works fine but you can't call the id column because it is ambiguous and you would get the following exception:
pyspark.sql.utils.AnalysisException: "Reference 'id' is ambiguous,
could be: id#5691, id#5918.;"
This makes id not usable anymore...
The following function solves the problem:
def join(df1, df2, cond, how='left'):
df = df1.join(df2, cond, how=how)
repeated_columns = [c for c in df1.columns if c in df2.columns]
for col in repeated_columns:
df = df.drop(df2[col])
return df
What I don't like about it is that I have to iterate over the column names and delete them why by one. This looks really clunky...
Do you know of any other solution that will either join and remove duplicates more elegantly or delete multiple columns without iterating over each of them?
If the join columns at both data frames have the same names and you only need equi join, you can specify the join columns as a list, in which case the result will only keep one of the join columns:
df1.show()
+---+----+
| id|val1|
+---+----+
| 1| 2|
| 2| 3|
| 4| 4|
| 5| 5|
+---+----+
df2.show()
+---+----+
| id|val2|
+---+----+
| 1| 2|
| 1| 3|
| 2| 4|
| 3| 5|
+---+----+
df1.join(df2, ['id']).show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+----+----+
Otherwise you need to give the join data frames alias and refer to the duplicated columns by the alias later:
df1.alias("a").join(
df2.alias("b"), df1['id'] == df2['id']
).select("a.id", "a.val1", "b.val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 4|
+---+----+----+
df.join(other, on, how) when on is a column name string, or a list of column names strings, the returned dataframe will prevent duplicate columns.
when on is a join expression, it will result in duplicate columns. We can use .drop(df.a) to drop duplicate columns. Example:
cond = [df.a == other.a, df.b == other.bb, df.c == other.ccc]
# result will have duplicate column a
result = df.join(other, cond, 'inner').drop(df.a)
Assuming 'a' is a dataframe with column 'id' and 'b' is another dataframe with column 'id'
I use the following two methods to remove duplicates:
Method 1: Using String Join Expression as opposed to boolean expression. This automatically remove a duplicate column for you
a.join(b, 'id')
Method 2: Renaming the column before the join and dropping it after
b.withColumnRenamed('id', 'b_id')
joinexpr = a['id'] == b['b_id']
a.join(b, joinexpr).drop('b_id)
The code below works with Spark 1.6.0 and above.
salespeople_df.show()
+---+------+-----+
|Num| Name|Store|
+---+------+-----+
| 1| Henry| 100|
| 2| Karen| 100|
| 3| Paul| 101|
| 4| Jimmy| 102|
| 5|Janice| 103|
+---+------+-----+
storeaddress_df.show()
+-----+--------------------+
|Store| Address|
+-----+--------------------+
| 100| 64 E Illinos Ave|
| 101| 74 Grand Pl|
| 102| 2298 Hwy 7|
| 103|No address available|
+-----+--------------------+
Assuming -in this example- that the name of the shared column is the same:
joined=salespeople_df.join(storeaddress_df, ['Store'])
joined.orderBy('Num', ascending=True).show()
+-----+---+------+--------------------+
|Store|Num| Name| Address|
+-----+---+------+--------------------+
| 100| 1| Henry| 64 E Illinos Ave|
| 100| 2| Karen| 64 E Illinos Ave|
| 101| 3| Paul| 74 Grand Pl|
| 102| 4| Jimmy| 2298 Hwy 7|
| 103| 5|Janice|No address available|
+-----+---+------+--------------------+
.join will prevent the duplication of the shared column.
Let's assume that you want to remove the column Num in this example, you can just use .drop('colname')
joined=joined.drop('Num')
joined.show()
+-----+------+--------------------+
|Store| Name| Address|
+-----+------+--------------------+
| 103|Janice|No address available|
| 100| Henry| 64 E Illinos Ave|
| 100| Karen| 64 E Illinos Ave|
| 101| Paul| 74 Grand Pl|
| 102| Jimmy| 2298 Hwy 7|
+-----+------+--------------------+
After I've joined multiple tables together, I run them through a simple function to drop columns in the DF if it encounters duplicates while walking from left to right. Alternatively, you could rename these columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = dropDupeDfCols(NamesAndDates)
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where dropDupeDfCols is defined as:
def dropDupeDfCols(df):
newcols = []
dupcols = []
for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)
df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Date'].
In my case I had a dataframe with multiple duplicate columns after joins and I was trying to same that dataframe in csv format, but due to duplicate column I was getting error. I followed below steps to drop duplicate columns. Code is in scala
1) Rename all the duplicate columns and make new dataframe
2) make separate list for all the renamed columns
3) Make new dataframe with all columns (including renamed - step 1)
4) drop all the renamed column
private def removeDuplicateColumns(dataFrame:DataFrame): DataFrame = {
var allColumns: mutable.MutableList[String] = mutable.MutableList()
val dup_Columns: mutable.MutableList[String] = mutable.MutableList()
dataFrame.columns.foreach((i: String) =>{
if(allColumns.contains(i))
if(allColumns.contains(i))
{allColumns += "dup_" + i
dup_Columns += "dup_" +i
}else{
allColumns += i
}println(i)
})
val columnSeq = allColumns.toSeq
val df = dataFrame.toDF(columnSeq:_*)
val unDF = df.drop(dup_Columns:_*)
unDF
}
to call the above function use below code and pass your dataframe which contains duplicate columns
val uniColDF = removeDuplicateColumns(df)
Here is simple solution for remove duplicate column
final_result=df1.join(df2,(df1['subjectid']==df2['subjectid']),"left").drop(df1['subjectid'])
If you join on a list or string, dup cols are automatically]1 removed
This is a scala solution, you could translate the same idea into any language
// get a list of duplicate columns or use a list/seq
// of columns you would like to join on (note that this list
// should include columns for which you do not want duplicates)
val duplicateCols = df1.columns.intersect(df2.columns)
// no duplicate columns in resulting DF
df1.join(df2, duplicateCols.distinct.toSet)
Spark SQL version of this answer:
df1.createOrReplaceTempView("t1")
df2.createOrReplaceTempView("t2")
spark.sql("select * from t1 inner join t2 using (id)").show()
# +---+----+----+
# | id|val1|val2|
# +---+----+----+
# | 1| 2| 2|
# | 1| 2| 3|
# | 2| 3| 4|
# +---+----+----+
This works for me when multiple columns used to join and need to drop more than one column which are not string type.
final_data = mdf1.alias("a").join(df3.alias("b")
(mdf1.unique_product_id==df3.unique_product_id) &
(mdf1.year_week==df3.year_week) ,"left" ).select("a.*","b.promotion_id")
Give a.* to select all columns from one table and from the other table choose specific columns.

select random columns from a very large dataframe in pyspark

I have a dataframe in pyspark which has around 150 columns. These columns are obtained from joining different tables. Now my requirement is to write the dataframe to a file but in a specific order like first write 1 to 50 columns then column 90 to 110 and then column 70 and 72. That is I want to select only specific columns along with rearranging them.
I know one one of the way is to use df.select("give your column order") but in my case, the columns are very large and it is not possible to write each and every column name in 'select'.
Please tell me how can I achieve this in pyspark.
Note- I cannot provide any sample data as the number of columns is very large and the column number is the main road blocker in my case.
It sounds like all that you want to do is to programmatically return the list of column names, pick out some slice or slices from that list, and then select that subset of columns in some order from your dataframe. You can do this by manipulating the list df.columns. As an example:
a=[list(range(10)),list(range(1,11)),list(range(2,12))]
df=sqlContext.createDataFrame(a,schema=['col_'+i for i in 'abcdefghij'])
df is a dataframe is with columns ['col_a', 'col_b', 'col_c', 'col_d', 'col_e', 'col_f', 'col_g', 'col_h', 'col_i', 'col_j']. You can return that list by calling df.columns which you can slice and reorder like you would any other python list. How you do that is up to you and which columns you want to select from the df and in which order. For example:
mycolumnlist=df.columns[8:9]+df.columns[0:5]
df[mycolumnlist].show()
Returns
+-----+-----+-----+-----+-----+-----+
|col_i|col_a|col_b|col_c|col_d|col_e|
+-----+-----+-----+-----+-----+-----+
| 8| 0| 1| 2| 3| 4|
| 9| 1| 2| 3| 4| 5|
| 10| 2| 3| 4| 5| 6|
+-----+-----+-----+-----+-----+-----+
You can create list of columns programmatically
first_df.join(second_df, on-'your_condition').select([column_name for column_name in first_df.columns] + [column_name for column_name in second_df.columns])
You can select random subset of columns by using random.sample(first_df.columns, number_of_columns) function.
Hope this helps :)

Categories

Resources