PySpark: Split DataFrame into multiple DataFrames without using loop - python

Hi I have a DataFrame as shown -
ID X Y
1 1234 284
1 1396 179
2 8620 178
3 1620 191
3 8820 828
I want split this DataFrame into multiple DataFrames based on ID. So for this example there will be 3 DataFrames. One way to achieve it is to run filter operation in loop. However, I would like to know if it can be done in much more efficient way.

#initialize spark dataframe
df = sc.parallelize([ (1,1234,282),(1,1396,179),(2,8620,178),(3,1620,191),(3,8820,828) ] ).toDF(["ID","X","Y"])
#get the list of unique ID values ; there's probably a better way to do this, but this was quick and easy
listids = [x.asDict().values()[0] for x in df.select("ID").distinct().collect()]
#create list of dataframes by IDs
dfArray = [df.where(df.ID == x) for x in listids]
dfArray[0].show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 1|1234|282|
| 1|1396|179|
+---+----+---+
dfArray[1].show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 2|8620|178|
+---+----+---+
dfArray[2].show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 3|1620|191|
| 3|8820|828|
+---+----+---+

The answer of #James Tobin needs to be altered a tiny bit if you are working with Python 3.X, as dict.values returns a dict-value object instead of a list. A quick workaround is just adding the list function:
listids = [list(x.asDict().values())[0]
for x in df.select("ID").distinct().collect()]
Posting as a seperate answer as I do not have the reputation required to put a comment on his answer.

Related

How te create a new column with values matching conditions from different dataframes?

Is there a simple function (both on pandas or numpy) to create a new column with true or false values, based on matching criteria from different dataframes?
I'm actually trying to compare two dataframes that have the column email and see, for example, which emails match with the emails on the second data frame. The goal is to print a table that looks like this (where hola#lorem.com it's actually both on the first and second dataframe):
| id | email | match |
|:------|:------ |:-------|
| 1 | hola#lorem.com | true|
| 2 | adios#lorem.com | false|
| 3 | bye#lorem.com | false|
Thanks in advance for your help
pd.assign
df1 = df1.assign(match=df2["email"].isin(df1["email"]))
You can for example use the function isin:
df1['match'] = df1['email'].isin(df2['email'])
df2['match'] = df2['email'].isin(df1['email'])

PySpark - new column partial regex matching from dictionary

I have a PySpark dataframe like this:
A
B
1
abc_value
2
abc_value
3
some_other_value
4
anything_else
I have a mapping dictionary:
d = {
"abc":"X",
"some_other":Y,
"anything":Z
}
I need to create new column in my original Dataframe which should be like this:
A
B
C
1
abc_value
X
2
abc_value
X
3
some_other_value
Y
4
anything_else
Z
I tried mapping like this:
mapping_expr = f.create_map([f.lit(x) for x in chain(*d.items())]) and then applying it with withColumn however it is exact matching, however I need partial (regex) matching as you can see.
How to accomplish this, please?
I'm afraid in PySpark there's no implemented function that extracts substrings according to a defined dictionary; you probably need to resort to tricks.
In this case, you can first create a search string which includes all your dictionary keys to be searched:
keys = list(d.keys())
keys_expr = '|'.join(keys)
keys_expr
# 'abc|some_other|anything'
Then you can use regexp_extract to extract the first key from keys_expr that we encounter in column B, if present (that's the reason for the | operator).
Finally, you can use dictionary d to replace the values in the new column.
import pyspark.sql.functions as F
df = df\
.withColumn('C', F.regexp_extract('B', keys_expr, 0))\
.replace(d, subset=['C'])
df.show()
+---+----------------+---+
| A| B| C|
+---+----------------+---+
| 1| abc_value| X|
| 2| abc_value| X|
| 3|some_other_value| Y|
| 4| anything_else| Z|
+---+----------------+---+

Combine two DataFrames in PySpark into matrix

I have 2 DataFrames in PySpark script.
DF1 has this data:
+-----+--------------+
| id | keyword |
+-----+--------------+
| 1 | banana |
| 2 | apple |
| 3 | orange |
+-----+--------------+
DF2 has this data:
+----+---------------+
| id | tokens |
+----+---------------+
| 13 | ['abc', 'def']|
| 14 | ['ghi', 'jkl']|
| 15 | ['mno', 'pqr']|
+----+---------------+
I'm looking to build a third DataFrame by a result of combining both of the DataFrames above and performing some complex calculations (the calculations are not important) between the keyword and the tokens defined by a python function:
def complex_calculation(keyword, tokens):
// some various stuff that produces a numeric result between the keyword and the tokens
// e.g. result = 0.7768756
return result
The final result should look something like this:
+-------------+---------+--------+--------+
| keyword | 13 | 14 | 15 |
+-------------+---------+--------+--------+
| banana | 0.5345 | 0.4325 | 0.6543 |
| apple | 0.2435 | 0.7865 | 0.9123 |
| orange | 0.3765 | 0.6942 | 0.2765 |
+-------------+---------+--------+--------+
Your complex calculation function is actually quite important in this context, because what you're looking to do is following:
Create a cartesian product of your two tables
table1 = spark._sc.parallelize([[1,"banana"],
[2,"apple"],
[3,"orange"]]).toDF(["id","keyword"])
table2 = spark._sc.parallelize([[13, ['abc', 'def']],
[14, ['ghi', 'jkl']],
[15, ['mno', 'pqr']]]).toDF(["id","token"])
Pivot with an aggregation function. Now this is where your function comes into play. As you can see, I am using f.count() as my aggregation function.
(
table1.select("keyword")
.crossJoin(table2)
.groupBy('keyword')
.pivot('id')
.agg(f.count("token"))
).show()
+-------+---+---+---+
|keyword| 13| 14| 15|
+-------+---+---+---+
| orange| 1| 1| 1|
| apple| 1| 1| 1|
| banana| 1| 1| 1|
+-------+---+---+---+
If you want to use some custom, clever calculation, you really have two options. If you're competent in Scala, you can write a UDAF (user-defined aggregate function) and register this jar to your Spark cluster. Alternatively, you can have a look at pandas udfs with something such as:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf("struct<agg_key: string, parameter1: parameter1_type>", PandasUDFType.GROUPED_MAP)
def my_agg_function(df):
df = pd.DataFrame(
df.groupby(agg_key).apply(lambda x: (...))
df.reset_index(inplace=True, drop=False)
return df
And then you use your pandas udf such as
spark_df.groupBy("keyword").pivot("id").apply(my_agg_function(...)))
However, despite best attempts at being vectorized, pandas udf are still not great and can have significant performance impacts. Hope this helps. More on pandas udf here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf
Ideally, you should try to do your complex aggregations using spark functions as much as you can, because Tungsten can then optimise this under the hood and give you best performance possible.

Groupby and UDF/UDAF in PySpark while maintaining DataFrame structure

I am new to PySpark and struggling with a simple dataframe manipulation. I have a dataframe similar to:
product period rating product_Desc1 product_Desc2 ..... more columns
a 1 60 foo xx
a 2 70 foo xx
a 3 59 foo xx
b 1 50 bar yy
b 2 55 bar yy
c 1 90 foo bar xy
c 2 100 foo bar xy
I would like to groupBy product, add columns to calculate arithmetic, geometric and harmonic means of ratings while also maintaining the rest of the columns in the dataframe, which are all consistent across each product.
I have tried to do so with a combination of built in functions and UDF. For example:
a_means = df.groupBy("product").agg(mean("rating").alias("a_mean")
g_means = df.groupBy("product").agg(udf_gmean("rating").alias("g_mean")
where:
def g_mean(x):
gm = reduce(mul,x)**(1/len(x))
return gm
udf_gmean = udf(g_mean, FloatType())
I would then join the a_means and g_means output with the original dataframe on product and drop duplicates. However, this method returns an error, for g_means, stating that "rating" is not involved in the groupBy nor is it a user defined aggregation function....
I have also tried using SciPy's gmean module but the error message I get states that the ufunc 'log' is not suitable for the input types, despite all of the rating column being integer type as far as I can see.
There are similar questions on the site but nothing that I can find that seems to fix this issue I have. I would really appreciate the help as it's driving me mad!
Thanks in advance and I should be able to provide any further info quickly today if I haven't provided enough.
It's worth noting that, for efficiency, I am unable to simply convert to Pandas and transform as I would with a Pandas dataframe...and I am using Spark 2.2 and unable to update!
How about something like this
from pyspark.sql.functions import avg
df1 = df.select("product","rating").rdd.map(lambda x: (x[0],(1.0,x[1]*1.0))).reduceByKey(lambda x,y: (x[0]+y[0], x[1]*y[1])).toDF(['product', 'g_mean'])
gdf = df1.select(df1['product'],pow(df1['g_mean._2'],1.0/df1['g_mean._1']).alias("rating_g_mean"))
display(gdf)
+-------+-----------------+
|product| rating_g_mean|
+-------+-----------------+
| a|62.81071936240795|
| b|52.44044240850758|
| c|94.86832980505137|
+-------+-----------------+
df1 = df.withColumn("h_mean", 1.0/df["rating"])
hdf = df1.groupBy("product").agg(avg(df1["rating"]).alias("rating_mean"), (1.0/avg(df1["h_mean"])).alias("rating_h_mean"))
sdf = hdf.join(gdf, ['product'])
display(sdf)
+-------+-----------+-----------------+-----------------+
|product|rating_mean| rating_h_mean| rating_g_mean|
+-------+-----------+-----------------+-----------------+
| a| 63.0|62.62847514743051|62.81071936240795|
| b| 52.5|52.38095238095239|52.44044240850758|
| c| 95.0|94.73684210526315|94.86832980505137|
+-------+-----------+-----------------+-----------------+
fdf = df.join(sdf, ['product'])
display(fdf.sort("product"))
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
|product|period|rating|product_Desc1|product_Desc2|rating_mean| rating_h_mean| rating_g_mean|
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
| a| 3| 59| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| a| 2| 70| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| a| 1| 60| foo| xx| 63.0|62.62847514743051|62.81071936240795|
| b| 2| 55| bar| yy| 52.5|52.38095238095239|52.44044240850758|
| b| 1| 50| bar| yy| 52.5|52.38095238095239|52.44044240850758|
| c| 2| 100| foo bar| xy| 95.0|94.73684210526315|94.86832980505137|
| c| 1| 90| foo bar| xy| 95.0|94.73684210526315|94.86832980505137|
+-------+------+------+-------------+-------------+-----------+-----------------+-----------------+
A slightly easier way than above using gapply:
from spark_sklearn.group_apply import gapply
from scipy.stats.mstats import gmean
import pandas as pd
def g_mean(_, vals):
gm = gmean(vals["rating"])
return pd.DataFrame(data=[gm])
geoSchema = StructType().add("geo_mean", FloatType())
gMeans = gapply(df.groupby("product"), g_mean, geoSchema)
This returns a dataframe which can then be sorted and joined onto the original using:
df_withGeo = df.join(gMeans, ["product"])
And repeat the process for any aggregation type function columns to be added to the original DataFrame...

How to update a pyspark dataframe with new values from another dataframe?

I have two spark dataframes:
Dataframe A:
|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |
and dataframe B:
|col_1 | col_2 | ... | col_m |
|val_1 | val_2 | ... | val_m |
Dataframe B can contain duplicate, updated and new rows from dataframe A. I want to write an operation in spark where I can create a new dataframe containing the rows from dataframe A and the updated and new rows from dataframe B.
I started by creating a hash column containing only the columns that are not updatable. This is the unique id. So let's say col1 and col2 can change value (can be updated), but col3,..,coln are unique. I have created a hash function as hash(col3,..,coln):
A=A.withColumn("hash", hash(*[col(colname) for colname in unique_cols_A]))
B=B.withColumn("hash", hash(*[col(colname) for colname in unique_cols_B]))
Now I want to write some spark code that basically selects the rows from B that have the hash not in A (so new rows and updated rows) and join them into a new dataframe together with the rows from A. How can I achieve this in pyspark?
Edit:
Dataframe B can have extra columns from dataframe A, so a union is not possible.
Sample example
Dataframe A:
+-----+-----+
|col_1|col_2|
+-----+-----+
| a| www|
| b| eee|
| c| rrr|
+-----+-----+
Dataframe B:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| d| yyy| 2|
| c| rer| 3|
+-----+-----+-----+
Result:
Dataframe C:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| b| eee| null|
| c| rer| 3|
| d| yyy| 2|
+-----+-----+-----+
This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.
For example:
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
'col_1',
f.when(
~f.isnull(f.col('b.col_2')),
f.col('b.col_2')
).otherwise(f.col('a.col_2')).alias('col_2'),
'b.col_3'
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| a| wew| 1|
#| b| eee| null|
#| c| rer| 3|
#| d| yyy| 2|
#+-----+-----+-----+
Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:
cols_to_update = ['col_2']
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
*[
['col_1'] +
[
f.when(
~f.isnull(f.col('b.{}'.format(c))),
f.col('b.{}'.format(c))
).otherwise(f.col('a.{}'.format(c))).alias(c)
for c in cols_to_update
] +
['b.col_3']
]
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.
replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
resultDf = dfA.subtract(replaceDf).union(dfB).show()
Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would
prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.
If you want to keep only unique values, and require strictly correct results, then union followed by dropDupilcates should do the trick:
columns_which_dont_change = [...]
old_df.union(new_df).dropDuplicates(subset=columns_which_dont_change)

Categories

Resources