Efficient column processing in PySpark - python

I have a dataframe with a very large number of columns (>30000).
I'm filling it with 1 and 0 based on the first column like this:
for column in list_of_column_names:
df = df.withColumn(column, when(array_contains(df['list_column'], column), 1).otherwise(0))
However this process takes a lot of time. Is there a way to do this more efficiently? Something tells me that column processing can be parallelized.
Edit:
Sample input data
+----------------+-----+-----+-----+
| list_column | Foo | Bar | Baz |
+----------------+-----+-----+-----+
| ['Foo', 'Bak'] | | | |
| ['Bar', Baz'] | | | |
| ['Foo'] | | | |
+----------------+-----+-----+-----+

There is nothing specifically wrong with your code, other than very wide data:
for column in list_of_column_names:
df = df.withColumn(...)
only generates the execution plan.
Actual data processing will concurrent and parallelized, once the result is evaluated.
It is however an expensive process as it require O(NMK) operations with N rows, M columns and K values in the list.
Additionally execution plans on very wide data are very expensive to compute (though cost is constant in terms of number of records). If it becomes a limiting factor, you might be better off with RDDs:
Sort column array using sort_array function.
Convert data to RDD.
Apply search for each column using binary search.

You might approach like this,
import pyspark.sql.functions as F
exprs = [F.when(F.array_contains(F.col('list_column'), column), 1).otherwise(0).alias(column)\
for column in list_column_names]
df = df.select(['list_column']+exprs)

withColumn is already distributed so a faster approach would be difficult to get other than what you already have. you can try defining a udf function as following
from pyspark.sql import functions as f
from pyspark.sql import types as t
def containsUdf(listColumn):
row = {}
for column in list_of_column_names:
if(column in listColumn):
row.update({column: 1})
else:
row.update({column: 0})
return row
callContainsUdf = f.udf(containsUdf, t.StructType([t.StructField(x, t.StringType(), True) for x in list_of_column_names]))
df.withColumn('struct', callContainsUdf(df['list_column']))\
.select(f.col('list_column'), f.col('struct.*'))\
.show(truncate=False)
which should give you
+-----------+---+---+---+
|list_column|Foo|Bar|Baz|
+-----------+---+---+---+
|[Foo, Bak] |1 |0 |0 |
|[Bar, Baz] |0 |1 |1 |
|[Foo] |1 |0 |0 |
+-----------+---+---+---+
Note: list_of_column_names = ["Foo","Bar","Baz"]

Related

How to split pyspark dataframe into segments of equal sized rows

I'm trying to create a new column that puts rows in my pyspark dataframe into groups based on observed rank values. For example, I'd like the first 100,000 ranks to be group 1, the next 100,000 to be group 2, and so on, up to an arbitrary number of ranks (it needs this flexibility as the size of my data, and number of overall ranks is likely to change)
Does anyone know how to achieve this? This is what my intended output looks like
--------------------------------------
| id. | rank | segment |
--------------------------------------
| 100 | 1 | 1 |
| 200 | 100,002 | 2 |
| 300 | 900,007 | 9 |
--------------------------------------
The only help I can find from browsing is for splitting the ranks into some kind of quantile, but I need guarantees that my segments are of size 100,000.
Does anyone have any tips as to how to achieve this outcome?
Some sample code here if it helps
import pandas as pd
spark.createDataFrame(pd.DataFrame({
"id": [100,200,300], "rank": [1, 100002, 900007]
}))
You can use the ceil function.
import pyspark.sql.functions as F
......
step = 100000
df = df.withColumn('segment', F.expr(f'ceil(rank / {step})'))
df.show(truncate=False)

Extracting a sub array from PySpark DataFrame column [duplicate]

This question already has answers here:
get first N elements from dataframe ArrayType column in pyspark
(2 answers)
Closed 4 years ago.
I wish to remove the last element of the array from this DataFrame. We have this link demonstrating the same thing, but with UDFs and that I wish to avoid. Is there is simple way to do this - something like list[:2]?
data = [(['cat','dog','sheep'],),(['bus','truck','car'],),(['ice','pizza','pasta'],)]
df = sqlContext.createDataFrame(data,['data'])
df.show()
+-------------------+
| data|
+-------------------+
| [cat, dog, sheep]|
| [bus, truck, car]|
|[ice, pizza, pasta]|
+-------------------+
Expected DataFrame:
+--------------+
| data|
+--------------+
| [cat, dog]|
| [bus, truck]|
| [ice, pizza]|
+--------------+
UDF is the best thing you can find for PySpark :)
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType
# Get the fist two elements
split_row = udf(lambda row: row[:2])
# apply the udf to each row
new_df = df.withColumn("data", split_row(df["data"]))
new_df.show()
# Output
+------------+
| data|
+------------+
| [cat, dog]|
|[bus, truck]|
|[ice, pizza]|
+------------+

Extracting multiple columns from column in PySpark DataFrame using named regex

Suppose I have a DataFrame df in pySpark of the following form:
| id | type | description |
| 1 | "A" | "Date: 2018/01/01\nDescr: This is a test des\ncription\n |
| 2 | "B" | "Date: 2018/01/02\nDescr: Another test descr\niption\n |
| 3 | "A" | "Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n |
which is of course a dummy set, but will suffice for this example.
I have made a regex-statement with named groups that can be used to extract the relevant information from the description-field, something along the lines of:
^(?:(?:Date: (?P<DATE>.+?)\n)|(?:Descr: (?P<DESCR>.+?)\n)|(?:Warning: (?P<WARNING>.+?)\n)+$
again, dummy regex, the actual regex is somewhat more elaborate, but the purpose is to capture three possible groups:
| DATE | DESCR | WARNING |
| 2018/01/01 | This is a test des\ncription | None |
| 2018/01/02 | Another test descr\niption | None |
| 2018/01/03 | None | This is a warnin\ng, watch out |
Now I would want to add the columns that are the result of the regex match to the original DataFrame (i.e. combine the two dummy tables in this question into one).
I have tried several ways to accomplish this, but none have lead to the full solution yet. A thing I've tried is:
def extract_fields(string):
patt = <ABOVE_PATTERN>
result = re.match(patt, string, re.DOTALL).groupdict()
# Actually, a slight work-around is needed to overcome the None problem when
# no match can be made, I'm using pandas' .str.extract for this now
return result
df.rdd.map(lambda x: extract_fields(x.description))
This will yield the second table, but I see no way to combine this with the original columns from df. I have tried to construct a new Row(), but then I run into problems with the ordering of columns (and the fact that I cannot hard-code the column names that will be added by the regex groups) that is needed in the Row()-constructor, resulting in a dataframe that is has the columns all jumbled up. How can I achieve what I want, i.e. one DataFrame with six columns: id, type, description, DATE, DESCR and WARNING?
Remark. Actually, the description field is not just one field, but several columns. Using concat_ws, I have concatenated these columns into a new columns description with the description-fields separated with \n, but maybe this can be incorporated in a nicer way.
I think you can use Pandas features for this case. Firstly I convert df to rdd to split description field. I pull a Pandas df then I create spark df with using Pandas df. It works regardless of column numbers in description field
>>> import pandas as pd
>>> import re
>>>
>>> df.show(truncate=False)
+---+----+-----------------------------------------------------------+
|id |type|description |
+---+----+-----------------------------------------------------------+
|1 |A |Date: 2018/01/01\nDescr: This is a test des\ncription\n |
|2 |B |Date: 2018/01/02\nDescr: Another test desc\niption\n |
|3 |A |Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n|
+---+----+-----------------------------------------------------------+
>>> #convert df to rdd
>>> rdd = df.rdd.map(list)
>>> rdd.first()
[1, 'A', 'Date: 2018/01/01\\nDescr: This is a test des\\ncription\\n']
>>>
>>> #split description field
>>> rddSplit = rdd.map(lambda x: (x[0],x[1],re.split('\n(?=[A-Z])', x[2].encode().decode('unicode_escape'))))
>>> rddSplit.first()
(1, 'A', ['Date: 2018/01/01', 'Descr: This is a test des\ncription\n'])
>>>
>>> #create empty Pandas df
>>> df1 = pd.DataFrame()
>>>
>>> #insert rows
>>> for rdd in rddSplit.collect():
... a = {i.split(':')[0].strip():i.split(':')[1].strip('\n').replace('\n','\\n').strip() for i in rdd[2]}
... a['id'] = rdd[0]
... a['type'] = rdd[1]
... df2 = pd.DataFrame([a], columns=a.keys())
... df1 = pd.concat([df1, df2])
...
>>> df1
Date Descr Warning id type
0 2018/01/01 This is a test des\ncription NaN 1 A
0 2018/01/02 Another test desc\niption NaN 2 B
0 2018/01/03 NaN This is a warnin\ng, watch out 3 A
>>>
>>> #create spark df
>>> df3 = spark.createDataFrame(df1.fillna('')).replace('',None)
>>> df3.show(truncate=False)
+----------+----------------------------+------------------------------+---+----+
|Date |Descr |Warning |id |type|
+----------+----------------------------+------------------------------+---+----+
|2018/01/01|This is a test des\ncription|null |1 |A |
|2018/01/02|Another test desc\niption |null |2 |B |
|2018/01/03|null |This is a warnin\ng, watch out|3 |A |
+----------+----------------------------+------------------------------+---+----+

PySpark aggregation function for "any value"

I have a PySpark Dataframe with an A field, few B fields that dependent on A (A->B) and C fields that I want to aggregate per each A. For example:
A | B | C
----------
A | 1 | 6
A | 1 | 7
B | 2 | 8
B | 2 | 4
I wish to group by A , present any of B and run aggregation (let's say SUM) on C.
The expected result would be:
A | B | C
----------
A | 1 | 13
B | 2 | 12
SQL-wise I would do:
SELECT A, COALESCE(B) as B, SUM(C) as C
FROM T
GROUP BY A
What is the PySpark way to do that?
I can group by A and B together or select MIN(B) per each A, for example:
df.groupBy('A').agg(F.min('B').alias('B'),F.sum('C').alias('C'))
or
df.groupBy(['A','B']).agg(F.sum('C').alias('C'))
but that seems inefficient. Is there is anything similar to SQL coalesce in PySpark?
Thanks
You'll just need to use first instead :
from pyspark.sql.functions import first, sum, col
from pyspark.sql import Row
array = [Row(A="A", B=1, C=6),
Row(A="A", B=1, C=7),
Row(A="B", B=2, C=8),
Row(A="B", B=2, C=4)]
df = sqlContext.createDataFrame(sc.parallelize(array))
results = df.groupBy(col("A")).agg(first(col("B")).alias("B"), sum(col("C")).alias("C"))
Let's now check the results :
results.show()
# +---+---+---+
# | A| B| C|
# +---+---+---+
# | B| 2| 12|
# | A| 1| 13|
# +---+---+---+
From the comments:
Is first here is computationally equivalent to any ?
groupBy causes shuffle. Thus a non deterministic behaviour is to expect.
Which is confirmed in the documentation of first :
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
note:: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
So yes, computationally there are the same, and that's one of the reasons you need to use sorting if you need a deterministic behaviour.
I hope this helps !

Filter a large number of IDs from a dataframe Spark

I have a large dataframe with a format similar to
+-----+------+------+
|ID |Cat |date |
+-----+------+------+
|12 | A |201602|
|14 | B |201601|
|19 | A |201608|
|12 | F |201605|
|11 | G |201603|
+-----+------+------+
and I need to filter rows based on a list with around 5000 thousand IDs. The straighforward way would be to filter with isin but that has really bad performance. How can this filter be done?
If you're committed to using Spark SQL and isin doesn't scale anymore then inner equi-join should be a decent fit.
First convert id list to as single column DataFrame. If this is a local collection
ids_df = sc.parallelize(id_list).map(lambda x: (x, )).toDF(["id"])
and join:
df.join(ids_df, ["ID"], "inner")

Categories

Resources