Creating a new dataframe from a pyspark dataframe column efficiently

Creating a new dataframe from a pyspark dataframe column efficiently - python

I wonder what is the most efficient way to extract a column in pyspark dataframe and turn them into a new dataframe? The following code runs without any problem with small datasets, but runs very slow and even causes out-of-memory error. I wonder how can I improve the efficiency of this code?
pdf_edges = sdf_grp.rdd.flatMap(lambda x: x).collect()
edgelist = reduce(lambda a, b: a + b, pdf_edges, [])
sdf_edges = spark.createDataFrame(edgelist)
In pyspark dataframe sdf_grp, The column "pairs" contains information as below
+-------------------------------------------------------------------+
|pairs |
+-------------------------------------------------------------------+
|[[39169813, 24907492], [39169813, 19650174]] |
|[[10876191, 139604770]] |
|[[6481958, 22689674]] |
|[[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]]|
|[[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]] |
+-------------------------------------------------------------------+
with a schema of
root
|-- pairs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- node1: integer (nullable = false)
| | |-- node2: integer (nullable = false)
I'd like to convert them into a new dataframe sdf_edges looks like below
+---------+---------+
| node1| node2|
+---------+---------+
| 39169813| 24907492|
| 39169813| 19650174|
| 10876191|139604770|
| 6481958| 22689674|
| 73450939|114203936|
| 73450939| 21226555|
| 73450939| 24367554|
| 66306616| 32911686|
| 66306616| 19319140|
| 66306616| 48712544|
+---------+---------+

The most efficient way to extract columns is avoiding collect(). When you call collect(), all the data is transfered to the driver and processed there. At better way to achieve what you want is using the explode() function. Have a look at the example below:
from pyspark.sql import types as T
import pyspark.sql.functions as F
schema = T.StructType([
T.StructField("pairs", T.ArrayType(
T.StructType([
T.StructField("node1", T.IntegerType()),
T.StructField("node2", T.IntegerType())
])
)
)
])
df = spark.createDataFrame(
[
([[39169813, 24907492], [39169813, 19650174]],),
([[10876191, 139604770]], ) ,
([[6481958, 22689674]] , ) ,
([[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]],),
([[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]],)
], schema)
df = df.select(F.explode('pairs').alias('exploded')).select('exploded.node1', 'exploded.node2')
df.show(truncate=False)
Output:
+--------+---------+
| node1 | node2 |
+--------+---------+
|39169813|24907492 |
|39169813|19650174 |
|10876191|139604770|
|6481958 |22689674 |
|73450939|114203936|
|73450939|21226555 |
|73450939|24367554 |
|66306616|32911686 |
|66306616|19319140 |
|66306616|48712544 |
+--------+---------+

Well, I just solve it with the below
sdf_edges = sdf_grp.select('pairs').rdd.flatMap(lambda x: x[0]).toDF()

Related

Splitting Dataframe column containing structs into new columns

I have a Dataframe named df with following structure:
root
|-- country: string (nullable = true)
|-- competition: string (nullable = true)
|-- competitor: array (nullable = true)
|-- element: struct (containsNull = true)
| |-- name: string (nullable = true)
|-- time: string (nullable = true)
Which looks like this:
|country|competiton|competitor |
|___________________________________|
|USA |WN |[{Adam, 9.43}] |
|China |FN |[{John, 9.56}] |
|China |FN |[{Adam, 9.48}] |
|USA |MNU |[{Phil, 10.02}] |
|... |... |... |
I want to pivot (or something similar) the competitor Column into new columns depending on the values in each struct so it looks like this:
|country|competition|Adam|John|Phil |
|____________________________________|
|USA |WN |9.43|... |... |
|China |FN |9.48|9.56|... |
|USA |MNU |... |... |10.02 |
The names are unique so if column already exists i dont want to create a new but fill the value in the already created one. There are alot of names so it needs to be done dynamically.
I have a pretty big dataset so i cant use Pandas.

We can extract the struct from the array and then create new columns from that struct. The name column can be pivoted with the time as values.
data_sdf. \
withColumn('name_time_struct', func.col('competitor')[0]). \
select('country', 'competition', func.col('name_time_struct.*')). \
groupBy('country', 'competition'). \
pivot('name'). \
agg(func.first('time')). \
show()
+-------+-----------+----+----+-----+
|country|competition|Adam|John| Phil|
+-------+-----------+----+----+-----+
| USA| MNU|null|null|10.02|
| USA| WN|9.43|null| null|
| China| FN|9.48|9.56| null|
+-------+-----------+----+----+-----+
P.S. this assumes there's just 1 struct in the array.

Calculate relative frequency of bigrams in PySpark

I'm trying to count word pairs in a text file. First, I've done some pre-processing on the text, and then I counted word pairs as shown below:
((Aspire, to), 1) ; ((to, inspire), 4) ; ((inspire, before), 38)...
Now, I want to report the 1000 most frequent pairs, sorted by :
Word (second word of the pair)
Relative frequency (pair occurences / 2nd word total occurences)
Here's what I've done so far
from pyspark.sql import SparkSession
import re
spark = SparkSession.builder.appName("Bigram occurences and relative frequencies").master("local[*]").getOrCreate()
sc = spark.sparkContext
text = sc.textFile("big.txt")
tokens = text.map(lambda x: x.lower()).map(lambda x: re.split("[\s,.;:!?]+", x))
pairs = tokens.flatMap(lambda xs: (tuple(x) for x in zip(xs, xs[1:]))).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
frame = pairs.toDF(['pair', 'count'])
# Dataframe ordered by the most frequent pair to the least
most_frequent = frame.sort(frame['count'].desc())
# For each row, trying to add a column with the relative frequency, but I'm getting an error
with_rf = frame.withColumn("rf", frame['count'] / (frame.pair._2.sum()))
I think I'm relatively close to the result I want but I can't figure it out. I'm new to Spark and DataFrames in general.
I also tried
import pyspark.sql.functions as F
frame.groupBy(frame['pair._2']).agg((F.col('count') / F.sum('count')).alias('rf')).show()
Any help would be appreciated.
EDIT: here's a sample of the frame dataframe
+--------------------+-----+
| pair|count|
+--------------------+-----+
|{project, gutenberg}| 69|
| {gutenberg, ebook}| 14|
| {ebook, of}| 5|
| {adventures, of}| 6|
| {by, sir}| 12|
| {conan, doyle)}| 1|
| {changing, all}| 2|
| {all, over}| 24|
+--------------------+-----+
root
|-- pair: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
|-- count: long (nullable = true)

The relative frequency can be computed by using window function, that partitions by the second word in the pair and applies a sum operation.
Then, we limit the entries in the df to the top x, based on count and finally order by the second word in pair and the relative frequency.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [(("project", "gutenberg"), 69,),
(("gutenberg", "ebook"), 14,),
(("ebook", "of"), 5,),
(("adventures", "of"), 6,),
(("by", "sir"), 12,),
(("conan", "doyle"), 1,),
(("changing", "all"), 2,),
(("all", "over"), 24,), ]
df = spark.createDataFrame(data, ("pair", "count", ))
ws = W.partitionBy(F.col("pair")._2).rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
(df.withColumn("relative_freq", F.col("count") / F.sum("count").over(ws))
.orderBy(F.col("count").desc())
.limit(3) # change here to select top 1000
.orderBy(F.desc(F.col("pair")._2), F.col("relative_freq").desc())
).show()
"""
+--------------------+-----+-------------+
| pair|count|relative_freq|
+--------------------+-----+-------------+
| {all, over}| 24| 1.0|
|{project, gutenberg}| 69| 1.0|
| {gutenberg, ebook}| 14| 1.0|
+--------------------+-----+-------------+
"""

Pyspark: sum over a window based on a condition

Consider the simple DataFrame:
from pyspark import SparkContext
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf, PandasUDFType
spark = SparkSession.builder.appName('Trial').getOrCreate()
simpleData = (("2000-04-17", "144", 1), \
("2000-07-06", "015", 1), \
("2001-01-23", "015", -1), \
("2001-01-18", "144", -1), \
("2001-04-17", "198", 1), \
("2001-04-18", "036", -1), \
("2001-04-19", "012", -1), \
("2001-04-19", "188", 1), \
("2001-04-25", "188", 1),\
("2001-04-27", "015", 1) \
)
columns= ["dates", "id", "eps"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)
Out:
root
|-- dates: string (nullable = true)
|-- id: string (nullable = true)
|-- eps: long (nullable = true)
+----------+---+---+
|dates |id |eps|
+----------+---+---+
|2000-04-17|144|1 |
|2000-07-06|015|1 |
|2001-01-23|015|-1 |
|2001-01-18|144|-1 |
|2001-04-17|198|1 |
|2001-04-18|036|-1 |
|2001-04-19|012|-1 |
|2001-04-19|188|1 |
|2001-04-25|188|1 |
|2001-04-27|015|1 |
+----------+---+---+
I would like to sum the values in the eps column over a rolling window keeping only the last value for any given ID in the id column. For example, defining a window of 5 rows and assuming we are on 2001-04-17, I want to sum only the last eps value for each given unique ID. In the 5 rows we have only 3 different ID, so the sum must be of 3 elements: -1 for the ID 144 (forth row), -1 for the ID 015 (third row) and 1 for the ID 198 (fifth row) for a total of -1.
In my mind, within the rolling window I should do something like F.sum(groupBy('id').agg(F.last('eps'))) that of course is not possible to achieve in a rolling window.
I obtained the desired result using a UDF.
#pandas_udf(IntegerType(), PandasUDFType.GROUPEDAGG)
def fun_sum(id, eps):
df = pd.DataFrame()
df['id'] = id
df['eps'] = eps
value = df.groupby('id').last().sum()
return value
And then:
w = Window.orderBy('dates').rowsBetween(-5,0)
df = df.withColumn('sum', fun_sum(F.col('id'), F.col('eps')).over(w))
The problem is that my dataset contains more than 8 milion rows and performing this task with this UDF takes about 2 hours.
I was wandering whether there is a way to achieve the same result with built-in PySpark functions avoiding using a UDF or at least whether there is a way to improve the performance of my UDF.
For completeness, the desired output should be:
+----------+---+---+----+
|dates |id |eps|sum |
+----------+---+---+----+
|2000-04-17|144|1 |1 |
|2000-07-06|015|1 |2 |
|2001-01-23|015|-1 |0 |
|2001-01-18|144|-1 |-2 |
|2001-04-17|198|1 |-1 |
|2001-04-18|036|-1 |-2 |
|2001-04-19|012|-1 |-3 |
|2001-04-19|188|1 |-1 |
|2001-04-25|188|1 |0 |
|2001-04-27|015|1 |0 |
+----------+---+---+----+
EDIT: the rseult must also be achievable using a .rangeBetween() window.

In case you haven't figured it out yet, here's one way of achieving it.
Assuming that df is defined and initialised the way you defined and initialised it in your question.
Import the required functions and classes:
from pyspark.sql.functions import row_number, col
from pyspark.sql.window import Window
Create the necessary WindowSpec:
window_spec = (
Window
# Partition by 'id'.
.partitionBy(df.id)
# Order by 'dates', latest dates first.
.orderBy(df.dates.desc())
)
Create a DataFrame with partitioned data:
partitioned_df = (
df
# Use the window function 'row_number()' to populate a new column
# containing a sequential number starting at 1 within a window partition.
.withColumn('row', row_number().over(window_spec))
# Only select the first entry in each partition (i.e. the latest date).
.where(col('row') == 1)
)
Just in case you want to double-check the data:
partitioned_df.show()
# +----------+---+---+---+
# | dates| id|eps|row|
# +----------+---+---+---+
# |2001-04-19|012| -1| 1|
# |2001-04-25|188| 1| 1|
# |2001-04-27|015| 1| 1|
# |2001-04-17|198| 1| 1|
# |2001-01-18|144| -1| 1|
# |2001-04-18|036| -1| 1|
# +----------+---+---+---+
Group and aggregate the data:
sum_rows = (
partitioned_df
# Aggragate data.
.groupBy()
# Sum all rows in 'eps' column.
.sum('eps')
# Get all records as a list of Rows.
.collect()
)
Get the result:
print(f"sum eps: {sum_rows[0][0]})
# sum eps: 0

Pyspark mapping regex

I have a pyspark dataframe, with text column.
I wanted to map the values which with a regex expression.
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-RH', 'RH'))
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-FI, 'FI'))
Plus I wanted to map specifics values according to a dictionnary, I did the following (mapper is from create_map()):
df = df.withColumn("mapped_col",mapper.getItem(F.col("action")))
Finaly the values which has not been mapped by the dictionnary or the regex expression, will be set null. I do not know how to do this part in accordance to the two others.
Is it possible to have like a dictionnary of regex expression so I can regroup the two 'functions'?
{".*-RH": "RH", ".*FI" : "FI"}
Original Output Example
+-----------------------------+
|message |
+-----------------------------+
|GDF2009 |
|GDF2014 |
|ADS-set |
|ADS-set |
|XSQXQXQSDZADAA5454546a45a4-FI|
|dadaccpjpifjpsjfefspolamml-FI|
|dqdazdaapijiejoajojp565656-RH|
|kijipiadoa
+-----------------------------+
Expected Output Example
+-----------------------------+-----------------------------+
|message |status|
+-----------------------------+-----------------------------+
|GDF2009 | GDF
|GDF2014 | GDF
|ADS/set | ADS
|ADS-set | ADS
|XSQXQXQSDZADAA5454546a45a4-FI| FI
|dadaccpjpifjpsjfefspolamml-FI| FI
|dqdazdaapijiejoajojp565656-RH| RH
|kijipiadoa | null or ??
So first 4th line are mapped with a dict, and the other are mapped using regex. Unmapped are null or ??
Thank you,

You can achieve it using contains function:
from pyspark.sql.types import StringType
df = spark.createDataFrame(
["GDF2009", "GDF2014", "ADS-set", "ADS-set", "XSQXQXQSDZADAA5454546a45a4-FI", "dadaccpjpifjpsjfefspolamml-FI",
"dqdazdaapijiejoajojp565656-RH", "kijipiadoa"], StringType()).toDF("message")
df.show()
names = ("GDF", "ADS", "FI", "RH")
def c(col, names):
return [f.when(f.col(col).contains(i), i).otherwise("") for i in names]
df.select("message", f.concat_ws("", f.array_remove(f.array(*c("message", names)), "")).alias("status")).show()
output:
+--------------------+
| message|
+--------------------+
| GDF2009|
| GDF2014|
| ADS-set|
| ADS-set|
|XSQXQXQSDZADAA545...|
|dadaccpjpifjpsjfe...|
|dqdazdaapijiejoaj...|
| kijipiadoa|
+--------------------+
+--------------------+------+
| message|status|
+--------------------+------+
| GDF2009| GDF|
| GDF2014| GDF|
| ADS-set| ADS|
| ADS-set| ADS|
|XSQXQXQSDZADAA545...| FI|
|dadaccpjpifjpsjfe...| FI|
|dqdazdaapijiejoaj...| RH|
| kijipiadoa| |
+--------------------+------+

PySpark: How to covert column with Ljava.lang.Object

I created data frame in PySpark by reading data from HDFS like this:
df = spark.read.parquet('path/to/parquet')
I expect the data frame to have two column of strings:
+------------+------------------+
|my_column |my_other_column |
+------------+------------------+
|my_string_1 |my_other_string_1 |
|my_string_2 |my_other_string_2 |
|my_string_3 |my_other_string_3 |
|my_string_4 |my_other_string_4 |
|my_string_5 |my_other_string_5 |
|my_string_6 |my_other_string_6 |
|my_string_7 |my_other_string_7 |
|my_string_8 |my_other_string_8 |
+------------+------------------+
However, I get my_column column with some strings starting with [Ljava.lang.Object;, looking like this:
>> df.show(truncate=False)
+-----------------------------+------------------+
|my_column |my_other_column |
+-----------------------------+------------------+
|[Ljava.lang.Object;#7abeeeb6 |my_other_string_1 |
|[Ljava.lang.Object;#5c1bbb1c |my_other_string_2 |
|[Ljava.lang.Object;#6be335ee |my_other_string_3 |
|[Ljava.lang.Object;#153bdb33 |my_other_string_4 |
|[Ljava.lang.Object;#1a23b57f |my_other_string_5 |
|[Ljava.lang.Object;#3a101a1a |my_other_string_6 |
|[Ljava.lang.Object;#33846636 |my_other_string_7 |
|[Ljava.lang.Object;#521a0a3d |my_other_string_8 |
+-----------------------------+------------------+
>> df.printSchema()
root
|-- my_column: string (nullable = true)
|-- my_other_column: string (nullable = true)
As you can see, my_other_column column is looking as expected. Is there any way, how to convert objects in my_column column to humanly readable strings?

Jaroslav,
I tried with the following code, and have used a sample parquet file from here. I am able to get the desired output from the dataframe, can u please chk your code using the code snippet below and also sample file referred above to see if there's any other issue:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Read a Parquet file").getOrCreate()
df = spark.read.parquet('E:\\...\\..\\userdata1.parquet')
df.show(10)
df.printSchema()
Replace the path to your HDFS location.
Dataframe output for your reference:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a new dataframe from a pyspark dataframe column efficiently - python

Well, I just solve it with the below sdf_edges = sdf_grp.select('pairs').rdd.flatMap(lambda x: x[0]).toDF()

Related

Splitting Dataframe column containing structs into new columns

Calculate relative frequency of bigrams in PySpark

Pyspark: sum over a window based on a condition

Pyspark mapping regex

PySpark: How to covert column with Ljava.lang.Object

Categories

Resources