Pyspark : removing special/numeric strings from array of string - python

To keep it simple I have a df with the following schema:
root
|-- Event_Time: string (nullable = true)
|-- tokens: array (nullable = true)
| |-- element: string (containsNull = true)
some of the elements of "tokens" have number and special characters for example:
"431883", "r2b2", "#refe98"
Any way I can remove all those and keep only actuals words ? I want to do an LDA later and want to clean my data before.
I tried regexp_replace, explode, str.replace with no success maybe I didn't use them correctly.
Thanks
edit2:
df_2 = (df_1.select(explode(df_1.tokens).alias('elements'))
.select(regexp_replace('elements','\\w*\\d\\w**',""))
)
This works only if the column in a string type, and with explode method I can explode an array into strings but there is not in the same row anymore... Anyone can improve on this?

from pyspark.sql.functions import *
df = spark.createDataFrame([(["#a", "b", "c"],), ([],)], ['data'])
df_1 = df.withColumn('data_1', concat_ws(',', 'data'))
df_1 = df_1.withColumn("data_2", regexp_replace('data_1', "['{#]",""))
#df_1.printSchema()
df_1.show()
+----------+------+------+
| data|data_1|data_2|
+----------+------+------+
|[#a, b, c]|#a,b,c| a,b,c|
| []| | |
+----------+------+------+

The solution I found is (as also stated by pault in comment section):
After explode on tokens, I groupBy and agg with collect list to get back the tokens in the format I want them.
here is the comment of pault:
After the explode, you need to groupBy and aggregate with collect_list to get the values back into a single row. Assuming Event_Time is a unique key:
df2 = df_1
.select("Event_Time", regexp_replace(explode("tokens"), "<your regex here>")
.alias("elements")).groupBy("Event_Time")
.agg(collect_list("elements").alias("tokens"))
Also, stated by paul which I didnt know, there is currently no way to iterate over an array in pyspark without using udf or rdd.

The transform() function was added in PySpark 3.1.0, which helped me accomplish this task a little more easily. The example in the question would now look like this:
from pyspark.sql import functions as F
df_2 = df_1.withColumn("tokens",
F.expr(""" transform(tokens, x -> regexp_replace(x, '\\w*\\d\\w**')) """))

Related

Convert to DateType column

I have a column with the values below. How can I add another column with values converted to DateType?
As the front of the string is fixed and the middle of the string is comma-separated, you could use a mix of substr and split to get what you want. Finally use make_date to create the date from the component parts.
A simple example:
import pyspark.sql.functions as F
df2 = df \
.withColumn("xyear", F.split(F.col("col1").substr(19,12),",").getItem(0)) \
.withColumn("xmonth", F.split(F.col("col1").substr(19,12),",").getItem(1)) \
.withColumn("xday", F.split(F.col("col1").substr(19,12),",").getItem(2)) \
.withColumn("md2", F.expr("make_date(xyear, xmonth, xday)"))
df2.show()
My results:
You could also look at RegEx to split the string. Some good examples here. I'd be interested to see if there was a more Pythonic way of doing it.

Databricks: Python pivot table in spark dataframe

Anyone can give me some guidance on the pivot table, using spark dataframe in python language
I am getting the following error :Column is not iterable
enter image description here
anyone has idea ?
Pivots function Pivots a column of the current DataFrame and performs the specified aggregation operation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not.
With specifying column values - df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings")
without specifying column values (more concise but less efficient) - df.groupBy("year").pivot("course").sum("earnings")
You are proceeding in the right direction. Sample working code, python 2
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.functions import col
>>> spark = SparkSession.builder.master('local').appName('app').getOrCreate()
>>> df = spark.read.option('header', 'true').csv('pivot.csv')
>>> df = df.withColumn('value1', col('value1').cast("int"))
>>> pdf = df.groupBy('thisyear').pivot('month', ['JAN','FEB']).sum('value1')
>>> pdf.show(10)
+--------+---+---+
|thisyear|JAN|FEB|
+--------+---+---+
| 2019| 3| 2|
+--------+---+---+
//pivot.csv
thisyear,month,value1
2019,JAN,1
2019,JAN,1
2019,FEB,1
2019,JAN,1
2019,FEB,1

How to zip two array columns in Spark SQL

I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below:
df['column_1']: 'abc, def, ghi'
df['column_2']: '1.0, 2.0, 3.0'
I wanted to join these two columns in a third column like below for each row of my dataframe.
df['column_3']: [abc_1.0, def_2.0, ghi_3.0]
I have successfully done so in python using the code below but the dataframe is quite large and it takes a very long time to run it for the whole dataframe. I want to do the same thing in PySpark for efficiency. I have read the data in spark dataframe successfully but I'm having a hard time determining how to replicate Pandas functions with PySpark equivalent functions. How can I get my desired result in PySpark?
df['column_3'] = df['column_2']
for index, row in df.iterrows():
while index < 3:
if isinstance(row['column_1'], str):
row['column_1'] = list(row['column_1'].split(','))
row['column_2'] = list(row['column_2'].split(','))
row['column_3'] = ['_'.join(map(str, i)) for i in zip(list(row['column_1']), list(row['column_2']))]
I have converted the two columns to arrays in PySpark by using the below code
from pyspark.sql.types import ArrayType, IntegerType, StringType
from pyspark.sql.functions import col, split
crash.withColumn("column_1",
split(col("column_1"), ",\s*").cast(ArrayType(StringType())).alias("column_1")
)
crash.withColumn("column_2",
split(col("column_2"), ",\s*").cast(ArrayType(StringType())).alias("column_2")
)
Now all I need is to zip each element of the arrays in the two columns using '_'. How can I use zip with this? Any help is appreciated.
A Spark SQL equivalent of Python's would be pyspark.sql.functions.arrays_zip:
pyspark.sql.functions.arrays_zip(*cols)
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
So if you already have two arrays:
from pyspark.sql.functions import split
df = (spark
.createDataFrame([('abc, def, ghi', '1.0, 2.0, 3.0')])
.toDF("column_1", "column_2")
.withColumn("column_1", split("column_1", "\s*,\s*"))
.withColumn("column_2", split("column_2", "\s*,\s*")))
You can just apply it on the result
from pyspark.sql.functions import arrays_zip
df_zipped = df.withColumn(
"zipped", arrays_zip("column_1", "column_2")
)
df_zipped.select("zipped").show(truncate=False)
+------------------------------------+
|zipped |
+------------------------------------+
|[[abc, 1.0], [def, 2.0], [ghi, 3.0]]|
+------------------------------------+
Now to combine the results you can transform (How to use transform higher-order function?, TypeError: Column is not iterable - How to iterate over ArrayType()?):
df_zipped_concat = df_zipped.withColumn(
"zipped_concat",
expr("transform(zipped, x -> concat_ws('_', x.column_1, x.column_2))")
)
df_zipped_concat.select("zipped_concat").show(truncate=False)
+---------------------------+
|zipped_concat |
+---------------------------+
|[abc_1.0, def_2.0, ghi_3.0]|
+---------------------------+
Note:
Higher order functions transform and arrays_zip has been introduced in Apache Spark 2.4.
For Spark 2.4+, this can be done using only zip_with function to zip a concatenate on the same time:
df.withColumn("column_3", expr("zip_with(column_1, column_2, (x, y) -> concat(x, '_', y))"))
The higher-order function takes 2 arrays to merge, element-wise, using a lambda function (x, y) -> concat(x, '_', y).
You can also UDF to zip the split array columns,
df = spark.createDataFrame([('abc,def,ghi','1.0,2.0,3.0')], ['col1','col2'])
+-----------+-----------+
|col1 |col2 |
+-----------+-----------+
|abc,def,ghi|1.0,2.0,3.0|
+-----------+-----------+ ## Hope this is how your dataframe is
from pyspark.sql import functions as F
from pyspark.sql.types import *
def concat_udf(*args):
return ['_'.join(x) for x in zip(*args)]
udf1 = F.udf(concat_udf,ArrayType(StringType()))
df = df.withColumn('col3',udf1(F.split(df.col1,','),F.split(df.col2,',')))
df.show(1,False)
+-----------+-----------+---------------------------+
|col1 |col2 |col3 |
+-----------+-----------+---------------------------+
|abc,def,ghi|1.0,2.0,3.0|[abc_1.0, def_2.0, ghi_3.0]|
+-----------+-----------+---------------------------+
For Spark 3.1+, they now provide pyspark.sql.functions.zip_with() with Python lambda function, therefore it can be done like this:
import pyspark.sql.functions as F
df = df.withColumn("column_3", F.zip_with("column_1", "column_2", lambda x,y: F.concat_ws("_", x, y)))

Elegant way to fillna missing values for dates in spark

Let me break this problem down to a smaller chunk. I have a DataFrame in PySpark, where I have a column arrival_date in date format -
from pyspark.sql.functions import to_date
values = [('22.05.2016',),('13.07.2010',),('15.09.2012',),(None,)]
df = sqlContext.createDataFrame(values,['arrival_date'])
#Following code line converts String into Date format
df = df.withColumn('arrival_date',to_date(col('arrival_date'),'dd.MM.yyyy'))
df.show()
+------------+
|arrival_date|
+------------+
| 2016-05-22|
| 2010-07-13|
| 2012-09-15|
| null|
+------------+
df.printSchema()
root
|-- arrival_date: date (nullable = true)
After applying a lot of transformations to the DataFrame, I finally wish to fill in the missing dates, marked as null with 01-01-1900.
One method to do this is to convert the column arrival_date to String and then replace missing values this way - df.fillna('1900-01-01',subset=['arrival_date']) and finally reconvert this column to_date. This is very unelegant.
The following code line doesn't work, as expected and I get an error-
df = df.fillna(to_date(lit('1900-01-01'),'yyyy-MM-dd'), subset=['arrival_date'])
The documentation says The value must be of the following type: Int, Long, Float, Double, String, Boolean.
Another way is by using withColumn() and when() -
df = df.withColumn('arrival_date',when(col('arrival_date').isNull(),to_date(lit('01.01.1900'),'dd.MM.yyyy')).otherwise(col('arrival_date')))
Is there a way, where I could directly assign a date of my choice to a date formatted column by using some function?
Anyone has any better suggestion?
The second way should be the way to do it, but you don't have to use to_date to transform between string and date, just use datetime.date(1900, 1, 1).
import datetime as dt
df = df.withColumn('arrival_date', when(col('arrival_date').isNull(), dt.date(1900, 1, 1)).otherwise(col('arrival_date')))

PySpark - Sum a column in dataframe and return results as int

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.
df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])
I do the following to sum the column.
df.groupBy().sum()
But I get a dataframe back.
+-----------+
|sum(Number)|
+-----------+
| 130|
+-----------+
I would 130 returned as an int stored in a variable to be used else where in the program.
result = 130
I think the simplest way:
df.groupBy().sum().collect()
will return a list.
In your example:
In [9]: df.groupBy().sum().collect()[0][0]
Out[9]: 130
If you want a specific column :
import pyspark.sql.functions as F
df.agg(F.sum("my_column")).collect()[0][0]
The simplest way really :
df.groupBy().sum().collect()
But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:
df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
I tried on a bigger dataset and i measured the processing time:
RDD and ReduceByKey : 2.23 s
GroupByKey: 30.5 s
This is another way you can do this. using agg and collect:
sum_number = df.agg({"Number":"sum"}).collect()[0]
result = sum_number["sum(Number)"]
Similar to other answers, but without the use of a groupby or agg. I just select the column in question, sum it, collect it, and then grab the first two indices to return an int. The only reason I chose this over the accepted answer is I am new to pyspark and was confused that the 'Number' column was not explicitly summed in the accepted answer. If I had to come back after sometime and try to understand what was happening, syntax such as below would be easier for me to follow.
import pyspark.sql.functions as f
df.select(f.sum('Number')).collect()[0][0]
You can also try using first() function. It returns the first row from the dataframe, and you can access values of respective columns using indices.
df.groupBy().sum().first()[0]
In your case, the result is a dataframe with single row and column, so above snippet works.
Select column as RDD, abuse keys() to get value in Row (or use .map(lambda x: x[0])), then use RDD sum:
df.select("Number").rdd.keys().sum()
SQL sum using selectExpr:
df.selectExpr("sum(Number)").first()[0]
The following should work:
df.groupBy().sum().rdd.map(lambda x: x[0]).collect()
sometimes read a csv file to pyspark Dataframe, maybe the numeric column change to string type '23',like this, you should use pyspark.sql.functions.sum to get the result as int , not sum()
import pyspark.sql.functions as F
df.groupBy().agg(F.sum('Number')).show()

Categories

Resources