Extract multiple substrings from column in pyspark

Extract multiple substrings from column in pyspark - python

I have a pyspark DataFrame with only one column as follows:
df = spark.createDataFrame(["This is AD185E000834", "U1JG97297 And ODNO926902 etc.","DIHK2975290;HI22K2390279; DSM928HK08", "there is nothing here."], "string").toDF("col1")
I would like to extract the codes in col1 to other columns like:
df.col2 = ["AD185E000834", "U1JG97297", "DIHK2975290", None]
df.col3 = [None, "ODNO926902", "HI22K2390279", None]
df.col4 = [None, None, "DSM928HK08", None]
Does anyone know how to do this? Thank you very much.

I believe this can be shortened. Went long hand to give you my logic. Would have been easier if you laid down your logic in the question
#split string into array
df1=df.withColumn('k', split(col('col1'),'\s|\;')).withColumn('j', size('k'))
#compute maximum array length
s=df1.agg(max('j').alias('max')).distinct().collect()[0][0]
df1 =(df1.withColumn('k',expr("filter(k, x -> x rlike('^[A-Z0-9]+$'))"))#Filter only non alphanumeric characters in the array
#Convert resulting array into struct to allow split
.withColumn(
"k",
F.struct(*[
F.col("k")[i].alias(f"col{i+2}") for i in range(s)
])
))
#Split struct column in df1 and join back to df
df.join(df1.select('col1','k.*'),how='left', on='col1').show()
+--------------------+------------+------------+----------+----+
| col1| col2| col3| col4|col5|
+--------------------+------------+------------+----------+----+
|DIHK2975290;HI22K...| DIHK2975290|HI22K2390279|DSM928HK08|null|
|This is AD185E000834|AD185E000834| null| null|null|
|U1JG97297 And ODN...| U1JG97297| ODNO926902| null|null|
|there is nothing ...| null| null| null|null|
+--------------------+------------+------------+----------+----+

As you said in your comment, here we are assuming that your "codes" are strings of at least two characters composed only by uppercase letters and numbers.
That being said, as of Spark 3.1+, you can use regexp_extract_all with an expr function to create a temporary array column with all the codes, then dynamically create multiple columns for each entry of the arrays.
import pyspark.sql.functions as F
# create an array with all the identified "codes"
new_df = df.withColumn('myarray', F.expr("regexp_extract_all(col1, '([A-Z0-9]{2,})', 1)"))
# find the maximum amount of codes identified in a single string
max_array_length = new_df.withColumn('array_length', F.size('myarray')).agg({'array_length': 'max'}).collect()[0][0]
print('Max array length: {}'.format(max_array_length))
# explode the array in multiple columns
new_df.select('col1', *[new_df.myarray[i].alias('col' + str(i+2)) for i in range(max_array_length)]) \
.show(truncate=False)
Max array length: 3
+------------------------------------+------------+------------+----------+
|col1 |col2 |col3 |col4 |
+------------------------------------+------------+------------+----------+
|This is AD185E000834 |AD185E000834|null |null |
|U1JG97297 And ODNO926902 etc. |U1JG97297 |ODNO926902 |null |
|DIHK2975290;HI22K2390279; DSM928HK08|DIHK2975290 |HI22K2390279|DSM928HK08|
|there is nothing here. |null |null |null |
+------------------------------------+------------+------------+----------+

Related

How to keep order of columns consistent after aggregation functions in pyspark

I'm trying to create an aggregated dataframe of many sensor readings over time to just the sums for each sensor. I have many dataframes but they all have the same schema, with 10 columns, one for each sensor:
+--------+--------+--------+--------+--------+--------+--------+--------+--------+---------+
|sensor_1|sensor_2|sensor_3|sensor_4|sensor_5|sensor_6|sensor_7|sensor_8|sensor_9|sensor_10|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+---------+
| 220.0| 339.0| -336.0| 364.0| null| 492.0| -796.0| -423.0| -582.0| -40.0|
| 178.0| 221.0| -317.0| 366.0| null| 525.0| -754.0| -415.0| -932.0| -305.0|
| 151.0| 42.0| -280.0| 250.0| null| 463.0| -772.0| -229.0| -257.0| -59.0|
| 162.0| -123.0| -243.0| 288.0| null| 303.0| -899.0| 212.0| -295.0| 38.0|
| 158.0| -287.0| -300.0| 372.0| null| 169.0| -769.0| 755.0| 169.0| -239.0|
| 136.0| -302.0| -308.0| 242.0| null| 241.0| -510.0| 888.0| 282.0| -293.0|
| 124.0| -131.0| -292.0| 132.0| null| 234.0| -494.0| 970.0| -326.0| -203.0|
| 127.0| 133.0| -208.0| 14.0| null| 134.0| -748.0| 700.0| 237.0| -278.0|
| 142.0| 374.0| -81.0| -177.0| null| -200.0| -678.0| 402.0| 664.0| -460.0|
| 135.0| 538.0| 52.0| -113.0| null| -440.0| -711.0| 35.0| 877.0| -452.0|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+---------+
Can try recreating a small version of the dataframe with
columns = ['sensor_1','sensor_2','sensor_3','sensor_4','sensor_5','sensor_6','sensor_7','sensor_8','sensor_9','sensor_10']
data = [(220.0, 339.0, -336.0, 364.0, null, 492.0, -796.0, -423.0, -582.0, -40.0),
(178.0, 221.0, -317.0, 366.0, null, 525.0, -754.0, -415.0, -932.0, -305.0),
(151.0, 42.0, -280.0, 250.0, null, 463.0, -772.0, -229.0, -257.0, -59.0)]
spark = SparkSession.builder.appName('Sensors').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF()
AFter I try to create the sums DataFrame with:
exprs = {x: "sum" for x in df.columns}
sum_df = df.agg(exprs)
This gives the following output.
+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
|sum(sensor_2)|sum(sensor_9)|sum(sensor_3)|sum(sensor_8)|sum(sensor_4)|sum(sensor_7)|sum(sensor_1)|
+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| 276834.0| 87904.0| 213587.0| 76103.0| 121201.0| 423609.0| -96621.0|
+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
As you can see the columns are out of order to the original. I had to trim the display of the whole df to fit neatly in this post but you get the idea. I'm not sure what logical ordering the spark engine decided to use but it doesn't suit me because I need them to have consistent order. Why does it do this? how can I keep the order consistent?

Lets try using list comprehension;
df1=df.agg(*[f.sum(x).alias(x) for x in df.columns])
df1.show()

How can I concatenate the rows in a pyspark dataframe with multiple columns using groupby and aggregate

I have a pyspark dataframe with multiple columns. For example the one below.
from pyspark.sql import Row
l = [('Jack',"a","p"),('Jack',"b","q"),('Bell',"c","r"),('Bell',"d","s")]
rdd = sc.parallelize(l)
score_rdd = rdd.map(lambda x: Row(name=x[0], letters1=x[1], letters2=x[2]))
score_card = sqlContext.createDataFrame(score_rdd)
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| a| p|
|Jack| b| q|
|Bell| c| r|
|Bell| d| s|
+----+--------+--------+
Now I want to group by "name" and concatenate the values in every row for both columns.
I know how to do it but let's say there are thousands of rows then my code becomes very ugly.
Here is my solution.
import pyspark.sql.functions as f
t = score_card.groupby("name").agg(
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
)
Here is the output I get when I save it in a CSV file.
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| ab| pq|
|Bell| cd| rs|
+----+--------+--------+
But my main concern is about these two lines of code
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
If there are thousands of columns then I will have to repeat the above code thousands of times. Is there a simpler solution for this so that I don't have to repeat f.concat_ws() for every column?
I have searched everywhere and haven't been able to find a solution.

yes, you can use for loop inside agg function and iterate through df.columns. Let me know if it helps.
from pyspark.sql import functions as F
df.show()
# +--------+--------+----+
# |letters1|letters2|name|
# +--------+--------+----+
# | a| p|Jack|
# | b| q|Jack|
# | c| r|Bell|
# | d| s|Bell|
# +--------+--------+----+
df.groupBy("name").agg( *[F.array_join(F.collect_list(column), "").alias(column) for column in df.columns if column !='name' ]).show()
# +----+--------+--------+
# |name|letters1|letters2|
# +----+--------+--------+
# |Bell| cd| rs|
# |Jack| ab| pq|
# +----+--------+--------+

How to extract all elements after last underscore in pyspark?

I have a pyspark dataframe with a column I am trying to extract information from. To give you an example, the column is a combination of 4 foreign keys which could look like this:
Ex 1: 12345-123-12345-4
Ex 2: 5678-4321-123-12
I am trying to extract the last piece of the string, in this case the 4 & 12. Any idea on how I can do this?
I've tried the following:
df.withColumn("result", sf.split(sf.col("column_to_split"), '\_')[1])\
.withColumn("result", sf.col("result").cast('integer'))
However, the result for double digit values is null, and it only returns an integer for single digits (0-9)
Thanks!

For spark2.4,You should use element_at -1 on your array after split
from pyspark.sql import functions as sf
df.withColumn("result", sf.element_at(sf.split("column_to_split","\-"),-1).cast("int")).show()
+-----------------+------+
| column_to_split|result|
+-----------------+------+
|12345-123-12345-4| 4|
| 5678-4321-123-12| 12|
+-----------------+------+

Mohammad's answer is very clean and a nice solution. However if you need a solution for Spark versions < 2.4, you can utilise the reverse string functionality and take the first element, reverse it back and turn into an Integer, f.e.:
import pandas as pd
import pyspark.sql.functions as f
import pyspark.sql.types as t
df = pd.DataFrame()
df['column_to_split'] = ["12345-123-12345-4", "5678-4321-123-12"]
df = spark.createDataFrame(df)
df.withColumn("result",
f.reverse(f.split(f.reverse("column_to_split"), "-")[0]). \
cast(t.IntegerType())).show(2, False)
+-----------------+------+
|column_to_split |result|
+-----------------+------+
|12345-123-12345-4|4 |
|5678-4321-123-12 |12 |
+-----------------+------+

This is how to get the last digits of the serial number above:
serial_no = '12345-123-12345-4'
last_digit = serial_no.split('-')[-1]
print(last_digit)
So in your case, try:
df.withColumn("result", int(sf.col("column_to_split").split('-')[-1]))
If it doesn't work, please share the result.

Adding up another ways:
You can use .regexp_extract() (or) .substring_index() function also:
Example:
df.show()
#+-----------------+
#| column_to_split|
#+-----------------+
#|12345-123-12345-4|
#| 5678-4321-123-12|
#+-----------------+
df.withColumn("result",regexp_extract(col("column_to_split"),"([^-]+$)",1).cast("int")).\
withColumn("result1",substring_index(col("column_to_split"),"-",-1).cast("int")).\
show()
#+-----------------+------+-------+
#| column_to_split|result|result1|
#+-----------------+------+-------+
#|12345-123-12345-4| 4| 4|
#| 5678-4321-123-12| 12| 12|
#+-----------------+------+-------+

Iterating over dict RDD and assign value to dataframe column

So I have a dataframe df like so,
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
+---+-----+
I also have a dict like so:
{"COL_B":"abc","COL_C":""}
Now, what I have to do is to update df with keys in dict being the new column name and the value of key being the costant value of the column.
Expected df should be like:
+---+-----+-----+-----+
| ID|COL_A|COL_B|COL_C|
+---+-----+-----+-----+
| 1| 123| abc| |
+---+-----+-----+-----+
Now here's my python code to do it which is working fine...
input_data = pd.read_csv(inputFilePath,dtype=str)
for key, value in mapRow.iteritems(): #mapRow is the dict
if value is None:
input_data[key] = ""
else:
input_data[key] = value
Now I'm migrating this code to pyspark and would like to know how to do it in pyspark?
Thanks for the help.

To combine RDDs, we use use zip or join . Below is the explanation using zip. zip is to concat them and map to flatten.
from pyspark.sql import Row
rdd_1 = sc.parallelize([Row(ID=1,COL_A=2)])
rdd_2 = sc.parallelize([Row(COL_B="abc",COL_C=" ")])
result_rdd = rdd_1.zip(rdd_2).map(lamda x: [j for i in x for j in i])
NOTE I didn't have payspark currently with me so this isn't tested.

Spark dataframe update column where other colum is like with PySpark

This creates my example dataframe:
df = sc.parallelize([('abc',),('def',)]).toDF() #(
df = df.selectExpr("_1 as one",)
df = df.withColumn("two", lit('z'))
df.show()
looking like this:
+---+---+
|one|two|
+---+---+
|abc| z|
|def| z|
+---+---+
now what I want to do is a series of SQL where like statements where column two is appended whether or not it matches
in "pseudo code" it looks like this:
for letter in ['a','b','c','d']:
df = df['two'].where(col('one').like("%{}%".format(letter))) += letter
finally resulting in a df looking like this:
+---+----+
|one| two|
+---+----+
|abc|zabc|
|def| zd|
+---+----+

If you are using a list of strings to subset your string column, you can best use broadcast variables. Let's start with a more realistic example where your string still contain spaces:
df = sc.parallelize([('a b c',),('d e f',)]).toDF()
df = df.selectExpr("_1 as one",)
df = df.withColumn("two", lit('z'))
Then we create a broadcast variable from a list of letters, and consequently define an udf that uses them to subset a list of strings; and finally concatenates them with the value in another column, returning one string:
letters = ['a','b','c','d']
letters_bd = sc.broadcast(letters)
def subs(col1, col2):
l_subset = [x for x in col1 if x in letters_bd.value]
return col2 + ' ' + ' '.join(l_subset)
subs_udf = udf(subs)
To apply the above, the string we are subsetting need to be converted to a list, so we use the function split() first and then apply our udf:
from pyspark.sql.functions import col, split
df.withColumn("three", split(col('one'), r'\W+')) \
.withColumn("three", subs_udf("three", "two")) \
.show()
+-----+---+-------+
| one|two| three|
+-----+---+-------+
|a b c| z|z a b c|
|d e f| z| z d|
+-----+---+-------+
Or without udf, using regexp_replace and concat if your letters can be comfortably fit into the regex expression.
from pyspark.sql.functions import regexp_replace, col, concat, lit
df.withColumn("three", concat(col('two'), lit(' '),
regexp_replace(col('one'), '[^abcd]', ' ')))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract multiple substrings from column in pyspark - python

Related

How to keep order of columns consistent after aggregation functions in pyspark

How can I concatenate the rows in a pyspark dataframe with multiple columns using groupby and aggregate

How to extract all elements after last underscore in pyspark?

Iterating over dict RDD and assign value to dataframe column

Spark dataframe update column where other colum is like with PySpark

Categories

Resources