This question already has answers here:
Split Spark dataframe string column into multiple columns
(5 answers)
Closed 2 years ago.
I have a pyspark dataframe like the input data below. I would like to split the values in the productname column on white space. I'd then like to create new columns with the first 3 values. I have example input and output data below. Can someone please suggest how to do this with pyspark?
input data:
+------+-------------------+
|id |productname |
+------+-------------------+
|235832|EXTREME BERRY Sweet|
|419736|BLUE CHASER SAUCE |
|124513|LAAVA C2L5 |
+------+-------------------+
output:
+------+-------------------+-------------+-------------+-------------+
|id |productname |product1 |product2 |product3 |
+------+-------------------+-------------+-------------+-------------+
|235832|EXTREME BERRY Sweet|EXTREME |BERRY |Sweet |
|419736|BLUE CHASER SAUCE |BLUE |CHASER |SAUCE |
|124513|LAAVA C2L5 |LAAVA |C2L5 | |
+------+-------------------+-------------+-------------+-------------+
Split the productname column then create new columns using element_at (or) .getItem() on index value.
df.withColumn("tmp",split(col("productname"),"\s+")).\
withColumn("product1",element_at(col("tmp"),1)).\
withColumn("product2",element_at(col("tmp"),2)).\
withColumn("product3",coalesce(element_at(col("tmp"),3),lit(""))).drop("tmp").show()
#or
df.withColumn("tmp",split(col("productname"),"\s+")).\
withColumn("product1",col("tmp").getItem(0)).\
withColumn("product2",col("tmp").getItem(1)).\
withColumn("product3",coalesce(col("tmp").getItem(2),lit(""))).drop("tmp").show()
#+------+-------------------+--------+--------+--------+
#| id| productname|product1|product2|product3|
#+------+-------------------+--------+--------+--------+
#|235832|EXTREME BERRY Sweet| EXTREME| BERRY| Sweet|
#| 4| BLUE CHASER SAUCE| BLUE| CHASER| SAUCE|
#| 1| LAAVA C2L5| LAAVA| C2L5| |
#+------+-------------------+--------+--------+--------+
To do more dynamic way:
df.show()
#+------+-------------------+
#| id| productname|
#+------+-------------------+
#|235832|EXTREME BERRY Sweet|
#| 4| BLUE CHASER SAUCE|
#| 1| LAAVA C2L5|
#+------+-------------------+
#caluculate array max size and store into variable
arr=int(df.select(size(split(col("productname"),"\s+")).alias("size")).orderBy(desc("size")).collect()[0][0])
#loop through arr variable and add the columns replace null with ""
(df.withColumn('temp', split('productname', '\s+')).select("*",*(coalesce(col('temp').getItem(i),lit("")).alias('product{}'.format(i+1)) for i in range(arr))).drop("temp").show())
#+------+-------------------+--------+--------+--------+
#| id| productname|product1|product2|product3|
#+------+-------------------+--------+--------+--------+
#|235832|EXTREME BERRY Sweet| EXTREME| BERRY| Sweet|
#| 4| BLUE CHASER SAUCE| BLUE| CHASER| SAUCE|
#| 1| LAAVA C2L5| LAAVA| C2L5| |
#+------+-------------------+--------+--------+--------+
You can use split, element_at, and when/otherwise clause with array_union to put empty strings.
from pyspark.sql import functions as F
from pyspark.sql.functions import when
df.withColumn("array", F.split("productname","\ "))\
.withColumn("array", F.when(F.size("array")==2, F.array_union(F.col("array"),F.array(F.lit(""))))\
.when(F.size("array")==1, F.array_union(F.col("array"),F.array(F.lit(" "),F.lit(""))))\
.otherwise(F.col("array")))\
.withColumn("product1", F.element_at("array",1))\
.withColumn("product2", F.element_at("array",2))\
.withColumn("product3", F.element_at("array",3)).drop("array")\
.show(truncate=False)
+------+-------------------+--------+--------+--------+
|id |productname |product1|product2|product3|
+------+-------------------+--------+--------+--------+
|235832|EXTREME BERRY Sweet|EXTREME |BERRY |Sweet |
|419736|BLUE CHASER SAUCE |BLUE |CHASER |SAUCE |
|124513|LAAVA C2L5 |LAAVA |C2L5 | |
|123455|LAVA |LAVA | | |
+------+-------------------+--------+--------+--------+
Related
I'm trying to split comma separated values in a string column to individual values and count each individual value.
The data I have is formatted as such:
+--------------------+
| tags|
+--------------------+
|cult, horror, got...|
| violence|
| romantic|
|inspiring, romant...|
|cruelty, murder, ...|
|romantic, queer, ...|
|gothic, cruelty, ...|
|mystery, suspense...|
| violence|
|revenge, neo noir...|
+--------------------+
And I want the result to look like
+--------------------+-----+
| tags|count|
+--------------------+-----+
|cult | 4|
|horror | 10|
|goth | 4|
|violence | 30|
...
The code I've tried that hasn't worked is below:
data.select('tags').groupby('tags').count().show(10)
I also used a countdistinct function which also failed to work.
I feel like I need to have a function that separates the values by comma and then lists them but not sure how to execute them.
You can use split() to split strings, then explode(). Finally, groupby and count:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[
["cult,horror"],
["cult,comedy"],
["romantic,comedy"],
["thriler,horror,comedy"],
], schema=["tags"])
df = df \
.withColumn("tags", F.split("tags", pattern=",")) \
.withColumn("tags", F.explode("tags"))
df = df.groupBy("tags").count()
[Out]:
+--------+-----+
|tags |count|
+--------+-----+
|romantic|1 |
|thriler |1 |
|horror |2 |
|cult |2 |
|comedy |3 |
+--------+-----+
I have a dataframe of title and bin:
+---------------------+-------------+
| Title| bin|
+---------------------+-------------+
| Forrest Gump (1994)| 3|
| Pulp Fiction (1994)| 2|
| Matrix, The (1999)| 3|
| Toy Story (1995)| 1|
| Fight Club (1999)| 3|
+---------------------+-------------+
How do I count the bin into each individual column of a new dataframe using Pyspark? For instance:
+------------+------------+------------+
| count(bin1)| count(bin2)| count(bin3)|
+------------+------------+------------+
| 1| 1 | 3|
+------------+------------+------------+
Is this possible? Would someone please help me with this if you know how?
Group by bin and count then pivot the column bin and rename the columns of resulting dataframe if you want:
import pyspark.sql.functions as F
df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))
df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])
df1.show()
#+----------+----------+----------+
#|count_bin1|count_bin2|count_bin3|
#+----------+----------+----------+
#| 1| 1| 3|
#+----------+----------+----------+
I am new to Data Science and I am working on a simple self project using Google Colab. I took a data from a something1.csv file and something2.csv file.
df1 = spark.read.csv('something1.csv', header=True)
df2 = spark.read.csv('something2.csv', header=True)
The data of something1.csv looks like
#+----------+---------+----------+--------+-------+
#| country| Latitude| Longitude| col1 | col2 |
#+----------+---------+----------+--------+-------+
#| Andorra| 42.506| 1.5218| 2 | 1 |
#| Australia| -31.81| 115.86| 1 | 6 |
#| Austria| 41.597| 12.580| 4 | 9 |
#| Belgium| 21.782| 1.286| 2 | 3 |
#| India| 78.389| 12.972| 1 | 7 |
#| Ireland| 9.281| 9.286| 9 | 8 |
#| USA| 69.371| 21.819| 7 | 2 |
#+----------+---------+----------+--------+-------+
The data of something2.csv looks like
#+----------+---------+----------+--------+-------+
#| country| Latitude| Longitude| col1 | col2 |
#+----------+---------+----------+--------+-------+
#| Australia| -31.81| 115.86| 2 | 6 |
#| Belgium| 21.782| 1.286| 1 | 6 |
#| India| 78.389| 12.972| 3 | 5 |
#| USA| 69.371| 21.819| 2 | 5 |
#+----------+---------+----------+--------+-------+
Now I want to intersect df2 with df1 based on Longitude and Latitude and get the rows that are present in df1 along with col1 and col2 from df1. My table should look like
#+----------+---------+----------+--------+-------+
#| country| Latitude| Longitude| col1 | col2 |
#+----------+---------+----------+--------+-------+
#| Australia| -31.81| 115.86| 1 | 6 |
#| Belgium| 21.782| 1.286| 2 | 3 |
#| India| 78.389| 12.972| 1 | 7 |
#| USA| 69.371| 21.819| 7 | 2 |
#+----------+---------+----------+--------+-------+
I tried using the following code but didn't work.
new_df = df1.intersect(df2) #using the intersection in pyspark which gave me null table
Then I also tried based on Latitude and Longitude
new_df = df2.select('Latitude','Longitude').intersect(df1.select('Latitude','Logitude')) #intersecting based on columns
I tried both the above methods in pyspark but didn't work.
Intersect only gets rows that are common in both dataframes.
But in your case you need col1,col2 from df1 and other columns from df2, Join the dataframes (left/inner as per requirement) and select only col1,col2 from df1 and other columns from df2.
(or) As mentioned in comments by Mohammad Murtaza Hashmi Use left_semi join
Example:
#using left semi join
df1.join(df2,['Latitude','Longitude'],'left_semi').show()
#using left join
df2.alias("t2").join(df1.alias("t1"),['Latitude','Longitude'],'left').select("t2.country","t2.Latitude","t2.Longitude","t1.col1","t1.col2").show()
#+---------+--------+---------+----+----+
#| country|Latitude|Longitude|col1|col2|
#+---------+--------+---------+----+----+
#|Australia| -31.81| 115.86| 1| 6|
#| Belgium| 21.782| 1.286| 2| 3|
#| India| 78.389| 12.972| 1| 7|
#| USA| 69.371| 21.819| 7| 2|
#+---------+--------+---------+----+----+
Dynamic way:
#join columns
join_cols=[x for x in df1.columns if x.startswith("L")]
#selecting cols from t1
t1_cols=["t1."+x for x in df1.columns if x.startswith("col")]
#selecting cols from t2
t2_cols=["t2."+x for x in df2.columns if not x.startswith("col")]
df2.alias("t2").join(df1.alias("t1"),['Latitude','Longitude'],'inner').select(*t2_cols + t1_cols).show()
#+---------+--------+---------+----+----+
#| country|Latitude|Longitude|col1|col2|
#+---------+--------+---------+----+----+
#|Australia| -31.81| 115.86| 1| 6|
#| Belgium| 21.782| 1.286| 2| 3|
#| India| 78.389| 12.972| 1| 7|
#| USA| 69.371| 21.819| 7| 2|
#+---------+--------+---------+----+----+
I have a pyspark dataframe like the input data below. I would like to create a new column product1_num that parses the first numeric in each record in the productname column, in to a new column. I have example output data below. I'm not sure what's available in pyspark as far as string split and regex matching. Can anyone suggest how to do this with pyspark?
input data:
+------+-------------------+
|id |productname |
+------+-------------------+
|234832|EXTREME BERRY SAUCE|
|419836|BLUE KOSHER SAUCE |
|350022|GUAVA (1G) |
|123213|GUAVA 1G |
+------+-------------------+
output:
+------+-------------------+-------------+
|id |productname |product1_num |
+------+-------------------+-------------+
|234832|EXTREME BERRY SAUCE| |
|419836|BLUE KOSHER SAUCE | |
|350022|GUAVA (1G) |1 |
|123213|GUAVA G5 |5 |
|125513|3GULA G5 |3 |
|127143|GUAVA G50 |50 |
|124513|LAAVA C2L5 |2 |
+------+-------------------+-------------+
You can use regexp_extract:
from pyspark.sql import functions as F
df.withColumn("product1_num", F.regexp_extract("productname", "([0-9]+)",1)).show()
+------+-------------------+------------+
| id| productname|product1_num|
+------+-------------------+------------+
|234832|EXTREME BERRY SAUCE| |
|419836| BLUE KOSHER SAUCE| |
|350022| GUAVA (1G)| 1|
|123213| GUAVA G5| 5|
|125513| 3GULA G5| 3|
|127143| GUAVA G50| 50|
|124513| LAAVA C2L5| 2|
+------+-------------------+------------+
My dataframe is this and I want to split my data frame by colon (:)
+------------------+
|Name:Roll_no:Class|
+------------------+
| #ab:cd#:23:C|
| #sd:ps#:34:A|
| #ra:kh#:14:H|
| #ku:pa#:36:S|
| #ra:sh#:50:P|
+------------------+
and I want my dataframe like:
+-----+-------+-----+
| Name|Roll_no|Class|
+-----+-------+-----+
|ab:cd| 23| C|
|sd:ps| 34| A|
|ra:kh| 14| H|
|ku:pa| 36| S|
|ra:sh| 50| P|
+-----+-------+-----+
If need split by last 2 : use Series.str.rsplit, then set columns by split column name and last remove first and last # by indexing:
col = 'Name:Roll_no:Class'
df1 = df[col].str.rsplit(':', n=2, expand=True)
df1.columns = col.split(':')
df1['Name'] = df1['Name'].str[1:-1]
#if only first and last value
#df1['Name'] = df1['Name'].str.strip('#')
print (df1)
Name Roll_no Class
0 ab:cd 23 C
1 sd:ps 34 A
2 ra:kh 14 H
3 ku:pa 36 S
4 ra:sh 50 P
Use read_csv() sep=':' and quotechar='#'
str = """Name:Roll_no:Class
#ab:cd#:23:C
#sd:ps#:34:A
#ra:kh#:14:H
#ku:pa#:36:S
#ra:sh#:50:P"""
df = pd.read_csv(pd.io.common.StringIO(str), sep=':', quotechar='#')
>>> df
Name Roll_no Class
#0 ab:cd 23 C
#1 sd:ps 34 A
#2 ra:kh 14 H
#3 ku:pa 36 S
#4 ra:sh 50 P
This is how you could do this in pyspark:
Specify the separator and the quote on read
If you're reading the data from a file, you can use spark.read_csv with the following arguments:
df = spark.read.csv("path/to/file", sep=":", quote="#", header=True)
df.show()
#+-----+-------+-----+
#| Name|Roll_no|Class|
#+-----+-------+-----+
#|ab:cd| 23| C|
#|sd:ps| 34| A|
#|ra:kh| 14| H|
#|ku:pa| 36| S|
#|ra:sh| 50| P|
#+-----+-------+-----+
Use Regular Expressions
If you're unable to change the way the data is read and you're starting with the DataFrame shown in the question, you can use regular expressions to get the desired output.
First get the new column names by splitting the existing column name on ":"
new_columns = df.columns[0].split(":")
print(new_columns)
#['Name', 'Roll_no', 'Class']
For the Name column you need to extract the data between the #s. For the other two columns, you need to remove the strings between the #s (and the following ":") and use pyspark.sql.functions.split to extract the components
from pyspark.sql.functions import regexp_extract, regexp_replace, split
df.withColumn(new_columns[0], regexp_extract(df.columns[0], r"(?<=#).+(?=#)", 0))\
.withColumn(new_columns[1], split(regexp_replace(df.columns[0], "#.+#:", ""), ":")[0])\
.withColumn(new_columns[2], split(regexp_replace(df.columns[0], "#.+#:", ""), ":")[1])\
.select(*new_columns)\
.show()
#+-----+-------+-----+
#| Name|Roll_no|Class|
#+-----+-------+-----+
#|ab:cd| 23| C|
#|sd:ps| 34| A|
#|ra:kh| 14| H|
#|ku:pa| 36| S|
#|ra:sh| 50| P|
#+-----+-------+-----+