PySpark: Get first character of each word in string - python

For an assignment I have been asked to shorten the names of clients to only the first letter of each name where they are separated by a space character.
I found a lot of solutions for this in Python, but I am not able to translate this to a dataframe.
The DF looks like this:
| ID | Name |
| -------- | -------------- |
| 1 | John Doe |
| 2 | Roy Lee Winters|
| 3 | Mary-Kate Baron|
My desired output would be:
| ID | Name | Shortened_name|
| -------- | -------- | -------------- |
| 1 | John Doe | JD |
| 2 | Roy Lee Winters | RLW |
| 3 | Mary-Kate Baron | MB |
I've had some result with the code below but this is not working when there are more then 2 names. I would also like to have some more 'flexible' code as some people have 4 or 5 names where others only have 1.
df.withColumn("col1", F.substring(F.split(F.col("Name"), " ").getItem(0), 1, 1))\
.withColumn("col2", F.substring(F.split(F.col("Name"), " ").getItem(1), 1, 1))\
.withColumn('Shortened_name', F.concat('col1', 'col2'))

You can split the Name column then use transform function on the resulting array to get first letter of each element:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, "John Doe"), (2, "Roy Lee Winters"), (3, "Mary-Kate Baron")], ["ID", "Name"])
df1 = df.withColumn(
"Shortened_name",
F.array_join(F.expr("transform(split(Name, ' '), x -> left(x, 1))"), "")
)
df1.show()
# +---+---------------+--------------+
# | ID| Name|Shortened_name|
# +---+---------------+--------------+
# | 1| John Doe| JD|
# | 2|Roy Lee Winters| RLW|
# | 3|Mary-Kate Baron| MB|
# +---+---------------+--------------+
Or by using aggregate function:
df1 = df.withColumn(
"Shortened_name",
F.expr("aggregate(split(Name, ' '), '', (acc, x) -> acc || left(x, 1))")
)

Related

Check if PySaprk column values exists in another dataframe column values

I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again. I was thinking of doing a multiple withColumn() with a when() function.
For example my two dataframes can be something like:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value, for example I was trying something like:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column (1or 0) in a new dataframe.
The result I wish to get would be something like:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |
Left join df1 with df2 on id after prefixing all df2 columns except id with df2_*:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+

Combining Two Pandas Dataframe with Same Columns into one String Columns

I have two Pandas dataframes ie:
+-------+-------------------+--+
| Name | Class | |
+-------+-------------------+--+
| Alice | Physics | |
| Bob | "" (Empty string) | |
+-------+-------------------+--+
Table 2:
+-------+-----------+
| Name | Class |
+-------+-----------+
| Alice | Chemistry |
| Bob | Math |
+-------+-----------+
Is there a way to combine it easily on the column Class so the resulting table is like:
+-------+--------------------+
| Name | Class |
+-------+--------------------+
| Alice | Physics, Chemistry |
| Bob | Math |
+-------+--------------------+
I also want to make sure there are no extra commas when adding columns. Thanks!
df = pd.DataFrame({'Name':['Alice','Bob'],
'Class':['Physics',np.nan]})
df2 = pd.DataFrame({'Name':['Alice','Bob'],
'Class':['Chemistry','Math']})
df3 = df.append(df2).dropna(subset=['Class']).groupby('Name')['Class'].apply(list).reset_index()
# to remove list
df3['Class'] = df3['Class'].apply(lambda x: ', '.join(x))
Try with concat and groupby:
>>> pd.concat([df1, df2]).groupby("Name").agg(lambda x: ", ".join(i for i in x.tolist() if len(i.strip())>0)).reset_index()
Name Class
Alice Physics, Chemistry
Bob Math

Creating a JSON struct from available rows after Group By in PySpark

I would like to group by customer_id, so that I can collect the key/value pair of field_name and field_value as a JSON struct. So, for example, I have a dataframe like this:
customerID | field_name | field_value
-------------------------------------
A | age | 20
A | sex | M
A | country | US
B | country | US
B | age | 34
c | sex | F
All columns in the DataFrame have a String data type. What I want is this:
customerID | custom_attributes
-------------------------------------
A | {'age':'20', 'sex': 'M', 'country':'US'}
B | {'age':'34', 'country':'US'}
c | {'sex':'F'}
This is what I tried:
test = (data
.groupBy('customer_id')
.agg(
collect_list(struct(col('field_name'), col('field_value'))).alias('custom_attributes'))
)
But this gets me as far as getting an array that I dont know how to flatten:
customer_id | custom_attributes
--------------------------------
A | [{'field_name':'sex', 'field_value':'M'},
| {'field_name':'age', 'field_value':'34'},
| {'field_name':'country', 'field_value':'US'}]
You need to do a pivot here:
import pyspark.sql.functions as F
df2 = (df.groupBy('customerID')
.pivot('field_name')
.agg(F.first('field_value'))
.select('customerID', F.to_json(F.struct('age', 'country', 'sex')).alias('custom_attributes'))
.orderBy('customerID')
)
df2.show(truncate=False)
+----------+-------------------------------------+
|customerID|custom_attributes |
+----------+-------------------------------------+
|A |{"age":"20","country":"US","sex":"M"}|
|B |{"age":"34","country":"US"} |
|c |{"sex":"F"} |
+----------+-------------------------------------+

Writing to excel file based on column value & string naming

The DF looks something like this and extends for thousands of rows (i.e every combination of 'Type' & 'Name' possible)
| total | big | med | small| Type | Name |
|:-----:|:-----:|:-----:|:----:|:--------:|:--------:|
| 5 | 4 | 0 | 1 | Pig | John |
| 6 | 0 | 3 | 3 | Horse | Mike |
| 5 | 2 | 3 | 0 | Cow | Rick |
| 5 | 2 | 3 | 0 | Horse | Rick |
| 5 | 2 | 3 | 0 | Cow | John |
| 5 | 2 | 3 | 0 | Pig | Mike |
I would like to write code that writes files to excel based on the 'Type' column value. In the example above there are 3 different "Types" so I'd like one file for Pig, one for Horse, one for Cow respectively.
I have been able to do this using two columns but for some reason have not been able to do it do it with just one. See code below.
for idx, df in data.groupby(['Type', 'Name']):
table_1 = function_1(df)
table_2 = function_2(df)
with pd.ExcelWriter(f"{'STRING1'+ '_' + ('_'.join(idx)) + '_' + 'STRING2'}.xlsx") as writer:
table_1.to_excel(writer, sheet_name='Table 1', index=False)
table_2.to_excel(writer, sheet_name='Table 2', index=False)
Current result is:
STRING1_Pig_John_STRING2.xlsx (all the rows that have Pig and John)
What I would like is:
STRING1_Pig_STRING2.xlsx (all the rows that have Pig)
Do you have anything against boolean indexing ? If not :
vals = df['Type'].unique().tolist()
with pd.ExcelWriter("blah.xlsx") as writer:
for val in vals:
ix = df[df['Type']==val].index
df.loc[ix].to_excel(writer, sheet_name=str(val), index=False)
EDIT :
If you want to stick to groupby, that would be :
with pd.ExcelWriter("blah.xlsx") as writer:
for idx, df in data.groupby(['Type']):
val = list(set(df.Type))[0]
df.to_excel(writer, sheet_name=str(val), index=False)

Concat multiple string rows for each unique ID by a particular order

I want to create a table where each row is a unique ID and the Place and City column consists of all the places and cities a person visited , ordered by the date of visit , either using Pyspark or Hive.
df.groupby("ID").agg(F.concat_ws("|",F.collect_list("Place")))
does the concatenation but I am unable to order it by the date. Also for each column I need to keep doing this step separately.
I also tried using windows function as mentioned in this post, (collect_list by preserving order based on another variable)
but it trows an error :java.lang.UnsupportedOperationException: 'collect_list(') is not supported in a window operation.
I want to :
1- order the concatenated columns in order of the date travelled
2- do this step for multiple columns
Data
| ID | Date | Place | City |
| 1 | 2017 | UK | Birm |
| 2 | 2014 | US | LA |
| 1 | 2018 | SIN | Sin |
| 1 | 2019 | MAL | KL |
| 2 | 2015 | US | SF |
| 3 | 2019 | UK | Lon |
Expected
| ID | Place | City |
| 1 | UK,SIN,MAL | Birm,Sin,KL |
| 2 | US,US | LA,SF |
| 3 | UK | Lon |
>>> from pyspark.sql import functions as F
>>> from pyspark.sql import Window
>>> w = Window.partitionBy('ID').orderBy('Date')
//Input data frame
>>> df.show()
+---+----+-----+----+
| ID|Date|Place|City|
+---+----+-----+----+
| 1|2017| UK|Birm|
| 2|2014| US| LA|
| 1|2018| SIN| Sin|
| 1|2019| MAL| KL|
| 2|2015| US| SF|
| 3|2019| UK| Lon|
+---+----+-----+----+
>>> df2 = df.withColumn("Place",F.collect_list("Place").over(w)).withColumn("City",F.collect_list("City").over(w)).groupBy("ID").agg(F.max("Place").alias("Place"), F.max("City").alias("City"))
//Data value in List
>>> df2.show()
+---+--------------+---------------+
| ID| Place| City|
+---+--------------+---------------+
| 3| [UK]| [Lon]|
| 1|[UK, SIN, MAL]|[Birm, Sin, KL]|
| 2| [US, US]| [LA, SF]|
+---+--------------+---------------+
//If you want value in String
>>> df2.withColumn("Place", F.concat_ws(" ", "Place")).withColumn("City", F.concat_ws(" ", "City")).show()
+---+----------+-----------+
| ID| Place| City|
+---+----------+-----------+
| 3| UK| Lon|
| 1|UK SIN MAL|Birm Sin KL|
| 2| US US| LA SF|
+---+----------+-----------+

Categories

Resources