Split & Map fields of array<string> in pyspark [duplicate] - python

This question already has answers here:
How to explode multiple columns of a dataframe in pyspark
(7 answers)
Closed last year.
I have a Pyspark dataframe as below with 7 columns out of which 6 fields are array and one column is array<array>.
Sample data is as below
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|customer_id |equipment_id |type |language |country |lang_cnt_str |model_num |
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
|[18e644bb-4342-4c22-ab9b-a90fda50ad69, 70f0b998-3e4e-422d-b863-1f5f455c4883, 54a99992-5403-4946-b059-f71ec7ef2cca]|[1407c4a9-b075-4837-bada-690da10717cd, fc4632f3-302b-43cb-9245-ede2d1ac590f, 1407c4a9-b075-4837-bada-690da10717cd]|[comm, comm, vspec]|[cs, en-GB, pt-PT] |[[CZ], [PT], [PT]] |[(language = 'cs' AND country IS IN ('CZ')), (language = 'en-GB' AND country IS IN ('PT')), (language = 'pt-PT' AND country IS IN ('PT'))]|[1618832612617, 1618832612858, 1618832614027]|
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+-------------------+--------------------+-------------------+------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------+
I want to split and map every element of all columns. Below is the expected output.
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
|customer_id |equipment_id |type |language |country |lang_cnt_str |model_num |
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
|18e644bb-4342-4c22-ab9b-a90fda50ad69 |1407c4a9-b075-4837-bada-690da10717cd |comm |cs |[CZ] |(language = 'cs' AND country IS IN ('CZ')) |1618832612617 |
|70f0b998-3e4e-422d-b863-1f5f455c4883 |fc4632f3-302b-43cb-9245-ede2d1ac590f |comm |en-GB |[PT] |(language = 'en-GB' AND country IS IN ('PT')) |1618832612858 |
|54a99992-5403-4946-b059-f71ec7ef2cca |1407c4a9-b075-4837-bada-690da10717cd |vspec |pt-PT |[PT] |(language = 'pt-PT' AND country IS IN ('PT')) |1618832614027 |
+---------------------------------------+---------------------------------------+-------+-----------+------------+--------------------------------------------------+-------------------+
How can we achieve this in pyspark. Can someone please help me. Thanks in advance!!

We exchanged a couple of comments above, and I think there's nothing special about the array(array(string)) column. So I post this answer to show the solution posted in How to explode multiple columns of a dataframe in pyspark
df = spark.createDataFrame([
(['1', '2', '3'], [['1'], ['2'], ['3']])
], ['col1', 'col2'])
df = (df
.withColumn('zipped', f.arrays_zip(f.col('col1'), f.col('col2')))
.withColumn('unzipped', f.explode(f.col('zipped')))
.select(f.col('unzipped.col1'),
f.col('unzipped.col2')
)
)
df.show()
The input is:
+---------+---------------+
| col1| col2|
+---------+---------------+
|[1, 2, 3]|[[1], [2], [3]]|
+---------+---------------+
And the output is:
+----+----+
|col1|col2|
+----+----+
| 1| [1]|
| 2| [2]|
| 3| [3]|
+----+----+

Related

Combine values from multiple columns into one Pyspark Dataframe [duplicate]

This question already has answers here:
Concat multiple columns of a dataframe using pyspark
(1 answer)
Concatenate columns in Apache Spark DataFrame
(18 answers)
How to concatenate multiple columns in PySpark with a separator?
(2 answers)
Closed 2 years ago.
I have a pyspark dataframe that has fields:
"id",
"fields_0_type" ,
"fields_0_price",
"fields_1_type",
"fields_1_price"
+------------------+--------------+-------------+-------------+---
|id |fields_0_type|fields_0_price|fields_1_type|fields_1_price|
+------------------+-----+--------+-------------+----------+
|1234| Return |45 |New |50 |
+--------------+----------+--------------------+------------+
How can I save the values of these values into two columns called "type" and"price" as a list and separate the values by ",". The ideal dataframe looks like this:
+--------------------------- +------------------------------+
|id |type | price
+---------------------------+------------------------------+
|1234 |Return,Upgrade |45,50
Note that the data I am providing here is a sample. In reality I have more than just "type" and "price" columns that will need to be combined.
Update:
Thanks it works. But is there any way that I can get rid of the extra ","? These are caused by the fact that there are blank values in the columns. Is there a way just to not to take in those columns with blank values in it?
What it is showing now:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, |
|New,New,Sale,Sale,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,|
+------------------------------------------------------------------+
How I want it:
+------------------------------------------------------------------+
|type |
+------------------------------------------------------------------+
|New,New,New |
|New,New,Sale,Sale,New|
+------------------------------------------------------------------+
Cast all columns in array then use concat_ws function.
Example:
df.show()
#+----+-------------+-------------+-------------+
#| id|fields_0_type|fields_1_type|fields_2_type|
#+----+-------------+-------------+-------------+
#|1234| a| b| c|
#+----+-------------+-------------+-------------+
columns=df.columns
columns.remove('id')
df.withColumn("type",concat_ws(",",array(*columns))).drop(*columns).show()
#+----+-----+
#| id| type|
#+----+-----+
#|1234|a,b,c|
#+----+-----+
UPDATE:
df.show()
#+----+-------------+--------------+-------------+--------------+
#| id|fields_0_type|fields_0_price|fields_1_type|fields_1_price|
#+----+-------------+--------------+-------------+--------------+
#|1234| a| 45| b| 50|
#+----+-------------+--------------+-------------+--------------+
type_cols=[f for f in df.columns if 'type' in f]
price_cols=[f for f in df.columns if 'price' in f]
df.withColumn("type",concat_ws(",",array(*type_cols))).withColumn("price",concat_ws(",",array(*price_cols))).\
drop(*type_cols,*price_cols).\
show()
#+----+----+-----+
#| id|type|price|
#+----+----+-----+
#|1234| a,b|45,50|
#+----+----+-----+

Difference between two DataFrames based on only one column in pyspark [duplicate]

This question already has an answer here:
Pyspark filter dataframe by columns of another dataframe
(1 answer)
Closed 4 years ago.
I am looking for a way to find the difference of two DataFrames based on one column. For example:
from pyspark.sql import SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
df_a = sql_context.createDataFrame([("fa", 3), ("fb", 5), ("fc", 7)], ["first name", "id"])
df_b = sql_context.createDataFrame([("la", 3), ("lb", 10), ("lc", 13)], ["last name", "id"])
DataFrame A:
+----------+---+
|first name| id|
+----------+---+
| fa| 3|
| fb| 5|
| fc| 7|
+----------+---+
DataFrame B:
+---------+---+
|last name| id|
+---------+---+
| la| 3|
| lb| 10|
| lc| 13|
+---------+---+
My goal is to find the difference of DataFrame A and DataFrame B considering column id, the output would be the following DataFrame
+---------+---+
|last name| id|
+---------+---+
| lb| 10|
| lc| 13|
+---------+---+
I don't want to use the following method:
a_ids = set(df_a.rdd.map(lambda r: r.id).collect())
df_c = df_b.filter(~col('id').isin(a_ids))
I'm looking for an efficient method (in terms of memory and speed) that I don't have to collect the ids (the size of ids can be billions), maybe something like RDDs SubtractByKey but for DataFrame
PS: I can map df_a to RDD, but I don't want to map df_b to RDD
You can do a left_anti join on column id:
df_b.join(df_a.select('id'), how='left_anti', on=['id']).show()
+---+---------+
| id|last name|
+---+---------+
| 10| lb|
| 13| lc|
+---+---------+

How to drop all columns with null values in a PySpark DataFrame?

I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that?
The following only drops a single column or rows containing null.
df.where(col("dt_mvmt").isNull()) #doesnt work because I do not have all the columns names or for 1000's of columns
df.filter(df.dt_mvmt.isNotNull()) #same reason as above
df.na.drop() #drops rows that contain null, instead of columns that contain null
For example
a | b | c
1 | | 0
2 | 2 | 3
In the above case it will drop the whole column B because one of its values is empty.
Here is one possible approach for dropping all columns that have NULL values: See here for the source on the code of counting NULL values per column.
import pyspark.sql.functions as F
# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
'x2': ['b', None, '2'],
'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()
def drop_null_columns(df):
"""
This function drops all columns which contain null values.
:param df: A PySpark DataFrame
"""
null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
to_drop = [k for k, v in null_counts.items() if v > 0]
df = df.drop(*to_drop)
return df
# Drops column b2, because it contains null values
drop_null_columns(df).show()
Before:
+---+----+---+
| x1| x2| x3|
+---+----+---+
| a| b| c|
| 1|null| 0|
| 2| 2| 3|
+---+----+---+
After:
+---+---+
| x1| x3|
+---+---+
| a| c|
| 1| 0|
| 2| 3|
+---+---+
Hope this helps!
If we need to keep only the rows having at least one inspected column not null then use this. Execution time is very less.
from operator import or_
from functools import reduce
inspected = df.columns
df = df.where(reduce(or_, (F.col(c).isNotNull() for c in inspected ), F.lit(False)))```

How to update a pyspark dataframe with new values from another dataframe?

I have two spark dataframes:
Dataframe A:
|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |
and dataframe B:
|col_1 | col_2 | ... | col_m |
|val_1 | val_2 | ... | val_m |
Dataframe B can contain duplicate, updated and new rows from dataframe A. I want to write an operation in spark where I can create a new dataframe containing the rows from dataframe A and the updated and new rows from dataframe B.
I started by creating a hash column containing only the columns that are not updatable. This is the unique id. So let's say col1 and col2 can change value (can be updated), but col3,..,coln are unique. I have created a hash function as hash(col3,..,coln):
A=A.withColumn("hash", hash(*[col(colname) for colname in unique_cols_A]))
B=B.withColumn("hash", hash(*[col(colname) for colname in unique_cols_B]))
Now I want to write some spark code that basically selects the rows from B that have the hash not in A (so new rows and updated rows) and join them into a new dataframe together with the rows from A. How can I achieve this in pyspark?
Edit:
Dataframe B can have extra columns from dataframe A, so a union is not possible.
Sample example
Dataframe A:
+-----+-----+
|col_1|col_2|
+-----+-----+
| a| www|
| b| eee|
| c| rrr|
+-----+-----+
Dataframe B:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| d| yyy| 2|
| c| rer| 3|
+-----+-----+-----+
Result:
Dataframe C:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| b| eee| null|
| c| rer| 3|
| d| yyy| 2|
+-----+-----+-----+
This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.
For example:
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
'col_1',
f.when(
~f.isnull(f.col('b.col_2')),
f.col('b.col_2')
).otherwise(f.col('a.col_2')).alias('col_2'),
'b.col_3'
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| a| wew| 1|
#| b| eee| null|
#| c| rer| 3|
#| d| yyy| 2|
#+-----+-----+-----+
Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:
cols_to_update = ['col_2']
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
*[
['col_1'] +
[
f.when(
~f.isnull(f.col('b.{}'.format(c))),
f.col('b.{}'.format(c))
).otherwise(f.col('a.{}'.format(c))).alias(c)
for c in cols_to_update
] +
['b.col_3']
]
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.
replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
resultDf = dfA.subtract(replaceDf).union(dfB).show()
Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would
prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.
If you want to keep only unique values, and require strictly correct results, then union followed by dropDupilcates should do the trick:
columns_which_dont_change = [...]
old_df.union(new_df).dropDuplicates(subset=columns_which_dont_change)

pyspark parse fixed width text file

Trying to parse a fixed width text file.
my text file looks like the following and I need a row id, date, a string, and an integer:
00101292017you1234
00201302017 me5678
I can read the text file to an RDD using sc.textFile(path).
I can createDataFrame with a parsed RDD and a schema.
It's the parsing in between those two steps.
Spark's substr function can handle fixed-width columns, for example:
df = spark.read.text("/tmp/sample.txt")
df.select(
df.value.substr(1,3).alias('id'),
df.value.substr(4,8).alias('date'),
df.value.substr(12,3).alias('string'),
df.value.substr(15,4).cast('integer').alias('integer')
).show()
will result in:
+---+--------+------+-------+
| id| date|string|integer|
+---+--------+------+-------+
|001|01292017| you| 1234|
|002|01302017| me| 5678|
+---+--------+------+-------+
Having splitted columns you can reformat and use them as in normal spark dataframe.
Someone asked how to do it based on a schema. Based on above responses, here is a simple example:
x= ''' 1 123121234 joe
2 234234234jill
3 345345345jane
4abcde12345jack'''
schema = [
("id",1,5),
("ssn",6,10),
("name",16,4)
]
with open("personfixed.csv", "w") as f:
f.write(x)
df = spark.read.text("personfixed.csv")
df.show()
df2 = df
for colinfo in schema:
df2 = df2.withColumn(colinfo[0], df2.value.substr(colinfo[1],colinfo[2]))
df2.show()
Here is the output:
+-------------------+
| value|
+-------------------+
| 1 123121234 joe|
| 2 234234234jill|
| 3 345345345jane|
| 4abcde12345jack|
+-------------------+
+-------------------+-----+----------+----+
| value| id| ssn|name|
+-------------------+-----+----------+----+
| 1 123121234 joe| 1| 123121234| joe|
| 2 234234234jill| 2| 234234234|jill|
| 3 345345345jane| 3| 345345345|jane|
| 4abcde12345jack| 4|abcde12345|jack|
+-------------------+-----+----------+----+
Here is a Oneliner for you :
df = spark.read.text("/folder/file.txt")
df.select(*map(lambda x: trim(df.value.substr(col_idx[x]['idx'], col_idx[x]['len'])).alias(x), col_idx))
where col_idx is something like this :
col_idx = {col1: {'idx': 1, 'len': 2}, col2: {'idx': 3, 'len': 1}}
It's practical when you have a lot of columns, and also more efficient to use select than multiple withcolumn (see The hidden cost of Spark withColumn)
df = spark.read.text("fixedwidth")
df.withColumn("id",df.value.substr(1,5)).withColumn("name",df.value.substr(6,11)).drop('value').show()
the result is
+-----+------+
| id| name|
+-----+------+
|23465|ramasg|
|54334|hjsgfd|
|87687|dgftre|
|45365|ghfduh|
+-----+------+

Categories

Resources