String to array in spark

String to array in spark - python

I have a dataframe in PySpark with a string column with value [{"AppId":"APACON","ExtId":"141730"}] (the string is exactly like that in my column, it is a string, not an array)
I want to convert this to an array of struct.
Can I do that simply with native spark function or do I have to parse the string or use UDF ?
sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
).show()
+---+--------------------+
|idx| txt|
+---+--------------------+
| 1|[{"AppId":"APACON...|
| 2|[{"AppId":"APACON...|
+---+--------------------+

With Spark 2.1 or above
You have the following data :
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
)
you can indeed use pyspark.sql.functions.from_json as follows :
schema = StructType([StructField("AppId", StringType()),
StructField("ExtId", StringType())])
df = df.withColumn('array',F.from_json(F.col('txt'), schema))
df.show()
+---+--------------------+---------------+
|idx| txt| array|
+---+--------------------+---------------+
| 1|[{"AppId":"APACON...|[APACON,141730]|
| 2|[{"AppId":"APACON...|[APACON,141793]|
+---+--------------------+---------------+
Version < Spark 2.1
One way to bypass the issue, would be to first slightly modify your input string to have :
# Use regexp_extract to ignore square brackets
df.withColumn('txt_parsed',F.regexp_extract(F.col('txt'),'[^\\[\\]]+',0))
df.show()
+---+-------------------------------------+-----------------------------------+
|idx|txt |txt_parsed |
+---+-------------------------------------+-----------------------------------+
|1 |[{"AppId":"APACON","ExtId":"141730"}]|{"AppId":"APACON","ExtId":"141730"}|
|2 |[{"AppId":"APACON","ExtId":"141793"}]|{"AppId":"APACON","ExtId":"141793"}|
+---+-------------------------------------+-----------------------------------+
Then you could use pyspark.sql.functions.get_json_object to parse the txt column
df = df.withColumn('AppId', F.get_json_object(df.txt, '$.AppId'))
df = df.withColumn('ExtId', F.get_json_object(df.txt, '$.ExtId'))
df.show()
+---+--------------------+------+------+
|idx| txt| AppId| ExtId|
+---+--------------------+------+------+
| 1|{"AppId":"APACON"...|APACON|141730|
| 2|{"AppId":"APACON"...|APACON|141793|
+---+--------------------+------+------+

Related

TypeError: Column is not iterable

s = ["abcd:{'name':'john'}","defasdf:{'num':123}"]
df = spark.createDataFrame(s, "string").toDF("request")
display(df)
+--------------------+
| request|
+--------------------+
|abcd:{'name':'john'}|
| defasdf:{'num':123}|
+--------------------+
I would like to get as
+--------------------+---------------+
| request| sub|
+--------------------+---------------+
|abcd:{'name':'john'}|{'name':'john'}|
| defasdf:{'num':123}| {'num':123}|
+--------------------+---------------+
I did write as below, but it is throwing error :
TypeError: Column is not iterable
df = df.withColumn("sub",substring(col('request'),locate('{',col('request')),length(col('request'))-locate('{',col('request'))))
df.show()
Can someone please help me ?

You need to use substring function in SQL expression in order to pass columns for position and length arguments. Note also that you need to add +1 to length to get correct result:
import pyspark.sql.functions as F
df = df.withColumn(
"json",
F.expr("substring(request, locate('{',request), length(request) - locate('{', request) + 1)")
)
df.show()
#+--------------------+---------------+
#| request| json|
#+--------------------+---------------+
#|abcd:{'name':'john'}|{'name':'john'}|
#| defasdf:{'num':123}| {'num':123}|
#+--------------------+---------------+
You could also consider using regexp_extract function instead of substring like this:
df = df.withColumn(
"json",
F.regexp_extract("request", "^.*:(\\{.*\\})$", 1)
)

Pyspark DataFrame - Escaping &

I have some large (~150 GB) csv files using semicolon as the separator character. I have found that some of the fields contain a html encoded ampersand & The semicolon is getting picked up as a column separator so I need to way to escape it or replace & with & while loading the dataframe.
As an example, I have the following csv file:
ID;FirstName;LastName
1;Chandler;Bing
2;Ross & Monica;Geller
I load it using the following notebook:
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test.csv')
df.show()
The result I'm getting is:
+---+---------+--------+
| ID|FirstName|LastName|
+---+---------+--------+
| 1| Chandler| Bing|
| 2|Ross &amp| Monica|
+---+---------+--------+
Whereas what I'm looking for is:
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
I have tried using .option("escape", "&") but that escaping only works on a single character.
Update
I have a hacky workaround using RDDs that works at least for small test files but I'm still looking for a proper solution escapes the string while loading the dataframe.
rdd = sc.textFile('/mnt/input/AMP test.csv')
rdd = rdd.map(lambda x: x.replace('&', '&'))
rdd.coalesce(1).saveAsTextFile("/mnt/input/AMP test escaped.csv")
df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test escaped.csv')
df.show()

You can do that with dataframes directly. It helps if you have at least 1 file you know that does not contains any & to retrieve the schema.
Let's assume such a file exists and its path is "valid.csv".
from pyspark.sql import functions as F
# I acquire a valid file without the & wrong data to get a nice schema
schm = spark.read.csv("valid.csv", header=True, inferSchema=True, sep=";").schema
df = spark.read.text("/mnt/input/AMP test.csv")
# I assume you have several files, so I remove all the headers.
# I do not need them as I already have my schema in schm.
header = df.first().value
df = df.where(F.col("value") != header)
# I replace "&" with "&", and split the column
df = df.withColumn(
"value", F.regexp_replace(F.col("value"), "&", "&")
).withColumn(
"value", F.split("value", ";")
)
# I explode the array in several columns and add types based on schm defined previously
df = df.select(
*(
F.col("value").getItem(i).cast(col.dataType).alias(col.name)
for i, col in enumerate(schm)
)
)
here is the result :
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)

I think there isn't a way to escape this complex char & only using spark.read.csv and the solution is like you did your "workaround":
rdd.map: This function already replaces in all columns the value & by &
It is not necessary to save your rdd in a temporary path, just passes it as csv parameter:
rdd = sc.textFile("your_path").map(lambda x: x.replace("&", "&"))
df = spark.read.csv(rdd, header=True, sep=";")
df.show()
+---+-------------+--------+
| ID| FirstName|LastName|
+---+-------------+--------+
| 1| Chandler| Bing|
| 2|Ross & Monica| Geller|
+---+-------------+--------+

Extract a column in pyspark dataframe using udfs

I am using below code snippets to extract a portion of a dataframe column .
df.withColumn("chargemonth",getBookedMonth1(df['chargedate']))
def getBookedMonth1(chargedate):
booked_year=chargedate[0:3]
booked_month=chargedate[5:7]
return booked_year+"-"+booked_month
I have also used getBookedMonth for the same , but I am getting null value for the new column chargemonth in both cases.
from pyspark.sql.functions import substring
def getBookedMonth(chargedate):
booked_year=substring(chargedate, 1,4)
booked_month=substring(chargedate,5, 6)
return booked_year+"-"+booked_month
Is this correct way of extraction/substring of columns in pyspark ?

Please DON'T use udf for this! UDFs are known for bad performance.
I would suggest you tu use Spark builtin functions to manipulate dates. Here is an example:
# DF sample
data = [(1, "2019-12-05"), (2, "2019-12-06"), (3, "2019-12-07")]
df = spark.createDataFrame(data, ["id", "chargedate"])
# format dates as 'yyyy-MM'
df.withColumn("chargemonth", date_format(to_date(col("chargedate")), "yyyy-MM")).show()
+---+----------+-----------+
| id|chargedate|chargemonth|
+---+----------+-----------+
| 1|2019-12-05| 2019-12|
| 2|2019-12-06| 2019-12|
| 3|2019-12-07| 2019-12|
+---+----------+-----------+

You need to make a new function as Pyspark UDF.
>>> from pyspark.sql.functions import udf
>>> data = [
... {"chargedate":"2019-01-01"},
... {"chargedate":"2019-02-01"},
... {"chargedate":"2019-03-01"},
... {"chargedate":"2019-04-01"}
... ]
>>>
>>> booked_month = udf(lambda a:"{0}-{1}".format(a[0:4], a[5:7]))
>>>
>>> df = spark.createDataFrame(data)
>>> df = df.withColumn("chargemonth",booked_month(df['chargedate'])).drop('chargedate')
>>> df.show()
+-----------+
|chargemonth|
+-----------+
| 2019-01|
| 2019-02|
| 2019-03|
| 2019-04|
+-----------+
>>>
withColumn is a right way to add a column, drop is used to drop a column.

Casting a column to JSON/dict and flattening JSON values in a column in pyspark

I am new to Pyspark and I am figuring out how to cast a column type to dict type and then flatten that column to multiple columns using explode.
Here's how my dataframe looks like:
col1 | col2 |
-----------------------
test:1 | {"test1":[{"Id":"17","cName":"c1"},{"Id":"01","cName":"c2","pScore":0.003609}],
{"test8":[{"Id":"1","cName":"c11","pScore":0.0},{"Id":"012","cName":"c2","pScore":0.003609}]
test:2 | {"test1:subtest2":[{"Id":"18","cName":"c13","pScore":0.00203}]}
Right now, the schema of this dataframe is
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
The output I am looking to have is like this:
col1 | col2 | Id | cName | pScore |
------------------------------------------------
test:1 | test1 | 17 | c1 | null |
test:1 | test1 | 01 | c2 | 0.003609|
test:1 | test8 | 1 | c11 | 0.0 |
test:1 | test8 | 012| c2 | 0.003609|
test:2 | test1:subtest2 | 18 | c13 | 0.00203 |
I am having trouble defining the right schema for col2 to cast its type from String to json or dict. And then, I would like to be able to explode the values to multiple columns as shown above. Any help would be greatly appreciated. I am using Spark 2.0 + .
Thank you!

Updating my answer, I used udf to put the key into array, then explode to reach the desired output
See the example below:
import json
import re
import pyspark.sql.functions as f
from pyspark.shell import spark
from pyspark.sql.types import ArrayType, StructType, StructField, StringType, DoubleType
df = spark.createDataFrame([
('test:1',
'{"test1":[{"Id":"17","cName":"c1"},{"Id":"01","cName":"c2","pScore":0.003609}]},'
'{"test8":[{"Id":"1","cName":"c11","pScore":0.0},{"Id":"012","cName":"c2","pScore":0.003609}]}'),
('test:2', '{"test1:subtest2":[{"Id":"18","cName":"c13","pScore":0.00203}]}')
], ['col1', 'col2'])
schema = ArrayType(
StructType(
[
StructField("Col", StringType()),
StructField("Id", StringType()),
StructField("cName", StringType()),
StructField("pScore", DoubleType())
]
)
)
#f.udf(returnType=schema)
def parse_col(column):
updated_values = []
for it in re.finditer(r'{.*?}]}', column):
parse = json.loads(it.group())
for key, values in parse.items():
for value in values:
value['Col'] = key
updated_values.append(value)
return updated_values
df = df \
.withColumn('tmp', parse_col(f.col('col2'))) \
.withColumn('tmp', f.explode(f.col('tmp'))) \
.select(f.col('col1'),
f.col('tmp').Col.alias('col2'),
f.col('tmp').Id.alias('Id'),
f.col('tmp').cName.alias('cName'),
f.col('tmp').pScore.alias('pScore'))
df.show()
Output:
+------+--------------+---+-----+--------+
| col1| col2| Id|cName| pScore|
+------+--------------+---+-----+--------+
|test:1| test1| 17| c1| null|
|test:1| test1| 01| c2|0.003609|
|test:1| test8| 1| c11| 0.0|
|test:1| test8|012| c2|0.003609|
|test:2|test1:subtest2| 18| c13| 0.00203|
+------+--------------+---+-----+--------+

Because of the different key name for each row in the JSON, defining a general schema for the json is not going to work well, I believe it's better to handle this via UDFs:
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql import Row
import json
def extract_key(dumped_json):
"""
Extracts the single key from the dumped json (as a string).
"""
if dumped_json is None:
return None
d = json.loads(dumped_json)
try:
return list(d.keys())[0]
except IndexError:
return None
def extract_values(dumped_json):
"""
Extracts the single array value from the dumped json and parses each element
of the array as a spark Row.
"""
if dumped_json is None:
return None
d = json.loads(dumped_json)
try:
return [Row(**_d) for _d in list(d.values())[0]]
except IndexError:
return None
# Definition of the output type of the `extract_values` function
output_values_type = t.ArrayType(t.StructType(
[t.StructField("Id", t.StringType()),
t.StructField("cName", t.StringType()),
t.StructField("pScore", t.DoubleType())]
))
# Define UDFs
extract_key_udf = f.udf(extract_key, t.StringType())
extract_values_udf = f.udf(extract_values, output_values_type)
# Extract values and keys
extracted_df = df.withColumn("values", extract_values_udf("col2")). \
withColumn("col2", extract_key_udf("col2"))
# Explode the array
exploded_df = extracted_df.withColumn("values", f.explode("values"))
# Select the wanted columns
final_df = exploded_df.select("col1", "col2", "values.Id", "values.cName",
"values.pScore")
The result is then as wanted:
+------+--------------+---+-----+--------+
|col1 |col2 |Id |cName|pScore |
+------+--------------+---+-----+--------+
|test:1|test1:subtest1|17 |c1 |0.002034|
|test:1|test1:subtest1|01 |c2 |0.003609|
|test:2|test1:subtest2|18 |c13 |0.00203 |
+------+--------------+---+-----+--------+

How do I convert convert a unicode list contained in pyspark column of a dataframe into float list?

I have created a dataframe as shown
import ast
from pyspark.sql.functions import udf
values = [(u'['2','4','713',10),(u'['12','245']',20),(u'['101','12']',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|u'['2','4','713']| 10|
| u' ['12','245']| 20|
| u'['101','12',]| 30|
+-----------------+---+
**How can I convert the above dataframe such that each element in the list is a float and is within a proper list**
I tried the below one :
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
df2 = amp_conversion(df)
But the data remains the same without a change.
I dont want convert the dataframe to pandas or use collect as it is memory intensive.
And if possible try to give me an optimal solution.I am using pyspark

That's because you forgot about the type
udf(lambda row: ast.literal_eval(str(row)), "array<integer>")
Though something like this would be more efficient:
from pyspark.sql.functions import rtrim, ltrim, split
df = spark.createDataFrame(["""u'[23,4,77,890,4]"""], "string").toDF("list")
df.select(split(
regexp_replace("list", "^u'\\[|\\]$", ""), ","
).cast("array<integer>").alias("list")).show()
# +-------------------+
# | list|
# +-------------------+
# |[23, 4, 77, 890, 4]|
# +-------------------+

I can create the true result in python 3 with a little change in definition of function df_amp_conversion. You didn't return the value of df_modelamp! This code works for me properly:
import ast
from pyspark.sql.functions import udf, col
values = [(u"['2','4','713']",10),(u"['12','245']",20),(u"['101','12']",30)]
df = sqlContext.createDataFrame(values,['list','A'])
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
return df_modelamp
df2 = df_amp_conversion(df)
df2.show()
# +---------------+---+-----------+
# | list| A| float_list|
# +---------------+---+-----------+
# |['2','4','713']| 10|[2, 4, 713]|
# | ['12','245']| 20| [12, 245]|
# | ['101','12']| 30| [101, 12]|
# +---------------+---+-----------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

String to array in spark - python

Related

TypeError: Column is not iterable

Pyspark DataFrame - Escaping &

Extract a column in pyspark dataframe using udfs

Casting a column to JSON/dict and flattening JSON values in a column in pyspark

How do I convert convert a unicode list contained in pyspark column of a dataframe into float list?

Categories

Resources