How to Remove / Replace Character from PySpark List - python

I am very new to Python/PySpark and currently using it with Databricks.
I have the following list
dummyJson= [
('{"name":"leo", "object" : ["191.168.192.96", "191.168.192.99"]}',),
('{"name":"anne", "object" : ["191.168.192.103", "191.168.192.107"]}',),
]
When I tried to
jsonRDD = sc.parallelize(dummyJson)
then
put it in dataframe
spark.read.json(jsonRDD)
it does not parse the JSON correctly. The resulting dataframe is one column with _corrupt_record as the header.
Looking at the elements in dummyJson, it looks like there are extra / unnecessary comma just before the closing parantheses on each element/record.
How can I remove this comma from each of the element of this list?
Thanks

If you can fix the input format at the source, that would be ideal.
But for your given case, you may fix it by taking the objects out of the tuple.
>>> dJson = [i[0] for i in dummyJson]
>>> jsonRDD = sc.parallelize(dJson)
>>> jsonDF = spark.read.json(jsonRDD)
>>> jsonDF.show()
+----+--------------------+
|name| object|
+----+--------------------+
| leo|[191.168.192.96, ...|
|anne|[191.168.192.103,...|
+----+--------------------+

Related

Turning object column into JSON column

I have an object column in a dataframe with this data structure:
{"sku":"AHG5289"}, {"sku":"MCPV443"}, {"sku":"KBP2646"}, {"sku":"KCB2677"}, {"sku":"OR6344"}, {"sku":"WFM5449"}, {"sku":"TCM3322"}, {"sku":"ADE5357"}, {"sku":"MCP6412"}
And I'm hoping to convert it so that it becomes a proper JSON formatted column with this structure:
[{"sku":"AHG5289"}, {"sku":"MCPV443"}, {"sku":"KBP2646"}, {"sku":"KCB2677"}, {"sku":"OR6344"}, {"sku":"WFM5449"}, {"sku":"TCM3322"}, {"sku":"ADE5357"}, {"sku":"MCP6412"}]
How can I accomplish this?
Edit: I have tried to_json(orient="records") but it adds a bunch of weird backslashes and quotations marks such that it looks like this:
["{\"sku\":\"AHG5289\"}, ..."]
I think you can add [ and ] to the start and end of each row in the column, and use json.loads for each row:
import json
df['col'] = ('[' + df['col'] + ']').apply(json.loads)

Pyspark: Regex_replace commas between quotes

I'm struggling with replacing with regexp_replace in Pyspark. I have to following string column:
"1233455666, 'ThisIsMyAdress, 1234AB', 24234234234"
A better overview of the string:
Id
Address
Code
1233455666
'ThisIsMyAdress, 1234AB'
24234234234
The total string that I receive and process is comma separated, like the example in the beginning. Unfortunately I can't change this format of delivered data. To handle the data well I want to replace the comma between the quotes with nothing.
The only requirement is using regexp_replace.
I've tried the code below, and many more. But with these code the comma separation will break as well. Then the string is one big string with removed comma's.
.withColumn("ColCommasRemoved" , regexp_replace( col("X"), "[,]", ""))
which gave me this output:
"1233455666 'ThisIsMyAdress 1234AB' 24234234234"
The output what I want to achieve:
"1233455666, 'ThisIsMyAdress 1234AB', 24234234234"
Using regexp_replace:
from pyspark.sql import functions as F
df = spark.createDataFrame([("1233455666, 'ThisIsMyAdress, 1234AB', 24234234234",)], ["X"])
result = df.withColumn(
"ColCommasRemoved",
F.split(F.regexp_replace("X", ",(?=[^']*'[^']*(?:'[^']*'[^']*)*$)", ""), ",")
).select(
F.col("ColCommasRemoved")[0].alias("ID"),
F.col("ColCommasRemoved")[1].alias("Address"),
F.col("ColCommasRemoved")[2].alias("Code")
)
result.show()
#+----------+------------------------+------------+
#|ID |Address |Code |
#+----------+------------------------+------------+
#|1233455666| 'ThisIsMyAdress 1234AB'| 24234234234|
#+----------+------------------------+------------+
Or if you want to split directly the original column by , and ignore those inside quotes:
result = df.withColumn(
"split",
F.split(F.col("X"), ",(?=(?:[^']*'[^']*')*[^']*$)")
)
result.show(truncate=False)
#+-------------------------------------------------+-----------------------------------------------------+
#|X |split |
#+-------------------------------------------------+-----------------------------------------------------+
#|1233455666, 'ThisIsMyAdress, 1234AB', 24234234234|[1233455666, 'ThisIsMyAdress, 1234AB', 24234234234]|
#+-------------------------------------------------+-----------------------------------------------------+

PySpark groupby elements with key of their occurence

I have this data in a DATAFRAME:
id,col
65475383,acacia
63975914,acacia
65475383,excelsa
63975914,better
I want to have a dictionary that will contain the column "word" and every id that is associated with it, something like this:
word:key
acacia: 65475383,63975914
excelsa: 65475383
better: 63975914
I tried groupBy, but that is a way to aggregate data, how to approach this problem?
I'm not sure if you intend to have the result as a Python dictionary or as a Dataframe (it is not clear from your question).
However, if you do want a Dataframe then one way to calculate that is:
from pyspark.sql.functions import collect_list
idsByWords = df \
.groupBy("col") \
.agg(collect_list("id").alias("ids")) \
.withColumnRenamed("col", "word")
This will result in:
idsByWords.show(truncate=False)
+-------+--------------------+
|word |ids |
+-------+--------------------+
|excelsa|[65475383] |
|better |[63975914] |
|acacia |[65475383, 63975914]|
+-------+--------------------+
Then you can turn that dataframe into a Python dictionary :
d = {r.asDict()["word"]: r.asDict()["ids"] for r in idsByWords.collect()}
To finally get:
{
'excelsa': [65475383],
'better': [63975914],
'acacia': [65475383, 63975914]
}
Note that collect may crash your driver program if it exceeds your driver memory.

Store string in a column as nested JSON to a JSON file - Pyspark

I have a pyspark dataframe, this is what it looks like
+------------------------------------+-------------------+-------------+--------------------------------+---------+
|member_uuid |Timestamp |updated |member_id |easy_id |
+------------------------------------+-------------------+-------------+--------------------------------+---------+
|027130fe-584d-4d8e-9fb0-b87c984a0c20|2020-02-11 19:15:32|password_hash|ajuypjtnlzmk4na047cgav27jma6_STG|993269700|
I transformed the above dataframe to this,
+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation|params |timestamp |
+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|profile |UPDATE |{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"}|2020-02-11 19:15:32|
Using the following code,
ll = ['member_uuid', 'member_id', 'easy_id', 'field']
df = df.withColumn('timestamp', col('Timestamp')).withColumn('attribute', lit('profile')).withColumn('operation', lit(col_name)) \
.withColumn('field', col('updated')).withColumn('params', F.to_json(struct([x for x in ll])))
df = df.select('attribute', 'operation', 'params', 'timestamp')
I have save this dataframe df to a text file after converting it to JSON.
I tried using the following code to do the same,
df_final.toJSON().coalesce(1).saveAsTextFile('file')
The file contains,
{"attribute":"profile","operation":"UPDATE","params":"{\"member_uuid\":\"027130fe-584d-4d8e-9fb0-b87c984a0c20\",\"member_id\":\"ajuypjtnlzmk4na047cgav27jma6_STG\",\"easy_id\":993269700,\"field\":\"password_hash\"}","timestamp":"2020-02-11T19:15:32.000Z"}
I want it to save in this format,
{"attribute":"profile","operation":"UPDATE","params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"},"timestamp":"2020-02-11T19:15:32.000Z"}
to_json saves the value in the params columns as a string, is there a way to keep the json context here so I can save it as the desired output?
Don't use to_json to create params column in dataframe.
The trick here is just create struct and write to the file (using .saveAsTextFile (or) .write.json()) Spark will create JSON for the Struct field.
if we already created json object and writing in json format Spark will add \ to escape the quotes already exists in Json string.
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([("027130fe-584d-4d8e-9fb0-b87c984a0c20","2020-02-11 19:15:32","password_hash","ajuypjtnlzmk4na047cgav27jma6_STG","993269700")],["member_uuid","Timestamp","updated","member_id","easy_id"])
df1=df.withColumn("attribute",lit("profile")).withColumn("operation",lit("UPDATE"))
df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").write.format("json").mode("overwrite").save("<path>")
#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}
df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").toJSON().saveAsTextFile("<path>")
#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}
A simple way to handle it is to just do a replace operation on the file
sourceData=open('file').read().replace('"{','{').replace('}"','}').replace('\\','')
with open('file','w') as final:
final.write(sourceData)
This might not be what you are looking for, but will achieve the end result.

Splitting a dictionary in a Pyspark dataframe into individual columns

I have a dataframe (in Pyspark) that has one of the row values as a dictionary:
df.show()
And it looks like:
+----+---+-----------------------------+
|name|age|info |
+----+---+-----------------------------+
|rob |26 |{color: red, car: volkswagen}|
|evan|25 |{color: blue, car: mazda} |
+----+---+-----------------------------+
Based on the comments to give more:
df.printSchema()
The types are strings
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- dict: string (nullable = true)
Is it possible to take the keys from the dictionary (color and car) and make them columns in the dataframe, and have the values be the rows for those columns?
Expected Result:
+----+---+-----------------------------+
|name|age|color |car |
+----+---+-----------------------------+
|rob |26 |red |volkswagen |
|evan|25 |blue |mazda |
+----+---+-----------------------------+
I didn't know I had to use df.withColumn() and somehow iterate through the dictionary to pick each one and then make a column out of it? I've tried to find some answers so far, but most were using Pandas, and not Spark, so I'm not sure if I can apply the same logic.
Your strings:
"{color: red, car: volkswagen}"
"{color: blue, car: mazda}"
are not in a python friendly format. They can't be parsed using json.loads, nor can it be evaluated using ast.literal_eval.
However, if you knew the keys ahead of time and can assume that the strings are always in this format, you should be able to use pyspark.sql.functions.regexp_extract:
For example:
from pyspark.sql.functions import regexp_extract
df.withColumn("color", regexp_extract("info", "(?<=color: )\w+(?=(,|}))", 0))\
.withColumn("car", regexp_extract("info", "(?<=car: )\w+(?=(,|}))", 0))\
.show(truncate=False)
#+----+---+-----------------------------+-----+----------+
#|name|age|info |color|car |
#+----+---+-----------------------------+-----+----------+
#|rob |26 |{color: red, car: volkswagen}|red |volkswagen|
#|evan|25 |{color: blue, car: mazda} |blue |mazda |
#+----+---+-----------------------------+-----+----------+
The pattern is:
(?<=color: ): A positive look-behind for the literal string "color: "
\w+: One or more word characters
(?=(,|})): A positive look-ahead for either a literal comma or close curly brace.
Here is how to generalize this for more than two keys, and handle the case where the key does not exist in the string.
from pyspark.sql.functions import regexp_extract, when, col
from functools import reduce
keys = ["color", "car", "year"]
pat = "(?<=%s: )\w+(?=(,|}))"
df = reduce(
lambda df, c: df.withColumn(
c,
when(
col("info").rlike(pat%c),
regexp_extract("info", pat%c, 0)
)
),
keys,
df
)
df.drop("info").show(truncate=False)
#+----+---+-----+----------+----+
#|name|age|color|car |year|
#+----+---+-----+----------+----+
#|rob |26 |red |volkswagen|null|
#|evan|25 |blue |mazda |null|
#+----+---+-----+----------+----+
In this case, we use pyspark.sql.functions.when and pyspark.sql.Column.rlike to test to see if the string contains the pattern, before we try to extract the match.
If you don't know the keys ahead of time, you'll either have to write your own parser or try to modify the data upstream.
As you can see with the printSchema function your dictionary is understood by "Spark" as a string. The function that slices a string and creates new columns is split () so a simple solution to this problem could be.
Create a UDF that is capable of:
Convert the dictionary string into a comma separated string (removing the keys from the dictionary but keeping the order of the values)
Apply a split and create two new columns from the new format of our dictionary
The code:
#udf()
def transform_dict(dict_str):
str_of_dict_values = dict_str.\
replace("}", "").\
replace("{", ""). \
replace("color:", ""). \
replace(" car: ", ""). \
strip()
# output example: 'red,volkswagen'
return str_of_dict_values
# Create new column with our UDF with the dict values converted to str
df = df.withColumn('info_clean', clean("info"))
# Split these values and store in a tmp variable
split_col = split(df['info_clean'], ',')
# Create new columns with the split values
df = df.withColumn('color', split_col.getItem(0))
df = df.withColumn('car', split_col.getItem(1))
This solution is only correct if we assume that the dictionary elements always come in the same order, and also the keys are fixed.
For other more complex cases we could create a dictionary in the UDF function and form the string of list of values by explicitly invoking each of the dictionary keys, so we would ensure that the order in the output chain is maintained.
I feel the most scalable solution is the following one, using the general keys to be passed through the lambda function:
from pyspark.sql.functions import explode,map_keys,col
keysDF = df.select(explode(map_keys(df.info))).distinct()
keysList = keysDF.rdd.map(lambda x:x[0]).collect()
keyCols = list(map(lambda x: col("info").getItem(x).alias(str(x)), keysList))
df.select(df.name, df.age, *keyCols).show()

Categories

Resources