Pyspark explode string column containing JSON nested in array laterally - python

I have a dataframe containing a column like:
df['metrics'] =
[{id=1,name=XYZ,value=3}, {id=2,name=KJH,value=2}]
[{id=4,name=ABC,value=7}, {id=8,name=HGS,value=9}]
The column is a String type, and I am trying to explode the column using :
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType
array_item_schema = spark.read.json(df.rdd.map(lambda row: row['metrics'])).schema
json_array_schema = ArrayType(array_item_schema, True)
arrays_df = df.select(F.from_json('metrics', json_array_schema).alias('json_arrays'))
objects_df = arrays_df.select(F.explode('json_arrays').alias('objects'))
However, I have a null value returned when I try
objects_df.show()
The output I am looking for is a separated list of each element in the 'metrics' column, with column names showing id, name, value, in the same dataframe, and don't know where to start to decode it. Thanks for the help!

You can schema_of_json function to get schema from JSON string and pass it to from_json function get struct type.
json_array_schema = schema_of_json(str(df.select("metrics").first()[0]))
arrays_df = df.select(from_json('metrics', json_array_schema).alias('json_arrays'))

Related

Creating new column with list of strings

I have a given dataset: https://www.kaggle.com/abcsds/pokemon
I need to create new column based on another column called 'Name of the Pokemon'(string type) that will contain list of strings instead of strings
I need to use function. That's my code:
import pandas as pd
import numpy as np
df = pd.read_csv('pokemon.csv')
def transform_faves(df):
df = df.assign(name_as_list = df.name) #new column
list_of_a_single_column = df['name'].tolist()
df['name_as_list'] = list_of_a_single_column
print(type(list_of_a_single_column))
return df
df = transform_faves(df)
The problem is that new column is still string rather than list of strings. Why such conversion does not work?

Get value from Pyspark Column and compare it to a Python dictionary

So I have a pyspark dataframe that I want to add another column to using the value from the Section_1 column and find its corresponding value in a python dictionary. So basically use the value from the Section_1 cell as the key and then fill in the value from the python dictionary in the new column like below.
Original dataframe
DataId
ObjId
Name
Object
Section_1
My data
Data name
Object name
rd.111
rd.123
Python Dictionary
object_map= {'rd.123' : 'rd.567'}
Where section 1 has a value of rd.123 and I will search in the dictionary for the key 'rd.123' and want to return that value of rd.567 and place that in the new column
Desired DataFrame
DataId
ObjId
Name
Object
Section_1
Section_2
My data
Data name
Object name
rd.111
rd.123
rd.567
Right now I got this error with my current code and I dont really know what I did wrong as I am not to familiar with pyspark
There is an incorrect call to a Column object in your code. Please
review your code.
Here is my code that I am currently using where object_map is the python dictionary.
test_df = output.withColumn('Section_2', object_map.get(output.Section_1.collect()))
You can try this (adapted from this answer with added null handling):
from itertools import chain
from pyspark.sql.functions import create_map, lit, when
object_map = {'rd.123': 'rd.567'}
mapping_expr = create_map([lit(x) for x in chain(*object_map.items())])
df1 = df.filter(df['Section_1'].isNull()).withColumn('Section_2', F.lit(None))
df2 = df.filter(df['Section_1'].isNotNull()).withColumn(
'Section_2',
when(
df['Section_1'].isNotNull(),
mapping_expr[df['Section_1']]
)
)
result = df1.unionAll(df2)

How to replace unicode dicts values and return to unicode format in python

I have a dataframe with one column named "metadata" in unicode format, as it can be seen below:
print(df.metadata[1])
u'{"vehicle_year":2010,"issue_state":"RS",...,"type":4}'
type(df.metadata[1])
unicode
I have other column in this dataframe named 'issue_state_update' and I need to change the values from issue state from what is in the metadata to the data in the metadata's row in 'issue_state_update' column.
I have tried to use the following:
for i in range(len(df_final['metadata'])):
df_final['metadata'][i] = json.loads((df_final['metadata'][i]))
json_dumps(df_final['metadata'][i].update({'issue_state': df_final['issue_state_update'][i]}),ensure_ascii=False).encode('utf-8')
However what I get is an error:
TypeError: expected string or buffer
What I need is to have exactly the same format as before doing this change, but with the new info associated with 'issue_state'
For example:
u'{"vehicle_year":2010,"issue_state":"NO STATE",...,"type":4}'
I'm assuming you have a DataFrame (DF) that looks something like:
screenshot of a DF I mocked up
Since you're working with a DF you should manipulate the data as a vector instead of iterating over it like in standard Python. One way to do this is by defining a function and then "applying" it to your data. Something like:
def parse_dict(x):
x['metadata']['issue_state'] = x['issue_state_update']
Then you could apply it to every row in your DataFrame using:
some_df.apply(parse_dict, axis=1)
After running that code I get an updated DF that looks like:
updated DF where dict now has value from 'issue_state_update'
Actually I have found the answer. I don't know how efficient it is, but it works. Here it goes:
def replacer(df):
df_final = df
import unicodedata
df_final['issue_state_upd'] = ""
for i in range(len(df_final['issue_state'])):
#From unicode to string
df_final['issue_state_upd'][i] = unicodedata.normalize('NFKD', df_final['issue_state'][i]).encode('ascii','ignore')
#From string to dict
df_final['issue_state_upd'][i] = json.loads((df_final['issue_state_upd'][i]))
#Replace value in fuel key
df_final['issue_state_upd'][i].update({'fuel_type': df_final['issue_state_upd'][i]})
#From dict to str
df_final['issue_state_upd'][i] = json.dumps(df_final['issue_state_upd'][i])
#From str to unicode
df_final['issue_state_upd'][i] = unicode(df_final['issue_state_upd'][i], "utf-8")
return df_final

How to create a new column reading part of string in another column and transforming it to integer

I need to create a new column in a dataframe based on information on another column which is of string type.
dataframe name= total_data
class,name
a, C-FRA_FRA-S18_FU_L_FUS_FR073_STR001-STR00
b, C-FRA_FRA-S18_FU_L_FUS_FR074_STR010-STR011
I have tried using the find() method and it does not work, I obtain nan values for the new column total_data.Frame
total_data["Frame"]=total_data.name.str[total_data.name.str.find("FR0"):total_data.name.str.find("_STR")]
Using code above I obtain a new column that contains only nan values
I want to have a new column in the dataframe as follows:
class,name, Frame
a,C-FRA_FRA-S18_FU_L_FUS_FR073_STR001-STR001,73
b,C-FRA_FRA-S18_FU_L_FUS_FR074_STR010-STR011,74
and if possible that this new column contains integers.
If all the strings are in the same format, you can use a regex and str.extract like so:
df['Frame'] = df['name'].str.extract(r"FR0(\d+)_STR").astype(int)
# class name Frame
# 0 a C-FRA_FRA-S18_FU_L_FUS_FR073_STR001-STR00 73
# 1 b C-FRA_FRA-S18_FU_L_FUS_FR074_STR010-STR011 74
You can create a custom function and apply it to the DataFrame column using apply:
# Example set-up:
df = pd.DataFrame(data={"class":["a", "b"],
"name":["C-FRA_FRA-S18_FU_L_FUS_FR073_STR001-STR00",
"C-FRA_FRA-S18_FU_L_FUS_FR074_STR010-STR011"]})
# Solution:
def str_func(s):
ix1 = s.find("FR0")+3
ix2 = s.find("_STR")
return s[ix1:ix2]
df["Frame"] = df["name"].apply(str_func).astype(int)

How to save the returned values of UDF function into two columns?

My function get_data returns a tuple: two integer values.
get_data_udf = udf(lambda id: get_data(spark, id), (IntegerType(), IntegerType()))
I need to split them into two columns val1 and val2. How can I do it?
dfnew = df \
.withColumn("val", get_data_udf(col("id")))
Should I save the tuple in a column, e.g. val, and then split it somehow into two columns. Or is there any shorter way?
You can create structFields in udf in order to access later times.
from pyspark.sql.types import *
get_data_udf = udf(lambda id: get_data(spark, id),
StructType([StructField('first', IntegerType()), StructField('second', IntegerType())]))
dfnew = df \
.withColumn("val", get_data_udf(col("id"))) \
.select('*', 'val.`first`'.alias('first'), 'val.`second`'.alias('second'))
tuple's can be indexed just like lists, so you can add the value for column one as get_data()[0] and for the second value in the second column you do get_data()[1]
also you can do v1, v2 = get_data() and this way assign the returned tuple values to the variables v1 and v2.
Take a look at this question here for further clarification.
For example you have a sample dataframe of one column like below
val df = sc.parallelize(Seq(3)).toDF()
df.show()
//Below is a UDF which will return a tuple
def tupleFunction(): (Int,Int) = (1,2)
//we will create two new column from the above UDF
df.withColumn("newCol",typedLit(tupleFunction.toString.replace("(","").replace(")","")
.split(","))).select((0 to 1)
.map(i => col("newCol").getItem(i).alias(s"newColFromTuple$i")):_*).show

Categories

Resources