Create a dataframe from column of dictionaries in pyspark - python

I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;
........................................................
data
........................................................
{"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................
How do I convert into the dataframe with individual columns like following.
name address occupation
sam uk
jack aus job

Convert data to an RDD, then use spark.read.json to convert the RDD into a dataFrame with the schema.
data = [
{"name": "sam", "address":"uk"},
{"name":"jack" , "address":"aus", "occupation":"job"}
]
spark = SparkSession.builder.getOrCreate()
df = spark.read.json(sc.parallelize(data)).na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| uk| sam| |
| aus|jack| job|
+-------+----+----------+

If the order of rows is not important, this is another way you can do this:
from pyspark import SparkContext
sc = SparkContext()
df = sc.parallelize([
{"name":"jack" , "address":"aus", "occupation":"job"},
{"name": "sam", "address":"uk"}
]).toDF()
df = df.na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| aus|jack| job|
| uk| sam| |
+-------+----+----------+

Related

How to append a value from exploded value in dataframe in pyspark

The data is
data = [{"_id":"Inst001","Type":"AAAA", "Model001":[{"_id":"Mod001", "Name": "FFFF"},
{"_id":"Mod0011", "Name": "FFFF4"}]},
{"_id":"Inst002", "Type":"BBBB", "Model001":[{"_id":"Mod002", "Name": "DDD"}]}]
Need to frame a dataframe as follows
pid
_id
Name
Inst001
Mod001
FFFF
Inst001
Mod0011
FFFF4
Inst002
Mod002
DDD
The approach I had is
Need to explode "Model001"
Then need to append the main _id to this exploded dataframe. But how this append can be done in pyspark?
Is there any builtin method available in pyspark for the above problem?
Create a dataframe with a proper schema, and do inline on the Model001 column:
df = spark.createDataFrame(
data,
'_id string, Type string, Model001 array<struct<_id:string, Name:String>>'
).selectExpr('_id as pid', 'inline(Model001)')
df.show(truncate=False)
+-------+-------+-----+
|pid |_id |Name |
+-------+-------+-----+
|Inst001|Mod001 |FFFF |
|Inst001|Mod0011|FFFF4|
|Inst002|Mod002 |DDD |
+-------+-------+-----+

Split file name into different columns of pyspark dataframe

I am using pyspark SQL function input_file_name to add the input file name as a dataframe column.
df = df.withColumn("filename",input_file_name())
The column now has value like below.
"abc://dev/folder1/date=20200813/id=1"
From the above column I have to create 2 different columns.
Date
ID
I have to get only date and id from the above file name and populate it to the columns mentioned above.
I can use split_col and get it. But if the folder structure changes then it might be a problem.
Is there a way to check if the file name has string "date" and "id" as part of it and get the values after the equal to symbol and populate it two new columns ?
Below is the expected output.
filename date id
abc://dev/folder1/date=20200813/id=1 20200813 1
You could use regexp_extract with a pattern that looks at the date= and id= substrings:
df = sc.parallelize(['abc://dev/folder1/date=20200813/id=1',
'def://dev/folder25/id=3/date=20200814'])\
.map(lambda l: Row(file=l)).toDF()
+-------------------------------------+
|file |
+-------------------------------------+
|abc://dev/folder1/date=20200813/id=1 |
|def://dev/folder25/id=3/date=20200814|
+-------------------------------------+
df = df.withColumn('date', f.regexp_extract(f.col('file'), '(?<=date=)[0-9]+', 0))\
.withColumn('id', f.regexp_extract(f.col('file'), '(?<=id=)[0-9]+', 0))
df.show(truncate=False)
Which outputs:
+-------------------------------------+--------+---+
|file |date |id |
+-------------------------------------+--------+---+
|abc://dev/folder1/date=20200813/id=1 |20200813|1 |
|def://dev/folder25/id=3/date=20200814|20200814|3 |
+-------------------------------------+--------+---+
I have used the withcolumn and split to break the column value into date and id by creating them as columns in the same dataset , code snippet is below:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
adata = [("abc://dev/folder1/date=20200813/id=1",)]
aschema = StructType([StructField("filename",StringType(),True)])
adf = spark.createDataFrame(data=adata,schema=aschema)
bdf = adf.withColumn('date', split(adf['filename'],'date=').getItem(1)[0:8]).withColumn('id',split(adf['filename'],'id=').getItem(1))
bdf.show(truncate=False)
Which outputs to :
+------------------------------------+--------+---+
|filename |date |id |
+------------------------------------+--------+---+
|abc://dev/folder1/date=20200813/id=1|20200813|1 |
+------------------------------------+--------+---+

String to array in spark

I have a dataframe in PySpark with a string column with value [{"AppId":"APACON","ExtId":"141730"}] (the string is exactly like that in my column, it is a string, not an array)
I want to convert this to an array of struct.
Can I do that simply with native spark function or do I have to parse the string or use UDF ?
sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
).show()
+---+--------------------+
|idx| txt|
+---+--------------------+
| 1|[{"AppId":"APACON...|
| 2|[{"AppId":"APACON...|
+---+--------------------+
With Spark 2.1 or above
You have the following data :
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
)
you can indeed use pyspark.sql.functions.from_json as follows :
schema = StructType([StructField("AppId", StringType()),
StructField("ExtId", StringType())])
df = df.withColumn('array',F.from_json(F.col('txt'), schema))
df.show()
+---+--------------------+---------------+
|idx| txt| array|
+---+--------------------+---------------+
| 1|[{"AppId":"APACON...|[APACON,141730]|
| 2|[{"AppId":"APACON...|[APACON,141793]|
+---+--------------------+---------------+
Version < Spark 2.1
One way to bypass the issue, would be to first slightly modify your input string to have :
# Use regexp_extract to ignore square brackets
df.withColumn('txt_parsed',F.regexp_extract(F.col('txt'),'[^\\[\\]]+',0))
df.show()
+---+-------------------------------------+-----------------------------------+
|idx|txt |txt_parsed |
+---+-------------------------------------+-----------------------------------+
|1 |[{"AppId":"APACON","ExtId":"141730"}]|{"AppId":"APACON","ExtId":"141730"}|
|2 |[{"AppId":"APACON","ExtId":"141793"}]|{"AppId":"APACON","ExtId":"141793"}|
+---+-------------------------------------+-----------------------------------+
Then you could use pyspark.sql.functions.get_json_object to parse the txt column
df = df.withColumn('AppId', F.get_json_object(df.txt, '$.AppId'))
df = df.withColumn('ExtId', F.get_json_object(df.txt, '$.ExtId'))
df.show()
+---+--------------------+------+------+
|idx| txt| AppId| ExtId|
+---+--------------------+------+------+
| 1|{"AppId":"APACON"...|APACON|141730|
| 2|{"AppId":"APACON"...|APACON|141793|
+---+--------------------+------+------+

How to create dataframe with single header ( 1 row many cols) and update values to this dataframe in pyspark?

I want to create a dataframe in pyspark like the table below :
category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count
-----------------------------------------------------------------------------------------------------
nation | nation | 1 | 222 | 444 | 555 | 6677
So, the code I tried below :
schema = StructType([])
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
df = df.withColumn("category",F.lit('nation')).withColumn("category_id",F.lit('nation')).withColumn("bucket",bucket)
df = df.withColumn("prop_count",prop_count).withColumn("event_count",event_count).withColumn("accum_prop_count",accum_prop_count).withColumn("accum_event_count",accum_event_count)
df.show()
This is giving an error :
AssertionError: col should be Column
Also, The values of the columns have to be updated again later and the update will also be of 1 line.
How to do this??
I think the problem with your code is lies in lines where you are using variables like .withColumn("bucket",bucket). You are trying to create a new column by giving an integer value. withColumn expects a column and not a single integer value.
To solve this, you can use the lit just like you are already using for "nation"
like :
df = df\
.withColumn("category",F.lit('nation'))\
.withColumn("category_id",F.lit('nation'))\
.withColumn("bucket",F.lit(bucket))\
.withColumn("prop_count",F.lit(prop_count))\
.withColumn("event_count",F.lit(event_count))\
.withColumn("accum_prop_count",F.lit(accum_prop_count))\
.withColumn("accum_event_count",F.lit(accum_event_count))
another simple and cleaner way to write it may be like this :
# create schema
fields = [StructField("category", StringType(),True),
StructField("category_id", StringType(),True),
StructField("bucket", IntegerType(),True),
StructField("prop_count", IntegerType(),True),
StructField("event_count", IntegerType(),True),
StructField("accum_prop_count", IntegerType(),True)
]
schema = StructType(fields)
# load data
data = [["nation","nation",1,222,444,555]]
df = spark.createDataFrame(data, schema)
df.show()

Count in pyspark

I have a spark dataframe df with a column "id" (string) and another column "values" (array of strings). I want to create another column called count with contains the count of values for each id.
df looks like -
id values
1fdf67 [dhjy1,jh87w3,89yt5re]
df45l1 [hj098,hg45l0,sass65r4,dh6t21]
Result should look like -
id values count
1fdf67 [dhjy1,jh87w3,89yt5re] 3
df45l1 [hj098,hg45l0,sass65r4,dh6t21] 4
I am trying to do as below -
df= df.select(id,values).toDF(id,values,values.count())
This doesn't seem to be working for my requirement.
Please use size function:
from pyspark.sql.functions import size
df = spark.createDataFrame([
("1fdf67", ["dhjy1", "jh87w3", "89yt5re"]),
("df45l1", ["hj098", "hg45l0", "sass65r4", "dh6t21"])],
("id", "values"))
df.select("*", size("values").alias("count")).show(2, False)
+------+---------------------------------+-----+
|id |values |count|
+------+---------------------------------+-----+
|1fdf67|[dhjy1, jh87w3, 89yt5re] |3 |
|df45l1|[hj098, hg45l0, sass65r4, dh6t21]|4 |
+------+---------------------------------+-----+

Categories

Resources