How to append a value from exploded value in dataframe in pyspark - python

The data is
data = [{"_id":"Inst001","Type":"AAAA", "Model001":[{"_id":"Mod001", "Name": "FFFF"},
{"_id":"Mod0011", "Name": "FFFF4"}]},
{"_id":"Inst002", "Type":"BBBB", "Model001":[{"_id":"Mod002", "Name": "DDD"}]}]
Need to frame a dataframe as follows
pid
_id
Name
Inst001
Mod001
FFFF
Inst001
Mod0011
FFFF4
Inst002
Mod002
DDD
The approach I had is
Need to explode "Model001"
Then need to append the main _id to this exploded dataframe. But how this append can be done in pyspark?
Is there any builtin method available in pyspark for the above problem?

Create a dataframe with a proper schema, and do inline on the Model001 column:
df = spark.createDataFrame(
data,
'_id string, Type string, Model001 array<struct<_id:string, Name:String>>'
).selectExpr('_id as pid', 'inline(Model001)')
df.show(truncate=False)
+-------+-------+-----+
|pid |_id |Name |
+-------+-------+-----+
|Inst001|Mod001 |FFFF |
|Inst001|Mod0011|FFFF4|
|Inst002|Mod002 |DDD |
+-------+-------+-----+

Related

coverting a mongodb nested record into a dataframe using pandas

I was read a record from mongodb using python and the end result was not as expected.
MongoDb record
_id:objectID("4624689264826482")
verison:2
name:"matt"
code:"57532"
status:"active"
address:object
address1:"4638, 14th cross"
city:"london"
state:"london"
date:"2021-10-25T00:19:56:000+00:00"
floordetails:object
floorname:"2"
room:"5"
metadata:object
extid:"3303"
ctype:"6384"
_id:objectID("20889689264826482")
verison:3
name:"rick"
code:"96597"
status:"active"
address:object
address1:"34, 12th street"
city:"london"
state:"london"
date:"2021-10-25T00:19:56:000+00:00"
floordetails:object
floorname:"4"
room:"234"
metadata:object
extid:"26403"
ctype:"4724"
I tried converting the record to a dataframe( all nested key with in the object to be a column name)
expected result:
_id |verison|name |code |status |address1 |city |state |date |floorname |room|extid |ctype
objectID("4624689264826482") |2 |"matt"|"57532"|"active"|"4638, 14th cross"|"london"|"london"|"2021-10-25T00:19:56:000+00:00"|"2" |"5" |"3303"|"6384"
objectID("20889689264826482")|3 |"rick"|"96597"|"active"|"34, 12th street" |"london"|"london"|"2021-10-25T00:19:56:000+00:00"|"4" |"234"|"26403"|"4724"
but the final result appears as below
id |verison|name |code |status |address |floordetails |metadata
objectID("4624689264826482") |2 |"matt"|"57532"|"active"|{address1:"4638, 14th cross",city:"london",state:"london",date:"2021-10-25T00:19:56:000+00:00"}|{floorname:"2",room:"5"} |{extid:"3303",ctype:"6384"}
objectID("20889689264826482")|3 |"rick"|"96597"|"active"|{address1:"34, 12th street",city:"london",state:"london",date:"2021-10-25T00:19:56:000+00:00"} |{floorname:"4",room:"234"}|{extid:"26403",ctype:"4724"}
please advise me on this
Supposing you're loading data by:
DataFrame(list(db.collection_name.find({}))
I think that there is no a direct form to "unpack" your values separately, otherwise if your JSON/Record or like dict data are string type you need to write this to convert properly a dictionary to be processed with pandas.DataFrame:
import ast
df['address'] = df['address'].map(ast.literal_eval)
df['floordetails'] = df['floordetails'].map(ast.literal_eval)
df['metadata'] = df['metadata'].map(ast.literal_eval)
Now I use Pandas.DataFrame.join() each time is added new nested dict values to a new dataframe
import pandas as pd
newdf = df[['_id','verison','name','code','status']]
newdf = newdf[['_id','verison','name','code','status']].join(pd.DataFrame(df['address'].tolist(), index=df.index).add_prefix('address.'))
newdf = newdf[['_id','verison','name','code','status','address.city','address.state','address.date']].join(pd.DataFrame(df['floordetails'].tolist(), index=df.index).add_prefix('floordetails.'))
newdf = newdf[['_id','verison','name','code','status','address.city','address.state','address.date','floordetails.floorname', 'floordetails.room']].join(pd.DataFrame(df['metadata'].tolist(), index=df.index).add_prefix('metadata.'))

Split file name into different columns of pyspark dataframe

I am using pyspark SQL function input_file_name to add the input file name as a dataframe column.
df = df.withColumn("filename",input_file_name())
The column now has value like below.
"abc://dev/folder1/date=20200813/id=1"
From the above column I have to create 2 different columns.
Date
ID
I have to get only date and id from the above file name and populate it to the columns mentioned above.
I can use split_col and get it. But if the folder structure changes then it might be a problem.
Is there a way to check if the file name has string "date" and "id" as part of it and get the values after the equal to symbol and populate it two new columns ?
Below is the expected output.
filename date id
abc://dev/folder1/date=20200813/id=1 20200813 1
You could use regexp_extract with a pattern that looks at the date= and id= substrings:
df = sc.parallelize(['abc://dev/folder1/date=20200813/id=1',
'def://dev/folder25/id=3/date=20200814'])\
.map(lambda l: Row(file=l)).toDF()
+-------------------------------------+
|file |
+-------------------------------------+
|abc://dev/folder1/date=20200813/id=1 |
|def://dev/folder25/id=3/date=20200814|
+-------------------------------------+
df = df.withColumn('date', f.regexp_extract(f.col('file'), '(?<=date=)[0-9]+', 0))\
.withColumn('id', f.regexp_extract(f.col('file'), '(?<=id=)[0-9]+', 0))
df.show(truncate=False)
Which outputs:
+-------------------------------------+--------+---+
|file |date |id |
+-------------------------------------+--------+---+
|abc://dev/folder1/date=20200813/id=1 |20200813|1 |
|def://dev/folder25/id=3/date=20200814|20200814|3 |
+-------------------------------------+--------+---+
I have used the withcolumn and split to break the column value into date and id by creating them as columns in the same dataset , code snippet is below:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
adata = [("abc://dev/folder1/date=20200813/id=1",)]
aschema = StructType([StructField("filename",StringType(),True)])
adf = spark.createDataFrame(data=adata,schema=aschema)
bdf = adf.withColumn('date', split(adf['filename'],'date=').getItem(1)[0:8]).withColumn('id',split(adf['filename'],'id=').getItem(1))
bdf.show(truncate=False)
Which outputs to :
+------------------------------------+--------+---+
|filename |date |id |
+------------------------------------+--------+---+
|abc://dev/folder1/date=20200813/id=1|20200813|1 |
+------------------------------------+--------+---+

Create a dataframe from column of dictionaries in pyspark

I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;
........................................................
data
........................................................
{"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................
How do I convert into the dataframe with individual columns like following.
name address occupation
sam uk
jack aus job
Convert data to an RDD, then use spark.read.json to convert the RDD into a dataFrame with the schema.
data = [
{"name": "sam", "address":"uk"},
{"name":"jack" , "address":"aus", "occupation":"job"}
]
spark = SparkSession.builder.getOrCreate()
df = spark.read.json(sc.parallelize(data)).na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| uk| sam| |
| aus|jack| job|
+-------+----+----------+
If the order of rows is not important, this is another way you can do this:
from pyspark import SparkContext
sc = SparkContext()
df = sc.parallelize([
{"name":"jack" , "address":"aus", "occupation":"job"},
{"name": "sam", "address":"uk"}
]).toDF()
df = df.na.fill('')
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
| aus|jack| job|
| uk| sam| |
+-------+----+----------+

How to create dataframe with single header ( 1 row many cols) and update values to this dataframe in pyspark?

I want to create a dataframe in pyspark like the table below :
category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count
-----------------------------------------------------------------------------------------------------
nation | nation | 1 | 222 | 444 | 555 | 6677
So, the code I tried below :
schema = StructType([])
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
df = df.withColumn("category",F.lit('nation')).withColumn("category_id",F.lit('nation')).withColumn("bucket",bucket)
df = df.withColumn("prop_count",prop_count).withColumn("event_count",event_count).withColumn("accum_prop_count",accum_prop_count).withColumn("accum_event_count",accum_event_count)
df.show()
This is giving an error :
AssertionError: col should be Column
Also, The values of the columns have to be updated again later and the update will also be of 1 line.
How to do this??
I think the problem with your code is lies in lines where you are using variables like .withColumn("bucket",bucket). You are trying to create a new column by giving an integer value. withColumn expects a column and not a single integer value.
To solve this, you can use the lit just like you are already using for "nation"
like :
df = df\
.withColumn("category",F.lit('nation'))\
.withColumn("category_id",F.lit('nation'))\
.withColumn("bucket",F.lit(bucket))\
.withColumn("prop_count",F.lit(prop_count))\
.withColumn("event_count",F.lit(event_count))\
.withColumn("accum_prop_count",F.lit(accum_prop_count))\
.withColumn("accum_event_count",F.lit(accum_event_count))
another simple and cleaner way to write it may be like this :
# create schema
fields = [StructField("category", StringType(),True),
StructField("category_id", StringType(),True),
StructField("bucket", IntegerType(),True),
StructField("prop_count", IntegerType(),True),
StructField("event_count", IntegerType(),True),
StructField("accum_prop_count", IntegerType(),True)
]
schema = StructType(fields)
# load data
data = [["nation","nation",1,222,444,555]]
df = spark.createDataFrame(data, schema)
df.show()

Count in pyspark

I have a spark dataframe df with a column "id" (string) and another column "values" (array of strings). I want to create another column called count with contains the count of values for each id.
df looks like -
id values
1fdf67 [dhjy1,jh87w3,89yt5re]
df45l1 [hj098,hg45l0,sass65r4,dh6t21]
Result should look like -
id values count
1fdf67 [dhjy1,jh87w3,89yt5re] 3
df45l1 [hj098,hg45l0,sass65r4,dh6t21] 4
I am trying to do as below -
df= df.select(id,values).toDF(id,values,values.count())
This doesn't seem to be working for my requirement.
Please use size function:
from pyspark.sql.functions import size
df = spark.createDataFrame([
("1fdf67", ["dhjy1", "jh87w3", "89yt5re"]),
("df45l1", ["hj098", "hg45l0", "sass65r4", "dh6t21"])],
("id", "values"))
df.select("*", size("values").alias("count")).show(2, False)
+------+---------------------------------+-----+
|id |values |count|
+------+---------------------------------+-----+
|1fdf67|[dhjy1, jh87w3, 89yt5re] |3 |
|df45l1|[hj098, hg45l0, sass65r4, dh6t21]|4 |
+------+---------------------------------+-----+

Categories

Resources