How to check whether key or value exist in Pyspark Map - python

I have a Map column in a spark DF and would like to filter this column on a particular key (i.e. keep the row if the key in the map matches desired value).
For example, my schema is defined as:
df_schema = StructType(
[StructField('id', StringType()),
StructField('rank', MapType(StringType(), IntegerType()))]
)
My sample data is:
{ "id": "0981850006", "rank": {"a": 1} }
Is there any way to filter my df on rows where "a" is in "rank" without using explode()?
Is there a better schema representation for the given json than what I have defined?

Accessing the key with rank.key would mean rank is a StructType(). Although explode is probably the best solution let's build a UDF to assess whether or not k is a key of rank.
First let's create our dataframe:
from pyspark.sql.types import *
df_schema = StructType(
[StructField('id', StringType()),
StructField('rank', MapType(StringType(), IntegerType()))]
)
df = spark.createDataFrame([
["0981850006", {"a": 1}],
["0981850006", {"b": 2, "c": 3}],
], df_schema)
Now our UDF:
def isKey(k,d):
return k in d.keys()
isKey_udf = lambda k: psf.udf(lambda d: isKey(k,d), BooleanType())
Which gives:
df.withColumn(
"is_key",
isKey_udf('a')(df.rank)
)
+----------+-------------------+------+
| id| rank|is_key|
+----------+-------------------+------+
|0981850006| Map(a -> 1)| true|
|0981850006|Map(b -> 2, c -> 3)| false|
+----------+-------------------+------+

Related

pandas dataframe, with multi-index, to dictionary

I am trying to transform a pandas dataframe resulting from a groupby([columns]). The resulting index will have for each "target_index" different lists of words (example in image below). Transforming it with to_dict() seems to not be working directly (I have tried several orient arguments).
The Input dataframe:
The desired output (only two keys for the example):
{
"2060": {
"NOUN": ["product"]
},
"3881": {
"ADJ": ["greater", "direct", "raw"],
"NOUN": ["manufacturing", "capital"],
"VERB": ["increased"]
}
}
In order to recreate the below dataset:
df= pd.DataFrame([
["2060", "NOUN", ["product"]],
["2060", "ADJ", ["greater"]],
["3881", "NOUN", ["manufacturing", "capital"]],
["3881", "ADJ", ["greater", "direct", "raw"]],
["3881", "VERB", ["increased"]]
], columns= ["a", "b", "c"])
df= df.groupby(["a", "b"]).agg({"c": lambda x: x})
The input given in the constructor is different from the one in the image. I used the input in the constructor. You could use a lambda in groupby.apply to convert each group to dicts, then convert the aggregate to dict:
out = df.groupby(level=0).apply(lambda x: x.droplevel(0).to_dict()['c']).to_dict()
Another option is to use itertuples and dict.setdefault:
out = {}
for (ok, ik), v in df.itertuples():
out.setdefault(ok, {}).setdefault(ik, []).extend(v)
Output:
{'2060': {'ADJ': ['greater'], 'NOUN': ['product']},
'3881': {'ADJ': ['greater', 'direct', 'raw'],
'NOUN': ['manufacturing', 'capital'],
'VERB': ['increased']}}

Row wise operation in Pandas DataFrame

I have a Dataframe as
import pandas as pd
df = pd.DataFrame({
"First": ['First1', 'First2', 'First3'],
"Secnd": ['Secnd1', 'Secnd2', 'Secnd3']
)
df.index = ['Row1', 'Row2', 'Row3']
I would like to have a lambda function in apply method to create a list of dictionary (including index item) as below
[
{'Row1': ['First1', 'Secnd1']},
{'Row2': ['First2', 'Secnd2']},
{'Row3': ['First3', 'Secnd3']},
]
If I use something like .apply(lambda x: <some operation>) here, x does not include the index rather the values.
Cheers,
DD
To expand Hans Bambel's answer to get the exact desired output:
[{k: list(v.values())} for k, v in df.to_dict('index').items()]
You don't need apply here. You can just use the to_dict() function with the "index" argument:
df.to_dict("index")
This gives the output:
{'Row1': {'First': 'First1', 'Secnd': 'Secnd1'},
'Row2': {'First': 'First2', 'Secnd': 'Secnd2'},
'Row3': {'First': 'First3', 'Secnd': 'Secnd3'}}

Pyspark - from long to wide with new column names

I have this dataframe:
data = [{"name": "test", "sentiment":'positive', "avg":13.65, "stddev":15.24},
{"name": "test", "sentiment":'neutral', "avg":338.74, "stddev":187.27},
{"name": "test", "sentiment":'negative', "avg":54.58, "stddev":50.19}]
df = spark.createDataFrame(data).select("name", "sentiment", "avg", "stddev")
df.show()
+----+---------+------+------+
|name|sentiment| avg|stddev|
+----+---------+------+------+
|test| positive| 13.65| 15.24|
|test| neutral|338.74|187.27|
|test| negative| 54.58| 50.19|
+----+---------+------+------+
I'd like to create a dataframe with this structure:
+----+------------+-----------+------------+------------+-----------+------------+
|name|avg_positive|avg_neutral|avg_negative|std_positive|std_neutral|std_negative|
+----+------------+-----------+------------+------------+-----------+------------+
|test| 13.65| 338.74| 54.58| 15.24| 187.27| 50.19|
+----+------------+-----------+------------+------------+-----------+------------+
I also don't know the name of this operation, feel free to suggest a proper title.
Thanks!
use groupBy() and pivot()
df_grp = df.groupBy("name").pivot("sentiment").agg((F.first("avg").alias("avg")),(F.first("stddev").alias("stddev")) )
df_grp.show()
+----+------------+---------------+-----------+--------------+------------+---------------+
|name|negative_avg|negative_stddev|neutral_avg|neutral_stddev|positive_avg|positive_stddev|
+----+------------+---------------+-----------+--------------+------------+---------------+
|test| 54.58| 50.19| 338.74| 187.27| 13.65| 15.24|
+----+------------+---------------+-----------+--------------+------------+---------------+
rename the columns if you really want to

Spark: Dataframe Transformation

I have dataframe_1:
+-------------+----+---------+
| Name| Age| Salary|
+-------------+----+---------+
|Avery Bradley|25.0|7730337.0|
| Jae Crowder|25.0|6796117.0|
+-------------+----+---------+
and want to transform it to dataframe_2:
+----------------------------------------------------------------------------------------------------------------------+
| json_data |
+----------------------------------------------------------------------------------------------------------------------+
|[{"Name": "Avery Bradley", "Age": 25.0, "Salary" 7730337.0}, {"Name": "Jae Crowder", "Age": 25.0, "Salary" 6796117.0}]|
+----------------------------------------------------------------------------------------------------------------------+
I can do dataframe_1.toPandas().to_dict(orient="records"), but this would be dataframe to dict(json object) transformation and I need dataframe to dataframe transformation.
A solution in PySpark, if possible, would be appreciated.
You can do a collect_list of json:
import pyspark.sql.functions as F
df2 = df.agg(F.collect_list(F.to_json(F.struct('*'))).alias('json_data'))
df2.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------+
|json_data |
+--------------------------------------------------------------------------------------------------------------+
|[{"Name":"Avery Bradley","Age":25.0,"Salary":7730337.0}, {"Name":"Jae Crowder","Age":25.0,"Salary":6796117.0}]|
+--------------------------------------------------------------------------------------------------------------+
You can merge columns to a map then create a JSON out of it
(df
.withColumn('json', F.to_json(F.create_map(
F.lit('name'), F.col('name'),
F.lit('age'), F.col('age'),
F.lit('salary'), F.col('salary'),
)))
.agg(F.collect_list('json').alias('json_value'))
)
+----------------------------------------------------------------------------------------------------------------------+
|json_value |
+----------------------------------------------------------------------------------------------------------------------+
|[{"name":"Avery Bradley","age":"25.0","salary":"7730337.0"}, {"name":"Jae Crowder","age":"25.0","salary":"6796117.0"}]|
+----------------------------------------------------------------------------------------------------------------------+

optimize performance when creating dataframe from dictionary

Data looks like this:
data = {"date": 20210606,
"B": 11355,
"C": 4,
"ID": "ladygaga"}
I want to convert it to dataframe however each value needs to be a list therefore
data = {key: [item] for key, item in data.items()}
df = pd.DataFrame.from_dict(data)
this is what I do, I want to optimize code as much as possible since this is going to be on production level API.
You can pass dictionary to list like:
df = pd.DataFrame([data])
print (df)
date B C ID
0 20210606 11355 4 ladygaga
Also your solution should be faster by:
df = pd.DataFrame({key: [item] for key, item in data.items()})

Categories

Resources