Creating a dataframe from Lists and string values in pyspark

Creating a dataframe from Lists and string values in pyspark - python

I have 3 string values and 4 lists that Im trying to create a dataframe.
Lists -
>>> exp_type
[u'expect_column_to_exist', u'expect_column_values_to_not_be_null', u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal', u'expect_column_sum_to_be_between']
>>> expectations
[{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'}, {u'column': u'countray'}, {u'value': 10}, {u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]
>>> results
[{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102}, {u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]
>>> this_exp_success
[True, True, True, False, False, False]
Strings -
>>> is_overall_success
'False'
>>> data_source
'stores'
>>>
>>> run_time
'2022-02-24T05:43:16.678220+00:00'
Trying to create a dataframe as below.
columns = ['data_source', 'run_time', 'exp_type', 'expectations', 'results', 'this_exp_success', 'is_overall_success']
dataframe = spark.createDataFrame(zip(data_source, run_time, exp_type, expectations, results, this_exp_success, is_overall_success), columns)
Error -
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 748, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 416, in _createFromLocal
struct = self._inferSchemaFromList(data, names=schema)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 348, in _inferSchemaFromList
schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1101, in _merge_type
for f in a.fields]
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1114, in _merge_type
_merge_type(a.valueType, b.valueType, name='value of map %s' % name),
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1094, in _merge_type
raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b))))
TypeError: value of map field results: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>
Expected Output
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+
| data_source | run_time |exp_type |expectations |results |this_exp_success|is_overall_success|
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_to_exist |{u'column': u'country'} |{} |True |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_values_to_not_be_null |{u'column': u'country'} |{u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102} |True |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_values_to_be_of_type |{u'column': u'country', u'type_': u'StringType'} |{u'observed_value': u'StringType'} |True |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_to_exist |{u'column': u'countray'} |{} |False |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_table_row_count_to_equal |{u'value': 10} |{u'observed_value': 102} |False |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_sum_to_be_between |{u'column': u'active_stores', u'max_value': 1000, u'min_value': 100} |{u'observed_value': 22075.0} |False |False |
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+

A solution to your problem would be the following:
Zip the lists and append the strings as spark literals.
exp_type = [u'expect_column_to_exist', u'expect_column_values_to_not_be_null',
u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal',
u'expect_column_sum_to_be_between']
expectations = [{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'},
{u'column': u'countray'}, {u'value': 10},
{u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]
results = [{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0,
u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102},
{u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]
df_schema = StructType([StructField("exp_type", StringType(), True),
StructField("expectations", MapType(StringType(), StringType()), True),
StructField("results", StringType(), True),
StructField("this_exp_success", StringType(), True)
])
this_exp_success = [True, True, True, False, False, False]
result = spark.createDataFrame(
zip(exp_type, expectations, results, this_exp_success), df_schema)\
.withColumn('is_overall_success', lit('False'))\
.withColumn('data_source', lit('stores'))\
.withColumn('run_time', lit('2022-02-24T05:43:16.678220+00:00'))
result.show(truncate=False)
And the result is showing:
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+
|exp_type |expectations |results |this_exp_success|is_overall_success|data_source|run_time |
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+
|expect_column_to_exist |[column -> country] |{} |true |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_values_to_not_be_null|[column -> country] |{partial_unexpected_list=[], partial_unexpected_index_list=null, unexpected_count=0, element_count=102, unexpected_percent=0.0, partial_unexpected_counts=[]}|true |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_values_to_be_of_type |[column -> country, type_ -> StringType] |{observed_value=StringType} |true |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_to_exist |[column -> countray] |{} |false |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_table_row_count_to_equal |[value -> 10] |{observed_value=102} |false |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_sum_to_be_between |[column -> active_stores, min_value -> 100, max_value -> 1000]|{observed_value=22075.0} |false |False |stores |2022-02-24T05:43:16.678220+00:00|
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+

Related

How to map a column in PySpark DataFrame and avoid getting Null values?

I have a PySpark DataFrame and I want to map values of a column.
Sample dataset:
data = [(1, 'N'), \
(2, 'N'), \
(3, 'C'), \
(4, 'S'), \
(5, 'North'), \
(6, 'Central'), \
(7, 'Central'), \
(8, 'South')
]
columns = ["ID", "City"]
df = spark.createDataFrame(data = data, schema = columns)
The mapping dictionary is:
{'N': 'North', 'C': 'Central', 'S': 'South'}
And I use the following code:
from pyspark.sql import functions as F
from itertools import chain
mapping_dict = {'N': 'North', 'C': 'Central', 'S': 'South'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*mapping_dict.items())])
df_new = df.withColumn('City_New', mapping_expr[df['City']])
And the results are:
As you can see, I get Null values for rows which I don't include their values in the mapping dictionary. To solve this, I can define mapping dictionary by:
{'N': 'North', 'C': 'Central', 'S': 'South', \
'North': 'North', 'Central': 'Central', 'South': 'South'}
However, if there are many unique values in the dataset, it is hard to define a mapping dictionary.
Is there any better way for this purpose?

you can use a coalesce.
here's how it'd look like.
# create separate case whens for each key-value pair
map_whens = [func.when(func.upper('city') == k.upper(), v) for k, v in map_dict.items()]
# [Column<'CASE WHEN (upper(city) = N) THEN North END'>,
# Column<'CASE WHEN (upper(city) = C) THEN Central END'>,
# Column<'CASE WHEN (upper(city) = S) THEN South END'>]
# pass case whens to coalesce with last value as `city` field
data_sdf. \
withColumn('city_new', func.coalesce(*map_whens, 'city')). \
show()
# +---+-------+--------+
# | id| city|city_new|
# +---+-------+--------+
# | 1| N| North|
# | 2| N| North|
# | 3| C| Central|
# | 4| S| South|
# | 5| North| North|
# | 6|Central| Central|
# | 7|Central| Central|
# | 8| South| South|
# +---+-------+--------+

Create two columns in spark DataFrame with one for cumulative value and another for maximum continuous value

I have a spark data frame as below:
from pyspark.sql import SparkSession
from pyspark.sql import Window
import pyspark.sql.functions as F
data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True, "time": 1},
{"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False, "time": 2},
{"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None, "time": 3},
{"Category": 'C', "ID": 4, "Value": 33.87, "Truth": True, "time": 4},
{"Category": 'D', "ID": 4, "Value": 33.87, "Truth": True, "time": 5},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 6},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 7},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 8}
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
w = Window.partitionBy(F.col("Category")).rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.filter(df["ID"] == 4).withColumn("val_sum", F.sum(F.col("Value")).over(w)).withColumn("max_time", F.max(F.col("time")).over(w)).show()
and I received the follow output:
However, my expected output is like below:
+--------+---+-----+-----+----+------------------+--------+
|Category| ID|Truth|Value|time| val_sum|max_time|
+--------+---+-----+-----+----+------------------+--------+
| C| 4| true|33.87| 4| 33.87| 4|
| D| 4| true|33.87| 5| 33.87| 5|
| E| 4| true|33.87| 6| 33.87| 6|
| E| 4| true|33.87| 7| 67.74| 7|
| E| 4| true|33.87| 8|101.60999999999999| 8|
+--------+---+-----+-----+----+------------------+--------+
Can anyone please assist me with this?
+--------+---+-----+-----+----+------------------+--------+
|Category| ID|Truth|Value|time| val_sum|max_time|
+--------+---+-----+-----+----+------------------+--------+
| C| 4| true|33.87| 4| 33.87| 4|
| D| 4| true|33.87| 5| 33.87| 5|
| E| 4| true|33.87| 6| 33.87| 8|
| E| 4| true|33.87| 7| 67.74| 8|
| E| 4| true|33.87| 8|101.60999999999999| 8|
+--------+---+-----+-----+----+------------------+--------+
Please do let me know if it is not clear so that I could provide more info.

Define another window spec for max function with unboundedPreceding and unboundedFollowing instead of currentRow
Example:
from pyspark.sql import SparkSession
from pyspark.sql import Window
import pyspark.sql.functions as F
data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True, "time": 1},
{"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False, "time": 2},
{"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None, "time": 3},
{"Category": 'C', "ID": 4, "Value": 33.87, "Truth": True, "time": 4},
{"Category": 'D', "ID": 4, "Value": 33.87, "Truth": True, "time": 5},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 6},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 7},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True, "time": 8}
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
w = Window.partitionBy(F.col("Category")).rowsBetween(Window.unboundedPreceding, Window.currentRow)
w1 = Window.partitionBy(F.col("Category")).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.filter(df["ID"] == 4).withColumn("val_sum", F.sum(F.col("Value")).over(w)).withColumn("max_time", F.max(F.col("time")).over(w1)).show()
+--------+---+-----+-----+----+------------------+--------+
|Category| ID|Truth|Value|time| val_sum|max_time|
+--------+---+-----+-----+----+------------------+--------+
| E| 4| true|33.87| 6| 33.87| 8|
| E| 4| true|33.87| 7| 67.74| 8|
| E| 4| true|33.87| 8|101.60999999999999| 8|
| D| 4| true|33.87| 5| 33.87| 5|
| C| 4| true|33.87| 4| 33.87| 4|
+--------+---+-----+-----+----+------------------+--------+

How to use lambda in agg and groupBy when using pyspark?

I am just studying pyspark. I am got confused about the following code:
df.groupBy(['Category','Register']).agg({'NetValue':'sum',
'Units':'mean'}).show(5,truncate=False)
df.groupBy(['Category','Register']).agg({'NetValue':'sum',
'Units': lambda x: pd.Series(x).nunique()}).show(5,truncate=False)
The first line is correct. But the second line is incorrect. The error message is:
AttributeError: 'function' object has no attribute '_get_object_id'
It looks like I did not use lambda function correctly. But this is how I use lambda in a normal python environment, and it is correct.
Could anyone help me here?

If you are okay with the performance of PySpark primitives using pure Python functions, the following code gives the desired result. You can modify the logic in _map to suit your specific need. I made some assumptions about what your data schema might look like.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType
schema = StructType([
StructField('Category', StringType(), True),
StructField('Register', LongType(), True),
StructField('NetValue', LongType(), True),
StructField('Units', LongType(), True)
])
test_records = [
{'Category': 'foo', 'Register': 1, 'NetValue': 1, 'Units': 1},
{'Category': 'foo', 'Register': 1, 'NetValue': 2, 'Units': 2},
{'Category': 'foo', 'Register': 2, 'NetValue': 3, 'Units': 3},
{'Category': 'foo', 'Register': 2, 'NetValue': 4, 'Units': 4},
{'Category': 'bar', 'Register': 1, 'NetValue': 5, 'Units': 5},
{'Category': 'bar', 'Register': 1, 'NetValue': 6, 'Units': 6},
{'Category': 'bar', 'Register': 2, 'NetValue': 7, 'Units': 7},
{'Category': 'bar', 'Register': 2, 'NetValue': 8, 'Units': 8}
]
spark = SparkSession.builder.getOrCreate()
dataframe = spark.createDataFrame(test_records, schema)
def _map(((category, register), records)):
net_value_sum = 0
uniques = set()
for record in records:
net_value_sum += record['NetValue']
uniques.add(record['Units'])
return category, register, net_value_sum, len(uniques)
new_dataframe = spark.createDataFrame(
dataframe.rdd.groupBy(lambda x: (x['Category'], x['Register'])).map(_map),
schema
)
new_dataframe.show()
Result:
+--------+--------+--------+-----+
|Category|Register|NetValue|Units|
+--------+--------+--------+-----+
| bar| 2| 15| 2|
| foo| 1| 3| 2|
| foo| 2| 7| 2|
| bar| 1| 11| 2|
+--------+--------+--------+-----+
If you need performance or to stick with the pyspark.sql framework, then see this related question and its linked questions:
Custom aggregation on PySpark dataframes

Create multidict from pyspark dataframe

I am new to pyspark and want to create a dictionary from a pyspark dataframe. I do have a working pandas code but I need an equivalent command in pyspark and somehow I am not able to figure out how to do it.
df = spark.createDataFrame([
(11, 101, 5.9),
(11, 102, 5.4),
(22, 111, 5.2),
(22, 112, 5.9),
(22, 101, 5.7),
(33, 101, 5.2),
(44, 102, 5.3),
], ['user_id', 'team_id', 'height'])
df = df.select(['user_id', 'team_id'])
df.show()
-------+-------+
|user_id|team_id|
+-------+-------+
| 11| 101|
| 11| 102|
| 22| 111|
| 22| 112|
| 22| 101|
| 33| 101|
| 44| 102|
+-------+-------+
df.toPandas().groupby('user_id')[
'team_id'].apply(list).to_dict()
Result:
{11: [101, 102], 22: [111, 112, 101], 33: [101], 44: [102]}
Looking for efficient way in pyspark to create the above multidict.

You can aggregate the team_id column as list and then collect the rdd as dictionary using collectAsMap method:
mport pyspark.sql.functions as F
df.groupBy("user_id").agg(F.collect_list("team_id")).rdd.collectAsMap()
# {33: [101], 11: [101, 102], 44: [102], 22: [111, 112, 101]}

Aggregate List of Map in PySpark

I have a list of map e.g
[{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20} }
I want to get the average of values of a and b. So the expected output is
a = (10 + 5 + 0 + 0) /3 = 5 ;
b = 80/4 = 20.
How can i do it efficiently using RDD

The easiest might be map your rdd element to a format like:
init = {'a': {'sum': 0, 'cnt': 0}, 'b': {'sum': 0, 'cnt': 0}}
i.e. record the sum and count for each key, and then reduce it.
Map function:
def map_fun(d, keys=['a', 'b']):
map_d = {}
for k in keys:
if k in d:
temp = {'sum': d[k], 'cnt': 1}
else:
temp = {'sum': 0, 'cnt': 0}
map_d[k] = temp
return map_d
Reduce function:
def reduce_fun(a, b, keys=['a', 'b']):
from collections import defaultdict
reduce_d = defaultdict(dict)
for k in keys:
reduce_d[k]['sum'] = a[k]['sum'] + b[k]['sum']
reduce_d[k]['cnt'] = a[k]['cnt'] + b[k]['cnt']
return reduce_d
rdd.map(map_fun).reduce(reduce_fun)
# defaultdict(<type 'dict'>, {'a': {'sum': 15, 'cnt': 3}, 'b': {'sum': 80, 'cnt': 4}})
Calculate the average:
d = rdd.map(map_fun).reduce(reduce_fun)
{k: v['sum']/v['cnt'] for k, v in d.items()}
{'a': 5, 'b': 20}

Given the structure of your data you should be able to use the dataframe api to achieve this calculation. If you need an rdd it is not to hard to get from the dataframe back to an rdd.
from pyspark.sql import functions as F
df = spark.createDataFrame([{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20}])
Dataframe looks like this
+----+---+
| a| b|
+----+---+
| 10| 20|
| 5| 20|
|null| 20|
| 0| 20|
+----+---+
Then it follows simply to calculate averages using the pyspark.sql functions
cols = df.columns
df_means = df.agg(*[F.mean(F.col(col)).alias(col+"_mean") for col in cols])
df_means.show()
OUTPUT:
+------+------+
|a_mean|b_mean|
+------+------+
| 5.0| 20.0|
+------+------+

You can use defaultdict to collect similar keys and their values as list.
Then simply aggregate using sum of values divided by number of elements of list for each value.
from collections import defaultdict
x = [{'a' : 10,'b': 20}, {'a' : 5,'b': 20} , {'b': 20} ,{'a' : 0,'b': 20}]
y = defaultdict(lambda: [])
[y[k].append(v) for i in x for k,v in i.items() ]
for k,v in y.items():
print k, "=" ,sum(v)/len(v)
>>> y
defaultdict(<function <lambda> at 0x02A43BB0>, {'a': [10, 5, 0], 'b': [20, 20, 20, 20]})
>>>
>>>
a = 5
b = 20

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a dataframe from Lists and string values in pyspark - python

Related

How to map a column in PySpark DataFrame and avoid getting Null values?

Create two columns in spark DataFrame with one for cumulative value and another for maximum continuous value

How to use lambda in agg and groupBy when using pyspark?

Create multidict from pyspark dataframe

Aggregate List of Map in PySpark

Categories

Resources