Create multidict from pyspark dataframe

Create multidict from pyspark dataframe - python

I am new to pyspark and want to create a dictionary from a pyspark dataframe. I do have a working pandas code but I need an equivalent command in pyspark and somehow I am not able to figure out how to do it.
df = spark.createDataFrame([
(11, 101, 5.9),
(11, 102, 5.4),
(22, 111, 5.2),
(22, 112, 5.9),
(22, 101, 5.7),
(33, 101, 5.2),
(44, 102, 5.3),
], ['user_id', 'team_id', 'height'])
df = df.select(['user_id', 'team_id'])
df.show()
-------+-------+
|user_id|team_id|
+-------+-------+
| 11| 101|
| 11| 102|
| 22| 111|
| 22| 112|
| 22| 101|
| 33| 101|
| 44| 102|
+-------+-------+
df.toPandas().groupby('user_id')[
'team_id'].apply(list).to_dict()
Result:
{11: [101, 102], 22: [111, 112, 101], 33: [101], 44: [102]}
Looking for efficient way in pyspark to create the above multidict.

You can aggregate the team_id column as list and then collect the rdd as dictionary using collectAsMap method:
mport pyspark.sql.functions as F
df.groupBy("user_id").agg(F.collect_list("team_id")).rdd.collectAsMap()
# {33: [101], 11: [101, 102], 44: [102], 22: [111, 112, 101]}

Related

How to map a column in PySpark DataFrame and avoid getting Null values?

I have a PySpark DataFrame and I want to map values of a column.
Sample dataset:
data = [(1, 'N'), \
(2, 'N'), \
(3, 'C'), \
(4, 'S'), \
(5, 'North'), \
(6, 'Central'), \
(7, 'Central'), \
(8, 'South')
]
columns = ["ID", "City"]
df = spark.createDataFrame(data = data, schema = columns)
The mapping dictionary is:
{'N': 'North', 'C': 'Central', 'S': 'South'}
And I use the following code:
from pyspark.sql import functions as F
from itertools import chain
mapping_dict = {'N': 'North', 'C': 'Central', 'S': 'South'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*mapping_dict.items())])
df_new = df.withColumn('City_New', mapping_expr[df['City']])
And the results are:
As you can see, I get Null values for rows which I don't include their values in the mapping dictionary. To solve this, I can define mapping dictionary by:
{'N': 'North', 'C': 'Central', 'S': 'South', \
'North': 'North', 'Central': 'Central', 'South': 'South'}
However, if there are many unique values in the dataset, it is hard to define a mapping dictionary.
Is there any better way for this purpose?

you can use a coalesce.
here's how it'd look like.
# create separate case whens for each key-value pair
map_whens = [func.when(func.upper('city') == k.upper(), v) for k, v in map_dict.items()]
# [Column<'CASE WHEN (upper(city) = N) THEN North END'>,
# Column<'CASE WHEN (upper(city) = C) THEN Central END'>,
# Column<'CASE WHEN (upper(city) = S) THEN South END'>]
# pass case whens to coalesce with last value as `city` field
data_sdf. \
withColumn('city_new', func.coalesce(*map_whens, 'city')). \
show()
# +---+-------+--------+
# | id| city|city_new|
# +---+-------+--------+
# | 1| N| North|
# | 2| N| North|
# | 3| C| Central|
# | 4| S| South|
# | 5| North| North|
# | 6|Central| Central|
# | 7|Central| Central|
# | 8| South| South|
# +---+-------+--------+

Compare two couple of columns from two different pyspark dataframe to display the data that are different

i've got this dataframe with four columns
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 3),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df1.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|7.3| 8|
| c| d|7.3| 2|
| c| d|7.3| 8|
| e| f|6.0| 3|
| e| f|6.0| 8|
| e| f|6.0| 3|
| c| j|4.2| 3|
| c| j|4.3| 9|
+---+---+---+---+
and i also got this other dataframe df2 with the same schema as the dataframe df1
df2 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 3.3, 5),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 7),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 1),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df2.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|3.3| 5|
| c| d|7.3| 2|
| c| d|7.3| 7|
| e| f|6.0| 3|
| c| j|4.2| 1|
| c| j|4.3| 9|
+---+---+---+---+
I want to compare the couple (a, b, d) so that i can obtain the different values that are present in df2 but not in df1 like this
df3
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.3| 5|
| c| d|7.3| 7|
| c| j|4.2| 1|
+---+---+---+---+

I think what you want is:
df2.subtract(df1.intersect(df2)).show()
I want what is in df2 that is not in both df1 and df2.
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| j|4.2| 1|
| c| d|3.3| 5|
| c| d|7.3| 7|
+---+---+---+---+
I also agree with #pltc that call out you might have made a mistake in your output table.

Sort by key (Month) using RDDs in Pyspark

I have this RDD and wanna sort it by Month (Jan --> Dec). How can i do it in pyspark?
Note: Don't want to use spark.sql or Dataframe.
+-----+-----+
|Month|count|
+-----+-----+
| Oct| 1176|
| Sep| 1167|
| Dec| 2084|
| Aug| 1126|
| May| 1176|
| Jun| 1424|
| Feb| 1286|
| Nov| 1078|
| Mar| 1740|
| Jan| 1544|
| Apr| 1080|
| Jul| 1237|
+-----+-----+

You can use rdd.sortBy with a helper dictionary available in python's calendar module or create your own month dictionary:
import calendar
d = {i:e for e,i in enumerate(calendar.month_abbr[1:],1)}
#{'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7,
#'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
myrdd.sortBy(keyfunc=lambda x: d.get(x[0])).collect()
[('Jan', 1544),
('Feb', 1286),
('Mar', 1740),
('Apr', 1080),
('May', 1176),
('Jun', 1424),
('Jul', 1237),
('Aug', 1126),
('Sep', 1167),
('Oct', 1176),
('Nov', 1078),
('Dec', 2084)]

myList = myrdd.collect()
my_list_dict = dict(myList)
months = ['Jan', 'Feb', 'Mar', 'Apr', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
newList = []
for m in months:
newList.append((m, my_list_dict[m]))
print(newList)

Creating a dataframe from Lists and string values in pyspark

I have 3 string values and 4 lists that Im trying to create a dataframe.
Lists -
>>> exp_type
[u'expect_column_to_exist', u'expect_column_values_to_not_be_null', u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal', u'expect_column_sum_to_be_between']
>>> expectations
[{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'}, {u'column': u'countray'}, {u'value': 10}, {u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]
>>> results
[{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102}, {u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]
>>> this_exp_success
[True, True, True, False, False, False]
Strings -
>>> is_overall_success
'False'
>>> data_source
'stores'
>>>
>>> run_time
'2022-02-24T05:43:16.678220+00:00'
Trying to create a dataframe as below.
columns = ['data_source', 'run_time', 'exp_type', 'expectations', 'results', 'this_exp_success', 'is_overall_success']
dataframe = spark.createDataFrame(zip(data_source, run_time, exp_type, expectations, results, this_exp_success, is_overall_success), columns)
Error -
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 748, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 416, in _createFromLocal
struct = self._inferSchemaFromList(data, names=schema)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 348, in _inferSchemaFromList
schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1101, in _merge_type
for f in a.fields]
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1114, in _merge_type
_merge_type(a.valueType, b.valueType, name='value of map %s' % name),
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1094, in _merge_type
raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b))))
TypeError: value of map field results: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>
Expected Output
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+
| data_source | run_time |exp_type |expectations |results |this_exp_success|is_overall_success|
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_to_exist |{u'column': u'country'} |{} |True |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_values_to_not_be_null |{u'column': u'country'} |{u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102} |True |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_values_to_be_of_type |{u'column': u'country', u'type_': u'StringType'} |{u'observed_value': u'StringType'} |True |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_to_exist |{u'column': u'countray'} |{} |False |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_table_row_count_to_equal |{u'value': 10} |{u'observed_value': 102} |False |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_sum_to_be_between |{u'column': u'active_stores', u'max_value': 1000, u'min_value': 100} |{u'observed_value': 22075.0} |False |False |
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+

A solution to your problem would be the following:
Zip the lists and append the strings as spark literals.
exp_type = [u'expect_column_to_exist', u'expect_column_values_to_not_be_null',
u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal',
u'expect_column_sum_to_be_between']
expectations = [{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'},
{u'column': u'countray'}, {u'value': 10},
{u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]
results = [{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0,
u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102},
{u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]
df_schema = StructType([StructField("exp_type", StringType(), True),
StructField("expectations", MapType(StringType(), StringType()), True),
StructField("results", StringType(), True),
StructField("this_exp_success", StringType(), True)
])
this_exp_success = [True, True, True, False, False, False]
result = spark.createDataFrame(
zip(exp_type, expectations, results, this_exp_success), df_schema)\
.withColumn('is_overall_success', lit('False'))\
.withColumn('data_source', lit('stores'))\
.withColumn('run_time', lit('2022-02-24T05:43:16.678220+00:00'))
result.show(truncate=False)
And the result is showing:
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+
|exp_type |expectations |results |this_exp_success|is_overall_success|data_source|run_time |
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+
|expect_column_to_exist |[column -> country] |{} |true |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_values_to_not_be_null|[column -> country] |{partial_unexpected_list=[], partial_unexpected_index_list=null, unexpected_count=0, element_count=102, unexpected_percent=0.0, partial_unexpected_counts=[]}|true |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_values_to_be_of_type |[column -> country, type_ -> StringType] |{observed_value=StringType} |true |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_to_exist |[column -> countray] |{} |false |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_table_row_count_to_equal |[value -> 10] |{observed_value=102} |false |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_sum_to_be_between |[column -> active_stores, min_value -> 100, max_value -> 1000]|{observed_value=22075.0} |false |False |stores |2022-02-24T05:43:16.678220+00:00|
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+

Removing 0 at the end of multiple time series

I have multiple time series stored in a Spark DataFrame as below:
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-03-12', 'France', 0),
('2020-03-13', 'France', 0),
('2020-03-14', 'France', 0),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-13', 'UK', 0),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
I am looking for a way (without looping as my real DataFrame has millions of rows) to remove the 0's at the end of each time series.
In our example, we would obtain:
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)

Assume you want to remove at the end of every country ordered by date
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-03-12', 'France', 0),
('2020-03-13', 'France', 0),
('2020-03-14', 'France', 0),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-13', 'UK', 0),
('2020-04-13', 'India', 1),
('2020-04-14', 'India', 0),
('2020-04-15', 'India', 0),
('2020-04-16', 'India', 1),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
# convert negative to positive to avoid accidental summing up to 0
df=df.withColumn('y1',F.abs(F.col('y')))
# Window function to reverse the last rows to first
w=Window.partitionBy('country').orderBy(F.col('date').desc())
# Start summing function. when the first non zero value comes the value changes
df_sum = df.withColumn("sum_chk",F.sum('y1').over(w))
# Filter non zero values, sort it just for viewing
df_res = df_sum.where("sum_chk!=0").orderBy('date',ascending=True)
The result:
df_res.show()
+----------+-------+---+---+-------+
| date|country| y| y1|sum_chk|
+----------+-------+---+---+-------+
|2020-03-10| France| 19| 19| 41|
|2020-03-11| France| 22| 22| 22|
|2020-04-08| Japan| 0| 0| 5|
|2020-04-09| Japan| -3| 3| 5|
|2020-04-10| Japan| -2| 2| 2|
|2020-04-10| UK| 12| 12| 21|
|2020-04-11| UK| 0| 0| 9|
|2020-04-12| UK| 9| 9| 9|
|2020-04-13| India| 1| 1| 2|
|2020-04-14| India| 0| 0| 1|
|2020-04-15| India| 0| 0| 1|
|2020-04-16| India| 1| 1| 1|
+----------+-------+---+---+-------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create multidict from pyspark dataframe - python

You can aggregate the team_id column as list and then collect the rdd as dictionary using collectAsMap method: mport pyspark.sql.functions as F df.groupBy("user_id").agg(F.collect_list("team_id")).rdd.collectAsMap() # {33: [101], 11: [101, 102], 44: [102], 22: [111, 112, 101]}

Related

How to map a column in PySpark DataFrame and avoid getting Null values?

Compare two couple of columns from two different pyspark dataframe to display the data that are different

Sort by key (Month) using RDDs in Pyspark

Creating a dataframe from Lists and string values in pyspark

Removing 0 at the end of multiple time series

Categories

Resources