I am new to pyspark and want to create a dictionary from a pyspark dataframe. I do have a working pandas code but I need an equivalent command in pyspark and somehow I am not able to figure out how to do it.
df = spark.createDataFrame([
(11, 101, 5.9),
(11, 102, 5.4),
(22, 111, 5.2),
(22, 112, 5.9),
(22, 101, 5.7),
(33, 101, 5.2),
(44, 102, 5.3),
], ['user_id', 'team_id', 'height'])
df = df.select(['user_id', 'team_id'])
df.show()
-------+-------+
|user_id|team_id|
+-------+-------+
| 11| 101|
| 11| 102|
| 22| 111|
| 22| 112|
| 22| 101|
| 33| 101|
| 44| 102|
+-------+-------+
df.toPandas().groupby('user_id')[
'team_id'].apply(list).to_dict()
Result:
{11: [101, 102], 22: [111, 112, 101], 33: [101], 44: [102]}
Looking for efficient way in pyspark to create the above multidict.
You can aggregate the team_id column as list and then collect the rdd as dictionary using collectAsMap method:
mport pyspark.sql.functions as F
df.groupBy("user_id").agg(F.collect_list("team_id")).rdd.collectAsMap()
# {33: [101], 11: [101, 102], 44: [102], 22: [111, 112, 101]}
Related
I have a PySpark DataFrame and I want to map values of a column.
Sample dataset:
data = [(1, 'N'), \
(2, 'N'), \
(3, 'C'), \
(4, 'S'), \
(5, 'North'), \
(6, 'Central'), \
(7, 'Central'), \
(8, 'South')
]
columns = ["ID", "City"]
df = spark.createDataFrame(data = data, schema = columns)
The mapping dictionary is:
{'N': 'North', 'C': 'Central', 'S': 'South'}
And I use the following code:
from pyspark.sql import functions as F
from itertools import chain
mapping_dict = {'N': 'North', 'C': 'Central', 'S': 'South'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*mapping_dict.items())])
df_new = df.withColumn('City_New', mapping_expr[df['City']])
And the results are:
As you can see, I get Null values for rows which I don't include their values in the mapping dictionary. To solve this, I can define mapping dictionary by:
{'N': 'North', 'C': 'Central', 'S': 'South', \
'North': 'North', 'Central': 'Central', 'South': 'South'}
However, if there are many unique values in the dataset, it is hard to define a mapping dictionary.
Is there any better way for this purpose?
you can use a coalesce.
here's how it'd look like.
# create separate case whens for each key-value pair
map_whens = [func.when(func.upper('city') == k.upper(), v) for k, v in map_dict.items()]
# [Column<'CASE WHEN (upper(city) = N) THEN North END'>,
# Column<'CASE WHEN (upper(city) = C) THEN Central END'>,
# Column<'CASE WHEN (upper(city) = S) THEN South END'>]
# pass case whens to coalesce with last value as `city` field
data_sdf. \
withColumn('city_new', func.coalesce(*map_whens, 'city')). \
show()
# +---+-------+--------+
# | id| city|city_new|
# +---+-------+--------+
# | 1| N| North|
# | 2| N| North|
# | 3| C| Central|
# | 4| S| South|
# | 5| North| North|
# | 6|Central| Central|
# | 7|Central| Central|
# | 8| South| South|
# +---+-------+--------+
i've got this dataframe with four columns
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 3),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df1.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|7.3| 8|
| c| d|7.3| 2|
| c| d|7.3| 8|
| e| f|6.0| 3|
| e| f|6.0| 8|
| e| f|6.0| 3|
| c| j|4.2| 3|
| c| j|4.3| 9|
+---+---+---+---+
and i also got this other dataframe df2 with the same schema as the dataframe df1
df2 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 3.3, 5),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 7),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 1),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df2.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|3.3| 5|
| c| d|7.3| 2|
| c| d|7.3| 7|
| e| f|6.0| 3|
| c| j|4.2| 1|
| c| j|4.3| 9|
+---+---+---+---+
I want to compare the couple (a, b, d) so that i can obtain the different values that are present in df2 but not in df1 like this
df3
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.3| 5|
| c| d|7.3| 7|
| c| j|4.2| 1|
+---+---+---+---+
I think what you want is:
df2.subtract(df1.intersect(df2)).show()
I want what is in df2 that is not in both df1 and df2.
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| j|4.2| 1|
| c| d|3.3| 5|
| c| d|7.3| 7|
+---+---+---+---+
I also agree with #pltc that call out you might have made a mistake in your output table.
I have this RDD and wanna sort it by Month (Jan --> Dec). How can i do it in pyspark?
Note: Don't want to use spark.sql or Dataframe.
+-----+-----+
|Month|count|
+-----+-----+
| Oct| 1176|
| Sep| 1167|
| Dec| 2084|
| Aug| 1126|
| May| 1176|
| Jun| 1424|
| Feb| 1286|
| Nov| 1078|
| Mar| 1740|
| Jan| 1544|
| Apr| 1080|
| Jul| 1237|
+-----+-----+
You can use rdd.sortBy with a helper dictionary available in python's calendar module or create your own month dictionary:
import calendar
d = {i:e for e,i in enumerate(calendar.month_abbr[1:],1)}
#{'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7,
#'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
myrdd.sortBy(keyfunc=lambda x: d.get(x[0])).collect()
[('Jan', 1544),
('Feb', 1286),
('Mar', 1740),
('Apr', 1080),
('May', 1176),
('Jun', 1424),
('Jul', 1237),
('Aug', 1126),
('Sep', 1167),
('Oct', 1176),
('Nov', 1078),
('Dec', 2084)]
myList = myrdd.collect()
my_list_dict = dict(myList)
months = ['Jan', 'Feb', 'Mar', 'Apr', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
newList = []
for m in months:
newList.append((m, my_list_dict[m]))
print(newList)
I have 3 string values and 4 lists that Im trying to create a dataframe.
Lists -
>>> exp_type
[u'expect_column_to_exist', u'expect_column_values_to_not_be_null', u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal', u'expect_column_sum_to_be_between']
>>> expectations
[{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'}, {u'column': u'countray'}, {u'value': 10}, {u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]
>>> results
[{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102}, {u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]
>>> this_exp_success
[True, True, True, False, False, False]
Strings -
>>> is_overall_success
'False'
>>> data_source
'stores'
>>>
>>> run_time
'2022-02-24T05:43:16.678220+00:00'
Trying to create a dataframe as below.
columns = ['data_source', 'run_time', 'exp_type', 'expectations', 'results', 'this_exp_success', 'is_overall_success']
dataframe = spark.createDataFrame(zip(data_source, run_time, exp_type, expectations, results, this_exp_success, is_overall_success), columns)
Error -
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 748, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 416, in _createFromLocal
struct = self._inferSchemaFromList(data, names=schema)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 348, in _inferSchemaFromList
schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1101, in _merge_type
for f in a.fields]
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1114, in _merge_type
_merge_type(a.valueType, b.valueType, name='value of map %s' % name),
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1094, in _merge_type
raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b))))
TypeError: value of map field results: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>
Expected Output
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+
| data_source | run_time |exp_type |expectations |results |this_exp_success|is_overall_success|
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_to_exist |{u'column': u'country'} |{} |True |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_values_to_not_be_null |{u'column': u'country'} |{u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102} |True |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_values_to_be_of_type |{u'column': u'country', u'type_': u'StringType'} |{u'observed_value': u'StringType'} |True |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_to_exist |{u'column': u'countray'} |{} |False |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_table_row_count_to_equal |{u'value': 10} |{u'observed_value': 102} |False |False |
|stores |2022-02-24T05:43:16.678220+00:00|expect_column_sum_to_be_between |{u'column': u'active_stores', u'max_value': 1000, u'min_value': 100} |{u'observed_value': 22075.0} |False |False |
+---------------+--------------------------------+--------------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+
A solution to your problem would be the following:
Zip the lists and append the strings as spark literals.
exp_type = [u'expect_column_to_exist', u'expect_column_values_to_not_be_null',
u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal',
u'expect_column_sum_to_be_between']
expectations = [{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'},
{u'column': u'countray'}, {u'value': 10},
{u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]
results = [{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0,
u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102},
{u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]
df_schema = StructType([StructField("exp_type", StringType(), True),
StructField("expectations", MapType(StringType(), StringType()), True),
StructField("results", StringType(), True),
StructField("this_exp_success", StringType(), True)
])
this_exp_success = [True, True, True, False, False, False]
result = spark.createDataFrame(
zip(exp_type, expectations, results, this_exp_success), df_schema)\
.withColumn('is_overall_success', lit('False'))\
.withColumn('data_source', lit('stores'))\
.withColumn('run_time', lit('2022-02-24T05:43:16.678220+00:00'))
result.show(truncate=False)
And the result is showing:
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+
|exp_type |expectations |results |this_exp_success|is_overall_success|data_source|run_time |
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+
|expect_column_to_exist |[column -> country] |{} |true |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_values_to_not_be_null|[column -> country] |{partial_unexpected_list=[], partial_unexpected_index_list=null, unexpected_count=0, element_count=102, unexpected_percent=0.0, partial_unexpected_counts=[]}|true |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_values_to_be_of_type |[column -> country, type_ -> StringType] |{observed_value=StringType} |true |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_to_exist |[column -> countray] |{} |false |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_table_row_count_to_equal |[value -> 10] |{observed_value=102} |false |False |stores |2022-02-24T05:43:16.678220+00:00|
|expect_column_sum_to_be_between |[column -> active_stores, min_value -> 100, max_value -> 1000]|{observed_value=22075.0} |false |False |stores |2022-02-24T05:43:16.678220+00:00|
+-----------------------------------+--------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------------+-----------+--------------------------------+
I have multiple time series stored in a Spark DataFrame as below:
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-03-12', 'France', 0),
('2020-03-13', 'France', 0),
('2020-03-14', 'France', 0),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-13', 'UK', 0),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
I am looking for a way (without looping as my real DataFrame has millions of rows) to remove the 0's at the end of each time series.
In our example, we would obtain:
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
Assume you want to remove at the end of every country ordered by date
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-03-12', 'France', 0),
('2020-03-13', 'France', 0),
('2020-03-14', 'France', 0),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-13', 'UK', 0),
('2020-04-13', 'India', 1),
('2020-04-14', 'India', 0),
('2020-04-15', 'India', 0),
('2020-04-16', 'India', 1),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
# convert negative to positive to avoid accidental summing up to 0
df=df.withColumn('y1',F.abs(F.col('y')))
# Window function to reverse the last rows to first
w=Window.partitionBy('country').orderBy(F.col('date').desc())
# Start summing function. when the first non zero value comes the value changes
df_sum = df.withColumn("sum_chk",F.sum('y1').over(w))
# Filter non zero values, sort it just for viewing
df_res = df_sum.where("sum_chk!=0").orderBy('date',ascending=True)
The result:
df_res.show()
+----------+-------+---+---+-------+
| date|country| y| y1|sum_chk|
+----------+-------+---+---+-------+
|2020-03-10| France| 19| 19| 41|
|2020-03-11| France| 22| 22| 22|
|2020-04-08| Japan| 0| 0| 5|
|2020-04-09| Japan| -3| 3| 5|
|2020-04-10| Japan| -2| 2| 2|
|2020-04-10| UK| 12| 12| 21|
|2020-04-11| UK| 0| 0| 9|
|2020-04-12| UK| 9| 9| 9|
|2020-04-13| India| 1| 1| 2|
|2020-04-14| India| 0| 0| 1|
|2020-04-15| India| 0| 0| 1|
|2020-04-16| India| 1| 1| 1|
+----------+-------+---+---+-------+