I have a PySpark DataFrame and I want to map values of a column.
Sample dataset:
data = [(1, 'N'), \
(2, 'N'), \
(3, 'C'), \
(4, 'S'), \
(5, 'North'), \
(6, 'Central'), \
(7, 'Central'), \
(8, 'South')
]
columns = ["ID", "City"]
df = spark.createDataFrame(data = data, schema = columns)
The mapping dictionary is:
{'N': 'North', 'C': 'Central', 'S': 'South'}
And I use the following code:
from pyspark.sql import functions as F
from itertools import chain
mapping_dict = {'N': 'North', 'C': 'Central', 'S': 'South'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*mapping_dict.items())])
df_new = df.withColumn('City_New', mapping_expr[df['City']])
And the results are:
As you can see, I get Null values for rows which I don't include their values in the mapping dictionary. To solve this, I can define mapping dictionary by:
{'N': 'North', 'C': 'Central', 'S': 'South', \
'North': 'North', 'Central': 'Central', 'South': 'South'}
However, if there are many unique values in the dataset, it is hard to define a mapping dictionary.
Is there any better way for this purpose?
you can use a coalesce.
here's how it'd look like.
# create separate case whens for each key-value pair
map_whens = [func.when(func.upper('city') == k.upper(), v) for k, v in map_dict.items()]
# [Column<'CASE WHEN (upper(city) = N) THEN North END'>,
# Column<'CASE WHEN (upper(city) = C) THEN Central END'>,
# Column<'CASE WHEN (upper(city) = S) THEN South END'>]
# pass case whens to coalesce with last value as `city` field
data_sdf. \
withColumn('city_new', func.coalesce(*map_whens, 'city')). \
show()
# +---+-------+--------+
# | id| city|city_new|
# +---+-------+--------+
# | 1| N| North|
# | 2| N| North|
# | 3| C| Central|
# | 4| S| South|
# | 5| North| North|
# | 6|Central| Central|
# | 7|Central| Central|
# | 8| South| South|
# +---+-------+--------+
i've got this dataframe with four columns
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 3),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df1.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|7.3| 8|
| c| d|7.3| 2|
| c| d|7.3| 8|
| e| f|6.0| 3|
| e| f|6.0| 8|
| e| f|6.0| 3|
| c| j|4.2| 3|
| c| j|4.3| 9|
+---+---+---+---+
and i also got this other dataframe df2 with the same schema as the dataframe df1
df2 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 3.3, 5),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 7),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 1),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df2.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|3.3| 5|
| c| d|7.3| 2|
| c| d|7.3| 7|
| e| f|6.0| 3|
| c| j|4.2| 1|
| c| j|4.3| 9|
+---+---+---+---+
I want to compare the couple (a, b, d) so that i can obtain the different values that are present in df2 but not in df1 like this
df3
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.3| 5|
| c| d|7.3| 7|
| c| j|4.2| 1|
+---+---+---+---+
I think what you want is:
df2.subtract(df1.intersect(df2)).show()
I want what is in df2 that is not in both df1 and df2.
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| j|4.2| 1|
| c| d|3.3| 5|
| c| d|7.3| 7|
+---+---+---+---+
I also agree with #pltc that call out you might have made a mistake in your output table.
I'm trying to calculate the proportion of a specific value occurring in a specific column within subgroups.
Sample dataframe
pdf = pd.DataFrame({
'id': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'letter': ['L', 'A', 'L', 'L', 'L', 'L', 'L', 'A', 'L', 'L']
})
df = spark.createDataFrame(pdf)
df.show()
I tried to rely on this answer but with the following code
df\
.groupby('id')\
.agg((count(col('letter') == 'L') / count(col('letter'))).alias('prop'))\
.show()
I obtained a column full of 1.0, even when I changed 'L' to 'A'.
My desired output is, for each group, the proportion of 'L' values within the group:
+---+--------+
| id| prop|
+---+--------+
| 1| 0.75|
| 2| 1.0|
| 3| 0.66667|
+---+--------+
You can use sum with when instead to count the occurrences of L:
df.groupby('id')\
.agg((F.sum(F.when(F.col('letter') == 'L', 1)) / F.count(F.col('letter'))).alias('prop'))\
.show()
This will give you the proportion only in non-null values. If you want to calculate on all rows, divide by count("*")instead of count(col('letter')).
Before you count, you need to mask the non-L letters with nulls using when:
df\
.groupby('id')\
.agg((count(when(col('letter') == 'L', 1)) / count(col('letter'))).alias('prop'))\
.show()
Note that count will only count non-null entries. It does not only count true entries, as you had expected in your code. Your code is more suitable if you're using count_if from Spark SQL.
I would like to generate all possible combinations of a given character list with a given length and exclude a few combinations. For example, if I have this list:
chars = ['a', 'b', 'c', '1', '2']
Now I want to exclude character formations of more than 2characters in a row so that combinations like aaaaa or 111111 aren't possible. And I also want the output to be a given length, for example, 5 characters. Is this possible? I thought of itertools
Thanks for any help in advance.
import itertools
chars = ['a', 'b', 'c', '1', '2']
for combination in itertools.product(chars, repeat = 5):
if all(combination.count(x) < 3 for x in combination):
print (combination)
Output:
('c', '1', '1', '2', 'c')
('c', '1', '1', '2', '2')
('c', '1', '2', 'a', 'a')
('c', '1', '2', 'a', 'b')
('c', '1', '2', 'a', 'c')
('c', '1', '2', 'a', '1')
('c', '1', '2', 'a', '2')
('c', '1', '2', 'b', 'a')
('c', '1', '2', 'b', 'b')
('c', '1', '2', 'b', 'c')
('c', '1', '2', 'b', '1')
('c', '1', '2', 'b', '2')
('c', '1', '2', 'c', 'a')
('c', '1', '2', 'c', 'b')
('c', '1', '2', 'c', '1')
('c', '1', '2', 'c', '2')
('c', '1', '2', '1', 'a')
('c', '1', '2', '1', 'b')
('c', '1', '2', '1', 'c')
('c', '1', '2', '1', '2')
('c', '1', '2', '2', 'a')
('c', '1', '2', '2', 'b')
('c', '1', '2', '2', 'c')
('c', '1', '2', '2', '1')
('c', '2', 'a', 'a', 'b')
('c', '2', 'a', 'a', 'c')
('c', '2', 'a', 'a', '1')
('c', '2', 'a', 'a', '2')
('c', '2', 'a', 'b', 'a')
('c', '2', 'a', 'b', 'b')
('c', '2', 'a', 'b', 'c')
('c', '2', 'a', 'b', '1')
('c', '2', 'a', 'b', '2')
('c', '2', 'a', 'c', 'a')
('c', '2', 'a', 'c', 'b')
('c', '2', 'a', 'c', '1')
('c', '2', 'a', 'c', '2')
('c', '2', 'a', '1', 'a')
('c', '2', 'a', '1', 'b')
('c', '2', 'a', '1', 'c')
('c', '2', 'a', '1', '1')
('c', '2', 'a', '1', '2')
('c', '2', 'a', '2', 'a')
('c', '2', 'a', '2', 'b')
('c', '2', 'a', '2', 'c')
('c', '2', 'a', '2', '1')
('c', '2', 'b', 'a', 'a')
('c', '2', 'b', 'a', 'b')
('c', '2', 'b', 'a', 'c')
('c', '2', 'b', 'a', '1')
('c', '2', 'b', 'a', '2')
('c', '2', 'b', 'b', 'a')
('c', '2', 'b', 'b', 'c')
('c', '2', 'b', 'b', '1')
('c', '2', 'b', 'b', '2')
('c', '2', 'b', 'c', 'a')
('c', '2', 'b', 'c', 'b')
('c', '2', 'b', 'c', '1')
('c', '2', 'b', 'c', '2')
('c', '2', 'b', '1', 'a')
('c', '2', 'b', '1', 'b')
('c', '2', 'b', '1', 'c')
('c', '2', 'b', '1', '1')
('c', '2', 'b', '1', '2')
('c', '2', 'b', '2', 'a')
('c', '2', 'b', '2', 'b')
('c', '2', 'b', '2', 'c')
('c', '2', 'b', '2', '1')
('c', '2', 'c', 'a', 'a')
('c', '2', 'c', 'a', 'b')
('c', '2', 'c', 'a', '1')
('c', '2', 'c', 'a', '2')
('c', '2', 'c', 'b', 'a')
('c', '2', 'c', 'b', 'b')
('c', '2', 'c', 'b', '1')
('c', '2', 'c', 'b', '2')
etc...
I have multiple time series stored in a Spark DataFrame as below:
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-03-12', 'France', 0),
('2020-03-13', 'France', 0),
('2020-03-14', 'France', 0),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-13', 'UK', 0),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
I am looking for a way (without looping as my real DataFrame has millions of rows) to remove the 0's at the end of each time series.
In our example, we would obtain:
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
Assume you want to remove at the end of every country ordered by date
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-03-12', 'France', 0),
('2020-03-13', 'France', 0),
('2020-03-14', 'France', 0),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-13', 'UK', 0),
('2020-04-13', 'India', 1),
('2020-04-14', 'India', 0),
('2020-04-15', 'India', 0),
('2020-04-16', 'India', 1),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
# convert negative to positive to avoid accidental summing up to 0
df=df.withColumn('y1',F.abs(F.col('y')))
# Window function to reverse the last rows to first
w=Window.partitionBy('country').orderBy(F.col('date').desc())
# Start summing function. when the first non zero value comes the value changes
df_sum = df.withColumn("sum_chk",F.sum('y1').over(w))
# Filter non zero values, sort it just for viewing
df_res = df_sum.where("sum_chk!=0").orderBy('date',ascending=True)
The result:
df_res.show()
+----------+-------+---+---+-------+
| date|country| y| y1|sum_chk|
+----------+-------+---+---+-------+
|2020-03-10| France| 19| 19| 41|
|2020-03-11| France| 22| 22| 22|
|2020-04-08| Japan| 0| 0| 5|
|2020-04-09| Japan| -3| 3| 5|
|2020-04-10| Japan| -2| 2| 2|
|2020-04-10| UK| 12| 12| 21|
|2020-04-11| UK| 0| 0| 9|
|2020-04-12| UK| 9| 9| 9|
|2020-04-13| India| 1| 1| 2|
|2020-04-14| India| 0| 0| 1|
|2020-04-15| India| 0| 0| 1|
|2020-04-16| India| 1| 1| 1|
+----------+-------+---+---+-------+