Pyspark - replace values in column with dictionary

Pyspark - replace values in column with dictionary - python

I'm avoiding repeating the .when function 12 times, so I thought about a dictionary. I don't know if it's a limitation of the Spark function or a logic error. Does the function allow this concatenation?
months = {'1': 'Jan', '2': 'Feb', '3': 'Mar', '4': 'Apr', '5': 'May', '6': 'Jun',
'7': 'Jul', '8': 'Aug', '9': 'Sep', '10':'Oct', '11': 'Nov', '12':'Dec'}
for num, month in months.items():
custoDF1 = custoDF.\
withColumn("Month",
when(col("Nummes") == num, month)
.otherwise(month))
custoDF1.select(col('Nummes').alias('NumMonth'), 'month').distinct().orderBy("NumMonth").show(200)

You can use the replace method of the DataFrame class:
import pyspark.sql.functions as F
months = {'1': 'Jan', '2': 'Feb', '3': 'Mar', '4': 'Apr', '5': 'May', '6': 'Jun',
'7': 'Jul', '8': 'Aug', '9': 'Sep', '10':'Oct', '11': 'Nov', '12':'Dec'}
df = (df.withColumn('month', F.col('NumMonth').cast('string'))
.replace(months, subset=['month']))
df.show()
+--------+-----+
|NumMonth|month|
+--------+-----+
| 1| Jan|
| 2| Feb|
| 3| Mar|
| 4| Apr|
| 5| May|
| 6| Jun|
| 7| Jul|
| 8| Aug|
| 9| Sep|
| 10| Oct|
| 11| Nov|
| 12| Dec|
+--------+-----+
Here I had to cast NumMonth to string because your mapping in months dictionary had string keys; alternatively, you can change them to integer and avoid casting to string.

Related

Compare two couple of columns from two different pyspark dataframe to display the data that are different

i've got this dataframe with four columns
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 3),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df1.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|7.3| 8|
| c| d|7.3| 2|
| c| d|7.3| 8|
| e| f|6.0| 3|
| e| f|6.0| 8|
| e| f|6.0| 3|
| c| j|4.2| 3|
| c| j|4.3| 9|
+---+---+---+---+
and i also got this other dataframe df2 with the same schema as the dataframe df1
df2 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 3.3, 5),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 7),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 1),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df2.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|3.3| 5|
| c| d|7.3| 2|
| c| d|7.3| 7|
| e| f|6.0| 3|
| c| j|4.2| 1|
| c| j|4.3| 9|
+---+---+---+---+
I want to compare the couple (a, b, d) so that i can obtain the different values that are present in df2 but not in df1 like this
df3
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.3| 5|
| c| d|7.3| 7|
| c| j|4.2| 1|
+---+---+---+---+

I think what you want is:
df2.subtract(df1.intersect(df2)).show()
I want what is in df2 that is not in both df1 and df2.
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| j|4.2| 1|
| c| d|3.3| 5|
| c| d|7.3| 7|
+---+---+---+---+
I also agree with #pltc that call out you might have made a mistake in your output table.

Sort by key (Month) using RDDs in Pyspark

I have this RDD and wanna sort it by Month (Jan --> Dec). How can i do it in pyspark?
Note: Don't want to use spark.sql or Dataframe.
+-----+-----+
|Month|count|
+-----+-----+
| Oct| 1176|
| Sep| 1167|
| Dec| 2084|
| Aug| 1126|
| May| 1176|
| Jun| 1424|
| Feb| 1286|
| Nov| 1078|
| Mar| 1740|
| Jan| 1544|
| Apr| 1080|
| Jul| 1237|
+-----+-----+

You can use rdd.sortBy with a helper dictionary available in python's calendar module or create your own month dictionary:
import calendar
d = {i:e for e,i in enumerate(calendar.month_abbr[1:],1)}
#{'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7,
#'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
myrdd.sortBy(keyfunc=lambda x: d.get(x[0])).collect()
[('Jan', 1544),
('Feb', 1286),
('Mar', 1740),
('Apr', 1080),
('May', 1176),
('Jun', 1424),
('Jul', 1237),
('Aug', 1126),
('Sep', 1167),
('Oct', 1176),
('Nov', 1078),
('Dec', 2084)]

myList = myrdd.collect()
my_list_dict = dict(myList)
months = ['Jan', 'Feb', 'Mar', 'Apr', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
newList = []
for m in months:
newList.append((m, my_list_dict[m]))
print(newList)

Removing 0 at the end of multiple time series

I have multiple time series stored in a Spark DataFrame as below:
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-03-12', 'France', 0),
('2020-03-13', 'France', 0),
('2020-03-14', 'France', 0),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-13', 'UK', 0),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
I am looking for a way (without looping as my real DataFrame has millions of rows) to remove the 0's at the end of each time series.
In our example, we would obtain:
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)

Assume you want to remove at the end of every country ordered by date
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-03-12', 'France', 0),
('2020-03-13', 'France', 0),
('2020-03-14', 'France', 0),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-13', 'UK', 0),
('2020-04-13', 'India', 1),
('2020-04-14', 'India', 0),
('2020-04-15', 'India', 0),
('2020-04-16', 'India', 1),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
# convert negative to positive to avoid accidental summing up to 0
df=df.withColumn('y1',F.abs(F.col('y')))
# Window function to reverse the last rows to first
w=Window.partitionBy('country').orderBy(F.col('date').desc())
# Start summing function. when the first non zero value comes the value changes
df_sum = df.withColumn("sum_chk",F.sum('y1').over(w))
# Filter non zero values, sort it just for viewing
df_res = df_sum.where("sum_chk!=0").orderBy('date',ascending=True)
The result:
df_res.show()
+----------+-------+---+---+-------+
| date|country| y| y1|sum_chk|
+----------+-------+---+---+-------+
|2020-03-10| France| 19| 19| 41|
|2020-03-11| France| 22| 22| 22|
|2020-04-08| Japan| 0| 0| 5|
|2020-04-09| Japan| -3| 3| 5|
|2020-04-10| Japan| -2| 2| 2|
|2020-04-10| UK| 12| 12| 21|
|2020-04-11| UK| 0| 0| 9|
|2020-04-12| UK| 9| 9| 9|
|2020-04-13| India| 1| 1| 2|
|2020-04-14| India| 0| 0| 1|
|2020-04-15| India| 0| 0| 1|
|2020-04-16| India| 1| 1| 1|
+----------+-------+---+---+-------+

pass both string and list to pandas .isin method

I am trying to pass both a string and a list to the pandas .isin() method. Here is my code below
overall_months = ['APR', 'JUL', 'NOV', 'MAR', 'FEB', 'AUG', 'SEP', 'OCT', 'JAN', 'DEC', 'MAY',
'JUN', ['APR', 'JUL', 'NOV', 'MAR', 'FEB', 'AUG', 'SEP', 'OCT', 'JAN', 'DEC', 'MAY', 'JUN']]
for mon in overall_months:
temp_df = df.month.isin([[mon]]))
The issue here is the .isin([]) is fine for each iteration of a string, but when i get to overall_months[-1], its a list and you cannot pass a list into .isin([]) syntax. Ive tried this but cannot remove the double quotes because my understanding is strings are immutable:
str(overall_months[-1]).replace('[', '').replace(']','')
This produces: "'APR', 'JUL', 'NOV', 'MAR', 'FEB', 'AUG', 'SEP', 'OCT', 'JAN', 'DEC', 'MAY', 'JUN'"
It could be passed to my syntax if it was: 'APR', 'JUL', 'NOV', 'MAR', 'FEB', 'AUG', 'SEP', 'OCT', 'JAN', 'DEC', 'MAY', 'JUN'
Any help in the best way to accomplish this?

You can check if the element is a list with isinstance:
for mon in overall_months:
if not isinstance(mon, list): mon = [mon]
tmp_df = df.month.isin(mon)

Pyspark: filter dataframe based on list with many conditions

Suppose you have a pyspark dataframe df with columns A and B.
Now, you want to filter the dataframe with many conditions.
The conditions are contained in a list of dicts:
l = [{'A': 'val1', 'B': 5}, {'A': 'val4', 'B': 2}, ...]
The filtering should be done as follows:
df.filter(
( (df['A'] == l[0]['A']) & (df['B'] == l[0]['B']) )
&
( (df['A'] == l[1]['A']) & (df['B'] == l[1]['B']) )
&
...
)
How can this be done with l containing many conditions, i.e. a manual insertion into the filter condition is not practical?
I thought about using separate filter steps, i.e.:
for d in l:
df = df.filter((df['A'] == d['A']) & (df['B'] == d['B']))
Is there a shorter or more elegant way of doing this, e.g. similar to using list comprehensions?
In addition, this does not work for ORs (|).

You could use your list of dictionaries to create a sql expression and send it to your filter all at once.
l = [{'A': 'val1', 'B': 5}, {'A': 'val4', 'B': 2}]
df.show()
#+----+---+
#| A| B|
#+----+---+
#|val1| 5|
#|val1| 1|
#|val1| 3|
#|val4| 2|
#|val1| 4|
#|val1| 1|
#+----+---+
df.filter(' or '.join(["A"+"="+"'"+d['A']+"'"+" and "+"B"+"="+str(d['B']) for d in l])).show()
#+----+---+
#| A| B|
#+----+---+
#|val1| 5|
#|val4| 2|
#+----+---+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark - replace values in column with dictionary - python

Related

Compare two couple of columns from two different pyspark dataframe to display the data that are different

Sort by key (Month) using RDDs in Pyspark

Removing 0 at the end of multiple time series

pass both string and list to pandas .isin method

Pyspark: filter dataframe based on list with many conditions

Categories

Resources