Removing 0 at the end of multiple time series

Removing 0 at the end of multiple time series - python

I have multiple time series stored in a Spark DataFrame as below:
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-03-12', 'France', 0),
('2020-03-13', 'France', 0),
('2020-03-14', 'France', 0),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-13', 'UK', 0),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
I am looking for a way (without looping as my real DataFrame has millions of rows) to remove the 0's at the end of each time series.
In our example, we would obtain:
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)

Assume you want to remove at the end of every country ordered by date
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
df = spark.createDataFrame([('2020-03-10', 'France', 19),
('2020-03-11', 'France', 22),
('2020-03-12', 'France', 0),
('2020-03-13', 'France', 0),
('2020-03-14', 'France', 0),
('2020-04-10', 'UK', 12),
('2020-04-11', 'UK', 0),
('2020-04-12', 'UK', 9),
('2020-04-13', 'UK', 0),
('2020-04-13', 'India', 1),
('2020-04-14', 'India', 0),
('2020-04-15', 'India', 0),
('2020-04-16', 'India', 1),
('2020-04-08', 'Japan', 0),
('2020-04-09', 'Japan', -3),
('2020-04-10', 'Japan', -2)
],
['date', 'country', 'y']
)
# convert negative to positive to avoid accidental summing up to 0
df=df.withColumn('y1',F.abs(F.col('y')))
# Window function to reverse the last rows to first
w=Window.partitionBy('country').orderBy(F.col('date').desc())
# Start summing function. when the first non zero value comes the value changes
df_sum = df.withColumn("sum_chk",F.sum('y1').over(w))
# Filter non zero values, sort it just for viewing
df_res = df_sum.where("sum_chk!=0").orderBy('date',ascending=True)
The result:
df_res.show()
+----------+-------+---+---+-------+
| date|country| y| y1|sum_chk|
+----------+-------+---+---+-------+
|2020-03-10| France| 19| 19| 41|
|2020-03-11| France| 22| 22| 22|
|2020-04-08| Japan| 0| 0| 5|
|2020-04-09| Japan| -3| 3| 5|
|2020-04-10| Japan| -2| 2| 2|
|2020-04-10| UK| 12| 12| 21|
|2020-04-11| UK| 0| 0| 9|
|2020-04-12| UK| 9| 9| 9|
|2020-04-13| India| 1| 1| 2|
|2020-04-14| India| 0| 0| 1|
|2020-04-15| India| 0| 0| 1|
|2020-04-16| India| 1| 1| 1|
+----------+-------+---+---+-------+

Related

How to map a column in PySpark DataFrame and avoid getting Null values?

I have a PySpark DataFrame and I want to map values of a column.
Sample dataset:
data = [(1, 'N'), \
(2, 'N'), \
(3, 'C'), \
(4, 'S'), \
(5, 'North'), \
(6, 'Central'), \
(7, 'Central'), \
(8, 'South')
]
columns = ["ID", "City"]
df = spark.createDataFrame(data = data, schema = columns)
The mapping dictionary is:
{'N': 'North', 'C': 'Central', 'S': 'South'}
And I use the following code:
from pyspark.sql import functions as F
from itertools import chain
mapping_dict = {'N': 'North', 'C': 'Central', 'S': 'South'}
mapping_expr = F.create_map([F.lit(x) for x in chain(*mapping_dict.items())])
df_new = df.withColumn('City_New', mapping_expr[df['City']])
And the results are:
As you can see, I get Null values for rows which I don't include their values in the mapping dictionary. To solve this, I can define mapping dictionary by:
{'N': 'North', 'C': 'Central', 'S': 'South', \
'North': 'North', 'Central': 'Central', 'South': 'South'}
However, if there are many unique values in the dataset, it is hard to define a mapping dictionary.
Is there any better way for this purpose?

you can use a coalesce.
here's how it'd look like.
# create separate case whens for each key-value pair
map_whens = [func.when(func.upper('city') == k.upper(), v) for k, v in map_dict.items()]
# [Column<'CASE WHEN (upper(city) = N) THEN North END'>,
# Column<'CASE WHEN (upper(city) = C) THEN Central END'>,
# Column<'CASE WHEN (upper(city) = S) THEN South END'>]
# pass case whens to coalesce with last value as `city` field
data_sdf. \
withColumn('city_new', func.coalesce(*map_whens, 'city')). \
show()
# +---+-------+--------+
# | id| city|city_new|
# +---+-------+--------+
# | 1| N| North|
# | 2| N| North|
# | 3| C| Central|
# | 4| S| South|
# | 5| North| North|
# | 6|Central| Central|
# | 7|Central| Central|
# | 8| South| South|
# +---+-------+--------+

Compare two couple of columns from two different pyspark dataframe to display the data that are different

i've got this dataframe with four columns
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 3),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df1.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|7.3| 8|
| c| d|7.3| 2|
| c| d|7.3| 8|
| e| f|6.0| 3|
| e| f|6.0| 8|
| e| f|6.0| 3|
| c| j|4.2| 3|
| c| j|4.3| 9|
+---+---+---+---+
and i also got this other dataframe df2 with the same schema as the dataframe df1
df2 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 3.3, 5),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 7),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 1),
('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df2.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.0| 4|
| c| d|3.3| 5|
| c| d|7.3| 2|
| c| d|7.3| 7|
| e| f|6.0| 3|
| c| j|4.2| 1|
| c| j|4.3| 9|
+---+---+---+---+
I want to compare the couple (a, b, d) so that i can obtain the different values that are present in df2 but not in df1 like this
df3
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| d|3.3| 5|
| c| d|7.3| 7|
| c| j|4.2| 1|
+---+---+---+---+

I think what you want is:
df2.subtract(df1.intersect(df2)).show()
I want what is in df2 that is not in both df1 and df2.
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| c| j|4.2| 1|
| c| d|3.3| 5|
| c| d|7.3| 7|
+---+---+---+---+
I also agree with #pltc that call out you might have made a mistake in your output table.

Unexplode in pyspark with sequence conditional

I need to unexplode a column in dataframe pyspark with sequence number conditional. E.g
Input dataframe
Expect output dataframe
You can see when c1 = 1 at a row, that row will break content of c4 column into new row (because length over limit). Otherwise if when c1 = 0 then c4 contain full content, no need break into new row. c4 column can break it into multi row next
This same pyspark.sql.functions.explode(col) in pyspark, and i need to unexplode but i have a conditional is c1 column (it's not simple such as group by then collect list df.groupby().agg(F.collect_list()), because c1 is sequence conditional)
I try to use window function flow by this topic PySpark - Append previous and next row to current row. But how can i solve when c4 col break multi row next
Sample code
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
df_in = spark_session.createDataFrame(
[
(1, 'a', 'b', 'c1', 'd'),
(0, 'a', 'b', 'c2', 'd'),
(0, 'e', 'f', 'g', 'h'),
(0, '1', '2', '3', '4'),
(1, 'x', 'y', 'z1', 'k'),
(1, 'x', 'y', 'z2', 'k'),
(1, 'x', 'y', 'z3', 'k'),
(0, 'x', 'y', 'z4', 'k'),
(1, '6', '7', '81', '9'),
(0, '6', '7', '82', '9'),
],
['c1', 'c2', 'c3', 'c4', 'c5']
)
df_out = spark_session.createDataFrame(
[
('a', 'b', 'c1-c2', 'd'),
('e', 'f', 'g', 'h'),
('1', '2', '3', '4'),
('x', 'y', 'z1-z2-z3-z4', 'k'),
('6', '7', '81-82', '9')
],
['c2', 'c3', 'c4', 'c5']
)
df_in.show()
df_out.show()
How can i solve that. Thank you
UPDATED
input
df_in = spark_session.createDataFrame(
[
('0', 1, 'a', 'b', 'c1', 'd'),
('0', 0, 'a', 'b', 'c2', 'd'),
('0', 0, 'e', 'f', 'g', 'h'),
('0', 0, '1', '2', '3', '4'),
('0', 1, 'x', 'y', 'sele', 'k'),
('0', 1, 'x', 'y', 'ct ', 'k'),
('0', 1, 'x', 'y', 'from', 'k'),
('0', 0, 'x', 'y', 'a', 'k'),
('0', 1, '6', '7', '81', '9'),
('0', 0, '6', '7', '82', '9'),
],
['c0', 'c1', 'c2', 'c3', 'c4', 'c5']
)
output
Expect output
x| y|select -from-a| k

This solution works even when your data set is in multiple partitions and not ordered.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
orderByColumns = [F.col('c4'),F.col('c1').cast('int').desc()]
partitionColumns =[ F.col(column) for column in ['c2','c3','c5']]
df_in.orderBy(orderByColumns)\
.withColumn('ranked',F.dense_rank().over(Window.partitionBy(partitionColumns).orderBy(orderByColumns)))\
.withColumn('c4-ranked',F.concat(F.col('ranked'),F.lit('='),F.col('c4')))\
.groupBy(partitionColumns)\
.agg(F.collect_list('c4-ranked').alias('c4'))\
.select(
F.col('c2'),
F.col('c3'),
F.regexp_replace(F.array_join(F.col('c4'),"-"),"\d+=","").alias('c4'),
F.col('c5')
)\
.show()
+---+---+-----------+---+
| c2| c3| c4| c5|
+---+---+-----------+---+
| 1| 2| 3| 4|
| x| y|z1-z2-z3-z4| k|
| e| f| g| h|
| 6| 7| 81-82| 9|
| a| b| c1-c2| d|
+---+---+-----------+---+
Setup
df_in = sparkSession.createDataFrame(
[
(1, 'a', 'b', 'c1', 'd'),
(0, 'a', 'b', 'c2', 'd'),
(0, 'e', 'f', 'g', 'h'),
(0, '1', '2', '3', '4'),
(1, 'x', 'y', 'z1', 'k'),
(1, 'x', 'y', 'z2', 'k'),
(1, 'x', 'y', 'z3', 'k'),
(0, 'x', 'y', 'z4', 'k'),
(1, '6', '7', '81', '9'),
(0, '6', '7', '82', '9'),
],
['c1', 'c2', 'c3', 'c4', 'c5']
).repartition(5)
df_in.show()
Provides on my run (may very each run)
+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
| 1| x| y| z2| k|
| 0| x| y| z4| k|
| 1| a| b| c1| d|
| 0| 1| 2| 3| 4|
| 0| 6| 7| 82| 9|
| 0| a| b| c2| d|
| 0| e| f| g| h|
| 1| 6| 7| 81| 9|
| 1| x| y| z3| k|
| 1| x| y| z1| k|
+---+---+---+---+---+

Pyspark: filter dataframe based on list with many conditions

Suppose you have a pyspark dataframe df with columns A and B.
Now, you want to filter the dataframe with many conditions.
The conditions are contained in a list of dicts:
l = [{'A': 'val1', 'B': 5}, {'A': 'val4', 'B': 2}, ...]
The filtering should be done as follows:
df.filter(
( (df['A'] == l[0]['A']) & (df['B'] == l[0]['B']) )
&
( (df['A'] == l[1]['A']) & (df['B'] == l[1]['B']) )
&
...
)
How can this be done with l containing many conditions, i.e. a manual insertion into the filter condition is not practical?
I thought about using separate filter steps, i.e.:
for d in l:
df = df.filter((df['A'] == d['A']) & (df['B'] == d['B']))
Is there a shorter or more elegant way of doing this, e.g. similar to using list comprehensions?
In addition, this does not work for ORs (|).

You could use your list of dictionaries to create a sql expression and send it to your filter all at once.
l = [{'A': 'val1', 'B': 5}, {'A': 'val4', 'B': 2}]
df.show()
#+----+---+
#| A| B|
#+----+---+
#|val1| 5|
#|val1| 1|
#|val1| 3|
#|val4| 2|
#|val1| 4|
#|val1| 1|
#+----+---+
df.filter(' or '.join(["A"+"="+"'"+d['A']+"'"+" and "+"B"+"="+str(d['B']) for d in l])).show()
#+----+---+
#| A| B|
#+----+---+
#|val1| 5|
#|val4| 2|
#+----+---+

Create multidict from pyspark dataframe

I am new to pyspark and want to create a dictionary from a pyspark dataframe. I do have a working pandas code but I need an equivalent command in pyspark and somehow I am not able to figure out how to do it.
df = spark.createDataFrame([
(11, 101, 5.9),
(11, 102, 5.4),
(22, 111, 5.2),
(22, 112, 5.9),
(22, 101, 5.7),
(33, 101, 5.2),
(44, 102, 5.3),
], ['user_id', 'team_id', 'height'])
df = df.select(['user_id', 'team_id'])
df.show()
-------+-------+
|user_id|team_id|
+-------+-------+
| 11| 101|
| 11| 102|
| 22| 111|
| 22| 112|
| 22| 101|
| 33| 101|
| 44| 102|
+-------+-------+
df.toPandas().groupby('user_id')[
'team_id'].apply(list).to_dict()
Result:
{11: [101, 102], 22: [111, 112, 101], 33: [101], 44: [102]}
Looking for efficient way in pyspark to create the above multidict.

You can aggregate the team_id column as list and then collect the rdd as dictionary using collectAsMap method:
mport pyspark.sql.functions as F
df.groupBy("user_id").agg(F.collect_list("team_id")).rdd.collectAsMap()
# {33: [101], 11: [101, 102], 44: [102], 22: [111, 112, 101]}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing 0 at the end of multiple time series - python

Related

How to map a column in PySpark DataFrame and avoid getting Null values?

Compare two couple of columns from two different pyspark dataframe to display the data that are different

Unexplode in pyspark with sequence conditional

Pyspark: filter dataframe based on list with many conditions

Create multidict from pyspark dataframe

Categories

Resources