Related
I have a crosstab table with 4 rows and multiple columns, containing numeral values (representing the number of dataset elements on the crossing of two factors). I want to sort the order of columns in the crosstab by the sum of values in each column.
e.g. I have:
ct = pd.crosstab(df_flt_reg['experience'], df_flt_reg['region'])
| a| b| c| d| e|
0 | 1| 0| 7| 3| 6|
1 | 2| 4| 1| 5| 4|
2 | 3| 5| 0| 7| 2|
3 | 1| 3| 1| 9| 1|
(sum)| 7| 12| 9| 24| 13| # row doesn't exist, written here to make clear the logic
What do I want:
| d| e| b| c| a|
0 | 3| 6| 0| 7| 1|
1 | 5| 4| 4| 1| 2|
2 | 7| 2| 5| 0| 3|
3 | 9| 1| 3| 1| 1|
(sum)| 24| 13| 12| 9| 7|
I succeded only in sorting the columns by their names (in alphabet order), but that's not what I need. I tried to sum those values separately, made a list of properly ordered indexes and then addressed them to crosstab.sort_values() via "by" parameter, but it was sorted in alphabet order again. Also I tried to create a new row "sum", but succeded to create only a new column -_-
So I humbly asking for the community's help.
Calculate the sum and sort the values. Once you have the sorted series get the index and reorder your columns with it.
sorted_df = ct[ct.sum().sort_values(ascending=False).index]
d e b c a
0 3 6 0 7 1
1 5 4 4 1 2
2 7 2 5 0 3
3 9 1 3 1 1
Trying to replicate pandas code in pyspark 2.x.
say I have dataframe as follows:
age education country
0 22 A Canada
1 34 B Mongolia
2 55 A Peru
3 44 C Korea
Usually in pandas I would scale numerical columns and one hot encode categorical and get:
age education_A education_B education_C country_Canada country_Mongolia ...
0 0 1 0 0 1 0
1 0.3 0 1 0 0 0
2 1 1 0 0 0 0 ...
3 0.7 0 0 1 ... ...
In pyspark I've done
str_indexer1 = StringIndexer(inputCol="education", outputCol=education+"_si", handleInvalid="skip")
str_indexer2 = StringIndexer(inputCol="country", outputCol=country+"_si", handleInvalid="skip")
mod_df = str_indexer1.fit(df).transform(df)
mod_df = str_indexer2.fit(df).transform(mod_df)
ohe1 = OneHotEncoder(inputCol="education", outputCol=education+"_ohe")
ohe2 = OneHotEncoder(inputCol="country", outputCol=country+"_ohe")
ohe1.fit(mod_df).transform(mod_df)
This gives me
age education country education_si country_si education_ohe
0 0 A Canada 1 1 (1,0,0,0)
1 0.3 B Mongolia 2 2 (0,1,0,0)
2 1 A Peru 1 3 (1,0,0,0)
3 0.7 C Korea 3 4 (0,0,1,0)
From here I cannot find out how to explode education_ohe into education_A, etc...
How can I do this and also is more efficient way to perform ohe and scaler in large dataframe?
There is a built in oneHotEncoder in pyspark's functions, but I could not get it to provide true one-hot encoded columns.
That being said the following code will get the desired result.
Using the following dataframe.
df.show()
+---+---------+--------+
|age|education| country|
+---+---------+--------+
| 22| A| Canada|
| 34| B|Mongolia|
| 55| A| Peru|
| 44| C| Korea|
+---+---------+--------+
You can select all the distinct column values, and iteratively create additional columns.
variables_dict = {}
for col in df.columns:
variables_dict[col] = [row[col] for row in df.distinct().collect()]
for col in variables_dict.keys():
for val in variables_dict[col]:
df = df.withColumn("{}_{}".format(col, val), functions.when((df[col] == val), 1).otherwise(0))
df.show()
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
|age|education| country|age_22|age_44|age_55|age_34|education_A|education_C|education_B|country_Canada|country_Korea|country_Peru|country_Mongolia|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
| 22| A| Canada| 1| 0| 0| 0| 1| 0| 0| 1| 0| 0| 0|
| 34| B|Mongolia| 0| 0| 0| 1| 0| 0| 1| 0| 0| 0| 1|
| 55| A| Peru| 0| 0| 1| 0| 1| 0| 0| 0| 0| 1| 0|
| 44| C| Korea| 0| 1| 0| 0| 0| 1| 0| 0| 1| 0| 0|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
You can then use the same variables_dict to apply the one_hot_encoder to another dataframe.
df2.show()
+---+---------+--------+
|age|education| country|
+---+---------+--------+
| 22| A| Canada|
| 66| B|Mongolia|
| 55| D| Peru|
| 44| C| China|
+---+---------+--------+
The dataframe above has column variables that have been seen before ie. (66, D, China).
We can use the following code to transform 'df2' to have identical columns as 'df'.
for col in variables_dict.keys():
for val in variables_dict[col]:
df2 = df2.withColumn("{}_{}".format(col, val), functions.when((df2[col] == val), 1).otherwise(0))
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
|age|education| country|age_22|age_44|age_55|age_34|education_A|education_C|education_B|country_Canada|country_Korea|country_Peru|country_Mongolia|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
| 22| A| Canada| 1| 0| 0| 0| 1| 0| 0| 1| 0| 0| 0|
| 66| B|Mongolia| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 1|
| 55| D| Peru| 0| 0| 1| 0| 0| 0| 0| 0| 0| 1| 0|
| 44| C| China| 0| 1| 0| 0| 0| 1| 0| 0| 0| 0| 0|
+---+---------+--------+------+------+------+------+-----------+-----------+-----------+--------------+-------------+------------+----------------+
I have a PySpark dataframe like this,
+----------+--------+----------+----------+
|id_ | p |d1 | d2 |
+----------+--------+----------+----------+
| 1 | A |2018-09-26|2018-10-26|
| 2 | B |2018-06-21|2018-07-19|
| 2 | B |2018-08-13|2018-10-07|
| 2 | B |2018-12-31|2019-02-27|
| 2 | B |2019-05-28|2019-06-25|
| 3 |C |2018-06-15|2018-07-13|
| 3 |C |2018-08-15|2018-10-09|
| 3 |C |2018-12-03|2019-03-12|
| 3 |C |2019-05-10|2019-06-07|
| 4 | A |2019-01-30|2019-03-01|
| 4 | B |2019-05-30|2019-07-25|
| 5 |C |2018-09-19|2018-10-17|
-------------------------------------------
From this dataframe I have to derive another dataframe which have n columns. Where each column is a month from month(min(d1)) to month(max(d2)).
I want a in the derived dataframe for a row in the actual dataframe and the column values must be number of days in that month.
For example,
for first row, where id_ is 1 and p is A, I want to get a row in the derived dataframe where column of 201809 with value 5 and column 201810 with value 26.
For second row where id_ is 2 and p is B, I want to get a row in the derived dataframe where column of 201806 should be 9 and 201807 should be 19.
For the second last row, I want the columns 201905 filled with value 1, column 201906 with value 30, 201907 with 25.
So basically, I want the dataframe to be populated such a way that, for each row in my original dataframe I have a row in the derived dataframe where the columns in the table that corresponds to the month should be filled, for the range min(d1) to max(d2) with value number of days that is covered in that particular month.
I am currently doing this in the hard way. I am making n columns, where columns range for dates from min(d1) to max(d2). I am filling theses column with 1 and then melting the data and filtering based on value. Finally aggregating this dataframe to get my desired result, then selected the max valued p.
In codes,
d = df.select(F.min('d1').alias('d1'), F.max('d2').alias('d2')).first()
cols = [ c.strftime('%Y-%m-%d') for c in pd.period_range(d.d1, d.d2, freq='D') ]
result = df.select('id_', 'p', *[ F.when((df.d1 <= c)&(df.d2 >= c), 1).otherwise(0).alias(c) for c in cols ])
melted_data = melt(result, id_vars=['id_','p'], value_vars=cols)
melted_data = melted_data.withColumn('Month', F.substring(F.regexp_replace('variable', '-', ''), 1, 6))
melted_data = melted_data.groupBy('id_', 'Month', 'p').agg(F.sum('value').alias('days'))
melted_data = melted_data.orderBy('id_', 'Month', 'days', ascending=[False, False, False])
final_data = melted_data.groupBy('id_', 'Month').agg(F.first('p').alias('p'))
This codes takes a lot of time to run in decent configurations. How can I improve this.?
How can I achieve this task in a more optimized manner.? Making every single date in the range dont seems to be the best solution.
A small sample of the needed output is shown below,
+---+---+----------+----------+----------+----------+-------+
|id_|p |201806 |201807 |201808 | 201809 | 201810|
+---+---+----------+----------+----------+----------+-------+
| 1 | A | 0| 0 | 0| 4 | 26 |
| 2 | B | 9| 19| 0| 0 | 0 |
| 2 | B | 0| 0 | 18| 30 | 7 |
+---+---+----------+----------+----------+----------+-------+
I think it's slowing down because of the freq='D' and multiple transformations on dataset.
Please try below:
Edit 1: Update for the quarter
Edit 2: Per comment, the start date should be included in Final result
Edit 3: Per comment, Update for the daily
Prepared data
#Imports
import pyspark.sql.functions as f
from pyspark.sql.functions import when
import pandas as pd
df.show()
+---+---+----------+----------+
| id| p| d1| d2|
+---+---+----------+----------+
| 1| A|2018-09-26|2018-10-26|
| 2| B|2018-06-21|2018-07-19|
| 2| B|2018-08-13|2018-10-07|
| 2| B|2018-12-31|2019-02-27|
| 2| B|2019-05-28|2019-06-25|
| 3| C|2018-06-15|2018-07-13|
| 3| C|2018-08-15|2018-10-09|
| 3| C|2018-12-03|2019-03-12|
| 3| C|2019-05-10|2019-06-07|
| 4| A|2019-01-30|2019-03-01|
| 4| B|2019-05-30|2019-07-25|
| 5| C|2018-09-19|2018-10-17|
| 5| C|2019-05-16|2019-05-29| # --> Same month case
+---+---+----------+----------+
Get the min and max date from a dataset with month frequency freq='M'
d = df.select(f.min('d1').alias('min'), f.max('d2').alias('max')).first()
dates = pd.period_range(d.min, d.max, freq='M').strftime("%Y%m").tolist()
dates
['201806', '201807', '201808', '201809', '201810', '201811', '201812', '201901', '201902', '201903', '201904', '201905', '201906', '201907']
Now, Final Buesiness logic using spark date operators and functions
df1 = df.select('id',
'p',
'd1',
'd2', *[ (when( (f.trunc(df.d1, "month") == f.trunc(df.d2, "month")) & (f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d1, "month"))
, f.datediff(df.d2 , df.d1) +1 ) # Same month ((Last day - First dat) + 1
.when(f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d1, "month") ,
f.datediff(f.last_day(f.to_date(f.lit(c),'yyyyMM')) , df.d1) +1 ) # d1 date (Last day - current day)
.when(f.to_date(f.lit(c),'yyyyMM') == f.trunc(df.d2, "month") ,
f.datediff(df.d2, f.to_date(f.lit(c),'yyyyMM')) +1 ) # d2 date (Currentday - Firstday)
.when(f.to_date(f.lit(c),'yyyyMM').between(f.trunc(df.d1, "month"), df.d2),
f.dayofmonth(f.last_day(f.to_date(f.lit(c),'yyyyMM')))) # Between date (Total days in month)
).otherwise(0) # Rest of the months (0)
.alias(c) for c in dates ])
df1.show()
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| id| p| d1| d2|201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|201907|
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0| 5| 26| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 10| 19| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 19| 30| 7| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0| 0| 0| 0| 1| 31| 27| 0| 0| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 4| 25| 0|
| 3| C|2018-06-15|2018-07-13| 16| 13| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C|2018-08-15|2018-10-09| 0| 0| 17| 30| 9| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0| 0| 0| 0| 29| 31| 28| 12| 0| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 22| 7| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0| 0| 0| 0| 0| 2| 28| 1| 0| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 2| 30| 25|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0| 12| 17| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 14| 0| 0|
+---+---+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
Edit 2: Update for the quarter range:
Note: Taking quarter date range dictionary from #jxc's answer. We are more interested in the optimal solution here. #jxc has done an excellent job and no point of reinventing the wheel unless there is a performance issue.
Create date range dictionary :
q_dates = dict([
(str(c), [ c.to_timestamp().strftime("%Y-%m-%d") ,(c.to_timestamp() + pd.tseries.offsets.QuarterEnd()).strftime("%Y-%m-%d")
]) for c in pd.period_range(d.min, d.max, freq='Q')
])
# {'2018Q2': ['2018-04-01', '2018-06-30'],
# '2018Q3': ['2018-07-01', '2018-09-30'],
# '2018Q4': ['2018-10-01', '2018-12-31'],
# '2019Q1': ['2019-01-01', '2019-03-31'],
# '2019Q2': ['2019-04-01', '2019-06-30'],
# '2019Q3': ['2019-07-01', '2019-09-30']}
Now Apply business logic on quarters.
df1 = df.select('id',
'p',
'd1',
'd2',
*[(when( (df.d1.between(q_dates[c][0], q_dates[c][1])) & (f.trunc(df.d1, "month") == f.trunc(df.d2, "month")),
f.datediff(df.d2 , df.d1) +1 ) # Same month ((Last day - start day) +1 )
.when(df.d1.between(q_dates[c][0], q_dates[c][1]),
f.datediff(f.to_date(f.lit(q_dates[c][1])), df.d1) +1) # Min date , remaining days (Last day of quarter - Min day)
.when(df.d2.between(q_dates[c][0], q_dates[c][1]),
f.datediff(df.d2, f.to_date(f.lit(q_dates[c][0]))) +1 ) # Max date , remaining days (Max day - Start day of quarter )
.when(f.to_date(f.lit(q_dates[c][0])).between(df.d1, df.d2),
f.datediff(f.to_date(f.lit(q_dates[c][1])), f.to_date(f.lit(q_dates[c][0]))) +1) # All remaining days
).otherwise(0)
.alias(c) for c in q_dates ])
df1.show()
+---+---+----------+----------+------+------+------+------+------+------+
| id| p| d1| d2|2018Q2|2018Q3|2018Q4|2019Q1|2019Q2|2019Q3|
+---+---+----------+----------+------+------+------+------+------+------+
| 1| A|2018-09-26|2018-10-26| 0| 5| 26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 10| 19| 0| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 49| 7| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 1| 58| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0| 0| 34| 0|
| 3| C|2018-06-15|2018-07-13| 16| 13| 0| 0| 0| 0|
| 3| C|2018-08-15|2018-10-09| 0| 47| 9| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 29| 71| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0| 0| 52| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0| 61| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0| 0| 32| 25|
| 5| C|2018-09-19|2018-10-17| 0| 12| 17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0| 0| 14| 0|
+---+---+----------+----------+------+------+------+------+------+------+
Edit 3: Per comment, Update for the daily
Since here evaluations are more, need to careful in terms of performance.
Approach 1 : Dataframe/Dataset
Get Date list in yyyy-MM-dd format but as string
df_dates = pd.period_range(d.min, d.max, freq='D').strftime("%Y-%m-%d").tolist()
Now the business logic is quite simple. It's either 1 or 0
df1 = df.select('id'
, 'p'
, 'd1'
,'d2'
, *[ (when(f.lit(c).between (df.d1, df.d2),1)) # For date rabge 1
.otherwise(0) # For rest of days
.alias(c) for c in df_dates ])
df1.show()
+---+---+----------+----------+----------+----------+----------+
| id| p| d1| d2|2018-06-15|2018-06-16|2018-06-17| # and so on....
+---+---+----------+----------+----------+----------+----------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0|
| 3| C|2018-06-15|2018-07-13| 1| 1| 1|
| 3| C|2018-08-15|2018-10-09| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0|
+---+---+----------+----------+----------+----------+----------+
# Due to answer character limit unable to give the result.
Approach 2: RDD evaluations
Get Date list as a date object
rdd_dates = [ c.to_timestamp().date() for c in pd.period_range(d.min, d.max, freq='D') ]
Use map from rdd
df1 = df \
.rdd \
.map(lambda x : tuple([x.id, x.p, x.d1, x.d2 , *[ 1 if ( x.d1 <= c <=x.d2) else 0 for c in rdd_dates]])) \
.toDF(df.columns + [ c.strftime("%Y-%m-%d") for c in rdd_dates])
df1.show()
+---+---+----------+----------+----------+----------+----------+
| id| p| d1| d2|2018-06-15|2018-06-16|2018-06-17| # and so on....
+---+---+----------+----------+----------+----------+----------+
| 1| A|2018-09-26|2018-10-26| 0| 0| 0|
| 2| B|2018-06-21|2018-07-19| 0| 0| 0|
| 2| B|2018-08-13|2018-10-07| 0| 0| 0|
| 2| B|2018-12-31|2019-02-27| 0| 0| 0|
| 2| B|2019-05-28|2019-06-25| 0| 0| 0|
| 3| C|2018-06-15|2018-07-13| 1| 1| 1|
| 3| C|2018-08-15|2018-10-09| 0| 0| 0|
| 3| C|2018-12-03|2019-03-12| 0| 0| 0|
| 3| C|2019-05-10|2019-06-07| 0| 0| 0|
| 4| A|2019-01-30|2019-03-01| 0| 0| 0|
| 4| B|2019-05-30|2019-07-25| 0| 0| 0|
| 5| C|2018-09-19|2018-10-17| 0| 0| 0|
| 5| C|2019-05-16|2019-05-29| 0| 0| 0|
+---+---+----------+----------+----------+----------+----------+
IIUC, your problem can be simplified using some Spark SQL tricks:
# get start_date and end_date
d = df.select(F.min('d1').alias('start_date'), F.max('d2').alias('end_date')).first()
# get a list of month strings (using the first day of the month) between d.start_date and d.end_date
mrange = [ c.strftime("%Y-%m-01") for c in pd.period_range(d.start_date, d.end_date, freq='M') ]
#['2018-06-01',
# '2018-07-01',
# ....
# '2019-06-01',
# '2019-07-01']
Write the following Spark SQL snippet to count the number of days in each month, where {0} will be replaced by the month strings(i.e. "2018-06-01"), and {1} will be replaced by column names(i.e. "201806").
stmt = '''
IF(d2 < "{0}" OR d1 > LAST_DAY("{0}")
, 0
, DATEDIFF(LEAST(d2, LAST_DAY("{0}")), GREATEST(d1, TO_DATE("{0}")))
+ IF(d1 BETWEEN "{0}" AND LAST_DAY("{0}"), 0, 1)
) AS `{1}`
'''
This SQL snippet does the following, assumed m is the month string:
if (d1, d2) is out of range, i.e. d1 > last_day(m) or d2 < m, then return 0
otherwise, we calculate the datediff() between LEAST(d2, LAST_DAY(m)) and GREATEST(d1, m).
Notice there is an 1 day offset in calculating the above datediff(). it only exists when d1 is NOT in the current month, i.e. between(m, LAST_DAY(m))
We can then calculate the new columns using selectExpr and this SQL snippet:
df_new = df.withColumn('d1', F.to_date('d1')) \
.withColumn('d2', F.to_date('d2')) \
.selectExpr(
'id_'
, 'p'
, *[ stmt.format(m, m[:7].replace('-','')) for m in mrange ]
)
df_new.show()
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|id_| p|201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|201907|
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| 1| A| 0| 0| 0| 4| 26| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 9| 19| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 18| 30| 7| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 0| 0| 0| 0| 0| 31| 27| 0| 0| 0| 0| 0|
| 2| B| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 3| 25| 0|
| 3| C| 15| 13| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C| 0| 0| 16| 30| 9| 0| 0| 0| 0| 0| 0| 0| 0| 0|
| 3| C| 0| 0| 0| 0| 0| 0| 28| 31| 28| 12| 0| 0| 0| 0|
| 3| C| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 21| 7| 0|
| 4| A| 0| 0| 0| 0| 0| 0| 0| 1| 28| 1| 0| 0| 0| 0|
| 4| B| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 30| 25|
| 5| C| 0| 0| 0| 11| 17| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+---+---+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
Edit-1: About the Quarterly list
Per your comment, I modified the SQL snippet so that you can extend it into more named date ranges. see below: {0} will be replaced by range_start_date, and {1} by range_end_date and {2} by range_name:
stmt = '''
IF(d2 < "{0}" OR d1 > "{1}"
, 0
, DATEDIFF(LEAST(d2, TO_DATE("{1}")), GREATEST(d1, TO_DATE("{0}")))
+ IF(d1 BETWEEN "{0}" AND "{1}", 0, 1)
) AS `{2}`
'''
Create a dictionary using quarter name as keys and a list of corresponding start_date and end_date as values: (this part is a pure python or pandas problem)
range_dict = dict([
(str(c), [ c.to_timestamp().strftime("%Y-%m-%d")
,(c.to_timestamp() + pd.tseries.offsets.QuarterEnd()).strftime("%Y-%m-%d")
]) for c in pd.period_range(d.start_date, d.end_date, freq='Q')
])
#{'2018Q2': ['2018-04-01', '2018-06-30'],
# '2018Q3': ['2018-07-01', '2018-09-30'],
# '2018Q4': ['2018-10-01', '2018-12-31'],
# '2019Q1': ['2019-01-01', '2019-03-31'],
# '2019Q2': ['2019-04-01', '2019-06-30'],
# '2019Q3': ['2019-07-01', '2019-09-30']}
df_new = df.withColumn('d1', F.to_date('d1')) \
.withColumn('d2', F.to_date('d2')) \
.selectExpr(
'id_'
, 'p'
, *[ stmt.format(range_dict[n][0], range_dict[n][1], n) for n in sorted(range_dict.keys()) ]
)
df_new.show()
+---+---+------+------+------+------+------+------+
|id_| p|2018Q2|2018Q3|2018Q4|2019Q1|2019Q2|2019Q3|
+---+---+------+------+------+------+------+------+
| 1| A| 0| 4| 26| 0| 0| 0|
| 2| B| 9| 19| 0| 0| 0| 0|
| 2| B| 0| 48| 7| 0| 0| 0|
| 2| B| 0| 0| 0| 58| 0| 0|
| 2| B| 0| 0| 0| 0| 28| 0|
| 3| C| 15| 13| 0| 0| 0| 0|
| 3| C| 0| 46| 9| 0| 0| 0|
| 3| C| 0| 0| 28| 71| 0| 0|
| 3| C| 0| 0| 0| 0| 28| 0|
| 4| A| 0| 0| 0| 30| 0| 0|
| 4| B| 0| 0| 0| 0| 31| 25|
| 5| C| 0| 11| 17| 0| 0| 0|
+---+---+------+------+------+------+------+------+
Edit-2: Regarding the Segmentation errors
I tested the code with a sample dataframe of 56K rows (see below), everything ran well under my testing environment (VM, Centos 7.3, 1 CPU and 2GB RAM, spark-2.4.0-bin-hadoop2.7 run on local mode in a docker container. this is far below any production environment). Thus I doubt if it was from the Spark version issue? I rewrote the same code logic by using two different approaches: one is using only Spark SQL(with TempView etc) and another is using pure dataframe API functions(similar to #SMaZ's approach). I'd like to see if any of these could run through your environment and data. BTW. I think, given most of the fields are numeric, 1M rows + 100 columns should not be very huge in terms of big data projects.
Also, please do make sure if there exists missing data (null for d1/d2) or incorrectly data issues (i.e. d1 > d2) and adjust the code to handle such issues if needed.
# sample data-set
import pandas as pd, numpy as np
N = 560000
df1 = pd.DataFrame({
'id_': sorted(np.random.choice(range(100),N))
, 'p': np.random.choice(list('ABCDEFGHIJKLMN'),N)
, 'd1': sorted(np.random.choice(pd.date_range('2016-06-30','2019-06-30',freq='D'),N))
, 'n': np.random.choice(list(map(lambda x: pd.Timedelta(days=x), range(300))),N)
})
df1['d2'] = df1['d1'] + df1['n']
df = spark.createDataFrame(df1)
df.printSchema()
#root
# |-- id_: long (nullable = true)
# |-- p: string (nullable = true)
# |-- d1: timestamp (nullable = true)
# |-- n: long (nullable = true)
# |-- d2: timestamp (nullable = true)
# get the overall date-range of dataset
d = df.select(F.min('d1').alias('start_date'), F.max('d2').alias('end_date')).first()
#Row(start_date=datetime.datetime(2016, 6, 29, 20, 0), end_date=datetime.datetime(2020, 4, 22, 20, 0))
# range_dict for the month data
range_dict = dict([
(c.strftime('%Y%m'), [ c.to_timestamp().date()
,(c.to_timestamp() + pd.tseries.offsets.MonthEnd()).date()
]) for c in pd.period_range(d.start_date, d.end_date, freq='M')
])
#{'201606': [datetime.date(2016, 6, 1), datetime.date(2016, 6, 30)],
# '201607': [datetime.date(2016, 7, 1), datetime.date(2016, 7, 31)],
# '201608': [datetime.date(2016, 8, 1), datetime.date(2016, 8, 31)],
# ....
# '202003': [datetime.date(2020, 3, 1), datetime.date(2020, 3, 31)],
# '202004': [datetime.date(2020, 4, 1), datetime.date(2020, 4, 30)]}
Method-1: Using Spark SQL:
# create TempView `df_table`
df.createOrReplaceTempView('df_table')
# SQL snippet to calculate new column
stmt = '''
IF(d2 < "{0}" OR d1 > "{1}"
, 0
, DATEDIFF(LEAST(d2, to_date("{1}")), GREATEST(d1, to_date("{0}")))
+ IF(d1 BETWEEN "{0}" AND "{1}", 0, 1)
) AS `{2}`
'''
# set up the SQL field list
sql_fields_list = [
'id_'
, 'p'
, *[ stmt.format(range_dict[n][0], range_dict[n][1], n) for n in sorted(range_dict.keys()) ]
]
# create SQL statement
sql_stmt = 'SELECT {} FROM df_table'.format(', '.join(sql_fields_list))
# run the Spark SQL:
df_new = spark.sql(sql_stmt)
Method-2: Using dataframe API functions:
from pyspark.sql.functions import when, col, greatest, least, lit, datediff
df_new = df.select(
'id_'
, 'p'
, *[
when((col('d2') < range_dict[n][0]) | (col('d1') > range_dict[n][1]), 0).otherwise(
datediff(least('d2', lit(range_dict[n][1])), greatest('d1', lit(range_dict[n][0])))
+ when(col('d1').between(range_dict[n][0], range_dict[n][1]), 0).otherwise(1)
).alias(n)
for n in sorted(range_dict.keys())
]
)
If you want to avoid pandas completely (which brings the data back to driver) then a pure pyspark based solution can be:
from pyspark.sql import functions as psf
# Assumption made: your dataframe's name is : sample_data and has id, p, d1, d2 columns.
# Add month and days left column using pyspark functions
# I have kept a row id as well just to ensure that if you have duplicates in your data on the keys then it would still be able to handle it - no obligations though
data = sample_data.select("id", "p",
psf.monotonically_increasing_id().alias("row_id"),
psf.date_format("d2", 'YYYYMM').alias("d2_month"),
psf.dayofmonth("d2").alias("d2_id"),
psf.date_format("d1", 'YYYYMM').alias("d1_month"),
psf.datediff(psf.last_day("d1"), sample_data["d1"]).alias("d1_id"))
data.show(5, False)
Result:
+---+---+-----------+--------+-----+--------+-----+
|id |p |row_id |d2_month|d2_id|d1_month|d1_id|
+---+---+-----------+--------+-----+--------+-----+
|1 |A |8589934592 |201810 |26 |201809 |4 |
|2 |B |25769803776|201807 |19 |201806 |9 |
|2 |B |34359738368|201810 |7 |201808 |18 |
|2 |B |51539607552|201902 |27 |201912 |0 |
|2 |B |60129542144|201906 |25 |201905 |3 |
+---+---+-----------+--------+-----+--------+-----+
only showing top 5 rows
Then you can split the dataframe and pivot it:
####
# Create two separate dataframes by pivoting on d1_month and d2_month
####
df1 = data.groupby(["id", "p", "row_id"]).pivot("d1_month").max("d1_id")
df2 = data.groupby(["id", "p", "row_id"]).pivot("d2_month").max("d2_id")
df1.show(5, False), df2.show(5, False)
Result:
+---+---+------------+------+------+------+------+------+------+------+
|id |p |row_id |201806|201808|201809|201812|201901|201905|201912|
+---+---+------------+------+------+------+------+------+------+------+
|3 |C |85899345920 |null |16 |null |null |null |null |null |
|2 |B |51539607552 |null |null |null |null |null |null |0 |
|3 |C |77309411328 |15 |null |null |null |null |null |null |
|3 |C |103079215104|null |null |null |28 |null |null |null |
|4 |A |128849018880|null |null |null |null |1 |null |null |
+---+---+------------+------+------+------+------+------+------+------+
only showing top 5 rows
+---+---+------------+------+------+------+------+------+------+------+
|id |p |row_id |201807|201809|201810|201902|201903|201906|201907|
+---+---+------------+------+------+------+------+------+------+------+
|3 |C |85899345920 |null |null |9 |null |null |null |null |
|2 |B |51539607552 |null |null |null |27 |null |null |null |
|3 |C |77309411328 |13 |null |null |null |null |null |null |
|3 |C |103079215104|null |null |null |null |12 |null |null |
|4 |A |128849018880|null |null |null |null |1 |null |null |
+---+---+------------+------+------+------+------+------+------+------+
only showing top 5 rows
Join back and get your result:
result = df1.join(df2, on=["id", "p","row_id"])\
.select([psf.coalesce(df1[x_], df2[x_]).alias(x_)
if (x_ in df1.columns) and (x_ in df2.columns) else x_
for x_ in set(df1.columns + df2.columns)])\
.orderBy("row_id").drop("row_id")
result.na.fill(0).show(5, False)
Result:
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
|201906|201907|201912|201901|201810|p |201812|201905|201902|201903|201809|201808|201807|201806|id |
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
|0 |0 |0 |0 |26 |A |0 |0 |0 |0 |4 |0 |0 |0 |1 |
|0 |0 |0 |0 |0 |B |0 |0 |0 |0 |0 |0 |19 |9 |2 |
|0 |0 |0 |0 |7 |B |0 |0 |0 |0 |0 |18 |0 |0 |2 |
|0 |0 |0 |0 |0 |B |0 |0 |27 |0 |0 |0 |0 |0 |2 |
|25 |0 |0 |0 |0 |B |0 |3 |0 |0 |0 |0 |0 |0 |2 |
+------+------+------+------+------+---+------+------+------+------+------+------+------+------+---+
only showing top 5 rows
I have two data frames like below.
df = spark.createDataFrame(sc.parallelize([[1,1,2],[1,2,9], [2,1,2],[2,2,1],
[4,1,5],[4,2,6]]), ["sid","cid","Cr"])
df.show()
+---+---+---+
|sid|cid| Cr|
+---+---+---+
| 1| 1| 2|
| 1| 2| 9|
| 2| 1| 2|
| 2| 2| 1|
| 4| 1| 5|
| 4| 2| 6|
| 5| 1| 3|
| 5| 2| 8|
+---+---+---+
next I have created df1 like below.
df1 = spark.createDataFrame(sc.parallelize([[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,1],[5,2],[5,3]]), ["sid","cid"])
df1.show()
+---+---+
|sid|cid|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 1|
| 2| 2|
| 2| 3|
| 4| 1|
| 4| 2|
| 4| 3|
| 5| 1|
| 5| 2|
| 5| 3|
+---+---+
now I want my final output should be like below i.e . if any of the data presented i.e.
if (df1.sid==df.sid)&(df1.cid==df.cid) then flag value 1 else 0.
and missing Cr values will be '0'
+---+---+---+----+
|sid|cid| Cr|flag|
+---+---+---+----+
| 1| 1| 2| 1 |
| 1| 2| 9| 1 |
| 1| 3| 0| 0 |
| 2| 1| 2| 1 |
| 2| 2| 1| 1 |
| 2| 3| 0| 0 |
| 4| 1| 5| 1 |
| 4| 2| 6| 1 |
| 4| 3| 0| 0 |
| 5| 1| 3| 1 |
| 5| 2| 8| 1 |
| 5| 3| 0| 0 |
+---+---+---+----+
please help me on this.
With data:
from pyspark.sql.functions import col, when, lit, coalesce
df = spark.createDataFrame(
[(1, 1, 2), (1, 2, 9), (2, 1, 2), (2, 2, 1), (4, 1, 5), (4, 2, 6), (5, 1, 3), (5, 2, 8)],
("sid", "cid", "Cr"))
df1 = spark.createDataFrame(
[[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,1],[5,2],[5,3]],
["sid","cid"])
outer join:
joined = (df.alias("df")
.join(
df1.alias("df1"),
(col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
"rightouter"))
and select
joined.select(
col("df1.*"),
coalesce(col("Cr"), lit(0)).alias("Cr"),
col("df.sid").isNotNull().cast("integer").alias("flag")
).orderBy("sid", "cid").show()
# +---+---+---+----+
# |sid|cid| Cr|flag|
# +---+---+---+----+
# | 1| 1| 2| 1|
# | 1| 2| 9| 1|
# | 1| 3| 0| 0|
# | 2| 1| 2| 1|
# | 2| 2| 1| 1|
# | 2| 3| 0| 0|
# | 4| 1| 5| 1|
# | 4| 2| 6| 1|
# | 4| 3| 0| 0|
# | 5| 1| 3| 1|
# | 5| 2| 8| 1|
# | 5| 3| 0| 0|
# +---+---+---+----+
I am using pyspark 2.1. Below are my input dataframes . I am stuck up in taking dynamic offset values from different dataframe please help
df1=
category value
1 3
2 2
4 5
df2
category year month weeknumber lag_attribute runs
1 0 0 0 0 2
1 2019 1 1 1 0
1 2019 1 2 2 0
1 2019 1 3 3 0
1 2019 1 4 4 1
1 2019 1 5 5 2
1 2019 1 6 6 3
1 2019 1 7 7 4
1 2019 1 8 8 5
1 2019 1 9 9 6
2 0 0 0 9 0
2 2018 1 1 2 0
2 2018 1 2 3 2
2 2018 1 3 4 3
2 2018 1 3 5 4
As shown in above example df1 is my look up table which has offset values,for 1 offset value is 3 and for category 2 offset value is 2 .
in df2 ,runs is my output column so for every category values in df1 if the lag value is 3, then from dataframe2[df2] should consider the lag_attrbute and lag down by 3 values hence you could see for every 3 values of lag_attribute the runs were repeating
I tried below coding didn't work . Please help
df1=df1.registerTempTable("df1")
df2=df2.registerTempTable("df2")
sqlCtx.sql("select st.category,st.Year,st.Month,st.weekyear,st.lag_attribute,LAG(st.lag_attribute,df1.value, 0) OVER (PARTITION BY st.cagtegory ORDER BY st.Year,st.Month,st.weekyear) as return_test from df1 st,df2 lkp where df1.category=df2.category")
Please help me to cross this hurdle
lag takes in a column object and an integer (python integer), as shown in the function's signature:
Signature: psf.lag(col, count=1, default=None)
The value for count cannot be a pyspark IntegerType (column object). There are workarounds though, let's start with the sample data:
df1 = spark.createDataFrame([[1, 3],[2, 2],[4, 5]], ["category", "value"])
df2 = spark.createDataFrame([[1, 0, 0, 0, 0, 2],[1, 2019, 1, 1, 1, 0],[1, 2019, 1, 2, 2, 0],[1, 2019, 1, 3, 3, 0],
[1, 2019, 1, 4, 4, 1],[1, 2019, 1, 5, 5, 2],[1, 2019, 1, 6, 6, 3],[1, 2019, 1, 7, 7, 4],
[1, 2019, 1, 8, 8, 5],[1, 2019, 1, 9, 9, 6],[2, 0, 0, 0, 9, 0],[2, 2018, 1, 1, 2, 0],
[2, 2018, 1, 2, 3, 2],[2, 2018, 1, 3, 4, 3],[2, 2018, 1, 3, 5, 4]],
["category", "year", "month", "weeknumber", "lag_attribute", "runs"])
What you could do, if df1 is not too big (meaning a small amount of categories and potentially a lot of values in each category), is convert df1 to a list and create an if-elif-elif... condition based on its values:
list1 = df1.collect()
sc.broadcast(list1)
import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("category").orderBy("year", "month", "weeknumber")
cond = eval('psf' + ''.join(['.when(df2.category == ' + str(c) + ', psf.lag("lag_attribute", ' + str(l) + ', 0).over(w))' for c, l in list1]))
Note: this is if c and l are integers, if they are strings then:
cond = eval('psf' + ''.join(['.when(df2.category == "' + str(c) + '", psf.lag("lag_attribute", "' + str(l) + '", 0).over(w))' for c, l in list1]))
Now we can apply the condition:
df2.select("*", cond.alias("return_test")).show()
+--------+----+-----+----------+-------------+----+-----------+
|category|year|month|weeknumber|lag_attribute|runs|return_test|
+--------+----+-----+----------+-------------+----+-----------+
| 1| 0| 0| 0| 0| 2| 0|
| 1|2019| 1| 1| 1| 0| 0|
| 1|2019| 1| 2| 2| 0| 0|
| 1|2019| 1| 3| 3| 0| 0|
| 1|2019| 1| 4| 4| 1| 1|
| 1|2019| 1| 5| 5| 2| 2|
| 1|2019| 1| 6| 6| 3| 3|
| 1|2019| 1| 7| 7| 4| 4|
| 1|2019| 1| 8| 8| 5| 5|
| 1|2019| 1| 9| 9| 6| 6|
| 2| 0| 0| 0| 9| 0| 0|
| 2|2018| 1| 1| 2| 0| 0|
| 2|2018| 1| 2| 3| 2| 9|
| 2|2018| 1| 3| 4| 3| 2|
| 2|2018| 1| 3| 5| 4| 3|
+--------+----+-----+----------+-------------+----+-----------+
If df1 is big then you can self join df2 on a built lag column:
First we'll bring the values from df1 to df2 using a join:
df = df2.join(df1, "category")
if df1 is not too big, you should broadcast it:
import pyspark.sql.functions as psf
df = df2.join(psf.broadcast(df1), "category")
Now we'll enumerate the rows in each partition and build a lag column:
from pyspark.sql import Window
w = Window.partitionBy("category").orderBy("year", "month", "weeknumber")
left = df.withColumn('rn', psf.row_number().over(w))
right = left.select((left.rn + left.value).alias("rn"), left.lag_attribute.alias("return_test"))
left.join(right, ["category", "rn"], "left")\
.na.fill(0)\
.sort("category", "rn").show()
+--------+---+----+-----+----------+-------------+----+-----+-----------+
|category| rn|year|month|weeknumber|lag_attribute|runs|value|return_test|
+--------+---+----+-----+----------+-------------+----+-----+-----------+
| 1| 1| 0| 0| 0| 0| 2| 3| 0|
| 1| 2|2019| 1| 1| 1| 0| 3| 0|
| 1| 3|2019| 1| 2| 2| 0| 3| 0|
| 1| 4|2019| 1| 3| 3| 0| 3| 0|
| 1| 5|2019| 1| 4| 4| 1| 3| 1|
| 1| 6|2019| 1| 5| 5| 2| 3| 2|
| 1| 7|2019| 1| 6| 6| 3| 3| 3|
| 1| 8|2019| 1| 7| 7| 4| 3| 4|
| 1| 9|2019| 1| 8| 8| 5| 3| 5|
| 1| 10|2019| 1| 9| 9| 6| 3| 6|
| 2| 1| 0| 0| 0| 9| 0| 2| 0|
| 2| 2|2018| 1| 1| 2| 0| 2| 0|
| 2| 3|2018| 1| 2| 3| 2| 2| 9|
| 2| 4|2018| 1| 3| 4| 3| 2| 2|
| 2| 5|2018| 1| 3| 5| 4| 2| 3|
+--------+---+----+-----+----------+-------------+----+-----+-----------+
Note: There is a problem with your runs lag value, for catagory=2 it is only lagging 1 instead of 2 for instance. Also some lines have the same order (eg. the two last lines in your sample dataframe df2 have the same category, year, month and weeknumber) in your dataframe, since there is shuffling involved you might get different results everytime you run the code.