I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_1','READ_5','READ_6','READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
'val' :[5,6,7,11,5,7,16,12,13,56,32,13,45,43,46],
})
My above input dataframe looks like this
Though the code below works fine(thanks to Jezrael) in Python pandas, when I apply this to real data (more than 4M records), it runs for a long time. So I was trying to use pyspark . Please note I already tried Dask,modin,pandarallel which are equivalent to pandas for large scale processing but didn't help either. What the below codes does is it generates the summary statistics for each subject for each reading. You can have a look at the expected output below to get an idea
df_op = (df.groupby(['subject_id','readings'])['val']
.describe()
.unstack()
.swaplevel(0,1,axis=1)
.reindex(df['readings'].unique(), axis=1, level=0))
df_op.columns = df_op.columns.map('_'.join)
df_op = df_op.reset_index()
Can you help me achieve the above operation in pyspark? When I tried the below, it threw an error
df.groupby(['subject_id','readings'])['val']
For example - subject_id = 1 has 4 readings but 3 unique readings. So we get 3 * 8 = 24 columns for subject_id = 1. Why 8? Because it's MIN,MAX,COUNT,Std,MEAN,25%percentile,50th percentile,75th percentile. Hope this helps
When I started off with this in pyspark, it returns the below error
TypeError: 'GroupedData' object is not subscriptable
I expect my output to be like as shown below
You need to groupby and get the statistics for each reading first, and then you make a pivot to get an expected outcome
import pyspark.sql.functions as F
agg_df = df.groupby("subject_id", "readings").agg(F.mean(F.col("val")), F.min(F.col("val")), F.max(F.col("val")),
F.count(F.col("val")),
F.expr('percentile_approx(val, 0.25)').alias("quantile_25"),
F.expr('percentile_approx(val, 0.75)').alias("quantile_75"))
This will give you the following output:
+----------+--------+--------+--------+--------+----------+-----------+-----------+
|subject_id|readings|avg(val)|min(val)|max(val)|count(val)|quantile_25|quantile_75|
+----------+--------+--------+--------+--------+----------+-----------+-----------+
| 2| READ_1| 5.0| 5| 5| 1| 5| 5|
| 2| READ_5| 7.0| 7| 7| 1| 7| 7|
| 2| READ_8| 12.0| 12| 12| 1| 12| 12|
| 4| READ_08| 43.0| 43| 43| 1| 43| 43|
| 1| READ_2| 6.0| 6| 6| 1| 6| 6|
| 1| READ_1| 6.0| 5| 7| 2| 5| 7|
| 2| READ_6| 16.0| 16| 16| 1| 16| 16|
| 1| READ_3| 11.0| 11| 11| 1| 11| 11|
| 4| READ_11| 32.0| 32| 32| 1| 32| 32|
| 3| READ_10| 13.0| 13| 13| 1| 13| 13|
| 3| READ_12| 56.0| 56| 56| 1| 56| 56|
| 4| READ_14| 13.0| 13| 13| 1| 13| 13|
| 4| READ_07| 46.0| 46| 46| 1| 46| 46|
| 4| READ_09| 45.0| 45| 45| 1| 45| 45|
+----------+--------+--------+--------+--------+----------+-----------+-----------+
Using groupby subject_id if you pivot readings, you will get the expected output:
agg_df2 = df.groupby("subject_id").pivot("readings").agg(F.mean(F.col("val")), F.min(F.col("val")), F.max(F.col("val")),
F.count(F.col("val")),
F.expr('percentile_approx(val, 0.25)').alias("quantile_25"),
F.expr('percentile_approx(val, 0.75)').alias("quantile_75"))
for i in agg_df2.columns:
agg_df2 = agg_df2.withColumnRenamed(i, i.replace("(val)", ""))
agg_df2.show()
+----------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+
|subject_id|READ_07_avg(val)|READ_07_min(val)|READ_07_max(val)|READ_07_count(val)|READ_07_quantile_25|READ_07_quantile_75|READ_08_avg(val)|READ_08_min(val)|READ_08_max(val)|READ_08_count(val)|READ_08_quantile_25|READ_08_quantile_75|READ_09_avg(val)|READ_09_min(val)|READ_09_max(val)|READ_09_count(val)|READ_09_quantile_25|READ_09_quantile_75|READ_1_avg(val)|READ_1_min(val)|READ_1_max(val)|READ_1_count(val)|READ_1_quantile_25|READ_1_quantile_75|READ_10_avg(val)|READ_10_min(val)|READ_10_max(val)|READ_10_count(val)|READ_10_quantile_25|READ_10_quantile_75|READ_11_avg(val)|READ_11_min(val)|READ_11_max(val)|READ_11_count(val)|READ_11_quantile_25|READ_11_quantile_75|READ_12_avg(val)|READ_12_min(val)|READ_12_max(val)|READ_12_count(val)|READ_12_quantile_25|READ_12_quantile_75|READ_14_avg(val)|READ_14_min(val)|READ_14_max(val)|READ_14_count(val)|READ_14_quantile_25|READ_14_quantile_75|READ_2_avg(val)|READ_2_min(val)|READ_2_max(val)|READ_2_count(val)|READ_2_quantile_25|READ_2_quantile_75|READ_3_avg(val)|READ_3_min(val)|READ_3_max(val)|READ_3_count(val)|READ_3_quantile_25|READ_3_quantile_75|READ_5_avg(val)|READ_5_min(val)|READ_5_max(val)|READ_5_count(val)|READ_5_quantile_25|READ_5_quantile_75|READ_6_avg(val)|READ_6_min(val)|READ_6_max(val)|READ_6_count(val)|READ_6_quantile_25|READ_6_quantile_75|READ_8_avg(val)|READ_8_min(val)|READ_8_max(val)|READ_8_count(val)|READ_8_quantile_25|READ_8_quantile_75|
+----------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+
| 1| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 6.0| 5| 7| 2| 5| 7| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 6.0| 6| 6| 1| 6| 6| 11.0| 11| 11| 1| 11| 11| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| 3| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 13.0| 13| 13| 1| 13| 13| null| null| null| null| null| null| 56.0| 56| 56| 1| 56| 56| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| 2| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 5.0| 5| 5| 1| 5| 5| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| 7.0| 7| 7| 1| 7| 7| 16.0| 16| 16| 1| 16| 16| 12.0| 12| 12| 1| 12| 12|
| 4| 46.0| 46| 46| 1| 46| 46| 43.0| 43| 43| 1| 43| 43| 45.0| 45| 45| 1| 45| 45| null| null| null| null| null| null| null| null| null| null| null| null| 32.0| 32| 32| 1| 32| 32| null| null| null| null| null| null| 13.0| 13| 13| 1| 13| 13| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
+----------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+----------------+----------------+----------------+------------------+-------------------+-------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+---------------+---------------+---------------+-----------------+------------------+------------------+
Related
I wanted to re-label healthy examples (0) as failure (1) for 2 days before the actual failure for all serial numbers in the failure column. Here is my code:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark3.2show').getOrCreate()
print('Spark info :')
spark
url="https://gist.githubusercontent.com/JishanAhmed2019/e464ca4da5c871428ca9ed9264467aa0/raw/da3921c1953fefbc66dddc3ce238dac53142dba8/failure.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df=spark.read.csv(SparkFiles.get("failure.csv"), header=True,sep='\t')
I wanted to re-label the red marked 0 as 1. Also, Serial C was mistakenly present in the database as healthy even after the actual failure.
I would recast the date column as a Timestamp because this will allow you to take the difference between any two Timestamps, which we will need to do.
Then you can create a new column called failure_dates that contains the date whenever a failure occurs, and is null otherwise.
Next, create a new column called 2_days_to_failure where you partition by serial_number and take the difference between the max value in the failure_date column each date inside the partition to get the number of days to failure, returning 1 whenever there is 2 days or fewer to failure.
Finally, we can create a column called failure_relabeled by combining the information from the columns 2_days_to_failure and the original failure column.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window = Window.partitionBy("serial_number")
df.withColumn(
'date', F.to_timestamp(F.col('date'), 'M/D/yyyy')
).withColumn(
"failure_dates", F.when(F.col('failure') == 1, F.col('date'))
).withColumn(
"2_days_to_failure", F.datediff(F.max(F.col('failure_dates')).over(window), F.col('date')) <= 2
).withColumn(
"failure_relabeled", F.when((F.col('2_days_to_failure') | (F.col('failure') == 1)), F.lit(1)).otherwise(F.lit(0))
).orderBy('serial_number','date').show()
+-------------------+-------------+-------+-----------+-------------+-------------------+-----------------+-----------------+
| date|serial_number|failure|smart_5_raw|smart_187_raw| failure_dates|2_days_to_failure|failure_relabeled|
+-------------------+-------------+-------+-----------+-------------+-------------------+-----------------+-----------------+
|2014-01-01 00:00:00| A| 0| 0| 60| null| false| 0|
|2014-01-02 00:00:00| A| 0| 0| 180| null| false| 0|
|2014-01-03 00:00:00| A| 0| 0| 140| null| true| 1|
|2014-01-04 00:00:00| A| 0| 0| 280| null| true| 1|
|2014-01-05 00:00:00| A| 1| 0| 400|2014-01-05 00:00:00| true| 1|
|2014-01-01 00:00:00| B| 0| 0| 40| null| null| 0|
|2014-01-02 00:00:00| B| 0| 0| 160| null| null| 0|
|2014-01-03 00:00:00| B| 0| 0| 100| null| null| 0|
|2014-01-04 00:00:00| B| 0| 0| 320| null| null| 0|
|2014-01-05 00:00:00| B| 0| 0| 340| null| null| 0|
|2014-01-06 00:00:00| B| 0| 0| 400| null| null| 0|
|2014-01-01 00:00:00| C| 0| 0| 80| null| true| 1|
|2014-01-02 00:00:00| C| 0| 0| 200| null| true| 1|
|2014-01-03 00:00:00| C| 1| 0| 120|2014-01-03 00:00:00| true| 1|
|2014-01-04 00:00:00| D| 0| 0| 300| null| null| 0|
|2014-01-05 00:00:00| D| 0| 0| 360| null| null| 0|
+-------------------+-------------+-------+-----------+-------------+-------------------+-----------------+-----------------+
I am trying to generate multiple lag for the 'status' variable in the dataframe below. The data that I have is timeseries and it is possible to have gap within the data. What I am trying to do is generate the lags but when there is a gap, the value should be set to Missing/Null.
Input DF:
+---+----------+------+
| id| s_date|status|
+---+----------+------+
|123|2007-01-31| 1|
|123|2007-02-28| 1|
|123|2007-03-31| 2|
|123|2007-04-30| 2|
|123|2007-05-31| 1|
|123|2007-06-30| 1|
|123|2007-07-31| 2|
|123|2007-08-31| 2|
|345|2007-08-31| 3|
|123|2007-09-30| 2|
|345|2007-09-30| 2|
|123|2007-10-31| 1|
|345|2007-10-31| 1|
|123|2007-11-30| 1|
|345|2007-11-30| 2|
|123|2008-01-31| 3|
|345|2007-12-31| 2|
|567|2007-12-31| 3|
|123|2008-03-31| 4|
|345|2008-01-31| 2|
+---+----------+------+
from datetime import date
rdd = sc.parallelize([
[123,date(2007,1,31),1],
[123,date(2007,2,28),1],
[123,date(2007,3,31),2],
[123,date(2007,4,30),2],
[123,date(2007,5,31),1],
[123,date(2007,6,30),1],
[123,date(2007,7,31),2],
[123,date(2007,8,31),2],
[345,date(2007,8,31),3],
[123,date(2007,9,30),2],
[345,date(2007,9,30),2],
[123,date(2007,10,31),1],
[345,date(2007,10,31),1],
[123,date(2007,11,30),1],
[345,date(2007,11,30),2],
[123,date(2008,1,31),3],
[345,date(2007,12,31),2],
[567,date(2007,12,31),3],
[123,date(2008,3,31),4],
[345,date(2008,1,31),2],
[567,date(2008,1,31),3],
[123,date(2008,4,30),3],
[123,date(2008,5,31),2],
[123,date(2008,6,30),1]
])
df = rdd.toDF(['id','s_date','status'])
df.show()
# Below is the code that works
from pyspark.sql.window import Window
w = Window().partitionBy('id').orderBy('s_date')
for i in range(1, 25):
df = df.withColumn(f"lag_{i}", fn.lag(fn.col('status'), i).over(w))\
.withColumn(f"lag_month_{i}", fn.lag(fn.col('s_date), i).over(w))\
.withColumn(f"lag_status_{i}", fn.expr("case when "f"{i} = 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day(s_date)) then "f"lag_{i}"" else null end"))
In the code above, column 'lag_status_1" is correctly getting population for i = 1. This column has value of Null for Jan and March 2008 which is what I want for every lag. However, I add below line to handle other lags (i.e. Lag_2, Lag_3, etc.) the code does not work.
.withColumn(f"lag_status_{i}", fn.expr("case when "f"{i} = 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day(s_date)) then "f"lag_{i}"" " +
"when "f"{i} > 1 and (last_day(add_months("f"lag_month_{i}"",1)) = last_day("f"lag_month_{i-1})) then "f"lag_{i}"" else null end"))\
Output DF: lag_dlq_2 and lag_dlq_3 is Null but would like to populate with similar logic as lag_dlq_1 (but with respect to its lag)
+---+----------+------+---------+---------+---------+
| id| s_date|status|lag_status_1|lag_status_2|lag_status_3|
+---+----------+------+---------+---------+---------+
|123|2007-01-31| 1| null| null| null|
|123|2007-02-28| 1| 1| null| null|
|123|2007-03-31| 2| 1| null| null|
|123|2007-04-30| 2| 2| null| null|
|123|2007-05-31| 1| 2| null| null|
|123|2007-06-30| 1| 1| null| null|
|123|2007-07-31| 2| 1| null| null|
|123|2007-08-31| 2| 2| null| null|
|123|2007-09-30| 2| 2| null| null|
|123|2007-10-31| 1| 2| null| null|
|123|2007-11-30| 1| 1| null| null|
|123|2008-01-31| 3| null| null| null|
|123|2008-03-31| 4| null| null| null|
|123|2008-04-30| 3| 4| null| null|
|123|2008-05-31| 2| 3| null| null|
|123|2008-06-30| 1| 2| null| null|
|345|2007-08-31| 3| null| null| null|
|345|2007-09-30| 2| 3| null| null|
|345|2007-10-31| 1| 2| null| null|
|345|2007-11-30| 2| 1| null| null|
+---+----------+------+---------+---------+---------+
Can you please guide on how to resolve? If there is a better or more efficient solution, please suggest.
I have an array of dates:
date_set = ["2019-01-01", "2019-02-01", "2019-03-01"....."2020-01-01"]
and I have this dataframe:
|DATE |ID |VALUE|
+---------+---------------+---------------+---------+-----------------+
| 2019-04| 1234 |100.0|
| 2019-05| 4567 |200.0|
For each element of my list, I have to apply the following transformations:
for date in date_set:
target = date - relativedelta(months=+6)
dfTemp = df.where(
(F.col("DATE") <= date) &
(F.col("DATE") >= target)
).groupBy("ID").agg(F.sum("VALUE").alias("VALUE"))
I want to avoid this for loop. How can I do it in an efficient way?
Make a dataframe of your date list, cross join it to your dataframe, and then use window functions to get the rolling sum over the past 6 months:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df.show()
+-------+----+-----+
| DATE| ID|VALUE|
+-------+----+-----+
|2019-04|1234|100.0|
|2019-05|4567|200.0|
+-------+----+-----+
date_set = ["2019-%02d-01" % i for i in range(1,13)] + ["2020-01-01"]
dates = spark.createDataFrame([[i] for i in date_set]).toDF('date')
dates_id = dates.crossJoin(df2.select('ID').distinct())
df2 = df.withColumn(
'date',
F.to_date('DATE', 'yyyy-MM')
).join(
dates_id,
['date', 'ID'],
'right'
).withColumn(
'deltamonth',
F.month('date') - F.min(F.month('date')).over(Window.partitionBy('id'))
).withColumn(
'sum_value',
F.sum('value').over(
Window.partitionBy('id').orderBy('deltamonth').rangeBetween(-6,0)
)
).drop('deltamonth').orderBy('date', 'ID')
Results:
df2.show(50)
+----------+----+-----+---------+
| date| ID|VALUE|sum_value|
+----------+----+-----+---------+
|2019-01-01|1234| null| null|
|2019-01-01|4567| null| null|
|2019-02-01|1234| null| null|
|2019-02-01|4567| null| null|
|2019-03-01|1234| null| null|
|2019-03-01|4567| null| null|
|2019-04-01|1234|100.0| 100.0|
|2019-04-01|4567| null| null|
|2019-05-01|1234| null| 100.0|
|2019-05-01|4567|200.0| 200.0|
|2019-06-01|1234| null| 100.0|
|2019-06-01|4567| null| 200.0|
|2019-07-01|1234| null| 100.0|
|2019-07-01|4567| null| 200.0|
|2019-08-01|1234| null| 100.0|
|2019-08-01|4567| null| 200.0|
|2019-09-01|1234| null| 100.0|
|2019-09-01|4567| null| 200.0|
|2019-10-01|1234| null| 100.0|
|2019-10-01|4567| null| 200.0|
|2019-11-01|1234| null| null|
|2019-11-01|4567| null| 200.0|
|2019-12-01|1234| null| null|
|2019-12-01|4567| null| null|
|2020-01-01|1234| null| null|
|2020-01-01|4567| null| null|
+----------+----+-----+---------+
Sample of data :
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customtargeting |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|nocid=no;store=2007;tppid=45c566dd-00d7-4193-b5c7-17843c2764e9 |
|nocid=no;store=3084;tppid=4cd36fde-c59a-41d2-a2b4-b731b6cfbe05 |
|nocid=no;tppid=c688c1be-a9c5-47a2-8c09-aef175a19847 |
|nocid=yes;search=washing liquid;store=3060 |
|pos=top;tppid=278bab7b-d40b-4783-8f89-bef94a9f5150 |
|pos=top;tppid=00bb87fa-f3f5-4b0e-bbf8-16079a1a5efe |
|nocid=no;shelf=cleanser-toner-and-face-mask;store=2019;tppid=84006d41-eb63-4ae1-8c3c-3ac9436d446c |
|pos=top;tppid=ed02b037-066b-46bd-99e6-d183160644a2 |
|nocid=yes;search=salad;store=3060 |
|pos=top;nocid=no;store=2882;tppid=164563e4-8e5c-4366-a5a8-438ffb10da9d |
|nocid=yes;search=beer;store=3060 |
|nocid=no;search=washing capsules;store=5528;tppid=4f9b99eb-65ff-4fbc-b11c-b0552b7f158d |
|pos=right;tppid=ddb54247-a5c9-40a0-9f99-8412d8542b4c |
|nocid=yes;search=bedding;store=3060 |
|pos=top |
|pos=mpu1;keywords=helium canisters;keywords=tesco.com;keywords=helium canisters reviews;keywords=tesco;keywords=helium canisters uk;keywords=balloons;pagetype=category|
I want to convert a PySpark dataframe column to a map type, the column can contain any number of key value pair and the type of column is string and for some keys there are multiple values which i want convert in array as a value for the key .
Try this,
import pyspark.sql.functions as F
from pyspark.sql.types import *
def convert_to_json(_str):
_split_str = [tuple(x.split('=')) for x in _str.split(';') if len(tuple(x.split('='))) == 2]
_json = {}
for k,v in _split_str:
if k in _json:
_json[k].append(v)
else:
_json[k] = [v]
return _json
convert_udf = F.udf(convert_to_json, MapType(StringType(),ArrayType(StringType())))
df = df.withColumn('customtargeting', convert_udf('customtargeting'))
print df.schema
print df.limit(5).collect()
This gives you the schema and output as,
StructType(List(StructField(
customtargeting,MapType(StringType,ArrayType(StringType,true),true),true)))
[Row(customtargeting={u'store': [u'2007'], u'tppid': [u'45c566dd-00d7-4193-b5c7-17843c2764e9'], u'nocid': [u'no']}),
Row(customtargeting={u'store': [u'3084'], u'tppid': [u'4cd36fde-c59a-41d2-a2b4-b731b6cfbe05'], u'nocid': [u'no']}),
Row(customtargeting={u'nocid': [u'no'], u'tppid': [u'c688c1be-a9c5-47a2-8c09-aef175a19847']}),
Row(customtargeting={u'search': [u'washing liquid'], u'nocid': [u'yes'], u'store': [u'3060']}),
Row(customtargeting={u'pos': [u'top'], u'tppid': [u'278bab7b-d40b-4783-8f89-bef94a9f5150']})]
If you want to seperate columns and create a new dataframe, you can use pandas features. Find my solution below
>>> import pandas as pd
>>>
>>> rdd = sc.textFile('/home/ali/text1.txt')
>>> rdd.first()
'nocid=no;store=2007;tppid=45c566dd-00d7-4193-b5c7-17843c2764e9'
>>> rddMap = rdd.map(lambda x: x.split(';'))
>>> rddMap.first()
['nocid=no', 'store=2007', 'tppid=45c566dd-00d7-4193-b5c7-17843c2764e9']
>>>
>>> df1 = pd.DataFrame()
>>> for rdd in rddMap.collect():
... a = {i.split('=')[0]:i.split('=')[1] for i in rdd}
... df2 = pd.DataFrame([a], columns=a.keys())
... df1 = pd.concat([df1, df2])
...
>>> df = spark.createDataFrame(df1.astype(str)).replace('nan',None)
>>> df.show()
+--------+-----+--------+-----+----------------+--------------------+-----+--------------------+
|keywords|nocid|pagetype| pos| search| shelf|store| tppid|
+--------+-----+--------+-----+----------------+--------------------+-----+--------------------+
| null| no| null| null| null| null| 2007|45c566dd-00d7-419...|
| null| no| null| null| null| null| 3084|4cd36fde-c59a-41d...|
| null| no| null| null| null| null| null|c688c1be-a9c5-47a...|
| null| yes| null| null| washing liquid| null| 3060| null|
| null| null| null| top| null| null| null|278bab7b-d40b-478...|
| null| null| null| top| null| null| null|00bb87fa-f3f5-4b0...|
| null| no| null| null| null|cleanser-toner-an...| 2019|84006d41-eb63-4ae...|
| null| null| null| top| null| null| null|ed02b037-066b-46b...|
| null| yes| null| null| salad| null| 3060| null|
| null| no| null| top| null| null| 2882|164563e4-8e5c-436...|
| null| yes| null| null| beer| null| 3060| null|
| null| no| null| null|washing capsules| null| 5528|4f9b99eb-65ff-4fb...|
| null| null| null|right| null| null| null|ddb54247-a5c9-40a...|
| null| yes| null| null| bedding| null| 3060| null|
| null| null| null| top| null| null| null| null|
|balloons| null|category| mpu1| null| null| null| null|
+--------+-----+--------+-----+----------------+--------------------+-----+--------------------+
I got this dataframe Sample:
from pyspark.sql.types import *
schema = StructType([
StructField("ClientId", IntegerType(), True),
StructField("m_ant21", IntegerType(), True),
StructField("m_ant22", IntegerType(), True),
StructField("m_ant23", IntegerType(), True),
StructField("m_ant24", IntegerType(), True)])
df = sqlContext.createDataFrame(
data=[(0, None, None, None, None),
(1, 23, 13, 17, 99),
(2, 0, 0, 0, 1),
(3, 0, None, 1, 0),
(4, None, None, None, None)],
schema=schema)
I have this data frame:
+--------+-------+-------+-------+-------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|
+--------+-------+-------+-------+-------+
| 0| null| null| null| null|
| 1| 23| 13| 17| 99|
| 2| 0| 0| 0| 1|
| 3| 0| null| 1| 0|
| 4| null| null| null| null|
+--------+-------+-------+-------+-------+
And I need to solve this question:
I'd like to create a new variable which counts how many null values have the data per row. For example:
ClientId 0 should be 4
ClientId 1 should be 0
ClientId 3 should be 1
Note that df is a pyspark.sql.dataframe.DataFrame.
Here is one option:
from pyspark.sql import Row
# add the column schema to the original schema
schema.add(StructField("count_null", IntegerType(), True))
# convert data frame to rdd and append an element to each row to count the number of nulls
df.rdd.map(lambda row: row + Row(sum(x is None for x in row))).toDF(schema).show()
+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
| 0| null| null| null| null| 4|
| 1| 23| 13| 17| 99| 0|
| 2| 0| 0| 0| 1| 0|
| 3| 0| null| 1| 0| 1|
| 4| null| null| null| null| 4|
+--------+-------+-------+-------+-------+----------+
If you don't want to deal with schema, here is another option:
from pyspark.sql.functions import col, when
df.withColumn("count_null", sum([when(col(x).isNull(),1).otherwise(0) for x in df.columns])).show()
+--------+-------+-------+-------+-------+----------+
|ClientId|m_ant21|m_ant22|m_ant23|m_ant24|count_null|
+--------+-------+-------+-------+-------+----------+
| 0| null| null| null| null| 4|
| 1| 23| 13| 17| 99| 0|
| 2| 0| 0| 0| 1| 0|
| 3| 0| null| 1| 0| 1|
| 4| null| null| null| null| 4|
+--------+-------+-------+-------+-------+----------+