I am trying to find the difference between every two columns in a pyspark dataframe with 100+ columns. If it was less, I could manually create a new column each time by doing df.withColumn('delta', df.col1 - df.col2) but I am trying to do this in a more concise way. Any ideas?
col1
col2
col3
col4
1
5
3
9
Wanted output:
delta1
delta2
4
6
All you have to do it creating a proper for loop to read through the list of columns and do your subtraction
Sample data
df = spark.createDataFrame([
(1, 4, 7, 8),
(0, 5, 3, 9),
], ['c1', 'c2', 'c3', 'c4'])
+---+---+---+---+
|c1 |c2 |c3 |c4 |
+---+---+---+---+
|1 |4 |7 |8 |
|0 |5 |3 |9 |
+---+---+---+---+
Loop through columns
from pyspark.sql import functions as F
cols = []
for i in range(len(df.columns)):
if i % 2 == 0:
cols.append((F.col(df.columns[i + 1]) - F.col(df.columns[i])).alias(f'delta{i}'))
df.select(cols).show()
+------+------+
|delta0|delta2|
+------+------+
| 3| 1|
| 5| 6|
+------+------+
Related
I'm using pyspark, and I have data like this:
col1
col2
col3
1
0
1
1
1
0
1
1
0
1
0
0
My desired output is:
col
sum
col1
4
col2
2
col3
1
My first thought was to put the column names in a list, loop through it, and each time sum that column and union the results to a new df.
Then I thought, maybe it's possible to do multiple aggregations, e.g.:
df.agg(sum('col1), sum('col2))
... and then figure out a way to unpivot.
Is there an easier way?
There is no easier way as far as I know. You can unpivot it after aggregating, either by first converting it to a Pandas dataframe and then invoking transpose on it or creating a map and then exploding the map to get the result as col and sum column.
# Assuming initial dataframe is df
aggDF = df.agg(*[F.sum(F.col(col_name)).alias(col_name) for col_name in df.columns])
# Using pandas
aggDF.toPandas().transpose().reset_index().rename({'index' : 'col', 0: 'sum'}, axis=1)
# Going spark all the way
aggDF.withColumn("col", F.create_map([e for col in aggDF.columns for e in (F.lit(col), F.col(col))])).selectExpr("explode(col) as (col, sum)").show()
# Both return
"""
+----+---+
| col|sum|
+----+---+
|col1| 4|
|col2| 2|
|col3| 1|
+----+---+
"""
This works for more than 3 columns, if required.
You can use stack SQL function to unpivot a dataframe, as described here. So your code would become, with input as your input dataframe:
from pyspark.sql import functions as F
output = input.agg(
F.sum("col1").alias("col1"),
F.sum("col2").alias("col2"),
F.sum("col3").alias("col3")
).select(
F.expr("stack(3, 'col1', col1, 'col2', col2, 'col3', col3) as (col,sum)")
)
If you have the following input dataframe:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |0 |1 |
|1 |1 |0 |
|1 |1 |0 |
|1 |0 |0 |
+----+----+----+
You will get the following output dataframe:
+----+---+
|col |sum|
+----+---+
|col1|4 |
|col2|2 |
|col3|1 |
+----+---+
You can first sum each column:
// input
val df = List((1,0,1),(1,1,0),(1,1,0),(1,0,0)).toDF("col1", "col2", "col3")
df.show
// sum each column
val sums = df.agg(sum("coL1").as("col1"), sum("col2").as("col2"),
sum("col3").as("col3"))
sums.show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 4| 2| 1|
+----+----+----+
This gives you a DS with one row, and 3 columns. Which you can easily collect. And if that's what you want, create a new dataset with:
val sumRow = sums.first
val sumDS = List("col1" -> sumRow.getAs[Long]("col1"), "col2" ->
sumRow.getAs[Long]("col2"), "col3" -> sumRow.getAs[Long]("col3")).toDF("col", "sum")
sumDS.show
+----+---+
| col|sum|
+----+---+
|col1| 4|
|col2| 2|
|col3| 1|
+----+---+
I have a dataframe like so:
+------------------------------------+-----+-----+
|id |point|count|
+------------------------------------+-----+-----+
|id_1|5 |9 |
|id_2|5 |1 |
|id_3|4 |3 |
|id_1|3 |3 |
|id_2|4 |3 |
The id-point pairs are unique.
I would like to group by id and create columns from the point column with values from the count column like so:
+------------------------------------+-----+-----+
|id |point_3|point_4|point_5|
+------------------------------------+-----+-----+
|id_1|3 |0 |9
|id_2|0 |3 |1
|id_3|0 |3 |0
If you can guide me on how to start this or in which direction to start going, it would be much appreciated. I feel stuck on this for a while.
We can use pivot to achieve the required result:
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master("local[*]").getOrCreate()
#sample dataframe
in_values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), ("id_1", 3, 3), ("id_2", 4, 3)]
in_df = spark.createDataFrame(in_values, "id string, point int, count int")
out_df = in_df.groupby("id").pivot("point").agg(sum("count"))
# To replace null by 0
out_df = out_df.na.fill(0)
# To rename columns
columns_to_rename = out_df.columns
columns_to_rename.remove("id")
for col in columns_to_rename:
out_df = out_df.withColumnRenamed(col, f"point_{col}")
out_df.show()
+----+-------+-------+-------+
| id|point_3|point_4|point_5|
+----+-------+-------+-------+
|id_2| 0| 3| 1|
|id_1| 3| 0| 9|
|id_3| 0| 3| 0|
+----+-------+-------+-------+
Is there a more elegant way to filtering a dataframe by one column and then for each subset, further filtering by another column? And have the resulting data in one dataframe? The filtering information is in a dictionary. The first filter is on col1 using the dict key. The 2nd filter is on col3 using its corresponding value.
df = pd.DataFrame({'col1': [1,1,1,2,2], 'col2': [2,2,2,2,2], 'col3': [1,6,7,5,9]})
df looks like the following
|col1|col2|col3|
|1 |2 |1 |
|1 |2 |6 |
|1 |2 |7 |
|2 |2 |5 |
|2 |2 |9 |
filter_dict = {1:5, 2:7}
df_new = df.somefunction(filter_dict)
Where col1 is 1, filter where col3 value is greater than 5. Where col1 is 2, filter by col3 value is greater than 7. This would result:
df_new
|col1|col2|col3|
|1 |2 |6 |
|1 |2 |7 |
|2 |2 |9 |
List comprehension and boolean indexing with concat
df_new = pd.concat([df[(df['col1'] == k) & (df['col3'] > v)] for k,v in filter_dict.items()])
col1 col2 col3
1 1 2 6
2 1 2 7
4 2 2 9
Given the following DataFrame we need to interpolate my_column values from the example and use them as separate columns and then sort by the int_column values that belong to each some_id column in descending order. The example:
+--------------------+-----------+------------------+
| some_id | my_column | int_column |
+--------------------+-----------+------------------+
|xx1 |id_1 | 3 |
|xx1 |id_2 | 4 |
|xx1 |id_3 | 5 |
|xx2 |id_1 | 6 |
|xx2 |id_2 | 1 |
|xx2 |id_3 | 3 |
|xx3 |id_1 | 4 |
|xx3 |id_2 | 8 |
|xx3 |id_3 | 9 |
|xx4 |id_1 | 1 |
+--------------------+-----------+------------------+
Expected output:
+--------------------+-----------+------------------+
| id_1 | id_2 | id_3 |
+--------------------+-----------+------------------+
| [xx4, 1] |[xx2, 1] |[xx2, 3] |
| [xx1, 3] |[xx1, 4] |[xx1, 5] |
| [xx3, 4] |[xx3, 8] |[xx3, 9] |
| [xx2, 6] |null |null |
+--------------------+-----------+------------------+
As you can see, for id_1 the lowest number in int_column is 1 right at the end of the DataFrame and it belongs to xx4 from the some_id column, the next value is 3, 4, and 6, each belonging to xx1, xx3, and xx2 respectively.
Any pointers on how to approach this problem? Either PySpark or Pandas can be used.
Code to reproduce the input dataframe:
import pandas as pd
data = {'some_id': ['xx1', 'xx1', 'xx1', 'xx2', 'xx2', 'xx2', 'xx3', 'xx3', 'xx3', 'xx4'], \
'my_column' : ['id_1', 'id_2', 'id_3', 'id_1', 'id_2', 'id_3', 'id_1', 'id_2', 'id_3', 'id_1'],\
'int_column' : [3, 4, 5, 6 , 1, 3, 4, 8, 9, 1]}
df = pd.DataFrame.from_dict(data)
We need a helper key , create by using cumcount , then we using groupby + apply (This part just like pivot, or you can using pivot_table or crosstab )
df=df.assign(key=df.groupby('my_column').cumcount())
df.groupby(['key','my_column']).apply(lambda x : list(zip(x['some_id'],x['int_column']))[0]).unstack()
Out[378]:
my_column id_1 id_2 id_3
key
0 (xx1, 3) (xx1, 4) (xx1, 5)
1 (xx2, 6) (xx2, 1) (xx2, 3)
2 (xx3, 4) (xx3, 8) (xx3, 9)
3 (xx4, 1) None None
If using pivot+sort_values
df=df.sort_values('int_column').assign(key=df.groupby('my_column').cumcount())
df['Value']=list(zip(df['some_id'],df['int_column']))
s=df.pivot(index='key',columns='my_column',values='Value')
s
Out[397]:
my_column id_1 id_2 id_3
key
0 (xx4, 1) (xx2, 1) (xx2, 3)
1 (xx1, 3) (xx1, 4) (xx1, 5)
2 (xx3, 4) (xx3, 8) (xx3, 9)
3 (xx2, 6) None None
Here's a solution in pyspark.
First define a Window to partition by my_column and order by int_column. We will define an ordering using pyspark.sql.functions.row_number() over this partition.
from pyspark.sql import Window
import pyspark.sql.functions as f
w = Window.partitionBy("my_column").orderBy("int_column")
df.withColumn("order", f.row_number().over(w)).sort("order").show()
#+-------+---------+----------+-----+
#|some_id|my_column|int_column|order|
#+-------+---------+----------+-----+
#| xx4| id_1| 1| 1|
#| xx2| id_2| 1| 1|
#| xx2| id_3| 3| 1|
#| xx1| id_2| 4| 2|
#| xx1| id_1| 3| 2|
#| xx1| id_3| 5| 2|
#| xx3| id_2| 8| 3|
#| xx3| id_3| 9| 3|
#| xx3| id_1| 4| 3|
#| xx2| id_1| 6| 4|
#+-------+---------+----------+-----+
Notice that (xx4, 1) is in the first row after sorting by order, as you explained.
Now you can group by order and pivot the dataframe on my_column. This requires an aggregate function, so I will use pyspark.sql.functions.first() because I am assuming there is only one (some_id, int_column) pair per order. Then simply sort by the order and drop that column to get the desired output:
df.withColumn("order", f.row_number().over(w))\
.groupBy("order")\
.pivot("my_column")\
.agg(f.first(f.array([f.col("some_id"), f.col("int_column")])))\
.sort("order")\
.drop("order")\
.show(truncate=False)
#+--------+--------+--------+
#|id_1 |id_2 |id_3 |
#+--------+--------+--------+
#|[xx4, 1]|[xx2, 1]|[xx2, 3]|
#|[xx1, 3]|[xx1, 4]|[xx1, 5]|
#|[xx3, 4]|[xx3, 8]|[xx3, 9]|
#|[xx2, 6]|null |null |
#+--------+--------+--------+
I have a DataFrame as below
A B C
1 3 1
1 8 2
1 5 3
2 2 1
My output should be, Column B is ordered based on the initial column B value
A B
1 3,1/5,3/8,2
2 2,1
I wrote something like this is scala
df.groupBy("A").withColumn("B",collect_list(concat("B",lit(","),"C"))
But dint solves my problem.
Given that you have input dataframe as
+---+---+---+
|A |B |C |
+---+---+---+
|1 |3 |1 |
|1 |8 |2 |
|1 |5 |3 |
|2 |2 |1 |
+---+---+---+
You can get following output as
+---+---------------+
|A |B |
+---+---------------+
|1 |[3,1, 5,3, 8,2]|
|2 |[2,1] |
+---+---------------+
By doing simple groupBy, aggregations and using functions
df.orderBy("B").groupBy("A").agg(collect_list(concat_ws(",", col("B"), col("C"))) as "B")
You can use udf function to get the final desired result as
def joinString = udf((b: mutable.WrappedArray[String]) => {
b.mkString("/")
} )
newdf.withColumn("B", joinString(col("B"))).show(false)
You should get
+---+-----------+
|A |B |
+---+-----------+
|1 |3,1/5,3/8,2|
|2 |2,1 |
+---+-----------+
Note you would need import org.apache.spark.sql.functions._ for all of the above to work
Edited
Column B is ordered based on the initial column B value
For this you can just remove the orderBy part as
import org.apache.spark.sql.functions._
val newdf = df.groupBy("A").agg(collect_list(concat_ws(",", col("B"), col("C"))) as "B")
def joinString = udf((b: mutable.WrappedArray[String]) => {
b.mkString("/")
} )
newdf.withColumn("B", joinString(col("B"))).show(false)
and you should get output as
+---+-----------+
|A |B |
+---+-----------+
|1 |3,1/8,2/5,3|
|2 |2,1 |
+---+-----------+
This is what you can achieve by using concat_ws function and then groupby column A and collect the list
val df1 = spark.sparkContext.parallelize(Seq(
( 1, 3, 1),
(1, 8, 2),
(1, 5, 3),
(2, 2, 1)
)).toDF("A", "B", "C")
val result = df1.withColumn("B", concat_ws("/", $"B", $"C"))
result.groupBy("A").agg(collect_list($"B").alias("B")).show
Output:
+---+---------------+
| A| B|
+---+---------------+
| 1|[3/1, 8/2, 5/3]|
| 2| [2/1]|
+---+---------------+
Edited:
Here is what you can do if you want to sort with the column B
val format = udf((value : Seq[String]) => {
value.sortBy(x => {x.split(",")(0)}).mkString("/")
})
val result = df1.withColumn("B", concat_ws(",", $"B", $"C"))
.groupBy($"A").agg(collect_list($"B").alias("B"))
.withColumn("B", format($"B"))
result.show()
Output:
+---+-----------+
| A| B|
+---+-----------+
| 1|3,1/5,3/8,2|
| 2| 2,1|
+---+-----------+
Hope this was helpful!