I have a dataframe like so:
+------------------------------------+-----+-----+
|id |point|count|
+------------------------------------+-----+-----+
|id_1|5 |9 |
|id_2|5 |1 |
|id_3|4 |3 |
|id_1|3 |3 |
|id_2|4 |3 |
The id-point pairs are unique.
I would like to group by id and create columns from the point column with values from the count column like so:
+------------------------------------+-----+-----+
|id |point_3|point_4|point_5|
+------------------------------------+-----+-----+
|id_1|3 |0 |9
|id_2|0 |3 |1
|id_3|0 |3 |0
If you can guide me on how to start this or in which direction to start going, it would be much appreciated. I feel stuck on this for a while.
We can use pivot to achieve the required result:
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master("local[*]").getOrCreate()
#sample dataframe
in_values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), ("id_1", 3, 3), ("id_2", 4, 3)]
in_df = spark.createDataFrame(in_values, "id string, point int, count int")
out_df = in_df.groupby("id").pivot("point").agg(sum("count"))
# To replace null by 0
out_df = out_df.na.fill(0)
# To rename columns
columns_to_rename = out_df.columns
columns_to_rename.remove("id")
for col in columns_to_rename:
out_df = out_df.withColumnRenamed(col, f"point_{col}")
out_df.show()
+----+-------+-------+-------+
| id|point_3|point_4|point_5|
+----+-------+-------+-------+
|id_2| 0| 3| 1|
|id_1| 3| 0| 9|
|id_3| 0| 3| 0|
+----+-------+-------+-------+
Related
I have a pySpark dataframe with many attributes in columns (there is about 160). These columns are 1s and 0s to show whether an account has an attribute or not.
I need to do an analysis about the combinations of attributes, so I want to put together a sting in a new column with the names of the attributes that, that account has.
Here is an example: I have these columns - account, then some other columns, then the attributes. The column I want to add is 'att_list'.
What I have tried is something like this:
I have the list of attributes in a variable
# create a list of all the attributes available
att_names=df1.drop('Account','other_col1','other_col1')
attlist=[x for x in att_names.columns ]
I tried with a function - expanding an existing :
def func_att_list(df, cols=[]):
att_list_column = ','.join([when(f.col(i) > 0, i) for i in cols])
return df.withColumn('att_list', att_list_column )
df2 = func_att_list(df1, cols=[i for i in attlist])
This just errors out.
I've also tried this:
att_list_column = [when(df1.col(i) > 0, i) for i in attlist]
df1 = df1.withColumn('att_list', ','.join([i for i in att_list_column ])
This also doesnt work.
I am not confident with functions and find them a bit of a 'black box'.
I would greatly appreciate any help.
you could use concat_ws and pass a list of case when conditions for each attribute column - the conditions could be like if attribute column has 1 then attribute column name.
here's a small test example
# sample input creation
data_ls = [
[random.randint(0, 1) for i in range(10)] for j in range(100)
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['attr'+str(k) for k in range(10)])
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# |attr0|attr1|attr2|attr3|attr4|attr5|attr6|attr7|attr8|attr9|
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# | 0| 0| 1| 1| 1| 1| 0| 0| 0| 0|
# | 1| 1| 0| 0| 1| 1| 1| 1| 1| 1|
# | 0| 1| 0| 1| 0| 0| 1| 0| 0| 0|
# | 1| 1| 0| 0| 0| 0| 0| 1| 1| 0|
# | 1| 0| 1| 0| 1| 0| 1| 1| 1| 0|
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# only showing top 5 rows
# concatenate when().otherwise() for each attribute field
data_sdf. \
withColumn('attr_list',
func.concat_ws(',',
*[func.when(func.col(c) == 1, func.lit(c))
for c in data_sdf.columns if c.startswith('attr')]
)
). \
show(5, truncate=False)
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# |attr0|attr1|attr2|attr3|attr4|attr5|attr6|attr7|attr8|attr9|attr_list |
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# |0 |0 |1 |1 |1 |1 |0 |0 |0 |0 |attr2,attr3,attr4,attr5 |
# |1 |1 |0 |0 |1 |1 |1 |1 |1 |1 |attr0,attr1,attr4,attr5,attr6,attr7,attr8,attr9|
# |0 |1 |0 |1 |0 |0 |1 |0 |0 |0 |attr1,attr3,attr6 |
# |1 |1 |0 |0 |0 |0 |0 |1 |1 |0 |attr0,attr1,attr7,attr8 |
# |1 |0 |1 |0 |1 |0 |1 |1 |1 |0 |attr0,attr2,attr4,attr6,attr7,attr8 |
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# only showing top 5 rows
the list comprehension would result in the following
[Column<'CASE WHEN (attr0 = 1) THEN attr0 END'>,
Column<'CASE WHEN (attr1 = 1) THEN attr1 END'>,
Column<'CASE WHEN (attr2 = 1) THEN attr2 END'>,
Column<'CASE WHEN (attr3 = 1) THEN attr3 END'>,
Column<'CASE WHEN (attr4 = 1) THEN attr4 END'>,
Column<'CASE WHEN (attr5 = 1) THEN attr5 END'>,
Column<'CASE WHEN (attr6 = 1) THEN attr6 END'>,
Column<'CASE WHEN (attr7 = 1) THEN attr7 END'>,
Column<'CASE WHEN (attr8 = 1) THEN attr8 END'>,
Column<'CASE WHEN (attr9 = 1) THEN attr9 END'>]
I am trying to find the difference between every two columns in a pyspark dataframe with 100+ columns. If it was less, I could manually create a new column each time by doing df.withColumn('delta', df.col1 - df.col2) but I am trying to do this in a more concise way. Any ideas?
col1
col2
col3
col4
1
5
3
9
Wanted output:
delta1
delta2
4
6
All you have to do it creating a proper for loop to read through the list of columns and do your subtraction
Sample data
df = spark.createDataFrame([
(1, 4, 7, 8),
(0, 5, 3, 9),
], ['c1', 'c2', 'c3', 'c4'])
+---+---+---+---+
|c1 |c2 |c3 |c4 |
+---+---+---+---+
|1 |4 |7 |8 |
|0 |5 |3 |9 |
+---+---+---+---+
Loop through columns
from pyspark.sql import functions as F
cols = []
for i in range(len(df.columns)):
if i % 2 == 0:
cols.append((F.col(df.columns[i + 1]) - F.col(df.columns[i])).alias(f'delta{i}'))
df.select(cols).show()
+------+------+
|delta0|delta2|
+------+------+
| 3| 1|
| 5| 6|
+------+------+
I have a DataFrame like this :
from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType, StringType
#import numpy as np
data = [(("ID1", 3, 5,5)), (("ID2", 4, 5,6)), (("ID3", 3, 3,3))]
df = sqlContext.createDataFrame(data, ["ID", "colA", "colB","colC"])
df.show()
cols = df.columns
maxcol = f.udf(lambda row: cols[row.index(max(row)) +1], StringType())
maxDF = df.withColumn("Max_col", maxcol(f.struct([df[x] for x in df.columns[1:]])))
maxDF.show(truncate=False)
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5| 5|
|ID2| 4| 5| 6|
|ID3| 3| 3| 3|
+---+----+----+----+
+---+----+----+----+-------+
|ID |colA|colB|colC|Max_col|
+---+----+----+----+-------+
|ID1|3 |5 |5 |colB |
|ID2|4 |5 |6 |colC |
|ID3|3 |3 |3 |colA |
+---+----+----+----+-------+
I want to return all column names of max values in case there are ties, how can I achieve this in pyspark like this :
+---+----+----+----+--------------+
|ID |colA|colB|colC|Max_col |
+---+----+----+----+--------------+
|ID1|3 |5 |5 |colB,colC |
|ID2|4 |5 |6 |colC |
|ID3|3 |3 |3 |colA,ColB,ColC|
+---+----+----+----+--------------+
Thank you
Seems like a udf solution.
iterate over the columns you have (pass them as an input to the class) and perform a python operations to get the max and check who has the same value. return a list (aka array) of the column names.
#udf(returnType=ArrayType(StringType()))
def collect_same_max():
...
Or, maybe if it doable you can try use the transform function
from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html
Data Engineering, I would say. See code and logic below
new =(df.withColumn('x',F.array(*[F.struct(F.lit(x).alias('col'),F.col(x).alias('num')) for x in df.columns if x!='ID']))#Create Struct Column of columns and their values
.selectExpr('ID','colA','colB','colC', 'inline(x)')#Explode struct column
.withColumn('z', first('num').over(Window.partitionBy('ID').orderBy(F.desc('num'))))#Create column with max value for each id
.where(col('num')==col('z'))#isolate max values in each id
.groupBy(['ID','colA','colB','colC']).agg(F.collect_list('col').alias('col'))#combine max columns into list
)
+---+----+----+----+------------------+
| ID|colA|colB|colC| Max_col|
+---+----+----+----+------------------+
|ID1| 3| 5| 5| [colB, colC]|
|ID2| 4| 5| 6| [colC]|
|ID3| 3| 3| 3|[colA, colB, colC]|
+---+----+----+----+------------------+
I have a pyspark dataframe that have duplicated ids. There are missing values in some of the records and differences in the "Time" field among the duplicated ids.
+-------------+------------------------+-------------------------+---------------------------------+
|id |Time |Type |Status|
+-------------+------------------------+-------------------------+---------------------------------+
|1 |2020-03-01 | | |
|1 |2020-03-01 |A |Single |
|1 |2020-03-01 |A | |
|2 |2020-02-01 |C | Double |
|2 |2020-02-25 | | Double |
+-------------+------------------------+-------------------------+---------------------------------+
How can I merge the info in every field and make them into one record? And if there is there are difference "time"value, how can i just choose the most recent one? The ideal dataframe looks like this:
+-------------+------------------------+-------------------------+---------------------------------+
|id |Time |Type |Status|
+-------------+------------------------+-------------------------+---------------------------------+
|1 |2020-03-01 |A | Single |
|2 |2020-02-25 |C | Double |
+-------------+------------------------+-------------------------+-------------------------------
And please note that I have around 100 fields in this dataframe, not just the four I am showing.
Try by grouping by id then get the max(time) by converting to date type to get most recent one.
Example:
df.show()
#+---+----------+----+------+
#| id| Time|Type|Status|
#+---+----------+----+------+
#| 1|2020-03-01| | |
#| 1|2020-03-01| A|Single|
#| 1|2020-03-01| A| |
#| 2|2020-02-01| C|Double|
#| 2|2020-02-25| |Double|
#+---+----------+----+------+
from pyspark.sql.functions import *
expr=[max(to_date(f'{i}')).alias("Time") for i in df.columns if i == 'Time'] +[max(f'{i}').alias(f'{i}') for i in df.columns if i not in ['id','Time']]
#[Column<b'max(to_date(`Time`)) AS `Time`'>, Column<b'max(Type) AS `Type`'>, Column<b'max(Status) AS `Status`'>]
df.groupBy("id").agg(*expr).show()
#+---+----------+----+------+
#| id| Time|Type|Status|
#+---+----------+----+------+
#| 1|2020-03-01| A|Single|
#| 2|2020-02-25| C|Double|
#+---+----------+----+------+
If you want to get first/last based on Type,Status values then
when_expr=['id'] + [when(length(f'{i}') ==0 ,lit(None)).otherwise(col(f'{i}')).alias(f'{i}') for i in df.columns if i not in ['id']]
df1=df.select(*when_expr)
df1.show()
#+---+----------+----+------+
#| id| Time|Type|Status|
#+---+----------+----+------+
#| 1|2020-03-01|null| null|
#| 1|2020-03-01| A|Single|
#| 1|2020-03-01| A| null|
#| 2|2020-02-01| C|Double|
#| 2|2020-02-25|null|Double|
#+---+----------+----+------+
#using first function
expr=[max(to_date(f'{i}')).alias("Time") for i in df.columns if i == 'Time'] +[first(f'{i}',True).alias(f'{i}') for i in df.columns if i not in ['id','Time']]
#using last function
expr=[max(to_date(f'{i}')).alias("Time") for i in df.columns if i == 'Time'] +[last(f'{i}',True).alias(f'{i}') for i in df.columns if i not in ['id','Time']]
df1.groupBy("id").agg(*expr).show()
#+---+----------+----+------+
#| id| Time|Type|Status|
#+---+----------+----+------+
#| 1|2020-03-01| A|Single|
#| 2|2020-02-25| C|Double|
#+---+----------+----+------+
I have a DataFrame as below
A B C
1 3 1
1 8 2
1 5 3
2 2 1
My output should be, Column B is ordered based on the initial column B value
A B
1 3,1/5,3/8,2
2 2,1
I wrote something like this is scala
df.groupBy("A").withColumn("B",collect_list(concat("B",lit(","),"C"))
But dint solves my problem.
Given that you have input dataframe as
+---+---+---+
|A |B |C |
+---+---+---+
|1 |3 |1 |
|1 |8 |2 |
|1 |5 |3 |
|2 |2 |1 |
+---+---+---+
You can get following output as
+---+---------------+
|A |B |
+---+---------------+
|1 |[3,1, 5,3, 8,2]|
|2 |[2,1] |
+---+---------------+
By doing simple groupBy, aggregations and using functions
df.orderBy("B").groupBy("A").agg(collect_list(concat_ws(",", col("B"), col("C"))) as "B")
You can use udf function to get the final desired result as
def joinString = udf((b: mutable.WrappedArray[String]) => {
b.mkString("/")
} )
newdf.withColumn("B", joinString(col("B"))).show(false)
You should get
+---+-----------+
|A |B |
+---+-----------+
|1 |3,1/5,3/8,2|
|2 |2,1 |
+---+-----------+
Note you would need import org.apache.spark.sql.functions._ for all of the above to work
Edited
Column B is ordered based on the initial column B value
For this you can just remove the orderBy part as
import org.apache.spark.sql.functions._
val newdf = df.groupBy("A").agg(collect_list(concat_ws(",", col("B"), col("C"))) as "B")
def joinString = udf((b: mutable.WrappedArray[String]) => {
b.mkString("/")
} )
newdf.withColumn("B", joinString(col("B"))).show(false)
and you should get output as
+---+-----------+
|A |B |
+---+-----------+
|1 |3,1/8,2/5,3|
|2 |2,1 |
+---+-----------+
This is what you can achieve by using concat_ws function and then groupby column A and collect the list
val df1 = spark.sparkContext.parallelize(Seq(
( 1, 3, 1),
(1, 8, 2),
(1, 5, 3),
(2, 2, 1)
)).toDF("A", "B", "C")
val result = df1.withColumn("B", concat_ws("/", $"B", $"C"))
result.groupBy("A").agg(collect_list($"B").alias("B")).show
Output:
+---+---------------+
| A| B|
+---+---------------+
| 1|[3/1, 8/2, 5/3]|
| 2| [2/1]|
+---+---------------+
Edited:
Here is what you can do if you want to sort with the column B
val format = udf((value : Seq[String]) => {
value.sortBy(x => {x.split(",")(0)}).mkString("/")
})
val result = df1.withColumn("B", concat_ws(",", $"B", $"C"))
.groupBy($"A").agg(collect_list($"B").alias("B"))
.withColumn("B", format($"B"))
result.show()
Output:
+---+-----------+
| A| B|
+---+-----------+
| 1|3,1/5,3/8,2|
| 2| 2,1|
+---+-----------+
Hope this was helpful!