pySpark: Concatenating column names into a string into column - python

I have a pySpark dataframe with many attributes in columns (there is about 160). These columns are 1s and 0s to show whether an account has an attribute or not.
I need to do an analysis about the combinations of attributes, so I want to put together a sting in a new column with the names of the attributes that, that account has.
Here is an example: I have these columns - account, then some other columns, then the attributes. The column I want to add is 'att_list'.
What I have tried is something like this:
I have the list of attributes in a variable
# create a list of all the attributes available
att_names=df1.drop('Account','other_col1','other_col1')
attlist=[x for x in att_names.columns ]
I tried with a function - expanding an existing :
def func_att_list(df, cols=[]):
att_list_column = ','.join([when(f.col(i) > 0, i) for i in cols])
return df.withColumn('att_list', att_list_column )
df2 = func_att_list(df1, cols=[i for i in attlist])
This just errors out.
I've also tried this:
att_list_column = [when(df1.col(i) > 0, i) for i in attlist]
df1 = df1.withColumn('att_list', ','.join([i for i in att_list_column ])
This also doesnt work.
I am not confident with functions and find them a bit of a 'black box'.
I would greatly appreciate any help.

you could use concat_ws and pass a list of case when conditions for each attribute column - the conditions could be like if attribute column has 1 then attribute column name.
here's a small test example
# sample input creation
data_ls = [
[random.randint(0, 1) for i in range(10)] for j in range(100)
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['attr'+str(k) for k in range(10)])
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# |attr0|attr1|attr2|attr3|attr4|attr5|attr6|attr7|attr8|attr9|
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# | 0| 0| 1| 1| 1| 1| 0| 0| 0| 0|
# | 1| 1| 0| 0| 1| 1| 1| 1| 1| 1|
# | 0| 1| 0| 1| 0| 0| 1| 0| 0| 0|
# | 1| 1| 0| 0| 0| 0| 0| 1| 1| 0|
# | 1| 0| 1| 0| 1| 0| 1| 1| 1| 0|
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# only showing top 5 rows
# concatenate when().otherwise() for each attribute field
data_sdf. \
withColumn('attr_list',
func.concat_ws(',',
*[func.when(func.col(c) == 1, func.lit(c))
for c in data_sdf.columns if c.startswith('attr')]
)
). \
show(5, truncate=False)
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# |attr0|attr1|attr2|attr3|attr4|attr5|attr6|attr7|attr8|attr9|attr_list |
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# |0 |0 |1 |1 |1 |1 |0 |0 |0 |0 |attr2,attr3,attr4,attr5 |
# |1 |1 |0 |0 |1 |1 |1 |1 |1 |1 |attr0,attr1,attr4,attr5,attr6,attr7,attr8,attr9|
# |0 |1 |0 |1 |0 |0 |1 |0 |0 |0 |attr1,attr3,attr6 |
# |1 |1 |0 |0 |0 |0 |0 |1 |1 |0 |attr0,attr1,attr7,attr8 |
# |1 |0 |1 |0 |1 |0 |1 |1 |1 |0 |attr0,attr2,attr4,attr6,attr7,attr8 |
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# only showing top 5 rows
the list comprehension would result in the following
[Column<'CASE WHEN (attr0 = 1) THEN attr0 END'>,
Column<'CASE WHEN (attr1 = 1) THEN attr1 END'>,
Column<'CASE WHEN (attr2 = 1) THEN attr2 END'>,
Column<'CASE WHEN (attr3 = 1) THEN attr3 END'>,
Column<'CASE WHEN (attr4 = 1) THEN attr4 END'>,
Column<'CASE WHEN (attr5 = 1) THEN attr5 END'>,
Column<'CASE WHEN (attr6 = 1) THEN attr6 END'>,
Column<'CASE WHEN (attr7 = 1) THEN attr7 END'>,
Column<'CASE WHEN (attr8 = 1) THEN attr8 END'>,
Column<'CASE WHEN (attr9 = 1) THEN attr9 END'>]

Related

loop over two variables to create multiple year columns

If I have table
|a | b | c|
|"hello"|"world"| 1|
and the variables
start =2000
end =2015
How do I in pyspark add 15 cols with 1st column m2000 and second m2001 etc and all these new cols have 0 so new dataframe is
|a | b | c|m2000 | m2001 | m2002 | ... | m2015|
|"hello"|"world"| 1| 0 | 0 | 0 | ... | 0 |
I have tried below but
df = df.select(
'*',
*["0".alias(f'm{i}') for i in range(2000, 2016)]
)
df.show()
I get the error
AttributeError: 'str' object has no attribute 'alias'
You can simply use withColumn to add relevant columns.
from pyspark.sql.functions import col,lit
df = spark.createDataFrame(data=[("hello","world",1)],schema=["a","b","c"])
df.show()
+-----+-----+---+
| a| b| c|
+-----+-----+---+
|hello|world| 1|
+-----+-----+---+
for i in range(2000, 2015):
df = df.withColumn("m"+str(i), lit(0))
df.show()
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
You can use one-liner
df = df.select(df.columns + [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])
Full example:
df = spark.createDataFrame([["hello","world",1]],["a","b","c"])
df = df.select(df.columns + [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])
[Out]:
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
in pandas, you can do the following:
import pandas as pd
df = pd.Series({'a': 'Hello', 'b': 'World', 'c': 1}).to_frame().T
df[['m{}'.format(x) for x in range(2000, 2016)]] = 0
print(df)
I am not very familiar with the spark-synthax, but the approach should be near-identical.
What is happening:
The term ['m{}'.format(x) for x in range(2000, 2016)] is a list-comprehension that creates the list of desired column names. We assign the value 0 to these columns. Since the columns do not yet exist, they are added.
Your code for generating extra columns is perfectly fine - just need to wrap the "0" in lit function, like this:
from pyspark.sql.functions import lit
df.select('*', *[lit("0").alias(f'm{i}') for i in range(2000, 2016)]).show()
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|m2015|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
I would be cautious with calling withColumn method repeatadly - every new call to it, creates a new projection in Spark's query execution plan and it can become very expensive computationally. Using just single select will always be better approach.

Pyspark : Return all column names of max values

I have a DataFrame like this :
from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType, StringType
#import numpy as np
data = [(("ID1", 3, 5,5)), (("ID2", 4, 5,6)), (("ID3", 3, 3,3))]
df = sqlContext.createDataFrame(data, ["ID", "colA", "colB","colC"])
df.show()
cols = df.columns
maxcol = f.udf(lambda row: cols[row.index(max(row)) +1], StringType())
maxDF = df.withColumn("Max_col", maxcol(f.struct([df[x] for x in df.columns[1:]])))
maxDF.show(truncate=False)
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5| 5|
|ID2| 4| 5| 6|
|ID3| 3| 3| 3|
+---+----+----+----+
+---+----+----+----+-------+
|ID |colA|colB|colC|Max_col|
+---+----+----+----+-------+
|ID1|3 |5 |5 |colB |
|ID2|4 |5 |6 |colC |
|ID3|3 |3 |3 |colA |
+---+----+----+----+-------+
I want to return all column names of max values in case there are ties, how can I achieve this in pyspark like this :
+---+----+----+----+--------------+
|ID |colA|colB|colC|Max_col |
+---+----+----+----+--------------+
|ID1|3 |5 |5 |colB,colC |
|ID2|4 |5 |6 |colC |
|ID3|3 |3 |3 |colA,ColB,ColC|
+---+----+----+----+--------------+
Thank you
Seems like a udf solution.
iterate over the columns you have (pass them as an input to the class) and perform a python operations to get the max and check who has the same value. return a list (aka array) of the column names.
#udf(returnType=ArrayType(StringType()))
def collect_same_max():
...
Or, maybe if it doable you can try use the transform function
from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html
Data Engineering, I would say. See code and logic below
new =(df.withColumn('x',F.array(*[F.struct(F.lit(x).alias('col'),F.col(x).alias('num')) for x in df.columns if x!='ID']))#Create Struct Column of columns and their values
.selectExpr('ID','colA','colB','colC', 'inline(x)')#Explode struct column
.withColumn('z', first('num').over(Window.partitionBy('ID').orderBy(F.desc('num'))))#Create column with max value for each id
.where(col('num')==col('z'))#isolate max values in each id
.groupBy(['ID','colA','colB','colC']).agg(F.collect_list('col').alias('col'))#combine max columns into list
)
+---+----+----+----+------------------+
| ID|colA|colB|colC| Max_col|
+---+----+----+----+------------------+
|ID1| 3| 5| 5| [colB, colC]|
|ID2| 4| 5| 6| [colC]|
|ID3| 3| 3| 3|[colA, colB, colC]|
+---+----+----+----+------------------+

pyspark create multiple columns from values of existing column

I have a dataframe like so:
+------------------------------------+-----+-----+
|id |point|count|
+------------------------------------+-----+-----+
|id_1|5 |9 |
|id_2|5 |1 |
|id_3|4 |3 |
|id_1|3 |3 |
|id_2|4 |3 |
The id-point pairs are unique.
I would like to group by id and create columns from the point column with values from the count column like so:
+------------------------------------+-----+-----+
|id |point_3|point_4|point_5|
+------------------------------------+-----+-----+
|id_1|3 |0 |9
|id_2|0 |3 |1
|id_3|0 |3 |0
If you can guide me on how to start this or in which direction to start going, it would be much appreciated. I feel stuck on this for a while.
We can use pivot to achieve the required result:
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master("local[*]").getOrCreate()
#sample dataframe
in_values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), ("id_1", 3, 3), ("id_2", 4, 3)]
in_df = spark.createDataFrame(in_values, "id string, point int, count int")
out_df = in_df.groupby("id").pivot("point").agg(sum("count"))
# To replace null by 0
out_df = out_df.na.fill(0)
# To rename columns
columns_to_rename = out_df.columns
columns_to_rename.remove("id")
for col in columns_to_rename:
out_df = out_df.withColumnRenamed(col, f"point_{col}")
out_df.show()
+----+-------+-------+-------+
| id|point_3|point_4|point_5|
+----+-------+-------+-------+
|id_2| 0| 3| 1|
|id_1| 3| 0| 9|
|id_3| 0| 3| 0|
+----+-------+-------+-------+

Combine two rows in Pyspark if a condition is met

I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1
IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.
You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()
The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.

Partition pyspark dataframe based on the change in column value

I have a dataframe in pyspark.
Say the has some columns a,b,c...
I want to group the data into groups as the value of column changes. Say
A B
1 x
1 y
0 x
0 y
0 x
1 y
1 x
1 y
There will be 3 groups as (1x,1y),(0x,0y,0x),(1y,1x,1y)
And corresponding row data
If I understand correctly you want to create a distinct group every time column A changes values.
First we'll create a monotonically increasing id to keep the row order as it is:
import pyspark.sql.functions as psf
df = sc.parallelize([[1,'x'],[1,'y'],[0,'x'],[0,'y'],[0,'x'],[1,'y'],[1,'x'],[1,'y']])\
.toDF(['A', 'B'])\
.withColumn("rn", psf.monotonically_increasing_id())
df.show()
+---+---+----------+
| A| B| rn|
+---+---+----------+
| 1| x| 0|
| 1| y| 1|
| 0| x| 2|
| 0| y| 3|
| 0| x|8589934592|
| 1| y|8589934593|
| 1| x|8589934594|
| 1| y|8589934595|
+---+---+----------+
Now we'll use a window function to create a column that contains 1 every time column A changes:
from pyspark.sql import Window
w = Window.orderBy('rn')
df = df.withColumn("changed", (df.A != psf.lag('A', 1, 0).over(w)).cast('int'))
+---+---+----------+-------+
| A| B| rn|changed|
+---+---+----------+-------+
| 1| x| 0| 1|
| 1| y| 1| 0|
| 0| x| 2| 1|
| 0| y| 3| 0|
| 0| x|8589934592| 0|
| 1| y|8589934593| 1|
| 1| x|8589934594| 0|
| 1| y|8589934595| 0|
+---+---+----------+-------+
Finally we'll use another window function to allocate different numbers to each group:
df = df.withColumn("group_id", psf.sum("changed").over(w)).drop("rn").drop("changed")
+---+---+--------+
| A| B|group_id|
+---+---+--------+
| 1| x| 1|
| 1| y| 1|
| 0| x| 2|
| 0| y| 2|
| 0| x| 2|
| 1| y| 3|
| 1| x| 3|
| 1| y| 3|
+---+---+--------+
Now you can build you groups

Categories

Resources