I'm using pyspark, and I have data like this:
col1
col2
col3
1
0
1
1
1
0
1
1
0
1
0
0
My desired output is:
col
sum
col1
4
col2
2
col3
1
My first thought was to put the column names in a list, loop through it, and each time sum that column and union the results to a new df.
Then I thought, maybe it's possible to do multiple aggregations, e.g.:
df.agg(sum('col1), sum('col2))
... and then figure out a way to unpivot.
Is there an easier way?
There is no easier way as far as I know. You can unpivot it after aggregating, either by first converting it to a Pandas dataframe and then invoking transpose on it or creating a map and then exploding the map to get the result as col and sum column.
# Assuming initial dataframe is df
aggDF = df.agg(*[F.sum(F.col(col_name)).alias(col_name) for col_name in df.columns])
# Using pandas
aggDF.toPandas().transpose().reset_index().rename({'index' : 'col', 0: 'sum'}, axis=1)
# Going spark all the way
aggDF.withColumn("col", F.create_map([e for col in aggDF.columns for e in (F.lit(col), F.col(col))])).selectExpr("explode(col) as (col, sum)").show()
# Both return
"""
+----+---+
| col|sum|
+----+---+
|col1| 4|
|col2| 2|
|col3| 1|
+----+---+
"""
This works for more than 3 columns, if required.
You can use stack SQL function to unpivot a dataframe, as described here. So your code would become, with input as your input dataframe:
from pyspark.sql import functions as F
output = input.agg(
F.sum("col1").alias("col1"),
F.sum("col2").alias("col2"),
F.sum("col3").alias("col3")
).select(
F.expr("stack(3, 'col1', col1, 'col2', col2, 'col3', col3) as (col,sum)")
)
If you have the following input dataframe:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |0 |1 |
|1 |1 |0 |
|1 |1 |0 |
|1 |0 |0 |
+----+----+----+
You will get the following output dataframe:
+----+---+
|col |sum|
+----+---+
|col1|4 |
|col2|2 |
|col3|1 |
+----+---+
You can first sum each column:
// input
val df = List((1,0,1),(1,1,0),(1,1,0),(1,0,0)).toDF("col1", "col2", "col3")
df.show
// sum each column
val sums = df.agg(sum("coL1").as("col1"), sum("col2").as("col2"),
sum("col3").as("col3"))
sums.show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 4| 2| 1|
+----+----+----+
This gives you a DS with one row, and 3 columns. Which you can easily collect. And if that's what you want, create a new dataset with:
val sumRow = sums.first
val sumDS = List("col1" -> sumRow.getAs[Long]("col1"), "col2" ->
sumRow.getAs[Long]("col2"), "col3" -> sumRow.getAs[Long]("col3")).toDF("col", "sum")
sumDS.show
+----+---+
| col|sum|
+----+---+
|col1| 4|
|col2| 2|
|col3| 1|
+----+---+
Related
id
val1
val2
1
Y
Flagged
1
N
Flagged
2
N
Flagged
2
Y
Flagged
2
N
Flagged
I have the above table. I want to check the rows in val1 with the same id, if there's at least one Y and one N then all the rows having 1 as id will be marked flagged in val2. In addition, for a more efficient code, I want the code to jump to the next id once it finds a Y.
Assuming the val1 columns contains only Y and N as unique values, you can group the dataframe by id and aggregate val1 using countDistinct to count the unique values, then create a new column flagged corresponding the condition where distinct count > 1, finally join this new column with original dataframe to get the result
from pyspark.sql import functions as F
counts = df.groupBy('id').agg(F.countDistinct('val1').alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
If column val1 may contains other values along with Y, N, then first mask the values which are not in Y and N:
vals = F.when(F.col('val1').isin(['Y', 'N']), F.col('val1'))
counts = df.groupBy('id').agg(F.countDistinct(vals).alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
>>> df.show()
| id|val1|flagged|
+---+----+-------+
| 1| Y| true|
| 1| N| true|
| 2| N| true|
| 2| Y| true|
| 2| N| true|
+---+----+-------+
PS: I have also modified your output slightly as having a column named flagged with boolean values makes more sense
You can also use a window to collect the set of values and compare to the array of Y and N values:
from pyspark.sql import functions as F, Window as W
a = F.array([F.lit('N'),F.lit('Y')])
out = (df.withColumn("Flagged",F.array_intersect(a,
F.collect_set("val1").over(W.partitionBy("id")))==a))
out.show()
+---+----+-------+-------+
| id|val1| val2|Flagged|
+---+----+-------+-------+
| 1| Y|Flagged| true|
| 1| N|Flagged| true|
| 2| N|Flagged| true|
| 2| Y|Flagged| true|
| 2| N|Flagged| true|
+---+----+-------+-------+
I have a dataframe with many columns. My aim is to produce a dataframe thats lists each column name, along with the number of null values in that column.
Example:
+-------------+-------------+
| Column_Name | NULL_Values |
+-------------+-------------+
| Column_1 | 15 |
| Column_2 | 56 |
| Column_3 | 18 |
| ... | ... |
+-------------+-------------+
I have managed to get the number of null values for ONE column like so:
df.agg(F.count(F.when(F.isnull(c), c)).alias('NULL_Count'))
where c is a column in the dataframe. However, it does not show the name of the column. The output is:
+------------+
| NULL_Count |
+------------+
| 15 |
+------------+
Any ideas?
You can use a list comprehension to loop over all of your columns in the agg, and use alias to rename the output column:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
However, this will return the results in one row as shown below:
df_agg.show()
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
If you wanted the results in one column instead, you could union each column from df_agg using functools.reduce as follows:
from functools import reduce
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df_agg.select(F.lit(c).alias("Column_Name"), F.col(c).alias("NULL_Count"))
for c in df_agg.columns
)
)
df_agg_col.show()
#+-----------+----------+
#|Column_Name|NULL_Count|
#+-----------+----------+
#| Column_1| 15|
#| Column_2| 56|
#| Column_3| 18|
#+-----------+----------+
Or you can skip the intermediate step of creating df_agg and do:
df_agg_col = reduce(
lambda a, b: a.union(b),
(
df.agg(
F.count(F.when(F.isnull(c), c)).alias('NULL_Count')
).select(F.lit(c).alias("Column_Name"), "NULL_Count")
for c in df.columns
)
)
Scala alternative could be
case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
df1.show()
+---+------+---+------+
| id|weight|age|gender|
+---+------+---+------+
| 1| 100| 23| Male|
| 2| null| 25| null|
| 3| null| 33|Female|
+---+------+---+------+
val s = df1.columns.map(c => sum(col(c).isNull.cast("integer")).alias(c))
val df2 = df1.agg(s.head, s.tail:_*)
val t = df2.columns.map(c => df2.select(lit(c).alias("col_name"), col(c).alias("null_count")))
val df_agg_col = t.reduce((df1, df2) => df1.union(df2))
df_agg_col.show()
I have a large dataset of which I would like to drop columns that contain null values and return a new dataframe. How can I do that?
The following only drops a single column or rows containing null.
df.where(col("dt_mvmt").isNull()) #doesnt work because I do not have all the columns names or for 1000's of columns
df.filter(df.dt_mvmt.isNotNull()) #same reason as above
df.na.drop() #drops rows that contain null, instead of columns that contain null
For example
a | b | c
1 | | 0
2 | 2 | 3
In the above case it will drop the whole column B because one of its values is empty.
Here is one possible approach for dropping all columns that have NULL values: See here for the source on the code of counting NULL values per column.
import pyspark.sql.functions as F
# Sample data
df = pd.DataFrame({'x1': ['a', '1', '2'],
'x2': ['b', None, '2'],
'x3': ['c', '0', '3'] })
df = sqlContext.createDataFrame(df)
df.show()
def drop_null_columns(df):
"""
This function drops all columns which contain null values.
:param df: A PySpark DataFrame
"""
null_counts = df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
to_drop = [k for k, v in null_counts.items() if v > 0]
df = df.drop(*to_drop)
return df
# Drops column b2, because it contains null values
drop_null_columns(df).show()
Before:
+---+----+---+
| x1| x2| x3|
+---+----+---+
| a| b| c|
| 1|null| 0|
| 2| 2| 3|
+---+----+---+
After:
+---+---+
| x1| x3|
+---+---+
| a| c|
| 1| 0|
| 2| 3|
+---+---+
Hope this helps!
If we need to keep only the rows having at least one inspected column not null then use this. Execution time is very less.
from operator import or_
from functools import reduce
inspected = df.columns
df = df.where(reduce(or_, (F.col(c).isNotNull() for c in inspected ), F.lit(False)))```
I have two spark dataframes:
Dataframe A:
|col_1 | col_2 | ... | col_n |
|val_1 | val_2 | ... | val_n |
and dataframe B:
|col_1 | col_2 | ... | col_m |
|val_1 | val_2 | ... | val_m |
Dataframe B can contain duplicate, updated and new rows from dataframe A. I want to write an operation in spark where I can create a new dataframe containing the rows from dataframe A and the updated and new rows from dataframe B.
I started by creating a hash column containing only the columns that are not updatable. This is the unique id. So let's say col1 and col2 can change value (can be updated), but col3,..,coln are unique. I have created a hash function as hash(col3,..,coln):
A=A.withColumn("hash", hash(*[col(colname) for colname in unique_cols_A]))
B=B.withColumn("hash", hash(*[col(colname) for colname in unique_cols_B]))
Now I want to write some spark code that basically selects the rows from B that have the hash not in A (so new rows and updated rows) and join them into a new dataframe together with the rows from A. How can I achieve this in pyspark?
Edit:
Dataframe B can have extra columns from dataframe A, so a union is not possible.
Sample example
Dataframe A:
+-----+-----+
|col_1|col_2|
+-----+-----+
| a| www|
| b| eee|
| c| rrr|
+-----+-----+
Dataframe B:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| d| yyy| 2|
| c| rer| 3|
+-----+-----+-----+
Result:
Dataframe C:
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| a| wew| 1|
| b| eee| null|
| c| rer| 3|
| d| yyy| 2|
+-----+-----+-----+
This is closely related to update a dataframe column with new values, except that you also want to add the rows from DataFrame B. One approach would be to first do what is outlined in the linked question and then union the result with DataFrame B and drop duplicates.
For example:
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
'col_1',
f.when(
~f.isnull(f.col('b.col_2')),
f.col('b.col_2')
).otherwise(f.col('a.col_2')).alias('col_2'),
'b.col_3'
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| a| wew| 1|
#| b| eee| null|
#| c| rer| 3|
#| d| yyy| 2|
#+-----+-----+-----+
Or more generically using a list comprehension if you have a lot of columns to replace and you don't want to hard code them all:
cols_to_update = ['col_2']
dfA.alias('a').join(dfB.alias('b'), on=['col_1'], how='left')\
.select(
*[
['col_1'] +
[
f.when(
~f.isnull(f.col('b.{}'.format(c))),
f.col('b.{}'.format(c))
).otherwise(f.col('a.{}'.format(c))).alias(c)
for c in cols_to_update
] +
['b.col_3']
]
)\
.union(dfB)\
.dropDuplicates()\
.sort('col_1')\
.show()
I would opt for different solution, which I believe is less verbose, more generic and does not involve column listing. I would first identify subset of dfA that will be updated (replaceDf) by performing inner join based on keyCols (list). Then I would subtract this replaceDF from dfA and union it with dfB.
replaceDf = dfA.alias('a').join(dfB.alias('b'), on=keyCols, how='inner').select('a.*')
resultDf = dfA.subtract(replaceDf).union(dfB).show()
Even though there will be different columns in dfA and dfB, you can still overcome this with obtaining list of columns from both DataFrames and finding their union. Then I would
prepare select query (instead of "select.('a.')*") so that I would just list columns from dfA that exist in dfB + "null as colname" for those that do not exist in dfB.
If you want to keep only unique values, and require strictly correct results, then union followed by dropDupilcates should do the trick:
columns_which_dont_change = [...]
old_df.union(new_df).dropDuplicates(subset=columns_which_dont_change)
I'm trying to transpose some columns of my table to row.
I'm using Python and Spark 1.5.0. Here is my initial table:
+-----+-----+-----+-------+
| A |col_1|col_2|col_...|
+-----+-------------------+
| 1 | 0.0| 0.6| ... |
| 2 | 0.6| 0.7| ... |
| 3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
I would like to have somthing like this:
+-----+--------+-----------+
| A | col_id | col_value |
+-----+--------+-----------+
| 1 | col_1| 0.0|
| 1 | col_2| 0.6|
| ...| ...| ...|
| 2 | col_1| 0.6|
| 2 | col_2| 0.7|
| ...| ...| ...|
| 3 | col_1| 0.5|
| 3 | col_2| 0.9|
| ...| ...| ...|
Does someone know haw I can do it? Thank you for your help.
Spark >= 3.4
You can use built-in melt method. With Python:
df.melt(
ids=["A"], values=["col_1", "col_2"],
variableColumnName="key", valueColumnName="val"
)
with Scala
df.melt(Array($"A"), Array($"col_1", $"col_2"), "key", "val")
Spark < 3.4
It is relatively simple to do with basic Spark SQL functions.
Python
from pyspark.sql.functions import array, col, explode, struct, lit
df = sc.parallelize([(1, 0.0, 0.6), (1, 0.6, 0.7)]).toDF(["A", "col_1", "col_2"])
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
to_long(df, ["A"])
Scala:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{array, col, explode, lit, struct}
val df = Seq((1, 0.0, 0.6), (1, 0.6, 0.7)).toDF("A", "col_1", "col_2")
def toLong(df: DataFrame, by: Seq[String]): DataFrame = {
val (cols, types) = df.dtypes.filter{ case (c, _) => !by.contains(c)}.unzip
require(types.distinct.size == 1, s"${types.distinct.toString}.length != 1")
val kvs = explode(array(
cols.map(c => struct(lit(c).alias("key"), col(c).alias("val"))): _*
))
val byExprs = by.map(col(_))
df
.select(byExprs :+ kvs.alias("_kvs"): _*)
.select(byExprs ++ Seq($"_kvs.key", $"_kvs.val"): _*)
}
toLong(df, Seq("A"))
One way to solve with pyspark sql using functions create_map and explode.
from pyspark.sql import functions as func
#Use `create_map` to create the map of columns with constant
df = df.withColumn('mapCol', \
func.create_map(func.lit('col_1'),df.col_1,
func.lit('col_2'),df.col_2,
func.lit('col_3'),df.col_3
)
)
#Use explode function to explode the map
res = df.select('*',func.explode(df.mapCol).alias('col_id','col_value'))
res.show()
The Spark local linear algebra libraries are presently very weak: and they do not include basic operations as the above.
There is a JIRA for fixing this for Spark 2.1 - but that will not help you today.
Something to consider: performing a transpose will likely require completely shuffling the data.
For now you will need to write RDD code directly. I have written transpose in scala - but not in python. Here is the scala version:
def transpose(mat: DMatrix) = {
val nCols = mat(0).length
val matT = mat
.flatten
.zipWithIndex
.groupBy {
_._2 % nCols
}
.toSeq.sortBy {
_._1
}
.map(_._2)
.map(_.map(_._1))
.toArray
matT
}
So you can convert that to python for your use. I do not have bandwidth to write/test that at this particular moment: let me know if you were unable to do that conversion.
At the least - the following are readily converted to python.
zipWithIndex --> enumerate() (python equivalent - credit to #zero323)
map --> [someOperation(x) for x in ..]
groupBy --> itertools.groupBy()
Here is the implementation for flatten which does not have a python equivalent:
def flatten(L):
for item in L:
try:
for i in flatten(item):
yield i
except TypeError:
yield item
So you should be able to put those together for a solution.
You could use the stack function:
for example:
df.selectExpr("stack(2, 'col_1', col_1, 'col_2', col_2) as (key, value)")
where:
2 is the number of columns to stack (col_1 and col_2)
'col_1' is a string for the key
col_1 is the column from which to take the values
if you have several columns, you could build the whole stack string iterating the column names and pass that to selectExpr
Use flatmap. Something like below should work
from pyspark.sql import Row
def rowExpander(row):
rowDict = row.asDict()
valA = rowDict.pop('A')
for k in rowDict:
yield Row(**{'A': valA , 'colID': k, 'colValue': row[k]})
newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander))
I took the Scala answer that #javadba wrote and created a Python version for transposing all columns in a DataFrame. This might be a bit different from what OP was asking...
from itertools import chain
from pyspark.sql import DataFrame
def _sort_transpose_tuple(tup):
x, y = tup
return x, tuple(zip(*sorted(y, key=lambda v_k: v_k[1], reverse=False)))[0]
def transpose(X):
"""Transpose a PySpark DataFrame.
Parameters
----------
X : PySpark ``DataFrame``
The ``DataFrame`` that should be tranposed.
"""
# validate
if not isinstance(X, DataFrame):
raise TypeError('X should be a DataFrame, not a %s'
% type(X))
cols = X.columns
n_features = len(cols)
# Sorry for this unreadability...
return X.rdd.flatMap( # make into an RDD
lambda xs: chain(xs)).zipWithIndex().groupBy( # zip index
lambda val_idx: val_idx[1] % n_features).sortBy( # group by index % n_features as key
lambda grp_res: grp_res[0]).map( # sort by index % n_features key
lambda grp_res: _sort_transpose_tuple(grp_res)).map( # maintain order
lambda key_col: key_col[1]).toDF() # return to DF
For example:
>>> X = sc.parallelize([(1,2,3), (4,5,6), (7,8,9)]).toDF()
>>> X.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
>>> transpose(X).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 4| 7|
| 2| 5| 8|
| 3| 6| 9|
+---+---+---+
A very handy way to implement:
from pyspark.sql import Row
def rowExpander(row):
rowDict = row.asDict()
valA = rowDict.pop('A')
for k in rowDict:
yield Row(**{'A': valA , 'colID' : k, 'colValue' : row[k]})
newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander)
To transpose Dataframe in pySpark, I use pivot over the temporary created column, which I drop at the end of the operation.
Say, we have a table like this. What we wanna do is to find all users over each listed_days_bin value.
+------------------+-------------+
| listed_days_bin | users_count |
+------------------+-------------+
|1 | 5|
|0 | 2|
|0 | 1|
|1 | 3|
|1 | 4|
|2 | 5|
|2 | 7|
|2 | 2|
|1 | 1|
+------------------+-------------+
Create new temp column - 'pvt_value', aggregate over it and pivot results
import pyspark.sql.functions as F
agg_df = df.withColumn('pvt_value', lit(1))\
.groupby('pvt_value')\
.pivot('listed_days_bin')\
.agg(F.sum('users_count')).drop('pvt_value')
New Dataframe should look like:
+----+---+---+
| 0 | 1 | 2 | # Columns
+----+---+---+
| 3| 13| 14| # Users over the bin
+----+---+---+
I found PySpark to be too complicated to transpose so I just convert my dataframe to Pandas and use the transpose() method and convert the dataframe back to PySpark if required.
dfOutput = spark.createDataFrame(dfPySpark.toPandas().transpose())
dfOutput.display()