I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+
Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+
in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+
I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()
Related
I want to add a new column new_col, if the value of column a is in yes_list, then the value is 1 in new_col else 0
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
rdd = sc.parallelize([{"a":'y'}, {"a":'y', "b":2}, {"a":'n', "c":3}])
rdd_df = sqlContext.read.json(rdd)
yes_list = ['y']
Something like this:
rdd_df.withColumn("new_col", [1 if val in yes_list else 0 for val in rdd_df["a"]])
But the above is not correct, and raise errors.
TypeError: Column is not iterable
How to achieve it?
You can use the when and isin functions for the sparkSQL API. It would go as follows:
from pyspark.sql import functions
rdd_df.withColumn("new_col", functions.when(rdd_df['a'].isin(yes_list), 1).otherwise(0)).show()
+---+----+----+-------+
| a| b| c|new_col|
+---+----+----+-------+
| y|null|null| 1|
| y| 2|null| 1|
| n|null| 3| 0|
+---+----+----+-------+
I have some dataframe in Pyspark:
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession.builder.getOrCreate()
sqlcontext = SQLContext(spark)
df = sqlcontext.createDataFrame([['a'],['b'],['c'],['d'],['e']], ['id'])
df.show()
+---+
| id|
+---+
| a|
| b|
| c|
| d|
| e|
+---+
And I have a list of lists:
l = [[1,1], [2,2], [3,3], [4,4], [5,5]]
Is it possible to append this list as a column to df? Namely, the first element of l should appear next to the first row of df, the second element of l next to the second row of df, etc. It should look like this:
+----+---+--+
| id| l|
+----+---+--+
| a| [1,1]|
| b| [2,2]|
| c| [3,3]|
| d| [4,4]|
| e| [5,5]|
+----+---+--+
UDF's are generally slow but a more efficient way without using any UDF's would be:
import pyspark.sql.functions as F
ldf = spark.createDataFrame(l, schema = "array<int>")
df1 = df.withColumn("m_id", F.monotonically_increasing_id())
df2 = ldf.withColumn("m_id", F.monotonically_increasing_id())
df3 = df2.join(df1, "m_id", "outer").drop("m_id")
df3.select("id", "value").show()
+---+------+
| id| value|
+---+------+
| a|[1, 1]|
| b|[2, 2]|
| d|[4, 4]|
| c|[3, 3]|
| e|[5, 5]|
+---+------+
Assuming that you are going to have same amount of rows in your df and items in your list (df.count==len(l)).
You can add a row_id (to specify the order) to your df, and based on that, access to the item on your list (l).
from pyspark.sql.functions import row_number, lit
from pyspark.sql.window import *
df = df.withColumn("row_num", row_number().over(Window().orderBy(lit('A'))))
df.show()
Above code will look like:
+---+-------+
| id|row_num|
+---+-------+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+---+-------+
Then, you can just iterate your df and access the specified index in your list:
def map_df(row):
return (row.id, l[row.row_num-1])
new_df = df.rdd.map(map_df).toDF(["id", "l"])
new_df.show()
Output:
+---+------+
| id| l|
+---+------+
| 1|[1, 1]|
| 2|[2, 2]|
| 3|[3, 3]|
| 4|[4, 4]|
| 5|[5, 5]|
+---+------+
Thanks to Cesar's answer, I figured out how to do it without making the dataframe an RDD and coming back. It would be something like this:
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import row_number, lit, udf
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, FloatType, IntegerType
spark = SparkSession.builder.getOrCreate()
sqlcontext = SQLContext(spark)
df = sqlcontext.createDataFrame([['a'],['b'],['c'],['d'],['e']], ['id'])
df = df.withColumn("row_num", row_number().over(Window().orderBy(lit('A'))))
new_col = [[1.,1.], [2.,2.], [3.,3.], [4.,4.], [5.,5.]]
map_list_to_column = udf(lambda row_num: new_col[row_num -1], ArrayType(FloatType()))
df.withColumn('new_col', map_list_to_column(df.row_num)).drop('row_num').show()
I'm trying to filter my dataframe in Pyspark and I want to write my results in a parquet file, but I get an error every time because something is wrong with my isNotNull() condition. I have 3 conditions in the filter function, and if one of them is true the resulting row should be written in the parquet file.
I tried different versions with OR and | and different versions with isNotNull(), but nothing helped me.
This is one example I tried:
from pyspark.sql.functions import col
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df.where(col("col2").isNotNull()))
).write.save("new_parquet.parquet")
This is the other example I tried, but in that example it ignores the rows with attribute1 or attribute2:
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df['col2'].isNotNull())
).write.save("new_parquet.parquet")
This is the error message:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I hope you can help me, I'm new to the topic. Thank you so much!
First of, about the col1 filter, you could do it using isin like this:
df['col1'].isin(['attribute1', 'attribute2'])
And then:
df.filter((df['col1'].isin(['atribute1', 'atribute2']))|(df['col2'].isNotNull()))
AFAIK, the dataframe.column.isNotNull() should work, but I dont have sample data to test it, sorry.
See the example below:
from pyspark.sql import functions as F
df = spark.createDataFrame([(3,'a'),(5,None),(9,'a'),(1,'b'),(7,None),(3,None)], ["id", "value"])
df.show()
The original DataFrame
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 5| null|
| 9| a|
| 1| b|
| 7| null|
| 3| null|
+---+-----+
Now we do the filter:
df = df.filter( (df['id']==3)|(df['id']=='9')|(~F.isnull('value')))
df.show()
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 9| a|
| 1| b|
| 3| null|
+---+-----+
So you see
row(3, 'a') and row(3, null) are selected because of `df['id']==3'
row(9, 'a') is selected because of `df['id']==9'
row(1, 'b') is selected because of ~F.isnull('value'), but row(5, null) and row(7, null) are not selected.
I have a Dataframe, which contains the following data:
df.show()
+-----+------+--------+
| id_A| idx_B| B_value|
+-----+------+--------+
| a| 0| 7|
| b| 0| 5|
| b| 2| 2|
+-----+------+--------+
Assuming B have total of 3 possible indices, I want to create a table that will merge all indices and values into a list (or numpy array) that looks like this:
final_df.show()
+-----+----------+
| id_A| B_values|
+-----+----------+
| a| [7, 0, 0]|
| b| [5, 0, 2]|
+-----+----------+
I've managed to go up to this point:
from pyspark.sql import functions as f
temp_df = df.withColumn('B_tuple', f.struct(df['idx_B'], df['B_value']))\
.groupBy('id_A').agg(f.collect_list('B_tuple').alias('B_tuples'))
temp_df.show()
+-----+-----------------+
| id_A| B_tuples|
+-----+-----------------+
| a| [[0, 7]]|
| b| [[0, 5], [2, 2]]|
+-----+-----------------+
But now I can't run a proper udf function to turn temp_df into final_df.
Is there a simpler way to do so?
If not, what is the proper function I should use to finish the transformation?
So I've found a solution,
def create_vector(tuples_list, size):
my_list = [0] * size
for x in tuples_list:
my_list[x["idx_B"]] = x["B_value"]
return my_list
create_vector_udf = f.udf(create_vector, ArrayType(IntegerType()))
final_df = temp_df.with_column('B_values', create_vector_udf(temp_df['B_tuples'])).select(['id_A', 'B_values'])
final_df.show()
+-----+----------+
| id_A| B_values|
+-----+----------+
| a| [7, 0, 0]|
| b| [5, 0, 2]|
+-----+----------+
If you already know the size of the array, you can do this without a udf.
Take advantage of the optional second argument to pivot(): values. This takes in a
List of values that will be translated to columns in the output DataFrame
So groupBy the id_A column, and pivot the DataFrame on the idx_B column. Since not all indices may be present, you can pass in range(size) as the values argument.
import pyspark.sql.functions as f
size = 3
df = df.groupBy("id_A").pivot("idx_B", values=range(size)).agg(f.first("B_value"))
df = df.na.fill(0)
df.show()
#+----+---+---+---+
#|id_A| 0| 1| 2|
#+----+---+---+---+
#| b| 5| 0| 2|
#| a| 7| 0| 0|
#+----+---+---+---+
The indices that are not present in the data will default to null, so we call na.fill(0) as this is the default value.
Once you have your data in this format, you just need to create an array from the columns:
df.select("id_A", f.array([f.col(str(i)) for i in range(size)]).alias("B_values")).show()
#+----+---------+
#|id_A| B_values|
#+----+---------+
#| b|[5, 0, 2]|
#| a|[7, 0, 0]|
#+----+---------+
I'm trying to transpose some columns of my table to row.
I'm using Python and Spark 1.5.0. Here is my initial table:
+-----+-----+-----+-------+
| A |col_1|col_2|col_...|
+-----+-------------------+
| 1 | 0.0| 0.6| ... |
| 2 | 0.6| 0.7| ... |
| 3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
I would like to have somthing like this:
+-----+--------+-----------+
| A | col_id | col_value |
+-----+--------+-----------+
| 1 | col_1| 0.0|
| 1 | col_2| 0.6|
| ...| ...| ...|
| 2 | col_1| 0.6|
| 2 | col_2| 0.7|
| ...| ...| ...|
| 3 | col_1| 0.5|
| 3 | col_2| 0.9|
| ...| ...| ...|
Does someone know haw I can do it? Thank you for your help.
Spark >= 3.4
You can use built-in melt method. With Python:
df.melt(
ids=["A"], values=["col_1", "col_2"],
variableColumnName="key", valueColumnName="val"
)
with Scala
df.melt(Array($"A"), Array($"col_1", $"col_2"), "key", "val")
Spark < 3.4
It is relatively simple to do with basic Spark SQL functions.
Python
from pyspark.sql.functions import array, col, explode, struct, lit
df = sc.parallelize([(1, 0.0, 0.6), (1, 0.6, 0.7)]).toDF(["A", "col_1", "col_2"])
def to_long(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
to_long(df, ["A"])
Scala:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{array, col, explode, lit, struct}
val df = Seq((1, 0.0, 0.6), (1, 0.6, 0.7)).toDF("A", "col_1", "col_2")
def toLong(df: DataFrame, by: Seq[String]): DataFrame = {
val (cols, types) = df.dtypes.filter{ case (c, _) => !by.contains(c)}.unzip
require(types.distinct.size == 1, s"${types.distinct.toString}.length != 1")
val kvs = explode(array(
cols.map(c => struct(lit(c).alias("key"), col(c).alias("val"))): _*
))
val byExprs = by.map(col(_))
df
.select(byExprs :+ kvs.alias("_kvs"): _*)
.select(byExprs ++ Seq($"_kvs.key", $"_kvs.val"): _*)
}
toLong(df, Seq("A"))
One way to solve with pyspark sql using functions create_map and explode.
from pyspark.sql import functions as func
#Use `create_map` to create the map of columns with constant
df = df.withColumn('mapCol', \
func.create_map(func.lit('col_1'),df.col_1,
func.lit('col_2'),df.col_2,
func.lit('col_3'),df.col_3
)
)
#Use explode function to explode the map
res = df.select('*',func.explode(df.mapCol).alias('col_id','col_value'))
res.show()
The Spark local linear algebra libraries are presently very weak: and they do not include basic operations as the above.
There is a JIRA for fixing this for Spark 2.1 - but that will not help you today.
Something to consider: performing a transpose will likely require completely shuffling the data.
For now you will need to write RDD code directly. I have written transpose in scala - but not in python. Here is the scala version:
def transpose(mat: DMatrix) = {
val nCols = mat(0).length
val matT = mat
.flatten
.zipWithIndex
.groupBy {
_._2 % nCols
}
.toSeq.sortBy {
_._1
}
.map(_._2)
.map(_.map(_._1))
.toArray
matT
}
So you can convert that to python for your use. I do not have bandwidth to write/test that at this particular moment: let me know if you were unable to do that conversion.
At the least - the following are readily converted to python.
zipWithIndex --> enumerate() (python equivalent - credit to #zero323)
map --> [someOperation(x) for x in ..]
groupBy --> itertools.groupBy()
Here is the implementation for flatten which does not have a python equivalent:
def flatten(L):
for item in L:
try:
for i in flatten(item):
yield i
except TypeError:
yield item
So you should be able to put those together for a solution.
You could use the stack function:
for example:
df.selectExpr("stack(2, 'col_1', col_1, 'col_2', col_2) as (key, value)")
where:
2 is the number of columns to stack (col_1 and col_2)
'col_1' is a string for the key
col_1 is the column from which to take the values
if you have several columns, you could build the whole stack string iterating the column names and pass that to selectExpr
Use flatmap. Something like below should work
from pyspark.sql import Row
def rowExpander(row):
rowDict = row.asDict()
valA = rowDict.pop('A')
for k in rowDict:
yield Row(**{'A': valA , 'colID': k, 'colValue': row[k]})
newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander))
I took the Scala answer that #javadba wrote and created a Python version for transposing all columns in a DataFrame. This might be a bit different from what OP was asking...
from itertools import chain
from pyspark.sql import DataFrame
def _sort_transpose_tuple(tup):
x, y = tup
return x, tuple(zip(*sorted(y, key=lambda v_k: v_k[1], reverse=False)))[0]
def transpose(X):
"""Transpose a PySpark DataFrame.
Parameters
----------
X : PySpark ``DataFrame``
The ``DataFrame`` that should be tranposed.
"""
# validate
if not isinstance(X, DataFrame):
raise TypeError('X should be a DataFrame, not a %s'
% type(X))
cols = X.columns
n_features = len(cols)
# Sorry for this unreadability...
return X.rdd.flatMap( # make into an RDD
lambda xs: chain(xs)).zipWithIndex().groupBy( # zip index
lambda val_idx: val_idx[1] % n_features).sortBy( # group by index % n_features as key
lambda grp_res: grp_res[0]).map( # sort by index % n_features key
lambda grp_res: _sort_transpose_tuple(grp_res)).map( # maintain order
lambda key_col: key_col[1]).toDF() # return to DF
For example:
>>> X = sc.parallelize([(1,2,3), (4,5,6), (7,8,9)]).toDF()
>>> X.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
>>> transpose(X).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 4| 7|
| 2| 5| 8|
| 3| 6| 9|
+---+---+---+
A very handy way to implement:
from pyspark.sql import Row
def rowExpander(row):
rowDict = row.asDict()
valA = rowDict.pop('A')
for k in rowDict:
yield Row(**{'A': valA , 'colID' : k, 'colValue' : row[k]})
newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander)
To transpose Dataframe in pySpark, I use pivot over the temporary created column, which I drop at the end of the operation.
Say, we have a table like this. What we wanna do is to find all users over each listed_days_bin value.
+------------------+-------------+
| listed_days_bin | users_count |
+------------------+-------------+
|1 | 5|
|0 | 2|
|0 | 1|
|1 | 3|
|1 | 4|
|2 | 5|
|2 | 7|
|2 | 2|
|1 | 1|
+------------------+-------------+
Create new temp column - 'pvt_value', aggregate over it and pivot results
import pyspark.sql.functions as F
agg_df = df.withColumn('pvt_value', lit(1))\
.groupby('pvt_value')\
.pivot('listed_days_bin')\
.agg(F.sum('users_count')).drop('pvt_value')
New Dataframe should look like:
+----+---+---+
| 0 | 1 | 2 | # Columns
+----+---+---+
| 3| 13| 14| # Users over the bin
+----+---+---+
I found PySpark to be too complicated to transpose so I just convert my dataframe to Pandas and use the transpose() method and convert the dataframe back to PySpark if required.
dfOutput = spark.createDataFrame(dfPySpark.toPandas().transpose())
dfOutput.display()