In spark, I have following data frame called "df" with some null entries:
+-------+--------------------+--------------------+
| id| features1| features2|
+-------+--------------------+--------------------+
| 185|(5,[0,1,4],[0.1,0...| null|
| 220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...|
| 225| null|(10,[1,3,5],[0.1,...|
+-------+--------------------+--------------------+
df.features1 and df.features2 are type vector (nullable). Then I tried to use following code to fill null entries with SparseVectors:
df1 = df.na.fill({"features1":SparseVector(5,{}), "features2":SparseVector(10, {})})
This code led to following error:
AttributeError: 'SparseVector' object has no attribute '_get_object_id'
Then I found following paragraph in spark documentation:
fillna(value, subset=None)
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.
Parameters:
value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.
Does this explain my failure to replace null entries with SparseVectors in DataFrame? Or does this mean that there's no way to do this in DataFrame?
I can achieve my goal by converting DataFrame to RDD and replacing None values with SparseVectors, but it will be much more convenient for me to do this directly in DataFrame.
Is there any method to do this directly in DataFrame?
Thanks!
You can use udf:
from pyspark.sql.functions import udf, lit
from pyspark.ml.linalg import *
fill_with_vector = udf(
lambda x, i: x if x is not None else SparseVector(i, {}),
VectorUDT()
)
df = sc.parallelize([
(SparseVector(5, {1: 1.0}), SparseVector(10, {1: -1.0})), (None, None)
]).toDF(["features1", "features2"])
(df
.withColumn("features1", fill_with_vector("features1", lit(5)))
.withColumn("features2", fill_with_vector("features2", lit(10)))
.show())
# +-------------+---------------+
# | features1| features2|
# +-------------+---------------+
# |(5,[1],[1.0])|(10,[1],[-1.0])|
# | (5,[],[])| (10,[],[])|
# +-------------+---------------+
Related
Notice: this is for Spark version 2.1.1.2.6.1.0-129
I have a spark dataframe. One of the columns has states as type string (ex. Illinois, California, Nevada). There are some instances of numbers in this column (ex. 12, 24, 01, 2). I would like to replace any instace of an integer with a NULL.
The following is some code that I have written:
my_df = my_df.selectExpr(
" regexp_replace(states, '^-?[0-9]+$', '') AS states ",
"someOtherColumn")
This regex expression replaces any instance of an integer with an empty string. I would like to replace it with None in python to designate it as a NULL value in the DataFrame.
I strongly suggest you to look at PySpark SQL functions, and try to use them properly instead of selectExpr
from pyspark.sql import functions as F
(df
.withColumn('states', F
.when(F.regexp_replace(F.col('states'), '^-?[0-9]+$', '') == '', None)
.otherwise(F.col('states'))
)
.show()
)
# Output
# +----------+------------+
# | states|states_fixed|
# +----------+------------+
# | Illinois| Illinois|
# | 12| null|
# |California| California|
# | 01| null|
# | Nevada| Nevada|
# +----------+------------+
I have three Arrays of string type containing following information:
groupBy array: containing names of the columns I want to group my data by.
aggregate array: containing names of columns I want to aggregate.
operations array: containing the aggregate operations I want to perform
I am trying to use spark data frames to achieve this. Spark data frames provide an agg() where you can pass a Map [String,String] (of column name and respective aggregate operation ) as input, however I want to perform different aggregation operations on the same column of the data. Any suggestions on how to achieve this?
Scala:
You can for example map over a list of functions with a defined mapping from name to function:
import org.apache.spark.sql.functions.{col, min, max, mean}
import org.apache.spark.sql.Column
val df = Seq((1L, 3.0), (1L, 3.0), (2L, -5.0)).toDF("k", "v")
val mapping: Map[String, Column => Column] = Map(
"min" -> min, "max" -> max, "mean" -> avg)
val groupBy = Seq("k")
val aggregate = Seq("v")
val operations = Seq("min", "max", "mean")
val exprs = aggregate.flatMap(c => operations .map(f => mapping(f)(col(c))))
df.groupBy(groupBy.map(col): _*).agg(exprs.head, exprs.tail: _*).show
// +---+------+------+------+
// | k|min(v)|max(v)|avg(v)|
// +---+------+------+------+
// | 1| 3.0| 3.0| 3.0|
// | 2| -5.0| -5.0| -5.0|
// +---+------+------+------+
or
df.groupBy(groupBy.head, groupBy.tail: _*).agg(exprs.head, exprs.tail: _*).show
Unfortunately parser which is used internally SQLContext is not exposed publicly but you can always try to build plain SQL queries:
df.registerTempTable("df")
val groupExprs = groupBy.mkString(",")
val aggExprs = aggregate.flatMap(c => operations.map(
f => s"$f($c) AS ${c}_${f}")
).mkString(",")
sqlContext.sql(s"SELECT $groupExprs, $aggExprs FROM df GROUP BY $groupExprs")
Python:
from pyspark.sql.functions import mean, sum, max, col
df = sc.parallelize([(1, 3.0), (1, 3.0), (2, -5.0)]).toDF(["k", "v"])
groupBy = ["k"]
aggregate = ["v"]
funs = [mean, sum, max]
exprs = [f(col(c)) for f in funs for c in aggregate]
# or equivalent df.groupby(groupBy).agg(*exprs)
df.groupby(*groupBy).agg(*exprs)
See also:
Spark SQL: apply aggregate functions to a list of column
For those that wonder, how #zero323 answer can be written without a list comprehension in python:
from pyspark.sql.functions import min, max, col
# init your spark dataframe
expr = [min(col("valueName")),max(col("valueName"))]
df.groupBy("keyName").agg(*expr)
Do something like
from pyspark.sql import functions as F
df.groupBy('groupByColName') \
.agg(F.sum('col1').alias('col1_sum'),
F.max('col2').alias('col2_max'),
F.avg('col2').alias('col2_avg')) \
.show()
Here is another straight forward way to apply different aggregate functions on the same column while using Scala (this has been tested in Azure Databricks).
val groupByColName = "Store"
val colName = "Weekly_Sales"
df.groupBy(groupByColName)
.agg(min(colName),
max(colName),
round(avg(colName), 2))
.show()
for example if you want to count percentage of zeroes in each column in pyspark dataframe for which we can use expression to be executed on each column of the dataframe
from pyspark.sql.functions import count,col
def count_zero_percentage(c):
pred = col(c)==0
return sum(pred.cast("integer")).alias(c)
df.agg(*[count_zero_percentage(c)/count('*').alias(c) for c in df.columns]).show()
case class soExample(firstName: String, lastName: String, Amount: Int)
val df = Seq(soExample("me", "zack", 100)).toDF
import org.apache.spark.sql.functions._
val groupped = df.groupBy("firstName", "lastName").agg(
sum("Amount"),
mean("Amount"),
stddev("Amount"),
count(lit(1)).alias("numOfRecords")
).toDF()
display(groupped)
// Courtesy Zach ..
Zach simplified answer for a post Marked Duplicate
Spark Scala Data Frame to have multiple aggregation of single Group By
I have a dataframe (in Pyspark) that has one of the row values as a dictionary:
df.show()
And it looks like:
+----+---+-----------------------------+
|name|age|info |
+----+---+-----------------------------+
|rob |26 |{color: red, car: volkswagen}|
|evan|25 |{color: blue, car: mazda} |
+----+---+-----------------------------+
Based on the comments to give more:
df.printSchema()
The types are strings
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- dict: string (nullable = true)
Is it possible to take the keys from the dictionary (color and car) and make them columns in the dataframe, and have the values be the rows for those columns?
Expected Result:
+----+---+-----------------------------+
|name|age|color |car |
+----+---+-----------------------------+
|rob |26 |red |volkswagen |
|evan|25 |blue |mazda |
+----+---+-----------------------------+
I didn't know I had to use df.withColumn() and somehow iterate through the dictionary to pick each one and then make a column out of it? I've tried to find some answers so far, but most were using Pandas, and not Spark, so I'm not sure if I can apply the same logic.
Your strings:
"{color: red, car: volkswagen}"
"{color: blue, car: mazda}"
are not in a python friendly format. They can't be parsed using json.loads, nor can it be evaluated using ast.literal_eval.
However, if you knew the keys ahead of time and can assume that the strings are always in this format, you should be able to use pyspark.sql.functions.regexp_extract:
For example:
from pyspark.sql.functions import regexp_extract
df.withColumn("color", regexp_extract("info", "(?<=color: )\w+(?=(,|}))", 0))\
.withColumn("car", regexp_extract("info", "(?<=car: )\w+(?=(,|}))", 0))\
.show(truncate=False)
#+----+---+-----------------------------+-----+----------+
#|name|age|info |color|car |
#+----+---+-----------------------------+-----+----------+
#|rob |26 |{color: red, car: volkswagen}|red |volkswagen|
#|evan|25 |{color: blue, car: mazda} |blue |mazda |
#+----+---+-----------------------------+-----+----------+
The pattern is:
(?<=color: ): A positive look-behind for the literal string "color: "
\w+: One or more word characters
(?=(,|})): A positive look-ahead for either a literal comma or close curly brace.
Here is how to generalize this for more than two keys, and handle the case where the key does not exist in the string.
from pyspark.sql.functions import regexp_extract, when, col
from functools import reduce
keys = ["color", "car", "year"]
pat = "(?<=%s: )\w+(?=(,|}))"
df = reduce(
lambda df, c: df.withColumn(
c,
when(
col("info").rlike(pat%c),
regexp_extract("info", pat%c, 0)
)
),
keys,
df
)
df.drop("info").show(truncate=False)
#+----+---+-----+----------+----+
#|name|age|color|car |year|
#+----+---+-----+----------+----+
#|rob |26 |red |volkswagen|null|
#|evan|25 |blue |mazda |null|
#+----+---+-----+----------+----+
In this case, we use pyspark.sql.functions.when and pyspark.sql.Column.rlike to test to see if the string contains the pattern, before we try to extract the match.
If you don't know the keys ahead of time, you'll either have to write your own parser or try to modify the data upstream.
As you can see with the printSchema function your dictionary is understood by "Spark" as a string. The function that slices a string and creates new columns is split () so a simple solution to this problem could be.
Create a UDF that is capable of:
Convert the dictionary string into a comma separated string (removing the keys from the dictionary but keeping the order of the values)
Apply a split and create two new columns from the new format of our dictionary
The code:
#udf()
def transform_dict(dict_str):
str_of_dict_values = dict_str.\
replace("}", "").\
replace("{", ""). \
replace("color:", ""). \
replace(" car: ", ""). \
strip()
# output example: 'red,volkswagen'
return str_of_dict_values
# Create new column with our UDF with the dict values converted to str
df = df.withColumn('info_clean', clean("info"))
# Split these values and store in a tmp variable
split_col = split(df['info_clean'], ',')
# Create new columns with the split values
df = df.withColumn('color', split_col.getItem(0))
df = df.withColumn('car', split_col.getItem(1))
This solution is only correct if we assume that the dictionary elements always come in the same order, and also the keys are fixed.
For other more complex cases we could create a dictionary in the UDF function and form the string of list of values by explicitly invoking each of the dictionary keys, so we would ensure that the order in the output chain is maintained.
I feel the most scalable solution is the following one, using the general keys to be passed through the lambda function:
from pyspark.sql.functions import explode,map_keys,col
keysDF = df.select(explode(map_keys(df.info))).distinct()
keysList = keysDF.rdd.map(lambda x:x[0]).collect()
keyCols = list(map(lambda x: col("info").getItem(x).alias(str(x)), keysList))
df.select(df.name, df.age, *keyCols).show()
I want to know how to map values in a specific column in a dataframe.
I have a dataframe which looks like:
df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])
+-----+-------+
| col1| col2|
+-----+-------+
|india| japan|
| usa|uruguay|
+-----+-------+
I have a dictionary from where I want to map the values.
dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')])
The output I want is:
+-----+-------+--------+--------+
| col1| col2|col1_map|col2_map|
+-----+-------+--------+--------+
|india| japan| ind| jpn|
| usa|uruguay| us| urg|
+-----+-------+--------+--------+
I have tried using the lookup function but it doesn't work. It throws error SPARK-5063. Following is my approach which failed:
def map_val(x):
return dicts.lookup(x)[0]
myfun = udf(lambda x: map_val(x), StringType())
df = df.withColumn('col1_map', myfun('col1')) # doesn't work
df = df.withColumn('col2_map', myfun('col2')) # doesn't work
I think the easier way is just to use a simple dictionary and df.withColumn.
from itertools import chain
from pyspark.sql.functions import create_map, lit
simple_dict = {'india':'ind', 'usa':'us', 'japan':'jpn', 'uruguay':'urg'}
mapping_expr = create_map([lit(x) for x in chain(*simple_dict.items())])
df = df.withColumn('col1_map', mapping_expr[df['col1']])\
.withColumn('col2_map', mapping_expr[df['col2']])
df.show(truncate=False)
udf way
I would suggest you to change the list of tuples to dicts and broadcast it to be used in udf
dicts = sc.broadcast(dict([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')]))
from pyspark.sql import functions as f
from pyspark.sql import types as t
def newCols(x):
return dicts.value[x]
callnewColsUdf = f.udf(newCols, t.StringType())
df.withColumn('col1_map', callnewColsUdf(f.col('col1')))\
.withColumn('col2_map', callnewColsUdf(f.col('col2')))\
.show(truncate=False)
which should give you
+-----+-------+--------+--------+
|col1 |col2 |col1_map|col2_map|
+-----+-------+--------+--------+
|india|japan |ind |jpn |
|usa |uruguay|us |urg |
+-----+-------+--------+--------+
join way (slower than udf way)
All you have to do is change the dicts rdd to dataframe too and use two joins with aliasings as following
df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])
dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')]).toDF(['key', 'value'])
from pyspark.sql import functions as f
df.join(dicts, df['col1'] == dicts['key'], 'inner')\
.select(f.col('col1'), f.col('col2'), f.col('value').alias('col1_map'))\
.join(dicts, df['col2'] == dicts['key'], 'inner') \
.select(f.col('col1'), f.col('col2'), f.col('col1_map'), f.col('value').alias('col2_map'))\
.show(truncate=False)
which should give you the same result
Similar to Ali AzG, but pulling it all out into a handy little method if anyone finds it useful
from itertools import chain
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from typing import Dict
def map_column_values(df:DataFrame, map_dict:Dict, column:str, new_column:str="")->DataFrame:
"""Handy method for mapping column values from one value to another
Args:
df (DataFrame): Dataframe to operate on
map_dict (Dict): Dictionary containing the values to map from and to
column (str): The column containing the values to be mapped
new_column (str, optional): The name of the column to store the mapped values in.
If not specified the values will be stored in the original column
Returns:
DataFrame
"""
spark_map = F.create_map([F.lit(x) for x in chain(*map_dict.items())])
return df.withColumn(new_column or column, spark_map[df[column]])
This can be used as follows
from pyspark.sql import Row, SparkSession
spark = SparkSession.builder.master("local[3]").getOrCreate()
df = spark.createDataFrame([Row(A=0), Row(A=1)])
df = map_column_values(df, map_dict={0:"foo", 1:"bar"}, column="A", new_column="B")
df.show()
#>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#+---+---+
#| A| B|
#+---+---+
#| 0|foo|
#| 1|bar|
#+---+---+
I have such DataFrame in PySpark (this is the result of a take(3), the dataframe is very big):
sc = SparkContext()
df = [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)]
the same owner will have more rows. What I need to do is summing the values of the field a_d per owner, after grouping, as
b = df.groupBy('owner').agg(sum('a_d').alias('a_d_sum'))
but this throws error
TypeError: unsupported operand type(s) for +: 'int' and 'str'
However, the schema contains double values, not strings (this comes from a printSchema()):
root
|-- owner: string (nullable = true)
|-- a_d: double (nullable = true)
So what is happening here?
You are not using the correct sum function but the built-in function sum (by default).
So the reason why the build-in function won't work is
that's it takes an iterable as an argument where as here the name of the column passed is a string and the built-in function can't be applied on a string. Ref. Python Official Documentation.
You'll need to import the proper function from pyspark.sql.functions :
from pyspark.sql import Row
from pyspark.sql.functions import sum as _sum
df = sqlContext.createDataFrame(
[Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)]
)
df2 = df.groupBy('owner').agg(_sum('a_d').alias('a_d_sum'))
df2.show()
# +-----+-------+
# |owner|a_d_sum|
# +-----+-------+
# | u1| 0.4|
# | u2| 0.0|
# +-----+-------+