How to add a constant column in a Spark DataFrame? - python
I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows:
dt.withColumn('new_column', 10).head(5)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-50-a6d0257ca2be> in <module>()
1 dt = (messages
2 .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt")))
----> 3 dt.withColumn('new_column', 10).head(5)
/Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
1166 [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
1167 """
-> 1168 return self.select('*', col.alias(colName))
1169
1170 #ignore_unicode_prefix
AttributeError: 'int' object has no attribute 'alias'
It seems that I can trick the function into working as I want by adding and subtracting one of the other columns (so they add to zero) and then adding the number I want (10 in this case):
dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5)
[Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)]
This is supremely hacky, right? I assume there is a more legit way to do this?
Spark 2.2+
Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala):
import org.apache.spark.sql.functions.typedLit
df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
df.withColumn("some_struct", typedLit(("foo", 1, 0.3)))
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))
Spark 1.3+ (lit), 1.4+ (array, struct), 2.0+ (map):
The second argument for DataFrame.withColumn should be a Column so you have to use a literal:
from pyspark.sql.functions import lit
df.withColumn('new_column', lit(10))
If you need complex columns you can build these using blocks like array:
from pyspark.sql.functions import array, create_map, struct
df.withColumn("some_array", array(lit(1), lit(2), lit(3)))
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3)))
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))
Exactly the same methods can be used in Scala.
import org.apache.spark.sql.functions.{array, lit, map, struct}
df.withColumn("new_column", lit(10))
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))
To provide names for structs use either alias on each field:
df.withColumn(
"some_struct",
struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
)
or cast on the whole object
df.withColumn(
"some_struct",
struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
)
It is also possible, although slower, to use an UDF.
Note:
The same constructs can be used to pass constant arguments to UDFs or SQL functions.
In spark 2.2 there are two ways to add constant value in a column in DataFrame:
1) Using lit
2) Using typedLit.
The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map
Sample DataFrame:
val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1")
+---+----+
| id|col1|
+---+----+
| 0| a|
| 1| b|
+---+----+
1) Using lit: Adding constant string value in new column named newcol:
import org.apache.spark.sql.functions.lit
val newdf = df.withColumn("newcol",lit("myval"))
Result:
+---+----+------+
| id|col1|newcol|
+---+----+------+
| 0| a| myval|
| 1| b| myval|
+---+----+------+
2) Using typedLit:
import org.apache.spark.sql.functions.typedLit
df.withColumn("newcol", typedLit(("sample", 10, .044)))
Result:
+---+----+-----------------+
| id|col1| newcol|
+---+----+-----------------+
| 0| a|[sample,10,0.044]|
| 1| b|[sample,10,0.044]|
| 2| c|[sample,10,0.044]|
+---+----+-----------------+
As the other answers have described, lit and typedLit are how to add constant columns to DataFrames. lit is an important Spark function that you will use frequently, but not for adding constant columns to DataFrames.
You'll commonly be using lit to create org.apache.spark.sql.Column objects because that's the column type required by most of the org.apache.spark.sql.functions.
Suppose you have a DataFrame with a some_date DateType column and would like to add a column with the days between December 31, 2020 and some_date.
Here's your DataFrame:
+----------+
| some_date|
+----------+
|2020-09-23|
|2020-01-05|
|2020-04-12|
+----------+
Here's how to calculate the days till the year end:
val diff = datediff(lit(Date.valueOf("2020-12-31")), col("some_date"))
df
.withColumn("days_till_yearend", diff)
.show()
+----------+-----------------+
| some_date|days_till_yearend|
+----------+-----------------+
|2020-09-23| 99|
|2020-01-05| 361|
|2020-04-12| 263|
+----------+-----------------+
You could also use lit to create a year_end column and compute the days_till_yearend like so:
import java.sql.Date
df
.withColumn("yearend", lit(Date.valueOf("2020-12-31")))
.withColumn("days_till_yearend", datediff(col("yearend"), col("some_date")))
.show()
+----------+----------+-----------------+
| some_date| yearend|days_till_yearend|
+----------+----------+-----------------+
|2020-09-23|2020-12-31| 99|
|2020-01-05|2020-12-31| 361|
|2020-04-12|2020-12-31| 263|
+----------+----------+-----------------+
Most of the time, you don't need to use lit to append a constant column to a DataFrame. You just need to use lit to convert a Scala type to a org.apache.spark.sql.Column object because that's what's required by the function.
See the datediff function signature:
As you can see, datediff requires two Column arguments.
Related
TypeError: col should be Column with apache spark
I have this method where I am gathering positive values def pos_values(df, metrics): num_pos_values = df.where(df.ttu > 1).count() df.withColumn("loader_ttu_pos_value", num_pos_values) df.write.json(metrics) However I get TypeError: col should be Column whenever I go to test it. I tried to cast it but that doesn't seem to be an option.
The reason you're getting this error is because df.withColumn expects a Column object as second argument, whereas you're giving num_pos_values which is an integer. If you want to assign a literal value to a column (you'll have the same value for each row), you can use the lit function of pyspark.sql.functions. Something like this works: df = spark.createDataFrame([("2022", "January"), ("2021", "December")], ["Year", "Month"]) df.show() +----+--------+ |Year| Month| +----+--------+ |2022| January| |2021|December| +----+--------+ from pyspark.sql.functions import lit df.withColumn("testColumn", lit(5)).show() +----+--------+----------+ |Year| Month|testColumn| +----+--------+----------+ |2022| January| 5| |2021|December| 5| +----+--------+----------+
Is there a Scala Spark equivalent to pandas Grouper freq feature?
In pandas, if we have a time series and need to group it by a certain frequency (say, every two weeks), it's possible to use the Grouper class, like this: import pandas as pd df.groupby(pd.Grouper(key='timestamp', freq='2W')) Is there any equivalent in Spark (more specifically, using Scala) for this feature?
You can use the sql function window. First, you create the timestamp column, if you don´t have any yet, from a string type datetime: val data = Seq(("2022-01-01 00:00:00", 1), ("2022-01-01 00:15:00", 1), ("2022-01-08 23:30:00", 1), ("2022-01-22 23:30:00", 4)) Then, apply the window function to the timestamp column, and do the aggregation to the column you need to obtain a result per slot: val df0 = df.groupBy(window(col("date"), "1 week", "1 week", "0 minutes")) .agg(sum("a") as "sum_a") The result includes the calculated windows. Take a look to the doc for a better understanding of the input parameters: https://spark.apache.org/docs/latest/api/sql/index.html#window. val df1 = df0.select("window.start", "window.end", "sum_a") df1.show() it gives: +-------------------+-------------------+-----+ | start| end|sum_a| +-------------------+-------------------+-----+ |2022-01-20 01:00:00|2022-01-27 01:00:00| 4| |2021-12-30 01:00:00|2022-01-06 01:00:00| 2| |2022-01-06 01:00:00|2022-01-13 01:00:00| 1| +-------------------+-------------------+-----+
How to filter on a string numpy array column in pyspark
I have a pyspark dataframe import pandas as pd foo = pd.DataFrame({'col':[['a_b', 'bad'],['a_a', 'good'],[]]}) I would like to filter out all the rows for which 'bad' is in the list of col I have tried to first create a binary column and then filter on this one: from pyspark.sql import functions as f foo = foo.withColumn('at_least_one_bad', f.when(f.col("col").array_contains("bad"),f.lit(1)).otherwise(f.lit(0))) but I get an error TypeError: 'Column' object is not callable Any ideas?
Your syntax is slightly off - try this code below: import pyspark.sql.functions as f foo2 = foo.withColumn('at_least_one_bad', f.array_contains('col', 'bad').cast('int')) foo2.show() +-----------+----------------+ | col|at_least_one_bad| +-----------+----------------+ | [a_b, bad]| 1| |[a_a, good]| 0| | []| 0| +-----------+----------------+
How to properly use reduce with a dictionary
I am using a custom function as part of a reduce operation. For the following example I am getting the following message TypeError: reduce() takes no keyword arguments - I believe this is due to the way I am using the dictionary mapping in the function exposed_colum - Could you please help me fix this function? from pyspark.sql import DataFrame, Row from pyspark.sql.functions import col from pyspark.sql import SparkSession from functools import reduce def process_data(df: DataFrame): col_mapping = dict(zip(["name", "age"], ["a", "b"])) # Do other things... def exposed_column(df: DataFrame, mapping: dict): return df.select([col(c).alias(mapping.get(c, c)) for c in df.columns]) return reduce(exposed_column, sequence=col_mapping, initial=df) spark = SparkSession.builder.appName("app").getOrCreate() l = [ ("Bob", 25, "Spain"), ("Marc", 22, "France"), ("Steve", 20, "Belgium"), ("Donald", 26, "USA"), ] rdd = spark.sparkContext.parallelize(l) people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]), country=x[2])).toDF() people.show() process_data(people).show() people.show() is looking like this +---+-------+------+ |age|country| name| +---+-------+------+ | 25| Spain| Bob| | 22| France| Marc| | 20|Belgium| Steve| | 26| USA|Donald| +---+-------+------+ And this is the expected output +------+---+ | a| b| +------+---+ | Bob| 25| | Marc| 22| | Steve| 20| |Donald| 26| +------+---+
reduce does not take keywords, that’s true. Once you remove the keywords, you’ll notice a more serious issue though: when you iterate over a dictionary, you’re iterating over its keys only. So the function in which you're trying to batch rename the columns won’t do what you had in mind. One way to do a batch column rename, would be to iterate over the dictionary’s items: from typing import Mapping from pyspark.sql import DataFrame def rename_columns(frame: DataFrame, mapping: Mapping[str, str]) -> DataFrame: return reduce(lambda f, old_new: f.withColumnRenamed(old_new[0], old_new[1]), mapping.items(), frame) This allows you to pass in a dictionary (note that the recommendation for adding type hints to arguments is to use Mapping, not dict) that maps column names to other names. Fortunately, withColumnRenamed won’t complain if you try to rename a column that isn’t in the DataFrame, so this is equivalent to your mapping.get(c, c). One thing I’m not noticing in your code is that it is dropping the country column. So that’ll still be in your output.
How to create correct output in when-otherwise? [duplicate]
I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows: dt.withColumn('new_column', 10).head(5) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-50-a6d0257ca2be> in <module>() 1 dt = (messages 2 .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt"))) ----> 3 dt.withColumn('new_column', 10).head(5) /Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col) 1166 [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)] 1167 """ -> 1168 return self.select('*', col.alias(colName)) 1169 1170 #ignore_unicode_prefix AttributeError: 'int' object has no attribute 'alias' It seems that I can trick the function into working as I want by adding and subtracting one of the other columns (so they add to zero) and then adding the number I want (10 in this case): dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5) [Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10), Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10), Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10), Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10), Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)] This is supremely hacky, right? I assume there is a more legit way to do this?
Spark 2.2+ Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala): import org.apache.spark.sql.functions.typedLit df.withColumn("some_array", typedLit(Seq(1, 2, 3))) df.withColumn("some_struct", typedLit(("foo", 1, 0.3))) df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2))) Spark 1.3+ (lit), 1.4+ (array, struct), 2.0+ (map): The second argument for DataFrame.withColumn should be a Column so you have to use a literal: from pyspark.sql.functions import lit df.withColumn('new_column', lit(10)) If you need complex columns you can build these using blocks like array: from pyspark.sql.functions import array, create_map, struct df.withColumn("some_array", array(lit(1), lit(2), lit(3))) df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3))) df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2))) Exactly the same methods can be used in Scala. import org.apache.spark.sql.functions.{array, lit, map, struct} df.withColumn("new_column", lit(10)) df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2))) To provide names for structs use either alias on each field: df.withColumn( "some_struct", struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z")) ) or cast on the whole object df.withColumn( "some_struct", struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>") ) It is also possible, although slower, to use an UDF. Note: The same constructs can be used to pass constant arguments to UDFs or SQL functions.
In spark 2.2 there are two ways to add constant value in a column in DataFrame: 1) Using lit 2) Using typedLit. The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map Sample DataFrame: val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1") +---+----+ | id|col1| +---+----+ | 0| a| | 1| b| +---+----+ 1) Using lit: Adding constant string value in new column named newcol: import org.apache.spark.sql.functions.lit val newdf = df.withColumn("newcol",lit("myval")) Result: +---+----+------+ | id|col1|newcol| +---+----+------+ | 0| a| myval| | 1| b| myval| +---+----+------+ 2) Using typedLit: import org.apache.spark.sql.functions.typedLit df.withColumn("newcol", typedLit(("sample", 10, .044))) Result: +---+----+-----------------+ | id|col1| newcol| +---+----+-----------------+ | 0| a|[sample,10,0.044]| | 1| b|[sample,10,0.044]| | 2| c|[sample,10,0.044]| +---+----+-----------------+
As the other answers have described, lit and typedLit are how to add constant columns to DataFrames. lit is an important Spark function that you will use frequently, but not for adding constant columns to DataFrames. You'll commonly be using lit to create org.apache.spark.sql.Column objects because that's the column type required by most of the org.apache.spark.sql.functions. Suppose you have a DataFrame with a some_date DateType column and would like to add a column with the days between December 31, 2020 and some_date. Here's your DataFrame: +----------+ | some_date| +----------+ |2020-09-23| |2020-01-05| |2020-04-12| +----------+ Here's how to calculate the days till the year end: val diff = datediff(lit(Date.valueOf("2020-12-31")), col("some_date")) df .withColumn("days_till_yearend", diff) .show() +----------+-----------------+ | some_date|days_till_yearend| +----------+-----------------+ |2020-09-23| 99| |2020-01-05| 361| |2020-04-12| 263| +----------+-----------------+ You could also use lit to create a year_end column and compute the days_till_yearend like so: import java.sql.Date df .withColumn("yearend", lit(Date.valueOf("2020-12-31"))) .withColumn("days_till_yearend", datediff(col("yearend"), col("some_date"))) .show() +----------+----------+-----------------+ | some_date| yearend|days_till_yearend| +----------+----------+-----------------+ |2020-09-23|2020-12-31| 99| |2020-01-05|2020-12-31| 361| |2020-04-12|2020-12-31| 263| +----------+----------+-----------------+ Most of the time, you don't need to use lit to append a constant column to a DataFrame. You just need to use lit to convert a Scala type to a org.apache.spark.sql.Column object because that's what's required by the function. See the datediff function signature: As you can see, datediff requires two Column arguments.