I have a Dataframe, which contains the following data:
df.show()
+-----+------+--------+
| id_A| idx_B| B_value|
+-----+------+--------+
| a| 0| 7|
| b| 0| 5|
| b| 2| 2|
+-----+------+--------+
Assuming B have total of 3 possible indices, I want to create a table that will merge all indices and values into a list (or numpy array) that looks like this:
final_df.show()
+-----+----------+
| id_A| B_values|
+-----+----------+
| a| [7, 0, 0]|
| b| [5, 0, 2]|
+-----+----------+
I've managed to go up to this point:
from pyspark.sql import functions as f
temp_df = df.withColumn('B_tuple', f.struct(df['idx_B'], df['B_value']))\
.groupBy('id_A').agg(f.collect_list('B_tuple').alias('B_tuples'))
temp_df.show()
+-----+-----------------+
| id_A| B_tuples|
+-----+-----------------+
| a| [[0, 7]]|
| b| [[0, 5], [2, 2]]|
+-----+-----------------+
But now I can't run a proper udf function to turn temp_df into final_df.
Is there a simpler way to do so?
If not, what is the proper function I should use to finish the transformation?
So I've found a solution,
def create_vector(tuples_list, size):
my_list = [0] * size
for x in tuples_list:
my_list[x["idx_B"]] = x["B_value"]
return my_list
create_vector_udf = f.udf(create_vector, ArrayType(IntegerType()))
final_df = temp_df.with_column('B_values', create_vector_udf(temp_df['B_tuples'])).select(['id_A', 'B_values'])
final_df.show()
+-----+----------+
| id_A| B_values|
+-----+----------+
| a| [7, 0, 0]|
| b| [5, 0, 2]|
+-----+----------+
If you already know the size of the array, you can do this without a udf.
Take advantage of the optional second argument to pivot(): values. This takes in a
List of values that will be translated to columns in the output DataFrame
So groupBy the id_A column, and pivot the DataFrame on the idx_B column. Since not all indices may be present, you can pass in range(size) as the values argument.
import pyspark.sql.functions as f
size = 3
df = df.groupBy("id_A").pivot("idx_B", values=range(size)).agg(f.first("B_value"))
df = df.na.fill(0)
df.show()
#+----+---+---+---+
#|id_A| 0| 1| 2|
#+----+---+---+---+
#| b| 5| 0| 2|
#| a| 7| 0| 0|
#+----+---+---+---+
The indices that are not present in the data will default to null, so we call na.fill(0) as this is the default value.
Once you have your data in this format, you just need to create an array from the columns:
df.select("id_A", f.array([f.col(str(i)) for i in range(size)]).alias("B_values")).show()
#+----+---------+
#|id_A| B_values|
#+----+---------+
#| b|[5, 0, 2]|
#| a|[7, 0, 0]|
#+----+---------+
Related
If I have table
|a | b | c|
|"hello"|"world"| 1|
and the variables
start =2000
end =2015
How do I in pyspark add 15 cols with 1st column m2000 and second m2001 etc and all these new cols have 0 so new dataframe is
|a | b | c|m2000 | m2001 | m2002 | ... | m2015|
|"hello"|"world"| 1| 0 | 0 | 0 | ... | 0 |
I have tried below but
df = df.select(
'*',
*["0".alias(f'm{i}') for i in range(2000, 2016)]
)
df.show()
I get the error
AttributeError: 'str' object has no attribute 'alias'
You can simply use withColumn to add relevant columns.
from pyspark.sql.functions import col,lit
df = spark.createDataFrame(data=[("hello","world",1)],schema=["a","b","c"])
df.show()
+-----+-----+---+
| a| b| c|
+-----+-----+---+
|hello|world| 1|
+-----+-----+---+
for i in range(2000, 2015):
df = df.withColumn("m"+str(i), lit(0))
df.show()
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
You can use one-liner
df = df.select(df.columns + [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])
Full example:
df = spark.createDataFrame([["hello","world",1]],["a","b","c"])
df = df.select(df.columns + [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])
[Out]:
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
in pandas, you can do the following:
import pandas as pd
df = pd.Series({'a': 'Hello', 'b': 'World', 'c': 1}).to_frame().T
df[['m{}'.format(x) for x in range(2000, 2016)]] = 0
print(df)
I am not very familiar with the spark-synthax, but the approach should be near-identical.
What is happening:
The term ['m{}'.format(x) for x in range(2000, 2016)] is a list-comprehension that creates the list of desired column names. We assign the value 0 to these columns. Since the columns do not yet exist, they are added.
Your code for generating extra columns is perfectly fine - just need to wrap the "0" in lit function, like this:
from pyspark.sql.functions import lit
df.select('*', *[lit("0").alias(f'm{i}') for i in range(2000, 2016)]).show()
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| b| c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|m2015|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
I would be cautious with calling withColumn method repeatadly - every new call to it, creates a new projection in Spark's query execution plan and it can become very expensive computationally. Using just single select will always be better approach.
I have a dataframe like this
+---+---------------------+
| id| csv|
+---+---------------------+
| 1|a,b,c\n1,2,3\n2,3,4\n|
| 2|a,b,c\n3,4,5\n4,5,6\n|
| 3|a,b,c\n5,6,7\n6,7,8\n|
+---+---------------------+
and I want to explode the string type csv column, in fact I'm only interested in this column. So I'm looking for a method to obtain the following dataframe from the above.
+--+--+--+
| a| b| c|
+--+--+--+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 4| 5| 6|
| 5| 6| 7|
| 6| 7| 8|
+--+--+--+
Looking at the from_csv documentation it seems that the insput csv string can contain only one row of data, which I found stated more clearly here. So that's not an option.
I guess I could loop over the individual rows of the input dataframe, extract and parse the csv string from each row and then stitch everything together:
rows = df.collect()
for (i, row) in enumerate(rows):
data = row['csv']
data = data.split('\\n')
rdd = spark.sparkContext.parallelize(data)
df_row = (spark.read
.option('header', 'true')
.schema('a int, b int, c int')
.csv(rdd))
if i == 0:
df_new = df_row
else:
df_new = df_new.union(df_row)
df_new.show()
But that seems awfully inefficient. Is there a better way to achieve the desired result?
Using split + from_csv functions along with transform you can do something like:
from pyspark.sql import functions as F
df = spark.createDataFrame([
(1, r"a,b,c\n1,2,3\n2,3,4\n"), (2, r"a,b,c\n3,4,5\n4,5,6\n"),
(3, r"a,b,c\n5,6,7\n6,7,8\n")], ["id", "csv"]
)
df1 = df.withColumn(
"csv",
F.transform(
F.split(F.regexp_replace("csv", r"^a,b,c\\n|\\n$", ""), r"\\n"),
lambda x: F.from_csv(x, "a int, b int, c int")
)
).selectExpr("inline(csv)")
df1.show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# | 1| 2| 3|
# | 2| 3| 4|
# | 3| 4| 5|
# | 4| 5| 6|
# | 5| 6| 7|
# | 6| 7| 8|
# +---+---+---+
I have a list of historical values for a device setting and a dataframe with timestamps.
I need to create a new column in my dataframe based on the comparison of the timestamps column in the dataframe and the timestamp of the setting value in my list.
settings_history = [[1, '2021-01-01'], [2, '2021-01-12']]
dataframe = df.withColumn(
'setting_col', when(col('device_timestamp') <= settings_history[0][1], settings_history[0][0])
.when(col('device_timestamp') <= settings_history[1][1], settings_history[1][0])
)
The number of entries in the settings_history array is dynamic and I need to find a way to implement something like above, but I get a syntax error. Also, I have tried to use a for loop in my withColumn function, but that didn't work either.
My raw dataframe has values like:
device_timestamp
2020-05-21
2020-12-19
2021-01-03
2021-01-11
My goal is to have something like:
device_timestamp setting_col
2020-05-21 1
2020-12-19 1
2021-01-03 2
2021-01-11 2
I'm using Databricks on Azure for my work.
You can use reduce to chain the when conditions together:
from functools import reduce
settings_history = [[1, '2021-01-01'], [2, '2021-01-12']]
new_col = reduce(
lambda c, history: c.when(col('device_timestamp') <= history[1], history[0]),
settings_history[1:],
when(col('device_timestamp') <= settings_history[0][1], settings_history[0][0])
)
dataframe = df.withColumn('setting_col', new_col)
Something like the below created when_expression function will be useful in this case. where a when condition is created based on whatever information you provide in list settings_array.
import pandas as pd
from pyspark.sql import functions as F
def when_expression(settings_array):
when_condition = None
for a, b in settings_array:
if when_condition is None:
when_condition = F.when(F.col('device_timestamp') <= a, F.lit(b))
else:
when_condition = when_condition.when(F.col('device_timestamp') <= a, F.lit(b))
return when_condition
settings_array = [
[2, 3], # if <= 2 make it 3
[5, 7], # if <= 5 make it 7
[10, 100], # if <= 10 make it 100
]
df = pd.DataFrame({'device_timestamp': range(10)})
df = spark.createDataFrame(df)
df.show()
when_condition = when_expression(settings_array)
print(when_condition)
df = df.withColumn('setting_col', when_condition)
df.show()
Output:
+----------------+
|device_timestamp|
+----------------+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+----------------+
Column<b'CASE WHEN (device_timestamp <= 2) THEN 3 WHEN (device_timestamp <= 5) THEN 7 WHEN (device_timestamp <= 10) THEN 100 END'>
+----------------+-----------+
|device_timestamp|setting_col|
+----------------+-----------+
| 0| 3|
| 1| 3|
| 2| 3|
| 3| 7|
| 4| 7|
| 5| 7|
| 6| 100|
| 7| 100|
| 8| 100|
| 9| 100|
+----------------+-----------+
Consider the following toy PySpark data frame:
+----+-----+
|name|value|
+----+-----+
| A| 1|
| A| 3|
| A| 4|
| A| 5|
| A| 9|
| B| 1|
| B| 3|
| B| 6|
| B| 7|
| B| 8|
+----+-----+
For each row X, I would like to determine how many rows Y with Y.name == X.name have Y.value in range [X.value - 3, X.value + 3]. For rows Y that satisfy those conditions, I would also like to compute the average value:
+----+-----+-------------+--------------+
|name|value|n_vals_in_rng|avg_val_in_rng|
+----+-----+-------------+--------------+
| A| 1| 3| 2.6666667| # (1 + 3 + 4) / 3 = 2.6666667
| A| 3| 4| 3.25| # (1 + 3 + 4 + 5) / 4 = 3.25
| A| 4| 4| 3.25| # ...
| A| 5| 3| 4.0|
| A| 9| 1| 9.0|
| B| 1| 2| 2.0|
| B| 3| 3| 3.3333333|
| B| 6| 4| 6.0|
| B| 7| 3| 7.0|
| B| 8| 3| 7.0|
+----+-----+-------------+--------------+
Can I do this efficiently in PySpark? If so, how? Is it better to use Pandas to solve this problem? Note that my real dataset has ~400k rows and ~8k distinct names in the name column.
Below is my solution so far. It gives the correct result, but takes forever on a large dataset (several hours for a data frame with ~400k rows).
First, I group the data frame by name and I collect all values into a list stored as a new column.
import pyspark.sql.functions as F
import pyspark.sql.types as T
import numpy as np
# df is the data frame defined above
# define a df to be nested
df_to_nest = df.groupBy("name").agg(F.collect_list("value").alias("values"))
# df_to_nest.show():
# +----+---------------+
# |name| values|
# +----+---------------+
# | A|[1, 3, 4, 5, 9]|
# | B|[1, 3, 6, 7, 8]|
# +----+---------------+
I then join this aggregated data frame (df_to_nest) with the original df:
# join with the original data frame
df = df.join(df_to_nest, "name", "left")
# df.show()
# +----+-----+---------------+
# |name|value| values|
# +----+-----+---------------+
# | A| 1|[1, 3, 4, 5, 9]|
# | A| 3|[1, 3, 4, 5, 9]|
# | A| 4|[1, 3, 4, 5, 9]|
# | A| 5|[1, 3, 4, 5, 9]|
# | A| 9|[1, 3, 4, 5, 9]|
# | B| 1|[1, 3, 6, 7, 8]|
# | B| 3|[1, 3, 6, 7, 8]|
# | B| 6|[1, 3, 6, 7, 8]|
# | B| 7|[1, 3, 6, 7, 8]|
# | B| 8|[1, 3, 6, 7, 8]|
# +----+-----+---------------+
Last, I create a user-defined function (UDF) to process each row.
# define a UDF to process each row
def process_row(row):
vals_in_range = [x for x in row.values if abs(x-row.value) <= 3]
return (len(vals_in_range),
float(np.mean(vals_in_range)))
input_schema = F.struct([df[x] for x in df.columns])
output_schema = T.StructType([
T.StructField("n_vals_in_rng", T.IntegerType(), nullable=True),
T.StructField("avg_val_in_rng", T.FloatType(), nullable=True),
])
udf = F.udf(process_row, output_schema)
# apply the UDF
df= df.select("name", "value", "values", udf(input_schema).alias("new_cols"))
# unroll the new columns
df= df.select("name", "value", "new_cols.*", "values")
Result:
# df.show():
# +----+-----+-------------+--------------+---------------+
# |name|value|n_vals_in_rng|avg_val_in_rng| values|
# +----+-----+-------------+--------------+---------------+
# | A| 1| 3| 2.6666667|[1, 3, 4, 5, 9]|
# | A| 3| 4| 3.25|[1, 3, 4, 5, 9]|
# | A| 4| 4| 3.25|[1, 3, 4, 5, 9]|
# | A| 5| 3| 4.0|[1, 3, 4, 5, 9]|
# | A| 9| 1| 9.0|[1, 3, 4, 5, 9]|
# | B| 1| 2| 2.0|[1, 3, 6, 7, 8]|
# | B| 3| 3| 3.3333333|[1, 3, 6, 7, 8]|
# | B| 6| 4| 6.0|[1, 3, 6, 7, 8]|
# | B| 7| 3| 7.0|[1, 3, 6, 7, 8]|
# | B| 8| 3| 7.0|[1, 3, 6, 7, 8]|
# +----+-----+-------------+--------------+---------------+
This can be done with window functions and higher order functions. It should be more efficient than UDF.
df.withColumn("value", col("value").cast("int")) \
.withColumn("values", collect_list("value").over(Window.partitionBy("name"))) \
.withColumn("in_range", expr("filter(values, v -> abs(v - value) <= 3)")) \
.withColumn("n_vals_in_rng", size(col("in_range"))) \
.withColumn("avg_val_in_rng",
expr("aggregate(in_range, 0, (acc, value) -> value + acc, acc -> acc / n_vals_in_rng)")) \
.select("name", "value", "n_vals_in_rng", "avg_val_in_rng") \
.show()
You can read more about filter and aggregate functions here: https://spark.apache.org/docs/latest/api/sql/
While I liked the solution proposed by barteksch, I have found another solution that works with Spark 2.3 (and lower), and is better-suited for my particular case.
Instead of collecting values in range in a column, the idea is to create a row for each pair of values with the same name and with difference of at most 3. This can be done by exploding the data frame via a self-join, and filtering.
Before using this approach, be aware that it can potentially create a huge intermediate dataset, so make sure it makes sense in your case. For my dataset with ~400k rows, the exploded dataset had almost 400 million rows, but only about 9 million were kept after filtering. The whole script took around 10 minutes to run.
# assign an index per row unique over a window defined by `name`
df = df.withColumn("id", F.row_number().over(Window.partitionBy("name").orderBy("value")))
# join the dataset with itself
df_exploded = df.select("name", "id", "value").join(
df.select("name", F.col("value").alias("other_value")), "name")
# only keep rows where (value, other_value) are within 3 of each other
df_exploded = df_exploded.filter(F.abs(F.col("value") - F.col("other_value")) <= 3)
# aggregate over groups defined by (name, id)
df = df_exploded.groupBy(["name", "id"]).agg(
# since `value` is constant per group, just take the max to retrieve it
F.max("value").alias("value"),
# compute the actual aggregations: count, mean
F.count("other_value").alias("n_vals_in_rng"),
F.mean("other_value").alias("avg_val_in_rng")
)
Result:
# df.show():
# +----+---+-----+---------------+------------------+
# |name| id|value|n_vals_in_range| avg_val_in_range|
# +----+---+-----+---------------+------------------+
# | A| 1| 1| 3|2.6666666666666665|
# | A| 2| 3| 4| 3.25|
# | A| 3| 4| 4| 3.25|
# | A| 4| 5| 3| 4.0|
# | A| 5| 9| 1| 9.0|
# | B| 1| 1| 2| 2.0|
# | B| 2| 3| 3|3.3333333333333335|
# | B| 3| 6| 4| 6.0|
# | B| 4| 7| 3| 7.0|
# | B| 5| 8| 3| 7.0|
# +----+---+-----+---------------+------------------+
I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+
Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+
in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+
I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()