pyspark how to add selected columns based on value - python

For the below data structure, I hope to return a new dataframe base on the condition column. For example if "condition" =='A' the new dataframe should have cols values in group1, and if "condition" =='B' the new dataframe should have cols values in group2. The thing is I do not want to hard code the column names, as there could be many columns after anothervalue. How could I do this? Many thanks for your help. For example for this input dataframe,
+---------+---------+---------+
|condition| group1| group2|
+---------+---------+---------+
| A|{SEA, WA}|{PDX, OR}|
| B| {NY, NY}| {LA, CA}|
+---------+---------+---------+
I'd like to get this output:
+---------+---------+
|condition| group |
+---------+---------+
| A|{SEA, WA}|
| B| {LA, CA}|
+---------+---------+
The above input dataframe was created using this json schema:
jsonStrings = ['{"condition":"A","group1":{"city":"SEA","state":"WA"},"group2":{"city":"PDX","state":"OR"}}','{"condition":"B","group1":{"city":"NY","state":"NY"},"group2":{"city":"LA","state":"CA"}}']

You could simply use when and construct dynamic list of conditions as below
from pyspark.sql.functions import *
conditions = when(col('condition') == 'A', col("group1"))\
.when(col('condition') == 'B', col("group2")).otherwise(None)
df1.select(col('condition'), conditions.alias("group")).show(truncate=False)
Output:
+---------+---------+
|condition|group |
+---------+---------+
|A |{SEA, WA}|
|B |{LA, CA} |
+---------+---------+

Related

Using lookup structure to search pyspark dataframe

I'm new to PySpark and I'm trying to create a generic .where() function, that can accept any lookup structure and use that to check if the value is present
TYPES = ('TYPE_1', 'TYPE_2', 'TYPE_3')
Something like this:
(
df.where(
df.value in TYPES
)
)
What is the most efficient way of doing this?
You can construct an array column from your lookup structure and use array_contains to filter whether the column contains an element in your structure.
e.g.
>>> df = spark.createDataFrame([(1,),(2,),(3,)],['column'])
>>> arr = [2,3,4]
>>> df.withColumn('contains', F.array_contains(F.array(*[F.lit(i) for i in arr]), F.col('column'))).show()
+------+--------+
|column|contains|
+------+--------+
| 1| false|
| 2| true|
| 3| true|
+------+--------+

Counting nulls in PySpark dataframes with total rows and columns

I'm trying to write a query to count all the null values in a large dataframe using PySpark. After reading in the dataset, I am doing this:
import pyspark.sql.functions as F
df_agg = df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])
df_countnull_agg.coalesce(1).write.option("header", "true").mode("overwrite").csv(path)
This works fine and the df_agg dataframe gives me something like this:
#+--------+--------+--------+
#|Column_1|Column_2|Column_3|
#+--------+--------+--------+
#| 15| 56| 18|
#+--------+--------+--------+
What I want to do is to also add two columns at the end of the dataframe for total_rows and total_columns so I can run some calculations after writing to a .csv file. I know I can get the numbers from the dataframe like this:
total_rows = df.count()
total_columns = len(df.columns)
I want to add those two numbers into columns that would result in a dataframe like this, and then write it to a .csv like I before:
#+--------+--------+--------+--------+--------+
#|Column_1|Column_2|Column_3|t_rows |t_cols |
#+--------+--------+--------+--------+--------+
#| 15| 56| 18| 500| 20|
#+--------+--------+--------+--------+--------+
What I'm concerned about is runtime, since counting the nulls takes a bit of time, and then calculating the shape of the dataframe and adding that to the final df for output. Any help is appreciated!
To get count of total rows, you could do that inside the aggregate by counting values of F.lit(1), and then you could to get count of total columns by using withColumn to create a new column with literal(lit) as len of df.columns.
df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns], F.count(F.lit(1)).alias("t_rows"))\
.withColumn("t_cols", F.lit(len(df.columns))).show()
+-----+----+--------+------+------+
|query|href|position|t_rows|t_cols|
+-----+----+--------+------+------+
| 3| 2| 0| 12| 3|
+-----+----+--------+------+------+

how to find the max value of all columns in a spark dataframe [duplicate]

This question already has answers here:
Spark SQL: apply aggregate functions to a list of columns
(4 answers)
Closed 4 years ago.
I have a spark data frame of around 60M rows. I want to create a single row data frame that will have the max of all individual columns.
I tried out the following options, but each has its own set of disadvantages-
df.select(col_list).describe().filter(summary = 'max').show()
-- This query doesn't return the string columns. So my original dimension of the data frame gets reduced.
df.select(max(col1).alias(col1), max(col2).alias(col2), max(col3).alias(col3), ...).show()
-- This query works, but it's disadvantageous when I have around 700 odd columns.
Can someone suggest a better syntax?
The code will work irrespective of how many columns or mix of datatypes there are.
Note: OP suggested in her comments that for string columns, take the first non-Null value while grouping.
# Import relevant functions
from pyspark.sql.functions import max, first, col
# Take an example DataFrame
values = [('Alice',10,5,None,50),('Bob',15,15,'Simon',10),('Jack',5,1,'Timo',3)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4','col5'])
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice| 10| 5| null| 50|
| Bob| 15| 15|Simon| 10|
| Jack| 5| 1| Timo| 3|
+-----+----+----+-----+----+
# Lists all columns in the DataFrame
seq_of_columns = df.columns
print(seq_of_columns)
['col1', 'col2', 'col3', 'col4', 'col5']
# Using List comprehensions to create a list of columns of String DataType
string_columns = [i[0] for i in df.dtypes if i[1]=='string']
print(string_columns)
['col1', 'col4']
# Using Set function to get non-string columns by subtracting one list from another.
non_string_columns = list(set(seq_of_columns) - set(string_columns))
print(non_string_columns)
['col2', 'col3', 'col5']
Read about first() and ignorenulls here
# Aggregating both string and non-string columns
df = df.select(*[max(col(c)).alias(c) for c in non_string_columns],*[first(col(c),ignorenulls = True).alias(c) for c in string_columns])
df = df[[seq_of_columns]]
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice| 15| 15|Simon| 50|
+-----+----+----+-----+----+

Apply a function to all cells in Spark DataFrame

I'm trying to convert some Pandas code to Spark for scaling. myfunc is a wrapper to a complex API that takes a string and returns a new string (meaning I can't use vectorized functions).
def myfunc(ds):
for attribute, value in ds.items():
value = api_function(attribute, value)
ds[attribute] = value
return ds
df = df.apply(myfunc, axis='columns')
myfunc takes a DataSeries, breaks it up into individual cells, calls the API for each cell, and builds a new DataSeries with the same column names. This effectively modifies all cells in the DataFrame.
I'm new to Spark and I want to translate this logic using pyspark. I've converted my pandas DataFrame to Spark:
spark = SparkSession.builder.appName('My app').getOrCreate()
spark_schema = StructType([StructField(c, StringType(), True) for c in df.columns])
spark_df = spark.createDataFrame(df, schema=spark_schema)
This is where I get lost. Do I need a UDF, a pandas_udf? How do I iterate across all cells and return a new string for each using myfunc? spark_df.foreach() doesn't return anything and it doesn't have a map() function.
I can modify myfunc from DataSeries -> DataSeries to string -> string if necessary.
Option 1: Use a UDF on One Column at a Time
The simplest approach would be to rewrite your function to take a string as an argument (so that it is string -> string) and use a UDF. There's a nice example here. This works on one column at a time. So, if your DataFrame has a reasonable number of columns, you can apply the UDF to each column one at a time:
from pyspark.sql.functions import col
new_df = df.select(udf(col("col1")), udf(col("col2")), ...)
Example
df = sc.parallelize([[1, 4], [2,5], [3,6]]).toDF(["col1", "col2"])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1| 4|
| 2| 5|
| 3| 6|
+----+----+
def plus1_udf(x):
return x + 1
plus1 = spark.udf.register("plus1", plus1_udf)
new_df = df.select(plus1(col("col1")), plus1(col("col2")))
new_df.show()
+-----------+-----------+
|plus1(col1)|plus1(col2)|
+-----------+-----------+
| 2| 5|
| 3| 6|
| 4| 7|
+-----------+-----------+
Option 2: Map the entire DataFrame at once
map is available for Scala DataFrames, but, at the moment, not in PySpark.
The lower-level RDD API does have a map function in PySpark. So, if you have too many columns to transform one at a time, you could operate on every single cell in the DataFrame like this:
def map_fn(row):
return [api_function(x) for (column, x) in row.asDict().items()
column_names = df.columns
new_df = df.rdd.map(map_fn).toDF(df.columns)
Example
df = sc.parallelize([[1, 4], [2,5], [3,6]]).toDF(["col1", "col2"])
def map_fn(row):
return [value + 1 for (_, value) in row.asDict().items()]
columns = df.columns
new_df = df.rdd.map(map_fn).toDF(columns)
new_df.show()
+----+----+
|col1|col2|
+----+----+
| 2| 5|
| 3| 6|
| 4| 7|
+----+----+
Context
The documentation of foreach only gives the example of printing, but we can verify looking at the code that it indeed does not return anything.
You can read about pandas_udf in this post, but it seems that it is most suited to vectorized functions, which, as you pointed out, you can't use because of api_function.
The solution is:
udf_func = udf(func, StringType())
for col_name in spark_df.columns:
spark_df = spark_df.withColumn(col_name, udf_func(lit(col_name), col_name))
return spark_df.toPandas()
There are 3 key insights that helped me figure this out:
If you use withColumn with the name of an existing column (col_name), Spark "overwrites"/shadows the original column. This essentially gives the appearance of editing the column directly as if it were mutable.
By creating a loop across the original columns and reusing the same DataFrame variable spark_df, I use the same principle to simulate a mutable DataFrame, creating a chain of column-wise transformations, each time "overwriting" a column (per #1 - see below)
Spark UDFs expect all parameters to be Column types, which means it attempts to resolve column values for each parameter. Because api_function's first parameter is a literal value that will be the same for all rows in the vector, you must use the lit() function. Simply passing col_name to the function will attempt to extract the column values for that column. As far as I could tell, passing col_name is equivalent to passing col(col_name).
Assuming 3 columns 'a', 'b' and 'c', unrolling this concept would look like this:
spark_df = spark_df.withColumn('a', udf_func(lit('a'), 'a')
.withColumn('b', udf_func(lit('b'), 'b')
.withColumn('c', udf_func(lit('c'), 'c')

PYSPARK set default value in duplicated column name

In pyspark, I have a dataframe with 10 columns like this :
id, last_name, first_name, manager, shop, location, manager, place, country, status
i would like to set a default value to only the first column manager, i've tried with :
df.withColumn("manager", "x1")
but it gives me an error for ambiguous reference as there is 2 columns with the same name.
Is there a way to do it without renaming the column ?
One work around can be,to recreate the dataframe changing the column names. It's always better to have unique column names.
>>> df = spark.createDataFrame([('value1','value2'),('value3','value4')],['manager','manager'])
>>> df.show()
+-------+-------+
|manager|manager|
+-------+-------+
| value1| value2|
| value3| value4|
+-------+-------+
>>> df1 = df.toDF('manager1','manager2')
>>> from pyspark.sql.functions import lit
>>> df1.withColumn('manager1',lit('x1')).show()
+--------+--------+
|manager1|manager2|
+--------+--------+
| x1| value2|
| x1| value4|
+--------+--------+

Categories

Resources