PYSPARK set default value in duplicated column name - python

In pyspark, I have a dataframe with 10 columns like this :
id, last_name, first_name, manager, shop, location, manager, place, country, status
i would like to set a default value to only the first column manager, i've tried with :
df.withColumn("manager", "x1")
but it gives me an error for ambiguous reference as there is 2 columns with the same name.
Is there a way to do it without renaming the column ?

One work around can be,to recreate the dataframe changing the column names. It's always better to have unique column names.
>>> df = spark.createDataFrame([('value1','value2'),('value3','value4')],['manager','manager'])
>>> df.show()
+-------+-------+
|manager|manager|
+-------+-------+
| value1| value2|
| value3| value4|
+-------+-------+
>>> df1 = df.toDF('manager1','manager2')
>>> from pyspark.sql.functions import lit
>>> df1.withColumn('manager1',lit('x1')).show()
+--------+--------+
|manager1|manager2|
+--------+--------+
| x1| value2|
| x1| value4|
+--------+--------+

Related

Python or SQL to convert the Data Type - Map to String for a column

I have the below column in a table called 'test'.
How can I get the 'id' and 'value' (eg.for 'id' = 2, I should get the value '24' and null for other two ids) from the given table.
The 'data type' for the column 'age' is 'Map' and I'm not sure how to deal with this.
A simple query in Python or SQL or any leads is much appreciated. Also, please advise on the packages to import.
explode function would "explode" your map to key and value pairs, then you can use them in anyway you want.
from pyspark.sql import functions as F
(df
.select('id', F.explode('age').alias('k', 'v'))
.show()
)
+---+---+----+
| id| k| v|
+---+---+----+
| 2|age| 24|
| 3|age|null|
+---+---+----+
You can get it in sql or python.
In Python You can try
agecolumn=age.replace("{","").replace("}","").split("=")
if agecolumn[1].strip():
do domething

pyspark how to add selected columns based on value

For the below data structure, I hope to return a new dataframe base on the condition column. For example if "condition" =='A' the new dataframe should have cols values in group1, and if "condition" =='B' the new dataframe should have cols values in group2. The thing is I do not want to hard code the column names, as there could be many columns after anothervalue. How could I do this? Many thanks for your help. For example for this input dataframe,
+---------+---------+---------+
|condition| group1| group2|
+---------+---------+---------+
| A|{SEA, WA}|{PDX, OR}|
| B| {NY, NY}| {LA, CA}|
+---------+---------+---------+
I'd like to get this output:
+---------+---------+
|condition| group |
+---------+---------+
| A|{SEA, WA}|
| B| {LA, CA}|
+---------+---------+
The above input dataframe was created using this json schema:
jsonStrings = ['{"condition":"A","group1":{"city":"SEA","state":"WA"},"group2":{"city":"PDX","state":"OR"}}','{"condition":"B","group1":{"city":"NY","state":"NY"},"group2":{"city":"LA","state":"CA"}}']
You could simply use when and construct dynamic list of conditions as below
from pyspark.sql.functions import *
conditions = when(col('condition') == 'A', col("group1"))\
.when(col('condition') == 'B', col("group2")).otherwise(None)
df1.select(col('condition'), conditions.alias("group")).show(truncate=False)
Output:
+---------+---------+
|condition|group |
+---------+---------+
|A |{SEA, WA}|
|B |{LA, CA} |
+---------+---------+

Merging two rows into one based on common field

I have dataframe with the following data:
+----------+------------+-------------+---------------+----------+
|id |name |predicted |actual |yyyy_mm_dd|
+----------+------------+-------------+---------------+----------+
| 215| NirPost| null|100.10023 |2020-01-10|
| null| NirPost| 57145|null |2020-01-10|
+----------+------------+-------------+---------------+----------+
I want to merge these two rows into one, based on the name. This df is the result of a query which I've restricted to one company and single day. In the real dataset, there is 70~ companies with daily data. I want to rewrite this data into a new table as single rows.
This is the output I'd like:
+----------+------------+-------------+---------------+----------+
|id |name |predicted | actual |yyyy_mm_dd|
+----------+------------+-------------+---------------+----------+
| 215| NirPost| 57145 |100.10023 |2020-01-10|
+----------+------------+-------------+---------------+----------+
I've tried this:
df.replace('null','').groupby('name',as_index=False).agg(''.join)
However, this outputs my original df but with NaN instead of null.
`df.dtypes`:
id float64
name object
predicted float64
actual float64
yyyy_mm_dd object
dtype: object
How about you explicitly pass all the columns in the groupby with max so that it eliminates the null values?
import pandas as pd
import numpy as np
data = {'id':[215,np.nan],'name':['nirpost','nirpost'],'predicted':[np.nan,57145],'actual':[100.12,np.nan],'yyyy_mm_dd':['2020-01-10','2020-01-10']}
df = pd.DataFrame(data)
df = df.groupby('name').agg({'id':'max','predicted':'max','actual':'max','yyyy_mm_dd':'max'}).reset_index()
print(df)
Returns:
name id predicted actual yyyy_mm_dd
0 nirpost 215.0 57145.0 100.12 2020-01-10
Of course since you have more data you should probably consider adding something else in your groupby so as to not delete too many rows, but for the example data you provide, I believe this is a way to solve the issue.
EDIT:
If all columns are being named as max_original_column_name then you can simply use this:
df.columns = [x[:-4] for x in list(df)]
With the list comprehension you are creating a list that strips the last 4 characters (that is _max from each value in list(df) which is the list of the name of the columns. Last, you are assigning it with df.columns =

How to fetch a column name

How can we apply conditions for a Dataset in python, specially applying those and want to fetch the column name as an output?
let's say the below one is the dataframe so my question is how can we retrieve a colname(let's say "name") as an output by applying conditions on this dataframe
name salary jobDes
______________________________________
store1 | daniel | 50k | datascientist
store2 | paladugu | 55k | datascientist
store3 | theodor | 53k | dataEngineer
fetch a column name as a result like let's say "name"
Elaborated:
import pandas as pd
data = {'name':['daniel', 'paladugu', 'theodor'], 'jobDes':['datascientist', 'datascientist', 'dataEngineer']}
df = pd.DataFrame(data)
print(df['name']) # just that easy
OUTPUT:
0 daniel
1 paladugu
2 theodor
Name: name, dtype: object
Presuming you are using either pandas or dask, you should be able to get a list of column names with
df.columns.
This means that if you wish to know what the first column is called you can index it (these start with 0 for the first element because python/c) as usual with df.columns[0] etc.
If you then wish to access all the data in it, you can use
df[df.columns[0]] or the actual column name df['name'].
If your data frame is named df, df.columns returns a list of all of the column names.

Apply a function to all cells in Spark DataFrame

I'm trying to convert some Pandas code to Spark for scaling. myfunc is a wrapper to a complex API that takes a string and returns a new string (meaning I can't use vectorized functions).
def myfunc(ds):
for attribute, value in ds.items():
value = api_function(attribute, value)
ds[attribute] = value
return ds
df = df.apply(myfunc, axis='columns')
myfunc takes a DataSeries, breaks it up into individual cells, calls the API for each cell, and builds a new DataSeries with the same column names. This effectively modifies all cells in the DataFrame.
I'm new to Spark and I want to translate this logic using pyspark. I've converted my pandas DataFrame to Spark:
spark = SparkSession.builder.appName('My app').getOrCreate()
spark_schema = StructType([StructField(c, StringType(), True) for c in df.columns])
spark_df = spark.createDataFrame(df, schema=spark_schema)
This is where I get lost. Do I need a UDF, a pandas_udf? How do I iterate across all cells and return a new string for each using myfunc? spark_df.foreach() doesn't return anything and it doesn't have a map() function.
I can modify myfunc from DataSeries -> DataSeries to string -> string if necessary.
Option 1: Use a UDF on One Column at a Time
The simplest approach would be to rewrite your function to take a string as an argument (so that it is string -> string) and use a UDF. There's a nice example here. This works on one column at a time. So, if your DataFrame has a reasonable number of columns, you can apply the UDF to each column one at a time:
from pyspark.sql.functions import col
new_df = df.select(udf(col("col1")), udf(col("col2")), ...)
Example
df = sc.parallelize([[1, 4], [2,5], [3,6]]).toDF(["col1", "col2"])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1| 4|
| 2| 5|
| 3| 6|
+----+----+
def plus1_udf(x):
return x + 1
plus1 = spark.udf.register("plus1", plus1_udf)
new_df = df.select(plus1(col("col1")), plus1(col("col2")))
new_df.show()
+-----------+-----------+
|plus1(col1)|plus1(col2)|
+-----------+-----------+
| 2| 5|
| 3| 6|
| 4| 7|
+-----------+-----------+
Option 2: Map the entire DataFrame at once
map is available for Scala DataFrames, but, at the moment, not in PySpark.
The lower-level RDD API does have a map function in PySpark. So, if you have too many columns to transform one at a time, you could operate on every single cell in the DataFrame like this:
def map_fn(row):
return [api_function(x) for (column, x) in row.asDict().items()
column_names = df.columns
new_df = df.rdd.map(map_fn).toDF(df.columns)
Example
df = sc.parallelize([[1, 4], [2,5], [3,6]]).toDF(["col1", "col2"])
def map_fn(row):
return [value + 1 for (_, value) in row.asDict().items()]
columns = df.columns
new_df = df.rdd.map(map_fn).toDF(columns)
new_df.show()
+----+----+
|col1|col2|
+----+----+
| 2| 5|
| 3| 6|
| 4| 7|
+----+----+
Context
The documentation of foreach only gives the example of printing, but we can verify looking at the code that it indeed does not return anything.
You can read about pandas_udf in this post, but it seems that it is most suited to vectorized functions, which, as you pointed out, you can't use because of api_function.
The solution is:
udf_func = udf(func, StringType())
for col_name in spark_df.columns:
spark_df = spark_df.withColumn(col_name, udf_func(lit(col_name), col_name))
return spark_df.toPandas()
There are 3 key insights that helped me figure this out:
If you use withColumn with the name of an existing column (col_name), Spark "overwrites"/shadows the original column. This essentially gives the appearance of editing the column directly as if it were mutable.
By creating a loop across the original columns and reusing the same DataFrame variable spark_df, I use the same principle to simulate a mutable DataFrame, creating a chain of column-wise transformations, each time "overwriting" a column (per #1 - see below)
Spark UDFs expect all parameters to be Column types, which means it attempts to resolve column values for each parameter. Because api_function's first parameter is a literal value that will be the same for all rows in the vector, you must use the lit() function. Simply passing col_name to the function will attempt to extract the column values for that column. As far as I could tell, passing col_name is equivalent to passing col(col_name).
Assuming 3 columns 'a', 'b' and 'c', unrolling this concept would look like this:
spark_df = spark_df.withColumn('a', udf_func(lit('a'), 'a')
.withColumn('b', udf_func(lit('b'), 'b')
.withColumn('c', udf_func(lit('c'), 'c')

Categories

Resources