I need to merge all the values of the dataframe's columns into a single value for each column. So the columns stay intact but I am just summing all the respective values.
For this purpose I intend to utilize this function:
def sum_col(data, col):
return data.select(f.sum(col)).collect()[0][0]
I was now thinking to do sth like this:
data = data.map(lambda current_col: sum_col(data, current_col))
Is this doable, or I need another way to merge all the values of the columns?
You can achieve this by sum function
import pyspark.sql.functions as f
df.select(*[f.sum(cols).alias(cols) for cols in df.columns]).show()
+----+---+---+
|val1| x| y|
+----+---+---+
| 36| 29|159|
+----+---+---+
To sum all your columns to a new column you can use list comprehension with the sum function of python
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
tst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])
tst_sum= tst.withColumn("sum_col",sum([tst[coln] for coln in tst.columns]))
results:
tst_sum.show()
+----+---+---+-------+
|val1| x| y|sum_col|
+----+---+---+-------+
| 10| 7| 14| 31|
| 5| 1| 4| 10|
| 9| 8| 10| 27|
| 2| 6| 90| 98|
| 7| 2| 30| 39|
| 3| 5| 11| 19|
+----+---+---+-------+
Note : If you had imported sum function from pyspark function as from import pyspark.sql.functions import sum then you have to change the name to some thing else , like from import pyspark.sql.functions import sum_pyspark
Related
I have a dataframe like this
+---+---------------------+
| id| csv|
+---+---------------------+
| 1|a,b,c\n1,2,3\n2,3,4\n|
| 2|a,b,c\n3,4,5\n4,5,6\n|
| 3|a,b,c\n5,6,7\n6,7,8\n|
+---+---------------------+
and I want to explode the string type csv column, in fact I'm only interested in this column. So I'm looking for a method to obtain the following dataframe from the above.
+--+--+--+
| a| b| c|
+--+--+--+
| 1| 2| 3|
| 2| 3| 4|
| 3| 4| 5|
| 4| 5| 6|
| 5| 6| 7|
| 6| 7| 8|
+--+--+--+
Looking at the from_csv documentation it seems that the insput csv string can contain only one row of data, which I found stated more clearly here. So that's not an option.
I guess I could loop over the individual rows of the input dataframe, extract and parse the csv string from each row and then stitch everything together:
rows = df.collect()
for (i, row) in enumerate(rows):
data = row['csv']
data = data.split('\\n')
rdd = spark.sparkContext.parallelize(data)
df_row = (spark.read
.option('header', 'true')
.schema('a int, b int, c int')
.csv(rdd))
if i == 0:
df_new = df_row
else:
df_new = df_new.union(df_row)
df_new.show()
But that seems awfully inefficient. Is there a better way to achieve the desired result?
Using split + from_csv functions along with transform you can do something like:
from pyspark.sql import functions as F
df = spark.createDataFrame([
(1, r"a,b,c\n1,2,3\n2,3,4\n"), (2, r"a,b,c\n3,4,5\n4,5,6\n"),
(3, r"a,b,c\n5,6,7\n6,7,8\n")], ["id", "csv"]
)
df1 = df.withColumn(
"csv",
F.transform(
F.split(F.regexp_replace("csv", r"^a,b,c\\n|\\n$", ""), r"\\n"),
lambda x: F.from_csv(x, "a int, b int, c int")
)
).selectExpr("inline(csv)")
df1.show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# | 1| 2| 3|
# | 2| 3| 4|
# | 3| 4| 5|
# | 4| 5| 6|
# | 5| 6| 7|
# | 6| 7| 8|
# +---+---+---+
I have a DataFrame like this :
from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType, StringType
#import numpy as np
data = [(("ID1", 3, 5,5)), (("ID2", 4, 5,6)), (("ID3", 3, 3,3))]
df = sqlContext.createDataFrame(data, ["ID", "colA", "colB","colC"])
df.show()
cols = df.columns
maxcol = f.udf(lambda row: cols[row.index(max(row)) +1], StringType())
maxDF = df.withColumn("Max_col", maxcol(f.struct([df[x] for x in df.columns[1:]])))
maxDF.show(truncate=False)
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5| 5|
|ID2| 4| 5| 6|
|ID3| 3| 3| 3|
+---+----+----+----+
+---+----+----+----+-------+
|ID |colA|colB|colC|Max_col|
+---+----+----+----+-------+
|ID1|3 |5 |5 |colB |
|ID2|4 |5 |6 |colC |
|ID3|3 |3 |3 |colA |
+---+----+----+----+-------+
I want to return all column names of max values in case there are ties, how can I achieve this in pyspark like this :
+---+----+----+----+--------------+
|ID |colA|colB|colC|Max_col |
+---+----+----+----+--------------+
|ID1|3 |5 |5 |colB,colC |
|ID2|4 |5 |6 |colC |
|ID3|3 |3 |3 |colA,ColB,ColC|
+---+----+----+----+--------------+
Thank you
Seems like a udf solution.
iterate over the columns you have (pass them as an input to the class) and perform a python operations to get the max and check who has the same value. return a list (aka array) of the column names.
#udf(returnType=ArrayType(StringType()))
def collect_same_max():
...
Or, maybe if it doable you can try use the transform function
from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.transform.html
Data Engineering, I would say. See code and logic below
new =(df.withColumn('x',F.array(*[F.struct(F.lit(x).alias('col'),F.col(x).alias('num')) for x in df.columns if x!='ID']))#Create Struct Column of columns and their values
.selectExpr('ID','colA','colB','colC', 'inline(x)')#Explode struct column
.withColumn('z', first('num').over(Window.partitionBy('ID').orderBy(F.desc('num'))))#Create column with max value for each id
.where(col('num')==col('z'))#isolate max values in each id
.groupBy(['ID','colA','colB','colC']).agg(F.collect_list('col').alias('col'))#combine max columns into list
)
+---+----+----+----+------------------+
| ID|colA|colB|colC| Max_col|
+---+----+----+----+------------------+
|ID1| 3| 5| 5| [colB, colC]|
|ID2| 4| 5| 6| [colC]|
|ID3| 3| 3| 3|[colA, colB, colC]|
+---+----+----+----+------------------+
I wanted to apply a function to some columns of a Spark DataFrame with different methods: fn and a fn1. Here is how I did that:
def fn(column):
return(x*2)
udf_1 = udf(fn, DecimalType())
def fn1(column):
return(x*3)
udf_2 = udf(fn1, DecimalType())
def process_df1(df, col_name):
df1 = df.withColumn(col_name, udf_1(col_name))
return df1
def process_df2(df, col_name):
df2 = df.withColumn(col_name, udf_2(col_name))
return df2
For a single column it works fine. But now I get a list of dicts containing information on various columns:
cols_info = [{'col_name': 'metric_1', 'process': 'True', 'method':'simple'}, {'col_name': 'metric_2', 'process': 'False', 'method':'hash'}]
How should I parse the cols_info list and apply the above logic only to the columns that have process:True and use a required method?
The first thing that comes to mind is to filter out columns with process:False
list(filter(lambda col_info: col_info['process'] == 'True', cols_info))
But I'm still missing a more generic approach here.
selectExpr function will be useful here
import pyspark.sql.functions as F
from pyspark.sql.window import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(1,3,4,1),(1,4,5,5),(1,6,7,8),(2,1,9,2),(2,2,9,9)],schema=['col1','col2','col3','col4'])
def fn(x):
return(x*2)
def fn1(x):
return(x*3)
sqlContext.udf.register("fn1", fn)
sqlContext.udf.register("fn2", fn1)
cols_info =[{'col_name':'col1','encrypt':False,},{'col_name':'col2','encrypt':True,'method':'fn1'},{'col_name':'col3','encrypt':True,'method':'fn2'}]
# determine which columns have any of the encryption
modified_columns = [x['col_name'] for x in cols_info if x['encrypt']]
# select which colulmns have to be retained
columns_retain = list(set(tst.columns)-set(modified_columns))
#%
expr =columns_retain+[((x['method'])+'('+(x['col_name'])+') as '+ x['col_name']) for x in cols_info if x['encrypt']]
#%
tst_res = tst.selectExpr(*expr)
The results will be :
+----+----+----+----+
|col4|col1|col2|col3|
+----+----+----+----+
| 4| 1| 4| 9|
| 1| 1| 6| 12|
| 5| 1| 8| 15|
| 8| 1| 12| 21|
| 2| 2| 2| 27|
| 9| 2| 4| 27|
+----+----+----+----+
I have a below dataframe and I wanted to update the rows dynamically with some values
input_frame.show()
+----------+----------+---------+
|student_id|name |timestamp|
+----------+----------+---------+
| s1|testuser | t1|
| s1|sampleuser| t2|
| s2|test123 | t1|
| s2|sample123 | t2|
+----------+----------+---------+
input_frame = input_frame.withColumn('test', sf.lit(None))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
input_frame = input_frame.withColumn('test', sf.concat(sf.col('test'),sf.lit('test')))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
I want to update the 'test' column with some values and apply the filter with partial matches on the column. But concatenating to null column resulting in a null column again. How can we do this?
use concat_ws, like this:
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([["1", "2"], ["2", None], ["3", "4"], ["4", "5"], [None, "6"]]).toDF("a", "b")
# This won't work
df = df.withColumn("concat", concat(df.a, df.b))
# This won't work
df = df.withColumn("concat + cast", concat(df.a.cast('string'), df.b.cast('string')))
# Do it like this
df = df.withColumn("concat_ws", concat_ws("", df.a, df.b))
df.show()
gives:
+----+----+------+-------------+---------+
| a| b|concat|concat + cast|concat_ws|
+----+----+------+-------------+---------+
| 1| 2| 12| 12| 12|
| 2|null| null| null| 2|
| 3| 4| 34| 34| 34|
| 4| 5| 45| 45| 45|
|null| 6| null| null| 6|
+----+----+------+-------------+---------+
Note specifically that casting a NULL column to string doesn't work as you wish, and will result in the entire row being NULL if any column is null.
There's no nice way of dealing with more complicated scenarios, but note that you can use a when statement in side a concat if you're willing to
suffer the verboseness of it, like this:
df.withColumn("concat_custom", concat(
when(df.a.isNull(), lit('_')).otherwise(df.a),
when(df.b.isNull(), lit('_')).otherwise(df.b))
)
To get, eg:
+----+----+-------------+
| a| b|concat_custom|
+----+----+-------------+
| 1| 2| 12|
| 2|null| 2_|
| 3| 4| 34|
| 4| 5| 45|
|null| 6| _6|
+----+----+-------------+
You can use the coalesce function, which returns first of its arguments which is not null, and provide a literal in the second place, which will be used in case the column has a null value.
df = df.withColumn("concat", concat(coalesce(df.a, lit('')), coalesce(df.b, lit(''))))
You can fill null values with empty strings:
import pyspark.sql.functions as f
from pyspark.sql.types import *
data = spark.createDataFrame([('s1', 't1'), ('s2', 't2')], ['col1', 'col2'])
data = data.withColumn('test', f.lit(None).cast(StringType()))
display(data.na.fill('').withColumn('test2', f.concat('col1', 'col2', 'test')))
Is that what you were looking for?
I have a data frame in Pyspark like below. I want to count values in two columns based on some lists and populate new columns for each list
df.show()
+---+-------------+-------------_+
| id| device| device_model|
+---+-------------+--------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 2| iphone| apple iphone|
| 3| spy camera| |
| 3| cctv| cctv|
+---+-------------+--------------+
lists are below:
phone_list = ['iphone', 'android', 'nokia']
pc_list = ['windows', 'mac']
security_list = ['camera', 'cctv']
I want to count the device and device_model for each id and pivot the values in a new data frame.
I want to count the values in the both the device_model and device columns for each id that match the strings in the list.
For example: in phone_list I have a iphone string this should count values for both values iphone and iphone5
The result I want
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 4| 2| 2|
| 2| 2|null| 1|
| 3| null| 2| 3|
+---+------+----+--------+
I have done like below
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
Using the above I can only do for device column and only if the string matches exactly. But unable to figure out how to do for both the columns and when value contains the string.
How can I achieve the result I want?
Here is the working solution . I have used udf function for checking the strings and calculating sum. You can use inbuilt functions if possible. (comments are provided as a means for explanation)
#creating dictionary for the lists with names for columns
columnLists = {'phone':phone_list, 'pc':pc_list, 'security':security_list}
#udf function for checking the strings and summing them
from pyspark.sql import functions as F
from pyspark.sql import types as t
def checkDevices(device, deviceModel, name):
sum = 0
for x in columnLists[name]:
if x in device:
sum += 1
if x in deviceModel:
sum += 1
return sum
checkDevicesAndSum = F.udf(checkDevices, t.IntegerType())
#populating the sum returned from udf function to respective columns
for x in columnLists:
df = df.withColumn(x, checkDevicesAndSum(F.col('device'), F.col('device_model'), F.lit(x)))
#finally grouping and sum
df.groupBy('id').agg(F.sum('phone').alias('phone'), F.sum('pc').alias('pc'), F.sum('security').alias('security')).show()
which should give you
+---+-----+---+--------+
| id|phone| pc|security|
+---+-----+---+--------+
| 3| 0| 2| 3|
| 1| 4| 2| 2|
| 2| 2| 0| 1|
+---+-----+---+--------+
Aggrgation part can be generalized as the rest of the parts. Improvements and modification is all in your hand. :)