So I have a dataframe df like so,
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
+---+-----+
I also have a dict like so:
{"COL_B":"abc","COL_C":""}
Now, what I have to do is to update df with keys in dict being the new column name and the value of key being the costant value of the column.
Expected df should be like:
+---+-----+-----+-----+
| ID|COL_A|COL_B|COL_C|
+---+-----+-----+-----+
| 1| 123| abc| |
+---+-----+-----+-----+
Now here's my python code to do it which is working fine...
input_data = pd.read_csv(inputFilePath,dtype=str)
for key, value in mapRow.iteritems(): #mapRow is the dict
if value is None:
input_data[key] = ""
else:
input_data[key] = value
Now I'm migrating this code to pyspark and would like to know how to do it in pyspark?
Thanks for the help.
To combine RDDs, we use use zip or join . Below is the explanation using zip. zip is to concat them and map to flatten.
from pyspark.sql import Row
rdd_1 = sc.parallelize([Row(ID=1,COL_A=2)])
rdd_2 = sc.parallelize([Row(COL_B="abc",COL_C=" ")])
result_rdd = rdd_1.zip(rdd_2).map(lamda x: [j for i in x for j in i])
NOTE I didn't have payspark currently with me so this isn't tested.
Related
I have a pyspark dataframe with multiple columns. For example the one below.
from pyspark.sql import Row
l = [('Jack',"a","p"),('Jack',"b","q"),('Bell',"c","r"),('Bell',"d","s")]
rdd = sc.parallelize(l)
score_rdd = rdd.map(lambda x: Row(name=x[0], letters1=x[1], letters2=x[2]))
score_card = sqlContext.createDataFrame(score_rdd)
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| a| p|
|Jack| b| q|
|Bell| c| r|
|Bell| d| s|
+----+--------+--------+
Now I want to group by "name" and concatenate the values in every row for both columns.
I know how to do it but let's say there are thousands of rows then my code becomes very ugly.
Here is my solution.
import pyspark.sql.functions as f
t = score_card.groupby("name").agg(
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
)
Here is the output I get when I save it in a CSV file.
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| ab| pq|
|Bell| cd| rs|
+----+--------+--------+
But my main concern is about these two lines of code
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
If there are thousands of columns then I will have to repeat the above code thousands of times. Is there a simpler solution for this so that I don't have to repeat f.concat_ws() for every column?
I have searched everywhere and haven't been able to find a solution.
yes, you can use for loop inside agg function and iterate through df.columns. Let me know if it helps.
from pyspark.sql import functions as F
df.show()
# +--------+--------+----+
# |letters1|letters2|name|
# +--------+--------+----+
# | a| p|Jack|
# | b| q|Jack|
# | c| r|Bell|
# | d| s|Bell|
# +--------+--------+----+
df.groupBy("name").agg( *[F.array_join(F.collect_list(column), "").alias(column) for column in df.columns if column !='name' ]).show()
# +----+--------+--------+
# |name|letters1|letters2|
# +----+--------+--------+
# |Bell| cd| rs|
# |Jack| ab| pq|
# +----+--------+--------+
I have created a dataframe as shown
import ast
from pyspark.sql.functions import udf
values = [(u'['2','4','713',10),(u'['12','245']',20),(u'['101','12']',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|u'['2','4','713']| 10|
| u' ['12','245']| 20|
| u'['101','12',]| 30|
+-----------------+---+
**How can I convert the above dataframe such that each element in the list is a float and is within a proper list**
I tried the below one :
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
df2 = amp_conversion(df)
But the data remains the same without a change.
I dont want convert the dataframe to pandas or use collect as it is memory intensive.
And if possible try to give me an optimal solution.I am using pyspark
That's because you forgot about the type
udf(lambda row: ast.literal_eval(str(row)), "array<integer>")
Though something like this would be more efficient:
from pyspark.sql.functions import rtrim, ltrim, split
df = spark.createDataFrame(["""u'[23,4,77,890,4]"""], "string").toDF("list")
df.select(split(
regexp_replace("list", "^u'\\[|\\]$", ""), ","
).cast("array<integer>").alias("list")).show()
# +-------------------+
# | list|
# +-------------------+
# |[23, 4, 77, 890, 4]|
# +-------------------+
I can create the true result in python 3 with a little change in definition of function df_amp_conversion. You didn't return the value of df_modelamp! This code works for me properly:
import ast
from pyspark.sql.functions import udf, col
values = [(u"['2','4','713']",10),(u"['12','245']",20),(u"['101','12']",30)]
df = sqlContext.createDataFrame(values,['list','A'])
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
return df_modelamp
df2 = df_amp_conversion(df)
df2.show()
# +---------------+---+-----------+
# | list| A| float_list|
# +---------------+---+-----------+
# |['2','4','713']| 10|[2, 4, 713]|
# | ['12','245']| 20| [12, 245]|
# | ['101','12']| 30| [101, 12]|
# +---------------+---+-----------+
I want to know how to map values in a specific column in a dataframe.
I have a dataframe which looks like:
df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])
+-----+-------+
| col1| col2|
+-----+-------+
|india| japan|
| usa|uruguay|
+-----+-------+
I have a dictionary from where I want to map the values.
dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')])
The output I want is:
+-----+-------+--------+--------+
| col1| col2|col1_map|col2_map|
+-----+-------+--------+--------+
|india| japan| ind| jpn|
| usa|uruguay| us| urg|
+-----+-------+--------+--------+
I have tried using the lookup function but it doesn't work. It throws error SPARK-5063. Following is my approach which failed:
def map_val(x):
return dicts.lookup(x)[0]
myfun = udf(lambda x: map_val(x), StringType())
df = df.withColumn('col1_map', myfun('col1')) # doesn't work
df = df.withColumn('col2_map', myfun('col2')) # doesn't work
I think the easier way is just to use a simple dictionary and df.withColumn.
from itertools import chain
from pyspark.sql.functions import create_map, lit
simple_dict = {'india':'ind', 'usa':'us', 'japan':'jpn', 'uruguay':'urg'}
mapping_expr = create_map([lit(x) for x in chain(*simple_dict.items())])
df = df.withColumn('col1_map', mapping_expr[df['col1']])\
.withColumn('col2_map', mapping_expr[df['col2']])
df.show(truncate=False)
udf way
I would suggest you to change the list of tuples to dicts and broadcast it to be used in udf
dicts = sc.broadcast(dict([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')]))
from pyspark.sql import functions as f
from pyspark.sql import types as t
def newCols(x):
return dicts.value[x]
callnewColsUdf = f.udf(newCols, t.StringType())
df.withColumn('col1_map', callnewColsUdf(f.col('col1')))\
.withColumn('col2_map', callnewColsUdf(f.col('col2')))\
.show(truncate=False)
which should give you
+-----+-------+--------+--------+
|col1 |col2 |col1_map|col2_map|
+-----+-------+--------+--------+
|india|japan |ind |jpn |
|usa |uruguay|us |urg |
+-----+-------+--------+--------+
join way (slower than udf way)
All you have to do is change the dicts rdd to dataframe too and use two joins with aliasings as following
df = sc.parallelize([('india','japan'),('usa','uruguay')]).toDF(['col1','col2'])
dicts = sc.parallelize([('india','ind'), ('usa','us'),('japan','jpn'),('uruguay','urg')]).toDF(['key', 'value'])
from pyspark.sql import functions as f
df.join(dicts, df['col1'] == dicts['key'], 'inner')\
.select(f.col('col1'), f.col('col2'), f.col('value').alias('col1_map'))\
.join(dicts, df['col2'] == dicts['key'], 'inner') \
.select(f.col('col1'), f.col('col2'), f.col('col1_map'), f.col('value').alias('col2_map'))\
.show(truncate=False)
which should give you the same result
Similar to Ali AzG, but pulling it all out into a handy little method if anyone finds it useful
from itertools import chain
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from typing import Dict
def map_column_values(df:DataFrame, map_dict:Dict, column:str, new_column:str="")->DataFrame:
"""Handy method for mapping column values from one value to another
Args:
df (DataFrame): Dataframe to operate on
map_dict (Dict): Dictionary containing the values to map from and to
column (str): The column containing the values to be mapped
new_column (str, optional): The name of the column to store the mapped values in.
If not specified the values will be stored in the original column
Returns:
DataFrame
"""
spark_map = F.create_map([F.lit(x) for x in chain(*map_dict.items())])
return df.withColumn(new_column or column, spark_map[df[column]])
This can be used as follows
from pyspark.sql import Row, SparkSession
spark = SparkSession.builder.master("local[3]").getOrCreate()
df = spark.createDataFrame([Row(A=0), Row(A=1)])
df = map_column_values(df, map_dict={0:"foo", 1:"bar"}, column="A", new_column="B")
df.show()
#>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
#+---+---+
#| A| B|
#+---+---+
#| 0|foo|
#| 1|bar|
#+---+---+
This creates my example dataframe:
df = sc.parallelize([('abc',),('def',)]).toDF() #(
df = df.selectExpr("_1 as one",)
df = df.withColumn("two", lit('z'))
df.show()
looking like this:
+---+---+
|one|two|
+---+---+
|abc| z|
|def| z|
+---+---+
now what I want to do is a series of SQL where like statements where column two is appended whether or not it matches
in "pseudo code" it looks like this:
for letter in ['a','b','c','d']:
df = df['two'].where(col('one').like("%{}%".format(letter))) += letter
finally resulting in a df looking like this:
+---+----+
|one| two|
+---+----+
|abc|zabc|
|def| zd|
+---+----+
If you are using a list of strings to subset your string column, you can best use broadcast variables. Let's start with a more realistic example where your string still contain spaces:
df = sc.parallelize([('a b c',),('d e f',)]).toDF()
df = df.selectExpr("_1 as one",)
df = df.withColumn("two", lit('z'))
Then we create a broadcast variable from a list of letters, and consequently define an udf that uses them to subset a list of strings; and finally concatenates them with the value in another column, returning one string:
letters = ['a','b','c','d']
letters_bd = sc.broadcast(letters)
def subs(col1, col2):
l_subset = [x for x in col1 if x in letters_bd.value]
return col2 + ' ' + ' '.join(l_subset)
subs_udf = udf(subs)
To apply the above, the string we are subsetting need to be converted to a list, so we use the function split() first and then apply our udf:
from pyspark.sql.functions import col, split
df.withColumn("three", split(col('one'), r'\W+')) \
.withColumn("three", subs_udf("three", "two")) \
.show()
+-----+---+-------+
| one|two| three|
+-----+---+-------+
|a b c| z|z a b c|
|d e f| z| z d|
+-----+---+-------+
Or without udf, using regexp_replace and concat if your letters can be comfortably fit into the regex expression.
from pyspark.sql.functions import regexp_replace, col, concat, lit
df.withColumn("three", concat(col('two'), lit(' '),
regexp_replace(col('one'), '[^abcd]', ' ')))
I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.
See my attempt below, which results in an error.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## | | 2|
## |null|null|
## +----+----+
## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple
## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## +----+----+
It is as simple as this:
from pyspark.sql.functions import col, when
def blank_as_null(x):
return when(col(x) != "", col(x)).otherwise(None)
dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))
dfWithEmptyReplaced.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## |null|null|
## +----+----+
dfWithEmptyReplaced.na.drop().show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## +----+----+
If you want to fill multiple columns you can for example reduce:
to_convert = set([...]) # Some set of columns
reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)
or use comprehension:
exprs = [
blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]
testDF.select(*exprs)
If you want to specifically operate on string fields please check the answer by robin-loxley.
UDFs are not terribly efficient. The correct way to do this using a built-in method is:
df = df.withColumn('myCol', when(col('myCol') == '', None).otherwise(col('myCol')))
Simply add on top of zero323's and soulmachine's answers. To convert for all StringType fields.
from pyspark.sql.types import StringType
string_fields = []
for i, f in enumerate(test_df.schema.fields):
if isinstance(f.dataType, StringType):
string_fields.append(f.name)
My solution is much better than all the solutions I'v seen so far, which can deal with as many fields as you want, see the little function as the following:
// Replace empty Strings with null values
private def setEmptyToNull(df: DataFrame): DataFrame = {
val exprs = df.schema.map { f =>
f.dataType match {
case StringType => when(length(col(f.name)) === 0, lit(null: String).cast(StringType)).otherwise(col(f.name)).as(f.name)
case _ => col(f.name)
}
}
df.select(exprs: _*)
}
You can easily rewrite the function above in Python.
I learned this trick from #liancheng
If you are using python u can check the following.
+----+-----+----+
| id| name| age|
+----+-----+----+
|null|name1| 50|
| 2| | |
| |name3|null|
+----+-----+----+
def convertToNull(dfa):
for i in dfa.columns:
dfa = dfa.withColumn(i , when(col(i) == '', None ).otherwise(col(i)))
return dfa
convertToNull(dfa).show()
+----+-----+----+
| id| name| age|
+----+-----+----+
|null|name1| 50|
| 2| null|null|
|null|name3|null|
+----+-----+----+
I would add a trim to #zero323's solution to deal with cases of multiple white spaces:
def blank_as_null(x):
return when(trim(col(x)) != "", col(x))
Thanks to #zero323 , #Tomerikoo and #Robin Loxley
Ready to use function:
def convert_blank_to_null(df, cols=None):
from pyspark.sql.functions import col, when, trim
from pyspark.sql.types import StringType
def blank_as_null(x):
return when(trim(col(x)) == "", None).otherwise(col(x))
# Don't know how to parallel
for f in (df.select(cols) if cols else df).schema.fields:
if isinstance(f.dataType, StringType):
df = df.withColumn(f.name, blank_as_null(f.name))
return df
This is a different version of soulmachine's solution, but I don't think you can translate this to Python as easily:
def emptyStringsToNone(df: DataFrame): DataFrame = {
df.schema.foldLeft(df)(
(current, field) =>
field.dataType match {
case DataTypes.StringType =>
current.withColumn(
field.name,
when(length(col(field.name)) === 0, lit(null: String)).otherwise(col(field.name))
)
case _ => current
}
)
}