I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.
See my attempt below, which results in an error.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## | | 2|
## |null|null|
## +----+----+
## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple
## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## +----+----+
It is as simple as this:
from pyspark.sql.functions import col, when
def blank_as_null(x):
return when(col(x) != "", col(x)).otherwise(None)
dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))
dfWithEmptyReplaced.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## |null|null|
## +----+----+
dfWithEmptyReplaced.na.drop().show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## +----+----+
If you want to fill multiple columns you can for example reduce:
to_convert = set([...]) # Some set of columns
reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)
or use comprehension:
exprs = [
blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]
testDF.select(*exprs)
If you want to specifically operate on string fields please check the answer by robin-loxley.
UDFs are not terribly efficient. The correct way to do this using a built-in method is:
df = df.withColumn('myCol', when(col('myCol') == '', None).otherwise(col('myCol')))
Simply add on top of zero323's and soulmachine's answers. To convert for all StringType fields.
from pyspark.sql.types import StringType
string_fields = []
for i, f in enumerate(test_df.schema.fields):
if isinstance(f.dataType, StringType):
string_fields.append(f.name)
My solution is much better than all the solutions I'v seen so far, which can deal with as many fields as you want, see the little function as the following:
// Replace empty Strings with null values
private def setEmptyToNull(df: DataFrame): DataFrame = {
val exprs = df.schema.map { f =>
f.dataType match {
case StringType => when(length(col(f.name)) === 0, lit(null: String).cast(StringType)).otherwise(col(f.name)).as(f.name)
case _ => col(f.name)
}
}
df.select(exprs: _*)
}
You can easily rewrite the function above in Python.
I learned this trick from #liancheng
If you are using python u can check the following.
+----+-----+----+
| id| name| age|
+----+-----+----+
|null|name1| 50|
| 2| | |
| |name3|null|
+----+-----+----+
def convertToNull(dfa):
for i in dfa.columns:
dfa = dfa.withColumn(i , when(col(i) == '', None ).otherwise(col(i)))
return dfa
convertToNull(dfa).show()
+----+-----+----+
| id| name| age|
+----+-----+----+
|null|name1| 50|
| 2| null|null|
|null|name3|null|
+----+-----+----+
I would add a trim to #zero323's solution to deal with cases of multiple white spaces:
def blank_as_null(x):
return when(trim(col(x)) != "", col(x))
Thanks to #zero323 , #Tomerikoo and #Robin Loxley
Ready to use function:
def convert_blank_to_null(df, cols=None):
from pyspark.sql.functions import col, when, trim
from pyspark.sql.types import StringType
def blank_as_null(x):
return when(trim(col(x)) == "", None).otherwise(col(x))
# Don't know how to parallel
for f in (df.select(cols) if cols else df).schema.fields:
if isinstance(f.dataType, StringType):
df = df.withColumn(f.name, blank_as_null(f.name))
return df
This is a different version of soulmachine's solution, but I don't think you can translate this to Python as easily:
def emptyStringsToNone(df: DataFrame): DataFrame = {
df.schema.foldLeft(df)(
(current, field) =>
field.dataType match {
case DataTypes.StringType =>
current.withColumn(
field.name,
when(length(col(field.name)) === 0, lit(null: String)).otherwise(col(field.name))
)
case _ => current
}
)
}
Related
df = spark.createDataFrame(
[
(1, "AxtTR"), # create your data here, be consistent in the types.
(2, "HdyOP"),
(3, "EqoPIC"),
(4, "OkTEic"),
], ["id", "label"] )# add your column names here]
df.show()
Below code is in python , where i use apply function and tried extracting first 2 letters of every row. i want to replicate the same code in pyspark. where a function is used to apply on every single row and get the output.
def get_string(lst):
lst = str(lst)
lst = lst.lower
lst= lst[0:2]
return(lst)
df['firt_2letter'] = df['label'].apply(get_string)
The yellow marked as shown in below image is the expected output.
You can use the relevant Spark SQL functions:
import pyspark.sql.functions as F
df2 = df.withColumn('first_2letter', F.lower('label')[0:2])
df2.show()
+---+------+-------------+
| id| label|first_2letter|
+---+------+-------------+
| 1| AxtTR| ax|
| 2| HdyOP| hd|
| 3|EqoPIC| eq|
| 4|OkTEic| ok|
+---+------+-------------+
If you want to use user-defined functions, you can define them as:
def get_string(lst):
lst = str(lst)
lst = lst.lower()
lst = lst[0:2]
return lst
import pyspark.sql.functions as F
df2 = df.withColumn('first_2letter', F.udf(get_string)('label'))
df2.show()
+---+------+-------------+
| id| label|first_2letter|
+---+------+-------------+
| 1| AxtTR| ax|
| 2| HdyOP| hd|
| 3|EqoPIC| eq|
| 4|OkTEic| ok|
+---+------+-------------+
I have a pyspark dataframe with multiple columns. For example the one below.
from pyspark.sql import Row
l = [('Jack',"a","p"),('Jack',"b","q"),('Bell',"c","r"),('Bell',"d","s")]
rdd = sc.parallelize(l)
score_rdd = rdd.map(lambda x: Row(name=x[0], letters1=x[1], letters2=x[2]))
score_card = sqlContext.createDataFrame(score_rdd)
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| a| p|
|Jack| b| q|
|Bell| c| r|
|Bell| d| s|
+----+--------+--------+
Now I want to group by "name" and concatenate the values in every row for both columns.
I know how to do it but let's say there are thousands of rows then my code becomes very ugly.
Here is my solution.
import pyspark.sql.functions as f
t = score_card.groupby("name").agg(
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
)
Here is the output I get when I save it in a CSV file.
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| ab| pq|
|Bell| cd| rs|
+----+--------+--------+
But my main concern is about these two lines of code
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
If there are thousands of columns then I will have to repeat the above code thousands of times. Is there a simpler solution for this so that I don't have to repeat f.concat_ws() for every column?
I have searched everywhere and haven't been able to find a solution.
yes, you can use for loop inside agg function and iterate through df.columns. Let me know if it helps.
from pyspark.sql import functions as F
df.show()
# +--------+--------+----+
# |letters1|letters2|name|
# +--------+--------+----+
# | a| p|Jack|
# | b| q|Jack|
# | c| r|Bell|
# | d| s|Bell|
# +--------+--------+----+
df.groupBy("name").agg( *[F.array_join(F.collect_list(column), "").alias(column) for column in df.columns if column !='name' ]).show()
# +----+--------+--------+
# |name|letters1|letters2|
# +----+--------+--------+
# |Bell| cd| rs|
# |Jack| ab| pq|
# +----+--------+--------+
I have a pyspark dataframe with a column I am trying to extract information from. To give you an example, the column is a combination of 4 foreign keys which could look like this:
Ex 1: 12345-123-12345-4
Ex 2: 5678-4321-123-12
I am trying to extract the last piece of the string, in this case the 4 & 12. Any idea on how I can do this?
I've tried the following:
df.withColumn("result", sf.split(sf.col("column_to_split"), '\_')[1])\
.withColumn("result", sf.col("result").cast('integer'))
However, the result for double digit values is null, and it only returns an integer for single digits (0-9)
Thanks!
For spark2.4,You should use element_at -1 on your array after split
from pyspark.sql import functions as sf
df.withColumn("result", sf.element_at(sf.split("column_to_split","\-"),-1).cast("int")).show()
+-----------------+------+
| column_to_split|result|
+-----------------+------+
|12345-123-12345-4| 4|
| 5678-4321-123-12| 12|
+-----------------+------+
Mohammad's answer is very clean and a nice solution. However if you need a solution for Spark versions < 2.4, you can utilise the reverse string functionality and take the first element, reverse it back and turn into an Integer, f.e.:
import pandas as pd
import pyspark.sql.functions as f
import pyspark.sql.types as t
df = pd.DataFrame()
df['column_to_split'] = ["12345-123-12345-4", "5678-4321-123-12"]
df = spark.createDataFrame(df)
df.withColumn("result",
f.reverse(f.split(f.reverse("column_to_split"), "-")[0]). \
cast(t.IntegerType())).show(2, False)
+-----------------+------+
|column_to_split |result|
+-----------------+------+
|12345-123-12345-4|4 |
|5678-4321-123-12 |12 |
+-----------------+------+
This is how to get the last digits of the serial number above:
serial_no = '12345-123-12345-4'
last_digit = serial_no.split('-')[-1]
print(last_digit)
So in your case, try:
df.withColumn("result", int(sf.col("column_to_split").split('-')[-1]))
If it doesn't work, please share the result.
Adding up another ways:
You can use .regexp_extract() (or) .substring_index() function also:
Example:
df.show()
#+-----------------+
#| column_to_split|
#+-----------------+
#|12345-123-12345-4|
#| 5678-4321-123-12|
#+-----------------+
df.withColumn("result",regexp_extract(col("column_to_split"),"([^-]+$)",1).cast("int")).\
withColumn("result1",substring_index(col("column_to_split"),"-",-1).cast("int")).\
show()
#+-----------------+------+-------+
#| column_to_split|result|result1|
#+-----------------+------+-------+
#|12345-123-12345-4| 4| 4|
#| 5678-4321-123-12| 12| 12|
#+-----------------+------+-------+
I have created a dataframe as shown
import ast
from pyspark.sql.functions import udf
values = [(u'['2','4','713',10),(u'['12','245']',20),(u'['101','12']',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|u'['2','4','713']| 10|
| u' ['12','245']| 20|
| u'['101','12',]| 30|
+-----------------+---+
**How can I convert the above dataframe such that each element in the list is a float and is within a proper list**
I tried the below one :
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
df2 = amp_conversion(df)
But the data remains the same without a change.
I dont want convert the dataframe to pandas or use collect as it is memory intensive.
And if possible try to give me an optimal solution.I am using pyspark
That's because you forgot about the type
udf(lambda row: ast.literal_eval(str(row)), "array<integer>")
Though something like this would be more efficient:
from pyspark.sql.functions import rtrim, ltrim, split
df = spark.createDataFrame(["""u'[23,4,77,890,4]"""], "string").toDF("list")
df.select(split(
regexp_replace("list", "^u'\\[|\\]$", ""), ","
).cast("array<integer>").alias("list")).show()
# +-------------------+
# | list|
# +-------------------+
# |[23, 4, 77, 890, 4]|
# +-------------------+
I can create the true result in python 3 with a little change in definition of function df_amp_conversion. You didn't return the value of df_modelamp! This code works for me properly:
import ast
from pyspark.sql.functions import udf, col
values = [(u"['2','4','713']",10),(u"['12','245']",20),(u"['101','12']",30)]
df = sqlContext.createDataFrame(values,['list','A'])
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
return df_modelamp
df2 = df_amp_conversion(df)
df2.show()
# +---------------+---+-----------+
# | list| A| float_list|
# +---------------+---+-----------+
# |['2','4','713']| 10|[2, 4, 713]|
# | ['12','245']| 20| [12, 245]|
# | ['101','12']| 30| [101, 12]|
# +---------------+---+-----------+
I have the following udf function in pyspark dataframe. The code works fine except when myFun1('oldColumn') is null, I want the output to be empty string instead of null.
myFun1 = udf(lambda x: myModule.myFunction1(x), StringType())
myDF = myDF.withColumn('newColumn', myFun1('oldColumn'))
Is it possible to do this in place instead of create another udf function? Thanks!
Using df.fillna() or df.na.fill() to replace null values with an empty string worked for me.
You can do replacements by column by supplying the column and value you want to replace nulls with as a parameter:
myDF = myDF.na.fill({'oldColumn': ''})
The Pyspark docs have an example :
>>> df4.na.fill({'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height| name|
+---+------+-------+
| 10| 80| Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null|unknown|
+---+------+-------+