pyspark/dataframe: replace null with empty space - python

I have the following udf function in pyspark dataframe. The code works fine except when myFun1('oldColumn') is null, I want the output to be empty string instead of null.
myFun1 = udf(lambda x: myModule.myFunction1(x), StringType())
myDF = myDF.withColumn('newColumn', myFun1('oldColumn'))
Is it possible to do this in place instead of create another udf function? Thanks!

Using df.fillna() or df.na.fill() to replace null values with an empty string worked for me.
You can do replacements by column by supplying the column and value you want to replace nulls with as a parameter:
myDF = myDF.na.fill({'oldColumn': ''})
The Pyspark docs have an example :
>>> df4.na.fill({'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height| name|
+---+------+-------+
| 10| 80| Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null|unknown|
+---+------+-------+

Related

Convert none type to datetime using parser.parse

I'm using the following function to parse string to date in PySpark
func = udf(lambda x: parser.parse(x), DateType())
My date format is:
"22-Jan-2021 00:00"
Althought this function does not work with None types, I have the following Spark data frame
date
----
"22-Jan-2021 00:00"
""
"10-Feb-2020 14:00"
When I apply my func to the date column I got an error on the second line of DF saying that it can't parse NoneType. Any tips to solve this problem using PySpark and the above func ?
An MVCE:
date = None
date_parsed = parser.parse(date)
It seems like you could just use the to_timestamp function.
ex.
df.show()
+---+-----------------+
| id| date|
+---+-----------------+
| 1|22-Jan-2021 00:00|
| 2| null|
| 3|10-Feb-2020 14:00|
+---+-----------------+
You can simply use the following code to convert the string in the date column to a timestamp type.
from pyspark.sql import functions
df = df.withColumn("date", functions.to_timestamp("date", "dd-MMM-yyyy HH:mm"))
df.show()
+---+-------------------+
| id| date|
+---+-------------------+
| 1|2021-01-22 00:00:00|
| 2| null|
| 3|2020-02-10 14:00:00|
+---+-------------------+
You can also verify that the conversion is done correctly with df.schema
print(df.schema)
StructType(List(StructField(id,StringType,true),StructField(date,TimestampType,true)))

How can I concatenate the rows in a pyspark dataframe with multiple columns using groupby and aggregate

I have a pyspark dataframe with multiple columns. For example the one below.
from pyspark.sql import Row
l = [('Jack',"a","p"),('Jack',"b","q"),('Bell',"c","r"),('Bell',"d","s")]
rdd = sc.parallelize(l)
score_rdd = rdd.map(lambda x: Row(name=x[0], letters1=x[1], letters2=x[2]))
score_card = sqlContext.createDataFrame(score_rdd)
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| a| p|
|Jack| b| q|
|Bell| c| r|
|Bell| d| s|
+----+--------+--------+
Now I want to group by "name" and concatenate the values in every row for both columns.
I know how to do it but let's say there are thousands of rows then my code becomes very ugly.
Here is my solution.
import pyspark.sql.functions as f
t = score_card.groupby("name").agg(
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
)
Here is the output I get when I save it in a CSV file.
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| ab| pq|
|Bell| cd| rs|
+----+--------+--------+
But my main concern is about these two lines of code
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
If there are thousands of columns then I will have to repeat the above code thousands of times. Is there a simpler solution for this so that I don't have to repeat f.concat_ws() for every column?
I have searched everywhere and haven't been able to find a solution.
yes, you can use for loop inside agg function and iterate through df.columns. Let me know if it helps.
from pyspark.sql import functions as F
df.show()
# +--------+--------+----+
# |letters1|letters2|name|
# +--------+--------+----+
# | a| p|Jack|
# | b| q|Jack|
# | c| r|Bell|
# | d| s|Bell|
# +--------+--------+----+
df.groupBy("name").agg( *[F.array_join(F.collect_list(column), "").alias(column) for column in df.columns if column !='name' ]).show()
# +----+--------+--------+
# |name|letters1|letters2|
# +----+--------+--------+
# |Bell| cd| rs|
# |Jack| ab| pq|
# +----+--------+--------+

How to extract all elements after last underscore in pyspark?

I have a pyspark dataframe with a column I am trying to extract information from. To give you an example, the column is a combination of 4 foreign keys which could look like this:
Ex 1: 12345-123-12345-4
Ex 2: 5678-4321-123-12
I am trying to extract the last piece of the string, in this case the 4 & 12. Any idea on how I can do this?
I've tried the following:
df.withColumn("result", sf.split(sf.col("column_to_split"), '\_')[1])\
.withColumn("result", sf.col("result").cast('integer'))
However, the result for double digit values is null, and it only returns an integer for single digits (0-9)
Thanks!
For spark2.4,You should use element_at -1 on your array after split
from pyspark.sql import functions as sf
df.withColumn("result", sf.element_at(sf.split("column_to_split","\-"),-1).cast("int")).show()
+-----------------+------+
| column_to_split|result|
+-----------------+------+
|12345-123-12345-4| 4|
| 5678-4321-123-12| 12|
+-----------------+------+
Mohammad's answer is very clean and a nice solution. However if you need a solution for Spark versions < 2.4, you can utilise the reverse string functionality and take the first element, reverse it back and turn into an Integer, f.e.:
import pandas as pd
import pyspark.sql.functions as f
import pyspark.sql.types as t
df = pd.DataFrame()
df['column_to_split'] = ["12345-123-12345-4", "5678-4321-123-12"]
df = spark.createDataFrame(df)
df.withColumn("result",
f.reverse(f.split(f.reverse("column_to_split"), "-")[0]). \
cast(t.IntegerType())).show(2, False)
+-----------------+------+
|column_to_split |result|
+-----------------+------+
|12345-123-12345-4|4 |
|5678-4321-123-12 |12 |
+-----------------+------+
This is how to get the last digits of the serial number above:
serial_no = '12345-123-12345-4'
last_digit = serial_no.split('-')[-1]
print(last_digit)
So in your case, try:
df.withColumn("result", int(sf.col("column_to_split").split('-')[-1]))
If it doesn't work, please share the result.
Adding up another ways:
You can use .regexp_extract() (or) .substring_index() function also:
Example:
df.show()
#+-----------------+
#| column_to_split|
#+-----------------+
#|12345-123-12345-4|
#| 5678-4321-123-12|
#+-----------------+
df.withColumn("result",regexp_extract(col("column_to_split"),"([^-]+$)",1).cast("int")).\
withColumn("result1",substring_index(col("column_to_split"),"-",-1).cast("int")).\
show()
#+-----------------+------+-------+
#| column_to_split|result|result1|
#+-----------------+------+-------+
#|12345-123-12345-4| 4| 4|
#| 5678-4321-123-12| 12| 12|
#+-----------------+------+-------+

How do I convert convert a unicode list contained in pyspark column of a dataframe into float list?

I have created a dataframe as shown
import ast
from pyspark.sql.functions import udf
values = [(u'['2','4','713',10),(u'['12','245']',20),(u'['101','12']',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|u'['2','4','713']| 10|
| u' ['12','245']| 20|
| u'['101','12',]| 30|
+-----------------+---+
**How can I convert the above dataframe such that each element in the list is a float and is within a proper list**
I tried the below one :
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
df2 = amp_conversion(df)
But the data remains the same without a change.
I dont want convert the dataframe to pandas or use collect as it is memory intensive.
And if possible try to give me an optimal solution.I am using pyspark
That's because you forgot about the type
udf(lambda row: ast.literal_eval(str(row)), "array<integer>")
Though something like this would be more efficient:
from pyspark.sql.functions import rtrim, ltrim, split
df = spark.createDataFrame(["""u'[23,4,77,890,4]"""], "string").toDF("list")
df.select(split(
regexp_replace("list", "^u'\\[|\\]$", ""), ","
).cast("array<integer>").alias("list")).show()
# +-------------------+
# | list|
# +-------------------+
# |[23, 4, 77, 890, 4]|
# +-------------------+
I can create the true result in python 3 with a little change in definition of function df_amp_conversion. You didn't return the value of df_modelamp! This code works for me properly:
import ast
from pyspark.sql.functions import udf, col
values = [(u"['2','4','713']",10),(u"['12','245']",20),(u"['101','12']",30)]
df = sqlContext.createDataFrame(values,['list','A'])
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
return df_modelamp
df2 = df_amp_conversion(df)
df2.show()
# +---------------+---+-----------+
# | list| A| float_list|
# +---------------+---+-----------+
# |['2','4','713']| 10|[2, 4, 713]|
# | ['12','245']| 20| [12, 245]|
# | ['101','12']| 30| [101, 12]|
# +---------------+---+-----------+

Replace empty strings with None/null values in DataFrame

I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.
See my attempt below, which results in an error.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## | | 2|
## |null|null|
## +----+----+
## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple
## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## +----+----+
It is as simple as this:
from pyspark.sql.functions import col, when
def blank_as_null(x):
return when(col(x) != "", col(x)).otherwise(None)
dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))
dfWithEmptyReplaced.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## |null|null|
## +----+----+
dfWithEmptyReplaced.na.drop().show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## +----+----+
If you want to fill multiple columns you can for example reduce:
to_convert = set([...]) # Some set of columns
reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)
or use comprehension:
exprs = [
blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]
testDF.select(*exprs)
If you want to specifically operate on string fields please check the answer by robin-loxley.
UDFs are not terribly efficient. The correct way to do this using a built-in method is:
df = df.withColumn('myCol', when(col('myCol') == '', None).otherwise(col('myCol')))
Simply add on top of zero323's and soulmachine's answers. To convert for all StringType fields.
from pyspark.sql.types import StringType
string_fields = []
for i, f in enumerate(test_df.schema.fields):
if isinstance(f.dataType, StringType):
string_fields.append(f.name)
My solution is much better than all the solutions I'v seen so far, which can deal with as many fields as you want, see the little function as the following:
// Replace empty Strings with null values
private def setEmptyToNull(df: DataFrame): DataFrame = {
val exprs = df.schema.map { f =>
f.dataType match {
case StringType => when(length(col(f.name)) === 0, lit(null: String).cast(StringType)).otherwise(col(f.name)).as(f.name)
case _ => col(f.name)
}
}
df.select(exprs: _*)
}
You can easily rewrite the function above in Python.
I learned this trick from #liancheng
If you are using python u can check the following.
+----+-----+----+
| id| name| age|
+----+-----+----+
|null|name1| 50|
| 2| | |
| |name3|null|
+----+-----+----+
def convertToNull(dfa):
for i in dfa.columns:
dfa = dfa.withColumn(i , when(col(i) == '', None ).otherwise(col(i)))
return dfa
convertToNull(dfa).show()
+----+-----+----+
| id| name| age|
+----+-----+----+
|null|name1| 50|
| 2| null|null|
|null|name3|null|
+----+-----+----+
I would add a trim to #zero323's solution to deal with cases of multiple white spaces:
def blank_as_null(x):
return when(trim(col(x)) != "", col(x))
Thanks to #zero323 , #Tomerikoo and #Robin Loxley
Ready to use function:
def convert_blank_to_null(df, cols=None):
from pyspark.sql.functions import col, when, trim
from pyspark.sql.types import StringType
def blank_as_null(x):
return when(trim(col(x)) == "", None).otherwise(col(x))
# Don't know how to parallel
for f in (df.select(cols) if cols else df).schema.fields:
if isinstance(f.dataType, StringType):
df = df.withColumn(f.name, blank_as_null(f.name))
return df
This is a different version of soulmachine's solution, but I don't think you can translate this to Python as easily:
def emptyStringsToNone(df: DataFrame): DataFrame = {
df.schema.foldLeft(df)(
(current, field) =>
field.dataType match {
case DataTypes.StringType =>
current.withColumn(
field.name,
when(length(col(field.name)) === 0, lit(null: String)).otherwise(col(field.name))
)
case _ => current
}
)
}

Categories

Resources