df = spark.createDataFrame(
[
(1, "AxtTR"), # create your data here, be consistent in the types.
(2, "HdyOP"),
(3, "EqoPIC"),
(4, "OkTEic"),
], ["id", "label"] )# add your column names here]
df.show()
Below code is in python , where i use apply function and tried extracting first 2 letters of every row. i want to replicate the same code in pyspark. where a function is used to apply on every single row and get the output.
def get_string(lst):
lst = str(lst)
lst = lst.lower
lst= lst[0:2]
return(lst)
df['firt_2letter'] = df['label'].apply(get_string)
The yellow marked as shown in below image is the expected output.
You can use the relevant Spark SQL functions:
import pyspark.sql.functions as F
df2 = df.withColumn('first_2letter', F.lower('label')[0:2])
df2.show()
+---+------+-------------+
| id| label|first_2letter|
+---+------+-------------+
| 1| AxtTR| ax|
| 2| HdyOP| hd|
| 3|EqoPIC| eq|
| 4|OkTEic| ok|
+---+------+-------------+
If you want to use user-defined functions, you can define them as:
def get_string(lst):
lst = str(lst)
lst = lst.lower()
lst = lst[0:2]
return lst
import pyspark.sql.functions as F
df2 = df.withColumn('first_2letter', F.udf(get_string)('label'))
df2.show()
+---+------+-------------+
| id| label|first_2letter|
+---+------+-------------+
| 1| AxtTR| ax|
| 2| HdyOP| hd|
| 3|EqoPIC| eq|
| 4|OkTEic| ok|
+---+------+-------------+
Related
I have a dictionary like this:
sample_dict = {
"A": ["aaaa\.com", "aaaa\.es"],
"B": ["bbbb\.com", "bbbb\.es", "bbbb\.net"],
"C": ["ccccc\.com"],
# many more entries here
}
I would like to add a column in a Spark DataFrame which performs the following operation:
(
df
.withColumn(
"new_col",
F.when(
(F.col("filter_col").rlike("aaaa\.com")) |
(F.col("filter_col").rlike("aaaa\.es")),
F.lit("A")
)
.when(
(F.col("filter_col").rlike("bbbb\.com")) |
(F.col("filter_col").rlike("bbbb\.es")) |
(F.col("filter_col").rlike("bbbb\.net")),
F.lit("B")
)
.when(
(F.col("filter_col").rlike("cccc\.com")),
F.lit("C")
)
.otherwise(None)
)
)
But, of course, I would like it to be dynamical, so that I may add new components to my dictionary and the column would automatically consider them and add a new category based on the rules.
Is this possible?
You can construct the column expression by iterating over the dict and assign this expression to your withColumn call.
from pyspark.sql import functions as F
sample_dict = {
"A": ["aaaa\.com", "aaaa\.es"],
"B": ["bbbb\.com", "bbbb\.es", "bbbb\.net"],
"C": ["ccccc\.com"],
# many more entries here
}
data = [("aaaa.com", ), ("aaaa.es", ), ("bbbb.com", ), ("zzzz.com", ), ]
df = spark.createDataFrame(data, ("filter_col", ))
column_expression = F
for k, conditions in sample_dict.items():
condition_expression = F.col("filter_col").rlike(conditions[0])
for condition in conditions[1:]:
condition_expression |= F.col("filter_col").rlike(condition)
column_expression = column_expression.when(condition_expression, F.lit(k))
df.withColumn("new_col", column_expression.otherwise(None)).show()
Output
# column_expression Equivalent to writing the expression by hand
Column<'CASE WHEN (RLIKE(filter_col, aaaa\.com) OR RLIKE(filter_col, aaaa\.es)) THEN A WHEN ((RLIKE(filter_col, bbbb\.com) OR RLIKE(filter_col, bbbb\.es)) OR RLIKE(filter_col, bbbb\.net)) THEN B WHEN RLIKE(filter_col, ccccc\.com) THEN C END'>
## Df with expression applied
+----------+-------+
|filter_col|new_col|
+----------+-------+
| aaaa.com| A|
| aaaa.es| A|
| bbbb.com| B|
| zzzz.com| null|
+----------+-------+
If you could alter you column such that you could look for exact matches you could use df.replace():
from pyspark.sql import SparkSession, Row
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(filter_col='aaa.de'),
Row(filter_col='aaa.es'),
Row(filter_col='bbb.de'),
Row(filter_col='bbb.es'),
])
d = {
'aaa.de': 'A',
'aaa.es': 'A',
'bbb.de': 'B',
'bbb.es': 'B',
}
(
df
.withColumn('new_col', F.col('filter_col'))
.withColumn('new_col', F.when(F.col('new_col').isin(list(d.keys())), F.col('new_col')))
.replace(d, None, subset='new_col')
.show()
)
# Output:
+----------+-------+
|filter_col|new_col|
+----------+-------+
| aaa.de| A|
| aaa.es| A|
| bbb.de| B|
| bbb.es| B|
| foo| null|
+----------+-------+
There might be a more performant way to replace values not mentioned in your dictionary with "None" (your "otherwise" condition).
Update:
If the reformatting is not possible, you would have to iterate through your dict:
from pyspark.sql import SparkSession, Row
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(filter_col='aaa.de/foo'),
Row(filter_col='aaa.es/foo'),
Row(filter_col='bbb.de/foo'),
Row(filter_col='bbb.es/foo'),
Row(filter_col='foo'),
])
d = {
'aaa\.de': 'A',
'aaa\.es': 'A',
'bbb\.de': 'B',
'bbb\.es': 'B',
}
df = df.withColumn('new_col', F.lit(None).cast('string'))
for k,v in d.items():
df = df.withColumn('new_col', F.when(F.col('filter_col').rlike(k), v).otherwise(F.col('new_col')))
df.show()
# Output
+----------+-------+
|filter_col|new_col|
+----------+-------+
|aaa.de/foo| A|
|aaa.es/foo| A|
|bbb.de/foo| B|
|bbb.es/foo| B|
| foo| null|
+----------+-------+
I have created a dataframe as shown
import ast
from pyspark.sql.functions import udf
values = [(u'['2','4','713',10),(u'['12','245']',20),(u'['101','12']',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|u'['2','4','713']| 10|
| u' ['12','245']| 20|
| u'['101','12',]| 30|
+-----------------+---+
**How can I convert the above dataframe such that each element in the list is a float and is within a proper list**
I tried the below one :
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
df2 = amp_conversion(df)
But the data remains the same without a change.
I dont want convert the dataframe to pandas or use collect as it is memory intensive.
And if possible try to give me an optimal solution.I am using pyspark
That's because you forgot about the type
udf(lambda row: ast.literal_eval(str(row)), "array<integer>")
Though something like this would be more efficient:
from pyspark.sql.functions import rtrim, ltrim, split
df = spark.createDataFrame(["""u'[23,4,77,890,4]"""], "string").toDF("list")
df.select(split(
regexp_replace("list", "^u'\\[|\\]$", ""), ","
).cast("array<integer>").alias("list")).show()
# +-------------------+
# | list|
# +-------------------+
# |[23, 4, 77, 890, 4]|
# +-------------------+
I can create the true result in python 3 with a little change in definition of function df_amp_conversion. You didn't return the value of df_modelamp! This code works for me properly:
import ast
from pyspark.sql.functions import udf, col
values = [(u"['2','4','713']",10),(u"['12','245']",20),(u"['101','12']",30)]
df = sqlContext.createDataFrame(values,['list','A'])
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
return df_modelamp
df2 = df_amp_conversion(df)
df2.show()
# +---------------+---+-----------+
# | list| A| float_list|
# +---------------+---+-----------+
# |['2','4','713']| 10|[2, 4, 713]|
# | ['12','245']| 20| [12, 245]|
# | ['101','12']| 30| [101, 12]|
# +---------------+---+-----------+
I am trying to run exponential weighted moving average in PySpark using a Grouped Map Pandas UDF. It doesn't work though:
def ExpMA(myData):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql import SQLContext
df = myData
group_col = 'Name'
sort_col = 'Date'
schema = df.select(group_col, sort_col,'count').schema
print(schema)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
Model = pd.DataFrame(pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean()))
return Model
data = df.groupby('Name').apply(ema)
return data
I also tried running it without the Pandas udf, just writing the ewma equation in PySpark, but the problem there is that the ewma equation contains the lag of the current ewma.
First of all your Pandas code is incorrect. This just won't work, Spark or not
pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean())
Another problem is the output schema, which depending on your data, won't really accommodate the result:
If want to add ewm schema should be extended.
If you want to return only ewm then schema is to large.
If you want to just replace, it might not match the type.
Let's assume this is the first scenario (I allowed myself to rewrite your code a bit):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql.types import DoubleType, StructField
def exp_ma(df, group_col='Name', sort_col='Date'):
schema = (df.select(group_col, sort_col, 'count')
.schema.add(StructField('ewma', DoubleType())))
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
pdf['ewm'] = pdf['count'].ewm(span=5, min_periods=1).mean()
return pdf
return df.groupby('Name').apply(ema)
df = spark.createDataFrame(
[("a", 1, 1), ("a", 2, 3), ("a", 3, 3), ("b", 1, 10), ("b", 8, 3), ("b", 9, 0)],
("name", "date", "count")
)
exp_ma(df).show()
# +----+----+-----+------------------+
# |Name|Date|count| ewma|
# +----+----+-----+------------------+
# | b| 1| 10| 10.0|
# | b| 8| 3| 5.800000000000001|
# | b| 9| 0|3.0526315789473686|
# | a| 1| 1| 1.0|
# | a| 2| 3| 2.2|
# | a| 3| 3| 2.578947368421052|
# +----+----+-----+------------------+
I don't use much Pandas so there might be more elegant way of doing this.
Based on previous questions: 1, 2. Suppose I have the following dataframe:
df = spark.createDataFrame(
[(1, "a", 23.0), (3, "B", -23.0)],
("x1", "x2", "x3"))
And I want to add new column x4 but I have value in a list of Python instead to add to the new column e.g. x4_ls = [35.0, 32.0]. Is there a best way to add new column to the Spark dataframe? (note that I use Spark 2.1)
Output should be something like:
## +---+---+-----+----+
## | x1| x2| x3| x4|
## +---+---+-----+----+
## | 1| a| 23.0|35.0|
## | 3| B|-23.0|32.0|
## +---+---+-----+----+
I can also transform my list to dataframe df_x4 = spark.createDataFrame([Row(**{'x4': x}) for x in x4_ls]) (but I don't how to concatenate dataframe together)
Thanks to Gaurav Dhama for a great answer! I made changes a little with his solution. Here is my solution which join two dataframe together on added new column row_num.
from pyspark.sql import Row
def flatten_row(r):
r_ = r.features.asDict()
r_.update({'row_num': r.row_num})
return Row(**r_)
def add_row_num(df):
df_row_num = df.rdd.zipWithIndex().toDF(['features', 'row_num'])
df_out = df_row_num.rdd.map(lambda x : flatten_row(x)).toDF()
return df_out
df = add_row_num(df)
df_x4 = add_row_num(df_x4)
df_concat = df.join(df_x4, on='row_num').drop('row_num')
We can concatenate on the basis of rownumbers as follows. Suppose we have two dataframes df and df_x4 :
def addrownum(df):
dff = df.rdd.zipWithIndex().toDF(['features','rownum'])
odf = dff.map(lambda x : tuple(x.features)+tuple([x.rownum])).toDF(df.columns+['rownum'])
return odf
df1 = addrownum(df)
df2 = addrownum(df_x4)
outputdf = df1.join(df2,df1.rownum==df2.rownum).drop(df1.rownum).drop(df2.rownum)
## outputdf
## +---+---+-----+----+
## | x1| x2| x3| x4|
## +---+---+-----+----+
## | 1| a| 23.0|35.0|
## | 3| B|-23.0|32.0|
## +---+---+-----+----+
outputdf is your required output dataframe
I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.
See my attempt below, which results in an error.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## | | 2|
## |null|null|
## +----+----+
## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple
## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## +----+----+
It is as simple as this:
from pyspark.sql.functions import col, when
def blank_as_null(x):
return when(col(x) != "", col(x)).otherwise(None)
dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))
dfWithEmptyReplaced.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## |null|null|
## +----+----+
dfWithEmptyReplaced.na.drop().show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## +----+----+
If you want to fill multiple columns you can for example reduce:
to_convert = set([...]) # Some set of columns
reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)
or use comprehension:
exprs = [
blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]
testDF.select(*exprs)
If you want to specifically operate on string fields please check the answer by robin-loxley.
UDFs are not terribly efficient. The correct way to do this using a built-in method is:
df = df.withColumn('myCol', when(col('myCol') == '', None).otherwise(col('myCol')))
Simply add on top of zero323's and soulmachine's answers. To convert for all StringType fields.
from pyspark.sql.types import StringType
string_fields = []
for i, f in enumerate(test_df.schema.fields):
if isinstance(f.dataType, StringType):
string_fields.append(f.name)
My solution is much better than all the solutions I'v seen so far, which can deal with as many fields as you want, see the little function as the following:
// Replace empty Strings with null values
private def setEmptyToNull(df: DataFrame): DataFrame = {
val exprs = df.schema.map { f =>
f.dataType match {
case StringType => when(length(col(f.name)) === 0, lit(null: String).cast(StringType)).otherwise(col(f.name)).as(f.name)
case _ => col(f.name)
}
}
df.select(exprs: _*)
}
You can easily rewrite the function above in Python.
I learned this trick from #liancheng
If you are using python u can check the following.
+----+-----+----+
| id| name| age|
+----+-----+----+
|null|name1| 50|
| 2| | |
| |name3|null|
+----+-----+----+
def convertToNull(dfa):
for i in dfa.columns:
dfa = dfa.withColumn(i , when(col(i) == '', None ).otherwise(col(i)))
return dfa
convertToNull(dfa).show()
+----+-----+----+
| id| name| age|
+----+-----+----+
|null|name1| 50|
| 2| null|null|
|null|name3|null|
+----+-----+----+
I would add a trim to #zero323's solution to deal with cases of multiple white spaces:
def blank_as_null(x):
return when(trim(col(x)) != "", col(x))
Thanks to #zero323 , #Tomerikoo and #Robin Loxley
Ready to use function:
def convert_blank_to_null(df, cols=None):
from pyspark.sql.functions import col, when, trim
from pyspark.sql.types import StringType
def blank_as_null(x):
return when(trim(col(x)) == "", None).otherwise(col(x))
# Don't know how to parallel
for f in (df.select(cols) if cols else df).schema.fields:
if isinstance(f.dataType, StringType):
df = df.withColumn(f.name, blank_as_null(f.name))
return df
This is a different version of soulmachine's solution, but I don't think you can translate this to Python as easily:
def emptyStringsToNone(df: DataFrame): DataFrame = {
df.schema.foldLeft(df)(
(current, field) =>
field.dataType match {
case DataTypes.StringType =>
current.withColumn(
field.name,
when(length(col(field.name)) === 0, lit(null: String)).otherwise(col(field.name))
)
case _ => current
}
)
}