s = ["abcd:{'name':'john'}","defasdf:{'num':123}"]
df = spark.createDataFrame(s, "string").toDF("request")
display(df)
+--------------------+
| request|
+--------------------+
|abcd:{'name':'john'}|
| defasdf:{'num':123}|
+--------------------+
I would like to get as
+--------------------+---------------+
| request| sub|
+--------------------+---------------+
|abcd:{'name':'john'}|{'name':'john'}|
| defasdf:{'num':123}| {'num':123}|
+--------------------+---------------+
I did write as below, but it is throwing error :
TypeError: Column is not iterable
df = df.withColumn("sub",substring(col('request'),locate('{',col('request')),length(col('request'))-locate('{',col('request'))))
df.show()
Can someone please help me ?
You need to use substring function in SQL expression in order to pass columns for position and length arguments. Note also that you need to add +1 to length to get correct result:
import pyspark.sql.functions as F
df = df.withColumn(
"json",
F.expr("substring(request, locate('{',request), length(request) - locate('{', request) + 1)")
)
df.show()
#+--------------------+---------------+
#| request| json|
#+--------------------+---------------+
#|abcd:{'name':'john'}|{'name':'john'}|
#| defasdf:{'num':123}| {'num':123}|
#+--------------------+---------------+
You could also consider using regexp_extract function instead of substring like this:
df = df.withColumn(
"json",
F.regexp_extract("request", "^.*:(\\{.*\\})$", 1)
)
Related
I have this method where I am gathering positive values
def pos_values(df, metrics):
num_pos_values = df.where(df.ttu > 1).count()
df.withColumn("loader_ttu_pos_value", num_pos_values)
df.write.json(metrics)
However I get TypeError: col should be Column whenever I go to test it. I tried to cast it but that doesn't seem to be an option.
The reason you're getting this error is because df.withColumn expects a Column object as second argument, whereas you're giving num_pos_values which is an integer.
If you want to assign a literal value to a column (you'll have the same value for each row), you can use the lit function of pyspark.sql.functions.
Something like this works:
df = spark.createDataFrame([("2022", "January"), ("2021", "December")], ["Year", "Month"])
df.show()
+----+--------+
|Year| Month|
+----+--------+
|2022| January|
|2021|December|
+----+--------+
from pyspark.sql.functions import lit
df.withColumn("testColumn", lit(5)).show()
+----+--------+----------+
|Year| Month|testColumn|
+----+--------+----------+
|2022| January| 5|
|2021|December| 5|
+----+--------+----------+
I want to use ROUND function like this:
CAST(ROUND(CostAmt,ISNULL(CurrencyDecimalPlaceNum)) AS decimal(32,8))
in pyspark.
In Dataframe and SQL ROUND function takes first argument as col and second argument as int number but I want to pass second argument as another column.
If i am trying to use second argument as col it is giving error column is not callable.
Pyspark code:
round(
col("CostAmt"),
coalesce(col("CurrencyDecimalPlaceNum").cast(IntegerType()), lit(2)),
).cast(DecimalType(23, 6))
how to solve this issue?
The round() function takes a column and an int as arguments: doc. The problem is that you are passing 2 columns as arguments since the coalesce returns a column.
I'm not sure how to do it using coalesce, I would use UDF and create a function that rounds the number and then apply it on both columns like this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
def round_value(value, scale):
if scale is None:
scale = 2
return round(value, scale)
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
df = spark.createDataFrame(
[
(1, 1, 2.3445),
(2, None, 168.454523),
(3, 4, 3500.345354),
],
["id", "CurrencyDecimalPlaceNum", "float_col"],
)
round_udf = F.udf(lambda x, y: round_value(x, y))
df = df.withColumn(
"round",
round_udf(
F.col("float_col"),
F.col("CurrencyDecimalPlaceNum"),
),
)
Result:
+---+-----------------------+-----------+---------+
| id|CurrencyDecimalPlaceNum| float_col| round|
+---+-----------------------+-----------+---------+
| 1| 1| 2.3445| 2.3|
| 2| null| 168.454523| 168.45|
| 3| 4|3500.345354|3500.3454|
+---+-----------------------+-----------+---------+
I have created a dataframe as shown
import ast
from pyspark.sql.functions import udf
values = [(u'['2','4','713',10),(u'['12','245']',20),(u'['101','12']',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|u'['2','4','713']| 10|
| u' ['12','245']| 20|
| u'['101','12',]| 30|
+-----------------+---+
**How can I convert the above dataframe such that each element in the list is a float and is within a proper list**
I tried the below one :
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
df2 = amp_conversion(df)
But the data remains the same without a change.
I dont want convert the dataframe to pandas or use collect as it is memory intensive.
And if possible try to give me an optimal solution.I am using pyspark
That's because you forgot about the type
udf(lambda row: ast.literal_eval(str(row)), "array<integer>")
Though something like this would be more efficient:
from pyspark.sql.functions import rtrim, ltrim, split
df = spark.createDataFrame(["""u'[23,4,77,890,4]"""], "string").toDF("list")
df.select(split(
regexp_replace("list", "^u'\\[|\\]$", ""), ","
).cast("array<integer>").alias("list")).show()
# +-------------------+
# | list|
# +-------------------+
# |[23, 4, 77, 890, 4]|
# +-------------------+
I can create the true result in python 3 with a little change in definition of function df_amp_conversion. You didn't return the value of df_modelamp! This code works for me properly:
import ast
from pyspark.sql.functions import udf, col
values = [(u"['2','4','713']",10),(u"['12','245']",20),(u"['101','12']",30)]
df = sqlContext.createDataFrame(values,['list','A'])
def df_amp_conversion(df_modelamp):
string_list_to_list = udf(lambda row: ast.literal_eval(str(row)))
df_modelamp = df_modelamp.withColumn('float_list',string_list_to_list(col("list")))
return df_modelamp
df2 = df_amp_conversion(df)
df2.show()
# +---------------+---+-----------+
# | list| A| float_list|
# +---------------+---+-----------+
# |['2','4','713']| 10|[2, 4, 713]|
# | ['12','245']| 20| [12, 245]|
# | ['101','12']| 30| [101, 12]|
# +---------------+---+-----------+
I have a dataframe in PySpark with a string column with value [{"AppId":"APACON","ExtId":"141730"}] (the string is exactly like that in my column, it is a string, not an array)
I want to convert this to an array of struct.
Can I do that simply with native spark function or do I have to parse the string or use UDF ?
sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
).show()
+---+--------------------+
|idx| txt|
+---+--------------------+
| 1|[{"AppId":"APACON...|
| 2|[{"AppId":"APACON...|
+---+--------------------+
With Spark 2.1 or above
You have the following data :
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
)
you can indeed use pyspark.sql.functions.from_json as follows :
schema = StructType([StructField("AppId", StringType()),
StructField("ExtId", StringType())])
df = df.withColumn('array',F.from_json(F.col('txt'), schema))
df.show()
+---+--------------------+---------------+
|idx| txt| array|
+---+--------------------+---------------+
| 1|[{"AppId":"APACON...|[APACON,141730]|
| 2|[{"AppId":"APACON...|[APACON,141793]|
+---+--------------------+---------------+
Version < Spark 2.1
One way to bypass the issue, would be to first slightly modify your input string to have :
# Use regexp_extract to ignore square brackets
df.withColumn('txt_parsed',F.regexp_extract(F.col('txt'),'[^\\[\\]]+',0))
df.show()
+---+-------------------------------------+-----------------------------------+
|idx|txt |txt_parsed |
+---+-------------------------------------+-----------------------------------+
|1 |[{"AppId":"APACON","ExtId":"141730"}]|{"AppId":"APACON","ExtId":"141730"}|
|2 |[{"AppId":"APACON","ExtId":"141793"}]|{"AppId":"APACON","ExtId":"141793"}|
+---+-------------------------------------+-----------------------------------+
Then you could use pyspark.sql.functions.get_json_object to parse the txt column
df = df.withColumn('AppId', F.get_json_object(df.txt, '$.AppId'))
df = df.withColumn('ExtId', F.get_json_object(df.txt, '$.ExtId'))
df.show()
+---+--------------------+------+------+
|idx| txt| AppId| ExtId|
+---+--------------------+------+------+
| 1|{"AppId":"APACON"...|APACON|141730|
| 2|{"AppId":"APACON"...|APACON|141793|
+---+--------------------+------+------+
This creates my example dataframe:
df = sc.parallelize([('abc',),('def',)]).toDF() #(
df = df.selectExpr("_1 as one",)
df = df.withColumn("two", lit('z'))
df.show()
looking like this:
+---+---+
|one|two|
+---+---+
|abc| z|
|def| z|
+---+---+
now what I want to do is a series of SQL where like statements where column two is appended whether or not it matches
in "pseudo code" it looks like this:
for letter in ['a','b','c','d']:
df = df['two'].where(col('one').like("%{}%".format(letter))) += letter
finally resulting in a df looking like this:
+---+----+
|one| two|
+---+----+
|abc|zabc|
|def| zd|
+---+----+
If you are using a list of strings to subset your string column, you can best use broadcast variables. Let's start with a more realistic example where your string still contain spaces:
df = sc.parallelize([('a b c',),('d e f',)]).toDF()
df = df.selectExpr("_1 as one",)
df = df.withColumn("two", lit('z'))
Then we create a broadcast variable from a list of letters, and consequently define an udf that uses them to subset a list of strings; and finally concatenates them with the value in another column, returning one string:
letters = ['a','b','c','d']
letters_bd = sc.broadcast(letters)
def subs(col1, col2):
l_subset = [x for x in col1 if x in letters_bd.value]
return col2 + ' ' + ' '.join(l_subset)
subs_udf = udf(subs)
To apply the above, the string we are subsetting need to be converted to a list, so we use the function split() first and then apply our udf:
from pyspark.sql.functions import col, split
df.withColumn("three", split(col('one'), r'\W+')) \
.withColumn("three", subs_udf("three", "two")) \
.show()
+-----+---+-------+
| one|two| three|
+-----+---+-------+
|a b c| z|z a b c|
|d e f| z| z d|
+-----+---+-------+
Or without udf, using regexp_replace and concat if your letters can be comfortably fit into the regex expression.
from pyspark.sql.functions import regexp_replace, col, concat, lit
df.withColumn("three", concat(col('two'), lit(' '),
regexp_replace(col('one'), '[^abcd]', ' ')))