The following is a sample of my Spark DataFrame with the printSchema below it:
+--------------------+---+------+------+--------------------+
| device_id|age|gender| group| apps|
+--------------------+---+------+------+--------------------+
|-9073325454084204615| 24| M|M23-26| null|
|-8965335561582270637| 28| F|F27-28|[1.0,1.0,1.0,1.0,...|
|-8958861370644389191| 21| M| M22-|[4.0,0.0,0.0,0.0,...|
|-8956021912595401048| 21| M| M22-| null|
|-8910497777165914301| 25| F|F24-26| null|
+--------------------+---+------+------+--------------------+
only showing top 5 rows
root
|-- device_id: long (nullable = true)
|-- age: integer (nullle = true)
|-- gender: string (nullable = true)
|-- group: string (nullable = true)
|-- apps: vector (nullable = true)
I'm trying to fill the null in the 'apps' column with np.zeros(19237). However When I execute
df.fillna({'apps': np.zeros(19237)}))
I get an error
Py4JJavaError: An error occurred while calling o562.fill.
: java.lang.IllegalArgumentException: Unsupported value type java.util.ArrayList
Or if I try
df.fillna({'apps': DenseVector(np.zeros(19237)})))
I get an error
AttributeError: 'numpy.ndarray' object has no attribute '_get_object_id'
Any ideas?
DataFrameNaFunctions support only a subset of native (no UDTs) types, so you'll need an UDF here.
from pyspark.sql.functions import coalesce, col, udf
from pyspark.ml.linalg import Vectors, VectorUDT
def zeros(n):
def zeros_():
return Vectors.sparse(n, {})
return udf(zeros_, VectorUDT())()
Example usage:
df = spark.createDataFrame(
[(1, Vectors.dense([1, 2, 3])), (2, None)],
("device_id", "apps"))
df.withColumn("apps", coalesce(col("apps"), zeros(3))).show()
+---------+-------------+
|device_id| apps|
+---------+-------------+
| 1|[1.0,2.0,3.0]|
| 2| (3,[],[])|
+---------+-------------+
Related
I have a dataframe and I'm doing this:
df = dataframe.withColumn("test", lit(0.4219759403))
I want to get just the first four numbers after the dot, without rounding.
When I cast to DecimalType, with .cast(DataTypes.createDecimalType(20,4)
or even with round function, this number is rounded to 0.4220.
The only way that I found without rounding is applying the function format_number(), but this function gives me a string, and when I cast this string to DecimalType(20,4), the framework rounds the number again to 0.4220.
I need to convert this number to DecimalType(20,4) without rounding, and I expect to see 0.4219.
If you have numbers with more than 1 digit before the decimal point, the substr is not adapt. Instead, you can use a regex to always extract the first 4 decimal digits (if present).
You can do this using regexp_extract
df = dataframe.withColumn('rounded', F.regexp_extract(F.col('test'), '\d+\.\d{0,4}', 0))
Example
import pyspark.sql.functions as F
dataframe = spark.createDataFrame([
(0.4219759403, ),
(0.4, ),
(1.0, ),
(0.5431293, ),
(123.769859, )
], ['test'])
df = dataframe.withColumn('rounded', F.regexp_extract(F.col('test'), '\d+\.\d{0,4}', 0))
df.show()
+------------+--------+
| test| rounded|
+------------+--------+
|0.4219759403| 0.4219|
| 0.4| 0.4|
| 1.0| 1.0|
| 0.5431293| 0.5431|
| 123.769859|123.7698|
+------------+--------+
Hi welcome to stackoverflow,
please next time try to provide a reproducible example with the code you tried, anyways this works for me:
from pyspark.sql.types import DecimalType
df = spark.createDataFrame([
(1, "a"),
(2, "b"),
(3, "c"),
], ["ID", "Text"])
df = df.withColumn("test", lit(0.4219759403))
df = df.withColumn("test_string", F.substring(df["test"].cast("string"), 0, 6))
df = df.withColumn("test_string_decimaltype", df["test_string"].cast(DecimalType(20,4)))
df.show()
df.printSchema()
+---+----+------------+-----------+-----------------------+
| ID|Text| test|test_string|test_string_decimaltype|
+---+----+------------+-----------+-----------------------+
| 1| a|0.4219759403| 0.4219| 0.4219|
| 2| b|0.4219759403| 0.4219| 0.4219|
| 3| c|0.4219759403| 0.4219| 0.4219|
+---+----+------------+-----------+-----------------------+
root
|-- ID: long (nullable = true)
|-- Text: string (nullable = true)
|-- test: double (nullable = false)
|-- test_string: string (nullable = false)
|-- test_string_decimaltype: decimal(20,4) (nullable = true)
Of course if you want you can overwrite the same column by putting always "test", i choose different names to let you see the steps.
I would like to create an empty Dataframe and the schema should match to an existing Pyspark Dataframe .I tried using Structtype manually .
To create an empty dataframe call spark.createDataFrame with empty array and providing the schema object from the original dataframe:
df = spark.createDataFrame([(1, 1)], ('foo', 'bar'))
df.printSchema()
# root
# |-- foo: long (nullable = true)
# |-- bar: long (nullable = true)
df.show()
# +---+---+
# |foo|bar|
# +---+---+
# | 1| 1|
# +---+---+
empty_df = spark.createDataFrame([], df.schema)
empty_df.printSchema()
# root
# |-- foo: long (nullable = true)
# |-- bar: long (nullable = true)
empty_df.show()
# +---+---+
# |foo|bar|
# +---+---+
# +---+---+
I have a dataframe with column "EVENT_ID" whose datatype is String.
I am running FPGrowth algorithm but throws the below error
Py4JJavaError: An error occurred while calling o1711.fit.
:java.lang.IllegalArgumentException: requirement failed:
The input column must be array, but got string.
The column EVENT_ID has values
E_34503_Probe
E_35203_In
E_31901_Cbc
I am using the below code to convert the string column to arraytype
df2 = df.withColumn("EVENT_ID", df["EVENT_ID"].cast(types.ArrayType(types.StringType())))
But I get the following error
Py4JJavaError: An error occurred while calling o1874.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`EVENT_ID`' due to data type mismatch: cannot cast string to array<string>;;
How do I either cast this column to array type or run the FPGrowth algorithm with string type?
Original answer
Try the following.
In [0]: from pyspark.sql.types import StringType
from pyspark.sql.functions import col, regexp_replace, split
In [1]: df = spark.createDataFrame(["E_34503_Probe", "E_35203_In", "E_31901_Cbc"], StringType()).toDF("EVENT_ID")
df.show()
Out [1]: +-------------+
| EVENT_ID|
+-------------+
|E_34503_Probe|
| E_35203_In|
| E_31901_Cbc|
+-------------+
In [2]: df_new = df.withColumn("EVENT_ID", split(regexp_replace(col("EVENT_ID"), r"(^\[)|(\]$)|(')", ""), ", "))
df_new.printSchema()
Out [2]: root
|-- EVENT_ID: array (nullable = true)
| |-- element: string (containsNull = true)
I hope it will be helpful.
Edited answer
As #pault pointed out very well in his comment, much easier solution is following:
In [0]: from pyspark.sql.types import StringType
from pyspark.sql.functions import array
In [1]: df = spark.createDataFrame(["E_34503_Probe", "E_35203_In", "E_31901_Cbc"], StringType()).toDF("EVENT_ID")
df.show()
Out [1]: +-------------+
| EVENT_ID|
+-------------+
|E_34503_Probe|
| E_35203_In|
| E_31901_Cbc|
+-------------+
In [2]: df_new = df.withColumn("EVENT_ID", array(df["EVENT_ID"]))
df_new.printSchema()
Out [2]: root
|-- EVENT_ID: array (nullable = false)
| |-- element: string (containsNull = true)
My Question is while converting from Rdd to dataframe in pyspark does the schema depends on the first row ?
data1 = [('A','abc',0.1,'',0.562),('B','def',0.15,0.5,0.123),('A','ghi',0.2,0.2,0.1345),('B','jkl','',0.1,0.642),('B','mno',0.1,0.1,'')]
>>> val1=sc.parallelize(data1).toDF()
>>> val1.show()
+---+---+----+---+------+
| _1| _2| _3| _4| _5|
+---+---+----+---+------+
| A|abc| 0.1| | 0.562| <------ Does it depends on type of this row?
| B|def|0.15|0.5| 0.123|
| A|ghi| 0.2|0.2|0.1345|
| B|jkl|null|0.1| 0.642|
| B|mno| 0.1|0.1| null|
+---+---+----+---+------+
>>> val1.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: double (nullable = true)
|-- _4: string (nullable = true)
|-- _5: double (nullable = true)
As you can see column _4 should have been double but it considered as string.
Any Suggestions will be helpfull.
Thanks!
#Prathik, I think you are right.
toDF() is a shorthand for spark.createDataFrame(rdd, schema, sampleRatio).
Here's the signature for createDataFrame:
def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=True)
So by default, the parameters schema and samplingRatio are None.
According to the doc:
If schema inference is needed, samplingRatio is used to determined the ratio of
rows used for schema inference. The first row will be used if samplingRatio is None.
So by default, toDF() will use the first row to infer the data type, which it figures StringType for column 4, but FloatType for column 5.
Here you can't specify the schema to be FloatType for column 4 and 5, since they have strings in their columns.
But you can try set sampleRatio to 0.3 as below:
data1 = [('A','abc',0.1,'',0.562),('B','def',0.15,0.5,0.123),('A','ghi',0.2,0.2,0.1345),('B','jkl','',0.1,0.642),('B','mno',0.1,0.1,'')]
val1=sc.parallelize(data1).toDF(sampleRatio=0.3)
val1.show()
val1.printSchema()
Some times the above code will throw out error if it happens to sample the string row
Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>
but if you are patient and try more times (< 10 for me), you may get something like this. And you can see that both column 4 and 5 are FloatType, because by luck, the program picked double numbers while running createDataFrame.
+---+---+----+----+------+
| _1| _2| _3| _4| _5|
+---+---+----+----+------+
| A|abc| 0.1|null| 0.562|
| B|def|0.15| 0.5| 0.123|
| A|ghi| 0.2| 0.2|0.1345|
| B|jkl|null| 0.1| 0.642|
| B|mno| 0.1| 0.1| null|
+---+---+----+----+------+
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: double (nullable = true)
|-- _4: double (nullable = true)
|-- _5: double (nullable = true)
With a dataframe as follows:
from pyspark.sql.functions import avg, first
rdd = sc.parallelize(
[
(0, "A", 223,"201603", "PORT"),
(0, "A", 22,"201602", "PORT"),
(0, "A", 422,"201601", "DOCK"),
(1,"B", 3213,"201602", "DOCK"),
(1,"B", 3213,"201601", "PORT"),
(2,"C", 2321,"201601", "DOCK")
]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])
df_data.show()
I do a pivot on it,
df_data.groupby(df_data.id, df_data.type).pivot("date").agg(avg("cost"), first("ship")).show()
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
| id|type|201601_avg(cost)|201601_first(ship)()|201602_avg(cost)|201602_first(ship)()|201603_avg(cost)|201603_first(ship)()|
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
| 2| C| 2321.0| DOCK| null| null| null| null|
| 0| A| 422.0| DOCK| 22.0| PORT| 223.0| PORT|
| 1| B| 3213.0| PORT| 3213.0| DOCK| null| null|
+---+----+----------------+--------------------+----------------+--------------------+----------------+--------------------+
But I get these really complicated names for the columns. Applying alias on the aggregation usually works, but because of the pivot in this case the names are even worse:
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
| id|type|201601_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201601_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201602_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201602_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|201603_(avg(cost),mode=Complete,isDistinct=false) AS cost#1619|201603_(first(ship)(),mode=Complete,isDistinct=false) AS ship#1620|
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
| 2| C| 2321.0| DOCK| null| null| null| null|
| 0| A| 422.0| DOCK| 22.0| PORT| 223.0| PORT|
| 1| B| 3213.0| PORT| 3213.0| DOCK| null| null|
+---+----+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------+------------------------------------------------------------------+
Is there a way to rename the column names on the fly on the pivot and aggregation?
You can alias the aggregations directly:
pivoted = df_data \
.groupby(df_data.id, df_data.type) \
.pivot("date") \
.agg(
avg('cost').alias('cost'),
first("ship").alias('ship')
)
pivoted.printSchema()
##root
##|-- id: long (nullable = true)
##|-- type: string (nullable = true)
##|-- 201601_cost: double (nullable = true)
##|-- 201601_ship: string (nullable = true)
##|-- 201602_cost: double (nullable = true)
##|-- 201602_ship: string (nullable = true)
##|-- 201603_cost: double (nullable = true)
##|-- 201603_ship: string (nullable = true)
A simple regular expression should do the trick:
import re
def clean_names(df):
p = re.compile("^(\w+?)_([a-z]+)\((\w+)\)(?:\(\))?")
return df.toDF(*[p.sub(r"\1_\3", c) for c in df.columns])
pivoted = df_data.groupby(...).pivot(...).agg(...)
clean_names(pivoted).printSchema()
## root
## |-- id: long (nullable = true)
## |-- type: string (nullable = true)
## |-- 201601_cost: double (nullable = true)
## |-- 201601_ship: string (nullable = true)
## |-- 201602_cost: double (nullable = true)
## |-- 201602_ship: string (nullable = true)
## |-- 201603_cost: double (nullable = true)
## |-- 201603_ship: string (nullable = true)
If you want to preserve function name you change substitution pattern to for example \1_\2_\3.
A simple approach will be using alias after the aggregate function.
I start with the df_data spark dataFrame you created.
df_data.groupby(df_data.id, df_data.type).pivot("date").agg(avg("cost").alias("avg_cost"), first("ship").alias("first_ship")).show()
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+
| id|type|201601_avg_cost|201601_first_ship|201602_avg_cost|201602_first_ship|201603_avg_cost|201603_first_ship|
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+
| 1| B| 3213.0| PORT| 3213.0| DOCK| null| null|
| 2| C| 2321.0| DOCK| null| null| null| null|
| 0| A| 422.0| DOCK| 22.0| PORT| 223.0| PORT|
+---+----+---------------+-----------------+---------------+-----------------+---------------+-----------------+
column names will be the form of "original_column_name_aliased_column_name". For your case, original_column_name will be 201601, aliased_column_name will be avg_cost, and the column name is 201601_avg_cost(linked by underscore "_").
Wrote an easy and fast function to do this. Enjoy! :)
# This function efficiently rename pivot tables' urgly names
def rename_pivot_cols(rename_df, remove_agg):
"""change spark pivot table's default ugly column names at ease.
Option 1: remove_agg = True: `2_sum(sum_amt)` --> `sum_amt_2`.
Option 2: remove_agg = False: `2_sum(sum_amt)` --> `sum_sum_amt_2`
"""
for column in rename_df.columns:
if remove_agg == True:
start_index = column.find('(')
end_index = column.find(')')
if (start_index > 0 and end_index > 0):
rename_df = rename_df.withColumnRenamed(column, column[start_index+1:end_index]+'_'+column[:1])
else:
new_column = column.replace('(','_').replace(')','')
rename_df = rename_df.withColumnRenamed(column, new_column[2:]+'_'+new_column[:1])
return rename_df
Modification version from zero323 , for spark 2.4
import re
def clean_names(df):
p = re.compile("^(\w+?)_([a-z]+)\((\w+)(,\s\w+)\)(:\s\w+)?")
return df.toDF(*[p.sub(r"\1_\3", c) for c in df.columns])
current column name is like 0_first(is_flashsale, false): int