Elegant way to fillna missing values for dates in spark - python

Let me break this problem down to a smaller chunk. I have a DataFrame in PySpark, where I have a column arrival_date in date format -
from pyspark.sql.functions import to_date
values = [('22.05.2016',),('13.07.2010',),('15.09.2012',),(None,)]
df = sqlContext.createDataFrame(values,['arrival_date'])
#Following code line converts String into Date format
df = df.withColumn('arrival_date',to_date(col('arrival_date'),'dd.MM.yyyy'))
df.show()
+------------+
|arrival_date|
+------------+
| 2016-05-22|
| 2010-07-13|
| 2012-09-15|
| null|
+------------+
df.printSchema()
root
|-- arrival_date: date (nullable = true)
After applying a lot of transformations to the DataFrame, I finally wish to fill in the missing dates, marked as null with 01-01-1900.
One method to do this is to convert the column arrival_date to String and then replace missing values this way - df.fillna('1900-01-01',subset=['arrival_date']) and finally reconvert this column to_date. This is very unelegant.
The following code line doesn't work, as expected and I get an error-
df = df.fillna(to_date(lit('1900-01-01'),'yyyy-MM-dd'), subset=['arrival_date'])
The documentation says The value must be of the following type: Int, Long, Float, Double, String, Boolean.
Another way is by using withColumn() and when() -
df = df.withColumn('arrival_date',when(col('arrival_date').isNull(),to_date(lit('01.01.1900'),'dd.MM.yyyy')).otherwise(col('arrival_date')))
Is there a way, where I could directly assign a date of my choice to a date formatted column by using some function?
Anyone has any better suggestion?

The second way should be the way to do it, but you don't have to use to_date to transform between string and date, just use datetime.date(1900, 1, 1).
import datetime as dt
df = df.withColumn('arrival_date', when(col('arrival_date').isNull(), dt.date(1900, 1, 1)).otherwise(col('arrival_date')))

Related

How to Append Pyspark Dataframe with Numpy Array?

I am new in PySpark and trying to append a dataframe with a numpy array.
I have a numpy array as:
print(category_dimension_vectors)
[[ 5.19333403e-01 -3.36615935e-01 -6.93262848e-02 2.37293671e-01]
[ 4.45220874e-01 1.30108798e-01 1.12913839e-01 1.87161517e-01]]
I would like to append this to a pyspark dataframe as a new column where each row in the array stored in its correspondent row in the dataframe.
Number of rows in the array, and the number of rows in the dataframe are equal.
This is what I have tried first:
arr_rows = udf(lambda row: category_dimension_vectors[row,:], ArrayType(DecimalType()))
df = df.withColumn("category_dimensions_reduced", arr_rows(df))
Getting the error:
TypeError: Invalid argument, not a string or column
Then I have tried
df = df.withColumn("category_dimensions_reduced", lit(category_dimension_vectors))
But got the error:
org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE]
What I try to achieve is:
+----+----+-----------------------------------------------------------------+
| a| b|category_dimension_vectors |
+----+----+-----------------------------------------------------------------+
|foo | 1|[5.19333403e-01,-3.36615935e-01,-6.93262848e-02,2.37293671e-01] |
|bar | 2|[4.45220874e-01,1.30108798e-01,1.12913839e-01,1.87161517e-01] |
+----+----+-----------------------------------------------------------------+
How should I approach to this problem?

Python or SQL to convert the Data Type - Map to String for a column

I have the below column in a table called 'test'.
How can I get the 'id' and 'value' (eg.for 'id' = 2, I should get the value '24' and null for other two ids) from the given table.
The 'data type' for the column 'age' is 'Map' and I'm not sure how to deal with this.
A simple query in Python or SQL or any leads is much appreciated. Also, please advise on the packages to import.
explode function would "explode" your map to key and value pairs, then you can use them in anyway you want.
from pyspark.sql import functions as F
(df
.select('id', F.explode('age').alias('k', 'v'))
.show()
)
+---+---+----+
| id| k| v|
+---+---+----+
| 2|age| 24|
| 3|age|null|
+---+---+----+
You can get it in sql or python.
In Python You can try
agecolumn=age.replace("{","").replace("}","").split("=")
if agecolumn[1].strip():
do domething

Merging two rows into one based on common field

I have dataframe with the following data:
+----------+------------+-------------+---------------+----------+
|id |name |predicted |actual |yyyy_mm_dd|
+----------+------------+-------------+---------------+----------+
| 215| NirPost| null|100.10023 |2020-01-10|
| null| NirPost| 57145|null |2020-01-10|
+----------+------------+-------------+---------------+----------+
I want to merge these two rows into one, based on the name. This df is the result of a query which I've restricted to one company and single day. In the real dataset, there is 70~ companies with daily data. I want to rewrite this data into a new table as single rows.
This is the output I'd like:
+----------+------------+-------------+---------------+----------+
|id |name |predicted | actual |yyyy_mm_dd|
+----------+------------+-------------+---------------+----------+
| 215| NirPost| 57145 |100.10023 |2020-01-10|
+----------+------------+-------------+---------------+----------+
I've tried this:
df.replace('null','').groupby('name',as_index=False).agg(''.join)
However, this outputs my original df but with NaN instead of null.
`df.dtypes`:
id float64
name object
predicted float64
actual float64
yyyy_mm_dd object
dtype: object
How about you explicitly pass all the columns in the groupby with max so that it eliminates the null values?
import pandas as pd
import numpy as np
data = {'id':[215,np.nan],'name':['nirpost','nirpost'],'predicted':[np.nan,57145],'actual':[100.12,np.nan],'yyyy_mm_dd':['2020-01-10','2020-01-10']}
df = pd.DataFrame(data)
df = df.groupby('name').agg({'id':'max','predicted':'max','actual':'max','yyyy_mm_dd':'max'}).reset_index()
print(df)
Returns:
name id predicted actual yyyy_mm_dd
0 nirpost 215.0 57145.0 100.12 2020-01-10
Of course since you have more data you should probably consider adding something else in your groupby so as to not delete too many rows, but for the example data you provide, I believe this is a way to solve the issue.
EDIT:
If all columns are being named as max_original_column_name then you can simply use this:
df.columns = [x[:-4] for x in list(df)]
With the list comprehension you are creating a list that strips the last 4 characters (that is _max from each value in list(df) which is the list of the name of the columns. Last, you are assigning it with df.columns =

Pyspark : removing special/numeric strings from array of string

To keep it simple I have a df with the following schema:
root
|-- Event_Time: string (nullable = true)
|-- tokens: array (nullable = true)
| |-- element: string (containsNull = true)
some of the elements of "tokens" have number and special characters for example:
"431883", "r2b2", "#refe98"
Any way I can remove all those and keep only actuals words ? I want to do an LDA later and want to clean my data before.
I tried regexp_replace, explode, str.replace with no success maybe I didn't use them correctly.
Thanks
edit2:
df_2 = (df_1.select(explode(df_1.tokens).alias('elements'))
.select(regexp_replace('elements','\\w*\\d\\w**',""))
)
This works only if the column in a string type, and with explode method I can explode an array into strings but there is not in the same row anymore... Anyone can improve on this?
from pyspark.sql.functions import *
df = spark.createDataFrame([(["#a", "b", "c"],), ([],)], ['data'])
df_1 = df.withColumn('data_1', concat_ws(',', 'data'))
df_1 = df_1.withColumn("data_2", regexp_replace('data_1', "['{#]",""))
#df_1.printSchema()
df_1.show()
+----------+------+------+
| data|data_1|data_2|
+----------+------+------+
|[#a, b, c]|#a,b,c| a,b,c|
| []| | |
+----------+------+------+
The solution I found is (as also stated by pault in comment section):
After explode on tokens, I groupBy and agg with collect list to get back the tokens in the format I want them.
here is the comment of pault:
After the explode, you need to groupBy and aggregate with collect_list to get the values back into a single row. Assuming Event_Time is a unique key:
df2 = df_1
.select("Event_Time", regexp_replace(explode("tokens"), "<your regex here>")
.alias("elements")).groupBy("Event_Time")
.agg(collect_list("elements").alias("tokens"))
Also, stated by paul which I didnt know, there is currently no way to iterate over an array in pyspark without using udf or rdd.
The transform() function was added in PySpark 3.1.0, which helped me accomplish this task a little more easily. The example in the question would now look like this:
from pyspark.sql import functions as F
df_2 = df_1.withColumn("tokens",
F.expr(""" transform(tokens, x -> regexp_replace(x, '\\w*\\d\\w**')) """))

Transform string column to vector column Spark DataFrames

I have a Spark dataframe that looks as follows:
+-----------+-------------------+
| ID | features |
+-----------+-------------------+
| 18156431|(5,[0,1,4],[1,1,1])|
| 20260831|(5,[0,4,5],[2,1,1])|
| 91859831|(5,[0,1],[1,3]) |
| 206186631|(5,[3,4,5],[1,5]) |
| 223134831|(5,[2,3,5],[1,1,1])|
+-----------+-------------------+
In this dataframe the features column is a sparse vector. In my scripts I have to save this DF as file on disk. When doing this, the features column is saved as as text column: example "(5,[0,1,4],[1,1,1])".
When importing again in Spark the column stays string, as you could expect. How can I convert the column back to (sparse) vector format?
Not particularly efficient (it would be a good idea to use a format that preserves types) due to UDF overhead but you can do something like this:
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
df = sc.parallelize([
(18156431, "(5,[0,1,4],[1,1,1])")
]).toDF(["id", "features"])
parse = udf(lambda s: Vectors.parse(s), VectorUDT())
df.select(parse("features"))
Please note this doesn't port directly to 2.0.0+ and ML Vector. Since ML vectors don't provide parse method you'd have to parse to MLLib and use asML:
parse = udf(lambda s: Vectors.parse(s).asML(), VectorUDT())

Categories

Resources