Split file name into different columns of pyspark dataframe

Split file name into different columns of pyspark dataframe - python

I am using pyspark SQL function input_file_name to add the input file name as a dataframe column.
df = df.withColumn("filename",input_file_name())
The column now has value like below.
"abc://dev/folder1/date=20200813/id=1"
From the above column I have to create 2 different columns.
Date
ID
I have to get only date and id from the above file name and populate it to the columns mentioned above.
I can use split_col and get it. But if the folder structure changes then it might be a problem.
Is there a way to check if the file name has string "date" and "id" as part of it and get the values after the equal to symbol and populate it two new columns ?
Below is the expected output.
filename date id
abc://dev/folder1/date=20200813/id=1 20200813 1

You could use regexp_extract with a pattern that looks at the date= and id= substrings:
df = sc.parallelize(['abc://dev/folder1/date=20200813/id=1',
'def://dev/folder25/id=3/date=20200814'])\
.map(lambda l: Row(file=l)).toDF()
+-------------------------------------+
|file |
+-------------------------------------+
|abc://dev/folder1/date=20200813/id=1 |
|def://dev/folder25/id=3/date=20200814|
+-------------------------------------+
df = df.withColumn('date', f.regexp_extract(f.col('file'), '(?<=date=)[0-9]+', 0))\
.withColumn('id', f.regexp_extract(f.col('file'), '(?<=id=)[0-9]+', 0))
df.show(truncate=False)
Which outputs:
+-------------------------------------+--------+---+
|file |date |id |
+-------------------------------------+--------+---+
|abc://dev/folder1/date=20200813/id=1 |20200813|1 |
|def://dev/folder25/id=3/date=20200814|20200814|3 |
+-------------------------------------+--------+---+

I have used the withcolumn and split to break the column value into date and id by creating them as columns in the same dataset , code snippet is below:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
adata = [("abc://dev/folder1/date=20200813/id=1",)]
aschema = StructType([StructField("filename",StringType(),True)])
adf = spark.createDataFrame(data=adata,schema=aschema)
bdf = adf.withColumn('date', split(adf['filename'],'date=').getItem(1)[0:8]).withColumn('id',split(adf['filename'],'id=').getItem(1))
bdf.show(truncate=False)
Which outputs to :
+------------------------------------+--------+---+
|filename |date |id |
+------------------------------------+--------+---+
|abc://dev/folder1/date=20200813/id=1|20200813|1 |
+------------------------------------+--------+---+

Related

Optimization on for loop on columns in Pyspark

I don't know if my title is very clear. I have a table with a lot columns (more than a hundred). Some of my columns contains values with brackets and I need to explode them into several rows. Here is a reproducible example:
# Import libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import *
import pandas as ps
# Create an example
columns = ["Name", "Age", "Activity", "Studies"]
data = [("Jame", 25, "[Painting,Yoga]", "[Math,Physics]"), ("Anne", 20, "[Garden,Cooking,Travel]", "[Communication,Marketing]"), ("Jane", 10, "[Gymnastique]", "[Basic School]")]
df = spark.createDataFrame(data=data,schema=columns)
df.show(truncate=False)
it shows the following table:
+----+---+-----------------------+-------------------------+
|Name|Age|Activity |Studies |
+----+---+-----------------------+-------------------------+
|Jame|25 |[Painting,Yoga] |[Math,Physics] |
|Anne|20 |[Garden,Cooking,Travel]|[Communication,Marketing]|
|Jane|10 |[Gymnastique] |[Basic School] |
+----+---+-----------------------+-------------------------+
I need to determine what columns contains brackets as value:
list_col = df.dtypes
df_array_col = spark.createDataFrame(list_col)\
.withColumnRenamed("_1", "Colname")\
.withColumnRenamed("_2", "TypeColumn")\
.filter(col("TypeColumn") == "string")\
.withColumn("IsBracket", lit(0))\
.toPandas()
# Function for determining what column contains brackets as a value
def func_isSquaredBracket(my_col):
A = df.select(first(col(my_col).rlike("\["), ignorenulls=True).alias(my_col))
val_IsBracket = A.select(col(my_col)).collect()[0][0]
return val_IsBracket
# For loop for applying the function
n_array = df_array_col.count()["Colname"]
for index, row in df_array_col.iterrows():
IsBracket_value = func_isSquaredBracket(df_array_col.at[index, "Colname"])
if IsBracket_value == True:
df_array_col.at[index, "IsBracket"] = 1
I succeed what columns have brackets as value. Now I can explode my table:
def func_extractStringInBracket_andSplit(my_col):
extract_string = regexp_extract(my_col, r'(?<=\[).+?(?=\])', 0).alias(my_col)
string_split = split(extract_string, "\||,").alias(my_col)
string_explode_array = explode_outer(string_split).alias(my_col)
return string_explode_array
df_explode_bracket = df
for index, row in df_array_bracket_col.iterrows():
colname = df_array_bracket_col["Colname"][index]
df_explode_bracket = df_explode_bracket.withColumn(colname, func_extractStringInBracket_andSplit(colname))
df_explode_bracket.show(truncate=False)
I obtain the result I want:
+----+---+-----------+-------------+
|Name|Age|Activity |Studies |
+----+---+-----------+-------------+
|Jame|25 |Painting |Math |
|Jame|25 |Painting |Physics |
|Jame|25 |Yoga |Math |
|Jame|25 |Yoga |Physics |
|Anne|20 |Garden |Communication|
|Anne|20 |Garden |Marketing |
|Anne|20 |Cooking |Communication|
|Anne|20 |Cooking |Marketing |
|Anne|20 |Travel |Communication|
|Anne|20 |Travel |Marketing |
|Jane|10 |Gymnastique|Basic School |
+----+---+-----------+-------------+
However, this solution is not optimized when I have more than 100 columns and it takes more than 6 minutes to get the result with the following message:
/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:289: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
'JavaPackage' object is not callable
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warnings.warn(msg)
I am pretty new to PySpark and I am not an expert in Python. My question is: How can I optimize the solution by using PySpark instead of Pandas? For loop is not ideal when you have the opportunity to use parallel processing.

It's actually pretty easy, use regexp_extract_all:
df = (
df.withColumn("Activity_list", F.expr(r"regexp_extract_all(Activity, '(\\w+)', 1)"))
.withColumn("Studies_list", F.expr(r"regexp_extract_all(Studies, '(\\w+)', 1)"))
)
df = (
df.drop("Activity", "Studies")
.withColumn("Activity", F.explode("Activity_list"))
.withColumn("Studies", F.explode("Studies_list"))
)
Edit: It even works with strings without brackets.

How to append a value from exploded value in dataframe in pyspark

The data is
data = [{"_id":"Inst001","Type":"AAAA", "Model001":[{"_id":"Mod001", "Name": "FFFF"},
{"_id":"Mod0011", "Name": "FFFF4"}]},
{"_id":"Inst002", "Type":"BBBB", "Model001":[{"_id":"Mod002", "Name": "DDD"}]}]
Need to frame a dataframe as follows
pid
_id
Name
Inst001
Mod001
FFFF
Inst001
Mod0011
FFFF4
Inst002
Mod002
DDD
The approach I had is
Need to explode "Model001"
Then need to append the main _id to this exploded dataframe. But how this append can be done in pyspark?
Is there any builtin method available in pyspark for the above problem?

Create a dataframe with a proper schema, and do inline on the Model001 column:
df = spark.createDataFrame(
data,
'_id string, Type string, Model001 array<struct<_id:string, Name:String>>'
).selectExpr('_id as pid', 'inline(Model001)')
df.show(truncate=False)
+-------+-------+-----+
|pid |_id |Name |
+-------+-------+-----+
|Inst001|Mod001 |FFFF |
|Inst001|Mod0011|FFFF4|
|Inst002|Mod002 |DDD |
+-------+-------+-----+

Pandas UDF in pyspark

I am trying to fill a series of observation on a spark dataframe. Basically I have a list of days and I should create the missing one for each group.
In pandas there is the reindex function, which is not available in pyspark.
I tried to implement a pandas UDF:
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def reindex_by_date(df):
df = df.set_index('dates')
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates, fill_value=0).ffill()
This looks like should do what I need, however it fails with this message
AttributeError: Can only use .dt accessor with datetimelike values
. What am I doing wrong here?
Here the full code:
data = spark.createDataFrame(
[(1, "2020-01-01", 0),
(1, "2020-01-03", 42),
(2, "2020-01-01", -1),
(2, "2020-01-03", -2)],
('id', 'dates', 'value'))
data = data.withColumn('dates', col('dates').cast("date"))
schema = StructType([
StructField('id', IntegerType()),
StructField('dates', DateType()),
StructField('value', DoubleType())])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def reindex_by_date(df):
df = df.set_index('dates')
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates, fill_value=0).ffill()
data = data.groupby('id').apply(reindex_by_date)
Ideally I would like something like this:
+---+----------+-----+
| id| dates|value|
+---+----------+-----+
| 1|2020-01-01| 0|
| 1|2020-01-02| 0|
| 1|2020-01-03| 42|
| 2|2020-01-01| -1|
| 2|2020-01-02| 0|
| 2|2020-01-03| -2|
+---+----------+-----+

Case 1: Each ID has an individual date range.
I would try to reduce the content of the udf as much as possible. In this case I would only calculate the date range per ID in the udf. For the other parts I would use Spark native functions.
from pyspark.sql import types as T
from pyspark.sql import functions as F
# Get min and max date per ID
date_ranges = data.groupby('id').agg(F.min('dates').alias('date_min'), F.max('dates').alias('date_max'))
# Calculate the date range for each ID
#F.udf(returnType=T.ArrayType(T.DateType()))
def get_date_range(date_min, date_max):
return [t.date() for t in list(pd.date_range(date_min, date_max))]
# To get one row per potential date, we need to explode the UDF output
date_ranges = date_ranges.withColumn(
'dates',
F.explode(get_date_range(F.col('date_min'), F.col('date_max')))
)
date_ranges = date_ranges.drop('date_min', 'date_max')
# Add the value for existing entries and add 0 for others
result = date_ranges.join(
data,
['id', 'dates'],
'left'
)
result = result.fillna({'value': 0})
Case 2: All ids have the same date range
I think there is no need to use a UDF here. What you want to can be archived in a different way: First, you get all possible IDs and all necessary dates. Second, you crossJoin them, which will provide you with all possible combinations. Third, left join the original data onto the combinations. Fourth, replace the occurred null values with 0.
# Get all unique ids
ids_df = data.select('id').distinct()
# Get the date series
date_min, date_max = data.agg(F.min('dates'), F.max('dates')).collect()[0]
dates = [[t.date()] for t in list(pd.date_range(date_min, date_max))]
dates_df = spark.createDataFrame(data=dates, schema="dates:date")
# Calculate all combinations
all_comdinations = ids_df.crossJoin(dates_df)
# Add the value column
result = all_comdinations.join(
data,
['id', 'dates'],
'left'
)
# Replace all null values with 0
result = result.fillna({'value': 0})
Please be aware of the following limitiations with this solution:
crossJoins can be quite costly. One potential solution to cope with the issue can be found in this related question.
The collect statement and use of Pandas results in a not perfectly parallelised Spark transformation.
[EDIT] Split into two cases as I first thought all IDs have the same date range.

How to create dataframe with single header ( 1 row many cols) and update values to this dataframe in pyspark?

I want to create a dataframe in pyspark like the table below :
category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count
-----------------------------------------------------------------------------------------------------
nation | nation | 1 | 222 | 444 | 555 | 6677
So, the code I tried below :
schema = StructType([])
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
df = df.withColumn("category",F.lit('nation')).withColumn("category_id",F.lit('nation')).withColumn("bucket",bucket)
df = df.withColumn("prop_count",prop_count).withColumn("event_count",event_count).withColumn("accum_prop_count",accum_prop_count).withColumn("accum_event_count",accum_event_count)
df.show()
This is giving an error :
AssertionError: col should be Column
Also, The values of the columns have to be updated again later and the update will also be of 1 line.
How to do this??

I think the problem with your code is lies in lines where you are using variables like .withColumn("bucket",bucket). You are trying to create a new column by giving an integer value. withColumn expects a column and not a single integer value.
To solve this, you can use the lit just like you are already using for "nation"
like :
df = df\
.withColumn("category",F.lit('nation'))\
.withColumn("category_id",F.lit('nation'))\
.withColumn("bucket",F.lit(bucket))\
.withColumn("prop_count",F.lit(prop_count))\
.withColumn("event_count",F.lit(event_count))\
.withColumn("accum_prop_count",F.lit(accum_prop_count))\
.withColumn("accum_event_count",F.lit(accum_event_count))
another simple and cleaner way to write it may be like this :
# create schema
fields = [StructField("category", StringType(),True),
StructField("category_id", StringType(),True),
StructField("bucket", IntegerType(),True),
StructField("prop_count", IntegerType(),True),
StructField("event_count", IntegerType(),True),
StructField("accum_prop_count", IntegerType(),True)
]
schema = StructType(fields)
# load data
data = [["nation","nation",1,222,444,555]]
df = spark.createDataFrame(data, schema)
df.show()

Count in pyspark

I have a spark dataframe df with a column "id" (string) and another column "values" (array of strings). I want to create another column called count with contains the count of values for each id.
df looks like -
id values
1fdf67 [dhjy1,jh87w3,89yt5re]
df45l1 [hj098,hg45l0,sass65r4,dh6t21]
Result should look like -
id values count
1fdf67 [dhjy1,jh87w3,89yt5re] 3
df45l1 [hj098,hg45l0,sass65r4,dh6t21] 4
I am trying to do as below -
df= df.select(id,values).toDF(id,values,values.count())
This doesn't seem to be working for my requirement.

Please use size function:
from pyspark.sql.functions import size
df = spark.createDataFrame([
("1fdf67", ["dhjy1", "jh87w3", "89yt5re"]),
("df45l1", ["hj098", "hg45l0", "sass65r4", "dh6t21"])],
("id", "values"))
df.select("*", size("values").alias("count")).show(2, False)
+------+---------------------------------+-----+
|id |values |count|
+------+---------------------------------+-----+
|1fdf67|[dhjy1, jh87w3, 89yt5re] |3 |
|df45l1|[hj098, hg45l0, sass65r4, dh6t21]|4 |
+------+---------------------------------+-----+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split file name into different columns of pyspark dataframe - python

Related

Optimization on for loop on columns in Pyspark

How to append a value from exploded value in dataframe in pyspark

Pandas UDF in pyspark

How to create dataframe with single header ( 1 row many cols) and update values to this dataframe in pyspark?

Count in pyspark

Categories

Resources