extract hypen separated values from a column and apply UDF - python

I have a dataframe like as provided below:
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
|sequence|recType|valCode|registerNumber| rest| errorCode|errorType | errorDescription|isSuccessful|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
| 9| 11| 0| XXXX2288|110XXXX2288MKKKKK...| CHAR0088| ERROR|Records out of se...| N|
| 9| 12| 0| XXXX2288|130XXXX22880011ZZ...| CHAR0088| ERROR|Records out of se...| N|
| 9| 18| 0| XXXX2288|140XXXX2288 ...| CHAR0088| ERROR|Records out of se...| N|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+ N|
The below code uses UDF to populate the data for errorType and errorDescription columns.
The UDFs i.e. resolveErrorTypeUDF and resolveErrorDescUDF take one errorCode as input and provide the respective errorType and errorDescription in output respectively.
errorFinalDf = errorDfAll.na.fill("") \
.withColumn("errorType", resolveErrorTypeUDF(col("errorCode"))) \
.withColumn("errorDescription", resolveErrorDescUDF(col("errorCode"))) \
.withColumn("isSuccessful", when(trim(col("errorCode")).eqNullSafe(""), "Y").otherwise("N")) \
.dropDuplicates()
Please notice that, I used to get only one error code in errorCode column. Now onwards, I will be getting single/multiple - separated error codes in the errorCode column. And I need to populate all the mapping errorType and errorDescription and write them into respective column with - separation.
The new dataframe would look like this.
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
|sequence|recType|valCode|registerNumber| rest| errorCode|errorType | errorDescription|isSuccessful|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
| 7| 1| 0| XXXX8822|010XXXX8822XBCDEF...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 11| 0| XXXX8822|110XXXX8822LLLLLL...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 12| 0| XXXX8822|120XXXX8822011GB ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 18| 0| XXXX8822|180XXXX8822 ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 18| 0| XXXX8822|180XXXX88220 ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
What changes would be needed to accommodate the new scenario. Please help. Thank you.

You need minimal changes, limited only to your UDFs.
Suppose you have a simple python function, get_type_from_code able to convert a string with the error code to the correspondent type (the same applies to the description).
from pyspark.sql import functions as F, types as T
def get_type_from_code(c: str) -> str:
"""Function to convert error code to error type.
Mind the interface: string in, string out
"""
return {'CHAR0009': 'ERROR', 'CHAR0021': 'WARNING'}.get(c, 'UNKNOWN')
#F.udf(returnType=T.StringType())
def convert_errcodes_to_types(codes: str) -> str:
"""Convert a string of error codes separated by '-' into a string of types concatenated with '-'"""
return '-'.join(
map(get_type_from_code, codes.split('-'))
)
Done!

Related

Can't aggregate count for an RDD?

I am working on a PySpark streaming application and I'm running into the following problem,
After performing some transformations on a DStream object, I end up with the following RDD called "rdd_new":
+------+---+---+
| _1| _2| _3|
+------+---+---+
|Python| 36| 10|
| C| 6| 1|
| C#| 8| 1|
+------+---+---+
I then run this rdd through the following command that will aggregate the values in the RDD:
rdd_new = rdd_new.updateStateByKey(aggregate_count)
Where aggregate_count looks like this
def aggregate_count(new_values, total_sum):
return sum(new_values) + (total_sum or 0)
But after that line of code is executed I am getting this error:
for obj in iterator:
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2052, in add_shuffle_key
ValueError: too many values to unpack (expected 2)
There are a lot more lines of error but I've narrowed it down to that. The thing is, the aggregate function works if my rdd looks like this:
+------+---+
| _1| _2|
+------+---+
|Python| 36|
| C| 6|
| C#| 8|
+------+---+
The key difference is that there are just 2 columns with this one. Since I really need this aggregate_count function to work for my project, how can I feed my 3 column RDD into the function and have it actually work? I have no idea how to even approach this sort of issue, thanks!

How to remove elements with UDF function and Pandas instead of using for loop Python

I have a problem ... how do I make such a for loop as a UDF function?
import cld3
ind_err = []
cnt = 0
cnt_NOT = 0
for index, row in pandasDF.iterrows():
lan, probability, is_reliable, proportion = cld3.get_language(row["content"])
if (lan != 'en'):
cnt_NOT += 1
ind_err.append(index)
elif(lan == 'en' and probability < 0.85):
cnt += 1
ind_err.append(index)
pandasDF = pandasDF.drop(labels=ind_err, axis=0)
This function cycles on all the lines of the pandas data frame and sees through cld3 which is English and which is not, in order to clean up. Save the indexes in an array to delete them with .drop (labels = ind_err, axis = 0).
This is the data that I have:
+--------------------+-----+
| content|score|
+--------------------+-----+
| what sapp| 1|
| right| 5|
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
+--------------------+-----+
This is the data that I would remove:
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
And the is the dataframe that I would have:
+--------------------+-----+
| content|score|
+--------------------+-----+
| what sapp| 1|
| right| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
+--------------------+-----+
The problem with this cycle is that it is very slow since there are 1,119,778 rows.
I know PySpark's withColumn is much faster, but I honestly can't figure out how to select the row to delete and get it deleted.
How can I turn that for loop into a function and make language detect a lot faster?
My environment is Google Colab.
Many thanks in advance!!
you can probably do something like that :
from pyspark.sql import functions as F, types as T
# assuming df is your dataframe
#F.udf(T.BooleanType())
def is_english(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.where(is_english(F.col("content")))
Actually, I do not really understand why you want to go through Spark for that. Using properly pandas should solve your problem:
# I used you example so I only have partial text...
def is_engllish(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.loc[df["content"].apply(is_eng)]
content
8 unable to update ...
# That's the only line from your truncated example that matches your criterias

Best approach to classify the dataset below into two classes and later remove the redundant rows using Pyspark?

My dataset:
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+-------------------+-------------------+-----------+----+---------------+-----------------+-----------+------------+----------------------+------------------+---------+
| event_time|event_type|product_id| category_id| category_code| brand| price| user_id| user_session| Event_time_NoUTC| Event_timestamp|day_of_week|hour|primaryCategory|secondaryCategory|eventVisits|productCount|secondaryCategoryCount| AvgCatExpense|SessCount|
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+-------------------+-------------------+-----------+----+---------------+-----------------+-----------+------------+----------------------+------------------+---------+
|2019-10-06 07:04:...| view| 1004565|2053013555631882655|electronics.smart...| huawei| 169.84|231943435|428ebb99-3568-4e1...|2019-10-06 07:04:50|2019-10-06 07:04:50| 1| 7| electronics| smartphone| 1| 1| 1| 380.2349402627628| 1|
|2019-10-25 03:50:...| view| 5100337|2053013553341792533| electronics.clocks| apple| 319.34|266287781|f55edf02-3fd4-48f...|2019-10-25 03:50:28|2019-10-25 03:50:28| 6| 3| electronics| clocks| 7| 7| 7| 369.7054359810376| 4|
|2019-10-25 03:52:...| view| 1005105|2053013555631882655|electronics.smart...| apple|1397.09|266287781|118dbcd6-fe31-4cc...|2019-10-25 03:52:09|2019-10-25 03:52:09| 6| 3| electronics| smartphone| 7| 7| 7| 369.7054359810376| 4|
|2019-10-26 12:15:...| view| 6000157|2053013560807654091|auto.accessories....|starline| 91.12|266287781|992d03b4-c561-4fb...|2019-10-26 12:15:56|2019-10-26 12:15:56| 7| 12| auto| accessories| 7| 7| 7| 369.7054359810376| 4|
The event type has three categories: View, cart and Purchase. I want to classify the user_id and product_id with a new column is_purchased=1 if it has event type as purchase and others will be 0. After that, I would remove the redundant rows as shown below which would basically help me classify my data whether a customer will churn or not.
I am thinking of partitioning data with user_id and product_id and then classify for those which has purchase. Please suggest your approaches to solve this?
Step 1: group the data by user and product and mark if each group contains the event purchase:
from pyspark.sql import functions as F
data = [("A",123, "view", "other attributes 1"),
("A",123, "cart", "other attributes 2"),
("A",123, "purchase", "other attributes 3"),
("B",123, "cart", "other attributes 4")]
df = spark.createDataFrame(data, schema = ["user", "product", "event", "other"])
is_purchased = df.groupBy("user", "product").agg(
F.array_contains(F.collect_set("event"), "purchase").alias("is_purchased"))
# +----+-------+------------+
# |user|product|is_purchased|
# +----+-------+------------+
# | A| 123| true|
# | B| 123| false|
# +----+-------+------------+
Step 2: join the result from step 1 with the original data and filter out the redundant rows:
result = df.join(is_purchased, on=["user", "product"], how="left") \
.filter("event= 'cart'")
# +----+-------+-----+------------------+------------+
# |user|product|event| other|is_purchased|
# +----+-------+-----+------------------+------------+
# | A| 123| cart|other attributes 2| true|
# | B| 123| cart|other attributes 4| false|
# +----+-------+-----+------------------+------------+
You can also apply a window function and get all events of each user and product, then filter (I'm using same sample data as #werner)
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
.withColumn('events', F.collect_set('event').over(W.partitionBy('user', 'product')))
.withColumn('is_purchased', F.array_contains(F.col('events'), 'purchase'))
.withColumn('is_purchased', F.array_contains(F.col('events'), 'purchase'))
.where(F.col('event') == 'cart')
.show(10, False)
)
+----+-------+-----+------------------+----------------------+------------+
|user|product|event|other |events |is_purchased|
+----+-------+-----+------------------+----------------------+------------+
|A |123 |cart |other attributes 2|[cart, view, purchase]|true |
|B |abc |cart |other attributes 4|[cart] |false |
+----+-------+-----+------------------+----------------------+------------+

How to modify a column for a join in spark dataframe when the join key are given as a list?

I have been trying to join two dataframes using the following list of join key passed as a list and I want to add the functionality to join on a subset of the keys if one of the key value is null
I have been trying to join two dataframes df_1 and df_2.
data1 = [[1,'2018-07-31',215,'a'],
[2,'2018-07-30',None,'b'],
[3,'2017-10-28',201,'c']
]
df_1 = sqlCtx.createDataFrame(data1,
['application_number','application_dt','account_id','var1'])
and
data2 = [[1,'2018-07-31',215,'aaa'],
[2,'2018-07-30',None,'bbb'],
[3,'2017-10-28',201,'ccc']
]
df_2 = sqlCtx.createDataFrame(data2,
['application_number','application_dt','account_id','var2'])
The code I use to join is this:
key_a = ['application_number','application_dt','account_id']
new = df_1.join(df_2,key_a,'left')
The output for the same is:
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b|null|
+------------------+--------------+----------+----+----+
My concern here is, in the case where account_id is null, the join should still work by comparing other 2 keys.
The required output should be like this:
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b| bbb|
+------------------+--------------+----------+----+----+
I have found a similar approach to do so by using the statement:
join_elem = "df_1.application_number ==
df_2.application_number|df_1.application_dt ==
df_2.application_dt|F.coalesce(df_1.account_id,F.lit(0)) ==
F.coalesce(df_2.account_id,F.lit(0))".split("|")
join_elem_column = [eval(x) for x in join_elem]
But the design consideration do not allow me to use a full join expression and i am stuck with using the list of column names as join-key.
I have been trying to find a way to accommodate this coalesce thing into this list itself but have not found any success so far.
I would call this solution a workaround.
The issue here is that we have Null value for one of the keys in the DataFrame and the OP wants that rest of the key columns to be used instead. Why not assign an arbitrary value to this Null and then apply the join. Effectively this would be same thing like making a join on the remaining two keys.
# Let's replace Null with an arbitrary value, which has
# little chance of occurring in the Dataset. For eg; -100000
df1 = df1.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))
df2 = df2.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))
# Do a FULL Join
df = df1.join(df2,['application_number','application_dt','account_id'],'full')
# Replace the arbitrary value back with Null.
df = df.withColumn('account_id', when(col('account_id')== -100000, None).otherwise(col('account_id')))
df.show()
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 2| 2018-07-30| null| b| bbb|
| 3| 2017-10-28| 201| c| ccc|
+------------------+--------------+----------+----+----+

Graphframes/Graphx connected components skipping numbers

I'm using the Spark Graphframes library to create an identity resolution system. I have been able to use spark to find matches. My plan was to use a graph to find transient links between people and assign a single id to them for further analysis etc.
I used the following data (from the public febrl database):
vertex data sample:
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
|given_name| surname|street_number| address_1| address_2| suburb|postcode|state|date_of_birth|soc_sec_id| id|block|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
| michaela| neumann| 8| stanley street| miami| winston hills| 4223| nsw| 19151111| 5304218| 0| mneu|
| courtney| painter| 12| pinkerton circuit| bega flats| richlands| 4560| vic| 19161214| 4066625| 1| cpai|
| charles| green| 38|salkauskas crescent| kela| dapto| 4566| nsw| 19480930| 4365168| 2| cgre|
| vanessa| parr| 905| macquoid place| broadbridge manor| south grafton| 2135| sa| 19951119| 9239102| 3| vpar|
| mikayla|malloney| 37| randwick road| avalind|hoppers crossing| 4552| vic| 19860208| 7207688| 4| mmal|
| blake| howie| 1| cutlack street|belmont park belt...| budgewoi| 6017| vic| 19250301| 5180548| 5| bhow|
| blakeston| broadby| 53| traeger street| valley of springs| north ward| 3083| qld| 19120907| 4308555| 7| bbro|
| edward| denholm| 10| corin place| gold tyne| clayfield| 4221| vic| 19660306| 7119771| 9| eden|
| charlie|alderson| 266|hawkesbury crescent|deergarden caravn...| cooma| 4128| vic| 19440908| 1256748| 10| cald|
| molly| roche| 59|willoughby crescent| donna valley| carrara| 4825| nsw| 19200712| 1847058| 11| mroc|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
Edge data sample:
+---+-----+-----+
|src| dst|match|
+---+-----+-----+
| 0|10000| 1|
| 1|17750| 1|
| 1|10001| 1|
| 1| 7750| 1|
| 2|19656| 1|
| 2|10002| 1|
| 2| 9656| 1|
| 3|19119| 1|
| 3|10003| 1|
| 3| 9119| 1|
+---+-----+-----+
created graph:
g = GraphFrame(vertix_data, edge_data)
used connected components:
connected = g.connectedComponents(algorithm='graphframes')
which results in:
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
|given_name| surname|street_number| address_1| address_2| suburb|postcode|state|date_of_birth|soc_sec_id| id|block|component|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
| michaela| neumann| 8| stanley street| miami| winston hills| 4223| nsw| 19151111| 5304218| 0| mneu| 0|
| courtney| painter| 12| pinkerton circuit| bega flats| richlands| 4560| vic| 19161214| 4066625| 1| cpai| 1|
| charles| green| 38|salkauskas crescent| kela| dapto| 4566| nsw| 19480930| 4365168| 2| cgre| 2|
| vanessa| parr| 905| macquoid place| broadbridge manor| south grafton| 2135| sa| 19951119| 9239102| 3| vpar| 3|
| mikayla|malloney| 37| randwick road| avalind|hoppers crossing| 4552| vic| 19860208| 7207688| 4| mmal| 4|
| blake| howie| 1| cutlack street|belmont park belt...| budgewoi| 6017| vic| 19250301| 5180548| 5| bhow| 5|
| blakeston| broadby| 53| traeger street| valley of springs| north ward| 3083| qld| 19120907| 4308555| 7| bbro| 7|
| edward| denholm| 10| corin place| gold tyne| clayfield| 4221| vic| 19660306| 7119771| 9| eden| 9|
| charlie|alderson| 266|hawkesbury crescent|deergarden caravn...| cooma| 4128| vic| 19440908| 1256748| 10| cald| 10|
| molly| roche| 59|willoughby crescent| donna valley| carrara| 4825| nsw| 19200712| 1847058| 11| mroc| 11|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
The component column doesn't always increase in increments of 1 but seems to randomly skip numbers, I would like to make sure that the increase in increments of one as using this number to assign each person an id.
Does anybody know why Graphframes does this?
When I look further into this, for the approx 20,000 rows in my development dataframe approx 17% of entries have a skip in them. In extreme cases the gap can be up to around 20-30, i.e. one rows id is 5846 and the next one is 5868. My worry is, when I go scale in millions and hundreds of millions the gaps will get very large between id's which could create problems down the line.
TL;DR: Why does Sparks connected components seem to randomly skip values and not always increment by 1?
Graphframes documentation never promises consecutive ids - instead the only guarantee it provides is:
The resulting DataFrame contains all the vertex information and one additional column:
component (LongType): unique ID for this component
In practice GraphX implementation uses the smallest ID in the component ("return a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex"), and Graphframes seems to do the same thing.
Like #user10802135 said, the component values are not guaranteed to be sequential. If you want to make them sequential, you'll need to do some post-processing on the component field. A pyspark solution to this would look something like this:
import pyspark.sql.functions as F
from pyspark.sql import Window
# Define our window for partitioning data on - necessary for dense_rank() function
windowSpec = Window.partitionBy(F.lit(1)).orderBy('component')
# Redefine the component field, now in sequential order
df = df.withColumn('component', F.dense_rank().over(windowSpec))
By partitioning by the literal value of 1, all rows are considered in the dense_rank(), and ranking order is determined by the .orderBy() argument. In this case the .orderBy() argument is set to 'component', which will order in ascending order by default. The .dense_rank() functionality ensures that records under the same component will be given the same returned value, something that using rank() does NOT ensure.
There are some great examples and explanations of .dense_rank() and other window functions here.

Categories

Resources