Graphframes/Graphx connected components skipping numbers - python

I'm using the Spark Graphframes library to create an identity resolution system. I have been able to use spark to find matches. My plan was to use a graph to find transient links between people and assign a single id to them for further analysis etc.
I used the following data (from the public febrl database):
vertex data sample:
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
|given_name| surname|street_number| address_1| address_2| suburb|postcode|state|date_of_birth|soc_sec_id| id|block|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
| michaela| neumann| 8| stanley street| miami| winston hills| 4223| nsw| 19151111| 5304218| 0| mneu|
| courtney| painter| 12| pinkerton circuit| bega flats| richlands| 4560| vic| 19161214| 4066625| 1| cpai|
| charles| green| 38|salkauskas crescent| kela| dapto| 4566| nsw| 19480930| 4365168| 2| cgre|
| vanessa| parr| 905| macquoid place| broadbridge manor| south grafton| 2135| sa| 19951119| 9239102| 3| vpar|
| mikayla|malloney| 37| randwick road| avalind|hoppers crossing| 4552| vic| 19860208| 7207688| 4| mmal|
| blake| howie| 1| cutlack street|belmont park belt...| budgewoi| 6017| vic| 19250301| 5180548| 5| bhow|
| blakeston| broadby| 53| traeger street| valley of springs| north ward| 3083| qld| 19120907| 4308555| 7| bbro|
| edward| denholm| 10| corin place| gold tyne| clayfield| 4221| vic| 19660306| 7119771| 9| eden|
| charlie|alderson| 266|hawkesbury crescent|deergarden caravn...| cooma| 4128| vic| 19440908| 1256748| 10| cald|
| molly| roche| 59|willoughby crescent| donna valley| carrara| 4825| nsw| 19200712| 1847058| 11| mroc|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
Edge data sample:
+---+-----+-----+
|src| dst|match|
+---+-----+-----+
| 0|10000| 1|
| 1|17750| 1|
| 1|10001| 1|
| 1| 7750| 1|
| 2|19656| 1|
| 2|10002| 1|
| 2| 9656| 1|
| 3|19119| 1|
| 3|10003| 1|
| 3| 9119| 1|
+---+-----+-----+
created graph:
g = GraphFrame(vertix_data, edge_data)
used connected components:
connected = g.connectedComponents(algorithm='graphframes')
which results in:
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
|given_name| surname|street_number| address_1| address_2| suburb|postcode|state|date_of_birth|soc_sec_id| id|block|component|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
| michaela| neumann| 8| stanley street| miami| winston hills| 4223| nsw| 19151111| 5304218| 0| mneu| 0|
| courtney| painter| 12| pinkerton circuit| bega flats| richlands| 4560| vic| 19161214| 4066625| 1| cpai| 1|
| charles| green| 38|salkauskas crescent| kela| dapto| 4566| nsw| 19480930| 4365168| 2| cgre| 2|
| vanessa| parr| 905| macquoid place| broadbridge manor| south grafton| 2135| sa| 19951119| 9239102| 3| vpar| 3|
| mikayla|malloney| 37| randwick road| avalind|hoppers crossing| 4552| vic| 19860208| 7207688| 4| mmal| 4|
| blake| howie| 1| cutlack street|belmont park belt...| budgewoi| 6017| vic| 19250301| 5180548| 5| bhow| 5|
| blakeston| broadby| 53| traeger street| valley of springs| north ward| 3083| qld| 19120907| 4308555| 7| bbro| 7|
| edward| denholm| 10| corin place| gold tyne| clayfield| 4221| vic| 19660306| 7119771| 9| eden| 9|
| charlie|alderson| 266|hawkesbury crescent|deergarden caravn...| cooma| 4128| vic| 19440908| 1256748| 10| cald| 10|
| molly| roche| 59|willoughby crescent| donna valley| carrara| 4825| nsw| 19200712| 1847058| 11| mroc| 11|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
The component column doesn't always increase in increments of 1 but seems to randomly skip numbers, I would like to make sure that the increase in increments of one as using this number to assign each person an id.
Does anybody know why Graphframes does this?
When I look further into this, for the approx 20,000 rows in my development dataframe approx 17% of entries have a skip in them. In extreme cases the gap can be up to around 20-30, i.e. one rows id is 5846 and the next one is 5868. My worry is, when I go scale in millions and hundreds of millions the gaps will get very large between id's which could create problems down the line.
TL;DR: Why does Sparks connected components seem to randomly skip values and not always increment by 1?

Graphframes documentation never promises consecutive ids - instead the only guarantee it provides is:
The resulting DataFrame contains all the vertex information and one additional column:
component (LongType): unique ID for this component
In practice GraphX implementation uses the smallest ID in the component ("return a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex"), and Graphframes seems to do the same thing.

Like #user10802135 said, the component values are not guaranteed to be sequential. If you want to make them sequential, you'll need to do some post-processing on the component field. A pyspark solution to this would look something like this:
import pyspark.sql.functions as F
from pyspark.sql import Window
# Define our window for partitioning data on - necessary for dense_rank() function
windowSpec = Window.partitionBy(F.lit(1)).orderBy('component')
# Redefine the component field, now in sequential order
df = df.withColumn('component', F.dense_rank().over(windowSpec))
By partitioning by the literal value of 1, all rows are considered in the dense_rank(), and ranking order is determined by the .orderBy() argument. In this case the .orderBy() argument is set to 'component', which will order in ascending order by default. The .dense_rank() functionality ensures that records under the same component will be given the same returned value, something that using rank() does NOT ensure.
There are some great examples and explanations of .dense_rank() and other window functions here.

Related

extract hypen separated values from a column and apply UDF

I have a dataframe like as provided below:
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
|sequence|recType|valCode|registerNumber| rest| errorCode|errorType | errorDescription|isSuccessful|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
| 9| 11| 0| XXXX2288|110XXXX2288MKKKKK...| CHAR0088| ERROR|Records out of se...| N|
| 9| 12| 0| XXXX2288|130XXXX22880011ZZ...| CHAR0088| ERROR|Records out of se...| N|
| 9| 18| 0| XXXX2288|140XXXX2288 ...| CHAR0088| ERROR|Records out of se...| N|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+ N|
The below code uses UDF to populate the data for errorType and errorDescription columns.
The UDFs i.e. resolveErrorTypeUDF and resolveErrorDescUDF take one errorCode as input and provide the respective errorType and errorDescription in output respectively.
errorFinalDf = errorDfAll.na.fill("") \
.withColumn("errorType", resolveErrorTypeUDF(col("errorCode"))) \
.withColumn("errorDescription", resolveErrorDescUDF(col("errorCode"))) \
.withColumn("isSuccessful", when(trim(col("errorCode")).eqNullSafe(""), "Y").otherwise("N")) \
.dropDuplicates()
Please notice that, I used to get only one error code in errorCode column. Now onwards, I will be getting single/multiple - separated error codes in the errorCode column. And I need to populate all the mapping errorType and errorDescription and write them into respective column with - separation.
The new dataframe would look like this.
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
|sequence|recType|valCode|registerNumber| rest| errorCode|errorType | errorDescription|isSuccessful|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
| 7| 1| 0| XXXX8822|010XXXX8822XBCDEF...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 11| 0| XXXX8822|110XXXX8822LLLLLL...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 12| 0| XXXX8822|120XXXX8822011GB ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 18| 0| XXXX8822|180XXXX8822 ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 18| 0| XXXX8822|180XXXX88220 ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
What changes would be needed to accommodate the new scenario. Please help. Thank you.
You need minimal changes, limited only to your UDFs.
Suppose you have a simple python function, get_type_from_code able to convert a string with the error code to the correspondent type (the same applies to the description).
from pyspark.sql import functions as F, types as T
def get_type_from_code(c: str) -> str:
"""Function to convert error code to error type.
Mind the interface: string in, string out
"""
return {'CHAR0009': 'ERROR', 'CHAR0021': 'WARNING'}.get(c, 'UNKNOWN')
#F.udf(returnType=T.StringType())
def convert_errcodes_to_types(codes: str) -> str:
"""Convert a string of error codes separated by '-' into a string of types concatenated with '-'"""
return '-'.join(
map(get_type_from_code, codes.split('-'))
)
Done!

Can't aggregate count for an RDD?

I am working on a PySpark streaming application and I'm running into the following problem,
After performing some transformations on a DStream object, I end up with the following RDD called "rdd_new":
+------+---+---+
| _1| _2| _3|
+------+---+---+
|Python| 36| 10|
| C| 6| 1|
| C#| 8| 1|
+------+---+---+
I then run this rdd through the following command that will aggregate the values in the RDD:
rdd_new = rdd_new.updateStateByKey(aggregate_count)
Where aggregate_count looks like this
def aggregate_count(new_values, total_sum):
return sum(new_values) + (total_sum or 0)
But after that line of code is executed I am getting this error:
for obj in iterator:
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2052, in add_shuffle_key
ValueError: too many values to unpack (expected 2)
There are a lot more lines of error but I've narrowed it down to that. The thing is, the aggregate function works if my rdd looks like this:
+------+---+
| _1| _2|
+------+---+
|Python| 36|
| C| 6|
| C#| 8|
+------+---+
The key difference is that there are just 2 columns with this one. Since I really need this aggregate_count function to work for my project, how can I feed my 3 column RDD into the function and have it actually work? I have no idea how to even approach this sort of issue, thanks!

How to remove elements with UDF function and Pandas instead of using for loop Python

I have a problem ... how do I make such a for loop as a UDF function?
import cld3
ind_err = []
cnt = 0
cnt_NOT = 0
for index, row in pandasDF.iterrows():
lan, probability, is_reliable, proportion = cld3.get_language(row["content"])
if (lan != 'en'):
cnt_NOT += 1
ind_err.append(index)
elif(lan == 'en' and probability < 0.85):
cnt += 1
ind_err.append(index)
pandasDF = pandasDF.drop(labels=ind_err, axis=0)
This function cycles on all the lines of the pandas data frame and sees through cld3 which is English and which is not, in order to clean up. Save the indexes in an array to delete them with .drop (labels = ind_err, axis = 0).
This is the data that I have:
+--------------------+-----+
| content|score|
+--------------------+-----+
| what sapp| 1|
| right| 5|
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
+--------------------+-----+
This is the data that I would remove:
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
And the is the dataframe that I would have:
+--------------------+-----+
| content|score|
+--------------------+-----+
| what sapp| 1|
| right| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
+--------------------+-----+
The problem with this cycle is that it is very slow since there are 1,119,778 rows.
I know PySpark's withColumn is much faster, but I honestly can't figure out how to select the row to delete and get it deleted.
How can I turn that for loop into a function and make language detect a lot faster?
My environment is Google Colab.
Many thanks in advance!!
you can probably do something like that :
from pyspark.sql import functions as F, types as T
# assuming df is your dataframe
#F.udf(T.BooleanType())
def is_english(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.where(is_english(F.col("content")))
Actually, I do not really understand why you want to go through Spark for that. Using properly pandas should solve your problem:
# I used you example so I only have partial text...
def is_engllish(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.loc[df["content"].apply(is_eng)]
content
8 unable to update ...
# That's the only line from your truncated example that matches your criterias

Best approach to classify the dataset below into two classes and later remove the redundant rows using Pyspark?

My dataset:
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+-------------------+-------------------+-----------+----+---------------+-----------------+-----------+------------+----------------------+------------------+---------+
| event_time|event_type|product_id| category_id| category_code| brand| price| user_id| user_session| Event_time_NoUTC| Event_timestamp|day_of_week|hour|primaryCategory|secondaryCategory|eventVisits|productCount|secondaryCategoryCount| AvgCatExpense|SessCount|
+--------------------+----------+----------+-------------------+--------------------+--------+-------+---------+--------------------+-------------------+-------------------+-----------+----+---------------+-----------------+-----------+------------+----------------------+------------------+---------+
|2019-10-06 07:04:...| view| 1004565|2053013555631882655|electronics.smart...| huawei| 169.84|231943435|428ebb99-3568-4e1...|2019-10-06 07:04:50|2019-10-06 07:04:50| 1| 7| electronics| smartphone| 1| 1| 1| 380.2349402627628| 1|
|2019-10-25 03:50:...| view| 5100337|2053013553341792533| electronics.clocks| apple| 319.34|266287781|f55edf02-3fd4-48f...|2019-10-25 03:50:28|2019-10-25 03:50:28| 6| 3| electronics| clocks| 7| 7| 7| 369.7054359810376| 4|
|2019-10-25 03:52:...| view| 1005105|2053013555631882655|electronics.smart...| apple|1397.09|266287781|118dbcd6-fe31-4cc...|2019-10-25 03:52:09|2019-10-25 03:52:09| 6| 3| electronics| smartphone| 7| 7| 7| 369.7054359810376| 4|
|2019-10-26 12:15:...| view| 6000157|2053013560807654091|auto.accessories....|starline| 91.12|266287781|992d03b4-c561-4fb...|2019-10-26 12:15:56|2019-10-26 12:15:56| 7| 12| auto| accessories| 7| 7| 7| 369.7054359810376| 4|
The event type has three categories: View, cart and Purchase. I want to classify the user_id and product_id with a new column is_purchased=1 if it has event type as purchase and others will be 0. After that, I would remove the redundant rows as shown below which would basically help me classify my data whether a customer will churn or not.
I am thinking of partitioning data with user_id and product_id and then classify for those which has purchase. Please suggest your approaches to solve this?
Step 1: group the data by user and product and mark if each group contains the event purchase:
from pyspark.sql import functions as F
data = [("A",123, "view", "other attributes 1"),
("A",123, "cart", "other attributes 2"),
("A",123, "purchase", "other attributes 3"),
("B",123, "cart", "other attributes 4")]
df = spark.createDataFrame(data, schema = ["user", "product", "event", "other"])
is_purchased = df.groupBy("user", "product").agg(
F.array_contains(F.collect_set("event"), "purchase").alias("is_purchased"))
# +----+-------+------------+
# |user|product|is_purchased|
# +----+-------+------------+
# | A| 123| true|
# | B| 123| false|
# +----+-------+------------+
Step 2: join the result from step 1 with the original data and filter out the redundant rows:
result = df.join(is_purchased, on=["user", "product"], how="left") \
.filter("event= 'cart'")
# +----+-------+-----+------------------+------------+
# |user|product|event| other|is_purchased|
# +----+-------+-----+------------------+------------+
# | A| 123| cart|other attributes 2| true|
# | B| 123| cart|other attributes 4| false|
# +----+-------+-----+------------------+------------+
You can also apply a window function and get all events of each user and product, then filter (I'm using same sample data as #werner)
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
.withColumn('events', F.collect_set('event').over(W.partitionBy('user', 'product')))
.withColumn('is_purchased', F.array_contains(F.col('events'), 'purchase'))
.withColumn('is_purchased', F.array_contains(F.col('events'), 'purchase'))
.where(F.col('event') == 'cart')
.show(10, False)
)
+----+-------+-----+------------------+----------------------+------------+
|user|product|event|other |events |is_purchased|
+----+-------+-----+------------------+----------------------+------------+
|A |123 |cart |other attributes 2|[cart, view, purchase]|true |
|B |abc |cart |other attributes 4|[cart] |false |
+----+-------+-----+------------------+----------------------+------------+

i have un-wanted data in some of my columns. How to get rid of it?

as you can see down below in the age and gender column i have some data, whereas the value it should be either null or digit, why the cells clash with eachother? how to clean my columns?
And as i understood, the problem's source is the description column, there some cells appear empty/ or the data shows with some non-deletin spaces, whereas they have data, so when i read the file, the content of description shows in the age and the gender column
df = sqlContext.read.csv("/FileStore/tables/mtmedical_V6-16623.csv", header=True)
df.show(150)
output:
+--------------------+--------------------+--------------------+--------------------+-------------------------------------------------------+--------------------+--------------------+
| description| medical_specialty| age| gender|sample_name (What has been done to patient = Treatment)| transcription| keywords|
+--------------------+--------------------+--------------------+--------------------+-------------------------------------------------------+--------------------+--------------------+
| A 23-year-old wh...| Allergy / Immuno...| 23| female| Allergic Rhinitis |SUBJECTIVE:, Thi...|allergy / immunol...|
| Consult for lapa...| Bariatrics| null| male| Laparoscopic Gas...|PAST MEDICAL HIST...|bariatrics, lapar...|
| Consult for lapa...| Bariatrics| 42| male| Laparoscopic Gas...|"HISTORY OF PRESE...| at his highest h...|
| 2-D M-Mode. Dopp...| Cardiovascular /...| null| null| 2-D Echocardiogr...|2-D M-MODE: , ,1....|cardiovascular / ...|
| 2-D Echocardiogram| Cardiovascular /...| null| male| 2-D Echocardiogr...|1. The left vent...|cardiovascular / ...|
| Morbid obesity. ...| Bariatrics| 30| male| Laparoscopic Gas...|PREOPERATIVE DIAG...|bariatrics, gastr...|
| Liposuction of t...| null| null| null| null| null| null|
|", Bariatrics,31,...| 1. Deformity| right breast rec...|2. Excess soft t...| anterior abdomen...|3. Lipodystrophy...|POSTOPERATIVE DIA...|
| 2-D Echocardiogram| Cardiovascular /...| null| male| 2-D Echocardiogr...|2-D ECHOCARDIOGRA...|cardiovascular / ...|
| Suction-assisted...| Bariatrics| null| male| Lipectomy - Abdo...|PREOPERATIVE DIAG...|bariatrics, lipod...|
| Echocardiogram a...| Cardiovascular /...| null| null| 2-D Echocardiogr...|DESCRIPTION:,1. ...|cardiovascular / ...|
| Morbid obesity. ...| Bariatrics| 50| male| Laparoscopic Gas...|PREOPERATIVE DIAG...|bariatrics, morbi...|
| Normal left vent...| Cardiovascular /...| null| male| 2-D Doppler |2-D STUDY,1. Mild...|cardiovascular / ...|
| Cerebral Angiogr...| Neurology| 31| male| Moyamoya Disease |"CC:, Confusion a...| she was found ""...|
This is how the csv file looks like
One alternative would be to map through the dataframe and drop the "bad rows". However, that won't be a very scalable process if you are getting several such csv files.
The Second alternative is to clean the csv file itself. It looks to me the file has incorrect tabs or space that could be problematic.
Lastly, you can try the following
val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.csv("/FileStore/tables/mtmedical_V6-16623.csv")
This will get rid of textual content with multiple newlines which might be the issue here.

Categories

Resources