Can't aggregate count for an RDD? - python

I am working on a PySpark streaming application and I'm running into the following problem,
After performing some transformations on a DStream object, I end up with the following RDD called "rdd_new":
+------+---+---+
| _1| _2| _3|
+------+---+---+
|Python| 36| 10|
| C| 6| 1|
| C#| 8| 1|
+------+---+---+
I then run this rdd through the following command that will aggregate the values in the RDD:
rdd_new = rdd_new.updateStateByKey(aggregate_count)
Where aggregate_count looks like this
def aggregate_count(new_values, total_sum):
return sum(new_values) + (total_sum or 0)
But after that line of code is executed I am getting this error:
for obj in iterator:
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2052, in add_shuffle_key
ValueError: too many values to unpack (expected 2)
There are a lot more lines of error but I've narrowed it down to that. The thing is, the aggregate function works if my rdd looks like this:
+------+---+
| _1| _2|
+------+---+
|Python| 36|
| C| 6|
| C#| 8|
+------+---+
The key difference is that there are just 2 columns with this one. Since I really need this aggregate_count function to work for my project, how can I feed my 3 column RDD into the function and have it actually work? I have no idea how to even approach this sort of issue, thanks!

Related

extract hypen separated values from a column and apply UDF

I have a dataframe like as provided below:
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
|sequence|recType|valCode|registerNumber| rest| errorCode|errorType | errorDescription|isSuccessful|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
| 9| 11| 0| XXXX2288|110XXXX2288MKKKKK...| CHAR0088| ERROR|Records out of se...| N|
| 9| 12| 0| XXXX2288|130XXXX22880011ZZ...| CHAR0088| ERROR|Records out of se...| N|
| 9| 18| 0| XXXX2288|140XXXX2288 ...| CHAR0088| ERROR|Records out of se...| N|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+ N|
The below code uses UDF to populate the data for errorType and errorDescription columns.
The UDFs i.e. resolveErrorTypeUDF and resolveErrorDescUDF take one errorCode as input and provide the respective errorType and errorDescription in output respectively.
errorFinalDf = errorDfAll.na.fill("") \
.withColumn("errorType", resolveErrorTypeUDF(col("errorCode"))) \
.withColumn("errorDescription", resolveErrorDescUDF(col("errorCode"))) \
.withColumn("isSuccessful", when(trim(col("errorCode")).eqNullSafe(""), "Y").otherwise("N")) \
.dropDuplicates()
Please notice that, I used to get only one error code in errorCode column. Now onwards, I will be getting single/multiple - separated error codes in the errorCode column. And I need to populate all the mapping errorType and errorDescription and write them into respective column with - separation.
The new dataframe would look like this.
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
|sequence|recType|valCode|registerNumber| rest| errorCode|errorType | errorDescription|isSuccessful|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
| 7| 1| 0| XXXX8822|010XXXX8822XBCDEF...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 11| 0| XXXX8822|110XXXX8822LLLLLL...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 12| 0| XXXX8822|120XXXX8822011GB ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 18| 0| XXXX8822|180XXXX8822 ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
| 7| 18| 0| XXXX8822|180XXXX88220 ...|CHAR0009-CHAR0021|ERROR-WARN|Short Failed-Miss...| N|
+--------+-------+-------+--------------+--------------------+-----------------+----------+--------------------+------------+
What changes would be needed to accommodate the new scenario. Please help. Thank you.
You need minimal changes, limited only to your UDFs.
Suppose you have a simple python function, get_type_from_code able to convert a string with the error code to the correspondent type (the same applies to the description).
from pyspark.sql import functions as F, types as T
def get_type_from_code(c: str) -> str:
"""Function to convert error code to error type.
Mind the interface: string in, string out
"""
return {'CHAR0009': 'ERROR', 'CHAR0021': 'WARNING'}.get(c, 'UNKNOWN')
#F.udf(returnType=T.StringType())
def convert_errcodes_to_types(codes: str) -> str:
"""Convert a string of error codes separated by '-' into a string of types concatenated with '-'"""
return '-'.join(
map(get_type_from_code, codes.split('-'))
)
Done!

How to remove elements with UDF function and Pandas instead of using for loop Python

I have a problem ... how do I make such a for loop as a UDF function?
import cld3
ind_err = []
cnt = 0
cnt_NOT = 0
for index, row in pandasDF.iterrows():
lan, probability, is_reliable, proportion = cld3.get_language(row["content"])
if (lan != 'en'):
cnt_NOT += 1
ind_err.append(index)
elif(lan == 'en' and probability < 0.85):
cnt += 1
ind_err.append(index)
pandasDF = pandasDF.drop(labels=ind_err, axis=0)
This function cycles on all the lines of the pandas data frame and sees through cld3 which is English and which is not, in order to clean up. Save the indexes in an array to delete them with .drop (labels = ind_err, axis = 0).
This is the data that I have:
+--------------------+-----+
| content|score|
+--------------------+-----+
| what sapp| 1|
| right| 5|
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
+--------------------+-----+
This is the data that I would remove:
|ciao mamma mi pia...| 1|
|bounjourn whatsa ...| 5|
|hola amigos te qu...| 5|
And the is the dataframe that I would have:
+--------------------+-----+
| content|score|
+--------------------+-----+
| what sapp| 1|
| right| 5|
|excellent thank y...| 5|
| whatsapp| 1|
|so frustrating i ...| 1|
|unable to update ...| 1|
| whatsapp| 1|
+--------------------+-----+
The problem with this cycle is that it is very slow since there are 1,119,778 rows.
I know PySpark's withColumn is much faster, but I honestly can't figure out how to select the row to delete and get it deleted.
How can I turn that for loop into a function and make language detect a lot faster?
My environment is Google Colab.
Many thanks in advance!!
you can probably do something like that :
from pyspark.sql import functions as F, types as T
# assuming df is your dataframe
#F.udf(T.BooleanType())
def is_english(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.where(is_english(F.col("content")))
Actually, I do not really understand why you want to go through Spark for that. Using properly pandas should solve your problem:
# I used you example so I only have partial text...
def is_engllish(content):
lan, probability, is_reliable, proportion = cld3.get_language(content)
if lan == "en" and probability >= 0.85:
return True
return False
df.loc[df["content"].apply(is_eng)]
content
8 unable to update ...
# That's the only line from your truncated example that matches your criterias

How to modify a column for a join in spark dataframe when the join key are given as a list?

I have been trying to join two dataframes using the following list of join key passed as a list and I want to add the functionality to join on a subset of the keys if one of the key value is null
I have been trying to join two dataframes df_1 and df_2.
data1 = [[1,'2018-07-31',215,'a'],
[2,'2018-07-30',None,'b'],
[3,'2017-10-28',201,'c']
]
df_1 = sqlCtx.createDataFrame(data1,
['application_number','application_dt','account_id','var1'])
and
data2 = [[1,'2018-07-31',215,'aaa'],
[2,'2018-07-30',None,'bbb'],
[3,'2017-10-28',201,'ccc']
]
df_2 = sqlCtx.createDataFrame(data2,
['application_number','application_dt','account_id','var2'])
The code I use to join is this:
key_a = ['application_number','application_dt','account_id']
new = df_1.join(df_2,key_a,'left')
The output for the same is:
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b|null|
+------------------+--------------+----------+----+----+
My concern here is, in the case where account_id is null, the join should still work by comparing other 2 keys.
The required output should be like this:
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 3| 2017-10-28| 201| c| ccc|
| 2| 2018-07-30| null| b| bbb|
+------------------+--------------+----------+----+----+
I have found a similar approach to do so by using the statement:
join_elem = "df_1.application_number ==
df_2.application_number|df_1.application_dt ==
df_2.application_dt|F.coalesce(df_1.account_id,F.lit(0)) ==
F.coalesce(df_2.account_id,F.lit(0))".split("|")
join_elem_column = [eval(x) for x in join_elem]
But the design consideration do not allow me to use a full join expression and i am stuck with using the list of column names as join-key.
I have been trying to find a way to accommodate this coalesce thing into this list itself but have not found any success so far.
I would call this solution a workaround.
The issue here is that we have Null value for one of the keys in the DataFrame and the OP wants that rest of the key columns to be used instead. Why not assign an arbitrary value to this Null and then apply the join. Effectively this would be same thing like making a join on the remaining two keys.
# Let's replace Null with an arbitrary value, which has
# little chance of occurring in the Dataset. For eg; -100000
df1 = df1.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))
df2 = df2.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))
# Do a FULL Join
df = df1.join(df2,['application_number','application_dt','account_id'],'full')
# Replace the arbitrary value back with Null.
df = df.withColumn('account_id', when(col('account_id')== -100000, None).otherwise(col('account_id')))
df.show()
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
| 1| 2018-07-31| 215| a| aaa|
| 2| 2018-07-30| null| b| bbb|
| 3| 2017-10-28| 201| c| ccc|
+------------------+--------------+----------+----+----+

Graphframes/Graphx connected components skipping numbers

I'm using the Spark Graphframes library to create an identity resolution system. I have been able to use spark to find matches. My plan was to use a graph to find transient links between people and assign a single id to them for further analysis etc.
I used the following data (from the public febrl database):
vertex data sample:
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
|given_name| surname|street_number| address_1| address_2| suburb|postcode|state|date_of_birth|soc_sec_id| id|block|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
| michaela| neumann| 8| stanley street| miami| winston hills| 4223| nsw| 19151111| 5304218| 0| mneu|
| courtney| painter| 12| pinkerton circuit| bega flats| richlands| 4560| vic| 19161214| 4066625| 1| cpai|
| charles| green| 38|salkauskas crescent| kela| dapto| 4566| nsw| 19480930| 4365168| 2| cgre|
| vanessa| parr| 905| macquoid place| broadbridge manor| south grafton| 2135| sa| 19951119| 9239102| 3| vpar|
| mikayla|malloney| 37| randwick road| avalind|hoppers crossing| 4552| vic| 19860208| 7207688| 4| mmal|
| blake| howie| 1| cutlack street|belmont park belt...| budgewoi| 6017| vic| 19250301| 5180548| 5| bhow|
| blakeston| broadby| 53| traeger street| valley of springs| north ward| 3083| qld| 19120907| 4308555| 7| bbro|
| edward| denholm| 10| corin place| gold tyne| clayfield| 4221| vic| 19660306| 7119771| 9| eden|
| charlie|alderson| 266|hawkesbury crescent|deergarden caravn...| cooma| 4128| vic| 19440908| 1256748| 10| cald|
| molly| roche| 59|willoughby crescent| donna valley| carrara| 4825| nsw| 19200712| 1847058| 11| mroc|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
Edge data sample:
+---+-----+-----+
|src| dst|match|
+---+-----+-----+
| 0|10000| 1|
| 1|17750| 1|
| 1|10001| 1|
| 1| 7750| 1|
| 2|19656| 1|
| 2|10002| 1|
| 2| 9656| 1|
| 3|19119| 1|
| 3|10003| 1|
| 3| 9119| 1|
+---+-----+-----+
created graph:
g = GraphFrame(vertix_data, edge_data)
used connected components:
connected = g.connectedComponents(algorithm='graphframes')
which results in:
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
|given_name| surname|street_number| address_1| address_2| suburb|postcode|state|date_of_birth|soc_sec_id| id|block|component|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
| michaela| neumann| 8| stanley street| miami| winston hills| 4223| nsw| 19151111| 5304218| 0| mneu| 0|
| courtney| painter| 12| pinkerton circuit| bega flats| richlands| 4560| vic| 19161214| 4066625| 1| cpai| 1|
| charles| green| 38|salkauskas crescent| kela| dapto| 4566| nsw| 19480930| 4365168| 2| cgre| 2|
| vanessa| parr| 905| macquoid place| broadbridge manor| south grafton| 2135| sa| 19951119| 9239102| 3| vpar| 3|
| mikayla|malloney| 37| randwick road| avalind|hoppers crossing| 4552| vic| 19860208| 7207688| 4| mmal| 4|
| blake| howie| 1| cutlack street|belmont park belt...| budgewoi| 6017| vic| 19250301| 5180548| 5| bhow| 5|
| blakeston| broadby| 53| traeger street| valley of springs| north ward| 3083| qld| 19120907| 4308555| 7| bbro| 7|
| edward| denholm| 10| corin place| gold tyne| clayfield| 4221| vic| 19660306| 7119771| 9| eden| 9|
| charlie|alderson| 266|hawkesbury crescent|deergarden caravn...| cooma| 4128| vic| 19440908| 1256748| 10| cald| 10|
| molly| roche| 59|willoughby crescent| donna valley| carrara| 4825| nsw| 19200712| 1847058| 11| mroc| 11|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
The component column doesn't always increase in increments of 1 but seems to randomly skip numbers, I would like to make sure that the increase in increments of one as using this number to assign each person an id.
Does anybody know why Graphframes does this?
When I look further into this, for the approx 20,000 rows in my development dataframe approx 17% of entries have a skip in them. In extreme cases the gap can be up to around 20-30, i.e. one rows id is 5846 and the next one is 5868. My worry is, when I go scale in millions and hundreds of millions the gaps will get very large between id's which could create problems down the line.
TL;DR: Why does Sparks connected components seem to randomly skip values and not always increment by 1?
Graphframes documentation never promises consecutive ids - instead the only guarantee it provides is:
The resulting DataFrame contains all the vertex information and one additional column:
component (LongType): unique ID for this component
In practice GraphX implementation uses the smallest ID in the component ("return a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex"), and Graphframes seems to do the same thing.
Like #user10802135 said, the component values are not guaranteed to be sequential. If you want to make them sequential, you'll need to do some post-processing on the component field. A pyspark solution to this would look something like this:
import pyspark.sql.functions as F
from pyspark.sql import Window
# Define our window for partitioning data on - necessary for dense_rank() function
windowSpec = Window.partitionBy(F.lit(1)).orderBy('component')
# Redefine the component field, now in sequential order
df = df.withColumn('component', F.dense_rank().over(windowSpec))
By partitioning by the literal value of 1, all rows are considered in the dense_rank(), and ranking order is determined by the .orderBy() argument. In this case the .orderBy() argument is set to 'component', which will order in ascending order by default. The .dense_rank() functionality ensures that records under the same component will be given the same returned value, something that using rank() does NOT ensure.
There are some great examples and explanations of .dense_rank() and other window functions here.

PySpark - Pass list as parameter to UDF + iterative dataframe column addition

I borrowed this example from a link!
I would like to understand why dataframe a - after having had column 'category' seemingly added to it, cannot be referenced in a subsequent operation. Is dataframe a somehow immutable? Is there another way to act on dataframe a so that subsequent operations can access column 'category'? Thanks for your help; I am still on the learning curve. Now, it is possible to add all the columns at once to avoid the error, but that isn't what I want to do here.
#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80),("E",0)],["Letter", "distances"])
label_list = ["Great", "Good", "OK", "Please Move", "Dead"]
#Passing List as Default value to a variable
def cate( feature_list,label=label_list):
if feature_list == 0:
return label[4]
else:
return 'I am not sure!'
def cate2( feature_list,label=label_list):
if feature_list == 0:
return label[4]
elif feature_list.category=='I am not sure!':
return 'Why not?'
udfcate = udf(cate, StringType())
udfcate2 = udf(cate2, StringType())
a.withColumn("category", udfcate("distances"))
a.show()
a.withColumn("category2", udfcate2("category")).show()
a.show()
I get the error:
C:\Users\gowreden\AppData\Local\Continuum\anaconda3\python.exe C:/Users/gowreden/PycharmProjects/DRC/src/tester.py
2018-08-09 09:06:42 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+------+---------+--------------+
|Letter|distances| category|
+------+---------+--------------+
| A| 20|I am not sure!|
| B| 30|I am not sure!|
| D| 80|I am not sure!|
| E| 0| Dead|
+------+---------+--------------+
Traceback (most recent call last):
File "C:\Programs\spark-2.3.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\Programs\spark-2.3.1-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o34.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`category`' given input columns: [Letter, distances];;
'Project [Letter#0, distances#1L, cate('category) AS category2#20]
+- AnalysisBarrier
+- LogicalRDD [Letter#0, distances#1L], false
....
I think there are two issues with your code:
First of all, as #pault said, withColumn is not in-place, and you need to modify your code accordingly.
Second, your cate2 function is not correct. In the sense that you apply it to column category and at the same time you request for comparing feature_list.category with something.
You may want to get rid of the first function and do the following:
import pyspark.sql.functions as F
a=a.withColumn('category', F.when(a.distances==0, label_list[4]).otherwise('I am not sure!'))
a.show()
Output:
+------+---------+--------------+
|Letter|distances| category|
+------+---------+--------------+
| A| 20|I am not sure!|
| B| 30|I am not sure!|
| D| 80|I am not sure!|
| E| 0| Dead|
+------+---------+--------------+
And do something like this for the second function:
a=a.withColumn('category2', F.when(a.distances==0, label_list[4]).otherwise(F.when(a.category=='I am not sure!', 'Why not?')))
a.show()
Output:
+------+---------+--------------+---------+
|Letter|distances| category|category2|
+------+---------+--------------+---------+
| A| 20|I am not sure!| Why not?|
| B| 30|I am not sure!| Why not?|
| D| 80|I am not sure!| Why not?|
| E| 0| Dead| Dead|
+------+---------+--------------+---------+

Categories

Resources