Pyspark self-join with error "Resolved attribute(s) missing" - python

While doing a pyspark dataframe self-join I got a error message:
Py4JJavaError: An error occurred while calling o1595.join.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) un_val#5997 missing from day#290,item_listed#281,filename#286 in operator !Project [...]. Attribute(s) with the same name appear in the operation: un_val. Please check if the right attribute(s) are used.;;
It is a simple dataframe self-join like the one below, that works fine but after a couple of operations on the dataframe like adding columns or joining with other dataframes the error mentioned above is raised.
df.join(df,on='item_listed')
Using dataframe aliases like bellow wont work either and the same error message is raised:
df.alias('A').join(df.alias('B'), col('A.my_id') == col('B.my_id'))

I've found a Java workaround here SPARK-14948 and for pyspark is like this:
#Add a "_r" suffix to column names array
newcols = [c + '_r' for c in df.columns]
#clone the dataframe with columns renamed
df2 = df.toDF(*newcols)
#self-join
df.join(df2,df.my_column == df2.my_column_r)

Related

How to resolve warning "Boolean Series key will be reindexed to match DataFrame index" while using a dataframe to create a new dataframe

I am writing a code to analyze some data where I am removing rows with a null value in a certain column. It works perfectly fine and I am getting the desired results but this warning keeps flashing.
Here's part of the code that's giving the error
original_df = pd.read_csv('titanic.csv')
age_wrangled_df = original_df[pd.notnull(original_df['Age'])]
embark_wrangled_df = age_wrangled_df[pd.notnull(original_df['Embarked'])]
The error I keep getting is
C:\Users\Dell\Downloads\Codes folium\titanic.py:14: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
embark_wrangled_df = age_wrangled_df[pd.notnull(original_df['Embarked'])]
I have read other answers regarding this error but none helped resolve the warning. What does the error mean and how can I fix it?

PySpark Join Error when using .show command?

I am trying to join two dataframes in pyspark. I created a unique key to do so. I keep getting an error which I do not understand.
Code used:
join = df.join(df2, on=['key'], how='inner')
join.show()
I checked the keys and there are correct, unique and there should be matches. The error I get is as follows:
Py4JJavaError: An error occurred while calling .showString.
If I use this code:
left_join = df.join(df2, df.key == df2.key, 'left')
left_join.show()
Also get an error: Py4JJavaError: An error occurred while calling .showString.
If I remove the .show() command I don't get an error..? Anyone knows why this is and how I can solve it?
Thanks in advance!

reshaping pandas data frame- Unique row error

I have a data frame as the following;
I am trying to use the reshape function from pandas package and it keep giving me the error that
" the id variables need to uniquely identify each row".
This is my code to reshape:
link to the data: https://pastebin.com/GzujhX3d
GG_long=pd.wide_to_long(data_GG,stubnames='time_',i=['Customer', 'date'], j='Cons')
The combination of 'Customer' and 'Date' is a unique row within my data, so I don't understand why it throws me this error and how I can fix it. Any help is appreciated.
I could identify the issue. the error was due to two things- first the name of the columns having ":" in them and second the format of the date- for some reason it doesn't like dd-mm-yy, instead it works with dd/mm/yy.

error occured when using df.fillna(0)

Very simple code using spark + python:
df = spark.read.option("header","true").csv(file_name)
df = df_abnor_matrix.fillna(0)
but error occured:
pyspark.sql.utils.AnalysisException: u'Cannot resolve column name
"cp_com.game.shns.uc" among (ProductVersion, IMEI, FROMTIME, TOTIME,
STATISTICTIME, TimeStamp, label, MD5, cp_com.game.shns.uc,
cp_com.yunchang....
What's wrong with it? cp_com.game.shns.uc is among the list.
Spark does not support dot character in column names, check issue, so you need to replace dots with underscore before working on the csv.

How could I order by sum, within a DataFrame in PySpark?

Analogously to:
order_items.groupBy("order_item_order_id").count().orderBy(desc("count")).show()
I have tried:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("sum")).show()
but this gives an error:
Py4JJavaError: An error occurred while calling o501.sort.
: org.apache.spark.sql.AnalysisException: cannot resolve 'sum' given input columns order_item_order_id, SUM(order_item_subtotal#429);
I have also tried:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("SUM(order_item_subtotal)")).show()
but I get the same error:
Py4JJavaError: An error occurred while calling o512.sort.
: org.apache.spark.sql.AnalysisException: cannot resolve 'SUM(order_item_subtotal)' given input columns order_item_order_id, SUM(order_item_subtotal#429);
I get the right result when executing:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("SUM(order_item_subtotal#429)")).show()
but this was done a posteriori, after having seen the number that Spark appends to the sum column name, i.e. #429.
Is there a way to get the same result but a priori, without knowing which number will be appended?
You should use aliases for your columns:
import pyspark.sql.functions as func
order_items.groupBy("order_item_order_id")\
.agg(func.sum("order_item_subtotal")\
.alias("sum_column_name"))\
.orderBy("sum_column_name")

Categories

Resources