How could I order by sum, within a DataFrame in PySpark? - python

Analogously to:
order_items.groupBy("order_item_order_id").count().orderBy(desc("count")).show()
I have tried:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("sum")).show()
but this gives an error:
Py4JJavaError: An error occurred while calling o501.sort.
: org.apache.spark.sql.AnalysisException: cannot resolve 'sum' given input columns order_item_order_id, SUM(order_item_subtotal#429);
I have also tried:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("SUM(order_item_subtotal)")).show()
but I get the same error:
Py4JJavaError: An error occurred while calling o512.sort.
: org.apache.spark.sql.AnalysisException: cannot resolve 'SUM(order_item_subtotal)' given input columns order_item_order_id, SUM(order_item_subtotal#429);
I get the right result when executing:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("SUM(order_item_subtotal#429)")).show()
but this was done a posteriori, after having seen the number that Spark appends to the sum column name, i.e. #429.
Is there a way to get the same result but a priori, without knowing which number will be appended?

You should use aliases for your columns:
import pyspark.sql.functions as func
order_items.groupBy("order_item_order_id")\
.agg(func.sum("order_item_subtotal")\
.alias("sum_column_name"))\
.orderBy("sum_column_name")

Related

PySpark Join Error when using .show command?

I am trying to join two dataframes in pyspark. I created a unique key to do so. I keep getting an error which I do not understand.
Code used:
join = df.join(df2, on=['key'], how='inner')
join.show()
I checked the keys and there are correct, unique and there should be matches. The error I get is as follows:
Py4JJavaError: An error occurred while calling .showString.
If I use this code:
left_join = df.join(df2, df.key == df2.key, 'left')
left_join.show()
Also get an error: Py4JJavaError: An error occurred while calling .showString.
If I remove the .show() command I don't get an error..? Anyone knows why this is and how I can solve it?
Thanks in advance!

reshaping pandas data frame- Unique row error

I have a data frame as the following;
I am trying to use the reshape function from pandas package and it keep giving me the error that
" the id variables need to uniquely identify each row".
This is my code to reshape:
link to the data: https://pastebin.com/GzujhX3d
GG_long=pd.wide_to_long(data_GG,stubnames='time_',i=['Customer', 'date'], j='Cons')
The combination of 'Customer' and 'Date' is a unique row within my data, so I don't understand why it throws me this error and how I can fix it. Any help is appreciated.
I could identify the issue. the error was due to two things- first the name of the columns having ":" in them and second the format of the date- for some reason it doesn't like dd-mm-yy, instead it works with dd/mm/yy.

Pyspark self-join with error "Resolved attribute(s) missing"

While doing a pyspark dataframe self-join I got a error message:
Py4JJavaError: An error occurred while calling o1595.join.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) un_val#5997 missing from day#290,item_listed#281,filename#286 in operator !Project [...]. Attribute(s) with the same name appear in the operation: un_val. Please check if the right attribute(s) are used.;;
It is a simple dataframe self-join like the one below, that works fine but after a couple of operations on the dataframe like adding columns or joining with other dataframes the error mentioned above is raised.
df.join(df,on='item_listed')
Using dataframe aliases like bellow wont work either and the same error message is raised:
df.alias('A').join(df.alias('B'), col('A.my_id') == col('B.my_id'))
I've found a Java workaround here SPARK-14948 and for pyspark is like this:
#Add a "_r" suffix to column names array
newcols = [c + '_r' for c in df.columns]
#clone the dataframe with columns renamed
df2 = df.toDF(*newcols)
#self-join
df.join(df2,df.my_column == df2.my_column_r)

Pandas apply function issue

I have a dataframe (data) of numeric variables and I want to analyse the distribution of each column by using the Shapiro test from scipy.
from scipy import stats
data.apply(stats.shapiro, axis=0)
But I keep getting the following error message:
ValueError: ('could not convert string to float: M', u'occurred at index 0')
I've checked the documentation and it says the first argument of the apply function should be a function, which stats.shapiro is (as far as I'm aware).
What am I doing wrong, and how can I fix it?
Found the problem. There was a column of type object which resulted in the error message above. Apply the function only to numeric columns solved this issue.

error occured when using df.fillna(0)

Very simple code using spark + python:
df = spark.read.option("header","true").csv(file_name)
df = df_abnor_matrix.fillna(0)
but error occured:
pyspark.sql.utils.AnalysisException: u'Cannot resolve column name
"cp_com.game.shns.uc" among (ProductVersion, IMEI, FROMTIME, TOTIME,
STATISTICTIME, TimeStamp, label, MD5, cp_com.game.shns.uc,
cp_com.yunchang....
What's wrong with it? cp_com.game.shns.uc is among the list.
Spark does not support dot character in column names, check issue, so you need to replace dots with underscore before working on the csv.

Categories

Resources