I am trying to join two dataframes in pyspark. I created a unique key to do so. I keep getting an error which I do not understand.
Code used:
join = df.join(df2, on=['key'], how='inner')
join.show()
I checked the keys and there are correct, unique and there should be matches. The error I get is as follows:
Py4JJavaError: An error occurred while calling .showString.
If I use this code:
left_join = df.join(df2, df.key == df2.key, 'left')
left_join.show()
Also get an error: Py4JJavaError: An error occurred while calling .showString.
If I remove the .show() command I don't get an error..? Anyone knows why this is and how I can solve it?
Thanks in advance!
Related
I had big table which I sliced to many smaller tables based on their dates:
dfs={}
for fecha in fechas:
dfs[fecha]=df[df['date']==fecha].set_index('Hour')
#now I can acess the tables like this:
dfs['2019-06-23'].head()
I have done some modifictions to the dfs['2019-06-23'] specific table and now I would like to save it on my computer. I have tried to do this in two ways:
#first try:
dfs['2019-06-23'].to_csv('specific/path/file.csv')
#second try:
test=dfs['2019-06-23']
test.to_csv('test.csv')
both of them raised this error:
TypeError: get_handle() got an unexpected keyword argument 'errors'
I don't know why I get this error and haven't find any reason for that. I have saved many files this way but never had that before.
My goal: to be able to save this dataframe after my modification as csv
If you are getting this error, there are two things to check:
Whether the DataFrame is not actually a Series - see (Pandas : to_csv() got an unexpected keyword argument)
Your numpy version. For me, updating to numpy==1.20.1 with pandas==1.2.2 fixed the problem. If you are using Jupyter notebooks, remember to restart the kernel afterwards.
In the end what worked was to use pd.DataFrame and then to export it as following:
to_export=pd.DataFrame(dfs['2019-06-23'])
to_export.to_csv('my_table.csv')
that suprised me because when I checked the type of the table when I got the error it was dataframe . However, this way it works.
While doing a pyspark dataframe self-join I got a error message:
Py4JJavaError: An error occurred while calling o1595.join.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) un_val#5997 missing from day#290,item_listed#281,filename#286 in operator !Project [...]. Attribute(s) with the same name appear in the operation: un_val. Please check if the right attribute(s) are used.;;
It is a simple dataframe self-join like the one below, that works fine but after a couple of operations on the dataframe like adding columns or joining with other dataframes the error mentioned above is raised.
df.join(df,on='item_listed')
Using dataframe aliases like bellow wont work either and the same error message is raised:
df.alias('A').join(df.alias('B'), col('A.my_id') == col('B.my_id'))
I've found a Java workaround here SPARK-14948 and for pyspark is like this:
#Add a "_r" suffix to column names array
newcols = [c + '_r' for c in df.columns]
#clone the dataframe with columns renamed
df2 = df.toDF(*newcols)
#self-join
df.join(df2,df.my_column == df2.my_column_r)
Very simple code using spark + python:
df = spark.read.option("header","true").csv(file_name)
df = df_abnor_matrix.fillna(0)
but error occured:
pyspark.sql.utils.AnalysisException: u'Cannot resolve column name
"cp_com.game.shns.uc" among (ProductVersion, IMEI, FROMTIME, TOTIME,
STATISTICTIME, TimeStamp, label, MD5, cp_com.game.shns.uc,
cp_com.yunchang....
What's wrong with it? cp_com.game.shns.uc is among the list.
Spark does not support dot character in column names, check issue, so you need to replace dots with underscore before working on the csv.
I have tried to write the following code but I get the error message: "ValueError: Cannot shift with no freq."
I have no idea of how to fix it? I tried to google on the error message but couldn't find any case similar to mine.
df is a python pandas dataframe for which I want to create new columns showing the daily change. The code is shown below. How can I fix the code to avoid the value error?
for column_names in df:
df[column_names+'%-daily'] =df[column_names].pct_change(freq=1).fillna(0)
The problem was that I had date as index. Since only weekdays were shown delta became incorrect. When I changed to period.
for column_names in list(df.columns.values):
df[column_names+'%-daily'] =df[column_names].pct_change(periods=1).fillna(0)
Analogously to:
order_items.groupBy("order_item_order_id").count().orderBy(desc("count")).show()
I have tried:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("sum")).show()
but this gives an error:
Py4JJavaError: An error occurred while calling o501.sort.
: org.apache.spark.sql.AnalysisException: cannot resolve 'sum' given input columns order_item_order_id, SUM(order_item_subtotal#429);
I have also tried:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("SUM(order_item_subtotal)")).show()
but I get the same error:
Py4JJavaError: An error occurred while calling o512.sort.
: org.apache.spark.sql.AnalysisException: cannot resolve 'SUM(order_item_subtotal)' given input columns order_item_order_id, SUM(order_item_subtotal#429);
I get the right result when executing:
order_items.groupBy("order_item_order_id").sum("order_item_subtotal").orderBy(desc("SUM(order_item_subtotal#429)")).show()
but this was done a posteriori, after having seen the number that Spark appends to the sum column name, i.e. #429.
Is there a way to get the same result but a priori, without knowing which number will be appended?
You should use aliases for your columns:
import pyspark.sql.functions as func
order_items.groupBy("order_item_order_id")\
.agg(func.sum("order_item_subtotal")\
.alias("sum_column_name"))\
.orderBy("sum_column_name")