error occured when using df.fillna(0) - python

Very simple code using spark + python:
df = spark.read.option("header","true").csv(file_name)
df = df_abnor_matrix.fillna(0)
but error occured:
pyspark.sql.utils.AnalysisException: u'Cannot resolve column name
"cp_com.game.shns.uc" among (ProductVersion, IMEI, FROMTIME, TOTIME,
STATISTICTIME, TimeStamp, label, MD5, cp_com.game.shns.uc,
cp_com.yunchang....
What's wrong with it? cp_com.game.shns.uc is among the list.

Spark does not support dot character in column names, check issue, so you need to replace dots with underscore before working on the csv.

Related

How to remove commas in a column within a Pyspark Dataframe

Hi all thanks for the time to help me on this,
Right now I have uploaded a csv into spark and the type of the dataframe is pyspark.sql.dataframe.DataFrame
I have a column of numbers (that are strings in this case though). They are numbers like 6,000 and I just want to remove all the commas from these numbers. I have tried df.select("col").replace(',' , '') and df.withColumn('col', regexp_replace('col', ',' , '') but seem to be getting an error that "DataFrame Object does not support item assignment"
Any ideas? I'm fairly new to Spark
You should be casting it really:
from pyspark.sql.types import IntegerType
df = df.withColumn("col", df["col"].cast(IntegerType()))

reshaping pandas data frame- Unique row error

I have a data frame as the following;
I am trying to use the reshape function from pandas package and it keep giving me the error that
" the id variables need to uniquely identify each row".
This is my code to reshape:
link to the data: https://pastebin.com/GzujhX3d
GG_long=pd.wide_to_long(data_GG,stubnames='time_',i=['Customer', 'date'], j='Cons')
The combination of 'Customer' and 'Date' is a unique row within my data, so I don't understand why it throws me this error and how I can fix it. Any help is appreciated.
I could identify the issue. the error was due to two things- first the name of the columns having ":" in them and second the format of the date- for some reason it doesn't like dd-mm-yy, instead it works with dd/mm/yy.

Pyspark self-join with error "Resolved attribute(s) missing"

While doing a pyspark dataframe self-join I got a error message:
Py4JJavaError: An error occurred while calling o1595.join.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) un_val#5997 missing from day#290,item_listed#281,filename#286 in operator !Project [...]. Attribute(s) with the same name appear in the operation: un_val. Please check if the right attribute(s) are used.;;
It is a simple dataframe self-join like the one below, that works fine but after a couple of operations on the dataframe like adding columns or joining with other dataframes the error mentioned above is raised.
df.join(df,on='item_listed')
Using dataframe aliases like bellow wont work either and the same error message is raised:
df.alias('A').join(df.alias('B'), col('A.my_id') == col('B.my_id'))
I've found a Java workaround here SPARK-14948 and for pyspark is like this:
#Add a "_r" suffix to column names array
newcols = [c + '_r' for c in df.columns]
#clone the dataframe with columns renamed
df2 = df.toDF(*newcols)
#self-join
df.join(df2,df.my_column == df2.my_column_r)

Pandas error - "ValueError: labels ['attributes'] not contained in axis"

I am extracting data from Saleforce system and converting it to a Dataframe when I get an error:
ValueError: labels ['attributes'] not contained in axis.
Given below is my Python script:
raw = sf_data_cursor.bulk.Case.query('''SELECT Id, Status, AccountName__c, AccountId FROM Case''')
raw_df = pd.DataFrame(raw).drop('attributes', axis= 1,inplace=False)
Could anyone assist.
Generally, this error occurs if the column (in this case attributes) you're trying to drop from raw doesn't exist.
Try the code: raw.columns, and the output should include the column name you're trying to drop.

feature_names must be unique - Xgboost

I am running the xgboost model for a very sparse matrix.
I am getting this error. ValueError: feature_names must be unique
How can I deal with this?
This is my code.
yprob = bst.predict(xgb.DMatrix(test_df))[:,1]
According the the xgboost source code documentation, this error only occurs in one place - in a DMatrix internal function. Here's the source code excerpt:
if len(feature_names) != len(set(feature_names)):
raise ValueError('feature_names must be unique')
So, the error text is pretty literal here; your test_df has at least one duplicate feature/column name.
You've tagged pandas on this post; that suggests test_df is a Pandas DataFrame. In this case, DMatrix literally runs df.columns to extract feature_names. Check your test_df for repeat column names, remove or rename them, and then try DMatrix() again.
Assuming the problem is indeed that columns are duplicated, the following line should solve your problem:
test_df = test_df.loc[:,~test_df.columns.duplicated()]
Source: python pandas remove duplicate columns
This line should identify which columns are duplicated:
duplicate_columns = test_df.columns[test_df.columns.duplicated()]
One way around this can be to use column names that are unique while preparing the data and then it should work out.
I converted to them to np.array(df). My problem was solved

Categories

Resources