How to fix RDD empty in pyspark - python

I have the following problem,
i am working with pySpark to build a model for a regression problem.
df = spark.read.format("csv").option("header", "true").option("mode", "DROPMALFORMED").load(path)
I cast everything into double/int. it works well
with this line i see no null values
df.describe().toPandas()
i seperate the label from the features
rdd_ml = df.rdd.map(lambda x: (x[0], DenseVector(x[1:])))
but this line throws the error raise ValueError("The first row in RDD is empty, "
df_ml = spark.createDataFrame(rdd_ml, ['label', 'features'])
if i for example execute this line i see nothing strange
df.rdd.collect()
is there something that i am missing here???
thank you if you have any advice on this.

Related

Cannot resolve stack() due to type mismatch

I have a pyspark code that looks like this:
from pyspark.sql.functions import expr
unpivotExpr = """stack(14, 'UeEnd', UeEnd,
'Encedreco', Endereco,
'UeSitFun', UeSitFun,
'SitacaoEscola', SituacaoEscola,
'Creche', Creche,
'PreEscola', PreEscola,
'FundAnosIniciais', FundAnosIniciais,
'FundAnosFinais', FundAnosFinais,
'EnsinoMedio', EnsinoMedio,
'Profissionalizante', Profissionalizante,
'EJA', EJA,
'EdEspecial', EdEspecial,
'Conveniada', Conveniada,
'TipoAtoCriacao', TipoAtoCriacao)
as (atributo, valor)"""
unpivotDf = df.select("Id", expr(unpivotExpr))
When I run it I get this Error:
cannot resolve 'stack(14, 'UeEnd', `UeEnd`, 'Encedreco', `Endereco`, 'UeSitFun', `UeSitFun`,
'SitacaoEscola', `SituacaoEscola`, 'Creche', `Creche`, 'PreEscola', `PreEscola`,
'FundAnosIniciais', `FundAnosIniciais`, 'FundAnosFinais', `FundAnosFinais`, 'EnsinoMedio',
`EnsinoMedio`, 'Profissionalizante', `Profissionalizante`, 'EJA', `EJA`, 'EdEspecial',
`EdEspecial`, 'Conveniada', `Conveniada`, 'TipoAtoCriacao', `TipoAtoCriacao`)'
due to data type mismatch: Argument 2 (string) != Argument 6 (bigint); line 1 pos 0;
What might be causing this problem?
When you unpivot a group of columns, all of their values are going to end up in the same column. Because of that, you should first make sure that all of the columns you are trying to unpivot into one have the same data types. Otherwise you would have a column with multiple different types in different rows.

Error occured during changing for loop to .apply()

I am currently working on finding position(row index, column index) of maximum cell in each column of a dataframe.
There are a lot of similar dataframe like this, so I made a function like below.
def FindPosition_max(series_max, txt_name):
# series_max : only needed to get the number of columns in time_history(Index starts from 1).
time_history = pd.read_csv(txt_name, skiprows=len(series_max)+7, header=None, usecols=[i for i in range(1,len(series_max)+1)])[:-2]
col_index = series_max.index
row_index = []
for current_col_index in series_max.index:
row_index.append(time_history.loc[:, current_col_index].idxmax())
return row_index, col_index.tolist()
This works well, but takes too much time to run with a lot of dataframes. I found on the internet that .apply() is much more faster than for loop and I tried like this.
def FindPosition_max(series_max, txt_name):
time_history = pd.read_csv(txt_name, skiprows=len(series_max)+7, header=None, usecols=[i for i in range(1,len(series_max)+1)])[:-2]
col_index = series_max.index
row_index = pd.Series(series_max.index).apply(lambda x: time_history.loc[:, x].idxmax())
return row_index, series_max.index.tolist()
And the error comes like this,
File "C:\Users\hwlee\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 844, in _list_of_series_to_arrays
indexer = indexer_cache[id(index)] = index.get_indexer(columns)
AttributeError: 'builtin_function_or_method' object has no attribute 'get_indexer'
I tried to find what causes this error, but this error never goes away. Also when I tested the codes inside the function separately, it works well.
Could anyone help me to solve this problem? Thank u!

Have I found a bug or made a mistake in pandas df.tail()?

Ok, I've got a weird one. I might have found a bug, but let's assume I made a mistake first. Anyways, I am running into some issues with pandas.
I want to locate the two last columns of a dataframe to compare the values of column 'Col'. I run the code inside a for loop because it needs to run on all files in a folder. This code:
import pandas
for item in itemList:
df = df[['Col']].tail(2)
valA = df.iloc[1]['Col']
valB = df.iloc[0]['Col']
Works mostly. I ran it over 1040 data frames without issues. Then at 1041 of about 2000 it causes this error:
Traceback (most recent call last):
File "/path/to/script.py", line 206, in <module>
valA = df.iloc[1]['Col']
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1373, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1830, in _getitem_axis
self._is_valid_integer(key, axis)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1713, in _is_valid_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
From this I thought, the data frame might be too short. It shouldn't be, I test for this elsewhere, but ok, mistakes happen so let's print(df) to figure this out.
If I print(df) before the assignment of .tail(2) like this:
print(df)
df = df[['Col']].tail(2)
valA = df.iloc[1]['Col']
valB = df.iloc[0]['Col']
I see a data frame of 37 rows. In my world, 37 > 2.
Now, let's move the print(df) down one line like so:
df = df[['Col']].tail(2)
print(df)
The output is usually two lines as one would expect. However, at the error the df.tail(2) returns a single row of data frame out of a data frame with 37 rows. Not two rows, one row. However, this only happens for one item in the loop. All others work fine. If I skip over the item manually like so:
for item in itemList:
if item == 'troublemaker':
continue
... the script runs through to the end. No errors happen.
I must add, I am fairly new to all this, so I might overlook something entirely. Am I? Suggestions appreciated. Thanks.
Edit: Here's the output of print(df) in case of the error
Col
Date
2018-11-30 True
and in all other cases:
Col
Date
2018-10-31 False
2018-11-30 True
Since it does not have second index, that is why return the error , try using tail and head , be aware of this , for your sample df, valA and valB will be the same value
import pandas
for item in itemList:
df = df[['Col']].tail(2)
valA = df.tail(1)['Col']
valB = df.head(1)['Col']
I don't think it's a bug since it only happens to one df in 2000. Can you show that df?
I also don't think you need tail here, have you tried
valA = df.iloc[-2]['Col']
valB = df.iloc[-1]['Col']
to get the last values.

Pandas KeyError: value not in index

I have the following code,
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]] = p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]].astype(int)
It has always been working until the csv file doesn't have enough coverage (of all week days). For e.g., with the following .csv file,
DOW,Hour,Changes
4Wed,01,237
3Tue,07,2533
1Sun,01,240
3Tue,12,4407
1Sun,09,2204
1Sun,01,240
1Sun,01,241
1Sun,01,241
3Tue,11,662
4Wed,01,4
2Mon,18,4737
1Sun,15,240
2Mon,02,4
6Fri,01,1
1Sun,01,240
2Mon,19,2300
2Mon,19,2532
I'll get the following error:
KeyError: "['5Thu' '7Sat'] not in index"
It seems to have a very easy fix, but I'm just too new to Python to know how to fix it.
Use reindex to get all columns you need. It'll preserve the ones that are already there and put in empty columns otherwise.
p = p.reindex(columns=['1Sun', '2Mon', '3Tue', '4Wed', '5Thu', '6Fri', '7Sat'])
So, your entire code example should look like this:
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
columns = ["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]
p = p.reindex(columns=columns)
p[columns] = p[columns].astype(int)
I had a very similar issue. I got the same error because the csv contained spaces in the header. My csv contained a header "Gender " and I had it listed as:
[['Gender']]
If it's easy enough for you to access your csv, you can use the excel formula trim() to clip any spaces of the cells.
or remove it like this
df.columns = df.columns.to_series().apply(lambda x: x.strip())
please try this to clean and format your column names:
df.columns = (df.columns.str.strip().str.upper()
.str.replace(' ', '_')
.str.replace('(', '')
.str.replace(')', ''))
I had the same issue.
During the 1st development I used a .csv file (comma as separator) that I've modified a bit before saving it.
After saving the commas became semicolon.
On Windows it is dependent on the "Regional and Language Options" customize screen where you find a List separator. This is the char Windows applications expect to be the CSV separator.
When testing from a brand new file I encountered that issue.
I've removed the 'sep' argument in read_csv method
before:
df1 = pd.read_csv('myfile.csv', sep=',');
after:
df1 = pd.read_csv('myfile.csv');
That way, the issue disappeared.

pandas crashes on repeated DataFrame.reset_index()

Very weird bug here: I'm using pandas to merge several dataframes. As part of the merge, I have to call reset_index several times. But when I do, it crashes unexpectedly on the second or third use of reset_index.
Here's minimal code to reproduce the error:
import pandas
A = pandas.DataFrame({
'val' : ['aaaaa', 'acaca', 'ddddd', 'zzzzz'],
'extra' : range(10,14),
})
A = A.reset_index()
A = A.reset_index()
A = A.reset_index()
Here's the relevant part of the traceback:
....
A = A.reset_index()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2393, in reset_index
new_obj.insert(0, name, _maybe_cast(self.index.values))
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1787, in insert
self._data.insert(loc, column, value)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 893, in insert
raise Exception('cannot insert %s, already exists' % item)
Exception: cannot insert level_0, already exists
Any idea what's going wrong here? How do I work around it?
Inspecting frame.py, it looks like pandas tries to insert a column 'index' or 'level_0'. If either/both(??) of them are already taken, then it throws the error.
Fortunately, there's a "drop" option. AFAICT, this drops an existing index with the same name and replaces it with the new, reset index. This might get you in trouble if you have a column named "index," but I think otherwise you're okay.
"Fixed" code:
import pandas
A = pandas.DataFrame({
'val' : ['aaaaa', 'acaca', 'ddddd', 'zzzzz'],
'extra' : range(10,14),
})
A = A.reset_index(drop=True)
A = A.reset_index(drop=True)
A = A.reset_index(drop=True)
you can use :
A.reset_index(drop=True, inplace=True)

Categories

Resources