I keep getting the following error on Databricks:
SparkPandasNotImplementedError: .iloc requires numeric slice or conditional boolean Index, got You are trying to use pandas function .iloc[..., ...], use spark function select, where
this is my code:
import re
import nltk
import heapq
corpus = []
for i in range(0, len(Y)):
describe = re.sub('[^a-zA-Z]', ' ', Y.iloc[i, 0])
describe = describe.lower()
describe = describe.split()
describe = ' '.join(describe)
corpus.append(describe)
The code works fine in Spyder, but not in databricks.
I tried to reproduce the same issue as yours successfully, as the code and figure below.
import numpy as np
import pandas as pd
import databricks.koalas as ks
dates = pd.date_range('20130101', periods=6)
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df = ks.from_pandas(pdf)
print(pdf.iloc[0,0])
print(df.iloc[0,0])
Due to lack of the necessary description of your variable Y, I guess Y is a dataframe, but the differences are pandas dataframe on local Spyder, Koalas dataframe in databricks.
According to the Koalas document for databricks.koalas.DataFrame.iloc, it does not support the operation iloc(int, int) for a Koalas dataframe.
So if you want to do some operation for the first column value of each row in databricks, there are two solutions as below.
Make sure Y is a pandas dataframe in the same script of your databricks.
Y must be a Koalas dataframe as you want, please try to the code as below.
# Here, `Y` is a Koalas dataframe
for row in Y.iterrows():
describe = re.sub('[^a-zA-Z]', ' ', row[1][0])
describe = describe.lower()
describe = describe.split()
describe = ' '.join(describe)
corpus.append(describe)
As you can see my sample code and result below, the function iterrows can help to get get the first column value of each row.
Related
I want to use Pandas + Uncertainties. I am getting a strange error, below a MWE:
from uncertainties import ufloat
import pandas
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
df.loc[0,'b'] = ufloat(3,1) # This line fails.
I have noticed that if I try to add the ufloats "on the fly", as I usually do with a float or some other stuff, it fails. If I first create a Series then it works:
from uncertainties import ufloat
import pandas
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
df['b'] = pandas.Series([ufloat(3,1)]) # Now it works.
print(df)
This makes it more cumbersome when calculating values on the fly within a loop as I have to create a temporary Series and after the loop add it as a column into my data frame.
Is this a problem of Pandas, a problem of Uncertainties, or am I doing something that is not supposed to be done?
The problem arises because when pandas tries to create a new column it checks the dtype of the new value so that it knows what dtype to assign to that column. For some reason, the dtype check on the ufloat value fails. I believe this is a bug that will have to be fixed in uncertainties.
A workaround in the interim is to manually create the new column with dtype set to object, for example in your case above:
from uncertainties import ufloat
import pandas
import numpy
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
# create a new column with the correct dtype
df.loc[:, 'b'] = numpy.zeros(len(df), dtype=object)
df.loc[0,'b'] = ufloat(3,1) # This line now works.
I want to replace "?" with NaN in Python.
The following code does not work, and I am not sure what is the reason.
import pandas as pd;
import numpy as np;
col_names = ['BI_RADS', 'age','shape','margin','density','severity']
dataset = pd.read_csv('mammographic_masses.data.txt', names = col_names)
dataset.replace("?", np.NaN)
After executing the above code, I still get those question marks in the dataset.
The format of the dataset looks like the followings:
5,67,3,5,3,1
4,43,1,1,?,1
5,58,?,5,3,1
4,28,1,1,3,0
5,74,1,5,?,1
Use inplace=True
Ex:
dataset.replace("?", np.NaN, inplace=True)
I want to divide a dataframe into two based on a number
train = corpus.iloc[:, :10000]
test = corpus.iloc[:, 10000:]
This is the code that i am using.
I am getting the below error as :
AttributeError: iloc not found
Is iloc not part of python3? Is there any other method to split the data based on the number of records to be split?
Edit
As mentioned by the user #craig, i loc is pandas and the datatype that i have is of sparse matrix (scipy.sparse.csr.csr_matrix)
No need for the iloc, can use a row slice directly:
Pandas
import pandas as pd
df = pd.DataFrame(range(10))
df_first_half = df[:5]
df_second_half = df[5:]
Scipy
import numpy as np
from scipy.sparse import csr_matrix
x = csr_matrix((10, 3), dtype=np.int8)
x_first_half = x[:5].toarray()
x_second_half = x[5:].toarray()
If you're unfamiliar with the [5:] notation, see: https://scipy-cookbook.readthedocs.io/items/Indexing.html. Briefly, it's a one-dimensional slice (rows). Multi-dimensional slicing, e.g. [5:, :1], is also available.
I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511
The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))
I am an R user currently trying to learn python, most of the time during my work I need to reshape dataframe which each cell contains a string. Reshaping is easy for me using dcast of reshape2 package in R. I want to do something similarly using the pandas package, like the script below:
import pandas as pd
temp = pd.DataFrame(index=arange(10), columns=['a','b','c','d'])
temp['a'] = 'A'
temp['b'] = 'B'
temp['c'] = 'C'
temp['d'] = 'D'
temp = pd.melt(temp, id_vars=['a','b'])
temp
pd.pivot_table(temp,index=['a','b'],columns='variable',values='value')
It keeps giving me error of DataError: No numeric types to aggregate, I think the aggfunc is the issue because the default value is np.mean, is there other aggfunc that list the cell rather than computing some value for the cell?
pd.pivot_table(temp,index=['a','b'],columns='variable',values='value',
aggfunc=lambda x: ', '.join(x.unique()))
You can write your own function to aggfunc