Using a subset of Pandas dataframe with Scipy Kmeans?

Using a subset of Pandas dataframe with Scipy Kmeans? - python

I have a data frame that I import using df = pd.read_csv('my.csv',sep=','). In that CSV file, the first row is the column name, and the first column is the observation name.
I know how to select a subset of the Panda dataframe, using:
df.iloc[:,1::]
which gives me only the numeric values. But when I try and use this with scipy.cluster.vq.kmeans using this command,
kmeans(df.iloc[:,1::],3)
I get the error 'DataFrame' object has no attribute 'dtype'
Any suggestions?

Here is an example to use KMeans.
from sklearn.datasets import make_blobs
from itertools import product
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
# try to simulate your data
# =====================================================
X, y = make_blobs(n_samples=1000, n_features=10, centers=3)
columns = ['feature' + str(x) for x in np.arange(1, 11, 1)]
d = {key: values for key, values in zip(columns, X.T)}
d['label'] = y
data = pd.DataFrame(d)
Out[72]:
feature1 feature10 feature2 ... feature8 feature9 label
0 1.2324 -2.6588 -7.2679 ... 5.4166 8.9043 2
1 0.3569 -1.6880 -5.7671 ... -2.2465 -1.7048 0
2 1.0177 -1.7145 -5.8591 ... -0.5755 -0.6969 0
3 1.5735 -0.0597 -4.9009 ... 0.3235 -0.2400 0
4 -0.1042 -1.6703 -4.0541 ... 0.4456 -1.0406 0
.. ... ... ... ... ... ... ...
995 -0.0983 -1.4569 -3.5179 ... -0.3164 -0.6685 0
996 1.3151 -3.3253 -7.0984 ... 3.7563 8.4052 2
997 -0.9177 0.7446 -4.8527 ... -2.3793 -0.4038 0
998 2.0385 -3.9001 -7.7472 ... 5.2290 9.2281 2
999 3.9357 -7.2564 5.7881 ... 1.2288 -2.2305 1
[1000 rows x 11 columns]
# fit your data with KMeans
# =====================================================
kmeans = KMeans(n_clusters=3)
kmeans.fit_predict(data.ix[:, :-1].values)
Out[70]: array([1, 0, 0, ..., 0, 1, 2], dtype=int32)

Related

Converting pandas.core.series.Series to dataframe with multiple column names

My toy example is as follows:
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
### prepare data
Xy = np.c_[load_iris(return_X_y=True)]
mycol = ['x1','x2','x3','x4','group']
df = pd.DataFrame(data=Xy, columns=mycol)
dat = df.iloc[:100,:] #only consider two species
dat['group'] = dat.group.apply(lambda x: 1 if x ==0 else 2) #two species means two groups
dat.shape
dat.head()
### Linear discriminant analysis procedure
G1 = dat.iloc[:50,:-1]; x1_bar = G1.mean(); S1 = G1.cov(); n1 = G1.shape[0]
G2 = dat.iloc[50:,:-1]; x2_bar = G2.mean(); S2 = G2.cov(); n2 = G2.shape[0]
Sp = (n1-1)/(n1+n2-2)*S1 + (n2-1)/(n1+n2-2)*S2
a = np.linalg.inv(Sp).dot(x1_bar-x2_bar); u_bar = (x1_bar + x2_bar)/2
m = a.T.dot(u_bar); print("Linear discriminant boundary is {} ".format(m))
def my_lda(x):
y = a.T.dot(x)
pred = 1 if y >= m else 2
return y.round(4), pred
xx = dat.iloc[:,:-1]
xxa = xx.agg(my_lda, axis=1)
xxa.shape
type(xxa)
We have xxa is a pandas.core.series.Series with shape (100,). Note that there are two columns in parentheses of xxa, I want convert xxa to a pd.DataFrame with 100 rows x 2 columns and I try
xxa_df1 = pd.DataFrame(data=xxa, columns=['y','pred'])
which gives ValueError: Shape of passed values is (100, 1), indices imply (100, 2).
Then I continue to try
xxa2 = xxa.to_frame()
# xxa2 = pd.DataFrame(xxa) #equals `xxa.to_frame()`
xxa_df2 = pd.DataFrame(data=xxa2, columns=['y','pred'])
and xxa_df2 presents all NaN with 100 rows x 2 columns. What should I do next?

Let's try Series.tolist()
xxa_df1 = pd.DataFrame(data=xxa.tolist(), columns=['y','pred'])
print(xxa_df1)
y pred
0 42.0080 1
1 32.3859 1
2 37.5566 1
3 31.0958 1
4 43.5050 1
.. ... ...
95 -56.9613 2
96 -61.8481 2
97 -62.4983 2
98 -38.6006 2
99 -61.4737 2
[100 rows x 2 columns]

TypeError: 'DataFrame' object is not callable for DBscan

Data set is below
storeid,revenue,profit,country
101,11434,2345,IN
101,12132,3445,US
102,21343,4545,CH
103,34423,3432,CH
103,43435,3234,JP
103,34345,3335,IN
Code is below
import pandas as pd
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from pylab import rcParams
from collections import Counter
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set_style('whitegrid')
df = pd.read_csv('1.csv',index_col=None)
df.head()
df.columns = df.columns.str.replace(' ', '')
dummies = pd.get_dummies(data = df)
del dummies['Unnamed:0']
model = DBSCAN(eps = 2.25, min_samples=19).fit(dummies)
print (model)
target = dummies.iloc[:,0]
data = dummies.iloc[:,1:-1]
outliers_df = pd.DataFrame(data)
print (Counter(model.labels_))
print(outliers_df(model.labels_==-1))
print(outliers_df(model.labels_==-1)) throwing TypeError: 'DataFrame' object is not callable

Use boolean indexing with [] for filter by mask:
print(outliers_df[model.labels_==-1])
revenue profit country_CH country_IN country_JP
0 11434 2345 0 1 0
1 12132 3445 0 0 0
2 21343 4545 1 0 0
3 34423 3432 1 0 0
4 43435 3234 0 0 1
5 34345 3335 0 1 0

How to convert pandas data frame to NumPy array?

Following the suggestions I got from my previous question here I'm converting a Pandas data frame to a numeric NumPy array. To do this Im used numpy.asarray.
My data frame:
DataFrame
----------
label vector
0 0 1:0.0033524514 2:-0.021896651 3:0.05087798 4:...
1 0 1:0.02134219 2:-0.007388343 3:0.06835007 4:0....
2 0 1:0.030515702 2:-0.0037591448 3:0.066626 4:0....
3 0 1:0.0069114454 2:-0.0149497045 3:0.020777626 ...
4 1 1:0.003118149 2:-0.015105667 3:0.040879637 4:...
... ... ...
19779 0 1:0.0042634667 2:-0.0044222944 3:-0.012995412...
19780 1 1:0.013818732 2:-0.010984628 3:0.060777966 4:...
19781 0 1:0.00019213723 2:-0.010443398 3:0.01679976 4...
19782 0 1:0.010373874 2:0.0043582567 3:-0.0078354385 ...
19783 1 1:0.0016790542 2:-0.028346825 3:0.03908631 4:...
[19784 rows x 2 columns]
DataFrame datatypes :
label object
vector object
dtype: object
To convert into a Numpy Array I'm using this script:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import matplotlib.pyplot as plt
r_filenameTSV = 'TSV/A19784.tsv'
tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
columns = ['label','vector'])
print('DataFrame\n----------\n', df)
print('\nDataFrame datatypes :\n', df.dtypes)
arr = np.asarray(df, dtype=np.float64)
print('\nNumpy Array\n----------\n', arr)
print('\nNumpy Array Datatype :', arr.dtype)
I'm having this error from line nr.22 arr = np.asarray(df, dtype=np.float64)
ValueError: could not convert string to float: ' 1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 5:-0.013740167 6:-0.0014883851 7:0.02230502 8:0.0053563705 9:0.00465044 10:-0.0030826542 11:0.010156203 12:-0.021754289 13:-0.03744049 14:0.011198468 15:-0.021201309 16:-0.0006497681 17:0.009229079 18:0.04218278 19:0.020572046 20:0.0021593391 ...
How can I solve this issue?
Regards and thanks for your time

Use list comprehension with nested dictionary comprehension for DataFrame:
df = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print (df)
1 2 3 4
0 0.0033524514 -0.021896651 0.05087798 0
1 0.02134219 -0.007388343 0.06835007 0
2 0.030515702 -0.0037591448 0.066626 0
3 0.0069114454 -0.0149497045 0.020777626 0
4 0.003118149 -0.015105667 0.040879637 0.4
And then convert to floats and to numpy array:
print (df.astype(float).to_numpy())
[[ 0.00335245 -0.02189665 0.05087798 0. ]
[ 0.02134219 -0.00738834 0.06835007 0. ]
[ 0.0305157 -0.00375914 0.066626 0. ]
[ 0.00691145 -0.0149497 0.02077763 0. ]
[ 0.00311815 -0.01510567 0.04087964 0.4 ]]

It seems one of your columns is a string, not an integer. Either remove that column or encode it as a string before converting the dataframe to an array

Replicate the tuple values to create random dataset in python [duplicate]

I have a pandas DataFrame with 100,000 rows and want to split it into 100 sections with 1000 rows in each of them.
How do I draw a random sample of certain size (e.g. 50 rows) of just one of the 100 sections? The df is already ordered such that the first 1000 rows are from the first section, next 1000 rows from another, and so on.

You can use the sample method*:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
In [12]: df.sample(2)
Out[12]:
A B
0 1 2
2 5 6
In [13]: df.sample(2)
Out[13]:
A B
3 7 8
0 1 2
*On one of the section DataFrames.
Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.
In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'
In [15]: df.sample(5, replace=True)
Out[15]:
A B
0 1 2
1 3 4
2 5 6
3 7 8
1 3 4

One solution is to use the choice function from numpy.
Say you want 50 entries out of 100, you can use:
import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]
This is of course not considering your block structure. If you want a 50 item sample from block i for example, you can do:
import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]

You could add a "section" column to your data then perform a groupby and sample:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"x": np.arange(1_000 * 100), "section": np.repeat(np.arange(100), 1_000)}
)
# >>> df
# x section
# 0 0 0
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 0
# ... ... ...
# 99995 99995 99
# 99996 99996 99
# 99997 99997 99
# 99998 99998 99
# 99999 99999 99
#
# [100000 rows x 2 columns]
sample = df.groupby("section").sample(50)
# >>> sample
# x section
# 907 907 0
# 494 494 0
# 775 775 0
# 20 20 0
# 230 230 0
# ... ... ...
# 99740 99740 99
# 99272 99272 99
# 99863 99863 99
# 99198 99198 99
# 99555 99555 99
#
# [5000 rows x 2 columns]
with additional .query("section == 42") or whatever if you are interested in only a particular section.
Note this requires pandas 1.1.0, see the docs here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html
For older versions, see the answer by #msh5678

Thank you, Jeff,
But I received an error;
AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method
So I suggest instead of sample = df.groupby("section").sample(50) using below command :
df.groupby('section').apply(lambda grp: grp.sample(50))

This is a nice place for recursion.
def main2():
rows = 8 # say you have 8 rows, real data will need len(rows) for int
rands = []
for i in range(rows):
gen = fun(rands)
rands.append(gen)
print(rands) # now range through random values
def fun(rands):
gen = np.random.randint(0, 8)
if gen in rands:
a = fun(rands)
return a
else: return gen
if __name__ == "__main__":
main2()
output: [6, 0, 7, 1, 3, 5, 4, 2]

3 dimensional numpy array to pandas dataframe

I have a 3 dimensional numpy array
([[[0.30706802]],
[[0.19451728]],
[[0.19380492]],
[[0.23329106]],
[[0.23849282]],
[[0.27154338]],
[[0.2616704 ]], ... ])
with shape (844,1,1) resulting from RNN model.predict()
y_prob = loaded_model.predict(X)
, my problem is how to convert it to a pandas dataframe.
I have used Keras
my objective is to have this:
0 0.30706802
7 0.19451728
21 0.19380492
35 0.23329106
42 ...
...
815 ...
822 ...
829 ...
836 ...
843 ...
Name: feature, Length: 78, dtype: float32

idea is to first flatten the nested list to list than convert it in df using from_records method of pandas dataframe
import numpy as np
import pandas as pd
data = np.array([[[0.30706802]],[[0.19451728]],[[0.19380492]],[[0.23329106]],[[0.23849282]],[[0.27154338]],[[0.2616704 ]]])
import itertools
data = list(itertools.chain(*data))
df = pd.DataFrame.from_records(data)
Without itertools
data = [i for j in data for i in j]
df = pd.DataFrame.from_records(data)
Or you can use flatten() method as mentioned in one of the answer, but you can directly use it like this
pd.DataFrame(data.flatten(),columns = ['col1'])

Here you go!
import pandas as pd
y = ([[[[11]],[[13]],[[14]],[[15]]]])
a = []
for i in y[0]:
a.append(i[0])
df = pd.DataFrame(a)
print(df)
Output:
0
0 11
1 13
2 14
3 15
Feel free to set your custom index values both for axis=0 and axis=1.

You could try:
s = pd.Series(your_array.flatten(), name='feature')
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html
You can then convert the series to a dataframe using s.to_frame()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using a subset of Pandas dataframe with Scipy Kmeans? - python

Related

Converting pandas.core.series.Series to dataframe with multiple column names

TypeError: 'DataFrame' object is not callable for DBscan

How to convert pandas data frame to NumPy array?

Replicate the tuple values to create random dataset in python [duplicate]

3 dimensional numpy array to pandas dataframe

Categories

Resources