Merge gives me much more rows in the dataframe - python

Update: Like mentioned in the comments, my indices weren't unique. worked around via a pivot.table
I got the following code to perform a clustering on a df. This df is approximately 80 K rows (df is named 'Kmeans'). I then have another df with a common value with 'Kmeans' (namely 'SKU_NR') with slightly less than 80K rows (this df is named 'Historie'). I want to merge df 'Kmeans' with df 'Historie', but when I do this, it gives me over 2M rows. I've done this before and then it worked. What's going wrong in the code?
#load in libraries
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
#Load and prepare data
Historie = pd.read_excel("file.xlsx")
Kmeans = Historie[['SKU_NR','ORDER_ADV_CONS_UNITS_WK_PICK']]
Kmeans = Kmeans.dropna()
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(Kmeans)
km.predict(Kmeans)
labels = km.labels_
Kmeans["Classification"] = labels
Kmeans = Kmeans[["SKU_NR","Classification"]]
Historie
=Historie[['SKU_NR','WEEKNR','ORDER_ADV_CONS_UNITS_WK_PICK',
'FORECAST_NEC_STOCK_BASE']]
Historie = Historie.merge(Kmeans, on = "SKU_NR")

Related

Applying scikitlearn preprocessing to pandas without causing warnings

I'm trying to use scikitlearn's preprocessing to min-max scale a row on pandas. My solution works but gives me 2 warnings and I am wondering if there is a better way to do it.
Here is my function which does the minmaxscaling given a dataframe and columns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
This is where I use it
df.loc[:,'pct_mm'] = minMaxScale(df,['pct'])
Where the column 'pct' exists and 'pct_mm' does not exist.
I get the following warning 2 times:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How should I do this the way pandas wants me to?
Cannot reproduce the warnings:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
def minMaxScale(df, cols):
scaler = MinMaxScaler()
return scaler.fit_transform(df[cols])
df = sns.load_dataset('iris')
df.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
However, if I do this:
df = sns.load_dataset('iris')
df2 = df[:]
df2.loc[:, 'newcolumn'] = minMaxScale(df, ['sepal_length'])
then I get two warnings as well.
Probably you derived df from another dataframe somewhere in your code. I recommend you to find the lines where you used df, and make sure to make a copy, like: df = old_df.copy().

Change pandas DataFrame to numpy array but keeping column names

I have a pandas DataFrame from the sklearn.datasets Boston house price data and am trying to convert this to a numpy array but keeping column names. Here is the code I tried:
from sklearn import datasets ## imports datasets from scikit-learn
import numpy as np
import pandas as pd
data = datasets.load_boston() ## loads Boston dataset from datasets library
df = pd.DataFrame(data.data, columns=data.feature_names)
X = df.to_numpy()
print(X.dtype.names)
However this returns None and therefore column names are not kept. Does anyone understand why?
Thanks
try this :
w = (data.feature_names).reshape(13,1)
X = np.vstack((w.T, data.data))
print (X)

K-Means Python Syntax When Records Represented by a Cnt Column (in Aggregate)

Trying to accomplish K-Means in Python using aggregated data files. For example, instead of a data frame with 3 records represented by 3 rows, one row will represent all 3 with a column like cnt (arbitrarily named) representing those 3 unique instances with the number 3 in it.
Below is a set of some basic starter code that does NOT use the aggregated representation of the rows. Please let me know if you would like for me to post the .csv too, but it should be pretty basic:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
data = pd.read_csv('../Data/wholesale_data.csv')
data.head()
categorical_features = ['Channel', 'Region']
continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen',
'Detergents_Paper', 'Delicassen']
for col in categorical_features: #for each categorical col
dummies = pd.get_dummies(data[col], prefix=col) #one-hot-encoding
data = pd.concat([data, dummies], axis=1) #append to data
data.drop(col, axis=1, inplace=True) #drop orig column
data.head()
mms = MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)
sum_of_squared_distances = []
K = range(1,15)
for k in K:
km = KMeans(n_clusters=k) #init model
km = km.fit(data_transformed) #fit model
sum_of_squared_distances.append(km.inertia_) #overall SSE
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

Loading SKLearn cancer dataset into Pandas DataFrame

I'm trying to load a sklearn.dataset, and missing a column, according to the keys (target_names, target & DESCR). I have tried various methods to include the last column, but with errors.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
the keys are ['target_names', 'data', 'target', 'DESCR', 'feature_names']
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
with the code above, it only returns 30 column, when I need 31 columns. What is the best way load scikit-learn datasets into pandas DataFrame.
Another option, but a one-liner, to create the dataframe including the features and target variables is:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
columns= np.append(cancer['feature_names'], ['target']))
If you want to have a target column you will need to add it because it's not in cancer.data. cancer.target has the column with 0 or 1, and cancer.target_names has the label. I hope the following is what you want:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
print data.describe()
data = data.assign(target=pd.Series(cancer.target))
print data.describe()
# In case you want labels instead of numbers.
data.replace(to_replace={'target': {0: cancer.target_names[0]}}, inplace=True)
data.replace(to_replace={'target': {1: cancer.target_names[1]}}, inplace=True)
print data.shape # data.describe() won't show the "target" column here because I converted its value to string.
This works too, also using pd.Series.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print cancer.keys()
data = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
data['Target'] = pd.Series(data=cancer.target, index=data.index)
print data.keys()
print data.shape
Only target column is missing, so you can just add one.
df = pd.DataFrame(cancer.data, columns=[cancer.feature_names])
df['target'] = cancer.target
mapping target names can be handled elegantly using map():
data["target"] = pd.Categorical(pd.Series(cancer.target).map(lambda x: cancer.target_names[x]))
As of scikit-learn 0.23 you can do the following to get a DataFrame with the target column included.
df = load_breast_cancer(as_frame=True)
df.frame

ufunc 'add' did not contain a loop with signature matching types dtype('<U23') dtype('<U23') dtype('<U23')

When trying to convert the sklearn dataset into pandas dataframe by the following code I am getting this error "ufunc 'add' did not contain a loop with signature matching types dtype('
import numpy as np
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
data = pd.DataFrame(data= np.c_[cancer['data'], cancer['target']],columns= cancer['feature_names'] + cancer['target'])
Here is how I converted the sklearn dataset to a pandas dataframe. The target column name needs to be appended.
bostonData = pd.DataFrame(data= np.c_[boston['data'], boston['target']],
columns= np.append(boston['feature_names'],['target']))
You have numpy array of strings please provide full error therefore we figure out what's missing;
For example I am assuming you got dtype('U9'), please add;
dtype=float into your array. Something like not certain;
data = pd.DataFrame(data= np.c_[cancer['data'], cancer['target']],columns= cancer['feature_names'] + cancer['target'], dtype=float)
Sometimes it's just easier to keep it simple. Create a DF for both data and target, then merge using pandas.
data_df = pd.DataFrame(data=cancer['data'] ,columns=cancer['feature_names'])
target_df = pd.DataFrame(data=cancer['target'], columns=['target']).reset_index(drop=True)
target_df.rename_axis(None)
df = pd.concat([data_df, target_df], axis=1)

Categories

Resources