I want to create a csv file that combines the train and test data and labels to use it for a project. The problem is that in concat function, even after using the index reset, the labels continue being Nan and i don't understand what is wrong. The datasets are in this link : https://wetransfer.com/downloads/9f0562b7ec341ebb663262af78971b8020211228154538/84d58d
import pandas as pd
from sklearn.utils import shuffle
# remove first col from training dataset
data = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_data.csv')
first_column = data.columns[0]
data = data.drop([first_column], axis=1)
data.to_csv('new1.csv', index=False)
# remove first col from testing dataset
data2 = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_data.csv')
first_column = data2.columns[0]
data2 = data2.drop([first_column], axis=1)
data2.to_csv('new2.csv', index=False)
#read training labels
data_labels = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_label.csv')
#read testing labels
data2_labels = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_label.csv')
train = pd.concat([data_labels, data], axis=1, join='inner')
print(train.shape)
test = pd.concat([data2_labels, data2], axis=1, join='inner')
print(test.shape)
test.reset_index(drop=True, inplace=True)
train.reset_index(drop=True, inplace=True)
frame = pd.concat([train, test], axis=0)
print(frame)
I suspect what's happening is you have duplicate index values before the concat(). (They're possibly only duplicated between the train & test sets, not necessarily duplicates within the sets separately.) That might throw off concat(), since index values are assumed to be unique... and it might compensate by setting some to NaN. The calls to reset_index() are going to give each of them separately index values starting from 1.
To fix this: Set ignore_index=True in pd.concat(). From the docs:
ignore_index: bool, default False If True, do not use the index values
along the concatenation axis. The resulting axis will be labeled 0, …,
n - 1. This is useful if you are concatenating objects where the
concatenation axis does not have meaningful indexing information. Note
the index values on the other axes are still respected in the join.
If that doesn't work, check: Do test & train have NaNs in the index before concatenation and after reset_index()? They shouldn't, but check. If they do, those will carry over into the concat.
I just did concats with different order and it worked.
The nans were the result of no merging the labels right. Instead of creating one single col with labels I created two with half of them empty, one with the train_labels and one with test_labels.
import pandas as pd
from sklearn.utils import shuffle
# remove first col from training dataset
data = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_data.csv')
first_column = data.columns[0]
data = data.drop([first_column], axis=1)
print(data.shape)
# remove first col from testing dataset
data2 = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_data.csv')
first_column = data2.columns[0]
data2 = data2.drop([first_column], axis=1)
print(data2.shape)
#read training labels
data_labels = pd.read_csv('/home/katerina/Desktop/PBMC_training_set_label.csv')
print(data_labels.shape)
#read testing labels
data2_labels = pd.read_csv('/home/katerina/Desktop/PBMC_testing_set_label.csv')
print(data2_labels.shape)
#concat data without labels
frames = [data, data2]
d = pd.concat(frames)
#concat labels
l = data_labels.append(data2_labels)
#create the original dataset
print(d.shape, l.shape)
dataset = pd.concat([l, d], axis=1)
dataset = shuffle(dataset)
dataset
Related
I have two dataframes, train and test. The test set has missing values on a column.
import numpy as np
import pandas as pd
train = [[0,1],[0,2],[0,3],[0,7],[0,7],[1,3],[1,5],[1,2],[1,2]]
test = [[0,0],[0,np.nan],[1,0],[1,np.nan]]
train = pd.DataFrame(train, columns = ['A','B'])
test = pd.DataFrame(test, columns = ['A','B'])
The test set has two missing values on column B. If the groupby column is A
If the imputing strategy is mode, then the missing values should be imputed with 7 and 2.
If the imputing strategy is mean, then the missing values should be (1+2+3+7+7)/5 = 4 and (3+5+2+2)/4 = 3.
What is a good way to do this?
This question is related, but uses only one dataframe instead of two.
IIUC, here's one way:
from statistics import mode
test_mode = test.set_index('A').fillna(train.groupby('A').agg(mode)).reset_index()
test_mean = test.set_index('A').fillna(train.groupby('A').mean()).reset_index()
If you want a function:
from statistics import mode
def evaluate_nan(strategy= 'mean'):
return test.set_index('A').fillna(train.groupby('A').agg(strategy)).reset_index()
test_mean = evaluate_nan()
test_mode = evaluate_nan(strategy = mode)
I need help understanding this line of code:
y_train2 = train_target2.astype('category').cat.codes
Am I right in saying that y_train2 is being changed to a categorical variable using astype(category) and then cat.codes is used to change it into integers?
Below is the full block of code.
# Train data pre-processing
train_target2 = df_train_01['class_2']
train_target5 = df_train_01['class_5']
df_train_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
# convert text labels to integers
y_train2 = train_target2.astype('category').cat.codes
y_train5 = train_target5.astype('category').cat.codes
# Test data pre-processing
test_target2 = df_test_01['class_2']
test_target5 = df_test_01['class_5']
# drop 'class_2' and 'class_5' columns
df_test_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
y_test2 = test_target2.astype('category').cat.codes
y_test5 = test_target5.astype('category').cat.codes
I think your understanding on the dataframe function and attribute is correct; pdf.astype('category') is turning values into categorical data and pdf.Categorical.codes() (or pdf.Series.codes() ) is an attribute that converts the values into a set of integers that start with 0.
Try to type some simple snippet below to see how they work.
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = load_iris()
pdf = pd.DataFrame(iris.data, columns=['s-length', 's-width', 'p-length', 'p-width'])
print(
iris['s-length'].astype('category'),
len(np.unique(iris['s-length'])), # -> 35
len( set(iris['s-length'].astype('category').cat.codes ), # -> 35
np.unique(iris['s-length'].astype('category').cat.codes)), # -> array([ 0, 1,...34]), dtype=int8)
)
In essence, a pandas categorical data type is a mapping between values that do not have a numeric interpretation and a unique number for each value.
Let's break down your code:
# Take the series `train_target2` and convert it to categorical type
train_target2.astype('category')
# Access the attributes or methods of a categorical series
train_target2.astype('category').cat
# Take the `codes` attribute
train_target2.astype('category').cat.codes
In reality, .codes is not converting the data into numbers. Rather, you are only taking the numeric equivalent of each category. Strictly speaking, .astype('category') is the part that converted your data to categorical.
You can find the attributes and methods of this data type here.
I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)
Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)
You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)
I have encountered this several times where I'm trying to filter a dataframe using a column from another dataframe. isin incorrectly returns true for every row. It is probably just a misunderstanding on my part as to how it should work. Why is it doing this, and is there a better way to code it?
#Read the data into a pandas dataframe
ar_data = pd.read_excel('~/data/Accounts-Receivable.xlsx')
ar_data.set_index('customerID', inplace=True)
#randomly select records for 70/30 train/test split
train = ar_data.sample(frac=.7, random_state = 1)
mask = ~ar_data.index.isin(list(train.index)) #why does this return False for every value?
test = ar_data[mask]
ar_data.shape #returns (2466, 11)
train.shape #(1726, 11)
test.shape #returns (0, 11). Should return 740 rows!
Example
I tried to execute you code with a sample DataFrame and it works:
import pandas as pd
ar_data = [[10,20],[11,2],[9,3]]
df = pd.DataFrame(ar_data,columns=["1","2"])
df.set_index("1", inplace=True)
train = df.sample(frac=.7, random_state = 1)
mask = ~df.index.isin(list(train.index))
test = df[mask]
train.shape #shape = (2,1)
test.shape #shape = (1,1)
The problem you may probably have is that the index you used is not a key, hence there are multiple lines with the same Customer_id.
In fact executing your code with duplicated indexes leads to the bug you encountered.
import pandas as pd
ar_data = [[10,20],[10,2],[10,3]]
df = pd.DataFrame(ar_data,columns=["1","2"])
df.set_index("1", inplace=True)
train = df.sample(frac=.7, random_state = 1)
mask = ~df.index.isin(list(train.index))
test = df[mask]
train.shape #shape = (2,1)
test.shape #shape = (0,1)
Anyways an easier and faster way to split your dataset would be:
from sklearn.model_selection import train_test_split
X = ar_data
y = ar_data
train, test, _, _ = train_test_split(X,y,test_size=0.3,random_state=1)
with that possibility, you can also split the features and the predictions with only one function, and it doesn't rely on the indexes.
I am using sklearn in python to perform principle component analysis (PCA) on gene expression data. My data is loaded as a pandas dataframe, for which I can call df.head() and the df looks good. I am using sklearn to generate a loading matrix, but the matrix only displays a generic index, and will not accept a column name for an index. I have 1722 genes, so it is important that I obtain the loading score for each gene computationally.
Here is my code for PCA:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn import preprocessing
# Load the data as pandas dataframe
cols = ['gene', 'FC_TSWV', 'FC_WFT', 'FC_TSWV_WFT']
df = pd.read_csv('./PCA.txt', names = cols, header = None, index_col = 'gene')
# preprocess data:
scaled_df = preprocessing.scale(df.T)
# perform PCA
pca = PCA()
pca.fit(scaled_df)
pca_data = pca.transform(scaled_df)
# Generate loading matrix. HERE IS WHERE THE TROUBLE IS:
loading_scores = pd.Series(pca.components_[0], index = df.gene)
# Print loading matrix
sorted_loading_scores = loading_scores.abs().sort_values(ascending=False)
print(loading_scores)
I have tried:
loading_scores = pd.Series(pca.components_[0], index = df.gene)
loading_scores = pd.Series(pca.components_[0], index = df['gene'])
loading_scores = pd.Series(pca.components_[0], index = df.loc['gene']
AttributeError: 'DataFrame' object has no attribute 'gene'.
If I do not specify an index at all, the loading scores are designated with the generic 0-based index.
Anyone know how to fix this?
Use df.index instead of df.gene or df['gene']
Once you set a certain column to be the index, the way to access it is through the .index attribute, and not through the column's name anymore.