I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the CountVectorizer in sklearn, to convert the documents to feature vectors. I did this by calling:
vectorizer = CountVectorizer
features = vectorizer.fit_transform(examples)
where examples is an array of all the text documents
Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape (14784, 5). The shape of my feature vector is (14784, 21343).
What would be a good way to insert the vectorized features into the pandas dataframe?
Return term-document matrix after learning the vocab dictionary from the raw documents.
X = vect.fit_transform(docs)
Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.
count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names_out())
Concatenate the original df and the count_vect_df columnwise.
pd.concat([df, count_vect_df], axis=1)
If your base data frame is df, all you need to do is:
import pandas as pd
features_df = pd.DataFrame(features)
combined_df = pd.concat([df, features_df], axis=1)
I'd recommend some options to reduce the number of features, which could be useful depending on what type of analysis you're doing. For example, if you haven't already, I'd suggest looking into removing stop words and stemming. Additionally you can set max_features, like features = vectorizer.fit_transform(examples, max_features = 1000) to limit the number of features.
Related
I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()
I am exploring PCA in Scikit-learn (0.20 on Python 3) using Pandas for structuring my data. When I apply a test/train split (and only when), my input labels seem to no longer match up with the PCA output.
import pandas
import sklearn.datasets
from matplotlib import pyplot
import seaborn
def load_bc_as_dataframe():
data = sklearn.datasets.load_breast_cancer()
df = pandas.DataFrame(data.data, columns=data.feature_names)
df['diagnosis'] = pandas.Series(data.target_names[data.target])
return data.feature_names.tolist(), df
feature_names, bc_data = load_bc_as_dataframe()
from sklearn.model_selection import train_test_split
# bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train = bc_data
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
bc_pca_raw = pca.fit_transform(bc_train[feature_names])
bc_pca = pandas.DataFrame(bc_pca_raw, columns=('PCA 1', 'PCA 2'))
bc_pca['diagnosis'] = bc_train['diagnosis']
seaborn.scatterplot(
data=bc_pca,
x='PCA 1',
y='PCA 2',
hue='diagnosis',
style='diagnosis'
)
pyplot.show()
This looks reasonable, and that's borne out by accurate classification results. If I replace the bc_train = bc_data with a train_test_split() call (even with test_size=0), my labels seem to no longer correspond to the original ones.
I realise that train_test_split() is shuffling my data (which I want it to, in general), but I don't see why that would be a problem, since the PCA and the label assignment use the same shuffled data. PCA's transformation is just a projection, and while it obviously doesn't retain the same features (columns), it shouldn't change which label goes with which frame.
How can I correctly relabel my PCA output?
The issue has three parts:
The shuffling in train_test_split() causes the indices in bc_train to be in a random order (compared to the row location).
PCA operates on numerical matrices, and effectively strips the indices from the input. Creating a new DataFrame recreates the indices to be sequential (compared to the row location).
Now we have random indices in bc_train and sequential indices in bc_pca. When I do bc_pca['diagnosis'] = bc_train['diagnosis'], bc_train is reindexed with bc_pcas indices. This reorders the bc_train data so that its indices match bc_pcas.
To put it another way, Pandas does a left-join on the indices when I assign with bc_pca['diagnosis'] (ie. __setitem__()), not a row-by-row copy (similar to update().
I don't find this intuitive, and I couldn't find documentation on __setitem__()'s behaviour beyond the source code, but I expect it makes sense to a more experienced Pandas user, and maybe it's documented at a higher level somewhere I haven't seen.
There are a number of ways to avoid this. I can reset the index of the training/test data:
bc_train, _ = train_test_split(bc_data, test_size=0)
bc_train.reset_index(inplace=True)
Alternatively I could assign from the values member:
bc_pca['diagnosis'] = bc_train['diagnosis'].values
I could also do a similar thing before constructing the DataFrame (arguably more sensible, since PCA is effectively operating on bc_train[feature_names].values).
I am having a dataframe which I have converted to an array to model the data using a regression algorithm. I used the following code to do it
X=df.iloc[:, 0:345].values
Y=df.iloc[:,345].values
Hence X & Y are arrays now.There are many columns because, the categorical variables have been created into dummy variables. Further, I create train and test split
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
Now, after I have completed building the model and making predictions, I want to get back the value of my categorical variables (X & Y have been created after creating dummy variables for all categorical variables).For this, I am trying to convert my X_test back to a dataframe with the column names in the original dataframe df. I tried the following code
dff=df.iloc[:, 0:345]
The above statement is to get the first 345 columns (of the data frame).
Then,
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
I get the following error
ValueError: Shape of passed values is (345, 25000), indices imply (345, 100000)
I don't understand why it matters how many rows I have. I have lesser rows because my train and test have been split up 75%-25%. And I am performing the split after data is converted to an array. How do i now convert the array data into a dataframe with column names from dff dataframe?
pd.DataFrame(X_test, index=dff.index, columns=dff.columns)
X_test being a numpy.ndarray
Modified the above statement to just this:
df_new=pd.DataFrame(X_test)
df_new.columns=list(dff.columns)
The new dataframe contains the X_test data and the column names are assigned from the dff dataframe to the newly created dataframe as well.
I would recommend using the DataFrame for train_test_split, and then passing in arrays to your algorithm using numpy:
my_algorithm(np.asarray(X_train), np.asarray(y_train))
This way you can look at your data the same way you would for any df, but can run the model with the array. I'm not sure what library you are using - but I'm pretty sure some can take DataFrames now for modeling.
I am using SciKit Learn to perform some analytics on a large dataset (+- 34.000 files). Now I was wondering. The HashingVectorizer aims on low memory usage. Is it possible to first convert a bunch of files to HashingVectorizer objects (using pickle.dump) and then load all these files together and convert them to TfIdf features? These features can be calculated from the HashingVectorizer, because counts are stored and the number of documents can be deduced. I now have the following:
for text in texts:
vectorizer = HashingVectorizer(norm=None, non_negative=True)
features = vectorizer.fit_transform([text])
with open(path, 'wb') as handle:
pickle.dump(features, handle)
Then, loading the files is trivial:
data = []
for path in paths:
with open(path, 'rb') as handle:
data.append(pickle.load(handle))
tfidf = TfidfVectorizer()
tfidf.fit_transform(data)
But, the magic does not happen. How can I let the magic happen?
It seems the problem is you are trying to vectorizing your text twice. Once you have built a matrix of counts, you should be able to transform the counts to tf-idf features using sklearn.feature_extraction.text.TfidfTransformer instead of TfidfVectorizer.
Also, it appears your saved data is a sparse matrix. You should be stacking the loaded matrices using scipy.sparse.vstack() instead of passing a list of matrices to TfidfTransformer
I'm quite worried by your loop
for text in texts:
vectorizer = HashingVectorizer(norm=None, non_negative=True)
features = vectorizer.fit_transform([text])
Each time you re-fit your vectoriser, maybe it will forget its vocabulary, and so the entries in each vector won't correspond to the same words (not sure about this i guess it depends on how they do the hashing); why not just fit it on the whole corpus, i.e.
features = vectorizer.fit_transform(texts)
For you actual question, it sounds like you are just trying to normalise the columns of your data matrix by the IDF; you should be able to do this directly on the arrays (i've converted to numpy arrays since I can't work out how the indexing works on the scipy arrays). The mask DF != 0 is necessary since you used the hashing vectoriser which has 2^20 columns:
import numpy as np
X = np.array(features.todense())
DF = (X != 0).sum(axis=0)
X_TFIDF = X[:,DF != 0]/DF[DF != 0]
I generate a set of features for input, that I store as a table using pandas and the CSV format.
(Each column header represents a feature names, except for the first, blank column, which is where the class labels are stored for each row).
My next step is reading the table from the csv file, into scikit learn. (I'm currently doing this with pandas again). However, after training and experimenting with my models using different feature selection methods (and different initially generated features), I want the NAMES of the selected features.
I assume this should be trivial, but I just haven't found how to do it.
(Note: I am NOT working on standard text documents, so "CountVectorizer" and "NaiveBayes"/nltk and the like do not help me).
I need a method to get the selected features, (and preferably something to drop the unselected ones, for when I apply the models and selected features on new "test" data).
Thank you very much!
My data is currently loaded like this:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
def load_data(filename="Feat_normalized.csv") :
df = pd.read_csv(filename, index_col=0)
lb = LabelEncoder()
labels = lb.fit_transform((df.index.values))
features = df.values
feature_names = list(df.columns)
feature_names.pop(0) #Remove index.
return (features, labels, lb)
features, labels, lb_encoder = load_data(filename)
X, y = features, labels
clf_logit = LogisticRegression(penalty="l1", dual=False, class_weight='auto')
X_reduced = clf_logit.fit_transform(X, y)
print('New sparse (filtered) features matrix size:')
print(X_svm.shape)
#Then fit to various models, Random forests, SVM, etc'..
Truncated Example of the first 2 rows in the input data/csv:
AA_C AA__D AA__E AA_F AA__G AA_H AA_I AA_K AA_L AA_M
Mammal_sequence_1.0.fasta 3.838099345 0.456591162 3.764884604 3.620232638 3.460992571 3.858487012 2.69247235 3.18710619 3.671029774 4.625996297 1.542632799
(AA_"" = Feature name. Mammal_sequence_1.0.fasta = Class name/label; (1 per row, empty header).
Thank you very much!