How to used groupby and CountVectorizer() together in pandas Dataframe? - python

I have this sample data. This is a CSV file. I want to create feature vectors of 'Questions' and 'Replies' columns using Bag-of-Word method (CounterVector()) and then calculate the cosine similarity between the question and their replies.
So far I have this python code:
topFeaturesValueListColumns = ['cosinSimilarityIpostRpost', 'Class']
topFeaturesValueList = []
featureVectorsPD = pd.DataFrame()
df = pd.read_csv("test1.csv", usecols = ['ThreadID', 'Title', 'UserID_inipst', 'Questions', 'UserID', 'Replies', 'Class'])
df = pd.DataFrame(df)
df = df.apply(lambda x: x.astype(str).str.lower())
for column in df:
df[column] = df[column].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
cv = CountVectorizer()
features['Title']+' '+df['UserID_inipst']+' '+df['Questions']+' '+df['UserID']+' '+df['Replies'])
featureVectorsPD['Questions'] = cv.transform(df['Questions']).toarray().tolist()
featureVectorsPD['Replies'] = cv.transform(df['Replies']).toarray().tolist()
featureVectorsPD['Class'] = df['Class']
for i in range(len(featureVectorsPD)):
label = featureVectorsPD['Class'][i]
res = cosine_similarity(q, r, dense_output=True)
res = float(np.asscalar(res[0]))
row = [res, label]
topQDFValuesPD = pd.DataFrame(topFeaturesValueList, columns=topFeaturesValueListColumns)
Problem in this code is that the
features =['Questions'] + ' ' + df['Replies'])
creates words dictionary (features.vocabulary_) from the whole "Questions" and "Replies" columns but my requirement is to calculate "vocabulary" for each thread individually and then create features vectors based on that individual dictionary. in other words in "ThreadID" column when values changes new vocabulary should be created.
I think "groupby" function is used here but how? Hope the question is clear.
Please help me. I will be very thankful to you.


How do I apply CountVectorizer to each row in a dataframe?

I have a dataframe say df which has 3 columns. Column A and B are some strings. Column C is a numeric variable.
I want to convert this to a feature matrix by passing it to a CountVectorizer.
I define my countVectorizer as:
cv = CountVectorizer(input='content', encoding='iso-8859-1',
decode_error='ignore', analyzer='word',
ngram_range=(1), tokenizer=my_tokenizer, stop_words='english',
Next I pass the entire dataframe to cv.fit_transform(df) which doesn't work.
I get this error:
cannot unpack non-iterable int object
Next I covert each row of the dataframe to
sample = pdt_items["A"] + "," + pdt_items["C"].astype(str) + "," + pdt_items["B"]
Then I apply
cv_m = sample.apply(lambda row: cv.fit_transform(row))
I still get error:
ValueError: Iterable over raw text documents expected, string object received.
Please let me know where am I going wrong?Or if I need to take some other approach?
Try this:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
A = ['very good day', 'a random thought', 'maybe like this']
B = ['so fast and slow', 'the meaning of this', 'here you go']
C = [1, 2, 3]
pdt_items = pd.DataFrame({'A':A,'B':B,'C':C})
cv = CountVectorizer()
# use pd.DataFrame here to avoid your error and add your column name
sample = pd.DataFrame(pdt_items['A']+','+pdt_items['B']+','+pdt_items['C'].astype('str'), columns=['Output'])
vectorized = cv.fit_transform(sample['Output'])
With the help of #QuantStats's comment, I applied the cv on each row of dataframe as follows:
row_input = df['column_name'].tolist()
kwds = []
for i in range(len(row_input)):
cell_input = [row_input[i]]
full_set = row_keywords(cell_input, 1,1)
candidates = [x for x in full_set if x[1]> 1] # to extract frequencies more than 1
kwds_col = pd.Series(kwds)
df['Keywords'] = kwds_col
("row_keywords" is a function for CountVectorizer.)

Filter DataFrame after sklearn.feature_selection

I reduce dimensionality of a dataset (pandas DataFrame).
X = df.as_matrix()
sel = VarianceThreshold(threshold=0.1)
X_r = sel.fit_transform(X)
then I wanto to get back the reduced DataFrame (i.e. keep only ok columns)
I found only this ugly way to do so, which is very inefficient, do you have any cleaner idea?
cols_OK = sel.get_support() # which columns are OK?
c = list()
for i, col in enumerate(cols_OK):
if col:
return df[c]
I think you need if return mask:
cols_OK = sel.get_support()
df = df.loc[:, cols_OK]
and if return indices:
cols_OK = sel.get_support()
df = df.iloc[:, cols_OK]

How to pass a column of pandas data frame as input to another dataframe

I want inputvalues inside a column of pandas dataframe as input another to a dataframe for dropping columns.
Corpus words
corpus_top_words = pd.DataFrame(cv_addr.todense(), columns=cv.get_feature_names())
corpus_top_words =corpus_top_words.sum().rename_axis('Word').reset_index(name='Freq')
corpus_top_words=corpus_top_words.drop('Freq', axis=1)
English Dictionary
from nltk.corpus import brown
word_list= pd.DataFrame(word_list,columns=feature_names)
brown_corpus['Word'] = brown_corpus['Word'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Only English Words
english_words_corpus = pd.merge(corpus_top_words, brown_corpus, on='Word', how='inner')
english_words_corpus = pd.DataFrame(english_words_corpus.Word.unique(),columns=feature_names)
I need to pass this english word corpus to the orginal dataframe to remove some columns:
data = data.drop(list_of_cols_to_drop, axis=1)
list_of_cols_to_drop = english_words_corpus
How this for sparse series
for i, col in enumerate(cv.get_feature_names()):
data[col] = pd.SparseSeries(cv_text[:, i].toarray().ravel(), fill_value=0)
To drop columns which match a list you can do this:
data = data.drop([col for col in list_of_cols_to_drop if col in data.columns], axis=1)

k means on structured data using python - more than one column

how does one do k means on multiple columns in structured data ?
In the example below its been done on 1 column (name)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])
here only name is used but say we wanted to use name and country, should I be adding country to the same column as follows ?
df_new['name'] = df_new['name'] + " " + df_new['country']
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])
It works from a code perspective and am still trying to understand the results (I actually have tons of columns) the data but I wonder if that is the right way to fit when there is more than one columns
import os
import pandas as pd
import re
import numpy as np
df = pd.read_csv('sample-data.csv')
def split_description(string):
# name
string_split = string.split(' - ',1)
name = string_split[0]
return name
df_new = pd.DataFrame()
df_new['name'] = df.loc[:,'description'].apply(lambda x: split_description(x))
df_new['id'] = df['id']
def remove(name):
new_name = re.sub("[0-9]", '', name)
new_name = ' '.join(new_name.split())
return new_name
df_new['name'] = df_new.loc[:,'name'].apply(lambda x: remove(x))
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
stop_words = 'english',
ngram_range=(1,4), min_df = 0.01, max_df = 0.8)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])
print (tfidf_matrix.shape)
print (tfidf_vectorizer.get_feature_names())
from sklearn.metrics.pairwise import cosine_similarity
dist = 1.0 - cosine_similarity(tfidf_matrix)
print (dist)
from sklearn.cluster import KMeans
num_clusters = range(1,20)
KM = [KMeans(n_clusters=k, random_state = 1).fit(tfidf_matrix) for k in num_clusters]
No, that is an incorrect way to fit multiple columns. You are basically simply jamming together multiple features together and expecting it to behave correctly as if kmeans was applied on these multiple columns as separate features.
You need to use other methods like Vectorizor and Pipelines along with tfidifVectorizor to do this on multiple columns. You can check out this link for more information.
Additionally, you can check out this answer for a possible alternate solution to your problem.

Slow Data analysis using pandas

I am using a mixture of both lists and pandas dataframes to accomplish a clean and merge of csv data. The following is a snippet from my code that runs disgustingly slow... Generates a csv with about 3MM lines of data.
UniqueAPI = Uniquify(API)
dummydata = []
#bridge the gaps in the data with zeros
for i in range(0,len(UniqueAPI)):
DateList = []
DaysList = []
PDaysList = []
OperatorList = []
OGOnumList = []
CountyList = []
MunicipalityList = []
LatitudeList = []
LongitudeList = []
UnconventionalList = []
ConfigurationList = []
HomeUseList = []
ReportingPeriodList = []
RecordSourceList = []
for j in range(0,len(API)):
if UniqueAPI[i] == API[j]:
DaysList = Days[j]
OperatorList = Operator[j]
OGOnumList = OGOnum[j]
CountyList = County[j]
MunicipalityList = Municipality[j]
LatitudeList = Latitude[j]
LongitudeList = Longitude[j]
UnconventionalList = Unconventional[j]
ConfigurationList = Configuration[j]
HomeUseList = HomeUse[j]
ReportingPeriodList = ReportingPeriod[j]
RecordSourceList = RecordSource[j]
df = pd.DataFrame(DateList, columns = ['Date'])
df['Date'] = pd.to_datetime(df['Date'])
minDate = df.min()
maxDate = df.max()
Years = int((maxDate - minDate)/np.timedelta64(1,'Y'))
Months = int(round((maxDate - minDate)/np.timedelta64(1,'M')))
finalMonths = Months - Years*12 + 1
Y,x = str(minDate).split("-",1)
x,Y = str(Y).split(" ",1)
for k in range(0,Years + 1):
if k == Years:
ender = int(finalMonths + 1)
ender = int(13)
full_df = pd.DataFrame()
if k > 0:
del full_df
full_df = pd.DataFrame()
full_df['API'] = UniqueAPI[i]
full_df['Production Month'] = [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)]
full_df['Days'] = DaysList
full_df['Operator'] = OperatorList
full_df['OGO_NUM'] = OGOnumList
full_df['County'] = CountyList
full_df['Municipality'] = MunicipalityList
full_df['Latitude'] = LatitudeList
full_df['Longitude'] = LongitudeList
full_df['Unconventional'] = UnconventionalList
full_df['Well_Configuration'] = ConfigurationList
full_df['Home_Use'] = HomeUseList
full_df['Reporting_Period'] = ReportingPeriodList
full_df['Record_Source'] = RecordSourceList
full_df = pd.concat(dummydata)
result = full_df.merge(dataClean,how='left').fillna(0)
result.to_csv(ResultPath, index_label=False, index=False)
This snippet of code has been running for hours the output should have ~3MM lines there has to be a faster way using pandas to accomplish the goal of which I will describe:
for each unique API i find all occurrences in the main list of apis
using that information i build a list of dates
I find a min and max date for each list corresponding to an api
I then build an empty pandas DataFrame that has every month between the two dates for the associated api
I then append this data frame to a list "dummydata" and loop to the next api
taking this dummy data list I then concatenate it into a DataFrame
this DataFrame is then merged with another dataframe with cleaned data
end result is a csv that has 0 value for dates that did not exist but should between the max and min dates for each corresponding API in the original unclean list
This all takes way longer than I would expect I would have thought that finding the min max date for each unique item and interpolating monthly between them filling in months that dont have data with 0 would be like a three line thing in Pandas. Any options that you guys think I should explore or any snippets of code that could help me out is much appreciated!
You could start by cleaning up the code a bit. These lines don't seem to have any effect or functional purpose since full_df was just created and is already an empty dataframe:
if k > 0:
del full_df
full_df = pd.DataFrame()
Then when you actually build up your full_df it's better to do it all at once rather than one column at a time. So try something like this:
full_df = pd.concat([UniqueAPI[i],
[pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)],
Then you would need to add the column labels which you could also do all at once (in the same order as your lists in the concat() call).
full_df.columns = ['API', 'Production Month', 'Days', etc.]

