Python Access Labels of Sklearn CountVectorizer

Python Access Labels of Sklearn CountVectorizer - python

Here is my df after cleaning:
number summary cleanSummary
0 1-123 he loves ice cream love ice cream
1 1-234 she loves ice love ice
2 1-345 i hate avocado hate avocado
3 1-123 i like skim milk like skim milk
As you can see, there are two records that have the same number. Now I'll create and fit the vectorizer.
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", ngram_range=(1,1), analyzer='word')
cv.fit(df['cleanSummary'])
Now I'll transform.
freq = cv.transform(df['cleanSummary'])
Now if I take a look at freq...
freq = sum(freq).toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
freq
frequency
0 1
1 1
2 1
3 2
4 1
5 2
6 1
7 1
...there doesn't seem to be a logical way to access the original number. I have tried methods of looping through each row, but this runs into problems because of the potential for multiple summaries per number. A loop using a grouped df...
def extractFeatures(groupedDF, textCol):
features = pd.DataFrame()
for id, group in groupedDF:
freq = cv.transform(group[textCol])
freq = sum(freq).toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
dfinner = pd.DataFrame(cv.get_feature_names(), columns=['ngram'])
dfinner['number'] = id
dfinner = dfinner.join(freq)
features = features.append(dfinner)
return features
...works, but the performance is terrible (i.e. 12 hours to run through 45,000 documents with one sentence lengths).
If I change
freq = sum(freq).toarray()[0]
to
freq = freq.toarray()
I get an array of frequencies for each ngram for each document. This is good, but then it doesn't allow me to push that array of lists into a dataframe. And I still wouldn't be able to access nunmber.
How do I access the original labels number for each ngram without looping over a grouped df? My desired result is:
number ngram frequency
1-123 love 1
1-123 ice 1
1-123 cream 1
1-234 love 1
1-234 ice 1
1-345 hate 1
1-345 avocado 1
1-123 like 1
1-123 skim 1
1-123 milk 1
Edit: this is somewhat of a revisit to this question:Convert CountVectorizer and TfidfTransformer Sparse Matrices into Separate Pandas Dataframe Rows. However, after implementing the method described in that answer, I face memory issues for a large corpus, so it doesn't seem scalable.

freq = cv.fit_transform(df.cleanSummary)
dtm = pd.DataFrame(freq.toarray(), columns=cv.get_feature_names(), index=df.number).stack()
dtm[dtm > 0]
number
1-123 cream 1
ice 1
love 1
1-234 ice 1
love 1
1-345 avocado 1
hate 1
1-123 like 1
milk 1
skim 1
dtype: int64

Related

Find how often products are sold together in Python DataFrame

I have a dataframe that is sturctured like below, but with 300 different products and about 20.000 orders.
Order
Avocado
Mango
Chili
1546
500
20
0
861153
200
500
5
1657446
500
20
0
79854
200
500
1
4654
500
20
0
74654
0
500
800
I found out what combinations are often together with this code (abbrivated here to 3 products).
size = df.groupby(['AVOCADO', 'MANGO', 'CHILI'], as_index=False).size().sort_values(by=['size'], ascending=False)
Now I want to know per product how often it is bought solo and how often with other products.
Something like this would be my ideal output (fictional numbers) where the percentage shows what percentage of total orders with that product had the other products as well:
Product
Avocado
Mango
Chili
AVOCADO
100%
20 %
1%
MANGO
20 %
100%
3%
CHILI
20%
30%
100%

First we replace actual quantities by 1s and 0s to indicate if the products were in the order or not:
df2 = 1*(df.set_index('Order') > 0)
Then I think the easiest is just to use matrix algebra wrapped into a dataframe. Also given the size of your data it is a good idea to go directly to numpy rather than try to manipulate the dataframe.
For actual numbers of orders that contain (product1,product2), we can do
df3 = pd.DataFrame(data = df2.values.T#df2.values, columns = df2.columns, index = df2.columns)
df3 looks like this:
Avocado Mango Chili
------- --------- ------- -------
Avocado 5 5 2
Mango 5 6 3
Chili 2 3 3
eg there are 2 orders that contain Avocado and Chili
If you want percentages as in your question, we need to divide by the total number of orders with the given product. Again I htink going to numpy directly is best:
df4 = pd.DataFrame(data = ( (df2.values/np.sum(df2.values,axis=0)).T#df2.values), columns = df2.columns, index = df2.columns)
df4 is:
Avocado Mango Chili
------- --------- ------- -------
Avocado 1 1 0.4
Mango 0.833333 1 0.5
Chili 0.666667 1 1
the 'main' product is in the index and its companion in column so for example for products with Mango, 0.833333 had avocado and 0.5 had Chili

pandas grouping and visualization

I have to do some analysis using Python3 and pandas with a dataset which is shown as a toy example-
data
'''
location importance agent count
0 London Low chatbot 2
1 NYC Medium chatbot 1
2 London High human 3
3 London Low human 4
4 NYC High human 1
5 NYC Medium chatbot 2
6 Melbourne Low chatbot 3
7 Melbourne Low human 4
8 Melbourne High human 5
9 NYC High chatbot 5
'''
My aim is to group the location and then count the number of Low, Medium and/or High 'importance' column for each location. So far, the code I have come up with is-
data.groupby(['location', 'importance']).aggregate(np.size)
'''
agent count
location importance
London High 1 1
Low 2 2
Melbourne High 1 1
Low 2 2
NYC High 2 2
Medium 2 2
'''
This grouping and count aggregation contains index as the grouping objects-
data.groupby(['location', 'importance']).aggregate(np.size).index
I don't know how to proceed next? Also, how can I visualize this?
Help?

I think you need DataFrame.pivot_table, added aggfunc=sum for aggregate if duplicates and then use DataFrame.plot:
df = data.pivot_table(index='location', columns='importance', values='count', aggfunc='sum')
df.plot()
If need counts of pairs location with importance use crosstab:
df = pd.crosstab(data['location'], data['importance'])
df.plot()

Counting non-filtered value_counts along with filtered values in pandas

Assuming that I have a dataframe of pastries
Pastry Flavor Qty
0 Cupcake Cheese 3
1 Cakeslice Chocolate 2
2 Tart Honey 2
3 Croissant Raspberry 1
And I get the value count of a specific flavor per pastry
df[df['Flavor'] == 'Cheese']['Pastry'].value_counts()
Cupcake 4
Tart 4
Cakeslice 3
Turnover 3
Creampie 2
Danish 2
Bear Claw 2
Then to get the percentile of that flavor qty, I could do this
df[df['Flavor'] == 'Cheese']['Pastry'].value_counts().describe(percentiles=[.75, .85, .95])
And I'd get something like this (from full dataframe)
count 35.00000
mean 1.485714
std 0.853072
min 1.000000
50% 1.000000
75% 2.000000
85% 2.000000
95% 3.300000
max 4.000000
Where the total different pastries that are cheese flavored is 35, so the total cheese qty is distributed amongst those 35 pastries. The mean of qty is 1.48, max qty is 4 (cupcake and tart) etc, etc.
What I want to do is bring that 95th percentile down by counting all other values which are not 'Cheese' in the flavor column, however value_counts() is only counting the ones that are 'Cheese' because I filtered the dataframe. How can I also count the non Cheese rows, so that my percentiles will go down and will represent the distribution of Cheese total in the entire dataframe?
This is an example output:
Cupcake 4
Tart 4
Cakeslice 3
Turnover 3
Creampie 2
Danish 2
Bear Claw 2
Swiss Roll 1
Baklava 0
Cannoli 0
Where the non-cheese flavor pastries are being included with 0 as qty, from there I can just get the percentiles and they will be reduced since there are 0 values now diluting them.

I decided to go and try the long way to try and solve this question and my result gave me the same answer as this question
Here is the long way, in case anyone is curious.
pastries = {}
for p in df['Pastry'].unique():
pastries[p] = df[(df['Flavor'] == 'Cheese') & (df['Pastry'] == p)]['Pastry'].count()
newdf = pd.DataFrame.from_dict(pastries.items())
newdf.describe(percentiles=[.75, .85, .95])

How to increase importance of a column in Decision Tree?

I hava dataset with name,ratings,ratings_count,genres columns.
Ex: Movies_Data.csv
Name ratings ratings_count Action Adventure Horror Musical Thriller
Mad-Max 2 7 1 0 0 0 1
Mitchell[1975] 3.25 2 1 0 0 0 1
John Wick 4.23 4 1 0 0 0 0
Insidious 3.75 10 0 0 1 0 0
I divided it into features and labels. Then Performed Label Encoding for the Name column.
Here's my features Dataset after split.
features:
ratings ratings_count Action Adventure Horror Musical Thriller
2 7 1 0 0 0 1
3.25 2 1 0 0 0 1
4.23 4 1 0 0 0 0
3.75 10 0 0 1 0 0
Now the problem is I have around 18 'Genre' Columns. So i think my decision tree is giving more importance to the these columns rather than ratings and ratings_count.
Like if i ask the tree to predict a movie with the following parameters:
ratings:3 ratings_count:2 Action:1 Adventure:0 Horror:0 Musical:0 Thriller:1
It should obviously predict Mitchell[1975] since the ratings:3 is near to 3.25 and ratings_count is same as my input. But it's predicting Mad-Max.
How can i increase the importance of the ratings and ratings_count column?
I'm new to ML. So is there any other way or any other algorithm can i use for better recommendations?
P.s.I know we can use neural networks but i need to stick to Basic ML algorithms only.
Thanks!

First, Random Forests almost always bring better results than Decision Trees. They have a bit more hyperparameters to tune, but that can also help you to bring better results. It's called an Ensemble algorithm and it works well because it averages lots of Decision Trees. It has less overfitting problems, so it should perform better.
If you're still having trouble, you might try to fuse some categories (or get more data), so your algorithm can correctly infer the rating's importance.
Also, this question might be better suited for Cross Validated, where you can ask more theoretical questions.
Good luck !

groupby comma-separated values in single DataFrame column python/pandas

As an example, let's say I have a python pandas DataFrame that is the following:
# PERSON THINGS
0 Joe Candy Corn, Popsicles
1 Jane Popsicles
2 John Candy Corn, Ice Packs
3 Lefty Ice Packs, Hot Dogs
I would like to use the pandas groupby functionality to have the following output:
THINGS COUNT
Candy Corn 2
Popsicles 2
Ice Packs 2
Hot Dogs 1
I generally understand the following groupby command:
df.groupby(['THINGS']).count()
But the output is not by individual item, but by the entire string. I think I understand why this is, but it's not clear to me how to best approach the problem to get the desired output instead of the following:
THINGS PERSON
Candy Corn, Ice Packs 1
Candy Corn, Popsicles 1
Ice Packs, Hot Dogs 1
Popsicles 1
Does pandas have a function like the LIKE in SQL, or am I thinking about how to do this wrong in pandas?
Any assistance appreciated.

Create a series by splitting words, and use value_counts
In [292]: pd.Series(df.THINGS.str.cat(sep=', ').split(', ')).value_counts()
Out[292]:
Popsicles 2
Ice Packs 2
Candy Corn 2
Hot Dogs 1
dtype: int64

You need to split THINGS by , and flatten the series and count values.
pd.Series([item.strip() for sublist in df['THINGS'].str.split(',') for item in sublist]).value_counts()
Output:
Candy Corn 2
Popsicles 2
Ice Packs 2
Hot Dogs 1
dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Access Labels of Sklearn CountVectorizer - python

freq = cv.fit_transform(df.cleanSummary) dtm = pd.DataFrame(freq.toarray(), columns=cv.get_feature_names(), index=df.number).stack() dtm[dtm > 0] number 1-123 cream 1 ice 1 love 1 1-234 ice 1 love 1 1-345 avocado 1 hate 1 1-123 like 1 milk 1 skim 1 dtype: int64

Related

Find how often products are sold together in Python DataFrame

pandas grouping and visualization

Counting non-filtered value_counts along with filtered values in pandas

How to increase importance of a column in Decision Tree?

groupby comma-separated values in single DataFrame column python/pandas

Categories

Resources