How to change Index in TFIDF in Python

How to change Index in TFIDF in Python - python

I have dataframe df_reps
RepID RepText
================
1328 Hello, this is me, ..
5744 This is a test reoprt, ..
8417 The UK has begun sending ventilators and oxygen ..
I am tryingto extract TFIDF values from RepText
and Here is my code
def TFIDF2(df, use_idf, smooth_idf, ngram_range, stop_words):
tf_idf_vec = TfidfVectorizer(use_idf=use_idf,
smooth_idf=smooth_idf,
ngram_range=ngram_range,stop_words=stop_words) # to use only bigrams ngram_range=(2,2)
tf_idf_data = tf_idf_vec.fit_transform(df)
tf_idf_dataframe = pd.DataFrame(tf_idf_data.toarray(),columns=tf_idf_vec.get_feature_names())
return tf_idf_dataframe
df_tfidf = TFIDF2(df_reps["RepText"], True, False, (1,1), "english")
But df_tfidf looks like this
df_tfidf
Out[25]:
UK reoort test ... begun sending Hello
0 0.0 0.0 0.0 ... 0.0 0.0 0.0
1 0.0 0.0 0.0 ... 0.0 0.0 0.0
2 0.0 0.0 0.0 ... 0.0 0.0 0.0
3 0.0 0.0 0.0 ... 0.0 0.0 0.0
4 0.0 0.0 0.0 ... 0.0 0.0 0.0
The problem is the index in df_tfidf is not related to RepID in df_report
I want to get the index of df_tfidf to be the RepID so it lools like this
RepID UK reoort test ... begun sending Hello
1328 0.0 0.0 0.0 ... 0.0 0.0 0.0
5744 0.0 0.0 0.0 ... 0.0 0.0 0.0
8417 0.0 0.0 0.0 ... 0.0 0.0 0.0
8823 0.0 0.0 0.0 ... 0.0 0.0 0.0
9938 0.0 0.0 0.0 ... 0.0 0.0 0.0

Related

Plotting Heatmap Of Unevenly Spaced Binned Data (Python)

Here I have a dataset:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14876 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14877 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14878 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14879 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14880 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The Y-axis represents seconds, and the X-axis represents binned size ranges. This is data for cloud particle concentration. So for every second, there are 42 pieces of data that represent the number of particles that exist within a certain size range.
Each column represents a certain size range as I already said and those ranges for example, are:
(micrometers)
0 = 0.0 to 20.0
1 = 20.0 to 40.0
2 = 40.0 to 60.0
3 = 60.0 to 80.0
4 = 80.0 to 100.0
5 = 100.0 to 125.0
6 = 125.0 to 150.0
7 = 150.0 to 200.0
8 = 200.0 to 250.0
9 = 250.0 to 300.0
10 = 300.0 to 350.0
11 = 350.0 to 400.0
12 = 400.0 to 475.0
ect...
The reason I included so many is I want to show how the bins are spaced. The width of the bins increases, and the increase in width does not follow any sort of formula.
What I am wanting to do is replace the index for each column on the X-axis with these binned size ranges, and create a filled contour plot very similar to this:
I am using a pandas dataframe to store the dataset and I am currently using pyplot to attempt some plotting using pcolormesh.
Edit: Here is my attempt at starting with this.
#reading dataset extracted using h5py into pandas dataframe
df = pd.DataFrame(ds_arr)
df = df.replace(-999.0, 0)
#creating list for bin midpoints
strcols = [10.0, 30.0, 50.0, 70.0, 90.0, 112.5, 137.5, 175.0, 225.0, 275.0, 325.0, 375.0, 437.5, 512.5, 587.5, 662.5, 750.0, 850.0, 950.0, 1100.0, 1300.0, 1500.0, 1700.0, 2000.0, 2400.0, 2800.0, 3200.0, 3600.0, 4000.0, 4400.0, 4800.0, 5500.0, 6500.0, 7500.0, 8500.0, 9500.0, 11000.0, 13000.0, 15000.0, 17000.0, 19000.0, 22500.0]
#add new column to end of df
newdf = df
newdf['midpoints'] = strcols
#set the index to be the new column, and delete the name.
newdf.set_index('midpoints', drop=True, inplace=True)
newdf.index.name=None
#getting bins on the X-axis
newdf = newdf.T
print(newdf)
#check data type of indicies
print('\ncolumns are:',type(newdf.columns),)
print('rows are:',type(newdf.index))
#creating figure
fig, ax = plt.subplots(figsize=(13, 5))
fig.tight_layout(pad = 6)
#setting colormap, .copy so you can modify it inplace without an error
cmap = cm.gnuplot2.copy()
cmap.set_bad(color='black')
#plotting data using pcolormesh
plot = ax.pcolormesh(newdf, norm = mpl.colors.LogNorm(), cmap = cmap)
plt.title('Number Concentration', pad=12.8)
plt.xlabel("Bins", rotation=0, labelpad=17.5)
plt.ylabel("Time(Seconds)", labelpad=8.5)
cb = plt.colorbar(plot, shrink=1, aspect=25, location='right')
cb.ax.set_title('#/m4', pad=10, fontsize=10.5)
plt.show()
The resulting dataframe, where I use the midpoints of my desired bins as the header labels, looks like this:
10.0 30.0 50.0 70.0 90.0 112.5 137.5 175.0 225.0 275.0 325.0 375.0 437.5 ... 4400.0 4800.0 5500.0 6500.0 7500.0 8500.0 9500.0 11000.0 13000.0 15000.0 17000.0 19000.0 22500.0
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14876 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14877 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14878 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14879 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14880 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
And this is the plot I generated:
plot
The problem here is that the tick marks on the X-axis are not matching with the column headers in my resulting dataset.
Here is the output when I check what type of data my indices are:
columns are: <class 'pandas.core.indexes.numeric.Float64Index'>
rows are: <class 'pandas.core.indexes.base.Index'>
To be clear, my goal is to be able to give each column a bin width, where the range of size data on the X-axis starts at 0 and ends at the endpoint of my last bin. I was to be able to hardcode the bin widths for each column individually. I would also be able to display the bins logarithmically scaled, or anything similar.
How should I configure my dataframe to be able to output a plot similar to the example plot, with unevenly spaced yet logarithmically scaled binned data?

My Python while loop is not terminating and I don't know why

It seems like the while loop should terminate once the start int == 1, but it keeps going. It also seems it's not actually printing the values....just 0
Given a positive integer n, the following rules will always create a
sequence that ends with 1, called the hailstone sequence:
If n is even, divide it by 2
If n is odd, multiply it by 3 and add 1(i.e. 3n +1)
Continue until n is 1
Write a program that reads an
integer as input and prints the hailstone sequence starting with the
integer entered. Format the output so that ten integers, each
separated by a tab character (\t), are printed per line.
The output format can be achieved as follows: print(n, end='\t')
Ex: If the input is:
25
the output is:
25 76 38 19 58 29 88 44 22 11
34 17 52 26 13 40 20 10 5 16
8 4 2 1
My code:
''' Type your code here. '''
start = int()
while True:
print(start, end='\t')
if start % 2 == 0:
start = start/2
print(start, end='\t')
elif start % 2 == 1:
start = (start *3)+1
print(start, end='\t')
if start == 1:
print(start, end='\t')
break
print(start, end='\t')
Program errors displayed here
Program generated too much output.
Output restricted to 50000 characters.
Check program for any unterminated loops generating output.
Program output displayed here
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.

Your loop isn't terminating because you used 0 as an input and as 0 % 2 == 0 is true and 0/2=0 you become stuck in an infinite loop. you could fix this by raising an exception if start is <=0 like this:
start = int(input())
if start <=0:
raise Exception('Start must be strictly positive')
while True:
print(start, end='\t')
if not start % 2:
start //= 2
elif start % 2:
start = 3*start+1
if start == 1:
break

How to ignore numbers and use min_df when using TfidfVectorizer?

I'm trying to run simple code of TfidfVectorizer with some properties:
Ignore numbers
Use min_df (ignore terms that have a document frequency strictly lower than the given threshold)
But I can't get the right results:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import pandas as pd
import nltk
import re
nltk.download('stopwords')
data = fetch_20newsgroups(subset='all')['data']
english_stop_words = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=english_stop_words,
max_features=5000,
min_df=200,
#token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b'
)
tfidf = vectorizer.fit_transform(data)
df_tfidf = pd.DataFrame(tfidf.toarray(), columns=vectorizer.get_feature_names_out())
print(df_tfidf.head())
Results:
00 000 01 02 03 04 05 10 100 1000 ... wrote \
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 ... 0.000000
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 ... 0.000000
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 ... 0.047383
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.024252 0.0 0.0 ... 0.000000
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 ... 0.000000
When I uncomment the line token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b' I'm getting error:
ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.
So I need to comment the line: min_df=200, and still I'm getting strange values:
a b d e f i k l n o p r \
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I have tried to use the answer from this post: How can I prevent TfidfVectorizer to get numbers as vocabulary but it didn't work.
How can I use TfidfVectorizer and ignore numbers and use min_df ?

Add and fill missing columns with values of 0s in pandas matrix [python]

I have a matrix of the form :
movie_id 1 2 3 ... 1494 1497 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 1.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0
. ...
.
.
As you can see even though the movies in my dataset are 1500, some movies haven't been recorded cause of the preprocess that my data has gone through.
What i want is to add and fill all the columns (movie_ids) that haven't been recorded with values of 0 (I don't know which movie_ids haven't been recorded exactly). So for example i want a new matrix of the form:
movie_id 1 2 3 ... 1494 1495 1496 1497 1498 1499 1500
user_id
1600 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1601 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1602 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0
1603 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1604 1.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0
. ...
.
.

Use DataFrame.reindex along axis=1 with fill_value=0 to conform the dataframe columns to a new index range:
df = df.reindex(range(df.columns.min(), df.columns.max() + 1), axis=1, fill_value=0)
Result:
movie_id 1 2 3 1498 1499 1500
user_id
1600 1.0 0.0 1.0 0 0 1.0
1601 1.0 0.0 0.0 0 0 0.0
1602 0.0 0.0 0.0 ... 0 0 1.0
1603 0.0 0.0 1.0 ... 0 0 0.0
1604 1.0 0.0 0.0 0 0 0.0

I assume variable name of the matrix is matrix
n_moovies = 1500
moove_ids = matrix.columns
for moovie_id in range(1, n_moovies + 1):
# iterate over id-s
if moovie_id not in moove_ids:
# if there's no such moovie create a column filled with zeros
matrix[moovie_id] = 0

Python pandas: converting column values into other columns

I have dataframe which looks like below:
df:
Review_Text Noun Thumbups
Would be nice to be able to import files from ... [My, Tracks, app, phone, Google, Drive, import... 1.0
No Offline Maps! It used to have offline maps ... [Offline, Maps, menu, option, video, exchange,... 18.0
Great application. Designed with very well tho... [application, application] 16.0
Great App. Nice and simple but accurate. Wish ... [Great, App, Nice, Exported] 0.0
Save For Offline - This does not work. The rou... [Save, Offline, route, filesystem] 12.0
Since latest update app will not run. Subscrip... [update, app, Subscription, March, application] 9.0
Great app. Love it! And all the things it does... [Great, app, Thank, work] 1.0
I have paid for subscription but keeps telling... [subscription, trial, period] 0.0
Error: The route cannot be save for no locatio... [Error, route, i, GPS] 0.0
When try to restore my tracks it says "unable ... [try, file, locally-1] 0.0
Was a good app but since the update it only re... [app, update, metre] 2.0
based on 'Noun' Column values, I want to create other columns. For example, all values of noun column from first row become columns and those columns contain value of 'Thumbups' column value. If the column name already present in dataframe then it adds 'Thumbups' value into the existing value of the column.
I was trying to implement by using pivot_table :
pd.pivot_table(latest_review,columns='Noun',values='Thumbups')
But got following error:
TypeError: unhashable type: 'list'
Could anyone help me in fixing the issue?

Use Series.str.join with Series.str.get_dummies for dummies and then multiple by column Thumbups by DataFrame.mul:
df1 = df['Noun'].str.join('|').str.get_dummies().mul(df['Thumbups'], axis=0)
print (df1)
App Drive Error Exported GPS Google Great Maps March My Nice \
0 0.0 10.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 10.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 180.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 90.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Offline Save Subscription Thank Tracks app application exchange \
0 0.0 0.0 0.0 0.0 10.0 10.0 0.0 0.0
1 180.0 0.0 0.0 0.0 0.0 0.0 0.0 180.0
2 0.0 0.0 0.0 0.0 0.0 0.0 160.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 120.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 90.0 0.0 0.0 90.0 90.0 0.0
6 0.0 0.0 0.0 10.0 0.0 10.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN
file filesystem i import locally-1 menu metre option period \
0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 180.0 0.0 180.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN
phone route subscription trial try update video work
0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 180.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 90.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN

rows = []
#_unpacking Noun column row list values and storing it in rows list
_ = df.apply(lambda row: [rows.append([row['Review_Text'],row['Thumbups'], nn])
for nn in row.Noun], axis=1)
#_creates new dataframe with unpacked values
df_new = pd.DataFrame(rows, columns=df.columns)
#_now doing pivot operation on df_new
pivot_df = df_new.pivot(index='Review_Text', columns='Noun')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to change Index in TFIDF in Python - python

Related

Plotting Heatmap Of Unevenly Spaced Binned Data (Python)

My Python while loop is not terminating and I don't know why

How to ignore numbers and use min_df when using TfidfVectorizer?

Add and fill missing columns with values of 0s in pandas matrix [python]

Python pandas: converting column values into other columns

Categories

Resources