Tokenization by date using nltk

Tokenization by date using nltk - python

I have the following dataset:
Date D
_
0 01/18/2020 shares recipes ... - news updates · breaking news emails · lives to remem...
1 01/18/2020 both sides of the pineapple slices with olive oil. ... some of my other support go-to's i...
2 01/18/2020 honey, tea tree oil ...learn more from webmd about honey ...
3 01/18/2020 years of downtown arts | times leaderas the local community dealt with concerns, pet...
4 01/18/2020 brooklyn, ny | opentableblood orange, arugula, hazelnuts, on toast. charcuterie. $16.00. smoked ...
5 01/19/2020 santa maria di leuca - we the italiansthe sounds of the taranta, the smell of tomatoes, olive oil...
6 01/19/2020 abuse in amish communities : nprit's been a minute with sam sanders · code switch · throughline ...
7 01/19/2020 fast, healthy recipe ideas – cbs new ...toss the pork cubes with chili powder, oregano, cumin, c...
9 01/19/2020 100; 51-100 | csnyi have used oregano oil, coconut oil, famciclovir, an..
I am applying CountVectorizer as follows:
stop_words = stopwords.words('english')
word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', stop_words=stop_words)
sparse_matrix = word_vectorizer.fit_transform(df['D'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
to get the highest frequency values for bi-grams. Since I would be interested in getting this info by date (i.e. grouping by 01/18/2020 and 01/19/2020 to get the bi-grams per each date), what I have done is not enough, since
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
creates an empty dataframe with no information about Date. How could I grouped bi-grams per date? If I was interested in one-gram, I would have done something like :
remove_words = list(stopwords.words('english'))
df.D = df.D.str.replace('\d+', '')
df.D = df.D.apply(lambda x: list(word for word in x.split() if word not in remove_words))
df.groupby('Date').agg({'D': 'value_counts'})
I do not know how to do something similar using nltk and CountVectorizer. I hope you can help me.
Expected output:
Date Bi-gram Frequency
0 2019-01-01 This is 1
1 2019-01-01 some sentence 1
....
n-m 2020-01-01 Stackoverlow is 1
....
n 2020-01-01 type now 1

Consider the sample dataframe
Date Sentence
0 2019-01-01 This is some sentence
1 2019-01-01 Another random sentence
2 2020-01-01 Stackoverlow is great
3 2020-01-01 What should I type now
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
# fit on entire dataset to get count for a word across the dataset
vec.fit(df["Sentence"])
df.groupby("Date").apply(lambda x: vec.transform(x["Sentence"]).toarray())
This would give you the count of each word per sentence for a given date. As mentioned in the comment, you could map the index of a word in the given position using get_feature_names()
In [34]: print(vec.get_feature_names())
['another', 'great', 'is', 'now', 'random', 'sentence', 'should', 'some', 'stackoverlow', 'this', 'type', 'what']
Output -
Date
2019-01-01 [[0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0], [1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]]
2020-01-01 [[0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1]]
dtype: object
Consider, [0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0] corresponding to the first sentence for the date 2019-01-01. Here 1 at index 2 means the word is occured once in the first sentence.

Related

Python: How to compare values of a row with a threshold to determine cycles

I have the following code I made that gets data from a machine in CSV format:
import pandas as pd
import numpy as np
header_list = ['Time']
df = pd.read_csv('S8-1.csv' , skiprows=6 , names = header_list)
#splits the data into proper columns
df[['Date/Time','Pressure']] = df.Time.str.split(",,", expand=True)
#deletes orginal messy column
df.pop('Time')
#convert Pressure from object to numeric
df['Pressure'] = pd.to_numeric(df['Pressure'], errors = 'coerce')
#converts to a time
df['Date/Time'] = pd.to_datetime(df['Date/Time'], format = '%m/%d/%y %H:%M:%S.%f' , errors = 'coerce')
#calculates rolling and rolling center of pressure values
df['Moving Average'] = df['Pressure'].rolling(window=5).mean()
df['Rolling Average Center']= df['Pressure'].rolling(window=5, center=True).mean()
#sets threshold for machine being on or off, if rolling center average is greater than 115 psi, machine is considered on
df['Machine On/Off'] = ['1' if x >= 115 else '0' for x in df['Rolling Average Center'] ]
df
The following DF is created:
Throughout the rows in column "Machine On/Off" there will be values of 1 or 0 based on the threshold i set. I need to write a code that will go through these rows and indicate if a cycle has started. The problem is due to the data being slightly off, during a "on" cycle, there will be around 20 rows saying (1) with a couple of rows saying 0 due to poor data recieved.
I need to have a code that compares the values through the data in order to determine the amount of cycles the machine is on or off for. I was thinking that setting a threshold of around would work, so that if the value is (1) for more than 6 rows then it will indicate a cycle and ignore the incorrect 0's that are scattered throughout the column.
What would be the best way program this so I can get a total count of cycles the machine is on or off for throughout the 20,000 rows of data I have.
Edit: Here is a example Df that is similar, in this example we can see there are 3 cycles of the machine (1 values) and mixed into the on cycles is 0 values (bad data). I need a code that will count the total number of cycles and ignore the bad data that may be in the middle of a 'on cycle'.
import pandas as pd
Machine = [0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
df2 = pd.DataFrame(Machine)

You can create groups of consecutive rows of on/off using cumsum:
machine = [0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
df = pd.DataFrame(machine, columns=['Machine On/Off'])
df['group'] = df['Machine On/Off'].ne(df['Machine On/Off'].shift()).cumsum()
df['group_size'] = df.groupby('group')['group'].transform('size')
# Output
Machine On/Off group group_size
0 0 1 6
1 0 1 6
2 0 1 6
3 0 1 6
4 0 1 6
5 0 1 6
6 1 2 5
7 1 2 5
8 1 2 5
9 1 2 5
10 1 2 5
I'm not sure I got your intention on how you would like to filter/alter the values, but probably this can serve as a guide:
threshold = 6
# Replace 0 for 1 if group_size < threshold. This will make the groupings invalid.
df.loc[(df['Machine On/Off'].eq(0)) & (df.group_size.lt(threshold)), 'Machine On/Off'] = 1
# Output df['Machine On/Off'].values
array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

Isolating Rows Of A Dataframe in a loop based on multiple conditions [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 1 year ago.
So I asked a question related to this recently and while the answer wassimple then ( I failed to utilize a specific column) this time I don't have that column. Here is the OP. None of the extra answers provided there actually work either :/
The problem is with a multilabel data frame when you want to isolate rows that contain 1 for a given class and zero for others. So far here is the code I have but it loops into infinity and crashes colab.
In this case I want just that Action row but Im also trying to loop it so I will append all Action with value 1 and column_list with value 0 next History 1 all others 0 etc...
Again the options provided on the link give me a The truth of the answer is ambiguous error
Index | Drama | Western | Action | History |
0 1 1 0 0
1 0 0 0 1
2 0 0 1 0
# Column list to be popped
column_list = list(balanced_df.columns)[1:]
single_labels = []
i=0
# 28 columns total
while i < 27:
# defining/reseting the full column list at the start of each loop
column_list = list(balanced_df.iloc[:,1:])
# Pop column name at index i
x = column_list.pop(i)
# storing the results in a list of lists
# Filters for the popped column where the column is 1 & the remaining columns are set to 0
single_labels.append(balanced_df[(balanced_df[x] == 1) & (balanced_df[column_list]==0)])
# incriment the column index number for the next run
i+=1
The output here would be something like
single_labels[0]
Index | Drama | Western | Action | History |
2 0 0 1 0
single_labels[1]
Index | Drama | Western | Action | History |
1 0 0 0 1

You don't need a loop.
You rarely need loops with pandas.
If you're selecting rows based on conditions, you should use boolean indexing.
In your case, that's:
df.loc[df.sum(axis='columns').eq(1)]
As an example:
pandas.DataFrame({
'A': [1, 0, 0, 0, 0, 1, 1, 0, 0],
'B': [0, 1, 0, 0, 1, 0, 1, 0, 0],
'C': [0, 0, 1, 0, 1, 0, 0, 1, 0],
'D': [0, 0, 0, 1, 0, 1, 0, 1, 0],
}).loc[lambda df: df.sum(axis='columns').eq(1)].values.tolist()
Which outputs:
[[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]

Iterate nltk.tokenize across all rows of a pandas dataframe

grateful for your help for what feels like a stupid question. I've pulled a sqlite table into a pandas dataframe so I can tokenize and count the frequency of words from a series of tweets.
With the code below, I can produce this for the first tweet. How do I iterate for the whole table?
conn = sqlite3.connect("tweets.sqlite")
data = pd.read_sql_query("select tweet_text from tweets_new;", conn)
tokenizer=RegexpTokenizer(r'\w+')
tokens=tokenizer.tokenize(data['tweet_text'][0])
words = nltk.FreqDist(tokens)
unigram_df = pd.DataFrame(words.most_common(),
columns=["WORD","COUNT"])
unigram_df
When I change the value to anything other than a single row, I get the following error:
TypeError: expected string or buffer
I know there are other ways of doing this, but I need to do it along these lines because of how I intend to use the output next. Thanks for any help you can provide!
I have tried:
%%time
tokenizer = RegexpTokenizer(r'\w+')
print "Cleaning the tweets...\n"
for i in xrange(0,len(df)):
if( (i+1)%1000000 == 0 ):
tokens=tokenizer.tokenize(df['tweet_text'][i])
words = nltk.FreqDist(tokens)
This looks like it should work, but still only returns words from the first row.

I think your problem can be solved more concisely using CountVectorizer. I'll give you an example. Given the following inputs:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus_tweets = [['I love pizza and hambuerger'],['I love apple and chips'], ['The pen is on the table!!']]
df = pd.DataFrame(corpus_tweets, columns=['tweet_text'])
You can create your bag of words template with these few lines:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df.tweet_text)
You can print the obtained vocabulary:
count_vect.vocabulary_
# ouutput: {'love': 5, 'pizza': 8, 'and': 0, 'hambuerger': 3, 'apple': 1, 'chips': 2, 'the': 10, 'pen': 7, 'is': 4, 'on': 6, 'table': 9}
and get the dataframe with word counts:
df_count = pd.DataFrame(X_train_counts.todense(), columns=count_vect.get_feature_names())
and apple chips hambuerger is love on pen pizza table the
0 1 0 0 1 0 1 0 0 1 0 0
1 1 1 1 0 0 1 0 0 0 0 0
2 0 0 0 0 1 0 1 1 0 1 2
If it is useful for you, you can merge the dataframe of the counts with the dataframe of the corpus:
pd.concat([df, df_count], axis=1)
tweet_text and apple chips hambuerger is love on \
0 I love pizza and hambuerger 1 0 0 1 0 1 0
1 I love apple and chips 1 1 1 0 0 1 0
2 The pen is on the table!! 0 0 0 0 1 0 1
pen pizza table the
0 0 1 0 0
1 0 0 0 0
2 1 0 1 2
If you want to get the dictionary containing the <word, count> pairs for each document, at this point all you need to do is:
dict_count = df_count.T.to_dict()
{0: {'and': 1,
'apple': 0,
'chips': 0,
'hambuerger': 1,
'is': 0,
'love': 1,
'on': 0,
'pen': 0,
'pizza': 1,
'table': 0,
'the': 0},
1: {'and': 1,
'apple': 1,
'chips': 1,
'hambuerger': 0,
'is': 0,
'love': 1,
'on': 0,
'pen': 0,
'pizza': 0,
'table': 0,
'the': 0},
2: {'and': 0,
'apple': 0,
'chips': 0,
'hambuerger': 0,
'is': 1,
'love': 0,
'on': 1,
'pen': 1,
'pizza': 0,
'table': 1,
'the': 2}}
Note: turning X_train_counts which is a sparse numpy matrix into a dataframe is not a good idea. But it can be useful to understand and visualize the various steps of your model.

After creating the DataFrame loop over all the rows:
tokenizer = RegexpTokenizer(r'\w+')
fdist = FreqDist()
for txt in data['tweet_text']:
for word in tokenizer.tokenize(txt):
fdist[word.lower()] += 1

In case anyone is interested in this niche use case, here's the code I was eventually able to make work:
conn = sqlite3.connect("tweets.sqlite")
data = pd.read_sql_query("select tweet_text from tweets_new;", conn)
alldata = str(data)
tokenizer=RegexpTokenizer(r'\w+')
tokens=tokenizer.tokenize(alldata)
words = nltk.FreqDist(tokens)
unigram_df = pd.DataFrame(words.most_common(),
columns=["WORD","COUNT"])
Thanks for your help everyone!

One hot encoding sentences

Here my implementation of one-got encoding :
%reset -f
import numpy as np
import pandas as pd
sentences = []
s1 = 'this is sentence 1'
s2 = 'this is sentence 2'
sentences.append(s1)
sentences.append(s2)
def get_all_words(sentences) :
unf = [s.split(' ') for s in sentences]
all_words = []
for f in unf :
for f2 in f :
all_words.append(f2)
return all_words
def get_one_hot(s , s1 , all_words) :
flattened = []
one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] :
for aa in a :
flattened.append(aa)
return flattened
all_words = get_all_words(sentences)
print(get_one_hot(sentences , s1 , all_words))
print(get_one_hot(sentences , s2 , all_words))
this returns :
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]
As can see a sparse vector is returns for small sentences. It appears the encoding is occurring at character level instead of word level ? How to correctly on-hot encode below words ?
I think the encodings should be ? :
s1 -> 1, 1, 1, 1
s2 -> 1, 1, 1, 0

Encoding at character level
This is because of the loop:
for f in unf :
for f2 in f :
all_words.append(f2)
that f2 is looping over characters of string f. Indeed you can rewrite the whole function to be:
def get_all_words(sentences) :
unf = [s.split(' ') for s in sentences]
return list(set([word for sen in unf for word in sen]))
correct one-hot encoding
This loop
for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] :
for aa in a :
flattened.append(aa)
is actually making a very long vector. Let's look at the output of one_hot_encoded_df = pd.get_dummies(list(set(all_words))):
1 2 is sentence this
0 0 1 0 0 0
1 0 0 0 0 1
2 1 0 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
the loop above is picking the corresponding columns from this dataframe and append to the output flattened. My suggestion will be simply leverage on the pandas feature to allow you to subset a few columns, than sum up, and clip to either 0 or 1, to get the one-hot encoded vector:
def get_one_hot(s , s1 , all_words) :
flattened = []
one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
return one_hot_encoded_df[s1.split(' ')].T.sum().clip(0,1).values
The output will be:
[0 1 1 1 1]
[1 1 0 1 1]
For your two sentenses respectively. This is how to interpret these: From the row indices of one_hot_encoded_df dataframe, we know that we use 0 for 2, 1 for this, 2 for 1, etc. So the output [0 1 1 1 1] means all items in the bag of words except 2, which you can confirm with the input 'this is sentence 1'

Why DBSCAN clustering returns single cluster on Movie lens data set?

The Scenario:
I'm performing Clustering over Movie Lens Dataset, where I have this Dataset in 2 formats:
OLD FORMAT:
uid iid rat
941 1 5
941 7 4
941 15 4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4
NEW FORMAT:
uid 1 2 3 4
1 5 3 4 3
2 4 3.6185548023 3.646073985 3.9238342172
3 2.8978348799 2.6692556753 2.7693015618 2.8973463681
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
940 3.7996234581 3.4979386925 3.5707888503 2
941 5 NaN NaN NaN
942 4.5762594612 4.2752554573 4.2522440019 4.3761477591
943 3.8252406362 5 3.3748860659 3.8487417604
over which I need to perform Clustering using KMeans, DBSCAN and HDBSCAN.
With KMeans I'm able to set and get clusters.
The Problem
The Problem persists only with DBSCAN & HDBSCAN that I'm unable to get enough amount of clusters (I do know we cannot set Clusters manually)
Techniques Tried:
Tried this with IRIS data-set, where I found Species wasn't included. Clearly that is in String and besides is to be predicted, and everything just works fine with that Dataset (Snippet 1)
Tried with Movie Lens 100K dataset in OLD FORMAT (with and without UID) since I tried an Analogy that, UID == SPECIES and hence tried without it. (Snippet 2)
Tried same with NEW FORMAT (with and without UID) yet the results ended up in same style.
Snippet 1:
print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris
iris = load_iris()
dbscan = DBSCAN()
d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)
Snippet 1 (Output):
FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Snippet 2:
import pandas as pd
from sklearn.cluster import DBSCAN
data_set = pd.DataFrame
ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
data_set = pd.read_csv("MainMatrix_IBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
elif ch is 2:
data_set = pd.read_csv("MainMatrix_UBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
else:
print "Enter Proper choice!"
print "Starting with DBSCAN for Clustering on\n", data_set.info()
db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
Snippet 2 (Output):
Extended Cluster Methods for:
1. Main Matrix IBCF
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])
As seen, it returns only 1 Cluster. I'd like to hear what am I doing wrong.

As pointed by #faraway and #Anony-Mousse, the solution is more Mathematical on Dataset than Programming.
Could finally figure out the clusters. Here's how:
db_cluster = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2)
arr = db_cluster.fit_predict(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
uni, counts = np.unique(arr, return_counts=True)
d = dict(zip(uni, counts))
print d
The Epsilon and Out-lier concept turned out more brightening from SO: How can I choose eps and minPts (two parameters for DBSCAN algorithm) for efficient results?.

You need to choose appropriate parameters. With a too small epsilon, everything becomes noise. sklearn shouldn't have a default value for this parameter, it needs to be chosen for each data set differently.
You also need to preprocess your data.
It's trivial to get "clusters" with kmeans that are meaningless...
Don't just call random functions. You need to understand what you are doing, or you are just wasting your time.

Firstly you need to preprocess your data removing any useless attribute such as ids, and incomplete instances (in case your chosen distance measure can't handle it).
It's good to understand that these algorithms are from two different paradigms, centroid-based (KMeans) and density-based (DBSCAN & HDBSCAN*). While centroid-based algorithms usually have the number of clusters as a input parameter, density-based algorithms need the number of neighbors (minPts) and the radius of the neighborhood (eps).
Normally in the literature the number of neighbors (minPts) is set to 4 and the radius (eps) is found through analyzing different values. You may find HDBSCAN* easier to use as you only need to inform the number of neighbors (minPts).
If after trying different configurations, you still getting useless clusterings, maybe your data haven't clusters at all and the KMeans output is meaningless.

Have you tried seeing how the cluster looks in 2D space using PCA (e.g). If whole data is dense and actually forms single group probably then you might get single cluster.
Change other parameters like min_samples=5, algorithm, metric. Possible value of algorithm and metric you can check from sklearn.neighbors.VALID_METRICS.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tokenization by date using nltk - python

Related

Python: How to compare values of a row with a threshold to determine cycles

Isolating Rows Of A Dataframe in a loop based on multiple conditions [duplicate]

Iterate nltk.tokenize across all rows of a pandas dataframe

One hot encoding sentences

Why DBSCAN clustering returns single cluster on Movie lens data set?

Categories

Resources