Iterate nltk.tokenize across all rows of a pandas dataframe

Iterate nltk.tokenize across all rows of a pandas dataframe - python

grateful for your help for what feels like a stupid question. I've pulled a sqlite table into a pandas dataframe so I can tokenize and count the frequency of words from a series of tweets.
With the code below, I can produce this for the first tweet. How do I iterate for the whole table?
conn = sqlite3.connect("tweets.sqlite")
data = pd.read_sql_query("select tweet_text from tweets_new;", conn)
tokenizer=RegexpTokenizer(r'\w+')
tokens=tokenizer.tokenize(data['tweet_text'][0])
words = nltk.FreqDist(tokens)
unigram_df = pd.DataFrame(words.most_common(),
columns=["WORD","COUNT"])
unigram_df
When I change the value to anything other than a single row, I get the following error:
TypeError: expected string or buffer
I know there are other ways of doing this, but I need to do it along these lines because of how I intend to use the output next. Thanks for any help you can provide!
I have tried:
%%time
tokenizer = RegexpTokenizer(r'\w+')
print "Cleaning the tweets...\n"
for i in xrange(0,len(df)):
if( (i+1)%1000000 == 0 ):
tokens=tokenizer.tokenize(df['tweet_text'][i])
words = nltk.FreqDist(tokens)
This looks like it should work, but still only returns words from the first row.

I think your problem can be solved more concisely using CountVectorizer. I'll give you an example. Given the following inputs:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus_tweets = [['I love pizza and hambuerger'],['I love apple and chips'], ['The pen is on the table!!']]
df = pd.DataFrame(corpus_tweets, columns=['tweet_text'])
You can create your bag of words template with these few lines:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df.tweet_text)
You can print the obtained vocabulary:
count_vect.vocabulary_
# ouutput: {'love': 5, 'pizza': 8, 'and': 0, 'hambuerger': 3, 'apple': 1, 'chips': 2, 'the': 10, 'pen': 7, 'is': 4, 'on': 6, 'table': 9}
and get the dataframe with word counts:
df_count = pd.DataFrame(X_train_counts.todense(), columns=count_vect.get_feature_names())
and apple chips hambuerger is love on pen pizza table the
0 1 0 0 1 0 1 0 0 1 0 0
1 1 1 1 0 0 1 0 0 0 0 0
2 0 0 0 0 1 0 1 1 0 1 2
If it is useful for you, you can merge the dataframe of the counts with the dataframe of the corpus:
pd.concat([df, df_count], axis=1)
tweet_text and apple chips hambuerger is love on \
0 I love pizza and hambuerger 1 0 0 1 0 1 0
1 I love apple and chips 1 1 1 0 0 1 0
2 The pen is on the table!! 0 0 0 0 1 0 1
pen pizza table the
0 0 1 0 0
1 0 0 0 0
2 1 0 1 2
If you want to get the dictionary containing the <word, count> pairs for each document, at this point all you need to do is:
dict_count = df_count.T.to_dict()
{0: {'and': 1,
'apple': 0,
'chips': 0,
'hambuerger': 1,
'is': 0,
'love': 1,
'on': 0,
'pen': 0,
'pizza': 1,
'table': 0,
'the': 0},
1: {'and': 1,
'apple': 1,
'chips': 1,
'hambuerger': 0,
'is': 0,
'love': 1,
'on': 0,
'pen': 0,
'pizza': 0,
'table': 0,
'the': 0},
2: {'and': 0,
'apple': 0,
'chips': 0,
'hambuerger': 0,
'is': 1,
'love': 0,
'on': 1,
'pen': 1,
'pizza': 0,
'table': 1,
'the': 2}}
Note: turning X_train_counts which is a sparse numpy matrix into a dataframe is not a good idea. But it can be useful to understand and visualize the various steps of your model.

After creating the DataFrame loop over all the rows:
tokenizer = RegexpTokenizer(r'\w+')
fdist = FreqDist()
for txt in data['tweet_text']:
for word in tokenizer.tokenize(txt):
fdist[word.lower()] += 1

In case anyone is interested in this niche use case, here's the code I was eventually able to make work:
conn = sqlite3.connect("tweets.sqlite")
data = pd.read_sql_query("select tweet_text from tweets_new;", conn)
alldata = str(data)
tokenizer=RegexpTokenizer(r'\w+')
tokens=tokenizer.tokenize(alldata)
words = nltk.FreqDist(tokens)
unigram_df = pd.DataFrame(words.most_common(),
columns=["WORD","COUNT"])
Thanks for your help everyone!

Related

Python: How to compare values of a row with a threshold to determine cycles

I have the following code I made that gets data from a machine in CSV format:
import pandas as pd
import numpy as np
header_list = ['Time']
df = pd.read_csv('S8-1.csv' , skiprows=6 , names = header_list)
#splits the data into proper columns
df[['Date/Time','Pressure']] = df.Time.str.split(",,", expand=True)
#deletes orginal messy column
df.pop('Time')
#convert Pressure from object to numeric
df['Pressure'] = pd.to_numeric(df['Pressure'], errors = 'coerce')
#converts to a time
df['Date/Time'] = pd.to_datetime(df['Date/Time'], format = '%m/%d/%y %H:%M:%S.%f' , errors = 'coerce')
#calculates rolling and rolling center of pressure values
df['Moving Average'] = df['Pressure'].rolling(window=5).mean()
df['Rolling Average Center']= df['Pressure'].rolling(window=5, center=True).mean()
#sets threshold for machine being on or off, if rolling center average is greater than 115 psi, machine is considered on
df['Machine On/Off'] = ['1' if x >= 115 else '0' for x in df['Rolling Average Center'] ]
df
The following DF is created:
Throughout the rows in column "Machine On/Off" there will be values of 1 or 0 based on the threshold i set. I need to write a code that will go through these rows and indicate if a cycle has started. The problem is due to the data being slightly off, during a "on" cycle, there will be around 20 rows saying (1) with a couple of rows saying 0 due to poor data recieved.
I need to have a code that compares the values through the data in order to determine the amount of cycles the machine is on or off for. I was thinking that setting a threshold of around would work, so that if the value is (1) for more than 6 rows then it will indicate a cycle and ignore the incorrect 0's that are scattered throughout the column.
What would be the best way program this so I can get a total count of cycles the machine is on or off for throughout the 20,000 rows of data I have.
Edit: Here is a example Df that is similar, in this example we can see there are 3 cycles of the machine (1 values) and mixed into the on cycles is 0 values (bad data). I need a code that will count the total number of cycles and ignore the bad data that may be in the middle of a 'on cycle'.
import pandas as pd
Machine = [0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
df2 = pd.DataFrame(Machine)

You can create groups of consecutive rows of on/off using cumsum:
machine = [0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
df = pd.DataFrame(machine, columns=['Machine On/Off'])
df['group'] = df['Machine On/Off'].ne(df['Machine On/Off'].shift()).cumsum()
df['group_size'] = df.groupby('group')['group'].transform('size')
# Output
Machine On/Off group group_size
0 0 1 6
1 0 1 6
2 0 1 6
3 0 1 6
4 0 1 6
5 0 1 6
6 1 2 5
7 1 2 5
8 1 2 5
9 1 2 5
10 1 2 5
I'm not sure I got your intention on how you would like to filter/alter the values, but probably this can serve as a guide:
threshold = 6
# Replace 0 for 1 if group_size < threshold. This will make the groupings invalid.
df.loc[(df['Machine On/Off'].eq(0)) & (df.group_size.lt(threshold)), 'Machine On/Off'] = 1
# Output df['Machine On/Off'].values
array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

Tokenization by date using nltk

I have the following dataset:
Date D
_
0 01/18/2020 shares recipes ... - news updates · breaking news emails · lives to remem...
1 01/18/2020 both sides of the pineapple slices with olive oil. ... some of my other support go-to's i...
2 01/18/2020 honey, tea tree oil ...learn more from webmd about honey ...
3 01/18/2020 years of downtown arts | times leaderas the local community dealt with concerns, pet...
4 01/18/2020 brooklyn, ny | opentableblood orange, arugula, hazelnuts, on toast. charcuterie. $16.00. smoked ...
5 01/19/2020 santa maria di leuca - we the italiansthe sounds of the taranta, the smell of tomatoes, olive oil...
6 01/19/2020 abuse in amish communities : nprit's been a minute with sam sanders · code switch · throughline ...
7 01/19/2020 fast, healthy recipe ideas – cbs new ...toss the pork cubes with chili powder, oregano, cumin, c...
9 01/19/2020 100; 51-100 | csnyi have used oregano oil, coconut oil, famciclovir, an..
I am applying CountVectorizer as follows:
stop_words = stopwords.words('english')
word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', stop_words=stop_words)
sparse_matrix = word_vectorizer.fit_transform(df['D'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
to get the highest frequency values for bi-grams. Since I would be interested in getting this info by date (i.e. grouping by 01/18/2020 and 01/19/2020 to get the bi-grams per each date), what I have done is not enough, since
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
creates an empty dataframe with no information about Date. How could I grouped bi-grams per date? If I was interested in one-gram, I would have done something like :
remove_words = list(stopwords.words('english'))
df.D = df.D.str.replace('\d+', '')
df.D = df.D.apply(lambda x: list(word for word in x.split() if word not in remove_words))
df.groupby('Date').agg({'D': 'value_counts'})
I do not know how to do something similar using nltk and CountVectorizer. I hope you can help me.
Expected output:
Date Bi-gram Frequency
0 2019-01-01 This is 1
1 2019-01-01 some sentence 1
....
n-m 2020-01-01 Stackoverlow is 1
....
n 2020-01-01 type now 1

Consider the sample dataframe
Date Sentence
0 2019-01-01 This is some sentence
1 2019-01-01 Another random sentence
2 2020-01-01 Stackoverlow is great
3 2020-01-01 What should I type now
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
# fit on entire dataset to get count for a word across the dataset
vec.fit(df["Sentence"])
df.groupby("Date").apply(lambda x: vec.transform(x["Sentence"]).toarray())
This would give you the count of each word per sentence for a given date. As mentioned in the comment, you could map the index of a word in the given position using get_feature_names()
In [34]: print(vec.get_feature_names())
['another', 'great', 'is', 'now', 'random', 'sentence', 'should', 'some', 'stackoverlow', 'this', 'type', 'what']
Output -
Date
2019-01-01 [[0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0], [1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]]
2020-01-01 [[0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1]]
dtype: object
Consider, [0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0] corresponding to the first sentence for the date 2019-01-01. Here 1 at index 2 means the word is occured once in the first sentence.

One hot encoding sentences

Here my implementation of one-got encoding :
%reset -f
import numpy as np
import pandas as pd
sentences = []
s1 = 'this is sentence 1'
s2 = 'this is sentence 2'
sentences.append(s1)
sentences.append(s2)
def get_all_words(sentences) :
unf = [s.split(' ') for s in sentences]
all_words = []
for f in unf :
for f2 in f :
all_words.append(f2)
return all_words
def get_one_hot(s , s1 , all_words) :
flattened = []
one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] :
for aa in a :
flattened.append(aa)
return flattened
all_words = get_all_words(sentences)
print(get_one_hot(sentences , s1 , all_words))
print(get_one_hot(sentences , s2 , all_words))
this returns :
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]
As can see a sparse vector is returns for small sentences. It appears the encoding is occurring at character level instead of word level ? How to correctly on-hot encode below words ?
I think the encodings should be ? :
s1 -> 1, 1, 1, 1
s2 -> 1, 1, 1, 0

Encoding at character level
This is because of the loop:
for f in unf :
for f2 in f :
all_words.append(f2)
that f2 is looping over characters of string f. Indeed you can rewrite the whole function to be:
def get_all_words(sentences) :
unf = [s.split(' ') for s in sentences]
return list(set([word for sen in unf for word in sen]))
correct one-hot encoding
This loop
for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] :
for aa in a :
flattened.append(aa)
is actually making a very long vector. Let's look at the output of one_hot_encoded_df = pd.get_dummies(list(set(all_words))):
1 2 is sentence this
0 0 1 0 0 0
1 0 0 0 0 1
2 1 0 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
the loop above is picking the corresponding columns from this dataframe and append to the output flattened. My suggestion will be simply leverage on the pandas feature to allow you to subset a few columns, than sum up, and clip to either 0 or 1, to get the one-hot encoded vector:
def get_one_hot(s , s1 , all_words) :
flattened = []
one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
return one_hot_encoded_df[s1.split(' ')].T.sum().clip(0,1).values
The output will be:
[0 1 1 1 1]
[1 1 0 1 1]
For your two sentenses respectively. This is how to interpret these: From the row indices of one_hot_encoded_df dataframe, we know that we use 0 for 2, 1 for this, 2 for 1, etc. So the output [0 1 1 1 1] means all items in the bag of words except 2, which you can confirm with the input 'this is sentence 1'

Match letter frequency within a word against 26 letters in R (or python)

Currently, I have a string "abdicator". I would like find out frequency of letters from this word compared against all English alphabets (i.e., 26 letters), with an output in the form as follows.
Output:
a b c d e f g h i ... o ... r s t ... x y z
2 1 1 0 0 0 0 0 1..0..1..0..1 0 1 ... 0 ...
This output can be a numeric vector (with names being the 26 letters). My initial attempt was to first use strsplit function to split the string into individual letters (using R):
strsplit("abdicator","") #split at every character
#[[1]]
#[1] "a" "b" "c" "d" "e"`
However, I am a little stuck as to what to do for the next step. Can someone enlighten me on this please? Many thanks.

In R:
table(c(letters, strsplit("abdicator", "")[[1]]))-1
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# 2 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0
And extending that a bit to handle the possibility of multiple words and/or capital letters:
words <- c("abdicator", "Syzygy")
letterCount <- function(X) table(c(letters, strsplit(tolower(X), "")[[1]]))-1
t(sapply(words, letterCount))
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# abdicator 2 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0
# syzygy 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 3 1

In Python:
>>> from collections import Counter
>>> s = "abdicator"
>>> Counter(s)
Counter({'a': 2, 'c': 1, 'b': 1, 'd': 1, 'i': 1, 'o': 1, 'r': 1, 't': 1})
>>> map(Counter(s).__getitem__, map(chr, range(ord('a'), ord('z')+1)))
[2, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
Or:
>>> import string
>>> map(Counter(s).__getitem__, string.lowercase)
[2, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]

Python:
import collections
import string
counts = collections.Counter('abdicator')
chars = string.ascii_lowercase
print(*chars, sep=' ')
print(*[counts[char] for char in chars], sep=' ')

In Python 2:
import string, collections
ctr = collections.Counter('abdicator')
for l in string.ascii_lowercase:
print l,
print
for l in string.ascii_lowercase:
print ctr[l],
print
In Python 3, only the syntax of print changes.
This produces exactly the output you requested. The core idea is that a collections.Counter, indexed with a missing key, humbly returns 0 with the obvious semantics "this key has been seen 0 times" fully aligned with the semantics it uses for keys that are present (where it returns their count, i.e, the number of times they have been seen).

Pandas DataFrame column concatenation

I have a pandas Dataframe y with 1 million rows and 5 columns.
np.shape(y)
(1037889, 5)
The column values are all 0 or 1. Looks something like this:
y.head()
a, b, c, d, e
0, 0, 1, 0, 0
1, 0, 0, 1, 1
0, 1, 1, 1, 1
0, 0, 0, 0, 0
I want a Dataframe with 1 million rows and 1 column.
np.shape(y)
(1037889, )
where the column is just the 5 columns concatenated together.
New column
0, 0, 1, 0, 0
1, 0, 0, 1, 1
0, 1, 1, 1, 1
0, 0, 0, 0, 0
I keep trying different things like merge, concat, dstack, etc...
but can't seem to figure this out.

If you want new column to have all data concatenated to string, it's good case for apply() function:
>>> df = pd.DataFrame({'a':[0,1,0,0], 'b':[0,0,1,0], 'c':[1,0,1,0], 'd':[0,1,1,0], 'c':[0,1,1,0]})
>>> df
a b c d
0 0 0 0 0
1 1 0 1 1
2 0 1 1 1
3 0 0 0 0
>>> df2 = df.apply(lambda row: ','.join(map(str, row)), axis=1)
>>> df2
0 0,0,0,0
1 1,0,1,1
2 0,1,1,1
3 0,0,0,0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate nltk.tokenize across all rows of a pandas dataframe - python

After creating the DataFrame loop over all the rows: tokenizer = RegexpTokenizer(r'\w+') fdist = FreqDist() for txt in data['tweet_text']: for word in tokenizer.tokenize(txt): fdist[word.lower()] += 1

Related

Python: How to compare values of a row with a threshold to determine cycles

Tokenization by date using nltk

One hot encoding sentences

Match letter frequency within a word against 26 letters in R (or python)

Pandas DataFrame column concatenation

Categories

Resources