Hi I just noticed the tweepy api, I can create dataframe with pandas using tweets object which fetched from tweepy. I want to make a word count df to my tweets. here's my code
freq_df = hastag_tweets_df["Tweet"].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis =0).sort_values(ascending=False).reset_index().head(10)
freq_df.columns = ["Words","freq"]
print('FREQ DF\n\n')
print(freq_df)
print('\n\n')
a = freq_df[freq_df.freq > freq_df.freq.mean() + freq_df.freq.std()]
#plotting
fig =a.plot.barh(x = "Words",y = "freq").get_figure()
This looks not good as I want because it always starts with "empty space" and "the" word like
Words freq
0 301.0
1 the 164.0
So how I can get desired data, without empty line and some words like 'the'.
Thank you
Use can use spaCy library to do it. With this library you can easily remove words like "the","a" known as stop words:
its easy to install : pip install spacy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")
lexeme = nlp.vocab[word]
# process text using spacy
def process_text(text):
#Tokenize text
doc = nlp(text)
token_list = [t.text for t in doc]
#remove stop words
filtered_sentence =[]
for word in token_list:
lexeme = nlp.vocab[word]
if lexeme.is_stop == False:
filtered_sentence.append(word)
# here we return the length of the filtered words if you want you can return the list as well
return len(filtered_sentence)
df = (
df
.assign(words_count=lambda d: d['comment'].apply(lambda c: process_text(c) ))
)
Related
I have a dataframe
0 i only need uxy to hit 20 eod to make up for a...
1 oh this isn’t good
2 lads why is my account covered in more red ink...
3 i'm tempted to drop my last 800 into some stup...
4 the sell offs will continue until moral improves.
I want to apply NLP for each comment to identify which one is positive and which one is negative.
Here is what I have
import pandas as pd
import numpy as np
from nltk.corpus import movie_reviews
from random import shuffle
from nltk import FreqDist
from nltk.corpus import stopwords
import string
from nltk import NaiveBayesClassifier
from nltk import classify
from nltk.tokenize import word_tokenize
df = pd.read_csv("/home/yan/PycharmProjects/pythonProject/comments_binary.csv")
pos_reviews = []
for fileid in movie_reviews.fileids('pos'):
words = movie_reviews.words(fileid)
pos_reviews.append(words)
neg_reviews = []
for fileid in movie_reviews.fileids('neg'):
words = movie_reviews.words(fileid)
neg_reviews.append(words)
stopwords_english = stopwords.words('english')
def bag_of_words(words):
words_clean = []
for word in words:
word = word.lower()
if word not in stopwords_english and word not in string.punctuation:
words_clean.append(word)
words_dictionary = dict([word, True] for word in words_clean)
return words_dictionary
# positive reviews feature set
pos_reviews_set = []
for words in pos_reviews:
pos_reviews_set.append((bag_of_words(words), 'pos'))
# negative reviews feature set
neg_reviews_set = []
for words in neg_reviews:
neg_reviews_set.append((bag_of_words(words), 'neg'))
shuffle(pos_reviews_set)
shuffle(neg_reviews_set)
test_set = pos_reviews_set[:200] + neg_reviews_set[:200]
train_set = pos_reviews_set[200:] + neg_reviews_set[200:]
classifier = NaiveBayesClassifier.train(train_set)
accuracy = classify.accuracy(classifier, test_set)
custom_review = "I am pretty sure that TSLA will hit 500 today after open"
custom_review_tokens = word_tokenize(custom_review)
custom_review_set = bag_of_words(custom_review_tokens)
print (classifier.classify(custom_review_set)) # Output: pos
I am confused how would I apply the whole function for each row with text and create a separate column with pos and neg text that would describe a certain comment.
I tried to create a function
def my_classification(x):
return classifier.classify(x)
df["new_column"] = df["text"].apply(my_classification)
But it says AttributeError: 'str' object has no attribute 'copy'
I would highly appreciate your help
I am trying to remove a list of the common words from a set of strings (text) from a python pandas dataframe. The dataframe looks like this
['Item', 'Label', 'Comment']
I have removed stop words already, but I did a word cloud and there are still some more common words that I want to remove to get better insight into this problem.
This is my current working code that does good but not good enough
# This recieves a sentence 1 at a time
# Use a loop if you want to process a dataset or a lambda
def nlp_preprocess(text, stopwords, lemmatizer, wordnet_map):
# Remove punctuation
text = re.sub('[^a-zA-Z]', ' ', text)
# Remove tags
text=re.sub("</?.*?>"," <> ",text)
# Remove special characters and digits
text=re.sub("(\\d|\\W)+"," ",text)
# Remove stop words like and is a and the
text = " ".join([word for word in text.split() if word not in stopwords])
# Find base word for all words in the sentence
pos_tagged_text = nltk.pos_tag(text.split())
text = " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])
return text
def full_nlp_text_process(df, pandas_parms, stopwords, lemmatizer, wordnet_map):
data = preprocess_dataframe(df, pandas_params)
nlp_data = data.copy()
nlp_data["ProComment"] = nlp_data['Comment'].apply(lambda x: nlp_preprocess(x, stopword, lemmatizer, wordnet_map))
return data, nlp_data
I know i want to something similar to this, but I do not know how I should put it in there to remove the words and where I should put it in (i.e. in the text processing or the dataframe process)\
fdist2 = nltk.FreqDist(text)
most_list = fdist2.most_common(10)
# Somewhere else
for t in text:
if t in most_list: text.remove(t)
I figured out a new way of doing it and it worked well.
The first method would be this
def find_common_words(df):
full_text = ""
for index, row in df.iterrows():
#print(row['Comment'])
full_text = full_text + " " + row["ProComment"]
allWords = nltk.tokenize.word_tokenize(full_text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)
mostCommon= allWordDist.most_common(10)
common_words = []
for item in mostCommon:
common_words.append(item[0])
return common_words
And then you will need this method to finish things up.
def remove_common_words(df, common_words):
for index, row in df.iterrows():
sentence = row["ProComment"]
word_tokens = word_tokenize(sentence)
filtered_sentence = " "
for w in word_tokens:
if w not in common_words:
filtered_sentence = filtered_sentence + " " + w
row["ProComment"] = filtered_sentence
This way you take in a dataframe and do all the processing through that.
I have created a list of websites and my goal is to find the most frequently used words (excluding the common ones such as the, as, are, and etc.) across all of these sites. I set this up to the best of my abilities but I am realizing that my function is only providing the words to the first website to my list. I created a loop function but I believe I may have messed something up and am looking for the fix.
My current code is:
from bs4 import BeautifulSoup
import pandas as pd
from nltk.corpus import stopwords
import nltk
import urllib.request
list1 = ["https://en.wikipedia.org/wiki/Netflix", "https://en.wikipedia.org/wiki/Disney+", "https://en.wikipedia.org/wiki/Prime_Video",
"https://en.wikipedia.org/wiki/Apple_TV", "https://en.wikipedia.org/wiki/Hulu", "https://en.wikipedia.org/wiki/Crunchyroll",
"https://en.wikipedia.org/wiki/ESPN%2B", "https://en.wikipedia.org/wiki/HBO_Max", "https://en.wikipedia.org/wiki/Paramount%2B",
"https://en.wikipedia.org/wiki/PBS", "https://en.wikipedia.org/wiki/Peacock_(streaming_service)", "https://en.wikipedia.org/wiki/Discovery%2B",
"https://en.wikipedia.org/wiki/Showtime_(TV_network)", "https://en.wikipedia.org/wiki/Epix", "https://en.wikipedia.org/wiki/Starz",
"https://en.wikipedia.org/wiki/Acorn_TV", "https://en.wikipedia.org/wiki/BritBox", "https://en.wikipedia.org/wiki/Kocowa",
"https://en.wikipedia.org/wiki/Pantelion_Films", "https://en.wikipedia.org/wiki/Spuul"]
freq = [ ]
for i in list1:
response = urllib.request.urlopen(i)
html = response.read()
soup = BeautifulSoup(html,'html5lib')
text = soup.get_text(strip = True)
tokens = [t for t in text.split()]
sr= stopwords.words('english')
clean_tokens = tokens[:]
for token in tokens:
if token in stopwords.words('english'):
clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
freq1 = freq.most_common(20)
freq[i] = pd.DataFrame(freq.items(), columns=['word', 'frequency'])
freq1
print("This is for the website: ", i, "\n", freq1)
The output for freq1 is giving me the most popular 20 words only for the first website, and I am looking for the most popular words across all websites.
program is overwriting the list items in the line:
freq1 = freq.most_common(20)
As python gives a nice way to append the list, refer (2).
sorry, edited your code, pls refer line and comment at (1),(2) and (3)
freq = [ ]
freq_list=[] # added new structure for list, as want to avoid conflict with your variables --(1)
for i in list1:
response = urllib.request.urlopen(i)
html = response.read()
soup = BeautifulSoup(html,'html5lib')
text = soup.get_text(strip = True)
tokens = [t for t in text.split()]
sr= stopwords.words('english')
clean_tokens = tokens[:]
for token in tokens:
if token in stopwords.words('english'):
clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
freq1 = freq.most_common(20)
freq_list=freq_list + freq1 # python way to add list items --(2)
freq[i] = pd.DataFrame(freq.items(), columns=['word', 'frequency'])
#freq1
print("This is for the website: ", i, "\n", freq1)
df=pd.DataFrame(freq_list, columns=['word', 'frequency']) # build the dataframe (3)
df.shape
I'm using spacy tokenizer to tokenize my data, and then build vocab.
This is my code:
import spacy
nlp = spacy.load("en_core_web_sm")
def build_vocab(docs, max_vocab=10000, min_freq=3):
stoi = {'<PAD>':0, '<UNK>':1}
itos = {0:'<PAD>', 1:'<UNK>'}
word_freq = {}
idx = 2
for sentence in docs:
for word in [i.text.lower() for i in nlp(sentence)]:
if word not in word_freq:
word_freq[word] = 1
else:
word_freq[word] += 1
if word_freq[word] == min_freq:
if len(stoi) < max_vocab:
stoi[word] = idx
itos[idx] = word
idx += 1
return stoi, itos
But it takes hours to complete since I have more than 800000 sentences.
Is there a faster and better way to achieve this? Thanks.
update: tried to remove min_freq:
def build_vocab(docs, max_vocab=10000):
stoi = {'<PAD>':0, '<UNK>':1}
itos = {0:'<PAD>', 1:'<UNK>'}
idx = 2
for sentence in docs:
for word in [i.text.lower() for i in nlp(sentence)]:
if word not in stoi:
if len(stoi) < max_vocab:
stoi[word] = idx
itos[idx] = word
idx += 1
return stoi, itos
still takes a long time, does spacy have a function to build vocab like in torchtext (.build_vocab).
There are a couple of things you can do to make this faster.
import spacy
from collections import Counter
def build_vocab(texts, max_vocab=10000, min_freq=3):
nlp = spacy.blank("en") # just the tokenizer
wc = Counter()
for doc in nlp.pipe(texts):
for word in doc:
wc[word.lower_] += 1
word2id = {}
id2word = {}
for word, count in wc.most_common():
if count < min_freq: break
if len(word2id) >= max_vocab: break
wid = len(word2id)
word2id[word] = wid
id2word[wid] = word
return word2id, id2word
Explanation:
If you only use the tokenizer you can use spacy.blank
nlp.pipe is fast for lots of text (less important, maybe irrelevant with blank model though)
Counter is optimized for this kind of counting task
Another thing is that the way you are building your vocab in your initial example, you will take the first N words that have enough tokens, not the top N words, which is probably wrong.
Another thing is that if you're using spaCy you shouldn't build your vocab this way - spaCy has its own built-in vocab class that handles converting tokens to IDs. I guess you might need this mapping for a downstream task or something but look at the vocab docs to see if you can use that instead.
I am finding some difficulties in apply the following code to one column in my dataset:
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
def f(text):
pos = ""
for token in text:
pos += token.pos_ + " "
return pos
df['Structure']= df.Low_Sentences.str.apply(f)
Where Low_Sentences is something like this:
Low_Sentences
A sentence is a set of words that is complete in itself
Seeing a specific word used in a sentence can provide more context and help you better understand proper usage.
Short example: She walks.
But I am getting this error:
AttributeError: 'StringMethods' object has no attribute 'apply'
Can you please tell me how to apply that function in order to get the structure of each row in my dataframe? Thanks
Just remove the str conversion. In pandas you apply on the columns.
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
def f(text):
doc = nlp(text)
pos = ""
for token in doc:
pos += token.pos_ + " "
return pos
df['Structure']= df.Low_Sentences.apply(f)
FYI: You had forgotten the doc = nlp(text) line!!