Getting sentence structure in a column dataframe - python

I am finding some difficulties in apply the following code to one column in my dataset:
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
def f(text):
pos = ""
for token in text:
pos += token.pos_ + " "
return pos
df['Structure']= df.Low_Sentences.str.apply(f)
Where Low_Sentences is something like this:
Low_Sentences
A sentence is a set of words that is complete in itself
Seeing a specific word used in a sentence can provide more context and help you better understand proper usage.
Short example: She walks.
But I am getting this error:
AttributeError: 'StringMethods' object has no attribute 'apply'
Can you please tell me how to apply that function in order to get the structure of each row in my dataframe? Thanks

Just remove the str conversion. In pandas you apply on the columns.
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])
def f(text):
doc = nlp(text)
pos = ""
for token in doc:
pos += token.pos_ + " "
return pos
df['Structure']= df.Low_Sentences.apply(f)
FYI: You had forgotten the doc = nlp(text) line!!

Related

Question about Ranking of Documents using BM25

I'm trying to rank task sentences of an occupation using bm25. I'm following this tutorial, but getting to this part I get confused "Ranking of documents
Now that we've created our document indexes, we can give it queries and see which documents are the most relevant" i want that the queries be every sentence that i have at my corpus column. How can i do that?
!pip install rank_bm25
import pandas as pd from rank_bm25 import BM25Okapi import string
corpus = pd.read_excel(r'/content/Job-occ.xlsx')
tokenized_corpus = [doc.split(" ") for doc in corpus['task']]
tokenized_corpus = [] for doc in corpus['task']:
print(doc)
doc_tokens = doc.split()
tokenized_corpus.append(doc_tokens)
bm25 = BM25Okapi(tokenized_corpus)
here is my data
Basically you just need to iterate over your list of documents, for example like this:
import pandas as pd
from rank_bm25 import BM25Okapi
import string
def argsort(seq, reverse):
# http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
return sorted(range(len(seq)), key=seq.__getitem__, reverse =reverse)
corpus = pd.read_excel(r'Job-occ.xlsx')
tokenized_corpus = [doc.split(" ") for doc in corpus['task']]
bm25 = BM25Okapi(tokenized_corpus)
for doc_as_query in tokenized_corpus:
scores = bm25.get_scores(doc_as_query)
# show top 3 (optional):
print('\nQuery:',' '.join(doc_as_query))
top_similar = argsort(scores, True)
for i in range(3):
print(' top', i+1,':', corpus['task'][top_similar[i]])
I added sorting and printing the top 3 similar documents, in case you want this (that's why I added the function argsort).

Why my output return in a strip-format and cannot be lemmatized/stemmed in Python?

First step is tokenizing the text from dataframe using NLTK. Then, I create a spelling correction using TextBlob. For this, I convert the output from tuple to string. After that, I need to lemmatize/stem (using NLTK). The problem is my output return in a strip-format. Thus, it cannot be lemmatized/stemmed.
#create a dataframe
import pandas as pd
import nltk
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(token)
#spelling correction
def spell_eng(text):
text=TextBlob(str(text)).correct()
#convert from tuple to str
text=functools.reduce(operator.add, (text))
return text
df['text3'] = df['text2'].apply(spell_eng)
#lemmatization/stemming
def stem_eng(text):
lemmatizer = nltk.stem.WordNetLemmatizer()
return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)
Generated output:
Desired output:
text4
--------------
[spell]
[be]
[work,cook,listen]
[study]
I got where the problem is, the dataframes are storing these arrays as a string. So, the lemmatization is not working. Also note that, it is from the spell_eng part.
I have written a solution, which is a slight modification for your code.
import pandas as pd
import nltk
from textblob import TextBlob
import functools
import operator
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(tokenize)
# spelling correction
def spell_eng(text):
text = [TextBlob(str(w)).correct() for w in text] #CHANGE
#convert from tuple to str
text = [functools.reduce(operator.add, (w)) for w in text] #CHANGE
return text
df['text3'] = df['text2'].apply(spell_eng)
# lemmatization/stemming
def stem_eng(text):
lemmatizer = nltk.stem.WordNetLemmatizer()
return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)
df['text4']
Hope these things help.

How to change the output of lemmatized words that are presented as list comprehension?

I am struggling with my lemmatization approach because the output provides the following:
Row
Lemmatized
1
[i, , b, e, , o, k, a, y, , ...
While I wanted the output to look like the following so that I can extract into a CSV file and do further analysis:
Row
Lemmatized
1
i be okay
I am using a CSV file to do my analysis and applied the following approach:
Apply parts of speech tags
Convert parts of speech tags to wordnet's format using NLTK's word lemmatizer
Apply the NLTK’s word lemmatizer within the list comprehension.
My code is the following:
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
df = pd.read_csv(r"C:xxxxx")
rws=df.loc[:, ['MESSAGE']]
rws['pos_tags'] = rws['MESSAGE'].apply(nltk.tag.pos_tag)
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
rws['wordnet_pos'] = rws['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])
wnl = WordNetLemmatizer()
rws['lemmatized'] = rws['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word,tag) for word,tag in x])
Can anyone guide me about what am I doing wrong? Thank you in advance.
Got it sorted guys , it was something very simple (EMBARASSED). My apologies for not wording it right. I wanted to put each row back into strings. Therefore, the code I have used was :
rws["final"]= rws["lemmatized"].str.join(" ")

Python pandas trying to make word count

Hi I just noticed the tweepy api, I can create dataframe with pandas using tweets object which fetched from tweepy. I want to make a word count df to my tweets. here's my code
freq_df = hastag_tweets_df["Tweet"].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis =0).sort_values(ascending=False).reset_index().head(10)
freq_df.columns = ["Words","freq"]
print('FREQ DF\n\n')
print(freq_df)
print('\n\n')
a = freq_df[freq_df.freq > freq_df.freq.mean() + freq_df.freq.std()]
#plotting
fig =a.plot.barh(x = "Words",y = "freq").get_figure()
This looks not good as I want because it always starts with "empty space" and "the" word like
Words freq
0 301.0
1 the 164.0
So how I can get desired data, without empty line and some words like 'the'.
Thank you
Use can use spaCy library to do it. With this library you can easily remove words like "the","a" known as stop words:
its easy to install : pip install spacy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")
lexeme = nlp.vocab[word]
# process text using spacy
def process_text(text):
#Tokenize text
doc = nlp(text)
token_list = [t.text for t in doc]
#remove stop words
filtered_sentence =[]
for word in token_list:
lexeme = nlp.vocab[word]
if lexeme.is_stop == False:
filtered_sentence.append(word)
# here we return the length of the filtered words if you want you can return the list as well
return len(filtered_sentence)
df = (
df
.assign(words_count=lambda d: d['comment'].apply(lambda c: process_text(c) ))
)

NLTK POS tags extraction, tried key, values but not there yet

I have a list of names on which I am using NLTK to POS tag. I use it along with wordsegment, as the names a jumbled up like thisisme.
So I have succesfully POS tagged these names using a loop, however, I am unable to extract the POS tags. The entire exercise is been done from a CSV.
This is what I want the CSV to look like at the end of the day.
name, length, pos
thisisyou 6 NN, ADJ
My code so far is
import pandas as pd
import nltk
import wordsegment
from wordsegment import segment
from nltk import pos_tag, word_tokenize
from nltk.tag.util import str2tuple
def readdata():
datafileread = pd.read_csv('data.net.lint.csv')
domain_names = datafileread.DOMAIN[0:5]
for domain_name in domain_names:
seg_words = segment(domain_name)
postagged = nltk.pos_tag(seg_words)
limit_names = postagged
for keys,values in postagged:
print (posttagged)
readdata()
And I get this result
NN
NN
ADJ
NN
This seems OK but it is wrong. Some POS tags should not be on a new line. It should merely be jumbled like NNNN.
The print function will insert a newline each time you use it. You need to avoid this. Try it like this:
for domain_name in domain_names:
seg_words = segment(domain_name)
postagged = nltk.pos_tag(seg_words)
tags = ", ".join(t for w, t in postagged)
print(domain_name, LENGTH, tags)
The join() method returns the POS tags as a single string, separated with ", ". I've just written LENGTH since I have no idea how you got the 6 in your example. Fill in whatever you meant.
PS. You don't need it here, but you can tell print() not to add the final newline like this: print(word, end=" ")

Categories

Resources