I want to make a simple spell corrector system and I have a datafarme like this:
incorrect_word, correct_word
scoohl,school
watn,want
frienf,friend
"I watn to go scoohl"
I want to correct this sentence by replacing the incorrect sample in "incorrect_word" column with the correct sample in the "correct_word" column (if it exists)
how can i do this?
the sample code i wrote and does not work.
text = " شما رفتین مدرسه شون گفتین دستاشون رو بشورن"
# if "دستاشون رو" in text:
# print("yes")
from hazm import *
import pandas as pd
from src.config.config import *
# letters = word_tokenize(text)
# for text in word_tokenize(text):
# print(text)
df = pd.read_excel(FILL_DATA).astype(str)
text = str(text)
for idx, item in enumerate(df['informal']):
if item in text:
text = text.replace(item, df['formal1'].iloc[idx])
# item = item.replace(df['informal'].iloc[idx], df['formal1'].iloc[idx])
print(text)
I would do like this :
df = pd.DataFrame([['scoohl','school'], ['watn','want'], ['frienf','friend']], columns=['incorrect_word', 'correct_word'])
df.index = df['incorrect_word']
df.drop(columns=['incorrect_word'], inplace=True)
text_to_correct = "I watn to go scoohl"
words = text_to_correct.split(' ')
for c, w in enumerate(words):
if w in df.index:
words[c] = df.at[w,'correct_word']
words = ' '.join(words)
words
result :
'I want to go school'
Hello this is very basic python you can do it in this way
df['incorrect']=[x for x in df['Correct'] if len(x)>2]
You should search about lambda, list comprenhension, apply and map
Thank you.
Related
First step is tokenizing the text from dataframe using NLTK. Then, I create a spelling correction using TextBlob. For this, I convert the output from tuple to string. After that, I need to lemmatize/stem (using NLTK). The problem is my output return in a strip-format. Thus, it cannot be lemmatized/stemmed.
#create a dataframe
import pandas as pd
import nltk
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(token)
#spelling correction
def spell_eng(text):
text=TextBlob(str(text)).correct()
#convert from tuple to str
text=functools.reduce(operator.add, (text))
return text
df['text3'] = df['text2'].apply(spell_eng)
#lemmatization/stemming
def stem_eng(text):
lemmatizer = nltk.stem.WordNetLemmatizer()
return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)
Generated output:
Desired output:
text4
--------------
[spell]
[be]
[work,cook,listen]
[study]
I got where the problem is, the dataframes are storing these arrays as a string. So, the lemmatization is not working. Also note that, it is from the spell_eng part.
I have written a solution, which is a slight modification for your code.
import pandas as pd
import nltk
from textblob import TextBlob
import functools
import operator
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(tokenize)
# spelling correction
def spell_eng(text):
text = [TextBlob(str(w)).correct() for w in text] #CHANGE
#convert from tuple to str
text = [functools.reduce(operator.add, (w)) for w in text] #CHANGE
return text
df['text3'] = df['text2'].apply(spell_eng)
# lemmatization/stemming
def stem_eng(text):
lemmatizer = nltk.stem.WordNetLemmatizer()
return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)
df['text4']
Hope these things help.
I ran the below code for about 20k data. Although the code is fine, and I am able to get the output but it's running very slow. It took almost 45 mins to get the output. Can someone please provide the appropriate solution to it?
Code:
import numpy as np
import pandas as pd
import re
def demoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", flags=re.UNICODE)
return(emoji_pattern.sub(r'', text))
df = pd.read_csv("data.csv")
print(df['Body'])
tweets=df.replace(to_replace=[r"\\t|\\n|\\r", "\t|/n|/r|w/|\n|w/|Quote::"], value=["",""], regex=True)
tweets[u'Body'] = tweets[u'Body'].astype(str)
tweets[u'Body'] = tweets[u'Body'].apply(lambda x:demoji(x))
weets[u'Body'] = tweets[u'Body'].apply(lambda x:demoji(x))
#Preprocessing del RT #blablabla:
tweets['tweetos'] = ''
#add tweetos first part
for i in range(len(tweets['Body'])):
try:
tweets['tweetos'][i] = tweets['Body'].str.split(' ')[i][0]
except AttributeError:
tweets['tweetos'][i] = 'other'
#Preprocessing tweetos. select tweetos contains 'RT #'
for i in range(len(tweets['Body'])):
if tweets['tweetos'].str.contains('#')[i] == False:
tweets['tweetos'][i] = 'other'# remove URLs, RTs, and twitter handles
for i in range(len(tweets['Body'])):
tweets['Body'][i] = " ".join([word for word in tweets['Body'][i].split()
if 'http' not in word and '#' not in word and '<' not in word])
This code is to remove special characters, like /n, Twitter mentions, basically text cleaning
Whenever you work with Pandas and start iterating over dataframe content there's a good chance that your approach is lacking. Try to stick to the native Pandas tools/methods, which are highly optimized! Also, watch out for repetition: In your code you do some stuff over and over again. E.g. in every iteration of
the 1. loop you split df.Body (tweets['tweetos'][i] = tweets['Body'].str.split(' ')[i][0]), only to pick one item from the resulting frame
the 2. loop you evaluate a complete column of the frame (tweets['tweetos'].str.contains('#')) only to pick one item from the result.
Your code could probably look like this:
import pandas as pd
import re
df = pd.read_csv("data.csv")
tweets = df.replace(to_replace=[r"\\t|\\n|\\r", "\t|/n|/r|w/|\n|w/|Quote::"], value=["",""], regex=True)
# Why not tweets = df.replace(r'\\t|\\n|\\r|\t|/n|/r|w/|\n|w/|Quote::', ',') ?
re_emoji = re.compile(...) # As in your code
tweets.Body = tweets.Body.astype(str).str.replace(re_emoji, '') # Is the astype(str) necessary?
body_split = tweets.Body.str.split()
tweets['tweetos'] = body_split.map(lambda l: 'other' if not l else l[0])
tweets.tweetos[~tweets.tweetos.str.contains('#')] = 'other'
re_discard = re.compile(r'http|#|<')
tweets.Body = (body_split.map(lambda l: [w for w in l if not re_discard.search(w)])
.str.join(' '))
Be aware that I don't have any real insight into the data you're working with - you haven't provided a sample. So there might be bugs in the code I've proposed.
I am trying to remove a list of the common words from a set of strings (text) from a python pandas dataframe. The dataframe looks like this
['Item', 'Label', 'Comment']
I have removed stop words already, but I did a word cloud and there are still some more common words that I want to remove to get better insight into this problem.
This is my current working code that does good but not good enough
# This recieves a sentence 1 at a time
# Use a loop if you want to process a dataset or a lambda
def nlp_preprocess(text, stopwords, lemmatizer, wordnet_map):
# Remove punctuation
text = re.sub('[^a-zA-Z]', ' ', text)
# Remove tags
text=re.sub("</?.*?>"," <> ",text)
# Remove special characters and digits
text=re.sub("(\\d|\\W)+"," ",text)
# Remove stop words like and is a and the
text = " ".join([word for word in text.split() if word not in stopwords])
# Find base word for all words in the sentence
pos_tagged_text = nltk.pos_tag(text.split())
text = " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])
return text
def full_nlp_text_process(df, pandas_parms, stopwords, lemmatizer, wordnet_map):
data = preprocess_dataframe(df, pandas_params)
nlp_data = data.copy()
nlp_data["ProComment"] = nlp_data['Comment'].apply(lambda x: nlp_preprocess(x, stopword, lemmatizer, wordnet_map))
return data, nlp_data
I know i want to something similar to this, but I do not know how I should put it in there to remove the words and where I should put it in (i.e. in the text processing or the dataframe process)\
fdist2 = nltk.FreqDist(text)
most_list = fdist2.most_common(10)
# Somewhere else
for t in text:
if t in most_list: text.remove(t)
I figured out a new way of doing it and it worked well.
The first method would be this
def find_common_words(df):
full_text = ""
for index, row in df.iterrows():
#print(row['Comment'])
full_text = full_text + " " + row["ProComment"]
allWords = nltk.tokenize.word_tokenize(full_text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)
mostCommon= allWordDist.most_common(10)
common_words = []
for item in mostCommon:
common_words.append(item[0])
return common_words
And then you will need this method to finish things up.
def remove_common_words(df, common_words):
for index, row in df.iterrows():
sentence = row["ProComment"]
word_tokens = word_tokenize(sentence)
filtered_sentence = " "
for w in word_tokens:
if w not in common_words:
filtered_sentence = filtered_sentence + " " + w
row["ProComment"] = filtered_sentence
This way you take in a dataframe and do all the processing through that.
I have created a list of websites and my goal is to find the most frequently used words (excluding the common ones such as the, as, are, and etc.) across all of these sites. I set this up to the best of my abilities but I am realizing that my function is only providing the words to the first website to my list. I created a loop function but I believe I may have messed something up and am looking for the fix.
My current code is:
from bs4 import BeautifulSoup
import pandas as pd
from nltk.corpus import stopwords
import nltk
import urllib.request
list1 = ["https://en.wikipedia.org/wiki/Netflix", "https://en.wikipedia.org/wiki/Disney+", "https://en.wikipedia.org/wiki/Prime_Video",
"https://en.wikipedia.org/wiki/Apple_TV", "https://en.wikipedia.org/wiki/Hulu", "https://en.wikipedia.org/wiki/Crunchyroll",
"https://en.wikipedia.org/wiki/ESPN%2B", "https://en.wikipedia.org/wiki/HBO_Max", "https://en.wikipedia.org/wiki/Paramount%2B",
"https://en.wikipedia.org/wiki/PBS", "https://en.wikipedia.org/wiki/Peacock_(streaming_service)", "https://en.wikipedia.org/wiki/Discovery%2B",
"https://en.wikipedia.org/wiki/Showtime_(TV_network)", "https://en.wikipedia.org/wiki/Epix", "https://en.wikipedia.org/wiki/Starz",
"https://en.wikipedia.org/wiki/Acorn_TV", "https://en.wikipedia.org/wiki/BritBox", "https://en.wikipedia.org/wiki/Kocowa",
"https://en.wikipedia.org/wiki/Pantelion_Films", "https://en.wikipedia.org/wiki/Spuul"]
freq = [ ]
for i in list1:
response = urllib.request.urlopen(i)
html = response.read()
soup = BeautifulSoup(html,'html5lib')
text = soup.get_text(strip = True)
tokens = [t for t in text.split()]
sr= stopwords.words('english')
clean_tokens = tokens[:]
for token in tokens:
if token in stopwords.words('english'):
clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
freq1 = freq.most_common(20)
freq[i] = pd.DataFrame(freq.items(), columns=['word', 'frequency'])
freq1
print("This is for the website: ", i, "\n", freq1)
The output for freq1 is giving me the most popular 20 words only for the first website, and I am looking for the most popular words across all websites.
program is overwriting the list items in the line:
freq1 = freq.most_common(20)
As python gives a nice way to append the list, refer (2).
sorry, edited your code, pls refer line and comment at (1),(2) and (3)
freq = [ ]
freq_list=[] # added new structure for list, as want to avoid conflict with your variables --(1)
for i in list1:
response = urllib.request.urlopen(i)
html = response.read()
soup = BeautifulSoup(html,'html5lib')
text = soup.get_text(strip = True)
tokens = [t for t in text.split()]
sr= stopwords.words('english')
clean_tokens = tokens[:]
for token in tokens:
if token in stopwords.words('english'):
clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
freq1 = freq.most_common(20)
freq_list=freq_list + freq1 # python way to add list items --(2)
freq[i] = pd.DataFrame(freq.items(), columns=['word', 'frequency'])
#freq1
print("This is for the website: ", i, "\n", freq1)
df=pd.DataFrame(freq_list, columns=['word', 'frequency']) # build the dataframe (3)
df.shape
Hi I just noticed the tweepy api, I can create dataframe with pandas using tweets object which fetched from tweepy. I want to make a word count df to my tweets. here's my code
freq_df = hastag_tweets_df["Tweet"].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis =0).sort_values(ascending=False).reset_index().head(10)
freq_df.columns = ["Words","freq"]
print('FREQ DF\n\n')
print(freq_df)
print('\n\n')
a = freq_df[freq_df.freq > freq_df.freq.mean() + freq_df.freq.std()]
#plotting
fig =a.plot.barh(x = "Words",y = "freq").get_figure()
This looks not good as I want because it always starts with "empty space" and "the" word like
Words freq
0 301.0
1 the 164.0
So how I can get desired data, without empty line and some words like 'the'.
Thank you
Use can use spaCy library to do it. With this library you can easily remove words like "the","a" known as stop words:
its easy to install : pip install spacy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")
lexeme = nlp.vocab[word]
# process text using spacy
def process_text(text):
#Tokenize text
doc = nlp(text)
token_list = [t.text for t in doc]
#remove stop words
filtered_sentence =[]
for word in token_list:
lexeme = nlp.vocab[word]
if lexeme.is_stop == False:
filtered_sentence.append(word)
# here we return the length of the filtered words if you want you can return the list as well
return len(filtered_sentence)
df = (
df
.assign(words_count=lambda d: d['comment'].apply(lambda c: process_text(c) ))
)