How do I remove the same words in the dataframe named df3?
My below codes doesn't seemed to work...
df3 = pd.DataFrame(np.array(c3), columns=["content"]).drop_duplicates()
def text_processing_cat3(df3):
''=== Removal of common words ==='''
freq = pd.Series(' '.join(df3['content']).split()).value_counts()[:10]
freq = list(freq.index)
df3['content'] = df3['content'].apply(lambda x: " ".join(x for x in
x.split() if x not in freq))
'''=== Removal of rare words ==='''
freq = pd.Series(' '.join(df3['content']).split()).value_counts()[-10:]
freq = list(freq.index)
df3['content'] = df3['content'].apply(lambda x: " ".join(x for x in
x.split() if x not in freq))
return df3
print(text_processing_cat3(df3)
The sample output for the above is:
cat_id content
0 3 male malay man nkda walking stick home ambulant ws void deck able walk bendemeer mall home bus stop away adli stays daughter family husband none image image image order cancellation note ct brain duplicate image
1 3 yo chinese man nkda phx hypertension hyperlipidemia benign hyperplasia open cholecystectomy gallbladder empyema distal gastrectomy pud penetrating aortic
Please help check the codes and improve the codes above. Thank you!!
Related
I have a df dataframe like this:
product name description0 description1 description2 description3
A plane flies air passengers wings
B car rolls road NaN NaN
C boat floats sea passengers NaN
What I want to do is to compare for each value in description columns to be searched in a txt file.
Let's say my test.txt file is:
He flies to London then crosses the sea to reach New-York.
The result would look like this:
product name description0 description1 description2 description3 Match
A plane flies air passengers wings Match
B car rolls road NaN NaN No match
C boat floats sea passengers NaN Match
I know the main structure but I'm a bit lost for the rest
with open ("test.txt", 'r') as searchfile:
for line in searchfile:
print line
if re.search() in line:
print(match)
You can search the input text using str.find() since you are searching for string literals. re.search() seems to be an overkill.
A quick-and-dirty solution using .apply(axis=1):
Data
# df as given
input_text = "He flies to London then crosses the sea to reach New-York."
Code
input_text_lower = input_text.lower()
def search(row):
for el in row: # description 0,1,2,3
# skip non-string contents and if the search is successful
if isinstance(el, str) and (input_text_lower.find(el.lower()) >= 0):
return True
return False
df["Match"] = df[[f"description{i}" for i in range(4)]].apply(search, axis=1)
Result
print(df)
product name description0 description1 description2 description3 Match
0 A plane flies air passengers wings True
1 B car rolls road NaN NaN False
2 C boat floats sea passengers NaN True
Note
Word boundary, punctuation and hyphens are not considered in the original problem. In real cases, additional preprocessing steps is likely to be required. This is out of the scope of the original question.
I made a bag of words model and when I printed it out, the output is not quite making sense.
This is my code I used to initalise the bag of words:
#creating the bag of words model
headline_bow = CountVectorizer()
headline_bow.fit(x)
a = headline_bow.transform(x)
b = headline_bow.get_feature_names()
print(a)
This is a sample of the output that comes out from the bag of words model:
(0, 837) 1
(0, 1496) 1
(0, 1952) 1
(0, 2610) 1
From my understanding, for "(0, 837) 1" this means that in the first list passed through the model, the 837th word in that list appears once. This makes no sense because when I print
x[0] I get this:
Four ways Bob Corker skewered Donald Trump
There is clearly not 837 words here so im confused as to whats going on.
Here is a sample of what x is: (a bunch of headlines)
['Four ways Bob Corker skewered Donald Trump'
"Linklater's war veteran comedy speaks to modern America, says star"
'Trump’s Fight With Corker Jeopardizes His Legislative Agenda' ...
'Ron Paul on Trump, Anarchism & the AltRight'
'China to accept overseas trial data in bid to speed up drug approvals'
'Vice President Mike Pence Leaves NFL Game Because of Anti-American Protests']
Here is the rest of my code:
data = pd.read_csv("/Users/amanpuranik/Desktop/fake-news-detection/data.csv")
data = data[['Headline', "Label"]]
x = np.array(data['Headline'])
print(x[0])
y = np.array(data["Label"])
# tokenization of the data here'
headline_vector = []
for headline in x:
headline_vector.append(word_tokenize(headline))
print(headline_vector)
stopwords = set(stopwords.words('english'))
#removing stopwords at this part
filtered = [[word for word in sentence if word not in stopwords]
for sentence in headline_vector]
#print(filtered)
stemmed2 = [[stem(word) for word in headline] for headline in filtered]
#print(stemmed2)
#lowercase
lower = [[word.lower() for word in headline] for headline in stemmed2] #start here
#conver lower into a list of strings
lower_sentences = [" ".join(x) for x in lower]
#organising
articles = []
for headline in lower:
articles.append(headline)
#creating the bag of words model
headline_bow = CountVectorizer()
headline_bow.fit(lower_sentences)
a = headline_bow.transform(lower_sentences)
print(a)
Here I got a pandas.series named 'traindata'.
0 Published: 4:53AM Friday August 29, 2014 Sourc...
1 8 Have your say\n\n\nPlaying low-level club c...
2 Rohit Shetty has now turned producer. But the ...
3 A TV reporter in Serbia almost lost her job be...
4 THE HAGUE -- Tony de Brum was 9 years old in 1...
5 Australian TV cameraman Harry Burton was kille...
6 President Barack Obama sharply rebuked protest...
7 The car displaying the DIE FOR SYRIA! sticker....
8 \nIf you've ever been, you know that seeing th...
9 \nThe former executive director of JBWere has ...
10 Waterloo Road actor Joe Slater has revealed hi...
...
**Name: traindata, Length: 2284, dtype: object**
and what I want to do is to replace the series.values with the stemmed sentences.
my thought is to build a new series and put the stemmed sentence in.
my code is as below:
from nltk.stem.porter import PorterStemmer
stem_word_data = np.zeros([2284,1])
ps = PorterStemmer()
for i in range(0,len(traindata)):
tst = word_tokenize(traindata[i])
for word in tst:
word = ps.stem(word)
stem_word_data[i] = word
and then an error occurs:
ValueError: could not convert string to float: 'publish'
Anyone knows how to fix this error or anyone has a better idea on how to replace the series.values with the stemmed sentence? thanks.
You can use apply on a series and avoid writing loops.
from nltk import word_tokenize
from nltk.stem import PorterStemmer
## intialise stemmer class
pst = PorterStemmer()
## sample data frame
df = pd.DataFrame({'senten': ['I am not dancing','You are playing']})
## apply here
df['senten'] = df['senten'].apply(word_tokenize)
df['senten'] = df['senten'].apply(lambda x: ' '.join([pst.stem(y) for y in x]))
print(df)
senten
0 I am not danc
1 you are play
I have two lists of strings, A and B. For each string in A, I'd like to compare it to every string in B and select the most similar match. The comparison function that I'm using is a custom cosine similarity measure that I found on this question. Here is how it works:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt')
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
My issue is that if I have somewhat long lists (500-1000 items), and the execution starts to take five or ten minutes. Here's an example using some dummy text:
import requests
url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text
A, B = sample[:int(len(sample)/2)], sample[int(len(sample)/2):]
A, B = list(map(''.join, zip(*[iter(A)]*100))), list(map(''.join, zip(*[iter(B)]*100)))
Now that I have two lists, each with ~500 strings (of 100 characters each), I compute the similarities and take the top one. This is done by taking a string from A, iterating through B, sorting by cosine_sim score, and then taking the last element, and then repeating for all elements in A:
matches = [(a, list(sorted([[b, cosine_sim(a, b)]
for b in B], key=lambda x: x[1]))[-1])
for a in A]
The output is a list of matches where each item contains both strings and also their calculated similarity score. That final line took 7 minutes to run though. I'm wondering if there are inefficiencies in my process that are slowing it down or if there's just a lot to compute (500*500 = 250,000 comparisons, plus sorting for the best 500 times)?
The biggest issue probably is that you are computing tfidf for every pair of documents (document here merely meaning your unit of text - this could be a tweet, a sentence, a scientific paper, or a book). Also, you shouldn't cook up your own similarity measure if it already exists. Finally, sklearn has a pairwise_distance routine that does what you want and is optimized. Putting this all together, here is a sample script:
import requests
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import pairwise_distances
url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text.split('\n\n') # I'm splitting the document by "paragraphs" since it is unclear what you actually want
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
doc_vectors = vectorizer.fit_transform(sample)
distances = pairwise_distances(doc_vectors, metric='cosine')
row_idx = list(enumerate(distances.argmax(axis=1)))
sorted_pairs = sorted(row_idx, key= lambda x: distances[x[0], x[1]], reverse=True)
# most similar documents:
i1, i2 = sorted_pairs[0] # index of most similar pair
print(sample[i1])
print("=="*50)
print(sample[i2])
There were 99 documents in my sample list, and this ran pretty much instantaneously after the download was complete. Also, the output:
Art party taxidermy locavore 3 wolf moon occupy. Tote bag twee tacos
listicle, butcher single-origin coffee raclette gentrify raw denim
helvetica kale chips shaman williamsburg man braid. Poke normcore lomo
health goth waistcoat kogi. Af next level banh mi, deep v locavore
asymmetrical snackwave chillwave. Subway tile viral flexitarian pok
pok vegan, cardigan health goth venmo artisan. Iceland next level twee
adaptogen, dreamcatcher paleo lyft. Selfies shoreditch microdosing
vape, knausgaard hot chicken pitchfork typewriter polaroid lyft
skateboard ethical distillery. Farm-to-table blue bottle yr artisan
wolf try-hard vegan paleo knausgaard deep v salvia ugh offal
snackwave. Succulents taxidermy cornhole wayfarers butcher, street art
polaroid jean shorts williamsburg la croix tumblr raw denim. Hot
chicken health goth taiyaki truffaut pop-up iceland shoreditch
fingerstache.
====================================================================================================
Organic microdosing keytar thundercats chambray, cray raclette. Seitan
irony raclette chia, cornhole YOLO stumptown. Gluten-free palo santo
beard chia. Whatever bushwick stumptown seitan cred quinoa. Small
batch selfies portland, cardigan you probably haven't heard of them
shabby chic yr four dollar toast flexitarian palo santo beard offal
migas. Kinfolk pour-over glossier, hammock poutine pinterest coloring
book kitsch adaptogen wayfarers +1 tattooed lomo yuccie vice. Plaid
fixie portland, letterpress knausgaard sartorial live-edge. Austin
adaptogen YOLO cloud bread wayfarers cliche hammock banjo. Sustainable
organic air plant mustache.
I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks
import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']
Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]
Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2