Would it be possible to look at a specific n-grams from the whole list of them and look for it in a list of sentences?
For example:
I have the following sentences (from a dataframe column):
example = ['Mary had a little lamb. Jack went up the hill' ,
'Jack went to the beach' ,
'i woke up suddenly' ,
'it was a really bad dream...']
and n-grams (bigrams) got from
word_v = CountVectorizer(ngram_range=(2,2), analyzer='word')
mat = word_v r.fit_transform(df['Example'])
frequencies = sum(mat).toarray()[0]
which generates the output of the n-grams frequency.
I would like to select
the most frequent bi-grams
a bi-gram selected manually
within the list above example.
So, let's say that the most frequent bi-gram is Jack went, how could I look for it in the example above?
Also, if I want to look, not at the most frequent bi-grams but at the hill/beach in the example, how could I do it?
To select the rows that have the most frequent ngrams in it, you can do:
df.loc[mat.toarray()[:, frequencies==frequencies.max()].astype(bool)]
Example
0 Mary had a little lamb. Jack went up the hill
1 Jack went to the beach
but if two ngrams have the max frequency, you would get all the rows where both are present.
If you want the top/hill 3 and all the rows that have any of them:
top = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, -top:].any(axis=1)])
Example
0 Mary had a little lamb. Jack went up the hill
1 Jack went to the beach
2 i woke up suddenly
3 it was a really bad dream...
#here it is all the rows with the example
hill = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, :hill].any(axis=1)])
Example
1 Jack went to the beach
3 it was a really bad dream...
Finally if you want a specific ngrams:
ng = 'it was'
df.loc[mat.toarray()[:, np.array(word_v.get_feature_names())==ng].astype(bool)]
Example
3 it was a really bad dream...
Related
Sorry if it's a simple question, I'm new in python. I have an string (array of words) and a 2 dimensions of words which I'm going to replace them one by one as something like follow:
str="Jim is a good person"
# and will convert to:
parts=['Jim','is','a','good','person']
and a 2 dimensions array which each dimension is a array of words that can be replaced with an element with same index in parts. for example something like this:
replacement=[['john','Nock','Kati'],
['were','was','are'],
['a','an'],
['bad','perfect','awesome'],
['cat','human','dog']]
result can be something like this:
1: nike is a good person
2: John are an bad human
3: Kati were a perfect cat
and so on
Actually I'm going to replace each word of a sentence with some possible words and then do some calculation on the new sentence. I need to achieve all possible replacement.
Many thanks.
itertools.product might be the best choice for creating all of the combinations that you're looking for.
Let's use your replacement list as a starting point for what could work. A way to get all the combinations you're looking for could look something like this
from itertools import product
word_options=[['john','Nock','Kati'],
['were','was','are'],
['a','an'],
['bad','perfect','awesome'],
['cat','human','dog']]
for option in product(*word_options):
new_sentence = ' '.join(option)
#do calculation on new_sentence
Each option that is being iterated through is a tuple, where each element is a single choice from each of the individual sub-lists of the original 2D list. Then the ' '.join(option) will combine the individual strings into a single string where the words are separated by a space. If you were to just print new_sentence, the output would look as follows.
john were a bad cat
john were a bad human
john were a bad dog
john were a perfect cat
john were a perfect human
john were a perfect dog
.
.
.
Kati are an perfect cat
Kati are an perfect human
Kati are an perfect dog
Kati are an awesome cat
Kati are an awesome human
Kati are an awesome dog
Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool
if one particular word does not end with another particular word, leave it. here is my string:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
i want to print and count all words between john and dead or death or died.
if john does not end with any of the died or dead or death words. leave it. start again with john word.
my code :
x = re.sub(r'[^\w]', ' ', x) # removed all dots, commas, special symbols
for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
print i
print len([word for word in i.split()])
my output:
got shot
2
with his john got killed or
6
with his wife
3
output which i want:
got shot
2
got killed or
3
with his wife
3
i don't know where i am doing mistake.
it is just a sample input. i have to check with 20,000 inputs at a time.
You can use this negative lookahead regex:
>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
... print i.strip()
... print len([word for word in i.split()])
...
got shot
2
got killed or
3
with his wife
3
Instead of your .*? this regex is using (?:(?!john).)*? which will lazily match 0 or more of any characters only when john is not present in this match.
I also suggest using word boundaries to make it match complete words:
re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)
Code Demo
I assume, you want to start over, when there is another john following in your string before dead|died|death occur.
Then, you can split your string by the word john and start matching on the resulting parts afterwards:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
m = re.match('(.+?)(dead|died|death)', e)
if m:
print(m.group(1))
print(len(m.group(1).split()))
yields:
got shot
2
got killed or
3
with his wife
3
Also, note that after the replacements I propose here (before splitting and matching), the string looks like this:
john got shot dead john with his john got killed or died in 1990 john with his wife dead or died
I.e., there are no multiple whitespaces left in a sequence. You manage this by splitting by a whitespace later, but I feel this is a bit cleaner.
I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks
import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']
Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]
Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2
I have a system where information can come from various sources. I want to make sure I don't add exact (or extremely similar) pieces of information. Here is an example:
Text A: One day a man walked over the hill and saw the sun
Text B: One day a man walked over a hill and saw the sun
Text C: One week a woman looked over a hill and saw the sun
In this case I want to get some sort of numerical value for the difference between the blocks of information. From there I can apply the following logic:
When adding Text to database, check for existing values in database
If values are seen to be very similar then do not add
If values are seen to be different enough, then do add
Therefore we end up with different information in the database, and not duplicates, but we allow a small amount of leeway.
Can anyone tell me how I might attempt this in Python?
Looking at your problem, difflib.SequenceMatcher.ratio() might come handy.
This nifty routine, takes two strings and calculates a similarity index in the range [0,1]
Quick Demo
>>> for a,b in list(itertools.product(st, st)):
print "Text 1 {}".format(a)
print "Text 2 {}".format(b)
print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
print '-'*80
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
There are a couple of python libraries that can help you with that. Have a look at this Q:.
The levisthein distance is a common algorithm. I found the nysiis algorithm very useful. Especially if you want to save a string representation in a DB.
This link will give you an excellent overview:
A primitive way of doing this... but you could iterate through strings, comparing the equivalent sequential word in another string and you get a ratio of matches to fails:
>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12
So in this example, you can see 11/12 words matched. You can then set a pass / fail level
In python or any other language hashes are the easiest way to remove duplicates.
You can maintain a table of already added hashes.
when you add another just check if hash is present or not.
Use hashlib for it
Adding hashlib usage example
import hashlib
m1 = hashlib.md5()
m1.update(" the spammish repetition")
print m1.hexdigest()
m2 = hashlib.md5()
m2.update(" the spammish")
print m2.hexdigest()
m3 = hashlib.md5()
m3.update(" the spammish repetition")
print m3.hexdigest()
Ans
d21fe4d39740662f11ad2cf8035b471b
03498704df59a124ee6ac0681e64841b
d21fe4d39740662f11ad2cf8035b471b