Comparing two blocks of text in Python - python

I have a system where information can come from various sources. I want to make sure I don't add exact (or extremely similar) pieces of information. Here is an example:
Text A: One day a man walked over the hill and saw the sun
Text B: One day a man walked over a hill and saw the sun
Text C: One week a woman looked over a hill and saw the sun
In this case I want to get some sort of numerical value for the difference between the blocks of information. From there I can apply the following logic:
When adding Text to database, check for existing values in database
If values are seen to be very similar then do not add
If values are seen to be different enough, then do add
Therefore we end up with different information in the database, and not duplicates, but we allow a small amount of leeway.
Can anyone tell me how I might attempt this in Python?

Looking at your problem, difflib.SequenceMatcher.ratio() might come handy.
This nifty routine, takes two strings and calculates a similarity index in the range [0,1]
Quick Demo
>>> for a,b in list(itertools.product(st, st)):
print "Text 1 {}".format(a)
print "Text 2 {}".format(b)
print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
print '-'*80
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------

There are a couple of python libraries that can help you with that. Have a look at this Q:.
The levisthein distance is a common algorithm. I found the nysiis algorithm very useful. Especially if you want to save a string representation in a DB.
This link will give you an excellent overview:

A primitive way of doing this... but you could iterate through strings, comparing the equivalent sequential word in another string and you get a ratio of matches to fails:
>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12
So in this example, you can see 11/12 words matched. You can then set a pass / fail level

In python or any other language hashes are the easiest way to remove duplicates.
You can maintain a table of already added hashes.
when you add another just check if hash is present or not.
Use hashlib for it
Adding hashlib usage example
import hashlib
m1 = hashlib.md5()
m1.update(" the spammish repetition")
print m1.hexdigest()
m2 = hashlib.md5()
m2.update(" the spammish")
print m2.hexdigest()
m3 = hashlib.md5()
m3.update(" the spammish repetition")
print m3.hexdigest()
Ans
d21fe4d39740662f11ad2cf8035b471b
03498704df59a124ee6ac0681e64841b
d21fe4d39740662f11ad2cf8035b471b

Related

Look for n-grams into texts

Would it be possible to look at a specific n-grams from the whole list of them and look for it in a list of sentences?
For example:
I have the following sentences (from a dataframe column):
example = ['Mary had a little lamb. Jack went up the hill' ,
'Jack went to the beach' ,
'i woke up suddenly' ,
'it was a really bad dream...']
and n-grams (bigrams) got from
word_v = CountVectorizer(ngram_range=(2,2), analyzer='word')
mat = word_v r.fit_transform(df['Example'])
frequencies = sum(mat).toarray()[0]
which generates the output of the n-grams frequency.
I would like to select
the most frequent bi-grams
a bi-gram selected manually
within the list above example.
So, let's say that the most frequent bi-gram is Jack went, how could I look for it in the example above?
Also, if I want to look, not at the most frequent bi-grams but at the hill/beach in the example, how could I do it?
To select the rows that have the most frequent ngrams in it, you can do:
df.loc[mat.toarray()[:, frequencies==frequencies.max()].astype(bool)]
Example
0 Mary had a little lamb. Jack went up the hill
1 Jack went to the beach
but if two ngrams have the max frequency, you would get all the rows where both are present.
If you want the top/hill 3 and all the rows that have any of them:
top = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, -top:].any(axis=1)])
Example
0 Mary had a little lamb. Jack went up the hill
1 Jack went to the beach
2 i woke up suddenly
3 it was a really bad dream...
#here it is all the rows with the example
hill = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, :hill].any(axis=1)])
Example
1 Jack went to the beach
3 it was a really bad dream...
Finally if you want a specific ngrams:
ng = 'it was'
df.loc[mat.toarray()[:, np.array(word_v.get_feature_names())==ng].astype(bool)]
Example
3 it was a really bad dream...

Regular Expressions How to append matches from raw string text into a list

Hello I have some messy text I am unable to process in any good way and I want to match all zip codes 5 digit numbers from the raw string then append them to a list. My string looks something like this:
string = '''
January 2020
Zip Code
Current Month
Sales Breakdown
(by type)
Last Month Last Year Year-to-Date
95608
Carmichael
95610
Citrus Heights
95621
Citrus Heights
95624
Elk Grove
95626
Elverta
95628
Fair Oaks
95630
Folsom
95632
Galt
95638
Herald
95641
Isleton
95655
Mather
95660
North Highlands
95662
Orangevale
Total Sales
43 REO Sales 0 45
40 43
Median Sales Price $417,000
$0 $410,000 $400,000
$417,000
'''
It can be done with re.findall and the regular expression \b\d{5}\b or even just \d{5}. Let's see an example:
import re
string = '''
January 2020
Zip Code
Current Month
Sales Breakdown
(by type)
Last Month Last Year Year-to-Date
95608
Carmichael
95610
Citrus Heights
95621
Citrus Heights
95624
Elk Grove
95626
Elverta
95628
Fair Oaks
95630
Folsom
95632
Galt
95638
Herald
95641
Isleton
95655
Mather
95660
North Highlands
95662
Orangevale
Total Sales
43 REO Sales 0 45
40 43
Median Sales Price $417,000
$0 $410,000 $400,000
$417,000
'''
regex = r'\b\d{5}\b'
zip_codes = re.findall(regex, string)
Then you can get each code from zip_codes. I recommend you to read re documentation and Regular Expression HOWTO. There are interesting tools to write and test regex, as Regex101.
I also recommend you that for the next time you ask, please investigate a bit by yourself and then try to do what you want, and then, if you have an issue, ask for this specific issue. The help page How I ask a good question? and How to create a Minimum, Reproducible example might help you to write a good question.

matching content creating new column

Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool

Python, find words from array in string

I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks
import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']
Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]
Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2

Python: Finding The Longest/Shortest Sentence In A Random Paragraph?

I am using Python 2.7 and need 2 functions to find the longest and shortest sentence (in terms of word count) in a random paragraph. For example, if I choose to put in this paragraph:
"Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."
The output for this should be 36 and 16 with 36 meaning there are 36 words in the longest sentence and 16 words in the shortest sentence.
def MaxMinWords(paragraph):
numWords = [len(sentence.split()) for sentence in paragraph.split('.')]
return max(numWords), min(numWords)
EDIT : As many have pointed out in the comments, this solution is far from robust. The point of this snippet is to simply serve as a pointer to the OP.
You need a way to split the paragraph into sentences and to count words in a sentence. You could use nltk package for both:
from nltk.tokenize import sent_tokenize, word_tokenize # $ pip install nltk
sentences = sent_tokenize(paragraph)
word_count = lambda sentence: len(word_tokenize(sentence))
print(min(sentences, key=word_count)) # the shortest sentence by word count
print(max(sentences, key=word_count)) # the longest sentence by word count
EDIT: As has been mentioned in the comments below, programmatically determining what constitutes the sentences in a paragraph is quite a complex task. However, given the example you provided, I have elucidated a nice start to perhaps solving your problem below.
First, we want to tokenize the paragraph into sentences. We do this by splitting the text on every occurrence of a . (period). This returns a list of strings, each of which is a sentence.
We then want to break each sentence into its corresponding list of words. Then, using this list of lists, we want the sentence (represented as a list of words) whose length is a maximum and the sentence whose length is a minimum. Consider the following code:
par = "Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."
# split paragraph into sentences
sentences = par.split(". ")
# split each sentence into words
tokenized_sentences = [sentence.split(" ") for sentence in sentences]
# get longest sentence and its length
longest_sen = max(tokenized_sentences, key=len)
longest_sen_len = len(longest_sen)
# get shortest word and its length
shortest_sen = min(tokenized_sentences, key=len)
shortest_sen_len = len(shortest_sen)
print longest_sen_len
print shortest_sen_len

Categories

Resources