How to extract non-matching text in two documents - python

Lets say I have two strings.
a = 'I am Sam. I love cooking.'
b = 'I am sam. I used to drink a lot.'
I am calculating their similarity score using :
from difflib import SequenceMatcher
s = SequenceMatcher(lambda x: x == " ",a,b)
print s.ratio()
Now I want to print non-matching sentences in both strings. Like this
a = 'I love cooking.'
b = 'I used to drink a lot.'
Any suggestion like what module or approach I can use to do that? I saw one module in difflib https://pymotw.com/2/difflib/ But in this it prints with (+,-,!,...) I don't want output in that format.

It is a very simple script . But i hope it gives you idea of how to do:
a = 'I am Sam. I love cooking.'
b = 'I am sam. I used to drink a lot.'
a= a.split('.')
b=b.split('.')
ca=len(a)
cb=len(b)
if ca>cb:l=cb
else :l=ca
c=0
while c<l:
if a[c].upper() == b[c].upper():pass
else:print b[c]+'.'
c=c+1

Use difflib. You can easily post-process the output of difflib.Differ, to strip off the first two characters of each unit and convert them to any format you want. Or you can work with the alignments returned by SequenceMatcher.get_matching_blocks, and generate your own output.
Here's how you might do it. If that's not what you want, edit your question to provide a less simplistic example of comparison and the output format you need.
differ = difflib.Differ()
for line in differ.compare(list1, list2):
if line.startswith("-"):
print("a="+line[2:])
elif line.startswith("+"):
print("b="+line[2:])
# else just ignore the line

Related

Using regular expressions in python to extract location mentions in a sentence

I am writing a code using python to extract the name of a road,street, highway, for example a sentence like "There is an accident along Uhuru Highway", I want my code to be able to extract the name of the highway mentioned, I have written the code below.
sentence="there is an accident along uhuru highway"
listw=[word for word in sentence.lower().split()]
for i in range(len(listw)):
if listw[i] == "highway":
print listw[i-1] + " "+ listw[i]
I can achieve this but my code is not optimized, i am thinking of using regular expressions, any help please
'uhuru highway' can be found as follows
import re
m = re.search(r'\S+ highway', sentence) # non-white-space followed by ' highway'
print(m.group())
# 'uhuru highway'
If the location you want to extract will always have highway after it, you can use:
>>> sentence = "there is an accident along uhuru highway"
>>> a = re.search(r'.* ([\w\s\d\-\_]+) highway', sentence)
>>> print(a.group(1))
>>> uhuru
You can do the following without using regexes:
sentence.split("highway")[0].strip().split(' ')[-1]
First split according to "highway". You'll get:
['there is an accident along uhuru', '']
And now you can easily extract the last word from the first part.

Shorter way to code in Python

I am Python beginner. Following code does exactly what i want. But it looks a little dump coz of three for loop. Can somebody show me smarter/shorter way to achieve it? may be a single function or parallelizing for loops.
def getWordListAndCounts(text):
words = []
for t in text:
for tt in t:
for ttt in (re.split("\s+", str(tt))):
words.append(str(ttt))
return Counter(words)
text = [['I like Apple' , 'I also like Google']]
getWordListAndCounts(text)
Firstly remove redundat list (it will reduce level in list comprehension):
Since there is not any necessity to store temporary result in list, generators are more preferable and efficient way.
Check this one-line approach:
text = ['I like Apple' , 'I also like Google']
print Counter(str(ttt) for t in text for ttt in (re.split("\s+", str(t))))
Use meaningful variable names. t, tt and ttt can't help the code being readable.
Why not use "for phrase in text" then "for word in phrase"?
Why are you using double encoded strings? Unless it is already in this format when you are reading it, I would suggest you not to do this.
import re
from collections import Counter
def getWordListAndCounts(text):
return Counter(re.split('\s+', str([' '.join(x) for x in text][0])))
text = [['I like Apple' , 'I also like Google']]
print getWordListAndCounts(text)

Function that insert words into text

I have a text that goes like this:
text = "All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood."
How do I write a function hedging(text) that processes my text and produces a new version that inserts the word "like" in the every third word of the text?
The outcome should be like that:
text2 = "All human beings like are born free like and equal in like..."
Thank you!
Instead of giving you something like
solution=' like '.join(map(' '.join, zip(*[iter(text.split())]*3)))
I'm posting a general advice on how to approach the problem. The "algorithm" is not particularly "pythonic", but hopefully easy to understand:
words = split text into words
number of words processed = 0
for each word in words
output word
number of words processed += 1
if number of words processed is divisible by 3 then
output like
Let us know if you have questions.
You could go with something like that:
' '.join([n + ' like' if i % 3 == 2 else n for i, n in enumerate(text.split())])

Python Replacing Strings with Dictionary values

Based on the given input:
I can do waaaaaaaaaaaaay better :DDDD!!!! I am sooooooooo exicted about it :))) Good !!
Desired: output
I can do way/LNG better :D/LNG !/LNG I am so/LNG exicted about it :)/LNG Good !/LNG
--- Challenges:
better vs. soooooooooo >> we need to keep the first one as is but shorten the second
for the second we need to add a tag (LNG) as it might have some importance for intensification for subjectivity and sentiment
---- Problem: error message "unbalanced parentheses"
Any ideas?
My code is:
import re
lengWords = {} # a dictionary of lengthened words
def removeDuplicates(corpus):
data = (open(corpus, 'r').read()).split()
myString = " ".join(data)
for word in data:
for chr in word:
countChr = word.count(chr)
if countChr >= 3:
lengWords[word] = word+"/LNG"
lengWords[word] = re.sub(r'([A-Za-z])\1+', r'\1', lengWords[word])
lengWords[word] = re.sub(r'([\'\!\~\.\?\,\.,\),\(])\1+', r'\1', lengWords[word])
for k, v in lengWords.items():
if k == word:
re.sub(word, v, myString)
return myString
It's not the perfect solution, but I don't have time to refine it now- just wanted to get you started with easy approach:
s = "I can do waaaaaaaaaaaaay better :DDDD!!!! I am sooooooooo exicted about it :))) Good !!"
re.sub(r'(.)(\1{2,})',r'\1/LNG',s)
>> 'I can do wa/LNGy better :D/LNG!/LNG I am so/LNG exicted about it :)/LNG Good !!'

What is a good strategy to group similar words?

Say I have a list of movie names with misspellings and small variations like this -
"Pirates of the Caribbean: The Curse of the Black Pearl"
"Pirates of the carribean"
"Pirates of the Caribbean: Dead Man's Chest"
"Pirates of the Caribbean trilogy"
"Pirates of the Caribbean"
"Pirates Of The Carribean"
How do I group or find such sets of words, preferably using python and/or redis?
Have a look at "fuzzy matching". Some great tools in the thread below that calculates similarities between strings.
I'm especially fond of the difflib module
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison
You might notice that similar strings have large common substring, for example:
"Bla bla bLa" and "Bla bla bRa" => common substring is "Bla bla ba" (notice the third word)
To find common substring you may use dynamic programming algorithm. One of algorithms variations is Levenshtein distance (distance between most similar strings is very small, and between more different strings distance is bigger) - http://en.wikipedia.org/wiki/Levenshtein_distance.
Also for quick performance you may try to adapt Soundex algorithm - http://en.wikipedia.org/wiki/Soundex.
So after calculating distance between all your strings, you have to clusterize them. The most simple way is k-means (but it needs you to define number of clusters). If you actually don't know number of clusters, you have to use hierarchical clustering. Note that number of clusters in your situation is number of different movies titles + 1(for totally bad spelled strings).
I believe there is in fact two distinct problems.
The first is spell correction. You can have one in Python here
http://norvig.com/spell-correct.html
The second is more functional. Here is what I'd do after the spell correction. I would make a relation function.
related( sentence1, sentence2 ) if and only if sentence1 and sentence2 have rare common words. By rare, I mean words different than (The, what, is, etc...). You can take a look at the TF/IDF system to determine if two document are related using their words. Just googling a bit I found this:
https://code.google.com/p/tfidf/
To add another tip to Fredrik's answer, you could also get inspired from search engines like code, such as this one :
def dosearch(terms, searchtype, case, adddir, files = []):
found = []
if files != None:
titlesrch = re.compile('>title<.*>/title<')
for file in files:
title = ""
if not (file.lower().endswith("html") or file.lower().endswith("htm")):
continue
filecontents = open(BASE_DIR + adddir + file, 'r').read()
titletmp = titlesrch.search(filecontents)
if titletmp != None:
title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
filecontents = remove_tags(filecontents)
filecontents = filecontents.lstrip()
filecontents = filecontents.rstrip()
if dofind(filecontents, case, searchtype, terms) > 0:
found.append(title)
found.append(file)
return found
Source and more information: http://www.zackgrossbart.com/hackito/search-engine-python/
Regards,
Max
One approach would be to pre-process all the strings before you compare them: convert all to lowercase, standardize whitespace (eg, replace any whitespace with single spaces). If punctuation is not important to your end goal, you can remove all punctuation characters as well.
Levenshtein distance is commonly-used to determine similarity of a string, this should help you group strings which differ by small spelling errors.

Categories

Resources