Find matching phrases and words in a string python - python

Using python, what would be the most efficient way for one to extract common phrases or words from to given string?
For example,
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
Would return:
["a","time","there","was a very","called Jack"]
How would one go about in doing this efficiently (in my case I would need to do this over thousands of 1000 word documents)?

You can split each string, then intersect the sets.
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
set(string1.split()).intersection(set(string2.split()))
Result
set(['a', 'very', 'Jack', 'time', 'was', 'called'])
Note this only matches individual words. You have to be more specific on what you would consider a "phrase". Longest consecutive matching substring? That could get more complicated.

In natural language processing, you usually extract common patterns and sequences from sentences using n-grams.
In python, you can use the excellent NLTK module for that.
For counting and finding the most common, you can use collections.Counter.
Here's a example for 2-grams:
from nltk.util import ngrams
from collections import Counter
from itertools import chain
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
n = 2
ngrams1= ngrams(string1.split(" "), n)
ngrams2= ngrams(string2.split(" "), n)
counter= Counter(chain(ngrams1,ngrams2)) #count occurrences of each n-gram
print [k[0] for k,v in counter.items() if v>1] #print all ngrams that come up more than once
output:
[('called', 'Jack'), ('was', 'a'), ('a', 'very')]
output with n=3:
[('was', 'a', 'very')]
output with n=1 (without tuples):
['Jack', 'a', 'was', 'time', 'called', 'very']

This is a classic dynamic programming problem. All you need to do is build a suffix tree for string1, with words instead of letters (which is the usual formulation). Here is an illustrative example of a suffix tree.
Label all nodes in your tree as s1.
Insert all suffixes of string2 one by one.
All nodes that the suffixes in step 2 pass through are labeled s2.
Any new nodes created in step 2 are also labeled s2.
In the final tree, path labels of every node labeled both s1 and s2 is a common substring.
This algorithm is succinctly explained in this lecture note.
For two strings of lengths n and m, the suffix tree construction takes O(max(n,m)), and all the matching substrings (in your case, words or phrases) can be searched in O(#matches).

A couple of years later, but I tried this way using 'Counter' below:
Input[ ]:
from collections import Counter
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
string1 += ' ' + string2
string1 = string1.split()
count = Counter(string1)
tag_count = []
for n, c in count.most_common(10):
dics = {'tag': n, 'count': c}
tag_count.append(dics)
Output[ ]:
[{'tag': 'a', 'count': 4},
{'tag': 'very', 'count': 3},
{'tag': 'time', 'count': 2},
{'tag': 'was', 'count': 2},
{'tag': 'called', 'count': 2},
{'tag': 'Jack', 'count': 2},
{'tag': 'once', 'count': 1},
{'tag': 'upon', 'count': 1},
{'tag': 'there', 'count': 1},
{'tag': 'large', 'count': 1}]
Hopefully, it would be useful for someone :)

Related

Extracting set of tokens from list of strings

I have a list of strings and I want to extract all tokens into one set of tokens -not a list of sets. I need every token mixed up.
My sentences are stored as a list of strings in "sentences"
So if try:
words = set([])
a=set(sentences[1].split())
b=set(sentences[2].split())
a.union(b)
I get a and b sets in one set like this. This is what I'm searching for
{',', '.', '2.252', '35-1/7', '37-year-old', 'B', 'Blood', 'Fred', 'G4', 'Grauman', 'O+', 'P3-5', 'pregnancy', 'product', 'rubella', surface', 'the', 'to', 'type', 'week', 'woman'}
But with list comprehension
words = set()
[words.union(set(sent.split())) for sent in sentences]
The output is a list of sets, like this
[{'.', 'Care', 'He', 'Intensive', 'Neonatal''}, {'.', '2.252', 35-1/7', '37-year-old', 'Fred', 'G4', 'Grauman','}]
Is there away to get what I need with some compact line of code like a list comprehension?
====
Well I just did, after list comprehension for "words",
a = set()
a.union(*words)
If your sentences are in strings you can just join them and split them again.
set(" ".join(sentences).split())
turns ['A short sentence', 'A second sentence']
into {'A', 'second', 'sentence', 'short'}
How about doing:
set(' '.join(sentences).split())
Or you could try to use reduce from functools.

Determining the most frequent casing of a word in Python

I have a text and a I want to determine the most frequent casing of each word and create a dictionary with it. This is an extract of the text:
PENCIL: A pencil is an object we use to write. Pencil should not be confused by pen, which is a different object. A pencil is usually made from a pigment core inside a protective casing.
For example, a word like "pencil" could appear as "Pencil", "PENCIL" or "pencil" in my text. I would like to create a function that would first determine which of those options is the most frequent one. I have started by classifying all the words into three groups, depending on casing, though I don't know how to determine which case is the most frequent one (I guess I'd have to do a comparison across the three lists, but I don't know how to do that):
list_upper = []
list_lower = []
list_both = []
for word in text:
if isupper(word):
list_upper.append(word)
if islower(word):
list_lower.append(word)
if word == word.title():
list_both.append(word)
Then, it will create a dictionary in which the first key would be the lowercase words and the values would be the most frequent type. For example: pencil, Pencil. I'm not sure how to do this either... This is my desired output:
my_dictionary = {"pencil":"Pencil", "the":"THE"...}
I'm assuming that text is already an iterable of words and that words like 'pEnCiL' cannot occur.
Instead of building these three lists, you can start constructing a dictionary with the counts right away. I suggest using a defaultdict which returns Counter instances when a key is missing.
from collections import defaultdict, Counter
cases = defaultdict(Counter)
for word in text:
cases[word.lower()][word] += 1
For a list text with the content
['pencil', 'pencil', 'PENCIL', 'Pencil', 'Pencil', 'PENCIL', 'rubber', 'PENCIL']
this will produce the following cases dictionary.
defaultdict(collections.Counter,
{'pencil': Counter({'PENCIL': 3, 'Pencil': 2, 'pencil': 2}),
'rubber': Counter({'rubber': 1})})
From here, you can construct the final result as follows.
result = {w:c.most_common(1)[0][0] for w, c in cases.items()}
This will give you
{'pencil': 'PENCIL', 'rubber': 'rubber'}
in this example. If two cases appear equally often, an arbitrary one is picked as the most common.
~edit~
Turns out text is not an iterable of words. Daniel Mesejo's answer has a regular expression which can help you extract the words from a string.
You could use Counter with defaultdict:
import re
from collections import Counter, defaultdict
def words(t):
return re.findall('\w+', t)
text = """PENCIL: A pencil is an object we use to write.
Pencil should not be confused by pen, which is a different object.
A pencil is usually made from a pigment core inside a protective casing.
Another casing with different Casing"""
table = defaultdict(list)
for word in words(text):
table[word.lower()].append(word)
result = {key: Counter(values).most_common(1)[0][0] for key, values in table.items()}
print(result)
Output
{'casing': 'casing', 'be': 'be', 'core': 'core', 'another': 'Another', 'object': 'object', 'should': 'should', 'from': 'from', 'write': 'write', 'pen': 'pen', 'protective': 'protective', 'a': 'a', 'which': 'which', 'pencil': 'pencil', 'different': 'different', 'not': 'not', 'is': 'is', 'by': 'by', 'inside': 'inside', 'to': 'to', 'confused': 'confused', 'with': 'with', 'pigment': 'pigment', 'we': 'we', 'use': 'use', 'an': 'an', 'made': 'made', 'usually': 'usually'}
First create a dictionary where the keys are the lower case variant of each word and the values are a list of the corresponding occurrences. Then use Counter to count the number of each casing and get the most common. Note the use of regex to extract the words.
You have two great answers already. Just for fun I figured we could try just using the builtins since you have already tokenized the words:
# Create a temp dict within the main dict that counts the occurrences of cases
d= {}
for word in words:
d.setdefault(word.lower(), {}).setdefault(word, 0)
d[word.lower()][word] += 1
# Create a function to convert the temp d back to its most common occurrence
def func(dct):
return sorted(dct.items(), key=lambda x: x[-1])[-1][0]
# Use function and dictionary comprehension to convert the results.
result = {k: func(v) for k, v in d.items()}
Test case:
text = """
PENCIL: A pencil is an object we use to write.
Pencil should not be confused by pen, which is a different object.
A pencil is usually made from a pigment core inside a protective casing.
PENCIL PENCIL PENCIL Pigment Pigment Pigment Pigment
"""
# Added last line to produce a different result
result
# {'pencil': 'PENCIL',
# 'a': 'a', 'is': 'is',
# 'an': 'an', 'object': 'object',
# 'we': 'we', 'use': 'use', 'to': 'to',
# 'write': 'write', 'should': 'should',
# 'not': 'not', 'be': 'be', 'confused':
# 'confused', 'by': 'by', 'pen': 'pen',
# 'which': 'which', 'different': 'different',
# 'usually': 'usually', 'made': 'made',
# 'from': 'from', 'pigment': 'Pigment',
# 'core': 'core', 'inside': 'inside',
# 'protective': 'protective', 'casing': 'casing'}

How to pull a word out of a list then shuffle the word?

So I want to pull a word out of a list, then I want to jumble the word so I will have to guess what the mixed up word is. Once guessed correctly, will move onto the next word.
My code so far:
import random
words = ['Jumble', 'Star', 'Candy', 'Wings', 'Power', 'String', 'Shopping', 'Blonde', 'Steak', 'Speakers', 'Case', 'Stubborn', 'Cat', 'Marker', 'Elevator', 'Taxi', 'Eight', 'Tomato', 'Penguin', 'Custard']
from random import shuffle
shuffle(words)
Start with your code:
import random
words = ['Jumble', 'Star', 'Candy', 'Wings', 'Power', 'String', 'Shopping', 'Blonde', 'Steak', 'Speakers', 'Case', 'Stubborn', 'Cat', 'Marker', 'Elevator', 'Taxi', 'Eight', 'Tomato', 'Penguin', 'Custard']
Use random.choice() to grab a single word:
chosen = random.choice(words)
If you want to grab multiple words and deal with them one at a time, you can use random.sample() instead. If you want to deal with every single word, you can use random.shuffle() as you've already done and then iterate over words (with for word in words: ...).
Finally, shuffle the word:
letters = list(chosen) # list of all the letters in chosen, in order
random.shuffle(letters)
scrambled = ''.join(letters)
try something like
for w in words:
s = list(w)
shuffle(s)
print ''.join(s)
Convert each word into list first before you apply shuffle to each word. Then join them together
Pls kindly noted that shuffle() changes the input parameter in-place which it will always return a None .
As Kevin suggested, you can pick a random word by random.choice() and feed word as below showing:
I thinking you are trying to do like
sorted(words[2], key=lambda k: random.random())
# ['y', 'd', 'a', 'C', 'n']
you can join it by :
''.join(sorted(words[2], key=lambda k: random.random()))
# 'andCy'
You do not need to import shuffle it is already imported with random

Elegant parsing of text-based key-value list

I'm writing a parser for text-based sequence alignment/map (SAM) files. One of the fields is a concatenated list of key-value pairs comprising a single alphabet character and an integer (the integer comes first). I have working code, but it just feels a bit clunky. What's an elegant pattern for parsing a format such as this? Thanks.
Input:
record['cigar_str'] = '6M1I69M1D34M'
Desired output:
record['cigar'] = [
{'type':'M', 'length':6},
{'type':'I', 'length':1},
{'type':'M', 'length':69},
{'type':'D', 'length':1},
{'type':'M', 'length':34}
]
EDIT: My current approach
cigarettes = re.findall('[\d]{0,}[A-Z]{1}', record['cigar_str'])
for cigarette in cigarettes:
if cigarette[-1] == 'I':
errors['ins'] += int(cigarette[:-1])
...
Here's what I'd do:
>>> import re
>>> s = '6M1I69M1D34M'
>>> matches = re.findall(r'(\d+)([A-Z]{1})', s)
>>> import pprint
>>> pprint.pprint([{'type':m[1], 'length':int(m[0])} for m in matches])
[{'length': 6, 'type': 'M'},
{'length': 1, 'type': 'I'},
{'length': 69, 'type': 'M'},
{'length': 1, 'type': 'D'},
{'length': 34, 'type': 'M'}]
It's pretty similar to what you have, but it uses regex groups to tease out the individual components of the match.

compare words in two lists in python

I would appreciate someone's help on this probably simple matter: I have a long list of words in the form ['word', 'another', 'word', 'and', 'yet', 'another']. I want to compare these words to a list that I specify, thus looking for target words whether they are contained in the first list or not.
I would like to output which of my "search" words are contained in the first list and how many times they appear. I tried something like list(set(a).intersection(set(b))) - but it splits up the words and compares letters instead.
How can I write in a list of words to compare with the existing long list? And how can I output co-occurences and their frequencies? Thank you so much for your time and help.
>>> lst = ['word', 'another', 'word', 'and', 'yet', 'another']
>>> search = ['word', 'and', 'but']
>>> [(w, lst.count(w)) for w in set(lst) if w in search]
[('and', 1), ('word', 2)]
This code basically iterates through the unique elements of lst, and if the element is in the search list, it adds the word, along with the number of occurences, to the resulting list.
Preprocess your list of words with a Counter:
from collections import Counter
a = ['word', 'another', 'word', 'and', 'yet', 'another']
c = Counter(a)
# c == Counter({'word': 2, 'another': 2, 'and': 1, 'yet': 1})
Now you can iterate over your new list of words and check whether they are contained within this Counter-dictionary and the value gives you their number of appearance in the original list:
words = ['word', 'no', 'another']
for w in words:
print w, c.get(w, 0)
which prints:
word 2
no 0
another 2
or output it in a list:
[(w, c.get(w, 0)) for w in words]
# returns [('word', 2), ('no', 0), ('another', 2)]

Categories

Resources