Find occurrences of connecting words in a string with python - python

I have a text where all the words are tagged with "parts of speech" tags. example of the text here:
What/NOUN could/VERB happen/VERB next/ADJ ?/PUNCT
I need to find all the occurrences where there is a /PUNCT followed by either NOUN, PRON or PROPN - and also count which one occurs the most often.
So one of the answers would appear like this:
?/PUNCT What/NOUN or ./PUNCT What/NOUN
Further on the word "Deal" appears 6 times, and I need to show this by code.
I am not allowed to use NLTK, only Collections.
Tried several different things, but don't really know what to do here. I think I need to use defaultdict, and then somehow do a while loop, that gives me back a list with the right connectives.

Here is a test program that does what you want.
It first splits the long string by spaces ' ' which creates a list of word/class elements. The for loop then check if the combination of PUNCT followed by NOUN, PRON, or PROPN occurs and saves that to a list.
The code is as follows:
from collections import Counter
string = "What/NOUN could/VERB happen/VERB next/ADJ ?/PUNCT What/NOUN could/VERB happen/VERB next/ADJ ?/PUNCT"
words = string.split(' ')
found = []
for n, (first, second) in enumerate(zip(words[:-1], words[1:])):
first_class = first.split('/')[1]
second_class = second.split('/')[1]
if first_class == 'PUNCT' and second_class in ["NOUN", "PRON", "PROPN"]:
print(f"Found occurence at data list index {n} and {n+1} with {first_class}, {second_class}")
found.append(f'{words[n]} {words[n+1]}')
To count the words:
words_only = [i.split('/')[0] for i in words]
word_counts = Counter(words_only).most_common()

Related

Returning the elements in a list that are over a certain number of characters

I want to be able to get words that are more than 3 three letters long from a list. I am struggling with the process of doing this.
words = "My name is bleh, I am bleh, how are you bleh?"
I would like to be able to extract everything that's above 3 letters.
Try this:
result = []
for word in words.split(' '):
if len(word) >= 3:
result.append(word)
When you use split you get a list containing every word of the variable 'words'. Then you iterate over it and calculate the length using len().

Find how many words start with certain letter in a list

I am trying to output the total of how many words start with a letter 'a' in a list from a separate text file. I'm looking for an output such as this.
35 words start with a letter 'a'.
However, i'm outputting all the words that start with an 'a' instead of the total with my current code. Should I be using something other than a for loop?
So far, this is what I have attempted:
wordsFile = open("words.txt", 'r')
words = wordsFile.read()
wordsFile.close()
wordList = words.split()
print("Words:",len(wordList)) # prints number of words in the file.
a_words = 0
for a_words in wordList:
if a_words[0]=='a':
print(a_words, "start with the letter 'a'.")
The output I'm getting thus far:
Words: 334
abate start with the letter 'a'.
aberrant start with the letter 'a'.
abeyance start with the letter 'a'.
and so on.
You could replace this with a sum call in which you feed 1 for every word in wordList that starts with a:
print(sum(1 for w in wordList if w.startswith('a')), 'start with the letter "a"')
This can be further trimmed down if you use the boolean values returned by startswith instead, since True is treated as 1 in these contexts the effect is the same:
print(sum(w.startswith('a') for w in a), 'start with the letter "a"')
With your current approach, you're not summing anything, you're simply printing any word that matches. In addition, you're re-naming a_word from an int to the contents of the list as you iterate through it.
Also, instead of using a_word[0] to check for the first character, you could use startswith(character) which has the same effect and is a bit more readable.
You are using the a_words as the value of the word in each iteration and missing a counter. If we change the for loop to have words as the value and reserved a_words for the counter, we can increment the counter each time the criteria is passed. You could change a_words to wordCount or something generic to make it more portable and friendly for other letters.
a_words = 0
for words in wordList:
if words[0]=='a':
a_words += 1
print(a_words, "start with the letter 'a'.")
sum(generator) is a way to go, but for completeness sake, you may want to do it with list comprehension (maybe if it's slightly more readable or you want to do something with words starting with a etc.).
words_starting_with_a = [word for word in word_list if word.startswith('a')]
After that you may use len built-in to retrieve length of your new list.
print(len(words_starting_with_a), "words start with a letter 'a'")
Simple alternative solution using re.findall function(without splitting text and for loop):
import re
...
words = wordsFile.read()
...
total = len(re.findall(r'\ba\w+?\b', words))
print('Total number of words that start with a letter "a" : ', total)

Count all occurrences of elements with and without special characters in a list from a text file in python

I really apologize if this has been answered before but I have been scouring SO and Google for a couple of hours now on how to properly do this. It should be easy and I know I am missing something simple.
I am trying to read from a file and count all occurrences of elements from a list. This list is not just whole words though. It has special characters and punctuation that I need to get as well.
This is what I have so far, I have been trying various ways and this post got me the closest:
Python - Finding word frequencies of list of words in text file
So I have a file that contains a couple of paragraphs and my list of strings is:
listToCheck = ['the','The ','the,','the;','the!','the\'','the.','\'the']
My full code is:
#!/usr/bin/python
import re
from collections import Counter
f = open('text.txt','r')
wanted = ['the','The ','the,','the;','the!','the\'','the.','\'the']
words = re.findall('\w+', f.read().lower())
cnt = Counter()
for word in words:
if word in wanted:
print word
cnt[word] += 1
print cnt
my output thus far looks like:
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
Counter({'the': 17})
It is counting my "the" strings with punctuation but not counting them as separate counters. I know it is because of the \W+. I am just not sure what the proper regex pattern to use here or if I'm going about this the wrong way.
I suspect there may be some extra details to your specific problem that you are not describing here for simplicity. However, I'll assume that what you are looking for is to find a given word, e.g. "the", which could have either an upper or lower case first letter, and can be preceded and followed either by a whitespace or by some punctuation characters such as ;,.!'. You want to count the number of all the distinct instances of this general pattern.
I would define a single (non-disjunctive) regular expression that define this. Something like this
import re
pattern = re.compile(r"[\s',;.!][Tt]he[\s.,;'!]")
(That might not be exactly what you are looking for in general. I just assuming it is based on what you stated above. )
Now, let's say our text is
text = '''
Foo and the foo and ;the, foo. The foo 'the and the;
and the' and the; and foo the, and the. foo.
'''
We could do
matches = pattern.findall(text)
where matches will be
[' the ',
';the,',
' The ',
"'the ",
' the;',
" the'",
' the;',
' the,',
' the.']
And then you just count.
from collections import Counter
count = Counter()
for match in matches:
count[match] += 1
which in this case would lead to
Counter({' the;': 2, ' the.': 1, ' the,': 1, " the'": 1, ' The ': 1, "'the ": 1, ';the,': 1, ' the ': 1})
As I said at the start, this might not be exactly what you want, but hopefully you could modify this to get what you want.
Just to add, a difficulty with using a disjunctive regular expression like
'the|the;|the,|the!'
is that the strings like "the," and "the;" will also match the first option, i.e. "the", and that will be returned as the match. Even though this problem could be avoided by more careful ordering of the options, I think it might not be easier in general.
The simplest option is to combine all "wanted" strings into one regular expression:
rr = '|'.join(map(re.escape, wanted))
and then find all matches in the text using re.findall.
To make sure longer stings match first, just sort the wanted list by length:
wanted.sort(key=len, reverse=True)
rr = '|'.join(map(re.escape, wanted))

Determine if a list of words is in a sentence?

Is there a way (Pattern or Python or NLTK, etc) to detect of a sentence has a list of words in it.
i.e.
The cat ran into the hat, box, and house. | The list would be hat, box, and house
This could be string processed but we may have more generic lists:
i.e.
The cat likes to run outside, run inside, or jump up the stairs. |
List=run outside, run inside, or jump up the stairs.
This could be in the middle of a paragraph or the end of the sentence which further complicates things.
I've been working with Pattern for python for awhile and I'm not seeing a way to go about this and was curious if there is a way with pattern or nltk (natural language tool kit).
From what I got from your question, I think you want to search whether all the words in your list is present in a sentence or not.
In general to search for a list elements, in a sentence, you can use all function. It returns true, if all the arguments in it are true.
listOfWords = ['word1', 'word2', 'word3', 'two words']
sentence = "word1 as word2 a fword3 af two words"
if all(word in sentence for word in listOfWords):
print "All words in sentence"
else:
print "Missing"
OUTPUT: -
"All words in sentence"
I think this might serve your purpose. If not, then you can clarify.
Using a Trie, you will be able to achieve this is O(n) where n is the number of words in the list of words after building a trie with the list of words which takes O(n) where n is the number of words in the list.
Algorithm
split the sentence into list of words separated by space.
For each word check if it has a key in the trie. i.e. that word exist in the list
if it exits add that word to the result to keep track of how many words from the list appear in the sentence
keep track of the words that has a has subtrie that is the current word is a prefix of the longer word in the list of words
for each word in this words see by extending it with the current word it can be a key or a subtrie on the list of words
if it's a subtrie then we add it to the extend_words list and see if concatenating with the next words we are able to get an exact match.
Code
import pygtrie
listOfWords = ['word1', 'word2', 'word3', 'two words']
trie = pygtrie.StringTrie()
trie._separator = ' '
for word in listOfWords:
trie[word] = True
print('s', trie._separator)
sentence = "word1 as word2 a fword3 af two words"
sentence_words = sentence.split()
words_found = {}
extended_words = set()
for possible_word in sentence_words:
has_possible_word = trie.has_node(possible_word)
if has_possible_word & trie.HAS_VALUE:
words_found[possible_word] = True
deep_clone = set(extended_words)
for extended_word in deep_clone:
extended_words.remove(extended_word)
possible_extended_word = extended_word + trie._separator + possible_word
print(possible_extended_word)
has_possible_extended_word = trie.has_node(possible_extended_word)
if has_possible_extended_word & trie.HAS_VALUE:
words_found[possible_extended_word] = True
if has_possible_extended_word & trie.HAS_SUBTRIE:
extended_words.update(possible_extended_word)
if has_possible_word & trie.HAS_SUBTRIE:
extended_words.update([possible_word])
print(words_found)
print(len(words_found) == len(listOfWords))
This is useful if your list of words is huge and you do not wish to iterate over it every time or you have a large number of queries that over the same list of words.
The code is here
What about using from nltk.tokenize import sent_tokenize ?
sent_tokenize("Hello SF Python. This is NLTK.")
["Hello SF Python.", "This is NLTK."]
Then you can use that list of sentences in this way:
for sentence in my_list:
# test if this sentence contains the words you want
# using all() method
More info here
all(word in sentence for word in listOfWords)

Breaking a string into individual words in Python

I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.
The problem I have is the list is formatted as domain names, for example:
examplecartrading.com
examplepensions.co.uk
exampledeals.org
examplesummeroffers.com
+5996
Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.
For my sanity I would prefer to script this.
I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.
We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!
def substrings_in_set(s, words):
if s in words:
yield [s]
for i in range(1, len(s)):
if s[:i] not in words:
continue
for rest in substrings_in_set(s[i:], words):
yield [s[:i]] + rest
This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])
Then we build the english dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words.
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())
# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")
# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))
# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))
Now we can put things together:
count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk",
"exampledeals.org", "examplesummeroffers.com"]
# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
# Extract the part in front of the first ".", and make it lower case
name = domain.partition(".")[0].lower()
found = set()
for split in substrings_in_set(name, words):
found |= set(split)
for word in found:
count[word] = count.get(word, 0) + 1
if not found:
no_match.append(name)
print count
print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns on the first False.
Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.
EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.
NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!
assuming you only have a few thousand standard domains you should be able to do this all in memory.
domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
for substring in all_sub_strings(domain):
if substring in dictionary:
found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want
print c
with open('/usr/share/dict/words') as f:
words = [w.strip() for w in f.readlines()]
def guess_split(word):
result = []
for n in xrange(len(word)):
if word[:n] in words and word[n:] in words:
result = [word[:n], word[n:]]
return result
from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
for line in f.readlines():
for word in line.strip().split('.'):
if len(word) > 3:
# junks the com , org, stuff
for x in guess_split(word):
word_counts[x] += 1
for spam in word_counts.items():
print '{word}: {count}'.format(word=spam[0],count=spam[1])
Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.
output:
deals: 1
example: 2
pensions: 1

Categories

Resources