I'm trying to count the number of occurrences of verbal contractions in some speeches I've gathered. One particular speech looks like this:
speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."
So, in this case, I'd like to count four (4) contractions. I have a list of contractions, and here are some of the first few terms:
contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}
My code looks something like this, to begin with:
count = 0
for word in speech:
if word in contractions:
count = count + 1
print count
I'm not getting anywhere with this, however, as the code's iterating over every single letter, as opposed to whole words.
Use str.split() to split your string on whitespace:
for word in speech.split():
This will split on arbitrary whitespace; this means spaces, tabs, newlines, and a few more exotic whitespace characters, and any number of them in a row.
You may need to lowercase your words using str.lower() (otherwise Ain't won't be found, for example), and strip punctuation:
from string import punctuation
count = 0
for word in speech.lower().split():
word = word.strip(punctuation)
if word in contractions:
count += 1
I use the str.strip() method here; it removes everything found in the string.punctuation string from the start and end of a word.
You're iterating over a string. So the items are characters. To get the words from a string you can use naive methods like str.split() that makes this for you (now you can iterate over a list of strings (the words splitted on the argument of str.split(), default: split on whitespace). There is even re.split(), which is more powerful. But I don't think that you need splitting the text with regexes.
What you have to do at least is to lowercase your string with str.lower() or to put all possible occurences (also with capital letters) in the dictionary. I strongly recommending the first alternative. The latter isn't really practicable. Removing the punctuation is also a duty for this. But this is still naive. If you're need a more sophisticated method, you have to split the text via a word tokenizer. NLTK is a good starting point for that, see the nltk tokenizer. But I strongly feel that this problem is not your major one or affects you really in solving your question. :)
speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...
# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re
def abbreviation_counter(input_text, abbreviation_dict):
count = 0
# what you want is a list of words. str.split() does this job for you.
# " " is default and you can also omit this. But if you really need better
# methods (see answer text abover), you have to take a word tokenizer tool
# or have to write your own.
for word in input_text.split(" "):
# and also clean word (remove ',', ';', ...) afterwards. The advantage of
# using re over `from string import punctuation` is that you have more
# control in what you want to remove. That means that you can add or
# remove easily any punctuation mark. It could be very handy. It could be
# also overpowered. If the latter is the case, just stick to Martijn Pieters
# solution.
if re.sub(',|;', '', word).lower() in abbreviation_dict:
count += 1
return count
print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)
It's a litte bit frustrating to give an answer at the same time as Martijn Pieters does ;), but I hope I still have generated some values for you. That's why I've edited my question to give you some hints for future work in addition.
A for loop in Python iterates over all elements in an iterable. In the case of strings the elements are the characters.
You need to split the string into a list (or tuple) of strings that contain the words. You can use .split(delimiter) for this.
Your problem is quite common, so Python has a shortcut: speech.split() splits at any number of spaces/tabs/newlines, so you only get your words in the list.
So your code should look like this:
count = 0
for word in speech.split():
if word in contractions:
count = count + 1
print(count)
speech.split(" ") works too, but only splits on whitespaces but not tabs or newlines and if there are double spaces you'd get empty elements in your resulting list.
Related
suppose, I have a string, s="panpanIpanAMpanJOEpan" . From this I want to find the word pan and replace it with spaces so that I can get the output string as "I AM JOE". How can I do it??
Actually I also don't know how to find certain substring from a long string without spaces such as mentioned above.
It will be great if someone helps me learning about this.
If you don't know pan you can exploit that the letters you want to find is all upper case.
fillword = min(set("".join(i if i.islower() else ' ' for i in s).split(' '))-set(['']),key=len)
This works by first replacing all upper case letters with space, then splitting on space and finding the minimal nonempty word.
Use replace to replace with space, and then strip to remove excess spacing.
s="panpanIpanAMpanJOEpan"
s.replace(fillword,' ').strip()
gives:
'I AM JOE'
s="panpanIpanAMpanJOEpan"
print(s.replace("pan"," ").strip())
use replace
Output:
I AM JOE
As DarrylG and others mentioned, .replace will do what you asked for, where you define what you want to replace ("pan") and what you want to replace it with (" ").
To find a certain string in a longer string you can use .find(), which takes a string you are looking for and optionally where to start and stop looking for it (as integers) as arguments.
If you wanted to find all of the occurrences of a string in a bigger string there's two options:
Find the string with find(), then cut the string so it no longer contains your searchterm and repeat this until the .find() method returns -1(that means the searchterm is not found in the string anymore)
or use the regex module and use the .finditer method to find all occurences of your string Link to someone explaining exactly that on stackoverflow.
Edit: If you don't know what you are searching for, it becomes a bit more tricky, but you can write a regex expession that would extract this data as well using the same regex module. This is easy if you know what the end result is supposed to be (I AM JOE in your case). If you don't it becomes more complicated and we would need additional information to help with this.
You can use replace, to replace all occurances of a substring at once.
In case you want to find the substrings yourself, you can do it manually:
s = "panpanIpanAMpanJOEpan"
while True:
panPosition = s.find('pan') # -1 == 'pan' not found!
if panPosition == -1:
s = s.strip()
break
# Cut out pan from s and replace it with a blanc.
s = s[:panPosition] + ' ' + s[panPosition + 3:]
print(s)
Out:
I AM JOE
In my program I'm using count=strr2.lower().count("ieee") to calculate the number of occurrences in the following string,
"i love ieee and ieeextream is the best coding competition ever"
In here it counts "ieeextream" is also as one occurrence which is not my expected result. The expected output is count=1
So are there any method to check only for "ieee" word or can we change the same code with different implementation? Thanks for your time
If you are trying to find the sub-string as a whole word present in the original string, then I guess, this is what you need :
count=strr2.lower().split().count("ieee")
If you want to count only whole words, you can use a regular expression, wrapping the word to be found in word-boundary characters \b. This will also work if the word is surrounded by punctuation.
>>> import re
>>> s = "i love IEEE, and ieeextream is the best coding competition ever"
>>> len(re.findall(r"\bieee\b", s.lower()))
1
For the purpose of this project, I'm using more exact regex expressions, rather than more general ones. I'm counting occurrences words from a list of words in a text file called I import into my script called vocabWords, where each word in the list is in the format \bword\b.
When I run my script, \bwhat\b will pick up the words "what" and "what's", but \bwhat's\b will pick up no words. If I switch the order so the apostrophe word is before the root word, words are counted correctly. How can I change my regex list so the words are counted correctly? I understand the problem is using "\b", but I haven't been able to find how to fix this. I cannot have a more general regex, and I have to include the words themselves in the regex pattern.
vocabWords:
\bwhat\b
\bwhat's\b
\biron\b
\biron's\b
My code:
matched = []
regex_all = re.compile('|'.join(vocabWords))
for row in df['test']:
matched.append(re.findall(regex_all, row))
There are at least another 2 solutions:
Test that next symbol isn't an apostrophe r"\bwhat(?!')\b"
Use more general rule r"\bwhat(?:'s)?\b" to caught both variants with/without apostrophe.
If you sort your wordlist by length before turning it into a regexp, longer words (like "what's") will precede shorter words (like "what"). This should do the trick.
regex_all = re.compile('|'.join(sorted(vocabWords, key=len, reverse=True)))
How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
Input:It was never going to work, he thought. He did not play so well, so he had to practice some more.
Desired output:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
Any idea how to solve this?
To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.
Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
And replace them with what you want
\1NEG_\2
I would not do this with regexp. Rather I would;
Split the input on punctuation characters.
For each fragment do
Set negation counter to 0
Split input into words
For each word
Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
If original word is in {No,Never,Not} increase negation counter by one.
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.
Join the string together again and insert the result in your original string in the place of the first regex's match.
I tried multiple solutions here, and although they strip some code, they dont seem to work on multiple punctuations ex. "[ or ',
This code:
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.match(i):
regex.sub('', i)
I got from:
Best way to strip punctuation from a string in Python was good but i still encounter problems with double punctuations.
I added While loop in hope to ittirate over each word to remove multiple punctuations but that does not seem to work it just gets stuck on the first item "[ and does not exit it
Am I just missing some obvious piece that I am just being oblivious too?
I solved the problem by adding a redundancy and double looping my lists, this takes extremely long time (well into the minutes) due to fairly large sets
I use Python 2.7
Your code doesn't work because regex.match needs the beginning of the string or complete string to match.
Also, you did not do anything with the return value of regex.sub(). sub doesn't work in place, but you need to assign its result to something.
regex.search returns a match if the pattern is found anywhere in the string and works as expected:
import re
import string
words = ['a.bc,,', 'cdd,gf.f.d,fe']
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.search(i):
i = regex.sub('', i)
print i
Edit: As pointed out below by #senderle, the while clause isn't necessary and can be left out completely.
this will replace everything not alphanumeric ...
re.sub("[^a-zA-Z0-9 ]","",my_text)
>>> re.sub("[^a-zA-Z0-9 ]","","A [Black. Cat' On a Hot , tin roof!")
'A Black Cat On a Hot tin roof'
Here is a simple way:
>>> print str.translate("My&& Dog's {{{%!##%!##$L&&&ove Sal*mon", None,'~`!##$%^&*()_+=-[]\|}{;:/><,.?\"\'')
>>> My Dogs Love Salmon
Using this str.translate function will eliminate the punctuation. I usually use this for eliminating numbers from DNA sequence reads.