My previous effort was something like this with Python NLTK
from nltk.tokenize import RegexpTokenizer
a = "miris ribe na balkanu"
capt1 = RegexpTokenizer('[a-b-c]\w+')
capt1.tokenize(a)
['be', 'balkanu']
This is was not what I wanted,ribe was cut to be from b.This was suggested by Tanzeel but doesn't help
>>> capt1
RegexpTokenizer(pattern='\x08[abc]\\w+', gaps=False, discard_empty=True, flags=56)
>>> a
'miris ribe na balkanu'
>>> capt1.tokenize(a)
[]
>>> capt1 = RegexpTokenizer('\b[a-b-c]\w+')
>>> capt1.tokenize(a)
[]
How to change this,to stay just with last word?
What you probably need is a word-boundary \b in your regex to match the start of a word.
Updating your regex to \b[abc]\w+ should work.
Update:
Since the OP could not get the regex with the word-boundary to work with NLTK (the word-boundary \b is a valid regex meta-character) I downloaded and tested the regex with NLTK myself.
This updated regex works now (?<=\s)[abc]\w+ and it returns the result ['balkanu'] as you'd expect.
Have not worked with NLTK before so I can't explain why the word-boundary didn't work.
The purpose of the RegexTokenizer is not to pull out selected words from your input, but to break it up into tokens according to your rules. To find all words that begin with a, b or c, use this:
import re
bwords = re.findall(r"\b[abc]\w*", 'miris ribe na balkanu')
I'm not too sure what you are after, so if your goal was actually to extract the last word in a string, use this:
word = re.findall(r"\b\w+$", 'miris ribe na balkanu')[0]
This matches the string of letters between a word boundary and the end of the string.
I think you are mixing up the notions of matching and tokenizing.
This line
capt1 = RegexpTokenizer('[abc]\w+')
(don't do [a-b-c]) says that the tokenizer should look for a, b or c and count everything following, up to the end of the word, as a token.
I think what you want to do is tokenize your data, then discard any tokens that don't begin with a or b or c.
That is a separate step.
>>> capt1 = RegexpTokenizer('\w+')
>>> tokens = capt1.tokenize(a)
# --> ['miris', 'ribe', 'na', 'balkanu']
>>> selection = [t for t in tokens if t.startswith(('a','b','c'))]
# --> ['balkanu']
I've used str.startswith() here because it is simple, but you could always use a regular expression for that too. But not in the same one as the tokenizer is using.
Related
My program replaces tokens with values when they are in a file. When reading in a certain line it gets stuck here is an example:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a
The two tokens in the example are Token100 and Token100a. I need a way to only replace Token100 with its data and not replace Token100a with Token100's data with an a afterwards. I can't look for spaces before and after because sometimes they are in the middle of lines. Any thoughts are appreciated. Thanks.
You can use regex:
import re
line = "1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a"
match = re.sub("Token100a", "data", line)
print(match)
Outputs:
1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1data
More about regex here:
https://www.w3schools.com/python/python_regex.asp
You can use a regular expression with a negative lookahead to ensure that the following character is not an "a":
>>> import re
>>> test = '1.1.1.1.1.1.1.1.1.1 Token100.1 1.1.1.1.1.1.1Token100a'
>>> re.sub(r'Token100(?!a)', 'data', test)
'1.1.1.1.1.1.1.1.1.1 data.1 1.1.1.1.1.1.1Token100a'
I'm trying to solve this problem were they give me a set of strings where to count how many times a certain word appears within a string like 'code' but the program also counts any variant where the 'd' changes like 'coze' but something like 'coz' doesn't count this is what I made:
def count(word):
count=0
for i in range(len(word)):
lo=word[i:i+4]
if lo=='co': # this is what gives me trouble
count+=1
return count
Test if the first two characters match co and the 4th character matches e.
def count(word):
count=0
for i in range(len(word)-3):
if word[i:i+1] == 'co' and word[i+3] == 'e'
count+=1
return count
The loop only goes up to len(word)-3 so that word[i+3] won't go out of range.
You could use regex for this, through the re module.
import re
string = 'this is a string containing the words code, coze, and coz'
re.findall(r'co.e', string)
['code', 'coze']
from there you could write a function such as:
def count(string, word):
return len(re.findall(word, string))
Regex is the answer to your question as mentioned above but what you need is a more refined regex pattern. since you are looking for certain word appears you need to search for boundary words. So your pattern should be sth. like this:
pattern = r'\bco.e\b'
this way your search will not match with the words like testcodetest or cozetest but only match with code coze coke but not leading or following characters
if you gonna test for multiple times, then it's better to use a compiled pattern, that way it'd be more memory efficient.
In [1]: import re
In [2]: string = 'this is a string containing the codeorg testcozetest words code, coze, and coz'
In [3]: pattern = re.compile(r'\bco.e\b')
In [4]: pattern.findall(string)
Out[4]: ['code', 'coze']
Hope that helps.
I'm having trouble completing the a regex tutorial that went \w+ references words to this problem with "Find all capitalized words in my_string and print the result" where some of the words have apostrophes.
Original String:
In [1]: my_string
Out[1]: "Let's write RegEx! Won't that be fun? I sure think so. Can you
find 4 sentences? Or perhaps, all 19 words?"
Current Attempt:
# Import the regex module
import re
# Find all capitalized words in my_string and print the result
capitalized_words = r"((?:[A-Z][a-z]+ ?)+)"
print(re.findall(capitalized_words, my_string))
Current Result:
['Let', 'RegEx', 'Won', 'Can ', 'Or ']
What I think the desired outcome is:
['Let's', 'RegEx', 'Won't', 'Can't', 'Or']
How do you go from r"((?:[A-Z][a-z]+ ?)+)" to also selecting the 's and 't at the end of Let's, Won't and Can't when not everything were trying to catch is expected to have an apostrophe?
Just add an apostrophe to the second bracket group:
capitalized_words = r"((?:[A-Z][a-z']+)+)"
I suppose you can add a little apostrophe in the group [a-z'].
So it will be like ((?:[A-Z][a-z']+ ?)+)
Hope that works
While you do have your answer, I'd like to provide a more "real-world" solution using nltk:
from nltk import sent_tokenize, regexp_tokenize
my_string = """Let's write RegEx! Won't that be fun? I sure think so. Can you
find 4 sentences? Or perhaps, all 19 words?"""
sent = sent_tokenize(my_string)
print(len(sent))
# 5
pattern = r"\b(?i)[a-z][\w']*"
print(len(regexp_tokenize(my_string, pattern)))
# 19
And imo, these are 5 sentences, not 4 unless there's some special requirement for a sentence in place.
I'm trying to learn how to use regular expressions but have a question. Let's say I have the string
line = 'Cow Apple think Woof`
I want to see if line has at least two words that begin with capital letters (which, of course, it does). In Python, I tried to do the following
import re
test = re.search(r'(\b[A-Z]([a-z])*\b){2,}',line)
print(bool(test))
but that prints False. If I instead do
test = re.search(r'(\b[A-Z]([a-z])*\b)',line)
I find that print(test.group(1)) is Cow but print(test.group(2)) is w, the last letter of the first match (there are no other elements in test.group).
Any suggestions on pinpointing this issue and/or how to approach the problem better in general?
The last letter of the match is in group because of inner parentheses. Just drop those and you'll be fine.
>>> t = re.findall('([A-Z][a-z]+)', line)
>>> t
['Cow', 'Apple', 'Woof']
>>> t = re.findall('([A-Z]([a-z])+)', line)
>>> t
[('Cow', 'w'), ('Apple', 'e'), ('Woof', 'f')]
The count of capitalised words is, of course, len(t).
I use the findall function to find all instances that match the regex. The use len to see how many matches there are, in this case, it prints out 3. You can check if the length is greater than 2 and return a True or False.
import re
line = 'Cow Apple think Woof'
test = re.findall(r'(\b[A-Z]([a-z])*\b)',line)
print(len(test) >= 2)
If you want to use only regex, you can search for a capitalized word then some characters in between and another capitalized word.
test = re.search(r'(\b[A-Z][a-z]*\b)(.*)(\b[A-Z][a-z]*\b)',line)
print(bool(test))
(\b[A-Z][a-z]*\b) - finds a capitalized word
(.*) - matches 0 or more characters
(\b[A-Z][a-z]*\b) - finds the second capitalized word
This method isn't as dynamical since it will not work for trying to match 3 capitalized word.
import re
sent = "His email is abc#some.com, however his wife uses xyz#gmail.com"
x = re.findall('[A-Za-z]+#[A-Za-z\.]+', sent)
print(x)
If there is a period at the end of an email ID (abc#some,com.), it will be returned at the end of the email address. However, this can be dealt separately.
I have a string contain words, each word has its own token (eg. NN/NNP/JJ etc). I want to take specific repeat words that contain NNP token. My code so far:
import re
sentence = "Rapunzel/NNP Sheila/NNP let/VBD down/RP her/PP$ long/JJ golden/JJ hair/NN in Yasir/NNP"
tes = re.findall(r'(\w+)/NNP', sentence)
print(tes)
The result of the code:
['Rapunzel', 'Sheila', 'Yasir']
As we see, there are 3 words contain NNP those are Rapunzel/NNP Sheila/NNP (appear next to each other) and Yasir/NNP (seperate by words to other NNP words). My problem is I need to sperate the word with repeat NNP and the other. My expected result is like :
['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']
What is the best way to perform this task, thanks.
Match the groups as simple strings, and then split them:
>>> [m.split() for m in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*", sentence)]
[['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']]
You can get very close to your expected outcome using a different capture group.
>>> re.findall(r'((?:\w+/NNP\s*)+)', sentence)
['Rapunzel/NNP Sheila/NNP ', 'Yasir/NNP']
Capture group ((?:\w+/NNP\s*)+) will group all the \w+/NNP patterns together with optional spaces in between.
Here's an alternative without any regex. It uses groupby and split():
from itertools import groupby
string = "Rapunzel/NNP Sheila/NNP let/VBD down/RP her/PP$ long/JJ golden/JJ hair/NN in Yasir/NNP"
words = string.split()
def get_token(word):
return word.split('/')[-1]
print([list(ws) for token, ws in groupby(words, get_token) if token == "NNP"])
# [['Rapunzel/NNP', 'Sheila/NNP'], ['Yasir/NNP']]