I have a string, e.g.:
"This is my very boring string"
In addition, I have a location of a char in the string without spaces.
e.g.:
The location 13, which in this example matches the o in the word boring.
What I need is, based on the index I get (13) to return the word (boring).
This code will return the char (o):
re.findall('[a-z]',s)[13]
But for some reason I don't think of a good way to return the word boring.
Any help will be appreciated.
You can use regex \w+ to match words and keep accumulating the lengths of the matches until the total length exceeds to target position:
def get_word_at(string, position):
length = 0
for word in re.findall(r'\w+', string):
length += len(word)
if length > position:
return word
so that get_word_at('This is my very boring string', 13) would return:
boring
A non-regex solution that strives for the elegance the OP desires:
def word_out_of_string(string, character_index):
words = string.split()
while words and character_index >= len(words[0]):
character_index -= len(words.pop(0))
return words.pop(0) if words else None
print(word_out_of_string("This is my very boring string", 13))
Do not require var length lookbehind which is slow and ugly.
Using a simple lookahead with a capture group will get the word.
This regex uses non-whitespace as the character.
^(?:\s*(?=(?<!\S)(\S+))?\S){13}
demo 13th char
Use word if need be but whatever the character sought it must
be used with the anti-character otherwise nothing will work,
it will stop because ALL characters mut be matched.
Examples:
\w used with \W
\s used with \S
demo 1st char
demo 18th char
You can install and use the regex module, which supports patterns with variable-length lookbehinds, so that you can use such a pattern to assert that there are exactly the desired number of word characters, optionally surrounded by white spaces, behind the matching word:
import regex
regex.search(r'\w*(?<=^\s*(\w\s*){13})\w+', 'This is my very boring string').group()
This returns:
boring
This function will take in two arguments: a string and an index. It will convert the index to be the same index equivalent to the original string. Then, it will return the word that the character of the converted index belongs to in the original string.
def find(string,idx):
# Find the index of the character relative original string
i1 = idx
for char in string:
if char == ' ':
i1 += 1
if string[i1] == string.replace(' ','')[idx]:
break
# Find which word the index belongs to in the original string
i2 = 0
for word in string.split():
for l in word:
i2 += 1
if i2 == i1:
return(word)
i2+=1
print(find("This is my very boring string", 13))
Output:
boring
If Python's alternative regex engine is used, one could replace matches of the following regular expression with empty strings:
r'^(?:\s*\S){0,13}\s|(?<=(?:\s*\S){13,})\s.*'
Regex demo <¯\_(ツ)_/¯> Python demo
For the example string the 'o' in 'boring' is at index 13 after whitespace has been removed. If both 13's in the regex are changed to any number in the range 12-17, 'boring' is returned. If they are changed to 12, 'very' is returned; if they are changed to 18, `'string' is returned.
The regex engine performs the following operations.
^ : match beginning of string
(?:\s*\S) : match 0+ ws chars, then 1 non-ws char, in a non-capture group
{0,13} : execute the non-capture group 0-13 times
\s : match a ws char
| : or
(?<= : begin a positive lookbehind
(?:\s*\S) : match 0+ ws chars, then 1 non-ws char, in a non-capture group
{13,} : execute the non-capture group at least 13 times
) : end positive lookahead
\s : match 1 ws char
.* : match 0+ chars
Related
I wanted to know how to remove punctuation marks at the end and at the beginning of one or more words.
If there are punctuation marks between the word, we don't remove.
for example
input:
word = "!.test-one,-"
output:
word = "test-one"
use strip
>>> import string
>>> word = "!.test-one,-"
>>> word.strip(string.punctuation)
'test-one'
The best solution is to use Python .strip(chars) method of the built-in class str.
Another approach will be to use a regular expression and the regular expressions module.
In order to understand what strip() and the regular expression does you can take a look at two functions which duplicate the behavior of strip(). The first one using recursion, the second one using while loops:
chars = '''!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~'''
def cstm_strip_1(word, chars):
# Approach using recursion:
w = word[1 if word[0] in chars else 0: -1 if word[-1] in chars else None]
if w == word:
return w
else:
return cstm_strip_1(w, chars)
def cstm_strip_2(word, chars):
# Approach using a while loop:
i , j = 0, -1
while word[i] in chars:
i += 1
while word[j] in chars:
j -= 1
return word[i:j+1]
import re, string
chars = string.punctuation
word = "~!.test-one^&test-one--two???"
wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
assert wsc == re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
word = "__~!.test-one^&test-one--two??__"
wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
# assert wsc == re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
assert re.sub(r"(^[^\w]+)|([^\w]+$)", "", word) == word
print(re.sub(r"(^[^\w]+)|([^\w]+$)", "", word), '!=', wsc )
print('"',re.sub(r"(^[^\w]+)|([^\w]+$)", "", "\tword\t"), '" != "', "\tword\t".strip(chars), '"', sep='' )
Notice that the result when using the given regular expression pattern can differ from the result when using .strip(string.punctuation) because the set of characters covered by regular expression [^\w] pattern differs from the set of characters in string.punctuation.
SUPPLEMENT
What does the regular expression pattern:
(^[^\w]+)|([^\w]+$)
mean?
Below a detailed explanation:
The '|' character means 'or' providing two alternatives for the
sub-string (called match) which is to find in the provided string.
'(^[^\w]+)' is the first of the two alternatives for a match
'(' ')' enclose what is called a "capturing group" (^[^\w]+)
The first of the two '^' asserts position at start of a line
'\w' : with \ escaped 'w' means: "word character"
(i.e. letters a-z, A-Z, digits 0-9 and the underscore '_').
The second of the two '^' means: logical "not"
(here not a "word character")
i.e. all characters except a-zA-z0-9 and '_'
(for example '~' or 'ö')
Notice that the meaning of '^' depends on context:
'^' outside of [ ] it means start of line/string
'^' inside of [ ] as first char means logical not
and not as first means itself
'[', ']' enclose specification of a set of characters
and mean the occurrence of exactly one of them
'+' means occurrence between one and unlimited times
of what was defined in preceding token
'([^\w]+$)' is the second alternative for a match
differing from the first by stating that the match
should be found at the end of the string
'$' means: "end of the line" (or "end of string")
The regular expression pattern tells the regular expression engine to work as follows:
The engine looks at the start of the string for an occurrence of a non-word
character. If one if found it will be remembered as a match and next
character will be checked and added to the already found ones if it is also
a non-word character. This way the start of the string is checked for
occurrences of non-word characters which then will be removed from the
string if the pattern is used in re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
which replaces any found characters with an empty string (in other words
it deletes found character from the string).
After the engine hits first word character in the string the search at
the start of the string will the jump to the end of the string because
of the second alternative given for the pattern to find as the first
alternative is limited to the start of the line.
This way any non-word characters in the intermediate part of the string
will be not searched for.
The engine looks then at the end of a string for a non-word character
and proceeds like at the start but going backwards to assure that the
found non-word characters are at the end of the string.
Using re.sub
import re
word = "!.test-one,-"
out = re.sub(r"(^[^\w]+)|([^\w]+$)", "", word)
print(out)
Gives #
test-one
Check this example using slice
import string
sentence = "_blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers."
if sentence[0] in string.punctuation:
sentence = sentence[1:]
if sentence[-1] in string.punctuation:
sentence = sentence[:-1]
print(sentence)
Output:
blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers
For example:
The string BINGO!
should be B I N G O!
I have already tried:
s = "BINGO!"
print(" ".join(s[::1]))
I'd use regex: re.sub(r'(?<=[a-zA-Z])(?=[a-zA-Z])', ' ', 'BINGO!')
This basically says, "for every empty string in 'BINGO!' that both follows a letter and precedes a letter, substitute a space."
I like what esramish had said, however that is a bit overcomplicated for your goal. It is as easy as this:
string = "BINGO"
print(" ".join(string))
This returns your desired output.
You can use a single lookahead to assert a char a-z to the right and make the pattern case insensitive.
In the replacement use the full match and a space using \g<0>
(?i)[A-Z](?=[A-Z])
(?i) Inline modifier for a case insensitive match
[A-Z] Match a single char A-Z
(?=[A-Z]) Positive lookahead, assert a char A-Z to the right of the current position
See a regex demo and a Python demo.
Example
import re
print(re.sub(r'(?i)[A-Z](?=[A-Z])', '\g<0> ', 'BINGO!'))
Output
B I N G O!
String 1:
[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97
String 2:
[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17
In string 1, I want to extract CAR<7:5 and BIKE<4:0,
In string 2, I want to extract CAKE<4:0
Any regex for this in Python?
You can use \w+<[^>]+
DEMO
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy).
< matches the character <
[^>] Match a single character not present in the list
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
We can use re.findall here with the pattern (\w+.*?)>:
inp = ["[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97", "[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17"]
for i in inp:
matches = re.findall(r'(\w+<.*?)>', i)
print(matches)
This prints:
['CAR<7:5', 'BIKE<4:0']
['CAKE<4:0']
In the first example, the BIKE part has no leading space but a pipe char.
A bit more precise match might be asserting either a space or pipe to the left, and match the digits separated by a colon and assert the > to the right.
(?<=[ |])[A-Z]+<\d+:\d+(?=>)
In parts, the pattern matches:
(?<=[ |]) Positive lookbehind, assert either a space or a pipe directly to the left
[A-Z]+ Match 1+ chars A-Z
<\d+:\d+ Match < and 1+ digits betqeen :
(?=>) Positive lookahead, assert > directly to the right
Regex demo
Or the capture group variant:
(?:[ |])([A-Z]+<\d+:\d)>
Regex demo
I would like to match a word when it is after a char m or b
So for example, when the word is men, I would like to return en (only the word that is following m), if the word is beetles then return eetles
Initially I tried (m|b)\w+ but it matches the entire men not en
How do I write regex expression in this case?
Thank you!
You could get the match only using a positive lookbehind asserting what is on the left is either m or b using character class [mb] preceded by a word boundary \b
(?<=\b[mb])\w+
(?<= Positive lookbehind, assert what is directly to the left is
\b[mb] Word boundary, match either m or b
) Close lookbehind
\w+ Match 1 + word chars
Regex demo
If there can not be anything after the the word characters, you can assert a whitespace boundary at the right using (?!\S)
(?<=\b[mb])\w+(?!\S)
Regex demo | Python demo
Example code
import re
test_str = ("beetles men")
regex = r"(?<=\b[mb])\w+"
print(re.findall(regex, test_str))
Output
['eetles', 'en']
You may use
\b[mb](\w+)
See the regex demo.
NOTE: When your known prefixes include multicharacter sequences, say, you want to find words starting with m or be, you will have to use a non-capturing group rather than a character class: \b(?:m|be)(\w+). The current solution can thus be written as \b(?:m|b)(\w+) (however, a character class here looks more natural, unless you have to build the regex dynamically).
Details
\b - a word boundary
[mb] - m or b
(\w+) - Capturing group 1: any one or more word chars, letters, digits or underscores. To match only letters, use ([^\W\d_]+) instead.
Python demo:
import re
rx = re.compile(r'\b[mb](\w+)')
text = "The words are men and beetles."
# First occurrence:
m = rx.search(text)
if m:
print(m.group(1)) # => en
# All occurrences
print( rx.findall(text) ) # => ['en', 'eetles']
(?<=[mb])\w+/
You can use this above regex. The regex means "Any word starts with m or b".
(?<=[mb]): positive lookbehind
\w+: matches any word character (equal to [a-zA-Z0-9]+)
I have a list of words such as:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
I want to find the words that have the same first and last character, and that the two middle characters are different from the first/last character.
The desired final result:
['abca', 'bcab', 'cbac']
I tried this:
re.findall('^(.)..\\1$', l, re.MULTILINE)
But it returns all of the unwanted words as well.
I thought of using [^...] somehow, but I couldn't figure it out.
There's a way of doing this with sets (to filter the results from the search above), but I'm looking for a regex.
Is it possible?
Edit: fixed to use negative lookahead assertions instead of negative lookbehind assertions. Read comments for #AlanMoore and #bukzor explanations.
>>> [s for s in l.splitlines() if re.search(r'^(.)(?!\1).(?!\1).\1$', s)]
['abca', 'bcab', 'cbac']
The solution uses negative lookahead assertions which means 'match the current position only if it isn't followed by a match for something else.' Now, take a look at the lookahead assertion - (?!\1). All this means is 'match the current character only if it isn't followed by the first character.'
There are lots of ways to do this. Here's probably the simplest:
re.findall(r'''
\b #The beginning of a word (a word boundary)
([a-z]) #One letter
(?!\w*\1\B) #The rest of this word may not contain the starting letter except at the end of the word
[a-z]* #Any number of other letters
\1 #The starting letter we captured in step 2
\b #The end of the word (another word boundary)
''', l, re.IGNORECASE | re.VERBOSE)
If you want, you can loosen the requirements a bit by replacing [a-z] with \w. That will allow numbers and underscores as well as letters. You can also restrict it to 4-character words by changing the last * in the pattern to {2}.
Note also that I'm not very familiar with Python, so I'm assuming your usage of findall is correct.
Are you required to use regexes? This is a much more pythonic way to do the same thing:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
for word in l.split():
if word[-1] == word[0] and word[0] not in word[1:-1]:
print word
Here's how I would do it:
result = re.findall(r"\b([a-z])(?:(?!\1)[a-z]){2}\1\b", subject)
This is similar to Justin's answer, except where that one does a one-time lookahead, this one checks each letter as it's consumed.
\b
([a-z]) # Capture the first letter.
(?:
(?!\1) # Unless it's the same as the first letter...
[a-z] # ...consume another letter.
){2}
\1
\b
I don't know what your real data looks like, so chose [a-z] arbitrarily because it works with your sample data. I limited the length to four characters for the same reason. As with Justin's answer, you may want to change the {2} to *, + or some other quantifier.
To heck with regexes.
[
word
for word in words.split('\n')
if word[0] == word[-1]
and word[0] not in word[1:-1]
]
You can do this with negative lookahead or lookbehind assertions; see http://docs.python.org/library/re.html for details.
Not a Python guru, but maybe this
re.findall('^(.)(?:(?!\1).)*\1$', l, re.MULTILINE)
expanded (use multi-line modifier):
^ # begin of line
(.) # capture grp 1, any char except newline
(?: # grouping
(?!\1) # Lookahead assertion, not what was in capture group 1 (backref to 1)
. # this is ok, grab any char except newline
)* # end grouping, do 0 or more times (could force length with {2} instead of *)
\1 # backref to group 1, this character must be the same
$ # end of line