Removing square brackets from output and stopping regex after first match - python

Python noob here. I'm trying to fix two problems with my current code.
I'm trying to remove the square brackets from my list output
I can't figure out how to stop regex after the first match
For the first problem I've tried a number of different solutions but without success.
str()
"".join()
.replace"[]",''
page_content = "carrots apples001 carrots apples002 apples003"
define_words = 'apples'
parsed_content = str((re.findall(r"([^.]*?%s[^.]*\.)" % define_words, page_content)))
I'm currently getting the following output
[apples001][][][][][apples002 apples003]
When I should be getting
apples001
Any help would be much appreciated and sorry about the messy code!

You can try the following:
Test_String = "carrots apples001 carrots apples002 apples003"
Regex_Pattern = r'(apples\S\S\S).*'
match = re.findall(Regex_Pattern, Test_String)
print(''.join(match))
Rextester

Instead of using re.findall, you could use re.search to search for the first location where the pattern produces a match.
To match the word apples and the following digits, you could use:
\bapples\d+\b
\b Word boundary to prevent being part of a larger word
apples\d+ Match apples followed by 1+ digits
\b Word boundary
Regex demo | Python demo
Your code could look like:
import re
page_content = "carrots apples001 carrots apples002 apples003"
define_words = 'apples'
parsed_content = (re.search(r"\b%s\d+\b" % define_words, page_content).group())
print(parsed_content) # apples001
If define_words can start with a non word character, you might use (?<!\S)%s\d+ instead to assert what is on the left is not a non whitespace character.
parsed_content = (re.search(r"(?<!\S)%s\d+" % define_words, page_content).group())

Related

A regex pattern that matches all words starting from a word with an s and stopping before a word that starts with an s

I'm trying to capture words in a string such that the first word starts with an s, and the regex stops matching if the next word also starts with an s.
For example. I have the string " Stack, Code and StackOverflow". I want to capture only " Stack, Code and " and not include "StackOverflow" in the match.
This is what I am thinking:
Start with a space followed by an s.
Match everything except if the group is a space and an s (I'm using negative lookahead).
The regex I have tried:
(?<=\s)S[a-z -,]*(?!(\sS))
I don't know how to make it work.
I think this should work. I adapted the regex from this thread. You can also test it out here. I have also included a non-regex solution. I basically track the first occurrence of a word starting with an 's' and the next word starting with an 's' and get the words in that range.
import re
teststring = " Stack, Code and StackOverflow"
extractText = re.search(r"(\s)[sS][^*\s]*[^sS]*", teststring)
print(extractText[0])
#non-regex solution
listwords = teststring.split(' ')
# non regex solution
start = 0
end = 0
for i,word in enumerate(listwords):
if word.startswith('s') or word.startswith('S'):
if start == 0:
start = i
else:
end = i
break
newstring = " " + " ".join([word for word in listwords[start:end]])
print(newstring)
Output
Stack, Code and
Stack, Code and
You could use for example a capture group:
(S(?<!\S.).*?)\s*S(?<!\S.)
Explanation
( Capture group 1
S(?<!\S.) Match S and assert that to the left of the S there is not a whitespace boundary
.*? Match any character, as few as possible
) Close group
\s* Match optional whitespace chars
S(?<!\S.) Match S and assert that to the left of the S there is not a whitespace boundary
See a regex demo and a Python demo.
Example code:
import re
pattern = r"(S(?<!\S.).*?)\s*S(?<!\S.)"
s = "Stack, Code and StackOverflow"
m = re.search(pattern, s)
if m:
print(m.group(1))
Output
Stack, Code and
Another option using a lookaround to assert the S to the right and not consume it to allow multiple matches after each other:
S(?<!\S.).*?(?=\s*S(?<!\S.))
Regex demo
import re
pattern = r"S(?<!\S.).*?(?=\s*S(?<!\S.))"
s = "Stack, Code and StackOverflow test Stack"
print(re.findall(pattern, s))
Output
['Stack, Code and', 'StackOverflow test']

Inconsistency between regex and python search

I'm doing a small regex that catch all the text before the numbers.
https://regex101.com/r/JhIiG9/2
import re
regex = "^(.*?)(\d*([-.]\d*)*)$"
message = "Myteeeeext 0.366- 0.3700"
result = re.search(regex, message)
print(result.group(1))
https://www.online-python.com/a7smOJHBwp
When I run this regex instead of just showing the first group which is Myteeeeext I'm getting Myteeeeext 0.366- but in regex101 it shows only
Try this Regex, [^\d.-]+
It catches all the text before the numbers
import re
regex = "[^\d.-]+"
message = "Myteeeeext 0.366- 0.3700 notMyteeeeext"
result = re.search(regex, message)
print(f"'{result.group()}'")
Outputs:
'Myteeeeext '
tell me if its okay for you...
Your regex:
regex = "^(.*?)(\d*([-.]\d*)*)$"
doesn't allow for the numbers part to have any spaces, but your search string:
message = "Myteeeeext 0.366- 0.3700"
does have a space after the dash, so this part of your regex:
(.*?)
matches up to the second number.
It doesn't look like your test string in the regex101.com example you gave has a space, so that's why your results are different.

Find a specific sentence with a given word in Python

I would like to find a sentence with a give word string1 in the following passage:
Note that
HEAD,Content,11005,{A:1,json:{B:0,C:5,D:-1,E:false,F:Failure},suffix:_A}DC0DHEAD,Content,11005,{A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646HEAD,Content,11005,{A:1,json:{Z:{Y:false,X:0,Q:1},suffix:}3AA8
So the expected result would be:
HEAD,Content,11005,{A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646
So far, I have used the regular expression as follows to chop the desired sentence:
([^.]*?string1[^.]*)
However, the result is not the desired one as the whole sentence cannot be captured but as follows:
A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646
Therefore, I hope is there anyone can help to solve this little issue. Thanks!
If all sentences begin with HEAD, you can do the following:
temp = s.split('HEAD')
res = 'HEAD' + [i for i in temp if 'string1' in i][0]
>>> print(res)
'HEAD,Content,11005,{A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646'
If you want to use a regex, you can match HEAD. Then match any char except then directly followed by HEAD.
Then match string1 again followed by matching any char except directly followed by HEAD
HEAD(?:(?!HEAD).)*string1(?:(?!HEAD).)*
Regex demo | Python demo
import re
pattern = r"HEAD(?:(?!HEAD).)*string1(?:(?!HEAD).)*"
s = ("HEAD,Content,11005,{A:1,json:{B:0,C:5,D:-1,E:false,F:Failure},suffix:_A}DC0DHEAD,Content,11005,{A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646HEAD,Content,11005,{A:1,json:{Z:{Y:false,X:0,Q:1},suffix:}3AA8\n")
matches = re.findall(pattern, s)
print(matches)
Output
['HEAD,Content,11005,{A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646']

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)

Regex? Match part of or whole word

I was wondering if it's possible to use regex with python to capture a word, or a part of the word (if it's at the end of the string).
Eg:
target word - potato
string - "this is a sentence about a potato"
string - "this is a sentence about a potat"
string - "this is another sentence about a pota"
Thanks!
import re
def get_matcher(word, minchars):
reg = '|'.join([word[0:i] for i in range(len(word), minchars - 1, -1)])
return re.compile('(%s)$' % (reg))
matcher = get_matcher('potato', 4)
for s in ["this is a sentence about a potato", "this is a sentence about a potat", "this is another sentence about a pota"]:
print matcher.search(s).groups()
OUTPUT
('potato',)
('potat',)
('pota',)
Dont know how to match a regex in python, but the regex would be:
"\bp$|\bpo$|\bpot$|\bpota$|\bpotat$|\bpotato$"
This would match anything from p to potato if its the last word in the string, and also for example not something like "foopotato", if this is what you want.
The | denotes an alternative, the \b is a "word boundary", so it matches a position (not a character) between a word- and a non-word character. And the $ matches the end of the string (also a position).
Use the $ to match at the end of a string. For example, the following would match 'potato' only at the end of a string (first example):
"potato$"
This would match all of your examples:
"pota[to]{1,2}$"
However, some risk of also matching "potao" or "potaot".
import re
patt = re.compile(r'(p|po|pot|pota|potat|potato)$')
patt.search(string)
I was tempted to use r'po?t?a?t?o?$', but that would also match poto or pott.
No, you can't do that with a regex as far as I know, without pointless (p|po|pot ...) matches which are excessive. Instead, just pick off the last word, and match that using a substring:
match = re.search('\S+$', haystack)
if match.group(0) == needle[:len(match.group(0))]:
# matches.

Categories

Resources