Find a specific sentence with a given word in Python - python

I would like to find a sentence with a give word string1 in the following passage:
Note that
HEAD,Content,11005,{A:1,json:{B:0,C:5,D:-1,E:false,F:Failure},suffix:_A}DC0DHEAD,Content,11005,{A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646HEAD,Content,11005,{A:1,json:{Z:{Y:false,X:0,Q:1},suffix:}3AA8
So the expected result would be:
HEAD,Content,11005,{A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646
So far, I have used the regular expression as follows to chop the desired sentence:
([^.]*?string1[^.]*)
However, the result is not the desired one as the whole sentence cannot be captured but as follows:
A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646
Therefore, I hope is there anyone can help to solve this little issue. Thanks!

If all sentences begin with HEAD, you can do the following:
temp = s.split('HEAD')
res = 'HEAD' + [i for i in temp if 'string1' in i][0]
>>> print(res)
'HEAD,Content,11005,{A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646'

If you want to use a regex, you can match HEAD. Then match any char except then directly followed by HEAD.
Then match string1 again followed by matching any char except directly followed by HEAD
HEAD(?:(?!HEAD).)*string1(?:(?!HEAD).)*
Regex demo | Python demo
import re
pattern = r"HEAD(?:(?!HEAD).)*string1(?:(?!HEAD).)*"
s = ("HEAD,Content,11005,{A:1,json:{B:0,C:5,D:-1,E:false,F:Failure},suffix:_A}DC0DHEAD,Content,11005,{A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646HEAD,Content,11005,{A:1,json:{Z:{Y:false,X:0,Q:1},suffix:}3AA8\n")
matches = re.findall(pattern, s)
print(matches)
Output
['HEAD,Content,11005,{A:1,json:{BC:true,DE:2,FG:0,HI:0,JK:0,string1:Error},suffix:_A}D646']

Related

How can I add a string inside a string?

The problem is simple, I'm given a random string and a random pattern and I'm told to get all the posible combinations of that pattern that occur in the string and mark then with [target] and [endtarget] at the beggining and end.
For example:
given the following text: "XuyZB8we4"
and the following pattern: "XYZAB"
The expected output would be: "[target]X[endtarget]uy[target]ZB[endtarget]8we4".
I already got the part that identifies all the words, but I can't find a way of placing the [target] and [endtarget] strings after and before the pattern (called in the code match).
import re
def tagger(text, search):
place_s = "[target]"
place_f = "[endtarget]"
pattern = re.compile(rf"[{search}]+")
matches = pattern.finditer(text)
for match in matches:
print(match)
return test_string
test_string = "alsikjuyZB8we4 aBBe8XAZ piarBq8 Bq84Z "
pattern = "XYZAB"
print(tagger(test_string, pattern))
I also tried the for with the sub method, but I couldn't get it to work.
for match in matches:
re.sub(match.group(0), place_s + match.group(0) + place_f, text)
return text
re.sub allows you to pass backreferences to matched groups within your pattern. so you do need to enclose your pattern in parentheses, or create a named group, and then it will replace all matches in the entire string at once with your desired replacements:
In [10]: re.sub(r'([XYZAB]+)', r'[target]\1[endtarget]', test_string)
Out[10]: 'alsikjuy[target]ZB[endtarget]8we4 a[target]BB[endtarget]e8[target]XAZ[endtarget] piar[target]B[endtarget]q8 [target]B[endtarget]q84[target]Z[endtarget] '
With this approach, re.finditer is not not needed at all.

RE to extract middle of line with or without a particuar word on the end

I'm trying to extract a string in the middle of a line with or without a particular word on the end. For example, this line:
START - some words and not THIS
should return "some words and not" and likewise, the line:
START - some words and not
should also return the same string. I've tried using lookahead from examples I've found with alternation for EOL, but adding the alternation returns a string ending with THIS. Here is the python regex:
[^-]*- (.+(?= THIS|$))
Removing |$ works, except when the line ends without THIS. The data I'm parsing has a small number of entries missing "THIS", so I need to account for both. What's the correct pattern for this?
You may use a lazy quantifier (.+?) as in
[^-]*- (.+?)(?:THIS|$)
See a demo on regex101.com.
Please, take a look at this.
Basing on your example the following regex (?<=START - )(.*)(?=THIS) will catch some words and not.
Hope it will help!
If I understand correctly, this should do the trick:
>>> regex = re.compile(r"(?!THIS)([^-]*- .+)(THIS)?$")
>>> s1 = 'START - some words and not THIS'
>>> regex.match(s1).groups()
('START - some words and not ', 'THIS')
>>> s2 = 'START - some words and not '
>>> regex.match(s2).groups()
('START - some words and not ', None)

Removing square brackets from output and stopping regex after first match

Python noob here. I'm trying to fix two problems with my current code.
I'm trying to remove the square brackets from my list output
I can't figure out how to stop regex after the first match
For the first problem I've tried a number of different solutions but without success.
str()
"".join()
.replace"[]",''
page_content = "carrots apples001 carrots apples002 apples003"
define_words = 'apples'
parsed_content = str((re.findall(r"([^.]*?%s[^.]*\.)" % define_words, page_content)))
I'm currently getting the following output
[apples001][][][][][apples002 apples003]
When I should be getting
apples001
Any help would be much appreciated and sorry about the messy code!
You can try the following:
Test_String = "carrots apples001 carrots apples002 apples003"
Regex_Pattern = r'(apples\S\S\S).*'
match = re.findall(Regex_Pattern, Test_String)
print(''.join(match))
Rextester
Instead of using re.findall, you could use re.search to search for the first location where the pattern produces a match.
To match the word apples and the following digits, you could use:
\bapples\d+\b
\b Word boundary to prevent being part of a larger word
apples\d+ Match apples followed by 1+ digits
\b Word boundary
Regex demo | Python demo
Your code could look like:
import re
page_content = "carrots apples001 carrots apples002 apples003"
define_words = 'apples'
parsed_content = (re.search(r"\b%s\d+\b" % define_words, page_content).group())
print(parsed_content) # apples001
If define_words can start with a non word character, you might use (?<!\S)%s\d+ instead to assert what is on the left is not a non whitespace character.
parsed_content = (re.search(r"(?<!\S)%s\d+" % define_words, page_content).group())

Python Find entire word in string using regex and user input

I'm trying to find the entire word exactly using regex but have the word i'm searching for be a variable value coming from user input. I've tried this:
regex = r"\b(?=\w)" + re.escape(user_input) + r"\b"
if re.match(regex, string_to_search[i], re.IGNORECASE):
<some code>...
but it matches every occurrence of the string. It matches "var"->"var" which is correct but also matches "var"->"var"iable and I only want it to match "var"->"var" or "string"->"string"
Input: "sword"
String_to_search = "There once was a swordsmith that made a sword"
Desired output: Match "sword" to "sword" and not "swordsmith"
You seem you want to use a pattern that matches an entire string. Note that \b word boundary is needed when you wan to find partial matches. When you need a full string match, you need anchors. Since re.match anchors the match at the start of string, all you need is $ (end of string position) at the end of the pattern:
regex = '{}$'.format(re.escape(user_input))
and then use
re.match(regex, search_string, re.IGNORCASE)
You can try re.finditer like that:
>>> import re
>>> user_input = "var"
>>> text = "var variable var variable"
>>> regex = r"(?=\b%s\b)" % re.escape(user_input)
>>> [m.start() for m in re.finditer(regex, text)]
[0, 13]
It'll find all matches iteratively.

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)

Categories

Resources