Find sub-words in a string using regex - python

In trying to solve this challenge (which I pasted at the bottom of this question) using Python 3, the first of the two proposed solutions below, passes all test cases, while the second one doesn't. Since, in my eyes, they're doing pretty much the same, this leaves me very confused. Why doesn't the second block of code work?
It must be something very obvious because the second one fails most test cases, but having worked through custom-inputs, I still can't figure it out.
Working solution:
import re
import sys
lines = sys.stdin.readlines()
n=int(lines[0])
q=int(lines[n+1])
N=lines[1:n+1]
S=lines[n+2:]
text = "\n".join(N)
for s in S:
print(len(re.findall(r"(?<!\W)(?="+s.strip()+r"\w)", text)))
Broken "solution":
import re
import sys
lines = sys.stdin.readlines()
n=int(lines[0])
q=int(lines[n+1])
N=lines[1:n+1]
S=lines[n+2:]
for s in S:
total=0
for string in N:
total += len(re.findall("(?<!\W)(?="+s.strip()+"\w)", string))
print(total)
We define a word character to be any of the following:
An English alphabetic letter (i.e., a-z and A-Z).
A decimal digit (i.e., 0-9).
An underscore (i.e., _, which corresponds to ASCII value ).
We define a word to be a contiguous sequence of one or more word characters that is preceded and succeeded by one or more occurrences of non-word-characters or line terminators. For example, in the string I l0ve-cheese_?, the words are I, l0ve, and cheese_.
We define a sub-word as follows:
A sequence of word characters (i.e., English alphabetic letters,
digits, and/or underscores) that occur in the same exact order (i.e.,
as a contiguous sequence) inside another word.
It is preceded and succeeded by word characters only.
Given sentences consisting of one or more words separated by non-word characters, process queries where each query consists of a single string, . To process each query, count the number of occurrences of as a sub-word in all sentences, then print the number of occurrences on a new line.
Input Format
The first line contains an integer, n, denoting the number of sentences.
Each of the subsequent lines contains a sentence consisting of words separated by non-word characters.
The next line contains an integer, , denoting the number of queries.
Each line of the subsequent lines contains a string, , to check.
Constraints
1 ≤ n ≤ 100
1 ≤ q ≤ 10
Output Format
For each query string, print the total number of times it occurs as a sub-word within all words in all sentences.
Sample Input
1
existing pessimist optimist this is
1
is
Sample Output
3
Explanation
We must count the number of times is occurs as a sub-word in our input sentence(s):
occurs time as a sub-word of existing.
occurs time as a sub-word of pessimist.
occurs time as a sub-word of optimist.
While is a substring of the word this, it's followed by a blank
space; because a blank space is non-alphabetic, non-numeric, and not
an underscore, we do not count it as a sub-word occurrence.
While is a substring of the word is in the sentence, we do not count
it as a match because it is preceded and succeeded by non-word
characters (i.e., blank spaces) in the sentence. This means it
doesn't count as a sub-word occurrence.
Next, we sum the occurrences of as a sub-word of all our words as 1+1+1+0+0=3. Thus, we print 3 on a new line.

Without specifying your strings as raw strings, the regex metacharacters are actually interpreted as special escaped characters, and the pattern will not match as you expect.
Since you are no longer looking inside a multiline string, you'll want to add modify your negative lookbehind to a positive one: (?<=\w)
As Wiktor mentions in his comment, it would be a good idea to escape s.strip so that any potential chars that could be treated as regex metachars will be escaped and taken literally. You can use re.escape(s.strip()) for that.
Your code will work with this change:
total += len(re.findall(r"(?<\w)(?=" + re.escape(s.strip()) + r"\w)", string))

Related

Regex for exactly phone number with any end

I want to re.sub to change phone number format inside a string but stuck with the number detection.
I want to detect and change this format : ###-###-#### to this one: (###)-###-####
My regex :(\d{3}\-)(\d{3}\-)(\d{4})$
my sub: (\1)-\2-\3
I got stuck at that my regex can detect the number but if the number string ends like this: My number is 212-345-9999. It can not detect the number string end with any other character. When I change my regex to:(\d{3}\-)(\d{3}\-)(\d{4}) it also changes the format of number like this: 123-456-78901 with is not a number I want to detect as a phone number.
Help me
Just add the word boundary \b to your regex pattern to require boundary characters such as space, period, etc. thus disallowing any additional numbers.
(\d{3}\-)(\d{3}\-)(\d{4})\b
But that will result to duplicate dashes. Instead, don't include the dash - in the captured groups so that they doesn't duplicate in the resulting string. So use this:
(\d{3})\-(\d{3})\-(\d{4})\b
If you want a stricter pattern to ensure that the string strictly contains the indicated pattern only and nothing more, match the start and end of string. Here, we will optionally catch an ending character \W that shouldn't be a digit nor letter.
^(\d{3})\-(\d{3})\-(\d{4})\W?$
Just change \W? to \W* if you want to match arbitrary number of non-digit characters e.g. 123-456-7890.,
Sample Run:
If you intend to only process the correctly-formatted numbers, then don't call re.sub() right away. First, check if there is a match via re.match():
import re
number_re = re.compile(r"^(\d{3})\-(\d{3})\-(\d{4})\W?$")
for num in [
"123-456-7890",
"123-456-78901",
"123-456-7890.",
"123-456-7890.1",
]:
print(num)
if number_re.match(num):
print("\t", number_re.sub(r"(\1)-\2-\3", num))
else:
print("\tIncorrect format")
Output:
123-456-7890
(123)-456-7890
123-456-78901
Incorrect format
123-456-7890.
(123)-456-7890
123-456-7890.1
Incorrect format

Match sequence of words with regex

I have a list of strings and I want to extract from it only the item name, with spaces, if there are.
The strings stay in column named 0, and index is just for reference.
For example, from each index line I want the following results:
Index - Expected result
0 - BOV BCONTRA
1 - BF PARAROLE C
2 - CUBINHOS DACE
... and so on.
Notice that inline 25 the desired result are not separated from the preceding numbers with spaces
There can be a dot . between the words line in index line 30.
I've tried re.findall(r"\n\d{1,2} \d+(\b\w+\b)") with no success.
Also re.findall(r"\n\d{1,2} \d+( ?\w+)") brings me only the first word, and I want all the words, not only the first one.
The lines start with a \n char that it's not printed at the list.
so basically you need all the upper case strings on the text.
try this expression, where it will get all the text with or without spaces
re.findall('[A-Z]+[ A-Z]*', text)
It seems you want [A-Z .]+, not "words" (represented by r'\w'), bordered by
integers. \w maps to
[a-zA-Z0-9_].
That's the Regex string to have: r'\d+ \d+([A-Z .]+)\d+'.
I don't know what you mean that a newline precedes each line. If you have a string with lines in it, it's perhaps better to split the input in lines with string.splitlines(), then do a linear Regex match (re.match so the Regex only matches from the start) on each relevant line.

Calculate the index of the nth word in a string

Given the index of a word in a string starting at zero ("index" is position two in this sentence), and a word being defined as that which is separated by whitespace, I need to find the index of the first char of that word.
My whitespace regex pattern is "( +|\t+)+", just to cover all my bases (except new line chars, which are excluded). I used split() to separate the string into words, and then summed the lengths of each of those words. However, I need to account for the possibility that more than once whitespace character is used between words, so I can't simply add the number of words minus one to that figure and still be accurate every time.
Example:
>>> example = "This is an example sentence"
>>> get_word_index(example, 2)
8
Change your regular expression to include the whitespace around each word to prevent it from being lost. The expression \s*\S+\s* will first consume leading whitespace, then the actual word, then trailing spaces, so only the first word in the resulting list might have leading spaces (if the string itself started with whitespace). The rest consist of the word itself potentially followed by whitespace. After you have that list, simply find the total length of all the words before the one you want, and account for any leading spaces the string may have.
def get_word_index(s, idx):
words = re.findall(r'\s*\S+\s*', s)
return sum(map(len, words[:idx])) + len(words[idx]) - len(words[idx].lstrip())
Testing:
>>> example = "This is an example sentence"
>>> get_word_index(example, 2)
8
>>> example2 = ' ' + example
>>> get_word_index(example2, 2)
9
Maybe you could try with:
your_string.index(your_word)
documentation

Regex for a string which contains only number with a length of two and has one white space between them

I am working on validation of inputs and need a regex which take only number with max length of 2 and one white space between them.
Regex for Python
import re
pattern="^[0-9_ ]{2}$"
check="01 03"
a=re.match(pattern,check)
if a == None:
print'Not valid value'
else:
print"valid value"
the output which i get is non valid value, what am i going wrong here
You're repeating a character set with {2}, which will match exactly two of the preceeding token. There will only be a match if the string contains exactly two characters.
Instead, use the character set [0-9]{1,2} to match one or two digits, followed by a space, followed by that repeated character set again:
[0-9]{1,2} [0-9]{1,2}$

Generate regex for exact words from a list

I am trying to write a regex that can match any word in the following or similar words. * in these strings are exact * and not any character.
Jump
J**p
J*m*
J***
***p
J***ing
J***ed
****ed
I want to keeo the length fixed.
1. Any string of lenght 4 that matches the string 'jump'
2. Any string of length 6 that matches 'jumped'
3. Any string of length 7 that matches 'jumping'
I was using the following statements but for some reason, i am not able to to the correct translation. It accepts other strings as well.
p = re.compile('j|\*)(u|\*)(m|\*)...)
bool(p.match('******g'))
This is a fairly straightforward regex. We want to match a word, but allow each character to be an asterisk. The regex is therefore a sequence of character groups of the form [x*]:
[Jj*][u*][m*][p*](?:[i*][n*][g*]|[e*][d*])?
See it in action at regex101.
If you only want to match these exact words, make sure to use the pattern with re.fullmatch.

Categories

Resources