Match sequence of words with regex - python

I have a list of strings and I want to extract from it only the item name, with spaces, if there are.
The strings stay in column named 0, and index is just for reference.
For example, from each index line I want the following results:
Index - Expected result
0 - BOV BCONTRA
1 - BF PARAROLE C
2 - CUBINHOS DACE
... and so on.
Notice that inline 25 the desired result are not separated from the preceding numbers with spaces
There can be a dot . between the words line in index line 30.
I've tried re.findall(r"\n\d{1,2} \d+(\b\w+\b)") with no success.
Also re.findall(r"\n\d{1,2} \d+( ?\w+)") brings me only the first word, and I want all the words, not only the first one.
The lines start with a \n char that it's not printed at the list.

so basically you need all the upper case strings on the text.
try this expression, where it will get all the text with or without spaces
re.findall('[A-Z]+[ A-Z]*', text)

It seems you want [A-Z .]+, not "words" (represented by r'\w'), bordered by
integers. \w maps to
[a-zA-Z0-9_].
That's the Regex string to have: r'\d+ \d+([A-Z .]+)\d+'.
I don't know what you mean that a newline precedes each line. If you have a string with lines in it, it's perhaps better to split the input in lines with string.splitlines(), then do a linear Regex match (re.match so the Regex only matches from the start) on each relevant line.

Related

regex select sequences that start with specific number

I want to select select all character strings that begin with 0
x= '1,1,1075 1,0,39 2,4,1,22409 0,1,1,755,300 0,1,1,755,50'
I have
re.findall(r'\b0\S*', x)
but this returns
['0,39', '0,1,1,755,300', '0,1,1,755,50']
I want
['0,1,1,755,300', '0,1,1,755,50']
The problem is that \b matches the boundaries between digits and commas too. The simplest way might be not to use a regex at all:
thingies = [thingy for thingy in x.split() if thingy.startswith('0')]
Instead of using the boundary \b which will match between the comma and number (between any word [a-zA-Z0-9_] and non word character), you will want to match on start of string or space like (^|\s).
(^|\s)0\S*
https://regex101.com/r/Mrzs8a/1
Which will match the start of string or a space preceding the target string. But that will also include the space if present so I would suggest either trimming your matched string or wrapping the latter part with parenthesis to make it a group and then just getting group 1 from the matches like:
(?:^|\s)(0\S*)
https://regex101.com/r/Mrzs8a/2

Find sub-words in a string using regex

In trying to solve this challenge (which I pasted at the bottom of this question) using Python 3, the first of the two proposed solutions below, passes all test cases, while the second one doesn't. Since, in my eyes, they're doing pretty much the same, this leaves me very confused. Why doesn't the second block of code work?
It must be something very obvious because the second one fails most test cases, but having worked through custom-inputs, I still can't figure it out.
Working solution:
import re
import sys
lines = sys.stdin.readlines()
n=int(lines[0])
q=int(lines[n+1])
N=lines[1:n+1]
S=lines[n+2:]
text = "\n".join(N)
for s in S:
print(len(re.findall(r"(?<!\W)(?="+s.strip()+r"\w)", text)))
Broken "solution":
import re
import sys
lines = sys.stdin.readlines()
n=int(lines[0])
q=int(lines[n+1])
N=lines[1:n+1]
S=lines[n+2:]
for s in S:
total=0
for string in N:
total += len(re.findall("(?<!\W)(?="+s.strip()+"\w)", string))
print(total)
We define a word character to be any of the following:
An English alphabetic letter (i.e., a-z and A-Z).
A decimal digit (i.e., 0-9).
An underscore (i.e., _, which corresponds to ASCII value ).
We define a word to be a contiguous sequence of one or more word characters that is preceded and succeeded by one or more occurrences of non-word-characters or line terminators. For example, in the string I l0ve-cheese_?, the words are I, l0ve, and cheese_.
We define a sub-word as follows:
A sequence of word characters (i.e., English alphabetic letters,
digits, and/or underscores) that occur in the same exact order (i.e.,
as a contiguous sequence) inside another word.
It is preceded and succeeded by word characters only.
Given sentences consisting of one or more words separated by non-word characters, process queries where each query consists of a single string, . To process each query, count the number of occurrences of as a sub-word in all sentences, then print the number of occurrences on a new line.
Input Format
The first line contains an integer, n, denoting the number of sentences.
Each of the subsequent lines contains a sentence consisting of words separated by non-word characters.
The next line contains an integer, , denoting the number of queries.
Each line of the subsequent lines contains a string, , to check.
Constraints
1 ≤ n ≤ 100
1 ≤ q ≤ 10
Output Format
For each query string, print the total number of times it occurs as a sub-word within all words in all sentences.
Sample Input
1
existing pessimist optimist this is
1
is
Sample Output
3
Explanation
We must count the number of times is occurs as a sub-word in our input sentence(s):
occurs time as a sub-word of existing.
occurs time as a sub-word of pessimist.
occurs time as a sub-word of optimist.
While is a substring of the word this, it's followed by a blank
space; because a blank space is non-alphabetic, non-numeric, and not
an underscore, we do not count it as a sub-word occurrence.
While is a substring of the word is in the sentence, we do not count
it as a match because it is preceded and succeeded by non-word
characters (i.e., blank spaces) in the sentence. This means it
doesn't count as a sub-word occurrence.
Next, we sum the occurrences of as a sub-word of all our words as 1+1+1+0+0=3. Thus, we print 3 on a new line.
Without specifying your strings as raw strings, the regex metacharacters are actually interpreted as special escaped characters, and the pattern will not match as you expect.
Since you are no longer looking inside a multiline string, you'll want to add modify your negative lookbehind to a positive one: (?<=\w)
As Wiktor mentions in his comment, it would be a good idea to escape s.strip so that any potential chars that could be treated as regex metachars will be escaped and taken literally. You can use re.escape(s.strip()) for that.
Your code will work with this change:
total += len(re.findall(r"(?<\w)(?=" + re.escape(s.strip()) + r"\w)", string))

Match only list of words in string

I have a list of words and am creating a regular expression like so:
((word1)|(word2)|(word3){1,3})
Basically I want to match a string that contains 1 - 3 of those words.
This works, however I want it to match the string only if the string contains words from the regex. For example:
((investment)|(property)|(something)|(else){1,3})
This should match the string investmentproperty but not the string abcinvestmentproperty. Likewise it should match somethinginvestmentproperty because all those words are in the regex.
How do I go about achieving that?
Thanks
You can use $...^ to match with a string with (^) and ($) to mark the beginning and ending of the string you want to match. Also note you need to add (...) around your group of words to match for the {1,3}:
^((investment)|(property)|(something)|(else)){1,3}$
Regex101 Example

Regex sub phone number format multiple times on same string

I'm trying to use reg expressions to modify the format of phone numbers in a list.
Here is a sample list:
["(123)456-7890 (321)-654-0987",
"(111) 111-1111",
"222-222-2222",
"(333)333.3333",
"(444).444.4444",
"blah blah blah (555) 555.5555",
"666.666.6666 random text"]
Every valid number has either a space OR start of string character leading, AND either a space OR end of string character trailing. This means that there can be random text in the strings, or multiple numbers on one line. My question is: How can I modify the format of ALL the phone numbers with my match pattern below?
I've written the following pattern to match all valid formats:
p = re.compile(r"""
(((?<=\ )|(?<=^)) #space or start of string
((\([0-9]{3}\))|[0-9]{3}) #Area code
(((-|\ )?[0-9]{3}-[0-9]{4}) #based on '-'
| #or
((\.|\ )?[0-9]{3}\.[0-9]{4})) #based on '.'
(?=(\ |$))) #space or end of string
""", re.X)
I want to modify the numbers so they adhere to the format:
\(\d{3}\)d{3}-\d{4} #Ex: (123)456-7890
I tried using re.findall, and re.sub but had no luck. I'm confused on how to deal with the circumstance of there being multiple matches on a line.
EDIT: Desired output:
["(123)456-7890 (321)654-0987",
"(111)111-1111",
"(222)222-2222",
"(333)333-3333",
"(444)444-4444",
"blah blah blah (555)555-5555",
"(666)666-6666 random text"]
Here's a more simple solution that works for all of those cases, though is a little naïve (and doesn't care about matching brackets).
\(?(\d{3})\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\1)\2-\3
Try it online
Explanation:
Works by first checking for 3 digits, and optionally surrounding brackets on either side, with \(?(\d{3})\)?. Notice that the 3 digits are in a capturing group.
Next, it checks for an optional separator character, and then another 3 digits, also stored in a capturing group: [ -.]?(\d{3}).
And lastly, it does the previous step again - but with 4 digits instead of 3: [ -.]?(\d{4})
Python:
To use it in Python, you should just be able to iterate over each element in the list and do:
p.sub('(\\1)\\2-\\3', myString) # Note the double backslashes, or...
p.sub(r'(\1)\2-\3', myString) # Raw strings work too
Example Python code
EDIT
This solution is a bit more complex, and ensures that if there is a close bracket, there must be a start bracket.
(\()?((?(1)\d{3}(?=\))|\d{3}(?!\))))\)?[ -.]?(\d{3})[ -.]?(\d{4})
Replace with:
(\2)\3-\4
Try it online

how to use python re to match a sting only with several specific charaters?

I want to search the DNA sequences in a file, the sequence contains only [ATGC], 4 characters.
I try this pattern:
m=re.search('([ATGC]+)',line_in_file)
but it gives me hits with all lines contain at least 1 character of ATGC.
so how do I search the line only contain those 4 characters, without others.
sorry for mis-describing my question. I'm not looking for the exactly match of ATGC as a word, but a string only containing ATCG 4 characters
Thanks
Currently your regex is matching against any part of the line. Using ^ $ signs you can force the regex to perform against the whole line having the four characters.
m=re.search('(^[ATGC]+$)',line_in_file)
From your clarification msg at above:
If you want to match a sequence like this AAAGGGCCCCCCT with the order AGCT then the regex will be:
(A+G+C+T+)
The square brackets in your search string tell the regex complier to match any of the letters in the set, not the full string. Remove the square brackets, and move the + to outside your parens.
m=re.search('(ATGC)+',a)
EDIT:
According to your comment, this won't match the pattern you actually want, just the one I thought you wanted. I can edit again once I understand the actual pattern.
EDIT2:
To match "ATGCCATG" but not "STUPID" try,
re.match("^[ATGC]$", str)
Then check for a NOT match, rather than a match.
The regex will hit if there are any characters NOT in [ATGC], then you exclude strings that match.
A slight modification:
def DNAcheck(dna):
y = dna.upper()
print(y)
if re.match("^[ATGC]+$", y):
return (2)
else:
return(1)
The if the entire sequence is composed of only A/T/G/C the code above should return back 2 else would return 1

Categories

Resources