Using regular expressions to find a pattern

Using regular expressions to find a pattern - python

If I have a file that consists of sentences like this:
1001 apple
1003 banana
1004 grapes
1005
1007 orange
Now I want to detect and print all such sentences where there is a number but no corresponding text (eg 1005), how can I design the regular expression to find such sentences? I find them a bit confusing to construct.
res=[]
with open("fruits.txt","r") as f:
for fruit in f:
res.append(fruit.strip().split())
Would it be something like this: re.sub("10**"/.")

Well you don't need a regular expressions for this:
with open("fruits.txt", "r") as f:
res = [int(line.strip()) for line in f if len(line.split()) == 1]

A regex that would detect a number, then a space, then an underscore word is ([0-9])+[ ]\w+.
A good ressource for trying that stuff out is http://regexr.com/

The re pattern for this would be re.sub("[0-9][0-9][0-9][0-9]"). This looks if there are only four numbers and nothing else, so it will find your 1005.
Hope this helps!

There are two ways to go about this: search() and findall(). The former will find the first instance of a match, and the latter will give a list of every match.
In any case, the regex you want to use is "^\d{4}$". It's a simple regex which matches a 4-digit number that takes up the entirety of a string, or, in multiline mode, a line. So, to find 'only number' sections, you will use the following code:
# assume 'func' is set to either be re.search or re.findall, whichever you prefer
with open("fruits.txt", "r") as f:
solo = func("^\d{4}$", f.read(), re.MULTILINE)
# 'solo' now has either the first 'non-labeled' number,
# or a list of all such numbers in the file, depending on
# the function you used. search() will return None if there
# are no such numbers, and findall() will return an empty list.
# if you prefer brevity, re.MULTILINE is equivalent to re.M
Additional explanation of the regex:
^ matches at the beginning of the line.
\d is a special sequence which matches any numeric digit.
{4} matches the prior element (\d) exactly four times.
$ matches at the end of the line.

Please try:
(?:^|\s+)(\d{4}\b)(?!\s.*\w+)
DEMO

Related

The Behavior of Alternative Match "|" with .* in a Regex

I seldom use | together with .* before. But today when I use both of them together, I find some results really confusing. The expression I use is as follows (in python):
>>> s = "abcdefg"
>>> re.findall(r"((a.*?c)|(.*g))",s)
[('abc',''),('','defg')]
The result of the first caputure is all right, but the second capture is beyond my expectation, for I have expected the second capture would be "abcdefg" (the whole string).
Then I reverse the two alternatives:
>>> re.findall(r"(.*?g)|(a.*?c)",s)
[('abcdefg', '')]
It seems that the regex engine only reads the string once - when the whole string is read in the first alternative, the regex engine will stop and no longer check the second alternative. However, in the first case, after dealing with the first alternative, the regex engine only reads from "a" to "c", and there are still "d" to "g" left in the string, which matches ".*?g" in the second alternative. Have I got it right? What's more, as for an expression with alternatives, the regex engine will check the first alternative first, and if it matches the string, it will never check the second alternative. Is it correct?
Besides, if I want to get both "abc" and "abcdefg" or "abc" and "bcde" (the two results overlap) like in the first case, what expression should I use?
Thank you so much!

You cannot have two matches starting from the same location in the regex (the only regex flavor that does it is Perl6).
In re.findall(r"((a.*?c)|(.*g))",s), re.findall will grab all non-overlapping matches in the string, and since the first one starts at the beginning, ends with c, the next one can only be found after c, within defg.
The (.*?g)|(a.*?c) regex matches abcdefg because the regex engine parses the string from left to right, and .*? will get any 0+ chars as few as possible but up to the first g. And since g is the last char, it will match and capture the whole string into Group 1.
To get abc and abcdefg, you may use, say
(a.*?c)?.*g
See the regex demo
Python demo:
import re
rx = r"(a.*?c)?.*g"
s = "abcdefg"
m = re.search(rx, s)
if m:
print(m.group(0)) # => abcdefg
print(m.group(1)) # => abc
It might not be what you exactly want, but it should give you a hint: you match the bigger part, and capture a subpart of the string.

Re-read the docs for the re.findall method.
findall "return[s] all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found."
Specifically, non-overlapping matches, and left-to-right. So if you have a string abcdefg and one pattern will match abc, then any other patterns must (1) not overlap; and (2) be further to the right.
It's perfectly valid to match abc and defg per the description. It would be a bug to match abc and abcdefg or even abc and cdefg because they would overlap.

Extracting numbers from a text file using regexp

Iam trying to make a python script that reads a text file input.txt and then scans all phone numbers in that file and writes back all matching phone no's to output.txt
lets say text file is like:
Hey my number is 1234567890 and another number is +91-1234567890. but if none of these is available you can call me on +91 5645454545 (or) mail me at abc#xyz.com
it should match 1234567890, +91-1234567890 and +91 5645454545
import re
no = '^(\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
Regexp for no is like : it takes country codes upto 3 digits and then a - or space which is optional and country code itself is optional and then a 10 digit number.

Yes, the problem is with your regex. Fortunately, it's a small one. You just need to remove the ^ character:
'(\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}'
The ^ signifies that you want to match only at the beginning of the string. You want to match multiple times throughout the string. Here's a 101demo.
For python, you'll need to specify a non-capturing group as well with ?:. Otherwise, re.findall does not return the complete match:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups.
Bold emphasis mine. Here's a relevant question.
This is what you get when you specify non-capturing groups for your problem:
In [485]: re.findall('(?:\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}', text)
Out[485]: ['1234567890', '+91-1234567890', '+91 5645454545']

this code will work:
import re
no = '(?:\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
The output will be:
1234567890
+91-1234567890
+91 5645454545

you can use
(?:\+[1-9]\d{1,2}-?)?\s?[1-9][0-9]{9}
see the demo at demo

pattern = '\d{10}|\+\d{2}[- ]+\d{10}'
matches = re.findall(pattern,text)
o/p -> ['1234567890', '+91-1234567890', '+91 5645454545']

how to use python re to match a sting only with several specific charaters?

I want to search the DNA sequences in a file, the sequence contains only [ATGC], 4 characters.
I try this pattern:
m=re.search('([ATGC]+)',line_in_file)
but it gives me hits with all lines contain at least 1 character of ATGC.
so how do I search the line only contain those 4 characters, without others.
sorry for mis-describing my question. I'm not looking for the exactly match of ATGC as a word, but a string only containing ATCG 4 characters
Thanks

Currently your regex is matching against any part of the line. Using ^ $ signs you can force the regex to perform against the whole line having the four characters.
m=re.search('(^[ATGC]+$)',line_in_file)
From your clarification msg at above:
If you want to match a sequence like this AAAGGGCCCCCCT with the order AGCT then the regex will be:
(A+G+C+T+)

The square brackets in your search string tell the regex complier to match any of the letters in the set, not the full string. Remove the square brackets, and move the + to outside your parens.
m=re.search('(ATGC)+',a)
EDIT:
According to your comment, this won't match the pattern you actually want, just the one I thought you wanted. I can edit again once I understand the actual pattern.
EDIT2:
To match "ATGCCATG" but not "STUPID" try,
re.match("^[ATGC]$", str)
Then check for a NOT match, rather than a match.
The regex will hit if there are any characters NOT in [ATGC], then you exclude strings that match.

A slight modification:
def DNAcheck(dna):
y = dna.upper()
print(y)
if re.match("^[ATGC]+$", y):
return (2)
else:
return(1)
The if the entire sequence is composed of only A/T/G/C the code above should return back 2 else would return 1

Find comma space year but ignore comma year without space

I am trying to read in a file and every time , year is found it prints it out. For example if it finds , 2003 it will print that out, but if it finds ,2003 it will ignore it. I originally used a split and was able to get the year to match up, but when I added the , I realized that it looked at it like two different words so I dont think that would work.
Here is my code:
import string
import re
while True:
filename=raw_input('Enter a file name: ')
if filename == 'exit':
break
try:
file = open(filename, 'r')
text=file.read()
file.close()
except:
print('file does not exist')
else:
p=re.compile('^\,\s(19|20)\d\d$')//this is my regular expression
print(text)
m=p.search(text)
if m:
print(m.groups())

If you want to search the file for the regex rather than match the entire file contents, remove ^ and $ from the regex.
If you want more than one match per file, use finditer or findall instead of search.
Use raw string when specifying the regex: p=re.compile(r',\s(19|20)\d\d')
Example:
for m in re.finditer(r',\s((19|20)\d\d)', text):
print m.group(1)

>>> import re
>>> text = "foo bar, 2003, 2006,1923, derp"
>>> p = re.compile(r',\s((?:19|20)\d\d)')
>>> p.findall(text)
['2003', '2006']
Simplified example. First of all, remove the anchors (^ and $) and use findall instead of search to find all matches. I also used ?: to designate a non-matching group (it won't show up in the results) and made the year a group instead.

If you just add a * to the \s in your regex, I think it should work. This will make it match zero or more whitespace characters, instead of exactly one. If you only want it to match zero or one, add a + instead.

Regex in Python. NOT matches

I'll go straight: I have a string like this (but with thousands of lines)
Ach-emos_2
Ach. emos_54
Achėmos_18
Ąžuolas_4
Somtehing else_2
and I need to remove lines that does not match a-z and ąčęėįšųūž plus _ plus any integer (3rd and 4th lines match this). And this should be case insensitive. I think regex should be
[a-ząčęėįšųūž]+_\d+ #don't know where to put case insensitive modifier
But how should look a regex that matches lines that are NOT alpha (and lithuanian letters) plus underscore plus integer? I tried
re.sub(r'[^a-ząčęėįšųūž]+_\d+\n', '', words)
but no good.
Thanks in advance, sorry if my english is not quite good.

As to making the matching case insensitive, you can use the I or IGNORECASE flags from the re module, for example when compiling your regex:
regex = re.compile("^[a-ząčęėįšųūž]+_\d+$", re.I)
As to removing the lines not matching this regex, you can simply construct a new string consisting of the lines that do match:
new_s = "\n".join(line for line in s.split("\n") if re.match(regex, line))

First of all, given your example inputs, every line ends with underscore + integers, so all you really need to do is invert the original match. If the example wasn't really representative, then inverting the match could land you results like this:
abcdefg_nodigitshere
But you can subfilter that this way:
import re
mydigre = re.compile(r'_\d+$')
myreg = re.compile(r'^[a-ząčęėįšųūž]+_\d+$', re.I)
for line in inputs.splitlines():
if re.match(myreg, line):
# do x
elif re.match(mydigre, line):
# do y
else:
# line doesn't end with _\d+
Another option would be to use Python sets. This approach only makes sense if all your lines are unique (or if you don't mind eliminating duplicate lines) and you don't care about order. It probably has a high memory cost, too, but is likely to be fast.
all_lines = set([line for line in inputs.splitlines()])
alpha_lines = set([line for line in all_lines if re.match(myreg, line)])
nonalpha_lines = all_lines - alpha_lines
nonalpha_digi_lines = set([line for line in nonalpha_lines if re.match(mydigire, line)])

Not sure how python does modifiers, but to edit in-place, use something like this (case insensitive):
edit Note that some of these characters are utf8. To use the literal representation your editor and language must support this, otherwise use the \u.. code in the character class (recommended).
s/(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)//mg;
where the regex is: r'(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)'
the replacement is ''
modifier is multiline and global.
Breakdown: modifiers are global and multiline
(?i) // case insensitive flag
^ // start of line
(?![a-ząčęėįšųūž]+_\d+(?:\n|$)) // look ahead, not this form of a line ?
.* // ok then select all except newline or eos
(?:\n|$) // select newline or end of string

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regular expressions to find a pattern - python

Well you don't need a regular expressions for this: with open("fruits.txt", "r") as f: res = [int(line.strip()) for line in f if len(line.split()) == 1]

A regex that would detect a number, then a space, then an underscore word is ([0-9])+[ ]\w+. A good ressource for trying that stuff out is http://regexr.com/

The re pattern for this would be re.sub("[0-9][0-9][0-9][0-9]"). This looks if there are only four numbers and nothing else, so it will find your 1005. Hope this helps!

Please try: (?:^|\s+)(\d{4}\b)(?!\s.*\w+) DEMO

Related

The Behavior of Alternative Match "|" with .* in a Regex

Extracting numbers from a text file using regexp

how to use python re to match a sting only with several specific charaters?

Find comma space year but ignore comma year without space

Regex in Python. NOT matches

Categories

Resources