Find comma space year but ignore comma year without space - python

I am trying to read in a file and every time , year is found it prints it out. For example if it finds , 2003 it will print that out, but if it finds ,2003 it will ignore it. I originally used a split and was able to get the year to match up, but when I added the , I realized that it looked at it like two different words so I dont think that would work.
Here is my code:
import string
import re
while True:
filename=raw_input('Enter a file name: ')
if filename == 'exit':
break
try:
file = open(filename, 'r')
text=file.read()
file.close()
except:
print('file does not exist')
else:
p=re.compile('^\,\s(19|20)\d\d$')//this is my regular expression
print(text)
m=p.search(text)
if m:
print(m.groups())

If you want to search the file for the regex rather than match the entire file contents, remove ^ and $ from the regex.
If you want more than one match per file, use finditer or findall instead of search.
Use raw string when specifying the regex: p=re.compile(r',\s(19|20)\d\d')
Example:
for m in re.finditer(r',\s((19|20)\d\d)', text):
print m.group(1)

>>> import re
>>> text = "foo bar, 2003, 2006,1923, derp"
>>> p = re.compile(r',\s((?:19|20)\d\d)')
>>> p.findall(text)
['2003', '2006']
Simplified example. First of all, remove the anchors (^ and $) and use findall instead of search to find all matches. I also used ?: to designate a non-matching group (it won't show up in the results) and made the year a group instead.

If you just add a * to the \s in your regex, I think it should work. This will make it match zero or more whitespace characters, instead of exactly one. If you only want it to match zero or one, add a + instead.

Related

python regax [] does not work [duplicate]

I need some help on declaring a regex. My inputs are like the following:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>
The required output is:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100.
and there are many other lines in the txt files
with such tags
I've tried this:
#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
line2 = line.replace('<[1> ', '')
line = line2.replace('</[1> ', '')
line2 = line.replace('<[1>', '')
line = line2.replace('</[1>', '')
print line
I've also tried this (but it seems like I'm using the wrong regex syntax):
line2 = line.replace('<[*> ', '')
line = line2.replace('</[*> ', '')
line2 = line.replace('<[*>', '')
line = line2.replace('</[*>', '')
I dont want to hard-code the replace from 1 to 99.
This tested snippet should do it:
import re
line = re.sub(r"</?\[\d+>", "", line)
Edit: Here's a commented version explaining how it works:
line = re.sub(r"""
(?x) # Use free-spacing mode.
< # Match a literal '<'
/? # Optionally match a '/'
\[ # Match a literal '['
\d+ # Match one or more digits
> # Match a literal '>'
""", "", line)
Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!
str.replace() does fixed replacements. Use re.sub() instead.
I would go like this (regex explained in comments):
import re
# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")
# <\/{0,}\[\d+>
#
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»
subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>"""
result = pattern.sub("", subject)
print(result)
If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.
The easiest way
import re
txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>'
out = re.sub("(<[^>]+>)", '', txt)
print out
replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).
You have to use re module:
import re
newline= re.sub("<\/?\[[0-9]+>", "", line)
don't have to use regular expression (for your sample string)
>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'
>>> for w in s.split(">"):
... if "<" in w:
... print w.split("<")[0]
...
this is a paragraph with
in between
and then there are cases ... where the
number ranges from 1-100
.
and there are many other lines in the txt files
with
such tags
import os, sys, re, glob
pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
retline = pattern.sub(replacementStringMatchesPattern, "", line)
sys.stdout.write(retline)
print (retline)

Why cant i change this string? - Python [duplicate]

I need some help on declaring a regex. My inputs are like the following:
this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>
The required output is:
this is a paragraph with in between and then there are cases ... where the number ranges from 1-100.
and there are many other lines in the txt files
with such tags
I've tried this:
#!/usr/bin/python
import os, sys, re, glob
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
line2 = line.replace('<[1> ', '')
line = line2.replace('</[1> ', '')
line2 = line.replace('<[1>', '')
line = line2.replace('</[1>', '')
print line
I've also tried this (but it seems like I'm using the wrong regex syntax):
line2 = line.replace('<[*> ', '')
line = line2.replace('</[*> ', '')
line2 = line.replace('<[*>', '')
line = line2.replace('</[*>', '')
I dont want to hard-code the replace from 1 to 99.
This tested snippet should do it:
import re
line = re.sub(r"</?\[\d+>", "", line)
Edit: Here's a commented version explaining how it works:
line = re.sub(r"""
(?x) # Use free-spacing mode.
< # Match a literal '<'
/? # Optionally match a '/'
\[ # Match a literal '['
\d+ # Match one or more digits
> # Match a literal '>'
""", "", line)
Regexes are fun! But I would strongly recommend spending an hour or two studying the basics. For starters, you need to learn which characters are special: "metacharacters" which need to be escaped (i.e. with a backslash placed in front - and the rules are different inside and outside character classes.) There is an excellent online tutorial at: www.regular-expressions.info. The time you spend there will pay for itself many times over. Happy regexing!
str.replace() does fixed replacements. Use re.sub() instead.
I would go like this (regex explained in comments):
import re
# If you need to use the regex more than once it is suggested to compile it.
pattern = re.compile(r"</{0,}\[\d+>")
# <\/{0,}\[\d+>
#
# Match the character “<” literally «<»
# Match the character “/” literally «\/{0,}»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «{0,}»
# Match the character “[” literally «\[»
# Match a single digit 0..9 «\d+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match the character “>” literally «>»
subject = """this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>.
and there are many other lines in the txt files
with<[3> such tags </[3>"""
result = pattern.sub("", subject)
print(result)
If you want to learn more about regex I recomend to read Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan.
The easiest way
import re
txt='this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. and there are many other lines in the txt files with<[3> such tags </[3>'
out = re.sub("(<[^>]+>)", '', txt)
print out
replace method of string objects does not accept regular expressions but only fixed strings (see documentation: http://docs.python.org/2/library/stdtypes.html#str.replace).
You have to use re module:
import re
newline= re.sub("<\/?\[[0-9]+>", "", line)
don't have to use regular expression (for your sample string)
>>> s
'this is a paragraph with<[1> in between</[1> and then there are cases ... where the<[99> number ranges from 1-100</[99>. \nand there are many other lines in the txt files\nwith<[3> such tags </[3>\n'
>>> for w in s.split(">"):
... if "<" in w:
... print w.split("<")[0]
...
this is a paragraph with
in between
and then there are cases ... where the
number ranges from 1-100
.
and there are many other lines in the txt files
with
such tags
import os, sys, re, glob
pattern = re.compile(r"\<\[\d\>")
replacementStringMatchesPattern = "<[1>"
for infile in glob.glob(os.path.join(os.getcwd(), '*.txt')):
for line in reader:
retline = pattern.sub(replacementStringMatchesPattern, "", line)
sys.stdout.write(retline)
print (retline)

Extracting numbers from a text file using regexp

Iam trying to make a python script that reads a text file input.txt and then scans all phone numbers in that file and writes back all matching phone no's to output.txt
lets say text file is like:
Hey my number is 1234567890 and another number is +91-1234567890. but if none of these is available you can call me on +91 5645454545 (or) mail me at abc#xyz.com
it should match 1234567890, +91-1234567890 and +91 5645454545
import re
no = '^(\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
Regexp for no is like : it takes country codes upto 3 digits and then a - or space which is optional and country code itself is optional and then a 10 digit number.
Yes, the problem is with your regex. Fortunately, it's a small one. You just need to remove the ^ character:
'(\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}'
The ^ signifies that you want to match only at the beginning of the string. You want to match multiple times throughout the string. Here's a 101demo.
For python, you'll need to specify a non-capturing group as well with ?:. Otherwise, re.findall does not return the complete match:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups.
Bold emphasis mine. Here's a relevant question.
This is what you get when you specify non-capturing groups for your problem:
In [485]: re.findall('(?:\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}', text)
Out[485]: ['1234567890', '+91-1234567890', '+91 5645454545']
this code will work:
import re
no = '(?:\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
The output will be:
1234567890
+91-1234567890
+91 5645454545
you can use
(?:\+[1-9]\d{1,2}-?)?\s?[1-9][0-9]{9}
see the demo at demo
pattern = '\d{10}|\+\d{2}[- ]+\d{10}'
matches = re.findall(pattern,text)
o/p -> ['1234567890', '+91-1234567890', '+91 5645454545']

Using regular expressions to find a pattern

If I have a file that consists of sentences like this:
1001 apple
1003 banana
1004 grapes
1005
1007 orange
Now I want to detect and print all such sentences where there is a number but no corresponding text (eg 1005), how can I design the regular expression to find such sentences? I find them a bit confusing to construct.
res=[]
with open("fruits.txt","r") as f:
for fruit in f:
res.append(fruit.strip().split())
Would it be something like this: re.sub("10**"/.")
Well you don't need a regular expressions for this:
with open("fruits.txt", "r") as f:
res = [int(line.strip()) for line in f if len(line.split()) == 1]
A regex that would detect a number, then a space, then an underscore word is ([0-9])+[ ]\w+.
A good ressource for trying that stuff out is http://regexr.com/
The re pattern for this would be re.sub("[0-9][0-9][0-9][0-9]"). This looks if there are only four numbers and nothing else, so it will find your 1005.
Hope this helps!
There are two ways to go about this: search() and findall(). The former will find the first instance of a match, and the latter will give a list of every match.
In any case, the regex you want to use is "^\d{4}$". It's a simple regex which matches a 4-digit number that takes up the entirety of a string, or, in multiline mode, a line. So, to find 'only number' sections, you will use the following code:
# assume 'func' is set to either be re.search or re.findall, whichever you prefer
with open("fruits.txt", "r") as f:
solo = func("^\d{4}$", f.read(), re.MULTILINE)
# 'solo' now has either the first 'non-labeled' number,
# or a list of all such numbers in the file, depending on
# the function you used. search() will return None if there
# are no such numbers, and findall() will return an empty list.
# if you prefer brevity, re.MULTILINE is equivalent to re.M
Additional explanation of the regex:
^ matches at the beginning of the line.
\d is a special sequence which matches any numeric digit.
{4} matches the prior element (\d) exactly four times.
$ matches at the end of the line.
Please try:
(?:^|\s+)(\d{4}\b)(?!\s.*\w+)
DEMO

find string with a pattern at the end regex python

I want to check if a string ends with a "_INT".
Here is my code
nOther = "c1_1"
tail = re.compile('_\d*$')
if tail.search(nOther):
nOther = nOther.replace("_","0")
print nOther
output:
c101
c102
c103
c104
but there may be two underscores in the string, I am only interested in the last one.
How can I edit my code to handle this?
Using two steps is useless (check if the pattern matches, make the replacement), because re.sub makes it in one step:
txt = re.sub(r'_(?=\d+$)', '0', txt)
The pattern use a lookahead (?=...) (i.e. followed by) that is only a check and the content inside is not a part of the match result. (In other words \d+$ is not replaced)
One way to do it would be to capture everything that is not the last underscore and rebuild the string.
import re
nOther = "c1_1"
tail = re.compile('(.*)_(\d*$)')
tail.sub(nOther, "0")
m = tail.search(nOther)
if m:
nOther = m.group(1) + '0' + m.group(2)
print nOther

Categories

Resources