Extracting numbers from a text file using regexp - python

Iam trying to make a python script that reads a text file input.txt and then scans all phone numbers in that file and writes back all matching phone no's to output.txt
lets say text file is like:
Hey my number is 1234567890 and another number is +91-1234567890. but if none of these is available you can call me on +91 5645454545 (or) mail me at abc#xyz.com
it should match 1234567890, +91-1234567890 and +91 5645454545
import re
no = '^(\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
Regexp for no is like : it takes country codes upto 3 digits and then a - or space which is optional and country code itself is optional and then a 10 digit number.

Yes, the problem is with your regex. Fortunately, it's a small one. You just need to remove the ^ character:
'(\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}'
The ^ signifies that you want to match only at the beginning of the string. You want to match multiple times throughout the string. Here's a 101demo.
For python, you'll need to specify a non-capturing group as well with ?:. Otherwise, re.findall does not return the complete match:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups.
Bold emphasis mine. Here's a relevant question.
This is what you get when you specify non-capturing groups for your problem:
In [485]: re.findall('(?:\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}', text)
Out[485]: ['1234567890', '+91-1234567890', '+91 5645454545']

this code will work:
import re
no = '(?:\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
The output will be:
1234567890
+91-1234567890
+91 5645454545

you can use
(?:\+[1-9]\d{1,2}-?)?\s?[1-9][0-9]{9}
see the demo at demo

pattern = '\d{10}|\+\d{2}[- ]+\d{10}'
matches = re.findall(pattern,text)
o/p -> ['1234567890', '+91-1234567890', '+91 5645454545']

Related

Regex to capture string if other string present within brackets

I am trying to create a Python regex to capture a file name, but only if the text "external=true" appears within the square brackets after the alleged file name.
I believe I am nearly there, but am missing a specific use-case. Essentially, I want to capture the text between qrcode: and the first [, but only if the text external=true appears between the two square brackets.
I have created the regex qrcode:([^:].*?)\[.*?external=true.*?\], which does not work for the second line below: it incorrectly returns vcard3.txt and does not return vcard4.txt.
qrcode:vcard1.txt[external=true] qrcode:vcard2.txt[xdim=2,ydim=2]
qrcode:vcard3.txt[xdim=2,ydim=2] qrcode:vcard4.txt[xdim=2,ydim=2,external=true]
qrcode:vcard5.txt[xdim=2,ydim=2,external=true,foreground=red,background=white]
qrcode:https://www.github.com[foreground=blue]
https://regex101.com/r/bh3IMb/3
As an alternative you can use
qrcode:([\w\.]+)(?=\[[\w\=,]*external=true[^\]]*)
See the regex demo.
Python demo:
import re
regex = re.compile(r"qrcode:([\w\.]+)(?=\[[\w\=,]*external=true[^\]]*)")
sample = """
qrcode:vcard1.txt[external=true] qrcode:vcard2.txt[xdim=2,ydim=2]
qrcode:vcard3.txt[xdim=2,ydim=2] qrcode:vcard4.txt[xdim=2,ydim=2,external=true]
qrcode:vcard5.txt[xdim=2,ydim=2,external=true,foreground=red,background=white]
qrcode:https://www.github.com[foreground=blue]
"""
print(regex.findall(sample))
Output:
['vcard1.txt', 'vcard4.txt', 'vcard5.txt']
Using positive look-ahead (for qrcode:) and positive look-behind (for [*external=true with lazy matching to capture the smallest of such groups.
Regex101 explanation: https://regex101.com/r/bOezIm/1
A complete python example:
import re
pattern = r"(?<=qrcode:)[^:]*?(?=\[[^\]]*?external=true)"
string = """
qrcode:vcard1.txt[external=true] qrcode:vcard2.txt[xdim=2,ydim=2]
qrcode:vcard3.txt[xdim=2,ydim=2] qrcode:vcard4.txt[xdim=2,ydim=2,external=true]
qrcode:vcard5.txt[xdim=2,ydim=2,external=true,foreground=red,background=white]
qrcode:https://www.github.com[foreground=blue]
"""
print(re.findall(pattern, string))

How can I find all paths in javascript file with regex in Python?

Sample Javascript (content):
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("src","/cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray="+e.ray),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("sdfdsfsfds",'/test/path'),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
regex = ""
endpoints = re.findall(regex, content)
Output I want:
> /cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray=
> /test/path
I want to find all fields starting with "/ and '/ with regex. I've tried many url regexes but it didn't work for me.
This should do it:
regex = r"""["']\/[^"']*"""
Note that you will need to trim the first character from the match. This also assumes that there are no quotation marks in the path.
Consider:
import re
txt = ... #your code
pat = r"(\"|\')(\/.*?)\1"
for el in re.findall(pat, txt):
print(el[1])
each el will be match of pattern starting with single, or double quote. Then minimal number of characters, then the same character as at the beginning (same type of quote).
.* stands for whatever number of any characters, following ? makes it non-greedy i.e. provides minimal characters match. Then \1 refers to first group, so it will match whatever type of quote was matched at the beginning. Then by specifying el[1] we return second group matched i.e. whatever was matched within quotes.

Python Regex behaviour with Square Brackets []

This the text file abc.txt
abc.txt
aa:s0:education.gov.in
bb:s1:defence.gov.in
cc:s2:finance.gov.in
I'm trying to parse this file by tokenizing (correct me if this is the incorrect term :) ) at every ":" using the following regular expression.
parser.py
import re,sys,os,subprocess
path = "C:\abc.txt"
site_list = open(path,'r')
for line in site_list:
site_line = re.search(r'(\w)*:(\w)*:([\w\W]*\.[\W\w]*\.[\W\w]*)',line)
print('Regex found that site_line.group(2) = '+str(site_line.group(2))
Why is the output
Regex found that site_line.group(2) = 0
Regex found that site_line.group(2) = 1
Regex found that site_line.group(2) = 2
Can someone please help me understand why it matches the last character of the second group ? I think its matching 0 from s0 , 1 from s1 & 2 from s2
But Why ?
Let's show a simplified example:
>>> re.search(r'(.)*', 'asdf').group(1)
'f'
>>> re.search(r'(.*)', 'asdf').group(1)
'asdf'
If you have a repetition operator around a capturing group, the group stores the last repetition. Putting the group around the repetition operator does what you want.
If you were expecting to see data from the third group, that would be group(3). group(0) is the whole match, and group(1), group(2), etc. count through the actual parenthesized capturing groups.
That said, as the comments suggest, regexes are overkill for this.
>>> 'aa:s0:education.gov.in'.split(':')
['aa', 's0', 'education.gov.in']
And first group is entire match by default.
If a groupN argument is zero, the corresponding return value is the
entire matching string.
So you should skip it. And check group(3), if you want last one.
Also, you should compile regexp before for-loop. It increase performance of your parser.
And you can replace (\w)* to (\w*), if you want match all symbols between :.

What is a RegEx to find phone numbers in Python?

I am trying to make a regex in python to detect 7-digit numbers and update contacts from a .vcf file. It then modifies the number to 8-digit number (just adding 5 before the number).Thing is the regex does not work.
I am having as error message "EOL while scanning string literal"
regex=re.compile(r'^(25|29|42[1-3]|42[8-9]|44|47[1-9]|49|7[0-9]|82|85|86|871|87[5-8]|9[0-8])/I s/^/5/')
#Open file for scanning
f = open("sample.vcf")
#scan each line in file
for line in f:
#find all results corresponding to regex and store in pattern
pattern=regex.findall(line)
#isolate results
for word in pattern:
print word
count = count+1 #display number of occurences
wordprefix = '5{}'.format(word)
s=open("sample.vcf").read()
s=s.replace(word,wordprefix)
f=open("sample.vcf",'w')
print wordprefix
f.write(s)
f.close()
I am suspecting that my regex is not in the correct format for detecting a particular pattern of numbers with 2 digits which have a particular format like the 25x and 29x and 5 digits that can be any pattern of numbers.. (TOTAL 7 digits)
can anyone help me out on the correct format to adopt for such a case?
/I is not how you give modifiers for regex in python. And neither you do substitution like s///.
You should use re.sub() for substitution, and give the modifier as re.I, as 2nd argument to re.compile:
reg = re.compile(regexPattern, re.I)
And then for a string s, the substitution would look like:
re.sub(reg, replacement, s)
As such, your regex looks weird to me. If you want to match 7 digits numbers, starting with 25 or 29, then you should use:
r'(2[59][0-9]{5})'
And for replacement, use "5\1". In all, for a string s, your code would look like:
reg = re.compile(r'(2[59][0-9]{5})', re.I)
new_s = re.sub(reg, "5\1", s)

Find comma space year but ignore comma year without space

I am trying to read in a file and every time , year is found it prints it out. For example if it finds , 2003 it will print that out, but if it finds ,2003 it will ignore it. I originally used a split and was able to get the year to match up, but when I added the , I realized that it looked at it like two different words so I dont think that would work.
Here is my code:
import string
import re
while True:
filename=raw_input('Enter a file name: ')
if filename == 'exit':
break
try:
file = open(filename, 'r')
text=file.read()
file.close()
except:
print('file does not exist')
else:
p=re.compile('^\,\s(19|20)\d\d$')//this is my regular expression
print(text)
m=p.search(text)
if m:
print(m.groups())
If you want to search the file for the regex rather than match the entire file contents, remove ^ and $ from the regex.
If you want more than one match per file, use finditer or findall instead of search.
Use raw string when specifying the regex: p=re.compile(r',\s(19|20)\d\d')
Example:
for m in re.finditer(r',\s((19|20)\d\d)', text):
print m.group(1)
>>> import re
>>> text = "foo bar, 2003, 2006,1923, derp"
>>> p = re.compile(r',\s((?:19|20)\d\d)')
>>> p.findall(text)
['2003', '2006']
Simplified example. First of all, remove the anchors (^ and $) and use findall instead of search to find all matches. I also used ?: to designate a non-matching group (it won't show up in the results) and made the year a group instead.
If you just add a * to the \s in your regex, I think it should work. This will make it match zero or more whitespace characters, instead of exactly one. If you only want it to match zero or one, add a + instead.

Categories

Resources