I am looking into the Regex function in Python.
As part of this, I am trying to extract a substring from a string.
For instance, assume I have the string:
<place of birth="Stockholm">
Is there a way to extract Stockholm with a single regex call?
So far, I have:
location_info = "<place of birth="Stockholm">"
#Remove before
location_name1 = re.sub(r"<place of birth=\"", r"", location_info)
#location_name1 --> Stockholm">
#Remove after
location_name2 = re.sub(r"\">", r"", location_name1)
#location_name2 --> Stockholm
Any advice on how to extract the string Stockholm, without using two "re.sub" calls is highly appreciated.
Sure, you can match the beginning up to the double quotes, and match and capture all the characters other than double quotes after that:
import re
p = re.compile(r'<place of birth="([^"]*)')
location_info = "<place of birth=\"Stockholm\">"
match = p.search(location_info)
if match:
print(match.group(1))
See IDEONE demo
The <place of birth=" is matched as a literal, and ([^"]*) is a capture group 1 matching 0 or more characters other than ". The value is accessed with .group(1).
Here is a REGEX demo.
print re.sub(r'^[^"]*"|"[^"]*$',"",location_info)
This should do it for you.See demo.
https://regex101.com/r/vV1wW6/30#python
Is there a specific reason why you are removing the rest of the string, instead of selecting the part you want with something like
location_info = "<place of birth="Stockholm">"
location_info = re.search('<.*="(.*)".*>', location_info, re.IGNORECASE).group(1)
this code tested under python 3.6
test = '<place of birth="Stockholm">'
resp = re.sub(r'.*="(\w+)">',r'\1',test)
print (resp)
Stockholm
Related
Sample Javascript (content):
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("src","/cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray="+e.ray),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
t.appendChild(u),t}},{10:10}],16:[function(e,t,r){e(10);t.exports=function(e){var t=document.createDocumentFragment(),r=document.createElement("img");r.setAttribute("alt",e.empty),r.id="trk_recaptcha",r.setAttribute("sdfdsfsfds",'/test/path'),t.appendChild(r);var n=document.createTextNode(" ");t.appendChild(n);var a=document.createElement("input");a.id="id",a.setAttribute("name","id"),a.setAttribute("type","hidden"),a.setAttribute("value",e.ray),t.appendChild(a);var i=document.createTextNode(" ");t.appendChild(i);
regex = ""
endpoints = re.findall(regex, content)
Output I want:
> /cdn-cgi/images/trace/captcha/js/re/transparent.gif?ray=
> /test/path
I want to find all fields starting with "/ and '/ with regex. I've tried many url regexes but it didn't work for me.
This should do it:
regex = r"""["']\/[^"']*"""
Note that you will need to trim the first character from the match. This also assumes that there are no quotation marks in the path.
Consider:
import re
txt = ... #your code
pat = r"(\"|\')(\/.*?)\1"
for el in re.findall(pat, txt):
print(el[1])
each el will be match of pattern starting with single, or double quote. Then minimal number of characters, then the same character as at the beginning (same type of quote).
.* stands for whatever number of any characters, following ? makes it non-greedy i.e. provides minimal characters match. Then \1 refers to first group, so it will match whatever type of quote was matched at the beginning. Then by specifying el[1] we return second group matched i.e. whatever was matched within quotes.
Hi I have a pattern that I can identify with python regex but I want to return everything but the match
with this function I can extract the year of the string
def stripyear(string):
p_year = re.compile('[2][0][0-9][0-9][0-1][0-9][0-3][0-9]')
ym = p_year.findall(string)
return ''.join(ym)
so stripyear('UTMSIRGAS200020150409')will return '20150409' but i want it to return
'UTMSIRGAS2000'
i.e. the opposite
I tried this patter '\b(?!([2][0][0-9][0-9][0-1][0-9][0-3][0-9])\b)\w+' and it works if there was a space before the patter in here but python does not like it.
thanks for any help.
Use re.sub to replace with empty string:
re.sub("[2][0][0-9][0-9][0-1][0-9][0-3][0-9]", "", input)
Iam trying to make a python script that reads a text file input.txt and then scans all phone numbers in that file and writes back all matching phone no's to output.txt
lets say text file is like:
Hey my number is 1234567890 and another number is +91-1234567890. but if none of these is available you can call me on +91 5645454545 (or) mail me at abc#xyz.com
it should match 1234567890, +91-1234567890 and +91 5645454545
import re
no = '^(\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
Regexp for no is like : it takes country codes upto 3 digits and then a - or space which is optional and country code itself is optional and then a 10 digit number.
Yes, the problem is with your regex. Fortunately, it's a small one. You just need to remove the ^ character:
'(\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}'
The ^ signifies that you want to match only at the beginning of the string. You want to match multiple times throughout the string. Here's a 101demo.
For python, you'll need to specify a non-capturing group as well with ?:. Otherwise, re.findall does not return the complete match:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups.
Bold emphasis mine. Here's a relevant question.
This is what you get when you specify non-capturing groups for your problem:
In [485]: re.findall('(?:\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}', text)
Out[485]: ['1234567890', '+91-1234567890', '+91 5645454545']
this code will work:
import re
no = '(?:\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
The output will be:
1234567890
+91-1234567890
+91 5645454545
you can use
(?:\+[1-9]\d{1,2}-?)?\s?[1-9][0-9]{9}
see the demo at demo
pattern = '\d{10}|\+\d{2}[- ]+\d{10}'
matches = re.findall(pattern,text)
o/p -> ['1234567890', '+91-1234567890', '+91 5645454545']
I want to match a string contained in a pair of either single or double quotes. I wrote a regex pattern as so:
pattern = r"([\"\'])[^\1]*\1"
mytext = '"bbb"ccc"ddd'
re.match(pattern, mytext).group()
The expected output would be:
"bbb"
However, this is the output:
"bbb"ccc"
Can someone explain what's wrong with the pattern above? I googled and found the correct pattern to be:
pattern = r"([\"\'])[^\1]*?\1"
However, I don't understand why I must use ?.
In your regex
([\"'])[^\1]*\1
Character class is meant for matching only one character. So your use of [^\1] is incorrect. Think, what would have have happened if there were more than one characters in the first capturing group.
You can use negative lookahead like this
(["'])((?!\1).)*\1
or simply with alternation
(["'])(?:[^"'\\]+|\\.)*\1
or
(?<!\\)(["'])(?:[^"'\\]+|\\.)*\1
if you want to make sure "b\"ccc" does not matches in string bb\"b\"ccc"
You should use a negative lookahead assertion. And I assume there won't be any escaped quotes in your input string.
>>> pattern = r"([\"'])(?:(?!\1).)*\1"
>>> mytext = '"bbb"ccc"ddd'
>>> re.search(pattern, mytext).group()
'"bbb"'
You can use:
pattern = r"[\"'][^\"']*[\"']"
https://regex101.com/r/dO0cA8/1
[^\"']* will match everything that isn't " or '
In the following script I would like to pull out text between the double quotes ("). However, the python interpreter is not happy and I can't figure out why...
import re
text = 'Hello, "find.me-_/\\" please help with python regex'
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
m = re.match(pattern, text)
print m.group()
The output should be find.me-/\.
match starts searching from the beginning of the text.
Use search instead:
#!/usr/bin/env python
import re
text = 'Hello, "find.me-_/\\" please help with python regex'
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
m = re.search(pattern, text)
print m.group()
match and search return None when they fail to match.
I guess you are getting AttributeError: 'NoneType' object has no attribute 'group' from python: This is because you are assuming you will match without checking the return from re.match.
If you write:
m = re.search(pattern, text)
match: searches at the beginning of text
search: searches all the string
Maybe this helps you to understand:
http://docs.python.org/library/re.html#matching-vs-searching
Split the text on quotes and take every other element starting with the second element:
def text_between_quotes(text):
return text.split('"')[1::2]
my_string = 'Hello, "find.me-_/\\" please help and "this quote" here'
my_string.split('"')[1::2] # ['find.me-_/\\', 'this quote']
'"just one quote"'.split('"')[1::2] # ['just one quote']
This assumes you don't have quotes within quotes, and your text doesn't mix quotes or use other quoting characters like `.
You should validate your input. For example, what do you want to do if there's an odd number of quotes, meaning not all the quotes are balanced? You could do something like discard the last item if you have an even number of things after doing the split
def text_between_quotes(text):
split_text = text.split('"')
between_quotes = split_text[1::2]
# discard the last element if the quotes are unbalanced
if len(split_text) % 2 == 0 and between_quotes and not text.endswith('"'):
between_quotes.pop()
return between_quotes
# ['first quote', 'second quote']
text_between_quotes('"first quote" and "second quote" and "unclosed quote')
or raise an error instead.
Use re.search() instead of re.match(). The latter will match only at the beginning of strings (like an implicit ^).
You need re.search(), not re.match() which is anchored to the start of your input string.
Docs here