Python regex group extraction

Python regex group extraction - python

For this string:
"https://webster.bfm.com/viewserver/rw?port=AAA-NY&rpttag=praada_pnl_sum_eq.BMACS_ASST_ALL&subdir=gp_views/MUS-ALLRET/released"
looking to run regular expression to look like the below:
list = [port=AAA,rpttag=praada_pnl_sum_eq.BMACS_ASST_ALL,subdir=gp_views/MUS-ALLRET/released]
I got this so far:
re.findall(r'\?(.+)','https://webster.bfm.com/viewserver/rw?port=AAA-NY&rpttag=praada_pnl_sum_eq.BMACS_ASST_ALL&subdir=gp_views/MUS-ALLRET/released')
that just returns one string. I know I need to just repeat this pattern, \S&+ using [], but can't see to figure out the best way to do this all in one regex

re.findall(r'[^?&]+', s)[1:]
This works by splitting on either ? or & and then throwing away the first match, which is the part up to the ?.
I'm making two assumptions here: first, that there are no ? characters in your fragments, and second, that you really want the first element of your list to be port=AAA-NY.

Using regex to parse URL is a bad idea when Python has built-in library to do the job:
Python 3
Use urlparse to parse the URL into schema, port, host, query, etc., then use parse_qs to parse the query string.
Do check out the documentation for parsing options for corner cases.
Example code:
from urllib.parse import *
input = 'https://webster.bfm.com/viewserver/rw?port=AAA-NY&rpttag=praada_pnl_sum_eq.BMACS_ASST_ALL&subdir=gp_views/MUS-ALLRET/released'
url = urlparse(input)
query_parts = parse_qs(url.query)
Printing query_parts:
>>> print(query_parts)
{'rpttag': ['praada_pnl_sum_eq.BMACS_ASST_ALL'], 'port': ['AAA-NY'], 'subdir': ['gp_views/MUS-ALLRET/released']}
Python 2
The code in Python 2.* is similar, but you need to import urlparse module, instead of urllib.parse. The functions are more or less the same.

Related

Python Regex back reference a named group

I'm attempting to parse phone numbers that can come through in different ways. For example:
(321) 123-4567
(321) 1234567
321-123-4567
321123-4567
I then want to graph each of the three parts separately. My thought is to use named groups and some and or situation like so:
(^\s*(?P<area>[0-9]{3})\-?(?P<fst>[0-9]{3})\-(?P<lst>[0-9]{4}))|(^\s*\(\area\)\s*(\fst)\-?(\lst))
Problem with that, I believe, is that I am not calling the named groups properly. I'm trying to use https://regex101.com/ to help but am still getting stuck. Because the parentheses around the area code should either both be there or neither should be there I don't want to use the "?" character like:
\(?(?P<area>[0-9]{3})\)?
Can anyone Help me with this? Thank you so much.
I'm using python 3.6 and the re package.

There were a few issues with your regex. You didn't make the brackets optional, and you didn't allow optional spaces between area code and first part. Without seeing your Python code it's not easy to know how you were doing things, but I did this by splitting into a compiled regex, and then using the regex against the list of numbers.
from __future__ import print_function
import re
phone_numbers = [
'(321) 123-4567',
'(321) 1234567',
'321-123-4567',
'321123-4567',
]
regex = re.compile(r'^\s*\(?(?P<area>[0-9]{3})[) -]*(?P<fst>[0-9]{3})-?(?P<sec>[0-9]{4})')
for p in phone_numbers:
print(regex.sub(r'(\g<area>) \g<fst>-\g<sec>', p))
This isn't perfect as it will allow things that aren't valid syntax (according to your list) to be parsed, but this shouldn't be a problem. For example '(321))- - )) 123-4567' would be parsed correctly.

I'd use group testing: ^(\()?(?P<area>\d{3})(?(1)\))[ -]?(?P<fst>\d{3})-?(?P<lst>\d{4})$.
In there:
(\()? captures an opening parenthese in group 1 when exists.
(?(1)\)) tests for existence of a captured group 1, if so matches a closing parenthese.
The rest is pretty straightforward.

How to regex split, but keep the split string?

I have the following URL pattern:
http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en
I would like to get everything up until and inclusive of /watch/\d+/.
So far I have:
>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']
But this does not include the split string (the string which appears between the domain and the path). The end answer I want to achieve is:
http://www.hulu.jp/watch/589851

You need to use capture group :
>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']

As mentioned in the other answer, you need to use groups to capture the "glue" between the split strings.
I wonder though, is what you want here a split() or a search()? It looks (from the sample) that you're trying to extract from a URL everything from the first occurrence of /watch/XXX/ where XXX is 1 or more digits, to the end of the string. If that's the case, then a match/search might be more suitable, as with a split if the search regex can match multiple times you'll split into multiple groups. Ex:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']
Which doesn't look like what you want. Instead perhaps:
result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []
which gives:
('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
You could also use this approach combined with named groups to get extra fancy:
result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}
giving:
{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}
If you're set on the split() approach, you can also set the maxsplit parameter to ensure it's only split once:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)
giving:
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']
Personally though, I find that when parsing URL's into constituent parts the search() with named groups approach works extremely well as it allows you to name the various parts in the regex itself, and via groupdict() get a nice dictionary you can use for working with those parts.

You've surely seen the Stack Overflow don't-parse-HTML-with-regex post, yes?
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Well, regex can parse URLs, but trying to do so when there's a plethora of better tools is foolish.
This is what a regex for URLs looks like:
^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ caseless flag)
It's just a mess of characters, right? Exactly!
Don't parse URLs with regex... almost.
There is one simple thing:
A path-relative URL must be zero or more path segments separated from each other by a "/".
Splitting the URL should be as simple as url.split("/").
from urllib.parse import urlparse, urlunparse
myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"
# Run a parser over it
parts = urlparse(myurl)
# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))
# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'

You can try following regex
.*\/watch\/\d+
Working Demo

extracting facebook page from html using regex

I am trying to get the address of a facebook page of websites using regular expression search on the html
usually the link appears as
Facebook
but sometimes the address will be http://www.facebook.com/some.other
and sometimes with numbers
at the moment the regex that I have is
'(facebook.com)\S\w+'
but it won't catch the last 2 possibilites
what is it called when I want the regex to search but not fetch it? (for instance I want the regex to match the www.facbook.com part but not have that part in the result, only the part that comes after it
note I use python with re and urllib2

seems to me your main issue is that you dont understand enough regex.
fb_re = re.compile(r'www.facebook.com([^"]+)')
then simply:
results = fb_re.findall(url)
why this works:
in regular expresions the part in the parenthesis () is what is captured, you were putting the www.facebook.com part in the parenthesis and so it was not getting anything else.
here i used a character set [] to match anything in there, i used the ^ operator to negate that, which means anything not in the set, and then i gave it the " character, so it will match anything that comes after www.facebook.com until it reaches a " and then stop.
note - this catches facebook links which are embedded, if the facebook link is simply on the page in plaintext you can use:
fb_re = re.compile(r'www.facebook.com(\S+)')
which means to grab any non-white-space character, so it will stop once it runs out of white-space.
if you are worried about links ending in periods, you can simply add:
fb_re = re.compile(r'www.facebook.com(\S+)\.\s')
which tells it to search for the same above, but stop when it gets to the end of a sentence, . followed by any white-space like a space or enter. this way it will still grab links like /some.other but when you have things like /some.other. it will remove the last .

if i assume correctly, the url is always in double quotes. right?
re.findall(r'"http://www.facebook.com(.+?)"',url)
Overall, trying to parse html with regex is a bad idea. I suggest you use an html parser like lxml.html to find the links and then use urlparse
>>> from urlparse import urlparse # in 3.x use from urllib.parse import urlparse
>>> url = 'http://www.facebook.com/some.other'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'facebook.com'
>>> parse_object.path
'/some.other'

Regular Expression (Python) to extract strings of text from inside of < and > - e.g. <stringone><string-two> etc

I'm currently playing with the Stack Overflow data dumps and am trying to construct (what I imagine is) a simple regular expression to extract tag names from inside of < and > characters. So, for each question, I have a list of one or more tags like <tagone><tag-two>...<tag-n> and am trying to extract just a list of tag names. Here are a few example tag strings taken from the data dump:
<javascript><internet-explorer>
<c#><windows><best-practices><winforms><windows-services>
<c><algorithm><sorting><word>
<java>
For reference, I don't need to divide tag names into words, so for examples like <best-practices> I would like to get back best-practices (not best and practices). Also, for what it's worth, I'm using Python if it makes any difference. Any suggestions?

Since the tag names of Stackoverflow do not have embedded < > you can use the regex:
<(.*?)>
or
<([^>]*)>
Explanation:
< : A literal <
(..) : To group and remember the
match.
.*? : To match anything in
non-greedy way.
> : A literal <
[^>] : A char class to match
anything other than a >

Instead of doing data dumps (whatever they are) and using regex, you may be interested in using the Stackoverflow API and json instead.
For example, to cull the tags from this question, you could do this:
import urllib2
import json
import gzip
import cStringIO
f=urllib2.urlopen('http://api.stackoverflow.com/1.0/questions/3708418?type=jsontext')
g=gzip.GzipFile(fileobj=cStringIO.StringIO(f.read()))
j=json.loads(g.read())
print(j['questions'][0]['tags'])
# [u'python', u'regex']

Here is a quick and dirty solution:
#!/usr/bin/python
import re
pattern = re.compile("<(.*?)>")
data = """
<javascript><internet-explorer>
<c#><windows><best-practices><winforms><windows-services>
<c><algorithm><sorting><word>
<java>
"""
for each in pattern.findall(data):
print each
Update
Statutory warning: if the data dump is in XML or JSON (as one of the users commented) then you are much better off using a suitable XML or JSON parser.

What's the cleanest way to extract URLs from a string using Python?

Although I know I could use some hugeass regex such as the one posted here I'm wondering if there is some tweaky as hell way to do this either with a standard module or perhaps some third-party add-on?
Simple question, but nothing jumped out on Google (or Stackoverflow).
Look forward to seeing how y'all do this!

I know that it's exactly what you do not want but here's a file with a huge regex:
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
the web url matching regex used by markdown
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
https://gist.github.com/gruber/8891611
"""
URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!#)))"""
I call that file urlmarker.py and when I need it I just import it, eg.
import urlmarker
import re
re.findall(urlmarker.URL_REGEX,'some text news.yahoo.com more text')
cf. http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Also, here is what Django (1.6) uses to validate URLFields:
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' # domain...
r'localhost|' # localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|' # ...or ipv4
r'\[?[A-F0-9]*:[A-F0-9:]+\]?)' # ...or ipv6
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
cf. https://github.com/django/django/blob/1.6/django/core/validators.py#L43-50
Django 1.9 has that logic split across a few classes

Look at Django's approach here: django.utils.urlize(). Regexps are too limited for the job and you have to use heuristics to get results that are mostly right.

There is an excellent comparison of 13 different regex approaches
...which can be found at this page: In search of the perfect URL validation regex.
The Diego Perini regex, which passed all the tests, is very long but is available at his gist here.
Note that you will have to convert his PHP version to python regex (there are slight differences).
I ended up using the Imme Emosol version which passes the vast majority of tests and is a fraction of the size of Diego Perini's.
Here is a python-compatible version of the Imme Emosol regex:
r'^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$'

You can use this library I wrote:
https://github.com/imranghory/urlextractor
It's extremely hacky, but it doesn't rely upon "http://" like many other techniques, rather it uses the Mozilla TLD list (via the tldextract library) to search for TLDs (i.e ".co.uk", ".com", etc.) in the text and then attempts to construct urls around the TLD.
It doesn't aim to be RFC compliant but rather accurate for how urls are used in practice in the real world. So for example it will reject the technically valid domain "com" (you can actually use a TLD as a domain; although it's rare in practice) and will strip trail full-stops or commas from urls.

if you know that there is a URL following a space in the string you can do something like this:
s is the string containg the url
>>> t = s[s.find("http://"):]
>>> t = t[:t.find(" ")]
otherwise you need to check if find returns -1 or not.

You can use BeautifulSoup.
def extractlinks(html):
soup = BeautifulSoup(html)
anchors = soup.findAll('a')
links = []
for a in anchors:
links.append(a['href'])
return links
Note that the solution with regexes is faster, although will not be as accurate.

I'm late to the party, but here is a solution someone from #python on freenode suggested to me. It avoids the regex hassle.
from urlparse import urlparse
def extract_urls(text):
"""Return a list of urls from a text string."""
out = []
for word in text.split(' '):
thing = urlparse(word.strip())
if thing.scheme:
out.append(word)
return out

There is another way how to extract URLs from text easily. You can use urlextract to do it for you, just install it via pip:
pip install urlextract
and then you can use it like this:
from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls("Let's have URL stackoverflow.com as an example.")
print(urls) # prints: ['stackoverflow.com']
You can find more info on my github page: https://github.com/lipoja/URLExtract
NOTE: It downloads list of TLDs from iana.org to keep you up to date. But if the program does not have internet access then its not for you.
This approach is similar as in urlextractor (mentioned above), but my code is recent, maintained and I am open for any suggestions (new features).

import re
text = '<p>Please click here</p>'
aa=re.findall('href="(.+)"',text)
print(aa)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex group extraction - python

Related

Python Regex back reference a named group

How to regex split, but keep the split string?

extracting facebook page from html using regex

Regular Expression (Python) to extract strings of text from inside of < and > - e.g. <stringone><string-two> etc

What's the cleanest way to extract URLs from a string using Python?

Categories

Resources