python findall, group and pipe

python findall, group and pipe - python

x = "type='text'"
re.findall("([A-Za-z])='(.*?)')", x) # this will work like a charm and produce
# ['type', 'text']
However, my problem is that I'd like to implement a pipe (alternation) so that the same regex will apply to
x = 'type="text"' # see the quotes
Basically, the following regex should work but with findall it results in something strange:
([A-Za-z])=('(.*?)')|"(.*?)")
And I can't use ['"] instead of a pipe because it may end with bad results:
value="hey there what's up?"
Now, how can I build such a regex that would apply to either single or double quotes? By the way, please do not suggest any html or xml parsers as I'm not interested in them.

shlex would do a better job here, but if you insist on re, use ([A-Za-z]+)=(?P<quote>['"])(.+?)(?P=quote)

The problem is, that in ([A-Za-z]+)=('(.*?)'|"(.*?)") you have four groups and you need only two (this is probably where you found results strange). If you use ([A-Za-z]+)=('.*?'|".*?") then should be all right. Remember you can exclude grouping by putting (?:), so this would be equivalent: ([A-Za-z]+)=('(?:.*?)')|"(?:.*?)").
EDIT: I've just realised that this solution would include surrounding quotes which you don't want. You can easily strip them off though. You could also use backreference, but then you would have one extra group, which should be removed at the end, for example:
import re
from operator import itemgetter
x = "type='text' TYPE=\"TEXT\""
print map(itemgetter(0,2), re.findall("([A-Za-z]+)=(['\"])(.*?)\\2", x))
gives [('type', 'text'), ('TYPE', 'TEXT')].

Related

Python Regex back reference a named group

I'm attempting to parse phone numbers that can come through in different ways. For example:
(321) 123-4567
(321) 1234567
321-123-4567
321123-4567
I then want to graph each of the three parts separately. My thought is to use named groups and some and or situation like so:
(^\s*(?P<area>[0-9]{3})\-?(?P<fst>[0-9]{3})\-(?P<lst>[0-9]{4}))|(^\s*\(\area\)\s*(\fst)\-?(\lst))
Problem with that, I believe, is that I am not calling the named groups properly. I'm trying to use https://regex101.com/ to help but am still getting stuck. Because the parentheses around the area code should either both be there or neither should be there I don't want to use the "?" character like:
\(?(?P<area>[0-9]{3})\)?
Can anyone Help me with this? Thank you so much.
I'm using python 3.6 and the re package.

There were a few issues with your regex. You didn't make the brackets optional, and you didn't allow optional spaces between area code and first part. Without seeing your Python code it's not easy to know how you were doing things, but I did this by splitting into a compiled regex, and then using the regex against the list of numbers.
from __future__ import print_function
import re
phone_numbers = [
'(321) 123-4567',
'(321) 1234567',
'321-123-4567',
'321123-4567',
]
regex = re.compile(r'^\s*\(?(?P<area>[0-9]{3})[) -]*(?P<fst>[0-9]{3})-?(?P<sec>[0-9]{4})')
for p in phone_numbers:
print(regex.sub(r'(\g<area>) \g<fst>-\g<sec>', p))
This isn't perfect as it will allow things that aren't valid syntax (according to your list) to be parsed, but this shouldn't be a problem. For example '(321))- - )) 123-4567' would be parsed correctly.

I'd use group testing: ^(\()?(?P<area>\d{3})(?(1)\))[ -]?(?P<fst>\d{3})-?(?P<lst>\d{4})$.
In there:
(\()? captures an opening parenthese in group 1 when exists.
(?(1)\)) tests for existence of a captured group 1, if so matches a closing parenthese.
The rest is pretty straightforward.

How to regex split, but keep the split string?

I have the following URL pattern:
http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en
I would like to get everything up until and inclusive of /watch/\d+/.
So far I have:
>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']
But this does not include the split string (the string which appears between the domain and the path). The end answer I want to achieve is:
http://www.hulu.jp/watch/589851

You need to use capture group :
>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']

As mentioned in the other answer, you need to use groups to capture the "glue" between the split strings.
I wonder though, is what you want here a split() or a search()? It looks (from the sample) that you're trying to extract from a URL everything from the first occurrence of /watch/XXX/ where XXX is 1 or more digits, to the end of the string. If that's the case, then a match/search might be more suitable, as with a split if the search regex can match multiple times you'll split into multiple groups. Ex:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']
Which doesn't look like what you want. Instead perhaps:
result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []
which gives:
('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
You could also use this approach combined with named groups to get extra fancy:
result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}
giving:
{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}
If you're set on the split() approach, you can also set the maxsplit parameter to ensure it's only split once:
re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)
giving:
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']
Personally though, I find that when parsing URL's into constituent parts the search() with named groups approach works extremely well as it allows you to name the various parts in the regex itself, and via groupdict() get a nice dictionary you can use for working with those parts.

You've surely seen the Stack Overflow don't-parse-HTML-with-regex post, yes?
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Well, regex can parse URLs, but trying to do so when there's a plethora of better tools is foolish.
This is what a regex for URLs looks like:
^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ caseless flag)
It's just a mess of characters, right? Exactly!
Don't parse URLs with regex... almost.
There is one simple thing:
A path-relative URL must be zero or more path segments separated from each other by a "/".
Splitting the URL should be as simple as url.split("/").
from urllib.parse import urlparse, urlunparse
myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"
# Run a parser over it
parts = urlparse(myurl)
# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))
# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'

You can try following regex
.*\/watch\/\d+
Working Demo

How to combine multiple regular expressions into one line?

My script works fine doing this:
images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)
However, I believe it is inefficient to search through the whole document twice.
Here's a sample document if it helps: http://pastebin.com/5kRZXjij
I would expect the following output from the above:
images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl
Instead it would be better to do something like:
image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)
How can I combine the two re.findall lines into one?
I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.

As mentioned in the comments, a pipe (|) should do the trick.
The regular expression
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
Demo on Regex Tester

If you really want efficient...
For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
You can use the re.IGNORECASE option and get rid of some letters:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*

How can I extract two values from a string like this using a regular expression?

How can I get the value from the following strings using one regular expression?
/*##debug_string:value/##*/
or
/*##debug_string:1234/##*/
or
/*##debug_string:http://stackoverflow.com//##*/
The result should be
value
1234
http://stackoverflow.com/

Trying to read behind your pattern
re.findall("/\*##debug_string:(.*?)/##\*/", your_string)
Note that your variations cannot work because you didn't escape the *. In regular expressions, * mean a repetition of the previous character/group. If you really mean the * character, you must use \*.
import re
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:value/##*/")
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:1234/##*/")
print re.findall("/\*##debug_string:(.*?)/##\*/", "/*##debug_string:http://stackoverflow.com//##*/")
Executes as:
['value']
['1234']
['http://stackoverflow.com/']
EDIT: Ok I see that you can have a URL. I've amended the pattern to take it into account.

Use this regex:
[^:]+:([^/]+)
And use capture group #1 for your value.
Live Demo: http://www.rubular.com/r/FxFnpfPHFn

Your regex will be something like: .*:(.*)/.+. Group 1 will be what you are looking for. However this is a REALLY inclusive regex, you might want to post some more details so that you can create some more restrictions.

Assuming that the format stays consistent:
re.findall('debug_string:([^\/]+)\/##', string)

Cleaning up commas in numbers w/ regular expressions in Python

I have been googling this one fervently, but I can't really narrow it down. I am attempting to interpret a csv file of values, common enough sort of behaviour. But I am being punished by values over a thousand, i.e. in quotations and involving a comma. I have kinda gotten around it by using the csv reader, which creates a list of numbers from the row, but I then have to pick the commas out afterwards.
For purely academic reasons, is there a better way to edit a string with regular expressions? Going from 08/09/2010,"25,132","2,909",650 to 08/09/2010,25132,2909,650.
(If you are into Vim, basically I want to put Python on this:
:1,$s/"\([0-9]*\),\([0-9]*\)"/\1\2/g :D )

Use the csv module for first-stage parsing, and a regex only for seeing if the result can be transformed to a number.
import csv, re
num_re = re.compile('^[0-9]+[0-9,]+$')
for row in csv.reader(open('input_file.csv')):
for el_num in len(row):
if num_re.match(row[el_num]):
row[el_num] = row[el_num].replace(',', '')
...although it would probably be faster not to use the regular expression at all:
for row in ([item.replace(',', '') for item in row]
for row in csv.reader(open('input_file.csv'))):
do_something_with_your(row)

I think what you're looking for is, assuming that commas will only appear in numbers, and that those entries will always be quoted:
import re
def remove_commas(mystring):
return re.sub(r'"(\d+?),(\d+?)"', r'\1\2', mystring)
UPDATE:
Adding cdarke's comments below, the following should work for arbitrary-length numbers:
import re
def remove_commas_and_quotes(mystring):
return re.sub(r'","|",|"', ',', re.sub(r'(?:(\d+?),)',r'\1',mystring))

Python has a regular expressions module, "re":
http://docs.python.org/library/re.html
However, in this case, you might want to consider using the "partition" function:
>>> s = 'some_long_string,"12,345",more_string,"56,6789",and_some_more'
>>> left_part,quote_mark,right_part = s.partition(")
>>> right_part
'12,345",more_string,"56,6789",and_some_more'
>>> number,quote_mark,remainder = right_part.partition(")
'12,345'
string.partition("character") splits a string into 3 parts, stuff to the left of the first occurrence of "character", "character" itself and stuff to the right.

Here's a simple regex for removing commas from numbers of any length:
re.sub(r'(\d+),?([\d+]?)',r'\1\2',mystring)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python findall, group and pipe - python

shlex would do a better job here, but if you insist on re, use ([A-Za-z]+)=(?P<quote>['"])(.+?)(?P=quote)

Related

Python Regex back reference a named group

How to regex split, but keep the split string?

How to combine multiple regular expressions into one line?

How can I extract two values from a string like this using a regular expression?

Cleaning up commas in numbers w/ regular expressions in Python

Categories

Resources