Regex giving tuple and not full match - python

I'm trying to use regex to find proxy address on a website. Currently I'm using this piece of regex (\d{1,3}\.){3}\d{1,3}:(\d+). It works on regexr.com and in sublime text, but when I try to use it in Python it doesn't work as expected.
This is the piece of code I'm using:
p = re.compile("(\d{1,3}\.){3}\d{1,3}:(\d+)")
ipCandidates = p.findall(soupString)
It should return proxies like this 120.206.182.172:8123 but it returns tuples like this ('44.', '3128'). What can I do to fix this?
Thank you.

re.findall() only returns the contents of capturing groups instead of the whole match (if you have such groups in your regex).
Then, you're repeating a capturing group three times, which means that only the third repetition is preserved (the other two are overwritten).
Change your regex to
p = re.compile(r"(?:\d{1,3}\.){3}\d{1,3}:\d+")
and you'll get whole matches.
If you do want tuples of the separate submatches (without the dots and colon), you can do that, too, but you can't use repetition then:
p = re.compile(r"(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}):(\d+)")
Also, always use raw strings for regexes, so regex escape sequences and string escape sequences can't be confused.

Related

Python regex expression example

I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.
For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.

How should I make these regex capture groups more succinct?

I'm using python's re library to do this, but it's a basic regex question.
I am receiving a string of coordinate information in degrees-minutes-seconds format without spaces, and I'm parsing it out to discrete coordinate pairs for conversion.
The string is fed to me looking like this (fake coords for example):
102030N0102030E203040N0203040E304050N0304050E405060N0405060E
I am catching it like this:
coordstr = '102030N0102030E203040N0203040E304050N0304050E405060N0405060E'
coords = re.match(
re.compile(r"^(\d+[NS]{1}\d+[EW]{1})(\d+[NS]{1}\d+[EW]{1})(\d+[NS]{1}\d+[EW]{1})(\d+[NS]{1}\d+[EW]{1})"),
coordstr)
for x in coords.groups():
print(x)
which gives me
102030N0102030E
203040N0203040E
304050N0304050E
405060N0405060E
And allows me to address each coordinate pair as coords.group(1), coords.group(2) and so on.
So it works, but it feels like I'm being too verbose in the pattern. Is there a more succinct way to crawl the line with one of the capture groups, and add each matched group to .groups() as it's encountered? I know I could do it with brute force string slicing but that seems like more trouble than it's worth.
I've read this but it doesn't seem to address what I'm going after in this question.
Because this is for an enterprise and these strings describe raster bounds, I will be validating the string before introducing the regex search and falling back to a gdal object if the string is not found (or corrupted).
Since you will pre-validate the strings you will process with regex, you need not use re.search / re.match with several groups with identical pattern, you can use re.findall to get all \d+[NS]\d+[EW] pattern matches from your strings:
import re
coordstr = '102030N0102030E203040N0203040E304050N0304050E405060N0405060E'
coords = re.findall(r'\d+[NS]\d+[EW]', coordstr)
for x in coords:
print(x)
Output:
102030N0102030E
203040N0203040E
304050N0304050E
405060N0405060E
See the Python demo.
NOTE: the list of matches returned by re.findall will always be in the same order as they are in the source text, see this SO post.

Python - Print substrings until matches for all matches

I am looking to find all matches in a string and print all substrings until I match these strings to a new line.
e.g.
"123ABC97edfABCaaabbdd1234ABC0009ui50ABC_1234"
should print:
ABC97edf
ABCaaabbdd1234
ABC0009ui50
ABC_1234
where "ABC" is the pattern match which is recurring.
Is there an efficient way I can do so using findall?
New to Python here, using python version 2.4.3
Edit just an F.Y.I:
What I am trying to do is basically I have a 250+Gb file which has control characters showing start and end of line but these Ctrl Characters (because of issues.. mostly network) are embedded within these lines i.e. in between the start/end indicating control characters.
With that, there is no specific distinction between the start/end control chars and the ones that come in between these messages.
So I am basically removing these control chars, and have I wish to have a complete message per line pertaining to some specific regex.
The regex here is not necessarily ABC or in order for all of these messages.
I have tried using findall and am able to find all the matches, just I did not know how to get the strings following these until i find the next match. (the regex here can be either -ABC=35nga|DEF=64325:dfaf:1234| or **ABC=35632|DEF=61 and many different forms.
And I have to break for each line and for the ones which have multiple lines embededed within a line.
Using re.findall:
See the regex in action on regex101.
s = "123ABC97edfABCaaabbdd1234ABC0009ui50ABC_1234"
re.findall("ABC.*?(?=ABC|$)",s)
which gives a list:
['ABC97edf', 'ABCaaabbdd1234', 'ABC0009ui50', 'ABC_1234']
And if you wanted to print the elements in this list, you could simply do:
for sub in re.findall("ABC.*?(?=ABC|$)",s):
print(sub)
which would output:
ABC97edf
ABCaaabbdd1234
ABC0009ui50
ABC_1234

How to combine multiple regular expressions into one line?

My script works fine doing this:
images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)
However, I believe it is inefficient to search through the whole document twice.
Here's a sample document if it helps: http://pastebin.com/5kRZXjij
I would expect the following output from the above:
images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl
Instead it would be better to do something like:
image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)
How can I combine the two re.findall lines into one?
I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.
As mentioned in the comments, a pipe (|) should do the trick.
The regular expression
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
Demo on Regex Tester
If you really want efficient...
For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
You can use the re.IGNORECASE option and get rid of some letters:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?
The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.
I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'
Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True
There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Categories

Resources