matching regular expressions in python which contains URLs

matching regular expressions in python which contains URLs - python

I have a list of URLS from which I am trying to fetch just the id numbers. I am trying to solve this out using the combination of URLParse and regular expressions. Here is how my function looks like:
def url_cleanup(url):
parsed_url = urlparse(url)
if parsed_url.query=="fref=ts":
return 'https://www.facebook.com/'+re.sub('/', '', parsed_url.path)
else:
qry = parsed_url.query
result = re.search('id=(.*)&fref=ts',qry)
return 'https://www.facebook.com/'+result.group(1)
However, I feel that the regular expression result = re.search('id=(.*)&fref=ts',qry) fails to match some of the URLs as explained in the below example.
#1
id=10001332443221607 #No match
#2
id=6383662222426&fref=ts #matched
I tried to take the suggestion as per the suggestion provided in this answer by rephrasing my regular expression as id=(.*).+?(?=&fref=ts) which again matches #2 but not #1 in the above examples.
I am not sure what I am missing here. Any suggestion/hint will be much appreciated.

Your regex's are wrong, indeed.
using the expression id=(.*)&fref=ts you will only match ids succeded by &fref=ts literally.
using id=(.*).+?(?=&fref=ts) you will do the same thing, but using the lookahead, which is a non-capturing group expression. This means that your match will be only the id=blablabla part, but only if it's succeded by &fref=ts.
Moreover, id=(.*) will match ids comprised of numbers, letters, symbols... literally anything. Using id=\d+ will match 'numbers only' ids.
So, try using
result = re.search('id=(\d+)', qry)
it will allow you to catch just the numbers, supposing your ids are always digits, and capture(using the parenthesis) only these digits for later use.
For further reference, refer to
http://www.regular-expressions.info/python.html

Your regex needs tweaking slightly. Try:
result = re.search('id=(\d+)(&fref=ts)?', qry)
id=(\d+) matches any number of digits following id=, and (&fref=ts)? allows the following group of letters to be optional. This would allow you to add them back in if necessary.
You should also note that this will throw an error if no match is found - so you might want to change slightly to:
result = re.search('id=(\d+)(&fref=ts)?', qry)
if result:
return 'https://www.facebook.com/'+result.group(1)
else:
# some error catch

Related

Beautiful soup if class not like "string" or regex

I know that beautiful soup has a function to match classes based on regex that contains certain strings, based on a post here. Below is a code example from that post:
regex = re.compile('.*listing-col-.*')
for EachPart in soup.find_all("div", {"class" : regex}):
print EachPart.get_text()
Now, is it possible to do the opposite? Basically, find classes that do not contain a certain regex. In SQL language, it's like:
where class not like '%test%'
Thanks in advance!

This actually can be done by using Negative Lookahead
Negative Lookahead has the following syntax (?!«pattern») and matches if pattern does not match what comes before the current location in the input string.
In your case, you could use the following regex to match all classes that don’t contain listing-col- in their name:
regex = re.compile('^((?!listing-col-).)*$')
Here’s the pretty simple and straightforward explanation of this regex ^((?!listing-col-).)*$:
^ asserts position at start of a line
Capturing Group ((?!listing-col-).)*
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed
Negative Lookahead (?!listing-col-).
Assert that the Regex below does not match.
listing-col- matches the characters listing-col- literally (case sensitive)
. matches any character
$ asserts position at the end of a line
Also, you may find the https://regex101.com site useful
It will help you test your patterns and show you a detailed explanation of each step. It's your best friend in writing regular expressions.

One possible solution is utilizing regex directly.
You can refer to Regular expression to match a line that doesn't contain a word.
Or you can introduce a function to implement the logic and pass it to find_all as a parameter.
You can refer to https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#find-all

You can use css selector syntax with :not() pseudo class and * contains operator
data = [i.text() for i in soup.select('div[class]:not([class*="listing-col-"])')]

Positive Lookbehind Stripping Out Metacharacters

I need to get the sequence at the end of many urls to label csv files. The approach I have taken gives me the result I want, but I am struggling to understand how I might use a positive lookbehind to capture all the characters after the word 'series' in the url while ignoring any metacharacters? I know I can use re.sub() to delete them, however, I am interested in learning how I can complete the whole process in one regex.
I have searched through many posts on how I might do this, and experimented with lots of different approaches but I haven't been able to figure it out. Mainly with replacing the .+ after the (?<=series\-) with something to negate that - but it hasn't worked.
url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
res = re.search(r"(?<=series\-).+", url).group(0)
re.sub('-', '', res)
Which gives the desired result 'kbw10a'
Is it possible to strip out the metacharacter '-' in the positive lookbehind? Is there a better approach to this without the lookaround?
More examples;
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1014416/yanmar-marine-marine-main-engine-small-qm-series-kbw10',
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a',
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1018923/yanmar-marine-marine-main-engine-small-qm-series-kh18-a',

You cannot "ignore" chars in a lookaround the way you describe, because in order to match a part of a string, the regex engine needs to consume the part, from left to right, matching all subsequent subpatterns in your regex.
The only way to achieve that is through additional step, removing the hyphens once the match is found. Note that you do not need another regex to remove hyphens, .replace('-', '') will suffice:
url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
resObj = re.search(r"series-(.+)", url)
if resObj:
res = resObj.group(1).replace('-', '')
Note it is much safer to first run re.search to get the match data object and then access the .group(), else, when there is no match, you may get an exception.
Also, there is no need of any lookarounds in the pattern, a capturing group will work as well.

Getting the last occurrence of a match that is inside parenthesis using regular expressions

I want to use regular expressions to get the text inside parenthesis in a sentence. But if the string has two or more occurrence, the pattern I am using gets everything in between. I google it and some sources tells me to use negative lookahead and backreference, but it is not working as expected. The examples I found are: Here, here
An example of a string is:
s = "Para atuar no (GCA) do (CNPEM)"
What I want is to get just the last occurrence: "(CNPEM)"
The pattern I am using is:
pattern = "(\(.*\))(?!.*\1)"
But when I run (using python's re module) I get this:
output = (GCA) do (CNPEM)
How can I get just the last occurrence in this case?

You could use re.findall here, and then access the last match:
s = "Para atuar no (GCA) do (CNPEM)"
last = re.findall(r'\(.*?\)', s)[-1]
print(last) # (CNPEM)

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.

First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'

Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

re.match() multiple times in the same string with Python

I have a regular expression to find :ABC:`hello` pattern. This is the code.
format =r".*\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
m = patt.match(l.rstrip())
if m:
...
It works well when the pattern happens once in a line, but with an example ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`". It finds only the last one.
How can I find all the three patterns?
EDIT
Based on Paul Z's answer, I could get it working with this code
format = r"\:([^:]*)\:\`([^`]*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
print tag, ":::", value
Result
tagbox ::: Verilog
tagbox ::: Multiply
tagbox ::: VHDL

Yeah, dcrosta suggested looking at the re module docs, which is probably a good idea, but I'm betting you actually wanted the finditer function. Try this:
format = r"\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
....
Your current solution always finds the last one because the initial .* eats as much as it can while still leaving a valid match (the last one). Incidentally this is also probably making your program incredibly slower than it needs to be, because .* first tries to eat the entire string, then backs up character by character as the remaining expression tells it "that was too much, go back". Using finditer should be much more performant.

A good place to start is there module docs. In addition to re.match (which searches starting explicitly at the beginning of the string), there is re.findall (finds all non-overlapping occurrences of the pattern), and the methods match and search of compiled RegexObjects, both of which accept start and end positions to limit the portion of the string being considered. See also split, which returns a list of substrings, split by the pattern. Depending on how you want your output, one of these may help.

re.findall or even better regex.findall can do that for you in a single line:
import regex as re #or just import re
s = ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`"
format = r"\:([^:]*)\:\`([^`]*)\`"
re.findall(format,s)
result is:
[('tagbox', 'Verilog'), ('tagbox', 'Multiply'), ('tagbox', 'VHDL')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

matching regular expressions in python which contains URLs - python

Related

Beautiful soup if class not like "string" or regex

Positive Lookbehind Stripping Out Metacharacters

Getting the last occurrence of a match that is inside parenthesis using regular expressions

Regular expression capturing entire match consisting of repeated groups

re.match() multiple times in the same string with Python

Categories

Resources