Regex for conditionally extracting a named group - python

I have a requirement to write python flavoured regex to extract a field conditionally. The following are the two types of test strings that I need to extract from:
http://domain/string1/path/field_to_extract/path/filename
http://domain/string2/path/90020_10029/path/filename
Below is my requirement:
For string2 we should only pick the number at the fourth location, between slash (/) and (_).
For others we should pick the whole text between the slashes (/) at the fourth location.
I have written the following regex:
(?i)^(?:[^ ]*(?: {1,2})){6}(?:[a-z]+://)(?:[^ /:]+[^ /]/:]+[^ /]+/[^ /]+/)?(?:[^ /]+/){2}(?P<field_name>(?<=/string2/)(?:[^/]+/)([^_]+)|((?<!/string2/)(?:[^/]+/)([^/]+)))
Though the conditional extracting seems to be working fine, this regex also matches the string before the field that is extracted. For example, when used on the first test string, this regex matches path/field_to_extract and on the second it matches path/90020.
Though I have added ignore to the group before the required field, it does not seem to be working.
Please help me in getting the regex right.

Try with pattern '//[^/]+/[^/]+/[^/]+/(\d+(?=_)|[^/]+)'

How about using a split() instead of a complegex :-
s = 'thelink'.split('/')
if len(s) > 4:
string1or2 = s[3]
field = s[5]
if string1or2 == 'string2':
print field.split('_')[0]
else:
raise ValueError("Incorrect URL")

A pure regex solution:
import re
urls = [
r'''http://domain/string1/path/field_to_extract/path/filename''',
r'''http://domain/string2/path/90020_10029/path/filename'''
]
for url in urls:
print(re.search(r'(?<![:/])/(?:(string2)|[^/]*)/[^/]*/((?(1)[^_]*|[^/]*))', url).group(2))
Explanation:
(?<![:/])/ :: Search for a slash that doesn't follow another slash or a colon.
(?:(string2)|[^/]*)/ :: Match the literal "string2" or any other thing. If it's the first one, save it as group-1 to do a conditional yes-no-pattern later.
[^/]*/ :: Match second part of the path. No interesting.
((?(1)[^_]*|[^/]*)) :: If exists group-1, match until first _ ([^_]*). Otherwise match until next slash ([^/]*).
It yields:
field_to_extract
90020

Related

How to catch a string using regex in python and replace it by desired string

I am new to python and I wrote the following code which suppose to catch a specific string and replace it with a specific string as well.
sid=\"1722407313768658\"
I used this regex: sid=(.+?)
but it catches irrelevant string as well
https://tmobile.demdex.net/dest5.html?d_nsid=0#
as well when I am running this regex on sid=\"1722407313768658\" (replacing it with 1900117189066752 , I am getting the following result which does not replace the string but add i: sid=\1900117189066752\ "1722407313768658\"
(instead of 1722407313768658 i want to have 1900117189066752 )
this is my python code:
import re
content = c.read()
################################################################
# change sessionid in content
replace_small_sid = str('sid=\\' + "\\"+str(sid) + "\\" + " ")
content = re.sub("sid=(.+?)", replace_small_sid, content)
As I understand it you wish to match string patterns in the form:
sid=\"1722407313768658\"
With the aim of replacing the digits.
To achieve this we can use positive lookbehinds and lookaheads as described here:
https://www.regular-expressions.info/lookaround.html
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
In this case our lookbehind will match
sid=\"
Our lookahead will match
\"
Please see the example here: https://regex101.com/r/2pXcMI/2
Finally, we can use this to match and replace as follows:
import re
line = "sid=\"1722407313768658\" safklabsf ipashf oiasfoi asbg fasnk sid=\"65641\" asjobfaosb asbfaosb asf asfauv sid=\"651564165\"."
replace_with = '1900117189066752'
line = re.sub('(?<=sid=\\\")\d+(?=\\\")', replace_with, line)
line
This returns
'sid="1900117189066752" safklabsf ipashf oiasfoi asbg fasnk sid="1900117189066752" asjobfaosb asbfaosb asf asfauv sid="1900117189066752".'
since you want to replace specific string, you can do it by:
content.replace("1722407313768658","1900117189066752")

Regex to match and clean quotes in python

I have a bunch of quotes scraped from Goodreads stored in a bs4.element.ResultSet, with each element of type bs4.element.Tag. I'm trying to use regex with the re module in python 3.6.3 to clean the quotes and get just the text. When I iterate and print using [print(q.text) for q in quotes] some quotes look like this
“Don't cry because it's over, smile because it happened.”
―
while others look like this:
“If you want to know what a man's like, take a good look at how he
treats his inferiors, not his equals.”
―
,
Each also has some extra blank lines at the end. My thought was I could iterate through quotes and call re.match on each quote as follows:
cleaned_quotes = []
for q in quote:
match = re.match(r'“[A-Z].+$”', str(q))
cleaned_quotes.append(match.group())
I'm guessing my regex pattern didn't match anything because I'm getting the following error:
AttributeError: 'NoneType' object has no attribute 'group'
Not surprisingly, printing the list gives me a list of None objects. Any ideas on what I might be doing wrong?
As you requested this for learning purpose, here's the regex answer:
(?<=“)[\s\s]+?(?=”)
Explanation:
We use a positive lookbehind to and lookahead to mark the beginning and end of the pattern and remove the quotes from result at the same time.
Inside of the quotes we lazy match anything with the .+?
Online Demo
Sample Code:
import re
regex = r"(?<=“)[\s\S]+?(?=”)"
cleaned_quotes = []
for q in quote:
m = re.search(regex, str(q))
if m:
cleaned_quotes.append(m.group())
Arguably, we do not need any regex flags. Add the g|gloabal flag for multiple matches. And m|multiline to process matches line by line (in such a scenario could be required to use [\s\S] instead of the dot to get line spanning results.)
This will also change the behavior of the positional anchors ^ and $, to match the end of the line instead of the string. Therefore, adding these positional anchors in-between is just wrong.
One more thing, I use re.search() since re.match() matches only from the beginning of the string. A common gotcha. See the documentation.
First of all, in your expression r'“[A-Z].+$”' end of line $ is defined before ", which is logically not possible.
To use $ in regexi for multiline strings, you should also specify re.MULTILINE flag.
Second - re.match expects to match the whole value, not find part of string that matches regular expression.
Meaning re.search should do what you initially expected to accomplish.
So the resulting regex could be:
re.search(r'"[A-Z].+"$', str(q), re.MULTILINE)

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

python regex pattern to extract value between two characters

I am trying to extract an id number from urls in the form of
http://www.domain.com/some-slug-here/person/237570
http://www.domain.com/person/237570
either one of these urls could also have params on them
http://www.domain.com/some-slug-here/person/237570?q=some+search+string
http://www.domain.com/person/237570?q=some+search+string
I have tried the following expressions to capture the id value of '237570' from the above urls but each one kinda works but does work across all four url scenarios.
(?<=person\/)(.*)(?=\?)
(?<=person\/)(.*)(?=\?|\z)
(?<=person\/)(.*)(?=\??*)
what I am seeing happening is it is getting the 237570 but including the ? and characters that come after it in the url. how can I say stop capturing either when you hit a ?, /, or the end of the string?
String:
http://www.domain.com/some-slug-here/person/1234?q=some+search+string
http://www.domain.com/person/3456?q=some+search+string
http://www.domain.com/some-slug-here/person/5678
http://www.domain.com/person/7890
Regexp:
person\/(\d{1,})
Output:
>>> regex.findall(string)
[u'1234', u'3456', u'5678', u'7890']
Don't use .* to match the ID. . will match any character (except for line breaks, unless you use the DOTALL option). Just match a bunch of digits: (.*) --> (\d+)

Python regular expressions matching within set

While testing on http://gskinner.com/RegExr/ (online regex tester), the regex [jpg|bmp] returns results when either jpg or bmp exist, however, when I run this regex in python, it only return j or b. How do I make the regex take the whole word "jpg" or "bmp" inside the set ? This may have been asked before however I was not sure how to structure question to find the answer. Thanks !!!
Here is the whole regex if it helps
"http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)"
Its just basically to look for pictures in a url
Use (jpg|bmp) instead of square brackets.
Square brackets mean - match a character from the set in the square brackets.
Edit - you might want something like that: [^ ].*?(jpg|bmp) or [^ ].*?\.(jpg|bmp)
When you are using [] your are creating a character class that contains all characters between the brackets.
So your are not matching for jpg or bmp you are matching for either a j or a p or a g or a | ...
You should add an anchor for the end of the string to your regex
http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
^ ^^
if you need double escaping then every where in your pattern
http://www\\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
to ensure that it checks for the file ending at the very end of the string.
If you are searching a list of URLs
urls = [ 'http://some.link.com/path/to/file.jpg',
'http://some.link.com/path/to/another.png',
'http://and.another.place.com/path/to/not-image.txt',
]
to find ones that match a given pattern you can use:
import re
for url in urls:
if re.match(r'http://.*(jpg|png|gif)$'):
print url
which will output
http://some.link.com/path/to/file.jpg
http://some.link.com/path/to/another.png
re.match() will test for a match at the start of the string and return a match object for the first two links, and None for the third.
If you are getting just the extension, you can use the following:
for url in urls:
m = re.match(r'http://.*(jpg|png|gif)$')
print m.group(0)
which will print
('jpg',)
('png',)
You will get just the extensions because that's what was defined as a group.
If you need to find the url in a long string of text (such as returned from wget), you need to use re.search() and enclose the part you are interested in with ( )'s. For example,
response = """dlkjkd dkjfadlfjkd fkdfl kadfjlkadfald ljkdskdfkl adfdf
kjakldjflkhttp://some.url.com/path/to/file.jpgkaksdj fkdjakjflakdjfad;kadj af
kdlfjd dkkf aldfkaklfakldfkja df"""
reg = re.search(r'(http:.*/(.*\.(jpg|png|gif)))', response)
print reg.groups()
will print
('http://some.url.com/path/to/file.jpg', 'file.jpg', 'jpg',)
or you can use re.findall or re.finditer in place of re.search to get all of the URL's in the long response. Search will only return the first one.

Categories

Resources