Include entire string when using re.findall to find string between characters - python

When using re.findall like my example below is there a to include the final four characters (.JPG)? As they may be lower or uppercase I can't just stitch it together with another string and be certain it will be correct. (In reality it's a list of dozens/hundreds of JPGs, some uppercase and some lowercase.)
I actually found the answer to this about 2 weeks ago but have since lost it (despite a lot of Googling).
I've done a lot of searching/reading and apologize if this exact problem has been asked before.
import re
examplestring = '/home/folder/image.JPG 200x400 20/12/2018'
print(re.findall(r'^(.*?).jpg', examplestring, flags=re.IGNORECASE))
Actual output:
['/home/folder/image']
I'm wanting the output to be:
['/home/folder/image.JPG']

Firstly, make sure to escape the dot since it's a special character in regex.
Either include .jpg in the group
^(.*?\.jpg)
or don't use a group at all
^.*?\.jpg

Method 1
Maybe,
(?i)\S+\.jpg
or
(?i)\S+\.jpe?g
just in case, if we would have had jpeg, might simply work OK.
RegEx Demo 1
We can include additional boundaries, if that'd be necessary, such as start anchor.
Also, the expression does not work if there would be any space in the dir names or filenames.
Method 2
If there would be horizontal spaces in the image path, then
(?i)^[^\r\n]+\.jpg
or
(?i)^[^\r\n]+\.jpe?g
would have been some options to explore.
RegEx Demo 2
Test
import re
string = '''
/home/folder/image.JPG 200x400 20/12/2018
/home/folder/image.jpg 200x400 20/12/2018
/home/folder/image.jpeg 200x400 20/12/2018
'''
expression = r'(?i)\S+\.jpe?g'
print(re.findall(expression, string))
Output
['/home/folder/image.JPG', '/home/folder/image.jpg', '/home/folder/image.jpeg']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Related

Python regex expression example

I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.
For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.

How to substitute a regex with another regex in a string

This question showed how to replace a regex with another regex like this
$string = '"SIP/1037-00000014","SIP/CL-00000015","Dial","SIP/CL/61436523277,45"';
$$pattern = '["SIP/CL/(\d*),(\d*)",]';
$replacement = '"SIP/CL/\1|\2",';
$string = preg_replace($pattern, $replacement, $string);
print($string);
However, I couldn't adapt that pattern to solve my case where I want to remove the full stop that lies between 2 words but not between a word and a number:
text = 'this . is bad. Not . 820'
regex1 = r'(\w+)(\s\.\s)(\D+)'
regex2 = r'(\w+)(\s)(\D+)'
re.sub(regex1, regex2, text)
# Desired outcome:
'this is bad. Not . 820'
Basically I like to remove the . between the two alphabet words. Could someone please help me with this problem? Thank you in advance.
These expressions might be close to what you might have in mind:
\s[.](?=\s\D)
or
(?<=\s)[.](?=\s\D)
Test
import re
regex = r"\s[.](?=\s\D)"
test_str = "this . is bad. Not . 820"
print(re.sub(regex, "", test_str))
Output
this is bad. Not . 820
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
Firstly, you can't really take PHP and apply it directly to Python, for obvious reasons.
Secondly, it always helps to specify which version of Python you're using as APIs change. Luckily in this instance, the API of re.sub has remained the same between Python 2.x and Python 3.
Onto your issue.
The second argument to re.sub is either a string or a function. If you pass in regex2 it'll just replace regex1 with the string contents of regex2, it won't apply regex2 as a regex.
If you want to use groups derived from the first regex (similar to your example, which is using \1 and \2 to extract the first and second matching group from the first regex), then you'd want to use a function, which takes a match object as its sole argument, which you could then use to extract matching groups and return them as part of the replacement string.

Regex optional order of capturing group

I have simple, but tricky question about regex (using in python), which i have did not find answer for anywhere here on google. Is there any "trick" how to make two capture groups in optional order? Let's say we have following:
.*abc.*
What i want is to match also this:
.*acb.*
I know i could use
.*abc|acb.*
but the problem is, that if we have something more complicated then abc, code is very long. Is not there any workaround to say e.g. "match last two capturing groups (or symbols, etc.) in any order?
I don't really get what is this in-any-order thing that would make the regex shorter. On the other hand, I can show you how to make this readable, even if you have tons of options.
import re
pattern = """
.* # match from starting the line
(?: # A non-capturing group starts so we can list lots of alternatives
abc| # alternative 1
acb # alternative 2
) # end of alternatives
.* # then match everything up to the end of the line
"""
re.search(pattern, 'qqabcqq', re.VERBOSE) # returns a match
re.search(pattern, 'qqacbqq', re.VERBOSE) # returns a match
re.search(pattern, 'qqaSDqq', re.VERBOSE) # does not return a match
So what did we just see here?
The """ ... """ construct is a convenient way to define multiline strings in python.
Then the re.VERBOSE skips the whitespaces and comments. As the manual says:
Whitespace within the pattern is ignored, except when in a character
class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.
This two things let you add structure and comments to your regex. Here is another great example.
With standard regular expressions you can define patterns without order. Example:
[cdgjow]
Of course this example refers to characters.
Alternative sequences must be specified using "|". Example:
abc|cba
There is no way to express what you would like to express in classic regular expression syntax. Regular expression syntax has no syntactic elements to express what you would like to express. It's lacking this feature. You have to rely on "manually" specifying your alternatives. It's not a limit of the automaton constructed from regular expressions but of the regular expression syntax itself.
That means: You will have to construct the regular expression you require by yourself with all variants possible. There are two ways how to do this:
Do it manually. Take your time, be careful, built the correct regex manually.
Do it programmatically. Write some code that generates the regex you require.
If you do it manually consider #TamasRev answer. (Thanks #TamasRev! Nice answer!) But if I were you I'd build the regex programmatically. (For things like that programming has been invented for anyway :-) )

Python 2.7 Regex capture groups not working as predicted

I am trying to pattern match and replace first person with second person with Python 2.7.
string = re.sub(r'(\W)I(\W)', '\g<1>you\g<2>',string)
string = re.sub(r'(\W)(me)(\W)', '\g<1>you\g<3>',string)
# but does NOT work
string = re.sub(r'(\W)I|(me)(\W)', '\g<1>you\g<3>',string)
I want to use the last regex, but somehow the capture groups are all messed up and even doing a \g<0> shows strange, irregular matches. I would think that capture group 3 would be the last word boundary, but it doesn't appear to be.
A sample sentence could be: I like candy.
I am not interested very much in the correctness of the replacement (me will never actually be selected since I goes first), but I don't know why the capture groups don't work as I would expect.
Thanks!
Try with following regex.
Regex: \b(I|me)\b
Explanation:
\b on both sides marks the word boundary.
(I|me) matches either I OR me.
Note:- You can make it case insensitive using i flag.
Regex101 Demo

REGEX: Parsing n digits with non numeric word boundaries

I hope this message finds you in good spirits. I am trying to find a quick tutorial on the \b expression (apologies if there is a better term). I am writing a script at the moment to parse some xml files, but have ran into a bit of a speed bump. I will show an example of my xml:
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
<...> is unimportant and non relevant xml code. Focus primarily on the CustomerID and OrderId.
My issue lies in parsing a string, similar to the above statement. I have a regexParse definition that works perfectly. However it is not intuitive. I need to match only the part of the string that contains 44444444.
My Current setup is:
searchPattern = '>\d{8}</CustomerId'
Great! It works, but I want to do it the right way. My thinking is 1) find 8 digits 2) if the some word boundary is non numeric after that matches CustomerId return it.
Idea:
searchPattern = '\bd{16}\b'
My issue in my tests is incorporating the search for CustomerId somewhere before and after the digits. I was wondering if any of you can either help me out with my issue, or point me in the right path (in words of a guide or something along the lines). Any help is appreciated.
Mods if this is in the wrong area apologies, I wanted to post this in the Python discussion because I am not sure if Python regex supports this functionality.
Thanks again all,
darcmasta
txt = """
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
"""
import re
pattern = "<(\w+)>(\d+)<"
print re.findall(pattern,txt)
#output [('OrderId', '123456'), ('CustomerId', '44444444')]
You might consider using a look-back operator in your regex to make it easy for a human to read:
import re
a = re.compile("(?<=OrderId>)\\d{6}")
a.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['123456']
b = re.compile("(?<=CustomerId>)\\d{8}")
b.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['44444444']
You should be using raw string literals:
searchPattern = r'\b\d{16}\b'
The escape sequence \b in a plain (non-raw) string literal represents the backspace character, so that's what the re module would be receiving (unrecognised escape sequences such as \d get passed on as-is, i.e. backslash followed by 'd').

Categories

Resources