How to use ^ and $ to parse simple expression? - python

How do I use the ^ and $ symbols to parse only /blog/articles in the following?
I've created ex3.txt that contains this:
/blog/article/1
/blog/articles
/blog
/admin/blog/articles
and the regex:
^/blog/articles$
doesn't appear to work, as in when I type it using 'regetron' (see learning regex the hard way) there is no output on the next line.
This is my exact procedure:
At command line in the correct directory, I type: regetron ex3.txt. ex3.txt contains one line with the following:
/blog/article/1 /blog/articles /blog /admin/blog/articles
although I have tried it with newlines between entries.
I type in ^/blog/article/[0-9]$ and nothing is returned on the next line.
I try the first solution posted,^\/blog\/articles$ and nothing is returned.
Thanks in advance SOers!

Change your regex to:
^\/blog\/articles$
You need to escape your slashes.
Also, ensure there are no trailing spaces on the end of each line in your ex3.txt file.

Based on your update, it sounds like ^ and $ might not be the right operators for you. Those match the beginning and end of a line respectively. If you have multiple strings that you want to match on the same line, then you'll need something more like this:
(?:^|\s)(\/blog\/articles)(?:$|\s)
What this does:
(?:^|\s) Matches, but does not capture (?:), a line start (^) OR (|) a whitespace (\s)
(\/blog\/articles) Matches and captures /blog/articles.
(?:$|\s) Matches, but does not capture (?:), a line end ($) OR (|) a whitespace (\s)
This will work for both cases, but be aware that it will match (but will not capture) up to a single whitespace before and after /blog/articles.

Related

How can I alphabetize Python functions using Sublime Text?

I installed a plugin that will alphabetize blocks. I just need a way to select all the defs in a python file. So far I've got this regex.
This doesn't select the last line because there isn't any newline. I could enter a newline at the end, but I'd like to avoid that. In fact, ideally I'd like to avoid grabbing all the newlines above.
But I'm worried that if I don't grab the newline, then it won't match functions that have a blank line in the middle.
If there's a better way than what I'm trying--by selecting the blocks and using an alphabetizer plugin--then please suggest it. Otherwise, is there some way I can get the regex to match just the defs?
def.+(\n?\n.+)+
Will accomplish what you want. (Sublime seems to follow the usual "dot is not newline" convention)
Breaking down the components of the expression:
def.+ - match the def line, up to a newline
\n?\n.+ - match a newline, followed by some characters, optionally prepended by another newline (the prepend handles the case of an empty line in the middle of a def)
(...)+ - start a capture group, and match its pattern one or more times
(\n?\n.+)+ - combine the previous two pieces, so we match any sequence of non-empty lines with at most one empty line between any two non-empty lines (pedantically, any sequence of non-empty-line and empty-line-then-non-empty-line blocks)
The final + could be a * instead if it's permissable to match "empty" defs like
def empty():
Try this
^(\s*)(def.*(?:\n\1\s+.*|\n\s*)+$)

Regex to match and clean quotes in python

I have a bunch of quotes scraped from Goodreads stored in a bs4.element.ResultSet, with each element of type bs4.element.Tag. I'm trying to use regex with the re module in python 3.6.3 to clean the quotes and get just the text. When I iterate and print using [print(q.text) for q in quotes] some quotes look like this
“Don't cry because it's over, smile because it happened.”
―
while others look like this:
“If you want to know what a man's like, take a good look at how he
treats his inferiors, not his equals.”
―
,
Each also has some extra blank lines at the end. My thought was I could iterate through quotes and call re.match on each quote as follows:
cleaned_quotes = []
for q in quote:
match = re.match(r'“[A-Z].+$”', str(q))
cleaned_quotes.append(match.group())
I'm guessing my regex pattern didn't match anything because I'm getting the following error:
AttributeError: 'NoneType' object has no attribute 'group'
Not surprisingly, printing the list gives me a list of None objects. Any ideas on what I might be doing wrong?
As you requested this for learning purpose, here's the regex answer:
(?<=“)[\s\s]+?(?=”)
Explanation:
We use a positive lookbehind to and lookahead to mark the beginning and end of the pattern and remove the quotes from result at the same time.
Inside of the quotes we lazy match anything with the .+?
Online Demo
Sample Code:
import re
regex = r"(?<=“)[\s\S]+?(?=”)"
cleaned_quotes = []
for q in quote:
m = re.search(regex, str(q))
if m:
cleaned_quotes.append(m.group())
Arguably, we do not need any regex flags. Add the g|gloabal flag for multiple matches. And m|multiline to process matches line by line (in such a scenario could be required to use [\s\S] instead of the dot to get line spanning results.)
This will also change the behavior of the positional anchors ^ and $, to match the end of the line instead of the string. Therefore, adding these positional anchors in-between is just wrong.
One more thing, I use re.search() since re.match() matches only from the beginning of the string. A common gotcha. See the documentation.
First of all, in your expression r'“[A-Z].+$”' end of line $ is defined before ", which is logically not possible.
To use $ in regexi for multiline strings, you should also specify re.MULTILINE flag.
Second - re.match expects to match the whole value, not find part of string that matches regular expression.
Meaning re.search should do what you initially expected to accomplish.
So the resulting regex could be:
re.search(r'"[A-Z].+"$', str(q), re.MULTILINE)

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?
Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.
The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.
This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

Python regex example

If I want to replace a pattern in the following statement structure:
cat&345;
bat &#hut;
I want to replace elements starting from & and ending before (not including ;). What is the best way to do so?
Including or not including the & in the replacement?
>>> re.sub(r'&.*?(?=;)','REPL','cat&345;') # including
'catREPL;'
>>> re.sub(r'(?<=&).*?(?=;)','REPL','bat &#hut;') # not including
'bat &REPL;'
Explanation:
Although not required here, use a r'raw string' to prevent having to escape backslashes which often occur in regular expressions.
.*? is a "non-greedy" match of anything, which makes the match stop at the first semicolon.
(?=;) the match must be followed by a semicolon, but it is not included in the match.
(?<=&) the match must be preceded by an ampersand, but it is not included in the match.
Here is a good regex
import re
result = re.sub("(?<=\\&).*(?=;)", replacementstr, searchText)
Basically this will put the replacement in between the & and the ;
Maybe go a different direction all together and use HTMLParser.unescape(). The unescape() method is undocumented, but it doesn't appear to be "internal" because it doesn't have a leading underscore.
You can use negated character classes to do this:
import re
st='''\
cat&345;
bat &#hut;'''
for line in st.splitlines():
print line
print re.sub(r'([^&]*)&[^;]*;',r'\1;',line)

match part of a string until it reaches the end of the line (python regex)

If I have a large string with multiple lines and I want to match part of a line only to end of that line, what is the best way to do that?
So, for example I have something like this and I want it to stop matching when it reaches the new line character.
r"(?P<name>[A-Za-z\s.]+)"
I saw this in a previous answer:
$ - indicates matching to the end of the string, or end of a line if
multiline is enabled.
My question is then how do you "enable multiline" as the author of that answer states?
Simply use
r"(?P<name>[A-Za-z\t .]+)"
This will match ASCII letters, spaces, tabs or periods. It'll stop at the first character that's not included in the group - and newlines aren't (whereas they are included in \s, and because of that it's irrelevant whether multiline mode is turned on or off).
You can enable multiline matching by passing re.MULTILINE as the second argument to re.compile(). However, there is a subtlety to watch out for: since the + quantifier is greedy, this regular expression will match as long a string as possible, so if the next line is made up of letters and whitespace, the regex might match more than one line ($ matches the end of any string).
There are three solutions to this:
Change your regex so that, instead of matching any whitespace including newline (\s) your repeated character set does not match that newline.
Change the quantifier to +?, the non-greedy ("minimal") version of +, so that it will match as short a string as possible and therefore stop at the first newline.
Change your code to first split the text up into an individual string for each line (using text.split('\n').
Look at the flags parameter at http://docs.python.org/library/re.html#module-contents

Categories

Resources