Pandas extract text notation - python

I'm new to Pandas, using it for a class, and I can't for the life of me find a resource that shows the notation used in pandas when representing text in the extract function. For example:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
I know this is telling the extract function to extract everything inside the parentheses from examples done in class, but I don't understand which symbols mean what inside the extract function. Is there a resource that can explain what these symbols mean? Thank you.

In General
The string argument of the .str.extract is a Regular Expression (regex), which is a language used for pattern matching and feature extraction in strings. If you go to the section called "Regular Expression Patterns" in the previous link you can find the meaning of the special control characters.
This Example
What specifically that regex string means is:
match any character, ., zero or more times, *, until a parenthesis, \(, then extract all the content in the parentheses, (.*), then close parenthesis, \), then any character zero or more times, .*, again.
Essentially this will match any string like: 'xxx(message)xxxx' or '(message)' or 'xx(message)' or '(message)x' and extract the 'message'.
Notes on Pandas and Regex
An important part of regular expressions (in general, but particularly for use in pandas with .str.extract) is capturing groups. You can 'capture' or grab part of a string by enclosing the pattern for that part inside of parenthesis. Note that these are the unescaped (no preceding slash - the inner set) parentheses in the regex and not the actual parentheses that appear in the string itself, e.g. in 'xxx(message)xxx'.
Check out the docs on .str.extract for a few examples of using regex with capturing groups in pandas:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html

Related

Python regex fuzzy searching

I have a question about making a pattern using fuzzy regex with the python regex module.
I have several strings such as TCATGCACGTGGGGCTGAC
The first eight characters of this string are variable (multiple options): TCAGTGTG, TCATGCAC, TGGTGGCT. In addition, there is a constant part after the variable part: GTGGGGCTGAC.
I would like to design a regex that can detect this string in a longer string, while allowing for at most 2 substitutions.
For example, this would be acceptable as two characters have been substituted:
TCATGCACGTGGGGCTGAC
TCCTGCACGTGGAGCTGAC
However, more substitutions should not be accepted.
In my code, I tried to do the following:
import regex
variable_parts = ["TCAGTGTG", "TCATGCAC", "TGGTGGCT", "GATAAGTG", "ATTAGACG", "CACTTCCG", "GTCTGTAT", "TGTCAAAG"]
string_to_test = "TCATGCACGTGGGGCTGAC"
motif = "(%s)GTGGGGCTGAC" % "|".join(variable_parts)
pattern = regex.compile(r''+motif+'{s<=2}')
print(pattern.search(string_to_test))
I get a match when I run this code and when I change the last character of string_to_test. But when I manually add a substitution in the middle of string_to_test, I do not get any match (even while I want to allow up to 2 substitutions).
Now I know that my regex is probably total crap, but I would like to know what I exactly need to do to make this work and where in the code I need to add/remove/change stuff. Any suggestions/tips are welcome!
Right now, you only add the restriction to the last C in the pattern that looks likelooks like (TCAGTGTG|TCATGCAC|TGGTGGCT|GATAAGTG|ATTAGACG|CACTTCCG|GTCTGTAT|TGTCAAAG)GTGGGGCTGAC{s<=2}.
To apply the {s<=2} quantifier to the whole expression you need to enclose the pattern within a non-capturing group:
pattern = regex.compile(fr'(?:{motif}){{s<=2}}')
The example above shows how to declare your pattern with the help of an f-string literal, where literal braces are defined with {{ and }} (doubled) braces. It yields the same result as pattern = regex.compile('(?:'+motif+'){s<=2}').
Also, note that r''+ is redundant and has no effect on the final pattern.

Python to detect latex mathematics using regular expressions or other methods

I want to detect if a long text string (input from "somewhere") contains mathematical expressions encoded in LaTeX. This means searching for substrings (denoted ... in what follows) enclosed inside either of:
$...$
\[...\]
\(...\)
\begin{displaymath} ... \end{displaymath}
There are some variations of item 3 with other keywords than displaymath, and there may be a whitespace inside the brace, etc., but I suppose I can figure out the rest once I get (1), (2), (3) working.
For (1), I suppose I can do the following:
import re
if re.search(r"$(\w+)$", str):
(do something)`
But I am having problems with the others, especially when it has the \. Help would be appreciated.
The python version should be 2.7.12 but ideally code that works for both versions 2.x and 3.x will be preferred.
You need to escape \,[,],{,},(,) as they have special meaning in regular expression.
So, you need to add an extra \ before them, when you want to match them literally.
For your second pattern, use:
\\\[(.+?)\\\]
For third pattern, use:
\\\((.+?)\\\)
For fourth pattern,
\\begin\{displaymath\}(.+?)\\end\{displaymath\}
You can see the demo for the fourth pattern here.

String after Escaped Characters in Regex

I'm using regular expressions in Python. I'm trying to pull out all the data between 2 variables, it starts with {"justin_h and ends with "} special characters included, however I'm having trouble with the regex syntax.
I've been using:
[{]["][justin_h...["][}]
And it returns no results. I know for a fact it's in there, and the [{]["] returns results, but it's when I start the string it doesn't seem to work. Where am I going wrong?
Use capturing groups or lookarounds.
r'\{"justin_h(.*?)"}'
Grab the string you want from group index 1. It won't work, if the part you want to grab contain newline character. For that case, you need to use (?s) DOTALL flag.
r'(?s)\{"justin_h(.*?)"}'
Example:
>>> re.findall(r'\{"justin_h(.*?)"}', 'foo{"justin_hfoobar"}barfoo')
['foobar']

How can I find all Markdown links using regular expressions?

In Markdown there is two ways to place a link, one is to just type the raw link in, like: http://example.com, the other is to use the ()[] syntax: (Stack Overflow)[http://example.com ].
I'm trying to write a regular expression that can match both of these, and, if it's the second match to also capture the display string.
So far I have this:
(?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\])
Debuggex Demo
But this doesn't seem to match either of my two test cases in Debuggex:
http://example.com
(Example)[http://example.com]
Really not sure why the first one isn't matched at the very least, is it something to do with my use of the named group? Which, if possible I'd like to keep using because this is a simplified expression to match the link and in the real example it is too long for me to feel comfortable duplicating it in two different places in the same pattern.
What am I doing wrong? Or is this not doable at all?
EDIT: I'm doing this in Python so will be using their regex engine.
The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.
You can obtain what you want in a more handy way using the new regex module of Python (since the re module has few features in comparison).
Example: (?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])
An online demo
pattern details:
(?| # open a branch reset group
# first case there is only the url
(?<txt> # in this case, the text and the url
(?<url> # are the same
(?:ht|f)tps?://\S+(?<=\P{P})
)
)
| # OR
# the (text)[url] format
\( ([^)]+) \) # this group will be named "txt" too
\[ (\g<url>) \] # this one "url"
)
This pattern uses the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.
\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)
(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})

Python Regular Expressions to match option of strings

I am new to Python and Python RE.
I am trying to make a parser for ARM assembly code. I want to make regular expression for matching conditional branch instructions which could be:
beq, bne, blt, bgt
I tried a regular expression of the form
'b[eq|ne|lt|gt]'
But this does not match. Can someone please help me with this?
You should be using parentheses for options, not square brackets:
b(eq|ne|lt|gt)
And you'd usually want a non-capture group:
b(?:eq|ne|lt|gt)
And you can also make it a little more optimised too:
b(?:eq|ne|[lg]t)
Square brackets will be understood as being any of the characters or range of characters. So [eq|ne|lt|gt] effectively means either one of e, q, |, n, e (again, so it becomes redundant), etc.
Try the following pattern: b(?:eq|ne|lt|gt)
[] Character set: Will only match any one character inside the brackets. You can specify a range of characters by using the metacharacter -, eg: [a-e] or even negate the expression by using the metacharacter ^, eg: [^aeiou]
() Capturing parentesis: Used for grouping part & for creating number capturing group, you can disable this feature by using the following char-set ?: within the capturing parentesis, eg(?:)
As mentioned above, you should be using the capturing parentesis for more than one character to be matched, so, that is why your pattern using brackets did not match your string.
Please note that using the non capturing parentesis was meant to no save any data being matched, however you can remove the metacharacters ?: in order to capture the group.
As python performs perl compatible regular expression engine, you are able to use named captured groups & numbered backreferences, the main advantage of using it, is to keep your expression easy to maintain, read, edit, etc.
Eg:
(?P<opcode>b(?:eq|ne|lt|gt)) - Will capture the match of your pattern b(?:eq|ne|lt|gt) into the backreference name opcode

Categories

Resources