Why does this pandas str.extract pattern work? - python

I have a dataframe "movies" with column "title", which contains movie titles and their release year in the following format:
The Pirates (2014)
I'm testing different ways to extract just the title portion, which in the example above would be "The Pirates", into a new column.
I used pandas Series.str.extract() and found a regex pattern that works, but I'm not sure why it works.
movies['title_only'] = movies['title'].str.extract('(.*)[\s]', expand=True)
The above code correctly extracts the "The Pirates" into a new column, but why doesn't it extract only "The" (everything before the first whitespace)?

is a greedy quantifier, meaning it will match as far into the string as possible. To only match the first word, you can switch it to a lazy quantifier *?. Also, note that you don't need square brackets around the \s. [\s] == \s
According to CAustin

Related

Python ReGex Pattern Finder

I am trying to get better with ReGex in Python, and I am trying to figure out how I would isolate a specific substring in some text. I have some text, and that text could look like any of the following:
possible_strings = [
"some text (and words) more text (and words)",
"textrighthere(with some more)",
"little trickier (this time) with (all of (the)(values))"
]
With each string, despite the fact that I don't know what's in them, I know it always ends with some information in parentheses. To include examples like #3, where the final pair of parentheses have parentheses in them.
How could I go about using re/ReGex to isolate the text only inside of the last pair of parentheses? So in the previous example, I would want the output to be:
output = [
"and words",
"with some more",
"all of (the)(values)"
]
Any tips or help would be much appreciated!
In python you can use the regex module as it is supports recurssion:
import regex
pat = r'(\((?:[^()]|(?1))*\))$'
regex.findall(pat, '\n'.join(possible_strings), regex.M)
['(and words)', '(with some more)', '(all of (the)(values))']
The regex might be quite complicated for a beginner. Click here for the explanations and examples
Abit of explanation:
( # 1st Capturing Group
\( # matches the character (
(?:#Non-capturing group
[^()] # 1st Alternative Match a single character not present in the character class
| # or
(?1) #2nd Alternative matches the expression defined in the 1st capture group recursively
) # closes capturing group
* # matches zero or more times
\) #matches the character )
$ asserts position at the end of a line
For the first two, start matching an opening bracket, that could be either of these:
"some text (and words) more text (and words)"
^ ^
followed by anything which isn't an opening bracket:
"some text (and words) more text (and words)"
^^^^^^^^^^^^^^^^^^^^^^X^^^^^^^^^^^
|- starting at the first ( hit
another ( which isn't allowed.
followed by end of line. Only the last () fits "no more ( until end of line".
>>> import re
>>> re.findall('\([^(]+\)$', "some text (and words) more text (and words)")
['(and words)']
RegEx is not a good fit for your third example; there's no easy way to pair up the parens, you may have to install and use a different regex engine to get nested structure support. See also
Matching Nested Structures With Regular Expressions in Python
Python: How to match nested parentheses with regex?

Regex (python) to match same group several times only when preceded or followed by specific pattern

Suppose I have the following text:
Products to be destroyed: «Prabo», «Palox 2000», «Remadon strong» (Rule). The customers «Dilora» and «Apple» has to be notified.
I need to match every string within the «» quotes but ONLY in the period starting with the "Products to be destroyed:" pattern or ending with the (Rule) pattern.
In other words in this example I do NOT want to match Dilora nor Apple.
The regex to get the quoted contents in the capturing group is:
«(.+?)»
Is it possible to "anchor" it to either a following pattern (such as Rule) or even to a prior pattern (such as "Products to be destroyed:"?
This is my saved attempt on regex101
Thank you very much.
You can match at least a single part between the arrows, and when there is a match, extract all the parts using re.findall for example.
The example data seems to be within a dot. In that case you can match at least a single arrow part matching any char except a dot using a negated character class.
Regex demo for at least a single match, and another demo to match the separate parts afterwards
import re
regex = r"\bProducts to be destroyed:[^.]*«[^«»]*»[^.]*\."
s = 'Products to be destroyed: «Prabo», «Palox 2000», «Remadon strong» (Rule). The customers «Dilora» and «Apple» has to be notified.'
result = re.search(regex, s)
if result:
print(re.findall(r"«([^«»]*)»", result.group()))
Output
['Prabo', 'Palox 2000', 'Remadon strong']

Insert string in pandas column using regex if pattern is found

I have a string column in a dataframe and I'd like to insert a # to the begging of my pattern.
For example:
My pattern is the letters 'pr' followed by any amount of numbers. If in my column there is a value 'problem in pr123', I would change it to 'problem in #pr123'.
I'm trying a bunch of code snippets but nothing is working for me.
Tried to change the solution to replace for 'pr#123' but this didn't work either.
df['desc_clean'] = df['desc_clean'].str.replace(r'([p][r])(\d+)', r'\1#\2', regex=True)
What's the best way I can replace all values in this column when I find this pattern?
If you need pr#123 you can use
df['desc_clean'] = df['desc_clean'].str.replace(r'(pr)(\d+)', r'\1#\2')
To get #pr123, you can use
df['desc_clean'].str.replace(r'pr\d+', r'#\g<0>')
To match pr as a whole word, you can add a word boundary, \b, in front of pr:
df['desc_clean'].str.replace(r'\bpr\d+', r'#\g<0>')
See the regex demo.

Python 2.7 Regex capture groups not working as predicted

I am trying to pattern match and replace first person with second person with Python 2.7.
string = re.sub(r'(\W)I(\W)', '\g<1>you\g<2>',string)
string = re.sub(r'(\W)(me)(\W)', '\g<1>you\g<3>',string)
# but does NOT work
string = re.sub(r'(\W)I|(me)(\W)', '\g<1>you\g<3>',string)
I want to use the last regex, but somehow the capture groups are all messed up and even doing a \g<0> shows strange, irregular matches. I would think that capture group 3 would be the last word boundary, but it doesn't appear to be.
A sample sentence could be: I like candy.
I am not interested very much in the correctness of the replacement (me will never actually be selected since I goes first), but I don't know why the capture groups don't work as I would expect.
Thanks!
Try with following regex.
Regex: \b(I|me)\b
Explanation:
\b on both sides marks the word boundary.
(I|me) matches either I OR me.
Note:- You can make it case insensitive using i flag.
Regex101 Demo

alternative regex to match all text in between first two dashes

I'm trying to use the following regex \-(.*?)-|\-(.*?)* it seems to work fine on regexr but python says there's nothing to repeat?
I'm trying to match all text in between the first two dashes or if a second dash does not exist after the first all text from the first - onwards.
Also, the regex above includes the dashes, but would preferrably like to exclude these so I don't have to do an extra replace etc.
You can use re.search with this pattern:
-([^-]*)
Note that - doesn't need to be escaped.
An other way consists to only search the positions of the two first dashes, and to extract the substring between these positions. Or you can use split:
>>> 'aaaaa-bbbbbb-ccccc-ddddd'.split('-')[1]
'bbbbbb'

Categories

Resources