What is the regex to remove the content inside brackets? - python

I want to do something like this,
Alice in the Wonderland [1865] [Charles Lutwidge Dodgson] Rating 4.5/5
to
Alice in the Wonderland Rating 4.5/5
What is the regex command to achieve this ?

You want to escape the the brackets and use the non-greed modifier ? with the catch-all expression .+.
>>> s = 'Alice in the Wonderland [1865] [Charles Lutwidge Dodgson] Rating 4.5/5'
>>> re.sub(r'\[.+?\]\s*', '', s)
'Alice in the Wonderland Rating 4.5/5'
Explanations:
The . means any character and + one or more occurrences. This expression is "greedy" and will match everything (the rest of the string including any closing bracket) so you need the non-greedy modifier ? to make it stop at the closing bracket. Note that x? means zero or one occurrences of "x", so context matters.
Change it to .* if you want to catch "[]", * means zero or more occurrences
The \s represents any space character
You can use the "negated" character class instead of .+? - the [^x] means not "x", but the resulting expression is harder to read: \[[^\]]+\].
Justhalf's observation is very pertinent: this one works as long as brackets are not nested.

Regex is not good for matching arbitrary number of open and closing parentheses, but if they are not nested, it can be accomplished with this regex:
import re
string = 'Alice in the Wonderland [1865] [Charles Lutwidge Dodgson] Rating 4.5/5'
re.sub('\[[^\]]+\]\s*','',string)
Note that it will also remove any space after the brackets.

You could use re.sub:
>>> re.sub(r'\[[^]]*\]\s?' , '', 'Alice in the Wonderland [1865] [Charles Lutwidge Dodgson] Rating 4.5/5')
'Alice in the Wonderland Rating 4.5/5'
>>>

If you prefer lots of [] in your regex :)
>>> import re
>>> s = 'Alice in the Wonderland [1865] [Charles Lutwidge Dodgson] Rating 4.5/5'
>>> re.sub('[[].*?[]]\s*', '', s)
'Alice in the Wonderland Rating 4.5/5'
>>> re.sub('[[][^]]*.\s*', '', s)
'Alice in the Wonderland Rating 4.5/5'
Reiterating what #justhalf said. Python regex are no good for nested [

Related

remove any occurence of a "(...)" from a long string, where "..." could be anything

I am currently using the following code to try and remove the characters from the string but nothing is being removed. I think it has to do with the fact that the sequence of characters I am trying to remove are between parantheses. So for example, in the following string, "-McQuay International, 13600 Industrial Park Blvd, (p) ", I would want to remove "(p)"
import re
regexp = " \(*\) "
text = re.sub(regexp, "", "-McQuay International, 13600 Industrial Park Blvd, (p)")
I would use the following regex replacement:
inp = "-McQuay International, 13600 Industrial Park Blvd, (p)"
output = re.sub(r'\s*\(.*?\)\s*', ' ', inp).strip()
print(output) # -McQuay International, 13600 Industrial Park Blvd,
First, you should be using lazy dog when matching parentheses. This is to avoid matching across multiple sets of parentheses, should your text have that. Second, I use a replacement which also removes unwanted whitespace. The call to strip() will remove any leading/trailing whitespace which might be left over.
regexp = "\(|\)"
Try this

How to select certain text between other text in python?

Here is an example string:
text = "hello, i like to eat beef 'sandwiches' and beef 'jerky' and chicken 'patties' and chicken 'burgers' and also chicken 'fingers' and other chicken 'meat' too."
I am trying to separate the words "patties", "burgers",
fingers", and "meat" from this text. I want to separate the words after chicken but before the closing quotation.
I have gotten stumped on how to even separate a single one. I can split after "chicken ' but then how can i select the text up until the next ' ?
I would like to iterate through a list to save the variables to an array. Thanks for any help you can provide.
You can use regular expressions:
import re
text = "hello, i like to eat beef 'sandwiches' and beef 'jerky' and chicken 'patties' and chicken 'burgers' and also chicken 'fingers' and other chicken 'meat' too."
match = re.findall(r'chicken \'(\S+)\'', text)
print (match)
Outputs:
['patties', 'burgers', 'fingers', 'meat']
This is a good use-case for regex.
import re
print(re.findall(r"chicken '(.*?)'", text))
Here's an explanation of the regex: https://regex101.com/r/8IdseD/1
Here's the python code running: https://repl.it/repls/SquareQuerulousModes
The regex, part by part:
chicken ' - matches that literal text
( - starts a capture group - the part that re.findall will spit out.
. - matches any character...
*? - ...any number of times, but as few possible (this is to ensure we don't capture the final ')
) - end the capture group
' - match a literal '.
So re.findall will give you a list of all the substrings that are captured in the group.
You can use zero-width lookarounds to match the surroundings:
(?<=chicken\s')[^']+(?=')
(?<=chicken\s') is zero-width positive lookbehind that matches chicken '
[^']+ matches the portion upto next single quote i.e. the desired substring
(?=') is zero-width positive lookahead that matches ' after the desired substring
Example:
In [713]: text = "hello, i like to eat beef 'sandwiches' and beef 'jerky' and chicken 'patties' and chicken 'burgers' and also chicken 'fingers' and other chicken 'meat' too."
In [714]: re.findall(r"(?<=chicken\s')[^']+(?=')", text)
Out[714]: ['patties', 'burgers', 'fingers', 'meat']
Select just the portion of the sentence from the first occurrence of "chicken":
chicken_text = text[text.find("chicken"):]
Split that text on spaces:
chicken_words = chicken_text.split(" ")
Scan the list for words that begin and end with a single quote:
for word in chicken_words:
if word[0] == "'" and word[-1] == "'":
print word[1:-1]
This won't work if the single-quoted words themselves contain spaces, but that isn't the case in the sample text you gave.

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

regex to parse out certain value that i want

Using https://regex101.com/
MY current regex Expression: ^.*'(\d\s*.*)'*$
which doesnt seem to be working. What is the right combination formula that i should use?
I want to able to parse out 4 variable namely
items, quantity, cost and Total
MY CODE:
import re
str = "xxxxxxxxxxxxxxxxxx"
match = re.match(r"^.*'(\d\s*.*)'*$",str)
print match.group(1)
The following regex matches each ingredient string and stores wanted informations into groups: r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$'
It defines 3 groups each separated from other by spaces:
^ marks the string start
(\d+) is the first group and looks for at least one digit
\s+ is the first separation between groups and looks for at least one white character
([A-Za-z ]+) is the second group and looks for a least one alphabetical character or space
\s+ is the second separation beween groups and looks for at least one white character
(\d+(?:\.\d*) is the third group and looks for at least one digit with eventually a decimal point and some other digits
$ marks the string end
A regex to obtain the total does not need to be explained I think.
Here is a test code using your test data. Is should be a good starting point:
import re
TEST_DATA = ['Table: Waiter: kenny',
'======================================',
'1 SAUSAGE WRAPPED WITH B 10.00',
'1 ESCARGOT WITH GARLIC H 12.00',
'1 PAN SEARED FOIE GRAS 15.00',
'1 SAUTE FIELD MUSHROOM W 9.00',
'1 CRISPY CHICKEN WINGS 7.00',
'1 ONION RINGS 6.00',
'----------------------------------',
'TOTAL 59.00',
'CASH 59.00',
'CHANGE 0.00',
'Signature:__________________________',
'Thank you & see you again soon!']
INGREDIENT_RE = re.compile(r'^(\d+)\s+([A-Za-z ]+)\s+(\d+(?:\.\d*))$')
TOTAL_RE = re.compile(r'^TOTAL (.+)$')
ingredients = []
total = None
for string in TEST_DATA:
match = INGREDIENT_RE.match(string)
if match:
ingredients.append(match.groups())
continue
match = TOTAL_RE.match(string)
if match:
total = match.groups()[0]
break
print(ingredients)
print(total)
this prints:
[('1', 'SAUSAGE WRAPPED WITH B', '10.00'), ('1', 'ESCARGOT WITH GARLIC H', '12.00'), ('1', 'PAN SEARED FOIE GRAS', '15.00'), ('1', 'SAUTE FIELD MUSHROOM W', '9.00'), ('1', 'CRISPY CHICKEN WINGS', '7.00'), ('1', 'ONION RINGS', '6.00')]
59.00
Edit on Python raw strings:
The r character before a Python string indicates that it is a raw string, which means that spécial characters (like \t, \n, etc...) are not interpreted.
To be clear, and for example, in a standard string \t is one tabulation character. It a raw string it is two characters: \ and t.
r'\t' is equivalent to '\\t'.
more details in the doc

Python regex findall

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags.
Here is my attempt:
regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates'].
import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same
unicode as u'[[1P].+?[/P]]+?' except harder to read.
The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,
Remove the outer enclosing square brackets. (Also remove the
stray 1 in front of P.)
To protect the literal brackets in [P], escape the brackets with a
backslash: \[P\].
To return only the words inside the tags, place grouping parentheses
around .+?.
Try this :
for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:
>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']
you can replace your pattern with
regex = ur"\[P\]([\w\s]+)\[\/P\]"
Use this pattern,
pattern = '\[P\].+?\[\/P\]'
Check here

Categories

Resources