Python regex ignore punctuation when using re.sub

Python regex ignore punctuation when using re.sub - python

Let's say I want to convert the word center to centre, theater to theatre, etc. In order to do so, I have written a regex like the one below:
s = "center ce..nnnnnnnnteeeerrrr mmmmeeeeet.eeerrr liiiiIIiter l1t3r"
regex = re.compile(r'(?:((?:(?:[l1]+\W*[i!1]+|m+\W*[e3]+|c+\W*[e3]+\W*n+)\W*t+|t+\W*h+\W*[e3]+\W*a+\W*t+|m+\W*a+\W*n+\W*[e3]+\W*u+\W*v+)\W*)([e3]+)(\W*)(r+))', re.I)
print(regex.sub(r'\1\4\3\2',s)
#prints "centre ce..nnnnnnnntrrrreeee mmmmeeeeet.rrreee liiiiIIitre l1tr3"
In order to account for loopholes like c.e.nn.ttteee,/rr (basically repeated characters and added punctuation), I have been forced to add \W* between each character.
However, people are still able to use strings like c.c.e.e.n.n.t.t.e.e.r.r, which don't match as there is punctuation between each letter, not just different letters.
I was wondering whether there is a smarter method of doing this, where I can use re.sub without removing whitespace/punctuation but nonetheless have it match.

Related

Extracting a section of a string using regex with repeating ending words

I am attempting to extract some some raw strings using re module in python. The end of a to-be-extracted section is identified by a repeating word (repeated multiple times), Current efforts always captures the last match of the repeating word. How can I modify this behavior?
A textfile has been extracted from a pdf. The entire PDF is stored as one string. A general formatting of the string is as below:
*"***Start of notes: Collection of alphanumeric words and characters EndofsectionTopic A: string of words Endofsection"*
The intended string to be captured is: "Collection of alphanumeric words and characters"
The attempted solution used in this situation was: "
re.compile(r"*{3}Start of notes:(.+)\sEndofsection")
This attempt tends to match the whole string rather than just "Collection of alphanumeric words and characters" as intended.
One possible approach is to split with Endofsection and then extract the string from the first section only - this works, but I was hoping to find a more elegant solution using re.compile.

Two problems in your regex,
You need to escape * as it is a meta character as \*
Second, you are using (.+) which is a greedy quantifier and will try matching as much as possible, but since you want the shortest match, you need to just change it to (.+?)
Fixing these two issues, gives you the correct intended match.
Regex Demo
Python code,
import re
s = "***Start of notes: Collection of alphanumeric words and characters EndofsectionTopic A: string of words Endofsection"
m = re.search(r'\*{3}Start of notes:(.+?)\sEndofsection', s)
if m:
print(m.group(1))
Prints,
Collection of alphanumeric words and characters

Detect abbreviations in the text in python

I want to find abbreviations in the text and remove it. What I am currently doing is identifying consecutive capital letters and remove them.
But I see that it does not remove abbreviations such as MOOCs, M.O.O.C, M.O.O.Cs. Is there an easy way of doing this in python? Or are there any libraries that I can use instead?

The re regex library is probably the tool for the job.
In order to remove every string of consecutive uppercase letters, the following code can be used:
import re
mytext = "hello, look an ACRONYM"
mytext = re.sub(r"\b[A-Z]{2,}\b", "", mytext)
Here, the regex "\b[A-Z]{2,}\b" searches for multiple consecutive (indicated by [...]{2,}) capital letters (A-Z), forming a complete word (\b...\b). It then replaces them with the second string, "".
The convenient thing about regex is how easily it can be modified for more complex cases. For example:
mytext = re.sub(r"\b[A-Z\.]{2,}\b", "", mytext)
Will replace consecutive uppercase letters and full stops, removing acronyms like A.B.C.D. as well as ABCD. The \ before the . is necessary as . otherwise is used by regex as a kind of wildcard.
The ? specifier could also be used to remove acronyms that end in s, for example:
mytext = re.sub(r"\b[A-Z\.]{2,}s?\b", "", mytext)
This regex will remove acronyms like ABCD, A.B.C.D, and even A.B.C.Ds. If other forms of acronym need to be removed, the regex can easily be modified to accommodate them.
The re library also includes functions like findall, or the match function, which allow for programs to locate and process each acronym individually. This might come in handy if you want to, for example, look at a list of the acronyms being removed and check there are no legitimate words there.

An intuitive way would be the use of regex
This regular expression does the job :([A-Z]\.*){2,}s?
Which gives in python :
import re
re.sub("([A-Z]\.*){2,}s?","", your_text)
Please visit regex documentation in case of doubt
https://docs.python.org/2/library/re.html#re.sub

Regular expression pattern questions?

I am having a hard time understanding regular expression pattern. Could someone help me regular expression pattern to match all words ending in s. And start with a and end with a (like ana).
How do I write ending?

Word boundaries are given by \b so the following regex matches words ending with ing or s: "\b(\w+?(?:ing|s))\b" where as \b is a word boundary, \w+ is one or more "word character" and (?:ing|s) is an uncaptured group of either ing or s.
As you asked "how to develop a regex":
First: Don't use regex for complex tasks. They are hard to read, write and maintain. For example there is a regex that validates email addresses - but its computer generated and nothing you should use in practice.
Start simple and add edge cases. At the beginning plan what characters you need to use: You said you need words ending with s or ing. So you probably need something to represent a word, endings of words and the literal characters s and ing. What is a word? This might change from case to case, but at least every alphabetical character. Looking up in the python documentation on regexes you can find \w which is [a-zA-Z0-9_], which fits my impression of a word character. There you can also find \b which is a word boundary.
So the "first pseudo code try" is something like \b\w...\w\b which matches a word. We still need to "formalize" ... which we want to have the meaning of "one ore more characters", which directly translates to \b\w+\b. We can now match a word! We still need the s or ing. | translates to or, so how is the following: \b\w+ing|s\b? If you test this, you'll see that it will match confusing things like ingest which should not match our regex. What is happening? As you probably already saw the | can't know "which part it should or", so we need to introduce parenthesis: \b\w+(ing|s)\b. Congratulations, you have now arrived at a working regex!
Why (and how) does this differ from the example I gave first? First I wrote \w+? instead of \w+, the ? turns the + into a non-greedy version. If you know what the difference between greedy and non greedy is, skip this paragraph. Consider the following: AaAAbA and we want to match the things enclosed with big letter A. A naive try: A\w+A, so one or more word characters enclosed with A. This matches AaA, but also AaAAbA, A is still something that can be matched by \w. Without further config the *+? quantifier all try to match as much as possible. Sometimes, like in the A example, you don't want that, you can then use a ? after the quantifier to signal you want a non-greedy version, a version that matches as little as possible.
But in our case this isn't needed, the words are well seperated by whitespaces, which are not part of \w. So in fact you can just let + be greedy and everything will be alright. If you use . (any character) you often need to be careful not to match to much.
The other difference is using (?:s|ing) instead of (s|ing). What does the ?: do here? It changes a capturing group to a non capturing group. Generally you don't want to get "everything" from the regex. Consider the following regex: I want to go to \w+. You are not interested in the whole sentence, but only in the \w+, so you can capture it in a group: I want to go to (\w+). This means that you are interested in this specific piece of information and want to retrieve it later. Sometimes (like when using |) you need to group expressions together, but are not interested in their content, you can then declare it as non capturing. Otherwise you will get the group (s or ing) but not the actual word!
So to summarize:
* start small
* add one case after another
* always test with examples
In fact I just tried re.findall(\b\w+(?:ing|s)\b, "fishing words") and it didn't work. \w+(?:ing|s) works. I've no idea why, maybe someone else can explain that. Regex are an arcane thing, only use them for easy and easy to test tasks.

Generally speaking I'd use \b to match "word boundaries" with \w which matches word components (short cut for [A-Za-z0-9_]). Then you can do an or grouping to match "s" or "ing". Result is:
/\b\w+(s|ing)\b/

Translate from TAG format to Regex for Corpus

I'm working with a corpus linguistics tool called AntConc, where you have a document where every word is tagged as a part of speech (noun, adjective, etc), and you use specific commands to pull out matches. For example, if I was looking for a noun (which is tagged NN), I would use *_NN and it would find every noun in the document.
I need to translate my *_TAG syntax into python regex, and I have no idea how to do that. For example, I have a phrase: *_PP$ *_NN *_DT *_JJ *_NN (this translates to possessive pronoun, noun, determiner, adjective, noun; it would find things like "her voice an exact duplicate") in TAG format.
How does one go about changing things like that to regex? For now, I'll take just that basic stuff. Later I'll worry about figuring out how to do "or" and "if this then this" and whatnot.
If you need more info about the tags, try searching for POS tags CLAWS, which should give you a list.
Thanks so much for your help!

So I did some research and found this PDF file describing the notion of embedded tags and non-embedded tags. You are looking to find the embedded tags. So if I'm correct the input would be like this right?
her_PP$ voice_NN an_DT exact_JJ duplicate_NN
Only then in a larger body of text and you don't know the actual words, you just know the _XX tags.
In a regex, you have to be more specific then *. What you want in the place of the * is 1 or more of any character that is part of a word (letters, but could also contain hyphens maybe?). That makes this for the noun:
[\w-]+_NN
This means a character class [...] of word characters \w and the hyphen -, repeated one or more times +, followed by _NN.
For the possessive pronoun, it has a $ in there which has a special meaning in regexes, if you want the character $ and not its special meaning, you need to escape it with a preceding \ like so:
[\w-]+_PP\$
Lastly you want to consider which characters are allowed in between the words. Could be just white-space like spaces, tabs and enters, which would be \s+. Could also be "any character that isn't a word character" to allow periods, commas, quotes, colons, etc. That would be \W+ (note the upper case W to be the opposite of the lowercase \w).
Combined this would amount to this:
[\w-]+_PP\$\W+[\w-]+_NN\W+[\w-]+_DT\W+[\w-]+_JJ\W+[\w-]+_NN
Debuggex Demo
To do "an undetermined amount of unknown words" you would do this:
(?:[\w-]+\W+)*?
So the part that matches the word [\w-]+ and the part that goes in between \W+ are wrapped into a non-capturing group (?:...) and that group is said to occur 0 or more times with the * but as few times as possible with ? to avoid greediness. You can see it here and remove or add an X to see it will still match.

How to add tags to negated words in strings that follow "not", "no" and "never"

How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
Input:It was never going to work, he thought. He did not play so well, so he had to practice some more.
Desired output:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
Any idea how to solve this?

To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.
Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
And replace them with what you want
\1NEG_\2

I would not do this with regexp. Rather I would;
Split the input on punctuation characters.
For each fragment do
Set negation counter to 0
Split input into words
For each word
Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
If original word is in {No,Never,Not} increase negation counter by one.

You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.
Join the string together again and insert the result in your original string in the place of the first regex's match.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.