Replace string that do not contain any digit, whitespace and parenthesis - python

I want to eliminate all the digit, whitespace and parenthesis in the column of dataframe.
So, I wrote the following code.
df_energy['Country'].replace(r'\d* \(.*\)','',regex=True,inplace=True)
However, it only eliminate whitespace and parenthesis.
{'China2',''China, Hong Kong Special Administrative Region3'}
Items with digit at the end still remains the same.
May I know which part of the statement I miswrote.

Your regex specifically looks for the pattern:
meaning that an input like 1 (string) would match.
If you want to eliminate the charactes in the input regardless of order, I would suggest something like a simple list of characters to look for: r'[\d ()]+'
The screenshots come from CyrilEx https://extendsclass.com/ that can visualize regex and help you debug patterns

Related

Python regex expression example

I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.
For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.

Python regex ignore punctuation when using re.sub

Let's say I want to convert the word center to centre, theater to theatre, etc. In order to do so, I have written a regex like the one below:
s = "center ce..nnnnnnnnteeeerrrr mmmmeeeeet.eeerrr liiiiIIiter l1t3r"
regex = re.compile(r'(?:((?:(?:[l1]+\W*[i!1]+|m+\W*[e3]+|c+\W*[e3]+\W*n+)\W*t+|t+\W*h+\W*[e3]+\W*a+\W*t+|m+\W*a+\W*n+\W*[e3]+\W*u+\W*v+)\W*)([e3]+)(\W*)(r+))', re.I)
print(regex.sub(r'\1\4\3\2',s)
#prints "centre ce..nnnnnnnntrrrreeee mmmmeeeeet.rrreee liiiiIIitre l1tr3"
In order to account for loopholes like c.e.nn.ttteee,/rr (basically repeated characters and added punctuation), I have been forced to add \W* between each character.
However, people are still able to use strings like c.c.e.e.n.n.t.t.e.e.r.r, which don't match as there is punctuation between each letter, not just different letters.
I was wondering whether there is a smarter method of doing this, where I can use re.sub without removing whitespace/punctuation but nonetheless have it match.

What am I doing wrong with this negative lookahead? Filtering out certain numbers in a regex

I have a big piece of code produced by a software. Each instruction has an identifier number and I have to modify only certain numbers:
grr.add(new GenericRuleResult(RULEX_RULES.get(String.valueOf(11)), new Result(0,Boolean.FALSE,"ROSSO")));
grr.add(new GenericRuleResult(RULEX_RULES.get(String.valueOf(12)), new Result(0,Boolean.FALSE,"£££")));
etc...
Now, I am using SublimeText3 to change rapidly all of the wrong lines with this regex:
Of\((11|14|19|20|21|27|28|31)\)\), new Result\(
This regex above allowed me to put "ROSSO" (red) in each line containing those numbers. Now I have to put "VERDE" (green) in the remaining lines. My idea was to add a ?! in the Regex to look for all of the lines NOT CONTAINING those numbers.
From the website Regex101 I get in the description of the regex:
Of matches the characters Of literally (case sensitive)
\( matches the character ( literally (case sensitive)
Negative Lookahead (?!11|14|19|20|21|27|28|31)
Assert that the Regex below does not match
1st Alternative 11
etc...
So why am I not finding the lines containing 12, 13, 14 etc?
Edit: the Actual Regex: Of\((?!11|14|19|20|21|27|28|31)\)\), new Result\(
Your problem is that you are assuming a negative look ahead changes the cursor position, it does not.
That is, a negative lookahead of the form (?!xy) merely verifies that the next two characters are not xy. It does not then swallow two characters from the text. As its name suggests, it merely looks ahead from where you are, without moving ahead!
Thus, if you wish to match further things beyond that assertion you must:
negatively assert it is not xy;
then consume the two characters for whatever they are;
then continue your match.
So try something like:
Of\((?!11|14|19|20|21|27|28|31)..\)\), new Result\(

Regex which matches the longer string in an OR

Motivation
I'm parsing addresses and need to get the address and the country in separated matches, but the countries might have aliases, e.g.:
UK == United Kingdom,
US == USA == United States,
Korea == South Korea,
and so on...
Explanation
So, what I do is create a big regex with all possible country names (at least the ones more likely to appear) separated by the OR operator, like this:
germany|us|france|chile
But the problem is with multi-word country names and their shorter versions, like:
Republic of Moldova and Moldova
Using this as example, we have the string:
'Somewhere in Moldova, bla bla, 12313, Republic of Moldova'
What I want to get from this:
'Somewhere in Moldova, bla bla, more bla, 12313'
'Republic of Moldova'
But this is what I get:
'Somewhere in Moldova, bla bla, 12313, Republic of'
'Moldova'
Regex
As there are several cases, here is what I'm using so far:
^(.*),? \(?(republic of moldova|moldova)\)?(.*[\d\-]+.*|,.*[:/].*)?$
As we might have fax, phone, zip codes or something else after the country name - which I don't care about - I use the last matching group to remove them:
(.*[\d\-]+.*|,.*[:/].*)?
Also, sometimes the country name comes enclosed in parenthesis, so I have \(? and \)? around the second match group, and all the countries go inside it:
(republic of moldova|moldova|...)
Question
The thing is, when there is an entry which is a subset of a bigger one, the shorter is chosen over the longer, and the remainder stays in the base_address string.
Is there a way to tell the regex to choose over the biggest possible match when two values mach?
Edit
I'm using Python with built in re module
As suggested by m.buettner, changing the first matching group from (.*) to (.*?) indeed fixes the current issue, but it also creates another. Consider other example:
'Department of Chemistry, National University of Singapore, 4512436 Singapore'
Matches:
'Department of Chemistry, National University of'
'Singapore'
Here it matches too soon now.
Your problem is greediness.
The .* right at the beginning tries to match as much as possible. That is everything until the end of the string. But then the rest of your pattern fails. So the engine backtracks, and discards the last character matched with .* and tries the rest of the pattern again (which still fails). The engine will repeat this process (fail match, backtrack/discard one character, try again) until it can finally match with the rest of the pattern. The first time this occurs is when .* matches everything up to Moldova (so .* is still consuming Republic of). And then the alternation (which still cannot match republic of moldova) will gladly match moldova and return that as the result.
The simplest solution is to make the repetition ungreedy:
^(.*?)...
Note that the question mark right after a quantifier does not mean "optional", but makes it "ungreedy". This simply reverses the behaviour: the engine first tries to leave out the .* completely, and in the process of backtracking it includes one more character after every failed attempt to match the rest of the pattern.
EDIT:
There are usually better alternatives to ungreediness. As you stated in a comment, the ungreedy solution brings another problem that countries in earlier parts of the string might be matched. What you can do instead, is to use lookarounds that ensure that there are no word characters (letters, digits, underscore) before or after the country. That means, a country word is only matched, if it is surrounded by commas or either end of the string:
^(.*),?(?<!\w)[ ][(]?(c|o|u|n|t|r|i|e|s)[)]?(?![ ]*\w)(.*[\d\-]+.*|,.*[:/].*)?$
Since lookarounds are not actually part of the match, they do not interfere with the rest of your pattern - they simply check a condition at a specific position in the match. The two lookarounds I have added ensure that:
There is no word character before the mandatory space preceding the country.
There is no word character after the country that is separated by nothing but spaces.
Note that I've wrapped spaces in a character class, as well as the literal parentheses (instead of escaping them). Neither is necessary, but I prefer these readability-wise, so they are just a suggestion.
EDIT 2:
As abarnert mentioned in a comment, how about not using a regex-only solution?
You could split the string on ,, then trim every result, and check these against your list of countries (possibly using regex). If any component of your address is the same as one of your countries, you can return that. If there are multiples ones than at least you can detect the ambiguity and deal with it properly.
Sort all alternatives in regex, just create regex programatically by sorted (from longest to shortest) array of names. Then make whole regex in atomic group (PCRE engine has it, don't know if RE engine has it too). Because of atomic group, regex engine never backtrack to try other alternative in atomic group and so u have all alternatives sorted, match will always be the longest one.
Tada.

beginning and ending sign in regular expression in python

'[A-Za-z0-9-_]*'
'^[A-Za-z0-9-_]*$'
I want to check if a string only contains the sign in the above expression, just want to make sure no more weird sign like #%&/() are in the strings.
I am wondering if there's any difference between these two regular expression? Did the beginning and ending sign matter? Will it affect the result somehow?
Python regular expressions are anchored at the beginning of strings (like in many other languages): hence the ^ sign at the beginning doesn’t make any difference. However, the $ sign does very much make one: if you don’t include it, you’re only going to match the beginning of your string, and the end could contain anything – including the characters you want to exclude. Just try re.match("[a-z0-9]", "abcdef/%&").
In addition to that, you may want to use a regular expression that simply excludes the characters you’re testing for, it’s much safe (hence [^#%&/()] – or maybe you have to do something to escape the parentheses; can’t remember how it works at the moment).
The beginning and end sign match the beginning and end of a String.
The first will match any String that contains zero or more ocurrences of the class [A-Za-z0-9-_] (basically any string whatsoever...).
The second will match an empty String, but not one that contains characters not defined in [A-Za-z0-9-_]
Yes it will. A regex can match anywhere in its input. # will match in your first regex.

Categories

Resources