I found a solution on stackoverflow but it doesn't seem to work. I have made a string scanner that checks for character frequency and then replaces all characters with the "real" characters. I've made sure that the character recognition works but when I try replacing all characters in a string they no longer match up with the expected/calculated characters (when I try replacing for example only 2 characters it works fine and matches up perfectly). Here is my replacement code:
print(text.replace(re,'e').replace(rt,'t').replace(ra,'a').replace(ro,'o').replace(ri,'i').replace(rn,'n').replace(rs,'s').replace(rr,'r').replace(rh,'h').replace(rl,'l').replace(ru,'u').replace(rc,'c').replace(rm,'m').replace(rf,'f').replace(ry,'y').replace(rw,'w').replace(rg,'g').replace(rp,'p').replace(rb,'b').replace(rv,'v').replace(rk,'k').replace(rx,'x').replace(rq,'q').replace(rj,'j').replace(rz,'z').replace(rd,'d'))
You might want to take a look at translate. Your code would probably look something like
text = text.translate(str.maketrans('abcd...', ''.join([ra, rb, rc, rd...]))
Related
Bulmaca-Zeka Oyunu<200f>
I have a lot of strings in a text file, and I noticed that one has this <200f> char. I want to find all entries that have this char and remove it. But in Vim I can't find it by searching '<200f>' using the search string '<200f>'. Probably it is one char not 6 individual chars.
In Python or VIM, how to remove or search them?
Those literal characters in angle brackets are how Vim handles some problematic unprintable characters. They are quite puzzling at first but they are really easy to figure out because they simply spell out the character's hexadecimal code.
In this case, <200f> is the literal representation of U+200F, which you search like this:
/\%u200f
So, if you want to get rid of the occurrences of <200f> in the current buffer, all you have to do is:
:%s/\%u200f//g
See :help \%u.
You can't find it in vim because you're not searching for the correct string.
Check your vim documentation: angle brackets are meta-characters. Simply escape them, and remove the undesired strings.
:%s/\<200f\>//g
In Python you could perhaps use string replacement :
line_of_text = "Bulmaca-Zeka Oyunu<200f>"
print(line_of_text.replace("<200f>", "")
Output :
Bulmaca-Zeka Oyunu
I am writing personal information filter. When it encounters VALID phone or email replaces it with "[PRIVATE]";
Valid phone is for example '0123 45678' and '00123 45678' is invalid, but i get 0[PRIVATE] for the second one after the filtering. How do i look only at entire words using regex and \bword\b is totally not working properly.
I'm betting that you forgot to use raw strings:
re.search("\bword\b")
finds a string that starts with a backspace character, then word, then another backspace character.
re.search(r"\bword\b")
finds an entire word.
This one will work:
re.search(r"([\d]+([\s]+)?[\d]+)")
I'm trying to get a python regex sub function to work but I'm having a bit of trouble. Below is the code that I'm using.
string = 'á:tdfrec'
newString = re.sub(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", ur"\1:\2", string)
#newString = re.sub(ur"([a|e|i|o|ä|ë|ö|á|é|í|ó|à|è|ì|ò])([a|e|i|o|ä|ë|ö|á|é|í|ó|ú|à|è|ì|ò]):", ur"\1:\2", string)
print newString
# a:́tdfrec is printed
So the the above code is not working the way that I intend. It's not displaying correctly but the string printed has the accute accent over the :. The regex statement is moving the accute accent from over the a to over the :. For the string that I'm declaring this regex is not suppose be applied. My intention for this regex statement is to only be applied for the following examples:
aä:dtcbd becomes a:ädtcbd
adfseì:gh becomes adfse:ìgh
éò:fdbh becomes é:òfdbh
but my regex statement is being applied and I don't want it to be. I think my problem is the second character set followed by the : (ie á:) is what's causing the regex statement to be applied. I've been staring at this for a while and tried a few other things and I feel like this should work but I'm missing something. Any help is appreciated!
The follow code with re.UNICODE flag also doesn't achieve the desired output:
>>> import re
>>> original = u'á:tdfrec'
>>> pattern = re.compile(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", re.UNICODE)
>>> print pattern.sub(ur'\1:\2', string)
á:tdfrec
Is it because of the diacritic and the tony the pony example for les misérable? The diacritic is on the wrong character after reversing it:
>>> original = u'les misérable'
>>> print ''.join([i for i in reversed(original)])
elbarésim sel
edit: Definitely an issue with the combining diacritics, you need to normalize both the regular expression and the strings you are trying to match. For example:
import unicodedata
regex = unicodedata.normalize('NFC', ur'([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):')
string = unicodedata.normalize('NFC', u'aä:dtcbd')
newString = re.sub(regex, ur'\1:\2', string)
Here is an example that shows why you might hit an issue without the normalization. The string u'á' could either be the single code point LATIN SMALL LETTER A WITH ACCUTE (U+00E1) or it could be two code points, LATIN SMALL LETTER A (U+0061) followed by COMBINING ACUTE ACCENT (U+0301). These will probably look the same, but they will have very different behaviors in a regex because you can match the combining accent as its own character. That is what is happening here with the string 'á:tdfrec', a regular 'a' is captured in group 1, and the combining diacritic is captured in group 2.
By normalizing both the regex and the string you are matching you ensure this doesn't happen, because the NFC normalization will replace the diacritic and the character before it with a single equivalent character.
Original answer below.
I think your issue here is that the string you are attempting to do the replacement on is a byte string, not a Unicode string.
If these are string literals make sure you are using the u prefix, e.g. string = u'aä:dtcbd'. If they are not literals you will need to decode them, e.g. string = string.decode('utf-8') (although you may need to use a different codec).
You should probably also normalize your string, because part of the issue may have something to do with combining diacritics.
Note that in this case the re.UNICODE flag will not make a difference, because that only changes the meaning of character class shorthands like \w and \d. The important thing here is that if you are using a Unicode regular expression, it should probably be applied to a Unicode string.
I have a city name in unicode, and I want to match it with regex, but I also want to validate when it is a string, like "New York".
I searched a little bit and tried something attached below, but could not figure out how?
I tried this regex "([\u0000-\uFFFF]+)" on this website:http://regex101.com/#python and it works, but could not get it working in python.
Thanks in advance!!
city=u"H\u0101na"
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
mcity.group(0)
u'H'
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
Unlike \x, \u is not a special sequence in regex syntax, so your character group matches a literal backslash, letter U, and so on.
To refer to non-ASCII in a regex you have to include them as raw characters in a Unicode string, for example as:
mcity=re.search(u"([\u0000-\uFFFFA-Za-z\\s]+)", city, re.U)
(If you don't want to double-backslash the \s, you could also use a ur string, in which \u still works as an escape but the other escapes like \x don't. This is a bit confusing though.)
This character group is redundant: including the range U+0000 to U+FFFF already covers all of A-Za-z\s, and indeed the whole Basic Multilingual Plane including control characters. On a narrow build of Python (including Windows Python 2 builds), where the characters outside the BMP are represented using surrogate pairs in the range U+D800 to U+DFFF, you are actually allowing every single character, so it's not much of a filter. (.+ would be a simpler way of putting it.)
Then again it's pretty difficult to express what might constitute a valid town name in different parts of the world. I'd be tempted to accept anything that, shorn of control characters and leading/trailing whitespace, wasn't an empty string.
I'm trying to use regular expressions to find three or more of the same character in a string. So for example:
'hello' would not match
'ohhh' would.
I've tried doing things like:
re.compile('(?!.*(.)\1{3,})^[a-zA-Z]*$')
re.compile('(\w)\1{5,}')
but neither seem to work.
(\w)\1{2,} is the regex you are looking for.
In Python it could be quoted like r"(\w)\1{2,}"
if you're looking for the same character three times consecutively, you can do this:
(\w)\1\1
if you want to find the same character three times anywhere in the string, you need to put a dot and an asterisk between the parts of the expression above, like so:
(\w).*\1.*\1
The .* matches any number of any character, so this expression should match any string which has any single word character that appears three or more times, with any number of any characters in between them.
Hope that helps.