find non English characters in python string [duplicate] - python

This question already has answers here:
Detect strings with non English characters in Python
(6 answers)
Closed 2 years ago.
I am collecting strings that may have writing of other languages in it and I want to find all strings that contain non English characters.
for example
lst = ['english1234!', 'Engl1sh', 'not english 行中ワ']

Depends on what you mean with "non-english" characters. If you are only allowing characters a-z you could use the string method "isalpha".
lst = ['english1234!', 'Engl1sh', 'not english 行中ワ']
allowed_strings = [string for string in lst if string.isalpha()]
If alphanumeric is allowed, use string.isalnum()
If alphanumeric + standard special characters, you could use string.isascii()
If any other specific scenarios is allowed, use regex.
e.g. in your example if using isascii() in the list comprehension above, you would remove the last string ut keep the first 2.

If you want to also have special character, you cannot use isAlpha() alone, but perhaps that's a start. (it won't accept "hi!" or "hi here")

First you need to decide what English character means. Do you want to reject words like café or naïve?
If you want only A-Z or A-Z and numbers you can use str.isalpha() or str.isalnum(). You can't use str.isascii() in your case, as the 7-bit US-ASCII range doesn't include any accented characters, just some extra symbols.
To include accented characters you can use a regular expression using the regex package and match against specific Unicode scripts or character blocks. For example, \p{IsLatin} will match all characters in the Latin1 script.
To find strings with non-English words you can use [^\p{IsLatin}]:
regex.match(r'[^\{IsLatin}]', 'not english 行中ワ')

Related

Regular expression that accepts tokens of three or more alphabetical characters

I'm trying to build a TFIDVectorizer that only accepts tokens of 3 or more alphabetical characters using TFIdfVectorizer(token_pattern="(?u)\\b\\D\\D\\D+\\b")
But it doesn't behave correctly, I know token_pattern="(?u)\\b\\w\\w\\w+\\b" accepts tokens of 3 or more alphanumerical characters, so I just don't understand why the former is not working.
What am I missing?
The problem lies in using the \D metacharacter, as it's actually for matching any non-digit character, rather than any alphabetical character. From Python docs:
You can go instead with:
token_pattern="(?i)[a-z]{3,}"
Explanation:
(?i) — inline flag to make matching case-insensitive,
[a-z] — matches any Latin letter,
{3,} — makes the previous token match three or more times (greedily, i.e., as many times as possible).
I hope this answers your question. :)

Regex matching Unicode variable names

In Python 2, a Python variable name contains only ASCII letters, numbers and underscores, and it must not start with a number. Thus,
re.search(r'[_a-zA-Z][_a-zA-Z0-9]*', s)
will find a matching Python name in the str s.
In Python 3, the letters are no longer restricted to ASCII. I am in search for a new regex which will match any and all legal Python 3 variable names.
According to the docs, \w in a regex will match any Unicode word literal, including numbers and the underscore. I am however unsure whether this character set contains exactly those characters which might be used in variable names.
Even if the character set \w contains exactly the characters from which Python 3 variable names may legally be constructed, how do I use it to create my regex? Using just \w+ will also match "words" which start with a number, which is no good. I have the following solution in mind,
re.search(r'(\w&[^0-9])\w*', s)
where & is the "and" operator (just like | is the "or" operator). The parentheses will thus match any word literal which at the same time is not a number. The problem with this is that the & operator does not exist, and so I'm stuck with no solution.
Edit
Though the "double negative" trick (as explained in the answer by Patrick Artner below) can also be found in this question, note that this only partly answers my question. Using [^\W0-9]\w* only works if I am guaranteed that \w exactly matches the legal Unicode characters, plus the numbers 0-9. I would like a source of this knowledge, or some other regex which gets the job done.
You can use a double negative - \W is anything that \w is not - just disallow it to allow any \w:
[^\W0-9]\w*
essentially using any not - non-wordcharacter except 0-9 followed by any word character any number of times.
Doku: regular-expression-syntax
You could try using
^(?![0-9])\w+$
Which will not partial match invalid variable names
Alternatively, if you don't need to use regex. str.isidentifier() will probably do what you want.

Regex - replace word having plus or brackets [duplicate]

This question already has answers here:
Escaping regex string
(4 answers)
Closed 6 years ago.
In Python, I am trying to do
text = re.sub(r'\b%s\b' % word, "replace_text", text)
to replace a word with some text. Using re rather than just doing text.replace to replace only if the whole word matches using \b. Problem comes when there are characters like +, (, [ etc in word. For example +91xxxxxxxx.
Regex treats this + as wildcard for one or more and breaks with error. sre_constants.error: nothing to repeat. Same is in the case of ( too.
Could find a fix for this after searching around a bit. Is there a way?
Just use re.escape(string):
word = re.escape(word)
text = re.sub(r'\b{}\b'.format(word), "replace_text", text)
It replaces all critical characters with a special meaning in regex patterns with their escape forms (e.g. \+ instead of +).
Just a sidenote: formatting with the percent (%) character is deprecated and was replaced by the .format() method of strings.

Why do I need to add DOTALL to python regular expression to match new line in raw string [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
Why does one need to add the DOTALL flag for the python regular expression to match characters including the new line character in a raw string. I ask because a raw string is supposed to ignore the escape of special characters such as the new line character. From the docs:
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline.
This is my situation:
string = '\nSubject sentence is: Appropriate support for families of children diagnosed with hearing impairment\nCausal Verb is : may have\npredicate sentence is: a direct impact on the success of early hearing detection and intervention programs in reducing the negative effects of permanent hearing loss'
re.search(r"Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string ,re.DOTALL)
results in a match , However , when I remove the DOTALL flag, I get no match.
In regex . means any character except \n
So if you have newlines in your string, then .* will not pass that newline(\n).
But in Python, if you use the re.DOTALL flag(also known as re.S) then it includes the \n(newline) with that dot .
Your source string is not raw, only your pattern string.
maybe try
string = r'\n...\n'
re.search("Subject sentence is:(.*)Causal Verb is :(.*)predicate sentence is:(.*)", string)

re.split not working on ^A [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 5 years ago.
I'm trying to parse lines of input that look like
8=FIX.4.2^A9=0126^A35=0^A34=000742599^A49=L3Q206N^A50=2J6L^A52=20130620-11:16:27.344^A369=000733325^A56=CME^A57=G^A142=US,IL^A1603=OMS2^A1604=0.1^A
where you have different fields of data separated by ^A. I'm trying to get at the individual data fields (like 8=FIX.4.2, 9=0126, 35=0, etc). The problem is that python sometimes interprets ^A as a single character (in vim this is ctrl-v, ctrl-a) and sometimes as the string '^A' with two characters. So I have tried doing
entries = re.split('^A|^A', str(line))
but later when i do
for entry in entries:
print entries
I just end up with the original string, with nothing split. Is this a problem with re.split?
Depends on what that line contains.
If you want to split on the 2-character string '^A', escape the special-to-regexps character ^, in this case probably meaning '\^A'.
It's more likely that this is instead the caret notation way of printing the single character with byte value 0x01, in which case you probably want to split on '\x01' instead.
(You might as well use string's own split() function, I'm guessing it's faster than using regexps for something this simple)
^ has a special meaning in regular expressions, so you should escape it first.
>>> strs = "8=FIX.4.2^A9=0126^A35=0^A34=000742599^A49=L3Q206N^A50=2J6L^A52=20130620-11:16:27.344^A369=000733325^A56=CME^A57=G^A142=US,IL^A1603=OMS2^A1604=0.1^A"
>>> re.split('\^A',strs)
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']
From docs:
'^' : (Caret.) Matches the start of the string, and in MULTILINE mode also
matches immediately after each newline.
^ is a metacharacter, it matches only at the start of a string. Escape it:
>>> re.split('\^A', line)
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']
There is no need to use a | in your expression, especially not when both 'alternate' strings are the same.
It appears however that you have the \x07 or \a control character, not the two-character ^A string. Just use .split() to split on that value, no need for a regular expression:
>>> line = line.replace('^A', '\a')
>>> line
'8=FIX.4.2\x079=0126\x0735=0\x0734=000742599\x0749=L3Q206N\x0750=2J6L\x0752=20130620-11:16:27.344\x07369=000733325\x0756=CME\x0757=G\x07142=US,IL\x071603=OMS2\x071604=0.1\x07'
>>> line.split('\a')
['8=FIX.4.2', '9=0126', '35=0', '34=000742599', '49=L3Q206N', '50=2J6L', '52=20130620-11:16:27.344', '369=000733325', '56=CME', '57=G', '142=US,IL', '1603=OMS2', '1604=0.1', '']

Categories

Resources