Use search with regex to find Korean characters using Python

Use search with regex to find Korean characters using Python - python

Using Python 2.7.9 on Windows 8.1 Enterprise 64-bit
I'm using the following code to search for any Korean characters ( http://lcweb2.loc.gov/diglib/codetables/9.3.html )
line = ['x'. 'y', 'z', '쭌', 'a']
if any([re.search("[%s-%s]" % ("\xE3\x84\xB1".decode('utf-8'), "\xEC\xAD\x8C".decode('utf-8')), x) for x in line[3:]]):
print "found character"
When ever I run the script and give it the following character 쭌 the console shows ∞¡î which is a result of IDLE / Command Prompt being unable to show Korean characters I'm guessing.
쭌 is the last character that I was hoping to match in the regex
So is the above search correct at least? I'd prefer to know I at least have the right pattern to search for and spend time trying to make the console show the proper Korean characters.
I've tried in command prompt to do cph 1252 and nothing. It never prints out "found character" so I wouldn't ever know.
If it helps, the script is receiving text from an IRC channel where Korean is usually spoken.

Use Unicode strings (note the "u" prefixes):
import re
line = [u'x', u'y', u'z', u'쭌', u'a']
if any([re.search(u'[\u3131-\ucb4c]', x) for x in line[3:]]):
print "found character"

If you wanted to use the regex library (not to be confused with re), you could do this:
import regex
regex.search(r'\p{IsHangul}', '오소리')
or in a function to detect at least one Hangul character:
import regex
def is_hangul(value):
if regex.search(r'\p{IsHangul}', value):
return True
return False
print(is_hangul('오소리')) # True
print(is_hangul('mushroom')) # False
print(is_hangul('뱀')) # True

Related

Python: \b behaves differently from book's description [duplicate]

This question already has an answer here:
Why Python IDLE Print Instead Of Backspace? [duplicate]
(1 answer)
Closed 6 years ago.
I'm reading automate_the_boring_stuff_with_python_2015 and I got to this snippet:
print(positionStr, end='')
print('\b' * len(positionStr), end='', flush=True)
where positionStr is a string defined earlier. I looked at python escape sequences and saw that \b is backspace but for some reason the author says it should erase the printed string
To erase text, print the \b backspace escape character. This special
character erases a character at the end of the current line on the screen. The line at u uses string replication to produce a string with as many \b
characters as the length of the string stored in positionStr, which has the
effect of erasing the positionStr string that was last printed.
this contradicts what I saw in here (table in mid page)
this differs from my results
As you can see I got a bunch of backspace chars, as I guess I should have (I ran a loop in which I printed the said string and then the \b string)
Now, is the book wrong or should I have done something different in order for it to work? Additionally, if this is wrong, is there a way to achieve this goal? (print string and then delete it)
As it can be seen from the picture, I work with python 3.5.3. on Windows 8.1

Not all consoles support the \b character as a deletion character, especially graphical ones.
(same thing happens when you write it to a file, the previous char is not deleted either)
Try your example in a native shell (Windows or Linux would work) and the characters will be properly deleted.
Windows CMD:
>>> print("a\bc")
c
PyScripter (that's what I have):
>>> print("a\bc")
a<strange char>c

Python UTF-8 REGEX

I have a problem while trying to find text specified in regex.
Everything work perfectly fine but when i added "\£" to my regex it started causing problems. I get SyntaxError. "NON ASCII CHACTER "\xc2" in file (...) but no encoding declared...
I've tried to solve this problem with using
import sys
reload(sys) # to enable `setdefaultencoding` again
sys.setdefaultencoding("UTF-8")
but it doesnt help. I just want to build regular expression and use pound sign there. flag re.Unicode flag doesnt help, saving string as unicode (pat) doesnt help. Is there any solution to fix this regex? I just want to build regular expression and use pound sign there.Thanks for help.
k = text.encode('utf-8')
pat = u'salar.{1,6}?([0-9\-,\. \tkFFRroOMmTtAanNuUMm\$\&\;\£]{2,})'
pattern = re.compile(pat, flags = re.DOTALL|re.I|re.UNICODE)
salary = pattern.search(k).group(1)
print (salary)
Error is still there even if I comment(put "#" and skip all of those lines. Maybe its not connected with re. library but my settings?

The error message means Python cannot guess which character set you are using. It also tells you that you can fix it by telling it the encoding of your script.
# coding: utf-8
string = "£"
or equivalently
string = u"\u00a3"
Without an encoding declaration, Python sees a bunch of bytes which mean different things in different encodings. Rather than guess, it forces you to tell you what they mean. This is codified in PEP-263.
(ASCII is unambiguous [except if your system is EBCDIC I guess] so it knows what you mean if you use a pure-ASCII representation for everything.)
The encoding settings you were fiddling with affect how files and streams are read, and program I/O generally, but not how the program source is interpreted.

Python :Non-UTF-8 code starting with '\xe8' in file [duplicate]

I am trying to write a binary search program for a class, and I am pretty sure that my logic is right, but I keep getting a non-UTF-8 error. I have never seen this error and any help/clarification would be great! Thanks a bunch.
Here's the code.
def main():
str names = [‘Ava Fischer’, ‘Bob White’, ‘Chris Rich’, ‘Danielle Porter’, ‘Gordon Pike’, ‘Hannah Beauregard’, ‘Matt Hoyle’, ‘Ross Harrison’, ‘Sasha Ricci’, ‘Xavier Adams’]
binarySearch(names, input(str("Please Enter a Name.")))
print("That name is at position "+position)
def binarySearch(array, searchedValue):
begin = 0
end = len(array) - 1
position = -1
found = False
while !=found & begin<=end:
middle=(begin+end)/2
if array[middle]== searchedValue:
found=True
position = middle
elif array[middle] >value:
end=middle-1
else:
first =middle+1
return position

Add this line at the top of you code. It may work.
# coding=utf8

Your editor replaced ' (ASCII 39) with U+2018 LEFT SINGLE QUOTATION MARK characters, usually a sign you used Word or a similar wordprocessor instead of a plain text editor; a word processor tries to make your text 'prettier' and auto-replaces things like simple quotes with fancy ones. This was then saved in the Windows 1252 codepage encoding, where the fancy quotes were saved as hex 91 characters.
Python is having none of it. It wants source code saved in UTF-8 and using ' or " for quotation marks. Use notepad, or better still, IDLE to edit your Python code instead.
You have numerous other errors in your code; you cannot use spaces in your variable names, for example, and Python uses and, not & as the boolean AND operator. != is an operator requiring 2 operands (it means 'not equal', the opposite of ==), the boolean NOT operator is called not.

If you're using Notepad++, click Encoding at the top and choose Encode in UTF-8.

The character you are beginning your constant strings with is not the right string delimiter. You are using
‘Ava Fischer’ # ‘ and ’ as string delimiters
when it should have been either
'Ava Fischer' # Ascii 39 as string delimiter
or maybe
"Ava Fischer" # Ascii 34 as string delimiter

Add this line to the top of your code, it might help
# -*- coding:utf-8 -*-

Shell text to python string

I'm writing a little python utility to help move our shell -help documentation to searchable webpages, but I hit a weird block :
output = subprocess.Popen([sys.argv[1], '--help'],stdout=subprocess.PIPE).communicate()[0]
output = output.split('\n')
print output[4]
#NAME
for l in output[4]:
print l
#N
#A
#
#A
#M
#
#M
#E
#
#E
#or when written, n?na?am?me?e
It does this for any heading/subheading in the documentation, which makes it near unusable.
Any tips on getting correct formatting? Where did I screw up?
Thanks

The documentation contains overstruck characters done in the ancient line-printer way: print each character, followed by a backspace (\b aka \x08), followed by the same character again. So "NAME" becomes "N\bNA\bAM\bME\bE". If you can convince the program not to output that way, it would be the best; otherwise, you can clean it up with something like output = re.sub(r'\x08.', '', output)

A common way to mark a character as bold in a terminal is to print the character, followed by a backspace characters, followed by the character itself again (just like you would do it on a mechanical typewriter). Terminal emulators like xterm detect such sequences and turn them into bold characters. Programs shouldn't be printing such sequences if stdout is not a terminal, but if your tool does, you will have to clean up the mess yourself.

New line and tab characters in python on mac

I am printing a string to the python shell on a mac os 10.7.3.
The string contains new line characters, \n\r, and tabs, \t.
I'm not 100% sure what new line characters are used on each platform, however i've tried every combination (\n, \n\r, \r) and the newline characters are printed on the shell:
'hello\n\r\tworld'
I don't know what i'm doing wrong, any help is appreciated.
Thanks

What look to you like newlines and carriage returns are actually two characters each -- a back slash plus a normal character.
Fix this by using your_string.decode('string_escape'):
>>> s = 'hello\\n\\r\\tworld' # or s = r'hello\n\r\tworld'
>>> print s
hello\n\r\tworld
>>> print repr(s)
'hello\\n\\r\\tworld'
>>> print s.decode('string_escape')
hello
world

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use search with regex to find Korean characters using Python - python

Use Unicode strings (note the "u" prefixes): import re line = [u'x', u'y', u'z', u'쭌', u'a'] if any([re.search(u'[\u3131-\ucb4c]', x) for x in line[3:]]): print "found character"

Related

Python: \b behaves differently from book's description [duplicate]

Python UTF-8 REGEX

Python :Non-UTF-8 code starting with '\xe8' in file [duplicate]

Shell text to python string

New line and tab characters in python on mac

Categories

Resources