Can't get the UNICODE chars - python

I have encountered in a problem while i'm trying to get the UNICODE chars and to put them in a list. The problem is that i'm getting the hex code of the symbols and not the symbols themselves..
Can anyone help me with that?
My code:
KeysLst = []
for i in range(1000, 1100):
char = unichr(i)
KeysLst.append(char)
print KeysLst
Output:
[u'\u03e8', u'\u03e9', u'\u03ea', u'\u03eb', u'\u03ec', u'\u03ed', u'\u03ee', u'\u03ef', u'\u03f0', u'\u03f1', u'\u03f2', u'\u03f3', u'\u03f4', u'\u03f5', u'\u03f6', u'\u03f7', u'\u03f8', u'\u03f9', u'\u03fa', u'\u03fb', u'\u03fc', u'\u03fd', u'\u03fe', u'\u03ff', u'\u0400', u'\u0401', u'\u0402', u'\u0403', u'\u0404', u'\u0405', u'\u0406', u'\u0407', u'\u0408', u'\u0409', u'\u040a', u'\u040b', u'\u040c', u'\u040d', u'\u040e', u'\u040f', u'\u0410', u'\u0411', u'\u0412', u'\u0413', u'\u0414', u'\u0415', u'\u0416', u'\u0417', u'\u0418', u'\u0419', u'\u041a', u'\u041b', u'\u041c', u'\u041d', u'\u041e', u'\u041f', u'\u0420', u'\u0421', u'\u0422', u'\u0423', u'\u0424', u'\u0425', u'\u0426', u'\u0427', u'\u0428', u'\u0429', u'\u042a', u'\u042b', u'\u042c', u'\u042d', u'\u042e', u'\u042f', u'\u0430', u'\u0431', u'\u0432', u'\u0433', u'\u0434', u'\u0435', u'\u0436', u'\u0437', u'\u0438', u'\u0439', u'\u043a', u'\u043b', u'\u043c', u'\u043d', u'\u043e', u'\u043f', u'\u0440', u'\u0441', u'\u0442', u'\u0443', u'\u0444', u'\u0445', u'\u0446', u'\u0447', u'\u0448', u'\u0449', u'\u044a', u'\u044b']

You did get unicode characters.
However, Python is showing you unicode literal escapes, to make debugging easier. Those u'\u03e8' values are still one-character unicoe strings though.
Try printing the individual values in your list:
>>> print KeysLst[0]
Ϩ
>>> print KeysLst[1]
ϩ
>>> KeysLst[0]
u'\u03e8'
>>> KeysLst[1]
u'\u03e9'
The unicode escape representation is used for any codepoint outside of the printable ASCII range:
>>> u'A'
u'A'
>>> u'\n'
u'\n'
>>> u'\x86'
u'\x86'
>>> u'\u0025'
u'%'

When you print a list, you get the repr of the elements inside the list (surrounded by brackets and delimited by a comma).
If you are trying to print the unicode glyphs, try
KeysLst = []
for i in range(1000, 1100):
char = unichr(i)
KeysLst.append(char)
for char in KeysLst:
print char,
which yields
Ϩ ϩ Ϫ ϫ Ϭ ϭ Ϯ ϯ ϰ ϱ ϲ ϳ ϴ ϵ ϶ Ϸ ϸ Ϲ Ϻ ϻ ϼ Ͻ Ͼ Ͽ Ѐ Ё Ђ Ѓ Є Ѕ І Ї Ј Љ Њ Ћ Ќ Ѝ Ў Џ А Б В Г Д Е Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы

Related

invert regex pattern in python

I'm trying to filter from string only the Arabic character but the next function doesn't work for me:
import re
def remove_any_non_arabic_char(text):
non_arabic_char = re.compile('^[\u0627-\u064a]')
text = re.sub(non_arabic_char, "", text)
print(text)
for example:
s = "Kühn xvii, 346] قال جالينوس: [1] قد اتفق جل من فسر هذا الكتا"
The desired output of remove_any_non_arabic_char(s) should be قال جالينوس قد اتفق جل من فسر هذا الكتا but the input stays without changes.
What should I do?
First, you need to fix your regex as suggested in the comments, then for a more efficient solution, you will need to expand your Unicode character selection to include all Arabic character mappings. Finally, you need to keep at least one space between Arabic words to keep the Arabic text legible:
import re
def remove_any_non_arabic_char(text):
non_arabic_char = re.compile('[^\s\\u0600-\u06FF]')
text_with_no_spaces = re.sub(non_arabic_char, "", text)
text_with_single_spaces = " ".join(re.split("\s+", text_with_no_spaces))
return text_with_single_spaces
text_1 = "Kühn xvii, 346] قال جالينوس: [1] قد اتفق جل من فسر هذا الكتا"
text_2 = '''
تغيّر مفهوم كلمة (أدب) من العصر الجاهلي jahili (pre-Islamic) era إلى الآن عبر
مراحل periods التاريخ المتعددة. ففي الجاهلية، كانت كلمة أدب تعني (الدعوة إلى
الطعام). وبعدها، استخدم الرسول محمد (عليه السلام) الكلمة بمعنى "التهذيب والتربية"
education and mannerism. وفي العصر الأموي، اتصلت had to do كلمة أدب
بالتاريخ والفقه والقرآن والحديث. أما في العصرالعباسي، فأصبحت تعني تعلّم الشعر
والنثر prose واتسع الأدب ليشمل أنواع المعرفة وألوانها وخصوصاً علم البلاغة واللغة.
أما في الوقت الحالي، فأصبحت كلمة أدب ذات صلة pertinent بالكلام البليغ
الجميل المؤثر that impacts في أحاسيس القاريء أو السامع.
'''
# Isleem, N. M., & Abuhakema, G. M. (2020). Kalima wa Nagham: A Textbook for
# Teaching Arabic, Volume 2 (Vol. 3). University of Texas Press. (page 5)
print('text_1: \n', remove_any_non_arabic_char(text_1))
print('\ntext_2: \n\n', remove_any_non_arabic_char(text_2))
Running the code on the two texts above in Jupyter, you get:
Notice that punctuation marks shared between Arabic and English (like periods and brackets) have also been removed. To keep those, you would need to introduce more complex conditionals.

pyparsing a field that may or may not contain values

I have a dataset that resemebles the following:
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
The problem that I am having is I can't figure out how I could properly write a capture for the "Capture MICR - Serial Field". This field could either be blank or contain an alphanumeric of varying length (I have the same problem with the other fields that could either be populated or blank.
I have tried some variations of the following, but am still coming up short.
pp.Literal("Capture MICR - Serial:") + pp.White(" ", min=1, max=0) + (pp.Word(pp.printables) ^ pp.White(" ", min=1, max=0))("crd_micr_serial") + pp.FollowedBy(pp.Literal("Pos44:"))
I think that part of the problem is that the Or matches a parse for the longest match, which in this case could be a long whitespace character, with a single alphanumeric, but I would still want to capture the single value.
Thanks for everyone's help.
The simplest way to parse text like "A: valueA B: valueB C: valueC" is to use pyparsing's SkipTo class:
a_expr = "A:" + SkipTo("B:")
b_expr = "B:" + SkipTo("C:")
c_expr = "C:" + SkipTo(LineEnd())
line_parser = a_expr + b_expr + c_expr
I'd like to enhance this just a bit more:
add a parse action to strip off leading and trailing whitespace
add a results name to make it easy to get the results after the line has been parsed
Here is how that simple parser looks:
NL = LineEnd()
a_expr = "A:" + SkipTo("B:").addParseAction(lambda t: [t[0].strip()])('A')
b_expr = "B:" + SkipTo("C:").addParseAction(lambda t: [t[0].strip()])('B')
c_expr = "C:" + SkipTo(NL).addParseAction(lambda t: [t[0].strip()])('C')
line_parser = a_expr + b_expr + c_expr
line_parser.runTests("""
A: 100 B: Fred C:
A: B: a value with spaces C: 42
""")
Gives:
A: 100 B: Fred C:
['A:', '100', 'B:', 'Fred', 'C:', '']
- A: '100'
- B: 'Fred'
- C: ''
A: B: a value with spaces C: 42
['A:', '', 'B:', 'a value with spaces', 'C:', '42']
- A: ''
- B: 'a value with spaces'
- C: '42'
I try to avoid copy/paste code when I can, and would rather automate the "A is followed by B" and
"C is followed by end-of-line" with a list describing the different prompt strings, and then walking that list to build each
sub expression:
import pyparsing as pp
def make_prompt_expr(s):
'''Define the expression for prompts as 'ABC:' '''
return pp.Combine(pp.Literal(s) + ':')
def make_field_value_expr(next_expr):
'''Define the expression for the field value as SkipTo(what comes next)'''
return pp.SkipTo(next_expr).addParseAction(lambda t: [t[0].strip()])
def make_name(s):
'''Convert prompt string to identifier form for results names'''
return ''.join(s.split()).replace('-','_')
# use split to easily define list of prompts in order - makes it easy to update later if new prompts are added
prompts = "Capture MICR - Serial/Pos44/Trrt/Acct/Tc/Opt4/Split".split('/')
# keep a list of all the prompt-value expressions
exprs = []
# get a list of this-prompt, next-prompt pairs
for this_, next_ in zip(prompts, prompts[1:] + [None]):
field_name = make_name(this_)
if next_ is not None:
next_expr = make_prompt_expr(next_)
else:
next_expr = pp.LineEnd()
# define the prompt-value expression for the current prompt string and add to exprs
this_expr = make_prompt_expr(this_) + make_field_value_expr(next_expr)(field_name)
exprs.append(this_expr)
# define a line parser as the And of all of the generated exprs
line_parser = pp.And(exprs)
line_parser.runTests("""\
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
Capture MICR - Serial: 1729XYZ Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: XXL Split: 50
""")
Gives:
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
['Capture MICR - Serial:', '', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', '', 'Split:', '']
- Acct: ''
- CaptureMICR_Serial: ''
- Opt4: ''
- Pos44: ''
- Split: ''
- Tc: '2064'
- Trrt: '32904'
Capture MICR - Serial: 1729XYZ Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: XXL Split: 50
['Capture MICR - Serial:', '1729XYZ', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', 'XXL', 'Split:', '50']
- Acct: ''
- CaptureMICR_Serial: '1729XYZ'
- Opt4: 'XXL'
- Pos44: ''
- Split: '50'
- Tc: '2064'
- Trrt: '32904'
Does this do what you want?
I used the Combine merely so that both arms of the Or would produce similar results, ie, with 'Pos44:' at the end of the result string where it can be plucked off. I'm unhappy about resorting to a regex.
>>> import pyparsing as pp
>>> record_A = 'Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:'
>>> record_B = 'Capture MICR - Serial: 76ZXP67 Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:'
>>> parser_fragment = pp.Combine(pp.White()+pp.Literal('Pos44:'))
>>> parser = pp.Literal('Capture MICR - Serial:')+pp.Or([parser_fragment,pp.Regex('.*?(?:Pos44\:)')])
>>> parser.parseString(record_A)
(['Capture MICR - Serial:', ' Pos44:'], {})
>>> parser.parseString(record_B)
(['Capture MICR - Serial:', '76ZXP67 Pos44:'], {})

Split and parse (to new file) string every nth character iterating over starting character - python

I asked a more general approach to this problem in a previous post but I am getting stuck with trying to parse out my results to individual files. I want to iterate over a long string, starting at position 1 (python 0) and print out every 100 characters. Then, I want move over one character and start at position 2 (python 1) and repeat the process until I reach the last 100 characters. I want to parse each "100" line chunk into a new file. Here is what I am currently working with:
seq = 7524 # I get this number from a raw_input
read_num=100
for raw_reads in range(100):
def nlength_parts(seq,read_num):
return map(''.join,zip(*[seq[i:] for i in range(read_num)]))
f = open('read' + str(raw_reads), 'w')
f.write("read" '\n')
f.write(nlength_parts(seq,read_num))
f.close
The error I am constantly getting now it
f.write(nlength_parts(seq,read_num))
TypeError: expected a character buffer object
Having some issues, any help would be greatly appreciated!
After some help, I have made some changes but still not working properly:
seq = 7524 # I get this number from a raw_input
read_num=100
def nlength_parts(seq,read_num):
return map(''.join,zip(*[seq[i:] for i in range(read_num)]))
for raw_reads in range(100): # Should be gene length - 100
f = open('read' + str(raw_reads), 'w')
f.write("read" + str(raw_reads))
f.write(nlength_parts)
f.close
I may have left out some important variables and definitions to keep my post short but it has caused confusion. I have pasted my entire code below.
#! /usr/bin/env python
import sys,os
import random
import string
raw = raw_input("Text file: " )
with open(raw) as f:
joined = "".join(line.strip() for line in f)
f = open(raw + '.txt', 'w')
f.write(joined)
f.closed
seq = str(joined)
read_num = 100
def nlength_parts(seq,read_num):
return map(''.join,zip(*[seq[i:] for i in range(read_num)]))
for raw_reads in range(100): # ideally I want range to be len(seq)-100
f = open('read' + str(raw_reads), 'w')
f.write("read" + str(raw_reads))
f.write('\n')
f.write(str(nlength_parts))
f.close
A few things:
You define the variables seq and read_num in the global scope, and then also use the same parameters in your function. What you should be doing is have the names of the parameters in the function definition be different, and then passing those two variables to the function when you call it.
When you call nlength_parts, you don't pass it either of the parameters you defined it with and you also lack (). Fix that in conjunction with #1.
You don't seem to define the string you are slicing. You slice seq in your function, but seq is an integer in your code. Is seq the processed output of the file you were talking about in your comment? If so, is it much larger in your actual code?
That being said, I believe this code will do what you want it to do:
def nlength_parts(myStr, length, paddingChar=" "):
if(len(myStr) < length):
myStr += paddingChar * (length - len(myStr))
sequences = []
for i in range(0, len(myStr)-length + 1):
sequences.append(myStr[i:i+length])
return(sequences)
foo = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
nlengthfoo = nlength_parts(foo, 10)
for x in range(0, length(nlengthfoo):
with open("read" + (x+1), "w") as f:
f.write(nlengthfoo[x])
EDIT: Apologies, changed my code in response to your comment.
Edit in response to clarifying comment:
Essentially, you want a rolling window of your string. Say long_string = "012345678901234567890123456789..." for a total length of 100.
In [18]: long_string
Out[18]: '0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789'
In [19]: window = 10
In [20]: for i in range(len(long_string) - window +1):
.....: chunk = long_string[i:i+window]
.....: print(chunk)
.....: with open('chunk_' + str(i+1) + '.txt','w') as f:
.....: f.write(chunk)
.....:
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
1234567890
2345678901
3456789012
4567890123
5678901234
6789012345
7890123456
8901234567
9012345678
0123456789
Finally,
In [21]: ls
chunk_10.txt chunk_20.txt chunk_30.txt chunk_40.txt chunk_50.txt chunk_60.txt chunk_70.txt chunk_80.txt chunk_90.txt
chunk_11.txt chunk_21.txt chunk_31.txt chunk_41.txt chunk_51.txt chunk_61.txt chunk_71.txt chunk_81.txt chunk_91.txt
chunk_12.txt chunk_22.txt chunk_32.txt chunk_42.txt chunk_52.txt chunk_62.txt chunk_72.txt chunk_82.txt chunk_9.txt
chunk_13.txt chunk_23.txt chunk_33.txt chunk_43.txt chunk_53.txt chunk_63.txt chunk_73.txt chunk_83.txt
chunk_14.txt chunk_24.txt chunk_34.txt chunk_44.txt chunk_54.txt chunk_64.txt chunk_74.txt chunk_84.txt
chunk_15.txt chunk_25.txt chunk_35.txt chunk_45.txt chunk_55.txt chunk_65.txt chunk_75.txt chunk_85.txt
chunk_16.txt chunk_26.txt chunk_36.txt chunk_46.txt chunk_56.txt chunk_66.txt chunk_76.txt chunk_86.txt
chunk_17.txt chunk_27.txt chunk_37.txt chunk_47.txt chunk_57.txt chunk_67.txt chunk_77.txt chunk_87.txt
chunk_18.txt chunk_28.txt chunk_38.txt chunk_48.txt chunk_58.txt chunk_68.txt chunk_78.txt chunk_88.txt
chunk_19.txt chunk_29.txt chunk_39.txt chunk_49.txt chunk_59.txt chunk_69.txt chunk_79.txt chunk_89.txt
chunk_1.txt chunk_2.txt chunk_3.txt chunk_4.txt chunk_5.txt chunk_6.txt chunk_7.txt chunk_8.txt
Original response
I would just treat the string like a file. This lets you avoid any slicing headaches and is pretty straightforward because the file API lets you "read" in chunks easily.
In [1]: import io
In [2]: long_string = 'a'*100 + 'b'*100 + 'c'*100 + 'e'*88
In [3]: print(long_string)
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
In [4]: string_io = io.StringIO(long_string)
In [5]: chunk = string_io.read(100)
In [6]: chunk_no = 1
In [7]: while chunk:
....: print(chunk)
....: with open('chunk_' + str(chunk_no) + '.txt','w') as f:
....: f.write(chunk)
....: chunk = string_io.read(100)
....: chunk_no += 1
....:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
Note, I'm using ipython terminal, so you can use terminal commands inside the interpreter session!
In [8]: ls chunk*
chunk_1.txt chunk_2.txt chunk_3.txt chunk_4.txt
In [9]: cat chunk_1.txt
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
In [10]: cat chunk_2.txt
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
In [11]: cat chunk_3.txt
cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
In [12]: cat chunk_4.txt
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
In [13]:

Python 2 re.sub issue

I got this a function that replaces sub-string matches with the match surrounded with HTML tags. This function will consume string in English and Greek mostly.
The function:
def highlight_text(st, kwlist, start_tag=None, end_tag=None):
if start_tag is None:
start_tag = '<span class="nom">'
if end_tag is None:
end_tag = '</span>'
for kw in kwlist:
st = re.sub(r'\b' + kw + r'\b', '{}{}{}'.format(start_tag, kw, end_tag), st)
return st
The testing string is in Greek except the first sub-string [Korais]: st="Korais Ο Αδαμάντιος Κοραής (Σμύρνη, 27 Απριλίου 1748 – Παρίσι, 6 Απριλίου 1833), ήταν Έλληνας φιλόλογος με βαθιά γνώση του ελληνικού πολιτισμού. Ο Κοραής είναι ένας από τους σημαντικότερους εκπροσώπους του νεοελληνικού διαφωτισμού και μνημονεύεται, ανάμεσα σε άλλα, ως πρωτοπόρος στην έκδοση έργων αρχαίας ελληνικής γραμματείας, αλλά και για τις γλωσσικές του απόψεις στην υποστήριξη της καθαρεύουσας, σε μια μετριοπαθή όμως μορφή της με σκοπό την εκκαθάριση των πλείστων ξένων λέξεων που υπήρχαν στη γλώσσα του λαού."
The test code:
kwlist = ['ελληνικού', 'Σμύρνη', 'Αδαμάντιος', 'Korais']
d = highlight_text(st, kwlist, start_tag=None, end_tag=None)
print(d)
When I'm running the code [st is the above string] only sub-strings in English get tagged. Greek substr are ignored. Notice that I run the above block on Python 2.7. When I use Python 3.4 all sub-string get replaced.
Another issue is that when I'm running the above function withing Flask application, it throws me an error: unexpected end of regular expression.
How should I tackle the above issue without using external library if possible?
I'm pulling my hairs off my head two days now.
In Python 2.7, you need to explicitly convert text to Unicode. See the fixed snippet below:
# -*- coding: utf-8 -*-
import re
def highlight_text(st, kwlist, start_tag=None, end_tag=None):
if start_tag is None:
start_tag = '<span class="nom">'
if end_tag is None:
end_tag = '</span>'
for kw in kwlist:
st = re.sub(ur'\b' + kw.decode('utf8') + ur'\b',
u'{}{}{}'.format(start_tag.decode('utf8'), kw.decode('utf8'), end_tag.decode('utf8')),
st.decode('utf8'), 0, re.U).encode("utf8")
return st
st="Korais Ο Αδαμάντιος Κοραής (Σμύρνη, 27 Απριλίου 1748 – Παρίσι, 6 Απριλίου 1833), ήταν Έλληνας φιλόλογος με βαθιά γνώση του ελληνικού πολιτισμού. Ο Κοραής είναι ένας από τους σημαντικότερους εκπροσώπους του νεοελληνικού διαφωτισμού και μνημονεύεται, ανάμεσα σε άλλα, ως πρωτοπόρος στην έκδοση έργων αρχαίας ελληνικής γραμματείας, αλλά και για τις γλωσσικές του απόψεις στην υποστήριξη της καθαρεύουσας, σε μια μετριοπαθή όμως μορφή της με σκοπό την εκκαθάριση των πλείστων ξένων λέξεων που υπήρχαν στη γλώσσα του λαού."
kwlist = ['ελληνικού', 'Σμύρνη', 'Αδαμάντιος', 'Korais']
d = highlight_text(st, kwlist, start_tag=None, end_tag=None)
print(d)
See demo
Note that all literals are declared with u prefix and all variables are decodeed and the re.sub result is encoded back to UTF8.
English get tagged. Greek substr are ignored.
Where does your st come from? Please notice that in Python 2.x 'μορφή' != u'μορφή' Maybe you are comparing str with unicode.
Suggestions: Use unicode everywhere when you can, e.g.:
kwlist = [u'ελληνικού', u'Σμύρνη', u'Αδαμάντιος', u'Korais']

Python - how to parse this with regex correctly? its parsing all the E.164 but except the local format

Its working for 0032, 32, +32 but not as 0487365060 (which is a valid term)
to_user = "0032487365060"
# ^(?:\+|00)(\d+)$ Parse the 0032, 32, +32 & 0487365060
match = re.search(r'^(?:\+|00)(\d+)$', to_user)
to_user = "32487365060"
match = re.search(r'^(?:\+|00)(\d+)$', to_user)
to_user = "+32487365060"
match = re.search(r'^(?:\+|00)(\d+)$', to_user)
Not working:
to_user = "0487365060"
match = re.search(r'^(?:\+|00)(\d+)$', to_user)
Your last example doesn't work because it does not start with either + or 00. If that is optional you need to mark it as such:
r'^(?:\+|00)?(\d+)$'
Note that neither does your second example match; it doesn't start with + or 00 either.
Demo:
>>> import re
>>> samples = ('0032487365060', '32487365060', '+32487365060', '0487365060')
>>> pattern = re.compile(r'^(?:\+|00)?(\d+)$')
>>> for sample in samples:
... match = pattern.search(sample)
... if match is not None:
... print 'matched:', match.group(1)
... else:
... print 'Sample {} did not match'.format(sample)
...
matched: 32487365060
matched: 32487365060
matched: 32487365060
matched: 0487365060
Taking account of the question AND the comment, and in absence of more info (particularly on the length of the sequence of digits that must follow the 32 part, and if it is always 32 or may be another sequence), what I finally understand you want cab be obtained with:
import re
for to_user in ("0032487365060",
"32487365060",
"+32487365060",
"0487365060"):
m = re.sub('^(?:\+32|0032|32|0)(\d{9})$','32\\1', to_user)
print m
Something like this #eyquem method, to cover all the international codes from + and 00 into without +, 00 only for Belgium it should be default 32+the number:
import re
for to_user in (# Belgium
"0032487365060",
"32487365060",
"+32487365060",
"0487365060",
# USA
"0012127773456",
"12127773456",
"+12127773456",
# UK
"004412345678",
"4412345678",
"+4412345678"):
m = re.sub('^(?:\+|00|32|0)(\d{9})$','32\\1', to_user)
m = m.replace("+","")
m = re.sub('^(?:\+|00)(\d+)$', '\\1', m)
print m
Output:
32487365060
32487365060
32487365060
32487365060
12127773456
12127773456
12127773456
4412345678
4412345678
4412345678
Execution Successful!
Why not to use phonenumbers lib
>>> phonenumbers.parse("0487365060", "BE")
PhoneNumber(country_code=32, national_number=487365060, extension=None, italian_leading_zero=None, number_of_leading_zeros=None, country_code_source=0, preferred_domestic_carrier_code=None)
and other 3 is ok to
>>> phonenumbers.parse("0032487365060", "BE")
PhoneNumber(country_code=32, national_number=487365060, extension=None, italian_leading_zero=None, number_of_leading_zeros=None, country_code_source=0, preferred_domestic_carrier_code=None)
>>> phonenumbers.parse("+320487365060", "BE")
PhoneNumber(country_code=32, national_number=487365060, extension=None, italian_leading_zero=None, number_of_leading_zeros=None, country_code_source=0, preferred_domestic_carrier_code=None)
>>> phonenumbers.parse("320487365060", "BE")
PhoneNumber(country_code=32, national_number=487365060, extension=None, italian_leading_zero=None, number_of_leading_zeros=None, country_code_source=0, preferred_domestic_carrier_code=None)

Categories

Resources