invert regex pattern in python - python

I'm trying to filter from string only the Arabic character but the next function doesn't work for me:
import re
def remove_any_non_arabic_char(text):
non_arabic_char = re.compile('^[\u0627-\u064a]')
text = re.sub(non_arabic_char, "", text)
print(text)
for example:
s = "Kühn xvii, 346] قال جالينوس: [1] قد اتفق جل من فسر هذا الكتا"
The desired output of remove_any_non_arabic_char(s) should be قال جالينوس قد اتفق جل من فسر هذا الكتا but the input stays without changes.
What should I do?

First, you need to fix your regex as suggested in the comments, then for a more efficient solution, you will need to expand your Unicode character selection to include all Arabic character mappings. Finally, you need to keep at least one space between Arabic words to keep the Arabic text legible:
import re
def remove_any_non_arabic_char(text):
non_arabic_char = re.compile('[^\s\\u0600-\u06FF]')
text_with_no_spaces = re.sub(non_arabic_char, "", text)
text_with_single_spaces = " ".join(re.split("\s+", text_with_no_spaces))
return text_with_single_spaces
text_1 = "Kühn xvii, 346] قال جالينوس: [1] قد اتفق جل من فسر هذا الكتا"
text_2 = '''
تغيّر مفهوم كلمة (أدب) من العصر الجاهلي jahili (pre-Islamic) era إلى الآن عبر
مراحل periods التاريخ المتعددة. ففي الجاهلية، كانت كلمة أدب تعني (الدعوة إلى
الطعام). وبعدها، استخدم الرسول محمد (عليه السلام) الكلمة بمعنى "التهذيب والتربية"
education and mannerism. وفي العصر الأموي، اتصلت had to do كلمة أدب
بالتاريخ والفقه والقرآن والحديث. أما في العصرالعباسي، فأصبحت تعني تعلّم الشعر
والنثر prose واتسع الأدب ليشمل أنواع المعرفة وألوانها وخصوصاً علم البلاغة واللغة.
أما في الوقت الحالي، فأصبحت كلمة أدب ذات صلة pertinent بالكلام البليغ
الجميل المؤثر that impacts في أحاسيس القاريء أو السامع.
'''
# Isleem, N. M., & Abuhakema, G. M. (2020). Kalima wa Nagham: A Textbook for
# Teaching Arabic, Volume 2 (Vol. 3). University of Texas Press. (page 5)
print('text_1: \n', remove_any_non_arabic_char(text_1))
print('\ntext_2: \n\n', remove_any_non_arabic_char(text_2))
Running the code on the two texts above in Jupyter, you get:
Notice that punctuation marks shared between Arabic and English (like periods and brackets) have also been removed. To keep those, you would need to introduce more complex conditionals.

Related

How to split str, that contains farsi and english words using re module on python

I have some_str = 'دریافت اطلاعات در مورد HDD {hdd}'. I need regex for splitting this by farsi and non farsi words for getting result like this:
['دریافت اطلاعات در مورد', 'HDD {hdd}']
import re
some_str = 'دریافت اطلاعات در مورد HDD {hdd}'
regex = '???'
re.split(regex, some_str)
For another str like "اضافه کردن اعلام کننده {notifier} روی سرور {host} بوسیله کاربر {role}/{user} از آدرس های IP {ip_address}" i expect next result:
['اضافه کردن اعلام کننده', '{notifier}', 'روی سرور', '{host}', 'بوسیله کاربر', '{role}/{user}', 'از آدرس های', 'IP {ip_address}']
You may use this re.split:
import re
# regex for arabic text
reg = re.compile('([\u0600-\u06FF]+(?:\s+[\u0600-\u06FF]+)*)\s*')
# or for matching Persian characters only use:
# [\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC]
some_str = 'دریافت اطلاعات در مورد HDD {hdd}'
lst1 = list(filter(None, reg.split(some_str)))
print (lst1)
## ['دریافت اطلاعات در مورد', 'HDD {hdd}']
s = "اضافه کردن اعلام کننده {notifier} روی سرور {host} بوسیله کاربر {role}/{user} از آدرس های IP {ip_address}"
lstw = list(filter(None, reg.split(s)))
print (lst2)
## ['اضافه کردن اعلام کننده', '{notifier} ', 'روی سرور', '{host} ', 'بوسیله کاربر', '{role}/{user} ', 'از آدرس های', 'IP {ip_address}']
\[\u0600-\u06FF\] is used to match Persian characters.
RegEx Details:
([\u0600-\u06FF]+(?:\s+[\u0600-\u06FF]+)*): Match space separated Persian text at the start in capture group #1
\s*: Match 0 or more whitespaces
(.*): Match remainder of string in capture group #2

How to get text with both spaces and characters using a regular expression?

I am trying to make a programme that goes to a Wikipedia list and lists all countries by how many people they have. I used a regular expression to get only the country names, however, country names that include spaces (DR Congo, South Korea, United Kingdom, etc) are omitted.
import requests
import re
from bs4 import BeautifulSoup
pop = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)")
soup = BeautifulSoup(pop.text, "html.parser")
f = soup.find_all("td", attrs={"align":"left"})
f = str(f)
x = re.findall("data-sort-value=\"\S+\"", flit)
print(x)
I know that "S" only counts non-space characters, but what should I replace "S" with to get both the spaces and the characters?
This is how the output looks:
['data-sort-value="China"', 'data-sort-value="India"', 'data-sort-value="Indonesia"', 'data-sort-value="Pakistan"', 'data-sort-value="Brazil"', 'data-sort-value="Nigeria"', 'data-sort-value="Bangladesh"', 'data-sort-value="Russia"', 'data-sort-value="Mexico"', 'data-sort-value="Japan"', 'data-sort-value="Ethiopia"', 'data-sort-value="Philippines"', 'data-sort-value="Egypt"', 'data-sort-value="Vietnam"', 'data-sort-value="Germany"', 'data-sort-value="Turkey"', 'data-sort-value="Iran"', 'data-sort-value="Thailand"', 'data-sort-value="France"', 'data-sort-value="Italy"', 'data-sort-value="Tanzania"', 'data-sort-value="Myanmar"', 'data-sort-value="Kenya"', 'data-sort-value="Colombia"', 'data-sort-value="Spain"', 'data-sort-value="Argentina"', 'data-sort-value="Uganda"', 'data-sort-value="Ukraine"', 'data-sort-value="Algeria"', 'data-sort-value="Sudan"', 'data-sort-value="Iraq"', 'data-sort-value="Afghanistan"', 'data-sort-value="Poland"', 'data-sort-value="Canada"', 'data-sort-value="Morocco"', 'data-sort-value="Uzbekistan"', 'data-sort-value="Peru"', 'data-sort-value="Malaysia"', 'data-sort-value="Angola"', 'data-sort-value="Mozambique"', 'data-sort-value="Yemen"', 'data-sort-value="Ghana"', 'data-sort-value="Nepal"', 'data-sort-value="Venezuela"', 'data-sort-value="Madagascar"', 'data-sort-value="Cameroon"', 'data-sort-value="Australia"', 'data-sort-value="Taiwan"', 'data-sort-value="Niger"', 'data-sort-value="Mali"', 'data-sort-value="Romania"', 'data-sort-value="Malawi"', 'data-sort-value="Chile"', 'data-sort-value="Kazakhstan"', 'data-sort-value="Zambia"', 'data-sort-value="Guatemala"', 'data-sort-value="Ecuador"', 'data-sort-value="Netherlands"', 'data-sort-value="Syria"', 'data-sort-value="Cambodia"', 'data-sort-value="Senegal"', 'data-sort-value="Chad"', 'data-sort-value="Somalia"', 'data-sort-value="Zimbabwe"', 'data-sort-value="Guinea"', 'data-sort-value="Rwanda"', 'data-sort-value="Benin"', 'data-sort-value="Tunisia"', 'data-sort-value="Belgium"', 'data-sort-value="Bolivia"', 'data-sort-value="Cuba"', 'data-sort-value="Haiti"', 'data-sort-value="Burundi"', 'data-sort-value="Greece"', 'data-sort-value="Portugal"', 'data-sort-value="Jordan"', 'data-sort-value="Azerbaijan"', 'data-sort-value="Sweden"', 'data-sort-value="Honduras"', 'data-sort-value="Hungary"', 'data-sort-value="Belarus"', 'data-sort-value="Tajikistan"', 'data-sort-value="Austria"', 'data-sort-value="Serbia"', 'data-sort-value="Switzerland"', 'data-sort-value="Israel"', 'data-sort-value="Togo"', 'data-sort-value="Laos"', 'data-sort-value="Paraguay"', 'data-sort-value="Bulgaria"', 'data-sort-value="Lebanon"', 'data-sort-value="Libya"', 'data-sort-value="Nicaragua"', 'data-sort-value="Kyrgyzstan"', 'data-sort-value="Turkmenistan"', 'data-sort-value="Singapore"', 'data-sort-value="Denmark"', 'data-sort-value="Finland"', 'data-sort-value="Slovakia"', 'data-sort-value="Congo"', 'data-sort-value="Norway"', 'data-sort-value="Palestine"', 'data-sort-value="Oman"', 'data-sort-value="Liberia"', 'data-sort-value="Ireland"', 'data-sort-value="Mauritania"', 'data-sort-value="Panama"', 'data-sort-value="Kuwait"', 'data-sort-value="Croatia"', 'data-sort-value="Moldova"', 'data-sort-value="Georgia"', 'data-sort-value="Eritrea"', 'data-sort-value="Uruguay"', 'data-sort-value="Mongolia"', 'data-sort-value="Armenia"', 'data-sort-value="Jamaica"', 'data-sort-value="Albania"', 'data-sort-value="Qatar"', 'data-sort-value="Lithuania"', 'data-sort-value="Namibia"', 'data-sort-value="Gambia"', 'data-sort-value="Botswana"', 'data-sort-value="Gabon"', 'data-sort-value="Lesotho"', 'data-sort-value="Slovenia"', 'data-sort-value="Guinea-Bissau"', 'data-sort-value="Latvia"', 'data-sort-value="Bahrain"', 'data-sort-value="Estonia"', 'data-sort-value="Mauritius"', 'data-sort-value="Cyprus"', 'data-sort-value="Eswatini"', 'data-sort-value="Djibouti"', 'data-sort-value="Fiji"', 'data-sort-value="Réunion"', 'data-sort-value="Comoros"', 'data-sort-value="Guyana"', 'data-sort-value="Bhutan"', 'data-sort-value="Macau"', 'data-sort-value="Montenegro"', 'data-sort-value="Luxembourg"', 'data-sort-value="Suriname"', 'data-sort-value="Maldives"', 'data-sort-value="Guadeloupe"', 'data-sort-value="Malta"', 'data-sort-value="Brunei"', 'data-sort-value="Belize"', 'data-sort-value="Bahamas"', 'data-sort-value="Martinique"', 'data-sort-value="Iceland"', 'data-sort-value="Vanuatu"', 'data-sort-value="Barbados"', 'data-sort-value="Mayotte"', 'data-sort-value="Guam"', 'data-sort-value="Curaçao"', 'data-sort-value="Kiribati"', 'data-sort-value="Grenada"', 'data-sort-value="Tonga"', 'data-sort-value="Aruba"', 'data-sort-value="Seychelles"', 'data-sort-value="Andorra"', 'data-sort-value="Dominica"', 'data-sort-value="Bermuda"', 'data-sort-value="Greenland"', 'data-sort-value="Monaco"', 'data-sort-value="Liechtenstein"', 'data-sort-value="Gibraltar"', 'data-sort-value="Palau"', 'data-sort-value="Anguilla"', 'data-sort-value="Tuvalu"', 'data-sort-value="Nauru"', 'data-sort-value="Montserrat"', 'data-sort-value="Niue"', 'data-sort-value="Tokelau"']
\D+ should work, non-digits include spaces.
You may want to have a general solution to check for all names with multiple words, not just 2 words:
pattern_str = r'data-sort-value="(\S+(\s+\S+)*)"'
x = re.findall(pattern_str, flit)
This pattern string also matches names of 1 or more words, along with any names with one or more spaces.
Here is a solution without regex. It's better to avoid regex when scraping html & instead utilize method's provided by BeautifulSoup to extract the contents.
import requests
from bs4 import BeautifulSoup
resp = requests.get(
'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
)
soup = BeautifulSoup(resp.text, "html.parser")
print(
[x['data-sort-value']
for x in soup.findAll("span", {"class": "datasortkey"})]
)
['Congo', 'Norway', 'Costa Rica', 'Palestine', 'Oman', 'Cocos (Keeling) Islands', ...]

Tokenize tweet based on Regex

I have the following example text / tweet:
RT #trader $AAPL 2012 is o´o´o´o´o´pen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh url_that_cannot_be_posted_on_SO
I want to follow the procedure of Table 1 in Li, T, van Dalen, J, & van Rees, P.J. (Pieter Jan). (2017). More than just noise? Examining the information content of stock microblogs on financial markets. Journal of Information Technology. doi:10.1057/s41265-016-0034-2 in order to clean up the tweet.
They clean the tweet up in such a way that the final result is:
{RT|123456} {USER|56789} {TICKER|AAPL} {NUMBER|2012} notooopen nottalk patent {COMPANY|GOOG} notdefinetli treatment {HASH|samsung} {EMOTICON|POS} haha {URL}
I use the following script to tokenize the tweet based on the regex:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
emoticon_string = r"""
(?:
[<>]?
[:;=8] # eyes
[\-o\*\']? # optional nose
[\)\]\(\[dDpP/\:\}\{#\|\\] # mouth
|
[\)\]\(\[dDpP/\:\}\{#\|\\] # mouth
[\-o\*\']? # optional nose
[:;=8] # eyes
[<>]?
)"""
regex_strings = (
# URL:
r"""http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"""
,
# Twitter username:
r"""(?:#[\w_]+)"""
,
# Hashtags:
r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Cashtags:
r"""(?:\$+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Remaining word types:
r"""
(?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals.
|
(?:[\w_]+) # Words without apostrophes or dashes.
|
(?:\.(?:\s*\.){1,}) # Ellipsis dots.
|
(?:\S) # Everything else that isn't whitespace.
"""
)
word_re = re.compile(r"""(%s)""" % "|".join(regex_strings), re.VERBOSE | re.I | re.UNICODE)
emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)
######################################################################
class Tokenizer:
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, s):
try:
s = str(s)
except UnicodeDecodeError:
s = str(s).encode('string_escape')
s = unicode(s)
# Tokenize:
words = word_re.findall(s)
if not self.preserve_case:
words = map((lambda x: x if emoticon_re.search(x) else x.lower()), words)
return words
if __name__ == '__main__':
tok = Tokenizer(preserve_case=False)
test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh url_that_cannot_be_posted_on_SO'
tokenized = tok.tokenize(test)
print("\n".join(tokenized))
This yields the following output:
rt
#trader
$aapl
2012
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
#samsung
got
:-)
heh
url_that_cannot_be_posted_on_SO
How can I adjust this script to get:
rt
{USER|trader}
{CASHTAG|aapl}
{NUMBER|2012}
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
{HASHTAG|samsung}
got
{EMOTICON|:-)}
heh
{URL|url_that_cannot_be_posted_on_SO}
Thanks in advance for helping me out big time!
You really need to use named capturing groups (mentioned by thebjorn), and use groupdict() to get name-value pairs upon each match. It requires some post-processing though:
All pairs where the value is None must be discarded
If the self.preserve_case is false the value can be turned to lower case at once
If the group name is WORD, ELLIPSIS or ELSE the values are added to words as is
If the group name is HASHTAG, CASHTAG, USER or URL the values are added first stripped of $, # and # chars at the start and then added to words as {<GROUP_NAME>|<VALUE>} item
All other matches are added to words as {<GROUP_NAME>|<VALUE>} item.
Note that \w matches underscores by default, so [\w_] = \w. I optimized the patterns a little bit.
Here is a fixed code snippet:
import re
emoticon_string = r"""
(?P<EMOTICON>
[<>]?
[:;=8] # eyes
[-o*']? # optional nose
[][()dDpP/:{}#|\\] # mouth
|
[][()dDpP/:}{#|\\] # mouth
[-o*']? # optional nose
[:;=8] # eyes
[<>]?
)"""
regex_strings = (
# URL:
r"""(?P<URL>https?://(?:[-a-zA-Z0-9_$#.&+!*(),]|%[0-9a-fA-F][0-9a-fA-F])+)"""
,
# Twitter username:
r"""(?P<USER>#\w+)"""
,
# Hashtags:
r"""(?P<HASHTAG>\#+\w+[\w'-]*\w+)"""
,
# Cashtags:
r"""(?P<CASHTAG>\$+\w+[\w'-]*\w+)"""
,
# Remaining word types:
r"""
(?P<NUMBER>[+-]?\d+(?:[,/.:-]\d+[+-]?)?) # Numbers, including fractions, decimals.
|
(?P<WORD>\w+) # Words without apostrophes or dashes.
|
(?P<ELLIPSIS>\.(?:\s*\.)+) # Ellipsis dots.
|
(?P<ELSE>\S) # Everything else that isn't whitespace.
"""
)
word_re = re.compile(r"""({}|{})""".format(emoticon_string, "|".join(regex_strings)), re.VERBOSE | re.I | re.UNICODE)
#print(word_re.pattern)
emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)
######################################################################
class Tokenizer:
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, s):
try:
s = str(s)
except UnicodeDecodeError:
s = str(s).encode('string_escape')
s = unicode(s)
# Tokenize:
words = []
for x in word_re.finditer(s):
for key, val in x.groupdict().items():
if val:
if not self.preserve_case:
val = val.lower()
if key in ['WORD','ELLIPSIS','ELSE']:
words.append(val)
elif key in ['HASHTAG','CASHTAG','USER','URL']: # Add more here if needed
words.append("{{{}|{}}}".format(key, re.sub(r'^[##$]+', '', val)))
else:
words.append("{{{}|{}}}".format(key, val))
return words
if __name__ == '__main__':
tok = Tokenizer(preserve_case=False)
test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh http://some.site.here.com'
tokenized = tok.tokenize(test)
print("\n".join(tokenized))
With test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh http://some.site.here.com', it outputs
rt
{USER|trader}
{CASHTAG|aapl}
{NUMBER|2012}
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
{HASHTAG|samsung}
got
{EMOTICON|:-)}
heh
{URL|http://some.site.here.com}
See the regex demo online.

How to train a sense2vec model

The documentation of sense2vec mentions 3 primary files - the first of them being merge_text.py. I have tried several types of inputs- txt,csv,bzipped file since merge_text.py tries to open files compressed by bzip2.
The file can be found at:
https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py
What type of input format does this script require?
Further, if anyone could please suggest how to train the model.
I extended and adjusted the code samples from sense2vec.
You go from this input text:
"As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money."
To this:
as|ADV far|ADV as|ADP saudi_arabia|ENT and|CCONJ its|ADJ motif|NOUN that|ADJ is|VERB very|ADV simple|ADJ also|ADV saudis|ENT are|VERB good|ADJ at|ADP money|NOUN and|CCONJ arithmetic|NOUN faced|VERB with|ADP painful_choice|NOUN of|ADP losing|VERB money|NOUN maintaining|VERB current_production|NOUN at|ADP us$|SYM 60|MONEY per|ADP barrel|NOUN or|CCONJ taking|VERB two_million|CARDINAL barrel|NOUN per|ADP day|NOUN off|ADP market|NOUN and|CCONJ losing|VERB much_more_money|NOUN it|PRON 's|VERB easy_choice|NOUN take|VERB path|NOUN that|ADJ is|VERB less|ADV painful|ADJ if|ADP there|ADV are|VERB secondary_reason|NOUN like|ADP hurting|VERB us|ENT tight_oil_producer|NOUN or|CCONJ hurting|VERB iran|ENT and|CCONJ russia|ENT 's|VERB great|ADJ but|CCONJ it|PRON 's|VERB really|ADV just|ADV about|ADP money|NOUN
Double line breaks are interpreted as separate documents.
Urls are recognized as such, stripped down to domain.tld and marked as |URL
Nouns (also noun being part of noun phrases) are lemmatized (as motives become motifs)
Words with POS-tags like DET (determinate article) and PUNCT (for punctuation) are dropped
Here's the code. Let me know if you have questions.
I'll probably publish it on github.com/woltob soon.
import spacy
import re
nlp = spacy.load('en')
nlp.matcher = None
LABELS = {
'ENT': 'ENT',
'PERSON': 'PERSON',
'NORP': 'ENT',
'FAC': 'ENT',
'ORG': 'ENT',
'GPE': 'ENT',
'LOC': 'ENT',
'LAW': 'ENT',
'PRODUCT': 'ENT',
'EVENT': 'ENT',
'WORK_OF_ART': 'ENT',
'LANGUAGE': 'ENT',
'DATE': 'DATE',
'TIME': 'TIME',
'PERCENT': 'PERCENT',
'MONEY': 'MONEY',
'QUANTITY': 'QUANTITY',
'ORDINAL': 'ORDINAL',
'CARDINAL': 'CARDINAL'
}
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|´')
def strip_meta(text):
text = text.replace('per cent', 'percent')
text = text.replace('>', '>').replace('<', '<')
text = pre_format_re.sub('', text)
text = post_format_re.sub('', text)
text = double_linebreak_re.sub('{2break}', text)
text = single_linebreak_re.sub(' ', text)
text = text.replace('{2break}', '\n')
text = whitespace_re.sub(' ', text)
text = quote_re.sub('', text)
return text
def transform_doc(doc):
for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
for np in doc.noun_chunks:
while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
np = np[1:]
np.merge(np.root.tag_, np.text, np.root.ent_type_)
strings = []
for sent in doc.sents:
sentence = []
if sent.text.strip():
for w in sent:
if w.is_space:
continue
w_ = represent_word(w)
if w_:
sentence.append(w_)
strings.append(' '.join(sentence))
if strings:
return '\n'.join(strings) + '\n'
else:
return ''
def represent_word(word):
if word.like_url:
x = url_re.search(word.text.strip().lower())
if x:
return x.group(3)+'|URL'
else:
return word.text.lower().strip()+'|URL?'
text = re.sub(r'\s', '_', word.text.strip().lower())
tag = LABELS.get(word.ent_type_)
# Dropping PUNCTUATION such as commas and DET like the
if tag is None and word.pos_ not in ['PUNCT', 'DET']:
tag = word.pos_
elif tag is None:
return None
# if not word.pos_:
# tag = '?'
return text + '|' + tag
corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''
corpus_stripped = strip_meta(corpus)
doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
# only lemmatize NOUN and PROPN
if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
# Keep the original word with the length of the lemma, then add the white space, if it was there.:
lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
# print(word.text, lemma_)
corpus_.append(lemma_)
# print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
# All other words are added normally.
else:
corpus_.append(word.text_with_ws)
result = transform_doc(nlp(''.join(corpus_)))
sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w')
file.write(result)
file.close()
print(result)
You could visualise your model using Gensim in Tensorboard using this approach:
https://github.com/ArdalanM/gensim2tensorboard
I'll also adjust this code to work with the sense2vec approach (e.g. the words become lowercase in the preprocessing step, just comment it out in the code).
Happy coding,
woltob
The input file should be a bzipped json. To use a plain text file just edit the merge_text.py as follow:
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield line.decode('utf-8', errors='ignore')
# yield ujson.loads(line)['body']

Sphinx and Python, multilanguage search

I have the table in UTF-8 called tr_transcriptions. It stores dozens records with 'text' field on different languages. There're 34 languages supported at the moment:
Afrikaans, Korean, Arabic, Malayalam, Bahasa, Mandarin, Bahasama, Norwegian, Croatian,
Polish, Czech, Portuguese, Danish, Romanian, Dutch, Russian, English, Slovak, Flemish
Spanish, French, Swedish, German, Tagalog, Greek, Tamil, Hindi, Telugu, Hungarian, Thai, Italian, Turkish, Kannada, Vietnamese
I want to gives to users an opportunity to search through this table. I have a few issues with that. I can't get it working with Sphinx. Here's my config file:
source transcriptions
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = pass
sql_db = transcriptions
sql_port = 3306 # optional, default is 3306
sql_query_pre = SET NAMES utf8
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query = \
SELECT * \
FROM tr_transcriptions
}
index tr_transcriptions
{
source = transcriptions
charset_type = utf-8
charset_table = U+00C0->a, U+00C1->a, U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00E0->a, U+00E1->a,\
U+00E2->a, U+00E3->a, U+00E4->a, U+00E5->a, U+0100->a, U+0101->a, U+0102->a, U+0103->a,\
U+010300->a, U+0104->a, U+0105->a, U+01CD->a, U+01CE->a, U+01DE->a, U+01DF->a, U+01E0->a,\
U+01E1->a, U+01FA->a, U+01FB->a, U+0200->a, U+0201->a, U+0202->a, U+0203->a, U+0226->a,\
U+0227->a, U+023A->a, U+0250->a, U+04D0->a, U+04D1->a, U+1D2C->a, U+1D43->a, U+1D44->a,\
U+1D8F->a, U+1E00->a, U+1E01->a, U+1E9A->a, U+1EA0->a, U+1EA1->a, U+1EA2->a, U+1EA3->a,\
U+1EA4->a, U+1EA5->a, U+1EA6->a, U+1EA7->a, U+1EA8->a, U+1EA9->a, U+1EAA->a, U+1EAB->a,\
U+1EAC->a, U+1EAD->a, U+1EAE->a, U+1EAF->a, U+1EB0->a, U+1EB1->a, U+1EB2->a, U+1EB3->a,\
U+1EB4->a, U+1EB5->a, U+1EB6->a, U+1EB7->a, U+2090->a, U+2C65->a, U+0180->b, U+0181->b,\
U+0182->b, U+0183->b, U+0243->b, U+0253->b, U+0299->b, U+16D2->b, U+1D03->b, U+1D2E->b,\
U+1D2F->b, U+1D47->b, U+1D6C->b, U+1D80->b, U+1E02->b, U+1E03->b, U+1E04->b, U+1E05->b,\
U+1E06->b, U+1E07->b, U+00C7->c, U+00E7->c, U+0106->c, U+0107->c, U+0108->c, U+0109->c,\
U+010A->c, U+010B->c, U+010C->c, U+010D->c, U+0187->c, U+0188->c, U+023B->c, U+023C->c,\
U+0255->c, U+0297->c, U+1D9C->c, U+1D9D->c, U+1E08->c, U+1E09->c, U+212D->c, U+2184->c,\
U+010E->d, U+010F->d, U+0110->d, U+0111->d, U+0189->d, U+018A->d, U+018B->d, U+018C->d,\
U+01C5->d, U+01F2->d, U+0221->d, U+0256->d, U+0257->d, U+1D05->d, U+1D30->d, U+1D48->d,\
U+1D6D->d, U+1D81->d, U+1D91->d, U+1E0A->d, U+1E0B->d, U+1E0C->d, U+1E0D->d, U+1E0E->d,\
U+1E0F->d, U+1E10->d, U+1E11->d, U+1E12->d, U+1E13->d, U+00C8->e, U+00C9->e, U+00CA->e,\
U+00CB->e, U+00E8->e, U+00E9->e, U+00EA->e, U+00EB->e, U+0112->e, U+0113->e, U+0114->e,\
U+0115->e, U+0116->e, U+0117->e, U+0118->e, U+0119->e, U+011A->e, U+011B->e, U+018E->e,\
U+0190->e, U+01DD->e, U+0204->e, U+0205->e, U+0206->e, U+0207->e, U+0228->e, U+0229->e,\
U+0246->e, U+0247->e, U+0258->e, U+025B->e, U+025C->e, U+025D->e, U+025E->e, U+029A->e,\
U+1D07->e, U+1D08->e, U+1D31->e, U+1D32->e, U+1D49->e, U+1D4B->e, U+1D4C->e, U+1D92->e,\
U+1D93->e, U+1D94->e, U+1D9F->e, U+1E14->e, U+1E15->e, U+1E16->e, U+1E17->e, U+1E18->e,\
U+1E19->e, U+1E1A->e, U+1E1B->e, U+1E1C->e, U+1E1D->e, U+1EB8->e, U+1EB9->e, U+1EBA->e,\
U+1EBB->e, U+1EBC->e, U+1EBD->e, U+1EBE->e, U+1EBF->e, U+1EC0->e, U+1EC1->e, U+1EC2->e,\
U+1EC3->e, U+1EC4->e, U+1EC5->e, U+1EC6->e, U+1EC7->e, U+2091->e, U+0191->f, U+0192->f,\
U+1D6E->f, U+1D82->f, U+1DA0->f, U+1E1E->f, U+1E1F->f, U+011C->g, U+011D->g, U+011E->g,\
U+011F->g, U+0120->g, U+0121->g, U+0122->g, U+0123->g, U+0193->g, U+01E4->g, U+01E5->g,\
U+01E6->g, U+01E7->g, U+01F4->g, U+01F5->g, U+0260->g, U+0261->g, U+0262->g, U+029B->g,\
U+1D33->g, U+1D4D->g, U+1D77->g, U+1D79->g, U+1D83->g, U+1DA2->g, U+1E20->g, U+1E21->g,\
U+0124->h, U+0125->h, U+0126->h, U+0127->h, U+021E->h, U+021F->h, U+0265->h, U+0266->h,\
U+029C->h, U+02AE->h, U+02AF->h, U+02B0->h, U+02B1->h, U+1D34->h, U+1DA3->h, U+1E22->h,\
U+1E23->h, U+1E24->h, U+1E25->h, U+1E26->h, U+1E27->h, U+1E28->h, U+1E29->h, U+1E2A->h,\
U+1E2B->h, U+1E96->h, U+210C->h, U+2C67->h, U+2C68->h, U+2C75->h, U+2C76->h, U+00CC->i,\
U+00CD->i, U+00CE->i, U+00CF->i, U+00EC->i, U+00ED->i, U+00EE->i, U+00EF->i, U+010309->i,\
U+0128->i, U+0129->i, U+012A->i, U+012B->i, U+012C->i, U+012D->i, U+012E->i, U+012F->i,\
U+0130->i, U+0131->i, U+0197->i, U+01CF->i, U+01D0->i, U+0208->i, U+0209->i, U+020A->i,\
U+020B->i, U+0268->i, U+026A->i, U+040D->i, U+0418->i, U+0419->i, U+0438->i, U+0439->i,\
U+0456->i, U+1D09->i, U+1D35->i, U+1D4E->i, U+1D62->i, U+1D7B->i, U+1D96->i, U+1DA4->i,\
U+1DA6->i, U+1DA7->i, U+1E2C->i, U+1E2D->i, U+1E2E->i, U+1E2F->i, U+1EC8->i, U+1EC9->i,\
U+1ECA->i, U+1ECB->i, U+2071->i, U+2111->i, U+0134->j, U+0135->j, U+01C8->j, U+01CB->j,\
U+01F0->j, U+0237->j, U+0248->j, U+0249->j, U+025F->j, U+0284->j, U+029D->j, U+02B2->j,\
U+1D0A->j, U+1D36->j, U+1DA1->j, U+1DA8->j, U+0136->k, U+0137->k, U+0198->k, U+0199->k,\
U+01E8->k, U+01E9->k, U+029E->k, U+1D0B->k, U+1D37->k, U+1D4F->k, U+1D84->k, U+1E30->k,\
U+1E31->k, U+1E32->k, U+1E33->k, U+1E34->k, U+1E35->k, U+2C69->k, U+2C6A->k, U+0139->l,\
U+013A->l, U+013B->l, U+013C->l, U+013D->l, U+013E->l, U+013F->l, U+0140->l, U+0141->l,\
U+0142->l, U+019A->l, U+01C8->l, U+0234->l, U+023D->l, U+026B->l, U+026C->l, U+026D->l,\
U+029F->l, U+02E1->l, U+1D0C->l, U+1D38->l, U+1D85->l, U+1DA9->l, U+1DAA->l, U+1DAB->l,\
U+1E36->l, U+1E37->l, U+1E38->l, U+1E39->l, U+1E3A->l, U+1E3B->l, U+1E3C->l, U+1E3D->l,\
U+2C60->l, U+2C61->l, U+2C62->l, U+019C->m, U+026F->m, U+0270->m, U+0271->m, U+1D0D->m,\
U+1D1F->m, U+1D39->m, U+1D50->m, U+1D5A->m, U+1D6F->m, U+1D86->m, U+1DAC->m, U+1DAD->m,\
U+1E3E->m, U+1E3F->m, U+1E40->m, U+1E41->m, U+1E42->m, U+1E43->m, U+00D1->n, U+00F1->n,\
U+0143->n, U+0144->n, U+0145->n, U+0146->n, U+0147->n, U+0148->n, U+0149->n, U+019D->n,\
U+019E->n, U+01CB->n, U+01F8->n, U+01F9->n, U+0220->n, U+0235->n, U+0272->n, U+0273->n,\
U+0274->n, U+1D0E->n, U+1D3A->n, U+1D3B->n, U+1D70->n, U+1D87->n, U+1DAE->n, U+1DAF->n,\
U+1DB0->n, U+1E44->n, U+1E45->n, U+1E46->n, U+1E47->n, U+1E48->n, U+1E49->n, U+1E4A->n,\
U+1E4B->n, U+207F->n, U+00D2->o, U+00D3->o, U+00D4->o, U+00D5->o, U+00D6->o, U+00D8->o,\
U+00F2->o, U+00F3->o, U+00F4->o, U+00F5->o, U+00F6->o, U+00F8->o, U+01030F->o, U+014C->o,\
U+014D->o, U+014E->o, U+014F->o, U+0150->o, U+0151->o, U+0186->o, U+019F->o, U+01A0->o,\
U+01A1->o, U+01D1->o, U+01D2->o, U+01EA->o, U+01EB->o, U+01EC->o, U+01ED->o, U+01FE->o,\
U+01FF->o, U+020C->o, U+020D->o, U+020E->o, U+020F->o, U+022A->o, U+022B->o, U+022C->o,\
U+022D->o, U+022E->o, U+022F->o, U+0230->o, U+0231->o, U+0254->o, U+0275->o, U+043E->o,\
U+04E6->o, U+04E7->o, U+04E8->o, U+04E9->o, U+04EA->o, U+04EB->o, U+1D0F->o, U+1D10->o,\
U+1D11->o, U+1D12->o, U+1D13->o, U+1D16->o, U+1D17->o, U+1D3C->o, U+1D52->o, U+1D53->o,\
U+1D54->o, U+1D55->o, U+1D97->o, U+1DB1->o, U+1E4C->o, U+1E4D->o, U+1E4E->o, U+1E4F->o,\
U+1E50->o, U+1E51->o, U+1E52->o, U+1E53->o, U+1ECC->o, U+1ECD->o, U+1ECE->o, U+1ECF->o,\
U+1ED0->o, U+1ED1->o, U+1ED2->o, U+1ED3->o, U+1ED4->o, U+1ED5->o, U+1ED6->o, U+1ED7->o,\
U+1ED8->o, U+1ED9->o, U+1EDA->o, U+1EDB->o, U+1EDC->o, U+1EDD->o, U+1EDE->o, U+1EDF->o,\
U+1EE0->o, U+1EE1->o, U+1EE2->o, U+1EE3->o, U+2092->o, U+2C9E->o, U+2C9F->o, U+01A4->p,\
U+01A5->p, U+1D18->p, U+1D3E->p, U+1D56->p, U+1D71->p, U+1D7D->p, U+1D88->p, U+1E54->p,\
U+1E55->p, U+1E56->p, U+1E57->p, U+2C63->p, U+024A->q, U+024B->q, U+02A0->q, U+0154->r,\
U+0155->r, U+0156->r, U+0157->r, U+0158->r, U+0159->r, U+0210->r, U+0211->r, U+0212->r,\
U+0213->r, U+024C->r, U+024D->r, U+0279->r, U+027A->r, U+027B->r, U+027C->r, U+027D->r,\
U+027E->r, U+027F->r, U+0280->r, U+0281->r, U+02B3->r, U+02B4->r, U+02B5->r, U+02B6->r,\
U+1D19->r, U+1D1A->r, U+1D3F->r, U+1D63->r, U+1D72->r, U+1D73->r, U+1D89->r, U+1DCA->r,\
U+1E58->r, U+1E59->r, U+1E5A->r, U+1E5B->r, U+1E5C->r, U+1E5D->r, U+1E5E->r, U+1E5F->r,\
U+211C->r, U+2C64->r, U+00DF->s, U+015A->s, U+015B->s, U+015C->s, U+015D->s, U+015E->s,\
U+015F->s, U+0160->s, U+0161->s, U+017F->s, U+0218->s, U+0219->s, U+023F->s, U+0282->s,\
U+02E2->s, U+1D74->s, U+1D8A->s, U+1DB3->s, U+1E60->s, U+1E61->s, U+1E62->s, U+1E63->s,\
U+1E64->s, U+1E65->s, U+1E66->s, U+1E67->s, U+1E68->s, U+1E69->s, U+1E9B->s, U+0162->t,\
U+0163->t, U+0164->t, U+0165->t, U+0166->t, U+0167->t, U+01AB->t, U+01AC->t, U+01AD->t,\
U+01AE->t, U+021A->t, U+021B->t, U+0236->t, U+023E->t, U+0287->t, U+0288->t, U+1D1B->t,\
U+1D40->t, U+1D57->t, U+1D75->t, U+1DB5->t, U+1E6A->t, U+1E6B->t, U+1E6C->t, U+1E6D->t,\
U+1E6E->t, U+1E6F->t, U+1E70->t, U+1E71->t, U+1E97->t, U+2C66->t, U+00D9->u, U+00DA->u,\
U+00DB->u, U+00DC->u, U+00F9->u, U+00FA->u, U+00FB->u, U+00FC->u, U+010316->u, U+0168->u,\
U+0169->u, U+016A->u, U+016B->u, U+016C->u, U+016D->u, U+016E->u, U+016F->u, U+0170->u,\
U+0171->u, U+0172->u, U+0173->u, U+01AF->u, U+01B0->u, U+01D3->u, U+01D4->u, U+01D5->u,\
U+01D6->u, U+01D7->u, U+01D8->u, U+01D9->u, U+01DA->u, U+01DB->u, U+01DC->u, U+0214->u,\
U+0215->u, U+0216->u, U+0217->u, U+0244->u, U+0289->u, U+1D1C->u, U+1D1D->u, U+1D1E->u,\
U+1D41->u, U+1D58->u, U+1D59->u, U+1D64->u, U+1D7E->u, U+1D99->u, U+1DB6->u, U+1DB8->u,\
U+1E72->u, U+1E73->u, U+1E74->u, U+1E75->u, U+1E76->u, U+1E77->u, U+1E78->u, U+1E79->u,\
U+1E7A->u, U+1E7B->u, U+1EE4->u, U+1EE5->u, U+1EE6->u, U+1EE7->u, U+1EE8->u, U+1EE9->u,\
U+1EEA->u, U+1EEB->u, U+1EEC->u, U+1EED->u, U+1EEE->u, U+1EEF->u, U+1EF0->u, U+1EF1->u,\
U+01B2->v, U+0245->v, U+028B->v, U+028C->v, U+1D20->v, U+1D5B->v, U+1D65->v, U+1D8C->v,\
U+1DB9->v, U+1DBA->v, U+1E7C->v, U+1E7D->v, U+1E7E->v, U+1E7F->v, U+2C74->v, U+0174->w,\
U+0175->w, U+028D->w, U+02B7->w, U+1D21->w, U+1D42->w, U+1E80->w, U+1E81->w, U+1E82->w,\
U+1E83->w, U+1E84->w, U+1E85->w, U+1E86->w, U+1E87->w, U+1E88->w, U+1E89->w, U+1E98->w,\
U+02E3->x, U+1D8D->x, U+1E8A->x, U+1E8B->x, U+1E8C->x, U+1E8D->x, U+2093->x, U+00DD->y,\
U+00FD->y, U+00FF->y, U+0176->y, U+0177->y, U+0178->y, U+01B3->y, U+01B4->y, U+0232->y,\
U+0233->y, U+024E->y, U+024F->y, U+028E->y, U+028F->y, U+02B8->y, U+1E8E->y, U+1E8F->y,\
U+1E99->y, U+1EF2->y, U+1EF3->y, U+1EF4->y, U+1EF5->y, U+1EF6->y, U+1EF7->y, U+1EF8->y,\
U+1EF9->y, U+0179->z, U+017A->z, U+017B->z, U+017C->z, U+017D->z, U+017E->z, U+01B5->z,\
U+01B6->z, U+0224->z, U+0225->z, U+0240->z, U+0290->z, U+0291->z, U+1D22->z, U+1D76->z,\
U+1D8E->z, U+1DBB->z, U+1DBC->z, U+1DBD->z, U+1E90->z, U+1E91->z, U+1E92->z, U+1E93->z,\
U+1E94->z, U+1E95->z, U+2128->z, U+2C6B->z, U+2C6C->z, U+00C6->U+00E6, U+01E2->U+00E6,\
U+01E3->U+00E6, U+01FC->U+00E6, U+01FD->U+00E6, U+1D01->U+00E6, U+1D02->U+00E6,\
U+1D2D->U+00E6, U+1D46->U+00E6, U+00E6, U+0400->U+0435, U+0401->U+0435, U+0402->U+0452,\
U+0452, U+0403->U+0433, U+0404->U+0454, U+0454, U+0405->U+0455, U+0455, U+0406->U+0456,\
U+0407->U+0456, U+0457->U+0456, U+0456, U+0408..U+040B->U+0458..U+045B, U+0458..U+045B,\
U+040C->U+043A, U+040D->U+0438, U+040E->U+0443, U+040F->U+045F, U+045F, U+0450->U+0435,\
U+0451->U+0435, U+0453->U+0433, U+045C->U+043A, U+045D->U+0438, U+045E->U+0443,\
U+0460->U+0461, U+0461, U+0462->U+0463, U+0463, U+0464->U+0465, U+0465, U+0466->U+0467,\
U+0467, U+0468->U+0469, U+0469, U+046A->U+046B, U+046B, U+046C->U+046D, U+046D,\
U+046E->U+046F, U+046F, U+0470->U+0471, U+0471, U+0472->U+0473, U+0473, U+0474->U+0475,\
U+0476->U+0475, U+0477->U+0475, U+0475, U+0478->U+0479, U+0479, U+047A->U+047B, U+047B,\
U+047C->U+047D, U+047D, U+047E->U+047F, U+047F, U+0480->U+0481, U+0481, U+048A->U+0438,\
U+048B->U+0438, U+048C->U+044C, U+048D->U+044C, U+048E->U+0440, U+048F->U+0440,\
U+0490->U+0433, U+0491->U+0433, U+0490->U+0433, U+0491->U+0433, U+0492->U+0433,\
U+0493->U+0433, U+0494->U+0433, U+0495->U+0433, U+0496->U+0436, U+0497->U+0436,\
U+0498->U+0437, U+0499->U+0437, U+049A->U+043A, U+049B->U+043A, U+049C->U+043A,\
U+049D->U+043A, U+049E->U+043A, U+049F->U+043A, U+04A0->U+043A, U+04A1->U+043A,\
U+04A2->U+043D, U+04A3->U+043D, U+04A4->U+043D, U+04A5->U+043D, U+04A6->U+043F,\
U+04A7->U+043F, U+04A8->U+04A9, U+04A9, U+04AA->U+0441, U+04AB->U+0441, U+04AC->U+0442,\
U+04AD->U+0442, U+04AE->U+0443, U+04AF->U+0443, U+04B0->U+0443, U+04B1->U+0443,\
U+04B2->U+0445, U+04B3->U+0445, U+04B4->U+04B5, U+04B5, U+04B6->U+0447, U+04B7->U+0447,\
U+04B8->U+0447, U+04B9->U+0447, U+04BA->U+04BB, U+04BB, U+04BC->U+04BD, U+04BE->U+04BD,\
U+04BF->U+04BD, U+04BD, U+04C0->U+04CF, U+04CF, U+04C1->U+0436, U+04C2->U+0436,\
U+04C3->U+043A, U+04C4->U+043A, U+04C5->U+043B, U+04C6->U+043B, U+04C7->U+043D,\
U+04C8->U+043D, U+04C9->U+043D, U+04CA->U+043D, U+04CB->U+0447, U+04CC->U+0447,\
U+04CD->U+043C, U+04CE->U+043C, U+04D0->U+0430, U+04D1->U+0430, U+04D2->U+0430,\
U+04D3->U+0430, U+04D4->U+00E6, U+04D5->U+00E6, U+04D6->U+0435, U+04D7->U+0435,\
U+04D8->U+04D9, U+04DA->U+04D9, U+04DB->U+04D9, U+04D9, U+04DC->U+0436, U+04DD->U+0436,\
U+04DE->U+0437, U+04DF->U+0437, U+04E0->U+04E1, U+04E1, U+04E2->U+0438, U+04E3->U+0438,\
U+04E4->U+0438, U+04E5->U+0438, U+04E6->U+043E, U+04E7->U+043E, U+04E8->U+043E,\
U+04E9->U+043E, U+04EA->U+043E, U+04EB->U+043E, U+04EC->U+044D, U+04ED->U+044D,\
U+04EE->U+0443, U+04EF->U+0443, U+04F0->U+0443, U+04F1->U+0443, U+04F2->U+0443,\
U+04F3->U+0443, U+04F4->U+0447, U+04F5->U+0447, U+04F6->U+0433, U+04F7->U+0433,\
U+04F8->U+044B, U+04F9->U+044B, U+04FA->U+0433, U+04FB->U+0433, U+04FC->U+0445,\
U+04FD->U+0445, U+04FE->U+0445, U+04FF->U+0445, U+0410..U+0418->U+0430..U+0438,\
U+0419->U+0438, U+0430..U+0438, U+041A..U+042F->U+043A..U+044F, U+043A..U+044F,\
U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, a..z\
But when I search for e.g. Arabic text it doesn't give me any results.. but mysql's LIKE does.
Also I wanted to ask is there a better Search Server to such things than Sphinx?
Thanks a lot

Categories

Resources