More Pythonic to write Functions with Regex - python

I've got 20'000+ court documents I want to pull specific data points out of: date, document number, verdict. I am using Python and Regex to perform this.
The verdicts are in three languages (German, French and Italian) and some of them have slightly different formatting. I am trying to develop functions for the various data points that take this and the different languages into regards.
I'm finding my functions very clumsy. Has anybody got a more pythonic way to develop these functions?
def gericht(doc):
Gericht = re.findall(
r"Beschwerde gegen [a-z]+ [A-Z][a-züöä]+ ([^\n\n]*)", doc)
Gericht1 = re.findall(
r"Beschwerde nach [A-Za-z]. [0-9]+ [a-z]+. [A-Z]+ [a-z]+ [a-z]+[A-Za-z]+ [a-z]+ [0-9]+. [A-Za-z]+ [0-9]+ ([^\n\n]*)", doc)
Gericht2 = re.findall(
r"Revisionsgesuch gegen das Urteil ([^\n\n]*)", doc)
Gericht3 = re.findall(
r"Urteil des ([^\n\n]*)", doc)
Gericht_it = re.findall(
r"ricorso contro la sentenza emanata il [0-9]+ [a-z]+ [0-9]+ [a-z]+ ([^\n\n]*)", doc)
Gericht_fr = re.findall(
r"recours contre l'arrêt ([^\n\n]*)", doc)
Gericht_fr_1 = re.findall(
r"recours contre le jugement ([^\n\n]*)", doc)
Gericht_fr_2 = re.findall(
r"demande de révision de l'arrêt ([^\n\n]*)", doc)
try:
if Gericht != None:
return Gericht[0]
except:
None
try:
if Gericht1 != None:
return Gericht1[0]
except:
None
try:
if Gericht2 != None:
return Gericht2[0]
except:
None
try:
if Gericht3 != None:
return Gericht3[0]
except:
None
try:
if Gericht_it != None:
return Gericht_it[0]
except:
None
try:
if Gericht_fr != None:
Gericht_fr = Gericht_fr[0].replace('de la ', '').replace('du ', '')
return Gericht_fr
except:
None
try:
if Gericht_fr_1 != None:
Gericht_fr_1 = Gericht_fr_1[0].replace('de la ', '').replace('du ', '')
return Gericht_fr_1
except:
None
try:
if Gericht_fr_2 != None:
Gericht_fr_2 = Gericht_fr_2[0].replace('de la ', '').replace('du ', '')
return Gericht_fr_2
except:
None

The result of re.findall() is never None, so all those if statements testing this are superfluous. Then using findall() when you just want the first result does not make sense.
The replacing in french results may remove too much. For instance the 'du ' replacement does not just remove the word du but also affects words ending with du.
def gericht(doc):
for pattern, is_french in [
(r'Beschwerde gegen [a-z]+ [A-Z][a-züöä]+ ([^\n]*)', False),
(
r'Beschwerde nach [A-Za-z]. [0-9]+ [a-z]+. [A-Z]+ [a-z]+'
r' [a-z]+[A-Za-z]+ [a-z]+ [0-9]+. [A-Za-z]+ [0-9]+ ([^\n]*)',
False
),
(r'Revisionsgesuch gegen das Urteil ([^\n]*)', False),
(r'Urteil des ([^\n]*)', False),
(
r'ricorso contro la sentenza emanata il [0-9]+ [a-z]+ [0-9]+'
r' [a-z]+ ([^\n]*)',
False
),
(r"recours contre l'arrêt ([^\n]*)", True),
(r'recours contre le jugement ([^\n]*)', True),
(r"demande de révision de l'arrêt ([^\n]*)", True),
]:
match = re.search(pattern, doc)
if match:
result = match.group(1)
if is_french:
for removable in [' de la ', ' du ']:
result = result.replace(removable, ' ')
return result
return None

Related

How can I use a regex expression to identify the falses between braces on different lines?

I'm attempting to use the Python re.sub module to replace any instance of false in the example below with "false, \n"
local Mission = {
start_done = false
game_over = false
} --End Mission
I've attempted the following, but I'm not getting a successful replacements. The idea being I start and end with the anchor strings, skip over anything that isn't a "false", and return "false + ','" when I get a match. Any help would be appreciated!
re.sub(r'(Mission = {)(.+?)(false)(.+?)(} --End Mission)', r'\1' + ',' + '\n')
You can use
re.sub(r'Mission = {.*?} --End Mission', lambda x: x.group().replace('false', 'false, \n'), text, flags=re.S)
See the regex demo.
Notes:
The Mission = {.*?} --End Mission regex matches Mission = {, then any zero or more chars as few as chars, and then } --End Mission
Then false is replaced with false, \n in the matched texts.
See the Python demo:
import re
text = 'local Mission = {\n start_done = false\n game_over = false\n\n} --End Mission'
rx = r'Mission = {.*?} --End Mission'
print(re.sub(rx, lambda x: x.group().replace('false', 'false, \n'), text, flags=re.S))
Another option without regex:
your_string = 'local Mission = {\n start_done = false\n game_over = false\n\n} --End Mission'
print(your_string.replace(' = false\n', ' = false,\n'))
Output:
local Mission = {
start_done = false,
game_over = false,
} --End Mission
Provided that every "false" string which is preceded by = and followed by \n has to be substituted then here a regex:
re.sub(r'= (false)\n', r'= \1,\n', text)
Note: you introduce 5 groups in your regex so you should have used \3 and not \1 to refer to "false", group start from 1, see doc at paragraph \number

Text Preprocessing Translation Error Python

I was trying to translate tweet text using a deep translator but I found some issues.
Before translating the texts, I did some text preprocessing such as cleaning, removing emoji, etc. This is the ddefined functions of pre-processing :
def deEmojify(text):
regrex_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", re.UNICODE)
return regrex_pattern.sub(r'',text)
def cleaningText(text):
text = re.sub(r'#[A-Za-z0-9]+', '', text) # remove mentions
text = re.sub(r'#[A-Za-z0-9]+', '', text) # remove hashtag
text = re.sub(r'RT[\s]', '', text) # remove RT
text = re.sub(r"http\S+", '', text) # remove link
text = re.sub(r"[!##$]", '', text) # remove link
text = re.sub(r'[0-9]+', '', text) # remove numbers
text = text.replace('\n', ' ') # replace new line into space
text = text.translate(str.maketrans('', '', string.punctuation)) # remove all punctuations
text = text.strip(' ') # remove characters space from both left and right text
return text
def casefoldingText(text): # Converting all the characters in a text into lower case
text = text.lower()
return text
def tokenizingText(text): # Tokenizing or splitting a string, text into a list of tokens
text = word_tokenize(text)
return text
def filteringText(text): # Remove stopwors in a text
listStopwords = set(stopwords.words('indonesian'))
filtered = []
for txt in text:
if txt not in listStopwords:
filtered.append(txt)
text = filtered
return text
def stemmingText(text): # Reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words
factory = StemmerFactory()
stemmer = factory.create_stemmer()
text = [stemmer.stem(word) for word in text]
return text
def convert_eng(text):
text = GoogleTranslator(source='auto', target='en').translate_batch(text)
return text
And here's the translate function :
def convert_eng(text):
text = GoogleTranslator(source='auto', target='en').translate(text)
return text
this is an example of the expected result ( text in Indonesian)
text = '#jshuahaee Ketemu agnes mo lagii😍😍'
clean = cleaningText(text)
print('After cleaning ==> ', clean)
emoji = deEmojify(clean)
print('After emoji ==> ', emoji)
cf = casefoldingText(emoji)
print('After case folding ==> ', cf)
token = tokenizingText(cf)
print('After token ==> ', token)
filter= filteringText(token)
print('After filter ==> ', filter)
stem = stemmingText(filter)
print('After Stem ==> ', stem)
en = convert_eng(stem)
print('After translate ==> ', en)
Result :
After cleaning ==> Ketemu agnes mo lagii😍😍
After emoji ==> Ketemu agnes mo lagii
After case folding ==> ketemu agnes mo lagii
After token ==> ['ketemu', 'agnes', 'mo', 'lagii']
After filter ==> ['ketemu', 'agnes', 'mo', 'lagii']
After Stem ==> ['ketemu', 'agnes', 'mo', 'lagi']
After translate ==> ['meet', 'agnes', 'mo', 'again']
But, I found issues when the sentences contain some dots, the error happened when after the stem process the text contain of [''] ( I don't know how to call this)
text = 'News update Meski kurang diaspirasi Shoppee yg korea minded dalam waktu indonesa belaja di bulan November Lazada 1… '
clean = cleaningText(text)
print('After cleaning ==> ', clean)
emoji = deEmojify(clean)
print('After emoji ==> ', emoji)
cf = casefoldingText(emoji)
print('After case folding ==> ', cf)
token = tokenizingText(cf)
print('After token ==> ', token)
filter= filteringText(token)
print('After filter ==> ', filter)
stem = stemmingText(filter)
print('After Stem ==> ', stem)
en = convert_eng(stem)
print('After translate ==> ', en)
Result
After cleaning ==> News update Meski kurang diaspirasi Shoppee yg korea minded dalam waktu indonesa belaja di bulan November Lazada …
After emoji ==> News update Meski kurang diaspirasi Shoppee yg korea minded dalam waktu indonesa belaja di bulan November Lazada …
After case folding ==> news update meski kurang diaspirasi shoppee yg korea minded dalam waktu indonesa belaja di bulan november lazada …
After token ==> ['news', 'update', 'meski', 'kurang', 'diaspirasi', 'shoppee', 'yg', 'korea', 'minded', 'dalam', 'waktu', 'indonesa', 'belaja', 'di', 'bulan', 'november', 'lazada', '…']
After filter ==> ['news', 'update', 'diaspirasi', 'shoppee', 'yg', 'korea', 'minded', 'indonesa', 'belaja', 'november', 'lazada', '…']
After Stem ==> ['news', 'update', 'aspirasi', 'shoppee', 'yg', 'korea', 'minded', 'indonesa', 'baja', 'november', 'lazada', '']
This is the error message
NotValidPayload Traceback (most recent call last)
<ipython-input-40-cb9390422d3c> in <module>
14 print('After Stem ==> ', stem)
15
---> 16 en = convert_eng(stem)
17 print('After translate ==> ', en)
<ipython-input-28-28bc36c96914> in convert_eng(text)
8 return text
9 def convert_eng(text):
---> 10 text = GoogleTranslator(source='auto', target='en').translate_batch(text)
11 return text
C:\Python\lib\site-packages\deep_translator\google_trans.py in translate_batch(self, batch, **kwargs)
195 for i, text in enumerate(batch):
196
--> 197 translated = self.translate(text, **kwargs)
198 arr.append(translated)
199 return arr
C:\Python\lib\site-packages\deep_translator\google_trans.py in translate(self, text, **kwargs)
108 """
109
--> 110 if self._validate_payload(text):
111 text = text.strip()
112
C:\Python\lib\site-packages\deep_translator\parent.py in _validate_payload(payload, min_chars, max_chars)
44
45 if not payload or not isinstance(payload, str) or not payload.strip() or payload.isdigit():
---> 46 raise NotValidPayload(payload)
47
48 # check if payload contains only symbols
NotValidPayload: --> text must be a valid text with maximum 5000 character, otherwise it cannot be translated
My idea is to remove the '', i think that was the problem, but I have no idea how to do that.
Anyone, please kindly help me
You need to introduce a bit of error checking into your code, and only process an expected data type. Your convert_eng function (that uses GoogleTranslator#translate_batch) requires a list of non-blank strings as an argument (see if not payload or not isinstance(payload, str) or not payload.strip() or payload.isdigit(): part), and your stem contains an empty string as the last item in the list.
Besides, it is possible that filteringText(text) can return [] because all words can turn out to be stopwords. Also, do not use filter as a name of a variable, it is a built-in.
So, change
filter= filteringText(token)
print('After filter ==> ', filter)
stem = stemmingText(filter)
print('After Stem ==> ', stem)
to
filter1 = filteringText(token)
print('After filter ==> ', filter1)
if filter1:
stem = stemmingText(filter1)
print('After Stem ==> ', stem)
en = convert_eng([x for x in stem if x.strip() and not x.isdigit()])
print('After translate ==> ', en)
I left out the isinstance(x, str) check because I assume you already know your list only contains strings.

Tokenize tweet based on Regex

I have the following example text / tweet:
RT #trader $AAPL 2012 is o´o´o´o´o´pen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh url_that_cannot_be_posted_on_SO
I want to follow the procedure of Table 1 in Li, T, van Dalen, J, & van Rees, P.J. (Pieter Jan). (2017). More than just noise? Examining the information content of stock microblogs on financial markets. Journal of Information Technology. doi:10.1057/s41265-016-0034-2 in order to clean up the tweet.
They clean the tweet up in such a way that the final result is:
{RT|123456} {USER|56789} {TICKER|AAPL} {NUMBER|2012} notooopen nottalk patent {COMPANY|GOOG} notdefinetli treatment {HASH|samsung} {EMOTICON|POS} haha {URL}
I use the following script to tokenize the tweet based on the regex:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
emoticon_string = r"""
(?:
[<>]?
[:;=8] # eyes
[\-o\*\']? # optional nose
[\)\]\(\[dDpP/\:\}\{#\|\\] # mouth
|
[\)\]\(\[dDpP/\:\}\{#\|\\] # mouth
[\-o\*\']? # optional nose
[:;=8] # eyes
[<>]?
)"""
regex_strings = (
# URL:
r"""http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"""
,
# Twitter username:
r"""(?:#[\w_]+)"""
,
# Hashtags:
r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Cashtags:
r"""(?:\$+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Remaining word types:
r"""
(?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals.
|
(?:[\w_]+) # Words without apostrophes or dashes.
|
(?:\.(?:\s*\.){1,}) # Ellipsis dots.
|
(?:\S) # Everything else that isn't whitespace.
"""
)
word_re = re.compile(r"""(%s)""" % "|".join(regex_strings), re.VERBOSE | re.I | re.UNICODE)
emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)
######################################################################
class Tokenizer:
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, s):
try:
s = str(s)
except UnicodeDecodeError:
s = str(s).encode('string_escape')
s = unicode(s)
# Tokenize:
words = word_re.findall(s)
if not self.preserve_case:
words = map((lambda x: x if emoticon_re.search(x) else x.lower()), words)
return words
if __name__ == '__main__':
tok = Tokenizer(preserve_case=False)
test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh url_that_cannot_be_posted_on_SO'
tokenized = tok.tokenize(test)
print("\n".join(tokenized))
This yields the following output:
rt
#trader
$aapl
2012
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
#samsung
got
:-)
heh
url_that_cannot_be_posted_on_SO
How can I adjust this script to get:
rt
{USER|trader}
{CASHTAG|aapl}
{NUMBER|2012}
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
{HASHTAG|samsung}
got
{EMOTICON|:-)}
heh
{URL|url_that_cannot_be_posted_on_SO}
Thanks in advance for helping me out big time!
You really need to use named capturing groups (mentioned by thebjorn), and use groupdict() to get name-value pairs upon each match. It requires some post-processing though:
All pairs where the value is None must be discarded
If the self.preserve_case is false the value can be turned to lower case at once
If the group name is WORD, ELLIPSIS or ELSE the values are added to words as is
If the group name is HASHTAG, CASHTAG, USER or URL the values are added first stripped of $, # and # chars at the start and then added to words as {<GROUP_NAME>|<VALUE>} item
All other matches are added to words as {<GROUP_NAME>|<VALUE>} item.
Note that \w matches underscores by default, so [\w_] = \w. I optimized the patterns a little bit.
Here is a fixed code snippet:
import re
emoticon_string = r"""
(?P<EMOTICON>
[<>]?
[:;=8] # eyes
[-o*']? # optional nose
[][()dDpP/:{}#|\\] # mouth
|
[][()dDpP/:}{#|\\] # mouth
[-o*']? # optional nose
[:;=8] # eyes
[<>]?
)"""
regex_strings = (
# URL:
r"""(?P<URL>https?://(?:[-a-zA-Z0-9_$#.&+!*(),]|%[0-9a-fA-F][0-9a-fA-F])+)"""
,
# Twitter username:
r"""(?P<USER>#\w+)"""
,
# Hashtags:
r"""(?P<HASHTAG>\#+\w+[\w'-]*\w+)"""
,
# Cashtags:
r"""(?P<CASHTAG>\$+\w+[\w'-]*\w+)"""
,
# Remaining word types:
r"""
(?P<NUMBER>[+-]?\d+(?:[,/.:-]\d+[+-]?)?) # Numbers, including fractions, decimals.
|
(?P<WORD>\w+) # Words without apostrophes or dashes.
|
(?P<ELLIPSIS>\.(?:\s*\.)+) # Ellipsis dots.
|
(?P<ELSE>\S) # Everything else that isn't whitespace.
"""
)
word_re = re.compile(r"""({}|{})""".format(emoticon_string, "|".join(regex_strings)), re.VERBOSE | re.I | re.UNICODE)
#print(word_re.pattern)
emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)
######################################################################
class Tokenizer:
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, s):
try:
s = str(s)
except UnicodeDecodeError:
s = str(s).encode('string_escape')
s = unicode(s)
# Tokenize:
words = []
for x in word_re.finditer(s):
for key, val in x.groupdict().items():
if val:
if not self.preserve_case:
val = val.lower()
if key in ['WORD','ELLIPSIS','ELSE']:
words.append(val)
elif key in ['HASHTAG','CASHTAG','USER','URL']: # Add more here if needed
words.append("{{{}|{}}}".format(key, re.sub(r'^[##$]+', '', val)))
else:
words.append("{{{}|{}}}".format(key, val))
return words
if __name__ == '__main__':
tok = Tokenizer(preserve_case=False)
test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh http://some.site.here.com'
tokenized = tok.tokenize(test)
print("\n".join(tokenized))
With test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh http://some.site.here.com', it outputs
rt
{USER|trader}
{CASHTAG|aapl}
{NUMBER|2012}
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
{HASHTAG|samsung}
got
{EMOTICON|:-)}
heh
{URL|http://some.site.here.com}
See the regex demo online.

Writing in file with specific format

def sauvegarder_canaux(self, nom_fichier:str) is the method giving me a problem when the file saves it only writes in this format:
5 - TQS (Télévision Quatres-saisons, 0.0 $ extra)
I need it to be like this:
5 : TQS : Télévision Quatres-saisons : 0.0 $ extra
This is the code that I have for now:
from canal import Canal
from forfait_tv import ForfaitTV
from abonne import Abonne
#============= Classe ===========================
class Distributeur :
"""
Description :
===========
Cette classe gère les listes de canaux, de forfaits (et plus tard
d'abonné).
Données membres privées :
======================
__canaux # [Canal] Liste de canaux existants
__forfaits # [ForfaitTV] Liste des forfaits disponibles
"""
#----------- Constructeur -----------------------------
def __init__(self):
self.__canaux = None
self.__forfaits = None
#code
self.__canaux = [] #list
self.__forfaits = [] #list
#----------- Accesseurs/Mutateurs ----------------------
def ajouter_canal(self,un_canal:Canal):
self.__canaux.append(un_canal)
def chercher_canal (self,p_poste:int):
i=0
postex = None
poste_trouve=None
for i in range(0,len(self.__canaux),1):
postex=self.__canaux[i]
if postex.get_poste()== p_poste:
poste_trouve=postex
return print(poste_trouve)
def telecharger_canaux(self,nom_fichier:str):
fichierCanaux = open(nom_fichier, "r")
for line in fichierCanaux:
eleCanal = line.strip(" : ")
canal = Canal(eleCanal[0],eleCanal[1],eleCanal[2],eleCanal[3])
self.__canaux.append(canal)
return canal
def sauvegarder_canaux(self, nom_fichier:str):
fichCanaux = open(nom_fichier,"w")
for i in self.__canaux:
fichCanaux.write(str(i) + "\n")
fichCanaux.close()
You need only to edit the string before you write it. The string.replace command is your friend. Perhaps ...
for i in self.__canaux:
out_line = str(i)
for char in "-(,":
out_line = out_line.replace(char, ':')
fichCanaux.write(out_line + "\n")
If removing the accents is okay, you can normalize the text to NFD with unicodedata, then find the segments of interest, modify them with the desired formatting, and replace them with the formatted segments using regex:
import unicodedata
import re
def format_string(test_str):
# normalize accents
test_str = test_str.decode("UTF-8")
test_str = unicodedata.normalize('NFD', test_str).encode('ascii', 'ignore')
# segment patterns
segment_1_ptn = re.compile(r"""[0-9]*(\s)* # natural number
[-](\s)* # dash
(\w)*(\s)* # acronym
""",
re.VERBOSE)
segment_2_ptn = re.compile(r"""(\w)*(\s)* # acronym
(\() # open parenthesis
((\w*[-]*)*(\s)*)* # words
""",
re.VERBOSE)
segment_3_ptn = re.compile(r"""((\w*[-]*)*(\s)*)* # words
(,)(\s)* # comma
[0-9]*(.)[0-9]*(\s)*(\$)(\s) # real number
""",
re.VERBOSE)
# format data
segment_1_match = re.search(segment_1, test_str).group()
test_str = test_str.replace(segment_1_match, " : ".join(segment_1_match.split("-")))
segment_2_match = re.search(segment_2, test_str).group()
test_str = test_str.replace(segment_2_match, " : ".join(segment_2_match.split("(")))
segment_3_match = re.search(segment_3, test_str).group()
test_str = test_str.replace(segment_3_match, " : ".join(segment_3_match.split(",")))[:-1]
test_str = " : ".join([txt.strip() for txt in test_str.split(":")])
return test_str
Then you can call this function within sauvegarder_canaux
def sauvegarder_canaux(self, nom_fichier:str):
with open(nom_fichier, "w") as fichCanaux
for i in self.__canaux:
fichCanaux.write(format_string(str(i)) + "\n")
You can also add format_string as a method within your Distributeur class.
Example input:
5 - TQS (Télévision Quatres-saisons, 0.0 $ extra)
Example output:
5 : TQS : Television Quatres-saisons : 0.0 $ extra

Python RE ( In a word to check first letter is case sensitive and rest all case insensitive)

In the below case i want to match string "Singapore" where "S" should always be capital and rest of the words may be in lower or in uppercase. but in the below string "s" is in lower case and it gets matched in search condition. can any body let me know how to implement this?
import re
st = "Information in sinGapore "
if re.search("S""(?i)(ingapore)" , st):
print "matched"
Singapore => matched
sIngapore => notmatched
SinGapore => matched
SINGAPORE => matched
As commented, the Ugly way would be:
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "SingaPore")
<_sre.SRE_Match object at 0x10cea84a8>
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "Information in sinGapore")
The more elegant way would be matching Singapore case-insensitive, and then checking that the first letter is S:
reg=re.compile("singapore", re.I)
>>> s="Information in sinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
False
>>> s="Information in SinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
True
Update
Following your comment - you could use:
reg.search(s).group().startswith("S")
Instead of:
reg.search(s).group()[0]==("S")
If it seems more readable.
Since you want to set a GV code according to the catched phrase (unique name or several name blank separated, I know that), there must be a step in which the code is choosen in a dictionary according to the catched phrase.
So it's easy to make a profit of this step to perform the test on the first letter (must be uppercased) or the first name in the phrase that no regex is capable of.
I choosed certain conditions to constitute the test. For example, a dot in a first name is not mandatory, but uppercased letters are. These conditions will be easily changed at need.
EDIT 1
import re
def regexize(cntry):
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
to_join = []
for c in cntry:
cspl = c.split(' ',1)
if len(cspl)==1: # 'Singapore','Austria',...
to_join.append('(%s)%s'
% (doot(c[0]), doot(c[1:])))
else: # 'Den LMM','LMM Den',....
to_join.append('(%s) +%s'
% (doot(cspl[0]),
doot(cspl[1].strip(' ').lower())))
pattern = '|'.join(to_join).join('()')
return re.compile(pattern,re.I)
def code(X,CNTR,r = regexize):
r = regexize(CNTR)
for ma in r.finditer(X):
beg = ma.group(1).split(' ')[0]
if beg==ma.group(1):
GV = countries[beg[0]+beg[1:].replace('.','').lower()] \
if beg[0].upper()==beg[0] else '- bad match -'
else:
try:
k = (ki for ki in countries.iterkeys()
if beg.replace('.','')==ki.split(' ')[0]).next()
GV = countries[k]
except StopIteration:
GV = '- bad match -'
yield ' {!s:15} {!s:^13}'.format(ma.group(1), GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'
print '\n'.join(res for res in code(s,countries))
EDIT 2
I improved the code. It's shorter and more readable.
The instruction assert(.....] is to verify that the keys of the dictionaru are well formed for the purpose.
import re
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
def regexize(labels,doot=doot,
wg2 = '(%s) *( %s)',wnog2 = '(%s)(%s)',
ri = re.compile('(.(?!.*? )|[^ ]+)( ?) *(.+\Z)')):
to_join = []
modlabs = {}
for K in labels.iterkeys():
g1,g2,g3 = ri.match(K).groups()
to_join.append((wg2 if g2 else wnog2)
% (doot(g1), doot(g3.lower())))
modlabs[g1+g2+g3.lower()] = labels[K]
return (re.compile('|'.join(to_join), re.I), modlabs)
def code(X,labels,regexize = regexize):
reglab,modlabs = regexize(labels)
for ma in reglab.finditer(X):
a,b = tuple(x for x in ma.groups() if x)
k = (a + b.lower()).replace('.','')
GV = modlabs[k] if k in modlabs else '- bad match -'
yield ' {!s:15} {!s:^13}'.format(a+b, GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
assert(all('.' not in k and
(k.count(' ')==1 or k[0].upper()==k[0])
for k in countries))
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'.join(res for res in code(s,countries))
You could write a simple lambda to generate the ugly-but-all-re-solution:
>>> leading_cap_re = lambda s: s[0].upper() + ''.join('[%s%s]' %
(c.upper(),c.lower())
for c in s[1:])
>>> leading_cap_re("Singapore")
'S[Ii][Nn][Gg][Aa][Pp][Oo][Rr][Ee]'
For multi-word cities, define a string-splitting version:
>>> leading_caps_re = lambda s : r'\s+'.join(map(leading_cap_re,s.split()))
>>> print leading_caps_re('Kuala Lumpur')
K[Uu][Aa][Ll][Aa]\s+L[Uu][Mm][Pp][Uu][Rr]
Then your code could just be:
if re.search(leading_caps_re("Singapore") , st):
...etc...
and the ugliness of the RE would be purely internal.
interestingly
/((S)((?i)ingapore))/
Does the right thing in perl but doesn't seem to work as needed in python. To be fair the python docs spell it out clearly, (?i) alters the whole regexp
This is the BEST answer:
(?-i:S)(?i)ingapore
ClickHere for proof:

Categories

Resources