Remove prefix from name python - python

names = [
'LIC. SEBASTIÁN LASTIRI',
'ING. AGR. ROBERTO DANIEL RODRÍGUEZ',
'C.P.N. JULIO DOMINGO BURAK',
'INGENIERO HIDRÁULICO VÍCTOR AGUSTÍN PORRINO'
]
I have such list with names, i need to remove prefix like ('lic', 'c.p.n' etc) from name (this is just sample there is a lot of prefixes in such format)
output shell be like this :
'SEBASTIÁN LASTIRI'
I have tried to :
for i in names:
if '.' in i:
i.split('.')[1]
But it works only when there is one dot in prefix
How to solve this

Here is the solution for your issue:
import re
names = [
'LIC. SEBASTIÁN LASTIRI',
'ING. AGR. ROBERTO DANIEL RODRÍGUEZ',
'C.P.N. JULIO DOMINGO BURAK',
'INGENIERO HIDRÁULICO VÍCTOR AGUSTÍN PORRINO'
]
new_names = [re.sub("^\s+", "", i.split(".")[-1]) for i in names]
print new_names # [SEBASTIÁN LASTIRI', ROBERTO DANIEL RODRÍGUEZ', JULIO DOMINGO BURAK', 'INGENIERO HIDRÁULICO VÍCTOR AGUSTÍN PORRINO']

You can use the following code:
import re
names = [
'LIC. SEBASTIAN LASTIRI',
'ING. AGR. ROBERTO DANIEL RODRIGUEZ',
'C.P.N. JULIO DOMINGO BURAK',
'INGENIERO HIDRAULICO VICTOR AGUSTIN PORRINO'
]
for i in names:
res = re.split(r'\.\s*(?=[^.]+$)', i)
if len(res) > 1:
print res[1]
else:
print res[0]
Output:
SEBASTIAN LASTIRI
ROBERTO DANIEL RODRIGUEZ
JULIO DOMINGO BURAK
INGENIERO HIDRAULICO VICTOR AGUSTIN PORRINO

A simple filter to only use words without a dot in the end.
names = [
'LIC. SEBASTIÁN LASTIRI',
'ING. AGR. ROBERTO DANIEL RODRÍGUEZ',
'C.P.N. JULIO DOMINGO BURAK',
'INGENIERO HIDRÁULICO VÍCTOR AGUSTÍN PORRINO'
]
names_formated = [' '.join([sub for sub in name.split() if sub[-1] != '.']) for name in names]

Related

Looking at the next word

I would like to know how I can find a word which has the next one with the first letter capitalised.
For example:
ID Testo
141 Vivo in una piccola città
22 Gli Stati Uniti sono una grande nazione
153 Il Regno Unito ha votato per uscire dall'Europa
64 Hugh Laurie ha interpretato Dr. House
12 Mi piace bere birra.
My expected output would be:
ID Testo Estratte
141 Vivo in una piccola città []
22 Gli Stati Uniti sono una grande nazione [Gli Stati, Stati Uniti]
153 Il Regno Unito ha votato per uscire dall'Europa [Il Regno, Regno Unito]
64 Hugh Laurie ha interpretato Dr. House [Hugh Laurie, Dr House]
12 Mi piace bere birra. []
To extract letter capitalised I do:
df['Estratte'] = df['Testo'].str.findall(r'\b([A-Z][a-z]*)\b')
However this column collect only single words since the code does not look at the next word.
Could you please tell me which condition I should add to look at the next word?
Sometime regex is not always good , let us try split with explode
s=df.Testo.str.split(' ').explode()
s2=s.groupby(level=0).shift(-1)
assign=(s + ' ' + s2)[s.str.istitle() & s2.str.isttimeitle()].groupby(level=0).agg(list)
Out[244]:
1 [Gli Stati, Stati Uniti]
2 [Il Regno, Regno Unito]
3 [Hugh Laurie, Dr. House]
Name: Testo, dtype: object
df['New']=assign
# notice after assign the not find row will be assign as NaN
Maybe you could use my code below
def getCapitalize(myStr):
words = myStr.split()
for i in range(0, len(words) - 1):
if (words[i][0].isupper() and words[i+1][0].isupper()):
yield f"{words[i]} {words[i+1]}"
This function will create a generator and you will have to convert to a list or wtv
import re
import pandas as pd
x = {141 : 'Vivo in una piccola città', 22: 'Gli Stati Uniti sono una grande nazione',
153 : 'Il Regno Unito ha votato per uscire dall\'Europa', 64 : 'Hugh Laurie ha interpretato Dr. House', 12 :'Mi piace bere birra.'}
df = pd.DataFrame(x.items(), columns = ['id', 'testo'])
caps = []
vals = df.testo
for string in vals:
string = string.split(' ')
string = string[1:]
string = ' '.join(string)
caps.append(re.findall('([A-Z][a-z]+)', string))
df['Estratte'] = caps```
Why not match a word starting with capital letter but not at the start of line
df.Testo.str.findall('(?<!^)([A-Z]\w+)')
or
df.Testo.str.findall('(?<!^)[A-Z][a-z]+')
0 []
1 [Stati, Uniti]
2 [Regno, Unito, Europa]
3 [Laurie, Dr, House]
4 []
I think the simplest is to use regex, search (pattern-space-pattern), with overlapping:
import regex as re
df['Estratte'] = df.Testo.apply(lambda x: re.findall('[A-Z][a-z]+[ ][A-Z][a-z]+', x, overlapped=True))

Convert in utf16

I am crawling several websites and extract the names of the products. In some names there are errors like this:
Malecon 12 Jahre 0,05 ltr.<br>Reserva Superior
Bols Watermelon Lik\u00f6r 0,7l
Hayman\u00b4s Sloe Gin
Ron Zacapa Edici\u00f3n Negra
Havana Club A\u00f1ejo Especial
Caol Ila 13 Jahre (G&M Discovery)
How can I fix that?
I am using xpath and re.search to get the names.
In every Python file, this is the first code: # -*- coding: utf-8 -*-
Edit:
This is the sourcecode, how I get the information.
if '"articleName":' in details:
closer_to_product = details.split('"articleName":', 1)[1]
closer_to_product_2 = closer_to_product.split('"imageTitle', 1)[0]
if debug_product == 1:
print('product before try:' + repr(closer_to_product_2))
try:
found_product = re.search(f'{'"'}(.*?)'f'{'",'}'closer_to_product_2).group(1)
except AttributeError:
found_product = ''
if debug_product == 1:
print('cleared product: ', '>>>' + repr(found_product) + '<<<')
if not found_product:
print(product_detail_page, found_product)
items['products'] = 'default'
else:
items['products'] = found_product
Details
product_details = information.xpath('/*').extract()
product_details = [details.strip() for details in product_details]
Where is a problem (Python 3.8.3)?
import html
strings = [
'Bols Watermelon Lik\u00f6r 0,7l',
'Hayman\u00b4s Sloe Gin',
'Ron Zacapa Edici\u00f3n Negra',
'Havana Club A\u00f1ejo Especial',
'Caol Ila 13 Jahre (G&M Discovery)',
'Old Pulteney \\u00b7 12 Years \\u00b7 40% vol',
'Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L']
for str in strings:
print( html.unescape(str).
encode('raw_unicode_escape').
decode('unicode_escape') )
Bols Watermelon Likör 0,7l
Hayman´s Sloe Gin
Ron Zacapa Edición Negra
Havana Club Añejo Especial
Caol Ila 13 Jahre (G&M Discovery)
Old Pulteney · 12 Years · 40% vol
Killepitsch Kräuterlikör 42% 0,7 L
Edit Use .encode('raw_unicode_escape').decode('unicode_escape') for doubled Reverse Solidi, see Python Specific Encodings

Join whole word by its Tag Python

let say i have this sentences:
His/O name/O is/O Petter/Name Jack/Name and/O his/O brother/O name/O is/O
Jonas/Name Van/Name Dame/Name
How can i get result like this:
Petter Jack, Jonas Van Dame.
So far i've already tried this, but still its just join 2 word :
import re
pattern = re.compile(r"\w+\/Name)
sent = sentence.split()
for i , w in sent:
if pattern.match(sent[i]) != None:
if pattern.match(sent[i+1]) != None:
#....
#join sent[i] and sent[i+1] element
#....
Try something like this
pattern = re.compile(r"((\w+\/Name\s*)+)")
names = pattern.findall(your_string)
for name in names:
print(''.join(name[0].split('/Name')))
I'm thinking about a two-phase solution
r = re.compile(r'\w+\/Name(?:\ \w+\/Name)*')
result = r.findall(s)
# -> ['Petter/Name Jack/Name', 'Jonas/Name Van/Name Dame/Name']
for r in result:
print(r.replace('/Name', ''))
# -> Petter Jack
# -> Jonas Van Dame

For-loop does not iterate to next element in a file

I have a problem with my two for-loops in this code:
def batchm():
searchFile = sys.argv[1]
namesFile = sys.argv[2]
writeFile = sys.argv[3]
countDict = {}
with open(searchFile, "r") as nlcfile:
with open(namesFile, "r") as namesList:
with open(writeFile, "a") as wfile:
for name in namesList:
for line in nlcfile:
if name in line:
res = line.split("\t")
countValue = res[0]
countKey = res[-1]
countDict[countKey] = countValue
countDictMax = sorted(countDict, key = lambda x: x[1], reverse = True)
print(countDictMax)
The loop is iterating over this:
namesList:
Greene
Donald
Donald Duck
MacDonald
.
.
.
nlcfile:
123 1999–2000 Northampton Town F.C. season Northampton Town
5 John Simpson Kirkpatrick
167 File talk:NewYorkRangers1940s.png talk
234 Parshu Ram Sharma(Raj Comics) Parshuram Sharma
.
.
.
What I get looks like this:
['Lyn Greene\n', 'Rydbergia grandiflora (Torrey &amp; A. Gray in A. Gray) E. Greene\n', 'Tyler Greene\n', 'Ty Greene\n' ..... ]
and this list appears 48 times, which also happens to be the number of lines in namesList.
Desired output:
("string from namesList" -> "record with highest number in nlcfile")
Greene -> Ly Greene
Donald -> Donald Duck
.
.
.
I think that the two for-loops don't iterate the right way. But I have no clue, why.
Can anyone see, where the problem is?
Thank you very much!

Python RE ( In a word to check first letter is case sensitive and rest all case insensitive)

In the below case i want to match string "Singapore" where "S" should always be capital and rest of the words may be in lower or in uppercase. but in the below string "s" is in lower case and it gets matched in search condition. can any body let me know how to implement this?
import re
st = "Information in sinGapore "
if re.search("S""(?i)(ingapore)" , st):
print "matched"
Singapore => matched
sIngapore => notmatched
SinGapore => matched
SINGAPORE => matched
As commented, the Ugly way would be:
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "SingaPore")
<_sre.SRE_Match object at 0x10cea84a8>
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "Information in sinGapore")
The more elegant way would be matching Singapore case-insensitive, and then checking that the first letter is S:
reg=re.compile("singapore", re.I)
>>> s="Information in sinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
False
>>> s="Information in SinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
True
Update
Following your comment - you could use:
reg.search(s).group().startswith("S")
Instead of:
reg.search(s).group()[0]==("S")
If it seems more readable.
Since you want to set a GV code according to the catched phrase (unique name or several name blank separated, I know that), there must be a step in which the code is choosen in a dictionary according to the catched phrase.
So it's easy to make a profit of this step to perform the test on the first letter (must be uppercased) or the first name in the phrase that no regex is capable of.
I choosed certain conditions to constitute the test. For example, a dot in a first name is not mandatory, but uppercased letters are. These conditions will be easily changed at need.
EDIT 1
import re
def regexize(cntry):
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
to_join = []
for c in cntry:
cspl = c.split(' ',1)
if len(cspl)==1: # 'Singapore','Austria',...
to_join.append('(%s)%s'
% (doot(c[0]), doot(c[1:])))
else: # 'Den LMM','LMM Den',....
to_join.append('(%s) +%s'
% (doot(cspl[0]),
doot(cspl[1].strip(' ').lower())))
pattern = '|'.join(to_join).join('()')
return re.compile(pattern,re.I)
def code(X,CNTR,r = regexize):
r = regexize(CNTR)
for ma in r.finditer(X):
beg = ma.group(1).split(' ')[0]
if beg==ma.group(1):
GV = countries[beg[0]+beg[1:].replace('.','').lower()] \
if beg[0].upper()==beg[0] else '- bad match -'
else:
try:
k = (ki for ki in countries.iterkeys()
if beg.replace('.','')==ki.split(' ')[0]).next()
GV = countries[k]
except StopIteration:
GV = '- bad match -'
yield ' {!s:15} {!s:^13}'.format(ma.group(1), GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'
print '\n'.join(res for res in code(s,countries))
EDIT 2
I improved the code. It's shorter and more readable.
The instruction assert(.....] is to verify that the keys of the dictionaru are well formed for the purpose.
import re
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
def regexize(labels,doot=doot,
wg2 = '(%s) *( %s)',wnog2 = '(%s)(%s)',
ri = re.compile('(.(?!.*? )|[^ ]+)( ?) *(.+\Z)')):
to_join = []
modlabs = {}
for K in labels.iterkeys():
g1,g2,g3 = ri.match(K).groups()
to_join.append((wg2 if g2 else wnog2)
% (doot(g1), doot(g3.lower())))
modlabs[g1+g2+g3.lower()] = labels[K]
return (re.compile('|'.join(to_join), re.I), modlabs)
def code(X,labels,regexize = regexize):
reglab,modlabs = regexize(labels)
for ma in reglab.finditer(X):
a,b = tuple(x for x in ma.groups() if x)
k = (a + b.lower()).replace('.','')
GV = modlabs[k] if k in modlabs else '- bad match -'
yield ' {!s:15} {!s:^13}'.format(a+b, GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
assert(all('.' not in k and
(k.count(' ')==1 or k[0].upper()==k[0])
for k in countries))
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'.join(res for res in code(s,countries))
You could write a simple lambda to generate the ugly-but-all-re-solution:
>>> leading_cap_re = lambda s: s[0].upper() + ''.join('[%s%s]' %
(c.upper(),c.lower())
for c in s[1:])
>>> leading_cap_re("Singapore")
'S[Ii][Nn][Gg][Aa][Pp][Oo][Rr][Ee]'
For multi-word cities, define a string-splitting version:
>>> leading_caps_re = lambda s : r'\s+'.join(map(leading_cap_re,s.split()))
>>> print leading_caps_re('Kuala Lumpur')
K[Uu][Aa][Ll][Aa]\s+L[Uu][Mm][Pp][Uu][Rr]
Then your code could just be:
if re.search(leading_caps_re("Singapore") , st):
...etc...
and the ugliness of the RE would be purely internal.
interestingly
/((S)((?i)ingapore))/
Does the right thing in perl but doesn't seem to work as needed in python. To be fair the python docs spell it out clearly, (?i) alters the whole regexp
This is the BEST answer:
(?-i:S)(?i)ingapore
ClickHere for proof:

Categories

Resources