Return first word of the string - python

I has a data like below:
Colindale London
London Borough of Bromley
Crystal Palace, London
Bermondsey, London
Camden, London
This is my code:
def clean_whitespace(s):
out = str(s).replace(' ', '')
return out.lower()
My code now just return the string that has been remove white space. How can I select the first word the the string. For example:
Crystal Palace, London -> crystal-palace
Bermondsey, London -> bermondsey
Camden, London -> camden

You can try this code:
s = 'Bermondsey, London'
def clean_whitespace(s):
out = str(s).split(',', 1)[0]
out = out.strip()
out = out.replace(' ', '-')
return out.lower()
print(clean_whitespace(s))
Output:
bermondsey

Try this below :
s = "Crystal Palace, London"
output = s.split(',')[0].replace(' ', '-').lower()
print(output)

Related

Named Entity Extraction

I am trying to extract list of persons using Stanford Named Entity Recognizer (NER) in Python NLTK. Code and obtained output is like this
Code
from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
sent = 'joel thompson tracy k smith new work world premierenew york philharmonic commission'
strin = sent.title()
value = st.tag(strin.split())
def get_continuous_chunks(tagged_sent):
continuous_chunk = []
current_chunk = []
for token, tag in tagged_sent:
if tag != "O":
current_chunk.append((token, tag))
else:
if current_chunk: # if the current chunk is not empty
continuous_chunk.append(current_chunk)
current_chunk = []
# Flush the final current_chunk into the continuous_chunk, if any.
if current_chunk:
continuous_chunk.append(current_chunk)
return continuous_chunk
named_entities = get_continuous_chunks(value)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
print(named_entities_str)
Obtained Output
[('Joel Thompson Tracy K Smith New Work World Premierenew York Philharmonic Commission',
'PERSON')]
Desired Output
Person 1: Joel Thompson
Person 2: Tracy K Smith
Data : New Work World Premierenew York Philharmonic Commission

Remove item from list Python

I have a document containg a list. When I write the document it prints
['γύρισε στις φυλακές δομοκού κουφοντίνας huffpost greece']
['australia']
[]
['brasil']
[]
['canada']
[]
['españa']
[]
What I want is to remove the [] characters. So far I've done the following.
for file_name in list_of_files:
with open(file_name, 'r', encoding="utf-8") as inf:
lst = []
for line in inf:
#special characters removal
line = line.lower()
line = re.sub('\W+',' ', line )
line = word_tokenize(line)
#stopwords removal
line = ' '.join([word for word in line if word not in stopwords_dict])
line = line.split('\n')
line = list(filter(None, line))
lst.append(line)
inf.close()
Which removes some '' from inside the empty [], which seems reasonable. I have tried several approaches such as strip, remove() and [x for x in strings if x] without success. I am rather inexperienced, what am I missing?
update:
the initial text looks like this
Εκτέλεσαν τον δημοσιογράφο Γιώργο Καραϊβάζ στον Άλιμο | HuffPost Greece
Australia
Brasil
Canada
España
France
Ελλάδα (Greece)
India
Italia
日本 (Japan)
한국 (Korea)
Québec (en français)
United Kingdom
United States
Ελλάδα (Greece)
Update:
And I am writing the list to a file like this
for line in lst:
outf.write("%s\n" % line)
outf.close()
It looks like you're writing the lists itself after you're appending all of the items in the file. If you want to print the items of a list in python without the surrounding '[' and ']', then just loop over each item and print like so:
for item in list:
outf.write("%s\n" % item)
and for a list of lists
for list in lists:
for item in list:
outf.write("%s\n" % item)
if your output line always contains just one of each '[' and ']' then you can get what's in between with something like
for line in lst:
open_split = line.split('[')
after_open = open_split[1] if len(open_split) > 0 else ""
closed_split = after_open.split(']')
in_between_brackets = closed_split[0]
outf.write("%s\n" % in_between_brackets)
A short hand fragile version of the above split method can be done like so:
for line in lst:
outf.write("%s\n" % line.split('[')[1].split(']')[0])
If the expected result from the clean is like this
γύρισε στις φυλακές δομοκού κουφοντίνας huffpost greece
australia
brasil
canada
españa
Then this code below would help you.
import re
with open('original.txt') as f:
data = f.read()
with open('cleaned.txt', 'w') as f:
# Remove chars like [, ] and '
result = re.sub("\[|\]|'", '', data)
# Remove the extra lines (replace 2 \n by 1).
result = re.sub('\\n\\n', '\\n', result)
f.write(result)
After you remove the stop words you likely don't want to:
line = line.split('\n')
line = list(filter(None, line))
You likely want to inspect what is left and just continue if it is "nothing"
import re # mocking for NLTK
stopwords_dict = {
"huffpost": True
}
text_in = '''
Εκτέλεσαν τον δημοσιογράφο Γιώργο Καραϊβάζ στον Άλιμο | HuffPost Greece
Australia
Brasil
Canada
España
France
Ελλάδα (Greece)
India
Italia
日本 (Japan)
한국 (Korea)
Québec (en français)
United Kingdom
United States
Ελλάδα (Greece)
'''
'''
This emulates NLTK.word_tokenize
'''
def word_tokenize(text):
return re.sub(r'[^\w\s]', '', text).split()
lst = []
for line in text_in.splitlines():
line = line.lower()
line = re.sub('\W+',' ', line )
line_tokens = word_tokenize(line)
line_tokens = [token for token in line_tokens if token not in stopwords_dict]
# after cleaning a line, if there is nothing left skip
if not line_tokens:
continue
line = ' '.join(line_tokens)
lst.append(line)
with open("file_out.txt", "w", encoding='utf-8') as file_out:
for line in lst:
file_out.write("%s\n" % line)
This is give you a file with the contents of:
εκτέλεσαν τον δημοσιογράφο γιώργο καραϊβάζ στον άλιμο greece
australia
brasil
canada
españa
france
ελλάδα greece
india
italia
日本 japan
한국 korea
québec en français
united kingdom
united states
ελλάδα greece
Which is what you are hoping for (I think).
I managed to remove the empty lines by channging:
line = list(filter(None, line))
lst.append(line)
to
lst.append(line)
lst = list(filter(None, lst))
Apologies, it was a trivial mistake, thank you for your answers.

Is there a better way to find specific value in a python dictionary like in list?

I have been practicing on iterating through dictionary and list in Python.
The source file is a csv document containing Country and Capital. It seems I had to go through 2 for loops for country_dict in order to produce the same print result for country_list and capital_list.
Is there a better way to do this in Python dictionary?
The code:
import csv
path = #Path_to_CSV_File
country_list=[]
capital_list=[]
country_dict={'Country':[],'Capital':[]}
with open(path, mode='r') as data:
for line in csv.DictReader(data):
locals().update(line)
country_dict['Country'].append(Country)
country_dict['Capital'].append(Capital)
country_list.append(Country)
capital_list.append(Capital)
i=14 #set pointer value to the 15th row in the csv document
#---------------------- Iterating through Dictionary using for loops---------------------------
if i >= (len(country_dict['Country'])-1):
print("out of bound")
for count1, element in enumerate(country_dict['Country']):
if count1==i:
print('Country = ' + element)
for count2, element in enumerate(country_dict['Capital']):
if count2==i:
print('Capital = ' + element)
#--------------------------------Direct print for list----------------------------------------
print('Country = ' + country_list[i] + '\nCapital = ' + capital_list[i])
The output:
Country = Djibouti
Capital = Djibouti (city)
Country = Djibouti
Capital = Djibouti (city)
The CSV file content:
Country,Capital
Algeria,Algiers
Angola,Luanda
Benin,Porto-Novo
Botswana,Gaborone
Burkina Faso,Ouagadougou
Burundi,Gitega
Cabo Verde,Praia
Cameroon,Yaounde
Central African Republic,Bangui
Chad,N'Djamena
Comoros,Moroni
"Congo, Democratic Republic of the",Kinshasa
"Congo, Republic of the",Brazzaville
Cote d'Ivoire,Yamoussoukro
Djibouti,Djibouti (city)
Egypt,Cairo
Equatorial Guinea,"Malabo (de jure), Oyala (seat of government)"
Eritrea,Asmara
Eswatini (formerly Swaziland),"Mbabane (administrative), Lobamba (legislative, royal)"
Ethiopia,Addis Ababa
Gabon,Libreville
Gambia,Banjul
Ghana,Accra
Guinea,Conakry
Guinea-Bissau,Bissau
Kenya,Nairobi
Lesotho,Maseru
Liberia,Monrovia
Libya,Tripoli
Madagascar,Antananarivo
Malawi,Lilongwe
Mali,Bamako
Mauritania,Nouakchott
Mauritius,Port Louis
Morocco,Rabat
Mozambique,Maputo
Namibia,Windhoek
Niger,Niamey
Nigeria,Abuja
Rwanda,Kigali
Sao Tome and Principe,São Tomé
Senegal,Dakar
Seychelles,Victoria
Sierra Leone,Freetown
Somalia,Mogadishu
South Africa,"Pretoria (administrative), Cape Town (legislative), Bloemfontein (judicial)"
South Sudan,Juba
Sudan,Khartoum
Tanzania,Dodoma
Togo,Lomé
Tunisia,Tunis
Uganda,Kampala
Zambia,Lusaka
Zimbabwe,Harare
I am not sure if I get your point; Please check out the code.
import csv
path = #Path_to_CSV_File
country_dict={}
with open(path, mode='r') as data:
lines = csv.DictReader(data)
for idx,line in enumerate(lines):
locals().update(line)
country_dict[idx] = {"Country":Country,"Capital":}
i=14 #set pointer value to the 15th row in the csv document
#---------------------- Iterating through Dictionary using for loops---------------------------
country_info = country_dict.get(i)
#--------------------------------Direct print for list----------------------------------------
print('Country = ' + country_info['Country'] + '\nCapital = ' + country_info['Capital'])

Python: Parse a list of strings into a dictionnary

This is somewhat complicated. I have a list that looks like this:
['19841018 ID1\n', ' Plunging oil... \n', 'cut in the price \n', '\n', '19841018 ID2\n', ' The U.S. dollar... \n', 'the foreign-exchange markets \n', 'late New York trading \n', '\n']
In my list, the '\n' is what separate a story. What I would like to do is to create a dictionary from the above list that would like this:
dict = {ID1: [19841018, 'Plunging oil... cut in the price'], ID2: [19841018, 'The U.S. dollar... the foreign-exchange markets']}
You can see that my KEY of my dictionnary is the ID and the items are the year and the combination of the stories. Is that doable?
My IDs, are in this format J00100394, J00384932. So they all start with J00.
The tricky part is split your list by any value, so i've take this part from here.Then i've parsed the list parts to built the res dict
>>> import itertools
>>> def isplit(iterable,splitters):
... return [list(g) for k,g in itertools.groupby(iterable,lambda x:x in splitters) if not k]
...
>>> l = ['19841018 ID1\n', ' Plunging oil... \n', 'cut in the price \n', '\n', '19841018 ID2\n', ' The U.S. dollar... \n', 'the foreign-exchange markets \n', 'late New York trading \n', '\n']
>>> res = {}
>>> for sublist in isplit(l,('\n',)):
... id_parts = sublist[0].split()
... story = ' '.join (sentence.strip() for sentence in sublist[1:])
... res[id_parts[1].strip()] = [id_parts[0].strip(), story]
...
>>> res
{'ID2': ['19841018', 'The U.S. dollar... the foreign-exchange markets late New York trading'], 'ID1': ['19841018', 'Plunging oil... cut in the price']}
I code an answer that use generator. The idea is that every time that start an id token the generator return the last key computed. You can costumize by change the check_fun() and how to mix the part of the description.
def trailing_carriage(s):
if s.endswith('\n'):
return s[:-1]
return s
def check_fun(s):
"""
:param s:Take a string s
:return: None if s dosn't match the ID rules. Otherwise return the
name,value of the token
"""
if ' ' in s:
id_candidate,name = s.split(" ",1)
try:
return trailing_carriage(name),int(id_candidate)
except ValueError:
pass
def parser_list(list, check_id_prefix=check_fun):
name = None #key dict
id_candidate = None
desc = "" #description string
for token in list:
check = check_id_prefix(token)
if check is not None:
if name is not None:
"""Return the previous coputed entry"""
yield name,id_val,desc
name,id_val = check
else:
"""Append the description"""
desc += trailing_carriage(token)
if name is not None:
"""Flush the last entry"""
yield name,id_val,desc
>>> list = ['19841018 ID1\n', ' Plunging oil... \n', 'cut in the price \n', '\n', '19841018 ID2\n', ' The U.S. dollar... \n', 'the foreign-exchange markets \n', 'late New York trading \n', '\n']
>>> print {k:[i,d] for k,i,d in parser_list(list)}
{'ID2': [19841018, ' Plunging oil... cut in the price The U.S. dollar... the foreign-exchange markets late New York trading '], 'ID1': [19841018, ' Plunging oil... cut in the price ']}

Python RE ( In a word to check first letter is case sensitive and rest all case insensitive)

In the below case i want to match string "Singapore" where "S" should always be capital and rest of the words may be in lower or in uppercase. but in the below string "s" is in lower case and it gets matched in search condition. can any body let me know how to implement this?
import re
st = "Information in sinGapore "
if re.search("S""(?i)(ingapore)" , st):
print "matched"
Singapore => matched
sIngapore => notmatched
SinGapore => matched
SINGAPORE => matched
As commented, the Ugly way would be:
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "SingaPore")
<_sre.SRE_Match object at 0x10cea84a8>
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "Information in sinGapore")
The more elegant way would be matching Singapore case-insensitive, and then checking that the first letter is S:
reg=re.compile("singapore", re.I)
>>> s="Information in sinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
False
>>> s="Information in SinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
True
Update
Following your comment - you could use:
reg.search(s).group().startswith("S")
Instead of:
reg.search(s).group()[0]==("S")
If it seems more readable.
Since you want to set a GV code according to the catched phrase (unique name or several name blank separated, I know that), there must be a step in which the code is choosen in a dictionary according to the catched phrase.
So it's easy to make a profit of this step to perform the test on the first letter (must be uppercased) or the first name in the phrase that no regex is capable of.
I choosed certain conditions to constitute the test. For example, a dot in a first name is not mandatory, but uppercased letters are. These conditions will be easily changed at need.
EDIT 1
import re
def regexize(cntry):
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
to_join = []
for c in cntry:
cspl = c.split(' ',1)
if len(cspl)==1: # 'Singapore','Austria',...
to_join.append('(%s)%s'
% (doot(c[0]), doot(c[1:])))
else: # 'Den LMM','LMM Den',....
to_join.append('(%s) +%s'
% (doot(cspl[0]),
doot(cspl[1].strip(' ').lower())))
pattern = '|'.join(to_join).join('()')
return re.compile(pattern,re.I)
def code(X,CNTR,r = regexize):
r = regexize(CNTR)
for ma in r.finditer(X):
beg = ma.group(1).split(' ')[0]
if beg==ma.group(1):
GV = countries[beg[0]+beg[1:].replace('.','').lower()] \
if beg[0].upper()==beg[0] else '- bad match -'
else:
try:
k = (ki for ki in countries.iterkeys()
if beg.replace('.','')==ki.split(' ')[0]).next()
GV = countries[k]
except StopIteration:
GV = '- bad match -'
yield ' {!s:15} {!s:^13}'.format(ma.group(1), GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'
print '\n'.join(res for res in code(s,countries))
EDIT 2
I improved the code. It's shorter and more readable.
The instruction assert(.....] is to verify that the keys of the dictionaru are well formed for the purpose.
import re
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
def regexize(labels,doot=doot,
wg2 = '(%s) *( %s)',wnog2 = '(%s)(%s)',
ri = re.compile('(.(?!.*? )|[^ ]+)( ?) *(.+\Z)')):
to_join = []
modlabs = {}
for K in labels.iterkeys():
g1,g2,g3 = ri.match(K).groups()
to_join.append((wg2 if g2 else wnog2)
% (doot(g1), doot(g3.lower())))
modlabs[g1+g2+g3.lower()] = labels[K]
return (re.compile('|'.join(to_join), re.I), modlabs)
def code(X,labels,regexize = regexize):
reglab,modlabs = regexize(labels)
for ma in reglab.finditer(X):
a,b = tuple(x for x in ma.groups() if x)
k = (a + b.lower()).replace('.','')
GV = modlabs[k] if k in modlabs else '- bad match -'
yield ' {!s:15} {!s:^13}'.format(a+b, GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
assert(all('.' not in k and
(k.count(' ')==1 or k[0].upper()==k[0])
for k in countries))
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'.join(res for res in code(s,countries))
You could write a simple lambda to generate the ugly-but-all-re-solution:
>>> leading_cap_re = lambda s: s[0].upper() + ''.join('[%s%s]' %
(c.upper(),c.lower())
for c in s[1:])
>>> leading_cap_re("Singapore")
'S[Ii][Nn][Gg][Aa][Pp][Oo][Rr][Ee]'
For multi-word cities, define a string-splitting version:
>>> leading_caps_re = lambda s : r'\s+'.join(map(leading_cap_re,s.split()))
>>> print leading_caps_re('Kuala Lumpur')
K[Uu][Aa][Ll][Aa]\s+L[Uu][Mm][Pp][Uu][Rr]
Then your code could just be:
if re.search(leading_caps_re("Singapore") , st):
...etc...
and the ugliness of the RE would be purely internal.
interestingly
/((S)((?i)ingapore))/
Does the right thing in perl but doesn't seem to work as needed in python. To be fair the python docs spell it out clearly, (?i) alters the whole regexp
This is the BEST answer:
(?-i:S)(?i)ingapore
ClickHere for proof:

Categories

Resources