I am trying to extract relevant address info form an string and discard the garbage.
So this:
al domicilio social de la Compañía, Avenida de Burgos, 109 - 28050 Madrid (Indicar Asesoría Jurídica – Protección de Datos)
Should be:
Avenida de Burgos, 109 - 28050 Madrid
What i've done:
I am using stanza NER to find locations from text.
After that, I am using the indexes of the found entities to get the full address.
For eg: If A Madrid (Spanish city) is found in text[120:128] i will extract the string text[60:101] to get the full address.
My current code is:
##
##STANZA NER FOR LOCATIONS
##
!pip install stanza
#Download the spanish model
import stanza
stanza.download('es')
#create and run the ner tagger
nlp = stanza.Pipeline(lang='es', processors='tokenize,ner')
text = 'al domicilio social de la Compañía, Avenida de Burgos, 109 - 28050 Madrid (Indicar Asesoría Jurídica – Protección de Datos) '
doc = nlp(text)
#print results of NER tagger
print([ent for ent in doc.ents if ent.type=="LOC"], sep='\n')
print(*[text[int(ent.start_char)-60:int(ent.end_char)+15] for ent in doc.ents if ent.type=="LOC"], sep='\n')
After this, in this particular case, which should be reproducible. I get the next address.
cilio social de la Compañía, Avenida de Burgos, 109 - 28050 Madrid (Indicar Aseso
Which contains extra "garbage" info --> " cilio social de la Compañía," and "(Indicar Aseso"
In the next part of the process,I am using the libpostal library to parse the address as it follows:
!pip install postal
from postal.parser import parse_address
parse_address('Avenida de Burgos, 109 - 28050 Madrid')
Which works reliably in most cases, but only with clean addresses.
[('avenida de burgos', 'road'),
('109', 'house_number'),
('28050', 'postcode'),
('madrid', 'city')]
So, to sum up, I am searching from another tecnique apart from regex to help me discard garbage info from addresses apart from regex. (Libraries which do this if they exist or a new NLP approach ... )
Thanks
For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATES AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accommodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,6})(.{5,75}?)((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress
For outside the US
To port this to other locations suggest considering how this could be adapted to your region. If you have a finite number of states and a similar structure, try looking for someone who has already built the "state" regex for your country.
Related
I have texts like this:
['\n 2. Materials and Methods\n 2.1. Data Collection and Metadata Annotations\n \n We searched the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database [15]']
I wish to split the string by paragraph.. meaning at least two \n in a row. I'm not sure that all cases of \n are separated by the same number of white spaces.
How can I define such regex of this sort \n+ multiple spaces + \n?
Thanks!
Split on \n (any amount of spaces) \n then:
l = re.split(r'\n\s*\n', l)
print (l)
It leaves the spaces in your input left and right
['\n 2. Materials and Methods\n 2.1. Data Collection and Metadata Annotations',
' We searched the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database [15]']
but a quick strip will take care of that:
l = [par.strip() for par in re.split(r'\n\s*\n', l)]
print (l)
as it results in
['2. Materials and Methods\n 2.1. Data Collection and Metadata Annotations',
'We searched the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database [15]']
A bonus effect of \s* is that more than 2 successive \ns will be considered as 2-or-more, since the expression grabs as much as it can by default.
Maybe something like this?
>>> a = ['\n 2. Materials and Methods\n 2.1. Data Collection and Metadata Annotations\n \n We searched the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database [15]']
>>> output = [i.strip() for i in a[0].split('\n') if i.strip() != '']
>>> output
['2. Materials and Methods', '2.1. Data Collection and Metadata Annotations', 'We searched the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database [15]']
I have text data to be cleaned using regex. However, some words in the text are immediately followed by numbers which I want to remove.
For example, one row of the text is:
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes
terminology10 Lessons learnt from the RUPES project12 Payment for
environmental service and it potential and example in Vietnam16
Chapter Integrating payment for ecosystem service into Vietnams policy
and programmes17 Chapter Creating incentive for Tri An watershed
protection20 Chapter Sustainable financing for landscape beauty in
Bach Ma National Park 24 Chapter Building payment mechanism for carbon
sequestration in forestry a pilot project in Cao Phong district of Hoa
Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam28 Synthesis and Recommendations30
References32
The first word in the above text should be 'preface' instead of 'preface2' and so on.
line = re.sub(r"[A-Za-z]+(\d+)", "", line)
This, however removes the words as well as seen:
Pes Lessons learnt from the RUPES Payment for environmental service
and it potential and example in Chapter Integrating payment for
ecosystem service into Vietnams policy and Chapter Creating incentive
for Tri An watershed Chapter Sustainable financing for landscape
beauty in Bach Ma National Park 24 Chapter Building payment mechanism
for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Chapter 5 Local revenue sharing Nha
Trang Bay Marine Protected Area Synthesis and
How can I capture only the numbers that immediately follow words?
You can capture the text part and substitute the word with that captured part. It simply writes:
re.sub(r"([A-Za-z]+)\d+", r"\1", line)
You could try lookahead assertions to check for words before your numbers. Try word boundaries (\b) at the end of forcing your regex to only match numbers at the end of a word:
re.sub(r'(?<=\w+)\d+\b', '', line)
Hope this helps
EDIT:
Sorry about the glitch, mentioned in the comments about matching numbers that are NOT preceeded by words as well. That is because (sorry again) \w matches alphanumeric characters instead of only alphabetic ones. Depending on what you would like to delete you can use the positive version
re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)
to only check for english alphabetic characters (you can add characters to the [a-zA-Z] list) preceeding your number or the negative version
re.sub(r'(?<![\d\s])\d+\b', '', line)
to match anything that is NOT \d (numbers) or \s (spaces) before your desired number. This will also match punctuation marks though.
Try this:
line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number
line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one
line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one
\\1 will match the word, \\2 the number. See: How to use python regex to replace using captured group?
below, I'm proposing a working sample of code that might solve your problem.
Here's the snippet:
import re
# I'will write a function that take the test data as input and return the
# desired result as stated in your question.
def transform(data):
"""Replace in a text data words ending with number.""""
# first, lest construct a pattern matching those words we're looking for
pattern1 = r"([A-Za-z]+\d+)"
# Lest construct another pattern that will replace the previous in the final
# output.
pattern2 = r"\d+$"
# Let find all matching words
matches = re.findall(pattern1, data)
# Let construct a list of replacement for each word
replacements = []
for match in matches:
replacements.append(pattern2, '', match)
# Intermediate variable to construct tuple of (word, replacement) for
# use in string method 'replace'
changers = zip(matches, replacements)
# We now recursively change every appropriate word matched.
output = data
for changer in changers:
output.replace(*changer)
# The work is done, we can return the result
return output
For test purpose, we run the above function with your test data:
data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons
learnt from the RUPES project12 Payment for environmental service and it potential and
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""
result = transform(data)
print(result)
And the result looks like this:
Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from
the RUPES project Payment for environmental service and it potential and example in
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and
programmes Chapter Creating incentive for Tri An watershed protection Chapter
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam Synthesis and Recommendations References
You can create a range of numbers as well:
re.sub(r"[0-9]", "", line)
I have this text
'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New
York, NY 12345. Can you contact him now? If you need any help, call
me on 12345678'''
. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?
Definitely regular expressions :)
Something like
import re
txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)
# address = ['44 West 22nd Street, New York, NY 12345']
Explanation:
[0-9]{1,3}: 1 to 3 digits, the address number
(space): a space between the number and the street name
.+: street name, any character for any number of occurrences
,: a comma and a space before the city
.+: city, any character for any number of occurrences
,: a comma and a space before the state
[A-Z]{2}: exactly 2 uppercase chars from A to Z
[0-9]{5}: 5 digits
re.findall(expr, string) will return an array with all the occurrences found.
Pyap works best not just for this particular example but also for other addresses contained in texts.
text = ...
addresses = pyap.parse(text, country='US')
Checkout libpostal, a library dedicated to address extraction
It cannot extract address from raw text but may help in related tasks
For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress
I have a database from which I read. I want to identify the language in a specific cell, defined by column. I am using the langid library for python.
I read from my database like this:
connector = sqlite3.connect("somedb.db")
selecter = connector.cursor()
selecter.execute(''' SELECT tags FROM sometable''')
for row in selecter: #iterate through all the rows in db
#print (type(row)) #tuple
rf = str(row)
#print (type(rf)) #string
lan = langid.classify("{}".format(rf))
Technically, it works. It identifies the languages used and later on (not displayed here) writes the identified language back into the database.
So, now comes the weird part.
I wanted to double check some results manually. So I have these words:
a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
When I perform the language identification on the database it plots me Portuguese into the database.
But, performing it like this:
a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
lan = langid.classify(a)
Well, that returns me French. Apart from that it is neither French nor Portuguese, why is it returned different results?!
In the loop row is bound to a tuple with a single item, i.e. ('tags',) - where 'tags' stands for the list of words. str(row) therefore (in Python 3) returns "('tags',)" and it is this string (including the single quotes, commas and braces) that is being passed to langid.classify(). If you are using Python 2 the string becomes "(u'tags',)".
Now, I am not sure that this explains the different language detection result, and my testing in Python 2 shows that it doesn't, but it is an obvious difference between database sourced data and the plain string data.
Possibly there is some encoding issue coming into play. How is the data stored in the database? Text data should be UTF-8 encoded.
I have a list of names (actually, authors) stored in a sqlite database. Here is an example:
João Neres, Ruben C. Hartkoorn, Laurent R. Chiarelli, Ramakrishna Gadupudi, Maria Rosalia Pasca, Giorgia Mori, Alberto Venturelli, Svetlana Savina, Vadim Makarov, Gaelle S. Kolly, Elisabetta Molteni, Claudia Binda, Neeraj Dhar, Stefania Ferrari, Priscille Brodin, Vincent Delorme, Valérie Landry, Ana Luisa de Jesus Lopes Ribeiro, Davide Farina, Puneet Saxena, Florence Pojer, Antonio Carta, Rosaria Luciani, Alessio Porta, Giuseppe Zanoni, Edda De Rossi, Maria Paola Costi, Giovanna Riccardi, Stewart T. Cole
It's a string. My goal is to write an efficient "analyser" of name. So I basically perform a LIKE query:
' ' || replace(authors, ',', ' ') || ' ' LIKE '{0}'.format(my_string)
I basically replace all the commas with a space, and insert a space at the end and at the beginning of the string. So if I look for:
% Rossi %
I'll get all the items, where one of the authors has "Rossi" as a family name. "Rossi", not "Rossignol" or "Trossi".
It's an efficient way to look for an author with his family name, because I'm sure the string stored in the database contains the family names of the authors, unaltered.
But the main problem lies here: "Rossi" is, for example, a very common family name. So if I want to look for a very particular person, I will add his first name. Let's assume it is "Jean-Philippe". "Jean-Philippe" can be stored in the database under many forms: "J.P Rossi", "Jean-Philippe Rossi", "J. Rossi", "Jean P. Rossi", etc.
So I tried this:
% J%P Rossi %
But of course, It matches everything containing a J, then a P, and finally rossi. It matches the string I gave as an example. (Edda De Rossi)
So I wonder if there is a way to cut the string in the query, on a delimiter, and then match each piece against the search pattern.
Of course I'm open to any other solution. My goal is to match the search pattern against each author name.