Use regex to extract unit number - python

I have a list of descriptions and I want to extract the unit information using regular expression
I watched a video on regex and here's what I got
import re
x = ["Four 10-story towers - five 11-story residential towers around Lake Peterson - two 9-story hotel towers facing Devon Avenue & four levels of retail below the hotels",
"265 rental units",
"10 stories and contain 200 apartments",
"801 residential properties that include row homes, town homes, condos, single-family housing, apartments, and senior rental units",
"4-unit townhouse building (6,528 square feet of living space & 2,755 square feet of unheated garage)"]
unit=[]
for item in x:
extract = re.findall('[0-9]+.unit',item)
unit.append(extract)
print unit
This works with string ends in unit, but I also strings end with 'rental unit','apartment','bed' and other as in this example.
I could do this with multiple regex, but is there a way to do this within one regex?
Thanks!

As long as your not afraid of making a hideously long regex you could use something to the extent of:
compiled_re = re.compile(ur"(\d*)-unit|(\d*)\srental unit|(\d*)\sbed|(\d*)\sappartment")
unit = []
for item in x:
extract = re.findall(compiled_re, item)
unit.append(extract)
You would have to extend the regex pattern with a new "|" followed by a search pattern for each possible type of reference to unit numbers. Unfortunately, if there is very low consistency in the entries this approach would become basically unusable.
Also, might I suggest using a regex tester like Regex101. It really helps determining if your regex will do what you want it to.

Related

Find best matches of substring from list in corpus

I have a corpus that looks something like this
LETTER AGREEMENT N°5 CHINA SOUTHERN AIRLINES COMPANY LIMITED Bai Yun
Airport, Guangzhou 510405, People's Republic of China Subject: Delays
CHINA SOUTHERN AIRLINES COMPANY LIMITED (the ""Buyer"") and AIRBUS
S.A.S. (the ""Seller"") have entered into a purchase agreement (the
""Agreement"") dated as of even date
And a list of company names that looks like this
l = [ 'airbus', 'airbus internal', 'china southern airlines', ... ]
The elements of this list do not always have exact matches in the corpus, because of different formulations or just typos: for this reason I want to perform fuzzy matching.
What is the most efficient way of finding the best matches of l in the corpus? In theory the task is not super difficult but I don't see a way of solving it that does not entail looping through both the corpus and list of matches, which could cause huge slowdowns.
You can concatenate your list l in a single regex expression, then use regex to fuzzy match (https://github.com/mrabarnett/mrab-regex#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109) the words in the corpus.
Something like
my_regex = ""
for pattern in l:
my_regex += f'(?:{pattern}' + '{1<=e<=3})' #{1<=e<=3} permit at least 1 and at most 3 errors
my_regex += '|'
my_regex = my_regex[:-1] #remove the last |

Fuzzy match and get index of a pattern from a string

I have a list of company names that I want to match against a list of sentences and get the index start and end position if a keyword is present in any of the sentences.
I wrote the code for matching the keywords exactly but realized that names in the sentences won't always be an exact match. For example, my keywords list can contain Company One Two Ltd but the sentences can be -
Company OneTwo Ltd won the auction
Company One Two Limited won the auction
The auction was won by Co. One Two Ltd and other variations
Given a company name, I want to find out the index start and end position even if the company name in the sentence is not an exact match but a variation. Below is the code I wrote for exact matching.
def find_index(texts, target):
idxs = []
for i, each_sent in enumerate(texts):
add = [(m.start(0), m.end(0)) for m in re.finditer(target, each_sent)]
if len(add):
idxs.append([(i, m.start(0), m.end(0)) for m in re.finditer(target, each_sent)])
return idxs
I can think of 2-3 possibilities all with varying pros/cons:
Create More Versatile regex
(Company|Co\.?)\s?One\s?Two\s?(Limited|Ltd)
Building on the previous suggestion, iterate through company list and create fuzzy search
Company->(Company|Co\.?), ' '->\s?, imited->(Limited|Ltd), etc
Levenshtein distance calculator
example
which references external library fuzzywuzzy, there are alternatives fuzzy

matching content creating new column

Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool

How can I remove numbers that may occur at the end of words in a text

I have text data to be cleaned using regex. However, some words in the text are immediately followed by numbers which I want to remove.
For example, one row of the text is:
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes
terminology10 Lessons learnt from the RUPES project12 Payment for
environmental service and it potential and example in Vietnam16
Chapter Integrating payment for ecosystem service into Vietnams policy
and programmes17 Chapter Creating incentive for Tri An watershed
protection20 Chapter Sustainable financing for landscape beauty in
Bach Ma National Park 24 Chapter Building payment mechanism for carbon
sequestration in forestry a pilot project in Cao Phong district of Hoa
Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam28 Synthesis and Recommendations30
References32
The first word in the above text should be 'preface' instead of 'preface2' and so on.
line = re.sub(r"[A-Za-z]+(\d+)", "", line)
This, however removes the words as well as seen:
Pes Lessons learnt from the RUPES Payment for environmental service
and it potential and example in Chapter Integrating payment for
ecosystem service into Vietnams policy and Chapter Creating incentive
for Tri An watershed Chapter Sustainable financing for landscape
beauty in Bach Ma National Park 24 Chapter Building payment mechanism
for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Chapter 5 Local revenue sharing Nha
Trang Bay Marine Protected Area Synthesis and
How can I capture only the numbers that immediately follow words?
You can capture the text part and substitute the word with that captured part. It simply writes:
re.sub(r"([A-Za-z]+)\d+", r"\1", line)
You could try lookahead assertions to check for words before your numbers. Try word boundaries (\b) at the end of forcing your regex to only match numbers at the end of a word:
re.sub(r'(?<=\w+)\d+\b', '', line)
Hope this helps
EDIT:
Sorry about the glitch, mentioned in the comments about matching numbers that are NOT preceeded by words as well. That is because (sorry again) \w matches alphanumeric characters instead of only alphabetic ones. Depending on what you would like to delete you can use the positive version
re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)
to only check for english alphabetic characters (you can add characters to the [a-zA-Z] list) preceeding your number or the negative version
re.sub(r'(?<![\d\s])\d+\b', '', line)
to match anything that is NOT \d (numbers) or \s (spaces) before your desired number. This will also match punctuation marks though.
Try this:
line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number
line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one
line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one
\\1 will match the word, \\2 the number. See: How to use python regex to replace using captured group?
below, I'm proposing a working sample of code that might solve your problem.
Here's the snippet:
import re
# I'will write a function that take the test data as input and return the
# desired result as stated in your question.
def transform(data):
"""Replace in a text data words ending with number.""""
# first, lest construct a pattern matching those words we're looking for
pattern1 = r"([A-Za-z]+\d+)"
# Lest construct another pattern that will replace the previous in the final
# output.
pattern2 = r"\d+$"
# Let find all matching words
matches = re.findall(pattern1, data)
# Let construct a list of replacement for each word
replacements = []
for match in matches:
replacements.append(pattern2, '', match)
# Intermediate variable to construct tuple of (word, replacement) for
# use in string method 'replace'
changers = zip(matches, replacements)
# We now recursively change every appropriate word matched.
output = data
for changer in changers:
output.replace(*changer)
# The work is done, we can return the result
return output
For test purpose, we run the above function with your test data:
data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons
learnt from the RUPES project12 Payment for environmental service and it potential and
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""
result = transform(data)
print(result)
And the result looks like this:
Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from
the RUPES project Payment for environmental service and it potential and example in
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and
programmes Chapter Creating incentive for Tri An watershed protection Chapter
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam Synthesis and Recommendations References
You can create a range of numbers as well:
re.sub(r"[0-9]", "", line)

How to not match string not containing two consecutive newlines

Demo at regex101. I have the following text file (a bibtex .bbl file):
\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{a}})\textit{Alfonsi, Spogli,
De~Franceschi, Romano, Aquino, Dodson, and Mitchell}}]{alfonsi2011bcg}
Alfonsi, L., L.~Spogli, G.~De~Franceschi, V.~Romano, M.~Aquino, A.~Dodson, and
C.~N. Mitchell (2011{\natexlab{a}}), Bipolar climatology of {GPS} ionospheric
scintillation at solar minimum, \textit{Radio Science}, \textit{46}(3),
\doi{10.1029/2010RS004571}.
\bibitem[{\textit{Alfonsi et~al.}(2011{\natexlab{b}})\textit{Alfonsi, Spogli,
Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, and
Mitchell}}]{alfonsi2011gsa}
Alfonsi, L., L.~Spogli, J.~Tong, G.~De~Franceschi, V.~Romano, A.~Bourdillon,
M.~Le~Huy, and C.~Mitchell (2011{\natexlab{b}}), {GPS} scintillation and
{TEC} gradients at equatorial latitudes in april 2006, \textit{Advances in
Space Research}, \textit{47}(10), 1750--1757,
\doi{10.1016/j.asr.2010.04.020}.
\bibitem[{\textit{Anghel et~al.}(2008)\textit{Anghel, Astilean, Letia, and
Komjathy}}]{anghel2008nrm}
Anghel, A., A.~Astilean, T.~Letia, and A.~Komjathy (2008), Near real-time
monitoring of the ionosphere using dual frequency {GPS} data in a kalman
filter approach, in \textit{{IEEE} International Conference on Automation,
Quality and Testing, Robotics, 2008. {AQTR} 2008}, vol.~2, pp. 54--58,
\doi{10.1109/AQTR.2008.4588793}.
\bibitem[{\textit{Baker and Wing}(1989)}]{baker1989nmc}
Baker, K.~B., and S.~Wing (1989), A new magnetic coordinate system for
conjugate studies at high latitudes, \textit{Journal of Geophysical Research:
Space Physics}, \textit{94}(A7), 9139--9143, \doi{10.1029/JA094iA07p09139}.
I want to match the whole \bibitem command for a single entry (with some capture groups) if I know the reference code at the end of the command. I use this regex, which works for the first entry, but not for the rest (second entry exemplified below):
\\bibitem\[{(.*?)\((.*?)\)(.*?)}\]{alfonsi2011gsa}
This doesn't work, since it matches everything from the start of the first \bibitem command to the end of the second \bibitem command. How can I match only the second \bibitem command? I have tried using a negative lookahead for ^$ and \n\n, but I couldn't get either to work - basically, I want the third (.*?) to match any string not including two consecutive newlines. (If there's any other way to do this, I'm all ears.)
You can use negative look-arounds (?!) to prevent the match from having multiple occurrences of 'bibitem'. With this, the match will start with the 'bibitem' which immediately precedes your reference code. This seems to work:
\\bibitem\[{(((?!bibitem).)*?)\((((?!bibitem).)*?)\)(((?!bibitem).)*?)}\]{alfonsi2011gsa}
regex is not my strong point but this will get all the content you want without reading all the content into memory at once:
from itertools import groupby
import re
with open("file.txt") as f:
r = re.compile(r"\[{(.*?)\((.*?)\)(.*?)}\]\{alfonsi2011gsa\}")
for k, v in groupby(map(str.strip, f), key=lambda x: bool(x.strip())):
match = r.search("".join(v))
if match:
print(match.groups())
('\\textit{Alfonsi et~al.}', '2011{\\natexlab{b}}', '\\textit{Alfonsi, Spogli,Tong, De~Franceschi, Romano, Bourdillon, Le~Huy, andMitchell}')

Categories

Resources