Extract text between given set of words using Python - python

I went through various answers before posting and they are all regular expression based and involve symbols and special characters.
Here is my input text and expected output. I want to extract the text between 'Investment Objective' and 'Investment Policy'
input_text
"Investment Objective To provide long - term capital growth by investing primarily in a portfolio of African companies. Investment Policy"
output_text:
"To provide long - term capital growth by investing primarily in a portfolio of African companies."

An alternative to Joshua's answer:
input_text="Investment Objective To provide long - term capital growth by investing primarily in a portfolio of African companies. Investment Policy"
start_str = "Investment Objective"
startpos = input_text.find(start_str)
end_str = "Investment Policy"
endpos = input_text.find(end_str)
output_str = input_text[startpos + len(start_str):endpos]
output_str_nospaces = output_str.strip()
print(f"'{output_str}'")
print(f"'{output_str_nospaces}'")
Which prints:
' To provide long - term capital growth by investing primarily in a portfolio of African companies. '
'To provide long - term capital growth by investing primarily in a portfolio of African companies.'

Lets say, your blacklisted words are:
black = ["Investment Objective","Investment Policy"]
Now lets remove it:
for i in black:
input_text = input_text.replace(i,'').strip()
this gives:
'To provide long - term capital growth by investing primarily in a portfolio of African companies.'

Related

How to match complete words for acronym using regex?

I want to only get complete words from acronyms with ( ) around them.
For example, there is a sentence
'Lung cancer screening (LCS) reduces NSCLC mortality';
->I want to get 'Lung cancer screening' as a result.
How can I do it with regex?
original question:
I want to remove repeated upper alphabets :
"HIV acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer" => " acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer"
Assuming you want to target 2 or more capital letters, I would use re.sub here:
inp = "Lung cancer screening (LCS) reduces NSCLC mortality"
output = re.sub(r'\s*(?:\([A-Z]+\)|[A-Z]{2,})\s*', ' ', inp).strip()
print(output) # Lung cancer screening reduces mortality
import re
s = 'HIV acquired immunodeficiency syndrome are at a particularly high risk of cervical cancer'
print(re.sub(r'([A-Z])', lambda pat:'', s).strip()) # Inline
according to #jensgram answer

Joining output of iterative list in a while loop [duplicate]

This question already has answers here:
What is the purpose of the return statement? How is it different from printing?
(15 answers)
How can I use `return` to get back multiple values from a loop? Can I put them in a list?
(2 answers)
Closed 5 months ago.
I have a text file that I'm extracting text from using its punctuation and indentation patterns. The output should be a list of lists combining two lists; company_name and description
[[company,description],[company,description]]
To do that I'm running a while loop nested within a for loop to extract the description for each company. Here's my code
for line in file:
if not re.search(r" ", line, re.MULTILINE):
name = line.split(',', 1)[0]
companies.append(name)
print(companies)
companies = []
while re.search(r" ", line, re.MULTILINE):
desc.append(line)
print(desc)
desc = []
break
Sample from text file:
XYZ Group, a nearly nine-year-old, Copenhagen-based
company that has built a dual-purpose platform, providing both
accountancy software and a marketplace for small and medium
businesses to find accountants, has landed $73 million in growth funding from a single investor,
Lugard Road Capital. TechCrunch has more
here.
Black Lake, a nearly five-year-old, China-based software
platform for factory workers to log their daily tasks and managers
to oversee the plant floor, recently raised $77 million in funding,
including from Singapore’s sovereign wealth fund Temasek,
which led the round, as well as China
Renaissance and Lightspeed Venture
Partners. The outfit has now raised more than $100
million altogether, including from from GGV...
That's the output:
['XYZ Group']
[' company that has built a dual-purpose platform, providing both']
[' accountancy software and a marketplace for small and medium']
[' businesses to find accountants, has landed 73 million in growth funding from a single investor,']
[' Lugard Road Capital TechCrunch has more']
[' here']
['Black Lake']
[' platform for factory workers to log their daily tasks and managers']
[' to oversee the plant floor, recently raised 77 million in funding,']
[' including from Singapore’s sovereign wealth fund Temasek,']
[' which led the round, as well as China']
[' Renaissance and Lightspeed Venture']
[' Partners The outfit has now raised more than 100']
[' million altogether, including from from GGV']
[' Capital, Bertelsmann Asia Investments,']
[' GSR Ventures, ZhenFund']
[' and others TechCrunch has more']
[' here']
The goal is to join the output of desc list under company name into 1 list
Update
I put desc = [] outside of the while loop and I'm getting this:
['XYZ Group']
[' company that has built a dual-purpose platform, providing both']
[' company that has built a dual-purpose platform, providing both', ' accountancy software and a marketplace for small and medium']
[' company that has built a dual-purpose platform, providing both', ' accountancy software and a marketplace for small and medium', ' businesses to find accountants, has landed 73 million in growth funding from a single investor,']
[' company that has built a dual-purpose platform, providing both', ' accountancy software and a marketplace for small and medium', ' businesses to find accountants, has landed 73 million in growth funding from a single investor,', ' Lugard Road Capital TechCrunch has more']
[' company that has built a dual-purpose platform, providing both', ' accountancy software and a marketplace for small and medium', ' businesses to find accountants, has landed 73 million in growth funding from a single investor,', ' Lugard Road Capital TechCrunch has more', ' here']
I only need the last iteration though
Assuming the text is always following a <company_name>, <description> pattern, a very simple approach based on .split(). Simply split on the first , by limiting the number of splits with maxsplit=1 to get the name and full_description which can be prettified afterwards:
text = "XYZ Group, a nearly nine-year-old, Copenhagen-based company that has built a dual-purpose platform, providing both accountancy software and a marketplace for small and medium businesses to find accountants, has landed $73 million in growth funding from a single investor, Lugard Road Capital. TechCrunch has more here."
name, full_description = text.split(',', 1)
description = [s.strip() for s in full_description.split(',')]
output = [name, description]
print(output)
Output:
['XYZ Group', ['a nearly nine-year-old', 'Copenhagen-based company that has built a dual-purpose platform', 'providing both accountancy software and a marketplace for small and medium businesses to find accountants', 'has landed $73 million in growth funding from a single investor', 'Lugard Road Capital. TechCrunch has more here.']]
Alternatively, you could also use .split(" ") to split on the multiple occurring spaces and ignore any commas.

Find best matches of substring from list in corpus

I have a corpus that looks something like this
LETTER AGREEMENT N°5 CHINA SOUTHERN AIRLINES COMPANY LIMITED Bai Yun
Airport, Guangzhou 510405, People's Republic of China Subject: Delays
CHINA SOUTHERN AIRLINES COMPANY LIMITED (the ""Buyer"") and AIRBUS
S.A.S. (the ""Seller"") have entered into a purchase agreement (the
""Agreement"") dated as of even date
And a list of company names that looks like this
l = [ 'airbus', 'airbus internal', 'china southern airlines', ... ]
The elements of this list do not always have exact matches in the corpus, because of different formulations or just typos: for this reason I want to perform fuzzy matching.
What is the most efficient way of finding the best matches of l in the corpus? In theory the task is not super difficult but I don't see a way of solving it that does not entail looping through both the corpus and list of matches, which could cause huge slowdowns.
You can concatenate your list l in a single regex expression, then use regex to fuzzy match (https://github.com/mrabarnett/mrab-regex#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109) the words in the corpus.
Something like
my_regex = ""
for pattern in l:
my_regex += f'(?:{pattern}' + '{1<=e<=3})' #{1<=e<=3} permit at least 1 and at most 3 errors
my_regex += '|'
my_regex = my_regex[:-1] #remove the last |

How can I remove numbers that may occur at the end of words in a text

I have text data to be cleaned using regex. However, some words in the text are immediately followed by numbers which I want to remove.
For example, one row of the text is:
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes
terminology10 Lessons learnt from the RUPES project12 Payment for
environmental service and it potential and example in Vietnam16
Chapter Integrating payment for ecosystem service into Vietnams policy
and programmes17 Chapter Creating incentive for Tri An watershed
protection20 Chapter Sustainable financing for landscape beauty in
Bach Ma National Park 24 Chapter Building payment mechanism for carbon
sequestration in forestry a pilot project in Cao Phong district of Hoa
Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam28 Synthesis and Recommendations30
References32
The first word in the above text should be 'preface' instead of 'preface2' and so on.
line = re.sub(r"[A-Za-z]+(\d+)", "", line)
This, however removes the words as well as seen:
Pes Lessons learnt from the RUPES Payment for environmental service
and it potential and example in Chapter Integrating payment for
ecosystem service into Vietnams policy and Chapter Creating incentive
for Tri An watershed Chapter Sustainable financing for landscape
beauty in Bach Ma National Park 24 Chapter Building payment mechanism
for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Chapter 5 Local revenue sharing Nha
Trang Bay Marine Protected Area Synthesis and
How can I capture only the numbers that immediately follow words?
You can capture the text part and substitute the word with that captured part. It simply writes:
re.sub(r"([A-Za-z]+)\d+", r"\1", line)
You could try lookahead assertions to check for words before your numbers. Try word boundaries (\b) at the end of forcing your regex to only match numbers at the end of a word:
re.sub(r'(?<=\w+)\d+\b', '', line)
Hope this helps
EDIT:
Sorry about the glitch, mentioned in the comments about matching numbers that are NOT preceeded by words as well. That is because (sorry again) \w matches alphanumeric characters instead of only alphabetic ones. Depending on what you would like to delete you can use the positive version
re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)
to only check for english alphabetic characters (you can add characters to the [a-zA-Z] list) preceeding your number or the negative version
re.sub(r'(?<![\d\s])\d+\b', '', line)
to match anything that is NOT \d (numbers) or \s (spaces) before your desired number. This will also match punctuation marks though.
Try this:
line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number
line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one
line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one
\\1 will match the word, \\2 the number. See: How to use python regex to replace using captured group?
below, I'm proposing a working sample of code that might solve your problem.
Here's the snippet:
import re
# I'will write a function that take the test data as input and return the
# desired result as stated in your question.
def transform(data):
"""Replace in a text data words ending with number.""""
# first, lest construct a pattern matching those words we're looking for
pattern1 = r"([A-Za-z]+\d+)"
# Lest construct another pattern that will replace the previous in the final
# output.
pattern2 = r"\d+$"
# Let find all matching words
matches = re.findall(pattern1, data)
# Let construct a list of replacement for each word
replacements = []
for match in matches:
replacements.append(pattern2, '', match)
# Intermediate variable to construct tuple of (word, replacement) for
# use in string method 'replace'
changers = zip(matches, replacements)
# We now recursively change every appropriate word matched.
output = data
for changer in changers:
output.replace(*changer)
# The work is done, we can return the result
return output
For test purpose, we run the above function with your test data:
data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons
learnt from the RUPES project12 Payment for environmental service and it potential and
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""
result = transform(data)
print(result)
And the result looks like this:
Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from
the RUPES project Payment for environmental service and it potential and example in
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and
programmes Chapter Creating incentive for Tri An watershed protection Chapter
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay
Marine Protected Area Vietnam Synthesis and Recommendations References
You can create a range of numbers as well:
re.sub(r"[0-9]", "", line)

Use regex to extract unit number

I have a list of descriptions and I want to extract the unit information using regular expression
I watched a video on regex and here's what I got
import re
x = ["Four 10-story towers - five 11-story residential towers around Lake Peterson - two 9-story hotel towers facing Devon Avenue & four levels of retail below the hotels",
"265 rental units",
"10 stories and contain 200 apartments",
"801 residential properties that include row homes, town homes, condos, single-family housing, apartments, and senior rental units",
"4-unit townhouse building (6,528 square feet of living space & 2,755 square feet of unheated garage)"]
unit=[]
for item in x:
extract = re.findall('[0-9]+.unit',item)
unit.append(extract)
print unit
This works with string ends in unit, but I also strings end with 'rental unit','apartment','bed' and other as in this example.
I could do this with multiple regex, but is there a way to do this within one regex?
Thanks!
As long as your not afraid of making a hideously long regex you could use something to the extent of:
compiled_re = re.compile(ur"(\d*)-unit|(\d*)\srental unit|(\d*)\sbed|(\d*)\sappartment")
unit = []
for item in x:
extract = re.findall(compiled_re, item)
unit.append(extract)
You would have to extend the regex pattern with a new "|" followed by a search pattern for each possible type of reference to unit numbers. Unfortunately, if there is very low consistency in the entries this approach would become basically unusable.
Also, might I suggest using a regex tester like Regex101. It really helps determining if your regex will do what you want it to.

Categories

Resources