Python Regular Expression: re.sub to replace matches - python

I am trying to analyze an earnings call using python regular expression.
I want to delete unnecessary lines which only contain the name and position of the person, who is speaking next.
This is an excerpt of the text I want to analyze:
"Questions and Answers\nOperator [1]\n\n Shannon Siemsen Cross, Cross Research LLC - Co-Founder, Principal & Analyst [2]\n I hope everyone is well. Tim, you talked about seeing some improvement in the second half of April. So I was wondering if you could just talk maybe a bit more on the segment and geographic basis what you're seeing in the various regions that you're selling in and what you're hearing from your customers. And then I have a follow-up.\n Timothy D. Cook, Apple Inc. - CEO & Director [3]\n ..."
At the end of each line that I want to delete, you have [some number].
So I used the following line of code to get these lines:
name_lines = re.findall('.*[\d]]', text)
This works and gives me the following list:
['Operator [1]',
' Shannon Siemsen Cross, Cross Research LLC - Co-Founder, Principal & Analyst [2]',
' Timothy D. Cook, Apple Inc. - CEO & Director [3]']
So, now in the next step I want to replace this strings in the text using the following line of code:
for i in range(0,len(name_lines)):
text = re.sub(name_lines[i], '', text)
But this does not work. Also if I just try to replace 1 instead of using the loop it does not work, but I have no clue why.
Also if I try now to use re.findall and search for the lines I obtained from the first line of code I don`t get a match.

Try to use re.sub to replace the match:
import re
text = """\
Questions and Answers
Operator [1]
Shannon Siemsen Cross, Cross Research LLC - Co-Founder, Principal & Analyst [2]
I hope everyone is well. Tim, you talked about seeing some improvement in the second half of April. So I was wondering if you could just talk maybe a bit more on the segment and geographic basis what you're seeing in the various regions that you're selling in and what you're hearing from your customers. And then I have a follow-up.
Timothy D. Cook, Apple Inc. - CEO & Director [3]"""
text = re.sub(r".*\d]", "", text)
print(text)
Prints:
Questions and Answers
I hope everyone is well. Tim, you talked about seeing some improvement in the second half of April. So I was wondering if you could just talk maybe a bit more on the segment and geographic basis what you're seeing in the various regions that you're selling in and what you're hearing from your customers. And then I have a follow-up.

The first argument to re.sub is treated as a regular expression, so the square brackets get a special meaning and don't match literally.
You don't need a regular expression for this replacement at all though (and you also don't need the loop counter i):
for name_line in name_lines:
text = text.replace(name_line, '')

Related

Python Regular Expression - Get Text starting in the next line after the match was found

I have a question on using regular expressions in Python. This is a part of the text I am analysing.
Amit Jawaharlaz Daryanani, Evercore ISI Institutional Equities, Research Division - Senior MD & Fundamental Research Analyst [19]\n I have 2 as well. I guess, first off, on the channel inventory, I was hoping if you could talk about how did channel inventory look like in the March quarter because it sounds like it may be below the historical ranges. And then the discussion you had for June quarter performance of iPhones, what are you embedding from a channel building back inventory levels in that expectation?\n
My Goal is to extract this part of the text by matching the name of the analyst which is Amit Jawaharlaz Daryanani: \n I have 2 as well. I guess, first off, on the channel inventory, I was hoping if you could talk about how did channel inventory look like in the March quarter because it sounds like it may be below the historical ranges. And then the discussion you had for June quarter performance of iPhones, what are you embedding from a channel building back inventory levels in that expectation?\n
I cannot just do from \n to \n because the text is much longer and I specifically need the line of text which comes after his name.
I tried:
re.findall(r'(?<=Amit Jawaharlaz Daryanani).*?(?=\n)', text)
But the Output here is
[', Evercore ISI Institutional Equities, Research Division - Senior MD & Fundamental Research Analyst [19]'
So how can I start after the first \n that comes after his name until the second \n after his name?
You can use a capture group:
\bAmit Jawaharlaz Daryanani\b.*\n\s*(.*)\n
Explanation
\bAmit Jawaharlaz Daryanani\b Match the name
.*\n Match the rest of the line and a newline
\s*(.*)\n Match optional whitespace chars, and capture a whole line in group 1 followed by matching a newline
See a regex demo and a Python demo.
import re
pattern = r"\bAmit Jawaharlaz Daryanani\b.*\n\s*(.*)\n"
s = ("Amit Jawaharlaz Daryanani, Evercore ISI Institutional Equities, Research Division - Senior MD & Fundamental Research Analyst [19]\n"
" I have 2 as well. I guess, first off, on the channel inventory, I was hoping if you could talk about how did channel inventory look like in the March quarter because it sounds like it may be below the historical ranges. And then the discussion you had for June quarter performance of iPhones, what are you embedding from a channel building back inventory levels in that expectation?\n"
" \n")
m = re.search(pattern, s)
if m:
print(m.group(1))
Output
I have 2 as well. I guess, first off, on the channel inventory, I was hoping if you could talk about how did channel inventory look like in the March quarter because it sounds like it may be below the historical ranges. And then the discussion you had for June quarter performance of iPhones, what are you embedding from a channel building back inventory levels in that expectation?
Try this:
non-capturing group for the name
look for the first \n
capturing group until the second \n
re.findall(r'(?:Amit Jawaharlaz Daryanani).*?\n(.*?)\n', text)
This works because of .*?, which is non-greedy. This means it stops before the first \n that is encountered.
Output:
[' I have 2 as well. I guess, first off, on the channel inventory, I was hoping if you could talk about how did channel inventory look like in the March quarter because it sounds like it may be below the historical ranges. And then the discussion you had for June quarter performance of iPhones, what are you embedding from a channel building back inventory levels in that expectation?']

How to find a specific, pre-defined word surrounded by any word(s) starting with a capital letter(s)?

I have been analyzing large amounts of text data. This is what I got so far:
(([A-Z][\w-]*)+\s+(\b(Study|Test)\b)(\s[A-Z][\w-]*)*)|(\b(Study|Test)\b)(\s[A-Z][\w-]*)+
Types of phrases I would like to capture:
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
I want to capture the word 'Study' or 'Test' ONLY if it is surrounded by the words starting with a capital letter. The ideal regex would achieve all of this + it would ignore\escape certain words like 'of' or 'the'.
*the above regex is super slow with the str.findall function, I guess there must be a better solution
** I used https://regex101.com for testing and then run it in Jupyter, Python 3
You can use 2 capture groups instead, and match a single word starting with a capital A-Z on the left or on the right.
Using [^\S\r\n] will match a whitespace char without a newline, as \s can match a newline
\b[A-Z]\w*[^\S\r\n]+(Test|Study)\b|\b(Test|Study)[^\S\r\n]+[A-Z]\w*
Regex demo
Ok, this is possibly way out of the actual scope but you could use the newer regex module with subroutines:
(?(DEFINE)
(?<marker>\b[A-Z][-\w]*\b)
(?<ws>[\ \t]+)
(?<needle>\b(?:Study|Test))
(?<pre>(?:(?&marker)(?&ws))+)
(?<post>(?:(?&ws)(?&marker))+)
(?<before>(?&pre)(?&needle))
(?<after>(?&needle)(?&post))
(?<both>(?&pre)(?&needle)(?&post))
)
(?&both)|(?&before)|(?&after)
See a demo on regex101.com (and mind the modifiers!).
In actual code, this could be:
import regex as re
junk = """
I have been analyzing large amounts of text data. This is what I got so far:
(([A-Z][\w-]*)+\s+(\b(Study|Test)\b)(\s[A-Z][\w-]*)*)|(\b(Study|Test)\b)(\s[A-Z][\w-]*)+
Types of phrases I would like to capture:
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
I want to capture the word 'Study' or 'Test' ONLY if it is surrounded by the words starting with a capital letter. The ideal regex would achieve all of this + it would ignore\escape certain words like 'of' or 'the'.
*the above regex is super slow with the str.findall function, I guess there must be a better solution
** I used https://regex101.com for testing and then run it in Jupyter, Python 3
"""
pattern = re.compile(r'''
(?(DEFINE)
(?<marker>\b[A-Z][-\w]*\b)
(?<ws>[\ \t]+)
(?<needle>\b(?:Study|Test))
(?<pre>(?:(?&marker)(?&ws))+)
(?<post>(?:(?&ws)(?&marker))+)
(?<before>(?&pre)(?&needle))
(?<after>(?&needle)(?&post))
(?<both>(?&pre)(?&needle)(?&post))
)
(?&both)|(?&before)|(?&after)''', re.VERBOSE)
for match in pattern.finditer(junk):
print(match.group(0))
And would yield
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
((?:[A-Z]\w+\s+){0,5}\bStudy\b\s*(?:[A-Z]\w+\b\s*){0,5})
Test
I have to further test it to check whether it works for the all of the possible scenarios in a real world. I might need to adjust '5' in the expression to a lower or higher number(s) to optimize my algorithm's performance, though. I tested it on some real datasets already and the results have been promising so far. It is fast.

How to capture from whitespace +{n} to next {n} in regex

I've cleaned up a document to allow me to properly rip it verse by verse. Being weak in regex I cannot seem to find the right expression to extract these verses.
This is the expression I am using:
(\t?\t?{\d+}.*){
And I'm doing this in python, though I expect that does not matter.
How should I change this to make it simply highlight verses {x} some verse {x} next verse, but stopping short just of the next brace?
As you can see, I'm trying to keep it tabs-aware because this doc gives some attention to verse-style writing.
And here is an example doc:
{1} The words of the blessing of Enoch, wherewith he blessed the elect [[[[and]]]] righteous, who will be living in the day of tribulation, when all the wicked [[[[and godless]]]] are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, [[which]] the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, (even) on Mount Sinai,
[[And appear from His camp]]
And appear in the strength of His might from the heaven of heavens.
{5} And all shall be smitten with fear
And the Watchers shall quake,
And great fear and trembling shall seize them unto the ends of the earth.
{6} And the high mountains shall be shaken,
And the high hills shall be made low,
And shall melt like wax before the flame
{7} And the earth shall be [[wholly]] rent in sunder,
And all that is upon the earth shall perish,
And there shall be a judgement upon all (men).
{8} But with the righteous He will make peace.
And will protect the elect,
And mercy shall be upon them.
And they shall all belong to God,
And they shall be prospered,
And they shall [[all]] be blessed.
[[And He will help them all]],
And light shall appear unto them,
[[And He will make peace with them]].
{9} And behold! He cometh with ten thousands of [[His]] holy ones
To execute judgement upon all,
And to destroy [[all]] the ungodly:
And to convict all flesh
Of all the works [[of their ungodliness]] which they have ungodly committed,
And of all the hard things which ungodly sinners [[have spoken]] against Him.
[BREAK]
[CHAPTER 2]
Simply split the text on the verse markers with re.split:
import re
text = '''{1} The words of the blessing of Enoch, wherewith he blessed the elect [[[[and]]]] righteous, who will be living in the day of tribulation, when all the wicked [[[[and godless]]]] are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, [[which]] the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, (even) on Mount Sinai,
[[And appear from His camp]]
And appear in the strength of His might from the heaven of heavens.'''
result = [i for i in re.split(r'\{\d+\}', text) if i]
result has four elements, corresponding to {1} through {4} above.
(\t?\t?{\d+}.*?)(?={)
See demo.
https://regex101.com/r/OCpDb7/1
Edit:
If you want to capture last verse as well,use
(\t?\t?{\d+}.*?)(?={|\[BREAK\])
See demo.
https://regex101.com/r/OCpDb7/2
Your original regex suffered from 2 problems.
(\t?\t?{\d+}.*){
^ ^
1)You had used greedy operator.Use non greedy .*?
2)You were capturing { which would not allow that verse to match as it has been already captured.Use lookahead to just assert and not capture.
The answer above is good, but the verses are not always incremented properly in this book (ie, it can jump from verse 5 to 7 due to manuscript details) so I had to retain the verses to "pluck the number" them later. Basically, entire verses along with the number had to be extracted.
The recipe seemed to be this:
verse = re.compile(r'([\t+]?{\d+}[^{]*)', re.DOTALL)
In context:
import re
f = open('thebook.txt', 'r').read()
chapters = f.split('[BREAK]')
verse = re.compile(r'([\t+]?{\d+}[^{]*)', re.DOTALL)
verses = re.findall(verse, chapters[1])
Please note, it seems to work properly, but I have to check the results to make sure it accounts for everything.

Ideas for slicing a BeautifulSoup string to compartmentalize into certain categories?

So i'm in the process of doing some webscraping using BeautifulSoup and am given sequence of strings that are in this format:
"PRICE. ADDRESS, PHONE#, " 'WEBSITE
to show you what I mean, here are two examples of how these strings are displayed in the HTML text.
"$10. 2109 W. Chicago Ave., 773-772-0406, "'theoldoaktap.com
"$9. 3619 North Ave., 773-772-8435, "'cemitaspuebla.com
What's the best way to go this? It would've been easy if a comma followed the price (could've just done split(",") and addressed them by index, but what other alternatives do I have now? Can't split by periods because some addresses with directional streets have periods in the front (i.e. W. Chicago Ave).
Would the best solution be to split() and extract the first string (price), and then make a new string with the leftover indexes and then go about splitting by the comma (split(","))? Seems super non-python-y and i'm not sure that would work either.
In the end, I want to end up with
Price = $10
Location = 2109 W. Chicago Ave
Phone# = 773-772-0406
Website = http://www.theoldoaktap.com
thank you all in advance. my brain is fried.
import re
test = '"$10. 2109 W. Chicago Ave., 773-772-0406, "\'theoldoaktap.com'
extracted_entities = re.match(r'"\$(\d+)\. ([^,]+), ([\d-]+), "\'<a href="([^"]+)"', test)
print extracted_entities.groups()
Basically, since your strings are pretty rigidly formatted, you can simply use regular expression to extract out its components using some predetermined patterns. If you are going to do these types of projects often, I highly suggest studying regex, it's a very powerful tool!
Reference: https://docs.python.org/2/howto/regex.html

Simple text parsing library

I have a method that takes addresses from the web, and therefore, there are many known errors like:
123 Awesome St, Pleasantville, NY, Get Directions
Which I want to be:
123 Awesome St, Pleasantville, NY
Is there a web service or Python library that can help with this? It's fine for us to start creating a list of items like ", Get Directions" or a more generalized version of that, but I thought there might be a helper library for this kind of textual analysis.
If the address contains one of those bad strings, walk backwards till you find another non-whitespace character. If the character is one of your separators, say , or :, drop everything from that character onwards. If it's a different character, drop everything after that character.
Make a list of known bad strings. Then, you could take that list and use it to build a gigantic regex and use re.sub().
This is a naive solution, and isn't going to be particularly performant, but it does give you a clean way of adding known bad strings, by adding them to a file called .badstrings or similar and building the list from them.
Note that if you make bad choices about what these bad strings are, you will break the algorithm. But it should work for the simple cases you describe in the comments.
EDIT: Something like this is what I mean:
import re
def sanitize_address(address, regex):
return regex.sub('', address)
badstrings = ['get directions', 'multiple locations']
base_regex = r'[,\s]+('+'|'.join(badstrings)+')'
regex = re.compile(base_regex, re.I)
address = '123 Awesome St, Pleasantville, NY, Get Directions'
print sanitize_address(address, regex)
which outputs:
123 Awesome St, Pleasantville, NY
I would say that the task is impossible to do with a high degree of confidence unless the data is in a fixed format, or you have a gigantic address database to make matches against.
You could possibly get away with having a list of countries, and then a rule set per country that you use. The American rule set could include a list of states, cities and postal codes and a pattern to find street addresses. You would then drop anything that isn't either a state, city postal code or looks like a street address.
You'd still drop things that should be a part of an address though, at least with Swedish addresses, that can include just the name of a farm instead of a street and number. If US country side addresses are the same there is just no way to know what is a part of an address and what isn't unless you have access to a database with all US addresses. :-)
Here is a Regex that will parse either one. If you have other examples, I can change the current Regex to work for it
(?<address>(?:[0-9]+\s+(?:\w+\s?)+)+)[,]\s+(?<city>(?:\w+\s?)+)[,]\s+(?<state>(?:\w+\s?)+)(?:$|[,])
this will even work for addresses that are in similar format to mine (1234 North 1234 West, Pleasantville, NY)

Categories

Resources