text selection using greedy and reluctant - python

I am trying to extract text between markers. But these markers are word tokens and repeat frequently. input is EDGAR txt files.
My markers are ITEM some number ie. ITEM 1 and ITEM2.
my regex is
MDA_regex = 'item[^a-zA-Z\n]*\d\s*\.\s*management\'s discussion and analysis.*?item[^a-zA-Z\n]*\d\s*'
It is working fine but it fails if keyword item\d... occur between ITEM 1 AND ITEM 2. If i use .* it goes to other unwanted marker as the reports may contains other item\d.... If use .*? it will stuck at first occurance of item.
I cannot hardcode 1 and 2 because different reports can have different position/header ie. item 7 to item 8 for my required text. I am using python
for fileName in os.listdir(path):
fileName = os.path.join(path, fileName)
if os.path.isfile(fileName):
print("opening new file " + fileName)
with open(fileName, 'r', encoding='utf-8', errors='replace') as in_file:
content = in_file.read().replace('\n',' ')
mda_list=re.findall(MDA_regex,content, re.IGNORECASE|re.DOTALL)
print(mda_list)
print(len(mda_list))
My input is like
Management's Discussion and Analysis of Financial Condition and
Results of Operations............................................. 22
Item 3.
ITEM 1 . MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION
AND RESULTS OF OPERATIONS.
FORWARD-LOOKING STATEMENTS
This report contains forward-looking statements. Additional
written or oral forward-looking statements may be made by AMERCO from
time to time in filings with the Securities and Exchange Commission or
otherwise. Management believes such forward-looking statements are
within the meaning of the safe-harbor provisions. Such item 1 statements may
include, but not be limited to, projections of revenues, income or
<PAGE> 33
ITEM 2. QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK
DFDFDLF;DF SDLKD dlfldfkdffd;fl;l sdsl; dklkkdsmm,sd item 4
DDFLDFL dlkdsldkf dldfd;lf;f
Also 'item' can not be caps because some reports have caps and some don,t.
Can someone suggest how to handle this. Should i use multiple regex in if-else condition to check each condition.

Related

How to filter strings if the first three sentences contain keywords

I have a pandas dataframe called df. It has a column called article. The article column contains 600 strings, each of the strings represent a news article.
I want to only KEEP those articles whose first four sentences contain keywords "COVID-19" AND ("China" OR "Chinese"). But I´m unable to find a way to conduct this on my own.
(in the string, sentences are separated by \n. An example article looks like this:)
\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission.\ .......
First we define a function to return a boolean based on whether your keywords appear in a given sentence:
def contains_covid_kwds(sentence):
kw1 = 'COVID19'
kw2 = 'China'
kw3 = 'Chinese'
return kw1 in sentence and (kw2 in sentence or kw3 in sentence)
Then we create a boolean series by applying this function (using Series.apply) to the sentences of your df.article column.
Note that we use a lambda function in order to truncate the sentence passed on to the contains_covid_kwds up to the fifth occurrence of '\n', i.e. your first four sentences (more info on how this works here):
series = df.article.apply(lambda s: contains_covid_kwds(s[:s.replace('\n', '#', 4).find('\n')]))
Then we pass the boolean series to df.loc, in order to localize the rows where the series was evaluated to True:
filtered_df = df.loc[series]
You can use pandas apply method and do the way I did.
string = "\nChina may be past the worst of the COVID-19 pandemic, but they aren’t taking any chances.\nWorkers in Wuhan in service-related jobs would have to take a coronavirus test this week, the government announced, proving they had a clean bill of health before they could leave the city, Reuters reported.\nThe order will affect workers in security, nursing, education and other fields that come with high exposure to the general public, according to the edict, which came down from the country’s National Health Commission."
df = pd.DataFrame({'article':[string]})
def findKeys(string):
string_list = string.strip().lower().split('\n')
flag=0
keywords=['china','covid-19','wuhan']
# Checking if the article has more than 4 sentences
if len(string_list)>4:
# iterating over string_list variable, which contains sentences.
for i in range(4):
# iterating over keywords list
for key in keywords:
# checking if the sentence contains any keyword
if key in string_list[i]:
flag=1
break
# Else block is executed when article has less than or equal to 4 sentences
else:
# Iterating over string_list variable, which contains sentences
for i in range(len(string_list)):
# iterating over keywords list
for key in keywords:
# Checking if sentence contains any keyword
if key in string_list[i]:
flag=1
break
if flag==0:
return False
else:
return True
and then call the pandas apply method on df:-
df['Contains Keywords?'] = df['article'].apply(findKeys)
First I create a series which contains just the first four sentences from the original `df['articles'] column, and convert it to lower case, assuming that searches should be case-independent.
articles = df['articles'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[:4])).str.lower()
Then use a simple boolean mask to filter only those rows where the keywords were found in the first four sentences.
df[(articles.str.contains("covid")) & (articles.str.contains("chinese") | articles.str.contains("china"))]
Here:
found = []
s1 = "hello"
s2 = "good"
s3 = "great"
for string in article:
if s1 in string and (s2 in string or s3 in string):
found.append(string)

Regular expression not finding all the results

I am trying to clean up text from an actual English dictionary as my source. I have already written a python program which loads the data from a .txt file into a SQL DB in four different columns - id, word, definition. In the next step though, I am trying to define what 'type' of word it is by fetching from the definition of the word strings like n. for noun, adj. for adjective, adv. for adverb, so on and so forth.
Now, using the following regex I am trying to extract all words that end with a '.' like adv./abbr./n./adj. etc. and get a histogram of all such words to see what all the different types can be. Here my assumption is that such words will obviously be more frequent than normal words which end with '.' but even then I plan to check the top results manually to confirm. Here's my code:
for row in cur:
temp_var = re.findall('\w+[.]+ ',split)
if len(temp_var) >=1 :
temp_var = temp_var.pop()
typ_dict[temp_var] = typ_dict.get(temp_var,0) + 1
for key in typ_dict:
if typ_dict[key] > 50:
print(key, typ_dict[key])
After running this code I am not getting the desired result, with my count of numbers being way lower than in the definition. I have tested the word 'Abbr.' which this code shows occurs for 125 times but if you were to change the regex '\w+[.]+ ' to 'Abbr. ' the result shoots up186. I am not sure why my regex is not capturing all the occurrences.
Any idea as to why I am not getting all the matches?
Edit:
Here is the type of text I am working with
Aback - adv. take aback surprise, disconcert. [old english: related to *a2]
Abm - abbr. Anti-ballistic missile
Abnormal - adj. Deviating from the norm; exceptional. abnormality n. (pl. -ies). Abnormally adv. [french: related to *anomalous]
This is broken down into two the word and the rest into a definition and is loaded into a SQL table.
If you are using a dictionary to count items, then the best variant of a dictionary to use is Counter from the collections package. But you have another problem with your code. You check tep_var for length >= 1 but then you only do one pop operation. What happens when findall returns multiple items? You also do temp_var = temp_var.pop() which would prevent you from popping more items even if you you wanted to. So the result is to just yield the last match.
from collections import Counter
counters = Counter()
for row in cur:
temp_var = re.findall('\w+[.]+ ',split)
for x in temp_var:
counters[x] += 1
for key in counters:
if counters[key] > 50:
print(key, counters[key])

Python Regex - Extract text between (multiple) expressions in a textfile

I am a Python beginner and would be very thankful if you could help me with my text extraction problem.
I want to extract all text, which lies between two expressions in a textfile (the beginning and end of a letter). For both, the beginning and the end of the letter there are multiple possible expressions (defined in the lists "letter_begin" and "letter_end", e.g. "Dear", "to our", etc.). I want to analyze this for a bunch of files, find below an example of how such a textfile looks like -> I want to extract all text starting from "Dear" till "Douglas". In cases where the "letter_end" has no match, i.e. no letter_end expression is found, the output should start from the letter_beginning and end at the very end of the text file to be analyzed.
Edit: the end of "the recorded text" has to be after the match of "letter_end" and before the first line with 20 characters or more (as is the case for "Random text here as well" -> len=24.
"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards
Douglas
Random text here as well"""
This is my code so far - but it is not able to flexible catch the text between the expressions (there can be anything (lines, text, numbers, signs, etc.) before the "letter_begin" and after the "letter_end")
import re
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"
with open(filename, 'r', encoding="utf-8") as infile:
text = infile.read()
text = str(text)
output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
print (output)
I am very thankful for every help!
You may use
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
This pattern will result in a regex like
(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}
See the regex demo. Note you should not use re.DOTALL with this pattern, and the re.MULTILINE option is also redundant.
Details
(?:dear|to our|estimated) - any of the three values
[\s\S]*? - any 0+ chars, as few as possible
(?:sincerely|yours|best regards) - any of the three values
.* - any 0+ chars other than newline
(?:\n.*){0,2} - zero, one or two repetitions of a newline followed with any 0+ chars other than newline.
Python demo code:
import re
text="""Some random text here
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards
Douglas
Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))
Output:
['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

Find and replace for millions of rows with regex in Python

I have 2 set of data.
First one which serves as dictionary has two columns keyword and id and 180000 rows. Below is some sample data.
Also, note that some of the keyword are as small as 2 character and as big as 700 characters and there is no fixed length of the keywords. Although id has fixed pattern of 3 digit number with a hash symbol before and after the number.
keyword id
salesman #123#
painter #486#
senior painter #215#
Second file has one column which is corpus and it runs into 22 million records and length of each record varies between 10 to 1000. Below is sample data which can be considered as input.
corpus
I am working as a salesman. salesmanship is not my forte, however i have become a good at it
I have been a painter since i was 19
are you the salesman?
Output
corpus
I am working as a #123#. salesmanship is not my forte, however i have become a good at it
I have been a #486# since i was 19
are you the #123#?
Please note that i want to replace complete word and not overlapped words. so in the first sentence salesman was replaced with #123# where as salesmanship was not replaced with #123#ship. This requires me to add regular expression '\b' before and after the keyword. This is why Regex is important for the search function
So this is a search and replace operation for multi-million rows and has regex. I have read
Mass string replace in python?
and
Speed up millions of regex replacements in Python 3, however it is taking me days to do this find and replace, which i can't afford as this is a weekly task. I want to be able to do this much faster. Below is my code
Id = df_dict.Id.tolist()
#convert to list with regex
keyword = [r'\b'+ x + r'\b' for x in df_dict.keyword]
#light on memory to clean file
del df_dict
#replace
df_corpus["corpus_text"].replace(keyword, Id, regex=False,inplace=True)

Python: Using regex to find the last pair of occurence

Attached is a text file that I want to parse. I want to select the text in the last combination of the words' occurrence:
(1) Item 7 Management Discussion Analysis
(2) Item 8 Financial Statements
I would usually use regex as follow:
re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements",text, re.DOTALL)
You can see in the text file, the combination of Item 7 and Item 8 occurs often but if I find the last match (1) and last match (2), I increase by a lot the probability to grab the desired text.
The desired text in my text file starts with:
"'This Item 7, Management's Discussion and
Analysis of Financial Condition and Results of Operations, and other
parts of this Form 10-K contain forward-looking statements, within the
meaning of the Private Securities Litigation Reform Act of 1995, that
involve risks and..... "
and ends with:
"Item 8.
Financial Statements and Supplementary Data"
How can I adapt my regex code to grab this last pair between Item 7 and Item 8?
UPDATE:
I also try to parse this file using the same items.
This code has been rewritten. It now works with both the original data file (Output2.txt) and the newly added data file (Output2012.txt).
import re
discussions = []
for input_file_name in ['Output2.txt', 'Output2012.txt']:
with open(input_file_name) as f:
doc = f.read()
item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
discussion_text = r"[\S\s]*"
item8 = r"Item 8\.*\s*Financial Statements"
discussion_pattern = item7 + discussion_text + item8
results = re.findall(discussion_pattern, doc)
# Some input files have table of contents and others don't
# just keep the last match
discussion = results[len(results)-1]
discussions.append((input_file_name, discussion))
The discussions variable contains the results for each of the data files.
This is the original solution. It does not work for the new file but does show the use of named groups. I am not familiar with StackOverflow protocol here. Should I delete this old code?
By using longer match strings, the number of matches can be reduced to just 2 for both item 7
and item 8 - the table of contents and the actual section.
So search for the second occurence of item 7, and keep all text until item 8. This code uses
Python named groups.
import re
with open('Output2.txt') as f:
doc = f.read()
item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
item8 = r"Item 8\.*\s*Financial Statements"
discussion_pattern = re.compile(
r"(?P<item7>" + item7 + ")"
r"([\S\s]*)"
r"(?P<item7heading>" + item7 +")"
r"(?P<discussion>[\S\s]*)"
r"(?P<item8heading>" + item8 + ")"
)
match = re.search(discussion_pattern, doc)
discussion = match.group('discussion')
Use this pattern with s option
.*(Item 7.*?Item 8)
result at capturing group #1
Demo
. # Any character except line break
* # (zero or more)(greedy)
( # Capturing Group (1)
Item 7 # "Item 7"
. # Any character except line break
*? # (zero or more)(lazy)
Item 8 # "Item 8"
) # End of Capturing Group (1)
# " "
re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements(?!.*?(?:Item(?:(?!Item).)*7)|(?:Item(?:(?!Item).)*8))",text, re.DOTALL)
Try this.Added a lookahead .

Categories

Resources