Python regex search or match not working - python

I wrote this regex:
re.search(r'^SECTION.*?:', text, re.I | re.M)
re.match(r'^SECTION.*?:', text, re.I | re.M)
to run on this string:
text = 'SECTION 5.01. Financial Statements and Other Information. The Parent\nwill furnish to the Administrative Agent:\n (a) within 95 days after the end of each fiscal year of the Parent,\n its audited consolidated balance sheet and related statements of income,\n cash flows and stockholders\' equity as of the end of and for such year,\n setting forth in each case in comparative form the figures for the previous\n fiscal year, all reported on by Arthur Andersen LLP or other independent\n public accountants of recognized national standing (without a "going\n concern" or like qualification or exception and without any qualification\n or exception as to the scope of such audit) to the effect that such\n consolidated financial statements present fairly in all material respects\n the financial condition and results of operations of the Parent and its\n consolidated Subsidiaries on a consolidated basis in accordance with GAAP\n consistently applied;\n (b) within 50 days after the end of each of the first three fiscal\n quarters of each fiscal year of the Parent, its consolidated balance sheet\n and related statements of income, cash flows and stockholders\' equity as of\n the end of and for such fiscal quarter and the then elapsed portion of the\n fiscal year, setting forth in each case in comparative form the figures for\n the corresponding period or periods of (or, in the case of the balance\n sheet, as of the end of) the previous fiscal year, all certified by one of\n its Financial Officers as presenting fairly in all material respects the\n financial condition and results of operations of the Parent and its\n consolidated Subsidiaries on a consolidated basis in accordance with GAAP\n consistently applied, subject to normal year-end audit adjustments and the\n absence of footnotes;\n '
and i was expecting the following output:
SECTION 5.01. Financial Statements and Other Information. The Parent\nwill furnish to the Administrative Agent:
but i am getting None as the output.
Please anyone tell me what i am doing wrong here?

The .* will match all the text and since your text doesn't ended with : it returns None. You can use a negated character class instead to get the expected result:
In [32]: m = re.search(r'^SECTION[^:]*?:', text, re.I | re.M)
In [33]: m.group(0)
Out[33]: 'SECTION 5.01. Financial Statements and Other Information. The Parent\nwill furnish to the Administrative Agent:'
In [34]:

Related

Python/regex - Bypass table of contents when extracting text

I have a dataframe consisting of the following:
identifier
text
34678
0000950123-04-010521.txt : 20040901.....
87902
0000950123-04-010521.txt : 20040901.....
I am trying extract a portion of text from the "text" variable in Python that follows a line starting with "Item 5.02". I am placing the extracted text in a new variable called ("important_text"). With the help of fellow stack overflowers, I was able to construct the following code to extract the text:
pattern = r'\bItem\s+5\.02\s*([\w\W]*?)(?=\s*(?:Item\s+
[89]\.01|Item\s+5\.03|Item\s+5\.07|Item\s+7\.01|SIGNATURES|SIGNATURE|' + r'Pursuant
to the requirements of the Securities Exchange Act of 1934)\b)'.replace(' ', '\s*')
pd_00['important_text'] = pd_00['text'].str.extract(pattern, re.IGNORECASE, expand=False)
So, this is extracting all text between the first occurrence of "Item 5.02" and the first occurrence of various terms (i.e., "Item 8.01", "Item 9.01", "SIGNATURES", etc.).
In general, this does a really good job of extracting the portion of text I am looking for. However, in some instances, the text variable contains a Table of Contents that will have a line starting with "Item 5.02". In these instances, the regex code does not grab the portion of text I need. Does anyone have any advice for how to bypass the Table of Contents?
Here is an example that includes a Table of Contents (apologies for the large amount of text...I thought it would be best to give a full example):
<SEC-DOCUMENT>0000950137-05-007782.txt : 20050623
<SEC-HEADER>0000950137-05-007782.hdr.sgml : 20050623
<ACCEPTANCE-DATETIME>20050623154401
ACCESSION NUMBER: 0000950137-05-007782
CONFORMED SUBMISSION TYPE: 8-K/A
PUBLIC DOCUMENT COUNT: 3
CONFORMED PERIOD OF REPORT: 20050511
ITEM INFORMATION: Entry into a Material Definitive Agreement
ITEM INFORMATION: Departure of Directors or Principal Officers; Election of
Directors; Appointment of Principal Officers
ITEM INFORMATION: Financial Statements and Exhibits
FILED AS OF DATE: 20050623
DATE AS OF CHANGE: 20050623
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: HILLENBRAND INDUSTRIES INC
CENTRAL INDEX KEY: 0000047518
STANDARD INDUSTRIAL CLASSIFICATION: MISCELLANEOUS FURNITURE & FIXTURES [2590]
IRS NUMBER: 351160484
STATE OF INCORPORATION: IN
FISCAL YEAR END: 0930
FILING VALUES:
FORM TYPE: 8-K/A
SEC ACT: 1934 Act
SEC FILE NUMBER: 001-06651
FILM NUMBER: 05912533
BUSINESS ADDRESS:
STREET 1: 700 STATE ROUTE 46 E
CITY: BATESVILLE
STATE: IN
ZIP: 47006-8835
BUSINESS PHONE: 8129347000
</SEC-HEADER>
<DOCUMENT>
<TYPE>8-K/A
<SEQUENCE>1
<FILENAME>c96192ae8vkza.htm
<DESCRIPTION>AMENDMENT TO CURRENT REPORT
<TEXT>
e8vkza
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 8-K/A
CURRENT REPORT
Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934
Date of Report (Date of earliest event reported): May 11, 2005
HILLENBRAND INDUSTRIES, INC.
(Exact name of registrant as specified in its charter)
Indiana (State or other jurisdiction of incorporation) 1-6651 (Commission File Number) 35-1160484 (IRS Employer Identification No.)
700 State Route 46 East Batesville, Indiana (Address of principal executive offices) 47006-8835 (Zip Code)
Registrants telephone number, including area code: (812) 934-7000
Not Applicable
(Former name or former address, if changed since last report.)
Check the appropriate box below if the Form 8-K filing is intended to simultaneously satisfy
the filing obligation of the registrant under any of the following provisions:
o Written communications pursuant to Rule 425 under the Securities Act (17 CFR 230.425)
o Soliciting material pursuant to Rule 14a-12 under the Exchange Act (17 CFR 240.14a-12)
o Pre-commencement communications pursuant to Rule 14d-2(b) under the Exchange Act
(17 CFR 240.14d-2(b))
o Pre-commencement communications pursuant to Rule 13e-4(c) under the Exchange
Act
(17 CFR 240.13e-4(c))
1
TABLE OF CONTENTS
Item 1.01 ENTRY INTO A MATERIAL DEFINITIVE AGREEMENT
Item 5.02. DEPARTURE OF DIRECTORS OR PRINCIPAL OFFICERS; ELECTION OF DIRECTORS;
APPOINTMENT OF PRINCIPAL OFFICERS
Item 9.01. FINANCIAL STATEMENTS AND EXHIBITS
SIGNATURES
EXHIBIT INDEX
Employment Agreement
Stock Award
Table of Contents
Item 1.01 ENTRY INTO A MATERIAL DEFINITIVE AGREEMENT.
Item 5.02. DEPARTURE OF DIRECTORS OR PRINCIPAL OFFICERS; ELECTION OF DIRECTORS;
APPOINTMENT OF PRINCIPAL OFFICERS.
As previously disclosed, on May 11, 2005 Hillenbrand Industries, Inc.s Board of
Directors
appointed Rolf A. Classon to serve as President and Chief Executive Officer of
Hillenbrand on an
interim basis. At the time the Form 8-K announcing this appointment was filed,
Mr. Classon did not
have an employment agreement with Hillenbrand.
Similar to the above example, The Table of Contents will usually start with "Table of Contents" and end with "Table of Contents". To further complicate things, the text will sometimes randomly say "Table of Contents" towards the beginning of the text (this is also shown in the above example).
Here is what I would like to extract:
DEPARTURE OF DIRECTORS OR PRINCIPAL OFFICERS; ELECTION OF DIRECTORS;
APPOINTMENT OF PRINCIPAL OFFICERS.
As previously disclosed, on May 11, 2005 Hillenbrand Industries, Inc.s Board of
Directors
appointed Rolf A. Classon to serve as President and Chief Executive Officer of
Hillenbrand on an
interim basis. At the time the Form 8-K announcing this appointment was filed,
Mr. Classon did not
have an employment agreement with Hillenbrand.

Print corresponding value in pandas DF row

How do I print out entries in a df using a keyword search? I have a legislative database I'm running a list of climate keywords against:
climate_key_words = ['climate','gas','coal','greenhouse','carbon monoxide','carbon',\
'carbon dioxide','education',\
'gas tax','regulation']
Here's my for loop:
for bill in df.title:
for word in climate_key_words:
if word in bill:
print(bill)
print(word)
print(df.state)
print('------------')
When it prints, df.state forces everything to print funky:
24313 AK
24314 AK
24315 AK
24316 AK
24317 AK
Name: state, Length: 24318, dtype: object
------------
Relating to limitations on food regulations at farms, farmers' markets, and cottage food production operations.
regulation
But when print(df.state) is absent, it looks much nicer:
------------
Higher education; providing for the protection of certain expressive activities.
education
------------
Schools; allowing a school district board of education to amend certain policy to stock inhalers. Effective date. Emergency.
education
------------
How can I include df.state (and other values) and have them printed only once?
Ideally, my output should look like this:
###bill
###corresponding title
###corresponding state
print(df.state) is going to print out the column/field 'state'. You presumably want the state associated with that row of the dataframe?
So I would suggest tweaking your approach slightly and doing something like:
for row in range(dataframe.shape[0]): #for each row in the dataframe
for word in keywords:
if word in dataframe.iloc[row][bill]
print(dataframe.iloc[row][bill]) #allows you to access values in the df by row,column
print(dataframe.iloc[row][state])
print(dataframe.iloc[row][title])

Filtering rows saving result in a new column

I have a dataset as follows
Name Surname Username Tweet Tags
Matthew Fields m.fields I love summertime summer summertime sun holiday
Fion Stewart fion It is time to enjoy ourselves time
Christine Bold chris89 Enjoy your summer summer
Vera Lovable v.lov2 It's sunny outside sun summer holiday
I would like to search the following list of strings within three columns (Username, Tweet and Tags):
list_strings=['summer','summertime','sun','holiday']
to see if at least in one column there is one or more of the terms above. This check should be saved in a new column, Terms from list, where there will be stored the terms found in all the columns (with no duplicates, i.e. if the same term is present in more column, I would need only to mention once).
The expected output would be:
Name Surname Username Tweet Tags Terms from list
Matthew Fields m.fields I love summertime summer summertime sun holiday summer, summertime, sun, holiday
Christine Bold chris89 Enjoy your summer summer summer
Vera Lovable v.lov2 It's sunny outside sun summer holiday sun, summer, holiday
Could you please give me any advice on how to do this and point me in the right direction? thank you
You can try str.contains
df=df[df['Tweet'].str.contains('|'.join(list_strings))]
If multiple columns
df=df[df[['Tweet','Tags']].apply(lambda x : x.str.contains('|'.join(list_strings))).any(1)]
Try the steps below.
step 1: for each element in df if any word in the string (x.split(' ')[i] == string) is also a word in list_strings keep it else it will give an empty list ([]). i.e. we will have a list of lists (of length 1 or zero). So we choose the first item from the list (val[0]) if it exists.
list_strings=['summer','summertime','sun','holiday']
step1 = df[['Username', 'Tweet', 'Tags']].applymap(lambda x: (([val[0] for val in [([string for i in range(len(x.split(' '))) if (x.split(' ')[i] == string )]) for string in list_strings ] if val]) ))
step 2: we assign to the column "Terms in List" the unique elements of the combined lists from the three columns.
df['Terms in list'] = step1.apply(lambda x: set(x[0] + x[1] + x[2]), axis = 1)

Regex to find a name from a paragraph

I am performing a regex way to find name and address in an invoice.
I have tried the bellow regex patterns:
^([A-Za-z])+$
^[A-Za-z]+(((\'|\-|\.)?([A-Za-z])+))?$
^[A-Za-z]+((\s)?((\'|\-|\.)?([A-Za-z])+))*$
^[A-Za-z]+((\s)?([A-Za-z])+)*$
if not regex is there any other way to find the customer name & address in this given data?
the invoice data is bellow:
09/06/2020 Browntape.com | Orders | HtmlInvoice
Original for Receipient
Duplicate for Supplier/Transporter
TAX INVOICE Triplicate for Supplier
ITU TAMAA i
38826532-601 1257816 MLO380675700
comapy
GSTIN 1 29145AKCA223551ZK Invoice Date : 8 Jun 2020
Branch : Karnataka Invoice No. : SSINV/17-18/0887480
PAN =NA Reference No. : 38826532-601 1257816
Place of Supply AS Payment Type : PAID
Customer Name Billing Address Shipping Address
Saswati Saswati Saswati
Addreesss, Addreesss,
Customer GSTIN Khanamukh , Guwahati, AS, India, 781014 Khanamukh , Guwahati, AS, India, 781014
Ph: 1234567890 Ph: 1234567890
Pre-Tax Pre-Tax| pre-tax Taxable
Unit Unit -,
Discount| Shipping Oni NR tans
(INR) (INR)
Desc. of Goods
Z|1880COREABLUE032DD
(ZI1880COREABLUE032DD) 42.86 384.76
42.86 0 384.76
Taxable Amount
Total Tax
Invoice Total
Invoice Total(In Words) | INR Four Hundred and Four
We hope that you like the iterns that you have received. If there is anything about your products that you
are not happy with, please let us know using the contact details below and we will be happy to help you.
For Milastar Retail Pvt Ltd Authorised
We would be grateful for your positive feedback about our service. Signatory
Thank you for your business, we hope to see you again soon!
Warehouse Address: company, Warehouse no 2,
Company Address: company, Karnataka, addreess,
Karnataka, 562123, India, manjunath#zivame.com, company
app.browntape.com/orders/html_invoice/5561154968 Vi
My expected result is
Saswati Address: Khanamukh , Guwahati, AS, India, 781014 Khanamukh , Guwahati, AS, India, 781014
Customer [A-Z]*\s(([\w\s]+,)*([\d\s]+))
captures this particular example, but may need to be customised. The captured text can be further processed in Python with the group() method.

Extracting first occurrence of number after specific keyword, for each instance of keyword (Python)

I'm relatively new at Python and still learning the basics of dataframes and text extraction.
I have a column of strings that may or may not contain the keyword "discount rate" several times. When "discount rate" is there, I would like to grab the FIRST set of numbers that come after that word, and put those into a new column as a string. The numbers don't always appear immediately after the word "rate" appears, sometimes there may be a word or two in between.
I'm looking for a way to grab this text for ALL instances of "discount rate".
Currently, my code is only grabbing all occurrences of number ranges, but I only want the ones after "discount rate". Here is a snapshot of my code:
df["ext"] = ""
for i, row in df.iterrows():
df["ext"][i] = str(set(re.findall(r'\d+\.\d+%',df.loc[i,'txt']))).strip()
The output of this code gives me a set of strings - which I am later splitting into multiple columns - like this:
{'13.0%', '3.5%', '2.5%', '11.0%'}
For reference, the strings usually look something like this:
...growth rates of 2.5% to 3.5% to xxx calendar year 2025 after-tax
free cash flows. Xxx alsoperformed a discounted cash flow
analysis of the xxx to calculate the present value of the after-tax xxxx that
xxx forecasted would be generated during calendar years 2015(using only the
fourth quarter of 2015) through 2025 and of the terminal value of the xxxx by
applying perpetuity growth rates of 1.0% to 2.0% to the calendar year 2025
after-tax free cash flows. The cash flows andterminal values were discounted
to present value as of September 30,2015 using discount rates ranging from
9.50% to 12.50%, which were based on an estimate of xxxs weighted average
cost of capital. This analysis indicated thefollowing approximate implied per
share equity value reference ranges for xxx as compared to the Merger
Consideration....
I could only make the code specific to the sample text you provided.
sample_text = '''...growth rates of 2.5% to 3.5% to xxx calendar year 2025 after-tax
free cash flows. Xxx alsoperformed a discounted cash flow
analysis of the xxx to calculate the present value of the after-tax xxxx that
xxx forecasted would be generated during calendar years 2015(using only the
fourth quarter of 2015) through 2025 and of the terminal value of the xxxx by
applying perpetuity growth rates of 1.0% to 2.0% to the calendar year 2025
after-tax free cash flows. The cash flows andterminal values were discounted
to present value as of September 30,2015 using discount rates ranging from
9.50% to 12.50%, which were based on an estimate of xxxs weighted average
cost of capital. This analysis indicated thefollowing approximate implied per
share equity value reference ranges for xxx as compared to the Merger
Consideration....'''
split_sample_text = sample_text.split()
discount_ranges = list()
for index, word in enumerate(split_sample_text):
if word == "discount" and split_sample_text[index + 1] == "rates":
start_rate = None
end_rate = None
for index_, rate in enumerate(split_sample_text[index + 2:]):
if "%" in rate:
try:
float(rate.rstrip("%,"))
if not start_rate:
start_rate = rate
elif not end_rate:
end_rate = rate.rstrip(',')
except ValueError:
pass
elif rate == "discount" and split_sample_text[index_ + 1:] == "rates":
break
if start_rate and end_rate:
discount_ranges.append((start_rate, end_rate))
print discount_ranges
Gives us:
[('9.50%', '12.50%')]
If you paste your sample text 3x, it will still extract this same discount rate thrice, Hope this helps! Cheers!

Categories

Resources