Regex to find a name from a paragraph

Regex to find a name from a paragraph - python

I am performing a regex way to find name and address in an invoice.
I have tried the bellow regex patterns:
^([A-Za-z])+$
^[A-Za-z]+(((\'|\-|\.)?([A-Za-z])+))?$
^[A-Za-z]+((\s)?((\'|\-|\.)?([A-Za-z])+))*$
^[A-Za-z]+((\s)?([A-Za-z])+)*$
if not regex is there any other way to find the customer name & address in this given data?
the invoice data is bellow:
09/06/2020 Browntape.com | Orders | HtmlInvoice
Original for Receipient
Duplicate for Supplier/Transporter
TAX INVOICE Triplicate for Supplier
ITU TAMAA i
38826532-601 1257816 MLO380675700
comapy
GSTIN 1 29145AKCA223551ZK Invoice Date : 8 Jun 2020
Branch : Karnataka Invoice No. : SSINV/17-18/0887480
PAN =NA Reference No. : 38826532-601 1257816
Place of Supply AS Payment Type : PAID
Customer Name Billing Address Shipping Address
Saswati Saswati Saswati
Addreesss, Addreesss,
Customer GSTIN Khanamukh , Guwahati, AS, India, 781014 Khanamukh , Guwahati, AS, India, 781014
Ph: 1234567890 Ph: 1234567890
Pre-Tax Pre-Tax| pre-tax Taxable
Unit Unit -,
Discount| Shipping Oni NR tans
(INR) (INR)
Desc. of Goods
Z|1880COREABLUE032DD
(ZI1880COREABLUE032DD) 42.86 384.76
42.86 0 384.76
Taxable Amount
Total Tax
Invoice Total
Invoice Total(In Words) | INR Four Hundred and Four
We hope that you like the iterns that you have received. If there is anything about your products that you
are not happy with, please let us know using the contact details below and we will be happy to help you.
For Milastar Retail Pvt Ltd Authorised
We would be grateful for your positive feedback about our service. Signatory
Thank you for your business, we hope to see you again soon!
Warehouse Address: company, Warehouse no 2,
Company Address: company, Karnataka, addreess,
Karnataka, 562123, India, manjunath#zivame.com, company
app.browntape.com/orders/html_invoice/5561154968 Vi
My expected result is
Saswati Address: Khanamukh , Guwahati, AS, India, 781014 Khanamukh , Guwahati, AS, India, 781014

Customer [A-Z]*\s(([\w\s]+,)*([\d\s]+))
captures this particular example, but may need to be customised. The captured text can be further processed in Python with the group() method.

Related

Python/regex - Bypass table of contents when extracting text

I have a dataframe consisting of the following:
identifier
text
34678
0000950123-04-010521.txt : 20040901.....
87902
0000950123-04-010521.txt : 20040901.....
I am trying extract a portion of text from the "text" variable in Python that follows a line starting with "Item 5.02". I am placing the extracted text in a new variable called ("important_text"). With the help of fellow stack overflowers, I was able to construct the following code to extract the text:
pattern = r'\bItem\s+5\.02\s*([\w\W]*?)(?=\s*(?:Item\s+
[89]\.01|Item\s+5\.03|Item\s+5\.07|Item\s+7\.01|SIGNATURES|SIGNATURE|' + r'Pursuant
to the requirements of the Securities Exchange Act of 1934)\b)'.replace(' ', '\s*')
pd_00['important_text'] = pd_00['text'].str.extract(pattern, re.IGNORECASE, expand=False)
So, this is extracting all text between the first occurrence of "Item 5.02" and the first occurrence of various terms (i.e., "Item 8.01", "Item 9.01", "SIGNATURES", etc.).
In general, this does a really good job of extracting the portion of text I am looking for. However, in some instances, the text variable contains a Table of Contents that will have a line starting with "Item 5.02". In these instances, the regex code does not grab the portion of text I need. Does anyone have any advice for how to bypass the Table of Contents?
Here is an example that includes a Table of Contents (apologies for the large amount of text...I thought it would be best to give a full example):
<SEC-DOCUMENT>0000950137-05-007782.txt : 20050623
<SEC-HEADER>0000950137-05-007782.hdr.sgml : 20050623
<ACCEPTANCE-DATETIME>20050623154401
ACCESSION NUMBER: 0000950137-05-007782
CONFORMED SUBMISSION TYPE: 8-K/A
PUBLIC DOCUMENT COUNT: 3
CONFORMED PERIOD OF REPORT: 20050511
ITEM INFORMATION: Entry into a Material Definitive Agreement
ITEM INFORMATION: Departure of Directors or Principal Officers; Election of
Directors; Appointment of Principal Officers
ITEM INFORMATION: Financial Statements and Exhibits
FILED AS OF DATE: 20050623
DATE AS OF CHANGE: 20050623
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: HILLENBRAND INDUSTRIES INC
CENTRAL INDEX KEY: 0000047518
STANDARD INDUSTRIAL CLASSIFICATION: MISCELLANEOUS FURNITURE & FIXTURES [2590]
IRS NUMBER: 351160484
STATE OF INCORPORATION: IN
FISCAL YEAR END: 0930
FILING VALUES:
FORM TYPE: 8-K/A
SEC ACT: 1934 Act
SEC FILE NUMBER: 001-06651
FILM NUMBER: 05912533
BUSINESS ADDRESS:
STREET 1: 700 STATE ROUTE 46 E
CITY: BATESVILLE
STATE: IN
ZIP: 47006-8835
BUSINESS PHONE: 8129347000
</SEC-HEADER>
<DOCUMENT>
<TYPE>8-K/A
<SEQUENCE>1
<FILENAME>c96192ae8vkza.htm
<DESCRIPTION>AMENDMENT TO CURRENT REPORT
<TEXT>
e8vkza
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 8-K/A
CURRENT REPORT
Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934
Date of Report (Date of earliest event reported): May 11, 2005
HILLENBRAND INDUSTRIES, INC.
(Exact name of registrant as specified in its charter)
Indiana (State or other jurisdiction of incorporation) 1-6651 (Commission File Number) 35-1160484 (IRS Employer Identification No.)
700 State Route 46 East Batesville, Indiana (Address of principal executive offices) 47006-8835 (Zip Code)
Registrants telephone number, including area code: (812) 934-7000
Not Applicable
(Former name or former address, if changed since last report.)
Check the appropriate box below if the Form 8-K filing is intended to simultaneously satisfy
the filing obligation of the registrant under any of the following provisions:
o Written communications pursuant to Rule 425 under the Securities Act (17 CFR 230.425)
o Soliciting material pursuant to Rule 14a-12 under the Exchange Act (17 CFR 240.14a-12)
o Pre-commencement communications pursuant to Rule 14d-2(b) under the Exchange Act
(17 CFR 240.14d-2(b))
o Pre-commencement communications pursuant to Rule 13e-4(c) under the Exchange
Act
(17 CFR 240.13e-4(c))
1
TABLE OF CONTENTS
Item 1.01 ENTRY INTO A MATERIAL DEFINITIVE AGREEMENT
Item 5.02. DEPARTURE OF DIRECTORS OR PRINCIPAL OFFICERS; ELECTION OF DIRECTORS;
APPOINTMENT OF PRINCIPAL OFFICERS
Item 9.01. FINANCIAL STATEMENTS AND EXHIBITS
SIGNATURES
EXHIBIT INDEX
Employment Agreement
Stock Award
Table of Contents
Item 1.01 ENTRY INTO A MATERIAL DEFINITIVE AGREEMENT.
Item 5.02. DEPARTURE OF DIRECTORS OR PRINCIPAL OFFICERS; ELECTION OF DIRECTORS;
APPOINTMENT OF PRINCIPAL OFFICERS.
As previously disclosed, on May 11, 2005 Hillenbrand Industries, Inc.s Board of
Directors
appointed Rolf A. Classon to serve as President and Chief Executive Officer of
Hillenbrand on an
interim basis. At the time the Form 8-K announcing this appointment was filed,
Mr. Classon did not
have an employment agreement with Hillenbrand.
Similar to the above example, The Table of Contents will usually start with "Table of Contents" and end with "Table of Contents". To further complicate things, the text will sometimes randomly say "Table of Contents" towards the beginning of the text (this is also shown in the above example).
Here is what I would like to extract:
DEPARTURE OF DIRECTORS OR PRINCIPAL OFFICERS; ELECTION OF DIRECTORS;
APPOINTMENT OF PRINCIPAL OFFICERS.
As previously disclosed, on May 11, 2005 Hillenbrand Industries, Inc.s Board of
Directors
appointed Rolf A. Classon to serve as President and Chief Executive Officer of
Hillenbrand on an
interim basis. At the time the Form 8-K announcing this appointment was filed,
Mr. Classon did not
have an employment agreement with Hillenbrand.

Finding company matches in a list of financial news

I have a dataframe with company ticker("ticker"), full name ("longName) and short name ("unofficial_name") - this abridged name is created from the long name by removing inc., plc...
I also have a seperate datefame with company news: date ("date" ) of the news, headline ("name"), news text ("text") and sentiment analysis.
I am trying to find company name matches in the list of articles and create a new dataframe with unique company-article matches (i.e. if one article mentions more than one company, this article would have more rows depending on the number of companies mentioned).
I tried to execute the matching based on the "unofficial_name" with the following code:
dict=[]
for n, c in zip(df_news["text"], sp500_names["unofficial_name"]):
if c in n:
x = {"text":n, "unofficial_name":c}
dict.append(x)
print(dict)
But I get an empty list returned. Any ideas how to solve it?

sp500_names
ticker longName unofficial_name
0 A Agilent Technologies, Inc. Agilent Technologies
1 AAL American Airlines Group Inc. American Airlines Group
df_news
name date text neg neu pos compound
0 Asian stock markets reverse losses on global p... 2020-03-01 [By Tom Westbrook and Swati Pandey SINGAPORE (... 0.086 0.863 0.051 -0.9790
1 Energy & Precious Metals - Weekly Review and C... 2020-03-01 [By Barani Krishnan Investing.com - How much ... 0.134 0.795 0.071 -0.9982
Thank you!

How can I get the max value from groupby object with multiple values?

Sorry if the question is confusing, I was not sure how to word it. Please let me know if this is duplicated question.
I have a groupby object looks like this:
us.groupby(['category_id', 'title']).sum()[['views']]
us
category_id title views
Autos & Vehicle 1980 toyota corolla liftback commercial 13061
1992 Chevy Lumina Euro commercial 18470406
2019 Chevrolet Silverado First Look 13061
Music Backyard Boys 133
Eminem - Song 1223
Cardi B - Wap 1111122
Travel & Events Welcome to Winter PUNderland 437576
What Spring Looks Like Around The World 17554672
And I want to get only max value for each category, such as:
category_id title views
Autos & Vehicle 1992 Chevy Lumina Euro commercial 18470406
Music Cardi B - Wap 1111122
Travel & Events What Spring Looks Like Around The World 17554672
How can I do this?
I tried .first() method, and also us.groupby(['category_id', 'title']).sum()[['views']].sort_values(by='views', ascending=False)[:1] something like this, but it only gives first row of entire dataframe. Is there any function I can use to only filter max value of groupby object?
Thank you!

You can try:
us_group = us.groupby(['category_id', 'title']).sum()[['views']]
(us_group.reset_index().sort_values(['views'])
.drop_duplicates('category_id', keep='last')
)

How to convert text article with keywords to pandas data frame

I have similar text files to below, about 5,000 times and I want to extract the text article to one df column and the keywords in a list to another df column. I need this to have more training data.
In below sample, the article I want to extract is everything from 'Addis Abeba' to 'private bank' and the keywords are all keywords after 'SUBJECT' without percentages in brackets.
Sample of the dataset:
Addis Fortune
February 2011
Declaration? AU Action Needed in Favour of Democracy [opinion]
LENGTH: 692 words
Addis Abeba has been hosting delegates and heads of state for the AU Summit. It
is encouraging to see leaders of Africa discussing issues of continental
importance that accelerate the process of integration and thereby put Africa in
a better bargaining position in its relations with the outside world.
Indeed, "United We Stand, Divided We Fall."
It is time that the AU took a bold step to ensure that the leaders of the
continent win the hearts and minds of their citizens. It should ensure the
existence of democratic governments, which, at a minimum, guarantee popular
participation based on an acceptance of political equality among all citizens,
respect for civil liberties, and meaningful checks and balances on the power of
the executive.
This is also indispensable to the realisation of the age-old dream of the
formation of the United States of Africa. Donor countries and organisations also
have moral obligations to extend much needed support in this aspect.
Dawit Haile is a loan officer at a private bank.
SUBJECT: HEADS OF STATE & GOVERNMENT (90%); ELECTIONS (90%); INTERNATIONAL
ASSISTANCE (89%); INTERNATIONAL RELATIONS (73%); GROSS DOMESTIC PRODUCT (70%);
ECONOMIC NEWS (70%); EMBEZZLEMENT (68%); ELECTION FRAUD (68%) Ethiopia;
International Organizations and Africa
GEOGRAPHIC: AFRICA (96%); EGYPT (93%); UNITED STATES (93%); CHINA (92%);
ETHIOPIA (79%); TUNISIA (79%); ISRAEL (79%) Africa
LOAD-DATE: February 8, 2011
LANGUAGE: ENGLISH
PUBLICATION-TYPE: Newspaper
Copyright 2011 AllAfrica Global Media.
All Rights Reserved
2 of 1352 DOCUMENTS
Addis Fortune
February 2011
Gebrekidan Beyene's Prosecutors Repeat Request for 25 Years
BYLINE: Eden Sahle
LENGTH: 815 words
During the appeals hearing last week of Gebrekidan Beyene, a.k.a. Morocco,
general manager and a shareholder of a private limited company by the same name,
prosecutors of the Ethiopian Revenues and Customs Authority (ERCA) requested
almost the same sentence they originally had, in August 2010: a maximum jail
term and confiscation of properties.
However, the lower court's decision to mitigate the sentence was correct and the
Appeals Bench should release Gebrekidan, either as a free man or on parole, the
defence argued. His good behaviour in prison and the investment he had made in
his country should be counted as mitigating circumstances, the lawyer claimed,
also counting the defendant's poor health in mitigation. The case was adjourned
for a verdict until May 2, 2011.
An alleged similar offence involving money laundering and loan sharking against
Ayalew Tesema, board chairman and major shareholder of Ayat Real Estate, is
underway at the Federal High Court.
SUBJECT: LITIGATION (91%); JUSTICE DEPARTMENTS (90%); BANKING & FINANCE (90%);
EXCISE & CUSTOMS (90%); LIMITED LIABILITY COMPANIES (90%); SENTENCING (90%);
APPEALS (89%); LAW COURTS & TRIBUNALS (89%); JAIL SENTENCING (89%); LAWYERS
(89%); VERDICTS (89%); SUPREME COURTS (89%); FINES & PENALTIES (89%);
SETTLEMENTS & DECISIONS (78%); CRIMINAL CONVICTIONS (78%); DECISIONS & RULINGS
(78%); PRISONS (77%); SUITS & CLAIMS (77%); VALUE ADDED TAX (77%); JUDGES (73%);
INCOME TAX (72%); MONEY LAUNDERING (69%); COUNTERFEITING (68%); INTEREST RATES
(55%); ECONOMIC NEWS (55%) Ethiopia; Legal and Judicial Affairs
GEOGRAPHIC: MOROCCO (90%)
LOAD-DATE: March 1, 2011
LANGUAGE: ENGLISH
PUBLICATION-TYPE: Newspaper
My expected result would be:
df
content keywords
1 'string article 1' [HEADS OF STATE & GOVERNMENT, ELECTIONS, ...]
2 'string article 2' [LITIGATION, JUSTICE DEPARTMENTS, ...]

Python regex search or match not working

I wrote this regex:
re.search(r'^SECTION.*?:', text, re.I | re.M)
re.match(r'^SECTION.*?:', text, re.I | re.M)
to run on this string:
text = 'SECTION 5.01. Financial Statements and Other Information. The Parent\nwill furnish to the Administrative Agent:\n (a) within 95 days after the end of each fiscal year of the Parent,\n its audited consolidated balance sheet and related statements of income,\n cash flows and stockholders\' equity as of the end of and for such year,\n setting forth in each case in comparative form the figures for the previous\n fiscal year, all reported on by Arthur Andersen LLP or other independent\n public accountants of recognized national standing (without a "going\n concern" or like qualification or exception and without any qualification\n or exception as to the scope of such audit) to the effect that such\n consolidated financial statements present fairly in all material respects\n the financial condition and results of operations of the Parent and its\n consolidated Subsidiaries on a consolidated basis in accordance with GAAP\n consistently applied;\n (b) within 50 days after the end of each of the first three fiscal\n quarters of each fiscal year of the Parent, its consolidated balance sheet\n and related statements of income, cash flows and stockholders\' equity as of\n the end of and for such fiscal quarter and the then elapsed portion of the\n fiscal year, setting forth in each case in comparative form the figures for\n the corresponding period or periods of (or, in the case of the balance\n sheet, as of the end of) the previous fiscal year, all certified by one of\n its Financial Officers as presenting fairly in all material respects the\n financial condition and results of operations of the Parent and its\n consolidated Subsidiaries on a consolidated basis in accordance with GAAP\n consistently applied, subject to normal year-end audit adjustments and the\n absence of footnotes;\n '
and i was expecting the following output:
SECTION 5.01. Financial Statements and Other Information. The Parent\nwill furnish to the Administrative Agent:
but i am getting None as the output.
Please anyone tell me what i am doing wrong here?

The .* will match all the text and since your text doesn't ended with : it returns None. You can use a negated character class instead to get the expected result:
In [32]: m = re.search(r'^SECTION[^:]*?:', text, re.I | re.M)
In [33]: m.group(0)
Out[33]: 'SECTION 5.01. Financial Statements and Other Information. The Parent\nwill furnish to the Administrative Agent:'
In [34]:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to find a name from a paragraph - python

Customer [A-Z]\s(([\w\s]+,)([\d\s]+)) captures this particular example, but may need to be customised. The captured text can be further processed in Python with the group() method.

Related

Python/regex - Bypass table of contents when extracting text

Finding company matches in a list of financial news

How can I get the max value from groupby object with multiple values?

How to convert text article with keywords to pandas data frame

Python regex search or match not working

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to find a name from a paragraph - python

Customer [A-Z]*\s(([\w\s]+,)*([\d\s]+)) captures this particular example, but may need to be customised. The captured text can be further processed in Python with the group() method.

Related

Python/regex - Bypass table of contents when extracting text

Finding company matches in a list of financial news

How can I get the max value from groupby object with multiple values?

How to convert text article with keywords to pandas data frame

Python regex search or match not working

Categories

Resources

Customer [A-Z]\s(([\w\s]+,)([\d\s]+)) captures this particular example, but may need to be customised. The captured text can be further processed in Python with the group() method.