Suppose that I have a list of strings, and I want to substitute different parts of it by re.sub. My problem is that sometimes this substitution contains special characters inside so this function can't properly match the structure. One example:
import re
txt = 'May enbd company Ltd (Pty) Ltd, formerly known as apple shop Ltd., is a full service firm which is engaged in the sale and servicing of motor vehicles.'
re.sub('May enbd company Ltd (Pty) Ltd', 'PC (Pty) Ltd', txt)
Here the issue is coming from ( and ), but the other forms may happen that I'm not aware of it now. So I want to totally ignore these special characters inside and replace them with my preferred strings. In this case, it means:
'PC (Pty) Ltd, formerly known as apple shop Ltd., is a full-service firm which is engaged in the sale and servicing of motor vehicles.'
If no special functionality of regular expressions is needed, this can be done easier with the str.replace method:
txt = 'May enbd company Ltd (Pty) Ltd, formerly known as apple shop Ltd., is a full service firm which is engaged in the sale and servicing of motor vehicles.'
result = txt.replace('May enbd company Ltd (Pty) Ltd', 'PC (Pty) Ltd')
Related
I'm working on an NLP project and using Spacy. Now, I have identified different entities using NER of Spacy, and I want to remove the ORG (those identified as Organisations) from the original input string.
doc = "I'm here with the three of Nikkei Asia's stalwart editors, three Brits in Tokyo. First off, we have Michael Peel, who is executive editor, a journalist from our affiliate, The Financial Times . He is now in Tokyo but has previously reported from the likes of Brussels, Bangkok, Abu Dhabi and Lagos. Welcome, Michael.MICHAEL PEEL, EXECUTIVE EDITOR: Welcome Waj. Thank you very much.KHAN: All right. And we have Stephen Foley, our business editor who, like Michael, is on secondment from the FT, where he was deputy U.S. News Editor. Prior to the FT, he was a reporter at The Independent and like Michael, he's a fresh-off-the-boat arrival in Tokyo and has left some pretty big shoes to fill in the New York bureau, where we miss him. Welcome, Stephen.STEPHEN FOLEY, BUSINESS EDITOR: Thanks for having me, Waj.KHAN: Alright, and last but certainly not least, my brother in arms when it comes to cricket commentary across the high seas is Andy Sharp, or deputy editor who joined Nikkei Asia nearly four years ago, after a long stint at Bloomberg in Tokyo and other esteemed Japanese publications. Welcome, Andy.ANDREW SHARP"
text = NER(doc)
org_stopwords = [ent.text for ent in text.ents if ent.label_ == 'ORG']
output of org_stopwords
['The Financial Times ', 'Abu Dhabi and Lagos', 'Bloomberg ']
This is my code now, I've identified and made a list with all those identified as ORG by Spacy, but now I don't know how to remove those from the string. One problem I'm facing to normally split the string and remove the org_stopwords is beacause org_stopwords ar n-grams. Please help with some coded example how to tackle this issue.
Use regex instead of replace
import re
org_stopwords = ['The Financial Times',
'Abu Dhabi ',
'U.S. News Editor',
'Independent',
'ANDREW']
regex = re.compile('|'.join(org_stopwords))
new_doc = re.sub(regex, '', doc)
I have two python lists
li = ['206 Brookwood Center Drive Suite 508, WMP, Birmingham, AL 35111',
'340 Independence Drive, Homewood, AL 35209',
'41 Doell Drive Southeast, Huntsville, AL 35801',
'3 Mobile Circle, Suite 401, Mobile, AL 36607',
'7209 Copperfield Drive, Montgomery, AL 36117']
mi = ['340 Independence Dr Homewood, AL 35209',
'41 Doell Dr SE, Ste 24 Huntsville, AL 35801',
'3 Mobile Cir, Ste 401 Mobile, AL 36607',
'36 Saint Lukes Dr Montgomery, AL 36117',
'91 Kanis Rd, Ste 300 Little Rock, AR 72205',
'25 S Dobson Rd, Bldg J Chandler, AZ 85224']
I want to loop through li and see if a record does not exist in mi using some kind of partial text match, I tried startswith, in but because of differences like "Dr - Drive", "Suite-ste" this fails. Any suggestions? Would some kind of python regex work?
The output should be '206 Brookwood Center Drive Suite 508, WMP, Birmingham, AL 35111' and 7209 Copperfield Drive, Montgomery, AL 36117
If you are doing this for fun, remember that addresses are read from the bottom up because getting the letter to at least the right building is the biggest step.
city, state zip is the most important.
street address is the second most important, along with apt#
addressee is the last important item
The following two methods have a significant advantage over anything you might do with any kind of alias list for abbreviations, etc. That advantage is that they are based upon a database of all deliverable addresses against which to compare the "standardized" address:
If you are doing a one-off project and confidentiality is not an issue, you can use the U.S. Post Office website for zip code lookup. It will return the standardized address as well. You can automate its use to some extent.
If you are going to do anywhere over 1,000 addresses on a recurring basis, get an address standardization software package, usually in the form of mailing software. $600US/year upwards.
I have this website and I would like to pull via Python all company names such as West Wood Events, or Mitchell Event Planning.
But I am stuck on soup.find since it results me [].
When I inspect the page, lets say this:
< div class="LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd">Mitchell Event Planning<wbr></div >
in that I would write:
week = soup.find(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
print(week)
And I get 0.
Am I missing something? I'm pretty new to this.
This string is not single class but many classes separated by spaces.
In some modules you would have to use original string with all spaces but it seems in BS you have to use classes separated by single space.
Code works for me if I uses single space between LinesEllipsis and vendor-name--55315.
week = soup.find_all(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
Or if I use CSS selector with dot for every class in string
week = soup.select('.LinesEllipsis.vendor-name--55315.primaryBold--a3d1e.body1--24afd')
Minimal working code
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.theknot.com/marketplace/wedding-planners-acworth-ga?page=2'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
#week = soup.select('.LinesEllipsis.vendor-name--55315.primaryBold--a3d1e.body1--24afd')
week = soup.find_all(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
for item in week:
print(item.text)
Result:
The Charming Details
Enraptured Events
pearl and sky events - planning, design and florals
Unique Occasions ByTNicole, Inc
Platinum Eventions
RED COMPANY ATLANTA
Pop + Fizz: Event Planning and Design
Patricia Elizabeth, certified wedding planners/producer
Rienza Events
Pollyanna Richter Weddings
Calabash Events, Inc.
Weddings by Carmona LLC
Emily Jordan Events
Perfectly Taylored Events
Lindsey Wise Designs
Elegant Weddings and Affairs
Party PLANit
Wedded Bliss
Above the Fray
Willow Jaymes Events
Coco Red Events
Liz D. Events, LLC
Leslie Cox Events
YSE LLC
Marmaros Productions
PerfectionsID, LLC
All Things Love
West Wood Events
Everlasting Elegance
Prestigious Occasions
I am trying to do some text processing corpus which has emails.
I have a main directory, under which I have various folders. Each folder has many .txt files. Each txt file is basically the email conversations.
To give an example of how my text file looks like with emails, am copying a similar looking text file of emails from publicly available enron email corpus. I have the same type of text data with multiple emails in one text file.
An example text file can look like below:
Message-ID: <3490571.1075846143093.JavaMail.evans#thyme>
Date: Wed, 8 Sep 1999 08:50:00 -0700 (PDT)
From: steven.kean#enron.com
To: kelly.kimberly#enron.com
Subject: Re: India And The WTO Services Negotiation
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Steven J Kean
X-To: Kelly Kimberly
X-cc:
X-bcc:
X-Folder: \Steven_Kean_Dec2000_1\Notes Folders\All documents
X-Origin: KEAN-S
X-FileName: skean.nsf
fyi
---------------------- Forwarded by Steven J Kean/HOU/EES on 09/08/99 03:49
PM ---------------------------
Joe Hillings#ENRON
09/08/99 02:52 PM
To: Joe Hillings/Corp/Enron#Enron
cc: Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Terence H
Thorn/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Steven J Kean/HOU/EES#EES,
Jeffrey Sherrick/Corp/Enron#Enron
Subject: Re: India And The WTO Services Negotiation
Sanjay: Some information of possible interest to you. I attended a meeting
this afternoon of the Coalition of Service Industries, one of the lead groups
promoting a wide range of services including energy services in the upcoming
WTO GATTS 2000 negotiations. CSI President Bob Vastine was in Delhi last week
and met with CII to discuss the upcoming WTO. CII apparently has a committee
looking into the WTO. Bob says that he told them that energy services was
among the CSI recommendations and he recalls that CII said that they too have
an interest.
Since returning from the meeting I spoke with Kiran Pastricha and told her
the above. She actually arranged the meeting in Delhi. She asked that I send
her the packet of materials we distributed last week in Brussels and London.
One of her associates is leaving for India tomorrow and will take one of
these items to Delhi.
Joe
Joe Hillings
09/08/99 11:57 AM
To: Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT
cc: Terence H Thorn/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Steven J Kean/HOU/EES#EES,
Jeffrey Sherrick/Corp/Enron#Enron (bcc: Joe Hillings/Corp/Enron)
Subject: India And The WTO Services Negotiation
Sanjay: First some information and then a request for your advice and
involvment.
A group of US companies and associations formed the US WTO Energy Services
Coalition in late May and has asked the US Government to include "energy
services" on their proposed agenda when the first meeting of the WTO GATTS
2000 ministerial convenes late this year in Seattle. Ken Lay will be among
the CEO speakers. These negotiations are expected to last three years and
cover a range of subjects including agriculture, textiles, e-commerce,
investment, etc.
This morning I visited with Sudaker Rao at the Indian Embassy to tell him
about our coalition and to seek his advice on possible interest of the GOI.
After all, India is a leader in data processing matters and has other
companies including ONGC that must be interested in exporting energy
services. In fact probably Enron and other US companies may be engaging them
in India and possibly abroad.
Sudaker told me that the GOI has gone through various phases of opposing the
services round to saying only agriculture to now who knows what. He agrees
with the strategy of our US WTO Energy Services Coalition to work with
companies and associations in asking them to contact their government to ask
that energy services be on their list of agenda items. It would seem to me
that India has such an interest. Sudaker and I agree that you are a key
person to advise us and possibly to suggest to CII or others that they make
such a pitch to the GOI Minister of Commerce.
I will ask Lora to send you the packet of materials Chris Long and I
distributed in Brussels and London last week. I gave these materials to
Sudaker today.
Everyone tells us that we need some developing countries with an interest in
this issue. They may not know what we are doing and that they are likely to
have an opportunity if energy services are ultimately negotiated.
Please review and advise us how we should proceed. We do need to get
something done in October.
Joe
PS Terry Thorn is moderating a panel on energy services at the upcoming World
Services Congress in Atlanta. The Congress will cover many services issues. I
have noted in their materials that Mr. Alliwalia is among the speakers but
not on energy services. They expect people from all over the world to
participate.
So as you see there can be basically multiple emails in one text file with not much clear separation rule except new email headers (To, From etc).
I can do the os.walk in the main directory and then it would go through each of the sub directory, parse each of the text file in that sub-directory etc and repeat it for other sub-directory and so on.
I need to extract certain parts of each email within a text file and store it as new row in a dataset (csv,pandas dataframe etc).
Parts which can be helpful to extract and store as columns of a row in a dataset. Each row of this dataset can then be each email within each text file.
Fields:
Original Email content | From (Sender)| To (Receipient) | cc (Receipient)| Date/Time Sent| Subject of Email|
Edit: I looked at the duplicate question added. That considers a fixed spec and boundary. That's not the case here. I am looking for a simple regular expression way of extracting different fields as mentioned above
^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)
Make sure you're using dotall, multiline, and extended modes on your regex engine.
For the example you posted it works at least, it captures everything in different groups (you may need to enable that on the regex engine as well, depending on which it is)
Group `date` 63-99 `Wed, 8 Sep 1999 08:50:00 -0700 (PDT)`
Group `sender` 106-127 `steven.kean#enron.com`
Group `to` 132-156 `kelly.kimberly#enron.com`
Group `cc` 650-714 `Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Terence H `
Group `subject` 930-974 `Re: India And The WTO Services Negotiation `
https://regex101.com/r/gHUOLi/1
And use it to iterate over your stream of text, you mention python so there you go:
def match_email(long_string):
regex = r'^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)'
# try to match the thing
match = re.search(regex, long_string.strip(), re.I | re.X)
# if there is no match its over
if match is None:
return None, long_string
# otherwise, get it
email = match.groupdict()
# remove whatever matched from the original string
if email is not None:
long_string = long_string.strip()[match.end():]
# return the email, and the remaining string
return email, long_string
# now iterate over the long string
emails = []
email, tail = match_email(the_long_string)
while email is not None:
emails.append(email)
email, tail = match_email(tail)
print(emails)
Thats directly stolen from here just some names changed and stuff.
Keywords:
Keywords={u'secondary': [u'sales growth', u'next generation store', u'Steps Down', u' Profit warning', u'Store Of The Future', u'groceries']}
Paragraph:
paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""
it there any way to match the keywords in paragraph?(without using regex)
Output:
Matched keywords : next generation store , groceries
No need to use NLTK for this. First of all you will have to clean you text in the paragraph, or change your values in the list for the 'secondary key. '"next generation" store' and 'next generation store' are two different things.
After this you can iterate over the values of 'secondary', and check if any of those strings exist in your text.
match = [i for i in Keywords['secondary'] if i in paragraph]
EDIT: As i specified above, '"next generation" store' and 'next generation store' are two different things, which is the reason you only get 1 match. If you had 'next generation store' and 'next generation store' you would get two matches - as there are in fact two matches.
INPUT:
paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""
OUTPUT:
['groceries']
INPUT:
paragraph="""HOUSTON -- Target has unveiled its first next generation store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""
OUTPUT:
['next generation store','groceries']
Firstly, you don't really need a dict if your keywords has only one key. Use a set() instead.
Keywords={u'secondary': [u'sales growth', u'next generation store',
u'Steps Down', u' Profit warning',
u'Store Of The Future', u'groceries']}
keywords = {u'sales growth', u'next generation store',
u'Steps Down', u' Profit warning',
u'Store Of The Future', u'groceries'}
paragraph="""HOUSTON -- Target has unveiled its first "next generation" store in the Houston area, part of a multibillion-dollar effort to reimagine more than 1,000 stores nationwide to compete with e-commerce giants.
The 124,000-square-foot store, which opened earlier this week at Aliana market center in Richmond, Texas, has two distinct entrances and aims to appeal to consumers on both ends of the shopping spectrum.
Busy families seeking convenience can enter the "ease" side of the store, which offers a supermarket-style experience. Customers can pick up online orders, both in store and curbside, and buy grab-and-go items like groceries, wine, last-minute gifts, cleaning supplies and prepared meals."""
Then a minor tweak from Find multi-word terms in a tokenized text in Python
from nltk.tokenize import MWETokenizer
from nltk import sent_tokenize, word_tokenize
mwe = MWETokenizer([k.lower().split() for k in keywords], separator='_')
# Clean out the punctuations in your sentence.
import string
puncts = list(string.punctuation)
cleaned_paragraph = ''.join([ch if ch not in puncts else '' for ch in paragraph.lower()])
tokenized_paragraph = [token for token in mwe.tokenize(word_tokenize(cleaned_paragraph))
if token.replace('_', ' ') in keywords]
print(tokenized_paragraph)
[out]:
>>> print(tokenized_paragraph)
['next_generation_store', 'groceries'