I'm working on an NLP project and using Spacy. Now, I have identified different entities using NER of Spacy, and I want to remove the ORG (those identified as Organisations) from the original input string.
doc = "I'm here with the three of Nikkei Asia's stalwart editors, three Brits in Tokyo. First off, we have Michael Peel, who is executive editor, a journalist from our affiliate, The Financial Times . He is now in Tokyo but has previously reported from the likes of Brussels, Bangkok, Abu Dhabi and Lagos. Welcome, Michael.MICHAEL PEEL, EXECUTIVE EDITOR: Welcome Waj. Thank you very much.KHAN: All right. And we have Stephen Foley, our business editor who, like Michael, is on secondment from the FT, where he was deputy U.S. News Editor. Prior to the FT, he was a reporter at The Independent and like Michael, he's a fresh-off-the-boat arrival in Tokyo and has left some pretty big shoes to fill in the New York bureau, where we miss him. Welcome, Stephen.STEPHEN FOLEY, BUSINESS EDITOR: Thanks for having me, Waj.KHAN: Alright, and last but certainly not least, my brother in arms when it comes to cricket commentary across the high seas is Andy Sharp, or deputy editor who joined Nikkei Asia nearly four years ago, after a long stint at Bloomberg in Tokyo and other esteemed Japanese publications. Welcome, Andy.ANDREW SHARP"
text = NER(doc)
org_stopwords = [ent.text for ent in text.ents if ent.label_ == 'ORG']
output of org_stopwords
['The Financial Times ', 'Abu Dhabi and Lagos', 'Bloomberg ']
This is my code now, I've identified and made a list with all those identified as ORG by Spacy, but now I don't know how to remove those from the string. One problem I'm facing to normally split the string and remove the org_stopwords is beacause org_stopwords ar n-grams. Please help with some coded example how to tackle this issue.
Use regex instead of replace
import re
org_stopwords = ['The Financial Times',
'Abu Dhabi ',
'U.S. News Editor',
'Independent',
'ANDREW']
regex = re.compile('|'.join(org_stopwords))
new_doc = re.sub(regex, '', doc)
Related
Hate asking this because I've seen a lot of similar issues, but nothing that really guides me to the light unfortunately.
I'm trying to scrape the game story from this link (ultimately I want to build in the ability to do multiple links, but hoping I can handle that)
It seems like I should be able to take the div story tag, right? I am struggling through how to do that - I've tried various codes I've found online & have tried to tweak, but nothing really applies.
I found this amazing source which really taught me a lot about how to do this.
However, I'm still struggling - this is my code thus far:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
# print(soup.prettify())
story = soup.find_all('div', {'id': 'story'})
print (story)
I'm really trying to learn this, not just copy and paste, in my mind right now this says (in English):
Import packages needed
Get URL HTML Text Data ---- (when I printed the complete code it worked fine)
Narrow down the HTML code to only include div tags labeled as "story" -- this obviously is the hiccup
Struggling to understand, going to keep playing with this but figured I'd turn here for some advice - any thoughts are greatly appreciated. Just getting blank result right now;
Page is being rendered by javascript, which requests cannot execute, so the info (which is being pulled down by the original requests) remains in its incipient state, within the script tag.
This is one way to get that story with requests:
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])
Result printed in terminal:
Title: Durant, Nets rout 76ers in Simmons' return to Philadelphia
Date: 2022-03-11T00:30:27
Content: ['PHILADELPHIA (AP) The 76ers fans came to boo Ben Simmons. They left booing their own team.', "Kevin Durant scored 18 of his 25 points in Brooklyn's dominating first half in the Nets' 129-100 blowout victory over the 76ers on Thursday night in Simmons' much-hyped return to Philadelphia.", 'Seth Curry added 24 points, and Kyrie Irving had 22 for the Nets. They entered in eighth place in the East, but looked like a legitimate conference contender while badly outplaying the third-place 76ers.', 'Joel Embiid had 27 points and 12 rebounds for the 76ers, and James Harden finished with just 11 points. It was the first loss for Philadelphia in six games with Harden in the lineup.', "The game was dubbed as ''Boo Ben'' night, but the raucous fans instead turned their displeasure on the home team when the 76ers went to the locker room trailing 72-51 and again when Brooklyn built a stunning 32-point lead in the third quarter.", "''I think all of us look at Ben as our brother,'' Durant said. ''We knew this was a hostile environment. It's hard to chant at Ben Simmons when you're losing by that much.''", 'Simmons, wearing a designer hockey jersey and flashy jewelry, watched from the bench, likely taking delight in the vitriol deflected away from him. The three-time All-Star is continuing to recover from a back injury that has sidelined him since being swapped for Harden in a blockbuster deal at the trade deadline.', "''We definitely felt like Ben was on our heart,'' Irving said. ''If you come at Ben, you come at us.''", "While Simmons hasn't taken the floor yet, Harden had been a boon for the 76ers unlike his time in Brooklyn, where the so-called Big 3 of Harden, Durant and Irving managed to play just 16 games together following Harden's trade to Brooklyn last January that was billed as a potentially championship move. Harden exchanged fist bumps with Nets staff members just before tip before a shockingly poor performance from the 10-time All-Star and former MVP.", 'Harden missed 14 of 17 field-goal attempts.', "''We just didn't have the pop that we needed,'' Harden said.", 'The only shot Simmons took was a dunk during pregame warmups that drew derisive cheers from the Philly fans.', "The boos started early, as Simmons was met with catcalls while boarding the team bus to shootaround from the Nets' downtown hotel. Simmons did oblige one fan for an autograph, with another being heard on a video widely circulated on social media yelling, ''Why the grievance? Why spit in the face of Sixers fans? We did nothing but support you for five years, Ben. You know that.''", "The heckling continued when Simmons was at the arena. He entered the court 55 minutes prior to tip, wearing a sleeveless Nets warmup shirt and sweats and spent 20 minutes passing for Patty Mills' warmup shots. He didn't embrace any of his former teammates, though he did walk the length of the court to hug a 76ers official and then exchanged fist pumps with coach Doc Rivers at halftime.", "''Looked good to me, looked happy to be here,'' Nets coach Steve Nash said. ''I think he was happy to get it out of the way.''", "A large security presence closely watched the crowd and cell phones captured every Simmons move. By the end of the game, though, many 76ers fans had left and the remaining Nets fans were chanting: ''BEN SIM-MONS! BEN SIM-MONS!'' in a remarkable turnaround from the start of the evening.", 'WELCOME BACK', 'Former 76ers Curry and Andre Drummond, who also were part of the Simmons for Harden trade, were cheered during introductions, Curry made 10 of 14 shots, including 4 of 8 from 3-point range. Drummond had seven points and seven rebounds.', 'MOVE OVER, REGGIE', "Harden passed Reggie Miller for third on the NBA's 3-point list when he made his 2,561st trey with 6:47 left in the first quarter.", "TRAINER'S ROOM", 'Nets: LaMarcus Aldridge (hip) missed his second straight contest.', "76ers: Danny Green sat out after injuring his left middle finger in the first half of the 76ers' 121-106 win over the Bulls on Monday.", 'TIP-INS', 'Nets: Improved to 21-15 on the road, where Irving is only allowed to play due to his vaccination status. ... Durant also had 14 rebounds and seven assists.', "76ers: Paul Millsap returned after missing Monday's game against Chicago due to personal reasons but didn't play. . Former Sixers and Hall of Famers Allen Iverson and Julus ''Dr. J'' Erving were in attendance. Erving rang the ceremonial Liberty Bell before the contest.", 'UP NEXT', 'Nets: Host New York on Sunday.', '76ers: At Orlando on Sunday.', '---', 'More AP NBA: https://apnews.com/hub/NBA and https://twitter.com/AP-Sports']
Requests documentation: https://requests.readthedocs.io/en/latest/
Also, for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
I am writing a python code using the 'BeautifulSoup' library that would pull titles and authors of all the opinion pieces from a news website. While the for loop is working as intended for the titles, the find function within it meant to pull the author's name for each of the titles is repeatedly returning the author of the first piece as the output.
Any ideas where I am going wrong?
The Code
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nytimes.com/international/').text
soup = BeautifulSoup(source, 'lxml')
opinion = soup.find('div', class_='css-717c4s')
for story in opinion.find_all('article'):
title = story.h2.text
print(title)
author = opinion.find('div', class_='css-1xdt15l')
print (author.text)
The Output:
The Nazi-Fighting Women of the Jewish Resistance
Judy Batalion
My Great-Grandfather Knew How to Fix America’s Food System
Judy Batalion
Old Pol, New Tricks
Judy Batalion
Do We Have to Pay Businesses to Obey the Law?
Judy Batalion
I Don’t Want My Role Models Erased
Judy Batalion
Progressive Christians Arise! Hallelujah!
Judy Batalion
What the 2020s Need: Sex and Romance at the Movies
Judy Batalion
Biden Has Disappeared
Judy Batalion
What Republicans Could Learn From My Grandmother
Judy Batalion
Your Home’s Value Is Based on Racism
Judy Batalion
Once I Am Fully Vaccinated, What Is Safe for Me to Do?
Judy Batalion
You should do,
author = story.find('div', class_='css-1xdt15l')
What's wrong is, you are doing
author = opinion.find('div', class_='css-1xdt15l')
and it fetches the first author in all the authors in the opinions section and since you run this statement in every iteration of the loop, hence no matter how many times you do it, you will only get the first author.
Replacing it with
author = story.find('div', class_='css-1xdt15l')
fetches you the first author for each story iteration and since each story has a single author, it works fine.
It's because you target only the first tag, hence the one author.
Here's a fix for your code using zip():
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nytimes.com/international/').text
soup = BeautifulSoup(source, 'lxml').find('div', class_='css-717c4s')
authors = soup.find_all('div', class_='css-1xdt15l')
stories = soup.find_all('article')
for author, story in zip(authors, stories):
print(author.text)
print(story.h2.text)
Output:
Judy Batalion
The Nazi-Fighting Women of the Jewish Resistance
Gracy Olmstead
My Great-Grandfather Knew How to Fix America’s Food System
Maureen Dowd
Old Pol, New Tricks
Nikolas Bowie
Do We Have to Pay Businesses to Obey the Law?
Elizabeth Becker
I Don’t Want My Role Models Erased
Nicholas Kristof
Progressive Christians Arise! Hallelujah!
Ross Douthat
What the 2020s Need: Sex and Romance at the Movies
Frank Bruni
Biden Has Disappeared
Cecilia Gentili
What Republicans Could Learn From My Grandmother
Dorothy A. Brown
Your Home’s Value Is Based on Racism
Linsey Marr, Juliet Morrison and Caitlin Rivers
Once I Am Fully Vaccinated, What Is Safe for Me to Do?
Very little mistake.
You're supposed to do a search on each "article"/story object in the loop Not your initial "opinion" object i.e
author = story.find('div', class_='css-1xdt15l')
This produces the desired output:
The Nazi-Fighting Women of the Jewish Resistance
Judy Batalion
My Great-Grandfather Knew How to Fix America’s Food System
Gracy Olmstead
Old Pol, New Tricks
Maureen Dowd
Do We Have to Pay Businesses to Obey the Law?
Nikolas Bowie
I Don’t Want My Role Models Erased
Elizabeth Becker
Progressive Christians Arise! Hallelujah!
Nicholas Kristof
What the 2020s Need: Sex and Romance at the Movies
Ross Douthat
Biden Has Disappeared
Frank Bruni
What Republicans Could Learn From My Grandmother
Cecilia Gentili
Your Home’s Value Is Based on Racism
Dorothy A. Brown
Once I Am Fully Vaccinated, What Is Safe for Me to Do?
Linsey Marr, Juliet Morrison and Caitlin Rivers
I have like 3 strings and how can I remove the punctuation and make all the reviews lower-case and then print out all 3 reviews thereafter.
Review1 = 'My great auntie has lived at Everton Park for decades, and once upon a time I even lived here too, and I remember the days before when there was nothing remotely hipster about this housing block. It is really cool to see cute new cafes and coffee shops moving in, and I've been to Nylon every time I'm back in town.'
Review2 = 'Solid coffee in the Outram Park neighborhood. Location is hidden in a HDB block so you definitely need to search for it. Minus one star for limited seating options'
Review3 = 'Deserve it, truly deserves this much reviews. I will describe coffee here as honest, sincere, decent, strong, smart.'
Review1 = "My great auntie has lived at Everton Park for decades, and once upon a time I even lived here too, and I remember the days before when there was nothing remotely hipster about this housing block. It is really cool to see cute new cafes and coffee shops moving in, and I've been to Nylon every time I'm back in town."
import string
Review1_Fixed = Review1.lower().translate(str.maketrans('', '', string.punctuation))
print(Review1_Fixed)
Output:
"my great auntie has lived at everton park for decades and once upon a time i even lived here too and i remember the days before when there was nothing remotely hipster about this housing block it is really cool to see cute new cafes and coffee shops moving in and ive been to nylon every time im back in town"
For more information on what this command is doing, or more ways of doing this see this post.
Using re module:
Review1 = '''My great auntie has lived at Everton Park for decades, and once upon a time I even lived here too, and I remember the days before when there was nothing remotely hipster about this housing block. It is really cool to see cute new cafes and coffee shops moving in, and I've been to Nylon every time I'm back in town.'''
Review2 = '''Solid coffee in the Outram Park neighborhood. Location is hidden in a HDB block so you definitely need to search for it. Minus one star for limited seating options'''
Review3 = '''Deserve it, truly deserves this much reviews. I will describe coffee here as honest, sincere, decent, strong, smart.'''
import re
def strip_punctuation_make_lowercase(*strings):
return map(lambda s: re.sub(r'[^\s\w]+', '', s).lower(), strings)
Review1, Review2, Review3 = strip_punctuation_make_lowercase(Review1, Review2, Review3)
print(Review1)
print()
print(Review2)
print()
print(Review3)
print()
Prints:
my great auntie has lived at everton park for decades and once upon a time i even lived here too and i remember the days before when there was nothing remotely hipster about this housing block it is really cool to see cute new cafes and coffee shops moving in and ive been to nylon every time im back in town
solid coffee in the outram park neighborhood location is hidden in a hdb block so you definitely need to search for it minus one star for limited seating options
deserve it truly deserves this much reviews i will describe coffee here as honest sincere decent strong smart
In [23]: whitelist = set(string.ascii_letters)
In [24]: rev1 = "My great auntie has lived at Everton Park for decades, and once upon a time I even lived here too, and I remember the days before when there was nothing remotely hipster about this housing block. It is really cool to see cute new cafes and coffee shops moving in, and I've been to Nylon every time I'm
...: back in town."
In [25]: ''.join([char for char in rev1 if char in whitelist])
Out[25]: 'MygreatauntiehaslivedatEvertonParkfordecadesandonceuponatimeIevenlivedheretooandIrememberthedaysbeforewhentherewasnothingremotelyhipsteraboutthishousingblockItisreallycooltoseecutenewcafesandcoffeeshopsmovinginandIvebeentoNyloneverytimeImbackintown'
In [26]: whitelist = set(string.ascii_letters + ' ')
In [27]: ''.join([char for char in rev1 if char in whitelist])
Out[27]: 'My great auntie has lived at Everton Park for decades and once upon a time I even lived here too and I remember the days before when there was nothing remotely hipster about this housing block It is really cool to see cute new cafes and coffee shops moving in and Ive been to Nylon every time Im back in town'
__contains__ method defines how instances of class behave when they appear at right side of in and not in operator.
from string import ascii_letters
Review1 = "My great auntie has lived at Everton Park for decades, and once upon a time I even lived here too, and I remember the days before when there was nothing remotely hipster about this housing block. It is really cool to see cute new cafes and coffee shops moving in, and I've been to Nylon every time I'm back in town."
key = set(ascii_letters + ' ') # key = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ')
Review1_ = ''.join(filter(key.__contains__, Review1)).lower()
print (Review1_)
output:
my great auntie has lived at everton park for decades and once upon a time i even lived here too and i remember the days before when there was nothing remotely hipster about this housing block it is really cool to see cute new cafes and coffee shops moving in and ive been to nylon every time im back in town
For remove the punctuation
s.translate(None, string.punctuation)
or create you own function
def Punctuation(string):
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
for x in string.lower():
if x in punctuations:
string = string.replace(x, "")
# Print string without punctuation
print(string)
For lower case
string.lower()
I am trying to do some text processing corpus which has emails.
I have a main directory, under which I have various folders. Each folder has many .txt files. Each txt file is basically the email conversations.
To give an example of how my text file looks like with emails, am copying a similar looking text file of emails from publicly available enron email corpus. I have the same type of text data with multiple emails in one text file.
An example text file can look like below:
Message-ID: <3490571.1075846143093.JavaMail.evans#thyme>
Date: Wed, 8 Sep 1999 08:50:00 -0700 (PDT)
From: steven.kean#enron.com
To: kelly.kimberly#enron.com
Subject: Re: India And The WTO Services Negotiation
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Steven J Kean
X-To: Kelly Kimberly
X-cc:
X-bcc:
X-Folder: \Steven_Kean_Dec2000_1\Notes Folders\All documents
X-Origin: KEAN-S
X-FileName: skean.nsf
fyi
---------------------- Forwarded by Steven J Kean/HOU/EES on 09/08/99 03:49
PM ---------------------------
Joe Hillings#ENRON
09/08/99 02:52 PM
To: Joe Hillings/Corp/Enron#Enron
cc: Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Terence H
Thorn/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Steven J Kean/HOU/EES#EES,
Jeffrey Sherrick/Corp/Enron#Enron
Subject: Re: India And The WTO Services Negotiation
Sanjay: Some information of possible interest to you. I attended a meeting
this afternoon of the Coalition of Service Industries, one of the lead groups
promoting a wide range of services including energy services in the upcoming
WTO GATTS 2000 negotiations. CSI President Bob Vastine was in Delhi last week
and met with CII to discuss the upcoming WTO. CII apparently has a committee
looking into the WTO. Bob says that he told them that energy services was
among the CSI recommendations and he recalls that CII said that they too have
an interest.
Since returning from the meeting I spoke with Kiran Pastricha and told her
the above. She actually arranged the meeting in Delhi. She asked that I send
her the packet of materials we distributed last week in Brussels and London.
One of her associates is leaving for India tomorrow and will take one of
these items to Delhi.
Joe
Joe Hillings
09/08/99 11:57 AM
To: Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT
cc: Terence H Thorn/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Steven J Kean/HOU/EES#EES,
Jeffrey Sherrick/Corp/Enron#Enron (bcc: Joe Hillings/Corp/Enron)
Subject: India And The WTO Services Negotiation
Sanjay: First some information and then a request for your advice and
involvment.
A group of US companies and associations formed the US WTO Energy Services
Coalition in late May and has asked the US Government to include "energy
services" on their proposed agenda when the first meeting of the WTO GATTS
2000 ministerial convenes late this year in Seattle. Ken Lay will be among
the CEO speakers. These negotiations are expected to last three years and
cover a range of subjects including agriculture, textiles, e-commerce,
investment, etc.
This morning I visited with Sudaker Rao at the Indian Embassy to tell him
about our coalition and to seek his advice on possible interest of the GOI.
After all, India is a leader in data processing matters and has other
companies including ONGC that must be interested in exporting energy
services. In fact probably Enron and other US companies may be engaging them
in India and possibly abroad.
Sudaker told me that the GOI has gone through various phases of opposing the
services round to saying only agriculture to now who knows what. He agrees
with the strategy of our US WTO Energy Services Coalition to work with
companies and associations in asking them to contact their government to ask
that energy services be on their list of agenda items. It would seem to me
that India has such an interest. Sudaker and I agree that you are a key
person to advise us and possibly to suggest to CII or others that they make
such a pitch to the GOI Minister of Commerce.
I will ask Lora to send you the packet of materials Chris Long and I
distributed in Brussels and London last week. I gave these materials to
Sudaker today.
Everyone tells us that we need some developing countries with an interest in
this issue. They may not know what we are doing and that they are likely to
have an opportunity if energy services are ultimately negotiated.
Please review and advise us how we should proceed. We do need to get
something done in October.
Joe
PS Terry Thorn is moderating a panel on energy services at the upcoming World
Services Congress in Atlanta. The Congress will cover many services issues. I
have noted in their materials that Mr. Alliwalia is among the speakers but
not on energy services. They expect people from all over the world to
participate.
So as you see there can be basically multiple emails in one text file with not much clear separation rule except new email headers (To, From etc).
I can do the os.walk in the main directory and then it would go through each of the sub directory, parse each of the text file in that sub-directory etc and repeat it for other sub-directory and so on.
I need to extract certain parts of each email within a text file and store it as new row in a dataset (csv,pandas dataframe etc).
Parts which can be helpful to extract and store as columns of a row in a dataset. Each row of this dataset can then be each email within each text file.
Fields:
Original Email content | From (Sender)| To (Receipient) | cc (Receipient)| Date/Time Sent| Subject of Email|
Edit: I looked at the duplicate question added. That considers a fixed spec and boundary. That's not the case here. I am looking for a simple regular expression way of extracting different fields as mentioned above
^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)
Make sure you're using dotall, multiline, and extended modes on your regex engine.
For the example you posted it works at least, it captures everything in different groups (you may need to enable that on the regex engine as well, depending on which it is)
Group `date` 63-99 `Wed, 8 Sep 1999 08:50:00 -0700 (PDT)`
Group `sender` 106-127 `steven.kean#enron.com`
Group `to` 132-156 `kelly.kimberly#enron.com`
Group `cc` 650-714 `Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Terence H `
Group `subject` 930-974 `Re: India And The WTO Services Negotiation `
https://regex101.com/r/gHUOLi/1
And use it to iterate over your stream of text, you mention python so there you go:
def match_email(long_string):
regex = r'^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)'
# try to match the thing
match = re.search(regex, long_string.strip(), re.I | re.X)
# if there is no match its over
if match is None:
return None, long_string
# otherwise, get it
email = match.groupdict()
# remove whatever matched from the original string
if email is not None:
long_string = long_string.strip()[match.end():]
# return the email, and the remaining string
return email, long_string
# now iterate over the long string
emails = []
email, tail = match_email(the_long_string)
while email is not None:
emails.append(email)
email, tail = match_email(tail)
print(emails)
Thats directly stolen from here just some names changed and stuff.
I have lots of strings like following,
ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
I am using NLTK to remove the dateline part and recognize the date, location and person name?
Using pos tagging I can find the parts of speech. But I need to determine location, date, person name. How can I do that?
Update:
Note: I dont want to perform another http request. I need to parse it using my own code. If there is a library its okay to use it.
Update:
I use ne_chunk. But no luck.
import nltk
def pchunk(t):
w_tokens = nltk.word_tokenize(t)
pt = nltk.pos_tag(w_tokens)
ne = nltk.ne_chunk(pt)
print ne
# txts is a list of those 3 sentences.
for t in txts:
print t
pchunk(t)
Output is following,
ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
(S
ISLAMABAD/NNP
:/:
Chief/NNP
Justice/NNP
(PERSON Iftikhar/NNP Muhammad/NNP Chaudhry/NNP)
said/VBD
that/IN
(ORGANIZATION National/NNP Accountab/NNP))
KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
(S
(GPE KARACHI/NNP)
,/,
July/NNP
24/CD
--/:
Police/NNP
claimed/VBD
to/TO
have/VB
arrested/VBN
several/JJ
suspects/NNS
in/IN
separate/JJ)
ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
(S
(GPE ALUM/NN)
(ORGANIZATION KULAM/NN)
,/,
(PERSON Sri/NNP Lanka/NNP)
--/:
As/IN
gray-bellied/JJ
clouds/NNS
started/VBN
to/TO
blot/VB
out/RP
the/DT
scorchin/NN)
Check carefully. Even KARACHI is recognized very well, but Sri Lanka is recognized as Person and ISLAMABAD is recognized as NNP not GPE.
If using an API vs your own code is OK for your requirements, this is something the Wit API can easily do for you.
Wit will also resolve date/time tokens into normalized dates.
To get started you just have to provide a few examples.
Yahoo has a placefinder API that should help with identifying places. Looks like the places are always at the start so it could be worth taking the first couple of words and throwing them at the API until it hits a limit:
http://developer.yahoo.com/boss/geo/
May also be worth looking at using the dreaded REGEX in order to identify capitals:
Regular expression for checking if capital letters are found consecutively in a string?
Good luck!