I have lots of strings like following,
ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
I am using NLTK to remove the dateline part and recognize the date, location and person name?
Using pos tagging I can find the parts of speech. But I need to determine location, date, person name. How can I do that?
Update:
Note: I dont want to perform another http request. I need to parse it using my own code. If there is a library its okay to use it.
Update:
I use ne_chunk. But no luck.
import nltk
def pchunk(t):
w_tokens = nltk.word_tokenize(t)
pt = nltk.pos_tag(w_tokens)
ne = nltk.ne_chunk(pt)
print ne
# txts is a list of those 3 sentences.
for t in txts:
print t
pchunk(t)
Output is following,
ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
(S
ISLAMABAD/NNP
:/:
Chief/NNP
Justice/NNP
(PERSON Iftikhar/NNP Muhammad/NNP Chaudhry/NNP)
said/VBD
that/IN
(ORGANIZATION National/NNP Accountab/NNP))
KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
(S
(GPE KARACHI/NNP)
,/,
July/NNP
24/CD
--/:
Police/NNP
claimed/VBD
to/TO
have/VB
arrested/VBN
several/JJ
suspects/NNS
in/IN
separate/JJ)
ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
(S
(GPE ALUM/NN)
(ORGANIZATION KULAM/NN)
,/,
(PERSON Sri/NNP Lanka/NNP)
--/:
As/IN
gray-bellied/JJ
clouds/NNS
started/VBN
to/TO
blot/VB
out/RP
the/DT
scorchin/NN)
Check carefully. Even KARACHI is recognized very well, but Sri Lanka is recognized as Person and ISLAMABAD is recognized as NNP not GPE.
If using an API vs your own code is OK for your requirements, this is something the Wit API can easily do for you.
Wit will also resolve date/time tokens into normalized dates.
To get started you just have to provide a few examples.
Yahoo has a placefinder API that should help with identifying places. Looks like the places are always at the start so it could be worth taking the first couple of words and throwing them at the API until it hits a limit:
http://developer.yahoo.com/boss/geo/
May also be worth looking at using the dreaded REGEX in order to identify capitals:
Regular expression for checking if capital letters are found consecutively in a string?
Good luck!
Related
I'm working on an NLP project and using Spacy. Now, I have identified different entities using NER of Spacy, and I want to remove the ORG (those identified as Organisations) from the original input string.
doc = "I'm here with the three of Nikkei Asia's stalwart editors, three Brits in Tokyo. First off, we have Michael Peel, who is executive editor, a journalist from our affiliate, The Financial Times . He is now in Tokyo but has previously reported from the likes of Brussels, Bangkok, Abu Dhabi and Lagos. Welcome, Michael.MICHAEL PEEL, EXECUTIVE EDITOR: Welcome Waj. Thank you very much.KHAN: All right. And we have Stephen Foley, our business editor who, like Michael, is on secondment from the FT, where he was deputy U.S. News Editor. Prior to the FT, he was a reporter at The Independent and like Michael, he's a fresh-off-the-boat arrival in Tokyo and has left some pretty big shoes to fill in the New York bureau, where we miss him. Welcome, Stephen.STEPHEN FOLEY, BUSINESS EDITOR: Thanks for having me, Waj.KHAN: Alright, and last but certainly not least, my brother in arms when it comes to cricket commentary across the high seas is Andy Sharp, or deputy editor who joined Nikkei Asia nearly four years ago, after a long stint at Bloomberg in Tokyo and other esteemed Japanese publications. Welcome, Andy.ANDREW SHARP"
text = NER(doc)
org_stopwords = [ent.text for ent in text.ents if ent.label_ == 'ORG']
output of org_stopwords
['The Financial Times ', 'Abu Dhabi and Lagos', 'Bloomberg ']
This is my code now, I've identified and made a list with all those identified as ORG by Spacy, but now I don't know how to remove those from the string. One problem I'm facing to normally split the string and remove the org_stopwords is beacause org_stopwords ar n-grams. Please help with some coded example how to tackle this issue.
Use regex instead of replace
import re
org_stopwords = ['The Financial Times',
'Abu Dhabi ',
'U.S. News Editor',
'Independent',
'ANDREW']
regex = re.compile('|'.join(org_stopwords))
new_doc = re.sub(regex, '', doc)
I need a python package that could get the related sentence from a text, based on the keywords provided.
For example, below is the Wikipedia page of J.J Oppenheimer -
Early life
Childhood and education
J. Robert Oppenheimer was born in New York City on April 22, 1904,[note 1][7] to Julius Oppenheimer, a wealthy Jewish textile importer who had immigrated to the United States from Germany in 1888, and Ella Friedman, a painter.
Julius came to the United States with no money, no baccalaureate studies, and no knowledge of the English language. He got a job in a textile company and within a decade was an executive with the company. Ella was from Baltimore.[8] The Oppenheimer were non-observant Ashkenazi Jews.[9]
The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico.
Oppenheimer later remarked that it brought to mind words from the Bhagavad Gita: "Now I am become Death, the destroyer of worlds.
If my passed string is - "JJ Oppenheimer birth date", it should return "J. Robert Oppenheimer was born in New York City on April 22, 1904"
If my passed string is - "JJ Openheimer Trinity test", it should return "The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico"
I tried searching a lot but nothing comes closer to what I want and I don't know much about NLP vectorization techniques. It would be great if someone please suggest some package if they know(or exist).
You could use fuzzywuzzy.
fuzz.ratio(search_text, sentence).
This gives you a score of how similar two strings are.
https://github.com/seatgeek/fuzzywuzzy
I am pretty sure a Module exists that could do this for you, you could try and make it yourself by parsing through the text and creating words like: ["date of birth", "born", "birth date", etc] and you do this for multiple fields. This would thus allow you to find information that would be available.
The idea is:
you grab your text or whatever u have,
you grab what you are looking for (example date of birth)
You then assign a date of birth to a list of similar words,
you look through ur file to see if you find a sentence that has that in it.
I am pretty sure there is no module, maybe I am wrong but smth like this should work.
The task you describe looks like Information Retrieval. Given a query (the keywords) the model should return a list of document (the sentences) that best matches the query.
This is essentially what the response using fuzzywuzzy is suggesting. But maybe just counting the number of occurences of the query words in each sentence is enough (and more efficient).
The next step would be to use Tf-Idf. It is a weighting scheme, that gives high scores to words that are specific to a document with respect to a set of document (the corpus).
This results in every document having a vector associated, you will then be able to sort the documents according to their similarity to the query vector. SO Answer to do that
I am trying to do some text processing corpus which has emails.
I have a main directory, under which I have various folders. Each folder has many .txt files. Each txt file is basically the email conversations.
To give an example of how my text file looks like with emails, am copying a similar looking text file of emails from publicly available enron email corpus. I have the same type of text data with multiple emails in one text file.
An example text file can look like below:
Message-ID: <3490571.1075846143093.JavaMail.evans#thyme>
Date: Wed, 8 Sep 1999 08:50:00 -0700 (PDT)
From: steven.kean#enron.com
To: kelly.kimberly#enron.com
Subject: Re: India And The WTO Services Negotiation
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Steven J Kean
X-To: Kelly Kimberly
X-cc:
X-bcc:
X-Folder: \Steven_Kean_Dec2000_1\Notes Folders\All documents
X-Origin: KEAN-S
X-FileName: skean.nsf
fyi
---------------------- Forwarded by Steven J Kean/HOU/EES on 09/08/99 03:49
PM ---------------------------
Joe Hillings#ENRON
09/08/99 02:52 PM
To: Joe Hillings/Corp/Enron#Enron
cc: Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Terence H
Thorn/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Steven J Kean/HOU/EES#EES,
Jeffrey Sherrick/Corp/Enron#Enron
Subject: Re: India And The WTO Services Negotiation
Sanjay: Some information of possible interest to you. I attended a meeting
this afternoon of the Coalition of Service Industries, one of the lead groups
promoting a wide range of services including energy services in the upcoming
WTO GATTS 2000 negotiations. CSI President Bob Vastine was in Delhi last week
and met with CII to discuss the upcoming WTO. CII apparently has a committee
looking into the WTO. Bob says that he told them that energy services was
among the CSI recommendations and he recalls that CII said that they too have
an interest.
Since returning from the meeting I spoke with Kiran Pastricha and told her
the above. She actually arranged the meeting in Delhi. She asked that I send
her the packet of materials we distributed last week in Brussels and London.
One of her associates is leaving for India tomorrow and will take one of
these items to Delhi.
Joe
Joe Hillings
09/08/99 11:57 AM
To: Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT
cc: Terence H Thorn/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Steven J Kean/HOU/EES#EES,
Jeffrey Sherrick/Corp/Enron#Enron (bcc: Joe Hillings/Corp/Enron)
Subject: India And The WTO Services Negotiation
Sanjay: First some information and then a request for your advice and
involvment.
A group of US companies and associations formed the US WTO Energy Services
Coalition in late May and has asked the US Government to include "energy
services" on their proposed agenda when the first meeting of the WTO GATTS
2000 ministerial convenes late this year in Seattle. Ken Lay will be among
the CEO speakers. These negotiations are expected to last three years and
cover a range of subjects including agriculture, textiles, e-commerce,
investment, etc.
This morning I visited with Sudaker Rao at the Indian Embassy to tell him
about our coalition and to seek his advice on possible interest of the GOI.
After all, India is a leader in data processing matters and has other
companies including ONGC that must be interested in exporting energy
services. In fact probably Enron and other US companies may be engaging them
in India and possibly abroad.
Sudaker told me that the GOI has gone through various phases of opposing the
services round to saying only agriculture to now who knows what. He agrees
with the strategy of our US WTO Energy Services Coalition to work with
companies and associations in asking them to contact their government to ask
that energy services be on their list of agenda items. It would seem to me
that India has such an interest. Sudaker and I agree that you are a key
person to advise us and possibly to suggest to CII or others that they make
such a pitch to the GOI Minister of Commerce.
I will ask Lora to send you the packet of materials Chris Long and I
distributed in Brussels and London last week. I gave these materials to
Sudaker today.
Everyone tells us that we need some developing countries with an interest in
this issue. They may not know what we are doing and that they are likely to
have an opportunity if energy services are ultimately negotiated.
Please review and advise us how we should proceed. We do need to get
something done in October.
Joe
PS Terry Thorn is moderating a panel on energy services at the upcoming World
Services Congress in Atlanta. The Congress will cover many services issues. I
have noted in their materials that Mr. Alliwalia is among the speakers but
not on energy services. They expect people from all over the world to
participate.
So as you see there can be basically multiple emails in one text file with not much clear separation rule except new email headers (To, From etc).
I can do the os.walk in the main directory and then it would go through each of the sub directory, parse each of the text file in that sub-directory etc and repeat it for other sub-directory and so on.
I need to extract certain parts of each email within a text file and store it as new row in a dataset (csv,pandas dataframe etc).
Parts which can be helpful to extract and store as columns of a row in a dataset. Each row of this dataset can then be each email within each text file.
Fields:
Original Email content | From (Sender)| To (Receipient) | cc (Receipient)| Date/Time Sent| Subject of Email|
Edit: I looked at the duplicate question added. That considers a fixed spec and boundary. That's not the case here. I am looking for a simple regular expression way of extracting different fields as mentioned above
^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)
Make sure you're using dotall, multiline, and extended modes on your regex engine.
For the example you posted it works at least, it captures everything in different groups (you may need to enable that on the regex engine as well, depending on which it is)
Group `date` 63-99 `Wed, 8 Sep 1999 08:50:00 -0700 (PDT)`
Group `sender` 106-127 `steven.kean#enron.com`
Group `to` 132-156 `kelly.kimberly#enron.com`
Group `cc` 650-714 `Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Terence H `
Group `subject` 930-974 `Re: India And The WTO Services Negotiation `
https://regex101.com/r/gHUOLi/1
And use it to iterate over your stream of text, you mention python so there you go:
def match_email(long_string):
regex = r'^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)'
# try to match the thing
match = re.search(regex, long_string.strip(), re.I | re.X)
# if there is no match its over
if match is None:
return None, long_string
# otherwise, get it
email = match.groupdict()
# remove whatever matched from the original string
if email is not None:
long_string = long_string.strip()[match.end():]
# return the email, and the remaining string
return email, long_string
# now iterate over the long string
emails = []
email, tail = match_email(the_long_string)
while email is not None:
emails.append(email)
email, tail = match_email(tail)
print(emails)
Thats directly stolen from here just some names changed and stuff.
I have news dataset which contains almost 10,000 news over the last 3 years.
I also have a list of companies (names of companies) which are registered in NYSE. Now I want to check whether list of company names in the list have appeared in the news dataset or not.
Example:
company Name: 'E.I. du Pont de Nemours and Company'
News: 'Monsanto and DuPont settle major disputes with broad patent-licensing deal, with DuPont agreeing to pay at least $1.75 billion over 10 years for rights to technology for herbicide-resistant soybeans.'
Now, I can find the news contains company name if the exact company name is in the news but you can see from the above example it is not the case.
I also tried another way i.e. I took the integral name in the company's full name i.e. in the above example 'Pont' is a word which should be definitely a part of the text when this company name is called. So it worked for majority of the times but then problem occurs in the following example:
Company Name: Ennis, Inc.
News: L D`ennis` Kozlowski, former chief executive convicted of looting nearly $100 million from Tyco International, has emerged into far more modest life after serving six-and-a-half year sentence and probation; Kozlowski, who became ultimate symbol of corporate greed in era that included scandals at Enron and WorldCom, describes his personal transformation and more humble pleasures that have replaced his once high-flying lifestyle.
Now you can see Ennis is matching with Dennis in the text so it giving irrelevant news results.
Can someone help in telling the right way of doing this ? Thanks.
Use a regex with boundaries for exact matches whether you choose the full name or some partial part you think is unique is up to you but using word boundaries D'ennis' won't match Ennis :
companies = ["name1", "name2",...]
companies_re = re.compile(r"|".join([r"\b{}\b".format(name) for name in companies]))
Depending on how many matches per news item, you may want to use companies_re.search(artice) or companies_re.find_all(article).
Also for case insensitive matches pass re.I to compile.
If the only line you want to check is also always the one starting with company company Name: you can narrow down the search:
for line in all_lines:
if line.startswith("company Name:"):
name = companies_re.search(line)
if name:
...
break
It sounds like you need the Aho-Corasick algorithm. There is a nice and fast implementation for python here: https://pypi.python.org/pypi/pyahocorasick/
It will only do exact matching, so you would need to index both "Du pont" and "Dupont", for example. But that's not too hard, you can use the Wikidata to help you find aliases: for example, look at the aliases of Dupont's entry: it includes both "Dupont" and "Du pont".
Ok so let's assume you have the list of company names with their aliases:
import ahocorasick
A = ahocorasick.Automaton()
companies = ["google", "apple", "tesla", "dupont", "du pont"]
for idx, key in enumerate(companies):
A.add_word(key, idx)
Next, make the automaton (see the link above for details on the algorithm):
A.make_automaton()
Great! Now you can simply search for all companies in some text:
your_text = """
I love my Apple iPhone. Do you know what a Googleplex is?
I ate some apples this morning.
"""
for end_index, idx in A.iter(your_text.lower()):
print(end_index, companies[idx])
This is the output:
15 apple
49 google
74 apple
The numbers correspond to the index of the last character of the company name in the text.
Easy, right? And super fast, this algorithm is used by some variants of GNU grep.
Saving/loading the automaton
If there are a lot of company names, creating the automaton may take some time, so you may want to create it just once, save it to disk (using pickle), then load it every time you need it:
# create_company_automaton.py
# ... create the automaton (see above)
import pickle
pickle.dump(A, open('company_automaton.pickle', 'wb'))
In the program that will use this automaton, you start by loading the automaton:
# use_company_automaton.py
import ahocorasick
import pickle
A = pickle.load(open("company_automaton.pickle", "rb"))
# ... use the automaton
Hope this helps! :)
Bonus details
If you want to match "Apple" in "Apple releases a new iPhone" but not in "I ate an apple this morning", you are going to have a hard time. But it is doable: for example, you could gather a set of articles containing the word "apple" and about the company, and a set of articles not about the company, then identify words (or n-grams) that are more likely when it's about the company (e.g. "iPhone"). Unfortunately you would need to do this for every company whose name is ambiguous.
You can try
difflib.get_close_matches
with the full company name.
I've tested my regex with Pythex and it works as it's supposed to:
The HTML:
Something Very Important (SVI) 2013 Sercret Information, Big Company
Name (LBCN) Catalog Number BCN2013R18 and BSSN 3-55564-789-Y, was
developed as part of the SUP 2012 Something Task force was held in
conjunction with *SEM 2013, the second joint conference on study of
banana hand grenades and gorilla tactics (Association of Ape Warfare
Studies) interest groups BUDDY HOLLY and LION KING. It is comprised of
one hairy object containing 750 gross stories told in the voice of
Morgan Freeman and his trusty sidekick Michelle Bachman.
My regex:
,[\s\w()-]+,
When used with Pythex it selects the area I'm looking for, which is between the 2 commas in the paragraph:
Something Very Important (SVI) 2013 Sercret Information , Big
Company Name (LBCN) Catalog Number BCN2013R18 and BSSN
3-55564-789-Y, was developed as part of the SUP 2012 Something Task
force was held in conjunction with <a href="http://justaURL.com">*SEM
2013</a>, the second joint
conference on study of banana hand grenades and gorilla tactics
(Association of Ape Warfare Studies) interest groups BUDDY HOLLY and
LION KING. It is comprised of one hairy object containing 750 gross
stories told in the voice of Morgan Freeman and his trusty sidekick
Michelle Bachman.
However when I use BeautifulSoup's text regex:
print HTML.body.p.find_all(text=re.compile('\,[\s\w()-]+\,'))
I'm returned this instead of the area between the commas:
[u'Something Very Important (SVI) 2013 Sercret Information, Big Company Name (LBCN) Catalog Number BCN2013R18 and BSSN 3-55564-789-Y, was developed as part of the SUP 2012 Something Task force was held in conjunction with ']
I've also tried escaping the commas but to no luck. Beautiful soup just wants to return the whole <p> instead of the regex that I specified. Also I noticed that it returns the paragraph up until that link in the middle. Is this a problem with how I'm using BeautifulSoup or is this a regex problem?
BeautifulSoup uses the regular expression to search for matching elements. That whole text node matches your search.
You still then have to extract the part you want; BeautifulSoup does not do this for you. You could just reuse your regex here:
expression = re.compile('\,[\s\w()-]+\,')
textnode = HTML.body.p.find_all(text=expression)
print expression.search(textnode).group(0)