Python: libphonenumber - cannot get extract out obvious phone numbers

Python: libphonenumber - cannot get extract out obvious phone numbers - python

I am using Python version of Google's libphonenumbers, but when I tried this library on different texts, sometimes it will the python function will not return me anything while it is very obvious that there a phone number there and sometimes they do return the phone numbers. Please see below:
print(x2)
for match in pnum.PhoneNumberMatcher(x2, "US"):
print(match) #for the text above, it did not get the number
output:
I just read your profile and thought it was really great. I also thought you were cute and loved the fact that you go hiking with your brothers every summer. If you want to know anything more about me, just ask. My num 555-121-5468.
With this text above, it does not return me any phone number.
But in other situation like the following, this function gives me correct input:
x9 = "hay I hate to cut you short, its been fun chatting, but unfortuantely I gotta run. I am gald we became friends though. my number is (323) 2387890"
for match in pnum.PhoneNumberMatcher(x9, "US"):
print(match)
output:
PhoneNumberMatch [132,145) (323) 2387890
I don't know what is the issue causing this problem, I am new to Python and this library and would sincerely appreciate insight.

555-121-5468 looks like a valid US phone number, but actually it isn't.
The PhoneNumberMatcher class constructor accepts a leniency argument which defines how strictly the class matches candidate phone numbers (code). The possible values of this argument The default value of leniency is 1, which will only match valid phone numbers. Changing this to 0 will match possible phone numbers like 555-121-5468.
>>> for match in pnum.PhoneNumberMatcher(x2, 'US', leniency=0):
print(match)
...
PhoneNumberMatch [220,232) 555-121-5468
The 555 prefix is not a real prefix, but is used for fictional phone numbers in US TV and cinema. From Wikipedia:
Telephone numbers with the prefix 555 are widely used for fictitious
telephone numbers in North American television shows, films, video
games, and other media in order to prevent practical jokers and
curious callers from bothering real people and organisations by
telephoning numbers they see in works of fiction; generally, in North
America, a number with 555 as a prefix will not connect to a real
person.

Related

Python package to extract sentence from a textfile based on keyword

I need a python package that could get the related sentence from a text, based on the keywords provided.
For example, below is the Wikipedia page of J.J Oppenheimer -
Early life
Childhood and education
J. Robert Oppenheimer was born in New York City on April 22, 1904,[note 1][7] to Julius Oppenheimer, a wealthy Jewish textile importer who had immigrated to the United States from Germany in 1888, and Ella Friedman, a painter.
Julius came to the United States with no money, no baccalaureate studies, and no knowledge of the English language. He got a job in a textile company and within a decade was an executive with the company. Ella was from Baltimore.[8] The Oppenheimer were non-observant Ashkenazi Jews.[9]
The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico.
Oppenheimer later remarked that it brought to mind words from the Bhagavad Gita: "Now I am become Death, the destroyer of worlds.
If my passed string is - "JJ Oppenheimer birth date", it should return "J. Robert Oppenheimer was born in New York City on April 22, 1904"
If my passed string is - "JJ Openheimer Trinity test", it should return "The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico"
I tried searching a lot but nothing comes closer to what I want and I don't know much about NLP vectorization techniques. It would be great if someone please suggest some package if they know(or exist).

You could use fuzzywuzzy.
fuzz.ratio(search_text, sentence).
This gives you a score of how similar two strings are.
https://github.com/seatgeek/fuzzywuzzy

I am pretty sure a Module exists that could do this for you, you could try and make it yourself by parsing through the text and creating words like: ["date of birth", "born", "birth date", etc] and you do this for multiple fields. This would thus allow you to find information that would be available.
The idea is:
you grab your text or whatever u have,
you grab what you are looking for (example date of birth)
You then assign a date of birth to a list of similar words,
you look through ur file to see if you find a sentence that has that in it.
I am pretty sure there is no module, maybe I am wrong but smth like this should work.

The task you describe looks like Information Retrieval. Given a query (the keywords) the model should return a list of document (the sentences) that best matches the query.
This is essentially what the response using fuzzywuzzy is suggesting. But maybe just counting the number of occurences of the query words in each sentence is enough (and more efficient).
The next step would be to use Tf-Idf. It is a weighting scheme, that gives high scores to words that are specific to a document with respect to a set of document (the corpus).
This results in every document having a vector associated, you will then be able to sort the documents according to their similarity to the query vector. SO Answer to do that

Python Exact Match- Absolute exact match

Based on the code I have I am trying to find an exact match to any of the job positions listed in the input.
INPUT
this is str contains specific MATCH
dfp1[dfp1.index.str.match('Teacher|Dentist|General Manager|District Manager|Bus Driver|Team Lead|Dancer')]
Output is:
Teacher
Teacher, Middle
Teacher, High
Dentist, Sanford
Dentist
General Manager
General Manager, Dollar Tree
Team Lead
Dancer, 10th
Dancer
Dancer, Previous
I do not want anything extra other than the exact job position I put in the input. I want to specifically see only Teacher or Dentist or General Manager or District Manager or Bus Driver or Team Lead or Dancer.
I am not sure what my code is missing for it to display the job titles and no others.

Fixed your regex. You need to add a ^ at the beginning and a $ at the end.
dfp1[dfp1.index.str.match('^(Teacher|Dentist|General Manager|District Manager|Bus Driver|Team Lead|Dancer)$')]

Best way to parse the human names in a large list of addresses when the order of names is unkown in advance [duplicate]

ok so basically I am asking the question of their name
I want this to be one input rather than Forename and Surname.
Now is there any way of splitting this name? and taking just the last word from the "Sentence" e.g.
name = "Thomas Winter"
print name.split()
and what would be output is just "Winter"

You'll find that your key problem with this approach isn't a technical one, but a human one - different people write their names in different ways.
In fact, the terminology of "forename" and "surname" is itself flawed.
While many blended families use a hyphenated family name, such as Smith-Jones, there are some who just use both names separately, "Smith Jones" where both names are the family name.
Many european family names have multiple parts, such as "de Vere" and "van den Neiulaar". Sometimes these extras have important family history - for example, a prefix awarded by a king hundreds of years ago.
Side issue: I've capitalised these correctly for the people I'm referencing - "de" and "van den" don't get captial letters for some families, but do for others.
Conversely, many Asian cultures put the family name first, because the family is considered more important than the individual.
Last point - some people place great store in being "Junior" or "Senior" or "III" - and your code shouldn't treat those as the family name.
Also noting that there are a fair number of people who use a name that isn't the one bestowed by their parents, I've used the following scheme with some success:
Full Name (as normally written for addressing mail);
Family Name;
Known As (the name commonly used in conversation).
e.g:
Full Name: William Gates III; Family Name: Gates; Known As: Bill
Full Name: Soong Li; Family Name: Soong; Known As: Lisa

This is a pretty old issue but I found it searching around for a solution to parsing out pieces from a globbed together name.
http://code.google.com/p/python-nameparser/

The problem with trying to split the names from a single input is that you won't get the full surname for people with spaces in their surname, and I don't believe you'll be able to write code to manage that completely.
I would recommend that you ask for the names separately if it is at all possible.

An easy way to do exactly what you asked in python is
name = "Thomas Winter"
LastName = name.split()[1]
(note the parantheses on the function call split.)
split() creates a list where each element is from your original string, delimited by whitespace. You can now grab the second element using name.split()[1] or the last element using name.split()[-1]
However, as others said, unless you're SURE you're just getting a string like "First_Name Last_Name", there are a lot more issues involved.

Golden rule of data - don't aggregate too early - it is much easier to glue fields together than separate them. Most people also have a middle name which should be an optional field. Some people have a plethora of middle names. Some people only have one name, one word. Some cultures commonly have a dictionary of middle names, paying homage to the family tree back to the Golgafrincham Ark landing.
You don't need a code solution here - you need a business rule.

Like this:
print name.split()[-1]

If you're trying to parse apart a human name in PHP, I recomment Keith Beckman's nameparse.php script.

This is how I do it in my application:
def get_first_name(fullname):
firstname = ''
try:
firstname = fullname.split()[0]
except Exception as e:
print str(e)
return firstname
def get_last_name(fullname):
lastname = ''
try:
index=0
for part in fullname.split():
if index > 0:
if index > 1:
lastname += ' '
lastname += part
index += 1
except Exception as e:
print str(e)
return lastname
def get_last_word(string):
return string.split()[-1]
print get_first_name('Jim Van Loon')
print get_last_name('Jim Van Loon')
print get_last_word('Jim Van Loon')

Since there are so many different variation's of how people write their names, but here's how a basic way to get the first/lastname via regex.
import re
p = re.compile(r'^(\s+)?(Mr(\.)?|Mrs(\.)?)?(?P<FIRST_NAME>.+)(\s+)(?P<LAST_NAME>.+)$', re.IGNORECASE)
m = p.match('Mr. Dingo Bat')
if(m != None):
first_name = m.group('FIRST_NAME')
last_name = m.group('LAST_NAME')

Splitting names is harder than it looks. Some names have two word last names; some people will enter a first, middle, and last name; some names have two work first names. The more reliable (or least unreliable) way to handle names is to always capture first and last name in separate fields. Of course this raises its own issues, like how to handle people with only one name, making sure it works for users that have a different ordering of name parts.
Names are hard, handle with care.

It's definitely a more complicated task than it appears on the surface. I wrote up some of the challenges as well as my algorithm for solving it on my blog. Be sure to check out my Google Code project for it if you want the latest version in PHP:
http://www.onlineaspect.com/2009/08/17/splitting-names/

Here's how to do it in SQL. But data normalization with this kind of thing is really a bear. I agree with Dave DuPlantis about asking for separate inputs.

I would specify a standard format (some forms use them), such as "Please write your name in First name, Surname form".
It makes it easier for you, as names don't usually contain a comma. It also verifies that your users actually enter both first name and surname.

name = "Thomas Winter"
first, last = name.split()
print("First = {first}".format(first=first))
#First = Thomas
print("Last = {last}".format(last=" ".join(last)))
#Last = Winter

You can use str.find() for this.
x=input("enter your name ")
l=x.find(" ")
print("your first name is",x[:l])
print("your last name is",x[l:])

You would probably want to use rsplit for this:
rsplit([sep [,maxsplit]])
Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done, the rightmost ones. If sep is not specified or None, any whitespace string is a separator. Except for splitting from the right, rsplit() behaves like split() which is described in detail below. New in version 2.4.

Matching company names in the news data using Python

I have news dataset which contains almost 10,000 news over the last 3 years.
I also have a list of companies (names of companies) which are registered in NYSE. Now I want to check whether list of company names in the list have appeared in the news dataset or not.
Example:
company Name: 'E.I. du Pont de Nemours and Company'
News: 'Monsanto and DuPont settle major disputes with broad patent-licensing deal, with DuPont agreeing to pay at least $1.75 billion over 10 years for rights to technology for herbicide-resistant soybeans.'
Now, I can find the news contains company name if the exact company name is in the news but you can see from the above example it is not the case.
I also tried another way i.e. I took the integral name in the company's full name i.e. in the above example 'Pont' is a word which should be definitely a part of the text when this company name is called. So it worked for majority of the times but then problem occurs in the following example:
Company Name: Ennis, Inc.
News: L D`ennis` Kozlowski, former chief executive convicted of looting nearly $100 million from Tyco International, has emerged into far more modest life after serving six-and-a-half year sentence and probation; Kozlowski, who became ultimate symbol of corporate greed in era that included scandals at Enron and WorldCom, describes his personal transformation and more humble pleasures that have replaced his once high-flying lifestyle.
Now you can see Ennis is matching with Dennis in the text so it giving irrelevant news results.
Can someone help in telling the right way of doing this ? Thanks.

Use a regex with boundaries for exact matches whether you choose the full name or some partial part you think is unique is up to you but using word boundaries D'ennis' won't match Ennis :
companies = ["name1", "name2",...]
companies_re = re.compile(r"|".join([r"\b{}\b".format(name) for name in companies]))
Depending on how many matches per news item, you may want to use companies_re.search(artice) or companies_re.find_all(article).
Also for case insensitive matches pass re.I to compile.
If the only line you want to check is also always the one starting with company company Name: you can narrow down the search:
for line in all_lines:
if line.startswith("company Name:"):
name = companies_re.search(line)
if name:
...
break

It sounds like you need the Aho-Corasick algorithm. There is a nice and fast implementation for python here: https://pypi.python.org/pypi/pyahocorasick/
It will only do exact matching, so you would need to index both "Du pont" and "Dupont", for example. But that's not too hard, you can use the Wikidata to help you find aliases: for example, look at the aliases of Dupont's entry: it includes both "Dupont" and "Du pont".
Ok so let's assume you have the list of company names with their aliases:
import ahocorasick
A = ahocorasick.Automaton()
companies = ["google", "apple", "tesla", "dupont", "du pont"]
for idx, key in enumerate(companies):
A.add_word(key, idx)
Next, make the automaton (see the link above for details on the algorithm):
A.make_automaton()
Great! Now you can simply search for all companies in some text:
your_text = """
I love my Apple iPhone. Do you know what a Googleplex is?
I ate some apples this morning.
"""
for end_index, idx in A.iter(your_text.lower()):
print(end_index, companies[idx])
This is the output:
15 apple
49 google
74 apple
The numbers correspond to the index of the last character of the company name in the text.
Easy, right? And super fast, this algorithm is used by some variants of GNU grep.
Saving/loading the automaton
If there are a lot of company names, creating the automaton may take some time, so you may want to create it just once, save it to disk (using pickle), then load it every time you need it:
# create_company_automaton.py
# ... create the automaton (see above)
import pickle
pickle.dump(A, open('company_automaton.pickle', 'wb'))
In the program that will use this automaton, you start by loading the automaton:
# use_company_automaton.py
import ahocorasick
import pickle
A = pickle.load(open("company_automaton.pickle", "rb"))
# ... use the automaton
Hope this helps! :)
Bonus details
If you want to match "Apple" in "Apple releases a new iPhone" but not in "I ate an apple this morning", you are going to have a hard time. But it is doable: for example, you could gather a set of articles containing the word "apple" and about the company, and a set of articles not about the company, then identify words (or n-grams) that are more likely when it's about the company (e.g. "iPhone"). Unfortunately you would need to do this for every company whose name is ambiguous.

You can try
difflib.get_close_matches
with the full company name.

How can I parse email text for components like <salutation><body><signature><reply text> etc?

I'm writing an application that analyzes emails and it would save me a bunch of time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text> etc.
For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ..." would be parsed as
Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."
I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?

https://github.com/Trindaz/EFZP
This provides functionality posed in the original question, plus fair recognition of email zones as they commonly appear in email written by native English speakers from common email clients like Outlook and Gmail.

If you score each line based on the types of words it contains you may get a fairly good indication.
E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)
A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..).
Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation
http://ogden.basic-english.org/
http://osteele.com/projects/pywordnet/
the signature will contain closing words.
If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.
Each word would get a score [salutation score, body score, signature score,..]
e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature.
this means hello would get assigned [900, 10, 3, ..]
cheers might get assigned [10,3,100,..]
now you will have a large list of about 500,000 words.
words that don't have a large range aren't useful.
e.g. catch might have [100,101,80..] = range of 21
(it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.
Now you can reduce the number of words down to about 10,000
now for each line, give the line a score also of the form [salutation score, body score, signature score,..]
this score is calculated by adding the vector scores of each word.
e.g. a sentence "hello cheers for giving me your number" could be:
[900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..]
=[1023,900,500,..] say
then because the biggest number is at the start in the salutation score position, this sentence is a salutation.
then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score
Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.

The first approach that comes to mind (not necessarily the best...) would be to start off by using split. here's a little bit of code and stuff
linearray=emailtext.split('\n')
now you have an array of strings, each one like a paragraph or whatever
so linearray[0] would contain the salutation
deciding where the reply text starts is a little more tricky, i noticed that there is a double newline just before it so maybe do a search for that from the back and hope that the last one indicates the start of the reply text.
Or store some signature words you might expect and search for those from the front, like cheers, regards, and whatever else.
Once you figure out where the signature is the rest is the rest is easy
hope this helped

I built a pretty cheap API for this actually to parse the contact data from signatures of emails and email chains. It's called SigParser. You can see the Swagger docs here for it.
Basically you send it a header 'x-api-key' with a JSON body like so and it parses all the contacts in the reply chain of an email.
{
"subject": "Thanks for meeting...",
"from_address": "bgates#example.com",
"from_name": "Bill Gates",
"htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div>LinkedInTwitter",
"plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
"date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.