Regex and pandas: extract partial string on name match

Regex and pandas: extract partial string on name match - python

I have a pandas data frame containing instances of web chat between two people, the customer and the service desk operator.
The customers name is always announced in the first line of the web chat as the customer enters the conversation.
Example 1:
In: df['log'][0]
Out: [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\'m looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks[14:44:38] James has exited the session.
Example 2:
In: df['log'][1]
Out: [09:01:25] You are joining a chat with Roy Andrews[09:01:25] Roy Andrews: I\'m asking on behalf of partner whether she is able to still claim warranty on a coffee machine she purchased a year and a half ago? [09:02:00] Jamie: Thank you for contacting us. Could I start by asking for the type of coffee machine purchased please, and whether she still has a receipt?[09:02:23] Roy Andrews: BRX0403, she no longer has a receipt.[09:05:30] Jamie: Thank you, my interpretation is that she would not be able to claim as she is no longer under warranty. [09:08:46] Jamie: for more information on our product warranty policy please see www.brandx.com/warranty-policy/information[09:09:13] Roy Andrews: Thanks for the links, I will let her know.[09:09:15] Roy Andrews has exited the session.
The names in the chat always vary as different customers use the web chat service.
A customer can enter chat having one or more names. Example:
James
Ravi
Roy Andrews.
Requirements:
I would like to separate all instances of customer chat (e.g. chat by James and Roy Andrews) from the df['log'] column into a new column df[text_analysis].
From example 1 above this would look like:
In: df['text_analysis][0]
Out: [14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\'m looking to find out more about the services and products you offer.[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:44:12] James: Thanks
EDIT:
The optimal solution would extract the substrings as provided in the example above and omit the final time stamp [14:44:38] James has exited the session..
What I have tried so far:
I have extracted the customer names from the df['log'] column into a new column called df['names'] using:
df['names'] = df['log'].apply(lambda x: x.split(' ')[7].split('[')[0])
I wanted to use the names in the df['names'] column to use in a str.split() pandas function -- something along the lines of:
df['log'].str.split(df['names']) however this does not work and if the split did occur under this scenario I think it would not properly split the customer and service operator chats apart.
Also I have tried incorporating the names into a regex type solution:
df['log'].str.extract('([^.]*{}[^.]*)').format(df['log']))
But this does not work either (because I'm guessing that .extract() does not support format.
Any help would be appreciated.

Use regex, longs is your first paragraph:
import re
re.match(r'^.*(?=\[)', longs).group()
Result:
"[14:40:48] You are joining a chat with James[14:40:48] James: Hello, I'm looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks"
You can package this regex function into your dataframe:
df['text_analysis'] = df['log'].apply(lambda x: re.match(r'^.*(?=\[)', x).group())
Explanations: regex string '^.*(?=\[)' means: from beginning ^, match any number of any character .*, ends with [ but do not include it (?=\[). Since regex matches the longest string this will go from the beginning till the last [, and does not include [.
Individual lines can be extracted this way:
import re
customerspeak = re.findall(r'(?<=\[(?:\d{2}:){2}\d{2}\]) James:[^\[]*', s)
output:
[" James: Hello, I'm looking to find out more about the services and products you offer.",
' James: I would like to know more about your gardening and guttering service.',
' James: hello?',
' James: Thanks']
If you want these in the same line, you can ''.join(customerspeak)

Related

Simplify song and artist names

Introduction
Hello. I am currently building a web application that takes a random song and put it into a spotify playlist. (The user can't choose which songs he wants)
So I search the input with the spotify api and get a list of results.
Problem
Since spotify is returning not always the best result, I wanted to loop through the results and find the best matching one. How would you achieve the best result?
My attempt
The first thing I tried, was matching the strings with the fuzzywuzzy library.
This looked something like this:
song_ratio = ratio(real_song_name,result_song_name)
This was good and it helped a lot but what is with songs that just have a different punctuation?
So what I did is removing the punctuation with:
song_name = song_name.translate(str.maketrans('', '', punctuation))
I want also want to avoid Karaoke, Remastered or Live Versions, etc. e.g.:
Stay with Me Till Dawn - Live in the UK, 1982 / 2010 Remaster from Judie Tzuke
Just filtering by this names would make no sense because they appear not in the same shape.
Another problem:
Searching for the song "Fascination" from "Jane Morgan And The Troubadors"
What I get is:
Best found song: Its Been A Long Long Time to 22 % match<br>
Best found artist: Jane Morgan 54 %
Would I just have queried for the song "Fascination" from "Jane Morgan" i would get:
Best found song: Fascination 100 % <br>
Best found artist: Jane Morgan 100 %
Question
What is a good way to solve this issue? Is it possible to train a neural network to process my strings into the right format and then find the best matching?

Something you could try is to use the advanced query syntax offered by Spotify search, and only search for part of the song title/artist name. For example your query for "Fascination" from "Jane Morgan And The Troubadors" could become:
artist:"Jane Mo" track:"Fascin"
and still return the correct result.
This query looks for the exact string 'Jane M' appearing in the artist name and 'Fascin' in the track title.

Python package to extract sentence from a textfile based on keyword

I need a python package that could get the related sentence from a text, based on the keywords provided.
For example, below is the Wikipedia page of J.J Oppenheimer -
Early life
Childhood and education
J. Robert Oppenheimer was born in New York City on April 22, 1904,[note 1][7] to Julius Oppenheimer, a wealthy Jewish textile importer who had immigrated to the United States from Germany in 1888, and Ella Friedman, a painter.
Julius came to the United States with no money, no baccalaureate studies, and no knowledge of the English language. He got a job in a textile company and within a decade was an executive with the company. Ella was from Baltimore.[8] The Oppenheimer were non-observant Ashkenazi Jews.[9]
The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico.
Oppenheimer later remarked that it brought to mind words from the Bhagavad Gita: "Now I am become Death, the destroyer of worlds.
If my passed string is - "JJ Oppenheimer birth date", it should return "J. Robert Oppenheimer was born in New York City on April 22, 1904"
If my passed string is - "JJ Openheimer Trinity test", it should return "The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico"
I tried searching a lot but nothing comes closer to what I want and I don't know much about NLP vectorization techniques. It would be great if someone please suggest some package if they know(or exist).

You could use fuzzywuzzy.
fuzz.ratio(search_text, sentence).
This gives you a score of how similar two strings are.
https://github.com/seatgeek/fuzzywuzzy

I am pretty sure a Module exists that could do this for you, you could try and make it yourself by parsing through the text and creating words like: ["date of birth", "born", "birth date", etc] and you do this for multiple fields. This would thus allow you to find information that would be available.
The idea is:
you grab your text or whatever u have,
you grab what you are looking for (example date of birth)
You then assign a date of birth to a list of similar words,
you look through ur file to see if you find a sentence that has that in it.
I am pretty sure there is no module, maybe I am wrong but smth like this should work.

The task you describe looks like Information Retrieval. Given a query (the keywords) the model should return a list of document (the sentences) that best matches the query.
This is essentially what the response using fuzzywuzzy is suggesting. But maybe just counting the number of occurences of the query words in each sentence is enough (and more efficient).
The next step would be to use Tf-Idf. It is a weighting scheme, that gives high scores to words that are specific to a document with respect to a set of document (the corpus).
This results in every document having a vector associated, you will then be able to sort the documents according to their similarity to the query vector. SO Answer to do that

Extraction a specific word from a column in a dataframe using python

I have a Dataframe:
Job Title
CEO
Founder
Co-Founder
Co-founder, Executive Officer & Co-Managing Partner
Co-Founder and Systems/Software Developer
Founder/CEO
I need to only keep the rows that have Founders and Co-Founders in them. Note that I will also have to account for capitalization, spaces, and other operations like "/" and ",".
Please let me know what to do.

If you're using pandas you should be able to do something like this:
df = df[df['Job Title'].str.lower().contains('founder')]
That should account for both founder and co-founder, and will ignore case since it lowercases everything.

Matching company names in the news data using Python

I have news dataset which contains almost 10,000 news over the last 3 years.
I also have a list of companies (names of companies) which are registered in NYSE. Now I want to check whether list of company names in the list have appeared in the news dataset or not.
Example:
company Name: 'E.I. du Pont de Nemours and Company'
News: 'Monsanto and DuPont settle major disputes with broad patent-licensing deal, with DuPont agreeing to pay at least $1.75 billion over 10 years for rights to technology for herbicide-resistant soybeans.'
Now, I can find the news contains company name if the exact company name is in the news but you can see from the above example it is not the case.
I also tried another way i.e. I took the integral name in the company's full name i.e. in the above example 'Pont' is a word which should be definitely a part of the text when this company name is called. So it worked for majority of the times but then problem occurs in the following example:
Company Name: Ennis, Inc.
News: L D`ennis` Kozlowski, former chief executive convicted of looting nearly $100 million from Tyco International, has emerged into far more modest life after serving six-and-a-half year sentence and probation; Kozlowski, who became ultimate symbol of corporate greed in era that included scandals at Enron and WorldCom, describes his personal transformation and more humble pleasures that have replaced his once high-flying lifestyle.
Now you can see Ennis is matching with Dennis in the text so it giving irrelevant news results.
Can someone help in telling the right way of doing this ? Thanks.

Use a regex with boundaries for exact matches whether you choose the full name or some partial part you think is unique is up to you but using word boundaries D'ennis' won't match Ennis :
companies = ["name1", "name2",...]
companies_re = re.compile(r"|".join([r"\b{}\b".format(name) for name in companies]))
Depending on how many matches per news item, you may want to use companies_re.search(artice) or companies_re.find_all(article).
Also for case insensitive matches pass re.I to compile.
If the only line you want to check is also always the one starting with company company Name: you can narrow down the search:
for line in all_lines:
if line.startswith("company Name:"):
name = companies_re.search(line)
if name:
...
break

It sounds like you need the Aho-Corasick algorithm. There is a nice and fast implementation for python here: https://pypi.python.org/pypi/pyahocorasick/
It will only do exact matching, so you would need to index both "Du pont" and "Dupont", for example. But that's not too hard, you can use the Wikidata to help you find aliases: for example, look at the aliases of Dupont's entry: it includes both "Dupont" and "Du pont".
Ok so let's assume you have the list of company names with their aliases:
import ahocorasick
A = ahocorasick.Automaton()
companies = ["google", "apple", "tesla", "dupont", "du pont"]
for idx, key in enumerate(companies):
A.add_word(key, idx)
Next, make the automaton (see the link above for details on the algorithm):
A.make_automaton()
Great! Now you can simply search for all companies in some text:
your_text = """
I love my Apple iPhone. Do you know what a Googleplex is?
I ate some apples this morning.
"""
for end_index, idx in A.iter(your_text.lower()):
print(end_index, companies[idx])
This is the output:
15 apple
49 google
74 apple
The numbers correspond to the index of the last character of the company name in the text.
Easy, right? And super fast, this algorithm is used by some variants of GNU grep.
Saving/loading the automaton
If there are a lot of company names, creating the automaton may take some time, so you may want to create it just once, save it to disk (using pickle), then load it every time you need it:
# create_company_automaton.py
# ... create the automaton (see above)
import pickle
pickle.dump(A, open('company_automaton.pickle', 'wb'))
In the program that will use this automaton, you start by loading the automaton:
# use_company_automaton.py
import ahocorasick
import pickle
A = pickle.load(open("company_automaton.pickle", "rb"))
# ... use the automaton
Hope this helps! :)
Bonus details
If you want to match "Apple" in "Apple releases a new iPhone" but not in "I ate an apple this morning", you are going to have a hard time. But it is doable: for example, you could gather a set of articles containing the word "apple" and about the company, and a set of articles not about the company, then identify words (or n-grams) that are more likely when it's about the company (e.g. "iPhone"). Unfortunately you would need to do this for every company whose name is ambiguous.

You can try
difflib.get_close_matches
with the full company name.

Parsing txt file in python where it is hard to split by delimiter

I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>, <user>, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>. Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?

With data like this I'd just use split() with the maxplit argument:
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
https://docs.python.org/2/library/stdtypes.html#str.split

What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.