How to identify substrings in the order of the string? - python

I have a list of sentences as below.
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
I also have a set of selected concepts.
selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']
Now I want to select the concepts in seleceted_concepts from sentences in the order of the sentence.
i.e. my output should be as follows.
output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]
I could extract the concepts in the sentences as follows.
output = []
for sentence in sentences:
sentence_tokens = []
for item in selected_concepts:
if item in sentence:
sentence_tokens.append(item)
output.append(sentence_tokens)
However, I have troubles of organising the extracted concepts accoridng to sentence order. Is there any easy way of doing it in python?

One way to do it is to use .find() method to find the position of the substring and then sort by that value. For example:
output = []
for sentence in sentences:
sentence_tokens = []
for item in selected_concepts:
index = sentence.find(item)
if index >= 0:
sentence_tokens.append((index, item))
sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
output.append(sentence_tokens)

You could use .find() and .insert() instead.
Something like:
output = []
for sentence in sentences:
sentence_tokens = []
for item in selected_concepts:
pos = sentence.find(item)
if pos != -1:
sentence_tokens.insert(pos, item)
output.append(sentence_tokens)
The only problem would be overlap in the selected_concepts. For example, 'databases process' and 'process'. In this case, they would end up in the opposite of the order they are in in selected_concepts. You could potentially fix this with the following:
output = []
selected_concepts_multiplier = len(selected_concepts)
for sentence in sentences:
sentence_tokens = []
for k,item in selected_concepts:
pos = sentence.find(item)
if pos != -1:
sentence_tokens.insert((selected_concepts_multiplier * pos) + k, item)
output.append(sentence_tokens)

there is a built in statement called "in". it can check is there any string in other string.
sentences = [
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd'
]
selected_concepts = [
'machine learning',
'patterns',
'data mining',
'methods','database systems',
'interdisciplinary subfield','knowledege discovery',
'databases process',
'information',
'process'
]
output = [] #prepare the output
for s in sentences: #now lets check each sentences
output.append(list()) #add a list to output, so it will become multidimensional list
for c in selected_concepts: #check all selected_concepts
if c in s: #if there a selected concept in a sentence
output[-1].append(c) #then add the selected concept to the last list in output
print(output)

You can use the fact that regular expressions search text in order, left to right, and disallow overlaps:
import re
concept_re = re.compile(r'\b(?:' +
'|'.join(re.escape(concept) for concept in selected_concepts) + r')\b')
output = [match
for sentence in sentences for match in concept_re.findall(sentence)]
output
# => ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems', 'data mining', 'interdisciplinary subfield', 'information', 'information', 'data mining', 'databases process']
This should also be faster than searching for concepts individually, since the algorithm regexps use is more efficient for this, as well as being completely implemented in low-level code.
There is one difference though - if a concept repeats itself within one sentence, your code will only give one appearance per sentence, while this code outputs them all. If this is a meaningful difference, it is rather easy to dedupe a list.

Here I used a simple re.findall method if the pattern is matched in the string then re.findall will give the output as that matched pattern otherwise it will return an empty list based on that I wrote this code
import re
selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
output = []
for sentence in sentences:
matched_concepts = []
for selected_concept in selected_concepts:
if re.findall(selected_concept, sentence):
matched_concepts.append(selected_concept)
output.append(matched_concepts)
print output
Output:
[['machine learning', 'patterns', 'data mining', 'methods', 'database systems', 'process'], ['data mining', 'interdisciplinary subfield', 'information'], ['data mining', 'databases process', 'process']]

Related

Python NLP Spacy : improve bi-gram extraction from a dataframe, and with named entities?

I am using Python and spaCy as my NLP library, working on a big dataframe that contains feedback about different cars, which looks like this:
'feedback' column contains natural language text to be processed,
'lemmatized' column contains lemmatized version of the feedback text,
'entities' column contains named entities extracted from the feedback text (I've trained the pipeline so that it will recognise car models and brands, labelling these as 'CAR_BRAND' and 'CAR_MODEL')
I then created the following function, which applies the Spacy nlp token to each row of my dataframe and extract any [noun + verb], [verb + noun], [adj + noun], [adj+ proper noun] combinations.
def bi_gram(x):
doc = nlp_token(x)
result = []
text = ''
for i in range(len(doc)):
j = i+1
if j < len(doc):
if (doc[i].pos_ == "NOUN" and doc[j].pos_ == "VERB") or (doc[i].pos_ == "VERB" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "NOUN") or (doc[i].pos_ == "ADJ" and doc[j].pos_ == "PROPN"):
text = doc[i].text + " " + doc[j].text
result.append(text)
i = i+1
return result
Then I applied this function to 'lemmatized' column:
df['bi_gram'] = df['lemmatized'].apply(bi_gram)
This is where I have a problem...
This is producing only one bigram per row maximum. How can I tweak the code so that more than one bigram can be extracted and put in a column? (Also are there more linguistic combinations I should try?)
Is there a possibility to find out what people are saying about 'CAR_BRAND' and 'CAR_MODEL' named entities extracted in the 'entities' column? For example 'Cool Porsche' - Some brands or models are made of more than two words so it's tricky to tackle.
I am very new to NLP.. If there is a more efficient way to tackle this, any advice will be super helpful!
Many thanks for your help in advance.
spaCy has a built-in pattern matching engine that's perfect for your application – it's documented here and in a more extensive usage guide. It allows you to define patterns in a readable and easy-to-maintain way, as lists of dictionaries that define the properties of the tokens to be matched.
Set up the pattern matcher
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm") # or whatever model you choose
matcher = Matcher(nlp.vocab)
# your patterns
patterns = {
"noun_verb": [{"POS": "NOUN"}, {"POS": "VERB"}],
"verb_noun": [{"POS": "VERB"}, {"POS": "NOUN"}],
"adj_noun": [{"POS": "ADJ"}, {"POS": "NOUN"}],
"adj_propn": [{"POS": "ADJ"}, {"POS": "PROPN"}],
}
# add the patterns to the matcher
for pattern_name, pattern in patterns.items():
matcher.add(pattern_name, [pattern])
Extract matches
doc = nlp("The dog chased cats. Fast cats usually escape dogs.")
matches = matcher(doc)
matches is a list of tuples containing
a match id,
the start index of the matched bit and
the end index (exclusive).
This is a test output adopted from the spaCy usage guide:
for match_id, start, end in matches:
# Get string representation
string_id = nlp.vocab.strings[match_id]
# The matched span
span = doc[start:end]
print(repr(span.text))
print(match_id, string_id, start, end)
print()
Result
'dog chased'
1211260348777212867 noun_verb 1 3
'chased cats'
8748318984383740835 verb_noun 2 4
'Fast cats'
2526562708749592420 adj_noun 5 7
'escape dogs'
8748318984383740835 verb_noun 8 10
Some ideas for improvement
Named entity recognition should be able to detect multi-word expressions, so brand and/or model names that consist of more than one token shouldn't be an issue if everything is set up correctly
Matching dependency patterns instead of linear patterns might slightly improve your results
That being said, what you're trying to do – kind of sentiment analysis -is quite a difficult task that's normally engaged with machine learning approaches and heaps of training data. So don't expect too much from simple heuristics.

Finding the original form of a word after stemming

I am stemming a list of words and making a dataframe from it. The original data is as follow:
original = 'The man who flies the airplane dies in an air crash. His wife died a couple of weeks ago.'
df = pd.DataFrame({'text':[original]})
the functions I've used for lemmatisation and stemming are:
# lemmatize & stemmed.
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
result.append(lemmatize_stemming(token))
return result
The output will come from running df['text'].map(preprocess)[0] for which I get:
['man',
'fli',
'airplan',
'die',
'air',
'crash',
'wife',
'die',
'coupl',
'week',
'ago']
I wonder how can I return the output to the original tokens? for instance I have die which is from died and dies.
Stemming destroys information in the original corpus, by non-reversibly turning multiple tokens into some shared 'stem' form.
I you want the original text, you need to retain it yourself.
But also, note: many algorithms working on large amounts of data, like word2vec under ideal conditions, don't necessarily need or even benefit from stemming. You want to have vectors for all the words in the original text – not just the stems – and with enough data, the related forms of a word will get similar vectors. (Indeed, they'll even differ in useful ways, with all 'past' or 'adverbial' or whatever variants sharing a similar directional skew.)
So only do it if you're sure it's helping your goals, withn your corpus limits & goals.
You could return the mapping relationship as the result and perform postprocessing later.
def preprocess(text):
lemma_mapping = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
lemma_mapping[token] = lemmatize_stemming(token)
return lemma_mapping
Or store it as a by-product.
from collections import defaultdict
lemma_mapping = defaultdict(str)
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
lemma = lemmatize_stemming(token)
result.append(lemma)
lemma_mapping[token] = lemma
return result

Trying to create a function to search for a word in a string

it's my first time posting but I have a question regarding trying to create a function in python that will search a list of strings and return any words that I am looking for. Here is what I have so far:
def search_words(data, search_words):
keep = []
for data in data:
if data in search_words:
keep.append(data)
return keep
Here is the data I am searching through and the words I am trying to find:
data = ['SHOP earnings for Q1 are up 5%',
'Subscriptions at SHOP have risen to all-time highs, boosting sales',
"Got a new Mazda, VROOM VROOM Y'ALL",
'I hate getting up at 8am FOR A STUPID ZOOM MEETING',
'TSLA execs hint at a decline in earnings following a capital expansion program']
words = ['earnings', 'sales']
Upon doing print(search_words(data=data, search_words=words)) my list (keep) returns empty brackets [ ] and I am unsure of how to fix the issue. I know that searching for a word in a string is different than looking for a number in a list but I cannot figure out to modify my code to account for that. Any help would be appreciated.
You can use the following. This will keep all the sentences in data that contain at least one of the words:
keep = [s for s in data if any(w in s for w in words)]
Since they are all strings, instead of looping over them all just combine them all and search that. Also make words a set:
[word for word in ' '.join(data).split() if word in words]
Using regex:
re.findall('|'.join(words), ''.join(data))
['earnings', 'sales', 'earnings']
You can use the following. This will keep all the sentences in data that contain at least one or both of the words:
This is a form of programs for beginners.
data, search_words must be a list.
def search_words(data, search_words):
keep = []
for dt in data:
for sw in search_words:
if sw in dt:
keep.append(dt)
return keep

How to efficiently identify substrings in the order of the string in python

This is related to my previous question in: How to identify substrings in the order of the string?
For a given set of sentences and a set of selected_concepts I want to identify selected_concepts in the order of the sentences.
I am doing it fine with the code provided below.
output = []
for sentence in sentences:
sentence_tokens = []
for item in selected_concepts:
index = sentence.find(item)
if index >= 0:
sentence_tokens.append((index, item))
sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
output.append(sentence_tokens)
However, in my real dataset I have 13242627 selected_conceptsand 1234952 sentences. Therefore, I would like to know if there is any way to optimise this code to perform in lesser time. As I understand this is O(n^2). Therefore, I am concerned about the time complexity (space complexity is not a problem for me).
A sample is mentioned below.
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']
output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]
What about using pre-compiled ReGEx?
Here is an example:
import re
sentences = [
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd']
selected_concepts = [
'machine learning',
'patterns',
'data mining',
'methods',
'database systems',
'interdisciplinary subfield',
'knowledege discovery', # spelling error: “knowledge”
'databases process',
'information',
'process']
re_concepts = [re.escape(t) for t in selected_concepts]
find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall
output = [find_all_concepts(sentence) for sentence in sentences]
You get:
[['data mining',
'process',
'patterns',
'methods',
'machine learning',
'database systems'],
['data mining', 'interdisciplinary subfield', 'information', 'information'],
['data mining', 'databases process']]

Create a dictionary with 'word groups'

I would like to do some text analysis on job descriptions and was going to use nltk. I can build a dictionary and remove the stopwords, which is part of what I want. However in addition to the single words and their frequencies I would like to keep meaningful 'word groups' and count them as well.
For example in job descriptions containing 'machine learning' I don't want to consider 'machine' and 'learning' separately but keep retain the word group in my dictionary if it frequently occurs together. What is the most efficient method to do that? (I think I wont need to go beyond word groups containing 2 or words). And: At which point should I do the stopword removal?
Here is an example:
text = 'As a Data Scientist, you will focus on machine
learning and Natural Language Processing'
The dictionary I would like to have is:
dict = ['data scientist', 'machine learning', 'natural language processing',
'data', 'scientist', 'focus', 'machine', 'learning', 'natural'
'language', 'processing']
Sounds like what you want do is use collocations from nltk.
Tokenize your multi-word expressions into tuples, then put them in a set for easy lookup. The easiest way is to use nltk.ngrams which allows you to iterate directly over the ngrams in your text. Since your sample data includes a trigram, here's a search for n up to 3.
raw_keywords = [ 'data scientist', 'machine learning', 'natural language processing',
'data', 'scientist', 'focus', 'machine', 'learning', 'natural'
'language', 'processing']
keywords = set(tuple(term.split()) for term in raw_keywords)
tokens = nltk.word_tokenize(text.lower())
# Scan text once for each ngram size.
for n in 1, 2, 3:
for ngram in nltk.ngrams(tokens, n):
if ngram in keywords:
print(ngram)
If you have huge amounts of text, you could check you you'll get a speed-up by iterating over maximal ngrams only (with the option pad_right=True to avoid missing small ngram sizes). The number of lookups is the same both ways, so I doubt it will make much difference, except in the order of returned results.
for ngram in nltk.ngrams(tokens, n, pad_right=True):
for k in range(n):
if ngram[:k+1] in keywords:
print(ngram[:k+1])
As for stopword removal: If you remove them, you'll produce ngrams where there were none before, e.g., "sewing machine and learning center" will match "machine learning" after stopword removal. You'll have to decide if this is something you want, or not. If it were me I would remove punctuation before the keyword scan, but leave the stopwords in place.
Thanks #Batman, I played around a bit with collocations and ended up only needing a couple of lines of code. (Obviously 'meaningful text' should be a lot longer to find actual collocations)
meaningful_text = 'As a Data Scientist, you will focus on machine
learning and Natural Language Processing'
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(meaningful_text))
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)

Categories

Resources