Lemmatize a doc with spacy? - python

I have a spaCy doc that I would like to lemmatize.
For example:
import spacy
nlp = spacy.load('en_core_web_lg')
my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)
How can I convert every token in the doc to its lemma?

Each token has a number of attributes, you can iterate through the doc to access them.
For example: [token.lemma_ for token in doc]
If you want to reconstruct the sentence you could use: ' '.join([token.lemma_ for token in doc])
For a full list of token attributes see: https://spacy.io/api/token#attributes

If you don’t need a particular component of the pipeline – for example, the NER or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed.
For your case (Lemmatize a doc with spaCy) you only need the tagger component.
So here is a sample code:
import spacy
# keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_lg', disable=["parser", "ner"])
my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)
words_lemmas_list = [token.lemma_ for token in doc]
print(words_lemmas_list)
Output:
['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world']

This answer covers the case where your text consists of multiple sentences.
If you want to obtain a list of all tokens being lemmatized, do:
import spacy
nlp = spacy.load('en')
my_str = 'Python is the greatest language in the world. A python is an animal.'
doc = nlp(my_str)
words_lemmata_list = [token.lemma_ for token in doc]
print(words_lemmata_list)
# Output:
# ['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world', '.',
# 'a', 'python', 'be', 'an', 'animal', '.']
If you want to obtain a list of all sentences with each token being lemmatized, do:
sentences_lemmata_list = [sentence.lemma_ for sentence in doc.sents]
print(sentences_lemmata_list)
# Output:
# ['Python be the great language in the world .', 'a python be an animal .']

Related

TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)

I'm looking to get all sentences in a text file that contain at least one of the conjunctions in the list "conjunctions". However, when applying this function for the text in the variable "text_to_look" like this:
import spacy
lang_model = spacy.load("en_core_web_sm")
text_to_look = "A woman is looking at books in a library. She's looking to buy one, but she hasn't got any money. She really wanted to book, so she asks another customer to lend her money. The man accepts. They get along really well, so they both exchange phone numbers and go their separate ways."
def get_coordinate_sents(file_to_examine):
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
sentences = text.sents
for sentence in sentences:
coord_sents = []
if any(conjunction in sentence for conjunction in conjunctions):
coord_sents.append(sentence)
return coord_sents
wanted_sents = get_coordinate_sents(text_to_look)
I get this error message :
TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)
There seems to be something about spaCy that I'm not aware of and prevents me from doing this...
While the problem lies in the fact that conjunction is a string and sentence is a Span object, and to check if the sentence text contains a conjunction you need to access the Span text property, you also re-initialize the coord_sents in the loop, effectively saving only the last sentence in the variable. Note a list comprehension looks preferable in such cases.
So, a quick fix for your case is
def get_coordinate_sents(file_to_examine):
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
return [sentence for sentence in text.sents if any(conjunction in sentence.text for conjunction in conjunctions)]
Here is my test:
import spacy
lang_model = spacy.load("en_core_web_sm")
text_to_look = "A woman is looking at books in a library. She's looking to buy one, but she hasn't got any money. She really wanted to book, so she asks another customer to lend her money. The man accepts. They get along really well, so they both exchange phone numbers and go their separate ways."
file_to_examine = text_to_look
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
sentences = text.sents
coord_sents = [sentence for sentence in sentences if any(conjunction in sentence.text for conjunction in conjunctions)]
Output:
>>> coord_sents
[She's looking to buy one, but she hasn't got any money., She really wanted to book, so she asks another customer to lend her money., They get along really well, so they both exchange phone numbers and go their separate ways.]
However, the in operation will find nor in north, so in crimson, etc.
You need a regex here:
import re
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
rx = re.compile(fr'\b(?:{"|".join(conjunctions)})\b')
def get_coordinate_sents(file_to_examine):
text = lang_model(file_to_examine)
return [sentence for sentence in text.sents if rx.search(sentence.text)]

How to obtain SentenceTransformer vocab from corpus or query?

I am trying SentenceTransformer model from SBERT.net and I want to know how it handles entity names. Are they marked as unknown - are they broken down with tokens, etc. I want to make sure they are used in the comparison.
However, to do that I would need to see the vocab it build for the query - and perhaps even convert an embedding to text.
Looking at the api - its not obvious to me how to do that.
Here is a quick example from their docs:
embedder = SentenceTransformer("all-MiniLM-L6-v2")
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby."
]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
# Query sentences:
queries = [
"A man is eating pasta."
]
top_k = min(5, len(corpus))
for query in queries:
query_embedding = embedder.encode(query, convert_to_tensor=True)
...
Your SentenceTransformer model is actually packing and using a tokenizer from Hugging Face's transformers library under the hood. You can access it as the .tokenizer attribute of your model. The typical behaviour of such a token is to break down unknown tokens in word piece tokens. At this point, we can go on and check that it's indeed what it does, as it is relatively straightforward:
embedder = SentenceTransformer("all-MiniLM-L6-v2")
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby."
]
# the tokenizer is just here:
tokenizer = embedder.tokenizer # BertTokenizerFast
# and the vocabulary itself is there, if needed:
vocab = tokenizer.vocab # dict of length 30522
# get the split of sentences according to the vocab, for example:
inputs = tokenizer(corpus, padding='longest', truncation=True)
tokens = [e.tokens for e in inputs.encodings]
# tokens contains:
# [
# ['[CLS]', 'a', 'man', 'is', 'eating', 'food', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]']
# ['[CLS]', 'a', 'man', 'is', 'eating', 'a', 'piece', 'of', 'bread', '.', '[SEP]']
# ['[CLS]', 'the', 'girl', 'is', 'carrying', 'a', 'baby', '.', '[SEP]', '[PAD]', '[PAD]']
# ]
# now let's try with some unknown tokens and see what it does
queries = [
"Edv Beq is eating pasta."
]
q_inputs = tokenizer(queries, padding='longest', truncation=True)
q_tokens = [e.tokens for e in q_inputs.encodings]
# q_tokens contains:
# [
# ['[CLS]', 'ed', '##v', 'be', '##q', 'is', 'eating', 'pasta', '.', '[SEP]']
# ]

How to extract keyword from text in python? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I wanna extract some keywords from text and print but how?
This is sample text i wanna extract from
text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
This is sample keywords extract from text
keywords = ('bas agrisi', 'kurtulmak')
and i wanna detect these keywords and print like;
bas agrisi
kurtulmak
how can i do this in python?
try this:
string = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
keywords = ('bas agrisi', 'kurtulmak')
print(*[key for key in keywords if key in string], sep='\n')
Output:
bas agrisi
kurtulmak
Use re library to find all possible keywords.
import re
text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
keywords = ('bas agrisi', 'kurtulmak')
result = re.findall('|'.join(keywords), text)
for key in result:
print(key)
bas agrisi
bas agrisi
kurtulmak
Do you want python to understand keywords or would you like to see words as tokens in a particular text? Because for the first one, you may need to build a machine learning mechanism or neural network to understand and extract keywords from the text. But for the second, you can use a very easy steps to tokenize words.
For example,
import nltk #need to download necessary dictionaries
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
text = "I wonder if I have been changed in the night. Let me think. Was
I the same when I got up this morning? I almost can remember feeling a
little different. But if I am not the same, the next question is 'Who
in the world am I?' Ah, that is the great puzzle!" # This is an
#example of a text
tokens = nltk.word_tokenize(text)
tokens #punctuations did not removed and conceived as part of the word
#Output will look like the following;
['I',
'wonder',
'if',
'I',
'have',
'been',
'changed',
'in',
'the',
'night',
'.',
'Let',
'me',
'think',
'.',
'Was',
'I',
'the',
'same',
'when',
'I',....]
#As first, you can clean the text by lowering the letters
tokens2 = [ word.lower() for word in tokens if word.isalpha()]
#Second, you can remove stops words in the text. There are different
#libraries available for various languages but admittedly English is
#the best library
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
#You can filter the text from stop words by filtering the previously
#created tokens2
tokens3 = [word for word in tokens2 if word not in stop_words] #word
#for word named as list comprehension
#Tokenization is a pre-set up for the lemmatization which is a way to
eliminate repeating words and comprehend the stems of the words
# lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('stripes', pos= 'v') # n is for noun v is for
#verb
print(lemmatizer.lemmatize("stripes", 'n'))
#output is stripe because the stem of the word stripes is stripe
# The following is an example for using stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
[stemmer.stem(word) for word in tokens3]
#output will be
['wonder',
'chang',
'night',
'let',
'think',
'got',
'morn',
'almost',
'rememb',
'feel',
'littl',
'differ',
'next',
'question',
'world',
'ah',
'great',
'puzzl'] # From the text, stop words were eliminated. Such as I,
#have, been and etc. Also stems of the words retrieved.
#One last thing to see how lemmatizer works
tokens4 = [lemmatizer.lemmatize(word, pos='n') for word in tokens3]
tokens4 = [lemmatizer.lemmatize(word, pos='v') for word in tokens4]
print(tokens4)
#Output will be
['wonder', 'change', 'night', 'let', 'think', 'get', 'morning',
'almost', 'remember', 'feel', 'little', 'different', 'next',
'question', 'world', 'ah', 'great', 'puzzle']
I hope, I was able to explain clearly. Also, if you like to move on a little and create a neural network or such mechanism, you may use One hot coding.

Ignore punctuation symbol while giving tokens in spacy

I am new to using SpaCy. Can we tell spacy API to ignore symbols while giving tokens?
Example:
For the sentence Hi, Welcome to StackOverflow. The tokens are
Hi
,
Welcome
to
StackOverflow
.
I want spacy to give tokens only for words which have whitespace. For the above example, the tokens should be
Hi,
Welcome
to
StackOverflow.
Try:
import spacy
nlp = spacy.load("en_core_web_sm")
txt = "Hi, Welcome to StackOverflow."
doc = nlp(txt)
tokens = [tok.text for tok in doc if not tok.is_punct]
print(tokens)
['Hi', 'Welcome', 'to', 'StackOverflow']
You may wish to define your own list of punctuation:
punctuation = [".",",","!"]
tokens = [tok.text for tok in doc if tok.text not in punctuation]
print(tokens)
['Hi', 'Welcome', 'to', 'StackOverflow']
or use a readily available one from string package:
from string import punctuation
print(punctuation)
doc_punct = nlp(" ".join([punctuation]))
tokens = [tok.text for tok in doc if tok.text not in punctuation]
print(tokens)
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
['Hi', 'Welcome', 'to', 'StackOverflow']

spacy tokenization merges the wrong tokens

I would like to use spacy for tokenizing Wikipedia scrapes. Ideally it would work like this:
text = 'procedure that arbitrates competing models or hypotheses.[2][3] Researchers also use experimentation to test existing theories or new hypotheses to support or disprove them.[3][4]'
# run spacy
spacy_en = spacy.load("en")
doc = spacy_en(text, disable=['tagger', 'ner'])
tokens = [tok.text.lower() for tok in doc]
# desired output
# tokens = [..., 'models', 'or', 'hypotheses', '.', '[2][3]', 'Researchers', ...
# actual output
# tokens = [..., 'models', 'or', 'hypotheses.[2][3', ']', 'Researchers', ...]
The problem is that the 'hypotheses.[2][3]' is glued together into one token.
How can I prevent spacy from connecting this '[2][3]' to the previous token?
As long as it is split from the word hypotheses and the point at the end of the sentence, I don't care how it is handled. But individual words and grammar should stay apart from syntactical noise.
So for example, any of the following would be a desirable output:
'hypotheses', '.', '[2][', '3]'
'hypotheses', '.', '[2', '][3]'
I think you could try playing around with infix:
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'''[.]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world! I am hypothesis.[2][3]")
print([t.text for t in doc])
More on this https://spacy.io/usage/linguistic-features#native-tokenizers

Categories

Resources