Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I wanna extract some keywords from text and print but how?
This is sample text i wanna extract from
text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
This is sample keywords extract from text
keywords = ('bas agrisi', 'kurtulmak')
and i wanna detect these keywords and print like;
bas agrisi
kurtulmak
how can i do this in python?
try this:
string = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
keywords = ('bas agrisi', 'kurtulmak')
print(*[key for key in keywords if key in string], sep='\n')
Output:
bas agrisi
kurtulmak
Use re library to find all possible keywords.
import re
text = "Merhaba bugun bir miktar bas agrisi var, genellikle sonbahar gunlerinde baslayan bu bas agrisi insanin canini sikmakta. Bu durumdan kurtulmak icin neler yapmali."
keywords = ('bas agrisi', 'kurtulmak')
result = re.findall('|'.join(keywords), text)
for key in result:
print(key)
bas agrisi
bas agrisi
kurtulmak
Do you want python to understand keywords or would you like to see words as tokens in a particular text? Because for the first one, you may need to build a machine learning mechanism or neural network to understand and extract keywords from the text. But for the second, you can use a very easy steps to tokenize words.
For example,
import nltk #need to download necessary dictionaries
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
text = "I wonder if I have been changed in the night. Let me think. Was
I the same when I got up this morning? I almost can remember feeling a
little different. But if I am not the same, the next question is 'Who
in the world am I?' Ah, that is the great puzzle!" # This is an
#example of a text
tokens = nltk.word_tokenize(text)
tokens #punctuations did not removed and conceived as part of the word
#Output will look like the following;
['I',
'wonder',
'if',
'I',
'have',
'been',
'changed',
'in',
'the',
'night',
'.',
'Let',
'me',
'think',
'.',
'Was',
'I',
'the',
'same',
'when',
'I',....]
#As first, you can clean the text by lowering the letters
tokens2 = [ word.lower() for word in tokens if word.isalpha()]
#Second, you can remove stops words in the text. There are different
#libraries available for various languages but admittedly English is
#the best library
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
#You can filter the text from stop words by filtering the previously
#created tokens2
tokens3 = [word for word in tokens2 if word not in stop_words] #word
#for word named as list comprehension
#Tokenization is a pre-set up for the lemmatization which is a way to
eliminate repeating words and comprehend the stems of the words
# lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('stripes', pos= 'v') # n is for noun v is for
#verb
print(lemmatizer.lemmatize("stripes", 'n'))
#output is stripe because the stem of the word stripes is stripe
# The following is an example for using stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
[stemmer.stem(word) for word in tokens3]
#output will be
['wonder',
'chang',
'night',
'let',
'think',
'got',
'morn',
'almost',
'rememb',
'feel',
'littl',
'differ',
'next',
'question',
'world',
'ah',
'great',
'puzzl'] # From the text, stop words were eliminated. Such as I,
#have, been and etc. Also stems of the words retrieved.
#One last thing to see how lemmatizer works
tokens4 = [lemmatizer.lemmatize(word, pos='n') for word in tokens3]
tokens4 = [lemmatizer.lemmatize(word, pos='v') for word in tokens4]
print(tokens4)
#Output will be
['wonder', 'change', 'night', 'let', 'think', 'get', 'morning',
'almost', 'remember', 'feel', 'little', 'different', 'next',
'question', 'world', 'ah', 'great', 'puzzle']
I hope, I was able to explain clearly. Also, if you like to move on a little and create a neural network or such mechanism, you may use One hot coding.
Related
I'm looking to get all sentences in a text file that contain at least one of the conjunctions in the list "conjunctions". However, when applying this function for the text in the variable "text_to_look" like this:
import spacy
lang_model = spacy.load("en_core_web_sm")
text_to_look = "A woman is looking at books in a library. She's looking to buy one, but she hasn't got any money. She really wanted to book, so she asks another customer to lend her money. The man accepts. They get along really well, so they both exchange phone numbers and go their separate ways."
def get_coordinate_sents(file_to_examine):
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
sentences = text.sents
for sentence in sentences:
coord_sents = []
if any(conjunction in sentence for conjunction in conjunctions):
coord_sents.append(sentence)
return coord_sents
wanted_sents = get_coordinate_sents(text_to_look)
I get this error message :
TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)
There seems to be something about spaCy that I'm not aware of and prevents me from doing this...
While the problem lies in the fact that conjunction is a string and sentence is a Span object, and to check if the sentence text contains a conjunction you need to access the Span text property, you also re-initialize the coord_sents in the loop, effectively saving only the last sentence in the variable. Note a list comprehension looks preferable in such cases.
So, a quick fix for your case is
def get_coordinate_sents(file_to_examine):
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
return [sentence for sentence in text.sents if any(conjunction in sentence.text for conjunction in conjunctions)]
Here is my test:
import spacy
lang_model = spacy.load("en_core_web_sm")
text_to_look = "A woman is looking at books in a library. She's looking to buy one, but she hasn't got any money. She really wanted to book, so she asks another customer to lend her money. The man accepts. They get along really well, so they both exchange phone numbers and go their separate ways."
file_to_examine = text_to_look
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
text = lang_model(file_to_examine)
sentences = text.sents
coord_sents = [sentence for sentence in sentences if any(conjunction in sentence.text for conjunction in conjunctions)]
Output:
>>> coord_sents
[She's looking to buy one, but she hasn't got any money., She really wanted to book, so she asks another customer to lend her money., They get along really well, so they both exchange phone numbers and go their separate ways.]
However, the in operation will find nor in north, so in crimson, etc.
You need a regex here:
import re
conjunctions = ['and', 'but', 'for', 'nor', 'or', 'yet', 'so']
rx = re.compile(fr'\b(?:{"|".join(conjunctions)})\b')
def get_coordinate_sents(file_to_examine):
text = lang_model(file_to_examine)
return [sentence for sentence in text.sents if rx.search(sentence.text)]
I have a spaCy doc that I would like to lemmatize.
For example:
import spacy
nlp = spacy.load('en_core_web_lg')
my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)
How can I convert every token in the doc to its lemma?
Each token has a number of attributes, you can iterate through the doc to access them.
For example: [token.lemma_ for token in doc]
If you want to reconstruct the sentence you could use: ' '.join([token.lemma_ for token in doc])
For a full list of token attributes see: https://spacy.io/api/token#attributes
If you don’t need a particular component of the pipeline – for example, the NER or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed.
For your case (Lemmatize a doc with spaCy) you only need the tagger component.
So here is a sample code:
import spacy
# keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_lg', disable=["parser", "ner"])
my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)
words_lemmas_list = [token.lemma_ for token in doc]
print(words_lemmas_list)
Output:
['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world']
This answer covers the case where your text consists of multiple sentences.
If you want to obtain a list of all tokens being lemmatized, do:
import spacy
nlp = spacy.load('en')
my_str = 'Python is the greatest language in the world. A python is an animal.'
doc = nlp(my_str)
words_lemmata_list = [token.lemma_ for token in doc]
print(words_lemmata_list)
# Output:
# ['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world', '.',
# 'a', 'python', 'be', 'an', 'animal', '.']
If you want to obtain a list of all sentences with each token being lemmatized, do:
sentences_lemmata_list = [sentence.lemma_ for sentence in doc.sents]
print(sentences_lemmata_list)
# Output:
# ['Python be the great language in the world .', 'a python be an animal .']
I have a numpy array of sentences (strings)
arr = np.array(['It's the most wonderful time of the year.',
'With the kids jingle belling.',
'And everyone telling you be of good cheer.',
'It's the hap-happiest season of all.'])
(that I read from a csv file). I need to make a numpy array with all the unique words in these sentences.
So what I need is
array(["It's", "the", "most", "wonderful", "time", "of" "year", "With", "the", "kids", "jingle", "belling" "and", "everyone", "telling", "you", "be", "good", "cheer", "It's", "hap-happiest", "season", "all"])
I could do this like
o = []
for x in arr:
o += x.split()
words = np.array(o)
unique_words = np.array(list(set(words.tolist())))
but as this involves first making lists and then converting that to numpy array, it's obviously gonna be slow and inefficient for large data.
I also tried nltk as in
words = np.array([])
for x in arr:
words = np.append(words, nltk.word_tokenize(x))
but with this too seems inefficient as a new array is created on each iteration instead of the old one being modified.
I suppose there's some elegant way of achieving what I want using more of numpy.
Can you point me in the right direction?
I think you can try something like this:
vocab = set()
for x in arr:
vocab.update(nltk.word_tokenize(x))
set.update() takes an iterable to add elements to existing set.
Update:
Also, you can look at the working of CountVectorizer in scikit-learn which:
converts a collection of text documents to a matrix of token counts.
And it uses a dictionary to keep track of the unique words:
# raw_documents is an iterable of sentences.
for doc in raw_documents:
feature_counter = {}
# analyze will split the sentences into tokens
# and apply some preprocessing on them (like stemming, lemma etc)
for feature in analyze(doc):
try:
# vocabulary is a dictionary containing the words and their counts
feature_idx = vocabulary[feature]
...
...
And I think it works pretty efficiently. So I think you can also use a dict() instead of set. I am not familiar with working of NLTK, but I think that must also contain something equivalent to CountVectorizer.
I'm not sure numpy is the best way to go here. You can achieve what you want with nested lists and sets or dictionaries.
One useful thing to know is that the tokenizer methods from nltk can process a list of sentences, and will return a list of tokenized sentences. For example:
from nltk.tokenize import WordPunktTokenizer
wpt = WordPunktTokenizer()
tokenized = wpt.tokenize_sents(arr)
This will return a list of lists of the tokenized sentences in arr, i.e.:
[['It', "'", 's', 'the', 'most', 'wonderful', 'time', 'of', 'the', 'year', '.'],
['With', 'the', 'kids', 'jingle', 'belling', '.'],
['And', 'everyone', 'telling', 'you', 'be', 'of', 'good', 'cheer', '.'],
['It', "'", 's', 'the', 'hap', '-', 'happiest', 'season', 'of', 'all', '.']]
nltk comes with lots of different tokenizers, and so will give you options for how best to split the sentences into word tokens. You can then use something like the following to get the unique set of words / tokens:
unique_words = set()
for toks in tokenized:
unique_words.update(toks)
I have an array called allchats consisting of long strings. some of the places in the index look like the following:
allchats[5,0] = "Hi, have you ever seen something like that? no?"
allchats[106,0] = "some word blabla some more words yes"
allchats[410,0] = "I don't know how we will ever get through this..."
I wish to tokenize each string in the array. Furthermore I wish to use a regex tool to eliminate questionsmarks, commas etc.
I have tried the following:
import nltk
from nltk.tokenize import RegexTokenizer
tknzr = RegexTokenizer('\w+')
allchats1 = [[tknzr.tokenize(chat) for chat in str] for str in allchats]
I wish to end up with:
allchats[5,0] = ['Hi', 'have', 'you', 'ever', 'seen', 'something', 'like', 'that', 'no']
allchats[106,0] = '[some', 'word', 'blabla', 'some', 'more', 'words', 'yes']
allchats[410,0] = ['I', 'dont', 'know', 'how', 'we', 'will', 'ever', 'get', 'through', 'this']
I am quite sure that I am doing something wrong with the strings (str) in the for loop, but cannot figure out what I need to correct in order to succeed.
Thank you in advance for you help!
You have a typo error on your list comprehension, it doesn't take nested lists, but chained lists:
allchats1 = [tknzr.tokenize(chat) for str in allchats for chat in str]
If you want to iterate over words instead of just characters, you are looking for str.split() method. So here is a fully working exmple:
allchats = ["Hi, have you ever seen something like that? no?", "some word blabla some more words yes", "I don't know how we will ever get through this..."]
def tokenize(word):
# use real logic here
return word + 'tokenized'
tokenized = [tokenize(word) for sentence in allchats for word in sentence.split()]
print(tokenized)
If you're not sure you have only strings in your list and want to go over only strings, you can check this with isinstance method (example here):
tokenized = [tokenize(word) for sentence in allchats if isinstance(sentence, str) for word in sentence.split()]
I have two list of words that I would like to find in a sentence based on a sequence. I would like to check is it possible to use "regular expression" or I should use check the sentence by if condition?
n_ali = set(['ali','aliasghar'])
n_leyla = set(['leyla','lili',leila])
positive_adj = set(['good','nice','handsome'])
negative_adj = set(['bad','hate','lousy'])
Sentence = "aliasghar is nice man. ali is handsome man of my life. lili has so many bad attitude who is next to my friend. "
I would like to find any pattern as below:
n_ali + positive_adj
n_ali + negative_adj
n_leyla + positive_adj
n_leyla + negative_adj
I am using python 3.5 in VS2015 and I am new in NLTK. I know how to create a "regular expression" for check a single word but I am not sure what is the best approach for list of similar names. kindly help me and suggest me what is the best way to implement this approach.
You should consider removing stopwords.
import nltk
from nltk.corpus import stopwords
>>> words = [word for word in nltk.word_tokenize(sentence) if word not in stopwords.words('english')]
>>> words
['aliasghar', 'nice', 'man', '.', 'ali', 'handsome', 'man', 'life', '.', 'lili', 'many', 'bad', 'attitude', 'next', 'friend', '.']
Alright, now you have the data like you want it (mostly). Let's use simple looping to store the results in pairs for ali and leila separately.
>>> ali_adj = []
>>> leila_adj = []
>>> for i, word in enumerate(words[:-1]):
... if word in n_ali and (words[i+1] in positive_adj.union(negative_adj)):
... ali_adj.append((word, words[i+1]))
... if word in n_leyla and (words[i+1] in positive_adj.union(negative_adj)):
... leila_adj.append((word, words[i+1]))
...
>>>
>>> ali_adj
[('aliasghar', 'nice'), ('ali', 'handsome')]
>>> leila_adj
[]
Note that we could not find any adjectives to describe leila because "many" isn't a stopword. You may have to do this type of cleaning of the sentence manually.