Related
I have a text:
text = '''
Wales greatest moment. Lille is so close to the Belgian
border,
this was essentially a home game for one of the tournament favourites. Their
confident supporters mingled with their new Welsh fans on the streets,
buying into the carnival spirit - perhaps more relaxed than some might have
been before a quarter-final because they thought this was their time.
In the driving rain, Wales produced the best performance in their history to
carry the nation into uncharted territory. Nobody could quite believe it.'''
I have a code:
words = text.replace('.',' ').replace(',',' ').replace('\n',' ').split(' ')
print(words)
And Output:
['Wales', 'greatest', 'moment', '', 'Lille', 'is', 'so', 'close', 'to', 'the', 'Belgian', 'border', '', '', 'this', 'was', 'essentially', 'a', 'home', 'game', 'for', 'one', 'of', 'the', 'tournament', 'favourites', '', 'Their', '', 'confident', 'supporters', 'mingled', 'with', 'their', 'new', 'Welsh', 'fans', 'on', 'the', 'streets', '', '', 'buying', 'into', 'the', 'carnival', 'spirit', '-', 'perhaps', 'more', 'relaxed', 'than', 'some', 'might', 'have', '', 'been', 'before', 'a', 'quarter-final', 'because', 'they', 'thought', 'this', 'was', 'their', 'time', '', 'In', 'the', 'driving', 'rain', '', 'Wales', 'produced', 'the', 'best', 'performance', 'in', 'their', 'history', 'to', '', 'carry', 'the', 'nation', 'into', 'uncharted', 'territory', '', 'Nobody', 'could', 'quite', 'believe', 'it', '']
You can see, list have empty spaces, I remove '\n', ',' and '.'.
But now I have not idea how to remove this spaces.
You can filter them, if you don't like them
no_empties = list(filter(None, words))
If function is None, the identity function is assumed, that is, all elements of iterable that are false are removed.
This works because empty elements are considered falsey.
EDIT:
The original answer does not product the same output as mentioned in the comments, because of the dash symbol, to avoid that:
import re
words = re.findall(r'[\w-]+', text)
Original Answer
You can directly get what you want with the re module
import re
words = re.findall(r'\w+', text)
['Wales',
'greatest',
'moment',
'Lille',
'is',
'so',
'close',
'to',
'the',
'Belgian',
'border',
'this',
'was',
'essentially',
'a',
'home',
'game',
'for',
'one',
'of',
'the',
'tournament',
'favourites',
'Their',
'confident',
'supporters',
'mingled',
'with',
'their',
'new',
'Welsh',
'fans',
'on',
'the',
'streets',
'buying',
'into',
'the',
'carnival',
'spirit',
'perhaps',
'more',
'relaxed',
'than',
'some',
'might',
'have',
'been',
'before',
'a',
'quarter',
'final',
'because',
'they',
'thought',
'this',
'was',
'their',
'time',
'In',
'the',
'driving',
'rain',
'Wales',
'produced',
'the',
'best',
'performance',
'in',
'their',
'history',
'to',
'carry',
'the',
'nation',
'into',
'uncharted',
'territory',
'Nobody',
'could',
'quite',
'believe',
'it']
The reason you are getting this issue is that your text value is indented in every line with 4 single spaces, not because your code is flawed. You could add .replace(' ','') to your 'words' logic to fix this if you mean to have 4 single spaces every line, or you could refer to Thomas Weller's solution, which will solve the problem no matter how many consecutive single spaces you leave
Is there a way to see the vectors I got per paragraphs and not per each word in the vocabulary with doc2vec. By using model.wv.vectors I get all the vectors per words. Now, I would need this in order to apply a clusterization algorithm on the embedded paragraphs which I can hopefully obtain. I am not sure though if this approach is good. This is how the paragraphs look:
[TaggedDocument(words=['this', 'is', 'the', 'effect', 'of', 'those', 'states', 'that', 'went', 'into', 'lockdown', 'much', 'later', 'they', 'are', 'just', 'starting', 'to', 'see', 'the', 'large', 'increase', 'now', 'they', 'have', 'to', 'ride', 'it', 'out', 'and', 'hope', 'for', 'the', 'best'], tags=[0])
TaggedDocument(words=['so', 'see', 'the', 'headline', 'is', 'died', 'not', 'revised', 'predictions', 'show', 'more', 'hopeful', 'situation', 'or', 'new', 'york', 'reaching', 'apex', 'long', 'before', 'experts', 'predicted', 'or', 'any', 'such', 'thing', 'got', 'to', 'keep', 'the', 'panic', 'train', 'rolling', 'see'], tags=[1])]
model.docvecs.vectors will contain all the trained-up document vectors.
I have a number of sentences which I would like to split on specific words (e.g. and). However, when splitting the sentences sometimes there are two or more combinations of a word I'd like to split on in a sentence.
Example sentences:
['i', 'am', 'just', 'hoping', 'for', 'strength', 'and', 'guidance', 'because', 'i', 'have', 'no', 'idea', 'why']
['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home', 'and', 'tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was', 'because', 'he', 'is', 'been', 'told', 'not', 'to', 'talk']
so I have written some code to split a sentence:
split_on_word = []
no_splitting = []
indexPosList = [ i for i in range(len(kth)) if kth[i] == 'and'] # check if word is in sentence
for e in example:
kth = e.split() # split strings into list so it looks like example sentence
for n in indexPosList:
if n > 4: # only split when the word's position is 4 or more
h = e.split("and")
for i in h:
split_on_word.append(i)# append split sentences
else:
no_splitting.append(kth) #append sentences that don't need to be split
However, you can see that when using this code more than once (e.g.: replace the word to split on with another) I will create duplicates or part duplicates of the sentences that I append to a new list.
Is there any way to check for multiple conditions, so that if a sentence contains both or other combinations of it that I split the sentence in one go?
The output from the examples should then look like this:
['i', 'am', 'just', 'hoping', 'for', 'strength']
['guidance', 'because']
['i', 'have', 'no', 'idea', 'why']
['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home']
[ 'tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was']
['he', 'is', 'been', 'told', 'not', 'to', 'talk']
You can use itertools.groupby with a function that checks whether a word is a split-word:
In [11]: split_words = {'and', 'because'}
In [12]: [list(g) for k, g in it.groupby(example, key=lambda x: x not in split_words) if k]
Out[12]:
[['maybe', 'that', 'is', 'why', 'he', 'does', 'not', 'come', 'home'],
['tell', 'you', 'how', 'good', 'his', 'day', 'at', 'work', 'was'],
['he', 'is', 'been', 'told', 'not', 'to', 'talk']]
I want to get the similarity of one document to other documents. I use gensim. The program can run correctly, but after some steps it exits with Segmentation fault.
Below is my code:
from gensim import corpora, models, similarities
docs = [['Looking', 'for', 'the', 'meanings', 'of', 'words'],
['phrases'],
['and', 'expressions'],
['We', 'provide', 'hundreds', 'of', 'thousands', 'of', 'definitions'],
['synonyms'],
['antonyms'],
['and', 'pronunciations', 'for', 'English', 'and', 'other', 'languages'],
['derived', 'from', 'our', 'language', 'research', 'and', 'expert', 'analysis'],
['We', 'also', 'offer', 'a', 'unique', 'set', 'of', 'examples', 'of', 'real', 'usage'],
['as', 'well', 'as', 'guides', 'to:']]
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(text) for text in docs]
nf=len(dictionary.dfs)
index = similarities.SparseMatrixSimilarity(corpus, num_features=nf)
phrases = [['This',
'section',
'gives',
'guidelines',
'on',
'writing',
'in',
'everyday',
'situations'],
['from',
'applying',
'for',
'a',
'job',
'to',
'composing',
'letters',
'of',
'complaint',
'or',
'making',
'an',
'insurance',
'claim'],
['There',
'are',
'plenty',
'of',
'sample',
'documents',
'to',
'help',
'you',
'get',
'it',
'right',
'every',
'time'],
['create',
'a',
'good',
'impression'],
['and',
'increase',
'the',
'likelihood',
'of',
'achieving',
'your',
'desired',
'outcome']]
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
sims=index[phrase2word]
It can run normally until get sims, but it cannot get sims, and using gdb gets the following info:
Program received signal SIGSEGV, Segmentation fault.
0x00007fffd881d809 in csr_tocsc (n_row=5, n_col=39,
Ap=0x4a4eb10, Aj=0x9fc6ec0, Ax=0x1be4a00, Bp=0xa15f6a0, Bi=0x9f3ee80,
Bx=0x9f85f60) at scipy/sparse/sparsetools/csr.h:411 411
scipy/sparse/sparsetools/csr.h: 没有那个文件或目录.
I have get the answer from github
The main reason is that num_features should be same with the dictionary.dfs
Scenario:
I have some tasks performed for respective "Section Header"(Stored as String), result of that task has to be saved against same respective "Existing Section Header"(Stored as String)
While mapping if respective task's "Section Header" is one of the "Existing Section Header" task results are added to it.
And if not, new Section Header will get appended to the Existing Section Header List.
Existing Section Header Looks Like This:
[ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable
running from disk", "Actions from File"]
For below set of String the expected behaviour is as follows:
"Activity (Last 30 Days) - New Section Should be Added
"Executables running from disk" - Same existing "Executable running from disk" should be referred [considering extra "s" in Executables same as "Executable".
"Actions from a file" - Same existing "Actions from file" should be referred [Considering extra article "a"]
Is there any built-in function available python that may help incorporate same logic. Or any suggestion regarding Algorithm for this is highly appreciated.
This is a case where you may find regular expressions helpful. You can use re.sub() to find specific substrings and replace them. It will search for non-overlapping matches to a regular expression and repaces it with the specified string.
import re #this will allow you to use regular expressions
def modifyHeader(header):
#change the # of days to 30
modifiedHeader = re.sub(r"Activity (Last \d+ Days?)", "Activity (Last 30 Days)", header)
#add an s to "executable"
modifiedHeader = re.sub(r"Executable running from disk", "Executables running from disk", modifiedHeader)
#add "a"
modifiedHeader = re.sub(r"Actions from File", "Actions from a file", modifiedHeader)
return modifiedHeader
The r"" refers to raw strings which make it a bit easier to deal with the \ characters needed for regular expressions, \d matches any digit character, and + means "1 or more". Read the page I linked above for more information.
Since you want to compare only stem or "root word" of a given word, I suggest using some stemming algorithm. Stemming algorithms attempt to automatically remove suffixes (and in some cases prefixes) in order to find the "root word" or stem of a given word. This is useful in various natural language processing scenarios, such as search. Luckily there is a python package for stemming. You can download it from here.
Next you want to compare string without stop-words (a,an,the,from, etc.). So you need to filter these words before comparing strings. You can get a list of stop-words from internet or you can use nltk package to import stop-words list. You can get nltk from here
If there is any issue with nltk, here is the list of stop words:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an',
'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don',
'should', 'now']
Now use this simple code to get your desired output:
from stemming.porter2 import stem
from nltk.corpus import stopwords
stopwords_ = stopwords.words('english')
def addString(x):
flag = True
y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
for i in section:
i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
if y==i:
flag = False
break
if flag:
section.append(x)
print "\tNew Section Added"
Demo:
>>> from stemming.porter2 import stem
>>> from nltk.corpus import stopwords
>>> stopwords_ = stopwords.words('english')
>>>
>>> def addString(x):
... flag = True
... y = [stem(j).lower() for j in x.split() if j.lower() not in stopwords_]
... for i in section:
... i = [stem(j).lower() for j in i.split() if j.lower() not in stopwords_]
... if y==i:
... flag = False
... break
... if flag:
... section.append(x)
... print "\tNew Section Added"
...
>>> section = [ "Activity (Last 3 Days)", "Activity (Last 7 days)", "Executable running from disk", "Actions from File"] # initial Section list
>>> addString("Activity (Last 30 Days)")
New Section Added
>>> addString("Executables running from disk")
>>> addString("Actions from a file")
>>> section
['Activity (Last 3 Days)', 'Activity (Last 7 days)', 'Executable running from disk', 'Actions from File', 'Activity (Last 30 Days)'] # Final section list