Stack lists into nested list like a matryoshka in python - python

I have some "big" lists, that fit into some other lists of the same list of the "big" lists, and I would like to map the sublists into one another, to get the one containing all the others. Its sorted by lenghth of the items first.
lists = [['Linden'], ['Linden', 'N.J.'], ['chemical', 'plants'], ['some', 'federal', 'officials'], ['the', 'other', 'side', 'of', 'the', 'country'], ['the', 'most', 'dangerous', 'two', 'miles', 'in', 'America'], ['that', 'some', 'federal', 'officials', 'refer', 'to', 'as', 'the', 'most', 'dangerous', 'two', 'miles', 'in', 'America'], ['an', 'industrial', 'corridor', 'of', 'chemical', 'plants', 'that', 'some', 'federal', 'officials', 'refer', 'to', 'as', 'the', 'most', 'dangerous', 'two', 'miles', 'in', 'America'], ['At', 'the', 'other', 'side', 'of', 'the', 'country', 'Linden', 'N.J.', 'is', 'part', 'of', 'an', 'industrial', 'corridor', 'of', 'chemical', 'plants', 'and', 'oil', 'refineries', 'that', 'some', 'federal', 'officials', 'refer', 'to', 'as', 'the', 'most', 'dangerous', 'two', 'miles', 'in', 'America']]
and the result should be:
['At', ['the', 'other', 'side', 'of', 'the', 'country'], [['Linden'], 'N.J.'], 'is', 'part', 'of', ['an', 'industrial', 'corridor', 'of', ['chemical', 'plants'], 'and', 'oil', 'refineries', 'that', ['some', 'federal', 'officials'], 'refer', 'to', 'as', ['the', 'most', 'dangerous', 'two', 'miles', 'in', 'America']]]
How to solve that?

I found my own solution, after some inhibited attempts to ask that on stackoverflow. So here is it:
If you have optimization ideas, I would be happy about this.
I would like to use some iterator instead of the recursive function, like here (iterate python nested lists efficiently), but in that trial I had, it flattened the nesting of the list.
from more_itertools import replace
import collections
nestlist = [['Linden'],['Linden', 'N.J.'], ['chemical', 'plants'], ['some', 'federal', 'officials'], ['the', 'other', 'side', 'of', 'the', 'country'], ['the', 'most', 'dangerous', 'two', 'miles', 'in', 'America'], ['that', 'some', 'federal', 'officials', 'refer', 'to', 'as', 'the', 'most', 'dangerous', 'two', 'miles', 'in', 'America'], ['an', 'industrial', 'corridor', 'of', 'chemical', 'plants', 'that', 'some', 'federal', 'officials', 'refer', 'to', 'as', 'the', 'most', 'dangerous', 'two', 'miles', 'in', 'America'], ['At', 'the', 'other', 'side', 'of', 'the', 'country', 'Linden', 'N.J.', 'is', 'part', 'of', 'an', 'industrial', 'corridor', 'of', 'chemical', 'plants', 'and', 'oil', 'refineries', 'that', 'some', 'federal', 'officials', 'refer', 'to', 'as', 'the', 'most', 'dangerous', 'two', 'miles', 'in', 'America']]
#nestlist = [['Linden', 'N.J.'], ['chemical', 'plants'], ['country', 'Linden', 'N.J.', 'of', 'chemical', 'plants']]
#nestlist = [['Linden', 'N.J.'], ['Linden', 'N.J.', 'of'], ['country', 'Linden', 'N.J.', 'of', 'chemical', 'plants']]
def flatten(iterable):
for el in iterable:
if isinstance(el, collections.Iterable) and not isinstance(el, str):
yield from flatten(el)
else:
yield el
def on_each_iterable (lol, fun):
out = []
if isinstance(lol, collections.Iterable) and not isinstance(lol, str):
for x in lol:
out.append(on_each_iterable(x, fun))
out = fun(out)
else:
out = lol
return out
def stack_matrjoshka(nesting_list):
nesting_list = sorted(nesting_list, key=lambda x: len(x))
n = 0
while n < (len(nesting_list) - 2):
to_fit_there = nesting_list[n]
flatted_to_fit_there = list(flatten(to_fit_there[:]))
def is_fitting(*xs):
flatted_compared = list(flatten(xs[:]))
decision = flatted_compared == list(flatted_to_fit_there)
return decision
for m in range(n + 1, len(nesting_list)):
through = list(nesting_list[m])
def replacing_fun(x):
return list(replace(list(x), is_fitting, [to_fit_there], window_size=len(to_fit_there)))
nesting_list[m] = on_each_iterable(through, replacing_fun)
n = n + 1
return (nesting_list[-1])
nested_list = stack_matrjoshka(nestlist)
print (nested_list)
makes
['At', ['the', 'other', 'side', 'of', 'the', 'country'], [['Linden'], 'N.J.'], 'is', 'part', 'of', 'an', 'industrial', 'corridor', 'of', ['chemical', 'plants'], 'and', 'oil', 'refineries', ['that', ['some', 'federal', 'officials'], 'refer', 'to', 'as', ['the', 'most', 'dangerous', 'two', 'miles', 'in', 'America']]]

Related

Why does the nltk stopwords output does not match nltk word_tokenize output

I am currently using nltks stopwords and word_tokenize to process some text and encountered some weird behavior.
sentence = "this is a test sentence which makes perfectly.sense. doesn't it? it won't. i'm annoyed"
tok_sentence = word_tokenize(sentence)
print(tok_sentence)
print(stopwords.words('english'))
printing the following:
['this', 'is', 'a', 'test', 'sentence', 'which', 'makes', 'perfectly.sense', '.', 'does', "n't", 'it', '?', 'it', 'wo', "n't", '.', 'i', "'m", 'annoyed']
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Focusing on the words containing the " ' "-char. We can see, that the stopword list clearly contains words separated by it. At the same time all the words in my sample sentence are included in the stopword list, so are its parts. ("doesn't" -> included, "doesn" + "t" -> included).
The word_tokenize function however splits the word "doesn't" into "does" and "n't".
Filtering the stopwords after using word_tokenize will therefore lead to the removal of "does" but leaves "n't" behind...
I was wondering if this behavior was intentional. If so, could someone please explain why?

Is there a way in which I can get the equivalent list of vectors per paragraph in doc2vec?

Is there a way to see the vectors I got per paragraphs and not per each word in the vocabulary with doc2vec. By using model.wv.vectors I get all the vectors per words. Now, I would need this in order to apply a clusterization algorithm on the embedded paragraphs which I can hopefully obtain. I am not sure though if this approach is good. This is how the paragraphs look:
[TaggedDocument(words=['this', 'is', 'the', 'effect', 'of', 'those', 'states', 'that', 'went', 'into', 'lockdown', 'much', 'later', 'they', 'are', 'just', 'starting', 'to', 'see', 'the', 'large', 'increase', 'now', 'they', 'have', 'to', 'ride', 'it', 'out', 'and', 'hope', 'for', 'the', 'best'], tags=[0])
TaggedDocument(words=['so', 'see', 'the', 'headline', 'is', 'died', 'not', 'revised', 'predictions', 'show', 'more', 'hopeful', 'situation', 'or', 'new', 'york', 'reaching', 'apex', 'long', 'before', 'experts', 'predicted', 'or', 'any', 'such', 'thing', 'got', 'to', 'keep', 'the', 'panic', 'train', 'rolling', 'see'], tags=[1])]
model.docvecs.vectors will contain all the trained-up document vectors.

how to make these words into sentence

I lemmatised several sentences, and it turns out the results like this,this is for the first two sentences.
['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the', 'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood', 'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline', 'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';', 'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical', 'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because', 'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History', 'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST', 'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.', 'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
which all the words are generated together like a list. but i need them to be like sentence by sentence, the output format would be better like this:
['She be start on Levofloxacin but the patient become hypotensive at that point with blood pressure of 70/45 and receive a normal saline bolus to boost her blood pressure to 99/60 ; however the patient be admit to the Medical Intensive Care Unit for overnight observation because of her somnolence and hypotension .','11 . History of hemoptysis , on Coumadin .','There be ST scoop in the lateral lead consistent with Dig vs. a question of chronic ischemia change .']
can anyone help me please? thanks a lot
Try this code:
final = []
sentence = []
for word in words:
if word in ['.']: # and whatever other punctuation marks you want to use.
sentence.append(word)
final.append(' '.join(sentence))
sentence = []
else:
sentence.append(word)
print (final)
Hope this helps! :)
A good starting point might be str.join():
>>> wordsList = ['She', 'be', 'start', 'on', 'Levofloxacin']
>>> ' '.join(wordsList)
'She be start on Levofloxacin'
words=['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the', 'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood', 'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline', 'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';', 'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical', 'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because', 'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History', 'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST', 'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.', 'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
def Wordify(words,sen_lim):
Array=[]
word=""
sen_len=0
for w in words:
word+=w+" "
if(w.isalnum()):
sen_len+=1
if(w=="." and sen_len>sen_lim):
Array.append(word)
word=""
sen_len=0
return(Array)
print(Wordify(words,5))
Basically you append the characters to a new string and separate out sentence if there is a period , but also ensure that the current sentence has a minimum number of words. This ensures sentences that like "11." are avoided.
sen_lim
is a parameter you could tune according to your convenience.
You can try string concatenation by looping through the list
list1 = ['She', 'be', 'start', 'on', 'Levofloxacin', 'but', 'the',
'patient', 'become', 'hypotensive', 'at', 'that', 'point', 'with', 'blood',
'pressure', 'of', '70/45', 'and', 'receive', 'a', 'normal', 'saline',
'bolus', 'to', 'boost', 'her', 'blood', 'pressure', 'to', '99/60', ';',
'however', 'the', 'patient', 'be', 'admit', 'to', 'the', 'Medical',
'Intensive', 'Care', 'Unit', 'for', 'overnight', 'observation', 'because',
'of', 'her', 'somnolence', 'and', 'hypotension', '.', '11', '.', 'History',
'of', 'hemoptysis', ',', 'on', 'Coumadin', '.', 'There', 'be', 'ST',
'scoop', 'in', 'the', 'lateral', 'lead', 'consistent', 'with', 'Dig', 'vs.',
'a', 'question', 'of', 'chronic', 'ischemia', 'change', '.']
list2 = []
string = ""
for element in list1:
if(string == "" or element == "."):
string = string + element
else:
string = string + " " + element
list2.append(string)
print(list2)
You could try this:
# list of words.
words = ['This', 'is', 'a', 'sentence', '.']
def sentence_from_list(words):
sentence = ""
# iterate the list and append to the string.
for word in words:
sentence += word + " "
result = [sentence]
# print the result.
print result
sentence_from_list(words)
You may need to delete the last space, just before the '.'

gensim.similarities.SparseMatrixSimilarity get segmentation-fault

I want to get the similarity of one document to other documents. I use gensim. The program can run correctly, but after some steps it exits with Segmentation fault.
Below is my code:
from gensim import corpora, models, similarities
docs = [['Looking', 'for', 'the', 'meanings', 'of', 'words'],
['phrases'],
['and', 'expressions'],
['We', 'provide', 'hundreds', 'of', 'thousands', 'of', 'definitions'],
['synonyms'],
['antonyms'],
['and', 'pronunciations', 'for', 'English', 'and', 'other', 'languages'],
['derived', 'from', 'our', 'language', 'research', 'and', 'expert', 'analysis'],
['We', 'also', 'offer', 'a', 'unique', 'set', 'of', 'examples', 'of', 'real', 'usage'],
['as', 'well', 'as', 'guides', 'to:']]
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(text) for text in docs]
nf=len(dictionary.dfs)
index = similarities.SparseMatrixSimilarity(corpus, num_features=nf)
phrases = [['This',
'section',
'gives',
'guidelines',
'on',
'writing',
'in',
'everyday',
'situations'],
['from',
'applying',
'for',
'a',
'job',
'to',
'composing',
'letters',
'of',
'complaint',
'or',
'making',
'an',
'insurance',
'claim'],
['There',
'are',
'plenty',
'of',
'sample',
'documents',
'to',
'help',
'you',
'get',
'it',
'right',
'every',
'time'],
['create',
'a',
'good',
'impression'],
['and',
'increase',
'the',
'likelihood',
'of',
'achieving',
'your',
'desired',
'outcome']]
phrase2word=[dictionary.doc2bow(text,allow_update=True) for text in phrases]
sims=index[phrase2word]
It can run normally until get sims, but it cannot get sims, and using gdb gets the following info:
Program received signal SIGSEGV, Segmentation fault.
0x00007fffd881d809 in csr_tocsc (n_row=5, n_col=39,
Ap=0x4a4eb10, Aj=0x9fc6ec0, Ax=0x1be4a00, Bp=0xa15f6a0, Bi=0x9f3ee80,
Bx=0x9f85f60) at scipy/sparse/sparsetools/csr.h:411 411
scipy/sparse/sparsetools/csr.h: 没有那个文件或目录.
I have get the answer from github
The main reason is that num_features should be same with the dictionary.dfs

NLTK tokenize questions

Just a quick beginners question here with NLTK.
I am trying to tokenize and generate bigrams, trigrams and quadgrams from a corpus.
I need to add <s> to the beginning of my lists and </s> to the end in place of a period if there is one.
The list is taken from the brown corpus in nltk. (and a specific section at that)
so.. I have
from nltk.corpus import brown
news = brown.sents(categories = 'editorial')
Am I making this too difficult?
Thanks a lot.
import nltk.corpus as corpus
def mark_sentence(row):
if row[-1] == '.':
row[-1] = '</s>'
else:
row.append('</s>')
return ['<s>'] + row
news = corpus.brown.sents(categories = 'editorial')
for row in news[:5]:
print(mark_sentence(row))
yields
['<s>', 'Assembly', 'session', 'brought', 'much', 'good', '</s>']
['<s>', 'The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '</s>']
['<s>', 'It', 'was', 'faced', 'immediately', 'with', 'a', 'showdown', 'on', 'the', 'schools', ',', 'an', 'issue', 'which', 'was', 'met', 'squarely', 'in', 'conjunction', 'with', 'the', 'governor', 'with', 'a', 'decision', 'not', 'to', 'risk', 'abandoning', 'public', 'education', '</s>']
['<s>', 'There', 'followed', 'the', 'historic', 'appropriations', 'and', 'budget', 'fight', ',', 'in', 'which', 'the', 'General', 'Assembly', 'decided', 'to', 'tackle', 'executive', 'powers', '</s>']
['<s>', 'The', 'final', 'decision', 'went', 'to', 'the', 'executive', 'but', 'a', 'way', 'has', 'been', 'opened', 'for', 'strengthening', 'budgeting', 'procedures', 'and', 'to', 'provide', 'legislators', 'information', 'they', 'need', '</s>']

Categories

Resources