Spacy MemoryError - python

I managed to install spacy but when trying to use nlp then I am getting a MemoryError for some weird reason.
The code I wrote is as follows:
import spacy
import re
from nltk.corpus import gutenberg
def clean_text(astring):
#replace newlines with space
newstring=re.sub("\n"," ",astring)
#remove title and chapter headings
newstring=re.sub("\[[^\]]*\]"," ",newstring)
newstring=re.sub("VOLUME \S+"," ",newstring)
newstring=re.sub("CHAPTER \S+"," ",newstring)
newstring=re.sub("\s\s+"," ",newstring)
return newstring.lstrip().rstrip()
nlp=spacy.load('en')
alice=clean_text(gutenberg.raw('carroll-alice.txt'))
nlp_alice=list(nlp(alice).sents)
The error I am getting is as follows
The error message
Although when my code is something like this then it works:
import spacy
nlp=spacy.load('en')
alice=nlp("hello Hello")
If anybody could point out what I am doing wrong I would be very grateful

I'm guessing you truly are running out of memory. I couldn't find an exact number, but I'm sure Carrol's Alice's Adventures in Wonderland has tens of thousands of sentences. This equates to tens of thousands of Span elements from Spacy. Without modification, nlp() determines everything from POS to dependencies for the string passed to it. Moreover, the sents property returns an iterator which should be taken advantage of, as opposed to immediately expanding in a list.
Basically, you're attempting a computation which very likely might be running into a memory constraint. How much memory does your machine support? In the comments Joe suggested watching your machine's memory usage, I second this. My recommendations: check if your are actually running out of memory, or limit the functionality of nlp(), or consider doing your work with the iterator functionality:
for sentence in nlp(alice).sents:
pass

Related

How to create an IUPAC dna object without Bio.Alphabet module in Biopython after the update?

I am new to bioinformatics, so this question can be a little bit silly, but I really need clear answer and I cant find it anywhere on a web.
I know that before update it was something like that:
from Bio.Alphabet import IUPAC
dna_iupac = Seq('ATGATCTCGTAA', IUPAC.unambiguous_dna)
Alphabets were removed from Biopython 1.78 because they are very rarely needed and lead to unnecessary complications. You can easily create a sequence object without one, e.g. Seq('ATG'). If you really want to record you're using DNA/RNA/Protein you'll have to create an attribute or use the variable name.
You can read more about the history behind this decision if you're interested:
https://biopython.org/wiki/Alphabet
https://github.com/biopython/biopython/issues/2046

How to split SpaCy dependency tree into subclauses?

I am trying to split units of text by their dependency trees (according to SpaCy). I have experimented with much of the docs provided by spacy, but I cannot figure out how to accomplish this task. To visualize, see below:
import spacy
from spacy import displacy
doc = nlp('I was, I dont remember. Do you want to go home?')
dependency_flow = displacy.render(doc, style='dep', jupyter = True, options = {'disxatance': 120})
The code above results in this dependency tree graph (which is split into 2 screenshots due to size):
Intuitively, this indicates that there are 2 independent clauses in the original sentence. The original sentence was 'I was, I dont remember. Do you want to go home?', and it is effectively split into two clauses, 'I was, I dont remember.', and 'Do you want to go home?'.
Output
How, using SpaCy or any other tool, can I split the original utterance into those two clauses, so that the output is:
['I was, I dont remember.', 'Do you want to go home?']?
My current approach is rather lengthy and expensive. It involves finding the two biggest subtrees in the original text whose relative indices span the range of the original text indices, but I'm sure there is another, better way.
Given your input and output, i.e. a clause does not span multiple sentences. Then, instead of going down the dependency tree rabbit hole, it would be better to get the clauses as sentences(internally they are spans) from the doc.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('I was, I dont remember. Do you want to go home?')
print([sent.text for sent in doc.sents])
Output
['I was, I dont remember.', 'Do you want to go home?']

Extract list of variables with start attribute from Modelica model

Is there an easy way to extract a list of all variables with start attribute from a Modelica model? The ultimate goal is to run a simulation until it reaches steady-state, then run a python script that compares the values of start attribute against the steady-state value, so that I can identify start values that were chosen badly.
In the Dymola Python interface I could not find such a functionality. Another approach could be to generate the modelDescription.xml and parse it, I assume the information is available somewhere in there, but for that approach I also feel I need help to get started.
Similar to this answer, you can extract that info easily from the modelDescription.xml inside a FMU with FMPy.
Here is a small runnable example:
from fmpy import read_model_description
from fmpy.util import download_test_file
from pprint import pprint
fmu_filename = 'CoupledClutches.fmu'
download_test_file('2.0', 'CoSimulation', 'MapleSim', '2016.2', 'CoupledClutches', fmu_filename)
model_description = read_model_description(fmu_filename)
start_vars = [v for v in model_description.modelVariables if v.start and v.causality == 'local']
pprint(start_vars)
The files dsin.txt and dsfinal.txt might help you around with this. They have the same structure, with values at the start and at the end of the simulation; by renaming dsfinal.txt to dsin.txt you can start your simulation from the (e.g. steady-state) values you computed in a previous run.
It might be worthy working with these two files if you have in mind already to use such values for running other simulations.
They give you information about solvers/simulation settings, that you won't find in the .mat result files (if they're of any interest for your case)
However, if it is only a comparison between start and final values of variables that are present in the result files anyway, a better choice might be to use python and a library to read the result.mat file (dymat, modelicares, etc). It is then a matter of comparing start-end values of the signals of interest.
After some trial and error, I came up with this python code snippet to get that information from modelDescription.xml:
import xml.etree.ElementTree as ET
root = ET.parse('modelDescription.xml').getroot()
for ScalarVariable in root.findall('ModelVariables/ScalarVariable'):
varStart = ScalarVariable.find('*[#start]')
if varStart is not None:
name = ScalarVariable.get('name')
value = varStart.get('start')
print(f"{name} = {value};")
To generate the modelDescription.xml file, run Dymola translation with the flag
Advanced.FMI.GenerateModelDescriptionInterface2 = true;
Python standard library has several modules for processing XML:
https://docs.python.org/3/library/xml.html
This snippet uses ElementTree.
This is just a first step, not sure if I missed something basic.

Optimizing Python algorithm

I am running my code off a 10yr old potato computer (i5 with 4GB RAM) and need to do a lot of language processing with NLTK. I cannot afford a new computer yet. I wrote a simple function (as part of a bigger program). Problem is, I do not know which is more efficient, requires less computing power and is quicker for processing overall?
This snippet uses more variables:
import nltk
from nltk.tokenize import PunktSentenceTokenizer #Unsupervised machine learning tokenizer.
#This is the custom tagger I created. To use it in future projects, simply import it from Learn_NLTK and call it in your project.
def custom_tagger(training_file, target_file):
tagged = []
training_text = open(training_file,"r")
target_text = open(target_file,"r")
custom_sent_tokenizer = PunktSentenceTokenizer(training_text.read()) #You need to train the tokenizer on sample data.
tokenized = custom_sent_tokenizer.tokenize(target_text.read()) #Use the trained tokenizer to tag your target file.
for i in tokenized:
words = nltk.word_tokenize(i)
tagging = nltk.pos_tag(words)
tagged.append(tagging)
training_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
target_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
return tagged
Or is this more efficient? I actually prefer this:
import nltk
from nltk.tokenize import PunktSentenceTokenizer #Unsupervised machine learning tokenizer.
#This is the custom tagger I created. To use it in future projects, simply import it from Learn_NLTK and call it in your project.
def custom_tagger(training_file, target_file):
tagged = []
training_text = open(training_file,"r")
target_text = open(target_file,"r")
#Use the trained tokenizer to tag your target file.
for i in PunktSentenceTokenizer(training_text.read()).tokenize(target_text.read()): tagged.append(nltk.pos_tag(nltk.word_tokenize(i)))
training_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
target_text.close() #ALWAYS close opened files! This is why I have included the extra code to this function!
return tagged
Does anyone have any other suggestions for optimizing code?
It does not matter which one you choose.
The bulk of the computation is likely done by the tokenizer, not by the for loop in the presented code.
Moreover the two examples do the same, except one of them has fewer explicit variables, but still the data needs to be stored somewhere.
Usually, algorithmic speedups come from clever elimination of loop iterations, e.g. in sorting algorithms speedups may come from avoiding value comparisons that will not result in a change to the order of elements (ones that don't advance the sorting). Here the number of loop iterations is the same in both cases.
As mentioned by Daniel, timing functions would be your best way to figure out which method is faster.
I'd recommend using an iPython console to test out the timing for each function.
timeit custom_tagger(training_file, target_file)
I don't think there will be much of a speed difference between the two functions as the second is merely a refactoring of the first. Having all that text on one line won't speed up your code, and it makes it quite difficult to follow. If you're concerned about code length, I'd first clean up the way you read the files.
For example:
with open(target_file) as f:
target_text = f.read()
This is much safer as the file is closed immediately after reading. You could also improve the way you name your variables. In your code target_text is actually a file object, when it actually sounds like it's a string.

Email parser work on individual data; breaks when used in loops list comprehensions, then breaks on original data as well... then works with map

There's some weird mysterious behavior here.
EDIT This has gotten really long and tangled, and I've edited it like 10 times. The TL/DR is that in the course of processing some text, I've managed to write a function that:
works on individual strings of a list
throws a variety of errors when I try to apply it to the whole list with a list comprehension
throws similar errors when I try to apply it to the whole list with a loop
after throwing those errors, stops working on the individual strings until I re-run the function definition and feed it some sample data, then it starts working again, and finally
turns out to work when I apply it to the whole list with map().
There's an ipython notebook saved as html which displays the whole mess here: http://paul-gowder.com/wtf.html ---I've put a link at the top to jump past some irrelevant stuff. I've also made a[nother] gist that just has the problem code and some sample data, but since this problem seems to throw around a bunch of state somehow, I can't guarantee it'll be reproducible from it: https://gist.github.com/paultopia/402891d05dd8c05995d2
End TL/DR, begin mess
I'm doing some toy text-mining on that old enron dataset, and I have the following set of functions to clean up the emails preparatory to turning them into a document term matrix, after loading nltk stopwords and such. The following uses the email library in python 2.7
def parseEmail(document):
# strip unnecessary headers, header text, etc.
theMessage = email.message_from_string(document)
tofield = theMessage['to']
fromfield = theMessage['from']
subjectfield = theMessage['subject']
bodyfield = theMessage.get_payload()
wholeMsgList = [tofield, fromfield, subjectfield, bodyfield]
# get rid of any fields that don't exist in the email
cleanMsgList = [x for x in wholeMsgList if x is not None]
# now return a string with all that stuff run together
return ' '.join(cleanMsgList)
def lettersOnly(document):
return re.sub("[^a-zA-Z]", " ", document)
def wordBag(document):
return lettersOnly(parseEmail(document)).lower().split()
def cleanDoc(document):
dasbag = wordBag(document)
# get rid of "enron" for obvious reasons, also the .com
bagB = [word for word in dasbag if not word in ['enron','com']]
unstemmed =[word for word in bagB if not word in stopwords.words("english")]
return [stemmer.stem(word) for word in unstemmed]
print enronEmails[0][1]
print cleanDoc(enronEmails[0][1])
First (T-minus half an hour) running this on an email represented as a unicode string produced the expected result: print cleanDoc(enronEmails[0][1]) yielded a list of stemmed words. To be clear, the underlying data enronEmails is a list of [label, message] lists, where label is an integer 0 or 1, and message is a unicode string. (In python 2.7.)
Then at t-10, I added a couple lines of code (since deleted and lost, unfortunately...but see below), with some list comprehensions in them to just extract the messages from the enronEmails, run my cleanup function on them, and then join them back into strings for convenient conversion into document term matrix via sklearn. But the function started throwing errors. So I put my debugging hat on...
First I tried rerunning the original definition and test cell. But when I re-ran that cell, my email parsing function suddenly started throwing an error in the message_from_string method:
AttributeError: 'list' object has no attribute 'message_from_string'
So that was bizarre. This was exactly the same function, called on exactly the same data: cleanDoc(enronEmails[0][1]). The function was working, on the same data, and I haven't changed it.
So checked to make extra-sure I didn't mutate the data. enronEmails[0][1] was still a string. Not a list. I have no idea why traceback was of the opinion that I was passing a list to cleanDoc(). I wasn't.
But the plot thickens
So then I went to a make a gist to create a wholly reproducible example for the purpose of posting this SO question. I started with the working part. The gist: https://gist.github.com/paultopia/c8c3e066c39336e5f3c2.
To make sure it was working, first I stuck it in a normal .py file and ran it from command line. It worked.
Then I stuck it in a cell at the bottom of my ipython notebook with all the other stuff in it. That worked too.
Then I tried the parseEmail function on enronEmails[0][1]. That worked again. Then I went all the way back up to the original cell that was throwing an error not five minutes ago and re-ran it (including the import from sklearn, and including the original definition of all functions). And it freaking worked.
BUT THEN
I then went back in and tried again with the list comprehensions and such. And this time, I kept track more carefully of what was going on. Adding the following cells:
1.
def atLeastThreeString(cleandoc):
return ' '.join([w for w in cleandoc if len(w)>2])
print atLeastThreeString(cleanDoc(enronEmails[0][1]))
THIS works, and produces the expected output: a string with words over 2 letters. But then:
2.
justEmails = [email[1] for email in enronEmails]
bigEmailsList = [atLeastThreeString(cleanDoc(email)) for email in justEmails]
and all of a sudden it starts throwing a whole new error, same place in the traceback:
AttributeError: 'unicode' object has no attribute 'message_from_string'
which is extra funny, because I was passing it unicode strings a minute ago and it was doing just fine. And, just to thicken the plot, then going back and rerunning cleanDoc(enronEmails[0][1]) throws the same error
This is driving me insane. How is it possible that creating a new list, and then attempting to run function A on that list, not only throws an error on the new list, but ALSO causes function A to throw an error on data that it was previously working on? I know I'm not mutating the original list...
I've posted the entire notebook in html form here, if anyone wants to see full code and traceback: http://paul-gowder.com/wtf.html The relevant parts start about 2/3 of the way down, at the cells numbered 24-5, where it works, and then the cell numbered 26, where it blows up.
help??
Another edit: I've added some more debugging efforts to the bottom of the above-linked html notebook. As you can see, I've traced the problem down to the act of looping, whether done implicitly in list comprehension form or explicitly. My function works on an individual item in the list of just e-mails, but then fails on every single item when I try to loop over that list, except when I use map() to do it. ???? Has the world gone insane?
I believe the problem is these staements:
justEmails = [email[1] for email in enronEmails]
bigEmailsList = [atLeastThreeString(cleanDoc(email)) for email in justEmails]
In python 2, the dummy variable email leaks out into the namespace, and so you are overwriting the name of the email module, and you are then trying to call a method from that module on a python string. I don't have ntlk in python 2, so I cant test it, but I think this must be it.

Categories

Resources