I am trying to load the 20newsgroups corpus with the NLTK corpus reader and thereafter I am extracting words from all documents and tagging them. But it is showing error when I am trying to build the word extracted and tagged list.
Here is the CODE:
import nltk
import random
from nltk.tokenize import word_tokenize
newsgroups = nltk.corpus.reader.CategorizedPlaintextCorpusReader(
r"C:\nltk_data\corpora\20newsgroups",
r'(?!\.).*\.txt',
cat_pattern=r'(not_sports|sports)/.*',
encoding="utf8")
documents = [(list(newsgroups.words(fileid)), category)
for category in newsgroups.categories()
for fileid in newsgroups.fileids(category)]
random.shuffle(documents)
And the corresponding ERROR is:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-10-de2a1a6859ea> in <module>()
1 documents = [(list(newsgroups.words(fileid)), category)
----> 2 for category in newsgroups.categories()
3 for fileid in newsgroups.fileids(category)]
4
5 random.shuffle(documents)
<ipython-input-10-de2a1a6859ea> in <listcomp>(.0)
1 documents = [(list(newsgroups.words(fileid)), category)
2 for category in newsgroups.categories()
----> 3 for fileid in newsgroups.fileids(category)]
4
5 random.shuffle(documents)
C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in __len__(self)
231 # iterate_from() sets self._len when it reaches the end
232 # of the file:
--> 233 for tok in self.iterate_from(self._toknum[-1]): pass
234 return self._len
235
C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in iterate_from(self, start_tok)
294 self._current_toknum = toknum
295 self._current_blocknum = block_index
--> 296 tokens = self.read_block(self._stream)
297 assert isinstance(tokens, (tuple, list, AbstractLazySequence)), (
298 'block reader %s() should return list or tuple.' %
C:\ProgramData\Anaconda3\lib\site-packages\nltk\corpus\reader\plaintext.py in _read_word_block(self, stream)
120 words = []
121 for i in range(20): # Read 20 lines at a time.
--> 122 words.extend(self._word_tokenizer.tokenize(stream.readline()))
123 return words
124
C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in readline(self, size)
1166 while True:
1167 startpos = self.stream.tell() - len(self.bytebuffer)
-> 1168 new_chars = self._read(readsize)
1169
1170 # If we're at a '\r', then read one extra character, since
C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _read(self, size)
1398
1399 # Decode the bytes into unicode characters
-> 1400 chars, bytes_decoded = self._incr_decode(bytes)
1401
1402 # If we got bytes but couldn't decode any, then read further.
C:\ProgramData\Anaconda3\lib\site-packages\nltk\data.py in _incr_decode(self, bytes)
1429 while True:
1430 try:
-> 1431 return self.decode(bytes, 'strict')
1432 except UnicodeDecodeError as exc:
1433 # If the exception occurs at the end of the string,
C:\ProgramData\Anaconda3\lib\encodings\utf_8.py in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6: invalid start byte
I have tried changing the encoding in the corpus reader to ascii and utf16 as well. That's not working either. I am not sure whether the regex I have provided is the right one or not. The filenames in the 20newsgroups corpus are in the form of 2 numbers separated by a hyphen(-), such as:
5-53286
102-53553
8642-104983
The second thing that I am worried about is whether the error is being generated from the document contents when they are being read for feature extraction.
Here are a what documents in 20newsgroups corpus look like:
From: bil#okcforum.osrhe.edu (Bill Conner) Subject: Re: free moral
agency
dean.kaflowitz (decay#cbnewsj.cb.att.com) wrote: : > : > I think
you're letting atheist mythology
: Great start. I realize immediately that you are not interested : in
discussion and are going to thump your babble at me. I would : much
prefer an answer from Ms Healy, who seems to have a : reasonable and
reasoned approach to things. Say, aren't you the : creationist guy
who made a lot of silly statements about : evolution some time ago?
: Duh, gee, then we must be talking Christian mythology now. I : was
hoping to discuss something with a reasonable, logical : person, but
all you seem to have for your side is a repetition : of the same
boring mythology I've seen a thousand times before. : I am deleting
the rest of your remarks, unless I spot something : that approaches an
answer, because they are merely a repetition : of some uninteresting
doctrine or other and contain no thought : at all.
: I have to congratulate you, though, Bill. You wouldn't : know a
logical argument if it bit you on the balls. Such : a persistent lack
of function in the face of repeated : attempts to assist you in
learning (which I have seen : in this forum and others in the past)
speaks of a talent : that goes well beyond my own, meager abilities.
I just don't : seem to have that capacity for ignoring outside
influences.
: Dean Kaflowitz
Dean,
Re-read your comments, do you think that merely characterizing an
argument is the same as refuting it? Do you think that ad hominum
attacks are sufficient to make any point other than you disapproval of
me? Do you have any contribution to make at all?
Bill
From: cmk#athena.mit.edu (Charles M Kozierok) Subject: Re: Jack Morris
In article <1993Apr19.024222.11181#newshub.ariel.yorku.ca> cs902043#ariel.yorku.ca (SHAWN LUDDINGTON) writes: } In article <1993Apr18.032345.5178#cs.cornell.edu> tedward#cs.cornell.edu (Edward [Ted] Fischer) writes: } >In article <1993Apr18.030412.1210#mnemosyne.cs.du.edu> gspira#nyx.cs.du.edu (Greg Spira) writes: } >>Howard_Wong#mindlink.bc.ca (Howard Wong) writes: }
>> } >>>Has Jack lost a bit of his edge? What is the worst start Jack Morris has had? } >> } >>Uh, Jack lost his edge about 5 years ago, and has had only one above } >>average year in the last 5. } > } >Again goes to prove that it is better to be good than lucky. You can }
>count on good tomorrow. Lucky seems to be prone to bad starts (and a } >bad finish last year :-). } > } >(Yes, I am enjoying every last run he gives up. Who was it who said } >Morris was a better signing than Viola?) } } Hey Valentine, I don't see Boston with any world series rings on their } fingers.
oooooo. cheap shot. :^)
} Damn, Morris now has three and probably the Hall of Fame in his } future.
who cares? he had two of them before he came to Toronto; and if the Jays had signed Viola instead of Morris, it would have been Frank who won 20 and got the ring. and he would be on his way to 20 this year, too.
} Therefore, I would have to say Toronto easily made the best } signing.
your logic is curious, and spurious.
there is no reason to believe that Viola wouldn't have won as many games had *he* signed with Toronto. when you compare their stupid W-L records, be sure to compare their team's offensive averages too.
now, looking at anything like the Morris-Viola sweepstakes a year later is basically hindsight. but there were plenty of reasons why it should have been apparent that Viola was the better pitcher, based on previous recent years and also based on age (Frank is almost 5 years younger! how many knew that?). people got caught up in the '91 World Series, and then on Morris' 21 wins last year. wins are the stupidest, most misleading statistic in baseball, far worse than RBI or R. that he won 21 just means that the Jays got him a lot of runs.
the only really valid retort to Valentine is: weren't the Red Sox trying to get Morris too? oh, sure, they *said* Viola was their first choice afterwards, but what should we have expected they would say?
} And don't tell me Boston will win this year. They won't } even be in the top 4 in the division, more like 6th.
if this is true, it won't be for lack of contribution by Viola, so who cares?
-*- charles
Please suggest me whether the error is while loading the documents or while reading the files and extracting words. What do I need to do to load the corpus correctly?
NLTK has corpora loading issues
you can load the useful category data using
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
Where newsgroups_train.target_names give you categories.
Related
I am working on a Kaggle dataset and trying to extract BILUO entities using spacy
'training.offsets_to_biluo_tags'
function. The original data is in CSV format which I have managed to convert into below JSON format:
{
"entities": [
{
"feature_text": "Lack-of-other-thyroid-symptoms",
"location": "['564 566;588 600', '564 566;602 609', '564 566;632 633', '564 566;634 635']"
},
{
"feature_text": "anxious-OR-nervous",
"location": "['13 24', '454 465']"
},
{
"feature_text": "Lack of Sleep",
"location": "['289 314']"
},
{
"feature_text": "Insomnia",
"location": "['289 314']"
},
{
"feature_text": "Female",
"location": "['6 7']"
},
{
"feature_text": "45-year",
"location": "['0 5']"
}
],
"pn_history": "45 yo F. CC: nervousness x 3 weeks. Increased stress at work. Change in role from researcher to lecturer. Also many responsibilities at home, caring for elderly mother and in-laws, and 17 and 19 yo sons. Noticed decreased appetite, but forces herself to eat 3 meals a day. Associated with difficulty falling asleep (duration 30 to 60 min), but attaining full 7 hours with no interruptions, no early morning awakenings. Also decreased libido for 2 weeks. Nervousness worsened on Sunday and Monday when preparing for lectures for the week. \r\nROS: no recent illness, no headache, dizziness, palpitations, tremors, chest pain, SOB, n/v/d/c, pain\r\nPMH: none, no pasMeds: none, Past hosp/surgeries: 2 vaginal births no complications, FHx: no pysch hx, father passed from acute MI at age 65 yo, no thyroid disease\r\nLMP: 1 week ago \r\nSHx: English literature professor, no smoking, occasional EtOH, no ilicit drug use, sexually active."
}
In the JSON the entities part contains feature text and its location in the text and the pn_history part contains the entire text document.
The first problem I have is that the dataset contains instances where a single text portion is tagged with more than one unique entity. For instance, text located at position [289 314] belongs to two different entities 'Insomnia' and 'Lack of Sleep'. While processing this type of instance Spacy runs into:
ValueError [E103] Trying to set conflicting doc.ents while creating
custom NER
The second problem that I have in the dataset is for some cases the starting and ending positions are clearly mentioned for instance [13 24] but there are some cases where the
indices are scattered. e.g. for '564 566;588 600' which contains a semicolumn it is expected to pick the first set word(s) from the location 564 566 and the second set of word(s) from the location 588 600. These types of indexes I cannot pass to the Spacy function.
Please advise how can I solve these problems.
OK, it sounds like you have two separate problems.
Overlapping entities. You'll need to decide what to do with these and filter your data, spaCy won't automatically handle this for you. It's up to you to decide what's "correct". Usually you would want the longest entities. You could also use the recently released spancat, which is like NER but can handle overlapping annotations.
Discontinuous entities. These are your annotations with ;. These are harder, spaCy has no way to handle them at the moment (and in my experience, few systems handle discontinuous entities). Here's an example annotation from your sample:
[no] headache, dizziness, [palpitations]
Sometimes with discontinuous entities you can just include the middle part, but that won't work here. I don't think there's any good way to translate this into spaCy, because your input tag is "lack of thyroid symptoms". Usually I would model this as "thyroid symptoms" and handle negation separately; in this case that means you could just tag palpitations.
My code basically looks like this until now after importing the dataset, libraries and all of that:
data = pd.read_csv("/content/gdrive/MyDrive/Data/tripadvisor_hotel_reviews.csv")
reviews = data['Review'].str.lower()
#Check
print(reviews)
print(type('Review'))
print(type(reviews))
The output, however, looks like this:
0 nice hotel expensive parking got good deal sta...
1 ok nothing special charge diamond member hilto...
2 nice rooms not 4* experience hotel monaco seat...
3 unique, great stay, wonderful time hotel monac...
4 great stay great stay, went seahawk game aweso...
...
20486 best kept secret 3rd time staying charm, not 5...
20487 great location price view hotel great quick pl...
20488 ok just looks nice modern outside, desk staff ...
20489 hotel theft ruined vacation hotel opened sept ...
20490 people talking, ca n't believe excellent ratin...
Name: Review, Length: 20491, dtype: object
<class 'str'>
<class 'pandas.core.series.Series'>
I want to know why the variable "reviews" is a different type than the data column "Review" if I (supposedly) set them to equal.
This is problem because when I try to tokenize my data, it shows an error.
My code for tokenize:
word_tokenize(reviews)
The error I get:
**TypeError** Traceback (most recent call last)
<ipython-input-9-ebaf7dca0fec> in <module>()
----> 1 word_tokenize(reviews)
8 frames
/usr/local/lib/python3.7/dist-packages/nltk/tokenize/punkt.py in _slices_from_text(self, text)
1287 def _slices_from_text(self, text):
1288 last_break = 0 ->
1289 for match in self._lang_vars.period_context_re().finditer(text):
1290 context = match.group() + match.group('after_tok')
1291 if self.text_contains_sentbreak(context):
**TypeError:** expected string or bytes-like object
There are many things going on here. First of all, reviews is a pd.Series. This means that
word_tokenize(reviews)
won't work, because you can't tokenize a series of strings. You can tokenize, however, a string. The following should work
tokens = [word_tokenize(review) for review in reviews]
because review above is a string, and you are tokenizing each string in the whole pd.Series of strings named reviews.
Also, comparing type('Review') and type(reviews) makes no sense. reviews is a pd.Series (i.e. an iterable) with many different strings, while "Review" is a string instance that holds the English word "Review" in it. type('Review') will always be string. In contrast, type(reviews) might change depending on what value the variable reviews hold.
I'm currently learning BW2 and Wurst packages, and I'm new to Python. I've been attempting to replicate "When the background matters: Using scenarios from integrated assessment models in prospective lifecycle assessment" by Mendoza Beltran et al. (2018) using EcoInvent 3.6. If anyone's also replicated the study any help with the problem below would be awesome!:)
In creating regional versions of the additional datasets, add_new_locations_to_added_datasets(input_db) is being called that calls on geomatcher.intersects(('IMAGE', reg)) to create a new version of all added electricity generation datasets for each IMAGE region in the Integrated Assessment Model. However, I seem to keep getting KeyError: "Can't find this location" and I think there is an obvious adjustment/fix in the regions/locations I need to make from the provided code but my beginner brain can't seem to work backwards and solve this. I can provide more information in case anything's unclear or needs further defining.
KeyError Traceback (most recent call last)
Input In [89], in <module>
----> 1 add_new_locations_to_added_datasets(input_db)
File ~\AppData\Local\Temp\ipykernel_19308\2544958206.py:13, in add_new_locations_to_added_datasets(db)
11 possibles = {}
12 for reg in REGIONS[:-1]:
---> 13 temp= [x for x in geomatcher.intersects(('IMAGE', reg))if type(x) !=tuple]
14 possibles[reg] = [x for x in temp if len(ecoinvent_to_image_locations(x)) ==1 ]
15 if not len(possibles[reg]): print(reg, ' has no good candidate')
File ~\Miniconda3\envs\bw2\lib\site-packages\constructive_geometries\geomatcher.py:152, in Geomatcher.intersects(self, key, include_self, exclusive, biggest_first, only)
149 if key == "RoW" and "RoW" not in self:
150 return ["RoW"] if "RoW" in possibles else []
--> 152 faces = self[key]
153 lst = [
154 (k, (len(v.intersection(faces)), len(v)))
155 for k, v in possibles.items()
156 if (faces.intersection(v))
157 ]
158 return self._finish_filter(lst, key, include_self, exclusive, biggest_first)
File ~\Miniconda3\envs\bw2\lib\site-packages\constructive_geometries\geomatcher.py:72, in Geomatcher.__getitem__(self, key)
70 if key == "RoW" and "RoW" not in self.topology:
71 return set()
---> 72 return self.topology[self._actual_key(key)]
File ~\Miniconda3\envs\bw2\lib\site-packages\constructive_geometries\geomatcher.py:105, in Geomatcher._actual_key(self, key)
102 print("Geomatcher: Used '{}' for '{}'".format(new, key))
103 return new
--> 105 raise KeyError("Can't find this location")
KeyError: "Can't find this location"
IMAGE region labels in Wurst have been updated, hence, in the provided Image variable names-correct.csv spreadsheet the names needed to be updated, as suggested by Chris Mutel in comments.
In the _Functions_to_modify_ecoinvent notebook being used, for function add_new_locations_to_added_datasets(db) below, the [x for x in temp if len(ecoinvent_to_image_locations(x)) ==1] list comprehension needed to be changed to ==2 to match EcoInvent's two letter code to provide possibles. Not sure if this was due to an update or an accidental typo.
def add_new_locations_to_added_datasets(db):
# We create a new version of all added electricity generation datasets for each IMAGE region.
# We allow the upstream production to remain global, as we are mostly interested in regionalizing
# to take advantage of the regionalized IMAGE data.
# step 1: make copies of all datasets for new locations
# best would be to regionalize datasets for every location with an electricity market like this:
# locations = {x['location'] for x in get_many(db, *electricity_market_filter_high_voltage)}
# but this takes quite a long time. For now, we just use 1 location that is uniquely in each IMAGE region.
possibles = {}
for reg in REGIONS[:-1]:
temp= [x for x in geomatcher.intersects(('IMAGE', reg))if type(x) !=tuple]
possibles[reg] = [x for x in temp if len(ecoinvent_to_image_locations(x)) ==2 ]
if not len(possibles[reg]): print(reg, ' has no good candidate')
locations = [v[0] for v in possibles.values()]
For anyone using newer versions of EcoInvent, where previously the IAI Area regions needed to be fixed in the rename_locations(input_db, fix_names) that's longer needed as they've been updated. Additionally, for writing results back into database the rename_locations(db, fix_names_back) function needs to be rename_locations(db, fix_names)
# These locations aren't found correctly by the constructive geometries library - we correct them here:
fix_names= {'CSG' : 'CN-CSG',
'SGCC': 'CN-SGCC',
'RFC' : 'US-RFC',
'SERC' : 'US-SERC',
'TRE': 'US-TRE',
'ASCC': 'US-ASCC',
'HICC': 'US-HICC',
'FRCC': 'US-FRCC',
'SPP' : 'US-SPP',
'MRO, US only' : 'US-MRO',
'NPCC, US only': 'US-NPCC',
'WECC, US only': 'US-WECC',
'IAI Area 1, Africa':'IAI Area, Africa',
'IAI Area 3, South America':'IAI Area, South America',
'IAI Area 4&5, without China':'IAI Area, Asia, without China and GCC',
'IAI Area 2, without Quebec':'IAI Area, North America, without Quebec',
'IAI Area 8, Gulf':'IAI Area, Gulf Cooperation Council',
}
fix_names_back = {v:k for k,v in fix_names.items()}
def rename_locations(db, name_dict):
for ds in db:
if ds['location'] in name_dict:
ds['location'] = name_dict[ds['location']]
for exc in w.technosphere(ds):
if exc['location'] in name_dict:
exc['location'] = name_dict[exc['location']]
rename_locations(input_db, fix_names)
for key in database_dict.keys():
db = database_dict[key]['db']
rename_locations(db, fix_names_back)
write_brightway2_database(db, key)
The IMAGE region labels were changed in 2021. Not sure why, but I guess this is what is causing the error. Probably wurst was adapted to use newer IMAGE scenarios, or something similar.
You can use %debug to start a debugger print the exact location which is causing the error, and add that to your question.
I am looking for a sentiment analysis code with atleast 80%+ accuracy. I tried Vader and it I found it easy and usable, however it was giving accuracy of 64% only.
Now, I was looking at some BERT models and I noticed it needs to be re-trained? Is that correct? Isn't it pre-trained? is re-training necessary?
You can use pre-trained models from HuggingFace. There are plenty to choose from. Search for emotion or sentiment models
Here is an example of a model with 26 emotions. The current implementation works but is very slow for large datasets.
import pandas as pd
from transformers import RobertaTokenizerFast, TFRobertaForSequenceClassification, pipeline
tokenizer = RobertaTokenizerFast.from_pretrained("arpanghoshal/EmoRoBERTa")
model = TFRobertaForSequenceClassification.from_pretrained("arpanghoshal/EmoRoBERTa")
emotion = pipeline('sentiment-analysis',
model='arpanghoshal/EmoRoBERTa')
# example data
DATA_URI = "https://github.com/AFAgarap/ecommerce-reviews-analysis/raw/master/Womens%20Clothing%20E-Commerce%20Reviews.csv"
dataf = pd.read_csv(DATA_URI, usecols=["Review Text",])
# This is super slow, I will find a better optimization ASAP
dataf = (dataf
.head(50) # comment this out for the whole dataset
.assign(Emotion = lambda d: (d["Review Text"]
.fillna("")
.map(lambda x: emotion(x)[0].get("label", None))
),
)
)
We could also refactor it a bit
...
# a bit faster than the previous but still slow
def emotion_func(text:str) -> str:
if not text:
return None
return emotion(text)[0].get("label", None)
dataf = (dataf
.head(50) # comment this out for the whole dataset
.assign(Emotion = lambda d: (d["Review Text"]
.map(emotion_func)
),
)
)
Results:
Review Text Emotion
0 Absolutely wonderful - silky and sexy and comf... admiration
1 Love this dress! it's sooo pretty. i happene... love
2 I had such high hopes for this dress and reall... fear
3 I love, love, love this jumpsuit. it's fun, fl... love
...
6 I aded this in my basket at hte last mintue to... admiration
7 I ordered this in carbon for store pick up, an... neutral
8 I love this dress. i usually get an xs but it ... love
9 I'm 5"5' and 125 lbs. i ordered the s petite t... love
...
16 Material and color is nice. the leg opening i... neutral
17 Took a chance on this blouse and so glad i did... admiration
...
26 I have been waiting for this sweater coat to s... excitement
27 The colors weren't what i expected either. the... disapproval
...
31 I never would have given these pants a second ... love
32 These pants are even better in person. the onl... disapproval
33 I ordered this 3 months ago, and it finally ca... disappointment
34 This is such a neat dress. the color is great ... admiration
35 Wouldn't have given them a second look but tri... love
36 This is a comfortable skirt that can span seas... approval
...
40 Pretty and unique. great with jeans or i have ... admiration
41 This is a beautiful top. it's unique and not s... admiration
42 This poncho is so cute i love the plaid check ... love
43 First, this is thermal ,so naturally i didn't ... love
you can use pickle.
Pickle lets you.. well pickle your model for later use and in fact, you can use a loop to keep training the model until it reaches a certain accuracy and then exit the loop and pickle the model for later use.
You can find many tutorials on youtube on how to pickel a model.
I'm attempting a similar operation as shown here.
I begin with reading in two columns from a CSV file that contains 2405 rows in the format of: Year e.g. "1995" AND cleaned e.g. ["this", "is, "exemplar", "document", "contents"], both columns utilise strings as data types.
df = pandas.read_csv("ukgovClean.csv", encoding='utf-8', usecols=[0,2])
I have already pre-cleaned the data, and below shows the format of the top 4 rows:
[IN] df.head()
[OUT] Year cleaned
0 1909 acquaint hous receiv follow letter clerk crown...
1 1909 ask secretari state war whether issu statement...
2 1909 i beg present petit sign upward motor car driv...
3 1909 i desir ask secretari state war second lieuten...
4 1909 ask secretari state war whether would introduc...
[IN] df['cleaned'].head()
[OUT] 0 acquaint hous receiv follow letter clerk crown...
1 ask secretari state war whether issu statement...
2 i beg present petit sign upward motor car driv...
3 i desir ask secretari state war second lieuten...
4 ask secretari state war whether would introduc...
Name: cleaned, dtype: object
Then I initialise the TfidfVectorizer:
[IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
Following this, calling upon the below line results in:
[IN] x = v.fit_transform(df['cleaned'])
[OUT] ValueError: np.nan is an invalid document, expected byte or unicode string.
I overcame this using the solution in the aforementioned thread:
[IN] x = v.fit_transform(df['cleaned'].values.astype('U'))
however, this resulted in a Memory Error (Full Traceback).
I've attempted to look up storage using Pickle to circumvent mass-memory usage, but I'm not sure how to filter it in in this scenario. Any tips would be much appreciated, and thanks for reading.
[UPDATE]
#pittsburgh137 posted a solution to a similar problem involving fitting data here, in which the training data is generated using pandas.get_dummies(). What I've done with this is:
[IN] train_X = pandas.get_dummies(df['cleaned'])
[IN] train_X.shape
[OUT] (2405, 2380)
[IN] x = v.fit_transform(train_X)
[IN] type(x)
[OUT] scipy.sparse.csr.csr_matrix
I thought I should update any readers while I see what I can do with this development. If there are any predicted pitfalls with this method, I'd love to hear them.
I believe it's the conversion to dtype('<Unn') that might be giving you trouble. Check out the size of the array on a relative basis, using just the first few documents plus an NaN:
>>> df['cleaned'].values
array(['acquaint hous receiv follow letter clerk crown',
'ask secretari state war whether issu statement',
'i beg present petit sign upward motor car driv',
'i desir ask secretari state war second lieuten',
'ask secretari state war whether would introduc', nan],
dtype=object)
>>> df['cleaned'].values.astype('U').nbytes
1104
>>> df['cleaned'].values.nbytes
48
It seems like it would make sense to drop the NaN values first (df.dropna(inplace=True)). Then, it should be pretty efficient to call v.fit_transform(df['cleaned'].tolist()).