Sklearn CountVectorizer "Empty Vocabulary" error in Dataframe when computing nGram - python

I have a dataframe (data) with 3 records:
id text
0001 The farmer plants grain
0002 The fisher catches tuna
0003 The police officer fights crime
I group that dataframe by id:
data_grouped = data.groupby('id')
Describing the resulting groupby object shows that all the records remain.
I then run this code to find the nGrams in the text and join them to the id:
word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2),
analyzer='word')
for id, group in data_grouped:
X = word_vectorizer.fit_transform(group['text'])
frequencies = sum(X).toarray()[0]
results = pd.DataFrame(frequencies, columns=['frequency'])
dfinner = pd.DataFrame(word_vectorizer.get_feature_names())
dfinner['id'] = id
final = results.join(dfinner)
When I run all of this code together, an error kicks out for the word_vectorizer that states "empty vocabulary; perhaps the documents only contain stop words". I know this error has been mentioned in many other questions, but I couldn't find one that deals with a Dataframe.
To further complicate the issue, the error doesn't always show up. I am pulling the data from a SQL DB, and depending on how many records I pull in, the error may or may not show up. For instance, pulling in Top 10 records causes the error, but Top 5 doesn't.
EDIT:
Complete Traceback
Traceback (most recent call last):
File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
X = word_vectorizer.fit_transform(group['cleanComments'])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

I see what is going on here, but in running through it I have a nagging question. Why are you doing this? I'm not quite sure I understand the value of fitting the CountVectorizer to each document in a collection of documents. Generally the idea is to fit it to the entire corpus and then do you analysis from there. I get that maybe you want to be able to see what grams exist in each document but there are other, much easier and optimized, ways of doing this. For example:
df = pd.DataFrame({'id': [1,2,3], 'text': ['The farmer plants grain', 'The fisher catches tuna', 'The police officer fights crime']})
cv = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(df.text)
print(cv.get_feature_names())
['catches tuna',
'farmer plants',
'fights crime',
'fisher catches',
'officer fights',
'plants grain',
'police officer',
'the farmer',
'the fisher',
'the police']
print(dt_mat.todense())
[[0 1 0 0 0 1 0 1 0 0]
[1 0 0 1 0 0 0 0 1 0]
[0 0 1 0 1 0 1 0 0 1]]
Great so there you can see the features extracted by the CountVectorizer and the matrix representation of what features exist in each document. dt_mat is the Document Term Matrix and represents the count of each gram (the frequency) in the vocabulary (the features) for each document. To map this back to the grams, and even place it into your DataFrame you can do the following:
df['grams'] = cv.inverse_transform(dt_mat)
print(df)
id text \
0 1 The farmer plants grain
1 2 The fisher catches tuna
2 3 The police officer fights crime
grams
0 [plants grain, farmer plants, the farmer]
1 [catches tuna, fisher catches, the fisher]
2 [fights crime, officer fights, police officer,...
Personally this feels more meaningful, because you are fitting the CountVectorizer to the entire corpus and not just a single document at a time. You can still extract the same information (the frequency and the grams) and this will be much faster as you scale up in documents.

Related

Computational tractability of algorithm for matching names in two files in python

So I have two .txt files that I'm trying to match up. The first .txt file is just lines of about 12,500 names.
John Smith
Jane Smith
Joe Smith
The second .txt file also contains lines with names (that might repeat) but also extra info, about 17GB total.
584 19423 john smith John Smith 79946792 5 5 11 2016-06-24
584 19434 john smith John Smith 79923732 5 4 11 2018-03-14
584 19423 jane smith Jane Smith 79946792 5 5 11 2016-06-24
My goal is to find all the names from File 1 in File 2, and then spit out the File 2 lines that contain any of those File 1 names.
Here is my python code:
with open("Documents/File1.txt", "r") as t:
terms = [x.rstrip('\n') for x in t]
with open("Documents/File2.txt", "r") as f, open("Documents/matched.txt","w") as w:
for line in f:
if any([term in line for term in terms]):
w.write(line)
So this code definitely works, but it has been running for 3 days (and is still going!!!). I did some back-of-the-envelope calculations, and I'm very worried that my algorithm is computationally intractable (or hyper inefficient) given the size of the data.
Would anyone be able to provide feedback re: (1) whether this is actually intractable and/or extremely inefficient and if so (2) what an alternative algorithm might be?
Thank you!!
First, when testing membership, set and dict are going to be much, much faster, so terms should be a set:
with open("Documents/File1.txt", "r") as t:
terms = set(line.strip() for line in t)
Next, I would split each line into a list, and check if the name is in the set, not if members of the set are in the line, which is O(N) where N is the length of each line. This way you can directly pick out the column numbers (via slicing) that contain the first and last name:
with open("Documents/File2.txt", "r") as f, open("Documents/matched.txt","w") as w:
for line in f:
# split the line on whitespace
names = line.split()
# Your names seem to occur here
name = ' '.join(names[4:6])
if name in terms:
w.write(line)

Convert Text to Excel

I have the following text file as input
Patient Name: XXX,A
Date of Service: 12/12/2018
Speaker ID: 10531
Visit Start: 06/07/2018
Visit End: 06/18/2018
Recipient:
REQUESTING PHYSICIAN:
Mr.XXX
REASON FOR CONSULTATION:
Acute asthma.
HISTORY OF PRESENT ILLNESS:
The patient is a 64-year-old female who is well known to our practice. She has not been feeling well over the last 3 weeks and has been complaining of increasing shortness of breath, cough, wheezing, and chest tightness. She was prescribed systemic steroids and Zithromax. Her respiratory symptoms persisted; and subsequently, she went to Capital Health Emergency Room. She presented to the office again yesterday with increasing shortness of breath, chest tightness, wheezing, and cough productive of thick sputum. She also noted some low-grade temperature.
PAST MEDICAL HISTORY:
Remarkable for bronchial asthma, peptic ulcer disease, hyperlipidemia, coronary artery disease with anomalous coronary artery, status post tonsillectomy, appendectomy, sinus surgery, and status post rotator cuff surgery.
HOME MEDICATIONS:
Include;
1. Armodafinil.
2. Atorvastatin.
3. Bisoprolol.
4. Symbicort.
5. Prolia.
6. Nexium.
7. Gabapentin.
8. Synthroid.
9. Linzess_____.
10. Montelukast.
11. Domperidone.
12. Tramadol.
ALLERGIES:
1. CEPHALOSPORIN.
2. PENICILLIN.
3. SULFA.
SOCIAL HISTORY:
She is a lifelong nonsmoker.
PHYSICAL EXAMINATION:
GENERAL: Shows a pleasant 64-year-old female.
VITAL SIGNS: Blood pressure 108/56, pulse of 70, respiratory rate is 26, and pulse oximetry is 94% on room air. She is afebrile.
HEENT: Conjunctivae are pink. Oral cavity is clear.
CHEST: Shows increased AP diameter and decreased breath sounds with diffuse inspiratory and expiratory wheeze and prolonged expiratory phase.
CARDIOVASCULAR: Regular rate and rhythm.
ABDOMEN: Soft.
EXTREMITIES: Does not show any edema.
LABORATORY DATA:
Her INR is 1.1. Chemistry; sodium 139, potassium 3.3, chloride 106, CO2 of 25, BUN is 10, creatinine 0.74, and glucose is 110. BNP is 40. White count on admission 16,800; hemoglobin 12.5; and neutrophils 88%. Two sets of blood cultures are negative. CT scan of the chest is obtained, which is consistent with tree-in-bud opacities of the lung involving bilateral lower lobes with patchy infiltrate involving the right upper lobe. There is mild bilateral bronchial wall thickening.
IMPRESSION:
1. Acute asthma.
2. Community acquired pneumonia.
3. Probable allergic bronchopulmonary aspergillosis.
I want the text file to be converted as an excel file
Patient Name Date of Service Speaker ID Visit Start Visit End Recipient ..... IMPRESSION:
XYZ 2/27/2018 10101 06-07-2018 06/18/2018 NA ....... 1. Acute asthma.
2. Community
acquired
pneumonia.
3. Probable
allergic
I wrote the following code
with open('1.txt') as infile:
registrations = []
fields = OrderedDict()
d = {}
for line in infile:
line = line.strip()
if line:
key, value = [s.strip() for s in line.split(':', 1)]
d[key] = value
fields[key] = None
else:
if d:
registrations.append(d)
d = {}
else:
if d: # handle EOF
registrations.append(d)
with open('registrations.csv', 'w') as outfile:
writer = DictWriter(outfile, fieldnames=fields)
writer.writeheader()
writer.writerows(registrations)
I'm getting an error
ValueError: not enough values to unpack (expected 2, got 1)
I'm not sure what the error is saying. I searched through websites but could not find a solution. I tried editing the file to remove the space and tried the above code, it was working. But in real-time scenario there will be hundreds of thousands of files so manually editing every file to remove all the spaces is not possible.
Your particular error is likely from
key, value = [s.strip() for s in line.split(':', 1)]
Some of your lines don't have a colon, so there is only one value in your list, and we can't assign one value to the pair key, value.
For example:
line = 'this is some text with a : colon'
key, value = [s.strip() for s in line.split(':', 1)]
print(key)
print(value)
returns:
this is some text with a
colon
But you'll get your error with
line = 'this is some text without a colon'
key, value = [s.strip() for s in line.split(':', 1)]
print(key)
print(value)

Obtain tsv from text with a specific pattern

I'm a biologist and I need to take information on a text file
I have a file with plain text like that:
12018411
Comparison of two timed artificial insemination (TAI) protocols for management of first insemination postpartum.
TAI|timed artificial insemination|0.999808
Two estrus-synchronization programs were compared and factors influencing their success over a year were evaluated. All cows received a setup injection of PGF2alpha at 39 +/- 3 d postpartum. Fourteen days later they received GnRH, followed in 7 d by a second injection of PGF2alpha. Cows (n = 523) assigned to treatment 1 (modified targeted breeding) were inseminated based on visual signs of estrus at 24, 48, or 72 h after the second PGF2alpha injection. Any cow not observed in estrus was inseminated at 72 h. Cows (n = 440) assigned to treatment 2 received a second GnRH injection 48 h after the second PGF2alpha, and all were inseminated 24 h later. Treatment, season of calving, multiple birth, estrual status at insemination, number of occurrences of estrus before second PGF2alpha, prophylactic use of PGF2alpha, retained fetal membranes, and occurrence of estrus following the setup PGF2alpha influenced success. Conception rate was 31.2% (treatment 1) and 29.1% (treatment 2). A significant interaction occurred between protocol and estrual status at insemination. Cows in estrus at insemination had a 45.8% (treatment 1) or 35.4% (treatment 2) conception rate. The conception rate for cows not expressing estrus at insemination was 19.2% (treatment 1) and 27.7% (treatment 2). Provided good estrous detection exists, modified targeted breeding can be as successful as other timed artificial insemination programs. Nutritional, environmental, and management strategies to reduce postpartum disorders and to minimize the duration of postpartum anestrus are critical if synchronization schemes are used to program first insemination after the voluntary waiting period.
8406022
Deletion of the beta-turn/alpha-helix motif at the exon 2/3 boundary of human c-Myc leads to the loss of its immortalizing function.
The protein product (c-Myc) of the human c-myc proto-oncogene carries a beta-turn/alpha-helix motif at the exon2/exon3 boundary. The amino acid (aa) sequence and secondary structure of this motif are highly conserved among several nuclearly localized oncogene products, c-Myc, N-Myc, c-Fos, SV40 large T and adenovirus (Ad) Ela. Removal of this region from Ad E1a results in the loss of the transforming properties of the virus without destroying its known transregulatory functions. In order to analyse whether deletion of the above-mentioned region from c-Myc has a similar effect on its transformation activity, we constructed a deletion mutant (c-myc delta) lacking the respective aa at the exon2/exon3 boundary. In contrast to the c-myc wild-type gene product, constitutive expression of c-myc delta does not lead to the immortalization of primary mouse embryo fibroblast cells (MEF cells). This result indicates that c-Myc and Ad El a share a common domain which is involved in the transformation process by both oncogenes.
aa|amino acid|0.99818
Ad|adenovirus|0.96935
MEF cells|mouse embryo fibroblast cells|0.994648
The first line is the id, the second line is the title, the third line used to be the abstract (sometimes there are abbreviations) and the lasts lines (if there are) are abbreviations with double space, the abbreviation, the meaning, and a number. You can see :
GA|general anesthesia|0.99818
Then there is a line in blank and start again: ID, Title, Abstract, Abbreviations or ID, Title, Abbreviations, Abstract.
And I need to take this data and convert to a TSV file like that:
12018411 TAI timed artificial insemination
8406022 aa amino acids
8406022 Ad adenovirus
... ... ...
First column ID, second column Abbreviation, and third column Meaning of this abbreviation.
I tried to convert first in a Dataframe and then convert to TSV but I don't know how take the information of the text with the structure I need.
And I tried with this code too:
from collections import namedtuple
import pandas as pd
Item= namedtuple('Item', 'ID')
items = []
with open("identify_abbr-out.txt", "r", encoding='UTF-8') as f:
lines= f.readlines()
for line in lines:
if line== '\n':
ID= ¿nextline?
if line.startswith(" "):
Abbreviation = line
items.append(Item(ID, Abbreviation))
df = pd.DataFrame.from_records(items, columns=['ID', 'Abbreviation'])
But I don't know how to read the next line and the code not found because there are some lines in blank in the middle between the corpus and the title sometimes.
I'm using python 3.8
Thank you very much in advance.
Assuming test.txt has your input data, I used simple file read functions to process the data -
file1 = open('test.txt', 'r')
Lines = file1.readlines()
outputlines = []
outputline=""
counter = 0
for l in Lines:
if l.strip()=="":
outputline = ""
counter = 0
elif counter==0:
outputline = outputline + l.strip() + "|"
counter = counter + 1
elif counter==1:
counter = counter + 1
else:
if len(l.split("|"))==3 and l[0:2]==" " :
outputlines.append(outputline + l.strip() +"\n")
counter = counter + 1
file1 = open('myfile.txt', 'w')
file1.writelines(outputlines)
file1.close()
Here file is read, line by line, a counter is kept and reset when there is a blank line, and ID is read in just next line. If there are 3 field "|" separated row, with two spaces in beginning, row is exported with ID

Python: I have a list of words, and want to check the number of occurrences of those words in each line in a file

So I want to count the occurrences of certain words, per line, in a text file. How many times each specific word occurred doesnt matter, just how many times any of them occurred per line. I have a file containing a list of words, delimited by newline character. It looks like this:
amazingly
astoundingly
awful
bloody
exceptionally
frightfully
.....
very
I then have another text file containing lines of text. Lets say for example:
frightfully frightfully amazingly Male. Don't forget male
green flag stops? bloody bloody bloody bloody
I'm biased.
LOOKS like he was headed very
green flag stops?
amazingly exceptionally exceptionally
astoundingly
hello world
I want my output to look like:
3
4
0
1
0
3
1
Here's my code:
def checkLine(line):
count = 0
with open("intensifiers.txt") as f:
for word in f:
if word[:-1] in line:
count += 1
print count
for line in open("intense.txt", "r"):
checkLine(line)
Here's my actual output:
4
1
0
1
0
2
1
0
any ideas?
How about this:
def checkLine(line):
with open("intensifiers.txt") as fh:
line_words = line.rstrip().split(' ')
check_words = [word.rstrip() for word in fh]
print sum(line_words.count(w) for w in check_words)
for line in open("intense.txt", "r"):
checkLine(line)
Output:
3
4
0
1
0
3
1
0

Word match in multiple files

I have a corpus of words like these. There are more than 3000 words. But there are 2 files:
File #1:
#fabulous 7.526 2301 2
#excellent 7.247 2612 3
#superb 7.199 1660 2
#perfection 7.099 3004 4
#terrific 6.922 629 1
#magnificent 6.672 490 1
File #2:
) #perfect 6.021 511 2
? #great 5.995 249 1
! #magnificent 5.979 245 1
) #ideal 5.925 232 1
day #great 5.867 219 1
bed #perfect 5.858 217 1
) #heavenly 5.73 191 1
night #perfect 5.671 180 1
night #great 5.654 177 1
. #partytime 5.427 141 1
I have many sentences like this, more than 3000 lines like below:
superb, All I know is the road for that Lomardi start at TONIGHT!!!! We will set a record for a pre-season MNF I can guarantee it, perfection.
All Blue and White fam, we r meeting at Golden Corral for dinner to night at 6pm....great
I have to go through every line and do the following task:
1) find if those corpus of words match anywhere in the sentences
2) find if those corpus of words match leading and trailing of sentences
I am able to do part 2) and not part 1). I can do it but finding a efficient way.
I have the following code:
for line in sys.stdin:
(id,num,senti,words) = re.split("\t+",line.strip())
sentence = re.split("\s+", words.strip().lower())
for line1 in f1: #f1 is the file containing all corpus of words like File #1
(term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip())
wordanalysis["trail"] = found if re.match(sentence[(len(sentence)-1)],term2.lower()) else not(found)
wordanalysis["lead"] = found if re.match(sentence[0],term2.lower()) else not(found)
for line in sys.stdin:
(id,num,senti,words) = re.split("\t+",line.strip())
sentence = re.split("\s+", words.strip().lower())
for line1 in f1: #f1 is the file containing all corpus of words like File #1
(term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip())
wordanalysis["trail"] = found if re.match(sentence[(len(sentence)-1)],term2.lower()) else not(found)
wordanalysis["lead"] = found if re.match(sentence[0],term2.lower()) else not(found)
for line1 in f2: #f2 is the file containing all corpus of words like File #2
(term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip())
wordanalysis["trail_2"] = found if re.match(sentence[(len(sentence)-1)],term.lower()) else not(found)
wordanalysis["lead_2"] = found if re.match(sentence[0],term.lower()) else not(found)
Am I doing this right? Is there a better way to do it.
this is a classic map reduce problem, if you want to get serious about efficiency you should consider something like: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
and if you are too lazy / have too few resources to set your own hadoop environment you can try a ready made one http://aws.amazon.com/elasticmapreduce/
feel free to post your code here after its done :) it will be nice to see how it is translated into a mapreduce algorithm...

Categories

Resources